Follow The Money: State Legislature Part 1

The 2016 election is rapidly approaching and one of the major issues of this years race is campaign fiance reform. I am not big into politics, but I am well aware of the Citizens United vs FEC ruling. One thing I do not know however is on what scale politicians actually receive donations. I set out to see how much an average senator or congressman actually receives in a given year.

My intuition led me to believe that these men and women were pulling millions of dollars each year in donations, but that may be based on watching a little to much House of Cards. First thing I needed was the data. Luckily for me all politicians at the state level are required to file info on their finances. Even luckier for me there is an amazing website that databases it all and has an easy to use API

Followthemoney.org

First I wanted to start at the state level, looking at state senators and assemblymen. My guess was that these people were not pulling in the big bucks when it came to campaign donations. I downloaded a data set from follow the money which contained records of donations to lawmakers in the state of New Jersey. From there I cleaned it up and visualized it.  Heads up lots of bar charts coming!

New Jersey leaning democratic I expect the democrats to pull in a little more money than the republicans.

state leg party

WOW that’s a big difference. However, there seems to be an issue. Our data doesn’t look complete. Look at 2012 and 2014, there is missing data for both parties. The total amount is lower than it was back in 1997. Know that this data might be incomplete all analysis must be taken with a grain of salt. Let move on to Senate vs. House.

Senators in New Jersey serve one two-year term and two four-year terms every ten years is considered a 2-4-4 term system. This means that this year all the State senator seats are up. This makes me question the data even more as 2015 is relatively low compared to say 2011, another year were all seats were up. State House members serve 2 years. I have two conflicting trains of thought. One is that Senators will receive more donations as the contributor gets more bang for their buck to put it bluntly. Two is that assemblymen get more donations as they are up for election more frequently and constantly need to replenish the war chest. Lets see.

state leg office

Looks like Senators out do Assemblymen. Look at 2011, this year all State Senate seats were up for election. A grand total of around 31 million was raised that year. That’s pretty impressive , but where is all this money coming from? Lets take a look. I’m going to stick to 2011 as it seems to be the most complete our of all the years.

state leg industry

And our winner is Uncoded with a distant second, unitemized contributions. What does this mean? According to followthemoney.org, unitemized contributions are donations that are under the report-able limit. They are aggregated and listed under this heading. For New Jersey, the limit is $300 dollars from an individual. As for uncoded, this money can come from various industries or most prominently previous years. Uncoded gives an idea of how much these politicians have stocked up in the war chest.

As for the other General Trade Unions comes in third and Lawyers & Lobbyists in forth at around half of General Trade Unions. This is interesting as my previous beliefs on donations are based on big conglomerates or super pacs donating massive amounts of money, not general trade unions. Nevertheless, this is the state level maybe when we look at the federal level there will be much more, for lack of a better word, interesting donators.

Pretty interesting. If you wanna take a look at the data set yourself. I’ve included it here. NJlegDon.  My code is copied below if you wanna check it out (very unoptimized and also in Python!).

-Marcello

 

"""

@author: Marcello

Campaign Donation NJ totals
data sourced from: followthemoney.com

goal of program is to breakdown campaign donations to NJ Senators and 
Congressmen who are currently in office.
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

def summarytablegen(variable_list,varname):
 
 index = np.arange(len(variable_list))
 donation_summary2=pd.DataFrame(columns=columns_4_summary, index=index) 
 donation_total={} 
 total_year = 0
 i=0
 for variable in variable_list:
 for year in election_year:
 df1=df.loc[(df['Election_Year'] == year )& (df[varname] == variable)]
 total_donation = df1['Total_$'].sum()
 donation_total[str(year)] = total_donation
 total_year = total_year+total_donation
 donation_total['Variable']=variable
 donation_total['Grand Total']=total_year
 donation_summary2.loc[i] = pd.Series(donation_total)
 i=i+1
 donation_total={} 
 total_year = 0
 return donation_summary2

# data preprocessing, removing unnecessary columns

df = pd.DataFrame.from_csv('NJlegDon.csv')

df=df.reset_index()

column_names = df.columns.values.tolist()

columns_to_drop = ['request','Election_Year:token','Election_Year:id','Lawmaker:token',
'Office:token','Office:id','General_Office:token','General_Office:id',
'General_Party:token','General_Party:id','Contributor:token','Type_of_Contributor:token',
'Type_of_Contributor:id','General_Industry:token','Broad_Sector:token','In-Jurisdiction:token',
'In-Jurisdiction:id','#_of_Records']

df = df.drop(columns_to_drop, 1)

# drop all negative donations

df = df[df['Total_$'] >= 0]

# %%find total donations for canidate by year
Lawmaker_Id = list(set(df['Lawmaker'].tolist()))
election_year = list(set(df['Election_Year'].tolist()))
Industry = list(set(df['General_Industry'].tolist()))
party = list(set(df['General_Party'].tolist()))
office= list(set(df['General_Office'].tolist()))

str1 =','.join(str(e) for e in election_year)
str1=str1.split(',')

columns_4_summary = ['Variable','Grand Total']
columns_4_summary.extend(str1)


dflawmaker=summarytablegen(Lawmaker_Id,'Lawmaker')
dfIndustry=summarytablegen(Industry,'General_Industry')
dfparty = summarytablegen(party,'General_Party')
dfoffice = summarytablegen(office,'General_Office')


 
#%% 
# breakdown by party 

party = pd.melt(dfparty, id_vars=['Variable'], value_vars=str1,var_name='year', value_name='Donations')

colors = ["windows blue", "red"]
ax = sns.barplot(x="year", y="Donations",hue="Variable", data=party,palette=sns.xkcd_palette(colors))
ax.set( ylabel='Donation Total')

Journal Club: Week of 11/06/2015

Welcome to the first week of my journal club! I’ve gotten into the habit of looking for new and exciting papers to read. I made it a goal to read at least one new paper each week and I thought I’d share. The subject matter is not only on optimization but on various data analysis techniques, machine learning, multivariate controls, and other topics. Most of these papers are available online for free.


Multivariate Analysis
by Herve Abdi
University of Texas at Dallas


This paper serves as a what is best put as a catalog on multivariate techniques. It is by no means exhaustive, but contains a decent amount of techniques an brief blurbs on usage and statistical technique. The paper is organized by ones amount and format of ones data. First the author looks at techniques focused around one data set then he expands into two data sets. The two data sets section is split up into two categories. The first category assumes that one data set is trying to predict the other, the second category assumes that they are just different sets of variables. I like this organization as it makes it achieves the authors goal as a catalog. When I am looking for possible techniques, I can first look toward my data to rapidly see which techniques are not suitable.
When talking about the statistical techniques, the author goes into jsut enough details. The reviews rarely go above 2-3 paragraphs. One major criticism I have of this review is that it may be too brief. The author goes over how the techniques work, but barely touches upon possible issues or, more importantly, most appropriate usage. The author limits his discussion on usage to maximum 1 or 2 sentences.  However, this may be by design as the author mentions up front in the abstract that  “choice of the proper technique for a given problem is often difficult.” Nevertheless, this article starts as a jumping off point. If one desires to know more about the technique in question this paper at least gives a basis to expand upon.
Overall the paper is short, but provides enough insight for a reader to begin exploring possible options for multivariate analysis.


The paper can be found here.


Principal Component Analysis: Concept, Geometrical Interpretation, Mathematical Background, Algorithms, History, Practice
by K.H. Esbensen and  P. Geladi


This chapter by Esbense and Geladi fully guides the reader through the ins and outs of Principal Component Analysis (PCA). Because PCA is a basis and starting point for many multivariate methods, one needs a strong fundamental understanding. This chapter provides that and more. The chapter uses a geometrical interpretation of PCA which helps the reader to better understand what the algorithm does to decompose a series of variables and observations. Out of all the papers ive read on PCA, this chapter helped me the most.
this chapter includes an abundance diagrams which step the reader through all the projections PCA makes to our data set. Esbensen and Geladi take a matrix of variables and observations X, represented as a rectangle, and they run it through PCA algorithm.  This algorithm decomposes the matrix X into the two vectors, the loading , P, and scores , T, vectors. This is represented as the rectangle X decomposing into two smaller rectangles, T and P. They then go on to represent the “master equation” of PCA in the same way. This allows the reader to quickly grasp how PCA works visually. This is reinforced in the next section where PCA is represented as a change in coordinate axes. Finally, if geometric interpretations are not your thing, the authors include a simple algebraic approach, which stems into an algorithm for PCA. The algorithm is laid out briefly by the authors. I would have liked a more through step by step guidance, but this is satisfactory and enough to get a basic PCA program up and running.
Finally the chapter ends with an example and limitations. The example shows sample outputs and interpretation of the data which I found very beneficial. However, the example section is the weakest of the paper. I would have liked the authros to go into more detail of tha analysis and what conclusions you can make, these are briefly addressed (these may be contained in another section or chapter). Another thing I would have liked the actual data set to play around with, but i realize this was probably an expert chapter and shortened for space. Nevertheless, this paper is my go to for any questions or issues, but not on analysis of PCA.


Check out these two articles for an intro into multivariate!
Two new articles next week!
-Marcello

Advanced Optimization Methods: Artificial Neural Networks Part 3

In our last part we went over the mathematical design of the neurons and the network itself. Now we are going to build our network in MatLab and test it out on a real world problem.

Let’s say that we work in a chemical plant. We are creating some compound and we want to anticipate and optimize our production. The compound is synthesized in a fluidized bed reactor. For those of you without a chemical engineering background, think of a tube that contains tons of pellets. Fluid then runs over these pellets and turns into a new compound. Your boss comes to you and tells you that there is too much impurity in our output stream. There are two things you can change to reduce the impurity, catalyst (the pellets in our tube) amount and stabilizer amount.

In the pilot scale facility, you run a few tests varying the amount of catalyst and stabilizer. You come up with the following table of your results.

Catalyst Stabilizer Impurities %
0.57 3.41 3.7
3.41 3.41 4.8
0 2 3.7
4 2 8.9
2 0 6.6
2 4 3.6
2 2 4.2

 

 

After looking at the results you decide to create a neural network to predict and optimize these values. As we know we have two inputs, catalyst and stabilizer, and one output, impurity percent. From our last part on structures of neural networks we decided that we need two neurons in our input layer (one for catalyst and one for stabilizer), and one neuron in our output layer (impurity percent). That only leaves our hidden layer, since we do not expect a complex difficult problem that requires deep learning we only choose one layer. As for neurons we will choose 3 neurons to make the problem a little more interesting. The structure is shown below.

Screen Shot 2015-07-28 at 9.26.26 PM

Now that we have the structure let us build our network in MatLab. The code is actually quite simple for this part. First we input our two variables in a x by 2 matrix. We then multiply these by our first weights from our hidden layer and pass them through our sigmoid function. These values are then multiplied by the weights from the output layer then passed through the sigmoid function again. After they pass through they become our output, impurity %.  So lets see how our network performs the vector on the left is our actual values (scaled to the max) and on the right is what our network determined.

Screen Shot 2015-07-28 at 10.31.56 PMScreen Shot 2015-07-28 at 10.31.21 PM

As you can see, the network did not guess even remotely correctly. Well we are missing the most important part of the neural network, the training. We must train our network to get the right predictions. In order to do this we need to do our favorite thing, optimize.

-Marcello

Heres the code:


% ANN code
% structure:
% 2 input
% 3 hidden nodes
% 1 output

%initial input data [ catalyst, stabilizer]
input_t0 = [0.57	3.41; 3.41	3.41;0	2;4	2;2	0;2	4;2	2];
%normalize input data
input_t0(:,1) = input_t0/max(input_t0);
input_t0(:,2) = input_t0/max(input_t0);

%normalize output data
output_t0 = [3.7 4.8 3.7 8.9 6.6 3.6 4.2];
output_t0 = output_t0/max(output_t0);

%randomly assigned weights
weight_in = [.3 .6 .7;.2 .8 .5];
weight_out = [ .4 .6 .7]';

%initialize matrices
actHidSig = zeros(7,3);
actOutSig=zeros(7,1);

%find activation for hidden layer
act_hid = input_t0*weight_in ;

%apply sigmoid to hidden activation
for i = 1:7
    for j = 1:3
        actHidSig(i,j) = 1/(1+exp(-act_hid(i,j)));
    end
end

%find activation for output layer
act_out = actHidSig*weight_out;

%apply sigmid to output activation
for i = 1:7
    
        actOutSig(i) = 1/(1+exp(-act_out(i)));
    
end

%show results
output_t0'
actOutSig

Optimization Problem Overview: Setting Up 

The hands down most important part to our adventures in optimization is the correct and proper set up of the situation we hope to optimize. In previous posts I gave glimpse to how to formally define optimization problems. Now you will see the proper way to set up our problems.

All optimization problems start the same way, with a cost or objective function. This function is what we are trying to minimize. Our function can be our cost of ingredient, our time traveled, or sitting space in a resturant. All of these are possible functions. We will call the function we are trying to minimize (our objective function) f(x). Where x is a vector of variables. So the first part of our problem set up looks like this.

So far its a pretty boring optimization problem. We need to add rules or constraints to make our problem more interesting and more meaningful. There are two general types of constraints, equality and inequality. Obviously one type sets our variables equal to something, while the other tells us the relationship of the variable to constants or other variables. However, we like all of our optimization problems to look pretty much the same. This enables us to draw prallels between different problems and hopefully use the same methods of solving. for this reason we have all our inequality and equality constraints in the following form below.

Now our problem is starting to get a little more interesting and also conveying more information to anyone else who is looking at our problem. However we are not done yet. We have to determine what type of optimization problem we have. By identifying our problem type, we know how to approach solving the problem. Certain methods and solvers work better with certain problem type (remember our no free lunch talk). But this is saved for the next post, identifying and categorizing our problem.

Before we go let’s take our diet problem from yesterday for a spin. Let’s say we live in a small town with one grocery. This grocery is poorly stocked and only has 8 items on it’s shelves at any given time. Each of these items has a cost associated with it and certain nutritional value. Since we are watching what we eat, we decided to count our macros. Our macros our fats, carbohydrates, and protein. Also I am going to tack on another “macro” vitamins. So let’s see what this super market has to offer.

Walking down the aisle we see the 5 items. They have apples, steak, gummy vitamins (Vitafusion only), potatoes, orange juice, ice cream, broccoli, and chicken breasts. Before heading home you take note of all the prices and put them in a list below so they are all nice and organized.

Screen Shot 2015-07-27 at 8.28.35 PM

Once you get home you open up chrome and check out some of the nutritional facts on the items from the store. You pop open excel and make a spread sheet that lists all the nutritional facts broke down into our four “macros”. The spreadsheet is shown below.

food fat carbs protein vitamins
apples 0 5 1 3
steak 5 2 10 0
gummy vitamins 0 2 0 10
potatoes 0 8 0 1
orange juice 0 4 0 4
ice cream 10 4 0 0
broccoli 0 5 0 5
chicken breasts 1 2 7 2

As you can see some foods provide a lot more macros than others. However, upon first inspection I cannot tell which foods are gonna be the best options for our diet. But before we determine the most optimal diet we need to know how much of each macro we need. Conservatively we guess that we need 40 grams of fat, 60 grams of carbs, 50 grams of protein, and 45 grams of vitamins.  With this information, we can formulate our problem. First we need to create a few vectors and matrices. The first vector is going to represent the amount of each foodstuff we buy. The next vector is going to come from our cost list above into a cost vector.

Screen Shot 2015-07-27 at 9.00.11 PM

One thing we have to realize is that all the above x’s are non-negative as we cannot vomit up food and sell it to the store. Anyway, its starting to look like an optimization problem. We need two more elements, our constraint matrix and constraint vector. These are going to stem from the spreadsheet we made above and our target macros. The constraint matrix (spreadsheet values) is denoted by “A” and the constraint vector (our target vectors) “b”, they are shown below.

Screen Shot 2015-07-27 at 9.11.01 PM

We have all the necessary elements for our optimization problem. Going back we remember the goal of our optimization, to minimize the amount of money we spend on food. However, this is subject to the constraint that we have to fit our macros. Formally declaring the problem gives us the following.

Screen Shot 2015-07-27 at 9.22.06 PM

There we have it our first optimization problem. This isn’t exactly standard form, but it is close. In the next couple of posts we will go over various methods to get our problems into standard form. But before that we need to classify our optimization problem. When a problem falls into the form above we classify it as a linear programming problem in optimization. This is because both the objective function (our cost minimizing) and our constraints (macro targets) are linear equations.  Linear equations are a nice basis for optimization, Next part we will dive deeper into linear equations and the best ways to solve them.

-Marcello

Posted in

Optimization Problem Overview: Introduction

Now that we have a basic quiver of optimization methods we need something to use them on. So far we have been limited to functions of varying complexity as well as difficulty. At this point we need to formally go over what is optimization and why do we optimize. 

Optimization takes many forms but most often when we talk about optimization we are talking about a cost (or objective) function (wether this actually refers to money is another story). We set up our problems in such a way that all our information is contained in our cost function. From a more general standpoint, we are trying to either maximize our minimize our function while subject to various constraints or rules. 

We will use the classic diet problem as an example. say we are trying to minimize the amount of money we spend on food while ensuring that we hit all our daily nutritional values.  We have a lots foods to choose from all with there own nutritional value and cost. From purely a mathematical standpoint there is a combination of food that will reach all of our daily values as well as cost us the least amount of money. This is our optimal solution. 

However, sometimes our problems are not always as clear cut as the diet problem. Our question or problem could have different sets of rules. Certain variables could be dependent on each other. Also we could have problems that rely on a choice of yes or no. All these have to fit into our optimization problem. However there are general consistencies amongst most of our optimization problems. Each problem is trying to either minimize or maximize a function. Each problem is also subject to certain parameters which must be followed. Keeping these two things in mind, you can turn almost any real world problem into a rigorous mathematical problem.

But why do we even want to optimize? Well one reason is is efficiency. The first optimization method was created during the world wars. Our strategists needed to know the most efficient way to supply our troops. This became a big issue as if the troops are not adequately supplied they had less of a chance of winning. The first optimization methods sought to remedy this. These methods ensured that all troops where adequately supplied with the least amount of waste .After the wars more complex and interesting optimization methods began to develop. Optimization expanded to even more real life applications. Even though the subject matter changes they are all rooted in the desire to carry out objectives efficiently with the guaranteed least amount of waste. Another reason which is closely tied to efficiency is marginal improvement. When we find the optimal conditions for our objective we know safely that changing of any of our parameters will give us no marginal improvement. This lets us safely assess risk and ensure we are making the right decision and do both with confidence.

Optimization is an incredibly interesting field and one that I am incredibly passionate about. The applications range from scheduling to determining which subway line to take to get to your destination. Optimization can be used in most of our daily lives. Next time you commute to work ask yourself if you are take the most optimal route. Our you minimizing time or distance? When you sit down at restaurant look around at all the tables. Are they arranged in a way that optimized maximum occupancy while allowing for freedom of movement for the wait staff? The more you look around the more you will see all the possibilities for optimization and hopefully foster a passion of your own. But before you try to optimize the next burger joint you walk into we need to know the most important thing about optimization problems; how to set them up.

-Marcello

Posted in

Advanced Optimization Methods: Artificial Neural Network Part 2

The previous was a brief overview of how to construct a neural network. This part we will go in depth and actually do a little math to create these networks. Just a quick reminder from part 1, our general network contains three layers, one input layer, one hidden layer, and one output layer. For this example I will try to maintain generality. We start with an input layer neuron.

Input layer neurons are the most simple (and the most boring ) neurons of the system. ILN do not accept any inputs, as they are the inputs. Their job is to simply pass the inputs to the hidden layer. There is exactly one input layer for any given NN (as far as I know). Each variable we are looking at gets its own input layer neuron. If we are looking at 3 variables, we would have 3 ILN, if we had 5 variables we would have 5 ILN, 100 variables,100 ILN and so on. They do not modify or weigh the inputs, nor do they apply the activation function before passing on the inputs.  The next layer is the hidden layer, which is much more interesting.

600px-ArtificialNeuronModel_english

Above is a schematic of our hidden layer neuron (HLN). The x values here are from our input layer (or previous hidden layer). These values are then weighed and summed. Giving us a new value. The summation is listed below.Screen Shot 2015-07-19 at 12.26.24 PM   This new z value is passed to our activation function. Last post we skipped over our activation function, however it deserves some attention now. The point of our activation function is to introduce non-linearity to our method. For this neural network I have chosen the sigmoid function. Why did I choose this  function? For one it is real-valued and differentiable.  Second it has an interesting property of its derivative. The derivative of the sigmoid function is equal to the sigmoid function times one mine the sigmoid function. This makes the derivative computationally easy. However, the sigmoid function is used most often for the non-linearity I mentioned before. Below is the function (from wikipedia) as well as the equation and derivative property.

Logistic-curve.svgScreen Shot 2015-07-19 at 1.06.26 PM

Using this activation function we then take our z value, which is the summation of our newly weighed inputs, and put it through our simplified sigmoid function. The result of the activation function becomes the input for the next layer, which is our output layer. However, before we move on to an output neuron we have to go over how to determine how many hidden layers we need and how many neurons are in each of those layers. For most applications one layer would be appropriate. One thing to know is that increasing the number of layers give marginally better results.  There are a good amount of differing opinions on the amount of hidden layers needed, however there is even more confusion about the amount of hidden neurons needed in those layers. A quick heuristic is that the number of hidden neurons should be somewhere between the amount of input neurons and output neurons. However, this is just a rule of thumb and should not be taken as a hard and fast rule. Really the best way is to test our your network and “prune” out those neurons that aren’t contributing. Before we get to that we have to look at our last layer, the output layer.

An output layer neuron is just like a hidden layer neuron, however the output of the OLN is the answer or prediction we have been looking for. The output layer takes all the results of the activation functions from the previous layer and weighs them. It then sums all the weighted values and puts them through our sigmoid function again. After it passes through our function our method is complete.  Like the input layer, there is only one output layer. The number of neurons in this layer depends on the problem. If we are trying to predict a value only one neuron is needed, however, if you are trying to classify a data point you may need more than one neuron.

One big thing we over looked so far was why and how ANN work. Well we have to fine tune the weighs for each neuron (wij). By changing and optimizing these values we can make the network give us the information we want and need. This is called training our network, we will go over this in our next part.

-Marcello

Advanced Optimization Methods: Artificial Neural Networks Part 1

If you’ve been following along we’ve finally got to the fun stuff. The first “advanced” topic we will explore is Artificial Neural Networks (ANN). ANN is a simplified model of a network of “neurons” like the ones that make up our brains. Each neuron takes in information and transmits to another neuron, which in turns analyzes(weighs) and modifies the input and sends it to the next. Well how does this help us optimize? Neural Networks let us predict and optimize problems  at a greater speed and with greater accuracy than other methods. That is after they are trained, but more on that later. First lets look at an individual neuron then see how it interacts in a network.

Blausen_0657_MultipolarNeuron

Above is a diagram of a neuron (from wikipedia), the ones actually in your brain. Our artificial neurons look a very similar to the one above. Now not all the structures labeled on the above picture are important. The two we are concerned with are the dendrites and the Axon, take a quick note that both of these structures have branches.  Just a brief tangent from when I use to work as a neuroscientist at the Motor Neuron Center, neurons work by transmitting signals through their axons and receiving signals from their dendrites.  Neurons can receive tons of connections from other neurons through their dendrites. However, they can only have one outgoing connection through their axon. This is a very simplified explanation, but nevertheless important.  Our artificial Neurons will work in pretty much the same way.

600px-ArtificialNeuronModel_english

Above is a diagram of an artificial neuron (from wikipedia), the ones we will use in our method. Unlike our real neuron diagram, all the labels are important here. We will start with the axons from other neurons (inputs) to the dendrites (weights) and into the cell body (transfer function/activation function) and then leave the neuron to the next neuron through the axon (activation). So we have an idea of how the data moves through a neuron, but what does any of this exactly mean.

So imagine we have the neuron listed above. It has n axons attached to it. Each axon will transmit a number (xn)to this neuron. The neuron then weighs each of these numbers by multiplying it by its respective weight (wnj). These newly weighted numbers are then summed at our transfer function step. This new number is then passed to our activation function. This is where the power of neural networks is derived. Different activation functions can be used to achieve more accurate methods than their linearly bound counterparts. After the weighted average is passed through our chosen activation function we pass that newly created matrix of values to the next neuron and the method repeats. This leads us to the network part of neural networks.

300px-Colored_neural_network.svg

Above is an example of a very simple neural network. The network consists of three parts or layers, input layer, hidden layer, and output layer. Each layer can be made up of as few or as many neurons as the user designs. Also it must be noted that the hidden layer is not restricted to single column of neurons. There can be 3,4 or 100 layers in the hidden layer depending on the design and the neurons required, these however, must be designed initially before any optimization takes place. Another thing to note is that all neurons point forward. This is called a feed-forward design. Having this limitation becomes important when we look at the optimization portion but that comes later. Each neuron outputs to every neuron in the next layer and receives inputs from every neuron in the layer previous.

That’s a general overview of how feed-forward artificial neural networks work. Next part we will derive some equations and make a general outline of how to create an ANN.

-Marcello

GRADIENT SEARCH METHODS: NEWTON-RAPHSON METHOD PART 3

Most of the limitations of NRM in multiple dimensions are the same as those for NM in one dimension. The main issues stem from three major areas (per usual), initial point, derivatives, and function evaluations.

Our initial value for Newton’s method greatly determines the effectiveness.The closer our initial guess is to our minima the better the method will work. However, this requires some intuition of the function which can get quite complex. Lets take a look at two different initial points (20,20) and (2,2). The graph of (20,20) method is below.

Screen Shot 2015-07-18 at 11.16.51 AM

As you can see the method is flying back and forth past our minima. However, the method eventually converges at (0.9987, 0.4987), our target minima. Below I have included a matrix containing our initial step and all of the following steps (taken from matlab output).

Screen Shot 2015-07-18 at 11.26.46 AM

Now lets try it again with our other initial point (2,2). This is already close to our minimum so it shouldn’t take too many iterations.

Screen Shot 2015-07-18 at 11.29.31 AMScreen Shot 2015-07-18 at 11.30.56 AM

The closer initial staring point converged in a fifth of the iterations which means a less function evaluations. This brings us to our next two issues, derivatives and function evaluations. We need to be able to find the Hessian of our function as well as the gradient. If we cannot take these two we will need to pick a different method or modify our existing. Most problems stem from the Hessian. Sometime the Hessian cannot be found, other times it can be found but not inverted, but most of the time it is just too expensive to evaluate or invert. Mathematicians have come up with alternative methods called quasi-newton methods (these actually stem from the secant methods I mentioned before) which seek to remedy this issue. Like the one dimensional version, there are a lot of function evaluations.  Each derivative of the gradient must be evaluated as well as all of the second derivatives that make up the hessians. The hessian then needs to be inverted which is expensive at higher dimensions.

Now you have two tools to deal with finding minima at higher dimensions. Up next we will explore new more advanced methods for optimization.

-Marcello

Gradient Search Methods: Newton-Raphson Method Part 2

Last post we got an idea of how Newton-Raphson works (and how similar it is to the one dimensional version). Here we will explore step by step how the algorithm works and how quickly it arrives to our minimum.  We will be using the same function as before so we can compare how quickly the method descends. Here is our function in both a surface plot and a contour plot as a reminder. (f(x,y) =  x^2 -x+cos(y+x)+y^2

Screen Shot 2015-06-30 at 9.39.18 PMScreen Shot 2015-06-30 at 9.47.29 PM

As you can see it is a relatively simple surface and convex. In order to use Newton Raphson, we need to take both the gradient and the hessian of our function. A quick mathematics reminder, The gradient is the vector which contains first derivatives of the function with respect to each of its variables. Below you can see a general case for the gradient as well our specific gradient for this function.

Screen Shot 2015-07-18 at 8.15.45 AM

As I mentioned last post we also need the equivalent of the second derivative for newton-raphson. This takes the form of the hessian. Another quick mathematics reminder, the Hessian is the square matrix consisting of all the second derivatives of our function. The Hessian exhaustively takes the derivatives of the function with respect to one variable an d the second derivative with respect to every variable again. To clarify this i have left the general case below along with our specific case.

Screen Shot 2015-07-18 at 8.14.09 AM

Note that the Hessian above is a square matrix (it always should be). The second check that has to be made is to ensure that the Hessian is invertible. Since we are dealing with matrix we cannot “divide” matrices, we must use the inverse of the Hessian instead. This comes into play with our iterative step. Now to start our method we need an initial point, we are going to use the same one as we did with Steepest Descent (8,8).  With these values we calculate our Hessian and Gradient which gives us the following. Screen Shot 2015-07-18 at 8.58.19 AMWe now take the inverse of the Hessian and then find our next point. The iterative equation is listed below as well as the output.

Screen Shot 2015-07-18 at 9.10.17 AM

Our new point was a considerable decrease from our starting point. We repeat the above steps until we reach our stopping conditions which I set as our normal tolerance limits. I allowed the method to iterate until these conditions were met. The steps are outlined in the plot below.

Screen Shot 2015-07-18 at 9.43.38 AM

The graph above looks similar to our dynamic steepest descent. As you can see the method was very direct. Taking only 5 iterations our method produces a minimum of (0.9987,0.4987). This is pretty much dead on to the actual minimum of our function. Newton Raphson makes quick work of the minimization. Next post we will look at the limitations of this method. Code below.

-Marcello


%newton raphson in  dimensions
%f(x,y) =  x^2 -x+cos(y+x)+y^2

x=zeros(2,7);
z=zeros(1,7);

x(1,1)=8; %initial guess
x(2,1)=8;


tol(1)=1;
tol(2)=2;

i=2;


while tol(1) >0.0001 && tol(2) > 0.0001 %stop conditions

    %gradients
    dx1 = -1+2*x(1,i-1)-sin(x(1,i-1)+x(2,i-1)); %gradient dx
    dx2 = 2*x(2,i-1) - sin(x(1,i-1)+x(2,i-1)); %gradient dy

    g=[dx1,dx2]';%gradient matrix
    
    %hessian
    ddx1= 2 - cos(x(1,i-1)+x(2,i-1)); % sames as ddx2
    dx1dx2= -cos(x(1,i-1)+x(2,i-1)); % same as dx2dx1

    h =[ddx1 dx1dx2;dx1dx2,ddx1]; %hessian matrix
    
    x(:,i) = x(:,i-1)-hg;
    
    
    tol(1) = abs(x(1,i)-x(1,i-1)); %tolerance calc
    tol(2) = abs(x(2,i)-x(2,i-1));

    z(i) =x(1,i)^2 -x(1,i)+cos(x(2,i)+x(1,i))+x(2,i)^2; % function eval

    i=i+1;
end


Gradient Search Methods: Newton-Raphson Method Part 1

This method is gonna feel more than a little familiar. But before we get to the nitty gritty I have to give a disclaimer. This method is not purely a gradient method, as it does not solely rely on the gradient as a method of optimization. The method uses the gradient along with other properties of the function to arrive at our minimum. Now that that has been cleared up let move onto the method.

We are going to take a similar approach as we did with our one dimensional search methods. Steepest Descent reminds us of the golden section search and the fibonacci search. While they are used differently there are prominent similarities. When we looked at Steepest descent we had two cases, one with a constant step size and another with a variable step size. Eventually we reached an optimal case for these methods and we needed a new method. At this point we moved on to Newton’s method. Well, we are going to do that again, only in higher dimensions.

The next part is gonna be verrrrry familiar.

As you know the first and most important thing when thinking about using Newton-Raphson method is to ensure that the function is twice differentiable.  Like in one dimension, Newton-Raphson method uses a truncated taylor series to estimate where the minimum lies.

Newton-Raphson method uses a single point to begin. This point is used as a basis for our Taylor expansion. We need to take a brief detour to go over why we want to construct a Taylor Expansion. By doing this we create a quadratic approximation of our function locally, which we can in turn minimize. We repeat this until we hit a stationary point or our minimum.  So lets start with the Taylor series in N-dimensions, however we are only going to focus on the first three terms (this gives us up to the quadratic approximation).

Screen Shot 2015-07-16 at 9.19.58 PM

The above equation looks a little different than our previous Taylor expansion. As you can see there are no squares and no derivatives or  second derivatives (well not completely true).The first derivative is here is none other than our Gradient which is denoted by g. However, if you remember from the last time we dealt with the taylor series the third term deals with the second derivative. In n dimensions we need  an equivalent to the second derivative, for this we use the Hessian Matrix, which I have denoted as F. Now we use our first order necessary condition for optimality, this sets the gradient of our function equal to zero. From this we can determine our iterative step.

Screen Shot 2015-07-16 at 9.56.46 PM

This looks awfully like our iterative steps from our one dimensional Newton’s method. .  As we get closer to our minimum the gradient, g, of our function moves toward 0 which makes xk≈ xk+1. This is when we know our method has found a minimum (hopefully). Hopefully this all feels familiar. Next post we will see how Newton Raphson works.

-Marcello