Journal Club: Week of 11/13/2015

Got two more for you this week. One on Machine Learning and the other on multivariate. Check them out.

Supervised Machine Learning: A Review of Classification Techniques
By S.B. Kotsiantis
University of Peloponnese (2007)

This paper serves as a review of a subset of supervised machine learning algorithms with a focus on classification. Because of the vast amount of algorithms present the author breaks down the paper into key features of the algorithms. First the author gives a brief overview of machine learning in general, why and how it is used. What I liked most about this paper is that even before any algorithms are mentioned the author talks about general issues with classifiers and algorithm selection. This prepares the reader and removes the notion of the “silver bullet” algorithm.
The article is well organized. Kotsiantis starts with the most intuitive of machine learning algorithms, decision trees, and works his way up to new and more recent (well for 2007 at least) techniques. Each section goes over a multitude of techniques within the subheading, for example Statistical Learning algorithm contains Naïve Bayes and Bayesian Networks. I liked this organization as it guides the reader into more complex techniques. One thing that lacks is the depth. Most techniques are rushed over and not fully explained, but this paper’s purpose is not to outline precise steps to implement each technique but rather to familiarize the reader with existence of certain techniques.
Another criticism I have of the paper is that it seems to feel a little dated. This is of no fault of the author of course, but nevertheless a more recent paper may be worthwhile to follow up on. There is a table in the paper comparing the different techniques in terms of speed, tolerance, and other parameters which is very useful. However it might need to be checked for accuracy as it might be outdated.

Partial Least Squares Regression: A Tutorial
By Paul Geladi and Bruce R Kowalski
Analytica Chimica Acta, 185 (1986) 1-17

Here is an oldie but a goodie. When first learning about Partial Least Squares (PLS, or sometimes called projection onto latent sturctures) there was a vast amount of papers, but none really drove the point home for me. I went back to one paper that was constantly being cited, this paper from 1986. This paper provides a very clear tutorial on how to get PLS up and running. This paper assumes you have an understanding of linear algebra. Starting with data preprocessing, the paper states what form your data needs to be in and how to get it into that form.
The paper takes a detour however. It first goes over exisiting methods like multiple linear regression and principal component regression before it begins to explain PLS. This was good and bad for me as I was solely interested in PLS, nevertheless, the other tutorials gave insight and quick rudimentary ways of using other regression methods. However, I was here for the PLS. The paper immediately dives into building the PLS model. Take care reading this section as the explanation is sparse. Overall, it’s not the best tutorial, however it has two invaluable take aways. Figure 9 in the paper shows a geometrical representation of all the outputs and inputs the PLS model uses. It shows exactly the dimensions of each and how they relate to each other.
The other is the sample PLS algorithm. In the appendix of the paper there is almost a pseudocode like description of the PLS algorithm. Using this, I was able to get a PLS program up and running in less than an hour. This algorithm clearly shows every step that must be taken and exactly how to do it. This is the main reason why I would recommend this paper. There are others out there that explain PLSR better, but this paper allows for a rapid implementation of PLS.

-Marcello

Follow The Money: Federal Legislature Part 2

Last part we took a look at campaign donations to New Jersey State legislatures. Now we are moving on up to the US House and Senate. The stakes are a little higher, the politicians have more power, and hopefully full of campaign donations. Luckily for me we have Followthemoney.org on our side.

All data collected for the following graphs was using followthemoney.org’s API. This made it easy to tabulate and graph all the recorded donations. First up is Democrats Vs Republicans.

fed leg party

Follows state legislature pretty closely. Democrats stomp republicans in terms of donations, however, this may be due to our data source rather than reality. 2014 and 2010 show close donation totals, while 2012 shows a blowout. 2013 seems to be completely missing republican data. That or only Democrats won.

One important qualification to make on this data set is that it only represents donations to candidates who won their elections. We need context for 2013 as it is an off year election there must be some special circumstance. Luckily wikipedia is here to help out. Apparently during this time, sadly a senator  passed away and a special election was held. As we suspected, a democratic candidate won. This may have contributed to the lopsided data. Now lets see if office maters at all.

fed leg office

Depending on the year it looks like office matters quite a bit. The special Senate election in 2013 influenced all campaign spending that year. 2010 was similar to 2013, but completely dominated by House campaign donations. As you probably know, house seats are up every 2 years. In the data above, house donations are all in the same range except in 2013, where there is no election. Senate elections on the other hand are every 2 years, but only 1/3 of the seats are up. New Jersey Senators were up for reelection in both 2012 and 2014 but not in 2010, explaining the lack of donations. Finally lets look at industry donations in 2012.

fed leg industry

Here we see uncoded donations eclipsing the rest of the other industries. After seeing uncoded in part 1 I investigated. Uncoded actually includes a PAC donations as well as individual donations. This is why uncoded always comes in as the largest category.  I did some quick calculations to see what % was from individuals like you and me and what % came from corporations and other PACs.

Individual  $  14,760,750.00
Non-Individual  $    1,412,439.00
Grand Total  $  16,173,189.00

Overwhelmingly the donations stemmed from Individuals. That is super surprising for me.  There’s a lot more visualizations I can do with this data, but before that, we have to go nationwide.

-Marcello

find the data here:NJfedDon

Follow The Money: State Legislature Part 1

The 2016 election is rapidly approaching and one of the major issues of this years race is campaign fiance reform. I am not big into politics, but I am well aware of the Citizens United vs FEC ruling. One thing I do not know however is on what scale politicians actually receive donations. I set out to see how much an average senator or congressman actually receives in a given year.

My intuition led me to believe that these men and women were pulling millions of dollars each year in donations, but that may be based on watching a little to much House of Cards. First thing I needed was the data. Luckily for me all politicians at the state level are required to file info on their finances. Even luckier for me there is an amazing website that databases it all and has an easy to use API

Followthemoney.org

First I wanted to start at the state level, looking at state senators and assemblymen. My guess was that these people were not pulling in the big bucks when it came to campaign donations. I downloaded a data set from follow the money which contained records of donations to lawmakers in the state of New Jersey. From there I cleaned it up and visualized it.  Heads up lots of bar charts coming!

New Jersey leaning democratic I expect the democrats to pull in a little more money than the republicans.

state leg party

WOW that’s a big difference. However, there seems to be an issue. Our data doesn’t look complete. Look at 2012 and 2014, there is missing data for both parties. The total amount is lower than it was back in 1997. Know that this data might be incomplete all analysis must be taken with a grain of salt. Let move on to Senate vs. House.

Senators in New Jersey serve one two-year term and two four-year terms every ten years is considered a 2-4-4 term system. This means that this year all the State senator seats are up. This makes me question the data even more as 2015 is relatively low compared to say 2011, another year were all seats were up. State House members serve 2 years. I have two conflicting trains of thought. One is that Senators will receive more donations as the contributor gets more bang for their buck to put it bluntly. Two is that assemblymen get more donations as they are up for election more frequently and constantly need to replenish the war chest. Lets see.

state leg office

Looks like Senators out do Assemblymen. Look at 2011, this year all State Senate seats were up for election. A grand total of around 31 million was raised that year. That’s pretty impressive , but where is all this money coming from? Lets take a look. I’m going to stick to 2011 as it seems to be the most complete our of all the years.

state leg industry

And our winner is Uncoded with a distant second, unitemized contributions. What does this mean? According to followthemoney.org, unitemized contributions are donations that are under the report-able limit. They are aggregated and listed under this heading. For New Jersey, the limit is $300 dollars from an individual. As for uncoded, this money can come from various industries or most prominently previous years. Uncoded gives an idea of how much these politicians have stocked up in the war chest.

As for the other General Trade Unions comes in third and Lawyers & Lobbyists in forth at around half of General Trade Unions. This is interesting as my previous beliefs on donations are based on big conglomerates or super pacs donating massive amounts of money, not general trade unions. Nevertheless, this is the state level maybe when we look at the federal level there will be much more, for lack of a better word, interesting donators.

Pretty interesting. If you wanna take a look at the data set yourself. I’ve included it here. NJlegDon.  My code is copied below if you wanna check it out (very unoptimized and also in Python!).

-Marcello

 

"""

@author: Marcello

Campaign Donation NJ totals
data sourced from: followthemoney.com

goal of program is to breakdown campaign donations to NJ Senators and 
Congressmen who are currently in office.
"""
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

def summarytablegen(variable_list,varname):
 
 index = np.arange(len(variable_list))
 donation_summary2=pd.DataFrame(columns=columns_4_summary, index=index) 
 donation_total={} 
 total_year = 0
 i=0
 for variable in variable_list:
 for year in election_year:
 df1=df.loc[(df['Election_Year'] == year )& (df[varname] == variable)]
 total_donation = df1['Total_$'].sum()
 donation_total[str(year)] = total_donation
 total_year = total_year+total_donation
 donation_total['Variable']=variable
 donation_total['Grand Total']=total_year
 donation_summary2.loc[i] = pd.Series(donation_total)
 i=i+1
 donation_total={} 
 total_year = 0
 return donation_summary2

# data preprocessing, removing unnecessary columns

df = pd.DataFrame.from_csv('NJlegDon.csv')

df=df.reset_index()

column_names = df.columns.values.tolist()

columns_to_drop = ['request','Election_Year:token','Election_Year:id','Lawmaker:token',
'Office:token','Office:id','General_Office:token','General_Office:id',
'General_Party:token','General_Party:id','Contributor:token','Type_of_Contributor:token',
'Type_of_Contributor:id','General_Industry:token','Broad_Sector:token','In-Jurisdiction:token',
'In-Jurisdiction:id','#_of_Records']

df = df.drop(columns_to_drop, 1)

# drop all negative donations

df = df[df['Total_$'] >= 0]

# %%find total donations for canidate by year
Lawmaker_Id = list(set(df['Lawmaker'].tolist()))
election_year = list(set(df['Election_Year'].tolist()))
Industry = list(set(df['General_Industry'].tolist()))
party = list(set(df['General_Party'].tolist()))
office= list(set(df['General_Office'].tolist()))

str1 =','.join(str(e) for e in election_year)
str1=str1.split(',')

columns_4_summary = ['Variable','Grand Total']
columns_4_summary.extend(str1)


dflawmaker=summarytablegen(Lawmaker_Id,'Lawmaker')
dfIndustry=summarytablegen(Industry,'General_Industry')
dfparty = summarytablegen(party,'General_Party')
dfoffice = summarytablegen(office,'General_Office')


 
#%% 
# breakdown by party 

party = pd.melt(dfparty, id_vars=['Variable'], value_vars=str1,var_name='year', value_name='Donations')

colors = ["windows blue", "red"]
ax = sns.barplot(x="year", y="Donations",hue="Variable", data=party,palette=sns.xkcd_palette(colors))
ax.set( ylabel='Donation Total')