





The schedule and teams
However, simulating a tournament is not as easy as flipping a coin for each match up. The data generated needs to have weight; it needs to be significant, and most importantly needs to model real life. In order to give weight to my simulations I needed to determine which team was going to win each match up probabilistically.In theory, in any match up the team with the higher skill should beat the team with the lower skill more times than not. The best way to quantify this is to calculate an elo, which ideally represents a team’s skill. Elo is used in many online multiplayer games to determine a player’s skill. Elo originated in the chess scene. Named after the creator, Arpad Elo, Elo was used to determine the top players based on who they competed against. Elo has been expanded to many other sports and games. Most notably, Nate Silver and his team at 538 have been using elo for all types of sports. The have built elo models for football and basketball as well as baseball. However, you need match history and other data to build an accurate elo for a team or player.I scraped two years of professional game results, around 700 games. Each team in those two years started with an elo of 1500. As teams competed against each other, elo fell rose, and stabilized. The method for calculating elo is easy the formula is quite straight forward (check out this tutorial here). You need a few things to calculate elo: the rating of both teams before the match, the outcome of the match, and a “k factor”.Formulas for calculating Elo ratings
Ratings and Outcomes are handled by the data and the equations above, however, the “K factor” must be estimated. I tried out a few k factors to determine which provided the most stable, but reactive elos and settled on a value of 20. I looked to a lot of Nate Silver/538 to see if they had any inside. I highly suggest reading how they calculate their elo values as the insight was invaluable.Using my K factor, I started to run through the match history of each team. As I mentioned before, I gave every team an equal start at 1500 elo, it was up to them to raise or lower their rating. One thing to keep in mind, around summer 2016 games were handled differently. Instead of one game, LoL switched to best of 5. There were two ways to approach this you could update elo based on who won the set or based on individual games. I decided to update based on individual games as a 3-0 win is more telling of who is better than a close 3-2 win.Nate Silver uses game scores, home court advantage, and margins of victory in his elo calculations. LoL is different as their are no “home courts” as it is an online game. There is no game score or easily measurable margin of victory or point spread. There are W’s and L’s. As Dustin Pedroia is quoted saying in Nate Silver’s book “All I care about is W’s and L’s.” So my elo calc is simple it only cares about who won and who lost. Maybe down the line that will change, maybe after I compare the results of the spring playoffs to my predictions. After all these calcs and all there games were run I had elo for every team.Finally using the elo calculated after two years of games, less for more recent LCS teams (looking at you FLY), I could probabilistically determine the outcome of any match up. My mind immediately went to my favorite statistical hammer: Monte Carlo simulation. If you have read my previous posts I have used monte carlo simulation here and here. 50,000 simulations of the spring 2017 playoffs later, I produced the bracket you see below.See original visualization here: http://setosa.io/blog/2014/07/26/markov-chains/
After finishing up my last post about modelling artists and their probability to release consecutive Best new music albums (see part 1 here), I got to thinking about what else I could use the data that I scraped. I had all album reviews from 2003 to present including the relevant metadata, artist, album, genre, author of the review, and date reviewed. I also had the order in which they were reviewed.Then, with Markov chains still fresh in my mind, I got to thinking, do albums get reviewed in a genre based pattern? Are certain genre’s likely to follow others?Using the JavaScript code from http://setosa.io/blog/2014/07/26/markov-chains/, I plugged in my labels (each of the genres) and the probability of state change (moving from one genre to another) which resulted in the 9 node chain at the top of the post. If you let the chain run a little while you will notice a few patterns. The most obvious pattern is that all roads lead to Rock. For each node the probability of the next album being a rock album is close to 50%. This is because not all genres are equally represented and also because of the way Pitchfork labels genres. Pitchfork can assign up to 5 genres to an album it reviews. With up to 5 possibilities to get a spot, some genres start to gain a lead on others. Rock, for instance, is tacked on to other genres more frequently than any other genre. This causes our markov chain to highly favor going to Rock rather than other genres like Global and Jazz which are not tacked onto other as frequently. So if you are the betting type, the next album Pitchfork will review is probably a rock album.-MarcelloSee original visualization here: http://setosa.io/blog/2014/07/26/markov-chains/
I am an avid Pitchfork reader, it is a great way to keep up to date on new music. Pitchfork lets me know what albums to listen to and what not to waste my time. It’s definitely one source I love to go to when I need something new.One way Pitchfork distills down all the music they review and listen to is to award certain albums (an more recently tracks) as “Best New Music.” Best New Music, or BNM as I’ll start calling it, is pretty self explanatory. BNM is awarded to albums (or reissues) that are recently released, but show an explemplary effort. BNM is loosely governed by scores (lowest BNM was a 7.8), but I noticed that I would see some of the same artists pop up over the years. This got me to wondering. If an artist gets a BNM is their next album more likely to be BNM or meh?We need data. Unfortunately Pitchfork doesn’t have an API and no one has developed a good one, so that lead me to scrape all the album info. Luckily, all album reviews are listed on this page http://pitchfork.com/reviews/albums/. To get them all I simply iterated through each page and scraped all new albums. I scraped the artist name, album name, genre, main author of the review, and year released. BNM started back in 2003 so I had a natural endpoint. In order to go easy on Pitchforks servers I built in a little rest between requests (don’t get to mad Pitchfork).Now that I have the data, how should I model it? We can think of BNM and “meh” as two possible options or “states” for albums (ignoring completely scores). Markov Chains allows us to model these states and how the artists flow through them. Each pass through the chains represents a new album being released. A conventional example is weather. Imagine there are only rainy days and sunny days. If it rained yesterday there may be a stronger probability that it might rain tomorrow, however the weather could also change to sunny, but at a lower probability. Same goes for sunny days. For my model, just replace sunny days with BNM and rainy days with meh.Sunny “S” ,Rainy “R”, and the probabilities of swapping or staying the course
With all my data, I was able to calculate the overall Markov models. I took all artists that that had at least 1 BNM album, 2 albums minimum, and at least 1 album after the BNM album. This insures that these probabilities actually mean anything. I can only tell what the probability of staying BNM is if you have at least one more album after your first BNM. Once I distilled all the artists down using the above criteria getting the probabilities was easy. I simply iterated through each artists discography, classifying the “state” change between them (meh to meh, meh to BNM, BNM to BNM, BNM to meh)
Finally, with all the numbers crunched I plugged them in to the visualization at the top. NOTE: all the visualizations were NOT created by me. I simply plugged in my calculated probabilities and labels. The original visualization along with a fantastic explanation of markov chains can be found at http://setosa.io/blog/2014/07/26/markov-chains/. The visualization and all the code behind it was created by him NOT me. As I said before I only supplied the probabilities.
If you look at the size of the arrows you can tell the relative probability of each state change. As you can see BNM are pretty rare and artists don’t stay that way for long (thin arrow). What is much more common, as you probably guessed, are meh albums leading to more meh albums (thick arrow). As you can see, it is more likely that an artist will produce a meh album after BNM. What is interesting is that it is more likely to release a BNM after a BNM than it is to go from meh to BNM These conclusions seem pretty obvious, in retrospect, however since we lumped all artists together we might be missing some nuance.
Now the above metrics are for all artists, but it it probably unfair to lump in Radiohead (who churns out BNM like its nothing) to the latest EDM artist. I redid my analysis only this time further splitting all the artists by their genre. Below are the three most interesting genres.
METAL
POP/R&B
RAP
A While back I went out to dinner with a bunch of my buddies from highschool. We inevitably started talking about all the people that went to our highschool and what they were doing today. Eventually we started talking about one of our friends that has actually became instagram famous. As the night waned, one of my friends came up to me and said “You know all of his/her instagram followers are fake.” I immediately went to their account and started clicking on some of the followers. Sure enough they started to look a little fishy. However, as a data scientist I wasn’t 100% convinced. Down the rabbit hole I went.
Gun control has become a hot topic recently in the United States. Due to the increase of deaths at the ends of firearms there have been a lot of studies showing how guns flow through America. I wondered what of the larger weaponry. Items like missiles, tanks, and jet fighters Who is buying these heavy duty weaponry? Or do governments just produce their own weapons?
Campaign finances are becoming a prominent issue in today’s elections. We have candidates like Jeb Bush who are receiving record breaking amounts of donations from private citizens and private companies alike. On the other hand we have candidates like Bernie Sanders who only receives small donations from citizens. Regardless of your opinion on which end of the spectrum candidates should behave toward campaign donations, they are nevertheless an important part of US elections. When discussing campaign donations it is almost always about presidential candidates, but what about our legislators. They only time I ever heard about donations to legislators is when there is a huge scandal. Do they pull in as much money as presidential candidates? Do they receive more money from the average citizen or the average corporation? Do legislators of a certain party pull in more than another?
To achieve this I needed data on campaign donations for all the federal legislators. Luckily for me I am not the first to look for this data. There are quite a few places to go to for this information, but I wanted a place with an easy to understand API and something reliable. This led me to followthemoney.org. Here there is a very soft “API”, but nevertheless super useful and easy to parse. I took the data for all legislators from the past 5 years for any candidate that ran for either the Senate or the House of Representatives. Using their API, I exported their data in csv format. From there the preprocessing and the analysis was all preformed in python (anaconda distribution).Before we jump into the analysis we need to know a little more about campaign contributions themselves. There are federal contribution limits imposed to limit how much people (and corporations, parties, PACS, etc..) can donate. There are a few ways to get around these limits however and recent legislature that has helped to facilitate that.That’s much better. As you can see this is obviously not a population map. States like NY, NJ, MA, and CA are not top tier, but rather toward the bottom. Interestingly enough, states that have less people in them seem to have much greater donations per person, Alaska is a notable example. Why do these states get way more contributions than others? One possible explanation are that some of theses states are swing states. Swing states (like New Hampshire above) are very closely divided between the Republicans and the Democrats. These states should naturally garnish more donations as the races should be more exciting and volatile. In coarser terms, campaign money is more valuable in these states.
Before we go any further, we have to go into whose donating, lets take a look nationwide as to who is donating the most. Is it mostly large sums, or small donations?PEOPLE, PACS, AND THINGSSpeaking of small donations, who actually donates to campaigns? I personally have never, my naïve and uninformed idea of campaign donations are just giant faceless corporations throwing money at candidates. Let’s take a peek at average joes like you and me and how much they spend. Below you can see two maps of the US, one for 2012 and one for 2014. Hover over each state to see which citizen donated the most and how much they donated, the color scale lets you compare states to each other.Now what about those big faceless corporations. Here are two more maps, however these are only for the year 2014. The map on the left shows the the top Industry for that state the chart on the right shows the top ten Industries that donate the most nationwide.
Again we have what looks to be a population map. It seems like states with the most people have the highest individual donators whether from citizens or corporations. One thing that stood out to me were the biggest donors. Real estate and medical professionals we the top players in most states. Much less surprising was that Oil & Gas donated the most where, you guessed it, there is Oil & Gas.
Finally, what about groups who donate based on different ideology? Some examples of these groups are pro-Israel, Pro-Life/Pro-Choice, environmental policy as well as many others. The bar chart on the left shows a nationwide average of which ideologies get the most money. The map on the left shows the most popular ideology per state.PARTY FOULSo far we have skipped over the two most important groups in American politics, the Republicans and the Democrats. How do the parties compare? Seeing that the country is pretty divided on party allegiance I’d expect donations to each party be relatively the same. One thing I’d also expect is that third party candidates don’t pull in even the same magnitude as the two major parties.
from: Politico.com
The first graph from Politico shows which party each state voted for. The one below is which party received more donation in each state. The two maps look quite similar. Both the east and west coast mirror each other to an extent. The midwest also aligns with donations. Donations to legislators in each state may be a good predictor into where the electoral votes end up. Or, more possibly, states that were going to vote for a certain party donate to that party more.
Some states receive a lot more attention than other when it comes time for presidential elections. Currently I am only looking at federal legislator’s donations, but I wonder if they reflect presidential politics as well. Certain states I will refer to as swing states. These states are not as deeply entrenched as others. The swing states for 2014 were: Nevada, Colorado, Iowa, Wisconsin, Ohio, New Hampshire, Virginia, North Carolina, and Florida. The map below highlights states that have the closest spending between the Democrats and the Republicans.Most of the swing states have very similar donations between the two parties. Swing states like Virginia, Florida, and Nevada have very close donations totals. Virginia actually has the closest out of all of the states. On the other end, states like California, Texas, and New York have the greatest difference in donations. This makes sense as these states are deeply entrenched in one party, just look at Texas the donations are completely lopsided. There is some good news in this map. Most states are relatively close when it comes to donations to both parties.
MONEY MONEY MONEY MONEYPolitical Donations are a critical component of the United States government. Looking at the donations many of my previous assumptions were confirmed and many were discredited. However, one must have a critical eye on the data presented. The analysis is only as good as the data collected. I believe it is integral to have reliable and vetted donation data as it holds many insights. I’d like to thank followthemoney.org for their data and commitment. If you liked this analysis please check out their website and explore the data yourself! Maybe even consider donating! -Marcello [1] https://en.wikipedia.org/wiki/Citizens_United_v._FEC[2] https://en.wikipedia.org/wiki/Independent_expenditure[3] https://en.wikipedia.org/wiki/McCutcheon_v._FECI took a quick look candidate donations limited to New Jersey, now I’ve moved nation wide. Lets see if the trends that were in New Jersey were typical of the whole nation or just Jersey. I restricted the data to just 2014 to make it a little more manageable. As always lets look at Dems verse Repubs.