mgric | Optimization

August 30, 2017mgric

Complete History of The NALCS

Data is current up to: Spring Playoffs 2017

My previous post got me thinking. How does every team in that has ever participated in the North American League of Legends Championship Series (NALCS) stack up? For sports like chess, football, soccer, baseball, and many others an ELO based system is used to rank teams.

Elo is a straightforward calculation and concept. If a team wins their rank should go up, if a team loses their rank should go down. How much these teams rise or fall is based upon their expectation of winning the game. A team that is mathematically favorited to win a game will rise slowly, but fall sharply if they lose. Conversely, a team that is not favored to win will fall slowly, but in the case of an upset will rise sharply. Using this concept, pioneered in sports of all types by 538, and applying it to League of Legends an online-only video game gives us insight into the …ahem … league.

Inspired by the fantastic visualization 538 did with their “Complete History of the NBA”, I set out to build a similar visualization for the NALCS. First, as all these projects go, I needed data. Unfortunately for me, there was no ready made database with all the info I needed. My first priority was to put together a database with W’s, L’s, and teams. League of Legends has a massive following and because of this has pretty good records. Leaguepedia provided me with all the information I needed. I scraped all their NALCS match data, tidied it up, and built my database. This became the backbone for my ELO calculation.

Calculating Elo is a simple process. I outlined it in my last post which is worth a read. Once all elo’s were calculated I started building my visualization. The above visualization was made in D3.js. I started by using Nate Miller’s block and then modified it like crazy to make it do everything I needed. The visualization starts with the Spring 2013 split. If you are familiar with LoL, this is actually the 3rd year in the championship season, but the first officially recognized by Riot Games the maker of LoL. This felt like a natural place to start. Here are a few interesting teams from the history of the NALCS.

TEAM SOLO MID: THE BEST TEAM THE NALCS HAS EVER SEEN

Team SoloMid (TSM) has been around as long as competitive League of Legends has existed. They have participated in every single season and almost every single playoff. They are a force to be reckoned with. Ending the Summer 2016 split with a 17-1 record and a decisive playoff title, TSM cemented themselves as the greatest team in the the history of NALCS.

CLOUD 9: A GAME AWAY FROM GREATNESS

Every great team has a great rival. Baseball had Yankees/Red Sox, League of Legends has TSM/Cloud 9 (C9). Cloud 9 only started one season after TSM, but quickly rose to be the best team the NALCS had seen. However after a string of losses, C9 began their fall. C9’s top position was taken by one game 3 years later by TSM.

TEAM COAST: THE WORST TEAM THE NALCS HAS EVER SEEN

If TSM represents the best NALCS has to offer, Team Coast is the worst. Originally known as Good Game University (seen above in red), changed their name to Team Coast on June 1,2013. Coast never had a good record. The final nail in the coffin was Spring split 2015 where team Coast achieved an embarrassing 1-17 record, one of the worst records NALCS has seen.

DIGNITAS: MOST DECIDEDLY AVERAGE TEAM

Now this one hurt. I have been a Dignitas fan since the early days of imaqtpie. Time and the NALCS has not been kind to Dig. They are one of the original LoL teams but they cannot seem to escape mediocrity. Their elo never swung more than 80 points outside average, even though they lost their spot in the NALCS in Summer 2016.

ECHOFOX: TEAM OF MANY NAMES

In League teams rarely vanish completely, most get acquired, absorbed, or renamed. Echo Fox went through three name changes throughout their professional run. Originally Echo Fox was Curse Academy the challenger (think semi-pro) arm of the Curse brand. Curse Academy itself was a renaming of “Children of IWDominate”. They renamed after qualifying for Season three of the challenger league. Curse Academy rebranded themselves to Gravity Gaming after Riot’s new sponsorship rules involving the Curse voice chat client and their team by the same name. Finally on December 18th, 2015, Gravity Gaming was bought by Rick Fox of NBA fame and renamed to Echo Fox. The name changes have not helped their Elo.

IMMORTALS: FROM HUMBLE BEGINNINGS

As a relative newcomer to the NALCS, Immortals have done well for themselves. Originally Team 8, Immortals have surpassed an exceed T8’s skill. T8 was a mediocre team, not straying far from average. However, with a new name, and more importantly a new roster, Immortals reached a peak elo of 1760, before taking 3rd place in the 2016 Summer split.

August 20, 2017mgric

50,000 LCS 2017 SPRING PLAYOFFS – Postmortem

In my last blog post, I ran 50,000 monte-carlo simulations of all the possible outcomes for the Spring 2017 NALCS playoffs. For those of you who are not familiar with NALCS and LoL, League of Legends (LoL) is an online competitive video game that draws massive attention. The NALCS is the professional LoL series in North America. Every spring, after the spring season games are done, 6 teams enter playoffs. These teams compete in best of 5 matches, single elimination.

Using a modified elo calculation, I ranked all the teams entering the spring playoffs based on their past performance. With their elo “ranks”, I pit each team against each other 50,000 times and recorded the final brackets. I then compiled all these brackets and calculated percentages of winning for each team. My simulation yielded the following bracket:

My simulations predicted that first round Phoenix1 would decisively win over Dignitas and Counter Logic Gaming would have a hard fought win over Flyquest. Round two would have Team SoloMid would trounce anyone who made it past the first round, Cloud9 would treat their opponents similarly. Third place would be very dependent on the losers of round 2, but Pheonix1 seems to be a solid choice for third place. Last for the finals of the playoffs, Team SoloMid should take the crown, but not without a fight from Cloud9. So if we were betting on the trifecta, Third place: Phoenix, Second place: Cloud9, and Champion: Team SoloMid.

Now for the results.

If I was a betting man, I could’ve won some money. Good news: I predicted the trifecta, bad news: we got one wrong. Luckily for us, the one we got wrong was an incredibly tight match up and it went to game 5.

Things we got right: Phoenix 1 demolished Team Dignitas 3-0. Team SoloMid trounced FlyQuest 3-0 as well. Cloud 9 also did very well against Phoenix 1 winning 3-0. The third place match was quite tight, but Phoenix1 took it in game 5. From my predictions, I had Phoenix1 taking third, but I did not believe it would be so close. The final also went to game 5. Team SoloMid had a slight edge, enough to take the championship in the end.

Things we got wrong: CLG vs FlyQuest. I had CLG narrowly winning this game, unfortunately FlyQuest came ahead. I predicted this match up to be very close. My prediction was looking good at the beginning of the series. CLG took the first two games, but FlyQuest rallied and took the last 3 in a row.

Overall I believe the simulations were a success. 5/6 games predicted correctly with the trifecta.

Stay tuned for more esports predictions.

-Marcello

April 8, 2017mgric

50,000 Simulations of League of Legends 2017 Spring Playoffs

Team SoloMid is going to win the Spring 2017 Playoffs. Cloud 9 will most likely take 2^nd place and Phoenix1 will probably settle into third place. This is not a guess based on seeds (even though it seems to work out that way). This prediction is based on 50,000 simulations I ran of the playoffs to determine the most probable bracket.

Before we get into the simulations, for those of you who do not play online video games. League of Legends by Riot Games is the most popular online multiplayer game. League of Legends or LoL is a multiplayer online battle arena (MOBA) type game. This game has become so huge that people have it has become its own sport or esport. There are professionals that play for a living. With professional players comes sanctioned tournaments. Tomorrow starts the spring 2017 north american (NA) League of Legends Championship Series (LCS). Six professional LoL teams will compete for the top spot and a cash prize. I wanted to see if I could predict the winner.

The schedule and teams

However, simulating a tournament is not as easy as flipping a coin for each match up. The data generated needs to have weight; it needs to be significant, and most importantly needs to model real life. In order to give weight to my simulations I needed to determine which team was going to win each match up probabilistically.

In theory, in any match up the team with the higher skill should beat the team with the lower skill more times than not. The best way to quantify this is to calculate an elo, which ideally represents a team’s skill. Elo is used in many online multiplayer games to determine a player’s skill. Elo originated in the chess scene. Named after the creator, Arpad Elo, Elo was used to determine the top players based on who they competed against. Elo has been expanded to many other sports and games. Most notably, Nate Silver and his team at 538 have been using elo for all types of sports. The have built elo models for football and basketball as well as baseball. However, you need match history and other data to build an accurate elo for a team or player.

I scraped two years of professional game results, around 700 games. Each team in those two years started with an elo of 1500. As teams competed against each other, elo fell rose, and stabilized. The method for calculating elo is easy the formula is quite straight forward (check out this tutorial here). You need a few things to calculate elo: the rating of both teams before the match, the outcome of the match, and a “k factor”.

Formulas for calculating Elo ratings

Ratings and Outcomes are handled by the data and the equations above, however, the “K factor” must be estimated. I tried out a few k factors to determine which provided the most stable, but reactive elos and settled on a value of 20. I looked to a lot of Nate Silver/538 to see if they had any inside. I highly suggest reading how they calculate their elo values as the insight was invaluable.

Using my K factor, I started to run through the match history of each team. As I mentioned before, I gave every team an equal start at 1500 elo, it was up to them to raise or lower their rating. One thing to keep in mind, around summer 2016 games were handled differently. Instead of one game, LoL switched to best of 5. There were two ways to approach this you could update elo based on who won the set or based on individual games. I decided to update based on individual games as a 3-0 win is more telling of who is better than a close 3-2 win.

Nate Silver uses game scores, home court advantage, and margins of victory in his elo calculations. LoL is different as their are no “home courts” as it is an online game. There is no game score or easily measurable margin of victory or point spread. There are W’s and L’s. As Dustin Pedroia is quoted saying in Nate Silver’s book “All I care about is W’s and L’s.” So my elo calc is simple it only cares about who won and who lost. Maybe down the line that will change, maybe after I compare the results of the spring playoffs to my predictions. After all these calcs and all there games were run I had elo for every team.

Finally using the elo calculated after two years of games, less for more recent LCS teams (looking at you FLY), I could probabilistically determine the outcome of any match up. My mind immediately went to my favorite statistical hammer: Monte Carlo simulation. If you have read my previous posts I have used monte carlo simulation here and here. 50,000 simulations of the spring 2017 playoffs later, I produced the bracket you see below.

As you can see teams like TSM and C9 have high chances of placing due to their high elo and 1st week bys. Teams like Dignitas and FLY have much lower chances (especially dig) at claiming the title. However their chance is based upon building momentum. After each game of the tournament elo is recalculated. If a team like Dig 3-0s a higher seeded team their elo is boosted to a point where they have a better chance against a high elo team like TSM.

I’ll update my predictions after each stage of the tourney. I am excited to see if my predictions were close or not.

GO TSM!

-Marcello

March 11, 2017mgric

Kingdom Death: Monster Fight Optimization

After another round of Kingdom Death: Monster, I found that the weapon optimization tool I built in my last post here was invaluable. However, it was missing one thing. How does each weapon fit into a team composition?

In KDM, whenever you go out to hunt a monster (or whenever you are hunted by a monster) you can take up to 4 survivors to try to slay your query. Each survivor can be equipped with a weapon which helps them wound the monster. Survivors can also forgo a weapon and opt to fight the monster “fist and tooth.”

As I mentioned in my previous post, each weapon brings its own stats to the table. I was interested in how different combinations would affect the outcome of each battle. Would equipping a team with four bone blades be better or worse than equipping a team with only their fists and one bone axe?

The shiny app below allows you determine which team comp is better. You first select your monster, then your weapons for team a, then your weapons for team b. Hit the go button and 1000 hunts of your monster with your selected weapons are ran and compared. The simulations take anywhere from 30 seconds to 2 minutes (there’s a lot of sword swinging and arrow slinging going on behind the scenes).

At the end of the 1000 battles, four graphs should be produced. the top left graph shows the number of turns it took to slay the monster. Top right shows how many times you hit the monster, but may not have wound the monster (more in depth explanation of how attacking works in KDM here). Bottom left shows the wounds inflicted on the monster. Bottom right shows how many times you may have critically wounded the monster.

These graphs let you take a peek at how the distributions look for both of the team compositions you selected. For hard statistics, click on the second tab. This allows you to compare teams via averages and standard deviations on key attributes.

Hope this helps you ease the blow of future KDM campaigns

Happy Hunting,

– Marcello

February 26, 2017mgric

Kingdom Death: Monster Weapon Optimization

A few buddies of mine started playing a new game: Kingdom Death: Monster. KDM is a table top rpg-like game where you go out to hunt monsters and manage a settlement. We all got deeply involved equipping our characters and deciding their fates. One point of contention amongst the group was which weapons we should equip our squad with.

Each weapon has perks and drawbacks so it makes it difficult to determine if there is such at thing as a “best” weapon for your characters. Each weapon has three major attributes, strength, accuracy, and speed.

Strength determines how hard the weapon hits, accuracy determines how easy it is to hit with the weapon, and speed determines how many attacks you can get off with the weapon in a single turn. There is usually a trade off between these three stats. A weapon with high strength might have low speed because it is heavy or hard to use. A quick weapon might provide the opportunity for multiple hits, but might be less accurate than another option.

All of these options have to be weighed against the monster you plan to hunt. Each monster has different attributes making weapon choice important. Some monsters are harder to hit, others have a tough hide that requires more strength to break through. So keeping this in mind how do we choose the best weapon?

Like most table top games and rpgs all actions the player takes are determined by dice rolls whether they succeed or fail. When a player wants to attack a monster she rolls a dice and the number will tell her whether she hits the monster or not. Simulating actions this way allows to mathematically determine outcome much easier as I just have to model a dice roll for each attack.

Before we go into simulation, I should clearly outline the process of attacking a monster in KDM. KDM takes a unique approach on attacking a monster. There are two main stages to an attack: hit phase and wound phase.

During hit phase the player rolls to see if they actually hit the monster. They look at the monster’s evasion stat and their weapon’s accuracy stat. In order to hit the player must roll higher than their weapon’s accuracy stat plus the monster’s evasion stat. For each point of speed the weapon has the player may roll one dice. For example if their sword has a speed of 3, they can roll up to three dice.

Once the player determines if they hit or not they go to the next phase: wound phase. Here the play rolls again to see if they wound the monster. They re roll any dice that was greater than their weapon’s accuracy stat plus the monster’s evasion stat from the previous phase. To actually wound the monster, the player must roll higher than the monster’s toughness (or hide) minus the strength of the weapon. Every dice that passes this check wounds the monster.

Every roll is done on a 10-sided die. However, in KDM there are two special rolls, 1 and 10. Whenever a player rolls a 1, the hit or the wound fails no exception. Whenever a player rolls a 10, the hit or wound succeeds no exception.

In order to quantify which weapon performs better against certain monsters, I will need to determine the probabilities of hitting and more importantly wounding the monster. This could probably be done with distributions and simple probabilities, however, with the 1/10 rule and the various strength and accuracy check I decided on a different approach. In comes Monte Carlo methods.

Monte Carlo methods simulate your problem multiple times to determine probabilities of success. Essential to the method is randomness. Each trial produces a different result and if the number of trials is decently large it begins to model the distribution of the initial problem. I stumbled upon a great blog post that outlines how to perform these methods using R.

The website Count Bayesie outlines different ways to use Monte Carlo methods here. I highly recommend reading the post as it outlines MC methods pretty clearly and gives real world examples with easy to follow code.

Using R, I created a script that would crunch the numbers on each weapon. The script calculates the chance to hit, wound, critical hit, and critical wound (happens when a 10 is rolled). The script will also plot the distributions of each stat and compare them to another weapon. All of the above depends on the monster. Each monster has 3 levels with various stats.

The script initially produced tables and graphs, but wasn’t super useful. I converted the script into a shiny webapp below. Now anyone can utilize my monte carlo script to compare any two weapons for any given monster.

There are a few limitations so far. You can only compare 2 weapons at a time. ~~Weapon special abilities are not taken into account. This means that weapons are only compared on base stats only.~~

The weapon simulator has been updated! It now includes all weapon “perks” and affinities. Try it for yourself! (Gaxe is a good option as the affinities add +1 speed)

Enjoy!

-Marcello

November 4, 2016mgric

Pitchfork’s Best New Markov Chains Part 2

See original visualization here: http://setosa.io/blog/2014/07/26/markov-chains/

After finishing up my last post about modelling artists and their probability to release consecutive Best new music albums (see part 1 here), I got to thinking about what else I could use the data that I scraped. I had all album reviews from 2003 to present including the relevant metadata, artist, album, genre, author of the review, and date reviewed. I also had the order in which they were reviewed.

Then, with Markov chains still fresh in my mind, I got to thinking, do albums get reviewed in a genre based pattern? Are certain genre’s likely to follow others?

Using the JavaScript code from http://setosa.io/blog/2014/07/26/markov-chains/, I plugged in my labels (each of the genres) and the probability of state change (moving from one genre to another) which resulted in the 9 node chain at the top of the post.

If you let the chain run a little while you will notice a few patterns. The most obvious pattern is that all roads lead to Rock. For each node the probability of the next album being a rock album is close to 50%. This is because not all genres are equally represented and also because of the way Pitchfork labels genres. Pitchfork can assign up to 5 genres to an album it reviews. With up to 5 possibilities to get a spot, some genres start to gain a lead on others. Rock, for instance, is tacked on to other genres more frequently than any other genre. This causes our markov chain to highly favor going to Rock rather than other genres like Global and Jazz which are not tacked onto other as frequently.

So if you are the betting type, the next album Pitchfork will review is probably a rock album.

-Marcello

October 30, 2016mgric

Pitchfork’s Best New Markov Chains

See original visualization here: http://setosa.io/blog/2014/07/26/markov-chains/

I am an avid Pitchfork reader, it is a great way to keep up to date on new music. Pitchfork lets me know what albums to listen to and what not to waste my time. It’s definitely one source I love to go to when I need something new.

One way Pitchfork distills down all the music they review and listen to is to award certain albums (an more recently tracks) as “Best New Music.” Best New Music, or BNM as I’ll start calling it, is pretty self explanatory. BNM is awarded to albums (or reissues) that are recently released, but show an explemplary effort. BNM is loosely governed by scores (lowest BNM was a 7.8), but I noticed that I would see some of the same artists pop up over the years. This got me to wondering. If an artist gets a BNM is their next album more likely to be BNM or meh?

We need data. Unfortunately Pitchfork doesn’t have an API and no one has developed a good one, so that lead me to scrape all the album info. Luckily, all album reviews are listed on this page http://pitchfork.com/reviews/albums/. To get them all I simply iterated through each page and scraped all new albums. I scraped the artist name, album name, genre, main author of the review, and year released. BNM started back in 2003 so I had a natural endpoint. In order to go easy on Pitchforks servers I built in a little rest between requests (don’t get to mad Pitchfork).

Now that I have the data, how should I model it? We can think of BNM and “meh” as two possible options or “states” for albums (ignoring completely scores). Markov Chains allows us to model these states and how the artists flow through them. Each pass through the chains represents a new album being released. A conventional example is weather. Imagine there are only rainy days and sunny days. If it rained yesterday there may be a stronger probability that it might rain tomorrow, however the weather could also change to sunny, but at a lower probability. Same goes for sunny days. For my model, just replace sunny days with BNM and rainy days with meh.

Sunny “S” ,Rainy “R”, and the probabilities of swapping or staying the course

With all my data, I was able to calculate the overall Markov models. I took all artists that that had at least 1 BNM album, 2 albums minimum, and at least 1 album after the BNM album. This insures that these probabilities actually mean anything. I can only tell what the probability of staying BNM is if you have at least one more album after your first BNM. Once I distilled all the artists down using the above criteria getting the probabilities was easy. I simply iterated through each artists discography, classifying the “state” change between them (meh to meh, meh to BNM, BNM to BNM, BNM to meh)

Finally, with all the numbers crunched I plugged them in to the visualization at the top. NOTE: all the visualizations were NOT created by me. I simply plugged in my calculated probabilities and labels. The original visualization along with a fantastic explanation of markov chains can be found at http://setosa.io/blog/2014/07/26/markov-chains/. The visualization and all the code behind it was created by him NOT me. As I said before I only supplied the probabilities.

If you look at the size of the arrows you can tell the relative probability of each state change. As you can see BNM are pretty rare and artists don’t stay that way for long (thin arrow). What is much more common, as you probably guessed, are meh albums leading to more meh albums (thick arrow). As you can see, it is more likely that an artist will produce a meh album after BNM. What is interesting is that it is more likely to release a BNM after a BNM than it is to go from meh to BNM These conclusions seem pretty obvious, in retrospect, however since we lumped all artists together we might be missing some nuance.

Now the above metrics are for all artists, but it it probably unfair to lump in Radiohead (who churns out BNM like its nothing) to the latest EDM artist. I redid my analysis only this time further splitting all the artists by their genre. Below are the three most interesting genres.

METAL

POP/R&B

RAP

Breaking artists out by genre lead to some interesting results. For the most part, most genres followed our general outline for BNM Markov chain. However, the above three deviated. Metal had a much higher chance for an artist to release consecutive BNM albums, the probability is almost 50%. However, it is much harder for a metal artist to transition from meh to BNM. The exact opposite is true for pop/r&B (Pitchfork lumps the two together in their categorization). Pop artists switch back and forth between BNM and meh, but rarely produces two albums of the same state consecutively. Rap is a little different. Rap is more resistant to change. For rap, it is harder to switch between states, but rather easy to stay in a state.

There are some drawbacks from this subsetting. The number of observations drop for each group so these models are based off less data. Some albums also have multiple genre designations. Should a rock/electronic album count for both rock and electronic,weigh it 50% of a pure rock album, or separate out just rock/electronic? Nevertheless, as exploratory mildly useful Markov Chains we can see that some artists may have an advantage if they already produced a BNM album, but not by much.

-Marcello

September 24, 2016mgric

Fakestagram – Using Machine Learning to Determine Fake Followers

A While back I went out to dinner with a bunch of my buddies from highschool. We inevitably started talking about all the people that went to our highschool and what they were doing today. Eventually we started talking about one of our friends that has actually became instagram famous. As the night waned, one of my friends came up to me and said “You know all of his/her instagram followers are fake.” I immediately went to their account and started clicking on some of the followers. Sure enough they started to look a little fishy. However, as a data scientist I wasn’t 100% convinced. Down the rabbit hole I went.

In order to solve this problem I needed some data. I have an instagram account (@gospelofmarcello) and a few followers. The problem is all my followers were real. My followers were mostly friends and families with a few random businesses sprinkled in. Unfortunately, I didn’t have any fake followers. My first step was to correct that.

Pre-fake followers (shameless plug, it’s 90% food pictures):

So I found out there is a whole market around buying followers (and likes as well but thats a story for another blog post). I won’t post links here but I found a site where I could get 100 followers for $3. Since I only had about 100 real followers, these fake followers would complete my dataset. I spent the $3 dollars (sorry instagram! I’m doing it for science) and within the hour I had 100 new followers.

Post-fake followers (plus a few real ones):

Next step was to actually get info on all my followers. If you’ve used instagram before you’ve probably seen something like the photo above. Instagram profiles have some great data which I was gonna need to build my model. Unfortunately for me Instagram recently changed their API and made it so that you can only access 10 other users (and their info/data) at a time. Even worse you needed their permission. I assumed that these bots would not give consent to be subject to my probe, so I needed a solution. In comes Selenium.

Selenium allows me to open webpages like normal and interact with them. I wrote a script that would first scrape all my followers, then one by one open up each follower’s profile and gather data. My program takes a user’s instagram handle, number of followers, number of people they are following, posts, their real name, and their bio. I assigned all of my followers 0 if they were fake, and 1 if they were real. Now its time to build and train the model.

I decided to start off really simple with a decision tree algorithm. With this as a basis I could always get more complex with random forest or even the holy grail gradient boosted trees. But for the sake of good practice I started simple. Using sci-kit learn, I fit a simple decision tree reserving 30% of my data for testing. Scoring the predictions gave me 1.0, a perfect model, or what was more likely, a super overfit model. naturally, I loaded up scikit’s cross validation model to check to see how badly over fit my model was. To my surprise, the cross validation model produced an average score of 0.97 with standard deviation of 0.03.

The original model:

The model was basic, but with all my metrics I had some confidence. However, I needed more data to test. I reached out to a friend who kindly allowed me to scrape her Instagram followers. The only downside was that all her followers are real (she verified them all before I scraped). So I bought 100 more fake followers to append to their dataset, to make a more rich and varied dataset (sorry Instagram!, all in the name of data science). I refit my model with all the original data and tested it on the new dataset. My decision tree model had and accuracy of 0.69, precision of 0.62, recall of 1.0, and predicted that my friend’s had 82.5% real followers when it was closer to 51.4%.

There was a huge drop in all metrics. I was wondering why my model performed so badly. I did a little exploratory analysis and then I realized, I’d bought the two sets of fake followers from two different sites (you’d be surprised how many sites there are). These fake followers were of significantly higher quality then my first set. They had bios, names, and uploads, while the first set had only followers and maybe a name. Decision trees weeded these low-quality fakes out pretty quickly; however, it struggled on the high-quality fakes.

First round fake followers vs second round fake followers:

I needed to retrain my tree. I pooled all my data together and set aside 40% for training. I repeated all my steps of training, model building, and cross validation. I then tested the new decision tree model against my friends followers and the remaining mix of fake followers.

The model performed much better with an accuracy of 0.99, precision of 0.98, recall of 1.0, and predicted that my friend’s had 51.9% real followers which was close to the real percent of 51.4%.

I chose decision trees because of their easy interpret ability. Below is a picture of the structure of the refined decision tree model used to classify each follower.

My 2nd iteration model only used 3 features out of the 5 I supplied. The model focused on number of followers, number of following, and number of posts. Whether or not the user had a name or a bio did not come into play. There are many limitations to this model as it is based strictly on a certain group of Instagram users. My dataset leaves out real users that follow way more than they are followed. It also lacks in users that post very little, but might be more engaged with the community (likes, comments, etc). The model is quite basic and has room for growth, however I need way more varied data. In all likely-ness this model is overfit (look at the last branch), however it provides some insight into catching fake followers. Definitely look at the follower to following ratio as a major sign of “realness.”

Now that the model has been built, we have trained (and retrained) and tested it (kinda) successfully. It is time to answer the question that spawned this all. So how many of my friends followers are actually real? I scraped all 17 thousand and ran each one through the decision tree.

83% of His/her followers are fake.

Gotcha.

Thanks for reading,
-Marcello

July 15, 2016mgric

Travelling Pokemon Trainer

If you’ve walked outside in the past week and bumped into a person nose deep into their phone, they were probably playing pokemon go. As a child of the 90s I have fond memories of red and blue version and the endless hours spent playing them. Naturally, I picked up pokemon go immediately.

For those of you living under various rocks, pokemon go is a mobile game which makes you explore area around you to catch virtual pokemon. In order to catch pokemon you require a pokeball, These can either be bought with real money or you can receive them at various landmarks called “pokestops.” These “pokestops” are littered around cities and towns. The premise is simple, you walk around your town stopping at these landmarks and gain supplies.

This is ripe for optimization.

Luckily for us trainers this problem has already been explored, only in another form. The travelling salesman problem is a key problem in optimization. The travelling salesman classically had a person travel through a set amount of cities trying to minimize the distance she traveled while still visiting every city. This problem is computationally very tough to solve. Many algorithms have been written and tested to facilitate solving this problem.

The one i focused on is a genetic algorithm. Genetic algorithms mimic natural selection in order to find an optimal route. In my search I found a great blog post on optimal waldo search methods that uses a genetic algorithm. Here you can see Randal S Olson using it to solve a version of the travelling salesman problem involving Where’s Waldo. Naturally I wanted to apply it to my pokestops.

Using Mr. Olson’s Code and extracting all pokestop location data from Ingress (pokemon go precursor) provided me with the following path animation.

Quick note here, this is not the most optimal path, but it is close. Due to the nature of genetic algorithms we will get a close estimate. There are a few problems with the above path. First most glaring problem is that It assumes we can walk through buildings. I am not a haunter, so this is outta the question.

Luckily, I live in a city which utilizes the grid system of road planning. Because of that I can use taxi cab distance to approximate walking on a grid. However, in the upper left hand corner of the map above is a park with a good deal of pokestops. In the park area. In a park, I shouldn’t be constrained by buildings so I could walk on diagonals. I modified the code to calculate euclidean distance for park stops and taxi distance for everything else.

There is still something wrong. As you can see above, we are still moving through buildings, but now at right angles. Figuring out when/how to route around corners and up and down streets is a little more of a project than I am ready to tackle. So lets let google do that for us. Tapping into the google maps api distances are calculated for us using the correct routing. Our final map looks like this (plotting shows route through buildings, however distance is correct).

I guess I know which route I am going to be taking from now on.

Big shout out to Mr. Olson who wrote the underlying genetic algorithm code and part of the google maps api code I used. His write up on the waldo search is excellent and clear I highly suggest reading it.

As for next steps, there is something that is bothering me when I look at the above pathing, well a few things.

It is not a closed loop: I will have to walk allllllll the way back home from where ever this route lets me off. We need to add a constraint that the last stop has to be the same as the first.
Will my bag be full at the end of this?: I dont wanna waste time walking to more pokestops if i can’t receive any more supplies.
I don’t have to visit all the pokestops/ pokestops recharge: This is the most important one. Unlike the travelling salesman, i do not have to visit everyone. I can stay in a small subsection and just repeatedly go the the same 3 stops if i wanted to.

These are some important caveats to the map. We’ll see if we can solve them next time on pokemon.

-Marcello

March 18, 2016mgric

Journal Club: Week of 3/18/2016

The ASA’s statement on p-values: context, process, and purpose

Ronald L. Wasserstein & Nicole A. Lazar

P-values are a hotly debated topic recently. Al throughout school I learned that a p value of 0.05 makes or breaks any scientific progress. This paper, in the form of a statement from American Statistical Association, lays out all the flaws and common misuse of p-values. THe article starts with the motivation of the statement, this is a rare move for the ASA. This lends a little gravity to the paper. The statement; however, is where all the interesting parts lie. In the statement, there is a little obtuse definition of P value, quoted below:

“Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups would be equal to or more extreme than its observed value.”

Not exactly the clearest definition. However, the power of this paper lies in the Principles section. ASA outlines 6 principles of proper usage of p-values: replicated in the list below:

1. P-values can indicate how incompatible the data are with a specified statistical model

2.P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decision should not be based only on whether a p-value passes a specific threshold

4. proper inference requires full reporting and transparency

5. a p-value, or statistical significance, does not measure the size of an effect or the importance of a result

6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Wow, everyone who works with p-values should have to recite these rules before they begin their analysis. They should have them framed above their desk. I know in the past I have gone against 2,5, and 6. I highly suggest reading this statement. It will take you less than 20 minutes. If you cant spare that, just read the last line:

“No single index should substitute for scientific reasoning”

– Marcello