Machine Learning | Optimization

November 4, 2016mgric

Pitchfork’s Best New Markov Chains Part 2

<br />

See original visualization here: http://setosa.io/blog/2014/07/26/markov-chains/

After finishing up my last post about modelling artists and their probability to release consecutive Best new music albums (see part 1 here), I got to thinking about what else I could use the data that I scraped. I had all album reviews from 2003 to present including the relevant metadata, artist, album, genre, author of the review, and date reviewed. I also had the order in which they were reviewed.

Then, with Markov chains still fresh in my mind, I got to thinking, do albums get reviewed in a genre based pattern? Are certain genre’s likely to follow others?

Using the JavaScript code from http://setosa.io/blog/2014/07/26/markov-chains/, I plugged in my labels (each of the genres) and the probability of state change (moving from one genre to another) which resulted in the 9 node chain at the top of the post.

If you let the chain run a little while you will notice a few patterns. The most obvious pattern is that all roads lead to Rock. For each node the probability of the next album being a rock album is close to 50%. This is because not all genres are equally represented and also because of the way Pitchfork labels genres. Pitchfork can assign up to 5 genres to an album it reviews. With up to 5 possibilities to get a spot, some genres start to gain a lead on others. Rock, for instance, is tacked on to other genres more frequently than any other genre. This causes our markov chain to highly favor going to Rock rather than other genres like Global and Jazz which are not tacked onto other as frequently.

So if you are the betting type, the next album Pitchfork will review is probably a rock album.

-Marcello

October 30, 2016mgric

Pitchfork’s Best New Markov Chains

<br />

See original visualization here: http://setosa.io/blog/2014/07/26/markov-chains/

I am an avid Pitchfork reader, it is a great way to keep up to date on new music. Pitchfork lets me know what albums to listen to and what not to waste my time. It’s definitely one source I love to go to when I need something new.

One way Pitchfork distills down all the music they review and listen to is to award certain albums (an more recently tracks) as “Best New Music.” Best New Music, or BNM as I’ll start calling it, is pretty self explanatory. BNM is awarded to albums (or reissues) that are recently released, but show an explemplary effort. BNM is loosely governed by scores (lowest BNM was a 7.8), but I noticed that I would see some of the same artists pop up over the years. This got me to wondering. If an artist gets a BNM is their next album more likely to be BNM or meh?

We need data. Unfortunately Pitchfork doesn’t have an API and no one has developed a good one, so that lead me to scrape all the album info. Luckily, all album reviews are listed on this page http://pitchfork.com/reviews/albums/. To get them all I simply iterated through each page and scraped all new albums. I scraped the artist name, album name, genre, main author of the review, and year released. BNM started back in 2003 so I had a natural endpoint. In order to go easy on Pitchforks servers I built in a little rest between requests (don’t get to mad Pitchfork).

Now that I have the data, how should I model it? We can think of BNM and “meh” as two possible options or “states” for albums (ignoring completely scores). Markov Chains allows us to model these states and how the artists flow through them. Each pass through the chains represents a new album being released. A conventional example is weather. Imagine there are only rainy days and sunny days. If it rained yesterday there may be a stronger probability that it might rain tomorrow, however the weather could also change to sunny, but at a lower probability. Same goes for sunny days. For my model, just replace sunny days with BNM and rainy days with meh.

Sunny “S” ,Rainy “R”, and the probabilities of swapping or staying the course

With all my data, I was able to calculate the overall Markov models. I took all artists that that had at least 1 BNM album, 2 albums minimum, and at least 1 album after the BNM album. This insures that these probabilities actually mean anything. I can only tell what the probability of staying BNM is if you have at least one more album after your first BNM. Once I distilled all the artists down using the above criteria getting the probabilities was easy. I simply iterated through each artists discography, classifying the “state” change between them (meh to meh, meh to BNM, BNM to BNM, BNM to meh)

Finally, with all the numbers crunched I plugged them in to the visualization at the top. NOTE: all the visualizations were NOT created by me. I simply plugged in my calculated probabilities and labels. The original visualization along with a fantastic explanation of markov chains can be found at http://setosa.io/blog/2014/07/26/markov-chains/. The visualization and all the code behind it was created by him NOT me. As I said before I only supplied the probabilities.

If you look at the size of the arrows you can tell the relative probability of each state change. As you can see BNM are pretty rare and artists don’t stay that way for long (thin arrow). What is much more common, as you probably guessed, are meh albums leading to more meh albums (thick arrow). As you can see, it is more likely that an artist will produce a meh album after BNM. What is interesting is that it is more likely to release a BNM after a BNM than it is to go from meh to BNM These conclusions seem pretty obvious, in retrospect, however since we lumped all artists together we might be missing some nuance.

Now the above metrics are for all artists, but it it probably unfair to lump in Radiohead (who churns out BNM like its nothing) to the latest EDM artist. I redid my analysis only this time further splitting all the artists by their genre. Below are the three most interesting genres.

METAL

POP/R&B

RAP

Breaking artists out by genre lead to some interesting results. For the most part, most genres followed our general outline for BNM Markov chain. However, the above three deviated. Metal had a much higher chance for an artist to release consecutive BNM albums, the probability is almost 50%. However, it is much harder for a metal artist to transition from meh to BNM. The exact opposite is true for pop/r&B (Pitchfork lumps the two together in their categorization). Pop artists switch back and forth between BNM and meh, but rarely produces two albums of the same state consecutively. Rap is a little different. Rap is more resistant to change. For rap, it is harder to switch between states, but rather easy to stay in a state.

There are some drawbacks from this subsetting. The number of observations drop for each group so these models are based off less data. Some albums also have multiple genre designations. Should a rock/electronic album count for both rock and electronic,weigh it 50% of a pure rock album, or separate out just rock/electronic? Nevertheless, as exploratory mildly useful Markov Chains we can see that some artists may have an advantage if they already produced a BNM album, but not by much.

-Marcello

September 24, 2016mgric

Fakestagram – Using Machine Learning to Determine Fake Followers

<br />
A While back I went out to dinner with a bunch of my buddies from highschool. We inevitably started talking about all the people that went to our highschool and what they were doing today. Eventually we started talking about one of our friends that has actually became instagram famous. As the night waned, one of my friends came up to me and said “You know all of his/her instagram followers are fake.” I immediately went to their account and started clicking on some of the followers. Sure enough they started to look a little fishy. However, as a data scientist I wasn’t 100% convinced. Down the rabbit hole I went.

In order to solve this problem I needed some data. I have an instagram account (@gospelofmarcello) and a few followers. The problem is all my followers were real. My followers were mostly friends and families with a few random businesses sprinkled in. Unfortunately, I didn’t have any fake followers. My first step was to correct that.

Pre-fake followers (shameless plug, it’s 90% food pictures):

So I found out there is a whole market around buying followers (and likes as well but thats a story for another blog post). I won’t post links here but I found a site where I could get 100 followers for $3. Since I only had about 100 real followers, these fake followers would complete my dataset. I spent the $3 dollars (sorry instagram! I’m doing it for science) and within the hour I had 100 new followers.

Post-fake followers (plus a few real ones):

Next step was to actually get info on all my followers. If you’ve used instagram before you’ve probably seen something like the photo above. Instagram profiles have some great data which I was gonna need to build my model. Unfortunately for me Instagram recently changed their API and made it so that you can only access 10 other users (and their info/data) at a time. Even worse you needed their permission. I assumed that these bots would not give consent to be subject to my probe, so I needed a solution. In comes Selenium.

Selenium allows me to open webpages like normal and interact with them. I wrote a script that would first scrape all my followers, then one by one open up each follower’s profile and gather data. My program takes a user’s instagram handle, number of followers, number of people they are following, posts, their real name, and their bio. I assigned all of my followers 0 if they were fake, and 1 if they were real. Now its time to build and train the model.

I decided to start off really simple with a decision tree algorithm. With this as a basis I could always get more complex with random forest or even the holy grail gradient boosted trees. But for the sake of good practice I started simple. Using sci-kit learn, I fit a simple decision tree reserving 30% of my data for testing. Scoring the predictions gave me 1.0, a perfect model, or what was more likely, a super overfit model. naturally, I loaded up scikit’s cross validation model to check to see how badly over fit my model was. To my surprise, the cross validation model produced an average score of 0.97 with standard deviation of 0.03.

The original model:

The model was basic, but with all my metrics I had some confidence. However, I needed more data to test. I reached out to a friend who kindly allowed me to scrape her Instagram followers. The only downside was that all her followers are real (she verified them all before I scraped). So I bought 100 more fake followers to append to their dataset, to make a more rich and varied dataset (sorry Instagram!, all in the name of data science). I refit my model with all the original data and tested it on the new dataset. My decision tree model had and accuracy of 0.69, precision of 0.62, recall of 1.0, and predicted that my friend’s had 82.5% real followers when it was closer to 51.4%.

There was a huge drop in all metrics. I was wondering why my model performed so badly. I did a little exploratory analysis and then I realized, I’d bought the two sets of fake followers from two different sites (you’d be surprised how many sites there are). These fake followers were of significantly higher quality then my first set. They had bios, names, and uploads, while the first set had only followers and maybe a name. Decision trees weeded these low-quality fakes out pretty quickly; however, it struggled on the high-quality fakes.

First round fake followers vs second round fake followers:

I needed to retrain my tree. I pooled all my data together and set aside 40% for training. I repeated all my steps of training, model building, and cross validation. I then tested the new decision tree model against my friends followers and the remaining mix of fake followers.

The model performed much better with an accuracy of 0.99, precision of 0.98, recall of 1.0, and predicted that my friend’s had 51.9% real followers which was close to the real percent of 51.4%.

I chose decision trees because of their easy interpret ability. Below is a picture of the structure of the refined decision tree model used to classify each follower.

My 2nd iteration model only used 3 features out of the 5 I supplied. The model focused on number of followers, number of following, and number of posts. Whether or not the user had a name or a bio did not come into play. There are many limitations to this model as it is based strictly on a certain group of Instagram users. My dataset leaves out real users that follow way more than they are followed. It also lacks in users that post very little, but might be more engaged with the community (likes, comments, etc). The model is quite basic and has room for growth, however I need way more varied data. In all likely-ness this model is overfit (look at the last branch), however it provides some insight into catching fake followers. Definitely look at the follower to following ratio as a major sign of “realness.”

Now that the model has been built, we have trained (and retrained) and tested it (kinda) successfully. It is time to answer the question that spawned this all. So how many of my friends followers are actually real? I scraped all 17 thousand and ran each one through the decision tree.

83% of His/her followers are fake.

Gotcha.

Thanks for reading,
-Marcello