Fakestagram – Using Machine Learning to Determine Fake Followers

<br />
A While back I went out to dinner with a bunch of my buddies from highschool. We inevitably started talking about all the people that went to our highschool and what they were doing today. Eventually we started talking about one of our friends that has actually became instagram famous. As the night waned, one of my friends came up to me and said “You know all of his/her instagram followers are fake.” I immediately went to their account and started clicking on some of the followers. Sure enough they started to look a little fishy. However, as a data scientist I wasn’t 100% convinced. Down the rabbit hole I went.

In order to solve this problem I needed some data. I have an instagram account (@gospelofmarcello) and a few followers. The problem is all my followers were real. My followers were mostly friends and families with a few random businesses sprinkled in. Unfortunately, I didn’t have any fake followers. My first step was to correct that.

Pre-fake followers (shameless plug, it’s 90% food pictures):

So I found out there is a whole market around buying followers (and likes as well but thats a story for another blog post). I won’t post links here but I found a site where I could get 100 followers for $3. Since I only had about 100 real followers, these fake followers would complete my dataset. I spent the $3 dollars (sorry instagram! I’m doing it for science) and within the hour I had 100 new followers.

Post-fake followers (plus a few real ones):

Next step was to actually get info on all my followers. If you’ve used instagram before you’ve probably seen something like the photo above. Instagram profiles have some great data which I was gonna need to build my model. Unfortunately for me Instagram recently changed their API and made it so that you can only access 10 other users (and their info/data) at a time. Even worse you needed their permission. I assumed that these bots would not give consent to be subject to my probe, so I needed a solution. In comes Selenium.

Selenium allows me to open webpages like normal and interact with them. I wrote a script that would first scrape all my followers, then one by one open up each follower’s profile and gather data. My program takes a user’s instagram handle, number of followers, number of people they are following, posts, their real name, and their bio. I assigned all of my followers 0 if they were fake, and 1 if they were real. Now its time to build and train the model.

I decided to start off really simple with a decision tree algorithm. With this as a basis I could always get more complex with random forest or even the holy grail gradient boosted trees. But for the sake of good practice I started simple. Using sci-kit learn, I fit a simple decision tree reserving 30% of my data for testing. Scoring the predictions gave me 1.0, a perfect model, or what was more likely, a super overfit model. naturally, I loaded up scikit’s cross validation model to check to see how badly over fit my model was. To my surprise, the cross validation model produced an average score of 0.97 with standard deviation of 0.03.

The original model:

The model was basic, but with all my metrics I had some confidence. However, I needed more data to test. I reached out to a friend who kindly allowed me to scrape her Instagram followers. The only downside was that all her followers are real (she verified them all before I scraped). So I bought 100 more fake followers to append to their dataset, to make a more rich and varied dataset (sorry Instagram!, all in the name of data science). I refit my model with all the original data and tested it on the new dataset. My decision tree model had and accuracy of 0.69, precision of 0.62, recall of 1.0, and predicted that my friend’s had 82.5% real followers when it was closer to 51.4%.

There was a huge drop in all metrics. I was wondering why my model performed so badly. I did a little exploratory analysis and then I realized, I’d bought the two sets of fake followers from two different sites (you’d be surprised how many sites there are). These fake followers were of significantly higher quality then my first set. They had bios, names, and uploads, while the first set had only followers and maybe a name. Decision trees weeded these low-quality fakes out pretty quickly; however, it struggled on the high-quality fakes.

First round fake followers vs second round fake followers:

I needed to retrain my tree. I pooled all my data together and set aside 40% for training. I repeated all my steps of training, model building, and cross validation. I then tested the new decision tree model against my friends followers and the remaining mix of fake followers.

The model performed much better with an accuracy of 0.99, precision of 0.98, recall of 1.0, and predicted that my friend’s had 51.9% real followers which was close to the real percent of 51.4%.

I chose decision trees because of their easy interpret ability. Below is a picture of the structure of the refined decision tree model used to classify each follower.

My 2nd iteration model only used 3 features out of the 5 I supplied. The model focused on number of followers, number of following, and number of posts. Whether or not the user had a name or a bio did not come into play. There are many limitations to this model as it is based strictly on a certain group of Instagram users. My dataset leaves out real users that follow way more than they are followed. It also lacks in users that post very little, but might be more engaged with the community (likes, comments, etc). The model is quite basic and has room for growth, however I need way more varied data. In all likely-ness this model is overfit (look at the last branch), however it provides some insight into catching fake followers. Definitely look at the follower to following ratio as a major sign of “realness.”

Now that the model has been built, we have trained (and retrained) and tested it (kinda) successfully. It is time to answer the question that spawned this all. So how many of my friends followers are actually real? I scraped all 17 thousand and ran each one through the decision tree.

83% of His/her followers are fake.

Gotcha.

Thanks for reading,
-Marcello