SelenaBot boxing day update

Merry Christmas.


Well on Christmas eve, the day after I wrote & blogged about SelenaBot, I tried to run it again and got a bunch of “connection error” exceptions. Having run this by some friends, we suspected that Facebook servers detected Instagram was getting scraped and then got angry about it and started blocking me.

Changing my IP address – by coming back to my parents’ house for Christmas – seems to have worked: It runs again smoothly, although much more slowly as 1Gbps internet isn’t a thing in the countryside. I’m hoping that a more subtle scraper that I’m working on, which will only download images taken in the previous 24 hours, will be much kinder on the Instagram server and not get blocked. Otherwise I might have to teach my code how to change its IP address midway through operation…

Evening update:

Well things have gone very successfully, and SelenaBot will now only download files that were taken on the calendar day that SelenaBot is run. I need to alter this so it is actually the preceeding 24hrs, but I’m not sure how to do that yet. This means that Selenabot only scrapes around 150 pictures as opposed to 1,200 pictures, and the requests come in at more random time intervals (as opposed to one after the other) and so far this hasn’t resulted in any blocks. However all my scraping at this IP address has been done within one 24hr period so it will be interesting to see if I suddenly start getting server no reply exceptions tomorrow (i.e: does facebook analyse all its traffic once every 24hrs and then block offending requesters from the next day’s traffic?).

Another interesting thing that happened between this morning and this afternoon is that the account “repostapp” suddenly disappeared in the middle of the day (as in, it worked this morning, stopped working this afternoon) which meant that it was derailing the whole of SelenaBot, in much the same way Justin Bieber did the first time I ran the code. To stop this from happening in the future as and when other accounts get deleted/taken down, I’ve taught myself Exception Handling, so now when an account disappears, the code can carry on running.


One of my new years resolutions of 2016 was to learn to code a bit, and with just over a week to go I feel like I’ve been able to produce an idea for a piece of code, structure it, and get it to work. From scratch. By myself.

I started learning python using Automate the Boring Stuff, a website I found so useful I bought the book to say thank you to the Author. I’ve learnt all the basics from section one and started coming up with exercises myself to practice with, which is how SelenaBot came about.

I’ve had an idea for a Science Engagement stunt/thing (which, if it works, I’ll devote future blog posts to as it develops) that requires access to lots of Instagram posts, preferably by celebrities. So I’ve put together a piece of code that automatically accesses celebrity Instagrams and downloads their most recent 12 pictures, and I’m going to talk us through it here.

To start with, we import all the libraries we’ll need. Requests, for requesting webpages; BeautifulSoup to read the information on the webpages; JSON, because once you read the information on the webpage you discover it’s an unholy pile of tangled up Javascript that takes a whole evening to pick apart, and a list of the top100 instagram accounts, that I made separately (and will talk about later)

As alluded to above, the real killer for using BeautifulSoup to scrape instagram is that, when you run a request for an instagram page, instead of getting the website, you get a horrible pile of javascript.

When you access using a webbrowser
When you access using the Python Requests module

I suspect that this gross mess is because using BeautifulSoup is not the most intelligent way to scrape instagram, but I’m new here so that’s what we’re using.

line 13 of the code pull all the JavaScript out of everything else (mostly CSS and a bit of HTML) and pops it in a list called pageJS.

PageJS, and specifically PageJS[6]– the bit of code that contains the information that allows us to access the pictures – happens to be the ugliest thing you’ve ever damn well seen, and line 14 of SelenaBot took my the best part of an evening to write

Here’s the full content of PageJS[6]

and here’s line 14 again

allPics= json.loads(str(pageJS[6])[52:-10])['entry_data']['ProfilePage'][0]['user']['media']['nodes']

After stripping it of the begining and end characters to render it into something that the JSON module can interpret, we can see that all the information we need: Image URLS, timestamps, and more are in a dictionary with the key “Nodes”. To get to “Nodes” however  have to pick our way through another dictionary nested in a third dictionary, nested in a fourth, nested in a list, which is nested in two further dictionaries. This took me a long time painstakingly going through a text file I made of JS[6] and I really rather suspect there was a tool that would’ve helped me a lot quicker. But whatever, we can access the dictionary and we’ve called it allPics. From here on out it’s plain sailing, the for loop iterates through the allPics dictionary, and saves each one to the harddrive, giving it the file name of User + the unix time stamp.

The final two lines of code tell SelenaBot to open up a list of the top100 most followed celebrity accounts and then download all their most recent 12 images, and off it goes

Automated downloading the pics of the Ellen Show and Cara Delevigne: for all my eyebrowinspo and mum-meme needs
And the final result!

Just briefly, the top100gram module is another smaller scraper I wrote that pulls a list of the Top 100 most followed instagram accounts off a not particularly reliable looking website called SocialBlade. Socialblade hasn’t been updated since Justin Bieber quit and then rejoined instagram, which mean the list was passing an incorrect username into Selenabot. The poetic irony of having a piece of code named after Selena Gomez react violently and stop working at the mention of Justin Bieber’s name was not lost on me, and also forced me to write my favourite three lines of code yet.

if 'justinbieber' in top100: 

It’s not too late to say sorry.