One of my new years resolutions of 2016 was to learn to code a bit, and with just over a week to go I feel like I’ve been able to produce an idea for a piece of code, structure it, and get it to work. From scratch. By myself.

I started learning python using Automate the Boring Stuff, a website I found so useful I bought the book to say thank you to the Author. I’ve learnt all the basics from section one and started coming up with exercises myself to practice with, which is how SelenaBot came about.

I’ve had an idea for a Science Engagement stunt/thing (which, if it works, I’ll devote future blog posts to as it develops) that requires access to lots of Instagram posts, preferably by celebrities. So I’ve put together a piece of code that automatically accesses celebrity Instagrams and downloads their most recent 12 pictures, and I’m going to talk us through it here.

To start with, we import all the libraries we’ll need. Requests, for requesting webpages; BeautifulSoup to read the information on the webpages; JSON, because once you read the information on the webpage you discover it’s an unholy pile of tangled up Javascript that takes a whole evening to pick apart, and a list of the top100 instagram accounts, that I made separately (and will talk about later)

As alluded to above, the real killer for using BeautifulSoup to scrape instagram is that, when you run a request for an instagram page, instead of getting the website, you get a horrible pile of javascript.

When you access using a webbrowser
When you access using the Python Requests module

I suspect that this gross mess is because using BeautifulSoup is not the most intelligent way to scrape instagram, but I’m new here so that’s what we’re using.

line 13 of the code pull all the JavaScript out of everything else (mostly CSS and a bit of HTML) and pops it in a list called pageJS.

PageJS, and specifically PageJS[6]– the bit of code that contains the information that allows us to access the pictures – happens to be the ugliest thing you’ve ever damn well seen, and line 14 of SelenaBot took my the best part of an evening to write

Here’s the full content of PageJS[6]

and here’s line 14 again

allPics= json.loads(str(pageJS[6])[52:-10])['entry_data']['ProfilePage'][0]['user']['media']['nodes']

After stripping it of the begining and end characters to render it into something that the JSON module can interpret, we can see that all the information we need: Image URLS, timestamps, and more are in a dictionary with the key “Nodes”. To get to “Nodes” however  have to pick our way through another dictionary nested in a third dictionary, nested in a fourth, nested in a list, which is nested in two further dictionaries. This took me a long time painstakingly going through a text file I made of JS[6] and I really rather suspect there was a tool that would’ve helped me a lot quicker. But whatever, we can access the dictionary and we’ve called it allPics. From here on out it’s plain sailing, the for loop iterates through the allPics dictionary, and saves each one to the harddrive, giving it the file name of User + the unix time stamp.

The final two lines of code tell SelenaBot to open up a list of the top100 most followed celebrity accounts and then download all their most recent 12 images, and off it goes

Automated downloading the pics of the Ellen Show and Cara Delevigne: for all my eyebrowinspo and mum-meme needs
And the final result!

Just briefly, the top100gram module is another smaller scraper I wrote that pulls a list of the Top 100 most followed instagram accounts off a not particularly reliable looking website called SocialBlade. Socialblade hasn’t been updated since Justin Bieber quit and then rejoined instagram, which mean the list was passing an incorrect username into Selenabot. The poetic irony of having a piece of code named after Selena Gomez react violently and stop working at the mention of Justin Bieber’s name was not lost on me, and also forced me to write my favourite three lines of code yet.

if 'justinbieber' in top100: 

It’s not too late to say sorry.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.