Well, it’s Christmas so I’m taking on stupid self invented programming projects, much like last year with Selenabot. This year I’ve decided I’m going to create a Markov Text generator that recreates the work of a Young Kadhim Shubber (YKS).
I’ve decided to scrape all Kadhim’s work on Felix that’s available in non-PDF form (for now… if this becomes very easy perhaps I’ll try and automate PDF reading so we can even get his frozen hot-takes on world politics from 2008).
First order of the day is making a markov text parser and generator that work in python3. I forked an old python2 code I found on github into Markov_text3, which is a python3 module for generating Markov chains of text from a given source (saved as a .txt file).
Now we have Markov_text3, I need to get the source material. Having a quick look at Kadhim’s page on Felix (http://felixonline.co.uk/authors/ks607/) he has around 140 articles online, although not all of them published under his name are written by him (he used to be editor and thus I guess uploaded other people’s stuff). It appears to be material from pages 6 & 7 (his later work) so I’ll scrape that material manually.
Mid-Jan Update: I got busy over the xmas holidays and this project fell by the wayside, the main problem was that I have completely forgotten how to use Beautiful Soup properly and thus couldn’t easily get a list of all the articles that Kadhim had written. This was stupid because I managed to get a Markov Generating script going quite nicely.
A few weeks ago whilst reading a hand-wringing article about how automated content production on YouTube & social media will lead to the inevitable collapse of western civilization, my mind wondered to the idea of social media content algorithms behaving like real social media users. Then the stupid question at the top of this post popped into my head and the stupid answer popped in right afterwards. An algoritm’s selfie would be its source code. I’m a big fan of contextless text robotic tweets, especially @IAM_SHAKESPEARE, and have always wanted to own one of my own. So I filed this in my head under “stupid programming exercise to do when you’ve got nothing on” and got on with my life, which currently involves a lot of redecorating.
Fast-forward to today, where I am in day 2 of what feels like a 2 day cold. I’m too bleary and snotty to concentrate on anything serious like my job, but I’ve also just slept for about 18 hours, so I needed some moderate activity to tire me out in preparation for another long sleep. 2 hours later, I present you with @Recurbot.
Code below, but basically it’s a standard twitter bot that reads itself, chops three random lines out of itself and then tweets them. I chose to make it random, as opposed to sequential, to add a bit of variety to it .
So far I’ve not set this up to be automated yet, it’s just running manually when I execute it on my laptop. I’m going to use this as an excuse to start playing around with pythonanywhere.com to see if I can get it to work there. If I can, i expect i’ll set it up to tweet every 2 hours.
By picking three random lines the bot has 27 unique tweets, whereas going through the code sequentially in 3 line chunks would only give 10, a 170% increase in content !! I’m not sure if I should beef up the code by adding in the module that it currently imports: Random.py, json.py and tweepy.py. Random.py is 726 lines long and obviously most of it isn’t really used in the production of the tweets — only randint is called once to get a number between 0 and 27. The more of this paragraph I write, the more convinced I am to just leave it at 30 lines long and let it have some repetition. After all, the bot that tweets Shakespeare is on it’s 4th run through and it doesn’t make it any less amusing when it pops up in my work feed.
Well on Christmas eve, the day after I wrote & blogged about SelenaBot, I tried to run it again and got a bunch of “connection error” exceptions. Having run this by some friends, we suspected that Facebook servers detected Instagram was getting scraped and then got angry about it and started blocking me.
Changing my IP address – by coming back to my parents’ house for Christmas – seems to have worked: It runs again smoothly, although much more slowly as 1Gbps internet isn’t a thing in the countryside. I’m hoping that a more subtle scraper that I’m working on, which will only download images taken in the previous 24 hours, will be much kinder on the Instagram server and not get blocked. Otherwise I might have to teach my code how to change its IP address midway through operation…
Well things have gone very successfully, and SelenaBot will now only download files that were taken on the calendar day that SelenaBot is run. I need to alter this so it is actually the preceeding 24hrs, but I’m not sure how to do that yet. This means that Selenabot only scrapes around 150 pictures as opposed to 1,200 pictures, and the requests come in at more random time intervals (as opposed to one after the other) and so far this hasn’t resulted in any blocks. However all my scraping at this IP address has been done within one 24hr period so it will be interesting to see if I suddenly start getting server no reply exceptions tomorrow (i.e: does facebook analyse all its traffic once every 24hrs and then block offending requesters from the next day’s traffic?).
Another interesting thing that happened between this morning and this afternoon is that the account “repostapp” suddenly disappeared in the middle of the day (as in, it worked this morning, stopped working this afternoon) which meant that it was derailing the whole of SelenaBot, in much the same way Justin Bieber did the first time I ran the code. To stop this from happening in the future as and when other accounts get deleted/taken down, I’ve taught myself Exception Handling, so now when an account disappears, the code can carry on running.
One of my new years resolutions of 2016 was to learn to code a bit, and with just over a week to go I feel like I’ve been able to produce an idea for a piece of code, structure it, and get it to work. From scratch. By myself.
I started learning python using Automate the Boring Stuff, a website I found so useful I bought the book to say thank you to the Author. I’ve learnt all the basics from section one and started coming up with exercises myself to practice with, which is how SelenaBot came about.
I’ve had an idea for a Science Engagement stunt/thing (which, if it works, I’ll devote future blog posts to as it develops) that requires access to lots of Instagram posts, preferably by celebrities. So I’ve put together a piece of code that automatically accesses celebrity Instagrams and downloads their most recent 12 pictures, and I’m going to talk us through it here.
I suspect that this gross mess is because using BeautifulSoup is not the most intelligent way to scrape instagram, but I’m new here so that’s what we’re using.
PageJS, and specifically PageJS– the bit of code that contains the information that allows us to access the pictures – happens to be the ugliest thing you’ve ever damn well seen, and line 14 of SelenaBot took my the best part of an evening to write
After stripping it of the begining and end characters to render it into something that the JSON module can interpret, we can see that all the information we need: Image URLS, timestamps, and more are in a dictionary with the key “Nodes”. To get to “Nodes” however have to pick our way through another dictionary nested in a third dictionary, nested in a fourth, nested in a list, which is nested in two further dictionaries. This took me a long time painstakingly going through a text file I made of JS and I really rather suspect there was a tool that would’ve helped me a lot quicker. But whatever, we can access the dictionary and we’ve called it allPics. From here on out it’s plain sailing, the for loop iterates through the allPics dictionary, and saves each one to the harddrive, giving it the file name of User + the unix time stamp.
The final two lines of code tell SelenaBot to open up a list of the top100 most followed celebrity accounts and then download all their most recent 12 images, and off it goes
Just briefly, the top100gram module is another smaller scraper I wrote that pulls a list of the Top 100 most followed instagram accounts off a not particularly reliable looking website called SocialBlade. Socialblade hasn’t been updated since Justin Bieber quit and then rejoined instagram, which mean the list was passing an incorrect username into Selenabot. The poetic irony of having a piece of code named after Selena Gomez react violently and stop working at the mention of Justin Bieber’s name was not lost on me, and also forced me to write my favourite three lines of code yet.
if 'justinbieber' in top100: