Hama time

This is what happens at your holiday cottage in the pretty Dorset coast village of Charmouth when it has been raining for the last 24 hours, your boyfriend is poorly with a bad tooth infection, the kids have your laptop and tablet because Roblox, but you brought a big tub of Hama beads for just such an occasion –

Side note 1 – Apparently you can program Roblox with the Lua programming language. I have never tried Lua, but making cool things for the small ones in my life would be a good excuse to give it a go.

Side note 2 – Charmouth beach was lovely when it wasn’t raining,  it’s good to feel the sea and sand on your feet (I even found a fossil-decorated rock). We also went to Monkey World Ape Rescue Centre, where many marmosets and orangutans were admired.

Charmouth Beach

What happened on General Election night, and what Twitter said about it

Will flesh this out into a proper post when I’m not about to go to Sheffield!  Edit 05/08/2017 – I wasn’t in Sheffield that long, but holidays and related activities in the last couple of weeks have not left much time for adding meat to the blog post bones.

Anyway, this was an interesting one to do, further indulging my election data nerd leanings. The seat declaration dataset was made available here by the House of Commons Library with an accompanying report (note that I used the first published version, so any errors present in there will be here also). The tweet data was again collected using the Search on the Twitter REST API  via Tweepy (a library which greatly simplifies using the Twitter API from Python). I used the same method in  this post for the #dogsatpollingstations hashtag, and as I said there a neater, more automated way of doing it would still be good and is still in progress.

Simply counting tweets is a very simplistic way of analysing Twitter data and obviously much more detailed investigation would need to be done for any conclusions to be drawn, but as so much was made of how Labour won social media, it was interesting to see how there were consistently more tweets about Labour seats than Conservative ones through the night, despite there being more actual seats declared for the Conservatives. With the usual warnings about correlation and causation in mind, the spike in tweet numbers at around 4:50am was striking, and was very close to the point at which 90% of constituencies had declared according to the Commons Library report. Shortly before this point, the tweets about ‘hung parliament’ began to overtake those about ‘exit poll’ for the first time, suggesting that between 4am and 5am was when the realisation of what the result would actually be set in.

Plotting the tweets against the actual declaration times was intriguing, but the ‘high profile’ declarations didn’t really appear to correlate with spikes in tweets for the relevant party as I had thought they might. I couldn’t identify any patterns in the declaration time compared with the party, or the percentage vote share either, but plotting the constituencies by party in this way revealed some interesting information, such as that the highest percentage vote shares (> 70%) tended to be Labour, and in seats won by the Lib Dems and SNP, the share tended to be lower (< 50%).

Picking out some of the ‘extremes’ was informative too; Buckingham for example is the constituency of The Speaker of the House of Commons, who stands independent of any party, and is traditionally not opposed in the constituency by any of the main parties. Furthermore, the Speaker does not vote in Parliament except to break ties, meaning that the people of the constituency essentially do not get the same representation as the rest of the country. This may explain why this constituency also had the largest number of invalid votes (as was also the case in 2015), and it would be interesting to investigate whether the votes were deemed invalid as the electors had actively spoiled their ballots in protest, or were just confused as what their choices were. 

 

Click the plot for a big version –

#GE2017

The making of

The Commons Library data is provided as two tables and I wanted to combine them to get the time, vote share, and turnout in a single table. For my data fiddlings I’m using the virtual machine which is handily provided with TM351 (the Open University’s Data Analysis course) which comes with a PostgreSQL database already set up, but through the course I found it a bit fiddly to interact with from the Jupyter notebook environment. Since I know where I am with Microsoft SQL Server (and hey, I’m not being assessed here), I ended up using a bulk insert of the CSV files into two new tables, combing the two with a SQL join, and then copying the result to a new CSV, ready to import back into the notebook. The data didn’t require much cleaning, but I altered some constituency names to remove the commas as they were causing me CSV import headaches.

Once I had the data I spent the bulk of the time tweaking the code for the plot to make the subplots work properly together with the annotations in the right places, and learnt / relearnt plenty by revisting old TM351 notebooks and stumbling across StackOverflow answers. Rather than dissect it all here, I have once again put the notebook, combined CSV and SQL script on Github here.

Wildly speculating about what time of day people* voted with the #dogsatpollingstations hashtag

*Firstly let’s clarify that by ‘people’, I mean ‘some people, who use Twitter, perhaps have a dog, and took said dog when they voted’ !

This is another ‘doodle’ which I did last month when I’d come down with severe election / post exam-season fever. I’m something of an election nerd; I worked at a polling station in Morley on election days for about 8 or 9 years and really like the buzz of it, to the extent that I based my project for TM356 around some method of helping Poll Clerks in their job (more about that in another post maybe).

Anyway, a nice (and non party political) thing to come out of recent general elections has been the #dogsatpollingstations hashtag on Twitter in which people post pictures of their faithful hounds outside the polling station when they go to vote. To try and learn how to work with Twitter data I used Tweepy to grab tweets from 8th June 2017 (UK General Election day), and then plotted them on a graph to see how the number of tweets containing the hashtag varied across the day –

The vast majority of the tweets on the day were sent during polling hours (07:00 – 22:00), although there were probably more tweets in general between those two times. There appeared to be consistently more tweets in the first half of the polling hours (07:00 – 14:30) than in the second half  (14:31 – 22:00) and although it starts to reduce around 12:00, there’s a spike at almost halfway at 14:15 (maybe a lot of retweeting in a short period?). In the evening, another spike can be seen at 20:00 (After-dinner / Eastenders dog walkers? People settling down in front of their devices for the evening?).

One thing to mention is that I used the Search endpoint of the Twitter REST API since I wasn’t organised / skilled enough to set up a thing to catch tweets from the streaming one (maybe next time), and Twitter notes that tweets from the REST search may not be absolutely exhaustive, and that the results are more concerned with relevance than completeness. Still, it was a good way to learn and provided at least a rough idea of the numbers.

Grabbing the tweets

This was done mostly based on this very useful post which uses a loop and the ID of the last tweet of the previous results to get around the limit of 100 tweets returned per query, allowing you to harvest around 18,000 before you reach your rate limit (information about the limits here). There were many more than 18,000 #dogsatpollingstations tweets (actually I ended up with 96,519) so when I got the rate limit exception, I waited for 15 minutes (the limit resets after that) and then ran the whole thing again, passing in the ID of the very last tweet I’d got on the previous run. Thins is obviously a clunky way to do it and I’ve started to write a thing to do the waiting / resetting automatically, but it did the job.

Plotting the result 

I didn’t do anything all that fancy to plot the data but I did learn some interesting bits –

TimeGroupers

pd.TimeGrouper allows you to group data with a timestamp into ‘chunks’ of time (say every 15 minutes) and can be used with groupby() to aggregate the datapoints in each chunk. I had a DataFrame with just a tweet ID and a timestamp for example, and I grouped it to show the number of tweets in each 15 minute period like this – 

# The TimeGrouper will need the timestamp as the index, so set this first
daps_pd.set_index('Timestamp', drop=False, inplace=True)

# Then use a groupby and pass in a TimeGrouper with the frequency you want -
daps_pd_plot = daps_pd.groupby(pd.TimeGrouper(freq='15Min')).count()
daps_pd_plot

Using images with the matplotlib plt.figimage and plt.imread functions 

These make it much easier to add an image to your plot than i would have expected if you want to go a bit infographic-y. With an image prepared, you can add it to a plot in a single line as below (the alpha is set very low to have it overlaid on the plot, and the zorder controls when it is added). Looking at the documentation, for the image stuff, there’s all sorts you can do so it’s one for further investigation!

plt.figimage(plt.imread('woof.png'), 100, 70, alpha=.07, zorder=5)

Here’s the full snippet anyway –

xtick_array = [array of the hourly timestamps which I'm sure you can image]

daps_pd_plot.plot(kind='area', alpha=0.5, figsize=(12,6), legend=False, xticks=xtick_array)

plt.title("Number of tweets containing #dogsatpollingstations per 15-minute period, 8th June 2017\n")
plt.xlabel("Time")
plt.ylabel("Tweets per 15 minute period")

plt.axvline('2017-06-08 07:00:00', color='black')
plt.axvline('2017-06-08 22:00:00', color='black')
plt.text('2017-06-08 07:15:00', 2000,'Polls open', rotation=90)
plt.text('2017-06-08 22:15:00', 2000,'Polls close', rotation=90)

plt.figimage(plt.imread('woof.png'), 100, 70, alpha=.07, zorder=5)
plt.savefig('vis_output/DogsAtPollingStations.png')

 

Edit 13/08/2017 – This one on Github now too

Visualising the BBC salaries for Pandas / Matplotlib etc. data tinkering practice

Data on BBC salaries was published on Tuesday as part of the organisations annual report, to much controversy over the gender gap it revealed (amongst other issues). Given the other revelation earlier this week that I hadn’t completely fouled up my TM351 grade, I thought I’d have a play with whatever data I could find on the subject, and the below is what I came up with. What it shows has been covered a lot in the news already, but a good honest bar graph can put things in proportion and I hadn’t actually seen many. It helped me understand just how much more the highest earners were paid compared with the rest, and what the actual distribution of the women was –

 

BBC Salaries for 'On-air Talent and Contributors' 2016-17

 

I couldn’t find a handy CSV file of the results, but found the report here and converted it to an Excel workbook using Nitro (which is free, seemed to do a reasonable job, and only asks for your email address). I cleaned up the file in Excel and added the genders (based on name, and Googling if I wasn’t sure, so please correct any inaccuracies!), and then imported it into Pandas. Plotting it was fairly straightforward and I learned how to  plot categorical data on the x axis, and a nice way to colour the bars based on category from this StackOverflow answer, like this –

colors = {'Male': 'orange', 'Female': 'purple'}
salary_df.plot.bar(x='Name', y=['Band Upper Bound'], figsize=(24, 6),
                   color=[colors[i] for i in salary_df['Sex']])

I’d never seen .gcf() before for plotting the ‘source’ note either, which is a way to plot the text relative to the whole figure. Here is the CSV file and the Jupyter notebook for anyone curious (the original BBC PDF says that ‘The text of this document… may be reproduced free of charge in any format or medium provided that it is reproduced accurately and not in a misleading context‘ so we’re fine to fiddle with it).

Digression about version control – I’ll put this on Github at some point, but I’ve always used SVN through TortoiseSVN (does anyone still use that? I used to quite like it) or TFS for such matters, and have somehow managed to avoid Git, so it remains on my endless to-do list.

EDITOK it’s on Github now here. That wasn’t so bad, but I did just use the interface and uploaded the files manually.

Digression about cycling – Similarly to the Git admission, I also ‘came out’ this week about never having really learned to ride a bike properly. My lovely partner however bought me this beautiful metal beast for my birthday on Monday, so I will be riding her to work with regularity by my next birthday –

 

Bike