I’ve always wondered what Twitter-user distribution looks like across the world. I assume it falls off at night in the US, but who knows? Can I be sure of myself in assuming that most users are in the Mission in San Francisco? Like any fiddler with a penchant for coding and some free time, I decided to figure out myself.
Step 1: Get some tweets.
Twitter offers a sampling API that gives you random tweets from up to 3 years or so ago. I set up an instance of Phirehose and let it run for about half an hour, saving the geo and created_at fields in a MySQL table. It swiftly dumped ~1.3 million tweets, consuming 58 MB. Awesome!
Step 2: Normalize the data.
So Tweets are a complex chunk of data delivered in either JSON or XML. It’s fidelity to the official standard is at the mercy of the client used to create the tweet as well as the user. Needless to say, the data was ugly. Basic steps taken (along with tweets left after the filter):
- Remove empty geo fields (~200,000 left)
- Regex for XX.XXXX, XX.XXXX (35,462 left)
- Removed “true” lat/long with North/N/n values to leave just UTM coordinates(~29,000 left)
- Remove identical UTM pairs (21,750 left)
- Check for out-of-bounds values for date/time (~19,000 left)
So now I was left with 19,000 rows in a DB similar to: 40.4410, -105.25529, "Feb 2 2009 4:41:44 EST"
Step 3: Put it on to <canvas>, a drawing element supported by browsers that play nicely with HTML5 (newish Firefox, Safari, and Chrome builds).
The earth isn’t flat, so a grid is hard to project onto it. Luckily someone else has solved this problem by creating Equirectangular Projections. This family of maps are equal-distance projections that maintains distances between spaces (with sacrificing with more distortion around the poles). I used the Creative Commons map available on Wikipedia.
I built the page up with this image placed behind the <canvas> element with CSS, and started in on the Javascript. For a quick project where page heaviness is fine, jQuery is a good choice. I made some basic functions and threw in a jQueryUI slider (all of which is visible in the source).
Results: So I wound up with a pretty cool way of showing who tweets where, and when. The different hemispheres do indeed encounter lower usage in their night-time, which matches my assumptions. Neat!
Caveats: A few words of warning:
- There is a sampling bias in how I “cleaned” my data. No Kanji characters, for example, possibly explaining the lower-appearing usage in Japan.
- I also opted against geocoding locations such as “Boulder, CO” due to computational overhead. Google does offer this API through Google Maps, but I deemed it outside the scope of the project.
- This will not render on older browsers or any version of Internet Explorer. There is a project called Excanvas to get
<canvas>rolling on IE, but this was also outside of my scope.
Thanks for looking!