Scraping pdfs, Ruby and Idaho elections, or …

… why I am never going to be a programmer. Last week I saw several reporters tweeting about checking the Idaho Secretary of State’s Election’s page throughout the day so as not to miss any new candidate filings during the current Idaho elections filing period.

I feel for Kevin and the rest of the Idaho Capitol Press Corps. While the Secretary of State does make a good effort to provide public information, and Secretary of State Ben Ysursa has been a champion of public records, there has got to be a better way.

See the live map here!

This is what happens now: The Secretary of State updates a .pdf file two or more times a day during the filing period and posts it on his web page. That means that anyone interested in the intrigue of candidate filing in Idaho’s newly refigured legislative districts has to bookmark the page, reload it several times a day and scan it for new entries.

I’ve been reading a lot about data journalism and thought I could follow some of the tutorials out there and create a nice little web scraper to tease out the scoops on candidate registration. I am really inspired by sites like ProPublica, which puts together awesome databases like this one on what influenced support on SOPA/PIPA or one on unequal course offerings in schools across the nation that incorporates maps. I like maps. The Guardian in the UK is also doing amazing things with data journalism and making information public and available and accessible. These guys make it look easy.

Brett Nelson, AKA "The Ruby"

So I took a Boise City Community Education class on Ruby programming last week, thinking I could pick up a few new skills. The only thing I really learned was that I’m going to stick with WordPress as a blog platform and it’s unlikely that I’ll ever be a real programmer. But I suggested to the teacher, my friend Brett Nelson, who works at Customated.com, that I had this idea to grab the candidate .pdf from the Secretary of State, convert it into a useful format and post it somewhere automatically so that my reporter friends don’t have to work so hard.

As with anything, it’s harder than it looks, and Brett ended up doing all of the heavy lifting on this.

Here is what (I think) we did:

  1. Wrote a ruby script (a little computer program) that automatically downloads the .pdf from the Secretary of State’s website as often as we like, converts it to text using Xpf, which preserves the spacing, and then uses Regular Expressions to parse the text into fields: District, Office, Party, Name, Address. The program runs remotely, on a server and activates itself. Yeah, it’s a bot.
  2. Use our nicely formated text file to build a Google Fusion table. Again, this is automated, using Google Maps/Fusion API commands. Google Fusion Tables allows us to easily place candidates on a Google Map, though it’s a bit more difficult to actually build the map on a web page, and requires a working knowledge of Javascript, another programming language.
  3. I told Brett I’d take it from there, but again, it was not that simple. First of all, I could not figure out how to embed all the javascript I’d need in a blog post on this blog, so I had to just make a new web page, that looks like it’s on this blog. Also, I really don’t get the details of javascript anyway. So once again, Brett bailed me out.
  4. I found a cool use of fusion tables that I wanted to emulate, allowing the user to select Senate, House A or House B races to display on the map so that they could easily see which districts had contested races. I spent hours trying to customize the Chicago homicide map linked above to my own needs, but gave up in desperation. Brett figured out the right calls we’d need to redraw the map (it’s not really that complicated, but it’s like learning a foreign language).
  5. Finally, after I’d say 25 hours between us, though my hours were much less valuable than his, we had a workable map. I’m still bummed it’s not as pretty and full of functionality as the Propublica projects are, but I think it’s quite useful.

So what can you do with this web app? First of all, I think it’s the first Google map of the newly drawn legislative districts. I converted the .shp file (L93) that the Commission published to a .kml file (which Google Maps reads) using Qgis on my Mac. So now people cal fly around the new districts, find their house, make sure that candidates actually live in their districts, figure out which legislators have the best nearby hunting spots, etc. UPDATE: The locations plotted for candidates are their filing addresses, not necessarily their home addresses, and thus may be located in a different district.

Second of all, you can easily see which races are contested for the May 15 Primary Elections by selecting one of the three Legislative races and looking for red or blue markers on the map. This map is constantly updated with new filing information. Candidate filing closes in a week, on March 9, so we will have to figure out how to make this information useful on an ongoing basis. I’d like to add web links and Twitter feed info for candidates and links to news articles about the races (if anyone wants to help, please speak up)!

Please let us know what you think and feel free to use any of the info you glean from our web app!

This entry was posted in Features, Journalism 2.0 and tagged , , , , , , , . Bookmark the permalink.