We’re totally PDF’d: Open state-level datasets still fail to inspire

(Edited to add Max Ogden’s recommendation of ScraperWiki to help deal with PDF datasets.)


Courtesy of a recommendation by John Wonderlich at the Sunlight Foundation, here’s a faceted browser/catalog of state- and other-level datasets to explore: http://datos.fundacionctic.org/sandbox/catalog/faceted/

And while the tool itself is indespensible, it highlights the bane of our data-loving existence: tons of state-level data have been posted to data.gov-style sites only as PDFs.

Want to know how the Alabama liquor control board has been spending its money? You’ll have to read through forty-five 20+ page PDF’d spreadsheets: http://open.alabama.gov/frmsReport/ReportList.aspx?AppID=GFS&AgencyID=00…

How is it that this is a legal norm for open data sites? The Alabama liquor control board’s licensing and enforcement arm has spent $10 million from January to June, yet you have no easy way of knowing if that’s normal, suspicious, supremely well-run, or historically wasteful without downloading 45 files with names like “364077.pdf”, finding the paired data, and doing comparisons by hand.

That’s nuts. Everyone will tell you that that implies laziness, incompetence, or corruption.

So states: if you’re going to open your data, open your data. People can still take it by force, as developers are proving with tools like ScraperWiki, but if the point is to make it easy, eliminate the extra step of turning the doc into a PDF and just post the original spreadsheet and raw data.