With the Virginia State Board of Elections starting to provide bulk campaign finance data, a whole new world of data has opened up, and I intend to make the most of it.
Although the esteemed Virginia Public Access Project has long provided this information (laboriously cleaned up and displayed in a user-friendly fashion), it’s useful only to end users. There’s no API, no bulk downloads, etc., so it’s not possible for that data to be incorporated into Richmond Sunlight, Virginia Decoded, iOS apps, etc. That’s not a knock on VPAP—their line of business is providing this information to end users, period.
My normal instinct is to create a website that gathers and displays this data and, by the way, provides bulk downloads and an API. (For example, see Richmond Sunlight’s API guide and downloads directory, or Virginia Decoded’s downloads directory (the API is in alpha testing now).) But the website is, in this instance, unnecessary. VPAP is doing a better job of that than I can.
Instead, I intend to provide the tools for others to use this data. To that end, I’m developing Saberva, currently hosted on GitHub, a parser that gathers the data from the SBE’s servers, cleans it up, and exports it all to a MySQL database. (“Saber” as in Spanish for “to know,” and “VA” as in Virginia.) At first it’ll just be a program that anybody can run to get a big, beautiful pile of data, but I intend to provide bulk downloads (as MySQL and CSV) and an API (probably just as JSON). Slowing things down somewhat is the fact that I’m writing this in Python, a programming language that I know well enough to muck around in other people’s code, but not nearly well enough to write something of my own from scratch. This seems like the chance to learn it, and I think that Python is the right language for this project.
Awkwardly (for me), I’m learning this new language out in the open, on GitHub. GitHub, for those non-programmers, is a source code sharing website, for folks who, like me, develop software collaboratively. Every change that I make—every new line of code, every mistake—is chronicled on the project’s GitHub page. The tradeoff is that others can contribute to my code, making improvements or correcting my errors. Open government hacker Derek Willis has already forked Saberva, replacing and improving my laborious CSV parsing processes with Christopher Groskopf’s excellent csvkit.
Right now, Saberva will download the data for a single month (April), clean it up a bit, save a new CSV file, and create a file to allow it to be imported into a MySQL database. I’ve got the framework for something useful, and now it remains to be made genuinely useful.
If you’re handy with Python, and you know your way around Git, I hope you’ll consider lending a hand, even just cleaning up a few lines of code or adding a bit more functionality. Lord knows I could use the help.
Waldo, can you translate into common-man English what you are doing? When I go to VPAP as an “end user”, what am I not able to do that you are now going to provide?
what does ‘bulk download” mean?
Can you give an example scenario of what someone could find out with your toll that they could not find out with VPAP?
Does Saberva cure baldness, heartburn, or erectile dysfunction?
Realistically, there is little Waldo (and collaborators) are doing at the moment that you (or I) will likely use. That doesn’t mean it isn’t important, however.
There are three basic steps in the process right now. First,the SBE puts out really ugly data in large quantities. VPAP currently handles steps two and three in the process, of converting that data into manageable chunks, and then doing something with it (in this case, presenting it nice and tidy so that you can see how much your neighbor gave to a campaign). That’s cool, as that’s what most people want. But let’s say I came up with another use, like integrating it into a larger app of mine for campaign work. I would have to build the campaign app, but *also* build the converter that VPAP has already done to make SBE’s data manageable. Far fewer people in this arrangement go ahead and pursue that path.
Waldo (and collaborators) are building an open-source converter, in essence. So that instead of having to handle steps two and three myself, I can now just focus on step three: useful presentation of data.
To risk oversimplifying the issue, the status quo is that each inventor has to also design and implement a set of power lines to get power from the plant to their device. Waldo and crew are building a set of free power lines, thus increasing the true accessibility for anyone who wants to do something with the power the plant puts out.
Oh I’m not minimizing what Waldo is up to one bit.. I will take it on face value that once it’s done it will be just as important (and elegant) as Sunshine and DeCoded.
I just wrap my little mind around what it does…
I just CAN’T wrap my little mind……
Larry, I’ll try a metaphor, because that’s my default method of explaining anything. :) The SBE is providing tons of raw earth. VPAP is sifting through that to find the gold, extracting it, melting it down, and giving away jewelry that they make out of it. I want to give others the tools to extract and melt down their own gold.
In non-metaphorical terms, the SBE is providing very large spreadsheets of campaign finance data, spreadsheets that are kind of messy (with no descriptions for what columns contains what, unnecessary data contained in some of those columns, etc.), with information spread across dozens of spreadsheets. I’m working on creating a tool that will take those messy spreadsheets, combine them all into one, big record, clean up the inconsistencies, and let them be loaded into a database. What will people do with that data? I have no idea. And that’s the point! Other people are smarter than me. They’ll have better ideas than I will have. So rather than building a completed product, I want to give people the tools to build their own completed products.
As Joe said, this is a tool that will be useless to 99.99% of people. There can’t be one person in a thousand that visits VPAP who would have the foggiest notion of what to do with Saberva. Will anybody in Virginia use this? It’s entirely possible that I will be the only person. But my wager is that I won’t be, that at least a few other people will find that this is easy enough to use that they should give it a whirl. I hope that’s right.
That’s a help … so you’re basically automating some of the clean-up of the data and apparently there is more data than what VPAP is extracting to the end -user level.
but I’m still not entirely clear on what that other data is…. it sounds very vague….
specifics..might be what is needed for hard heads (like me).
In terms of it being useful to others.. well.. I still don’t understand exactly what data …. but I’m suspecting that it will be as valuable as GA bills and Decoded laws.
The value of what you and VPAP have done is woefully not understood by most… Only legislators, their staff and SBE and their staff really know just what a nightmare the data is and essentially what you (and VPAP) have done is “massage” it into ….Information… valuable information… totally on the cheap …without a esoteric bureaucratic process; you’ve encapsulated it inside of a computer program and database.
You’ve undoubtedly saved thousands of trees and killed dozens of potential government jobs in the process!
the big challenge with VPAP – beyond manually dealing with both electronic and paper data is timeliness – the information becomes exceptionally useful late in the election campaigns because that’s often when last minute big money tries to outrun disclosure before the election.
There’s really not more data at least that I know about. VPAP isn’t holding out on us. :)
I’ll give you an example. I heard from a legislator the other day, who wanted to know where on Richmond Sunlight he could see every time that a legislator had failed to vote. Well, that’s not a feature of the site, and it’s not going to be. (I mean, I guess it could be some day, but I don’t think it’d be useful to many people, and most people would draw inaccurate conclusions from that data, by not understanding what they were looking at.) With a closed-data model, that’s the end of the road—there’s no way for somebody to get that data. With Richmond Sunlight, though, I give away the raw data. Anybody can download raw vote data from the site, load it into a spreadsheet, sort by vote, and delete every vote other than “yea,” “nay,” and “abstain.” That’s pretty great, because it lets people do things that never would have occurred to me to allow them to do, or things that I don’t have the time or the interest to add to the site.
That data is on Richmond Sunlight, but not as a single lump in one place…except as a bulk download. I want to create bulk downloads of campaign finance data, so that this sort of analysis can be done by anybody with a copy of Microsoft Excel or MySQL.
okay.. same data… different ways to slice/dice it…
SBE must, at some point, convert paper filings to electronic , eh?
Comments are closed.