Richmond Sunlight needs some money.

A couple of years ago I created a website, Richmond Sunlight, that has been rather successful (if I may brag briefly) in reshaping how Virginians understand, follow, interact with, and affect Virginia’s General Assembly. While hundreds of thousands of people have found the site very useful, I look at it and see unfulfilled promise. I want to rewrite Richmond Sunlight and give it away to a nonpartisan political group in every state in the union. I want to complete the API so that anybody can write software to interact with the Capital Sunlight in their own state (and let even more newspapers integrate it into their own websites). I want a Facebook application, I want daily podcasts, I want people recording secret subcommittee votes, I want to mash up the daily floor calendar with campaign finance data with minutes with video and create the most radical transparency a state legislature has ever seen.

But I have two hurdles. And I need help to get over them, help that is beyond the resources that can be provided by the Virginia Interfaith Center (the nonpartisan, nonprofit to whom I gave the site upon its launch). It’s my hope that somebody reading this can help me get over those hurdles.

These are the two things that would make the difference in this project:

  1. A Mac Pro.
  2. Lobbyists to cover the cost of my time to give away what was to be a paid section of the site.

Each warrants explanation, of course.

A Mac Pro

I have a dual-processor 1.66GHz Intel Core Duo Mac mini with 2GB of RAM. It’s a marvelous little computer. But it cannot possibly keep up with what I need to do, and I just can’t justify the $2,500 outlay for a system that can.

For every day that the legislature is in session, I have to do the following with the video:

  1. Rip the DVD.
  2. Convert the VOB to a master MP4, using MPEG Streamclip.
  3. Convert the MP4 into a QuickTime file of a reasonable size for people to download, using QuickTime. (My system is doing this at this very moment.)
  4. Upload the QuickTime file to Google Video.
  5. Convert the MP4 into a smaller podcastable-sized MP4.
  6. Extract the audio and save as an MP3.
  7. Excerpt every 120th frame from the video (using ImageMagick) and build up an array of hundreds of screen captures.
  8. Crop each frame down to isolate the top corner of the frame.
  9. Re-crop each frame down to isolate the bottom quarter of the frame.
  10. OCR all of those cropped images to check for bill numbers (from the latter) and speaker names (from the former) that are overlaid on the video for viewers.
  11. Take the text resulting from that OCR and compare it to an array of known bill numbers and legislators; store all matches.
  12. Reduce the listing of matched legislatures and bill numbers to time spans and store those, creating a record of who spoke when about what for how long.
  13. Use the timestamps to chop up the video twice—once by bill being discussed, and once by legislator—to create dozens or hundreds of individual units of video, using MPlayer.
  14. Upload all of this to the web server (this takes ~3 hours)

And I have to do that twice: once for the house and once for the senate. In the meantime, 24/7, my desktop computer—my only computer—is so slow as to be useless for anything else. I’ve filled up three hard drives with video. This is stupidly laborious. But I feel strongly that it needs to be done, which is why I surrendered my life last year between December and February; I’m facing the same prospect this year, too. The biggest time-suck is the amount of time required for each of these conversion and extraction steps. My Mac mini just isn’t fast enough for this to be feasible. Every day I fall farther and farther behind. So I’ve just had to end the process after uploading the video to Google Video; the remaining steps simply don’t get done. Which is a shame, because that’s where it gets interesting, that’s where transparency becomes radical transparency. That’s the part where I look at Richmond Sunlight and see unfulfilled promise.

I need to step up to a Mac Pro to get this done.1 And, really, I can’t justify buying it myself—I put a lot of time into this site, but I need to draw the line at this kind of money. My hope is that there’s somebody out there—some business with a commitment to open government or some grant-making organization—who can just drop a Mac Pro into my lap and say “have fun.”

Lobbyists to cover the cost of my time to give away what was to be a paid section of the site.

Last fall I spent hundreds of hours building a new feature for Richmond Sunlight: Photosynthesis. The idea is to let people pick which bills they want to track, and keep up only with those ones. They can takes notes on the bills, and those notes appear on the page below every bill, so that everybody can share in that insight. But Photosynthesis is really just a baby version of what I was building: Photosynthesis Pro. That’s a paid version of Photosynthesis, basically Lobbyist-in-a-Box, but a) not horrible and b) cheaper. Complete with e-mail alerts, multiple portfolios, public/private toggles, the ability to include video and audio, and highly-flexible, constantly-updating queries, meaning that a lobbyist could (for example) receive an e-mail every time a bill is filed in the house that mentions Fairfax County, or be notified when any bill that’s crossed over into the senate that’s been filed by a Democrat has been killed in subcommittee. My wife was distinctly unhappy with me spending every evening and every weekend on this for almost three months, but I assured her that it would be great—at $500/subscriber/year, it would be cheaper than the General Assembly’s system, much better, and I would split the proceeds with the Virginia Interfaith Center.

Here’s the thing—I don’t like selling things. I like giving things away. I really don’t like the idea of such a great tool being available only to people who can throw around $500/year on this sort of thing (I know I couldn’t justify it). And I especially don’t like the notion of having to be a salesman and customer relations rep. I have a full-time job that I like very much; there is no time for such things. Photosynthesis Pro is 95% finished. I want to give it away. But I have a wife to answer to and, honestly, I want to make some money back for what has been an enormous amount of hard work.

My hope is that one or more lobbying firms is willing to pay to have this feature completed. (Unless, again, there’s some grantmaking organization all about open government who is down with this.) In exchange, they would get something better than LIAB, for a price that would quite likely be less than their existing LIAB subscription, and they’d be enabling anybody else to use it for free. I’d love to talk with any firm interested in covering this, or willing to go in with some other firms.

I’m a lousy fundraiser—it feels too much like busking—but give me the right candidate or the right cause, and I’m not about to hold back. Richmond Sunlight is that cause. That’s why I was so happy to receive $2,500 from the Sunlight Foundation last year, the grant that made it possible to purchase this year’s video from the General Assembly. (I haven’t yet figured out how to pay for it next year’s session.) If you are the right person to fulfill either of these needs, contact me. If you know somebody who might fit the bill, I hope you’ll pass along this blog entry to them. Trying to raise thousands of dollars via a blog post is a little unusual, I guess, but it’s all I’ve got.

1. EC2, incidentally, won’t do it; the 1GB MP4s are too big to be slinging around the internet, and Linux OCR software isn’t good enough to work, unfortunately.

Published by Waldo Jaquith

Waldo Jaquith (JAKE-with) is an open government technologist who lives near Char­lottes­­ville, VA, USA. more »

8 replies on “Richmond Sunlight needs some money.”

  1. You mention that EC2 won’t work because (in part) the files involved are too large — have you considered contacting Amazon AWS about this issue? I’m sure they could come up with some very interesting solutions. :-)

    Of course, the OCR is the real problem. However, I recall that the New York Times used Amazon’s EC2 to OCR all their old pre-electronic archives. I wonder what they used for that?

  2. You mention that EC2 won’t work because (in part) the files involved are too large — have you considered contacting Amazon AWS about this issue? I’m sure they could come up with some very interesting solutions. :-)

    Of course, the OCR is the real problem

    I’ve never gotten a response from an e-mail to Amazon—I don’t think they’re likely to start now. :) If I could solve the OCR problem (I do that under Parallels in Windows right now, using Abbyy FineReader), then it might actually be worth the upload time—then I could do everything, end-to-end, in EC2. Where I live, I can get a 10MB ADSL, but the price is another $30/month over my existing 1.5MBps connection. But now that I check, I see that only gets me 896kbps upload, which is almost twice as fast as my theoretical upload speed now, but it’s not real hot.

    However, I recall that the New York Times used Amazon’s EC2 to OCR all their old pre-electronic archives. I wonder what they used for that?

    I wondered the same thing when I read about that. I never did find out. It was Derek Gottfrid who headed that up for the Times—I’ll e-mail him and ask him, if I can manage to track down his address. Thanks for the suggestion!

    FWIW, the stuff I’m trying to OCR is pretty straightforward. A screen capture might look like this:

    And in this case, there’s no bill number, but there is a legislator name, which I detect (thanks to the shade of blue), hack out, and increase the contrast like such:

    As OCRing goes, that’s really not bad. But you can see why it’s important that I use large, high-quality video: the bigger the text, and the less noise around it, the easier it is to OCR it. Googling around this evening, I discovered Cuneiform and Ocrad, two Linux OCR packages that I don’t recall seeing before. I’ll have to give those a whirl.

    It’s this process that brings about my two biggest technical obstacles. The first is the size of the video, and the second is the processing time. DreamHost just chokes on the amount of processing that’s required to do all of these transformations, and they just kill the process, no matter how fiercely I renice it. Which is why I need to do all of this on the desktop or, better still, offload it to EC2. EC2 or no, though, I’ll never get around the need to locally rip video, and the quality of video generated by QuickTime (rather than mencoder) is so much higher that I really need to use it to generate the MP4 and the MOV.

  3. At work I have an 8-core Mac Pro with 10 gigs of Ram for video editing. I can attest that processing and rendering video is a snap. We ended up ordering the Mac from Apple with 2 gigs and saved tons of money by ordering the remaining 8 from a third party manufacturer.

    Good luck.

  4. Waldo, I generally consider you something of a good natured, sharp tongued ass but I must compliment you on Richmond Sunlight. It is an an amazing accomplishment and a model for states everywhere. From reading all of your constant wining I would never have known you had something that big and profound in you.

    Great Job. I love it.

  5. At work I have an 8-core Mac Pro with 10 gigs of Ram for video editing. I can attest that processing and rendering video is a snap. We ended up ordering the Mac from Apple with 2 gigs and saved tons of money by ordering the remaining 8 from a third party manufacturer.

    Oooh, just as I’d hoped. :) Ten gigs—my lord, that’s a lot of memory (hence, good instinct going third party on those). May I some day have the chance to find out for myself how well those eight-cores crunch video. :)

  6. Waldo, I generally consider you something of a good natured, sharp tongued ass but I must compliment you on Richmond Sunlight. It is an an amazing accomplishment and a model for states everywhere. From reading all of your constant wining I would never have known you had something that big and profound in you.

    Great Job. I love it.

    *Laugh* Thanks, Halsey…I think. :) I guess you could say I’ve earned my sharp-tongued whining. :)

  7. Halsey, I enjoy your comments here. How about sponsoring these noble goal or at least the MacPro. You got to know some people at Apple (of course Jobs might be upset at you- taking his bad boy of silicon valley title and all). I have never seen so much democracy availible for such a small price.

    A better investment then most of the politicans you have probably given to. What do you say?

  8. FWIW, neither Cuneform nor Ocrad are up to par. Ocrad is inferior to both gocr and Tesseract, and Cuneform is crude and sketchy. Silvercoders OCR, the lone commercial offering I can find, might be worth trying, but the fact that they don’t list any prices on their website strikes me as a bad sign. I may be doomed to continue doing this on my desktop.

Comments are closed.