Tag Archives: richmond sunlight

How I OCR hundreds of hours of video.

One of the features that I’m most pleased with on Richmond Sunlight is the integration of video. It’s one thing to put up chunks of video for people to paw through, but it’s another to automatically index it so that people can be directed to just the parts of the video that interest them. That opens the whole affair up, making it a great deal more useful. The process by which I do that is technologically interesting, I think, so I want to explain it.

The process is, narratively, pretty straightforward. I take a screenshot every few seconds, and then I use optical character recognition on the regions of the screen that contain text that identify the speaker and the current bill. That speaker name and bill number are stored in the database, along with the timestamp at which they appear, and that leaves me with a big listing of every bill and every legislator and every time that they spoke.

But the specifics are where it gets fun. Let’s go through it again, step by step, in more detail.

Rip the Video

I get video from the legislature on DVD, one or two for each day. They go for $10/disc. (This year’s discs are paid for by these good people who bought them, one day at a time!) I put these in my computer, one by one, and pull off the video. I use a free program called “DVD Backup” (which I can find no evidence of anywhere on the internet anymore) to extract the files from the DVD and store them on a dedicated 1TB FireWire drive. This process takes just a few minutes per disc.

Convert the Video

After trying many programs over the years, I have settled on using MPEG Streamclip to turn the DVD files into H.264 MP4s. This takes hours for each DVD. These are generally between 400-800MB apiece, though they can top 2GB or, on really brief days, be just 200MB.

Upload the Video

All of these videos have to get onto the internet, and that means uploading them to the Richmond Sunlight server. My DSL’s upload speed is nothing to write home about and, as such, I can only upload a few videos a day. In the meantime, our home internet connection isn’t good for much else.

Take Screenshots

Once the video is on the server (donated by Blue Ridge InternetWorks), it’s time to take screenshots. This is done with MPlayer. I have it play through the video, and save a screenshot every few seconds. If it’s Senate video, a screenshot is saved every 60 frames (or two seconds), and if it’s House video, it’s every 150 frames (or five seconds). That’s because the House video production team keeps the chyrons up for the entire time that a bill is being discussed or a legislator is speaking, while the Senate video production team apparently relishes flashing them for as little time as possible. (“Chyron”? Vocab time! This is the text that you see on TV, such as during a newscast, which uses them to identifying the speaker. The Chyron Corporation came up with the idea of putting graphics on TV screens, rather than filming paper cards. Their name has become synonymous with graphic overlaid on video. Chyrons are also known as “lower thirds.”) Senate chyrons stick around for as little as two seconds, and average around three. This can take half an hour or an hour to run and, when it’s done, I’ve got a directory full of JPEGs, anywhere from one to four thousand of them. I do this like such:

mplayer -vf framestep=60 -framedrop -nosound video.mp4 -speed 100 -vo jpeg:outdir=video

Selecting a screenshot more or less at random, what gets output are files that look like this:

Just to be careful, I use the brilliant ImageMagick at this point (and, in fact, for the next few steps) to make sure that the screenshots are at the size that I need them to be: 642 by 480.

for f in *.jpg; do mogrify -resize 642x480 $f; done

Extract Chyrons

From every one of these frames, I need to cut out the two areas that could contain chyrons. I say “could” because I don’t, at this point, have any idea if there’s a chyron in any of these screenshots. The point of these next couple of steps is to figure that out. So I use ImageMagick again, this time to make two new images for each screenshot, one of the area of the image where a bill number could be located, and one of the area where the speaker’s name could appear. The House and the Senate put these in different locations. Here is how I accomplish this for the House:

for f in *.jpg; do convert $f -crop 312x57+158+345 +repage -compress none -depth 8 $f.name.tif; done
for f in *.jpg; do convert $f -crop 150x32+428+65 +repage -compress none -depth 8 $f.bill.tif; done

Instead of a few thousand images, now I have three times as many. The bill chyron images look like this:

And the speaker chyron images look like this:

Determine Chyron Color

Now I put these chyrons to work. As you can see in the above screenshot, chyron text has a background color behind it. In the Senate, it’s maroon, and in the House, it’s blue. This is good news, because it allows me to check for that color to know if either the bill or the legislator chyron is on the screen in this screenshot. Again with ImageMagick I take a one pixel sample of the image, pipe it through the sed text filter, and save the detected color to a file. This is done for every single (potential) chyron image:

for f in *.tif; do convert $f -crop 1x1+1+1 -depth 8 txt:- | sed -n 's/.* \(#.*\)/\1/p' > $f.color.txt; done

And that means that yet another few thousand files are in my screenshot directory. Looking through each of those text files will tell me whether the corresponding JPEG contains a chyron or not. For the example bill chyron image, the color is #525f8c; for the speaker chyron image, it’s #555f94. (Those are hexadecimal triplets.) It is possible that a similar shade of red or blue happens to be on that very spot on the screen, so I can get some false positives, but it’s rare and, as you’ll see, not problematic. At this point, though, I still haven’t peered into those files, so I have no idea what’s a chyron and what’s just a random sliver of a screenshot.

Optimize Chyrons for OCRing

At this point I do something lazy, but simple. I optimize every single (potential) chyron image to be run through optical character recognition (OCR) software and turned into text. If I wanted to be really parsimonious, I would do this after I’d identified which images really are chyrons, but ImageMagick is so fast that I can convert all of these thousands of images in just a few seconds. I convert them all to black and white, dropping out almost all shades of gray, like this:

for f in *.tif; do convert $f -negate -fx '.8*r+.8*g+0*b' -compress none -depth 8 $f; done

That leaves the bill chyrons looking like this:

And the speaker chyrons looking like this:

OCR the Chyrons

Still without knowing which of these images are really text-bearing chyrons, I run every one of them through the free, simple, and excellent Tesseract OCR software. I have tried every Unix-based OCR package out there, subjecting them to rigorous testing, and nothing is nearly as good as Tesseract. This spits out a small text file for each file. Any file that has a chyron will have its text recorded more or less faithfully. Any file that doesn’t have a chyron, Tesseract will still faithfully attempt to find words in, which usually amounts of spitting out nonsense text. That OCRing is done, simply, like this:

ls *.tif | xargs -t -i tesseract {} {}

To recap, we have a screenshot every few seconds, two potentially chyron-bearing files cropped out of each of those screenshots, a text file containing the background color of every one of those potential chyrons, and another text file for every potential chyron that contains OCRd text.

Identify and Save the Chyrons

At this point it’s all turned over to code that I wrote in PHP. It iterates through this big pile of files, checking to see if the color is close enough to the appropriate shade of red or blue and, if so, pulling the OCRd text out of the file containing it and loading it into a database. There is a record for every screenshot, containing the screenshot itself, the timestamp at which it’s been recorded, and the text as OCRd.

I also use MPlayer’s -identify flag to gather all of the data about the video that I can get, and store all of that in the database. Resolution, frames per second, bit rate, and so on.

The chyron that I’ve been using as an example, for Del. Jennifer L. McClellan, OCRd particularly badly, like this:

Del. Jennifer L i\1cCie1ian
Richmond City (071)


Although Tesseract’s OCR is better than anything else out there, it’s also pretty bad, by any practical measurement. A legislator who speaks for five minutes could easily have their name OCRd fifty different ways in that time. Helping nothing, each chamber has ways of referring to legislators by which they are never referred to by the General Assembly at any other time. Sen. Dave Marsden is mysteriously referred to as “Senator Marsden (D) Western Fairfax.” Not “Dave” Marsden—unlike anywhere on the legislature’s website, he doesn’t get a first name. And “Western” Fairfax? His district municipality is never referred to as that anywhere else by the legislature. So how am I to associate that chyron content with Sen. Marsden?

The solution was to train it. I make a first pass on the speaker chyrons and calculate the Levenshtein distance for each one, relative to a master list of all legislators, with their names formatted similarly, and match any that are within 15% of identical. I make a second pass and see if any unresolved chyrons are the same as any past chyrons that were identified. And I make a third pass, basically repeating the second, only this time calculating the Levenshtein distance and accepting anything within 15%. In this way, the spellcheck gets a little smarter every time that it runs, and does quite well at recognizing names that OCR badly. The only danger is that two legislators with very similar names will represent the same municipality, and the acceptable range of misspellings of their names will get close enough that the system won’t be able to tell them apart. I keep an eye out for that.

Put the Pieces Together

What I’m left with is a big listing of every time that a bill or speaker chyron appeared on the screen, and the contents of those chyrons, which I then tie back to the database of legislators and bills to allow video to be sliced and diced dynamically based on that knowledge. (For example, every bill page features a highlights reel of all of the video of that bill being discussed on the floor of the legislature—here’s a random example—courtesy of HTTP pseudo-streaming). This also enabled some other fun things, such as calculating how many times each legislator has spoken, how long they’ve spoken, which subjects get the most time devoted to them on the floor, and lots of other toys that I haven’t had time to implement yet, but plenty of time to dream up.

Everything after uploading the video until the spellcheck is done with a single shell script, which is to say that it’s automated. And everything after that is done with a PHP script. So all of these steps are actually pretty easy, and require a minimal amount of work on my part.

And that’s how a video gets turned into thousands of data points.

That went better than expected.

That business of soliciting donations to buy the 2011 General Assembly session video? It took just under 46 hours for all of the money to be donated. A lot of it was donated by friends and regulars here. Larry Gross, Shaun Kenney, Jim Duncan, Craig Fifer, Kathy Mateer, Jeffrey Uphoff, Jeannine Lalonde, Bruce Roemmelt, Janis Jaquith, Susan Schoppelrey, Paul Wright, Peter Griesar, Vivian Paige, Tim Tolson, Jill Jaquith, Connie Jorgensen, Claire Chantell, Richard Martin, Sean Holihan, B G Hays, and Sheila McMillen (in honor of Hamish) all donated. (And then three more donations came in after all of the money had been raised—I don’t want to name those donors until I confirm that they still want to donate the money, which would be used in case the costs run over what’s estimated or, failing that, held for next year’s session.) It’s really great to have this project made possibly by so many people who believe in the importance of this like I do.

The icing on top was the news that came just minutes after the announcement that we’d met the goal—Carl Malamud informed me that Public.Resource.Org would be providing $900 to buy the missing 2010 videos. I’m a big fan of Carl’s work, and aspire to have an impact on the scale and of the types that Public.Resource.Org does, so not only is it really wonderful to be able to buy that video, but I’m flattered that the money is coming from him.

The comparison that I made on Richmond Sunlight is apt, I think. This evening, after so many good folks finished contributing to this project, hearing from Carl was like the famous closing scene from It’s a Wonderful Life:

Ernie Bishop: Just a minute! Quiet everybody! Quiet, quiet. Now get this, it’s from London.
Ma Bailey: Oh!
Ernie Bishop: [Reading the telegram in his hand] “Mr. Gower cabled you need cash, stop. My office instructed to advance you up to twenty-five thousand dollars, stop. Hee Haw and Merry Christmas! Sam Wainwright.”

It was a good day.

Legislative video sponsorships.

Richmond Sunlight has no video of the legislature for a single day of the 2010 session, since the site a) doesn’t have a budget and b) I didn’t raise any money. So, this year, I’m soliciting sponsors for every day’s video. The average day’s House and Senate video requires buying $18 worth of DVDs from the legislature, so I’m trying to get people to underwrite a day’s video by donating $18. In exchange, their (your?) name will appear as the sponsor of that day’s video. I have no idea if this will work. Nothing ventured, nothing gained.

CNS does it right.

Although it’s true that basically no media outlets bother to mention bill numbers when writing about legislation, I really have to give credit to the always-vital Capital News Service, run by Jeff South at the VCU School of Mass Communication. Every one of their articles about legislation provide bill numbers for every single bill that they refer to, as well as other bills on the same topic. It’s no coincidence that these stories are written by students. Because their stories are distributed directly to media outlets, and not published on their website, so I can’t link directly to an example, but you can download a Word file containing all of their stories from the past week.

You might think that you’ve never heard of CNS, but you have. Remember when Del. Bob Marshall claimed that women who have abortions end up having handicapped kids, due to God’s wrath? That was a CNS scoop—every other media outlet at that press conference overlooked that comment.

Memo to Virginia journalists.

Please start including bill numbers in your coverage of legislation. If you did that, then Richmond Sunlight would promote your coverage of that bill, prominently, on that bill’s page, as well as on pages about related bills. Media coverage is the only major component of the information ecosystem that simply can’t be incorporated into this legislative data structure, because the lack of bill numbers makes it impossible to know programmatically what bill that an article is about. For bonus points, you could include a listing of bill numbers within the article as Dublin Core metadata at the head of the HTML, which would make the task easier still.

Also, of course, then citizens can look up the bill and learn more about it. Without that, your average citizen is probably out of luck. Writing about a bill without giving the number is like writing about a great new restaurant without bothering to mention its name.

I understand that this isn’t likely to happen with this session, but it sure would be great if it could happen by next year. Need some help? Have your programmer get in touch with me and I’m glad to take some time to talk this through.

Richmond Sunlight’s JSON API.

I’ve just released v1.0 of the Richmond Sunlight API. It’s JSON-based, simple, and straightforward. This turns Richmond Sunlight into a web-based service that allows any application or website to get data about the General Assembly automatically and basically seamlessly. The most exciting bit, I think, is the Photosynthesis API. Now any individual or organization can track a bunch of bills, and then display the status of those bills on their own website (along with their own commentary on each bill). Lots of interest groups do this sort of thing already, but they have to update the listing manually as bills advance. That’s laborious and awkward. Now jQuery and a half hour of free time is all that’s necessary for interest groups to have a slick new feature on their website to keep stakeholders posted.

Here’s hoping folks put this to work.

The Senate killed a bill to put their own voting records online. So I did it for them.

The Senate Rules Committee killed a bill today that would have put legislators’ voting records online. The House passed freshman Republican Jim LeMunyon’s HB778 overwhelmingly. But the Senate Rules Committee—overwhelmingly Democratic, incidentally—barely allowed it out of subcommittee, and then killed it on a 13-2 vote. Officially, they think it’d just be too darned hard to put that data on their website. Which, the Roanoke Times editorial board points out today, seems unlikely, given that I’ve provided that very data on Richmond Sunlight for several years now, in the form of spreadsheets downloadable from any legislator’s page on the site. Realistically, they likely killed this because they don’t want their voting records to be available for opposition research.

Anyhow, just to stick a thumb in the eye of Senate Democrats, this evening I put together an HTML version of the same data, making it easier for folks to access and for search engines to index. It took me—no kidding—about twenty minutes. (For example, here’s my senator’s 2009 voting record.) As always, every scrap of legislative data on Richmond Sunlight comes directly from the legislature’s website, so I don’t have access to any special fairy dust that the Senate doesn’t have. I’ve said it before, and I’ll say it again: I don’t care who’s in charge of the legislature, transparency is essential. Any Democrats who thinks I’m going to go easy on them had best think again.

Sen. Hurt is, in fact, the most partisan member of the senate.

I just finished adding a new feature to Richmond Sunlight—the ability to sort through legislators by a variety of attributes like location, race, sex, year they started in office, etc.—and when I was done, I found a bug. For some reason, my code was listing Sen. Robert Hurt as the most partisan Republican in the senate. And I knew that couldn’t be true, because I mentioned earlier this month that he’s the least partisan Republican in the senate, a fact that I repeated on Weekend Virginia a few days ago. After a good half hour of debugging, I realized that the fault (dear Brutus) was in myself. There was nothing wrong with my code. Hurt is, in fact, ranked as the single most partisan member of the senate.

For the curious, a quick explanation as to how I made this particular error. Partisanship is ranked within the database from 0-100, with 0 arbitrarily assigned to Democrats—meaning “this person cosponsors bills exclusively with Democrats—and 100 assigned to Republicans. The effect of that is that the lower Democrats’ numbers, the more partisan that they are, but the higher Republicans’ numbers, the more partisan that they are. In the course of writing a blog entry about bipartisan Democrats, an offhand mention of just one Republican left me in a prime position to misread the data, using the inverse scale.

The moral of the story here is that it is far better to interpret publicly verifiable data than data that only I have access to. Not only does it make that interpretation a springboard for further exploration of the data by others, but it enables peer review, which helps make sure that cited facts are, indeed, facts.

I’ll close with a fun fact. Ignoring freshmen, for whom there’s little data just yet, the most partisan member of the General Assembly is Del. Todd Gilbert R-Woodstock. Don’t believe me? (And who could blame you?) Look it up.

The legislature’s most prolific copatrons.

The more time that I spend mapping the social relationships of legislators via their copatroning habits, the more fascinated that I am by this mechanism of exploring the General Assembly. It really is a powerful tool. (To see one of the ways I’m using it on Richmond Sunlight now, check out HB1721, SB1436, or HB2482, where you can see a graph indicating the average partisan position of each bill’s cosponsors.)

Here’s a less thoughtful usage of this data, but no less interesting: a listing of the legislators in each body and the total number of bills that they copatroned in this year’s session.

Legislator #
Patsy Ticer (D) 61
Chap Petersen (D) 55
Robert Hurt (R) 51
Richard Stuart (R) 50
John Edwards (D) 45
Jill Holtzman Vogel (R) 44
Roscoe Reynolds (D) 44
Mary Margaret Whipple (D) 43
Harry Blevins (R) 41
Creigh Deeds (D) 40
Frank Wagner (R) 39
Fred Quayle (R) 39
Walter Stosch (R) 39
Janet Howell (D) 38
Tommy Norment (R) 38
John Watkins (R) 37
Don McEachin (D) 37
Toddy Puller (D) 37
Ken Stolle (R) 36
Henry Marsh (D) 35
Emmett Hanger (R) 35
Phil Puckett (D) 34
Steve Newman (R) 33
Ken Cuccinelli (R) 33
Frank Ruff (R) 33
Louise Lucas (D) 32
Mamie Locke (D) 32
George Barker (D) 31
Ralph Northam (D) 31
Edd Houck (D) 31
Yvonne Miller (D) 30
Stephen Martin (R) 30
William Wampler (R) 28
Ralph Smith (R) 27
Mark Herring (D) 27
Dick Saslaw (D) 26
Ryan McDougle (R) 26
John Miller (D) 23
Mark Obenshain (R) 22
Chuck Colgan (D) 21
Legislator #
Clay Athey (R) 102
Mark Cole (R) 96
Tom Rust (R) 91
Don Merricks (R) 89
Frank Hall (D) 83
Scott Lingamfelter (R) 81
John O’Bannon (R) 81
Vivian Watts (D) 80
Ken Plum (D) 75
Joe Bouchard (D) 74
Bob Hull (D) 74
Jennifer McClellan (D) 72
Dave Marsden (D) 70
David Englin (D) 69
Bobby Mathieson (D) 68
Al Eisenberg (D) 67
Jimmie Massie (R) 67
Paula Miller (D) 67
Joe Morrissey (D) 66
Tim Hugo (R) 66
Adam Ebbin (D) 66
Beverly Sherwood (R) 65
Mark Sickles (D) 65
Steve Landes (R) 65
Jim Scott (D) 63
Mamye BaCote (D) 63
Robert Tata (R) 63
William Barlow (D) 63
Bob Brink (D) 62
Phil Hamilton (R) 62
Ken Melvin (D) 62
Chris Peace (R) 62
Algie Howell (D) 60
Dave Albo (R) 60
Sal Iaquinto (R) 59
David Bulova (D) 58
Anne Crockett-Stark (R) 58
Margi Vanderhye (D) 58
Terry Kilgore (R) 57
John Cosgrove (R) 56
Chuck Caputo (D) 56
Charles Poindexter (R) 56
Kris Amundson (D) 55
David Toscano (D) 55
Ward Armstrong (D) 54
Jeion Ward (D) 54
Albert Pollard (D) 53
Jackson Miller (R) 53
Todd Gilbert (R) 53
Danny Bowling (D) 53
Dave Nutter (R) 52
Shannon Valentine (D) 52
Onzlee Ware (D) 52
Bill Carrico (R) 52
Danny Marshall (R) 50
Morgan Griffith (R) 50
Steve Shannon (D) 50
Harvey Morgan (R) 50
Bill Janis (R) 49
Riley Ingram (R) 49
Lionell Spruill (D) 49
Manoli Loupassi (R) 48
Ed Scott (R) 48
Glenn Oder (R) 48
Lynwood Lewis (D) 48
Roslyn Tyler (D) 48
Kenny Alexander (D) 47
Barry Knight (R) 47
Johnny Joannou (D) 47
Delores McQuinn (D) 47
Ben Cline (R) 47
Matt Lohr (R) 46
Rosalyn Dance (D) 46
Chris Jones (R) 46
Sam Nixon (R) 45
David Poisson (D) 45
Brenda Pogge (R) 45
Harry Purkey (R) 45
Joseph Johnson (D) 44
Kirk Cox (R) 43
Tom Gear (R) 43
Chris Saxman (R) 42
Paul Nichols (D) 42
Kathy Byron (R) 41
Jim Shuler (D) 41
Bud Phillips (D) 40
Bill Fralin (R) 39
Lee Ware (R) 39
Bill Howell (R) 39
Jeff Frederick (R) 37
Joe May (R) 37
Charniele Herring (D) 36
Lacey Putney (I) 36
Bobby Orrock (R) 35
Frank Hargrove (R) 35
Rob Bell (R) 35
Watkins Abbitt (I) 35
Bob Marshall (R) 32
Tom Wright (R) 28
Clarke Hogan (R) 22

Sen. Patsy Ticer comes in #1 in her chamber, with 1.69 times as many bills copatroned as the average member of her chamber, while Del. Clay Athey is at the top of his half of the legislature—and the whole of the General Assembly—with an impressive 1.82 times as many bills copatroned as the average member of the House.