How I OCR hundreds of hours of video.

One of the features that I’m most pleased with on Richmond Sunlight is the integration of video. It’s one thing to put up chunks of video for people to paw through, but it’s another to automatically index it so that people can be directed to just the parts of the video that interest them. That opens the whole affair up, making it a great deal more useful. The process by which I do that is technologically interesting, I think, so I want to explain it.

The process is, narratively, pretty straightforward. I take a screenshot every few seconds, and then I use optical character recognition on the regions of the screen that contain text that identify the speaker and the current bill. That speaker name and bill number are stored in the database, along with the timestamp at which they appear, and that leaves me with a big listing of every bill and every legislator and every time that they spoke.

But the specifics are where it gets fun. Let’s go through it again, step by step, in more detail.

Rip the Video

I get video from the legislature on DVD, one or two for each day. They go for $10/disc. (This year’s discs are paid for by these good people who bought them, one day at a time!) I put these in my computer, one by one, and pull off the video. I use a free program called “DVD Backup” (which I can find no evidence of anywhere on the internet anymore) to extract the files from the DVD and store them on a dedicated 1TB FireWire drive. This process takes just a few minutes per disc.

Convert the Video

After trying many programs over the years, I have settled on using MPEG Streamclip to turn the DVD files into H.264 MP4s. This takes hours for each DVD. These are generally between 400-800MB apiece, though they can top 2GB or, on really brief days, be just 200MB.

Upload the Video

All of these videos have to get onto the internet, and that means uploading them to the Richmond Sunlight server. My DSL’s upload speed is nothing to write home about and, as such, I can only upload a few videos a day. In the meantime, our home internet connection isn’t good for much else.

Take Screenshots

Once the video is on the server (donated by Blue Ridge InternetWorks), it’s time to take screenshots. This is done with MPlayer. I have it play through the video, and save a screenshot every few seconds. If it’s Senate video, a screenshot is saved every 60 frames (or two seconds), and if it’s House video, it’s every 150 frames (or five seconds). That’s because the House video production team keeps the chyrons up for the entire time that a bill is being discussed or a legislator is speaking, while the Senate video production team apparently relishes flashing them for as little time as possible. (“Chyron”? Vocab time! This is the text that you see on TV, such as during a newscast, which uses them to identifying the speaker. The Chyron Corporation came up with the idea of putting graphics on TV screens, rather than filming paper cards. Their name has become synonymous with graphic overlaid on video. Chyrons are also known as “lower thirds.”) Senate chyrons stick around for as little as two seconds, and average around three. This can take half an hour or an hour to run and, when it’s done, I’ve got a directory full of JPEGs, anywhere from one to four thousand of them. I do this like such:

mplayer -vf framestep=60 -framedrop -nosound video.mp4 -speed 100 -vo jpeg:outdir=video

Selecting a screenshot more or less at random, what gets output are files that look like this:

Just to be careful, I use the brilliant ImageMagick at this point (and, in fact, for the next few steps) to make sure that the screenshots are at the size that I need them to be: 642 by 480.

for f in *.jpg; do mogrify -resize 642x480 $f; done

Extract Chyrons

From every one of these frames, I need to cut out the two areas that could contain chyrons. I say “could” because I don’t, at this point, have any idea if there’s a chyron in any of these screenshots. The point of these next couple of steps is to figure that out. So I use ImageMagick again, this time to make two new images for each screenshot, one of the area of the image where a bill number could be located, and one of the area where the speaker’s name could appear. The House and the Senate put these in different locations. Here is how I accomplish this for the House:

for f in *.jpg; do convert $f -crop 312x57+158+345 +repage -compress none -depth 8 $f.name.tif; done for f in *.jpg; do convert $f -crop 150x32+428+65 +repage -compress none -depth 8 $f.bill.tif; done

Instead of a few thousand images, now I have three times as many. The bill chyron images look like this:

And the speaker chyron images look like this:

Determine Chyron Color

Now I put these chyrons to work. As you can see in the above screenshot, chyron text has a background color behind it. In the Senate, it’s maroon, and in the House, it’s blue. This is good news, because it allows me to check for that color to know if either the bill or the legislator chyron is on the screen in this screenshot. Again with ImageMagick I take a one pixel sample of the image, pipe it through the sed text filter, and save the detected color to a file. This is done for every single (potential) chyron image:

for f in *.tif; do convert $f -crop 1x1+1+1 -depth 8 txt:- | sed -n 's/.* $#.*$/\1/p' > $f.color.txt; done

And that means that yet another few thousand files are in my screenshot directory. Looking through each of those text files will tell me whether the corresponding JPEG contains a chyron or not. For the example bill chyron image, the color is #525f8c; for the speaker chyron image, it’s #555f94. (Those are hexadecimal triplets.) It is possible that a similar shade of red or blue happens to be on that very spot on the screen, so I can get some false positives, but it’s rare and, as you’ll see, not problematic. At this point, though, I still haven’t peered into those files, so I have no idea what’s a chyron and what’s just a random sliver of a screenshot.

Optimize Chyrons for OCRing

At this point I do something lazy, but simple. I optimize every single (potential) chyron image to be run through optical character recognition (OCR) software and turned into text. If I wanted to be really parsimonious, I would do this after I’d identified which images really are chyrons, but ImageMagick is so fast that I can convert all of these thousands of images in just a few seconds. I convert them all to black and white, dropping out almost all shades of gray, like this:

for f in *.tif; do convert $f -negate -fx '.8*r+.8*g+0*b' -compress none -depth 8 $f; done

That leaves the bill chyrons looking like this:

And the speaker chyrons looking like this:

OCR the Chyrons

Still without knowing which of these images are really text-bearing chyrons, I run every one of them through the free, simple, and excellent Tesseract OCR software. I have tried every Unix-based OCR package out there, subjecting them to rigorous testing, and nothing is nearly as good as Tesseract. This spits out a small text file for each file. Any file that has a chyron will have its text recorded more or less faithfully. Any file that doesn’t have a chyron, Tesseract will still faithfully attempt to find words in, which usually amounts of spitting out nonsense text. That OCRing is done, simply, like this:

ls *.tif | xargs -t -i tesseract {} {}

To recap, we have a screenshot every few seconds, two potentially chyron-bearing files cropped out of each of those screenshots, a text file containing the background color of every one of those potential chyrons, and another text file for every potential chyron that contains OCRd text.

Identify and Save the Chyrons

At this point it’s all turned over to code that I wrote in PHP. It iterates through this big pile of files, checking to see if the color is close enough to the appropriate shade of red or blue and, if so, pulling the OCRd text out of the file containing it and loading it into a database. There is a record for every screenshot, containing the screenshot itself, the timestamp at which it’s been recorded, and the text as OCRd.

I also use MPlayer’s -identify flag to gather all of the data about the video that I can get, and store all of that in the database. Resolution, frames per second, bit rate, and so on.

The chyron that I’ve been using as an example, for Del. Jennifer L. McClellan, OCRd particularly badly, like this:

Del. Jennifer L i\1cCie1ian
Richmond City (071)

Spellcheck

Although Tesseract’s OCR is better than anything else out there, it’s also pretty bad, by any practical measurement. A legislator who speaks for five minutes could easily have their name OCRd fifty different ways in that time. Helping nothing, each chamber has ways of referring to legislators by which they are never referred to by the General Assembly at any other time. Sen. Dave Marsden is mysteriously referred to as “Senator Marsden (D) Western Fairfax.” Not “Dave” Marsden—unlike anywhere on the legislature’s website, he doesn’t get a first name. And “Western” Fairfax? His district municipality is never referred to as that anywhere else by the legislature. So how am I to associate that chyron content with Sen. Marsden?

The solution was to train it. I make a first pass on the speaker chyrons and calculate the Levenshtein distance for each one, relative to a master list of all legislators, with their names formatted similarly, and match any that are within 15% of identical. I make a second pass and see if any unresolved chyrons are the same as any past chyrons that were identified. And I make a third pass, basically repeating the second, only this time calculating the Levenshtein distance and accepting anything within 15%. In this way, the spellcheck gets a little smarter every time that it runs, and does quite well at recognizing names that OCR badly. The only danger is that two legislators with very similar names will represent the same municipality, and the acceptable range of misspellings of their names will get close enough that the system won’t be able to tell them apart. I keep an eye out for that.

Put the Pieces Together

What I’m left with is a big listing of every time that a bill or speaker chyron appeared on the screen, and the contents of those chyrons, which I then tie back to the database of legislators and bills to allow video to be sliced and diced dynamically based on that knowledge. (For example, every bill page features a highlights reel of all of the video of that bill being discussed on the floor of the legislature—here’s a random example—courtesy of HTTP pseudo-streaming). This also enabled some other fun things, such as calculating how many times each legislator has spoken, how long they’ve spoken, which subjects get the most time devoted to them on the floor, and lots of other toys that I haven’t had time to implement yet, but plenty of time to dream up.

Everything after uploading the video until the spellcheck is done with a single shell script, which is to say that it’s automated. And everything after that is done with a PHP script. So all of these steps are actually pretty easy, and require a minimal amount of work on my part.

And that’s how a video gets turned into thousands of data points.

7 replies on “How I OCR hundreds of hours of video.”

KCinDC says:

February 10, 2011 at 11:22 pm

That’s great! Now, any ideas on automatic transcription?
Scott Nolan says:

February 11, 2011 at 7:39 am

Thank you very much for sharing your work-flow. Very educational and potentially useful to many.
Waldo Jaquith says:

February 11, 2011 at 8:19 am

That’s great! Now, any ideas on automatic transcription?

Don’t I wish! At this point, transcription is only good enough to provide keywords, for people searching. Every year, around November, I conduct a search for any new technologies that will make transcription feasible. We get closer every year, but it’s just not there yet. I’m optimistic that Google will provide an API for their own voice transcription service (you can see it in action on YouTube), which isn’t good enough to provide a transcript of floor sessions, but it’d sure improve search.

Very educational and potentially useful to many.

That’s really what I’m hoping to get across here—that this is a pretty straightforward assembly of open source programs, and doesn’t require a CS degree to reproduce. I’ve been describing this arrangement at various gatherings of geeks over the years, and people are always impressed at how retrospectively obvious it all is. :)
KCinDC says:

February 11, 2011 at 9:02 am

Yes, I spent some time last year trying to do automated transcription of recordings using CMU Sphinx speech recognition, and we’re definitely far from that point. Of course, I thought Tesseract OCR was similarly nearly useless, but it did work for your limited application, so Sphinx could be usable in a situation with a very limited vocabulary.
Duane Gran says:

February 11, 2011 at 10:41 am

I’m very impressed. No particular one step is so amazing, but the vision to piece it all together and automate much of it is quite impressive. I’m sure others will benefit from this description.
Michael says:

February 11, 2011 at 10:59 am

Now, just cross your fingers and hope the GA people don’t mess up the chyrons!

Has that happened much? At all?
Waldo Jaquith says:

February 11, 2011 at 12:55 pm

Yes, I spent some time last year trying to do automated transcription of recordings using CMU Sphinx speech recognition, and we’re definitely far from that point

I think Sphinx is a good example of the fundamental challenge of speech-to-text transcription: the necessity of having a corpus of transcripted text is an insurmountable challenge for many applications. I’d love to use Sphinx on Richmond Sunlight—that would be amazing. But for it to have even a token degree of accuracy, I’d first need to transcribe manually many hours of audio. And if I had those kind of resources, well, then I wouldn’t need Sphinx, would I? :)

Now, just cross your fingers and hope the GA people don’t mess up the chyrons!

Has that happened much? At all?

It does happen, sometimes, but not often. Generally, they’re quite good. Any mistakes are way more likely to be my fault than the GA’s staff’s fault!