Digital audio content authentication.

I posted this over at Advogato, but it seems worth reproducing here. It’s a rough idea, but one that I’d like to spread around a bit.

As record labels become more aggressive about propagating false MP3s of songs (creating a file of the same size in bytes, the same length in minutes and seconds, with the same title, but contains garbage), it is inevitable that file-sharing networks come up with a method of fighting back.

I propose the use of a partial-match authentication system. I’ll say right up front that I know virtually nothing about this concept, but I believe it’s likely to be altogether achievable. Thus far, tracking of audio and images, Digimarc-style, has involved embedding a digital fingerprint in the file. This is good when the original creator of the information makes us of this, but when data has multiple points of origin (ie, many people ripping and sharing tracks), such a system is not of any use for purposes of authenticating the data.

Instead, it would be more desirable to derive a unique string for a song based not merely on the track length and the name, but the actual content of the song. If the track data as regards the actual music can be broken down into a short string of data, perhaps somewhere in the realm of 64 bytes, it will enable comparisons between tracks for purposes of determing whether or not they match. This is not any sort of a digital signature in the traditional sense, as it is never applied in the first place. It’s simply an extraction of the data. We’ll call this the authentication string. This string will need to be constructed in a manner such that two strings that are extremely similar are likely from extremely similar versions of the same audio file. A song that is encoded once at 192kbps and once at 128kbps should provide very similar authentication strings.

Now, this authentication string is not useful on its own. If Gnutella were modified to generate this data for every shared track, the information would be meaningless without a data source to compare it to. This is where a trust metric, of sorts, comes into play. Gnutella clients would generate this information for each track and, rather than storing it in an ID3 tag, store that data separately from the tracks. The servers (and perhaps the clients) would build up a database of the authentication strings for songs spotted on the network. This stateful database would track previously-spotted authentication strongs for an MP3, along with a voting-style system of currently-available MP3s, and perhaps even weight various authentication strings based on the total number of files shared by the owner and other, similar criteria. Whatever the nature of that trust metric, it would obviously have to be set up in a manner that would prevent the RIAA from poisoning the well.

I have no idea if somebody has already come up with a system like that. I obviously only know about the concepts behind this in the loosest of terms, so I’m not of any use in the development of a system of this nature. But I do expect that, short of some sort data-authentication system being put into place, file sharing systems will be spammed into oblivion by the recording industry.

Published by Waldo Jaquith