Google and Viacom have reached an agreement meant to ease privacy concerns about YouTube records being handed over to Viacom through a court order. However, there remain some questions about how exactly the “anonymizing” of these records will actually work. Until those are answered, I wouldn’t breathe a complete privacy sigh of relief yet.
Google was ordered to hand over YouTube viewing records earlier this month as part of Viacom’s copyright infringement lawsuit against the company. Privacy concerns were immediately raised, causing Viacom to react by saying it didn’t want all the information the court was granting it rights to and Google saying it was hopeful the data could be anonymized in some way.
When producing data from the Logging Database pursuant to the Order, Defendants shall substitute values while preserving uniqueness for entries in the following fields: User ID, IP Address and Visitor ID. The parties shall agree as promptly as feasible on a specific protocol to govern this substitution whereby each unique value contained in these fields shall be assigned a correlative unique substituted value, and preexisting interdependencies shall be retained in the version of the data produced.
OK, the idea here is that by replacing the “real” IP address or user ID information, it won’t be possible to know the “real” identity of those who did searches. Unfortunately, changing to “fake” information still doesn’t solve the privacy concerns entirely.
In the case of AOL search data that was released in 2006, all the associated user information with those records was also anonymized. But because each individual still had the same unique “fake” address, it was possible to see all the queries done by an “anonymous” user. That activity profile in a few cases made it possible to guess at the real person doing the searches.
Consider this example. Let’s say part of a YouTube log originally looks like this:
- May 5, 2008 – User: juliefielding – Search: “Battlestar Galactica”
- May 5, 2008 – User: juliefielding – Video Watched: Battlestar Galactica, Season 2, Episode 5
- May 5, 2008 – User: juliefielding – Search: “Julie Fielding Homecoming”
- May 5, 2008 – User: juliefielding – Video: “Julie Fielding’s Homecoming Party”
OK, this is GREATLY simplified (see Google Anonymizing Search Records To Protect Privacy for a more real-life example of how logging works). But you can see how the logs show how someone with the user account of juliefielding (who probably is a “real” Julie Fielding) has done searches and watched particular videos.
Now let’s say we “anonymize” the user name like this:
- May 5, 2008 – User: dskw92qw4 – Search: “Battlestar Galactica”
- May 5, 2008 – User: dskw92qw4 – Video Watched: Battlestar Galactica, Season 2, Episode 5
- May 5, 2008 – User: dskw92qw4 – Search: “Julie Fielding Homecoming”
- May 5, 2008 – User: dskw92qw4 – Video: “Julie Fielding’s Homecoming Party”
Now “juliefielding” has become the anonymous user “dskw92qw4,” so supposedly we can’t identify her. However, we do know everything this user has watched — and if they’re watching something called “Julie Fielding’s Homecoming Party,” we might assume they’re connected with Julie Fielding. Moreover, if we have a long pattern of viewing (and possibly searching) history, we might better be able to guess at how the person is. Imagine a person who is constantly watching videos that they themselves have uploaded, for example.
I asked Google about this. Why wasn’t the agreement worked out to drop out ANY user information entirely, especially as that still doesn’t seem necessary to achieve Viacom’s overall goal of simply guessing at how much infringing content may be viewed overall? I covered this in my earlier piece.d
If you only want to know the percentage of “infringing videos” that are watched versus “non-infringing” ones, then you only need a record of the videos requested. You don’t need IP addresses of those requesting them. You certainly don’t need the very personally-identifiable information of those who are logged in and watching them.
Saul Hansell at the New York Times raises the same issue in his write-up today:
I’m not entirely sure what Viacom will get out of all this. No doubt they will be able to prove that lots of people uploaded clips of material from MTV, Comedy Central and other Viacom properties and that lots of people watched them. You don’t need server logs to show that.
In response, Google said that the exact anonymizing protocol hadn’t been worked out and it highlighted this part of the agreement:
The parties agree that they shall not engage in any efforts to circumvent the encryption utilized pursuant to Paragraph 1 this Stipulation. This Paragraph does not limit in any way any party’s rights under Paragraph 8 below.
Neither response solves the concerns. Yes, how exactly substitution of user information will be done is yet to be worked out, but the underlying principle that each unique user will be given a unique alternative identity is not being challenged — and so individual user profiles can still potentially be worked out.
As for “encryption,” what encryption? I suppose this is meant to say that Viacom won’t try to build profiles to then work out who an anonymous user might really be. But if it won’t do that, then why hand over even anonymous user info? The company doesn’t need this. But handing over the information at all remains a privacy threat, agreement or not. Records get lost. Records get in the wrong hands.
Agreements are all well and good, but the most secure way to protect privacy is not to hand out information that isn’t needed. While building up user profiles as I’ve described is still not a major threat to the vast majority of YouTube users, it’s still a concern. And since Viacom doesn’t even need anonymous data, I’d hope the two parties get together once again and drop any of this entirely.
Give Viacom records of what was watched on YouTube, sure. But no, don’t give the company records about who watched this material, “anonymized” or not.
For more, see discussion on Techmeme.