Anonymizing Google’s Server Log Data — How’s It Going?
Back in March 2007, Google promised to begin anonymizing log data to better protect user privacy. That kicked off a wave of privacy pledges from competing search engines. In addition, by agreeing to limit itself, it inadvertently got the European Union to demand even faster data destruction. Below, a look at Google’s progress toward its initial 18 month anonymization plan, the 9 months it recently agreed to and that question about cookies — are they really deleted or not?
Question: When did you start anonymizing data, and what’s been changed so far?
We began implementing our 18 month anonymization policy this year and will apply it retroactively. We started anonymizing in May 2008, and we’re continuing to anonymize as data comes in. Although we had originally planned to begin anonymization in January 2008, we started anonymizing in May 2008 because we were working on better ways of anonymizing cookies and believed it was important to get it right the first time. Also, pending litigation has slowed our efforts and has required us to keep a set of data to meet our legal obligations. To date, we have anonymized the unauthenticated user search logs, ads logs, and dozens of other logs used by our engineers for search quality, spam-detection, and other product and security work that are older than 18 months, and have completed anonymization back to December 2006. We are continuing the work to anonymize all search logs and now commit to do so after 9 months.
Question: So what period of time is currently anonymized?
Data between December 2006 and March 2007 has currently been anonymized.
Question: You’ll work “forward” from March 2007 as more data reaches its 18 month birthday, correct? Does this happen each day — on the day log data turns 18 months old, does it get anonymized?
Question: How about going “backwards” from the current December 2006 date. How much of this older data do you process? Any estimate on when all the log data you possess with be anonymized?
Each day, we go forward. When we go backward, it may be faster than one day at a time.
Question: What’s being removed from the IP addresses? The last three digits?
We are removing the last octet of the IP address. In other words, we put zeros into the last eight bits of a 32-bit IP address. Technically speaking, there can be one to three digits in the last octet, when it is written in decimal notation.
Question: Microsoft has said it will anonymize the entire IP address, rather than the last three digits. Why don’t you do the same?
We decided prior to any other search provider that we would remove the last octet of the IP address. The decision to remove only the last octet was to balance the utility of having 24 bits of address and the ability to provide approximate geo location based on it against the need for improving privacy.
Question: You’ve said that you will start anonymizing data after 9 months. Why not switch that on now? Don’t you just have to do what you’re doing now, just sooner?
Anonymization requires a great deal of operational planning and work, and we are working on a better algorithm to extract non-identifiable information.
Question: So you’re extracting data from the logs? What exactly is being recording about IP addresses before they are anonymized? Is Google storing elsewhere a general location (City? State? Other level?). And what’s hoped to be recorded as part of the new process?
Yes. In some cases, we may retain non-identifiable information that can help us improve services, such as general locations based on an IP-geo mapping. This is the same kind of information that is used to serve geographically relevant ads such as when someone searches for [Italian restaurants] and is delivered ads for the general vicinity of their IP address, when it is available.
Question: What’s happening to cookies – anonymized or deleted? If anonymized, is a unique cookie replaced with an anonymous but still unique replacement for each instance? In other words, say my cookie was “ABC” on my computer and thus “ABC” in Google’s logs each time I made a request. Is Google going to change it to like “GHB” and each instance of “ABC” becomes “GHB?” Or will each instance of “ABC” get replaced with a different code (GHB, GYT, UIE, etc).
In the 18 month anonymization process, we anonymize cookie identifiers by transforming the identifiers with an HMAC (keyed hash function) — using a randomly generated, ephemeral key for each day of logs, destroyed immediately after anonymization. This assures that it is impossible for Google to associate the cookie when it appears again with the old anonymized cookie, and it further assures that two anonymized cookies originating from the same cookie in two different days cannot be matched by Google. This process is stronger than creating “anonymous” identifiers by using a one-way cryptographic hash function; using a one-way hash may prevent the extraction of the cookie from its scrambled version, but it allows easy association of a given cookie with a given “anonymous” one and allows matching of all instances of a cookie.
Question: What’s the situation with backup data. Are there offline copies of logs kept? If so, do you destroy this data as well?
Yes, the anonymization policy applies to backups on tapes. Anonymized logs will be re-taped in conformance with internal data security and data protection (reliability) practices. Pre-anonymized log data will be erased shortly after the 18 or 9 months (as applicable) from the collection date.
Question: What’s the view on the “European Privacy Seal” run by EuroPriSe that meta search engine Ixquick announced last July of being the only search engine to get. I assume Google has no plans to get one, but could you? Would you qualify?
Google has no plans to get the privacy seal. We comply with the FTC Fair Information Practice Principles and annually certify our practices under the US Safe Harbor program. On privacy matters, we have found engaging directly with our users and paying attention to their needs more fruitful than relying on a third party standardized seal program.
Question: Regarding the YouTube data given to Viacom, I was concerned that usernames were being replaced with anonymous names but names that still stayed unique for each individual user, which could still allow for a revealing profile to be created. What’s the situation? Is it being anonymized in the same way you do cookie data?
As previously reported, Viacom has agreed to the anonymization of the data and we are continuing our discussions with them about the specific implementation.
One of the things that kicked off this revisit to how the anonymization program has been going was Google’s announcement last month that it would shorten the anonymization time period, which was followed soon after by a News.com article that got some buzz saying “Google will not delete or anonymize user cookies from the logs” and thus declaring the entire program as simply “little more than snake oil” for PR purposes.
As you can see, cookies are being anonymized. Google’s cookie also only lasts two years, compared to Yahoo which tosses out four, the longest lasting for 29 years and Microsoft, which tosses out 11, the longest running for 12 years.
Just Snake Oil?
As for the snake oil accusation, I disagree. The program is destroying data as many privacy groups and advocates have wanted. If there’s a “snake oil” element, my frustration is that those groups — and the EU over this past year — have focused too much on worries about IP addresses and cookies when there is far more personally identifiable information retained. As I wrote in April 2007:
I’m actually pretty annoyed at some of the privacy advocacy groups. When Google announced it would anonymize server data last month, I still saw some old school concerns that fairly anonymous cookie data and IP addresses were a privacy concern. C’mon — you want to be concerned about something, you get concerned about the fact Google has — and is growing — real honest-to-goodness personally identifiable profiles of individual searchers. And if you want to get concerned about that, also get concerned that Yahoo and Microsoft have similar profiling — just not as visible to the searcher.
This leads to another issue — that only “unauthenticated” logs are anonymized. If you’re logged into Google in any way, then it records your search activity and does nothing to destroy this from server logs.
Sometimes this makes sense. For example, if you’ve enrolled in Google Web History, then having your searches automatically be destroyed after nine months will be pretty annoying. After all, the point of the program is to record them for you. It’s what you’ve explicitly asked Google to do.
In contrast, if you’ve logged into Gmail — and are NOT using Web History, it sounds like your data will still be retained, if you do a search. This is because you will be an authenticated user — someone who is signed in — so that data won’t be removed.
If that’s the case (and I’ll check on it), it should be changed. All search log data should be anonymized regardless of authentication or not, unless a searcher has specifically opted-in for retention. Even then, my assumption is that Web History databases are maintained separately from the log data, so log data could still be anonymized.
As for data kept through programs like Gmail, Google Talk or Web History, it’s still difficult to know what exactly gets deleted and when. Does it go poof right away? Does it come out of offline backups?
More Detailed Privacy Info Needed
A fairly recent development to help with this is the Google Privacy Center, which consolidates pages from various programs into one place. It’s a huge improvement from the confusing situation I reviewed last year.
Still, say you decide to kill your web history account. You’re told:
OK, when you say “removed,” here’s what I want to know:
- When will it be removed?
- Is it removed from server logs as well as a stored database?
- Will it be removed from all backups?
With Gmail, you get more specific information:
Such deletions or terminations will take immediate effect in your account view. Residual copies of deleted messages and accounts may take up to 60 days to be deleted from our active servers and may remain in our offline backup systems.
What tends to happen with these policies for search engines, as I’ve watched them over the years, is that things get added as specific worries are raised. For instance, years ago Yahoo had (and probably still does have) a big page about “web beacons” as people freaked out about those. But a page on search privacy? No one was concerned, so no page emerged.
With Gmail, soon after it launched, people wondered if data was being immediately deleted. Google was asked, and it turned out not right away. That got a lot of attention, so the policy was expanded. With Web History, if no one asks, then the policy doesn’t get more detailed.
Where’s My Privacy Dashboard?
Hopefully, we’ll see the policies get even clearer over time — plus users will get even more control over what they want to delete. I still like John Battelle’s idea of a privacy dashboard, and I’ll go back to what I wrote about that last June:
Figuring out where all my data resides and how to kill it is a pain — at Google or Microsoft or Yahoo, for that matter. John Battelle had a good suggestion back in early 2006 for a sort of private data control panel that could show you exactly what was stored where and put the user in control:
I bet 95% of the public will never edit, or even view the data more than once. But the sense that the control panel is there, just in case, will be invaluable to establishing trust.
We could use that more than ever. Google especially could use that, if it wants to stop the privacy attacks or at least stem them. How about it? I asked Google’s global privacy counsel Peter Fleischer about this yesterday, when talking to him about the Privacy International survey.
“We’re thinking hard internally along the digital dashboard-type of approach. Is there a way to give users a dashboard and visibility to all these elements and give them control,” he said. “It would be hugely complicated to build, but in terms of that vision, I completely share it, and we’re having deep discussions about it.”
I’m still not happy with the situation over YouTube data that’s being given to Viacom. When the news came out about this court order, people quickly realized that Viacom was getting usernames associated with viewing records that could represent a real privacy violation.
Pledges that the anonymization will protect things aren’t enough. We have to know exactly what’s happening. Unless usernames are destroyed — or anonymized in a way so that you cannot rebuild a profile for any particular “anonymous” user — then Google should fight the demand forcefully. And Viacom should step up and be very clear that usernames won’t be needed or accepted if there’s a one-to-one anonymous replacement.
I feel like the entire issue is being shunted to the side, and I’m amazed that nearly three months later, I still can’t get a clear answer on this. For further background, please see Hold On — Issues Remain Over Google & Viacom’s Deal On YouTube Viewing Privacy.
And The Others?
As I said, Google’s move to destroy data kicked off a wave with the other major search engines. Microsoft is notable in saying it would destroy ALL cookie information and IP data, rather than just anonymizing cookies and destroying part of the IP address, as Google is doing (you can read Microsoft’s statements here, here and here – all PDFs). I’m not sure if this has happened. Even if so, the issue of what’s happening with logged-in data remains the bigger concern to me. I’ll be checking on both.
Postscript: Regarding the question about Web History, Google’s now told me:
The Web History databases are maintained separately from the search logs data. Users control what goes into their Web History through their Google Account settings. They can also view and delete Web History entries. Once deleted, the entry is permanently removed from the Web History database. When a user has “paused” their Web History, no entries are written into the Web History database.
Google also said that all log data is scrubbed, even if a user is logged in:
It is also true that we anonymize all logs regardless of whether user is logged in or not.
When a user is logged into a Google Account using a service like Gmail or Web History, Google web search servers log queries without any personally identifying information. At the same time, a user’s Web History database entry–a completely separate system from our log system–is updated to save a query.
In other words, your Google Account ID, or the email address equivalent, is not written into the search logs. Likewise, your non-authenticated user ID from the PREF cookie is not written into your Google Web History.