Google Anonymizing Search Records To Protect Privacy
Google has announced that it will now anonymize the server log data that it collects after 18 to 24 months, as a way to better protect the privacy of its users. Until now, Google has retained server log data in its original form indefinitely, which made it possible for anyone with access to those logs […]
Google has announced that it will now anonymize the server log data that it collects after 18 to 24 months, as a way to better protect the privacy of its users. Until now, Google has retained server log data in its original form indefinitely, which made it possible for anyone with access to those logs — such as government agencies possibly gaining them through legal processes — to potentially track queries back to users.I’m going to revisit what Google collects in its server logs to explain how that can — and cannot — be used to track information back to an particular user. Then I’ll also recap some of the other places where search history is retained, since it isn’t only within server logs.
Server Log Records
When you visit any web site, the web server records certain information about your visit. Here’s a simplified view of what that might look like if you came to Google and did a search:
220.127.116.11 – 13/Mar/2007 00:44:15 – http://www.google.com/search?q=laptop+broadband – DQG4AADOkAAAAB_kWn0FCUZ15
You can see there are four segments. A server log would have more than these four bits of information, but this is enough to illustrate the key points. These are what the segments tell you:
1) IP Address
That 18.104.22.168 number represents the IP address someone had when visiting Google. An IP address is like an internet telephone number. You can trace the address back to who placed the “call,” so to speak.
In this case, 22.214.171.124 can be traced back through a reverse DNS lookup with tools (such as here) to a named location, AUTHNS1.MPLS.QWEST.NET. If you know a bit more, the QWEST.NET part tells you that the call came from a Qwest connection. The MPLS part tells you it was from the Minnesota area.
Even with this tracing, you still don’t know the actual person — the named individual — who placed the call. To get that, you’d need to contact Qwest with the information and ask them which account was accessing Google through their servers at this time. Google itself only knows, at best, that this was a Qwest connection.
2) Date & Time
Pretty self-explanatory, this is the day and time when a request was made.
3) Query Terms
See the two words in bold? Those are the terms that someone searched for, which get stored in what’s called referrer information. This tells us the person looked for [laptop broadband], in this case.
This is a unique code that’s assigned to a particular computer by Google. Once assigned, it allows Google to continue to know if requests came from a particular computer, even if the connection changes.
For example, say you use your laptop at home using your broadband provider. Google assigns you a unique cookie stored on the laptop. This allows you to do things like access Gmail and have all your settings saved. Now you go traveling. You use wireless in an airport. Your access provider — and thus IP address — is different. But the cookie, since it’s on your computer, stays the same. Google continues to know that your particular computer has been seen before and continues to maintain your settings.
What Google does not know is who you are as a named person, unless you’ve provided that information to them as part of some account you’ve signed up for. If so, that information is NOT stored in the server logs. It’s maintained elsewhere on Google.
Anonymizing The Logs
Google’s plan is to change the IP addresses and cookies kept in logs. By doing this, it will make it difficult, probably impossible, to trace any particular query back to a particular computer, much less a person that used that computer.
How exactly the information will be changed is still being determined. One method would be to replace the information with something randomly generated. For example, the line above might become:
67.42.X.XX – 13/Mar/2007 00:44:15 – http://www.google.com/search?q=laptop+broadband – DQG4AADOkAAAXXXXXXXXXXX
Doing that, overwriting some of the key data, would make it much harder — probably impossible — to trace a request back to a particular query or IP address.
How Anonymous Is Anonymous?
You may recall that last year, AOL released some “anonymous” data (and see tools here and here) for researchers that quickly became a way for people to make educated guesses about who exactly — named individuals — made certain searches. To date, only one person was positively identified, Thelma Arnold, who was featured in a New York Times article about the release. New York Times reporters tracked her down based on her queries, and she confirmed making them.
If that could happen, then how could anonymizing data as Google is doing still protect people? With AOL, it was a one-to-one change. A cookie might have been changed, but one change was used for a particular person’s cookie, then another different change for someone else.
It’s easier to illustrate. Imagine there were these requests:
126.96.36.199 – murder someone – DQG4AADOkAAAAB_kWn0FCUZ15
67.42.06.24 – house cleaning – DQG33434AADOkAAAABdfdfdsdCUZ15
188.8.131.52 – poisoning – DQG4AADOkAAAAB_kWn0FCUZ15
184.108.40.206 – exercise routines – WEQa333OkAAAABdfdfdsdCUZ15
220.127.116.11 – faking a death – DQG4AADOkAAAAB_kWn0FCUZ15
They could be randomly changed like this:
27.42.xx.xx – murder someone – anon1
67.42.xx.xx – house cleaning – anon2
27.42.xx.xx – poisoning – anon1
45.42.xx.xx – exercise routines – anon3
37.42.xx.xx – faking a death – anon1
Now see how the cookies are all changed? Anything that was “DQG4AADOkAAAAB_kWn0FCUZ15” becomes “anon1.” That means I can’t trace that cookie back to a particular computer, since the information is lost. But it does mean I can see all the queries this “anon1” person has made:
- murder someone
- faking a death
Now if anon1 did more queries — a lot of them — potentially I might have enough information to guess at who they are. That’s because despite anonymizing the information, the one-to-one change meant it was still possible to know a particular anonymous individual did a string of queries.
In contrast, say you did this:
18.104.22.168 – murder someone – DJSAFDKJDKDJDK
67.42.06.24 – house cleaning – DA9D98D98D
22.214.171.124 – poisoning – D87F9DA0898DD
126.96.36.199 – exercise routines – DAA90HQH34
188.8.131.52 – faking a death – DA908FD0DA
Now there’s no way to know that a particular anonymous person did a string of queries over time. There’s nothing to link all the queries together.
How & When The Change Will Happen
At the moment, Google is still working out how cookie data will be made anonymous. As for IP data, that’s having the last 8 bits removed. A FAQ (PDF format) from Google has some details on this and the change overall, in particular the how and why of Google retaining data.
As said, Google is going to alter the data after 18-24 months, except where it might be required to keep it longer for legal reasons. In Europe, data must be kept from six months to two years, depending on the particular country (and covers ISPs even outside the EU). The US is considering enacting a similar data retention law.
The exact date of when data will be anonymized hasn’t been announced, though Google says it hopes to have it in place by the end of the year. That will come after the exact process is determined. Once it happens, all information past, present and future will be changed.
I asked Google about backup data — data that might not be easily accessible or easily altered. Would that also get changed? Yes — but figuring out how to reach this data (most is on tape) is part of the engineering task underway.
Search History: Stored In Various Places
Changing the server logs is only one part of the privacy chain. There are still other ways where your search history can be discovered, some not in control of Google. Here’s a rundown on how your queries flow to Google (or other search engines) and how they might be exposed along the way.
1) Search History On Your Computer
Today, we wrote about a murder case involving searches done on Google and Microsoft Live. Those searches were found on the computer of the accused. Law enforcement didn’t have to get them from Google or Microsoft. They simply had to seize the computer and check the traces that searching leave behind on it.
In particular, your computer will make a “cached” copy of pages you visit — which can include search results you’ve viewed. Searches you do are also often stored within search boxes in your browser or within search toolbars you use.
For more advice on clearing some of this, here are some help pages to check out:
- How do I delete the drop-down list of my past searches?, from Google, has lots of good advice that will clear things out for you when accessing Google as well as other search engines.
- How do I clear my Internet search history?, Yahoo
- How do I clear the search history in my toolbar?, Yahoo
2) Search History & Your ISP
Everything you do goes through your ISP. Google may be anonymizing its records, but your ISP might not be. The best way to protect yourself, if you’ve very concerned here, is to use tools that anonymize your connection securely, such as Anonymizer or Tor.
Postscript: TechDirt points to a timely article, Compete CEO: ISPs Sell Clickstreams For $5 A Month, that highlights this point. I’ve stressed this for years (2003, 2005, 2006). As above, what you do is seen by your ISP, and ISPs can and do sell that data to others.
3) Search History & Search Engine Server Logs
Every visit you make to a search engine (or any site) is recorded in server logs as I’ve described in this article. Google’s changes will alter these logs. Other search engines operate differently. The last comprehensive review of log retention was by News.com back in February 2006, which found that Yahoo (like Google, until now), seemed to retain logs indefinitely with no alterations. Microsoft said some data is deleted, but not what and when.
Postscript: Google adding search privacy protection from News.com out now provides an update:
Yahoo and Microsoft have declined to disclose their exact data retention policies with respect to Web searches. AOL saves personally-identifiable search data for up to 30 days in a way that’s visible to the user and uses an encryption hashing technique to obscure it thereafter, said AOL spokesman Andrew Weinstein.
“We do not keep any IP addresses in our search database, and we de-identify any associated account information through an encryption algorithm,” he said. “We have also made a business decision not to keep any unique identifiers (i.e. the hashed user ID) for longer than 13 months. …”That said, it still might contain information of a personal nature, as the data released last year clearly did.”
Note that when AOL says the IP address is not kept in the “search database,” that’s likely a database entirely different from server logs. Those might be kept with IP and cookie information still intact.
4) Search History & Personalized Results Or Personal Search History Records
Last month, Google made a change that caused many more people to be using its Personalized Search and Search History features. The log changes announced WILL NOT alter your search history. That information is NOT being destroyed or anonymized over time. If you want it wiped out, Google says you have to do that separately from the action they announced today.
Google Ramps Up Personalized Search is my comprehensive look at both features, how to switch them off and how to clear your history, if you want to.
Yahoo’s MyWeb feature stores your searches, if you’ve switched that on. To turn it off, log in to MyWeb, go to your profile page and under Tools on the left-hand side, click on the Search History link. On the next page that loads, you’ll see a Clear History feature on the left-hand side.
Overall, it’s a good move by Google. It’s long been singled out as keeping all data indefinitely, even if others also do the same. At least now, the company is showing a strong move toward getting rid of personal data when it is no longer needed, which may be reassuring to some users.
Of course, it has also made a recent change causing many more people to have search histories recorded in another way. The difference is that anyone can delete those histories at any time, something that was not possible to do with server logs. The control is in the hands of the searchers, assuming they have an awareness of the feature and exercise that control. Many likely won’t.
For reaction to the change, check out the Techmeme round-up here.
Not Good Enough: “I don’t think the Google proposal is adequate. This period is too long and it’s not in fact data destruction, it’s more data de-identification, and that should be happening in 18 to 24 hours, not months,” said Marc Rotenberg, executive director of the Electronic Privacy Information Center. “I’m not persuaded that this isn’t still a ticking time bomb for Google’s search engine.”
Also Not Good Enough: Richard M. Smith, an Internet security and privacy consultant at Boston Software Forensics, said Google should never be archiving the IP address and cookies on servers. “Google should not be in the spy business,” he said. “By logging IP addresses and search strings they are running the largest intelligence operation in the world.”
They’re Trying: “This is really the first time we have seen them make a decision to try and work out the conflict between wanting to be pro-privacy and collecting all the world’s information,” said Ari Schwartz, deputy director of the Center for Democracy and Technology, an advocacy group. “They are not going to keep a profile on you indefinitely.”
Good First Move: Kevin Bankston, staff attorney at the Electronic Frontier Foundation, said he would like to see Google scrub the entire IP address within six months, but praised Google for making this “positive first step.” “We hope other online service providers will heed this example and work to minimize the amount of data they keep about their customers,” Bankston said.