Search Engine Land » SEO » Content » Wikipedia Releases Search Data To Public But Pulls It After Privacy Concerns

Wikipedia Releases Search Data To Public But Pulls It After Privacy Concerns

Wikipedia announced they have decided to give away their search data to the public for free. Yea, they would just give away search data to anyone would wanted to download it. Shortly after they announced this, they decided to “temporarily taken down this data to make additional improvements to the anonymization protocol related to the […]

Barry Schwartz on September 20, 2012 at 9:29 am | Reading time: 2 minutes

Chat with SearchBot

Wikipedia announced they have decided to give away their search data to the public for free. Yea, they would just give away search data to anyone would wanted to download it. Shortly after they announced this, they decided to “temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.”

My first reaction when I saw that Wikipedia was releasing this information was, privacy issue! Imagine how people use Wikipedia. They may search for family information, medical conditions, religious beliefs, political beliefs and so on. If you can match those search patterns to the same user (i.e. their IP address), you can technically back track who the searcher is and build a profile of the user and their beliefs and tastes. Great for marketers, but potentially horrible for the privacy of the searcher.

Back in 2006, AOL released this data and was blasted for doing so. In fact, the New York Times profiles one of those searchers, Searcher No. 4417749 to prove this point. Heck, they even made a movie around this leaked search data.

So when I heard Wikipedia is doing the same, I was a bit surprised. Why did they decide to release it? Well, they listed three reasons:

(1) it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered.
(2) we can improve our search index by benchmarking improvements against real queries.
(3) we give outside researchers the opportunity to discover gems in the data.

The data includes:

Server hostname
Timestamp (UTC)
Wikimedia project
URL encoded search query
Total number of results
Lucene score of best match
Interwiki result
Namespace (coded as integer)
Namespace (human-readable)
Title of best matching article

Again, Wikipedia has pulled down the data until they can figure out how to better anonymize the data.

Add Search Engine Land to your Google News feed.

Related stories

New on Search Engine Land

Google files its proposed remedies in DOJ’s monopoly case

Google’s review deletions: Why 5-star reviews are disappearing

Search, social and video: Evolving digital PR for 2025

Amazon SEO: A comprehensive guide

3 YouTube Ad formats you need to reach and engage viewers in 2025

About the author

Staff

Barry Schwartz

Barry Schwartz is a technologist and a Contributing Editor to Search Engine Land and a member of the programming team for SMX events. He owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on very advanced SEM topics.

In 2019, Barry was awarded the Outstanding Community Services Award from Search Engine Land, in 2018 he was awarded the US Search Awards the "US Search Personality Of The Year," you can learn more over here and in 2023 he was listed as a top 50 most influential PPCer by Marketing O'Clock.

Barry can be followed on X here and you can learn more about Barry Schwartz over here or on his personal site.