Wikipedia Releases Search Data To Public But Pulls It After Privacy Concerns

Wikipedia announced they have decided to give away their search data to the public for free. Yea, they would just give away search data to anyone would wanted to download it. Shortly after they announced this, they decided to “temporarily taken down this data to make additional improvements to the anonymization protocol related to the search queries.”

My first reaction when I saw that Wikipedia was releasing this information was, privacy issue! Imagine how people use Wikipedia. They may search for family information, medical conditions, religious beliefs, political beliefs and so on. If you can match those search patterns to the same user (i.e. their IP address), you can technically back track who the searcher is and build a profile of the user and their beliefs and tastes. Great for marketers, but potentially horrible for the privacy of the searcher.

Back in 2006, AOL released this data and was blasted for doing so. In fact, the New York Times profiles one of those searchers, Searcher No. 4417749 to prove this point. Heck, they even made a movie around this leaked search data.

So when I heard Wikipedia is doing the same, I was a bit surprised. Why did they decide to release it? Well, they listed three reasons:

(1) it provides valuable feedback to our editor community, who can use it to detect topics of interest that are currently insufficiently covered. (2) we can improve our search index by benchmarking improvements against real queries. (3) we give outside researchers the opportunity to discover gems in the data.

The data includes:

  • Server hostname
  • Timestamp (UTC)
  • Wikimedia project
  • URL encoded search query
  • Total number of results
  • Lucene score of best match
  • Interwiki result
  • Namespace (coded as integer)
  • Namespace (human-readable)
  • Title of best matching article

Again, Wikipedia has pulled down the data until they can figure out how to better anonymize the data.

Related Topics: Channel: Consumer | Features: Analysis | Legal: Privacy | Search Engines: Wikipedia | Top News


About The Author: is Search Engine Land's News Editor and owns RustyBrick, a NY based web consulting firm. He also runs Search Engine Roundtable, a popular search blog on very advanced SEM topics. Barry's personal blog is named Cartoon Barry and he can be followed on Twitter here. For more background information on Barry, see his full bio over here.

Connect with the author via: Email | Twitter | Google+ | LinkedIn


SMX - Search Marketing Expo

SearchCap:

Get all the top search stories emailed daily!  

Like This Story? Please Share!

Other ways to share:

Like Our Site? Follow Us!

Subscribe to Our Feed! Join our LinkedIn Group Check out our Tumblr! See us on Pinterest Get Search Engine Land on your mobile device!
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://twitter.com/BlueHatOfficial Blue Hat Marketing

    This is really interesting and illustrates the paradox of information freedom -sometimes individual rights and privacy conflict with the need to share information and maintain transparency.

  • Pat Grady

    in Latin, that’s…. wikipedus coitus datum interuptus

  • Alan

    Releasing the data is a good idea. Imagine the keyword research you could do on that data :) However ip’s and anything else that can tie down a person does need to be removed!

Get Our News, Everywhere!

 
  • Advertise With Us
 

Click to watch SMX conference video

Join us at an upcoming SMX event:

North America

EMEA

APAC

Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.

SMX Site » | SMX Difference » | SMX News »




 

Search Engine Land Periodic Table of SEO Ranking Factors

Get Your Copy
Read The Full SEO Guide