Q&A With Jimmy Wales On Search Wikia

News came out earlier this week that Wikipedia founder Jimmy Wales had a new project in mind, to build a community-driven "Google-killer" search engine. I’ve just finished talking with Jimmy about his plans. Here’s a rundown on his vision and what may come as his Search Wikia project grows over the course of the next year or two.

Note that in the Q&A, I’ve had to recreate my questions as best I remember asking them. I was focused more on getting down Jimmy’s responses.

Q. Since the news emerged, there’s been some confusion about Amazon and Wikipedia in relation to Search Wikia project. What’s the situation?

We recently completed a funding round with Amazon [for Wikia], but other than that, they don’t have anything to do with the search project. [The project] is a Wikia project [the for-profit company that Wales is chairman of], not a Wikipedia project [the separate community-driven encyclopedia he co-founded].

Q. Was the search project formally announced, or did the Search Wikia site come online as a result of The Times article discussing it.

It was a combination of them both. I’ve been working on this for a long time. We didn’t actually intend to announce per se just yet, but me and my big mouth, the reporter asked me if I ever thought about search.

Q. It’s been said the search engine would launch in the first quarter of 2007. That’s fast. Is that really just when you expect active development work to begin?

During Q1, we’re going to set up a project to get developers involved with building the site, writing the code and getting the search engine going. We’re going to rely initially with Nutch and Lucene [related open-source search software that's been developed over the past few years].

We’ll start from scratch on how to apply the Wikipedia principles to keep it as simple as possible and move forward.

It’s just the development starting. We’re not producing a Google killing search engine in three months. I only wish I were that good of a programmer.

We’ll have some servers open, some development, maybe a pre-pre-alpha demo site up. We’d really anticipate it would be a year or two until we’re able to launch a viable search engine.

Q. How do you see this improving on what’s out there?

There are a lot of things that we’ve learned in the wiki world on how to get communities involved and engaged to build trusted networks in communities.

A lot of the people who have tried to do this in the past have stumbled not on technical issues but on community issues … dmoz [The Open Directory] was too closed … that was their response because of the pressure of spammers … others have thought in terms of ranking algorithms. That’s not the right approach. The right approach allows for open dialog and debate and discussion.

Q. How do you envision the community participating? Will they be selecting sites? Will this leverage material in Wikipedia? Will they rate sites?

This will be completely independent of Wikipedia.

Exactly how people can be involved is not yet certain. If I had to speculate about it, I would say it’s several of those things, not just community involved with rating URLs but also community rating for whole web sites, what to include or not to include and also the whole algorithm … That’s a human type process that we can empower people to guide the spider

Q. Do you see humans reviewing the most popular queries, perhaps picking the right answers to come up?

Part of it might be a human review of queries. For the narrow subset of the really popular queries, I think it’s important to apply humans …. if someone types Ford Motor Company, there is a correct answer for that. There’s no reason to beat our brains out to train our algorithm to do that.

Q. Search engines have actually gotten much better over time with these type of navigational requests. You don’t need humans so much to make sure the right answer shows up.

Those kinds are not too difficult. The harder one if you type ford, did you mean President Ford or do you mean the Ford Motor Company? That’s the type of thing where human disambiguation pages like we have at Wikipedia are helpful.

Q. Search engines already do a lot of this type of stuff. Ask has its Zoom suggestions, others have clusterings or related searches. Do you imagine people being forced to make a query refinement choice before they actually get search results?

If you type ford, you should get some disambiguation terms that humans have collected, then some search results….this is one of the places where I think human intelligence is most important

[NOTE: For more on query refinement, see some of my past posts such as Robert Scoble Wants What We Had -- Better Query Refinement. So Do I!Hello Natural Language Search, My Old Over-Hyped Search Friend and Why Search Sucks & You Won't Fix It The Way You Think. The first link in particular discusses how Microsoft used to have disambiguation created by editors very similar to what Wales hopes to recreate. Sadly, it was killed in the quest to chase Google on the algorithmic front.]

Q. Are you planning to crawl the entire web, billions and billions of pages? Or will you go after a subset of important ones?

The number of pages is yet to be determined. Obviously we won’t be doing that initially [gathering everything], but we’ll invest in the hardware. Not to belittle the investment required to do a full crawl of the web on a regular basis, but I think it’s a fairly commoditized.

Q. Crawling is one thing. Serving up millions of queries per day is an entire other issue. Wikipedia handles a lot of traffic, but not at a Google scale. How’s it going with that?

The traffic’s not too bad. Servers are getting more and more powerful. Bandwidth is getting cheaper. It’s all pretty much off the shelf. It’s pretty efficient.

Q. Will you be selling ads, and if so, how will that work?

There are no immediate plan to sell ads, so for now we’re not too focused on that. If we don’t build something useful, selling ads on it is sort of a moot point.

Q. Why do this at all? What do you see wrong with search?

For certain types of searches, search engines are very good. But I still see major failures, where they aren’t delivering useful results. I think at a deeper almost political level, I think it’s important that we as a global society have some transparency in search. What are the algorithms involved? What are the reasons why one site comes up over another one. [Wales also raised the issue of how ads might influence regular listings, perhaps search engines trying to keep commercial sites out of the free listings to make money. From there, he went on....] Those types of incentives are problematic in search. The only solution I know to that is to be transparent

Q. How are you going to keep the community from being gamed. Wikipedia is very good at keeping out spam, but it’s not perfect. And despite its size, it’s dealing with far fewer topics than unique searches that will happen on any particular day. How do you police all those searches?

You have to recognize the difference between the way community is often used on the internet, which is short hand for millions of people clicking on some stuff as compared to community in the wiki world, which is people who actually know each other.

It’s one thing to say if you have millions of spammers out there trying to game and trick an algorithm …. but it’s not the number of queries. it’s the web sites themselves. A lot of numbers are thrown about for sites on the web, but the number of legitimate pages that are not coming from affiliate sites and spammers is a much more finite number. It’s much easier for a community to ban the bad stuff.

Q. But what if someone gets into a "good" domain. We’ve had cases where bad content gets shoved into "trusted" sites or even places like university sites. Do you ban those entire domains? How do they get back in?

At Wikipedia, we’d have a big discussion. [Wales then explained that people might realize a domain had done something accidentally wrong or without thinking about spam issues and so might be allowed back in.]

Q. You probably already search a lot, probably mostly with Google. Is it not finding what you want already most of the time, without a flood of spam or crud in your way?

Usually I’m looking for pages on Wikipedia, so they do a good job with that. It depends on the types of searches you are doing. If you’re doing a factual search, then Wikipedia [in the results] would be good. In other areas, I think there’s a strong commercial incentive. Why is it bad if I search for tampa hotels?

[NOTE: I then did this search on Google, which we discussed. I noted I saw plenty of good hotels listed, and that if I clicked through to the local search results, I got an even better experience of hotels listed.

Wales replied that he's often after reviews of hotels, not the hotels themselves. That took me back to the original results, where I pointed out the top listing was from TripAdvisor, exactly the type of review site he mentioned liking -- and that I often found them listed on these types of queries.

I also noted that Google even offers refinement categories at the top of the page similar to the disambiguation he wanted, with lodging guides as one of the categories. Unfortunately for Google, I didn't find that the results from that refinement did a good job bringing back trusted hotel guides]

Q. Back to transparency. People keep saying they want more of this. But can you name some exact examples of what you want to see? Do you want Google to say that using a term in bold text adds X percent of a score to the ranking criteria? And if you do that, don’t you think spammers will just abuse the recipe that’s been published?

If your search relies on some secret factors that you hope people won’t discover, you haven’t really come up with a good solution the problem.

Q. Microsoft has spent millions of dollars and years now of effort to try and be a Google killer and haven’t made it. You’re coming into this fresh with fewer resources and no real prior experience. Can you really do it?

I have no idea. I only do whatever sounds like it is fun.

Q. What type of funding do you have behind this?

Wikia’s initial round was 4 million from a variety of angels, then there was second round from Amazon, but the amount wasn’t announced.

Closing Comments

When I first heard of the plans, I was pretty dubious the project would have much success. For one thing, the idea of the "open source" search engine to take on the world and provide more transparency is old news. Consider this from back when Nutch first came out, out of New Scientist in 2003:

The project "is about providing free technology that should not be controlled by private, commercial, secretive organisations," says Doug Cuttings, veteran web search engineer, and a Nutch founder.

Three years on, nothing really changed despite the reasoning behind such a project being the same. And this was despite Nutch having some big names behind it.

In 2004, Nutch got another round of attention in an ACM article looking at how it works. My comment at that time was:

Interesting read especially for the efforts that are involved to defeat spam. The argument is that though Nutch is open, revealing secrets won’t hurt because spammers will batter down any defenses, no matter how tightly protected. OK, so what will stop spam? Nutch hopes that an open, public discussion may reveal new methods. Perhaps. But the real test will only come if Nutch is deployed by a major, highly-trafficked site. Spammers aren’t going to bother trying the defenses of other places. It’s not worth the time. That’s also a positive for those considering Nutch. If you operate a small, vertical site or just want Nutch to be used on your own content, then spam concerns are much less an issue.

The spam test simply hasn’t happened with Nutch. And every new search engine project I’ve looked at coming in over the years completely underestimates the spam problem they face. When I looked at the Search Wikia site, comments like this almost seemed laughable:

search active for spammer sites

  • trying to simulate user-typos (ie. "yaoho.com" rather than "yahoo.com"); see also: Microsoft’s URL Tracer
     
  • blacklist domains, where spammails are linking to; create actively honeypods to get spam; use a pattern like <domain-where-we-have-registered>@myhoneypod.com to identify the spam networks; shell the common user get the possibility to register such a mail-adress?

Seek out the spam sites? Hey, don’t worry — if you’re popular, they’ll find you fast enough. And as you blacklist one, two more throwaway domains will show up in their place.

I also tend to think Wales is completely underestimating how crawling a big chunk of the web, keeping those pages fresh, ranking them quickly to provide answers and doing so for millions each day isn’t an off-the-shelf commodity.

Still, I find myself oddly hopeful. I don’t think a Google killer will emerge, but perhaps some new ways of a community to be involved with search will come out of it. I wouldn’t have thought Wikipedia would work. Certainly it’s flawed, but it’s also an incredible resource. Maybe something useful will come from the Search Wikia project.

At the very least, I’ve long wanted humans to be back in the role of reviewing queries and actually looking to see if they make sense, rather than so much reliance on algorithms. Maybe the mere concept of the Search Wikia project will encourage the major search engines to do more in this area.

Postscript: Originally I had Wales listed as cofounder of Wikipedia, but he got in touch saying he was the founder. I’d noted that Wikipedia itself lists him as founding (well, creating) it with Larry Sanger. Is Wikipedia incorrect on this, I asked? “Yes, it is wrong,” he emailed. Sanger posts his own views on the origins here.

Related Topics: Channel: Consumer | Q&A Land | Search Engines: Search Wikia | Search Features: Query Refinement

Sponsored


About The Author: is a Founding Editor of Search Engine Land. He’s a widely cited authority on search engines and search marketing issues who has covered the space since 1996. Danny also serves as Chief Content Officer for Third Door Media, which publishes Search Engine Land and produces the SMX: Search Marketing Expo conference series. He has a personal blog called Daggle (and keeps his disclosures page there). He can be found on Facebook, Google + and microblogs on Twitter as @dannysullivan.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://www.mattcutts.com/blog/ Matt Cutts

    Wow–that’s a really deep look, right down to the link that mentions that back in 2003, Nutch was getting funding from Overture, a Google competitor that was eventually bought by Yahoo.

    I’m glad that someone finally asked about the fact that [tampa hotels] is really rather clean, with lots of variety: several specific hotels, several overview sites, and local search results and a Plusbox map.

    I respect Jimmy’s opinion that he’d rather have “hubs” with reviews of hotels rather than the hotels themselves. The pendulum hasn’t always swung in that direction though. I remember when a different search engine (now absorbed by someone else, and named for a Gaelic word) was making the rounds talking about the query “grocery stores” and how their search engine returned individual grocery stores instead of directories or hubs that listed grocery stores. At the time, Google leaned more toward returning overview/hub sites, but we had the ability to change that. So that’s something that existing search engines can adjust if they think users prefer a different mix; I know because we’ve changed that balance in the past.

  • http://monetizethis.blogspot.com NaplesDave

    Interesting!

    Would like to see how they will juggle user trust with user intervention. However, I am all for letting users tailor entries as with Wikipedia. Seems to be a source everyone trusts.

  • http://sethf.com/ Seth Finkelstein

    One thing many other open-source projects haven’t had is the marketing muscle to build a user-base and get donated development. Wikia has much more of this than many other efforts.

    While Wikia may be underestimating the resources required, I also think they have more resources available than most other open-source projects who have attempted the task.

    Plus, one of their sleeper strengths is they’ve developed an amazing ability to neutralize criticism of quality control failures. That also sets them apart from many other projects.

    So while I’m usually skeptical too, they do have some decent answers to the obvious question of why they have a chance of succeeding when so many others have failed.

  • http://www.xanersoft.com Web Design

    YES, I was looking for this info for many days. But to be very frank I have doubts if Search Wikia will really beat google.

    Also its open-source project needs a lot of resources, hope wikia is knowing what all it requires.

    Anyway, thanks for publicizing the talk. Lets see what happen.

  • Lukas

    Good article!
    Even if they won’t make it to dethrone Google then the *VERY* possitive message I can read between the lines is that they plan on using Lucene and Nutch (and Hadoop could follow in a few minutes, right? :-). As a Nutch/Lucene/Hadoop hobyist I can’t loudly call Bravo! (providing they will contribute their code back to Apache repository). Then the real winner can be an open-source community.

  • http://seoptimization.blog.com/ ★ ★ Search Engines WEB ★ ★

    If possible, WikiPedia should make its ongoing detailed stats, public. It would be interesting if Wales would consider it ASAP.

    The stats that are currently available are generalized.

    This is exceptionally important because Wikipedia is now on page one or page two for many competative terms in Google and Yahoo.
    ( including moving to #1 for the term:
    SEARCH ENGINE OPTIMIZATION )

    This would offer a unique opportunity to analyze what percentages of traffic each of the top search engines bring – and also how much approximate traffic competative keywords bring.

    Also a cummulative Overview for 2006 would be interesting

  • http://www.comstockfilms.com/blog/tony Tony Comstock

    Just three days ago I wrote to Matt Cutts:

    “I’m no expert, but it’s hard for me to image what sort of algorithm would be able to distinguish the highly entertaining, very intelligent, but often utterly filthy Pretty Dumb Things (apparently still in the Google penalty box) from run-of-the-mill sites that use similar language in similar quantities, and even in similar, but tremendously less artful ways.”

    So you can imagine how amused I am to see that human-powered search is the topic of the day.

    Thankfully, PTD appears to be out of the Google penalty box. Unfortunately Comstock Films is back in. As of this morning, sites featuring photos of women felatiating dogs and copulating with horses are out-ranking (out ranking by pages) Comstock Films on google searchs for ‘couples porn’. (We make award-winning films about the intimate aspects of couples in longterm relationships.)

    At this point I’m thinking this is neither a bug, nor an anti-sex bias at Google. I think explanation is that the googlebot is one kinky mo’fo’!

  • http://www.acclivitymarketing.com/blog Solomon Rothman Web Design Search Engine Marketing

    If Search Wikia can even produce usable results ( as in not chuck full of spam or outdated sites) they’ll gain an almost immediate following. Since Google has become so large and powerful, it’s attracted a large amount of criticism rising solely out of the fear of the power and control Google posses over the online market and the world economy.

    Just loosing your rankings on a few prime keywords can cost a company hundreds of thousands of dollars and due to the closed nature of Google’s algorithms it’s inevitable someone will claim Google to be manipulating things “unfairly” or more accurately “un-ideally” for the user. This belief will directly lead to a huge following of users who will support search wiki and use it as a replacement for google even if it produces inferior results.

    Search wiki doesn’t have to produce a product superior or even on pare with Google. It just has to give the perpetrators of open source philosophy what they want: to feel like they’re positively contributing to their search experience and that they (collectively ofcourse) have control and assurances over the fairness of the results.

    And I say “hell yeah.” Chaos refines order (eventually) and competition yields innovation.

  • http://www.bessed.com AdamJusko

    It’s an interesting project, something similar to what we’re already doing at Bessed (http://www.bessed.com). However, unlike what it sounds Wales may be up to, we’re using paid editors and have built our site on WordPress to give visitors a say on search results (via comments) without completely giving up the keys to the castle.

    Our issue is scalability–how much can humans really cover in terms of the billions of searches possible? In Wikiasari’s case (or whatever Wales’ final search site is called), there is the challenge of getting volunteers to work on a search engine that is for-profit, something very different than working on the non-profit Wikipedia. It’s fun to voluntarily add to the Wikipedia entry for The Oak Ridge Boys or Carl Sagan—is it fun to voluntarily gather a list of landscaping companies in Springfield, Illinois, as you’d expect would be necessary for a new human-powered engine? (Again, this is part of why Bessed uses paid editors; no one wants to do boring stuff for free.)

    I’m glad to see human-powered search in the news, though; it gives me a chance to blow Bessed’s horn and invite Webmasters to submit their sites.

  • http://krugle.com Ken Krugler

    After two years of working with Nutch 0.7 and 0.8, I’d have to strongly agree with Danny’s comment about how easy it is to underestimate the level of effort required to do a good job of regularly crawling lots of pages, filtering out spam, and quickly serving up high quality results.

    Many of the problems require technical solutions – e.g. you want the hit summarizer to ignore sidebars. Not ridiculously to do, but it’s one of countless small programming tasks that *somebody* has to define/code/test.

    So when I read posts on the wikia mailing list about P2P vs. centralized search, the semantic web, etc. I have to laugh. Those aren’t the things that make or break you, it’s fixing the 1000 little problems that constantly show up.

  • http://www.mattcutts.com/blog/ Matt Cutts

    Ken, is that where Krugle the code search engine gets its name–from your last name? I wondered about that. :)

    Yup, I remember talking about navbars and boilerplate and their impact on snippets way back a few years ago. There really is a lot of little things to get right for a search engine. :)

  • gary price

    Danny you are correct.

    Ask.com offers Zoom related search, contextually based search suggestions (and often related names). More here. Zoom related search can not only serve to help with narrowing and expanding a query but can also, in some cases, serve as a knowledge discovery tool.

    Ask also offers other types of disambiguation tools. Here are a few examples:

    + Rock Concerts New York City.
    Here, at the top of the results page, you see material from AskCity. Note the disambiguation pull-down boxes providing the searcher options to narrow to a specific borough of NYC.

    + Zip Codes, Springfield
    Springfield, MA is listed first but directly below it, a pull-down menu listing other cities and towns named Springfield in the U.S. is seen on the page. Area code Columbus does the same type of thing.

    + A search for MSFT asks the searcher do you want information about Microsoft or the latest stock quote.

    + A search for Rocky. First, you’re prompted if you want information on Rocky, the movie. Then, if selected, you’ll find info on the latest Rocky release (Rocky Balboa) and you’ll also see a pull-down menu to go directly to info about Rocky 1 – Rocky 5.

    + Of course, a search for Miami, FL provides a “Smart Answer” with direct links to various types of info directly at the top of the page.
    http://www.ask.com/web?q=miami%20fl&o=0&l=dir

    + The new prototype/beta, AskX, offers even more info on a results page in a three column format. Here’s a search for Chicago. Everything from the local time to information about the band. Same type of thing when you search Boston.

    + Btw, if you like this concept, WikiWax from Sufwax offers the same type of thing using dynamically generated suggestions from the Wikipedia’s subject headings.

    + Clusty’s dynamically generated clusters can also serve as both as a tool to narrow and focus or as an info discovery vehicle. You can also see various type of clusters with their Clustermed service.
    http://www.clustermed.info

  • cutting

    As Ken mentions, there are scores of minor engineering tasks involved in maintaining a search engine, but that’s the more tractable part. The trickier part is maintaining high quality results. The major search engines all employ people to evaluate search results and use these to train their search algorithms. This is an expensive process, with no open-source starting points. It is typically blind: evaluators are presented unranked results from unknown search engines and asked to indicate which are relevant to the query, a query that the user did not submit. It seems unlikely that volunteers will be interested in this task. And it is very different from Wikipedia. Raising a volunteer army for search quality is a research task. An interesting one, and one that I’m glad to see someone attack.

  • http://dmoz.org/profiles/chris2001.html chris2001

    Raising volunteer forces for this task is indeed an interesting challenge – but the idea is not really new. One possible way to do it has already been found and realized: ODP´s volunteer editors have scanned the web for quality content since 1998, and the result of our work is Open Content – freely available for everybody who needs a large amount of human-reviewed data, be it as initial input for a new search tool, or for a rather long list of other purposes, from enriching results to ranking and quality control.
    As the Nutch documentation explicitly refers to the ODP RDF dump, I guess I don´t have to go into details what advantages this might have for an Open Source search engine – you probably know a lot more about it than I do ;-)

    If you are looking into methods to measure quality and other properties of search engines without human testers, you might find the following papers interesting:
    - Random Sampling from a Search Engine’s Corpus by Ziv Bar Yossef, Maxim Gurevich (via SEO by the Sea which has additional links).
    - Using Titles and Category Names from Editor-driven Taxonomies for Automatic Evaluation by S. Beitzel, E. Jensen, A. Chowdhury, D. Grossman.

  • http://www.centiare.com thekohser

    In my opinion, Jimmy Wales is quite the hypocrite, and I’ve seen that he’ll attempt to change history in ways that rival Stalin’s efforts.

    How is it that Larry Sanger was “co-founder” of Wikipedia in the project’s own press releases — for years — but then he’s demoted after Sanger and Wales have a falling out?

    How is it that Wales has said of Wikipedia, “We try really hard to deal with customer service complaints. We don’t allow libel. We’re not a wide-open free speech forum that allows people to post whatever. We’re happy to delete rants and things like that as necessary. I think that’s part of the reason why we haven’t been sued,” but when MyWikiBiz followed his recommendation to the letter (http://www.nabble.com/MyWikiBiz-t2080660.html), Wales still banned the MyWikiBiz User account, AND posted a libelous message on that user’s page?

    How is it that Wales has been on a rampage, railing against companies editing their own space in Wikipedia, but when his partner Angela Beesley edits “her” article about Wikia in Wikipedia, everyone poo-poohs it, saying her edits aren’t biased?

    How is it that Wales orders the “nofollow” tag to be applied on all external links from Wikipedia, but somehow (miraculously) the “interwiki” template links that point to Wikia.com (there are thousands of them) are exempt from the “nofollow”?

    Isn’t this all convenient? Or, could it just be that Wales is a big hypocrite? I’m biased in my assessment, but you look at the facts, and you make the call.

  • Pozycjonowanie

    Thanks for very interesting article. Can I translate your article into polish and publish at my webblog? I will back here and check your answer. Keep up the good work. Greetings
    Pozycjonowanie

  • http://bestproxy.info Proxy

    yeah why not, you can publish in your blog

  • http://www.wyliczanka.pl Konin

    Very interesting article.Good work. In my opinion other open-source projects needs a lot of resources. Greetings.My site :Konin

 

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide