News came out earlier this week that Wikipedia founder Jimmy Wales had a new project in mind, to build a community-driven "Google-killer" search engine. I’ve just finished talking with Jimmy about his plans. Here’s a rundown on his vision and what may come as his Search Wikia project grows over the course of the next year or two.
Note that in the Q&A, I’ve had to recreate my questions as best I remember asking them. I was focused more on getting down Jimmy’s responses.
Q. Since the news emerged, there’s been some confusion about Amazon and Wikipedia in relation to Search Wikia project. What’s the situation?
We recently completed a funding round with Amazon [for Wikia], but other than that, they don’t have anything to do with the search project. [The project] is a Wikia project [the for-profit company that Wales is chairman of], not a Wikipedia project [the separate community-driven encyclopedia he co-founded].
It was a combination of them both. I’ve been working on this for a long time. We didn’t actually intend to announce per se just yet, but me and my big mouth, the reporter asked me if I ever thought about search.
Q. It’s been said the search engine would launch in the first quarter of 2007. That’s fast. Is that really just when you expect active development work to begin?
During Q1, we’re going to set up a project to get developers involved with building the site, writing the code and getting the search engine going. We’re going to rely initially with Nutch and Lucene [related open-source search software that's been developed over the past few years].
We’ll start from scratch on how to apply the Wikipedia principles to keep it as simple as possible and move forward.
It’s just the development starting. We’re not producing a Google killing search engine in three months. I only wish I were that good of a programmer.
We’ll have some servers open, some development, maybe a pre-pre-alpha demo site up. We’d really anticipate it would be a year or two until we’re able to launch a viable search engine.
Q. How do you see this improving on what’s out there?
There are a lot of things that we’ve learned in the wiki world on how to get communities involved and engaged to build trusted networks in communities.
A lot of the people who have tried to do this in the past have stumbled not on technical issues but on community issues … dmoz [The Open Directory] was too closed … that was their response because of the pressure of spammers … others have thought in terms of ranking algorithms. That’s not the right approach. The right approach allows for open dialog and debate and discussion.
Q. How do you envision the community participating? Will they be selecting sites? Will this leverage material in Wikipedia? Will they rate sites?
This will be completely independent of Wikipedia.
Exactly how people can be involved is not yet certain. If I had to speculate about it, I would say it’s several of those things, not just community involved with rating URLs but also community rating for whole web sites, what to include or not to include and also the whole algorithm … That’s a human type process that we can empower people to guide the spider
Q. Do you see humans reviewing the most popular queries, perhaps picking the right answers to come up?
Part of it might be a human review of queries. For the narrow subset of the really popular queries, I think it’s important to apply humans …. if someone types Ford Motor Company, there is a correct answer for that. There’s no reason to beat our brains out to train our algorithm to do that.
Q. Search engines have actually gotten much better over time with these type of navigational requests. You don’t need humans so much to make sure the right answer shows up.
Those kinds are not too difficult. The harder one if you type ford, did you mean President Ford or do you mean the Ford Motor Company? That’s the type of thing where human disambiguation pages like we have at Wikipedia are helpful.
Q. Search engines already do a lot of this type of stuff. Ask has its Zoom suggestions, others have clusterings or related searches. Do you imagine people being forced to make a query refinement choice before they actually get search results?
If you type ford, you should get some disambiguation terms that humans have collected, then some search results….this is one of the places where I think human intelligence is most important
[NOTE: For more on query refinement, see some of my past posts such as Robert Scoble Wants What We Had -- Better Query Refinement. So Do I!, Hello Natural Language Search, My Old Over-Hyped Search Friend and Why Search Sucks & You Won't Fix It The Way You Think. The first link in particular discusses how Microsoft used to have disambiguation created by editors very similar to what Wales hopes to recreate. Sadly, it was killed in the quest to chase Google on the algorithmic front.]
Q. Are you planning to crawl the entire web, billions and billions of pages? Or will you go after a subset of important ones?
The number of pages is yet to be determined. Obviously we won’t be doing that initially [gathering everything], but we’ll invest in the hardware. Not to belittle the investment required to do a full crawl of the web on a regular basis, but I think it’s a fairly commoditized.
Q. Crawling is one thing. Serving up millions of queries per day is an entire other issue. Wikipedia handles a lot of traffic, but not at a Google scale. How’s it going with that?
The traffic’s not too bad. Servers are getting more and more powerful. Bandwidth is getting cheaper. It’s all pretty much off the shelf. It’s pretty efficient.
Q. Will you be selling ads, and if so, how will that work?
There are no immediate plan to sell ads, so for now we’re not too focused on that. If we don’t build something useful, selling ads on it is sort of a moot point.
Q. Why do this at all? What do you see wrong with search?
For certain types of searches, search engines are very good. But I still see major failures, where they aren’t delivering useful results. I think at a deeper almost political level, I think it’s important that we as a global society have some transparency in search. What are the algorithms involved? What are the reasons why one site comes up over another one. [Wales also raised the issue of how ads might influence regular listings, perhaps search engines trying to keep commercial sites out of the free listings to make money. From there, he went on....] Those types of incentives are problematic in search. The only solution I know to that is to be transparent
Q. How are you going to keep the community from being gamed. Wikipedia is very good at keeping out spam, but it’s not perfect. And despite its size, it’s dealing with far fewer topics than unique searches that will happen on any particular day. How do you police all those searches?
You have to recognize the difference between the way community is often used on the internet, which is short hand for millions of people clicking on some stuff as compared to community in the wiki world, which is people who actually know each other.
It’s one thing to say if you have millions of spammers out there trying to game and trick an algorithm …. but it’s not the number of queries. it’s the web sites themselves. A lot of numbers are thrown about for sites on the web, but the number of legitimate pages that are not coming from affiliate sites and spammers is a much more finite number. It’s much easier for a community to ban the bad stuff.
Q. But what if someone gets into a "good" domain. We’ve had cases where bad content gets shoved into "trusted" sites or even places like university sites. Do you ban those entire domains? How do they get back in?
At Wikipedia, we’d have a big discussion. [Wales then explained that people might realize a domain had done something accidentally wrong or without thinking about spam issues and so might be allowed back in.]
Q. You probably already search a lot, probably mostly with Google. Is it not finding what you want already most of the time, without a flood of spam or crud in your way?
Usually I’m looking for pages on Wikipedia, so they do a good job with that. It depends on the types of searches you are doing. If you’re doing a factual search, then Wikipedia [in the results] would be good. In other areas, I think there’s a strong commercial incentive. Why is it bad if I search for tampa hotels?
[NOTE: I then did this search on Google, which we discussed. I noted I saw plenty of good hotels listed, and that if I clicked through to the local search results, I got an even better experience of hotels listed.
Wales replied that he's often after reviews of hotels, not the hotels themselves. That took me back to the original results, where I pointed out the top listing was from TripAdvisor, exactly the type of review site he mentioned liking -- and that I often found them listed on these types of queries.
I also noted that Google even offers refinement categories at the top of the page similar to the disambiguation he wanted, with lodging guides as one of the categories. Unfortunately for Google, I didn't find that the results from that refinement did a good job bringing back trusted hotel guides]
Q. Back to transparency. People keep saying they want more of this. But can you name some exact examples of what you want to see? Do you want Google to say that using a term in bold text adds X percent of a score to the ranking criteria? And if you do that, don’t you think spammers will just abuse the recipe that’s been published?
If your search relies on some secret factors that you hope people won’t discover, you haven’t really come up with a good solution the problem.
Q. Microsoft has spent millions of dollars and years now of effort to try and be a Google killer and haven’t made it. You’re coming into this fresh with fewer resources and no real prior experience. Can you really do it?
I have no idea. I only do whatever sounds like it is fun.
Q. What type of funding do you have behind this?
Wikia’s initial round was 4 million from a variety of angels, then there was second round from Amazon, but the amount wasn’t announced.
When I first heard of the plans, I was pretty dubious the project would have much success. For one thing, the idea of the "open source" search engine to take on the world and provide more transparency is old news. Consider this from back when Nutch first came out, out of New Scientist in 2003:
The project "is about providing free technology that should not be controlled by private, commercial, secretive organisations," says Doug Cuttings, veteran web search engineer, and a Nutch founder.
Three years on, nothing really changed despite the reasoning behind such a project being the same. And this was despite Nutch having some big names behind it.
In 2004, Nutch got another round of attention in an ACM article looking at how it works. My comment at that time was:
Interesting read especially for the efforts that are involved to defeat spam. The argument is that though Nutch is open, revealing secrets won’t hurt because spammers will batter down any defenses, no matter how tightly protected. OK, so what will stop spam? Nutch hopes that an open, public discussion may reveal new methods. Perhaps. But the real test will only come if Nutch is deployed by a major, highly-trafficked site. Spammers aren’t going to bother trying the defenses of other places. It’s not worth the time. That’s also a positive for those considering Nutch. If you operate a small, vertical site or just want Nutch to be used on your own content, then spam concerns are much less an issue.
The spam test simply hasn’t happened with Nutch. And every new search engine project I’ve looked at coming in over the years completely underestimates the spam problem they face. When I looked at the Search Wikia site, comments like this almost seemed laughable:
search active for spammer sites
- trying to simulate user-typos (ie. "yaoho.com" rather than "yahoo.com");
see also: Microsoft’s
- blacklist domains, where spammails are linking to; create actively
honeypods to get spam; use a pattern like
<domain-where-we-have-registered>@myhoneypod.comto identify the spam networks; shell the common user get the possibility to register such a mail-adress?
Seek out the spam sites? Hey, don’t worry — if you’re popular, they’ll find you fast enough. And as you blacklist one, two more throwaway domains will show up in their place.
I also tend to think Wales is completely underestimating how crawling a big chunk of the web, keeping those pages fresh, ranking them quickly to provide answers and doing so for millions each day isn’t an off-the-shelf commodity.
Still, I find myself oddly hopeful. I don’t think a Google killer will emerge, but perhaps some new ways of a community to be involved with search will come out of it. I wouldn’t have thought Wikipedia would work. Certainly it’s flawed, but it’s also an incredible resource. Maybe something useful will come from the Search Wikia project.
At the very least, I’ve long wanted humans to be back in the role of reviewing queries and actually looking to see if they make sense, rather than so much reliance on algorithms. Maybe the mere concept of the Search Wikia project will encourage the major search engines to do more in this area.
Postscript: Originally I had Wales listed as cofounder of Wikipedia, but he got in touch saying he was the founder. I’d noted that Wikipedia itself lists him as founding (well, creating) it with Larry Sanger. Is Wikipedia incorrect on this, I asked? “Yes, it is wrong,” he emailed. Sanger posts his own views on the origins here.