• http://www.mattcutts.com/blog/ Matt Cutts

    Wow–that’s a really deep look, right down to the link that mentions that back in 2003, Nutch was getting funding from Overture, a Google competitor that was eventually bought by Yahoo.

    I’m glad that someone finally asked about the fact that [tampa hotels] is really rather clean, with lots of variety: several specific hotels, several overview sites, and local search results and a Plusbox map.

    I respect Jimmy’s opinion that he’d rather have “hubs” with reviews of hotels rather than the hotels themselves. The pendulum hasn’t always swung in that direction though. I remember when a different search engine (now absorbed by someone else, and named for a Gaelic word) was making the rounds talking about the query “grocery stores” and how their search engine returned individual grocery stores instead of directories or hubs that listed grocery stores. At the time, Google leaned more toward returning overview/hub sites, but we had the ability to change that. So that’s something that existing search engines can adjust if they think users prefer a different mix; I know because we’ve changed that balance in the past.

  • http://monetizethis.blogspot.com NaplesDave

    Interesting!

    Would like to see how they will juggle user trust with user intervention. However, I am all for letting users tailor entries as with Wikipedia. Seems to be a source everyone trusts.

  • http://sethf.com/ Seth Finkelstein

    One thing many other open-source projects haven’t had is the marketing muscle to build a user-base and get donated development. Wikia has much more of this than many other efforts.

    While Wikia may be underestimating the resources required, I also think they have more resources available than most other open-source projects who have attempted the task.

    Plus, one of their sleeper strengths is they’ve developed an amazing ability to neutralize criticism of quality control failures. That also sets them apart from many other projects.

    So while I’m usually skeptical too, they do have some decent answers to the obvious question of why they have a chance of succeeding when so many others have failed.

  • http://www.xanersoft.com Web Design

    YES, I was looking for this info for many days. But to be very frank I have doubts if Search Wikia will really beat google.

    Also its open-source project needs a lot of resources, hope wikia is knowing what all it requires.

    Anyway, thanks for publicizing the talk. Lets see what happen.

  • Lukas

    Good article!
    Even if they won’t make it to dethrone Google then the *VERY* possitive message I can read between the lines is that they plan on using Lucene and Nutch (and Hadoop could follow in a few minutes, right? :-). As a Nutch/Lucene/Hadoop hobyist I can’t loudly call Bravo! (providing they will contribute their code back to Apache repository). Then the real winner can be an open-source community.

  • http://seoptimization.blog.com/ ★ ★ Search Engines WEB ★ ★

    If possible, WikiPedia should make its ongoing detailed stats, public. It would be interesting if Wales would consider it ASAP.

    The stats that are currently available are generalized.

    This is exceptionally important because Wikipedia is now on page one or page two for many competative terms in Google and Yahoo.
    ( including moving to #1 for the term:
    SEARCH ENGINE OPTIMIZATION )

    This would offer a unique opportunity to analyze what percentages of traffic each of the top search engines bring – and also how much approximate traffic competative keywords bring.

    Also a cummulative Overview for 2006 would be interesting

  • http://www.comstockfilms.com/blog/tony Tony Comstock

    Just three days ago I wrote to Matt Cutts:

    “I’m no expert, but it’s hard for me to image what sort of algorithm would be able to distinguish the highly entertaining, very intelligent, but often utterly filthy Pretty Dumb Things (apparently still in the Google penalty box) from run-of-the-mill sites that use similar language in similar quantities, and even in similar, but tremendously less artful ways.”

    So you can imagine how amused I am to see that human-powered search is the topic of the day.

    Thankfully, PTD appears to be out of the Google penalty box. Unfortunately Comstock Films is back in. As of this morning, sites featuring photos of women felatiating dogs and copulating with horses are out-ranking (out ranking by pages) Comstock Films on google searchs for ‘couples porn’. (We make award-winning films about the intimate aspects of couples in longterm relationships.)

    At this point I’m thinking this is neither a bug, nor an anti-sex bias at Google. I think explanation is that the googlebot is one kinky mo’fo’!

  • http://www.acclivitymarketing.com/blog Solomon Rothman Web Design Search Engine Marketing

    If Search Wikia can even produce usable results ( as in not chuck full of spam or outdated sites) they’ll gain an almost immediate following. Since Google has become so large and powerful, it’s attracted a large amount of criticism rising solely out of the fear of the power and control Google posses over the online market and the world economy.

    Just loosing your rankings on a few prime keywords can cost a company hundreds of thousands of dollars and due to the closed nature of Google’s algorithms it’s inevitable someone will claim Google to be manipulating things “unfairly” or more accurately “un-ideally” for the user. This belief will directly lead to a huge following of users who will support search wiki and use it as a replacement for google even if it produces inferior results.

    Search wiki doesn’t have to produce a product superior or even on pare with Google. It just has to give the perpetrators of open source philosophy what they want: to feel like they’re positively contributing to their search experience and that they (collectively ofcourse) have control and assurances over the fairness of the results.

    And I say “hell yeah.” Chaos refines order (eventually) and competition yields innovation.

  • http://www.bessed.com AdamJusko

    It’s an interesting project, something similar to what we’re already doing at Bessed (http://www.bessed.com). However, unlike what it sounds Wales may be up to, we’re using paid editors and have built our site on WordPress to give visitors a say on search results (via comments) without completely giving up the keys to the castle.

    Our issue is scalability–how much can humans really cover in terms of the billions of searches possible? In Wikiasari’s case (or whatever Wales’ final search site is called), there is the challenge of getting volunteers to work on a search engine that is for-profit, something very different than working on the non-profit Wikipedia. It’s fun to voluntarily add to the Wikipedia entry for The Oak Ridge Boys or Carl Sagan—is it fun to voluntarily gather a list of landscaping companies in Springfield, Illinois, as you’d expect would be necessary for a new human-powered engine? (Again, this is part of why Bessed uses paid editors; no one wants to do boring stuff for free.)

    I’m glad to see human-powered search in the news, though; it gives me a chance to blow Bessed’s horn and invite Webmasters to submit their sites.

  • http://krugle.com Ken Krugler

    After two years of working with Nutch 0.7 and 0.8, I’d have to strongly agree with Danny’s comment about how easy it is to underestimate the level of effort required to do a good job of regularly crawling lots of pages, filtering out spam, and quickly serving up high quality results.

    Many of the problems require technical solutions – e.g. you want the hit summarizer to ignore sidebars. Not ridiculously to do, but it’s one of countless small programming tasks that *somebody* has to define/code/test.

    So when I read posts on the wikia mailing list about P2P vs. centralized search, the semantic web, etc. I have to laugh. Those aren’t the things that make or break you, it’s fixing the 1000 little problems that constantly show up.

  • http://www.mattcutts.com/blog/ Matt Cutts

    Ken, is that where Krugle the code search engine gets its name–from your last name? I wondered about that. :)

    Yup, I remember talking about navbars and boilerplate and their impact on snippets way back a few years ago. There really is a lot of little things to get right for a search engine. :)

  • gary price

    Danny you are correct.

    Ask.com offers Zoom related search, contextually based search suggestions (and often related names). More here. Zoom related search can not only serve to help with narrowing and expanding a query but can also, in some cases, serve as a knowledge discovery tool.

    Ask also offers other types of disambiguation tools. Here are a few examples:

    + Rock Concerts New York City.
    Here, at the top of the results page, you see material from AskCity. Note the disambiguation pull-down boxes providing the searcher options to narrow to a specific borough of NYC.

    + Zip Codes, Springfield
    Springfield, MA is listed first but directly below it, a pull-down menu listing other cities and towns named Springfield in the U.S. is seen on the page. Area code Columbus does the same type of thing.

    + A search for MSFT asks the searcher do you want information about Microsoft or the latest stock quote.

    + A search for Rocky. First, you’re prompted if you want information on Rocky, the movie. Then, if selected, you’ll find info on the latest Rocky release (Rocky Balboa) and you’ll also see a pull-down menu to go directly to info about Rocky 1 – Rocky 5.

    + Of course, a search for Miami, FL provides a “Smart Answer” with direct links to various types of info directly at the top of the page.
    http://www.ask.com/web?q=miami%20fl&o=0&l=dir

    + The new prototype/beta, AskX, offers even more info on a results page in a three column format. Here’s a search for Chicago. Everything from the local time to information about the band. Same type of thing when you search Boston.

    + Btw, if you like this concept, WikiWax from Sufwax offers the same type of thing using dynamically generated suggestions from the Wikipedia’s subject headings.

    + Clusty’s dynamically generated clusters can also serve as both as a tool to narrow and focus or as an info discovery vehicle. You can also see various type of clusters with their Clustermed service.
    http://www.clustermed.info

  • cutting

    As Ken mentions, there are scores of minor engineering tasks involved in maintaining a search engine, but that’s the more tractable part. The trickier part is maintaining high quality results. The major search engines all employ people to evaluate search results and use these to train their search algorithms. This is an expensive process, with no open-source starting points. It is typically blind: evaluators are presented unranked results from unknown search engines and asked to indicate which are relevant to the query, a query that the user did not submit. It seems unlikely that volunteers will be interested in this task. And it is very different from Wikipedia. Raising a volunteer army for search quality is a research task. An interesting one, and one that I’m glad to see someone attack.

  • http://dmoz.org/profiles/chris2001.html chris2001

    Raising volunteer forces for this task is indeed an interesting challenge – but the idea is not really new. One possible way to do it has already been found and realized: ODP´s volunteer editors have scanned the web for quality content since 1998, and the result of our work is Open Content – freely available for everybody who needs a large amount of human-reviewed data, be it as initial input for a new search tool, or for a rather long list of other purposes, from enriching results to ranking and quality control.
    As the Nutch documentation explicitly refers to the ODP RDF dump, I guess I don´t have to go into details what advantages this might have for an Open Source search engine – you probably know a lot more about it than I do ;-)

    If you are looking into methods to measure quality and other properties of search engines without human testers, you might find the following papers interesting:
    - Random Sampling from a Search Engine’s Corpus by Ziv Bar Yossef, Maxim Gurevich (via SEO by the Sea which has additional links).
    - Using Titles and Category Names from Editor-driven Taxonomies for Automatic Evaluation by S. Beitzel, E. Jensen, A. Chowdhury, D. Grossman.

  • http://www.centiare.com thekohser

    In my opinion, Jimmy Wales is quite the hypocrite, and I’ve seen that he’ll attempt to change history in ways that rival Stalin’s efforts.

    How is it that Larry Sanger was “co-founder” of Wikipedia in the project’s own press releases — for years — but then he’s demoted after Sanger and Wales have a falling out?

    How is it that Wales has said of Wikipedia, “We try really hard to deal with customer service complaints. We don’t allow libel. We’re not a wide-open free speech forum that allows people to post whatever. We’re happy to delete rants and things like that as necessary. I think that’s part of the reason why we haven’t been sued,” but when MyWikiBiz followed his recommendation to the letter (http://www.nabble.com/MyWikiBiz-t2080660.html), Wales still banned the MyWikiBiz User account, AND posted a libelous message on that user’s page?

    How is it that Wales has been on a rampage, railing against companies editing their own space in Wikipedia, but when his partner Angela Beesley edits “her” article about Wikia in Wikipedia, everyone poo-poohs it, saying her edits aren’t biased?

    How is it that Wales orders the “nofollow” tag to be applied on all external links from Wikipedia, but somehow (miraculously) the “interwiki” template links that point to Wikia.com (there are thousands of them) are exempt from the “nofollow”?

    Isn’t this all convenient? Or, could it just be that Wales is a big hypocrite? I’m biased in my assessment, but you look at the facts, and you make the call.

  • Pozycjonowanie

    Thanks for very interesting article. Can I translate your article into polish and publish at my webblog? I will back here and check your answer. Keep up the good work. Greetings
    Pozycjonowanie

  • http://bestproxy.info Proxy

    yeah why not, you can publish in your blog

  • http://www.wyliczanka.pl Konin

    Very interesting article.Good work. In my opinion other open-source projects needs a lot of resources. Greetings.My site :Konin