Search Wikia Takes Steps To Crawl; Acquires Grub

Wikia, Inc., the for-profit company developing the open source search engine Search Wikia, has acquired Grub, a distributed crawler platform, from LookSmart.

Distributed crawler? Crawlers are software programs used by search engines to roam the web to discover pages that are then downloaded and indexed for searching. The crawlers operated by the major search engines are highly centralized, operating out of massive data centers, and are capable of finding and downloading millions of pages per minute.

Grub, by contrast, taps into the spare power of thousands of personal computers connected to the internet. Volunteers download the Grub client, and then allow it to operate as a background process on their computers—even as a screensaver. While each individual Grub client is a mite compared to a search engine crawler, the collective power of thousands of Grub clients working in tandem can be impressive. “We’re hoping to get lots of people involved to help us crawl the web,” said Jimmy Wales, co-founder and chairman, Wikia, Inc.

Wikia acquired Grub as part of its plan to build a “transparent and open platform for search,” according to Wales. Wikia is has transformed Grub into an open source project, allowing developers to add or extend the functionality of the software. Wales called it “the next step” in Wikia’s efforts to build a better search engine. Danny did an interesting Q&A With Jimmy Wales On Search Wikia where they discussed the details of that project.

I played around with Grub for a few weeks back in 2003, right after LookSmart acquired the technology. It’s fascinating to watch a crawler in action, fetching page after page from all over the web. It’s also an eye-opener, revealing the amazing variety of content on the web, from great sites you’ve never heard of to obviously spammy garbage.

I asked Wales about the spam problem, and he said that it wasn’t a concern at this point. To maximize efficiency and eliminate redundancy, there’s a “master crawl” list that’s broken up into small chunks that are sent to each client. Once a client has crawled a group of URLs, it gets another list, and so on. Wales said that at least initially, that master list would be made up of well-known, “whitelisted” sites.

Want to help Wikia crawl the web? Download the Grub client. You can run it either as your default screensaver, or as a background process. Currently available for Windows only—from the download page: “Ozra is pretty sure the Linux client isn’t going to work at all. We need to get it ported to Linux as soon as possible…”

Postscript: I’m not sure the Windows client is working either. It took me three tries to download—the first two attempts only downloaded fragments of the program that Windows couldn’t figure out how to open. Then, the client kept attempting to contact the host server to get an initial list of URLs to crawl, only to fail repeatedly. Let’s hope the open source community is motivated to start work asap!

grub-error.jpg

Related Topics: Channel: Social | Search Engines: Search Wikia | Social Media Marketing

Sponsored


About The Author: (@CJSherman) is a Founding Editor of SearchEngineLand.com and President of Searchwise LLC, a Boulder Colorado based Web consulting firm. He also programs and co-chairs the Search Marketing Expo - SMX conference series.

Connect with the author via: Email | Twitter | Google+ | LinkedIn



SearchCap:

Get all the top search stories emailed daily!  

Share

Other ways to share:
 

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://jeremie.com jeremie

    We’re working on getting the client working in test mode, but yes, lots needs to be fixed :)

  • wolf

    http://www.faroo.com goes even a step further: distributed crawling and distributed search.

  • hoodmonkey

    So what stops these clients from downloading illegal material and harmful software to end-users computers?

    Even in a protected environment, my ISP might like to know why I’ve been downloading gigabytes of information over the past week which may or may not have included questionable adult material. They might also question my use of the wikia client, but thats another story.

  • http://www.widgetlogic.com JasonD

    Although it sounds nice to have an open source spidering engine, but will the backend infrastructure be open sourced and will the collected data be made available to the community as well ?

  • http://sethf.com/ Seth Finkelstein

    I downloaded the Linux version – it seems to be from December 2002! Is it really worth putting effort into it?

  • http://neosmart.net/blog/ Computer Guru

    I think this is disgusting. Wikipedia exploits end-users to generate content and links, but that’s one thing seeing as Wikipedia is non-profit and all.

    But with WIKIA using end-users internet connections and power bills (both of which already cost too much), it’s become Jimmy Wales exploiting the human race to fatten his wallet.

    Wikia is a for-profit corporation, people! Don’t fall for this gimmick, support a REAL distrbuted computing project like Folding@Home or Seti@Home and help the world at large, not just Jimmy Wale’s plans to buy a small country!

    http://neosmart.net/blog/2007/wikias-outrageous-exploitation-of-the-human-race/

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest

 
 

Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States

Europe

Australia & China

Learn more about: SMX | MarTech


Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!

 


 

Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide