Subscribe Via Web Feed Subscribe with Google Add to My Yahoo! Subscribe with Bloglines Add to netvibes Subscribe with Live.com

« Microsoft And Google Moving More Deeply Into Each Others' Businesses | Main | Search in Pictures: Google Dog Food, Yahoo Blow Up Chairs & Matt At Word Camp »

Jul. 27, 2007 at 12:30pm Eastern by Chris Sherman

Search Wikia Takes Steps To Crawl; Acquires Grub

Wikia, Inc., the for-profit company developing the open source search engine Search Wikia, has acquired Grub, a distributed crawler platform, from LookSmart.

Distributed crawler? Crawlers are software programs used by search engines to roam the web to discover pages that are then downloaded and indexed for searching. The crawlers operated by the major search engines are highly centralized, operating out of massive data centers, and are capable of finding and downloading millions of pages per minute.

Grub, by contrast, taps into the spare power of thousands of personal computers connected to the internet. Volunteers download the Grub client, and then allow it to operate as a background process on their computers—even as a screensaver. While each individual Grub client is a mite compared to a search engine crawler, the collective power of thousands of Grub clients working in tandem can be impressive. "We're hoping to get lots of people involved to help us crawl the web," said Jimmy Wales, co-founder and chairman, Wikia, Inc.

Wikia acquired Grub as part of its plan to build a "transparent and open platform for search," according to Wales. Wikia is has transformed Grub into an open source project, allowing developers to add or extend the functionality of the software. Wales called it "the next step" in Wikia's efforts to build a better search engine. Danny did an interesting Q&A With Jimmy Wales On Search Wikia where they discussed the details of that project.

I played around with Grub for a few weeks back in 2003, right after LookSmart acquired the technology. It's fascinating to watch a crawler in action, fetching page after page from all over the web. It's also an eye-opener, revealing the amazing variety of content on the web, from great sites you've never heard of to obviously spammy garbage.

I asked Wales about the spam problem, and he said that it wasn't a concern at this point. To maximize efficiency and eliminate redundancy, there's a "master crawl" list that's broken up into small chunks that are sent to each client. Once a client has crawled a group of URLs, it gets another list, and so on. Wales said that at least initially, that master list would be made up of well-known, "whitelisted" sites.

Want to help Wikia crawl the web? Download the Grub client. You can run it either as your default screensaver, or as a background process. Currently available for Windows only—from the download page: "Ozra is pretty sure the Linux client isn't going to work at all. We need to get it ported to Linux as soon as possible..."

Postscript: I'm not sure the Windows client is working either. It took me three tries to download—the first two attempts only downloaded fragments of the program that Windows couldn't figure out how to open. Then, the client kept attempting to contact the host server to get an initial list of URLs to crawl, only to fail repeatedly. Let's hope the open source community is motivated to start work asap!

grub-error.jpg

Like The Story? Vote For It On Yahoo Buzz!
Subscribe To Our Daily Search News Recap!
Your Email:
Send me the monthly search newsletter too! (Learn more about our newsletters and feeds)
Subscribe To Our Search Feed!
Subscribe Via Web FeedSubscribe with GoogleAdd to My Yahoo!Subscribe with BloglinesAdd to netvibes
Subscribe with Live.comSubscribe in NewsGator OnlineSubscribe in RojoAdd to My AOL
Share & Bookmark This Story!
By Chris Sherman Permalink Jump To Comments See Related Stories In: Search Engines: Search Wikia, Social Media Marketing



Reader Comments

We're working on getting the client working in test mode, but yes, lots needs to be fixed :)

www.faroo.com goes even a step further: distributed crawling and distributed search.

Comment by wolf [TypeKey Profile Page] | July 27, 2007 1:21 PM

So what stops these clients from downloading illegal material and harmful software to end-users computers?

Even in a protected environment, my ISP might like to know why I've been downloading gigabytes of information over the past week which may or may not have included questionable adult material. They might also question my use of the wikia client, but thats another story.

Comment by hoodmonkey [TypeKey Profile Page] | July 27, 2007 2:12 PM

Although it sounds nice to have an open source spidering engine, but will the backend infrastructure be open sourced and will the collected data be made available to the community as well ?

I downloaded the Linux version - it seems to be from December 2002! Is it really worth putting effort into it?

I think this is disgusting. Wikipedia exploits end-users to generate content and links, but that's one thing seeing as Wikipedia is non-profit and all.

But with WIKIA using end-users internet connections and power bills (both of which already cost too much), it's become Jimmy Wales exploiting the human race to fatten his wallet.

Wikia is a for-profit corporation, people! Don't fall for this gimmick, support a REAL distrbuted computing project like Folding@Home or Seti@Home and help the world at large, not just Jimmy Wale's plans to buy a small country!

http://neosmart.net/blog/2007/wikias-outrageous-exploitation-of-the-human-race/

Search:

Search Marketing Expo

Save the date for:
SMX Local & Mobile - San Francisco, CA (July 24-25) See the agenda, and register now!
SMX Sao Paolo - Brazil - (Aug. 7-8)
SMX China - September 23 & 24
SMX Stockholm - September 23 & 24
SMX East - NYC - (Oct. 6-8) Registration is now open.
SMX London - November 4 & 5

Search Marketing Now

Learn more about search marketing through free online webcasts and webinars from our sister site Search Marketing Now.

Upcoming Webcasts:

Most Recent News Posts

About Search Engine Land

Stay Updated!

Get Our Search Newsletters:
Email:
Daily Monthly

Get Our Search Feed:
Subscribe Via Web FeedSubscribe with Google
Add to My Yahoo!Subscribe with Bloglines
Add to netvibesSubscribe with Live.com
Subscribe in NewsGator OnlineSubscribe in Rojo
Add to My AOL
More About Our Feeds & Newsletters

Add to Technorati Favorites

Track Us Socially:
Facebook: Our Search News App
Facebook: Search Engine Land Page
Facebook: Search Engine Land Group
Flickr: Search Engine Land
LinkedIn: Search Engine Land Group
Twitter: Search Engine Land Feed

Bragroll