Wikia, Inc., the for-profit company developing the open source search engine Search Wikia, has acquired Grub, a distributed crawler platform, from LookSmart.
Distributed crawler? Crawlers are software programs used by search engines to roam the web to discover pages that are then downloaded and indexed for searching. The crawlers operated by the major search engines are highly centralized, operating out of massive data centers, and are capable of finding and downloading millions of pages per minute.
Grub, by contrast, taps into the spare power of thousands of personal computers connected to the internet. Volunteers download the Grub client, and then allow it to operate as a background process on their computers—even as a screensaver. While each individual Grub client is a mite compared to a search engine crawler, the collective power of thousands of Grub clients working in tandem can be impressive. “We’re hoping to get lots of people involved to help us crawl the web,” said Jimmy Wales, co-founder and chairman, Wikia, Inc.
Wikia acquired Grub as part of its plan to build a “transparent and open platform for search,” according to Wales. Wikia is has transformed Grub into an open source project, allowing developers to add or extend the functionality of the software. Wales called it “the next step” in Wikia’s efforts to build a better search engine. Danny did an interesting Q&A With Jimmy Wales On Search Wikia where they discussed the details of that project.
I played around with Grub for a few weeks back in 2003, right after LookSmart acquired the technology. It’s fascinating to watch a crawler in action, fetching page after page from all over the web. It’s also an eye-opener, revealing the amazing variety of content on the web, from great sites you’ve never heard of to obviously spammy garbage.
I asked Wales about the spam problem, and he said that it wasn’t a concern at this point. To maximize efficiency and eliminate redundancy, there’s a “master crawl” list that’s broken up into small chunks that are sent to each client. Once a client has crawled a group of URLs, it gets another list, and so on. Wales said that at least initially, that master list would be made up of well-known, “whitelisted” sites.
Want to help Wikia crawl the web? Download the Grub client. You can run it either as your default screensaver, or as a background process. Currently available for Windows only—from the download page: “Ozra is pretty sure the Linux client isn’t going to work at all. We need to get it ported to Linux as soon as possible…”
Postscript: I’m not sure the Windows client is working either. It took me three tries to download—the first two attempts only downloaded fragments of the program that Windows couldn’t figure out how to open. Then, the client kept attempting to contact the host server to get an initial list of URLs to crawl, only to fail repeatedly. Let’s hope the open source community is motivated to start work asap!