Can any start-up search engine “be the next Google?” Many have wondered this, and today’s launch of Cuil (pronounced “cool”) may provide the best test case since Google itself overtook more established search engines. Cuil provides what appears to be a comprehensive index of the web, offers a unique display presentation, and emerges at a time when people might be ready to embrace a quality “underdog” service.
The big questions now are how does the relevancy hold up and can word-of-mouth really still build significant share? [Note: The Cuil site was supposed to be live for searches at of 9:01pm Pacific time on July 27, but so far I'm still seeing only a holding page. I'd expect this to change fairly soon].
Why Care About Cuil?
There’s no end of companies that have been trying to take on Google as a search destination. Earlier this year, my Google Challengers:2008 Edition article covered some of these, like Hakia, Mahalo, and Search Wikia. You can add to that list other companies like Gigaweb and Exalead. None of them have made a dent in Google’s share.
Indeed, the established players of Yahoo, Microsoft, and Ask.com — all of whom have established quality search products — haven’t dented Google either. So what makes Cuil worthy of special attention?
For one, Cuil has an impressive pedigree with its three founders: Tom Costello of IBM’s WebFountain project, plus Anna Patterson and Russell Power of Google’s TeraGoogle project, Google’s massive search index. Cuil also counts former AltaVista founder Louis Monier — who later went to eBay and then Google — as part of the team.
These people know search. In particular, they know on-the-firing line, heavy duty, industrial strength search. Not only that, they’re unleashing what appears to be a comprehensive service that anyone can use. Indeed, Google already did a blog post in reaction to Cuil and its size claims on Friday, before Cuil even launched or those claims became public. If Google’s paying that much attention, then anyone should.
What Cuil Offers
There are four major areas that Cuil is putting out to distinguish itself from other services. These are:
- Big web index
- Unique relevance algorithm
- Unique results display
I’m going to dive into each of these areas in depth, to examine the importance of them as well as dissect some of the misconceptions and PR spin that they also have.
Size Wars Return?
Cuil is claiming to have the largest index of the web: 120 billion pages indexed (with a total of 186 billion seen by its crawler; spam and duplicate content are among things excluded from what gets indexed). In talking with them, Cuil estimated they were three times the size of Google. Sounds pretty awesome, right?
Sigh. Yes, size matters. You want to have a comprehensive collection of documents from across the web. But having a lot of documents doesn’t mean you are most relevant. As I wrote back in September 2005, when Google famously dropped the number of documents it had indexed:
Last century, in December 1995 to be exact, AltaVista burst upon the search engine scene with what was at that time a giant index of 21 million pages, well above rivals that were in the 1 million to 2 million range. The web was growing fast, and the more pages you had, the greater the odds you really were going to find that needle in a haystack. Bigger did to some degree mean better.
That fact wasn’t wasted on the PR folks. Games to seem bigger began in earnest. Lycos would talk about the number of pages it “knew” about, even if these weren’t actually indexed or in any way accessible to searchers through its search engine. That irritated search engine Excite so much that it even posted a page on how to count URLs, as you can see archived here.
While size initially DID mean bigger was better, that soon disappeared when the scale of indexes grew from counting millions of pages to tens of millions. Bigger no longer meant better because for many queries, you could get overwhelmed with matches.
I’ve long played with the needle-in-the-haystack metaphor to explain this. You want to find the needle? You need to have the whole haystack, size proponents will say. But if I dump the entire haystack on your head, can you find the needle then? Just being biggest isn’t good enough.
That’s why I and others have been saying don’t fixate on size for as long as 1997 and 1998. Bigger no longer meant better, regardless of the many size wars that continued to erupt. Remember, Google — when it came to popular attention in 1998 and 1999 — was one of the tiniest search engines at around 20 to 85 million pages. Despite that supposed lack of comprehensiveness, it grew and grew because of the quality of its results.
Why have the size wars persisted? Search engines have seen an index size announcement as a quick, effective way to give the impression they were more relevant. In lieu of a relevancy figure, size figures could be trotted out and the search engine with the biggest bar on the chart wins!
Given this history, seeing Cuil trot out size figures is incredibly disheartening and a step backwards, not forwards. Time better spent on other things (such as measuring the RELEVANCY of the results) will instead get consumed by those trying to count pages. Without even running queries and trying to perform comparison counts, I already have issues with Cuil’s claims. For example:
- Cuil told us that Google was at 40 billion documents. According to? According to what Cuil has heard that reporters have told them they hear from Google. OK, I talk with both Google and reporters that cover them regularly. I’ve never heard this figure put out there. Cuil later added after the initial talk with them that comparison testing makes them believe that Google hasn’t grown.
- Yahoo was said to be at 20 billion. Cuil said this is based on where Yahoo said it was back in 2005, with the assumption that if they’d gotten bigger, they would have announced this. Bad assumption given that since 2005, the search size detente has kept both Google and Yahoo from talking about size figures.
- Microsoft was said to be at 12 billion. Actually, Microsoft said it was at 20 billion last September — but if that hard figure isn’t being used by Cuil, then you start doubting the other ones they’ve put out. In a follow-up, Cuil said they believe Microsoft has fallen back to a smaller index of 12 billion, based on its testing.
We can also start testing in short order, however. Just run a query, see what Google reports as a count for it, then run the same thing on Cuil. If Cuil regularly reports more, they win. Or not. This is what people especially started doing in droves during the last size battle between Google and Yahoo, and then issues about duplicate content and spam starting coming up.
Assuming you get beyond that, any advantage Cuil has on the size front right now will be short-lived, if they make size an issue. Google will simply crawl more documents and ensure that whatever Cuil is, Google will be +1.
We asked Cuil about this, why Google wouldn’t just match them. “If they wanted to triple size of their index, they’d have to triple the size of every server and cluster. It’s not easy or fast,” said Patterson.
In a follow-up, Cuil added that Google being as large as they estimated it to be now was largely down to Patterson’s work at Google, and since she’s no longer there, increasing the index size will be a “non-trivial” exercise.
Perhaps. And perhaps the infrastructure that Cuil has built does make it easier for them to more cheaply index documents from across the web than Google. But Google has plenty of money and engineering expertise of its own. It’s foolish to think they wouldn’t counter what might be perceived as a weakness. They responded to Yahoo in 2005; they’d do the same with Cuil. And for what? Even if Cuil is bigger than Google, it doesn’t mean Cuil is more relevant. Nor does it mean adding more documents in a “I’m bigger than you” game would improve the state of search overall.
Unfortunately, Google started reacting to Cuil’s claims even before Cuil made them. In a post on Friday, Google just so happened to decide it was time to mention they “knew” of 1 trillion items on the web. That will confuse some people into thinking Google has indexed 1 trillion documents, even though they don’t say this. What Google did say clearly was:
We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.
My response to Google — and to Cuil — and to any search engine that tries to do the size battle is what I said on Friday:
There’s no exact answer to what’s a useful page — and so in turn, there’s no one exact answer to who has the “most” of them collected. Tell me you have a good chunk of the web, and I’m fine. But when Google or any search engine start making size claims, my hackles go way up. There are better things to focus on.
As a side note, one issue with any large index is keeping it fresh. Cuil says that they crawl 1 to 1.5 billion pages per day, which means it would take 3 months to refresh everything they’ve currently spidered. However,some important pages are crawled on a weekly basis, they said. That’s good — but Google has pages that can be added in near-real time thanks to its instant layer.
So Long, PageRank?
Cuil is making a big push that it ranks pages by content, rather than popularity. The idea here is to poke at how Google is commonly viewed to just reward pages that have the most PageRank value.
The problem is that PageRank is just part of the way Google ranks pages. It looks at a variety of other factors, so that ranking is not just a popularity contest (see What Is Google PageRank? A Guide For Searchers & Webmasters for more about this).
The other issue is that despite the PR pitch, Cuil is indeed using popularity to rank results, as far as I can tell.
For example, in a search for [harry potter], the Harry Potter & The Order Of The Phoenix movie web site comes up first on Cuil. This is out of thousands of possible pages. How on earth can Cuil know just from the content on the page itself that the movie site should be in the top results, especially in a web environment where people can (and will) custom tailor content to mislead search algorithms?
The answer is link analysis — counting links and effectively seeing who is pointed at the most. The twist is that it is done by measuring the links from pages relevant to what someone search on.
Let’s go back to the [harry potter] search. When you do that at Cuil, it finds all the pages that it thinks are related to those two words. This means pages that use those words, as well as pages that have other words on them, such as “harry potter books” or “gryffindor.” It figures out these relationships by seeing what type of words commonly appear across the entire set of pages it finds. Since “gryffindor” appears often on pages that also say “harry potter,” it can tell these two words (well, three words– but two different query terms) are related.
Cuil then looks at the entire set to see which pages are linked to from them. Those with many or important links are likely to do better. Since the Harry Potter movie page has a lot of links pointing at it, it comes up higher in the results. Cuil even has a name for this — IdeaRank.
If this sounds familiar to some people, that’s because this particular flavor of link analysis was popularized by Teoma, which was later acquired by Ask. When Teoma appeared, it tried to distinguish itself against Google by saying it analyzed only the “subject-specific” links to do ranking. This is still played up at Ask today:
Our ExpertRank algorithm goes beyond mere link popularity (which ranks pages based on the sheer volume of links pointing to a particular page) to determine popularity among pages considered to be experts on the topic of your search. This is known as subject-specific popularity. Identifying topics (also known as “clusters”), the experts on those topics, and the popularity of millions of pages amongst those experts — at the exact moment your search query is conducted — requires many additional calculations that other search engines do not perform. The result is world-class relevance that often offers a unique editorial flavor compared to other search engines.
Fair to say, despite Ask’s supposed improved analysis, it never trounced Google. Moreover, there are plenty who assume –including myself– that Google itself does subject-specific link analysis.
So the rank by content twist at Cuil? As I’ve said, more twist than substance. But the content analysis is used in other ways, as I’ll get into next.
Three Column “Magazine” Display
Probably the most dramatic difference between Cuil and Google is how Cuil runs search results across three columns, rather than all in a straight line:
It’s appealing in one sense that you can see more results all that once. In fact, Cuil said in user testing, the display had an impact on which results was seen as “number one.” Some viewed the result in the top left corner as most important — others go to the one in the top left. When all nine results can be seen on a large screen, some assume the one in’ the middle is best.
In the top right corner, there’s a “Related Searches” box that allows you to refine your search and drill into specific topics:
Somewhat related to this, “tabs” appear at the top of the search page listing related searches:
Select one of these tabs, and you get back results on that particular topic. Tabs that appear reflect how popular those phrases are on the web.
Underneath the display, Cuil is also working to do what it views as a twist. It’s trying to diversify the results by topic, so that in the case of Harry Potter, for example, you might get pages about the book as well as the movie and the author, rather than just results that are all about the book.
“We’re trying to choose pages that aren’t all on the same topic and show you the diversity of the web,” said Patterson.
Again, however, it’s not like a Google search lacks diversity. A search on Harry Potter there also brings back different types of results.
Actually, a far more dramatic example of results diversity is what someone like Hakia does. Unlike Cuil, which divides pages into topics based on word patterns, Hakia is doing real semantic analysis — trying to understand what words actually mean and what pages are about. As a result, it groups results for [harry potter] into various categories such as:
- The Books
- News & Interviews
- The Soundtrack
- The Games
- Photographs & Pictures
- Blogs & Fan Sites
- Myths & Controversies
Over at Mahalo, you get similar groupings — though these are done through human work rather than through concept analysis.
Where are the ads? At launch, there will be some public service ads at the bottom of the page, so that people are used to ads being in that spot. As for revenue generating ones, the company is considering whether it should build its own ad network or partner with someone else.
Related to display, Cuil automatically suggests search topics as you type into the box on its home page, queries that come from looking at the most popular related words from across the web. It will also suggest actual web sites to take you to, showing an icon next to their name:
The Privacy Card
Cuil says that it’s not logging IP information or keeping any type of material that could be traced back to individual searchers. In contrast, all three major search engines do log IPs addresses plus cookie searches and, in the case of Google, even allow searchers to store search history over time.
That may be reassuring to some searchers, but to date, even scare stories about what Google could do (not that it does) hasn’t kept searchers away from it. Ask and Microsoft have both tried to play the privacy card against Google and gained nothing for it. Small player Ixquick has been playing the card even longer and has gotten no visible traction out of it.
Another issue is that by not allowing searchers to voluntarily allow for personalized results, Cuil might be missing out on an advancement in search where Google’s ahead of the pack.
Under The Hood
Behind the scenes, Cuil talk about its infrastructure that’s designed to be faster and more efficient than those at Google or the competing search engines. Supposedly, that means Cuil can do things cheaper and better than the other players.
I’ll leave this to others who are tech heads to dissect more in the future. For my part, I’ll just say that I’ve heard this line many times over the years. AltaVista would say how it was better in architecture than Lycos. Inktomi would talk then about how it was more distributed and cheaper than AltaVista. Then FAST would say it was even more distributed than Inktomi. And Google, you know, was so with it that you could break circuit boards and things just kept working.
I’d also get briefed on how super-duper the Google infrastructure was, for example, then a few years later, there would be a completely new one introduced that lets them do things not possible under the old one.
So call me jaded. Everyone’s always saying they’ve got the best, fastest and least expensive way of doing stuff. If they do, proving it has been difficult, nor has it seemed to make much different in market share.
I’ll leave with a few stats, however. Right now Cuil runs off of two data centers, using a combined 1,000 machines each running 8 CPUs. Another 280 machines split between data centers are used to serve results rather than index the web.
Does something seem missing with Cuil’s name? They lost the second L that was part of their original name, Cuill.
“We had a moment of silence for the departing L,” said Patterson, explaining that people found it easier to remember the name with one L rather than two.
The name mean “wisdom” in Gaelic, from the legend of Finn MacCuil. That brings in another Teoma connection. Teoma is apparently a Gaelic word for “cunning.”
Will Cuil Succeed?
Does Cuil think it can beat Google in the search space? They won’t come out and say that, but you can’t help but feel that’s the goal when talking to them. How about beating at least Microsoft or Yahoo?
“We’ll take that for a start. If we do that in the next year and a half, I’ll be an extremely happy person,” Patterson said.
But is even that realistic? Microsoft and Yahoo themselves both have mature search products. What’s especially important is that the big three also offer more than the web search that Cuil is providing at launch. News search, image search, video search, local search — these are just some of the verticals that Cuil lacks but which do get used by searchers. Not offering these makes Cuil feel too focused on what “old school” search used to be and missing out on the Search 3.0 vertical and blended search revolution that has been going on.
Clustering of pages; subject-specific link analysis — these are things others have tried and gained no market share with. Having a comprehensive index is great, but no one prior to launch has been able to play with the service and measure core relevancy. Cuil itself said it had no metrics to show it is more relevant than Google. So why would anyone think it could gain share?
I tend to think Cuil’s hitting the timing right to pick up a little share, maybe a point or two (which is huge compared to other start-ups). I don’t think word-of-mouth is dead, and I’m cautiously optimistic that Cuil will have good relevancy even without having tried it yet (if not, all bets are off). I think you’ve still got a core of early adopters and tech geeks out on the web who want the “next Google,” especially at a time with the existing Google seems so big and threatening to others. A good underdog can fill a need, if it’s a quality underdog — and neither Yahoo or Microsoft have that underdog spirit.
It’s possible that Cuil could be a wild success that eclipses Yahoo and Microsoft and does threaten Google itself, of course. Anything’s possible. But I think that’s unlikely for reasons I wrote before:
Google came along at a very special time, as I’ve long written. It had better technology at a time when all the search engines had abandoned improving search, since that was seen as a loss leader. The money was in portal features.
Today, search is a multi-billion dollar industry. If someone with a serious search threat comes along, you buy them (such as with YouTube), or you start to develop your own rival if it seems a real threat. Google’s not omnipotent — but you’ve already got a space where it’s Google, Yahoo, Microsoft, and Ask all seriously fighting it out (and the latter three, despite their funding and experience, still struggle against Google as being synonymous as a trusted search brand for most users).
To date, Google is the real exception of “a better mousetrap wins.” It’s far more likely the companies above, if they do gain traction, will end up being purchased for a large amount by one of the existing “search utility companies.”
Is Cuil open to being purchased, as what happened to search start-up Powerset earlier this month? Powerset was seen by many as a Google-killer (though not by me and several others). In the end, despite the hype that Powerset itself helped fuel (Cuil’s been careful to avoid this), it got gobbled up. What if Microsoft’s Steve Ballmer came knocking?
“I don’t know what he’d be knocking for, whether it be acquisition or partnership or whatever. We do intend on being polite. We believe in getting to know people and making friends because you never know what deal may come down the line,” Patterson said.
For related discussion, see Techmeme.
Postscript: See Cuil Fast Test – Relevancy Isn’t A Google Killer.