How big is Google’s web index? Google hasn’t said for years and still isn’t saying — but it is blogging today about how it "knows" of 1 trillion web-based items. Joy, we haven’t had a search engine talk about pages it knows about to confuse things since the Lycos days. Glad to have Google making the first shot to take us back into the search engine size wars of old.
As a short refresher, Google used to list the number of pages it had indexed on its home page. It dropped that count back in September 2005, after Yahoo (for a short period) had claimed to have indexed more. Both search engines swapped PR blows over who was bigger, then we got detente when the count went away.
That was good. Very good. This is because search engine size has long been used as a substitute for a relevancy metric that doesn’t exist. If a search engine wants to seem twice as good as a competitor, they need only trot out a bar chart showing they have twice as many documents as their competitor. Plus, you toss in the famous haystack metaphor. You can’t find the needle in the haystack if you’re searching only half the haystack!
But more documents doesn’t mean better relevancy. Indeed, more documents can make a search engine worse, if it doesn’t have good relevancy. My turn on the haystack metaphor has always been to say that if I dump the entire haystack on your head, does that help you find the needle? Chances are, you just get overwhelmed by a bunch of hay.
Still, size has long been an appealing stat that the search engines would go to — which in turn would cause search engines to find ways to inflate the size figures they’d report. Way back in the 1996, Lycos talked about the number of pages it "knew" about, even those these weren’t actually indexed or made accessible to searchers. Excite was so annoyed that it pushed back with a page on how to count URLs, as you can see archived here.
Now we’ve got Google talking about "knowing" 1 trillion items of content out there:
Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!
It’s easy to come away with the idea that Google lets you search against 1 trillion documents. That’s not the case, as the post does explain:
We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.
All I really want is the last part — that Google has what it believes to be a comprehensive index of the web. I don’t even want them or anyone saying that they have the "most" comprehensive, given that this is so difficult to verify. Indeed, consider this from Google’s own post:
So how many unique pages does the web really contain? We don’t know; we don’t have time to look at them all! :-) Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a "next day" link, and we could follow that link forever, each time finding a "new" page. We’re not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what’s a useful page, and there is no exact answer.
Right, there’s no exact answer to what’s a useful page — and so in turn, there’s no one exact answer to who has the "most" of them collected. Tell me you have a good chunk of the web, and I’m fine. But when Google or any search engine starts making size claims, my hackles go way up. There are better things to focus on.
For more, see related discussion on Techmeme.