Google “Knows” About 1 Trillion Web Items
How big is Google’s web index? Google hasn’t said for years and still isn’t saying — but it is blogging today about how it "knows" of 1 trillion web-based items. Joy, we haven’t had a search engine talk about pages it knows about to confuse things since the Lycos days. Glad to have Google making […]
How big is Google’s web index? Google hasn’t said for years and still
isn’t saying — but it is
blogging today about how it "knows" of 1 trillion web-based items. Joy,
we haven’t had a search engine talk about pages it knows about to confuse
things since the Lycos days. Glad to have Google making the first shot to
take us back into the search engine size wars of old.
As a short refresher, Google used to list the number of pages it had
indexed on its home page. It dropped that count back in September 2005,
after Yahoo (for a short period) had claimed to have indexed more. Both search
engines swapped PR blows over who was bigger, then we got detente when the
count went away.
That was good. Very good. This is because search engine size has long
been used as a substitute for a relevancy metric that doesn’t exist. If a
search engine wants to seem twice as good as a competitor, they need only
trot out a bar chart showing they have twice as many documents as their
competitor. Plus, you toss in the famous haystack metaphor. You can’t find
the needle in the haystack if you’re searching only half the haystack!
But more documents doesn’t mean better relevancy. Indeed, more documents
can make a search engine worse, if it doesn’t have good relevancy. My turn
on the haystack metaphor has always been to say that if I dump the entire
haystack on your head, does that help you find the needle? Chances are, you
just get overwhelmed by a bunch of hay.
Still, size has long been an appealing stat that the search engines would
go to — which in turn would cause search engines to find ways to inflate
the size figures they’d report. Way back in the 1996, Lycos talked about the
number of pages it "knew" about, even those these weren’t actually indexed
or made accessible to searchers. Excite was so annoyed that it pushed back
with a page on how to count URLs, as you can see archived
Now we’ve got Google talking about "knowing" 1 trillion items of content
Recently, even our search engineers stopped in awe about just how big
the web is these days — when our systems that process links on the web to
find new content hit a milestone: 1 trillion (as in 1,000,000,000,000)
unique URLs on the web at once!
It’s easy to come away with the idea that Google lets you search against
1 trillion documents. That’s not the case, as the post does explain:
We don’t index every one of those trillion pages — many of them are
similar to each other, or represent auto-generated content similar to the
calendar example that isn’t very useful to searchers. But we’re proud to
have the most comprehensive index of any search engine, and our goal
always has been to index all the world’s data.
All I really want is the last part — that Google has what it believes to
be a comprehensive index of the web. I don’t even want them or anyone saying
that they have the "most" comprehensive, given that this is so difficult to
verify. Indeed, consider this from Google’s own post:
So how many unique pages does the web really contain? We don’t know; we
don’t have time to look at them all! :-) Strictly speaking, the number of
pages out there is infinite — for example, web calendars may have a "next
day" link, and we could follow that link forever, each time finding a
"new" page. We’re not doing that, obviously, since there would be little
benefit to you. But this example shows that the size of the web really
depends on your definition of what’s a useful page, and there is no exact
Right, there’s no exact answer to what’s a useful page — and so in turn,
there’s no one exact answer to who has the "most" of them collected. Tell me
you have a good chunk of the web, and I’m fine. But when Google or any
search engine starts making size claims, my hackles go way up. There are
better things to focus on.
For more, see related discussion on Techmeme.