Google “Knows” About 1 Trillion Web Items

Danny Sullivan on
  • Categories: Channel: Industry, Google: Marketing, Google: Web Search, Stats: Size
  • How big is Google’s web index? Google hasn’t said for years and still
    isn’t saying — but it is

    blogging today
    about how it "knows" of 1 trillion web-based items. Joy,
    we haven’t had a search engine talk about pages it knows about to confuse
    things since the Lycos days. Glad to have Google making the first shot to
    take us back into the search engine size wars of old.

    As a short refresher, Google used to list the number of pages it had
    indexed on its home page. It dropped that count back in September 2005,
    after Yahoo (for a short period) had claimed to have indexed more. Both search
    engines swapped PR blows over who was bigger, then we got detente when the
    count went away.

    That was good. Very good. This is because search engine size has long
    been used as a substitute for a relevancy metric that doesn’t exist. If a
    search engine wants to seem twice as good as a competitor, they need only
    trot out a bar chart showing they have twice as many documents as their
    competitor. Plus, you toss in the famous haystack metaphor. You can’t find
    the needle in the haystack if you’re searching only half the haystack!

    But more documents doesn’t mean better relevancy. Indeed, more documents
    can make a search engine worse, if it doesn’t have good relevancy. My turn
    on the haystack metaphor has always been to say that if I dump the entire
    haystack on your head, does that help you find the needle? Chances are, you
    just get overwhelmed by a bunch of hay.

    Still, size has long been an appealing stat that the search engines would
    go to — which in turn would cause search engines to find ways to inflate
    the size figures they’d report. Way back in the 1996, Lycos talked about the
    number of pages it "knew" about, even those these weren’t actually indexed
    or made accessible to searchers. Excite was so annoyed that it pushed back
    with a page on how to count URLs, as you can see archived


    Now we’ve got Google talking about "knowing" 1 trillion items of content
    out there:

    Recently, even our search engineers stopped in awe about just how big
    the web is these days — when our systems that process links on the web to
    find new content hit a milestone: 1 trillion (as in 1,000,000,000,000)
    unique URLs on the web at once!

    It’s easy to come away with the idea that Google lets you search against
    1 trillion documents. That’s not the case, as the post does explain:

    We don’t index every one of those trillion pages — many of them are
    similar to each other, or represent auto-generated content similar to the
    calendar example that isn’t very useful to searchers. But we’re proud to
    have the most comprehensive index of any search engine, and our goal
    always has been to index all the world’s data.

    All I really want is the last part — that Google has what it believes to
    be a comprehensive index of the web. I don’t even want them or anyone saying
    that they have the "most" comprehensive, given that this is so difficult to
    verify. Indeed, consider this from Google’s own post:

    So how many unique pages does the web really contain? We don’t know; we
    don’t have time to look at them all! :-) Strictly speaking, the number of
    pages out there is infinite — for example, web calendars may have a "next
    day" link, and we could follow that link forever, each time finding a
    "new" page. We’re not doing that, obviously, since there would be little
    benefit to you. But this example shows that the size of the web really
    depends on your definition of what’s a useful page, and there is no exact

    Right, there’s no exact answer to what’s a useful page — and so in turn,
    there’s no one exact answer to who has the "most" of them collected. Tell me
    you have a good chunk of the web, and I’m fine. But when Google or any
    search engine starts making size claims, my hackles go way up. There are
    better things to focus on.

    For more, see related discussion on Techmeme.

    About The Author

    Danny Sullivan
    Danny Sullivan was a journalist and analyst who covered the digital and search marketing space from 1996 through 2017. He was also a cofounder of Third Door Media, which publishes Search Engine Land, Marketing Land, MarTech Today and produces the SMX: Search Marketing Expo and MarTech events. He retired from journalism and Third Door Media in June 2017. You can learn more about him on his personal site & blog He can also be found on Facebook and Twitter.