Wayback Machine Now Has 240 Billion URLs

The Wayback Machine from the Internet Archive, one of the most useful and important Internet research tools, recently reached a major milestone. In a blog post, archive founder Brewster Kahle announced that The Wayback Machine now provides access an index containing more than 240 billion URLs (about five petabytes of data), with archived pages dating […]

Chat with SearchBot

Wayback Logo1The Wayback Machine from the Internet Archive, one of the most useful and important Internet research tools, recently reached a major milestone.

In a blog post, archive founder Brewster Kahle announced that The Wayback Machine now provides access an index containing more than 240 billion URLs (about five petabytes of data), with archived pages dating back to 1996.

The amount of newly accessible archived material is huge. Prior to this update, The Wayback Machine provided access to about 150 billion URLs.

Researchers should note that a small amount of the index available in the prior release is temporarily unavailable via the new and larger index. So, the older index remains available using a different interface.

Along with announcing the new release Kahle said the database can potentially access archived pages that were online as recently as early December 2012.

This is also exciting news for researchers since the lag time between the time a page was crawled and indexed and then became accessible via The Wayback Machine could often be six months or longer.

Kahle also mentions that Wayback is now handling more than 1,000 queries per second by more than 500,000 people a day.

Internet Archive Results Page

All of this news follows a New Year’s Eve blog post by Brewster Kahle announcing that the Internet Archive had just completed raising one million dollars that will allow the organization to purchase four more petabytes of storage. The fundraising continues because the archive estimates they’ll need more than ten petabytes of storage during 2013.

Archive It

Archive It Logo

Although web pages indexed by The Wayback Machine are NOT keyword searchable more than a thousand collections of archived web pages focusing on wide variety of topics ARE keyword searchable.

These archives are made available by Archive-It, a fee-based service, that’s part of The Internet Archive. These collections are built targeting specific urls to crawl, index, and archive.

Archive-It works with the education community (K-12 and higher ed), libraries, government agencies, non-profits, and others. Many of these groups make their collections accessible to all users.

Archive Home1

Here are a Few Examples of Archive-It Collections:

Here’s a directory with information and links to more than 1800 Archive-It collections that are keyword searchable.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Gary Price
Contributor
Gary Price is a librarian, author, and an online information analyst based in suburban Washington, DC. He is the co-founder and co-editor of INFOdocket and FullTextReports.com and prior to that was founder/editor of ResourceShelf and DocuTicker for 10 years. He has worked for Blekko, Ask.com, and at Search Engine Watch where he was news editor. In 2001, Price was the co-author (with Chris Sherman) of the best-selling book The Invisible Web.

Get the must-read newsletter for search marketers.