Wayback Machine Now Has 240 Billion URLs
The Wayback Machine from the Internet Archive, one of the most useful and important Internet research tools, recently reached a major milestone. In a blog post, archive founder Brewster Kahle announced that The Wayback Machine now provides access an index containing more than 240 billion URLs (about five petabytes of data), with archived pages dating […]
In a blog post, archive founder Brewster Kahle announced that The Wayback Machine now provides access an index containing more than 240 billion URLs (about five petabytes of data), with archived pages dating back to 1996.
The amount of newly accessible archived material is huge. Prior to this update, The Wayback Machine provided access to about 150 billion URLs.
Researchers should note that a small amount of the index available in the prior release is temporarily unavailable via the new and larger index. So, the older index remains available using a different interface.
Along with announcing the new release Kahle said the database can potentially access archived pages that were online as recently as early December 2012.
This is also exciting news for researchers since the lag time between the time a page was crawled and indexed and then became accessible via The Wayback Machine could often be six months or longer.
Kahle also mentions that Wayback is now handling more than 1,000 queries per second by more than 500,000 people a day.
All of this news follows a New Year’s Eve blog post by Brewster Kahle announcing that the Internet Archive had just completed raising one million dollars that will allow the organization to purchase four more petabytes of storage. The fundraising continues because the archive estimates they’ll need more than ten petabytes of storage during 2013.
Although web pages indexed by The Wayback Machine are NOT keyword searchable more than a thousand collections of archived web pages focusing on wide variety of topics ARE keyword searchable.
These archives are made available by Archive-It, a fee-based service, that’s part of The Internet Archive. These collections are built targeting specific urls to crawl, index, and archive.
Archive-It works with the education community (K-12 and higher ed), libraries, government agencies, non-profits, and others. Many of these groups make their collections accessible to all users.
Here are a Few Examples of Archive-It Collections:
- New Zealand Earthquake of 2011
- Human Rights Documentation Initiative
- NHTSA, National Highway Traffic Safety Administration
- Wisconsin Historical Society
- National September 11 Memorial Museum
Here’s a directory with information and links to more than 1800 Archive-It collections that are keyword searchable.
Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.