Get the best search news, tips and resources, delivered each day.
Of Permanent Value: Archiving The Web
Yes, it’s a busy life but I’m very fortunate to do what I love and even get paid for it. The challenge, as least as I see it, is writing on something of interest for Search Engine Land and not worrying about conflicts of interest with every sentence I write.
Good news: I have found a topic that not only interests me but grows in significance for all of us as each day and each version of a web page passes: The importance of making web content more permanent. It’s crucial for historical purposes for web content to become less ephemeral.
It’s my goal in this series of articles to keep you posted on some of the major web archiving initiatives, databases, research and services, while at the same time offering quick peeks at tools you can use to save web pages and other forms of electronic content on your own. Naturally, awareness of copyright is key.
There is a lot going on all over the world and I will do my best to offer you introductions to many digital preservation initiatives, along with the research from universities and organizations engaged in collecting and storing online content.
So, where do we begin?
Many people know about The Internet Archive, based at the Presidio in San Francisco and home to The Wayback Machine. But many people aren’t aware of numerous additional projects (archiving, digitizing, preservation) that the Internet Archive, under the leadership of Brewster Kahle, is involved in.
One is a service the Internet Archive offers for a growing number of institutional clients, named Archive-It.
In a nutshell, this subscription service allows an organization to use an application that includes crawling, recrawling and data hosting services.
From the web site:
Internet Archive’s subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise or hosting facilities.
Subscribers can capture, catalog, and archive their institution’s own web site or build collections from the web, and then search and browse the collection when complete.
The collections are then made public (unless a user decides to keep them private) via the Archive-It web site. At last count, Archive-It was permanently archiving more than 135 million pages in nearly 300 collections.
For those interested, Archive-It regularly offers webinars explaining their services.
A few of the most interesting collections include:
- Tragedy at Virginia Tech A collection of web pages from the University and elsewhere immediately following the tragedy.
- California High Speed Rail Authority
- Orange County California Web Sites
- Latin American Government Documents Archive, (University of Texas)
- Canadian Political Parties and Political Interest Groups
It’s also worth noting that unlike the tens of millions of archived pages accessible via The Wayback Machine which cannot be keyword searched, pages archived using the Archive-It service can be searched using keywords.
In an upcoming article I will take a look at two massive web archives that combine the best of both the National Archives of the United States and The Internet Archive. They are named Web Harvest Presidential Term 2004 and Web Harvest 109th Congress (2006). Between them they contain terabytes of archived U.S. Government web data.
Gary Price is Director of Online Information Resources for Ask.com and also editor of ResourceShelf and DocuTicker. The Search On Search column, written by employees of major search engines, appears periodically at Search Engine Land.