Of Permanent Value: Archiving The Web

Search On Search - A Column From Search Engine Land I love working for Ask.com as Director of Online Information Resources and also compiling and editing ResourceShelf and DocuTicker.

Yes, it’s a busy life but I’m very fortunate to do what I love and even get paid for it. The challenge, as least as I see it, is writing on something of interest for Search Engine Land and not worrying about conflicts of interest with every sentence I write.

Good news: I have found a topic that not only interests me but grows in significance for all of us as each day and each version of a web page passes: The importance of making web content more permanent. It’s crucial for historical purposes for web content to become less ephemeral.

It’s my goal in this series of articles to keep you posted on some of the major web archiving initiatives, databases, research and services, while at the same time offering quick peeks at tools you can use to save web pages and other forms of electronic content on your own. Naturally, awareness of copyright is key.

There is a lot going on all over the world and I will do my best to offer you introductions to many digital preservation initiatives, along with the research from universities and organizations engaged in collecting and storing online content.

So, where do we begin?

Many people know about The Internet Archive, based at the Presidio in San Francisco and home to The Wayback Machine. But many people aren’t aware of numerous additional projects (archiving, digitizing, preservation) that the Internet Archive, under the leadership of Brewster Kahle, is involved in.

One is a service the Internet Archive offers for a growing number of institutional clients, named Archive-It.

In a nutshell, this subscription service allows an organization to use an application that includes crawling, recrawling and data hosting services.

From the web site:

Internet Archive’s subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise or hosting facilities.

Subscribers can capture, catalog, and archive their institution’s own web site or build collections from the web, and then search and browse the collection when complete.

The collections are then made public (unless a user decides to keep them private) via the Archive-It web site. At last count, Archive-It was permanently archiving more than 135 million pages in nearly 300 collections.

For those interested, Archive-It regularly offers webinars explaining their services.

This page offers direct links to all of Archive-It collections. In recent weeks, the collection has seen many new collections added to the service

A few of the most interesting collections include:

It’s also worth noting that unlike the tens of millions of archived pages accessible via The Wayback Machine which cannot be keyword searched, pages archived using the Archive-It service can be searched using keywords.

In an upcoming article I will take a look at two massive web archives that combine the best of both the National Archives of the United States and The Internet Archive. They are named Web Harvest Presidential Term 2004 and Web Harvest 109th Congress (2006). Between them they contain terabytes of archived U.S. Government web data.

Gary Price is Director of Online Information Resources for Ask.com and also editor of ResourceShelf and DocuTicker. The Search On Search column, written by employees of major search engines, appears periodically at Search Engine Land.

Related Topics: Channel: Consumer | Internet Archive | Search Engines: Academic Search Engines | Search Engines: Other Search Engines


About The Author: is a librarian, author, and an online information analyst based in suburban Washington, DC. He is the co-founder and co-editor of INFOdocket and FullTextReports.com and prior to that was founder/editor of ResourceShelf and DocuTicker for 10 years. He has worked for Blekko, Ask.com, and at Search Engine Watch where he was news editor. In 2001, Price was the co-author (with Chris Sherman) of the best-selling book The Invisible Web.

Connect with the author via: Email


Get all the top search stories emailed daily!  


Other ways to share:

Read before commenting! We welcome constructive comments and allow any that meet our common sense criteria. This means being respectful and polite to others. It means providing helpful information that contributes to a story or discussion. It means leaving links only that substantially add further to a discussion. Comments using foul language, being disrespectful to others or otherwise violating what we believe are common sense standards of discussion will be deleted. Comments may also be removed if they are posted from anonymous accounts. You can read more about our comments policy here.
  • http://blogoscoped.com Philipp Lenssen

    (Hey, nice to see you here Gary!)

    One thing missing from many archiving sites these days are interactive snapshots, e.g. video from someone using the site. How am I gonna find out what Google Docs — or let’s say, AskX — really “felt” like? All that the Wayback Machine is going to provide to us, at best, is a static snapshot of the homepage.

    (In the context of Google, I created a couple of “museums” which also have video, by the way.
    http://blogoscoped.com/search/?q=google+museum )

    There’s another big problem with the Archive.org snapshots, unless I’ve misunderstood this when coming across it in the past: if a website changes ownership and the new owner disallows crawling via robots.txt, then the old (pre-forbidden-crawling) Archive.org snapshots will also not be visible anymore. (But it would be neat you get some official confirmation for this, it’s just a theory because I once got this message on the Wayback Machine.)

  • http://searchengineland.com Danny Sullivan

    That’s right, Philipp – the snapshots can go away. I was at Foocamp a few weeks ago and have been meaning to write up a conversation I had with Brewster about this. My suggestion/solution is to take actual screenshots of sites. Currently, the IA does copying of the HTML. That’s nice, but it can be pulled. But legally, I don’t think you could get a screenshot of a site’s home page or perhaps many pages within it yanked. That would allow for providing some history retention yet I feel still respect fair use.

  • http://www.modified-news.com Rebekah

    Just a note that archive-it is coded to dot-com. i googled (it’s tuesday) and now i’ll check it out – thanks! (as of now i’ve got seven “searchengineland.com-referral” tags on del.icio.us and i’ve only been reading for a few days…

Get Our News, Everywhere!

Daily Email:

Follow Search Engine Land on Twitter @sengineland Like Search Engine Land on Facebook Follow Search Engine Land on Google+ Get the Search Engine Land Feed Connect with Search Engine Land on LinkedIn Check out our Tumblr! See us on Pinterest


Click to watch SMX conference video

Join us at one of our SMX or MarTech events:

United States


Australia & China

Learn more about: SMX | MarTech

Free Daily Search News Recap!

SearchCap is a once-per-day newsletter update - sign up below and get the news delivered to you!



Search Engine Land Periodic Table of SEO Success Factors

Get Your Copy
Read The Full SEO Guide