Jul 23, 2007 at 2:46pm ET by Gary Price
I love working for Ask.com as Director of Online Information Resources and also compiling and editing ResourceShelf and DocuTicker.
Yes, it’s a busy life but I’m very fortunate to do what I love and even get paid for it. The challenge, as least as I see it, is writing on something of interest for Search Engine Land and not worrying about conflicts of interest with every sentence I write.
Good news: I have found a topic that not only interests me but grows in significance for all of us as each day and each version of a web page passes: The importance of making web content more permanent. It’s crucial for historical purposes for web content to become less ephemeral.
It’s my goal in this series of articles to keep you posted on some of the major web archiving initiatives, databases, research and services, while at the same time offering quick peeks at tools you can use to save web pages and other forms of electronic content on your own. Naturally, awareness of copyright is key.
There is a lot going on all over the world and I will do my best to offer you introductions to many digital preservation initiatives, along with the research from universities and organizations engaged in collecting and storing online content.
So, where do we begin?
Many people know about The Internet Archive, based at the Presidio in San Francisco and home to The Wayback Machine. But many people aren’t aware of numerous additional projects (archiving, digitizing, preservation) that the Internet Archive, under the leadership of Brewster Kahle, is involved in.
One is a service the Internet Archive offers for a growing number of institutional clients, named Archive-It.
In a nutshell, this subscription service allows an organization to use an application that includes crawling, recrawling and data hosting services.
From the web site:
Internet Archive’s subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise or hosting facilities.Subscribers can capture, catalog, and archive their institution’s own web site or build collections from the web, and then search and browse the collection when complete.
The collections are then made public (unless a user decides to keep them private) via the Archive-It web site. At last count, Archive-It was permanently archiving more than 135 million pages in nearly 300 collections.
For those interested, Archive-It regularly offers webinars explaining their services.
This page offers direct links to all of Archive-It collections. In recent weeks, the collection has seen many new collections added to the service
A few of the most interesting collections include:
It’s also worth noting that unlike the tens of millions of archived pages accessible via The Wayback Machine which cannot be keyword searched, pages archived using the Archive-It service can be searched using keywords.
In an upcoming article I will take a look at two massive web archives that combine the best of both the National Archives of the United States and The Internet Archive. They are named Web Harvest Presidential Term 2004 and Web Harvest 109th Congress (2006). Between them they contain terabytes of archived U.S. Government web data.
Gary Price is Director of Online Information Resources for Ask.com and also editor of ResourceShelf and DocuTicker. The Search On Search column, written by employees of major search engines, appears periodically at Search Engine Land.
Opinions expressed in the article are those of the guest author and not necessarily Search Engine Land.
Share, Bookmark & Discuss This Article
More:
Keep Updated: News Via Email | News Via RSS Feed | News Via Twitter
See more stories like this in the Members Library! Check out the Search Engines: Academic Search Engines, Search Engines: Other Search Engines, Search On Search sections of the Members Library where this story is filed. Members also get access to exclusive video content, a members-only weekly & monthly newsletter, plus more. Check out all the benefits!
TOP STORIES
SEARCH NEWS BRIEFS
FEATURES & ANALYSIS
RECENT COMMENTS
Stay on top of all the search news with our daily summary, the SearchCap newsletter. View a sample ›
Search Engine Land produces SMX, the Search Marketing Expo conference series. SMX events deliver the most comprehensive educational and networking experiences - whether you're just starting in search marketing or you're a seasoned expert.
SMX Web Site » | SMX Difference » | SMX News »
Join us at an upcoming SMX event:
Learn more about search marketing with our free online webcasts and webinars from our sister site, Search Marketing Now. Upcoming online events include:
Featured sites from our Blogroll
Become a premium member today and receive:
(Hey, nice to see you here Gary!)
One thing missing from many archiving sites these days are interactive snapshots, e.g. video from someone using the site. How am I gonna find out what Google Docs — or let’s say, AskX — really “felt” like? All that the Wayback Machine is going to provide to us, at best, is a static snapshot of the homepage.
(In the context of Google, I created a couple of “museums” which also have video, by the way.
http://blogoscoped.com/search/?q=google+museum )
There’s another big problem with the Archive.org snapshots, unless I’ve misunderstood this when coming across it in the past: if a website changes ownership and the new owner disallows crawling via robots.txt, then the old (pre-forbidden-crawling) Archive.org snapshots will also not be visible anymore. (But it would be neat you get some official confirmation for this, it’s just a theory because I once got this message on the Wayback Machine.)
That’s right, Philipp – the snapshots can go away. I was at Foocamp a few weeks ago and have been meaning to write up a conversation I had with Brewster about this. My suggestion/solution is to take actual screenshots of sites. Currently, the IA does copying of the HTML. That’s nice, but it can be pulled. But legally, I don’t think you could get a screenshot of a site’s home page or perhaps many pages within it yanked. That would allow for providing some history retention yet I feel still respect fair use.
Just a note that archive-it is coded to dot-com. i googled (it’s tuesday) and now i’ll check it out – thanks! (as of now i’ve got seven “searchengineland.com-referral” tags on del.icio.us and i’ve only been reading for a few days…