Library Of Congress Struggling To Make A Searchable Twitter Archive

The Library of Congress is still working on plans to create a searchable archive of nearly every public tweet ever sent, but the challenges inherent in that task are making it a slow process. Understandably so, considering the substantial growth in tweets in recent years; the LoC is essentially trying to tame a very rapidly […]

Chat with SearchBot

Twitter Search 2012The Library of Congress is still working on plans to create a searchable archive of nearly every public tweet ever sent, but the challenges inherent in that task are making it a slow process.

Understandably so, considering the substantial growth in tweets in recent years; the LoC is essentially trying to tame a very rapidly moving dataset.

If it ever happens, a searchable archive of tweets could prove valuable to researchers, analysts, marketers and others. You can imagine brands wanting to search for Twitter trends surrounding major product/service announcements, or researchers looking for Twitter activity surrounding major world events.

On Friday, Gayle Osterberg, the Library’s Director of Communications, announced that the LoC is now getting about 500 million tweets per day, up from about 140 million when the project began in February 2011. She spelled out some of the challenges that the project poses.

Currently, executing a single search of just the fixed 2006-2010 archive on the Library’s systems could take 24 hours. This is an inadequate situation in which to begin offering access to researchers, as it so severely limits the number of possible searches.

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.

In a Washington Post article, Deputy Librarian of Congress Robert Dizard Jr. says the collection will eventually be made available only within the Library itself so that its archive doesn’t compete with commercial services that offer Twitter archives — that’s part of the agreement with Twitter.

But, as Gary Price said on INFOdocket, it doesn’t sound like any of that will happen anytime soon.

Twitter itself recently began letting users download their own tweet history, but the company doesn’t appear to have any plans to offer a historical search engine of its own.


Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.


About the author

Matt McGee
Contributor
Matt McGee joined Third Door Media as a writer/reporter/editor in September 2008. He served as Editor-In-Chief from January 2013 until his departure in July 2017. He can be found on Twitter at @MattMcGee.

Get the must-read newsletter for search marketers.