The Library of Congress is still working on plans to create a searchable archive of nearly every public tweet ever sent, but the challenges inherent in that task are making it a slow process.
Understandably so, considering the substantial growth in tweets in recent years; the LoC is essentially trying to tame a very rapidly moving dataset.
If it ever happens, a searchable archive of tweets could prove valuable to researchers, analysts, marketers and others. You can imagine brands wanting to search for Twitter trends surrounding major product/service announcements, or researchers looking for Twitter activity surrounding major world events.
On Friday, Gayle Osterberg, the Library’s Director of Communications, announced that the LoC is now getting about 500 million tweets per day, up from about 140 million when the project began in February 2011. She spelled out some of the challenges that the project poses.
Currently, executing a single search of just the fixed 2006-2010 archive on the Library’s systems could take 24 hours. This is an inadequate situation in which to begin offering access to researchers, as it so severely limits the number of possible searches.
The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.
In a Washington Post article, Deputy Librarian of Congress Robert Dizard Jr. says the collection will eventually be made available only within the Library itself so that its archive doesn’t compete with commercial services that offer Twitter archives — that’s part of the agreement with Twitter.
But, as Gary Price said on INFOdocket, it doesn’t sound like any of that will happen anytime soon.