Way back in April of 2010 the Library of Congress and Twitter signed an agreement that provided the institution with all public tweets from 2006 through April of 2010. Today, the LoC has announced that it will complete that initial archiving work this month, and will continue to collect them all going forward.
The Library now has 170B tweets on file and it is growing that at a rate of 500M tweets a day. That rate has increased from the 140M a day it was collecting in February, 2011 when the collection system was launched.
Currently, the archive measures some 133.2 terabytes including two compressed copies of the 2006-2010 archive.
The project has been slow going, as it took nearly three years to get the work done. “We were excited to be involved with acquiring the Twitter archives because it’s a unique record of our time. It’s also a unique way of communication,” said Bill Lefurgy, digital initiatives program manager at the library.
Now, the Library’s goals will be to make that archive accessible for data researchers in a ‘comprehensive, useful way’. Right now, the Library is in the process of organizing the archive by time and in hourly files, a process it says will also be completed this month.
The Library’s Director of Communications, Gayle Osterberg, had this to say about the Twitter collection:
Twitter is a new kind of collection for the Library of Congress but an important one to its mission. As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.
She says that the Library has received about 400 inquiries from researchers centered around topics like the rise of citizen journalism and the tweets of elected officials.
Here are some examples the Library gives of possible uses:
- A master’s student is interested in understanding the role of citizens in disruptive events. The student is focusing on real-time micro-blogging of terrorist attacks. The questions focus on the timeliness and accuracy of tweets during specified events.
- A post-doctoral researcher is looking at the language used to spread information about charities’ activities and solicitations via social media during and immediately following natural disasters. The questions focus on audience targets and effectiveness.
There is a white paper that provides a bunch of details about how they compiled the archive, you can view that here. The Library is using Gnip, one of Twitter’s preferred data resellers, as the delivery agent from the service’s firehose into its storage system.
Twitter itself recently launched a way for users to download archives of their own personal tweets, but it has been rolling out very slowly so far.
Image Credit: Mark Wilson/Getty Images