OPLIN 4Cast #301: Tackling the big jobs

In this day and age of user-generated Internet content — Web 2.0, if you want to call it that — there are astounding amounts of information being generated in very short time spans. It has been pointed out, for example, that every 24 hours there are more than twice as many words posted to Twitter as were printed in the entire New York Times over the last 60 years. If you are an archivist, the enormity of the task of saving even a little of this Internet material for future research seems overwhelming, and when you factor in other non-print information it seems even worse. Yet a few organizations, most notably the Internet Archive and the Library of Congress in the United States, have tackled portions of the job. Below are some recent news stories about the problem and their latest efforts to capture and provide effective access to huge amounts of information that might otherwise be beyond the reach of many.

The disappearing web: Decay is eating our history (Businessweek/Mathew Ingram) “They took a number of recent major news events over the past three years—including the Egyptian revolution, Michael Jackson’s death, the elections and related protests in Iran, and the outbreak of the H1N1 virus—and tracked the links that were shared on Twitter about each. Following the links to their ultimate source showed that an alarming number of them had simply vanished. In fact, the researchers said that within a year of these events, an average of 11 percent of the material that was linked to had disappeared completely (and another 20 percent had been archived), and after two-and-a-half years, close to 30 percent had been lost altogether and 41 percent had been archived.”
Launch of TV news search & borrow with 350,000 broadcasts (Internet Archive Blog/Brewster Kahle) “Like library collections of books and newspapers, this accessible archive of TV news enables anyone to reference and compare statements from this influential medium. The collection now contains 350,000 news programs collected over 3 years from national U.S. networks and stations in San Francisco and Washington D.C. The archive is updated with new broadcasts 24 hours after they are aired. Older materials are also being added.”
Congress.gov unveiled today (Library of Congress Blog/Erin Allen) “The Congress.gov site includes bill status and summary, bill text and member profiles and other new features like comprehensive searching across bill text, summary and statuses; persistent URLs for search results; Members’ legislative history and biographical profiles; and maintenance of existing features such as links to video of the House and Senate floor, top searched bills and the save/share feature.”
So, is the Library of Congress still archiving Twitter? (BuzzFeed/John Herrman) “Serving up billions upon billions of tweets in even the most basic way is a hard job for a technology company, much less for a government agency whose requested budget [pdf] for ‘Digital Initiatives’ in 2013 — all of them, including web archiving, historic newspapers, the online American history archive, the veteran’s history project, early sound recordings — is under $50m, and actually lower than it was in 2011.”

Tweet fact:
When the Library of Congress announced in April 2010 that it was going to archive Twitter, there were 50 million tweets a day. Now there are 400 million a day.

OPLIN 4Cast #301: Tackling the big jobs

editor