Let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.
Thomas Jefferson1
Information stored on paper can survive for millennia; information stored digitally today may not be recoverable this time next week. With seven million pages of new information added to the world wide web each day, the volatility of websites has emerged as an urgent problem, especially as websites are becoming the version of record for scientific journals. Three studies of links in peer reviewed journals all found their useful life to be a few years. 2-4 For Stuart Brand, president of the Long Now Foundation, “This is not a good way to run a civilization.”5 For librarians whose mission is to transmit today's intellectual, cultural, and historical output to the future, it's fast becoming a nightmare. A project initiated by Stanford University Libraries is coming to their aid.
Called LOCKSS (for “Lots of Copies Keeps Stuff Safe”), it aims to provide librarians with a cheap and easy way to collect, preserve, and provide access to their own, local copy of web published material (http://lockss.stanford.edu). The project has developed software that converts a personal computer into a digital preservation appliance. If a publisher gives permission, the appliance collects content by slowly crawling the publisher's site in the manner of a search engine. Access to the collected content is transparent; the appliance acts like a web cache to deliver requested pages from the publisher, or stored pages if the publisher fails to respond. In this way a library's readers see the subscribed pages at their original location, even though the publisher may no longer provide them there.
These appliances do not stand alone but are linked via the internet. They continually audit each other's content, comparing their versions by voting on its digest (a unique value computed from the content). If an appliance finds its copy outvoted and thus probably damaged, it can repair the damage from the appliances that outvoted it.6 LOCKSS uses this process of mutual audit and repair as the alternative of careful backups and manual auditing of the backup copies is very expensive. Librarians' defence against irreplaceable loss has always rested on redundancy (one library burns but only one of many copies of a work is destroyed). LOCKSS provides for Jefferson's “multiplication of copies,” but with an electronic twist.
Initially using content provided by the BMJ and adding other titles at an increasing rate, beta testing of the LOCKSS system is under way at 80 libraries worldwide and should go into production in spring 2004. Some 50 publishers of academic journals are supporting the project.
As flaws in digital preservation systems may not come to light until it is too late to save their content, diversity is essential. Fortunately, LOCKSS is not the only game in town. The Internet Archive makes heroic, if inevitably only partly successful, efforts to archive the entire web (www.archive.org/). The Dutch National Library is cooperating with the publisher Elsevier to preserve its journals.7 Debate continues over the economic and technical advantages of distributed versus centralised approaches to archiving, as national and institutional libraries plan for the digital future.
The LOCKSS team hopes to extend these techniques to other forms of content, for example less formal journals in the arts and humanities, and government documents. With such tools libraries can continue to serve as society's memory.
Competing interests: The LOCKSS program has received cash and support in kind from Sun Microsystems. Other computer companies currently support researchers who are contributing to the program. DR was until November 2002 employed by and holds shares in Sun Microsystems and in other computer hardware and software companies.
References
- 1.Thomas Jefferson to Ebenezer Hazard, Philadelphia, February 18, 1791. In Thomas Jefferson: Writings: autobiography, notes on the State of Virginia, public and private papers, addresses, letters, edited by Merrill D Peterson. New York: Library of America, 1984.
- 2.Lawrence S. Coetzee F, Glover E, Pennock D, Flake G, Nielsen F, et al. The persistence of web references in scientific research. IEEE Computer 2001;34(2):26-31. http://www.neci.nec.com/~lawrence/papers/persistence-computer01/persistence-computer01.pdf (accessed 5 Jan 2004).
- 3.Spinellis D. The decay and failures of URL references. Communications of the ACM 2003;46:71-7. http://www.spinellis.gr/sw/url-decay/ (accessed 5 Jan 2004).
- 4.Dellavalle RP, Hester EJ, Heilig LF, Drake AL, Kuntzman JW, Graber M, et al. Going, going, gone: lost internet references. Science 2003;302: 787-8. [DOI] [PubMed] [Google Scholar]
- 5.Meloan S. No way to run a culture. Wired News 1998 February 13, 6.19 pm. www.wired.com/news/culture/0,1284,10301,00.html (accessed 5 Jan 2004).
- 6.Storing e-text for centuries. Economist 2003. June 21(suppl): S8.
- 7.National Library of the Netherlands. Unique agreement between Elsevier Science BV and the Koninklijke Bibliotheek. www.kb.nl/kb/resources/frameset_kb.html?/kb/pr/pers/pers1996/elspers-en.html (accessed 5 Jan 2004).
