Eternal Bits Continued
By Mackenzie Smith
There's no one right way to preserve digital content. Just as biodiversity is good for the natural environment, different digital preservation policies and strategies are good for the preservation environment [see sidebar, ]. But to ensure that we don't wind up with a digital Tower of Babel, we need to agree to use open, published standards, such as XML, TIFF, PDF, and MPEG.
And that's true not just for the obvious items like images, documents, and audio files, but also for scientific images, genomics data sets, and multimedia presentations and simulations. In the scientific research community, standards are emerging here and there—HDF (Hierarchical Data Format), NetCDF (network Common Data Form), FITS (Flexible Image Transport System)—but much work remains to be done to define a common cyberinfrastructure.
MIT is working closely with the University of Cambridge to develop a preservation strategy for each of the formats that the DSpace project intends to support. We plan to share these as widely as possible for peer review. As a start, we are tackling the most commonly deposited formats: PDF, HTML, and a couple of the Microsoft formats, Excel and Word. We will work our way down the list over time, in order of popularity and ease of preservation, and we'll also publish guidelines for the MIT community about which formats should be used when possible to make archiving easier.
We hope and expect that the worldwide community of digital archivists will begin to divide and conquer so that one group by itself doesn't have to address the tens of thousands of file formats that are out there. Toward that end, efforts are under way to build new collaborative services like the proposed Global Digital Format Registry, which we can all add to and use as an authoritative source of information about digital formats and the tools for processing them.
Individuals can help, too. Document what you create, when you created it, in what format, on what computer, with what parameters, and so on. Also try to tag documents with metadata. By the time archivists get digital items, they're often unmoored from their originator, so sometimes archivists don't even know what the items are or who made them, much less whether the institution has the right to archive them.
Digital preservationists know that metadata tagging is a lot to ask of people and that we need to make doing the right thing much, much easier. Until we accomplish that goal, back up your hard disk tonight, and maybe print out your most important documents, just in case.
About the Author
Mackenzie Smith is associate director for technology at the Massachusetts Institute of Technology Libraries, in Cambridge.
To Probe Further
For more information on creating an institutional depository with DSpace, go to http://www.dspace.org.
To browse through 40 billion Web pages archived from 1996 to a few months ago, check out the Internet Archive's WayBack Machine at http://www.archive.org/web/web.php.
More than 80 institutions are participating in the LOCKSS program to preserve digital content at http://lockss.stanford.edu/index.html.