Search icon Looking for something?

When Good Data Go Bad: Preserving Exabytes of Data Across Centuries of Time
2006, Q2 (February 19, 2007)
By Ronald W. Garrison, Carolina Chapter Member

Ronald Garrison
Ronald Garrison

"This stuff goes bad if you don't work on it once in a while."

That's what I said once to some co-workers to whom I was demonstrating an application program, after one of many rounds of maintenance. This remark, made totally in jest, now seems, 20 years later, strangely true on a very serious level. It turns out that preserving digital data in a useful form for long periods of time is neither simple nor easy —it can, in very real ways, "go bad." At the same time, there really is no alternative to the digital option, given the vast amounts of information we have accumulated, as well as the fact that digital media, their problems notwithstanding, offer the only real hope for preserving information intact against the ravages of time and occasional mishaps. Like the loss of biological species, long-term loss of digital data is a serious problem confronting our modern time, and one which we have not confronted effectively.

My last article concluded with the remark that digitizing millions of books, as large an effort as that is, is already well on its way to completion, with most remaining difficulties in the legal realm of copyright issues. In assessing the sheer immensity of storage requirements, we find that only a trivial portion of humanity's total store of accumulated information is on paper. Audio and video recordings, Web sites, and machine-generated information, most of it never even dealt with directly by human beings, amount to tens of exabytes—that is, tens of millions of terabytes. Although little of that information needs to be scanned one printed page at a time, there is no denying the immensity of the task that confronts the keepers of the biggest collections ever.

As if the vastness of this ocean of data were not enough, there are serious problems standing in the way of preserving this storehouse of information well enough to make it accessible a long time from now. Finding a durable physical medium is just the start. The way the information is written on such a medium, the file system used, the data format within the file, and even high-level matters such as human language and writing style, all bear on this problem of how to render intelligible and useful what we hand down to future generations.

For me to be commenting on the looming problem of digital data loss actually represents a major shift in my thinking over the years. It was Jeff Rothenberg's 1995 article in Scientific American titled "Ensuring the Longevity of Digital Documents" that brought the matter to my attention, but at the time I actually scoffed at the claims he made. To be unable to figure out the contents of a digital recording, it seemed to me, would amount to creating an unbreakable code—and doing so accidentally. With some effort at analysis of one artifact, I reasoned, that item would be deciphered; then, huge numbers of similar data sets would be easy to recover. It also seemed silly to me that paper should be a much more durable storage medium than anything to which we write with a computer—again, largely by accident. Rothenberg's own glib one-liner that "Digital data lasts forever...or five years, whichever comes first" did little to soften my skepticism. Neither did various later claims, made by some, that data on CD-Rs could only be expected to last a few years—a claim that did not square with my own experience. Five years after the issue came to my attention, I was ready to consign the bogeyman of digital data loss to the same dustbin of history where I had already put the CD rot claims, the quickly forgotten hype about Y2K, the scattershot objections raised by Clifford Stoll in his book Silicon Snake Oil a few years earlier, and similar-sounding predictions about one or another aspect of the imminent fall of modern civilization.

Much of my skepticism remains. Yet I cannot dismiss all concerns raised about this subject, many of which are often voiced by respected researchers —in effect, to do so would be to adopt a viewpoint at the opposite extreme. I have had to soften my derogatory views, and have been dragged, virtually kicking and screaming, into a grudging acceptance of the notion that our global civilization could lose vast amounts of important information, collected at great expense, if we do not make a concerted, sincere effort to deal with the full set of issues involved.

A recent part of my education in this subject came from a 1998 video called Into the Future: On the Preservation of Knowledge in the Electronic Age. Directed by Tim Berners-Lee and narrated by Robert MacNeill, this engaging treatment of a disturbing subject made me take a whole set of issues more seriously. I, who as a teenager naïvely supposed that even local radio and TV shows were carefully preserved for posterity, began to see just how ephemeral even data from advanced interplanetary missions can be. As was said in the video, unlike paper records, "digital records don't just survive by accident." I learned that the total amount of electronic archival material in the world is roughly doubling every year; that the Internet, by about 1997, had become too big to preserve snapshots of it ( la the Wayback Machine); and that much data has already been lost from advanced interplanetary missions launched in the 1970s. Peter Norton eloquently described the personal trauma that can accompany loss of valuable data, as well as how horrifyingly often it happens. As the credits rolled, I was not quite crying, but I was thinking.

There does not seem to be any all-encompassing solution to the problem of how best to send a hundred exabytes into deep time. Printing any more than a tiny portion of it is obviously out of the question —and, in any case, is undesirable for a number of reasons. Keeping data in its native software environment works nicely when it can be done, but is extremely difficult to do for very long. Converting data to open system formats has advantages, but can sacrifice important qualities, such as the interactivity that the original versions had. And other approaches, such as saving the data with the version of the software needed to use it, or emulating the software in a hardware-independent fashion, run into difficulties because of the many connections of application software to the underlying systems that support it.

These complex issues, along with the sheer immensity of the data collections to be preserved, are difficult enough that I can confidently make a prediction: Solving the data preservation problem will be a struggle for the rest of the lives of my contemporaries. Yet for all that, I have not fully gone over to the other side. My remaining skepticism reminds me that media must exist that are both much denser than paper and more durable, though we have not emphasized those properties so far; that the data that exists in the largest quantities is generated by machines, and tends to be simple and highly repetitive, in contrast to documents generated by people; and that we should not confuse preserving data with preserving computer programs, something which is much more difficult and complex, especially when we lack the original hardware that ran them—and which certainly does not occupy exabytes. I am therefore optimistic that the preservation of digital data can be mastered in time to prevent a digital reprise of the loss of the library at Alexandria. Indeed, the ability given to us by the digital paradigm to make perfect copies, and spread them around the world cheaply, may be the only way to prevent eventual information loss catastrophes on scales previously unimaginable.


Ron can be contacted at rgarrison1 at nc dot rr dot com End of article.

More articles like this...
Comments powered by Disqus.