Bit rot | Observatório da Imprensa

PICTURE yourself as a historian in 2035, trying to make sense of this year’s American election campaign. Many of the websites and blogs now abuzz with news and comment will have long since perished. Data stored electronically decays. Many floppy disks from the early digital age are already unreadable. If you are lucky, copies of campaign material, and of e-mails and other materials (including declassified official documents), will be available in public libraries.

But will you be able to read them? Already, NASA has lost data from some of its earliest missions to the moon because the machines used to read the tapes were scrapped and cannot be rebuilt. A wise librarian will wish to keep in working order a few antique computers that can read such ancient technologies as CDs and USB thumb-drives. But even that may not be enough. Computer files are not worth anything without software to open them.

One way round that is to print everything out. If you use durable acid-free paper, this will reach at least the level of accessibility of medieval manuscripts, handwritten on vellum. But printouts of digital material are a second-best solution. They risk losing the metadata that make documents interesting: e-mails make most sense as threads, not as stacks of paper. Only in digitised form can data be sifted and crunched.

Conscientious institutions already make copies of some web pages, e-books and other digital material, and shift the data to new hardware every five years (see article). As software becomes obsolete, libraries and companies can create emulators—old operating systems working inside newer ones.

But that effort is hampered by regulation that makes archiving digital artefacts even more difficult than it already is. In America, for instance, circumventing the anti-piracy digital rights management software (DRM) that publishers attach to their products is a criminal offence. If that software disappears, the material will no longer be accessible. In 2010 the United States Copyright Office exempted publishers of online-only works from the duty of depositing a copy with the Library of Congress unless specifically requested. National libraries have the right to demand a copy of every printed book published on their territory (and they also get huge quantities of other documents too). But they have no mandate to collect the software or smartphone apps without which much electronic data remains encrypted gibberish.

Regulators are pondering the problem. In early May America’s Copyright Office will hold public hearings to discuss exemptions to the ban on circumventing DRM. In Britain the government wants to make it compulsory for publishers, including software-makers, to provide the British Library with a copy of the finished version of everything they produce within a month of publication. The proposed law will allow the library to harvest web pages and material hidden behind paywalls or login requirements. The sole exceptions are social networks and sites comprising only video or music.

Copy me in on the costs

Publishers complain that this will be costly, at a time when the industry is struggling to stay afloat. They fear that library access will compete with commercial sales and that providing copiable versions of their products will encourage piracy.

These complaints look a bit overblown. Libraries would not make digital material available to everyone, but only to users actually in the library building. The proposed British regulation even allows publishers to request that their materials be kept under wraps for three years—a concession rarely granted for print works. For centuries libraries have provided public access to even the most expensive books and journals. The principle is worth maintaining in the digital age, too.

The stakes are high. Mistakes 30 years ago mean that much of the early digital age is already a closed book (or no book at all) to historians. Without a wider mandate for libraries, giving them the right to store both digital materials and the tools to open it, historians of the future will be unable to reconstruct our times. They may not even know what they have lost.

***

Digital archiving – History flushed

IN 1086 William the Conqueror completed a comprehensive survey of England and Wales. “The Domesday Book”, as it came to be called, contained details of 13,418 places and 112 boroughs—and is still available for public inspection at the National Archives in London. Not so the original version of a new survey that was commissioned for the 900th anniversary of “The Domesday Book”. It was recorded on special 12-inch laser discs. Their format is now obsolete.

The digital era brought with it the promise of indefinite memory. Increased computing power and disk space combined with decreasing costs were supposed to make anything born digital possible to store for ever. But digital data often has a surprisingly short life. “If we’re not careful, we will know more about the beginning of the 20th century than the beginning of the 21st century,” says Adam Farquhar, who is in charge the British Library’s digital-preservation efforts.

The most obvious problems for digital archivists have to do with hardware, but they are also the easiest to fix. Many archives replace their data-storage systems every three to five years to guard against obsolescence and decay. This is not as expensive as it sounds: hard drives are cheap and reliable. The threat of hardware failure is overcome by keeping copies in different places. The British Library has storage sites in London, Yorkshire, Wales and Scotland.

Collecting digital material is trickier, particularly online. Archivists can only harvest those parts of the web that are freely accessible. Anything requiring user inputs—passwords, searches, forms—is off-limits. Streaming media, such as online videos, are hard to capture.

Changes in software and file formats create more hurdles. “Many of the digital objects we create can only be rendered by the software that created them,” says Vint Cerf, a pioneer of the internet who now works for Google. If the original program has gone, an archive of mint-condition files can be useless. By the time software is more than a decade old, running it usually requires hardware emulation—essentially fooling programs into thinking that they are running on old hardware.

Although technical problems can usually be solved, regulatory obstacles are harder to overcome. Laws force copyright libraries, such as the Library of Congress, to seek permission before archiving a website. Regulation can be even more damaging when it comes to preserving such things as computer programs, games, music and books. These often come with digital-rights management (DRM) software to protect them against piracy. Archivists who want to circumvent such programs can find themselves on the wrong side of the law. America’s Digital Millennium Copyright Act (DMCA) makes such circumvention a criminal offence.

Copyright and DRM will loom even larger as the nature of information systems evolves. The original internet was by default an open environment, making copying easy. The mobile world, with its widely popular smartphone apps, is much less so. As companies more fiercely protect their wares, contemporary digital artefacts run the risk of never being archived. Libraries have no mandate to collect apps, such as Angry Birds or Instagram, which form part of popular culture.

Despite all these difficulties, the world’s libraries have tried for over a decade to conserve some aspects of their national digital heritage. America’s Library of Congress started its digital-preservation programme in 2000 with $100m from the government. Its web archive currently stands at around 10,000 sites, many of them owned by the American government, and therefore exempt from copyright. Privately run sites are more difficult to include. For some archiving projects, only a fifth of webmasters reply to e-mails seeking permission for a copy.

Digital pack rats

Following the Library of Congress, most national libraries in rich countries now have some sort of digital-archiving programme. In Britain, for instance, the National Archives keeps copies of all government websites. The British Library is archiving all British online material.

Yet the best-known digital preservation effort is the Internet Archive, a private non-profit effort. Its servers are home to the Wayback Machine, a popular web service that lets users see how a website looked on specified dates in the past. Founded by Brewster Kahle in 1996, Internet Archive collects, stores and provides access to billions of web pages as well as other digital media such as books, video and software. The collection stands at roughly 160 billion web pages. It operates on the principle that it is better to seek forgiveness than to ask for permission.

More recently, geeks have rushed in where official agencies fear to tread. They have always been pack rats. Today they gather on websites such as TOSEC (short for “The Old School Emulation Centre”) to collect old software. But these collections have their own limitations. They focus heavily on games and operating systems; people tend not to have the same nostalgia for early versions of spreadsheet applications as they do for Super Mario Bros. More important, the material is very much under copyright.

Despite the proliferation of archives, digital preservation is patchy at best. Until the law catches up with technology, digital history will have to be written in drips and drabs rather than the great gushes promised by the digital age.

Ver outras publicações do autor

Aos leitores

Os artigos publicados nesta página não refletem necessariamente a opinião do Observatório da Imprensa, já que somos um fórum de opiniões. Procuramos publicar os textos recebidos como parte de nosso compromisso com a diversificação das fontes de informação. Como ninguém é dono da verdade, a melhor forma de buscar a objetividade é através do contato com perspectivas e opiniões diferenciadas, o que nos permite neutralizar o discurso do ódio e da intolerância.