Computer Science Faculty Research and Publications


Using the Web Infrastructure to Preserve Web Pages

Document Type


Publication Title

International Journal on Digital Libraries

Publication Date






First Page


Last Page



To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).


The final publication is available at Springer via

Copyright held by