Web Archive Collection Service - Harvard University Library

The public interface for Harvard's new Web Archive Collection Service (WAX) launched on February 4, 2009. WAX began as a pilot project in July 2006, funded by the University's Library Digital Initiative (LDI) to address the management of web sites by collection managers for long-term archiving. It was the first LDI project specifically oriented toward preserving "born-digital" material. WAX has now transitioned to a production system supported by the University Library's central infrastructure.

Collection managers, working in the online environment, must continue to acquire the content that they have always collected physically. With blogs supplanting diaries, e-mail supplanting traditional correspondence, and HTML materials supplanting many forms of print collateral, collection managers have grown increasingly concerned about potential gaps in the documentation of our cultural heritage.

WAX was developed as an initial--and only partial--response to these and other concerns, which range from technical feasibility to legal and financial implications. The pilot focused on harvesting content from the surface web--content that is discoverable to search engines through web crawlers, as opposed to content hidden from web crawlers in a database or restricted by password or login protection.

The WAX pilot was designed to address the capture, management, storage, and display of web sites for long-term archiving. It was a collaboration of the University Library's Office for Information Systems with three University partners, each fielding a single project: the Harvard University Archives (Harvard University Library); the Arthur and Elizabeth Schlesinger Library on the History of Women in America (Radcliffe Institute for Advanced Study); and the Edwin O. Reischauer Institute of Japanese Studies (Faculty of Arts and Sciences, with sponsorship from Harvard College Library).

The WAX system was built using several open source tools developed by the Internet Archive and other International Internet Preservation Consortium (IIPC) members. These IIPC tools include the Heritrix web crawler; the Wayback index and rendering tool; and the NutchWAX index and search tool. WAX also uses Quartz open source job scheduling software from OpenSymphony.

The WAX Crawler and Robots.txt Files

Our crawler is called:

Our crawler will obey all common instructions in robots.txt files. You may specifically instruct our crawler to harvest material from your site or not to harvest material from your site by updating your robots.txt file to include us. The robots.txt file must be placed at the root of your server. More information about robots.txt files can be found at: http://www.robotstxt.org/robotstxt.html.

Allowing the WAX Crawler

The following text added to the robots.txt file will allow our harvester to crawl your web site:
  User-agent: hul-wax

Prohibiting the WAX crawler

The following text added to the robots.txt file will disallow our harvester to crawl your web site:
  User-agent: hul-wax
  Disallow: /

Information for Copyright Holders

If you own or control copyrighted content available in WAX and wish it to be taken down, please let us know. To make a takedown request or inquire about inclusion of your content in WAX, go to Questions and Comments. Please identify in your submission the URL(s) of the web page(s) carrying your content, the date(s) and time(s) of archiving, the specific content on the page(s) to which you claim rights, and the nature of your rights, e.g.:

http://www.school.edu/faculty archived January 1, 2009 at 12:00 AM, photograph of teachers, creator Jane Doe, photograph registered for copyright.