Local Web Archiving Policies – Born Digital Content Preservation

Collection Development Policy:

Web archives projects are intended to strengthen the library’s research resources. As such, Web archives projects should reflect the University Library mission and policies established in various collection development statements. Web sites are selected for harvest to bolster, compliment and parallel existing library collections, meet administrative documentation retention requirements, assist researchers, and capture ephemeral materials valuable to subject specialties including grey literature, blog posts and other relevant and vulnerable content.

Unit Responsibilities:

When creating a collection, units must be open to discuss roles and responsibility of Web archives administration and maintenance, such as who will be responsible for creating metadata and doing quality assurance.

Archive-It Access:

To be able to crawl seeds, add metadata, and description to the University of Illinois Urbana-Champaign web collection. One must be trained by someone in preservation service. To set up a training please email webarchives@library.illinois.edu.

Intellectual Property/Copyright

Copying materials is inherently part of the web archiving process. Intellectual property and copyright issues are an area where Web archives should be especially cognizant to respect the rights of rights holders without limiting libraries and archives rights to preserve important historical content.

robots.txt exclusions

Webmasters use robots.txt files, also known as the Robots Exclusion standard, to tell Web robots whether they allow crawling or not. The robots.txt file can block out specific files, directories, or even entire sites from Web crawler harvest. Webmasters may implement robots.txt exclusions for any number of reasons, such as to ensure optimal server performance and privacy protection.

Archive-It’s web crawler honors all robots.txt exclusion requests. However, the crawler can be set up to ignore these blocks in specific cases.

A robots.txt file is always located at the topmost level of a website and the file itself is always called robots.txt. To determine if a crawl may be blocked, view a web site’s robots.txt file by adding “/robots.txt” to the end of the topmost level of a site’s address.

Storage and Contingency Planning:

Archive-It crawls are stored in the WARC (Web ARChive) file format, an ISO standard (CD 28500) for storing content harvested from the World Wide Web. Archive-It’s primary crawler, Heritrix, and the Wayback Machine viewing software are open-source tools supported by an international community of institutions.

Content files are hosted on servers at the Internet Archive in San Francisco. A copy of Archive-It data is hosted and stored in a secure, controlled-access facility in Richmond, California, and mirrored in additional Internet Archive repositories. In addition, a dark copy of the Archive-It repository is replicated for preservation purposes at a university in the Eastern United States.

Note at this time the Archives do not maintain a local copy of WARC files.

Password-Protected Content:

You can capture password-protected pages if the crawler is provided with login credentials to access the site. Some login pages work differently from others and may be difficult to capture. If you encounter problems with password-protected sites, please contact Archive-It Support.

To capture a password-protected site, add the login screen as the seed URL of the page you wish to capture. Under the Seeds tab, check the box next to the login URL and click the Edit Settings button. In the Edit Seed Settings dialog box, enter the page’s Login Name and Login Password, then click the Apply button.