Getting Started with Web Archiving 

Getting Started with Web Archiving 

What is Archive-It 

Archive-It is a paid subscription service that allows institutions to preserve and build collections of digital content offered by the Internet Archive

How Does Archive-It Work

Web archiving is the targeted harvesting of Web-based content for archival and preservation purposes. At its core Archive-It is a Java-based Heritrix Web crawler software, described as an “open-source, extensible, Web-scale, archival-quality” Web crawler. Archive-It web crawler performs web harvesting automatically beginning from one or more specific Web sites or “seeds.” The crawl follows links harvesting and saving content such as text, audiovisual materials, and site style sheets. Related harvested content is stored together in .WARC files. The .WARC file format is a publicly documented and open standard employed to wrap aggregate related Web-content and associated information or metadata. For more information how Archive-It works please refer to Archive-It’s About Archive-It APIs and access integration.

Archive-It Limitations

Due to technical limitations, exact content and appearance of all sites on the Web may not be preserved. The most reliable captures are generally comprised of static HTML sites whose pages contain text and images, and whose constituent files all reside on a single host server and domain. Web crawling software generally has the most difficulty with:

  • Dynamically-created pages: pages created from or using Dynamic scripts or applications such as JavaScript or Adobe Flash
  • Password protected material
  • Forms or database-driven content that requires interaction with the live host site
  • Exclusions specified in robots.txt file
  • Multimedia: Streaming media players with video or audio content

Getting to work with Archive-It  

To be able to work with the University of Illinois Urbana-Champaign web archives in Archive-It. One must be trained by someone in the preservation service. To set up a training please email webarchives@library.illinois.edu