Final Report: Enhancing Library Contributions to the HathiTrust (Library Innovation Fund ACTY60)
Kyle Rimkus
August 27, 2012
Table of Contents
Overview……………………………………………………………………………………………………………………….. 1
Summary of Work Accomplished…………………………………………………………………………………… 2
Technical Work……………………………………………………………………………………………………………. 2
Policy Work…………………………………………………………………………………………………………………. 2
Policy for Use of HathiTrust for Preservation and Access to Non-Unique Book Content………………… 3
Financial Summary……………………………………………………………………………………………………….. 3
Next Steps…………………………………………………………………………………………………………………….. 4
Appendix I: Project Proposal (Enhancing Library Contributions to the HathiTrust: An Innovation Funding Proposal)………………………………………………………………………………………………………………………. 5
PROBLEM STATEMENT…………………………………………………………………………………………… 5
BACKGROUND………………………………………………………………………………………………………….. 5
PROPOSED USE OF INNOVATION FUNDING………………………………………………………… 6
BENEFITS…………………………………………………………………………………………………………………. 6
ALIGNMENT WITH LIBRARY PRIORITIES……………………………………………………………… 6
BUDGET……………………………………………………………………………………………………………………. 7
TIMELINE………………………………………………………………………………………………………………… 7
Overview
In October 2012, an Innovation fund request to develop internal tools for improving HathiTrust ingest workflows was submitted by Kyle Rimkus, MJ Han, Betsy Kruger, and Tom Habing. The authors of the request recognized the need to stimulate the contribution of locally digitized book content to HathiTrust, which had stalled because UIUC JPEG2000 book image files were consistently rejected by HathiTrust ingest tools. Innovation funds were used to analyze the technical problems behind this and develop tools to mitigate barriers to ingest.
Summary of Work Accomplished
Technical Work
Technical work began later than scheduled due to challenges hiring a graduate hourly programmer. Instead of beginning in November 2012, the project began in earnest in March 2013, with the addition of Haruit Kumar as a graduate hourly programmer.
Haruit, under the direction of Kirk Hess and Kyle Rimkus, with input from the UIUC HathiTrust Users Group and HathiTrust staff at the University of Michigan (Jeremy York, Aaron Elkiss), has made the progress listed below over the past 6 months (all locally developed scripts are available in a restricted bitbucket repository at https://bitbucket.org/hkumar3/htfeeduiuc):
- HathiTool customized to run for UIUC books with the following refinements
- Convert tool to an Eclipse project
- Modify for local debugging
- New Remediation stage added- fixes isssues with image files
- New PdfToText stage
- Volume/SourceMets stage changes
- Changes for file naming conventions
- Change Marc file with 955 tag included (create a copy)
- Deployed tools on local server: hathi.library.illinois.edu
- Modified “CreateDirectory script”: As this tool runs for one book at a time, the new wrapper creates book directories as per the structure that HathiTool expects (has to be changed for each book class [unica, brittle])
- Run Feed Script: runs tool recursively over these directories
- Remediation progress
- The CreateDirectory script has been used to remediate 268 Unica books (Multiple Volume books have to be identified individually and remediated manually).
- Illinois Chemist books have been remediated.
- Brittle Books scripts are still a work in progress.
- Remediated books are hosted prior to HathiTrust ingest on http://hathi.library.illinois.edu/feed/
Policy Work
Technical progress was complemented by significant progress in HathiTrust policy development. The Library formed a HathiTrust Users Group (https://wiki.cites.illinois.edu/wiki/display/libemployees/HathiTrust+at+UIUC) under the umbrella of the Digital Repository Advisory Group to advise on local HathiTrust needs. Most importantly, the Library approved a clear policy on the use of HathiTrust as a preservation repository for specific types of materials (reproduced below; also available at: https://wiki.cites.illinois.edu/wiki/display/LibraryDigitalPreservation/Policy+for+Use+of+HathiTrust+for+Preservation+and+Access+to+Non-Unique+Book+Content)
Policy for Use of HathiTrust for Preservation and Access to Non-Unique Book Content
The HathiTrust Digital Library will serve as the preservation repository for most non-unique book-like content that is digitized by the University Library whether in house or outsourced and that has an OCLC number (a requirement of HathiTrust). In addition, the Library will no longer store or provide access to the access copies of this material, but will instead link to the copies held by the HathiTrust and/or the Internet Archive (or other access mechanisms that become available). We will provide links to these copies via URLs in the catalog records, as well as through a splash page containing the metadata and the links. The Preservation Unit will periodically ensure that the HathiTrust is continuing to perform the preservation and access activities to which it has committed.
Exceptions to this policy include:
- Material digitized from the Rare Book and Manuscript Library: it is important that the uncropped masters be retained by the University Library;
- Material that was published by the University of Illinois (such as technical reports, etc.): an access copy should be retained by IDEALS and/or the University Archives depending on the material;
- Material that has special access requirements or is part of a specific project: Emblem books, for example;
- Reformatted Brittle Books materials published post-1923 or otherwise protected by copyright restrictions;
- Other material as considered on a case by case basis by a sub-group of the Digital Library Access, Repository, and Scholarly Communication Services Advisory Group.
Note that these exceptions may change as the HathiTrust updates their policies and procedures. For example, if the HathiTrust allows the deposit of uncropped archival masters, we may reconsider depositing those from RBML into the HathiTrust.
Financial Summary
Financial codes
- ACTY60: Increasing Library Contributions to the HathiTrust
- BANNER FUND Number: 1 200250 600000 600051
Project funds were spent exclusively to fund the time of Haruit Kumar, a graduate programmer, who created the scripts and tools described above.
Original budget:
dollars/hour | hours/week | total weeks | TOTAL |
$19.47 | 20 | 22 | $8566.80 |
Actual expenditures:
Budget | Funds spent | Balance |
$8566.80 | $7534.89 | $1032.91 |
Next Steps
Refinements to the ingest process are ongoing. Future steps include:
- Running remediation scripts on all locally digitized legacy content identified for HathiTrust ingest and contributing them to HathiTrust Digital Library
- Developing more robust workflow tracking tool for improved file management (as outlined here: https://wiki.cites.illinois.edu/wiki/display/libemployees/workflow+tracking)
Fortunately, the Library has identified Provost/IT Fee funds to be spent in FY2014 to support continued efforts in streamlining HathiTrust ingest. Haruit Kumar is currently employed at 20 hours a week to meet these needs.
Appendix I: Project Proposal (Enhancing Library Contributions to the HathiTrust: An Innovation Funding Proposal)
Kyle Rimkus, MJ Han, Betsy Kruger, Tom Habing
September 14, 2012
PROBLEM STATEMENT
The University of Illinois Library’s book digitization efforts lack effective tools for contributing locally digitized content to the HathiTrust. Hundreds of books digitized under the supervision of Digital Content Creation intended for contribution to the HathiTrust are sitting on local servers with no clear workflow for moving them into the HathiTrust. Likewise, the Brittle Books program in Preservation has been unable to contribute content to the HathiTrust, despite strong interest in restructuring current workflows to rely on it as a key pillar of local content preservation and access strategies.
BACKGROUND
On July 2, 2012, a subgroup of the Digital Library Access, Repository, and Scholarly Communications Services Advisory Group consisting of Tim Cole, Bill Ingram, Betsy Kruger, Michael Norman, Kyle Rimkus, and Sarah Shreeves met to recommend policies for how best to utilize the HathiTrust’s access and digital preservation services within the context of the library’s broader digital content management strategies. Action items from this meeting included the following:
- establish a regular audit workflow to ensure archival masters are appearing in the HathiTrust
- change current workflow so that archival masters (excluding Rare Books and Manuscript Library and other unique materials) are no longer downloaded from the Internet Archive in order to rely on the HathiTrust as a digital preservation repository for our Internet Archive digitized book collections
- establish a regular workflow to send archival masters for our locally digitized and Brittle Books program to HathiTrust
- establish a regular workflow to download uncropped archival masters for special collections material digitized by the Internet Archive
- improve workflow to update catalog records with links to the Internet Archive and HathiTrust
- delete archival masters for those items that are now in Hathi and for which we do not need a local copy
On September 13, 2012, Kyle Rimkus convened a meeting of HathiTrust users to discuss, among other things, progress on the items above. This group included Betsy Kruger, Michael Norman, Rimkus, MJ Han, Annette Morris, Gary Maixner, William Weathers, Mike Tang, and Kirk Hess. The group concluded that insufficient progress had been made on these tasks. In addition, Digital Content Creation and Brittle Books representatives confirmed that much of their work intended for Hathi — hundreds of volumes of locally digitized content, in fact — is sitting on local servers with no clear workflow for moving content into the HathiTrust, and has been, in many cases, for at least a year’s time.
This is due to three factors:
- lack of tools and support staff dedicated to coordinating and tracking complex workflows
- lack of integration of existing tools for packaging digitized book content for contribution to the HathiTrust
- lack of coordination of staff with available time for contribution to HathiTrust activities
PROPOSED USE OF INNOVATION FUNDING
We are proposing an Innovation grant to stimulate University of Illinois contributions of content to the HathiTrust. We will hire a graduate student, preferably in Computer Science, to work twenty hours a week over the Fall and Spring semesters. This student will, under the supervision of Kyle Rimkus, MJ Han, Tom Habing, and Betsy Kruger, write scripts and develop a web-based management tool to facilitate contributing locally created materials into the HathiTrust. This includes:
- identifying, packing, delivering digitized content with metadata to the HathiTrust
- tracking the status of all files and file packages ingested into the HathiTrust (to include the deletion of staged files)
- modifying workflows for IA digitized books for which it is our policy to retain archival files
- improving workflows for updating catalog records to improve user experience of patrons using the OPAC
- integrating all tool and file management development, when applicable, with the UIUC Library’s Medusa digital preservation repository
This student will report to Tom Habing in the Library’s Software Development Group and will follow the direction of key stakeholders MJ Han and Annette Morris under the guidance of project leaders Betsy Kruger and Kyle Rimkus.
BENEFITS
An improved HathiTrust workflow would have several important benefits. Namely, we would:
- save valuable server space by better coordinating our reliance on the HathiTrust as a trusted digital preservation repository
- increase access to current “hidden” content — that is, digitized books produced by DCC and Preservation workflows that are not currently accessible to patrons
- build simple, scalable workflows for the shared benefit of Brittle Books, Digital Content Creation, and Content Access and Management
- ensure enduring access to our work by securing the persistence of archival files in the HathiTrust
ALIGNMENT WITH LIBRARY PRIORITIES
The Library’s strategic plan explicitly mentions participation in the HathiTrust as a priority:
“Promote collaborative efforts toward accomplishing local, regional, and national goals for digital preservation programs through participation in initiatives such as the DuraSpace Foundation, ArchivesSpace, and HathiTrust.”
This project will allow the Library to reap some return on our already considerable investment in the HathiTrust by allowing us to rely on its services as an essential component of our digital preservation, access, and file management practices for digitized books.
BUDGET
We are proposing to hire a programmer, preferably a Masters Student in Computer Science at the Library’s graduate hourly rate of $19.47/hour for 20 hours a week from the remainder of the Fall semester to the end of the Spring semester. This comes to 440 hours, or $8,566.80.
dollars/hour | hours/week | total weeks | TOTAL |
$19.47 | 20 | 22 | $8566.80 |
TIMELINE
This project will begin in October, 2012, and will terminate at the end of the Spring semester in May, 2013.