The INP and Newspaper Digitization at Illinois

Newspaper digitization at University of Illinois began in 2004, under the leadership of Professor Mary Stuart, History Librarian and head of the History and Philosophy Library. At the time, Stuart was developing a plan to merge the History and Philosophy Library with the Newspaper Library, to form a single unit: the History, Philosophy, and Newspaper Library (HPNL). As part of the proposed merger, the Illinois Newspaper Project (INP) would be brought under the umbrella of the new unit, with Stuart becoming the project’s Principal Investigator. While developing the proposal for the new unit, Stuart imagined that newspaper digitization would be a logical outgrowth of the INP. To lay the groundwork for this future program, Stuart created the position of Research Information Specialist for the new unit. One of the responsibilities of the Research Information Specialist would be to provide technical support for the unit’s newspaper digitization program.

By the time the new unit opened in 2005, it was already clear that newspaper digitization would transform historical research, and that public sector institutions had a role to play alongside the major private sector companies like ProQuest, which had unveiled its Historical New York Times and Historical Wall Street Journal just a few years earlier. In 2005, the National Digital Newspaper Program (NDNP) awarded its first two-year-cycle of grants to Virginia, California, Kentucky, New York, and Florida. Each state recipient digitized 100,000 pages of newspapers as part of the program. In 2007, the NDNP awarded a second two-year cycle of grants and also unveiled Chronicling America , the freely-available online collection of newspapers digitized by the NDNP partners. Meanwhile, the Brooklyn Public Library had digitized the Brooklyn Daily Eagle (1841-1902), and although the newspaper itself held limited interest for most researchers, the technology demonstrated exciting possibilities for the future of newspaper digitization in public institutions. Around the same time, the Colorado State Library and the Colorado Historical Society jointly received a Library Services and Technology Act (LSTA) grant to digitize Colorado newspapers from 1859 to 1930. The University of Utah began digitizing Utah newspapers from the 1850s to the 1960s, and newspaper digitization projects were forming at Pennsylvania State University, Virginia, and Kentucky, to name just a few of the larger programs.

Newspaper digitization at Illinois gained increasing momentum once the new unit opened. The Research Information Specialist position was filled in October, 2005, with the hiring of Nathan Yarasavage. The earliest obstacles to the creation of the program were, surprisingly, internal. There was a desire within the Library to centralize all digitization activities, and many regarded the work being done in the HPNL as anomalous at best. The choice of a delivery platform became a flashpoint for these organizational tensions. In 2005-2006, there were few platform options. The Library of Congress was developing its own, open-source platform for the NDNP (what later became known as “ChronAm”), but at the time local implementation would have been prohibitively labor- and resource-intensive, far beyond anything the Library could adapt for local in-house use. The two main proprietary platforms were Olive Software’s Active Paper Archive (used by the Brooklyn Public Library, Penn State, and the Colorado State Library) and CONTENTdm (used by Utah, which had a full-time programmer adapting CONTENTdm for newspapers). Olive offered the only out-of-the-box solution at the time. Even though the Library was already using CONTENTdm for some of its digitization projects, there was no interest in, or funding for, developing a newspaper application here.

Active Paper Archive from Olive Software was selected as the first platform for the digital newspaper collection, and the collection was titled the Illinois Digital Newspaper Collection (IDNC). The IDNC was unveiled to the public in 2007:

IDNC Original Interface
IDNC Original Interface

The HPNL‘s newspaper digitization program was funded by a combination of grants, gifts, and support from the University Library. In 2006, Stuart received an LSTA grant to digitize the Urbana Daily Courier (1916-1925). Additional funding came from the Clifford Family Endowment. The launch of the Courier was held on July 28, 2007 at the Urbana Free Library (view photographs from Urbana Courier launch event here ). Many former Courier employees were present, as well as the former publisher, Byron Vedder (then in his mid-90s). In 2008, the HPNL received a Special Heritage Award from Preservation and Conservation Association (PACA) for digitizing the Urbana Daily Courier.

In 2007, the Library Executive Committee gave the HPNL an innovation seed grant to digitize the Daily Illini (1916-1936). Additional funding for the digitization of the Daily Illini came from the Clifford Family Endowment, the Stewart Howe Foundation Endowment, and the University of Illinois Library. The launch of the digitized Daily Illini was held at the Illini Media building on April 17, 2008. (View photographs of the Daily Illini launch event here .)

Achieving a high quality digital image depends on a number of factors. One of the most important factors is the quality of the originals (print or microfilm). Both the Urbana Daily Courier and the Daily Illini were digitized directly from existing negative microfilm. Unfortunately, this film was created circa 1960, before best practices for preservation microfilming were established. Consequently, the microfilm suffers from bad lighting, incorrect exposure, uneven focus, and bleed-through. Furthermore, the originals from which the film was produced were often torn, soiled, faded, or stained, and many issues were missing pages. (Microfilming practices have improved considerably in the intervening decades, and microfilmed newspapers are of a much higher quality.) Tragically, there are no known original print copies of the Urbana Courier; in order to free up space, local libraries—including the University of Illinois Library—destroyed the bound volumes of the original print newspaper after microfilm was produced. If original newsprint was found, it could be re-filmed and re-digitized, which would be the first step toward improving the legibility of the digital files for that title. (Please contact us if you have any information about extant back files of the Urbana Courier.)

Technical specifications for the first round of digitization: microfilm for the Urbana Daily Courier and the Daily Illini was scanned and digitally converted into bitonal 300 dpi TIFF files. PDFs and PNG derivatives were then created from the TIFFs. These derivatives are served over the web. TIFF files are stored offline on DVD. The microfilm was scanned in bitonal black-and-white rather than 8-bit gray scale because bitonal scans yield higher Optical Character Recognition (OCR) accuracy rates than do grayscale scans. (OCR is the process through which scanned images of newspapers are made keyword searchable.)

In August 2008, Stuart was awarded an LSTA grant from the Illinois State Library to digitize approximately 100,000 pages of weekly farm newspapers published in Midwestern states from 1870 to 1923. This project, titled Farm, Field, and Fireside (FFF) became freely available online in the summer of 2009. Like the IDNC, FFF used Olive Software’s Active Paper Archive:

Original Interface for the Farm, Field, and Fireside Digital Collection
Original Interface for the Farm, Field, and Fireside Digital Collection

Thanks to additional grants and gifts, FFF was expanded to include many of the leading farm newspapers. Other sources of funding for FFF included the Douglas C. Roberts Family, the Norma Jean Johnston Estate, the Clifford Family Endowment, Lancaster Farming, Inc., the Minnesota Historical Society , Pennsylvania State University , the Wisconsin Historical Society , Ohio State University , and the T & C Schwartz Family Foundation.

Even with the generous support of private donors and external granting agencies, FFF includes only a small fraction of the entire farm newspaper output from the late nineteenth and early twentieth centuries. The University of Illinois Library has one of the world’s largest collections of farm newspapers in the original print editions. Many important farm newspapers remain to be filmed and digitized for inclusion in FFF.

Eventually, Stuart and Yarasavage had developed three separate digital newspaper collections: the Illinois Digital Newspaper Collection; Farm, Field and Fireside; and American Popular Entertainment (a collection of vaudeville trade newspapers, microfilmed and digitized with the support of Robert O. Endres, University of Illinois alumnus who worked as head film projectionist at Radio City Music Hall and later at Dolby Laboratories.). A fourth, pilot-project collection, called the Collegiate Chronicle, was to be a repository of college student newspapers, for the period 1875-1975. The goal of the project was both to digitize student newspapers for institutions that lacked the necessary IT infrastructure, but also to aggregate already-digitized student newspapers from around the country into a single, searchable collection. Unfortunately, Stuart and Yarasavage were unable to secure sufficient funding to realize this project.

In June 2009, Stuart was awarded the HPNL‘s first National Endowment for the Humanities (NEH) grant for participation in the National Digital Newspaper Program (NDNP). This grant funded the digitization of 100,000 pages of Illinois newspapers published between 1860 and 1922. Stuart applied for, and subsequently received, an additional two grants, extending the HPNL‘s participation in the NDNP through August of 2015 (the second grant was awarded in August, 2011, and the third grant in August, 2013). Like the first grant, the second and third grants each funded the digitization of 100,000 pages of Illinois newspapers. The Illinois Digital Newspaper Program (IDNP) was the only state partner in this program to have all its batches of digital content accepted by the Library of Congress on first submission, without need for corrections or adjustments, thanks to the meticulous microfilm evaluation, quality review, metadata creation, and overall attention to detail by the Project Coordinator, Amy Sullivan, and the Metadata/Quality Review Specialist, Tracy Nectoux.

The following criteria were used when selecting newspapers for digitization by the IDNP:

  • Newspapers recognized as the “paper of record” at the state or county level.
  • Newspapers with statewide or regional influence.
  • Titles considered to be important informational sources for specific ethnic, racial, political, economic, religious, or other special audiences or interest groups.
  • Orphaned titles.
  • Titles with state-wide or multi-county geographical representation.
  • Titles with long runs of complete chronological coverage (i.e. lacking major gaps on the microfilm between the eligible years of 1860-1920).
  • Mix of Chicago/urban/industrial and downstate/rural/agricultural titles.
  • Equal representation of papers serving Chicago and urban populations outside of Chicago.
  • Equal representation of labor, commercial, industrial, and agricultural groups.

In the first two-year award cycle, the IDNP digitized three Chicago newspapers: the Chicago Eagle, a Democratic party organ devoted to municipal politics; the Broad Ax, an African American newspaper started in 1895 in Salt Lake City that moved to Chicago with editor and publisher Julius F. Taylor in 1899; and the Day Book, a six-year experiment in advertisement-free newspaper publishing by E.W. Scripps, founder of the media empire. Carl Sandburg was a leading reporter for the Day Book and contributed at least 135 articles during his tenure at the paper.

The HPNL‘s final batch under the first NDNP award was the Cairo Bulletin, an important newspaper from southern Illinois, which in many respects embodies the intersection of “Southern” and “Northern” society, culture, and politics that characterizes Illinois. The IDNP continued digitizing the Cairo Bulletin after Stuart was awarded a second, two-year grant in 2011. With the second and third grants, the IDNP digitized the Ottawa Free Trader, the Joliet Signal, and the Rock Island Argus.

Technical specifications for IDNP digitization, as required by the NDNP grant guidelines: scan from clean second-generation duplicate silver negative preservation microfilm; capture images at 8-bit grayscale at the maximum resolution possible between 300-400 dpi, relative to the physical dimensions of the original material; split two-up film so that there is one page image per file; deskew images with a skew of greater than 3 degrees; crop to include visible edge of paper, retaining up to a quarter inch beyond edge; capture microfilm target frames and additional scanning resolution targets at the start of each session to monitor scan quality; for every page image, vendor will create OCR text encoded using the ALTO (Analyzed Layout and Text Object) schema, Version 1-4 or greater; create one OCR text file per page image; name each OCR text file to correspond with the page image it represents; use UTF-8 character set; refrain from saving graphic elements with the OCR text; order OCR text in natural reading order (column-by-column); create OCR text file with bounding-box coordinates at the word level; conform to the ALTO XML schema; create an ALTO XML file containing recognized text for all page images; create a searchable PDF image with hidden text and a JPEG2000 compressed image file for each page; name derivatives in such a way that each name corresponds to the page image it represents.

Yarasavage resigned in 2011, leaving the HPNL for a job with the NDNP at the Library of Congress. In October, 2011, Kirk Hess began working in the HPNL as Digital Humanities Specialist, taking over Yarasavage’s role in HPNL‘s newspaper digitization program.

In 2013, HPNL migrated its digital newspaper collections from Olive’s Active Paper Archive to Veridian :

Veridian Interface for the Illinois Digital Newspaper Collections
Veridian Interface for the Illinois Digital Newspaper Collections

 

Olive’s Active Paper Archive was increasingly viewed as obsolete, and our technical specialists felt that Olive’s software was not being updated in a sufficiently timely manner. Veridian, from DL Consulting, is better supported and preserves the article-level segmentation we liked with Olive’s Active Paper Archive, while enabling us to convert the PrXML schema used by Olive for describing article boundaries and containing OCR text, to the more widely adopted METS/ALTO standard. It also supports crowdsourcing for OCR correction and tagging.

Hess resigned in February 2015.

The third NDNP grant officially ended in August, 2015, but the University Library was granted an extension to digitize non-English language newspapers. In September 2015, Erica Parker took over as project director for Illinois’s participation in the NDNP extension

In 2015, the production side of newspaper digitization was moved to the Department of Conservation and Preservation, under the leadership of Assistant Professor Kyle Rimkus.

–Geoffrey Ross

 

Updated on