Using open source tools in a newspaper digitization workflow

Using open source tools in a newspaper digitization workflow

At the GLBT Historical Society we’re diligently digitizing more than 1,500 issues of the Bay Area Reporter, the San Francisco-based weekly newspaper that’s been serving the LGBT community since 1971. Thanks to a generous grant from the Bob Ross Foundation, we purchased a shiny new scanner that could accommodate newspaper spreads, and we set about digitizing the paper to the specifications put forth by the National Digital Newspaper Program (NDNP) and the California Digital Newspaper Collection (CDNC). When the project is complete, we’ll have created a publicly-accessible, full-text-searchable collection of over three decades worth of LGBT and California history, written week by week.

The software that accompanied our scanning hardware appeared well-suited for the task, with image processing capabilities like deskewing, optical character recognition (OCR), and image format conversion baked in. However, in practice we quickly realized that while this software worked well for small projects and one-off scans, it was not sufficient for the large-scale effort before us, mainly because it did not allow us to shift this processor-intensive work to off hours. We set out to construct a digital workflow using free, open-source tools that would replicate these image-processing tasks, but could churn through large batches of newspaper scans at night and over the weekend, freeing up precious work hours for staff, interns, and volunteers to move quickly from one newspaper issue to the next.

Technical Requirements

Starting with our archival TIFFs, scanned in 24-bit color at 400 dpi, we are generating JPEG-2000 derivatives (JP2s) according to NDNP specs, Analyzed Layout and Text Object (ALTO) XML, which encodes the results of the OCR, and full-text searchable PDF access copies. NDNP specs also dictate that all image files contain certain embedded metadata.

Our ideal workflow, starting with folders of TIFFs, would produce all the necessary derivatives, ensure images have the correct embedded metadata, and perform some basic quality control, for example, making sure all pages of an issue were scanned by checking the number of TIFFs in a folder against a spreadsheet that contains the page count for each issue. This workflow would be triggered at the end of the workday, running overnight without human intervention, and would move completed newspaper issues from the processing queue to a quality control queue. Workflow events, including errors, would be recorded in a log file.

Our Workflow

Assembling our toolkit, and correctly installing each piece of software on our Windows 10 PC, was time-consuming but essential work. Along the way several promising options were tested and rejected for a variety of reasons. Some tools or methods simply produced bad results, or were nearly impossible to get running on a Windows machine.

For our collaborative metadata spreadsheet, which includes volume and issue numbers, dates, page counts, and other descriptive information, the choice was easy: Google Sheets.

We used Python to code our workflow script, mainly because it’s what we know. Each piece of the workflow is coded as a Python function. The script first checks a directory on the local hard drive to see if there are any new newspaper issues, i.e., folders of TIFFs, to process; for each issue in the queue, it then collects basic metadata from the Google Sheet.

For each issue, the TIFFs are first deskewed in order to ensure the accuracy of our OCR. We’re using Marek Mauder’s Deskew to determine the angle at which the image is skewed, then ImageMagick to actually rotate the image. Exiftool then allows us to edit and insert the necessary embedded metadata. We again invoke ImageMagick to create JP2 derivatives according to NDNP specs, and again use Exiftool to embed an “XML box” of metadata in these versions of the newspaper pages.

Next, we use Tesseract as our OCR engine, producing both hOCR files and PDFs from our deskewed TIFFs. The resulting high-resolution PDFs of each newspaper page are then compressed using Ghostscript; these files are combined into a single PDF for the entire issue. Then we transform the hOCR files into ALTO XML by running them through an XSL stylesheet using the Saxon XSLT processor.

Finally, the script updates our Google Sheet to indicate that (1) the number of TIFFs matches the page count, and therefore the issue was enqueued for processing, (2) JP2s were created, and (3) the issue was OCR’d.

The next morning we’re able to look through the log file to see if the script threw any errors, and we perform quality control on the processed issues. If we notice any pages that need further deskewing, or if any pages need to be rescanned, we can make the appropriate corrections and re-process the issue.

From there, files are shuttled to an external hard drive in a RAID 1 configuration, and are also backed up to Dropbox, while we work on finalizing our plans for access and preservation.

We still have many hundreds of issues left to digitize, but this workflow helps to ensure we’re producing consistent, high-quality derivatives and metadata, and that our scanner operators can keep their focus on producing accurate digital surrogates of this important resource.

Our scripts for the Bay Area Reporter project are available on GitHub.