Digital collections workflows at CHS

Fremont Gate, Elysian Park, Los Angeles, Views of Los Angeles, California, PC-GS-Photographers-Los Angeles-Putnam & Valentine, California Historical Society

The California Historical Society recently added four collections of historical photographs to its digital library, including images of Los Angeles at the turn of the 20th century, and photos taken by a 15-year-old Alice Burr of volunteer infantrymen mustering in San Francisco during the Spanish-American War. These collections and more are available at digitallibrary.californiahistoricalsociety.org.

Perhaps more importantly, we’ve established new guidelines and workflows for our digital collections that help streamline time-consuming processes like cataloging digital objects at the item level and creating robust MODS records, preparing digital objects for ingestion in our Islandora DAMS, and making collection- or system-wide changes to objects’ descriptive metadata. Our GitHub account is a growing public repository of our digital tools and documentation of these workflows.

Gathering Existing Metadata

As is the case at many institutions, information about CHS’s collections and digital objects is spread across multiple digital platforms, including ArchivesSpace, PastPerfect, the Online Archive of California (OAC) and Calisphere, our OPAC, and Flickr Commons, among others. When preparing materials for DAMS ingestion, we aim to reuse existing metadata when possible; this information will likely have to be augmented to meet our MODS specifications, but at least we can try to avoid having to key in the same information more than once. To this end, we used the export feature in PastPerfect to generate a text file in the comma separated values (CSV) format of our entire photography collection database, and wrote a quick Python script to parse EAD XML for item-level metadata, to name a few examples. Figuring out how best to export data from these various systems is also good practice for the inevitable migration down the road.

Metadata from Flickr

The CHS Flickr Commons account, in particular, had good descriptive metadata for several image collections we identified as candidates for inclusion in our digital library. I was able to use a Python script to extract all of this metadata through the Flickr API and output to a CSV file. The Flickr “Description” field, however, contained a bulk of the descriptive metadata in a single blob of unstructured text, including the collection title and call number, digital object ID, and date.

Flickr Commons metadata
The unstructured “blob” of descriptive metadata in Flickr Commons
for the photo shown above.

In order to parse this data I turned again to OpenRefine, a favorite tool of metadata wranglers. With some careful use of regular expressions and other functions, I successfully separated these blobs into columns. I then fleshed out the spreadsheet with columns representing each MODS element and attribute we wished to include in the final XML documents.

Flickr data parsed into columns
The above data separated into columns. Parsing this Description field across the entire Flickr collection was accomplished using OpenRefine and regular expressions.

Using the “templating” feature in OpenRefine, I exported metadata for an entire image collection as a single XML document, which I then ran through a Python script to clean and split the data into individual MODS records with filenames that matched the digital object IDs. From there we could ingest to Islandora large batches of high-resolution TIFF images and MODS XML metadata. With this process defined and documented, we can now do the same for any other CHS Flickr Commons collections we wish to publish to our digital library.

In fact, as long as we have metadata in spreadsheet form, we can follow the process outlined above to produce high-quality MODS records for our digital objects. This has led to changes in how CHS archivists process photo collections that have been identified as candidates for digitization and inclusion in the DAMS. We’re now more likely, for example, to catalog images in a spreadsheet, where repeating or similar data can be quickly replicated down the rows and anomalies are easier to spot. Our spreadsheet template, with column headers that map to MODS elements and attributes, makes it much easier to produce XML records for collections of any size.

Cataloging Guidelines

As we refined the specifications for our MODS records, we created a document for cataloging visual materials, based on the Descriptive Cataloging of Rare Materials (Graphics) DCRM(G), that functions as both step-bystep instructions and style guide. It shows how each element is encoded in a MODS XML document, and how each element maps to Dublin Core. Whether cataloging in a spreadsheet or directly in an Islandora web form, CHS catalogers now have concrete guidance for each element in a record, ensuring more complete and consistent metadata going forward.

Other Workflows

No matter how much energy we spend making sure our MODS metadata and its Dublin Core derivatives are clean and consistent before they’re published to the web, we’ll inevitably find an error here or there, or perhaps we’ll want to make some collection-wide changes down the line. With the help of some Islandora modules we can easily find and replace text strings across collections or even export, edit, and replace batches of MODS records. We’re also happy to report that we’re now employing a cloud-based digital preservation workflow, about which we hope to share more in the future.

This continuing work is certainly a team effort here at CHS, and we could not have figured out most of this stuff had it not been for helpful blog posts and discussion threads from metadata professionals and Islandora developers, as well as tips and tricks gleaned from countless Stack Overflow users. In that same spirit of sharing, we hope you’ll find something useful here or in our GitHub repositories. If you’re interested in learning more about our use of Islandora and our digital workflows feel free to get in touch.


This piece was originally published in the Summer 2017 newsletter of the Society of California Archivists.