Skip to main content

Guidelines for Preparing UC Campus Digitized Materials for Submission to HathiTrust

The HathiTrust page outlining object ingest requirements can be found here.

Below is a simplified set of instructions prepared by CDL staff, specifically for UC campuses.  It is the intention of this document to help campuses create a standardized, repeatable workflow for the consistent preparation of content for submission.

  1. Contact CDL for consultation, preferably prior to any digitization activities.
  2. Create a single directory for each object, named with the identifier to be used for management. The identifier needs to be unique across all campus content (but not all UC content).
  3. For Continuous Tone (non-b/w) images:
    1. Ideally 300 dpi or greater
    2. Should be 8 bit grayscale or 24-bit sRGB
    3. #1 and #2 above are flexible, but require consultation
    4. TIFFs will be converted to JP2 on ingest
  4. Make sure image file names are sequential and represent the reading order of the object, ex. 00000001.tif, 00000002.tif, … 00000234.tif, etc.
  5. Check that image header metadata contains all required elements: use http://babel.hathitrust.org/feed/validate_image.html to check a single image.
  6. This will return a large number of problems with the image, but most of those are handled by remediation routines during ingest or can be supplied via a supplemental YAML file. Work with CDL for help navigating the results, along with how to identify and remediate any issues.
  7. Produce one OCR file for each image: plain text OCR (in UTF-8) is required; coordinate OCR can also be provided (and is encouraged). Make sure OCR file names match the page image file names (in numbering; file extension will be different) and that different OCR types are differentiated, ex. 00000001.txt, 00000001.xml, 00000002.txt, 00000002.xml, … 00000234.txt, 00000234.xml, etc.
  8. Create a file of MD5 checksums for all files in the directory.
  9. If there is missing image metadata, create a META.YML file (named “meta.yml”) to supply any metadata missing from the image headers; see #4 above.
  10. ZIP the directory in preparation for delivery. The ZIP file needs to be named using the identifier that will be used in HathiTrust. Ex. i1738822838.zip or 31280384772839.zip or ark+=13960=t0js9j62c.zip (this last example uses the pairtree encoding)
  11. Produce a MARCXML file that contains the base identifier; work with CDL to document the MARC formatting and location of critical fields. The MARCXML file should not be included in the main ZIP file.
  12. Coordinate with CDL on the delivery of files.

CDL would very much like to be involved if you are planning on submitting content to HathiTrust, whether legacy content or new locally digitized materials. Not only can we help you navigate the requirements and submission process, but we will be able to help track and steward your content as well as including it in our regular reporting to CoUL.