HathiTrust April 2012 Update
This month’s HathiTrust update is now available. The update reports that “California Digital Library created prototype exports of the metadata that will be used to populate HathiTrust’s tab-delimited inventory files (“hathifiles”) and bibliographic catalog. Timing tests for these exports were also conducted. The CDL team continued to reconcile bibliographic records in Zephir with records in the current system at the University of Michigan to ensure all the data is accounted for, addressing record discrepancies and ingest errors as encountered. The team has also begun development of a process to sync rights information in Zephir (the new management system) with the HathiTrust rights database.”
Also included is information on the HathiTrust Research Center that describes the creation of “Meandre workflow components (Meandre is part of the SEASR infrastructure) that retrieve texts from the HTRC using the HTRC data API, spell-check the texts, correct OCR errors, and then perform topic modeling on the texts. The HTRC has demonstrated this functionality, creating topic models of all pages returned from the data API from single-word queries on a full-text index of volumes. For example, a search for “dickens” in the non-Google digitized public domain corpus returns more than 100 topics with associated keywords. The diagrams below show tag clouds of keywords for the topics ‘lady’ and ‘men’.” The tag clouds are fascinating—take a look.
In other news, the HathiTrust collection continues to grow, with over 80,000 volumes ingested in April. HathiTrust is now approaching 3 million volumes in the public domain—2,903,378 volumes, or approximately 28% of total.