Summer Internship at CDL: Machine Learning Techniques Applied to Calisphere

February 27, 2019 Author: Barbara HuiCategories:

Tags:

During the summer of 2018, the Publishing, Archives, and Digitization group at CDL was fortunate enough to host UCB School of Information masters student Sejal Popat as an intern. While at the CDL, Popat worked to apply machine learning techniques to image content on Calisphere, with the goal of seeing whether or not this approach could be used to discover meaningful new patterns or other information within the corpus. Might this sort of computational approach be used to power a new kind of search for content and style similarities across Calisphere, or help to uncover new metadata subject tags, for example?

Popat’s strategy was to first explore the collections to identify promising materials to work with, eventually settling on a set of archival political poster images from several collections and institutions across Calisphere that all contain an interesting combination of text, images, poses and aesthetic styles. Moreover, the images were mainly drawn or painted and therefore quite different from the photographs that existing machine learning algorithms have been trained on. Would she be able to use these algorithms to detect meaningful patterns in non-photographic images?

Above: a sample of the poster images from Calisphere that Popat analyzed using machine learning techniques.

Popat’s results were quite successful; she was indeed able to detect several kinds of visual patterns by applying machine learning algorithms to the poster image data, including:

Potential duplicate images
Content similarity and visual similarities within collections
Content similarity across different collections
Similar composition, figures, colors and styles
Image layout similarities

Above: an example of content similarity between posters in different collections

Above: an example of content similarity between posters in different collections

Above: an example of style similarities discovered across collections

Above: the machine learning process also turned up a few surprising results!

Above: the machine learning process also turned up a few surprising results!

Popat also explored whether “pose estimation” models could be used to detect human subject poses, e.g.. standing with hands on hips, reclining, etc. This line of experimentation was less successful, as often, limbs and facial features are hard to detect using the relatively indistinct lines present in paintings and drawings.

Given the positive results of Popat’s research, the CDL Calisphere team has additional strategies to consider for incorporating machine learning-based approaches into the Calisphere user interface. This will potentially facilitate new kinds of discovery and use of resources available through the website.

We loved having Sejal on board for the summer, and we welcome future summer interns through the iSchool and other programs! If interested, please contact info@cdlib.org.