Mass Digitization: Open Content Alliance and the UC Libraries
On December 6, 2006, Microsoft released the beta Live Search Books (http://books.live.com), providing a new portal to access UC libraries books scanned by the Internet Archive (IA) for the Open Content Alliance. An initial review of Microsoft’s service was provided by CNET (http://news.com.com/Microsoft+releasing+book+search+in+beta/2100-1038_3-6141162.html).
Microsoft’s Live Search Books provides a window into scanned books that is as serendipitously fruitful as article indexes are for searching the content of scholarly articles. It searches every page of the scanned books and returns a link to the page that contains your search phrase.
At this site (http://search.live.com/results.aspx?q=&scope=books), search on “Adolph Sutro” – mayor of San Francisco from 1895–1897, mining engineer, philanthropist — and you find not only important information about his role in the beginnings of San Francisco, in Nevada mining enterprises, his interactions with President Benjamin Harrison, but also two poems lauding him, one by Carrie Walter and another by Joaquin Miller.
Search on the “Golden Gate Bridge” to see that landmark remembered in oral histories, documented by its creators, and praised in poem and song.
The opportunities for uncovering unknown connections are endless, and will only grow as Microsoft continues to digitize more historic titles.
Update on the Mass Digitization Projects
The Open Content Alliance (OCA) (http://www.opencontentalliance.org/) is one of two mass digitization projects now underway within the UC libraries. (The other is Google, about which more will be forthcoming in future articles as its workflow and scope unfolds.) With the approval of the University Librarians, the UC libraries became one of the earliest contributing members of the OCA. OCA is a coordinating body whose purpose is to build open access electronic collections and make them available through the Internet Archive (IA) http://www.archive.org/index.php. UC library books scanned with Microsoft funding for the Open Content Alliance are now available through both the Internet Archive interface and the Microsoft Live Search Books (beta).
As OCA contributors, the UC libraries are providing out-of-copyright, public domain materials and content for which the UC Regents hold copyright. The University of Toronto libraries are one of the many libraries contributing content to the OCA (for a complete contributor list see [http://www.opencontentalliance.org/contributors.html]. UC hosts two Internet Archive (IA) scanning facilities, one at NRLF (which came online in April 2006) and one at SRLF (which came online in August 2006). A third IA-operated scanning site resides at the University of Toronto.
Under the direction of Brewster Kahle, the IA is the organization that provides the technology and staff for the scanning service. IA servers in San Francisco host the resulting files. In the case of scanned images of UC materials, the digital files will also become part of the UC libraries Digital Preservation Repository (DPR). Files will include JPEG 2000, PDF, fully searchable OCR, and meta.xml.
CDL is investigating the implications of integrating the content generated through the OCA and Google projects into our UC library access systems and will be consulting with UC library advisory groups as the issues are better defined. Content scanned by Google will be available through WorldCat, and discussions are underway to provide OCA-scanned materials through OCLC as well.
The two OCA funding sources (Yahoo! and Microsoft) requested that IA initially scan thousands of books that can broadly be defined as reflecting Americana. CDL has created lists of titles (known as picklists) by searching the Melvyl Catalog with a combination of date limits, subject headings, and broad classification codes. These lists are drawing from UC’s systemwide library book collections managed at NRLF and SRLF, from the UC Berkeley and UCLA main libraries, and from the Bancroft Library’s and UCLA YRL’s Special Collections.
With the advice from SOPAG members and AULs from across the system, several UC librarians were identified to help define those searches to retrieve the widest range of materials. This subject approach depends upon cataloging consistency and completeness through decades of librarianship on different campuses. Librarians will recognize that this makes any such search far from perfect! But it has identified thousands of books (including oral histories from the Bancroft’s Regional Oral History Office) so far, many of which have been digitized and can be viewed at: http://www.archive.org/search.php?query=collection%3A%28cdl%29
Books are non-invasively scanned. A small test of 800 Berkeley mathematics books was digitized initially to affirm that the process does no harm to the original volumes. IA designed and manufactured special scanning stations, called Scribes. These hold the book face up, open at a 90 degree angle. Carefully trained operators manually turn pages, check that metadata is correct, and replace the books on carts for return to their shelving locations.
Staff at the two RLF’s have been actively and creatively involved in this project so far, devising workflows, trouble-shooting, and insuring that all scanned books are returned to their rightful homes. Many UC librarians have offered excellent advice as CDL staff have wrestled with devising lists of books that meet the criteria for scanning. This systemwide teamwork enables UC to take advantage of this timely opportunity to add new levels of access to our priceless collections.
Mass Digitization Collection Advisory Committee
Recognizing the need to formalize the content selection process as we continue to move forward on both the OCA and Google mass digitization projects, CDL obtained approval from the ULs for the formation of a Mass Digitization Collection Advisory Committee (MDCAC).
MDCAC’s charge will include developing an internal process for the review, identification, and selection of collections for scanning across the UC libraries; developing criteria for evaluating potential collections for scanning; communicating with CDL staff, UC bibliographer consortial groups, HOPS members, and HOTS members as needed for advice and assistance pertaining to technical and programmatic issues as recommendations are developed for collection scanning; and advising the SOPAG Collection Development Committee (CDC) on issues about collection development for mass digitization projects and recommending collections for their review and approval. The CDC has proposed members for this committee which will be appointed in the near future.
We wish to express our deep thanks to all of the UC librarians and CDL staff who have helped and will continue to assist in this great effort. Congratulations on reaching this milestone!