The integration of mass book digitization efforts into the practice of academic library management reached an important milestone in November 2008 with the debut of HathiTrust, an academic library digital content repository [1]. HathiTrust's origin was an effort led by the University of Michigan Library to provide a systematic and controlled means of storing, managing, and discovering the millions of files of scanned books created by its participation in the Google Books project. The name of the digital library was inspired by the Hindi word for elephant (hathi), an animal popularly admired for its long memory.
Supporting this effort from the beginning, as co-participants, were the University of California library system and the libraries of the twelve members of the Committee on Institutional Cooperation, an academic consortium of the Big Ten universities plus the University of Chicago. Since the launch of HathiTrust, several prominent institutions have agreed to participate, including Cornell University Library, Dartmouth College Library, the New York Public Library, Princeton University, the Triangle Research Libraries Network (North Carolina), the University of Virginia, and Yale University. In November of 2010, HathiTrust added an international partner to its list of participants, La Universidad Complutense de Madrid.
At the time of this writing, HathiTrust's database contains more than 7.6 million volumes. The text of all of the database's volumes are searchable through optical character recognition (OCR) processing of the scanned images, but perhaps of greatest interest to librarians and library users is the fact that approximately 1.8 million volumes, or 24% of the repository's content, are in the public domain. The full content of these public domain items are viewable as text or in tagged image file format (TIFF) or portable document format (PDF) by anyone searching the repository. In addition to the public domain items, some publishers, such as the University of Michigan Press, have selectively made content freely available as well. The full-content items are referred to as “Full View” items, while books protected by copyright are referred to as “Limited (search only)” items.
Users affiliated with HathiTrust's partner institutions have an additional valuable option. They are able to log into the repository database and download complete PDF files of the public domain items, as opposed to the single-page viewing option available to general users.
Both affiliated and unaffiliated users may also create and annotate custom “Collections” of bibliographic records. These collections may be organized around any principle the user decides on, for example, groups of titles related to a certain subject or by a certain author. Nonaffiliated users are offered the option to create University of Michigan guest accounts, which enable them to take advantage of this feature of the repository.
All HathiTrust bibliographic records contain a “Find in a library” link, which takes users to the record in OCLC's WorldCat.org database, the freely available web version of the WorldCat database. This will be particularly helpful to users who are interested in obtaining a library copy of a book for which the full text is not viewable.
Although the core of HathiTrust's content overlaps with the content produced in the Google Books scanning project, HathiTrust does contain items not included in the Google Books database. This content comes from libraries participating in the HathiTrust that were not involved in the Google Books undertaking. In addition, HathiTrust is distinguished from Google Books in its commitment to bibliographic standards for item records, sophisticated search options, long-term preservation efforts, and orientation toward cooperative national and international academic institutional endeavors.
Parallel with the technical efforts to produce and manage digital repositories such as HathiTrust should be the development of methods by librarians regarding the productive use of them (for two examples of such efforts, see Jones [2] and Blakeley [3]). Because much public domain content is by definition material published many years ago, one possibility for medical special collections librarians and students of the history of medicine is to utilize HathiTrust to facilitate access to the content of classic titles in the literature of the history of medicine. Along these lines, the remarks below are a result of a brief examination of the content of HathiTrust, using some titles included in Morton's Medical Bibliography (fifth edition) [4].
The quality of the scans appearing in the repository is consistently very good, including the scans of plates and figures. When color plates appear in books, they are often, though not always, scanned in color. For an example of this discrepancy, see the two full-view images of Studies in the History and Method of Science (Garrison and Morton #6411). Plate IX in the scans contributed by both Indiana University and the University of Michigan are good examples of scans of color images. Plate XXIII in the scan of the copy contributed by Indiana University is also in color, but the scan of the same plate in the item contributed by the University of Michigan is in black and white.
In examining the results, the question of the public domain issue arises. Specifically, it is not always apparent to the HathiTrust user why the full view version of a book appears in the repository. Items published prior to 1923 are unambiguously in the public domain, but, for example, Disease and Destiny by Ralph H. Major (Garrison-Morton #6432) was published after this date (1936) and a copyright statement is clearly visible on the verso of the title page.
The holdings of multivolume items appear to be somewhat fragmented. For example, Lynn Thorndike's A History of Magic and Experimental Science (Garrison and Morton #6422) is an eight-volume work. Only two volumes appear in HathiTrust, but a quick search of the University of Michigan's Mirlyn online public access catalog (OPAC) reveals that more than these two are held by the library.
HathiTrust can be characterized as a work in progress. New institutions are joining the effort to contribute content, and new features to the repository are being added regularly. Although users of HathiTrust may raise questions like the ones above, the potential for beneficial use of this huge database of printed knowledge, searchable and containing great amounts of full-text content, is great.
Because of their scope and potential for wide use, mass book digitization efforts can be said to be among the most interesting endeavors being undertaken in the library world today. In turn, the size of the collections involved in the HathiTrust digital repository makes it one of the most promising of these efforts.
Librarians in academic medical libraries should be aware of this new resource and should be thinking of creative ways to use and publicize it.
References
- 1.Albanese A. HathiTrust is launched. Libr J. 2008 Nov;133(18):21. [Google Scholar]
- 2.Jones E. Google Books as a general research collection. Libr Resour Tech Serv. 2010 Apr;54(2):77–89. [Google Scholar]
- 3.Blakeley R. What was lost now is found: using Google Books and Internet Archive to enhance a government documents collection with digital documents. Documents People. 2009 Fall;37(3):26–9. [Google Scholar]
- 4.Morton L.T. Morton's medical bibliography: an annotated checklist of texts illustrating the history of medicine. 5th ed. Brookfield, VT: Gower; 1991. [Google Scholar]