Unlocking the Secrets of 4.5 Billion Pages: An HathiTrust Research Center Update

January 7, 2015 @ 12:00 pm – 1:00 pm
Horton Room, Weston Library
Broad Street
Oxford, Oxford, Oxfordshire OX1 3AS
Pip Willcox

Professor J. Stephen Downie will present an open, public seminar co-hosted by Bodleian Libraries and the Oxford e-Research Centre.

This seminar provides an update on the recent developments and activities of the HathiTrust Research Center (HTRC). The HTRC is the research arm of the HathiTrust, an online repository dedicated to the provision of access to a comprehensive body of published works for scholarship and education.

The HathiTrust is a partnership of over 100 major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future. Membership is open to institutions worldwide.

Over 12.5 million volumes (4.5 billion pages) have been ingested into the HathiTrust digital archive from sources including Google Books, member university libraries, the Internet Archive, and numerous private collections. The HTRC is dedicated to facilitating scholarship using this enormous corpus through enabling access to the corpus, developing research tools, fostering research projects and communities, and providing additional resources such as enhanced metadata and indices that will assist scholars to more easily exploit the HathiTrust corpus.

This lecture will outline the mission, goals and structure of the HTRC. It will also provide an overview of recent work being conducted on a range of projects, partnerships and initiatives. Projects include Workset Creation for Scholarly Analysis project (WCSA, funded by the Andrew W. Mellon Foundation) and the HathiTrust + Bookworm project (HT+BW, funded by the National Endowment for the Humanities). HTRC’s involvement with the NOVELTM text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme, will be introduced. The HTRC’s new feature extraction and Data Capsule initiatives, part of its ongoing work its ongoing efforts to enable the non-consumptive analyses of copyrighted materials will also be discussed. The talk will conclude with a brief discussion of the ways in which scholars can work with and through the HTRC.


J. Stephen Downie is the Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.

Professor Downie is the Illinois Co-Director of the HathiTrust Research Center. He is also Director of the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) and founder and ongoing director of the Music Information Retrieval Evaluation eXchange (MIREX). He was the Principal Investigator on the Networked Environment for Music Analysis (NEMA) project, funded by the Andrew W. Mellon Foundation. He is Co-PI on the Structural Analysis of Large Amounts of Music Information (SALAMI) project, jointly funded by the National Science Foundation (NSF), the Canadian Social Science and Humanities Research Council (SSHRC), and the UK’s Joint Information Systems Committee (JISC). He represents the HTRC on the NOVELTM text mining project and the Single Interface for Music Score Searching and Analysis (SIMSSA) project, both funded by the SSHRC Partnership Grant programme .

He has been very active in the establishment of the Music Information Retrieval (MIR) community through his ongoing work with the International Society for Music Information Retrieval (ISMIR) conferences. He was ISMIR’s founding President and now serves on the ISMIR board.

Professor Downie holds a BA (Music Theory and Composition) along with a Master’s and a PhD in Library and Information Science, all earned at the University of Western Ontario, London, Canada.