Libraryland is a-buzz about a new role we can play in the pursuit of scientific knowledge: data curation. Data curation serves, in particular, the new scientific methodology that goes under the name e-science. E-science involves the collection of data sets which are made widely available to the research community. Researchers then “mine” these data sets by using automated systems to find statistically significant relationships within the data. The library’s role is to curate the data, i.e., identify, acquire, and manage the data sets through the course of their life cycle. As exciting as this new methodology is, one should be aware of its weaknesses. E-science can be a valuable addition to traditional scientific methodology, but by itself, it is no panacea.
In a commentary entitled “Implications of the Principle of Question Propagation for Comparative-Effectiveness and ‘Data Mining’ Research” in the Journal of the American Medical Association, 35(3), 2011, Mia and Benjamin Djulbegovic argue that data mining does not provide definitive answers to research questions. Instead, it should be considered merely a hypothesis-generating technique. Their first point already had been demonstrated vividly by a piece of data mining research entitled “Testing Multiple Statistical Hypotheses Resulted in Spurious Associations: A Study of Astrological Signs and Health” published in the Journal of Clinical Epidemiology, 59(9), 2006 by Peter Austin et al. Austin et al.’s research showed that residents of Ontario, Canada who were born under the astrological sign of Leo had a higher chance of suffering from a gastrointestinal hemorrhage than others in the population, and those born under the sign Sagittarius had a higher probability of being hospitalized for a humerus fracture. These results were statistically significant, even after being tested against an independent validation cohort. The study “emphasizes the hazards of testing multiple, non-prespecified hypotheses.” In other words, it warns us that given an enough data points, one can, after the fact, find any number of ways to connect them.
The second point in Djulbegovic and Djulbegovic, that data mining should be used as a hypothesis-generating technique, is, on the other hand, undermined by Austin et al. Austin et al. point out that the statistical methods that are at the heart of data mining are not able to distinguish real from spurious associations. Data mining employs the automated examination of enormous bodies of data. Its usefulness is thought to be proportional to the size of the data set that it collates; however, as the data set becomes larger and as the number of attributes that serve as potential relata increases, the number of potential relationships increases exponentially. Importantly, the number of spurious associations also increases. With enough data, no significance test will be stringent enough to provide assurance against the kind of results found in Austin et al. What is needed, according to Austin et al. is a “pre-specified plausible hypothesis.” For statistical analysis to be useful, the researcher must begin with a hypothesis, preferably a plausible one, if the research is to be valuable.
What exactly is a pre-specified plausible hypothesis and how can we generate it if data mining can’t do that for us? The question was posed some sixty years ago by the philosopher Nelson Goodman using different terms: Goodman believed that a critical question for epistemology was to distinguish between “projectible and non-projectible hypotheses.” One can more or less replace “pre-specified plausible hypothesis” with Goodman’s term “projectible hypothesis.” According to Goodman, when we seek to understand what hypothesis is (or is not) projectible, we do not come to the problem “empty-headed but with some stock of knowledge” which we use to determine what is (or is not) projectible. Projectible hypotheses will be those which do not conflict with other hypotheses that have been supported in the past. They will commonly use the same terminology of previously supported hypotheses. The terminology appearing in the hypotheses will have become “entrenched” in the language. This goes a long distance toward explaining why we don’t find the link between one’s astrological sign and medical conditions plausible. Twenty-first century Western medicine is not accustomed to linking astrological signs to ailments and so must find any hypothesis that does so implausible.
If Goodman is correct, then data mining is of little use without an historical understanding of the field of science to which the data pertains. Library administrators should keep this in mind when allocating resources. Clearly, purchasing data sets is a necessary part of serving our research patrons, but the emphasis must be not on the mere accumulation of data, it must be on the selection of data that is critical to continuing the scientific discourse. While data sets that distinguish astrological signs are clearly insignificant for medicine, there are many other attributes that form the basis of data sets that are more or less reasonable. Librarians must be able to perform the complex task of distinguishing the more from the less. It is the curation of data that is important, i.e., the acquisition and management of data sets through the whole of its life cycle; and most importantly, the curation of data sets that are of interest and value to the scholarly and research community.
Here, we have another argument for allocating library resources to pay for librarians with deep subject expertise. As e-science develops, vendors will make more and more data sets available, regardless of their actual worth to researchers. To effectively choose the data sets that are of value, librarians must have a thorough understanding of the research needs of their patrons. To do this, they must have a deep understanding of the field. Unfortunately, with the excitement swirling around e-science, the mere access to large data sets threatens to become the be-all and end-all in collection management. If we aren’t careful, we may find ourselves with mountains of data from which everything and nothing can be concluded.
I spent the better part of Wednesday at VuStuff II, a small regional gathering hosted by Villanova University’s Falvey Memorial Library, which focused on the intersection of technology and scholarly communication in libraries. The attendees were an interesting mix of people from academic and special libraries, and included library directors, archivists, systems librarians, special collections librarians, reference librarians, technical services librarians, and more. In the group discussion session, some of us regretted the lack of representation from public libraries. It sounded like it is now on the agenda to do outreach to that sector next year.
I’ve been impressed with what’s going on at Villanova for awhile now. Not only are they doing some of the most interesting, cutting-edge work that I’ve seen in terms of presenting digital content from their special collections, but the culture of their library work environment is very different (and I might judge it as “better”) than what I know of in other libraries and work settings. This is an outsider’s view, based on perceptions gleaned from what people who work there have told me and things that I’ve read. The following are some of the things I find particularly intriguing and feel might serve as a good model for other places to consider: 1) Falvey library staff are given time to explore special projects based on their own interests. By doing this, the library is taking a risk – some work hours may indeed be “wasted,” but new products and new services may be born. A lot of workplaces harp on the need for employees to be “creative,” “collaborative,” and “innovative,” but very few actually provide the time and space to support their staff in doing this. 2) Falvey funds technology. Money for digital projects and technology-based services is written into the budget. Many workplaces expect staff to “make do” with no financial support or else fund projects on an ad hoc basis. Falvey models the fact that superior technology-based projects require dedicated, on-going funding. 3) Falvey diversifies the responsibility for technology. There is no one staff position that is responsible for technology initiatives; rather, various aspects of technology are integrated into the job descriptions of numerous library staff members. This means that if a library staff position is cut or a staff member leaves, technology initiatives don’t evaporate along with that change. 4) Falvey supports open access. The VuFind product they’ve developed for use as a flexible library resource portal is available for free through a GPL open source license. The digital library content they present is available freely to anyone (with a few exceptions for some materials with outside restrictions). Instead of partnering with commercial interests to market a product, Falvey keeps to the ideal of libraries providing information and resources free-of-charge.
I think that Joe Lucia, Villanova’s university librarian and the director of Falvey Memorial Library, deserves a lot of credit for his leadership in these areas. I missed his opening remarks at the conference, but found his questions and comments throughout the sessions to be interesting and thought-provoking. He seems to be looking further forward than many library directors, asking questions like “What does it mean for libraries if the ILS as we know it is dead in the next five to eight years?” “What does it mean if 80% of the content of our book collections is available electronically?” A word to the wise is that the two books he specifically mentioned were Siva Vaidhyanathan’s The Googlization of Everything and R. David Lankes’ The Atlas of New Librarianship.
The presentations at the conference were informative and sometimes inspiring. Amy Baker of the University of Pittsburgh described the preservation of archival mining maps project that her institution has been involved in, spurred by a mining accident in western Pennsylvania. Working in conjunction with the Pennsylvania Department of Environment Protection, this project is a good example of a university/government partnership that provides publicly available information in order to help protect people and property. It reminded me that while librarians and archivists rarely see our work as possibly having life-or-death consequences – sometimes it does.
Eric Lease Morgan of the University of Notre Dame demonstrated the Catholic Research Resources Alliance website (the “Catholic Portal”) and explained how it uses the VuFind product to draw together metadata from various formats and sources into one seamless product. I was particularly interested in its ability to perform full text searches and construct KWIC word concordances. I’m not sure how well known or well utilized this site is, but I think it holds a great deal of potential for researchers in literature, history, religious studies, and other fields to mine text data for a variety of purposes.
Eric Zino of the LYRASIS library network explained the Mass Digitization Collaborative, undertaken to help libraries digitize selected resources in a cost effective way. Unique items of historical value have been the major focus, although participating libraries are free to choose any materials they wish to include (provided copyright restrictions are met). Digitized materials are made publicly available via the Internet Archive, and can also be hosted locally. This project underscored the benefits of libraries working together to cut costs, minimize staff time spent on projects, produce consistent products, and share content more broadly.
I missed the final presentation of the conference, which was Rob Behary of Duquesne University speaking on his library’s project to digitize the Pittsburgh Catholic newspaper. His presentation highlighted some of the benefits of moving from microfilm to digital content. Most librarians will agree that efforts like this, to preserve smaller regional publications with a unique focus or viewpoint, are an important service that libraries should be involved in.
All in all, this was an interesting day with plenty of time for networking built in. I enjoyed reconnecting with former colleagues and students, and meeting some new people as well. It was particularly rewarding to be with a group of people who were interested in moving library services forward into the 21st century, while still retaining the traditional library value of open access to information. I suspect that organizers may be seeking larger quarters for future VuStuff gatherings as its reputation continues to grow.