DIGITAL COLLECTION DEVELOPMENT:WHO IS DOING WHAT IN THE UNITED STATES

Abby Smith

Council on Library and Information Resources

Washington, DC

November 28, 2000

The Digital Library Federation (DLF), a program of the Council on Library and Information Resources (CLIR), is looking at strategies for developing sustainable and scalable digital library collections. We are surveying the practices of our member libraries and will issue reports on our findings this spring.

In the United States, we are developing collections in three ways:

1) acquiring or licensing from third parties

2) creating resources from the Web, so-called free resources

3) creating digital representations of our holdings

I will share with you some preliminary findings, and I want to focus on the latter - digitized collections - primarily because it brings into focus the general challenge research libraries face in defining their proper roles and responsibilities in the networked environment. I will also take advantage of this time with you to reflect on what I see as emerging trends and issues in the larger research community that are affecting libraries. While it is said here and elsewhere that the Americans are leading in digital library development in many areas, from the vantage point of my compatriots it often feels as if we are simply the first to learn - the hard way - what does and does not work. I will share with you not only the successes we have created, but also the failures. It is often the latter that are more instructive, if more painful, to share with colleagues.

 

DIGITIZATION

I want to assure you that, in the United States, selecting what to digitize is much easier than selecting who is to be the next president. It is not an issue that must be settled in the Supreme Court. On the contrary, the principle of "states rights," or every institution deciding what is best for itself, is well entrenched. We may talk a great deal about our federal, or federated, approach. But institutions usually make decisions based on particular institutional needs rather than on consensus community priorities.

How are leading research libraries moving from the early stages of experimentation with scanning, which has been project-based, to integration of this technology into the daily routine of library services?

In other words, when does digitization becomes a standard option for serving and preserving collections?

When do libraries plan for the implications of creating new collection assets from existing ones - that is, routinize the creation of metadata; plan for long-term retention and migration; and provide reader services and tools to support online research? When is funding for this activity is not ad hoc, but routine, either from base funding or from regular external funds such as endowments? When is this activity NOT grant-dependent? And how do we define that? FTEs covered but not conversion costs?

There are many written guides on selection for digitization. These guidelines bear a striking resemblance to preservation selection criteria - often because they are written by preservation staff. We see careful attention paid to the relation of the source material to the potential of added functionality, identification of research value, and so forth. Frankly, these policies are in effect more or less meaningless. What does it mean for something to have intrinsic research value? How many items are collected by research libraries that do not? The only things excluded under these criteria are things that do not fit under the camera, like architectural drawings, or things that are very boring, or things seldom used, like foreign language materials. These guidelines presume that funding and technical issues are addressed, that a larger purpose to the program has already been identified. What is missing is a high-level articulation of how an institution’s digitization practices reflect their over-all strategic view of collections.

However, this overall strategic view becomes very clear when you take a look at what libraries are doing, rather than what they say they are doing, at least officially. I am concerned that we tend to focus on tactics in lieu of strategy, and will end up mistaking one for the other, because we are not yet at the stage when we can articulate our strategies easily. But here are the trends.

Preservation

There is a growing consensus that digital scans serve as the preferred type of preservation surrogates. Digital preservation surrogates are generally widely embraced by scholars and preferred to microfilm. This is not surprising, of course, but the emphasis on creating digital surrogates for access purposes can and does result in money being diverted from preservation reformatting to digital reformatting. While CLIR has made efforts to develop a hybrid approach to this issue - developing film and digital copies of brittle materials - we are discouraged by the lack of interest in what many see as an expensive option. We have been urged to focus instead on solving the digital preservation problem, so that digital scans can be considered a viable preservation alternative to film.

In one institution - the University of Michigan - digital scans are actually used to create replacements for brittle books. And at Cornell, we see preference for digital replacements of brittle materials with backup to COM (computer output microfilm).

 

Access

We find two types of approaches to digitization-as-access, and they are clear, if not really cogent, reflections of the institutional view of collections. The one is collection-driven, the other user-driven (which I’d like to argue, is, by any other name, scholarly publishing).

Collection-driven

First, I will address the digitizing of collections in public institutions. Both the Library of Congress and the New York Public Library base their selection decisions on the realization that they are not part of a larger academic community, with faculty and students to set priorities. Rather, they serve a very broad community of the public, and so wish to make available things that both scholars and a broader audience would find interesting. They tend to select from among their special collections, special-format materials and rare or archival holdings that are often in the public domain, may be unique, and are certainly not generally available in libraries locally. This makes sense for a couple of reasons:

1) There are obligations to their supporters, the taxpayer and public at large. For the Library of Congress this means America and the larger world; for the New York Public, this means New York City and the larger world - which to New Yorkers is the same thing, of course. Again, this is not an academic audience.

2) This means putting up unique materials, usually primary sources, often unpublished.

3) They are easier to sell to funders, be they private donors or the US Congress. Both institutions feel the need to justify the expense of digitizing by making general cases about how much access to these materials will improve education or our civic life. Members of Congress are not often moved by arguments about the scholarly value of obscure resources.

4) These materials play to the general sense of the staff that this kind of access is mission related.

Beyond these criteria lurk other, more practical considerations as well. Given that creating metadata is usually a more expensive activity than the actual scanning, there is the need to take advantage of existing metadata - better known as cataloging. Often money has come with a promise by the library director that they will put up several thousand - even million - images, usually an optimistic and not terribly realistic assertion. So the library will turn to existing catalogues of special collections, as they are doing in New York. And this is a sound decision, in that the library is able to capitalize on previous investments in its collections for access purposes.

The public institutions, under special pressures to meet the needs of many differing publics, wrestle routinely with what they see as mission-driven activities versus what is simply urgent for institutional reasons. For example, both the Library of Congress and the New York Public identify K-12 [kindergarten through pre-college] as a primary audience, largely for political purposes. But this is not, in my view, what drives the decisions they actually make about what to put online They start with decisions about existing collections, as I outlined above, and then make an effort to find funding to create user interfaces for children or, more importantly, for teachers.

And both institutions wrestle with the problems that come with all primary sources - the intellectual question of how much of a collection is enough? They may choose to do something that represents the strengths of the collection, but what is that? How much is enough?

In private institutions, such as Yale University, and those mixed public/private like Cornell University and the University of Michigan, we see self-conscious attempts to build a critical mass of digitized holdings for research and teaching. With or without input from faculty, librarians will begin with a subject or collection that is strong - often something defined by thematic or chronological parameters - and they attempt to be as complete as possible in their coverage of that chosen topic. For selection decisions they rely on traditional bibliographies or selection procedures. They put together a proposal for funding that tries to sell the merits of the collections for research and teaching purposes, always supported by largely unexamined assumptions - or at least untested - about what people will use if made available online. The actual behavior and demands of the larger academic community are largely undocumented at this stage of development. When the materials they are scanning are brittle or frail, the digital images are touted as preservation surrogates.

We have seen that Michigan is dealing with the consequences of their decisions to reformat brittle materials and they have developed policies for the disposition of brittle materials after scanning. This is largely a defensive move, but an honest one. In fact, they place little value on the artifactual value of brittle general collections and are interested in creating masses of converted searchable text. Michigan is unique in talking about their large-scale production of digital collections as a type of collection management.

CRITICAL MASS

There are several libraries that talk of achieving a critical mass of materials online that, once achieved, will make it worthwhile to add items incrementally. This sounds like a nice concept, kind of intuitive, in fact, but I don’t know what it means. The idea of a critical mass is almost a mystical concept. I recognize that concept’s origins in two unrelated sources - Hegelian and Marxist theories of history, and nuclear physics . In physics, we are speaking of a density of matter that sets in motion a chain reaction. In Hegelian theory, it refers to a mass that is sufficient to transmute quantity into quality. In both cases, the idea is that a number can become so great that it wreaks a transformation in the nature of matter itself.

How much of a primary source is enough to recreate the semblance of real research, in which this critical mass is said the reside? What is a critical mass of research materials and how do we measure it? Libraries have traditionally used this term to mean that collections are significant to the extent that they are comprehensive. Why? Because a large and comprehensive collection provides a context for interpretation. But in the digital realm, it turns out that we really mean something else by this term, critical mass, something ill-defined and quite new. An example of that type of the collection is the Making of America (MOA) at the University of Michigan, a database of thousands of low-use brittle nineteenth-century imprints. While the books themselves were seldom called from the stacks, the MOA database is heavily used, though not primarily by students and teachers of Michigan. Its largest user is the Oxford University Press, which uses this database for etymological and lexical research.

I propose that what we really mean by the phrase "critical mass" is contextual mass, and recreating that online has become the great challenge for many American libraries. Whereas in the analog realm, searching within a so-called critical mass has always been very labor-intensive and takes great human effort to reveal the relationship in and among items in that collection, once those items are online in a form that is word-searchable, one has a mass that is now accessible to machine searching, not human researching.

This notion is implicitly premised on the transformative power of the technology that can not only re-create a collection online - one that can be used anytime, anywhere - but that gives the original sources new functionality, purpose, and audience.

JSTOR, a database for back issues of scholarly journals, is a good example of such a transformation. JSTOR started out, as its name implies, as a journal storage project whose aim was to free up scarce shelf space in library stacks by letting libraries get rid of their back issues of journals. Curiously, this has yet to come to pass. It turns out that large libraries have been able to move some journals into remote storage, perhaps, but few if any are willing to risk the disapprobation of some faculty and actually deaccession the journals themselves. Nonetheless, JSTOR has been successful in getting a significant number of important journals online, a large number of libraries to subscribe, both big and small, and it keeps careful track of how these journals are used. Their own statistics show that scholars are rapidly embracing this method of accessing old journals. It appears that they are querying this database in a way they that would not - could not - conduct research using single issues. Part of the genius of JSTOR, in my view, is that the staff not only keep carful tack of how people are using this resource, but they are also putting these usage statistics online with some analysis and thus holding a mirror up to their users to show them their behavior. This tool for analysis and consciousness-raising, if you will, is key to the often enthusiastic adoption of JSTOR in research.

User-driven

There are libraries that are approaching digital conversion not as a way of getting a lot of its collections online for general access purposes, as I described above, but rather in response to what their local academic community has indicated that it wants. The University of Virginia is the best example of this approach, though the depth and breadth of their activities in this area are far from typical. It is has several digital conversion initiatives, both in the library and elsewhere on campus. In the Institute for Advanced Technology on the Humanities (IATH), an academic center that is not in the library, scholars develop deep and deeply interpreted and edited digital objects that are, by any other name, publications. Examples include the projects on the writers Blake, Rossetti, and Twain. Within the library there is the Electronic Text Center, where the staff will choose to encode humanities texts that they put up without the interpretive apparatus of the IATH objects. They are more analogous to traditional library materials that are made available for others to interpret. Except, of course, that encoded text is far more complicated a creature than the OCR’d [Optical Character Recognition] text that other libraries are creating.

What is curious in this mix at Virginia of scholars and librarians working in essentially parallel tracks is that they are just now beginning to grapple seriously with the long-term consequences of their actions. The Andrew W. Mellon Foundation has granted IATH funds to work with the library to develop an architecture that will allow the library to provide long-term care, so to speak, of the digital publication that the scholars are creating. While no one says this out loud, these are scholars in fact data-creators, building what is in their mind a critical mass - a group of materials that are rich enough that they create a context for interpretation - but that we as librarians and archivists would see as publishing because the scale is so small. The work that Mellon is funding at Virginia to develop a system that will host these digital resources and make them cross-searchable and interoperable is an important attempt to make these pieces into a critical, or at least a contextual, mass.

American conditions and assumptions

I would like to close by briefly mentioning a few facts about the conditions and assumptions that underlie much of what we in America are doing, in the hope that this will put our work into a broader context. The United States is estimated to produce half the world’s published information each year - even if the Germans own most of our publishing houses. Our view of information is deeply tied to our roots as a nation of immigrants. Public institutions, both schools and libraries, have been the vehicles for the integration, if not assimilation, of new populations into the country. Our founders believed that unhindered access to information is vital to a republic, and they also believed that citizens have the obligation to keep themselves informed. We have lost that burning sense of responsibility, perhaps, but our avid love of the Internet, and the Web in particular, is vivid testimony to our unexamined assumption that information per se is a good thing. Yes, the creator of the Web is an Englishman and its creation happened in Switzerland. But Tim Berners-Lee is now at MIT, and so is the home of the World Wide Web consortium.

Americans tend to believe that information per se offers us a cure for whatever ails us as individuals and society. This seriously complicates our thinking about what to put online, because we have this general idea that everyone and every idea should have "equal time." This creates a tension between what might be good for educational and scholarly purposes and what might be good for the general public. I think we will see a further refinement of selection criteria as we track more carefully how our patrons are actually using our materials. We must be bold in risking failure in the pursuit of knowledge and faithfully sharing with our colleagues the lessons we have learned.

[Zurück zum Seitenanfang]

[Weitere Referate] [Zur Startseite des Digitalisierungszentrums]



BSB München 15.12.2000 MDZ@bsb-muenchen.de