Skip to main content
Rand Health Quarterly logoLink to Rand Health Quarterly
. 2012 Dec 1;1(4):13.

Enabling Long-Term Access to Scientific, Technical and Medical Data Collections

Jeff Rothenberg, Stijn Hoorens
PMCID: PMC4945262  PMID: 28083220

Short abstract

Presents the results of a scoping study that addresses the potential role of the British Library in facilitating access to relevant datasets in the biosciences and environmental science.

Abstract

In recent decades, online access to large, high quality data collections has led to a new, deeper level of sharing and analysis, potentially accelerating and improving the quality of scientific research. These online datasets are becoming imperative at all stages of the research process, particularly in scientific, technical and medical (STM) disciplines. Since libraries have a traditional responsibility to guarantee the availability of the output of scholarly research, they have a potentially important role to play in facilitating long-term access to these resources. Yet, the role of a national library in the realm of STM data remains unclear. This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving, and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.


In recent decades, online access to large, high quality data collections has led to a new, deeper level of sharing and analysis, potentially accelerating and improving the quality of scientific research. These online datasets are becoming imperative at all stages of the research process, particularly in the areas of scientific, technical and medical (STM). Since libraries have a traditional responsibility to guarantee the availability of the output of scholarly research, they have a potentially important role to play in facilitating long-term access to these resources. Yet the role of a national library in the realm of STM data remains unclear.

This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.

Characterising the Dimensions of Reference Data Collections

In order to develop a strategy aimed at providing access to these resources, a comprehensive picture should be developed of the inherent diversity in which research data are produced and offered. On the other hand, since the BL might function as a gateway to these resources, it is equally important to characterise the interests and needs of the potential users of these datasets. Therefore, we distinguished between the supply domain of datasets on the one hand, and the use of such data, i.e., the demand domain, on the other. As illustrated in Table 1, a set of seven supply-side dimensions and a set of five demand-side dimensions have been developed. Each dimension has several attributes to allow for a characterisation of each candidate database and to delineate a set of options for the BL related to each attribute.

Table 1.

Span of Plausible Attribute Values in Supply- and Demand-Side Dimensions

Dimension Number Dimension Attribute Attribute Values
S1 Access Restriction none, role-based (e.g., government, commercial, individual), location-or-affiliation-based (e.g., by country, agency, professional society), by-registration, requiring unpaid-membership, paid-membership, use-payment (unlimited or by data-item, query, dataset, etc.)
Access media online-only, offline-only, on-or-offline
Granularity attribute, data-item, query-result, subset, dataset,
Functionality low, medium, high
Software-requirements generic, modifiable, free-download, server-resident-proprietary, proprietary
S2 Scale, dynamism, coverage and completeness Scale small, medium, large
Dynamism of discipline frozen, static, dynamic, volatile
Dynamism (of database) frozen, static, dynamic, volatile
Temporal-depth historical, current-only, multiple versions/editions
Coverage narrow, medium, broad
Completeness low, medium, high
Collection-strategy passive, active
Processing none, minimal, significant, intensive
Validation none, minimal, significant, intensive
Timeliness low, medium, high
S3 Disciplinary usage Cross-discipline no, somewhat, yes
Disciplines <discipline designations>
Level of user support low, medium, high
S4 Interface User-interface menu, graphical-selection, text-query, graphics-input
Programmable-interfaces no, server-support, framework-support, API
S5 Interoperability Self-describing data no, somewhat, yes
Semantic transparency no, somewhat, yes
Linkage-to-other-collections no, somewhat, yes
Use of semantic standards no, somewhat, yes
Cross-domain semantic crosswalks no, somewhat, yes
Programmable interfaces non-existent, unique, standard, open
S6 Ownership, funding, governance, management and contributors Reputation low, medium, high
Involvement low, medium, high
Accessibility low, medium, high
Funding-level low, medium, high
Funding-reliability low, medium, high
Governance-quality low, medium, high
Sustainability short-term, medium-term, indefinite
S7 Attribution & IP Attribution completeness low, medium, high
Attribution accuracy low, medium, high
Attribution granularity low, medium, high
Licensing, registration, agreements with owners inapplicable, minimal, partial, complete
End-user licensing inapplicable, minimal, partial, complete
Redaction/anomalisation of data inapplicable, minimal, partial, complete
D1 Research methodology, funding and stakeholder requirements Required-access-granularity attribute, data-item, query-result, dataset, database
Required-metadata low, medium, high
Required-access-to-models low, medium, high
Methods <method designations>
Publication/distribution requirements <various>
D2 Discovery methods Search-engines generic, specialised
Discovery metadata generic, specialised
Other discovery resources <indexes, catalogues, etc.>
D3 Query style Expressivity low, medium, high
Desired-interface menu, graphical-selection, text-query, graphics-input
Required-programmable-access low, medium, high
D4 Federation Need-to-federate low, medium, high
Required-metadata-support low, medium, high
D5 Cross-disciplinary usage Cross-disciplinary-usage low, medium, high
Required-metadata-support low, medium, high
D6 Timeliness and temporal access Required-recency low, medium, high
Required-timestamp-granularity low, medium, high
Desired-update-method asynchronous, time-stamped, transaction-based
Required-temporal-access historical, current-only, versioned, multi-epoch, reconstructible

The identified attributes can have different values that represent the variation among the data collections' characteristics. We explored the online resources of a small sample of candidate data collections, and, to the extent possible, reviewed documentation about their ownership, management, data processing and validation methods, access mechanisms, query interfaces, browsing capabilities, metadata, etc. The identified dimensions, their attributes and the span of plausible values on the supply and demand side are given in Table 1.

Bundles of Strategic Options for a National Library

Analysis of the sample of candidate collections has led to the identification of a range of optional approaches that address each or a small set of salient attribute values. Examples of such options include: the BL should (or should not) hold a given dataset itself or should (or should not) develop and provide its own metadata and query or access mechanisms for a given dataset.

As an initial exercise for how the BL can develop a strategy for providing long-term access to these high quality reference data collections, we specified three exemplary clusters of attribute values, each of which characterises a class of databases. Each such attribute cluster defines a bundle of options that, taken together, can be considered a strategy.

  1. The first cluster of attributes can be labelled as neutral: it represents the issues arising in the sample of databases that were investigated. For this cluster, the national library might consider providing transparent access to the data collections.

  2. The second cluster of attributes represents a class of databases involving a complex, demanding set of requirements combined with relatively minimum support by the database itself. For this cluster, the national library might consider providing gateway access to the data collections.

  3. The third cluster represents a class of databases involving a simple, undemanding set of requirements, combined with relatively good support by the database itself. These data collections have minimal access restriction, and their supporting mechanisms are relatively simple. For this cluster, the national library might consider providing transparent access to the data collections.

The three bundles of options associated with these attribute clusters should be considered indicative strategies rather than definitive ones. The “demanding” and “undemanding” clusters have been deliberately formulated as two extreme ends on a spectrum of plausible cases. The BL may choose different options, depending on its missions and policies.

Lessons and Next Steps

The option bundles presented are only a starting point. The BL's strategy with respect to any given database should be decided on the basis of an overall assessment of the importance and uniqueness of that database, its relevance to the BL's policies with regard to STM data, and the BL's assessment of the degree to which users of the database would benefit from having the BL apply its own curatorial, preservation, or access resources to the database.

Although the limited resources of our study enabled us to obtain reasonable information for most supply-side attributes, details of ownership and funding (accessibility of owners, owner reputation, reliability of funding, etc) could in many cases only be inferred by our necessarily informal methods. Demand-side attributes were even harder to obtain; our values for most of these attributes are derived deductively rather than empirically. These need to be validated and revised based on future demand-side analysis.

The results of this study should therefore be replicated with greater depth and resources, using a larger number and wider range of sample databases augmented by demand-side input from researchers and user groups. The more in-depth examination should employ direct contact with database administrators, parent organisations, data processing managers, discipline-based organisations whose members use the database, and user communities. This should help fill in the supply-side attributes for each database as well as providing demand-side attributes, whose values were supplied largely by assumptions in the current study.


Articles from Rand Health Quarterly are provided here courtesy of The RAND Corporation

RESOURCES