Enabling Long-Term Access to Scientific, Technical and Medical Data Collections

Jeff Rothenberg; Stijn Hoorens

. 2012 Dec 1;1(4):13.

Enabling Long-Term Access to Scientific, Technical and Medical Data Collections

Jeff Rothenberg, Stijn Hoorens

PMCID: PMC4945262 PMID: 28083220

Short abstract

Presents the results of a scoping study that addresses the potential role of the British Library in facilitating access to relevant datasets in the biosciences and environmental science.

Abstract

In recent decades, online access to large, high quality data collections has led to a new, deeper level of sharing and analysis, potentially accelerating and improving the quality of scientific research. These online datasets are becoming imperative at all stages of the research process, particularly in scientific, technical and medical (STM) disciplines. Since libraries have a traditional responsibility to guarantee the availability of the output of scholarly research, they have a potentially important role to play in facilitating long-term access to these resources. Yet, the role of a national library in the realm of STM data remains unclear. This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving, and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.

This article presents the results of a scoping study that addresses the potential role of the British Library (BL) in facilitating access to relevant datasets in the biosciences and environmental science. The aim of this study is to assist the BL in developing an appropriate strategy that would enable it to establish a role for itself in the intake, curation, archiving and preservation of STM reference datasets, in order to provide access to these datasets for research purposes. The focus of this study is to explore a range of alternative strategies for the BL, which might be different for different types of databases or for data supporting different research fields or disciplines.

Characterising the Dimensions of Reference Data Collections

In order to develop a strategy aimed at providing access to these resources, a comprehensive picture should be developed of the inherent diversity in which research data are produced and offered. On the other hand, since the BL might function as a gateway to these resources, it is equally important to characterise the interests and needs of the potential users of these datasets. Therefore, we distinguished between the supply domain of datasets on the one hand, and the use of such data, i.e., the demand domain, on the other. As illustrated in Table 1, a set of seven supply-side dimensions and a set of five demand-side dimensions have been developed. Each dimension has several attributes to allow for a characterisation of each candidate database and to delineate a set of options for the BL related to each attribute.

Table 1.

Span of Plausible Attribute Values in Supply- and Demand-Side Dimensions

Dimension Number	Dimension	Attribute	Attribute Values
S1	Access	Restriction	none, role-based (e.g., government, commercial, individual), location-or-affiliation-based (e.g., by country, agency, professional society), by-registration, requiring unpaid-membership, paid-membership, use-payment (unlimited or by data-item, query, dataset, etc.)
		Access media	online-only, offline-only, on-or-offline
		Granularity	attribute, data-item, query-result, subset, dataset,
		Functionality	low, medium, high
		Software-requirements	generic, modifiable, free-download, server-resident-proprietary, proprietary
S2	Scale, dynamism, coverage and completeness	Scale	small, medium, large
		Dynamism of discipline	frozen, static, dynamic, volatile
		Dynamism (of database)	frozen, static, dynamic, volatile
		Temporal-depth	historical, current-only, multiple versions/editions
		Coverage	narrow, medium, broad
		Completeness	low, medium, high
		Collection-strategy	passive, active
		Processing	none, minimal, significant, intensive
		Validation	none, minimal, significant, intensive
		Timeliness	low, medium, high
S3	Disciplinary usage	Cross-discipline	no, somewhat, yes
		Disciplines	<discipline designations>
		Level of user support	low, medium, high
S4	Interface	User-interface	menu, graphical-selection, text-query, graphics-input
S4	Interface	Programmable-interfaces	no, server-support, framework-support, API
S5	Interoperability	Self-describing data	no, somewhat, yes
		Semantic transparency	no, somewhat, yes
		Linkage-to-other-collections	no, somewhat, yes
		Use of semantic standards	no, somewhat, yes
		Cross-domain semantic crosswalks	no, somewhat, yes
		Programmable interfaces	non-existent, unique, standard, open
S6	Ownership, funding, governance, management and contributors	Reputation	low, medium, high
		Involvement	low, medium, high
		Accessibility	low, medium, high
		Funding-level	low, medium, high
		Funding-reliability	low, medium, high
		Governance-quality	low, medium, high
		Sustainability	short-term, medium-term, indefinite
S7	Attribution & IP	Attribution completeness	low, medium, high
		Attribution accuracy	low, medium, high
		Attribution granularity	low, medium, high
		Licensing, registration, agreements with owners	inapplicable, minimal, partial, complete
		End-user licensing	inapplicable, minimal, partial, complete
		Redaction/anomalisation of data	inapplicable, minimal, partial, complete
D1	Research methodology, funding and stakeholder requirements	Required-access-granularity	attribute, data-item, query-result, dataset, database
		Required-metadata	low, medium, high
		Required-access-to-models	low, medium, high
		Methods	<method designations>
		Publication/distribution requirements	<various>
D2	Discovery methods	Search-engines	generic, specialised
		Discovery metadata	generic, specialised
		Other discovery resources	<indexes, catalogues, etc.>
D3	Query style	Expressivity	low, medium, high
		Desired-interface	menu, graphical-selection, text-query, graphics-input
		Required-programmable-access	low, medium, high
D4	Federation	Need-to-federate	low, medium, high
D4	Federation	Required-metadata-support	low, medium, high
D5	Cross-disciplinary usage	Cross-disciplinary-usage	low, medium, high
D5	Cross-disciplinary usage	Required-metadata-support	low, medium, high
D6	Timeliness and temporal access	Required-recency	low, medium, high
		Required-timestamp-granularity	low, medium, high
		Desired-update-method	asynchronous, time-stamped, transaction-based
		Required-temporal-access	historical, current-only, versioned, multi-epoch, reconstructible

Open in a new tab

The identified attributes can have different values that represent the variation among the data collections' characteristics. We explored the online resources of a small sample of candidate data collections, and, to the extent possible, reviewed documentation about their ownership, management, data processing and validation methods, access mechanisms, query interfaces, browsing capabilities, metadata, etc. The identified dimensions, their attributes and the span of plausible values on the supply and demand side are given in Table 1.

Bundles of Strategic Options for a National Library

Analysis of the sample of candidate collections has led to the identification of a range of optional approaches that address each or a small set of salient attribute values. Examples of such options include: the BL should (or should not) hold a given dataset itself or should (or should not) develop and provide its own metadata and query or access mechanisms for a given dataset.

As an initial exercise for how the BL can develop a strategy for providing long-term access to these high quality reference data collections, we specified three exemplary clusters of attribute values, each of which characterises a class of databases. Each such attribute cluster defines a bundle of options that, taken together, can be considered a strategy.

The first cluster of attributes can be labelled as neutral: it represents the issues arising in the sample of databases that were investigated. For this cluster, the national library might consider providing transparent access to the data collections.
The second cluster of attributes represents a class of databases involving a complex, demanding set of requirements combined with relatively minimum support by the database itself. For this cluster, the national library might consider providing gateway access to the data collections.
The third cluster represents a class of databases involving a simple, undemanding set of requirements, combined with relatively good support by the database itself. These data collections have minimal access restriction, and their supporting mechanisms are relatively simple. For this cluster, the national library might consider providing transparent access to the data collections.

The three bundles of options associated with these attribute clusters should be considered indicative strategies rather than definitive ones. The “demanding” and “undemanding” clusters have been deliberately formulated as two extreme ends on a spectrum of plausible cases. The BL may choose different options, depending on its missions and policies.

Lessons and Next Steps

The option bundles presented are only a starting point. The BL's strategy with respect to any given database should be decided on the basis of an overall assessment of the importance and uniqueness of that database, its relevance to the BL's policies with regard to STM data, and the BL's assessment of the degree to which users of the database would benefit from having the BL apply its own curatorial, preservation, or access resources to the database.

Although the limited resources of our study enabled us to obtain reasonable information for most supply-side attributes, details of ownership and funding (accessibility of owners, owner reputation, reliability of funding, etc) could in many cases only be inferred by our necessarily informal methods. Demand-side attributes were even harder to obtain; our values for most of these attributes are derived deductively rather than empirically. These need to be validated and revised based on future demand-side analysis.

The results of this study should therefore be replicated with greater depth and resources, using a larger number and wider range of sample databases augmented by demand-side input from researchers and user groups. The more in-depth examination should employ direct contact with database administrators, parent organisations, data processing managers, discipline-based organisations whose members use the database, and user communities. This should help fill in the supply-side attributes for each database as well as providing demand-side attributes, whose values were supplied largely by assumptions in the current study.

PERMALINK

Enabling Long-Term Access to Scientific, Technical and Medical Data Collections

Jeff Rothenberg

Stijn Hoorens

Short abstract

Abstract

Characterising the Dimensions of Reference Data Collections

Table 1.

Bundles of Strategic Options for a National Library

Lessons and Next Steps

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Enabling Long-Term Access to Scientific, Technical and Medical Data Collections

Jeff Rothenberg

Stijn Hoorens

Short abstract

Abstract

Characterising the Dimensions of Reference Data Collections

Table 1.

Bundles of Strategic Options for a National Library

Lessons and Next Steps

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases