Skip to main content
. 2020 Nov 4;1(8):100136. doi: 10.1016/j.patter.2020.100136

Table 1.

Compilation of Reusability Features for Datasets

Feature Description References
Access

License (1) available, (2) allows reuse W3C https://github.com/laurakoesten/3,22,45, 46, 47
Format/machine readability (1) consistent format, (2) single value type per column, (3) human as well as machine readable and non-proprietary format, (4) different formats available W3C2,22,48, 49, 50
Code available for cleaning, analysis, visualizations 51, 52, 53
Unique identifier PID for the dataset/ID's within the dataset W3C2,53
Download link/API (1) available, (2) functioning W3C47,50

Documentation: Summary Representations and Understandability

Description/README file meaningful textual description (can also include text, code, images) 22,54,55
Purpose purpose of data collection, context of creation 3,21,49,56,57
Summarizing statistics (1) on dataset level, (2) on column level 22,49
Visual representations statistical properties of the dataset 22,58
Headers understandable (1) column-level documentation (e.g., abbreviations explained), (2) variable types, (3) how derived (e.g., categorization, such as labels or codes) 22,59,60
Geographical scope (1) defined, (2) level of granularity 45,54,61,62
Temporal scope (1) defined, (2) level of granularity 45,54,61,62
Time of data collection (1) when collected, (2) what time span 63, 64, 65

Documentation: Methodological Choices

Methodology description of experimental setup (sampling, tools, etc.), link to publication or project 3,13,54,60,63,66
Units and reference systems (1) defined, (2) consistently used 54,67
Representativeness/Population in relation to a total population 21,60
Caveats changes: classification/seasonal or special event/sample size/coverage/rounding 48,54
Cleaning/pre-processing (1) cleaning choices described, (2) are the raw data available? 3,13,21,68
Biases/limitations different types of bias (i.e., sampling bias) 21,49,69
Data management (1) mode of storage, (2) duration of storage 3,70,71

Documentation: Quality

Missing values/null values (1) defined what they mean, (2) ratio of empty cells W3C22,48,49,59,60
Margin of error/reliability/quality control procedures (1) confidence intervals, (2) estimates versus actual measurements 54,65
Formatting (1) consistent data type per column, (2) consistent date format W3C41,65
Outliers are there data points that differ significantly from the rest 22
Possible options/constraints on a variable (1) value type, (2) if data contains an “other” category W3C72
Last update information about data maintenance if applicable 21,62
Completeness of metadata empty fields in the applied metadata structure? 41
Abbreviations/acronyms/codes defined 49,54

Connections

Relationships between variables defined (1) explained in documentation, (2) formulae 21,22
Cite sources (1) links or citation, (2) indication of link quality 21
Links to dataset being used elsewhere i.e., in publications, community-led projects 21,59
Contact person or organization, mode of contact specified W3C41,73

Provenance and Versioning

Publisher/producer/repository (1) authoritativeness of source, (2) funding mechanisms/other interests that influenced data collection specified 21,49,54,59,74,75
Version indicator version or modification of dataset documented W3C50,66,76
Version history workflow provenance W3C50,76
Prior reuse/advice on data reuse (1) example projects, (2) access to discussions 3,27,59,60

Ethics

Ethical considerations, personal data (1) data related to individually identifiable people, (2) if applicable, was consent given 21,57,71,75

Semantics

Schema/Syntax/Data Model defined W3C47,67
Use of existing taxonomies/vocabularies (1) documented, (2) link W3C2

This table does not claim to be comprehensive but aims to provide an overview of the many recommended documentation practices for dataset reuse. W3C refers to The Data on The Web Best Practices Vocabulary (https://www.w3.org/TR/vocab-dqv/)