Table 1.
Feature | Description | References |
---|---|---|
Access | ||
License | (1) available, (2) allows reuse | W3C https://github.com/laurakoesten/3,22,45, 46, 47 |
Format/machine readability | (1) consistent format, (2) single value type per column, (3) human as well as machine readable and non-proprietary format, (4) different formats available | W3C2,22,48, 49, 50 |
Code available | for cleaning, analysis, visualizations | 51, 52, 53 |
Unique identifier | PID for the dataset/ID's within the dataset | W3C2,53 |
Download link/API | (1) available, (2) functioning | W3C47,50 |
Documentation: Summary Representations and Understandability | ||
Description/README file | meaningful textual description (can also include text, code, images) | 22,54,55 |
Purpose | purpose of data collection, context of creation | 3,21,49,56,57 |
Summarizing statistics | (1) on dataset level, (2) on column level | 22,49 |
Visual representations | statistical properties of the dataset | 22,58 |
Headers understandable | (1) column-level documentation (e.g., abbreviations explained), (2) variable types, (3) how derived (e.g., categorization, such as labels or codes) | 22,59,60 |
Geographical scope | (1) defined, (2) level of granularity | 45,54,61,62 |
Temporal scope | (1) defined, (2) level of granularity | 45,54,61,62 |
Time of data collection | (1) when collected, (2) what time span | 63, 64, 65 |
Documentation: Methodological Choices | ||
Methodology | description of experimental setup (sampling, tools, etc.), link to publication or project | 3,13,54,60,63,66 |
Units and reference systems | (1) defined, (2) consistently used | 54,67 |
Representativeness/Population | in relation to a total population | 21,60 |
Caveats | changes: classification/seasonal or special event/sample size/coverage/rounding | 48,54 |
Cleaning/pre-processing | (1) cleaning choices described, (2) are the raw data available? | 3,13,21,68 |
Biases/limitations | different types of bias (i.e., sampling bias) | 21,49,69 |
Data management | (1) mode of storage, (2) duration of storage | 3,70,71 |
Documentation: Quality | ||
Missing values/null values | (1) defined what they mean, (2) ratio of empty cells | W3C22,48,49,59,60 |
Margin of error/reliability/quality control procedures | (1) confidence intervals, (2) estimates versus actual measurements | 54,65 |
Formatting | (1) consistent data type per column, (2) consistent date format | W3C41,65 |
Outliers | are there data points that differ significantly from the rest | 22 |
Possible options/constraints on a variable | (1) value type, (2) if data contains an “other” category | W3C72 |
Last update | information about data maintenance if applicable | 21,62 |
Completeness of metadata | empty fields in the applied metadata structure? | 41 |
Abbreviations/acronyms/codes | defined | 49,54 |
Connections | ||
Relationships between variables defined | (1) explained in documentation, (2) formulae | 21,22 |
Cite sources | (1) links or citation, (2) indication of link quality | 21 |
Links to dataset being used elsewhere | i.e., in publications, community-led projects | 21,59 |
Contact | person or organization, mode of contact specified | W3C41,73 |
Provenance and Versioning | ||
Publisher/producer/repository | (1) authoritativeness of source, (2) funding mechanisms/other interests that influenced data collection specified | 21,49,54,59,74,75 |
Version indicator | version or modification of dataset documented | W3C50,66,76 |
Version history | workflow provenance | W3C50,76 |
Prior reuse/advice on data reuse | (1) example projects, (2) access to discussions | 3,27,59,60 |
Ethics | ||
Ethical considerations, personal data | (1) data related to individually identifiable people, (2) if applicable, was consent given | 21,57,71,75 |
Semantics | ||
Schema/Syntax/Data Model | defined | W3C47,67 |
Use of existing taxonomies/vocabularies | (1) documented, (2) link | W3C2 |
This table does not claim to be comprehensive but aims to provide an overview of the many recommended documentation practices for dataset reuse. W3C refers to The Data on The Web Best Practices Vocabulary (https://www.w3.org/TR/vocab-dqv/)