Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Jun 16;17(6):e1009041. doi: 10.1371/journal.pcbi.1009041

Ten simple rules for making a vocabulary FAIR

Simon J D Cox 1,*, Alejandra N Gonzalez-Beltran 2, Barbara Magagna 3, Maria-Cristina Marinescu 4
Editor: Scott Markel5
PMCID: PMC8238180  PMID: 34133421

Abstract

We present ten simple rules that support converting a legacy vocabulary—a list of terms available in a print-based glossary or in a table not accessible using web standards—into a FAIR vocabulary. Various pathways may be followed to publish the FAIR vocabulary, but we emphasise particularly the goal of providing a globally unique resolvable identifier for each term or concept. A standard representation of the concept should be returned when the individual web identifier is resolved, using SKOS or OWL serialised in an RDF-based representation for machine-interchange and in a web-page for human consumption. Guidelines for vocabulary and term metadata are provided, as well as development and maintenance considerations. The rules are arranged as a stepwise recipe for creating a FAIR vocabulary based on the legacy vocabulary. By following these rules you can achieve the outcome of converting a legacy vocabulary into a standalone FAIR vocabulary, which can be used for unambiguous data annotation. In turn, this increases data interoperability and enables data integration.

Author summary

We present ten simple rules that support converting a list of terms not currently accessible using web standards into a vocabulary conforming to the FAIR principles–Findable, Accessible, Interoperable and Reusable. In a FAIR vocabulary each term has its own persistent web-identifier, and its definition can be downloaded in both human- and standard machine-readable formats. The goal is to enable terminology to be unambiguously cited within technical datasets, in both the dataset description, or individual fields within the data, so that data can be discovered and integrated. The rules consider arrangements for governance of a terminology alongside the technical aspects related to conversion of (typically) print-based forms to standards-based knowledge representations. The rules are presented in the sequence in which they should be considered in a conversion process.

Introduction

Environmental sustainability, global pandemics and other natural disasters are some of the challenges we are facing in the 21st century. Addressing these challenges involves analysing vast amounts of data from different sources, which is more effective when these sources are aggregated to find evidence-based solutions. Understanding the data, identifying the terminology used in each dataset and how the terminology in different datasets relates is a prerequisite to enable data integration.

Shared terminology is key to accurate communication and an enabler for data integration. Many organizations and disciplines have a tradition of curating lists of terms to serve various roles, particularly in metadata, column headings, and for some values in datasets. These are often called code-lists or glossaries, and if there is a process to manage them, ‘controlled-vocabularies’. Vocabularies may also be structured as hierarchies, thesauri, taxonomies, through to axiomatized ontologies [1]. Other sets of terms and codes that are used in data include units of measure, lists of materials, taxa, substances, and reference systems like geologic and dynastic time-scales (which are composed of ordered named intervals).

These vocabularies were typically managed as lists or tables within text-based resources (books and manuals), or sometimes as authority-tables in databases or in spreadsheets, for use within very specific communities and applications. We refer to these as “legacy vocabularies”. However, integration of datasets, both within and across applications, requires that the terminology used in them is interoperable, so that users in the target communities (a) share an understanding of the meaning of terms, and (b) use the same conventions for representing the terms within datasets.

Standard knowledge representation languages make a vocabulary not only useful for humans, but also for machines. A number of guidelines are available for creating and publishing new vocabularies (e.g. [2,3]). Nevertheless, the legacy vocabularies represent the accumulated consensus of important disciplines and communities. Hence, making those vocabularies FAIR—or Findable, Accessible, Interoperable and Reusable [4,5]—is a high-value activity that can preserve the embedded domain intuition and knowledge. While controlled-vocabularies were often defined and used within small communities or organizations, FAIR vocabularies can be used in the context of much larger interconnected data and communities, and be actionable by machines.

Our approach to making a vocabulary FAIR is to use Web technology as outlined in the rules below. We focus on the publication of the vocabulary as ‘Linked Data’ which means (i) on the web, with an individual persistent resolvable unique web identifier (web link) per term (i.e. a HTTP (Hypertext Transfer Protocol) IRI (Internationalized Resource Identifier)) (ii) when a term IRI is requested, a machine-readable representation of the term using Semantic Web standards is obtained (see Table 1 for a summary of how we assess if a vocabulary is FAIR, and Box 1 for some basic definitions relating to Semantic Web standards https://www.w3.org/standards/semanticweb/data).

Table 1. Summary of FAIR principles applied to a vocabulary.

F Each vocabulary is denoted by a persistent unique web identifier
Each term is denoted by a persistent unique web identifier
It is possible to search for a term or vocabulary and get a web identifier for it
The vocabulary is available from at least one repository recognised by the community
A When the vocabulary or term identifier is de-referenced, a machine- or human-readable representation is returned, as requested
I At least one representation conforms to a community standard for vocabularies
The vocabulary includes mapping relations to other vocabularies
R The license for use of the vocabulary is clear and accessible
Enough metadata at vocabulary and term-level is provided, including provenance and maintenance information
The definitions are sufficient for a user to understand what each term means

Box 1. Some basic Semantic Web definitions

The Resource Description Framework (RDF) is the core data model of the Semantic Web. RDF-Schema (RDFS) is an extension of RDF and is used for representing simple RDF vocabularies on the Web. Based on RDF, the Web Ontology Language (OWL) is a computational logic-based language for ontologies. The Simple Knowledge Organization System (SKOS) is a simple OWL ontology to represent Knowledge Organization Systems (KOS) such as thesauri, term lists and controlled vocabularies.

To make legacy vocabularies FAIR, processes and practices are required for transitioning and adapting vocabularies from traditional forms rooted in print technologies to more broadly accessible modes that are available openly on-demand, as web resources. These have been demonstrated in many projects and services (e.g. [6]). Our goal here is to distill guidelines for taking an existing list of terms and converting it to a web-accessible, FAIR vocabulary, and present the guidelines as ‘ten simple rules’.

In this paper we focus on one specific scenario, where:

  1. there is a community requirement to use agreed terms in data or metadata

  2. a suitable vocabulary (list of terms or codes with definitions) is available, hereafter called the legacy vocabulary; it was created by an organisation, person or group of people that we refer as the ‘content custodian’, who may also be maintaining and revising it moving forward

  3. the legacy vocabulary is in the form of a print document, a digital document, or in a semi-structured form such as a spreadsheet, comma-separated value file (CSV), database table, or XML document, and is not arranged and published in a FAIR way that allows references to the terms to be resolved to learn what they mean, using standard web technology

  4. no other vocabulary that is suitable for the application and acceptable to the community is published in a FAIR way either.

The Ten Simple Rules below describe how to convert that legacy vocabulary into a form that can be understood and linked on the Web, using existing, widely used practices, and also compatible with, and thus potentially able to be integrated with, related FAIR vocabularies. Some of the rules refer explicitly to the main FAIR principles, while others are basic vocabulary prerequisites. This scenario is narrow, but common. The resulting representation may not be axiomatized enough to support automated reasoning and logic operations, but publication in a form that allows specific web references is a significant improvement over the legacy forms.

We provide extensive supplementary material online at https://fairvocabularies.github.io/examples/ in the form of detailed examples taken from real vocabularies that illustrate the rules. It is strongly recommended to consult these examples in order to more fully understand details of our Ten Simple Rules.

This paper is complementary to Ten Simple Rules about vocabulary development [7] and vocabulary selection [8], and the best practices and recommendations for implementing FAIR vocabularies that primarily apply to new vocabularies rather than legacy ones that need conversion into FAIR [2,3]. The rules are arranged as a stepwise recipe for creating a FAIR vocabulary based on the legacy vocabulary. A partial alignment to the best practice recommendations is provided after the rules.

Rules

Rule 1. Determine the governance arrangements and custodian of the legacy vocabulary

Identify the content custodian, which is the agent (i.e. organization or person/people) that was responsible for creating or selecting the list of terms in the legacy vocabulary. They will have expertise in the subject-matter. They may be an individual, a formal or informal committee or working group, or an official organization, such as a government agency, or learned society, and will usually be managing the vocabulary on behalf of a specified community, discipline, organization, and/or jurisdiction.

When you have identified the content custodian, it is recommended that you advise them of your plan to repurpose the legacy vocabulary as a FAIR vocabulary, to get their acknowledgement of your initiative. Enrol them in the repurposing process if possible. Find out their planned revision schedule for the legacy vocabulary, so that you can allow for this in your FAIR vocabulary maintenance plan (Rule 10).

Rule 2. Verify that the legacy-vocabulary license allows repurposing, and agree on the license for the FAIR vocabulary

Verify that the copyright-holder grants permission for the list of terms to be re-published as ‘Linked Data’ (noting that the copyright-holder is often different to the maintainer or content custodian—see Rule 1).

If the source carries a Creative Commons license, then the No Derivatives (ND) options (CC BY-ND, CC BY-NC-ND) are not ok, since you are developing a ‘derivative product’.

The other CC licenses (CC0, and CC BY, CC BY-SA, CC BY-NC, CC BY-NC-SA) are suitable, provided you are also able to meet any BY (attribution), SA (share-alike) and NC (non-commercial) constraints.

If the original content uses another type of license, you must analyse it to understand if you are able to produce a derivative product, and what are the conditions for derivation. It may be necessary to contact the copyright-holder directly in order to explain what is planned and get permission.

Agree on the license for the FAIR vocabulary, preferably an open license for users (e.g. CC0 or CC-BY).

Rule 3. Check term and definition completeness and consistency in the legacy vocabulary

Ensure there is at least (i) a unique label and (ii) a description or textual definition for each term in the list. These are the minimum requirements for a useful vocabulary, and the minimum required information for encoding the FAIR vocabulary (Rule 6). Verify that the definitions are unambiguous, and ideally that they are distinct. If definitions overlap, or are missing or ambiguous, consult the custodian of the legacy vocabulary and ensure that the representation follows the reality of the domain (Rule 1), else identify or recruit an expert group to revise, review or provide definitions and sources. Ideally this should be composed of more than one person to allow a quality control cycle. As a last resort check with a public source for definitions such as Wikipedia, DBpedia, or Wikidata.

The legacy vocabulary may also contain synonyms, intra-vocabulary relationships such as a broader/narrower hierarchy, specified subsets, and cross-vocabulary mappings. Guidelines to encode all of these elements in a FAIR vocabulary are given in Rule 6.

Rule 4. Establish a traceable maintenance-environment for the FAIR vocabulary content

It is common to store the reference version of the FAIR vocabulary in a single file, using one of the standard RDF serializations (e.g. Turtle, RDF-XML, JSON-LD). It is strongly recommended to maintain this in a system that allows any changes made in the vocabulary to be easily traced. Thus, we recommend use of a version control system (e.g. BitBucket, GitHub, GitLab). Public access should be allowed, unless the content owner has good reason not to. An issue tracker or ticket system should be used to capture term requests or other proposals by members of the community, and to record the justification for individual changes made by the content custodian.

Note that an issue tracker is built into GitHub and GitLab; JIRA or Trac are popular stand-alone options.

More details on reflecting changes and revisions to the vocabulary content out to the published FAIR vocabulary are discussed in Rule 10.

Rule 5. Assign a unique and persistent identifier to (a) the vocabulary and (b) each term in the vocabulary

Choose a domain name for persistent identifier IRIs for the terms and other vocabulary items (e.g. collections of terms). Those IRIs must resolve to appropriate representations on the web over the lifetime of any datasets that will make use of them, so it should be planned to manage this domain over a 10+ year time period. Since this is longer than many organization names and most organizational structures, domain names based on organizations are generally not suitable, except if they are of organizations specifically created for the purpose of managing vocabularies. Consider existing open solutions for persistent identifiers such as https://w3id.org or http://purl.org as an alternative to managing your own HTTP server.

Choose and document the pattern for individual IRIs that identify terms in the vocabulary [9,10]. A common pattern is:

IRI=[http://|https://]+{domain}+{vocab}+{termid}

where {domain} is the long-lasting host for the FAIR vocabulary, {vocab} is a path composed of a sequence of tokens separated by slash characters (‘/’), and {term-id} denotes the individual term, and must be unique in the context of the vocabulary. Some complete IRIs for terms that demonstrate this pattern are

http://anzsoil.org/def/au/asls/landform/modal-slope

http://resource.geosciml.org/classifier/ics/ischart/Cambrian

http://vocabs.lter-europe.net/EnvThes/21279

http://purl.obolibrary.org/obo/ENVO_00000081

http://qudt.org/vocab/unit/DEG_C

http://vocab.nerc.ac.uk/collection/P06/current/UPAA/

The {term-id} may be an opaque code (e.g.numeric), or it may be based on the term or primary label for each term, or some other rule. For vocabularies with up to a few hundred terms where the meanings do not change over time, use of a label as the basis for a {term-id} may be manageable, and this can be a useful mnemonic for developers and maintainers. However, it is important to consider the stability of the current label, and have a strategy for managing the IRI if a different label becomes preferred for the same concept. For large vocabularies, or when labels may change over time, label-based patterns are difficult to sustain for the {term-id}, and numeric or opaque identifiers are more common [9].

It is recommended not to embed version information in the path or identifier, as this creates challenges if the same concept persists over multiple versions or releases.

It is recommended to avoid long paths. Hierarchical relationships should not be implied by the IRI path, but rather should be recorded explicitly within the representation of the term (see Rule 6).

It is recommended to use slash (‘/’) IRIs for large vocabularies, rather than hash (‘#’) IRIs. When a # IRI is requested the entire vocabulary will be returned instead of just a single term. This may be acceptable for a small vocabulary, but is undesirable for large vocabularies [10].

In Rule 9 we outline how the IRIs should be made resolvable, thus making the vocabulary, and its terms, accessible.

For more examples, see the online supplementary material https://fairvocabularies.github.io/examples/

Rule 6. Create machine readable representations of the vocabulary terms

Convert the vocabulary to semantic standards, using either the Simple Knowledge Organisation System (SKOS) [1113] or the Web Ontology Language (OWL) [14,15], together with elements from other standard vocabularies and ontologies where appropriate (e.g. Dublin Core [16,17]).

The table below details various technical steps and patterns for use of either SKOS or OWL to represent a vocabulary in RDF. There are a number of considerations in making a choice of one or the other of these pathways [18]:

  • SKOS was designed for sets of definitions optionally arranged in a hierarchy, so nicely fits the primary scenario under consideration here: i.e. conversion of a legacy vocabulary to an RDF-based form using a semi-formal representation. SKOS includes a number of features designed to make the conversion straightforward, including synonyms, codes, subsets, and broader/narrower relationships. However, there are limitations in its logical completeness that are considered weaknesses in some applications;

  • OWL supports axiomatization (based on description logics) for representing formal ontologies, and was designed for a much wider range of applications than the primary scenario. However, the design choices using OWL are complex, and describing them is well beyond the scope of this paper. Nevertheless, a basic OWL pattern is outlined below, with the namespaces limited to core vocabularies. This option most closely parallels SKOS, and is thus suitable for the primary use-case covered by this paper.

Other sources provide details on SKOS and OWL, their particular strengths, and how they can be used together (e.g. [13,19]). We include the OWL option here because a rich OWL representation is a potential future goal for a FAIR vocabulary, so a minimal version is a useful starting point. However, the choice of representation is not critical in this phase of vocabulary formalization.The most important feature is that a unique IRI is used to denote each distinct term (see Rule 5), so that these IRIs can be used in data or metadata. The representation of each term might be changed or supplemented later while retaining the same IRI, and alternative representations or descriptions can be provided to suit each application, as long as they describe the same underlying concept (see Rule 9).

Table 2 illustrates basic steps to follow to create a FAIR vocabulary relying on SKOS or OWL (for the namespace prefixes see Box 2). The SKOS terminology is standard. For an OWL representation we suggest some common elements, and for more expressive ontologies we recommend investigating OWL and the conventions of the community you want to target.

Table 2. Steps in the creation of machine-readable definitions.

Step SKOS OWL (basic)
Identify terms Encode each vocabulary term as a skos:Concept, assigning an identifier as discussed in Rule 5 Encode each vocabulary term as an owl:Class, assigning an identifier as discussed in Rule 5
Encode term labels and synonyms Encode the term name as the skos:prefLabel and synonyms and abbreviations in skos:altLabel.
Language tags [20] can be used for multilingual vocabularies.
Encode the term name as rdfs:label. For synonyms you can use the SKOS elements skos:prefLabel and skos:altLabel.
Note that some communities have their own terminology for labels and annotations (e.g. the OBO Foundry relies on the Information Artifact Ontology).
Language tags [20] can be used for multilingual terms.
Add textual definitions Encode the textual definition as skos:definition Encode the textual definition as rdfs:comment
Add codes and symbols Add any code or symbol as skos:notation (this applies if a formal code or symbol for the term is available, in addition to the name used for the skos:prefLabel) Add any code or symbol as additional labels (rdfs:label)
Add notes or comments for clarifications Comments can be encoded using skos:note. Clarifications on usage can be recorded using skos:scopeNote Comments and clarifications can be encoded in the rdfs:comment
Add per-term metadata, if available Individual terms may be annotated using standard elements such as dcterms:creator, dcterms:created, dcterms:identifier, dcterms:modified, dcterms:source, dcterms:replaces, rdfs:seeAlso.
Per-term annotations are needed when they differ from values associated with the vocabulary as-a-whole (see Rule 7).
The same metadata elements can be used to annotate terms in the OWL encoding as well as owl:versionInfo, rdfs:comment, rdfs:isDefinedBy. Alternatively, adopt a specific solution for describing term metadata such as the OBO Metadata Ontology (http://www.obofoundry.org/ontology/omo.html)
Define the hierarchy of terms If hierarchical relationships between terms are provided in the source document, encode these using skos:broader and skos:narrower. A narrower concept or subclass may be related to more than one broader concept or parent class, so each term may appear in more than one place in a hierarchy. If hierarchical (is-a-kind-of) relationships between terms are provided in the source document, encode these using rdfs:subClassOf. A more specific concept or subclass may be related to more than one broader concept or parent class, so each term may appear in more than one place in a hierarchy. N.B. OWL sub-class relationships have more precise semantics than SKOS narrower/broader relations.
Encode relationships between terms If other relationships (non-hierarchical) between terms within the vocabulary are provided in the source document, they may be encoded using skos:related.
Relationships to terms in other vocabularies (mappings) may be encoded using skos:broadMatch, skos:closeMatch, skos:exactMatch, skos:narrowMatch, skos:relatedMatch
dcterms:relation may be used to indicate related resources.
However, it is usually better to use OWL object properties with the specific required semantics. It is recommended to re-use existing elements (e.g. relations ontology http://www.obofoundry.org/ontology/ro.html), or create your own if no existing one fulfils your requirement.
owl:equivalentClass may be used for mappings.
Define subsets If subsets or other groupings of terms are present in the source, encode each as a skos:Collection.
Collections may be nested.
Concepts may be members of more than one collection.
If subsets or other groupings of terms are present in the source, encode each as a class whose members are sub-classes
Define the whole vocabulary The complete vocabulary should be encoded as a skos:ConceptScheme. Every skos:Concept should have a skos:inScheme relationship to the scheme, else the top terms in broader/narrower chains should have a skos:topConceptOf relationship to the concept scheme. The complete vocabulary should be encoded as an owl:Ontology. Every member term that is not in the same namespace as the ontology should have a rdfs:isDefinedBy relationship to the ontology

Box 2. Namespace prefixes mentioned in Rule 6

dcterms: http://purl.org/dc/terms/

owl: http://www.w3.org/2002/07/owl#

rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#

rdfs: http://www.w3.org/2000/01/rdf-schema#

skos: http://www.w3.org/2004/02/skos/core#

In Box 3. we show an example of both representations side-by-side for the same term (serialized in Turtle[21]):

Box 3. SKOS and OWL representations of the same term

ex:element-80 a skos:Concept;
    skos:prefLabel "mercury"@en;
    skos:prefLabel "mercurio"@es;
    skos:altLabel "quicksilver"@en;
    skos:notation "Hg";
    skos:definition "A heavy, silvery d-block element, mercury is the only metallic element that is liquid at standard conditions for temperature and pressure.";
    dcterms:identifier "7439-97-6";
    dcterms:source <https://en.wikipedia.org/wiki/Mercury_(element)>;
    skos:broader ex:group-12, ex:period-6;
    skos:exactMatch <http://purl.obolibrary.org/obo/CHEBI_16170>;
    skos:inScheme ex:periodicTable;
.
ex:element-80 a owl:Class;
    rdfs:label "mercury"@en;
    rdfs:label "mercurio"@es;
    skos:altLabel "quicksilver"@en;
    rdfs:label "Hg";
    rdfs:comment "A heavy, silvery d-block element, mercury is the only metallic element that is liquid at standard conditions for temperature and pressure.";
    dcterms:identifier "7439-97-6";
    dcterms:source <https://en.wikipedia.org/wiki/Mercury_(element)>;
    rdfs:subClassOf ex:group-12, ex:period-6;
    owl:equivalentClass <http://purl.obolibrary.org/obo/CHEBI_16170>;
    rdfs:isDefinedBy ex:periodicTable;
.

Different approaches will be required for the conversion, depending on the form of the source material.

  • Where the original vocabulary is only available as a printed document, scanning, or even rekeying the essential information may be the only practical route; if available as a digital text document, you may be able to copy and paste the information

  • Where the legacy vocabulary is tabulated, either fully or in part, it may be possible to identify a pattern or template from the elements of your vocabulary which will allow you to (fully or partly) automate the creation of the FAIR vocabulary. Tools such as SKOS-Play! or sheet2rdf and OpenRefine can convert spreadsheets to RDF. Links to these, and to tools to convert many other formats to RDF are available at https://www.w3.org/wiki/ConverterToRdf.

  • qSKOS (https://qskos.poolparty.biz/) is a useful structure- and quality-checker for SKOS vocabularies, and SKOSify (https://skosify.readthedocs.io/en/latest/) automates some conversion and cleaning operations.

  • Ontorat [22] and ROBOT [23] can be used for generating terms, annotations and axioms of an OWL vocabulary based on ontology design patterns or templates; in addition, ROBOT has other functionality to automate ontology development workflows.

Either way, it is recommended to use an RDF/OWL or SKOS IDE (Integrated Development Environment) such as TopBraid, Protégé, VocBench, or PoolParty for data entry, or for tidying up after an automated phase, and for consistency checking.

The FAIR vocabulary should represent the legacy vocabulary as closely as possible, so it is not recommended to change the vocabulary content or structure during encoding, even if there appear to be errors or potential improvements. The initial FAIR representation can serve as a baseline for future revisions, while clearly anchored to an archival source. Changes to the content of the legacy and FAIR vocabularies remain the prerogative of the content custodian identified in Rule 1, and the maintenance process described in Rule 10.

Rule 7. Add vocabulary metadata

Add metadata for the vocabulary, by adding metadata elements to the skos:ConceptScheme or owl:Ontology that represent the vocabulary-as-a-whole.

The description of the vocabulary must include at least:

  • provenance and ownership information (citation of or links to the source, pointers to the organization or community responsible for the content),

  • lifecycle information (creation and update dates, vocabulary status, pointers to the people responsible for the conversion and encoding, version information)

  • Vocabulary license, as agreed in Rule 2

Different communities rely on metadata elements as defined by different vocabularies such as Data Catalog Vocabulary (DCAT [24]), Linked Open Vocabularies (LOV [25]), Ontology Metadata Vocabulary (OMV [26]), or the Metadata for Ontology Description and Publication Ontology (MOD [27]). OWL includes some built-in annotation properties that are applicable to OWL ontologies (e..g owl:priorVersion, owl:backwardsCompatibleWith, owl:incompatibleWith). The choice of which metadata vocabulary and details about mandatory requirements should be prescribed in policies of the vocabulary repository (Rule 8), as well as documented in the metadata for the vocabulary with full text or a link to a policies document.

Rule 8. Register the vocabulary

Load or register the encoded content in a vocabulary service or semantic repository, such as Research Vocabularies Australia (RVA) (https://vocabs.ardc.edu.au/) (for SKOS vocabularies), Linked Open Vocabularies (LOV) (https://lov.linkeddata.es/dataset/lov/ [28]) (for OWL ontologies and SKOS vocabularies), the ESIP Community Ontology Repository (https://cor.esipfed.org/) or BioPortal (https://bioportal.org) and its derivatives such as Agroportal (http://aims.fao.org/agroportal) and Ecoportal (http://ecoportal.lifewatchitaly.eu/ontologies) (for OWL ontologies and SKOS vocabularies). If you expect to be maintaining many vocabularies you might establish your own service using one of the software stacks available.

You should also deposit release snapshots of the vocabulary in a repository such as Zenodo (https://zenodo.org) or Dryad (https://datadryad.org/stash), or in an institutional data repository available to you. This step will assign a DOI to the vocabulary and will ensure that the vocabulary is indexed in more general search engines. See Rule 4 for recommendations of using a version control system, and consider that there are automated ways to store Github releases in Zenodo (with associated DOI). You may also consider registering the FAIR vocabulary as a ‘standard’ in FAIRsharing (https://fairsharing.org/).

Finally, the community for whom the vocabulary is provided (identified in Rule 1) is likely to maintain a listing of community resources, which is often the first place that members of the community would look. Such venues would be a good target for linking to the vocabulary.

Rule 9. Make the vocabulary accessible for humans and machines

The web identifiers used in the vocabulary should resolve to specific digital objects. Thus, the HTTP server for the vocabulary domain (identified in Rule 5) must be configured so that any request for an IRI denoting a term gets a representation of the individual term from the service that hosts the vocabulary. Use standard HTTP content negotiation to provide access to different representations (using Accept: and Accept-profile: headers [29]). The representation should be a web page (if HTML is requested) or a serialized skos:Concept or owl:Class (if RDF is requested). The IRI for the vocabulary-as-a-whole should get a suitable ‘Landing Page’ (if HTML is requested) or a representation of the skos:ConceptScheme or owl:Ontology (if RDF is requested). The HTML representation can be generated automatically with existing tools (e.g. [30]). The representation should include metadata and attribution information. (Note that inbuilt metadata means that there is no advantage to licensing the FAIR vocabulary with CC-BY compared with CC0 (see Rule 2).)

SPARQL [31,32] is the standard RDF query interface, so a SPARQL endpoint may be provided to support flexible queries and interactions. A link to the SPARQL endpoint should be provided on the HTML landing pages. The public endpoint should not allow SPARQL Update operations [33]. The hosting service may provide other vocabulary Application Programming Interfaces (e.g. RVA provides SISSvoc [34]). These should be clearly advertised to the user-community.

Rule 10. Implement a process for publishing revisions of the FAIR vocabulary

The FAIR vocabulary should be created and maintained so that it reflects the content and updates agreed and issued by the content custodian, so it is important to obtain the maintenance schedule and versioning strategy for the vocabulary from the content custodian (Rule 1).

We recommend updating the FAIR vocabulary as soon as practical after the content custodian updates the legacy vocabulary. If the content custodian wishes to maintain the content in its original form (i.e. the legacy vocabulary), then try to arrange for alerts advising you of changes to be issued by the custodian, in order to trigger the process of update of the FAIR vocabulary. However, it may be possible to transition to an arrangement in which the FAIR vocabulary becomes the primary version or ‘point of truth’ for the content, in which case individual revisions should be proposed and tracked in a traceable maintenance environment (see Rule 4). However, this should only be done with the consent of the content custodian. Note that as well as improved tracking of revisions, some kinds of improvement may be supported better in the FAIR representation (see Rule 6) than on the legacy (print-based) platform, including specific relationships between terms, mappings to other vocabularies, and detailed axiomatization of definitions.

If revision of the vocabulary is by new releases of the vocabulary-as-a-whole, then status and version information will be in the vocabulary metadata (see Rule 7). If maintenance is continuous, then the per-term metadata should capture its status and version information (see Rule 6). Standard Dublin Core, SKOS and OWL properties that may be useful in versioning include:

  • dcterms:created—date or date-time that the vocabulary or term was initially created

  • dcterms:modified—date or date-time that the vocabulary or term was last updated

  • dcterms:isReplacedBy—to point to a superseding vocabulary or term

  • dcterms:replaces—to point to a prior version of a vocabulary or term

  • owl:deprecated = ‘true’ if the vocabulary or term is no longer valid

  • owl:priorVersion—to point to a previous version of a vocabulary

  • owl:versionInfo—general annotations relating to versioning

  • skos:changeNote—modifications to a term relative to prior versions

  • skos:historyNote—past state/use/meaning of a term

Do not re-assign or remove identifiers; they are persistently associated with the term to which they were originally assigned (Rule 5). If necessary, you can deprecate or retire an identifier. However, the IRI for every retired and superseded term must remain de-referenceable, as well as for previous versions of the vocabulary, so that references to them still return a result, annotated with the status.

Terms that carry over between releases without the definition changing must retain the same IRI. If the IRI were changed, then datasets that use different versions of the same vocabulary cannot interoperate. Consult with the content custodian to clarify the ‘identity-determining’ characteristics of terms, but note that changing relationships (e.g. position in a hierarchy) or the textual definition do not necessarily require changing the identifier (i.e. minting a new IRI) provided that the intention for the concept is still the same.

Alignment with other guidelines

We mentioned existing guidelines that focus primarily on the development of new vocabularies. In Table 3 we align our Ten Simple Rules with the recommendations and practices from two of these [2,3] as well as with the W3C Data on the Web Best Practices [35]. The alignment is only partial, as the other work goes into more detail on some topics, while some of the concerns discussed in our rules are not addressed in the other work.

Table 3. Alignment of the Ten Simple Rules with some other best practices.

Ten Simple Rules FAIRsFAIR [3] Best practices—Garijo & Poveda [2] Data on the Web Best Practice [35]
1 Determine the governance arrangements and custodian of the legacy vocabulary BP-Rec 7—Interact with the designated community and manage user centric development
BP-Rec 9 -The underlying logic of semantic artefacts should be grounded on the domain it intends
33—Provide Feedback to the Original Publisher
2 Verify that the legacy-vocabulary license allows repurposing, and agree on the license for the FAIR vocabulary P-Rec16—The semantic artefact should be clearly licenced for machines and humans 4—Provide data license information
34—Follow Licensing Terms
3 Check term and definition completeness and consistency in the legacy vocabulary BP-Rec 8—Provide a structured definition for each concept
BP-Rec 9—The underlying logic of semantic artefacts should be accurately grounded on the domain it intends to describe
4 Establish a traceable maintenance-environment for the FAIR vocabulary content BP-Rec 10—Define a set of governance policies for the semantic artefacts
5 Assign a unique identifier to (a) the vocabulary and (b) each term in the vocabulary P-Rec 1—Use Globally Unique, Persistent and Resolvable Identifier for Semantic Artefacts, their content and their versions
BP-Rec 1—Use a unique naming convention for concept/class and relations
BP-Rec 2—Use an Ontology Naming Convention
2.1—Ontology name and prefix
2.2—Hash versus slash URIs
2.5—Using permanent URIs
9—Use persistent URIs as identifiers of datasets
10—Use persistent URIs as identifiers within datasets
27- Preserve identifiers
6 Create machine readable representations of the vocabulary terms P-Rec 3—Use a common minimum metadata schema to describe semantic artefacts and their content
P-Rec 9—Semantic artefacts should be compliant with Semantic Web and Linked Data standards
P-Rec 11—Use a standardized description for complex logical relations
P-Rec 14—Use standard vocabularies to describe semantic artefacts
BP-Rec 3—Use defined ontology design patterns
BP-Rec 6—Harmonize the methodologies used to develop semantic artefacts
BP-Rec 8—Provide a structured definition for each concept
12—Use machine-readable standardized data formats
16—Choose the right formalization level
7 Add vocabulary metadata P-Rec 3—Use a common minimum metadata schema to describe semantic artefacts and their content
P-Rec 17—Provenance should be clear for both humans and machines
3.1 Ontology Metadata 1—Provide metadata
2—Provide descriptive metadata
5—Provide data provenance information
7—Provide a version indicator
8—Provide version history
35—Cite the Original Publication
8 Register the vocabulary P-Rec 4—Publish the Semantic Artefact and its content in a semantic repository 4.2 Making an Ontology Findable on the Web
9 Make the vocabulary accessible for humans and machines P-Rec 4—Publish the Semantic Artefact and its content in a semantic repository
P-Rec 5—Semantic repositories should offer a common API to access Semantic Artefacts and their content in various serializations for both use/reuse and indexation by search engines
3.2 Creating a Human-Readable Documentation
3.3 Ontology visualization
4.1 Ontology Accessibility in Multiple Interoperable Formats
17—Provide bulk download
19—Use content negotiation for serving data available in multiple formats
20—Provide real-time access
23—Make data available through an API
24—Use Web Standards as the foundation of APIs
32—Provide Complementary Presentations
10 Implement a process for publishing revisions of the FAIR vocabulary P-Rec 8—Define human and machine-readable persistency policies for metadata
BP-Rec 7—Interact with the designated community and manage user-centric development
BP-Rec 9—The underlying logic of semantic artefacts should be accurately grounded on the domain it intends to describe
BP-Rec 10—Define a set of governance policies for the semantic artefacts
2.4—Ontology versioning
2.5—Using permanent URIs
21—Provide data up to date
10—Use persistent URIs as identifiers within datasets
27—Preserve identifiers
33—Provide Feedback to the Original Publisher

Note that our Ten Simple Rules are ordered in a natural implementation workflow for the primary scenario, i.e. the conversion of existing vocabularies. This means that some recommendations that are grouped together in other guidelines are separated here. The sequence of Ten Simple Rules is designed for a specific audience, i.e. people assisting domain specialists, neither of whom are semantics or web specialists.

Summary and conclusion

We have presented ten simple rules that support converting a legacy vocabulary—a list of terms available in a print-based glossary or table not accessible using web standards—into a FAIR vocabulary. Various pathways may be followed to publish the FAIR vocabulary, but we emphasise particularly the goal of providing a distinct IRI for each term or concept. A standard representation of the concept should be returned when the individual IRI is de-referenced, using SKOS or OWL serialised in an RDF-based representation for machine-interchange, or in a web-page for human consumption. Guidelines for vocabulary and term metadata are provided, as well as development and maintenance considerations.

By following these rules you can achieve the outcome of converting a legacy vocabulary into a standalone FAIR vocabulary, which can be used for unambiguous data annotation. In turn, this increases data interoperability and enables data integration, which is essential for addressing global challenges such as environmental sustainability, and pandemic and natural disaster response. A set of examples illustrating the application of these rules are provided as supplementary material at https://fairvocabularies.github.io/examples/. These include environmental definitions that are needed to cover some of the data integration challenges that we referred to in the introduction.

Further steps towards broader interoperability that may be considered, but are beyond the scope of this paper, include:

  • relationships to terms and definitions in other FAIR vocabularies

  • patterns for re-use of terms from and subsets of existing FAIR vocabularies

  • supplementation of generic SKOS/OWL encoding with domain-based elements and axiomatization (see examples in the supplementary material)

  • rules for maintenance (expanding on Rules 1, 4 & 10)

These will be addressed in future guidelines.

Acknowledgments

We thank CODATA (https://codata.org) and the DDI Alliance (https://ddialliance.org/), who organised a Workshop on Cross-domain Metadata at Schloss Dagstuhl in October 2019, where this work was initiated. The FAIR vocabulary practices activity which triggered the preparation of this guideline initially also involved Pier Luigi Buttigieg, Niklas Kolbe, and Dan Brickley.

Data Availability

All relevant data are within the manuscript.

Funding Statement

The contribution of SJDC was supported through a CSIRO Strategic Project for engagement with CODATA. The contribution of BM was supported through - eLTERplus, a project funded from the INFRAIA-01-2018-2019 programme of European Union’s Horizon 2020 research and innovation programme under grant agreement No 871128 - OBARIS, an FFG funded project (No 887389) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Mcguinness D. Ontologies Come of Age. In: Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential [Internet]. Wadern, Germany; 2003. p. 171–94. Available from: https://www.researchgate.net/publication/221024668_Ontologies_Come_of_Age [Google Scholar]
  • 2.Garijo D, Poveda-Villalón M. Best Practices for Implementing FAIR Vocabularies and Ontologies on the Web. In: Cota G, Daquino M, Pozzato GL, editors. Studies on the Semantic Web [Internet]. IOS Press; 2020. [cited 2021 Jan 7]. Available from: http://ebooks.iospress.nl/doi/10.3233/SSW200034 [Google Scholar]
  • 3.Le Franc Y, Parland-von Essen J, Bonino L, Lehväslaiho H, Coen G, Staiger C. D2.2 FAIR Semantics: First recommendations [Internet]. Zenodo; 2020. Mar [cited 2020 Oct 15]. Available from: https://zenodo.org/record/3707984 [Google Scholar]
  • 4.Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016. Mar 15;3(1):160018. doi: 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Poveda-Villalón M, Espinoza-Arias P, Garijo D, Corcho O. Coming to Terms with FAIR Ontologies. In: Keet CM, Dumontier M, editors. Knowledge Engineering and Knowledge Management [Internet]. Cham: Springer International Publishing; 2020. [cited 2021 Jan 28]. p. 255–70. (Lecture Notes in Computer Science; vol. 12387). Available from: http://link.springer.com/10.1007/978-3-030-61244-3_18 [Google Scholar]
  • 6.Martin P, Magagna B, Liao X, Zhao Z. Semantic Linking of Research Infrastructure Metadata. In: Zhao Z, Hellström M, editors. Towards Interoperable Research Infrastructures for Environmental and Earth Sciences: A Reference Model Guided Approach for Common Challenges [Internet]. Cham: Springer International Publishing; 2020. [cited 2020 Oct 23]. p. 226–46. (Lecture Notes in Computer Science). Available from: 10.1007/978-3-030-52829-4_13 [DOI] [Google Scholar]
  • 7.Courtot M, Malone J, Mungall CJ. Ten simple rules for biomedical ontology development. In: Proceedings of the Joint International Conference on Biological Ontology and BioCreative [Internet]. Corvallis, Oregon, US: CEUR Workshop Proceedings; 2016. p. 4. Available from: http://ceur-ws.org/Vol-1747/IT404_ICBO2016.pdf
  • 8.Malone J, Stevens R, Jupp S, Hancocks T, Parkinson H, Brooksbank C. Ten Simple Rules for Selecting a Bio-ontology. PLOS Comput Biol. 2016. Feb 11;12(2):e1004743. doi: 10.1371/journal.pcbi.1004743 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.McMurry JA, Juty N, Blomberg N, Burdett T, Conlin T, Conte N, et al. Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLOS Biol. 2017. Jun 29;15(6):e2001414. doi: 10.1371/journal.pbio.2001414 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Berrueta D, Phipps J. Best Practice Recipes for Publishing RDF Vocabularies [Internet]. Cambridge, Mass. USA: World Wide Web Consortium; 2008. Available from: http://www.w3.org/TR/swbp-vocab-pub/ [Google Scholar]
  • 11.Isaac A, Summers E. SKOS Simple Knowledge Organization System Primer [Internet]. World Wide Web Consortium; 2009. [cited 2020 Oct 23]. Available from: https://www.w3.org/TR/skos-primer/ [Google Scholar]
  • 12.Miles A, Bechhofer S. SKOS Simple Knowledge Organization System Reference [Internet]. Cambridge, Mass. USA: World Wide Web Consortium; 2009. Available from: http://www.w3.org/TR/skos-reference/ [Google Scholar]
  • 13.Baker T, Bechhofer S, Isaac A, Miles A, Schreiber G, Summers E. Key choices in the design of Simple Knowledge Organization System (SKOS). J Web Semant. 2013. May 1;20:35–49. [Google Scholar]
  • 14.W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview (Second Edition) [Internet]. W3C Recommendation. Cambridge, Mass. USA: World Wide Web Consortium; 2012. Available from: http://www.w3.org/TR/owl2-overview/ [Google Scholar]
  • 15.Hitzler P, Krötzsch M, Parsia B, Patel-Schneider PF, Rudolph S. OWL 2 Web Ontology Language Primer (Second Edition) [Internet]. World Wide Web Consortium; 2012. [cited 2020 Oct 23]. Available from: https://www.w3.org/TR/owl-primer/ [Google Scholar]
  • 16.Kunze J, Baker T. The Dublin Core Metadata Element Set [Internet]. Vol. 5013, IETF RFC. Internet Engineering Task Force; 2007. [cited 2014 Mar 30]. Available from: http://dublincore.org/documents/dces/ http://www.ietf.org/rfc/rfc5013.txt [Google Scholar]
  • 17.DCMI Usage Board. DCMI Metadata Terms [Internet]. 2020 [cited 2020 Oct 23]. Available from: https://dublincore.org/specifications/dublin-core/dcmi-terms/
  • 18.Bechhofer S, Miles A. Using OWL and SKOS [Internet]. 2008 [cited 2020 Oct 23]. Available from: https://www.w3.org/2006/07/SWD/SKOS/skos-and-owl/master.html
  • 19.Noy NF, McGuinness DL. Ontology Development 101: A Guide to Creating Your First Ontology [Internet]. Available from: https://protegewiki.stanford.edu/wiki/Ontology101
  • 20.Cyganiak R, Wood D, Lanthaler M. RDF 1.1 Concepts and Abstract Syntax [Internet]. W3C Recommendation. 2014. Available from: https://www.w3.org/TR/rdf11-concepts/ [Google Scholar]
  • 21.Beckett D, Berners-Lee T, Prud’hommeaux E, Carothers G. RDF 1.1 Turtle [Internet]. W3C Recommendation. World Wide Web Consortium; 2014. Available from: https://www.w3.org/TR/turtle/ [Google Scholar]
  • 22.Xiang Z, Zheng J, Lin Y, He Y. Ontorat: automatic generation of new ontology terms, annotations, and axioms based on ontology design patterns. J Biomed Semant. 2015. Jan 9;6(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Jackson RC, Balhoff JP, Douglass E, Harris NL, Mungall CJ, Overton JA. ROBOT: A Tool for Automating Ontology Workflows. BMC Bioinformatics. 2019. Jul 29;20(1):407. doi: 10.1186/s12859-019-3002-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Albertoni R, Browning D, Cox SJD, Gonzalez-Beltran A, Perego A, Winstanley P. Data Catalog Vocabulary (DCAT)—Version 2 [Internet]. World Wide Web Consortium; 2020. [cited 2020 Oct 23]. Available from: https://www.w3.org/TR/vocab-dcat/ [Google Scholar]
  • 25.Vandenbussche P-Y, Vatant B. Metadata Recommendations For Linked Open Data Vocabularies [Internet]. 2012. [cited 2020 Oct 27] p. 4. Available from: https://lov.linkeddata.es/Recommendations_Vocabulary_Design.pdf [Google Scholar]
  • 26.Hartmann J, Palma R, Sure Y, Suárez-Figueroa MC, Haase P, Gómez-Pérez A, et al. Ontology Metadata Vocabulary and Applications. In: Meersman R, Tari Z, Herrero P, editors. On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops. Berlin, Heidelberg: Springer; 2005. p. 906–15. (Lecture Notes in Computer Science). [Google Scholar]
  • 27.Dutta B, Toulet A, Emonet V, Jonquet C. New Generation Metadata Vocabulary for Ontology Description and Publication. In: Garoufallou E, Virkus S, Siatri R, Koutsomiha D, editors. Metadata and Semantic Research. Cham: Springer International Publishing; 2017. p. 173–85. (Communications in Computer and Information Science). [Google Scholar]
  • 28.Vandenbussche P-Y, Atemezing GA, Poveda-Villalón M, Vatant B. Linked Open Vocabularies (LOV): A gateway to reusable semantic vocabularies on the Web. Semantic Web. 2016;8(3):437–52. [Google Scholar]
  • 29.Svensson L. Indicating, Discovering, Negotiating, and Writing Profiled Representations [Internet]. IETF; 2020 Apr [cited 2020 Oct 23]. Available from: https://profilenegotiation.github.io/I-D-Profile-Negotiation/I-D-Profile-Negotiation
  • 30.Garijo D. WIDOCO: A Wizard for Documenting Ontologies. In: d’Amato C, Fernandez M, Tamma V, Lecue F, Cudré-Mauroux P, Sequeda J, et al., editors. The Semantic Web–ISWC 2017. Cham: Springer International Publishing; 2017. p. 94–102. [Google Scholar]
  • 31.Feigenbaum L, Williams GT, Clark KG, Torres E. SPARQL 1.1 Protocol [Internet]. World Wide Web Consortium; 2013. [cited 2020 Oct 23]. Available from: https://www.w3.org/TR/sparql11-protocol/ [Google Scholar]
  • 32.Harris S, Seaborne A. SPARQL 1.1 Query Language [Internet]. World Wide Web Consortium; 2013. [cited 2020 Oct 23]. Available from: https://www.w3.org/TR/sparql11-query/ [Google Scholar]
  • 33.Gearon P, Passant A, Polleres A. SPARQL 1.1 Update [Internet]. World Wide Web Consortium; 2013. [cited 2020 Oct 23]. Available from: https://www.w3.org/TR/sparql11-update/ [Google Scholar]
  • 34.Cox SJD, Yu J, Rankine T. SISSVoc: A Linked Data API for access to SKOS vocabularies. Semantic Web J. 2016;7(1):9–24. [Google Scholar]
  • 35.Lóscio BF, Burle C, Calegari N. Data on the Web Best Practices [Internet]. World Wide Web Consortium; 2017. Jan [cited 2020 Oct 28]. Available from: https://www.w3.org/TR/dwbp/ [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009041.r001

Decision Letter 0

Scott Markel

31 Dec 2020

Dear Dr Cox,

Thank you very much for submitting your manuscript "Ten Simple Rules for making a vocabulary FAIR" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Scott Markel, Ph.D.

Ten Simple Rules Editor

PLOS Computational Biology

Scott Markel

Ten Simple Rules Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Two Documents attached:

1. pdf of manuscript with comments using Acrobat

2. pdf of document with comments and suggestions (CoxEtAl10RulesVocabularySMR.pdf)

Reviewer #2: The authors present here a succinct and comprehensive set of advise for both encoding vocabularies in Semantic Web standards and for making vocabularies fit well into the FAIR principles for Data Management. Having been responsible for publishing a number of controlled vocabularies, I find their steps here to a well laid out Best Practice. As such this is a timely paper and the authors are to be congratulated for their input and presentation. I have only one very minor comment, which is that on lines 42, 94 and 95 the render the phrase "linked data" and "linked-data", which I would more normally expect to see as "Linked Data".

Once again - thank you for the opportunity to review this well written paper.

Reviewer #3: This paper proposes a set of 10 rules to provide FAIR vocabularies. The work is interesting and worth being published as FAIR data and related initiatives are gaining momentum. In this sense, the first recommendation for improving the paper would be to align the proposed rules to the previous works as "Best Practices for Implementing FAIR Vocabularies and Ontologies on the Web" (http://ebooks.iospress.nl/volumearticle/56005 and arxiv version https://arxiv.org/abs/2003.13084) and the FAIRsFAIR D2.2 FAIR Semantics: First recommendations (https://zenodo.org/record/3707985). Authors may also be interested in the special issue https://www.mitpressjournals.org/doi/full/10.1162/dint_e_00023

Other important comments to enhance the current paper are:

Line 171: What do you mean by "Resource Description Framework (RDF) vocabularies"? the references mix the RDF and RDFS recommendations with Dublin Core. RDF is the data model for the web, RDFS is the language for vocabularies and OWL is a language for ontologies. Finally Dublin Core is a controlled vocabulary that happens to be implemented in RDFS and in OWL (depending elements or terms.) Please clarify these concepts. It would be advisable also to clarify in general than OWL is the ontology language and SKOS is an OWL ontology to represent thesaurus, not a particular language itself as in general lines from 175 to 187 might be a bit misleading, in particular lines 185-187 as the constructs that are "missed" in OWL are actually built in SKOS based on OWL and the reasoning features are defined in OWL.

Regarding the table for steps for SKOS and OWL conversion I would suggest to keep it just for SKOS as there are several points in the OWL column. In general I feel that that column could lead to conceptual errors to the readers, overall not experts in knowledge representation, as it seems to oversimplify a great amount of work about ontological engineering. While the steps for converting to a SKOS are clear and could be automated, for building an ontology there are a number of consideration that should be made and given the type of publication it would be more suitable to keep it to SKOS rather than entering in an ontological engineering work. The points about the table are:

-- "Identify terms": "or as an instance of an owl:Class if it is the most specific concept in the vocabulary " --> not always, one may want to have all as classes and create individual for the last level. It depends on the use.

-- "Encode term labels and synonyms" --> In general, it is a bad practice to use the owl:equivalentClass construct to define synonyms, they could be added as several labels to a given class or with other annotations but not creating N classes in the same ontology unless there is a good reason for that.

-- "Define the hierarchy of terms" --> transforming broader/narrower relationships to rdfs:subClassOf would lead to semantic errors and reasoning issues. See that in a thesaurus we can have PC - narrower - mouse. If we make mouse subclass of PC is a semantic inconsistency as individuals of mouse will be classified as individuals of PC. In this cases a transformation is not that simple and if one wants to transform a thesaurus into an ontology there should be some modelling decisions to be taken.

-- minor: rdfs:comment is used in steps 3 and 5 for different purposes, the original type of annotation would be lost.

-- "Add per-item metadata": for the case of ontologies the LOV, MOD https://github.com/sifrproject/MOD-Ontology and Widoco guidelines could be included but as said, I would suggest keeping the table for SKOS.

Other comments are:

In rule 8 LOV should be added as registry as it is widely used by many communities and Ontobee.

Line 32 and 61: the term glossary seems to be used as synonym for controlled vocabularies and could be a bit inaccurate. I'd suggest using and referring to the ontology spectrum by McGuinness http://www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-mit-press-(with-citation).htm

Line 61 to 65: I don't see why those example "may not be initially recognised as vocabularies" if word-lists are vocabularies list of terms which are unit of measure are too.

Minor comments:

Line 73 to 76: the list of digital documents is too narrow, it might include relational databases, XML files, etc.

Line 81: ".." remove one "."

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Stephen M Richard

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: CoxEtAl10RulesVocabularySMR.pdf

Attachment

Submitted filename: PCOMPBIOL-D-20-02105_reviewerCoxEtAlVocabulariesOnTheWeb.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009041.r003

Decision Letter 1

Scott Markel

8 Apr 2021

Dear Dr Cox,

Thank you very much for submitting your manuscript "Ten Simple Rules for making a vocabulary FAIR" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Scott Markel, Ph.D.

Ten Simple Rules Editor

PLOS Computational Biology

Scott Markel

Ten Simple Rules Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The revised version is excellent. Thanks for putting this together!

Reviewer #2: This revision looks excellent - I recommend for publishing

Reviewer #3: Thank you very much for your detailed comments and the effort improving the paper.

There are substantial changes in the manuscript addressing the reviewer's comments. I think the paper has improved and actually it a good starting point for converting legacy terminologies to FAIR vocabularies.

I only have some small comments:

Line 73: is there an "and" or "or" missing before "(b)"?

Lines 226 to 229: It is common to use # for small vocabularies or ontologies, it would be better to have that mentioned. I understand it is different if you are transforming a huge KOS.

Table skos and owl, point "Define the hierarchy of terms" in the owl column: "A narrower concept or subclass" --> I would avoid mentioning here "narrower", maybe "A more specific concept or subclass... ". Also, I would add a clarification stating that in owl the subclass is oriented to subsets of elements which is not exactly the same as "narrower/broader" for SKOS.

Table skos and owl, point "Define the whole vocabulary" about the sentence: "Every member term should have a rdfs:isDefinedBy relationship to the ontology" --> Could this be optional for when the element URI is defined in the same namespace as the ontology URI?

Regarding the options to transform vocabularies to SKOS or OWL, I find very useful OpenRefine. Not sure how easy to use is for non-experts in RDF or semantic technologies.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Stephen M Richard

Reviewer #2: No

Reviewer #3: No

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #3: Yes

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009041.r005

Decision Letter 2

Scott Markel

4 May 2021

Dear Dr Cox,

We are pleased to inform you that your manuscript 'Ten Simple Rules for making a vocabulary FAIR' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Scott Markel, Ph.D.

Ten Simple Rules Editor

PLOS Computational Biology

Scott Markel

Ten Simple Rules Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009041.r006

Acceptance letter

Scott Markel

11 Jun 2021

PCOMPBIOL-D-20-02105R2

Ten Simple Rules for making a vocabulary FAIR

Dear Dr Cox,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: CoxEtAl10RulesVocabularySMR.pdf

    Attachment

    Submitted filename: PCOMPBIOL-D-20-02105_reviewerCoxEtAlVocabulariesOnTheWeb.pdf

    Attachment

    Submitted filename: ResponsesToReviewersTenSimpleRulesMakingVocabularyFAIR.docx

    Attachment

    Submitted filename: Ten simple rules #1 Response to second review.docx

    Data Availability Statement

    All relevant data are within the manuscript.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES