CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications

Svetlana Pestryakova; Daniel Vollmers; Mohamed Ahmed Sherif; Stefan Heindorf; Muhammad Saleem; Diego Moussallem; Axel-Cyrille Ngonga Ngomo

doi:10.1038/s41597-022-01298-2

. 2022 Jul 8;9:389. doi: 10.1038/s41597-022-01298-2

CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications

Svetlana Pestryakova ^1,^✉, Daniel Vollmers ¹, Mohamed Ahmed Sherif ^1,^✉, Stefan Heindorf ¹, Muhammad Saleem ¹, Diego Moussallem ¹, Axel-Cyrille Ngonga Ngomo ¹

PMCID: PMC9263802 PMID: 35803947

Abstract

The rapid generation of large amounts of information about the coronavirus SARS-CoV-2 and the disease COVID-19 makes it increasingly difficult to gain a comprehensive overview of current insights related to the disease. With this work, we aim to support the rapid access to a comprehensive data source on COVID-19 targeted especially at researchers. Our knowledge graph, CovidPubGraph, an RDF knowledge graph of scientific publications, abides by the Linked Data and FAIR principles. The base dataset for the extraction is CORD-19, a dataset of COVID-19-related publications, which is updated regularly. Consequently, CovidPubGraph is updated biweekly. Our generation pipeline applies named entity recognition, entity linking and link discovery approaches to the original data. The current version of CovidPubGraph contains 268,108,670 triples and is linked to 9 other datasets by over 1 million links. In our use case studies, we demonstrate the usefulness of our knowledge graph for different applications. CovidPubGraph is publicly available under the Creative Commons Attribution 4.0 International license.

Subject terms: Medical research, Diseases, Research data

Measurement(s)	COVID-19-related publications
Technology Type(s)	named entity recognition, entity linking and link discovery approaches

Open in a new tab

Background & Summary

The number of papers pertaining to SARS-CoV-2 and COVID-19 has surged over the last few months, making it hard to keep track of the latest research findings on the subject matter. Hence, the Allen Institute initiated a growing corpus of publications about COVID-19 called CORD-19¹, which is updated on a regular basis. While the CORD-19 dataset provides the extracted full texts and corresponding licenses, it is still difficult to consume for end users and applications. For example, the data is available as one download (see https://www.semanticscholar.org/cord19/download). Hence, users first need to download the dataset and carry out some processing (e.g., some form of information retrieval) to get the information they desire. The integration of insights from different sources, which is of central importance in scientific research, cannot be carried out on the dataset directly. Moreover, the data being available in textual form makes it difficult to query using a structured query language such as SQL or SPARQL.

A growing number of research labs are hence building upon CORD-19 to make the data more amenable to automated processing. Table 1 gives an overview of existing datasets pertaining to COVID-19. Some datasets such as Wikidata Scholia only contain a small subset of the publications available as CORD-19. Other knowledge graphs about COVID-19 focus exclusively on case statistics instead of scientific publications (e.g., Covid-19 by STKO Lab) or text mining on the CORD-19 dataset without providing much information about the content present in the publications (e.g., Covid19-KG by Blender Lab and Cord-19-on-FHIR). Our goal differs from that of other COVID-19-related datasets: We aim to provide a comprehensive RDF representation of the CORD-19 data and include Natural Language Processing (NLP) results on the data to facilitate the development of intelligent search engines, domain-specific conversational AIs and structured machine learning solutions for COVID-19.

Table 1.

Overview of COVID-19 datasets.

Dataset	Format	Endpoint	Publ.	Base
CovidPubGraph (DICE Lab) (publications, links to DrugBank, Sider, Kegg, Cord19-NEKG, LitCovid, …)	rdf	LodView	160,271	CORD-19¹
Cord19-NEKG⁹ (Wimmics) (publications, links to DBpedia, Wikidata and BioPortal)	rdf	Virtuoso	111,256	CORD-19¹
Covid-19-Literature¹⁰ (IDLab) (publications, links to DBpedia)	rdf	Download	40,750	CORD-19¹
Wikidata Scholia²⁴ (publications)	json/csv	WDQS
Covid19-KG¹² (Blender Lab) (genes, diseases, chemicals, organisms)	csv	Download	0	CORD-19¹
Cord-19-on-FHIR¹³ (conditions, medications, procedures)	rdf	GraphDB	0	CORD-19¹
Covid-19²¹ (STKO Lab) (case statistics by region)	rdf	GraphDB	0	JHU^30,31

Open in a new tab

In this paper, we present CovidPubGraph, a comprehensive RDF knowledge graph of COVID-19 based on CORD-19. Our dataset follows the Linked Data lifecycle². We provide a detailed representation of the COVID-19 publications in RDF including properties like publication title, authors names and their institutions, paper sections (e.g., abstract, introduction, body, discussion, etc.) and annotated references (e.g., references to figures). Resources such as authors and named entities augment the original data and make it easier to process for the sake of question answering and machine learning. All resources in the dataset are dereferenceable HTTP IRIs, which can be accessed via LodView (https://lodview.it/) or via the dataset’s SPARQL endpoint (https://covid-19ds.data.dice-research.org/sparql/). In addition, we link our dataset to the biomedical entities in other relevant datasets (e.g., DrugBank, Sider, Kegg).

Our knowledge graph also abides by the FAIR principles³: It is findable by virtue of being annotated with rich metadata and indexable by search engines. We make it accessible by providing our data via an RDF dump download (https://hobbitdata.informatik.uni-leipzig.de/COVID19DS/archive/), a SPARQL endpoint as well as dereferenceable individual resources. For example, see https://covid-19ds.data.dice-research.org/resource/4bf4b71883a26d15dcc13b2800ec470b99764956. We make it interoperable by employing standard vocabularies, e.g., for authors, papers, and sections within papers, as well as through the aforementioned links to 9 knowledge graphs including Cord19-NEKG, Cord-19-on-FHIR as well as Covid-19-Literature (see Table 2). We make it reusable by associating the data with clear provenance and licensing information as well as by reusing popular vocabularies such as NIF and Fabio ourselves.

Table 2.

External datasets linking statistics.

Dataset	#Links	Predicate	Link classes
Cord19-NEKG¹	160,271	owl:sameAs	Publications
LitCovid²	143,840	owl:sameAs	Publications
Covid-19-Literature³	160,271	owl:sameAs	Publications
Cord-19-on-FHIR⁴	160,271	owl:sameAs	Publications
Cord-19-on-FHIR	160,271	rdfs:seeAlso	Publications
Makg⁵	160,271	owl:sameAs	Publications
Makg	22,885	owl:sameAs	Authors
Makg	6,589	owl:sameAs	Institutions
Kegg⁶	202,482	itsrdf:taIdentRef	Named entities
Sider⁷	41,741	itsrdf:taIdentRef	Named entities
DrugBank⁸	78,969	itsrdf:taIdentRef	Named entities
Total number of links	1,297,861

Open in a new tab

¹http://ns.inria.fr/covid19/.

²http://ns.inria.fr/covid19/.

³https://www.ncbi.nlm.nih.gov/pmc/articles/.

⁴https://www.ncbi.nlm.nih.gov/pmc/articles/.

⁵https://data.linkeddatafragments.org/covid19.

⁶https://data.linkeddatafragments.org/covid19.

⁷https://fhircat.org/cord-19/fhir/.

⁸https://fhircat.org/cord-19/fhir/.

Potential use cases of our knowledge graph include:

Finding papers about certain biomedical entities, e.g., drugs, side effects, genes, or proteins.
Discovering links between specific genome subsequences and drugs.
Training explainable machine learning models by running structured machine learning on selected named entities (e.g., drug names) to find similar drugs for clinical trials. The models can be trained with DL-Learner⁴, EvoLearner⁵, or DRILL⁶ and they learn class expressions in description logics based on the publication graph (e.g., drugs investigated by similar authors or in similar articles). The class expressions are comprehensible by domain experts.
Supporting scientometric research on various aspects related to COVID-19 publications, such as international collaboration trends⁷ and peer review trends⁸, which would be informative for policy-makers and the scientific community.

Methods

Knowledge graphs on the field of COVID-19 can be divided by their topics covered: publications, biomedical entities, and case statistics.

Knowledge graphs of publications

Most knowledge graphs of COVID-19 publications are based on the COVID-19 Open Research Dataset (CORD-19) by the Allen Institute¹. The CORD-19 dataset is based on papers and preprints from Semantic Scholar. Papers in CORD-19 are sourced from PubMedCentral (PMC), PubMed, the World Health Organization’s Covid-19 Database, and preprint servers bioRxiv, medRxiv, and arXiv¹. While CORD-19 contains the full texts of scientific publications, it does not adhere to FAIR principles³, e.g., it is only available via download and does not use common vocabularies. The two knowledge graphs most closely related to ours are Cord19-NEKG⁹ and Covid-19-Literature¹⁰. However, neither of them provides comprehensive metadata about the publications, and neither provides fine-granular information pertaining to the publications (e.g., section information). An alternative to CORD-19 is the Lens dataset on COVID-19¹¹. Lens contains metadata about scientific publications on COVID-19. However, it is only available as one big download (in JSON format). The Covidgraph project (https://covidgraph.org/) aims to utilize the dataset. However, at the time of writing, the proposed CovidPubGraph has not been released yet, making it hard to compare it to other knowledge graphs. To enable interoperability, we link our dataset to other datasets such as the Cord19-NEKG.

Knowledge graphs of biomedical entities

Most works utilizing CORD-19 focus on extracting named entities^9,10,12,13 such as genes, drugs, and proteins and linking them to existing knowledge bases such as DBpedia. For doing so, established tools such as DBpedia Spotlight¹⁴ and Entity Fishing (Wikidata) (https://github.com/kermitt2/entity-fishing/) are used. Alternatively, novel tools for recognizing biomedical entities on CORD-19 are also developed^15,16. Noteworthy is also the work by Zhou, Y. et al.¹⁷, in which a network of genes, proteins, and viruses are proposed. The network is based on pre-existing biomedical databases (e.g., DrugBank, Therapeutic Target Database, and BindingDB) and does not cover the latest research findings. Still, such biomedical knowledge graphs might be employed to identify promising treatment options such as repurposing existing drugs or developing novel drugs regardless of the underlying construction methodology. We perform named entity recognition on CORD-19 and link the discovered entities to other biomedical RDF databases such as DrugBank¹⁸ (drugs), Sider¹⁹ (side effects), and Kegg²⁰ (genes), thus making our dataset more amenable to tasks such as machine learning based on entities.

Knowledge graphs of case statistics

Another class of knowledge graphs focuses on the case statistics of novel COVID-19 virus²¹, e.g., subdivided by region and based on the Dashboard data by the John Hopkins University.

Data Records

RDF data model design

The ontology behind our knowledge graph was derived from the source from which it was extracted, i.e., the full-texts of publications provided as part of the CORD-19 dataset. The ontology was designed to enable search, question answering and machine learning. At the time of writing, our dataset is based on CORD-19 version 2021-11-08 (https://www.semanticscholar.org/cord19/download). Our conversion process is implemented in Python 3.6 with RDFLib 5.0.0 (https://github.com/RDFLib/rdflib). We make our source code publicly available (https://github.com/dice-group/COVID19DS) to ensure the reproducibility of our results and the rapid conversion of novel CORD-19 versions. One version of the generated RDF dataset can be found at Zenodo²².

Listing 1.

List of all used vocabularies in CovidPubGraph.

% @prefix cvdr: https://covid-19ds.data.dice-research.org/resource/.

% @prefix cvdo: https://covid-19ds.data.dice-research.org/ontology/.

% @prefix bibo: http://purl.org/ontology/bibo/.

% @prefix bibtex: http://purl.org/net/nknouf/ns/bibtex#.

% @prefix dcterms: http://purl.org/dc/terms/.

% @prefix fabio: http://purl.org/spar/fabio/.

% @prefix foaf: http://xmlns.com/foaf/0.1/.

% @prefix its: http://www.w3.org/2005/11/its/rdf#.

% @prefix nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#.

% @prefix prov: http://www.w3.org/ns/prov#.

% @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#.

% @prefix rdfs: http://www.w3.org/2000/01/rdf-schema#.

% @prefix schema: http://schema.org/.

% @prefix sdo: http://salt.semanticauthoring.org/ontologies/sdo#.

% @prefix swc: http://data.semanticweb.org/ns/swc/ontology#.

% @prefix vcard: http://www.w3.org/2006/vcard/ns#.

% @prefix xml: http://www.w3.org/XML/1998/namespace.

% @prefix xsd: http://www.w3.org/2001/XMLSchema#.

% @prefix inria: http://ns.inria.fr/covid19/.

% @prefix ncbi: https://www.ncbi.nlm.nih.gov/pmc/articles/.

% @prefix pubnt: http://pubannotation.org/docs/sourcedb/CORD-19/sourceid/.

% @prefix ldf: https://data.linkeddatafragments.org/.

% @prefix fccc: https://fhircat.org/cord-19/fhir/Commercial/Composition/.

% @prefix makg: http://ma-graph.org/property/.

% @prefix dbo: https://dbpedia.org/ontology/.

RDF namespaces

To facilitate the reusability of our knowledge graph, we represent our data in widely used vocabularies and namespaces as shown in Listing 1.

RDF data model

Figure 1 shows important classes (e.g., papers, authors, sections, bibliographic entries, and named entities) as well as predicates (e.g., first name, last name, license).

Fig. 1 — UML class diagram of the CovidPubGraph Ontology.

Papers

We represent bibliographic information of papers using four vocabularies: bibo, bibtex, fabio, and schema (see namespaces above). Important attributes include the title, PMID, DOI, publication date, publisher, publisher URI, license and authors. For each paper, we store provenance information. In particular, our code allows the reference to the original CORD-19 raw files as well as the time when we generate the resource. The URIs of our generated Paper resources follow the format https://covid-19ds.data.dice-research.org/resource/<paperId> where <paperId> is the unique paper id within the CORD-19 dataset. An example resource is given in Listing 2.

Authors

Authors are represented in FOAF (http://xmlns.com/foaf/spec/). Important attributes include the first, middle, and last names as well as mail addresses and institutions.

Sections

Papers are further subdivided by section and the corresponding information is expressed in the SALT ontology²³. We keep track of a set of predefined sections including Abstract, Introduction, Background, Related Work, Preliminaries, Conclusion, Experiment and Discussion. In case another section heading appears in the paper, we assign it to the default section Body. We further subdivide a section using cvdo:hasSection. An example is given in Listing 3.

References

References to other sections, figures and tables in the text are resolved and stored as RDF using Bibref. Important attributes are the anchor of the reference (e.g., the number of the section, figure, or table), its source string in the text (nif:referenceContext) along with its position in the text (nif:beginIndex, nif:endIndex) as well as the referenced object (its:taIdentRef) which might be a paper (BibEntry), a figure (Figure), or a table (Table).

Listing 2.

Example paper resource. Inline graphic

Named entities

As machine learning and question answering often rely on named entities and their locations in texts, we annotate CORD-19 papers accordingly and represent this information with the NIF 2.0 Core Ontology (https://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/nif-core.html). Further details of our entity linking process are described in Linking Section.

RDF example resources

Listing 2 provides an example of a paper represented as an RDF resource. Listing 3 shows an example of a section resource. Each section is linked to its text string via nif:isString and its title via bibtex:hasTitle. If a section includes references to other papers, figures or tables (e.g., (1-3), (4,5), Figure 1A,Fig. 1, etc.), we represent a reference in RDF as follows: We represent the anchor of the reference with nif:anchorOf (e.g., the number of a figure), the start position of the reference with nif:beginIndex, the end position of the reference with nif:endIndex, the source section of the reference with nif:referenceContext, and the referenced target with its:taIdentRef (e.g., a bibtex entry, figure or table). An example is shown in Listing 4. Listing 5 shows an example of provenance information.

Linking

We link our dataset to other data sources to ensure its reusability and integrability as well as to improve its use for search, question answering and structured machine learning. We generate links from our paper and author resources to publicly available related knowledge bases. Moreover, we extract named entities related to diseases, genes, and cells from all converted papers and link them to three external knowledge bases.

Linking publications, authors and institutes

We link publications in our knowledge graph to six other datasets using the owl:sameAs and rdfs:seeAlso predicates (see top six rows of Table 2). To the best of our knowledge, those six datasets are the most relevant RDF datasets that deal with the same publication data. We leave it to future work to link our dataset to non-RDF datasets such as Covid19-KG¹² and Wikidata Scholia²⁴.

Listing 3.

Example section representation. Inline graphic

Listing 4.

Example of a reference to a paper and its associated bibtex entry. Inline graphic

Cord19-NEKG and our dataset use the same CORD-19 paperId making the linking process straightforward. For LitCovid, we use the PubMed Central Id (PMC-id) that is provided as part of CORD-19. For Covid-19-Literature and Cord-19-on-FHIR, we employ sha hash values from CORD-19. Moreover, we link our dataset to the publications’ JSON files in Cord-19-on-FHIR with the predicate rdfs:seeAlso. Listing 6 shows an example of linked publications from our dataset CovidPubGraph to Cord19-NEKG and LitCovid.

We link our resources of both our authors and institutes to the Microsoft Academic Knowledge Graph (MAKG)²⁵ using the latest version of our link discovery framework LIMES²⁶. For linking the authors, LIMES is configured to discover owl:sameAs links between our instances of foaf:Person and Microsoft’s makg:Author. For linking the institutes, we look for links between instances of type dbo:EducationalInstitution from our knowledge graph and MAKG’s resources of type makg:Affiliation. LIMES configuration files for linking authors and institutes are available as part of our source code (https://github.com/dice-group/COVID19DS).

Linking named entities

We apply entity linking to connect entities derived from the sections of papers to other knowledge bases. This process comprises two steps: (1) entity extraction and (2) entity linking. For the extraction step, we use Scispacy²⁷ in version 0.2.4 in conjunction with the model en_ner_bionlp13cg_md (https://github.com/allenai/scispacy) which allows the extraction of biomedical entities such as diseases, genes and cells. Scispacy is a specialized NLP library based on the spaCy library (https://spacy.io/). The NER model in spaCy is a transition-based chunking model that represents tokens as hashed embedded representations of the prefix, suffix, shape and lemmatized features of individual words²⁷.

Listing 5.

Provenance information for the non-commercial dataset. Inline graphic

Listing 6.

An example of a linked publication. Inline graphic

Listing 7.

Entity linking example. Inline graphic

For the linking step, we adapt the entity linking framework MAG²⁸ to link our extracted resources to the three knowledge bases Sider¹⁹, Kegg²⁰ and DrugBank¹⁸—using their RDF versions provided by the Bio2RDF project (https://bio2rdf.org/). We adapt MAG by creating a search index for each of the external knowledge bases and running MAG once per knowledge base. The output is a set of entities in the NLP Interchange Format (NIF) (https://persistence.uni-leipzig.org/nlp2rdf/). In Listing 7, we provide an example for the named entity “folic acid”.

Automated generation of CovidPubGraph

CORD-19 uploaded new data almost every day for the second half of 2020. Due to this fact, we have to automate the process of updating our knowledge graph. To this end, we developed a pipeline to automate the entire process, which can be found in Fig. 2. This pipeline contains several steps:

Crawling. We start by crawling the most recent version as a zip file from the CORD-19 website, which includes a CSV metadata file and JSON parsed full texts of scientific papers about the coronavirus.
RDF conversion. Then, we convert the CORD-19 data into an RDF knowledge graph with a Python script using the RDFLib library (https://github.com/RDFLib/rdflib).
Linking. We integrate the AGDISTIS library (https://github.com/dice-group/AGDISTIS) into the generation process to extract and link the named entities from abstracts of the scholarly articles. Moreover, we carry out the entity linking tasks (i.e., link publication and authors to other datasets) by making use of the link discovery framework LIMES (https://github.com/dice-group/LIMES).
KG Update. We upload the new version of CovidPubGraph dumps into the HOBBIT server (https://hobbitdata.informatik.uni-leipzig.de/COVID19DS/archive/) as well as to the Virtuoso triple store (https://hub.docker.com/r/openlink/virtuoso-opensource-7).

Starting from 2021, CORD-19 publishes new data only every two weeks. Therefore, we keep our KG up-to-date by crawling the new version of the CORD-19 dataset biweekly. Then, we follow the KG creation procedure presented in Fig. 2. As the dataset is still not too big to be regenerated, we regenerate the complete dataset biweekly. Still, having an automatic incremental update is part of our future plans.

Technical Validation

Representing COVID-19-related publications as RDF promises to facilitate many applications and use cases—some of which we outline in this section.

Updating the dataset

An example of how the data are constantly updated is provided in Table 3, where we provide details about the growing number of different resource types across successive versions of our knowledge graph. As we trust the data provider, i.e. the Allen Institute, we do not do any further data cleaning than the pipeline introduced in Fig. 2. Moreover, the number of generated links to other external datasets within our linking (see Table 2), provides further evidence of the quality of the data.

Table 3.

CovidPubGraph statistics.

	Version 1.0	Version 2.0	Version 27.0	Version 28.0
Distinct number of over all resources	11,249,740	15,761,537	214,036,877	268,108,670
Distinct number of publications	40,224	58,739	216,664	262,954
Distinct number of authors	1,434,809	1,484,024	2,892,156	3,388,001
Distinct number of bib entries	1,482,257	2,022,147	6,156,150	7,748,575
Distinct number of bib figures	333,509	461,386	1,243,561	1,532,443
Distinct number of bib tables	158,896	251,970	538,523	690,478

Open in a new tab

Listing 8.

List the top 10 papers-URIs with the most number of authors. Inline graphic

Listing 9.

List all paper URIs written by the author “Ian Mackay.” Inline graphic

Data retrieval

While our base dataset CORD-19 contains a significant number of publications, they are not represented in a format optimized for retrieval.

By providing CovidPubGraph in RDF with a well-defined ontology, we enable the easy retrieval of data with structured query languages such as SPARQL. For example, Listing 9 shows a query to retrieve all papers written by the author “Ian Mackay.” Another query to retrieve the top 10 papers in terms of their number of authors is provided in

Using SPARQL queries, we carried out some random checks of the duplicate articles and authors, which resulted in no duplicates. This could be a direct consequence of the high quality of the original CORD dataset. Still, doing a full KG deduplication task is part of our future work.

Interoperability using NIF

Using the interoperability capabilities provided by NIF, it is easy to query all occurrences of a certain text segment within the whole dataset and still know exactly where each mention occurs. For example, in Listing 10, we provide a SPARQL query to list all papers where “folic acid” is mentioned with their respective sections.

Information aggregation

Linking our dataset to other RDF datasets adds a considerable amount of value. For example, Microsoft Academic Knowledge Graph (MAKG) covers more than 209 million publications (http://ma-graph.org/) and our interlinking enables the retrieval of an author’s citation count (Listing 11).

Usage Notes

Table 4 summarizes all technical details of our dataset pertaining to its availability.

Table 4.

Technical details of CovidPubGraph.

Name	CovidPubGraph
Example Resource	https://covid-19ds.data.dice-research.org/resource/pmc4913562
Dataset dump	https://hobbitdata.informatik.uni-leipzig.de/COVID19DS/archive/
Archived Dump	10.5281/zenodo.4650261
Sparql Endpoint	https://covid-19ds.data.dice-research.org/sparql
Dataset Graph	https://covid-19ds.data.dice-research.org/resource/corona
Ontology	https://covid-19ds.data.dice-research.org/ontology/
Void File	https://covid-19ds.data.dice-research.org/void/
Ver. Date	November 8, 2021
Ver. No.	28.0
Source Code	https://github.com/dice-group/COVID19DS
Software License	GPL 3.0 (https://www.gnu.org/licenses/gpl-3.0)
Dataset License	Creative Commons Attribution 4.0 International (https://creativecommons.org/licenses/by/4.0/)

Open in a new tab

Persistent URIs

All our resources are served from one of our servers via persistent URIs. The resource will be maintained by the DICE research team (https://dice-research.org) as part of the lab’s HOBBIT dataset efforts²⁹. A 100TB-Server maintained by the Paderborn university’s computing centre will host the datasets.

Resource dereferencing

We employ LodView (https://lodview.it/) for dereferencing our dataset URIs and allowing users to conveniently browse HTML pages. Figure 3 shows an example of a resource being served by LodView.

Fig. 3 — Excerpt of an example resource served by LodView.

Listing 10.

List all papers and sections mentioning “folic acid.” Inline graphic

Listing 11.

SPARQL example for retieving more data via interlinking with MAKG. Inline graphic

Dump files

We provide dump files of our dataset for download. The generated RDF datasets are located on our HOBBIT storage (https://hobbitdata.informatik.uni-leipzig.de/COVID19DS/archive/) and archived on Zenodo (https://zenodo.org/record/4650261).

SPARQL endpoint

We publicly serve CovidPubGraph via a SPARQL endpoint (https://covid-19ds.data.dice-research.org/sparql).

Acknowledgements

This work has been supported by the German Federal Ministry of Economics and Climate Protection (BMWK) project RAKI (GA no. 01MD19012D), the EU H2020 project KnowGraphs (GA no. 860801) as well as the BMVI projects LIMBO (GA no. 19F2029C) and OPAL (GA no. 19F2028A).

Author contributions

Svetlana Pestryakova carried out the main RDF data transformation and linking tasks. Daniel Vollmers deployed the NLP algorithm for the named entity extraction. Mohamed Ahmed Sherif analysed the data and conceived the work, Stefan Heindorf prepared the initial manuscript. Muhammad Saleem enhanced the manuscript. Diego Moussallem enhanced the manuscript. Axel-Cyrille Ngonga Ngomo supervised the work. All authors contributed to the text of the article, read and approved the final manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Code availability

Our source code to generate the new versions of our knowledge graph is publicly available at https://github.com/dice-group/COVID19DS and is maintained in parallel with the knowledge graph.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Svetlana Pestryakova, Email: pestryak@mail.uni-paderborn.de.

Mohamed Ahmed Sherif, Email: mohamed.sherif@upb.de.

References

1.Wang, L. L. et al. CORD-19: the covid-19 open research dataset. CoRRabs/2004.10706 (2020).
2.Ngomo, A.-C. N., Auer, S., Lehmann, J. & Zaveri, A. Introduction to linked data and its lifecycle on the web. In Reasoning Web International Summer School, 1–99 (Springer, 2014).
3.Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Scientific data3 (2016). [DOI] [PMC free article] [PubMed]
4.Bühmann L, Lehmann J, Westphal P. Dl-learner - A framework for inductive learning on the semantic web. J. Web Semant. 2016;39:15–24. doi: 10.1016/j.websem.2016.06.001. [DOI] [Google Scholar]
5.Heindorf, S. et al. Evolearner: Learning description logics with evolutionary algorithms. In WWW (ACM, 2022).
6.Demir, C. & Ngomo, A. N. DRILL- deep reinforcement learning for refinement operators in ALC. CoRRabs/2106.15373 (2021).
7.Cai X, Fry CV, Wagner CS. International collaboration during the covid-19 crisis: autumn 2020 developments. Scientometrics. 2021;126:3683–3692. doi: 10.1007/s11192-021-03873-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Horbach SPJM. No time for that now! Qualitative changes in manuscript peer review during the Covid-19 pandemic. Research Evaluation. 2021;30:231–239. doi: 10.1093/reseval/rvaa037. [DOI] [Google Scholar]
9.Wang, X., Song, X., Guan, Y., Li, B. & Han, J. Comprehensive named entity recognition on CORD-19 with distant or weak supervision. CoRRabs/2003.12218 (2020).
10.Vandewiele, G., Steenwinckel, B. & Weyns, M. Covid-19 literature knowledge graph. https://www.kaggle.com/group16/covid19-literature-knowledge-graph. Accessed: 2020-05-15.
11.Human coronavirus innovation landscape: Patent and research works open datasets. https://about.lens.org/covid-19 Accessed: 2020-05-19 (2020).
12.Wang, Q. et al. Knowledge extraction to assist scientific discovery from corona virus literature. http://blender.cs.illinois.edu/covid19/. Accessed: 2020-05-15.
13.Jiang, G., Booth, D., Jiao, D. & Solbrig, H. Cord-19-on-fhir – semantics for covid-19 discovery. https://github.com/fhircat/CORD-19-on-FHIR. Accessed: 2020-05-15.
14.Mendes, P. N., Jakob, M. García-Silva, A. & Bizer, C. Dbpedia spotlight: shedding light on the web of documents. In I-SEMANTICS, ACM International Conference Proceeding Series, 1–8 (ACM, 2011).
15.Kroll, H., Pirklbauer, J., Ruthmann, J. & Balke, W.-T. A semantically enriched dataset based on biomedical ner for the covid19 open research dataset challenge (2020).
16.Wang, X., Song, X., Guan, Y., Li, B. & Han, J. Comprehensive named entity recognition on cord-19 with distant or weak supervision (2020).
17.Zhou Y, et al. Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2. Cell Discovery. 2020;6:1–18. doi: 10.1038/s41421-020-0153-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Wishart DS, et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Research. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic Acids Research. 2016;44:1075–1079. doi: 10.1093/nar/gkv1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Kanehisa M, Sato Y, Furumichi M, Morishima K, Tanabe M. New approach for understanding genome variations in KEGG. Nucleic Acids Research. 2019;47:D590–D595. doi: 10.1093/nar/gky962. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Janowicz, K. et al. Covid-19 by stko lab, ucsb. https://covid.geog.ucsb.edu/. Accessed: 2020-05-15.
22.Pestryakova S, 2021. Covidpubgraph: A fair knowledge graph of covid-19 publications. Zenodo. [DOI] [PMC free article] [PubMed]
23.Groza, T., Handschuh, S., Möller, K. & Decker, S. SALT - semantically annotated latex for scientific publications. In ESWC, vol. 4519 of Lecture Notes in Computer Science, 518–532 (Springer, 2007).
24.Wikidata scholia topic covid-19. https://tools.wmflabs.org/scholia/topic/Q84263196. Accessed: 2020-05-15.
25.Färber, M. The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data. In Proceedings of the 18th International Semantic Web Conference, ISWC’19, 113–129, 10.1007/978-3-030-30796-7_8 (2019).
26.Ngonga Ngomo, A.-C. et al. LIMES - A Framework for Link Discovery on the Semantic Web. KI - Künstliche Intelligenz, German Journal of Artificial Intelligence - Organ des Fachbereichs “Künstliche Intelligenz” der Gesellschaft für Informatik e.V. (2021).
27.Neumann, M., King, D., Beltagy, I. & Ammar, W. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, 319–327, 10.18653/v1/W19-5034 (Association for Computational Linguistics, Florence, Italy, 2019).
28.Moussallem, D., Usbeck, R., Röder, M. & Ngonga Ngomo, A.-C. MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach. In K-CAP 2017: Knowledge Capture Conference, https://svn.aksw.org/papers/2017/KCAPMAG=sigconf-main:pdf 8 (ACM, 2017).
29.Röder, M., Kuchelev, D. & Ngonga Ngomo, A.-C. Hobbit: A platform for benchmarking big linked data. Data Science 1–21 (2019).
30.Dong, E., Du, H. & Gardner, L. Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university. https://github.com/CSSEGISandData/COVID-19. Accessed: 2020-05-15.
31.Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases (2020). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Pestryakova S, 2021. Covidpubgraph: A fair knowledge graph of covid-19 publications. Zenodo. [DOI] [PMC free article] [PubMed]

Data Availability Statement

Our source code to generate the new versions of our knowledge graph is publicly available at https://github.com/dice-group/COVID19DS and is maintained in parallel with the knowledge graph.

[CR1] 1.Wang, L. L. et al. CORD-19: the covid-19 open research dataset. CoRRabs/2004.10706 (2020).

[CR2] 2.Ngomo, A.-C. N., Auer, S., Lehmann, J. & Zaveri, A. Introduction to linked data and its lifecycle on the web. In Reasoning Web International Summer School, 1–99 (Springer, 2014).

[CR3] 3.Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Scientific data3 (2016). [DOI] [PMC free article] [PubMed]

[CR4] 4.Bühmann L, Lehmann J, Westphal P. Dl-learner - A framework for inductive learning on the semantic web. J. Web Semant. 2016;39:15–24. doi: 10.1016/j.websem.2016.06.001. [DOI] [Google Scholar]

[CR5] 5.Heindorf, S. et al. Evolearner: Learning description logics with evolutionary algorithms. In WWW (ACM, 2022).

[CR6] 6.Demir, C. & Ngomo, A. N. DRILL- deep reinforcement learning for refinement operators in ALC. CoRRabs/2106.15373 (2021).

[CR7] 7.Cai X, Fry CV, Wagner CS. International collaboration during the covid-19 crisis: autumn 2020 developments. Scientometrics. 2021;126:3683–3692. doi: 10.1007/s11192-021-03873-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Horbach SPJM. No time for that now! Qualitative changes in manuscript peer review during the Covid-19 pandemic. Research Evaluation. 2021;30:231–239. doi: 10.1093/reseval/rvaa037. [DOI] [Google Scholar]

[CR9] 9.Wang, X., Song, X., Guan, Y., Li, B. & Han, J. Comprehensive named entity recognition on CORD-19 with distant or weak supervision. CoRRabs/2003.12218 (2020).

[CR10] 10.Vandewiele, G., Steenwinckel, B. & Weyns, M. Covid-19 literature knowledge graph. https://www.kaggle.com/group16/covid19-literature-knowledge-graph. Accessed: 2020-05-15.

[CR11] 11.Human coronavirus innovation landscape: Patent and research works open datasets. https://about.lens.org/covid-19 Accessed: 2020-05-19 (2020).

[CR12] 12.Wang, Q. et al. Knowledge extraction to assist scientific discovery from corona virus literature. http://blender.cs.illinois.edu/covid19/. Accessed: 2020-05-15.

[CR13] 13.Jiang, G., Booth, D., Jiao, D. & Solbrig, H. Cord-19-on-fhir – semantics for covid-19 discovery. https://github.com/fhircat/CORD-19-on-FHIR. Accessed: 2020-05-15.

[CR14] 14.Mendes, P. N., Jakob, M. García-Silva, A. & Bizer, C. Dbpedia spotlight: shedding light on the web of documents. In I-SEMANTICS, ACM International Conference Proceeding Series, 1–8 (ACM, 2011).

[CR15] 15.Kroll, H., Pirklbauer, J., Ruthmann, J. & Balke, W.-T. A semantically enriched dataset based on biomedical ner for the covid19 open research dataset challenge (2020).

[CR16] 16.Wang, X., Song, X., Guan, Y., Li, B. & Han, J. Comprehensive named entity recognition on cord-19 with distant or weak supervision (2020).

[CR17] 17.Zhou Y, et al. Network-based drug repurposing for novel coronavirus 2019-ncov/sars-cov-2. Cell Discovery. 2020;6:1–18. doi: 10.1038/s41421-020-0153-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Wishart DS, et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Research. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic Acids Research. 2016;44:1075–1079. doi: 10.1093/nar/gkv1075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Kanehisa M, Sato Y, Furumichi M, Morishima K, Tanabe M. New approach for understanding genome variations in KEGG. Nucleic Acids Research. 2019;47:D590–D595. doi: 10.1093/nar/gky962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Janowicz, K. et al. Covid-19 by stko lab, ucsb. https://covid.geog.ucsb.edu/. Accessed: 2020-05-15.

[CR22] 22.Pestryakova S, 2021. Covidpubgraph: A fair knowledge graph of covid-19 publications. Zenodo. [DOI] [PMC free article] [PubMed]

[CR23] 23.Groza, T., Handschuh, S., Möller, K. & Decker, S. SALT - semantically annotated latex for scientific publications. In ESWC, vol. 4519 of Lecture Notes in Computer Science, 518–532 (Springer, 2007).

[CR24] 24.Wikidata scholia topic covid-19. https://tools.wmflabs.org/scholia/topic/Q84263196. Accessed: 2020-05-15.

[CR25] 25.Färber, M. The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data. In Proceedings of the 18th International Semantic Web Conference, ISWC’19, 113–129, 10.1007/978-3-030-30796-7_8 (2019).

[CR26] 26.Ngonga Ngomo, A.-C. et al. LIMES - A Framework for Link Discovery on the Semantic Web. KI - Künstliche Intelligenz, German Journal of Artificial Intelligence - Organ des Fachbereichs “Künstliche Intelligenz” der Gesellschaft für Informatik e.V. (2021).

[CR27] 27.Neumann, M., King, D., Beltagy, I. & Ammar, W. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, 319–327, 10.18653/v1/W19-5034 (Association for Computational Linguistics, Florence, Italy, 2019).

[CR28] 28.Moussallem, D., Usbeck, R., Röder, M. & Ngonga Ngomo, A.-C. MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach. In K-CAP 2017: Knowledge Capture Conference, https://svn.aksw.org/papers/2017/KCAPMAG=sigconf-main:pdf 8 (ACM, 2017).

[CR29] 29.Röder, M., Kuchelev, D. & Ngonga Ngomo, A.-C. Hobbit: A platform for benchmarking big linked data. Data Science 1–21 (2019).

[CR30] 30.Dong, E., Du, H. & Gardner, L. Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university. https://github.com/CSSEGISandData/COVID-19. Accessed: 2020-05-15.

[CR31] 31.Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases (2020). [DOI] [PMC free article] [PubMed]

PERMALINK

CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications

Svetlana Pestryakova

Daniel Vollmers

Mohamed Ahmed Sherif

Stefan Heindorf

Muhammad Saleem

Diego Moussallem

Axel-Cyrille Ngonga Ngomo

Abstract

Background & Summary

Table 1.

Table 2.

Methods

Knowledge graphs of publications

Knowledge graphs of biomedical entities

Knowledge graphs of case statistics

Data Records

RDF data model design

Listing 1.

RDF namespaces

RDF data model

Fig. 1.

Papers

Authors

Sections

References

Listing 2.

Named entities

RDF example resources

Linking

Linking publications, authors and institutes

Listing 3.

Listing 4.

Linking named entities

Listing 5.

Listing 6.

Listing 7.

Automated generation of CovidPubGraph

Fig. 2.

Technical Validation

Updating the dataset

Table 3.

Listing 8.

Listing 9.

Data retrieval

Interoperability using NIF

Information aggregation

Usage Notes

Table 4.

Persistent URIs

Resource dereferencing

Fig. 3.

Listing 10.

Listing 11.

Dump files

SPARQL endpoint

Acknowledgements

Author contributions

Funding

Code availability

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases