Skip to main content
Oxford University Press - PMC COVID-19 Collection logoLink to Oxford University Press - PMC COVID-19 Collection
. 2020 Dec 2:bbaa331. doi: 10.1093/bib/bbaa331

How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles

Xu Zuo 1, Yong Chen 2, Lucila Ohno-Machado 3, Hua Xu 4,
PMCID: PMC7799277  PMID: 33757278

Abstract

Objective

This study aims at reviewing novel coronavirus disease (COVID-19) datasets extracted from PubMed Central articles, thus providing quantitative analysis to answer questions related to dataset contents, accessibility and citations.

Methods

We downloaded COVID-19-related full-text articles published until 31 May 2020 from PubMed Central. Dataset URL links mentioned in full-text articles were extracted, and each dataset was manually reviewed to provide information on 10 variables: (1) type of the dataset, (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) format of the dataset files, (5) where the dataset was hosted, (6) whether the dataset was updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PubMed Central paper describing the dataset and (10) the number of times the dataset was cited by PubMed Central articles. Descriptive statistics about these seven variables were reported for all extracted datasets.

Results

We found that 28.5% of 12 324 COVID-19 full-text articles in PubMed Central provided at least one dataset link. In total, 128 unique dataset links were mentioned in 12 324 COVID-19 full text articles in PubMed Central. Further analysis showed that epidemiological datasets accounted for the largest portion (53.9%) in the dataset collection, and most datasets (84.4%) were available for immediate download. GitHub was the most popular repository for hosting COVID-19 datasets. CSV, XLSX and JSON were the most popular data formats. Additionally, citation patterns of COVID-19 datasets varied depending on specific datasets.

Conclusion

PubMed Central articles are an important source of COVID-19 datasets, but there is significant heterogeneity in the way these datasets are mentioned, shared, updated and cited.

Keywords: COVID-19, data sharing, review

Introduction

The novel coronavirus disease (COVID-19) outbreak was first reported in Wuhan, China, on 31 December 2019. On 11 March 2020, World Health Organization officially declared COVID-19 a pandemic, marking the recognition of a global crisis [1]. To fight the COVID-19 pandemic, researchers worldwide have quickly investigated different aspects of this disease and reported novel scientific findings, on a daily basis. According to LitCovid [2], a curated literature hub for tracking COVID-19 publication, 34 890 new articles (as the date of 25 July 2020) have been published in the past seven months. Along with published articles, massive and heterogeneous datasets have been created, ranging from testing and case statistics at various locations (medical centers, cities, counties, states, countries), clinical data from studies (e.g., ‘omics, imaging, assays, questionnaires) or from electronic health records, surveys for patient-reported outcomes, administrative data [e.g., ventilators, hospitalizations, intensive care unit (ICU) beds], vital statistics (e.g., obituaries, death certificates), as well as sociodemographic, environmental, economic, individual mobility and transportation data.

Efficient data sharing of biomedical data is an important component in the development of a successful data-driven research on COVID-19 [3]. Researchers reconstructed the early evolutionary paths of COVID-19 by genetic network analysis, for example using existing data of virus genomes collected across the world, providing insights into virus transmission patterns [4]. Nevertheless, it is challenging for researchers to find and identify reliable datasets for novel scientific discoveries, given the large volume and sometimes contradictory information (e.g. non-peer-reviewed sources) about available datasets. Principles such as FAIR (Findable, Accessible, Interoperable and Reusable) [5] and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) [6] have been proposed for sharing digital data and digital repositories, with applications to COVID-19 datasets as well (e.g. the Virus Outbreak Data Network) [7].

Here, we propose to conduct a systematic review on COVID-19 datasets that are associated with published literature. Our study aims at identifying a comprehensive list of available COVID-19 datasets across domains and at providing insights on how researchers share datasets as they publish COVID-19 research articles. Additionally, we also assess the accessibility, sustainability and impact of published datasets. More specifically, we attempt to answer the following research questions about COVID-19 datasets that are associated with publications:

Q1. Contents: What types of data are published to support different studies and where are those data collected from?

Q2. Accessibility: How can users access datasets and where are the data hosted?

Q3. Citation: How are datasets cited by others and what are top high-impact datasets, by citation count?Our ultimate goal is to promote data sharing and data reuse through careful analyses of current practice by researchers. Through a systematic review, we provide researchers with a comprehensive list of reliable datasets that are available to the public. Additionally, we provide insights about data sharing strategies to aid those who plan to develop and publish new COVID-19 datasets.

Methods

COVID-19 publication collection

To identify and collect COVID-19-related articles, we leveraged LitCovid [2], a newly established literature database for tracking the latest scientific articles about COVID-19, developed by National Library of Medicine in the United States. LitCovid provides essential bibliographic information such as PubMed ID, title, abstract and journal of publications related to COVID-19. In this review, we included all LitCovid articles published before 31 May 2020, resulting in 18 332 articles. As the recognition of associated datasets requires access to full-text articles, we further limited articles to those with full text available in PubMed Central (PMC), which is one of the most significant open access literature repositories of full-text biomedical articles. We then removed errata notes of 16 articles. This further reduced the number of articles to 12 324, from which we carried out our dataset collection process.

COVID-19 dataset collection

We manually reviewed 100 PMC full-text articles and identified the following patterns for mentioning datasets:

  • (1) Dataset information is available in the Data Availability Statement section provided by PMC, allowing the authors to disclose information about data availability and access, which often contains URL links to data sources, or

  • (2) When Data Availability Statement section was missing, datasets could have been mentioned in the full text as (a) external URL links to the data sources, (b) supplemental files (e.g. additional tables, sometimes in PDF) and (c) textual statements about data availability (e.g. ‘available upon request’).

As datasets from category 2b and 2c often required additional effort before they could be used in calculations, we limited our data collection to categories 1 and 2a, which led to the task of identifying URL links from PubMed full-text articles. Of course, external URL links in PMC articles do not always refer to datasets. Therefore, we developed a process that combines automatic extraction with manual review, to identify dataset links mentioned in articles. We first downloaded the full texts of 12 324 PMC articles in XML format using E-Fetch queries [8]. All URLs tagged with the markup ‘ext-link’ were then automatically extracted from articles. This included URLs both in the main text and in the citations. These URLs then underwent a normalization step, where extensions like ‘HTTP’ and ‘htm’ were removed, which resulted in a list of 23 467 URLs in total. We then manually reviewed all of them and identified 144 links directing to actual datasets. We noticed that one single dataset can be associated with multiple links. For example, the Johns Hopkins University Dashboard [9] was cited in articles using four different URLs. After merging these different data links that directed to the same dataset, we obtained 128 unique datasets from the verified data links. The complete process of extracting COVID-19 publications from LitCovid and extracting datasets mentioned in full-text COVID-19 publications in PMC was described in Figure 1.

Figure 1.

Figure 1

The workflow of screening and collecting publications and datasets from 18 332 PMC articles.

COVID-19 dataset review and analysis

For each of the 128 COVID-19 datasets, we manually reviewed its web pages. We extracted information for 10 descriptive variables: (1) type of the dataset (e.g. epidemiological or genomic data), (2) geographic region where the data were collected, (3) whether the dataset was immediately downloadable, (4) file format (e.g. CSV), (5) where the dataset was hosted, (6) whether the data were updated regularly, (7) the type of license used, (8) whether the metadata were explicitly provided, (9) whether there was a PMC paper describing the creation of the dataset and (10) the number of times the dataset was cited by PMC articles (either via URL links or via articles). The definitions and examples of values for the 10 variables are shown in Table 1.

Table 1.

Descriptions and examples of metadata variables collected for each dataset

Question Variable Description Examples
Content Data type Types of dataset Epidemiological, clinical, etc.
Geographic region The region from where the data were collected Worldwide, China, United States, etc.
Accessibility Download Can user download the dataset Immediately downloadable versus Request needed
Data format Format of the dataset CSV, XLSX, etc.
Data hosting Data repository where the dataset was hosted GitHub, Mendeley, etc.
Data update frequency Whether the dataset was being updated regularly and the last date of update Regularly updated versus One time only
License The type of license used CC BY 4.0, MIT, etc.
Metadata availability Whether the metadata are provided and the metadata format Machine readable, unstructured or unavailable
Citation Dataset article Whether there was a PMC paper that described the dataset The JHU dataset was described in the PMID 32087114 article
Citation count The number of times that the dataset was cited by PMC articles (either via URL links or via data articles) The JHU dataset was cited by 454 PMC articles

JHU, Johns Hopkins University; CSV, comma separated values; XLSX, Microsoft Excel Open XML Spreadsheet; CC BY 4.0, Creative Commons Attribution 4.0; MIT, The MIT License.

Results

Among 12 324 PMC articles screened, 249 papers included Data Availability Statement sections, and 23 papers provided valid online data sources. Of the papers without the Data Availability Statement (12 075), 3486 papers contained at least one dataset link in the full text. The proportion of COVID-19 articles in PMC that provided at least one dataset link was 28.5% (3509/12 324). In total, 128 unique dataset links were mentioned in 12 324 COVID-19 articles in PMC.

Q1. Content

Data types

Table 2 presents the distribution of types of datasets. As expected, epidemiological datasets (N = 69; 53.9%) constituted the largest portion of our dataset collection. Some of them were created by governmental sources and others by independent statistic suppliers and were aimed at tracking the latest COVID-19 case updates, including confirmed cases, death cases, recovered cases and the number of tests conducted [9–13, 18–23, 31–50, 53–57, 61, 66, 67, 74–76, 132]. Some epidemiological studies focused on the modeling [15, 16, 26, 29, 51, 58, 64, 135] and prediction of transmission patterns [24, 30, 52, 60, 63, 65, 70–73]. Of all datasets, 14.8% provided COVID-19 genome [76–91, 93] or protein sequences [92, 94, 95, 98]. Clinical datasets (N = 15; 11.7%) largely concentrated on three aspects: (1) incubation period as well as other clinical characteristics of COVID-19 patients [96, 104, 107], (2) potential treatments such as vaccines [97, 105] and medications [99] and (3) imaging datasets (N = 3; 2.3%) contained chest computed tomography (CT) images for COVID-19 patients plus others [110–112]. Mobility datasets (N = 7; 5.5%) used track movements trends over time by geography were also present [114–120]. Social studies datasets (N = 4; 3.1%) gathered information about people’s responses when facing the pandemic [121–124]. LitCovid [2] and CORD-19 [113] were the two major COVID-19 literature databases that appeared in PMC articles. Health administration data [100–103, 108] mainly describe hospital capacities, for example the number of ventilators [103] and ICU beds [108] available in hospitals. Other non-biomedical datasets (N = 10; 7.8%), such as climate [133, 134], economic [127], geographical [128] and population data [129, 131], were also discovered in articles investigating the effects of disease transmission and long-term impacts of the pandemic.

Table 2.

The distribution of data types among 128 COVID-19 datasets

Data type Number Percent
Epidemiology 69 53.9%
Genomics 19 14.8%
Clinical 15 11.7%
Imaging 3 2.3%
Mobility 7 5.5%
Social science 4 3.1%
Healthcare administration 2 1.6%
Literature 2 1.6%
Other 10 7.8%

Note: Imaging is considered as a subset of clinical datasets.

Geographic region

Figure 2 illustrates the geographic regions that the datasets covered. More than half of the datasets (N = 68; 53.1%) incorporated data from around the world. From the total, 18 (14.1%) datasets involved data from China. Multiple datasets related to epidemiology were reported in the United States [66, 69, 90, 102], United Kingdom [40, 54, 62] and India [35, 36, 57] as the coronavirus diseases spread to these countries. Aside from country-level data, Africa [59] and Europe [109] also created datasets covering these entire continents. There were also smaller datasets that covered only states [39, 47], counties [56] and cities [14, 16, 17, 43, 53]. Such datasets were often created by local health departments and incorporated detailed COVID-19 patients’ demographic breakdowns.

Figure 2.

Figure 2

Distribution of geographic regions where COVID-19 datasets were collected.

Q2. Accessibility

Download

Among the 128 datasets in our study, 20 datasets did not provide clear downloading information. Users who wish to use these datasets need to contact the owners for download instructions. Therefore, we marked the accessibility of such datasets as ‘request needed’. The remaining 108 (84.4%) datasets were instantly downloadable. Registrations prior to accessing the data are required for 9 out of 108 downloadable datasets.

Data format

Of 108 datasets that could be downloaded instantly, 19 were available to download in multiple formats. CSV (N = 53; 49.1%), XLSX (N = 27; 25.0%) and JSON (N = 10; 9.3%) were three popular formats in dataset exchange. Almost all genetic studies shared data in FASTA. RDS and RDA were two of the common data formats in studies that utilized the R programming language [28, 60, 68, 106, 108, 126]. Imaging datasets typically shared CT images as JPG files. Datasets of protein structures offered data in PDB files [92, 95]. GeoTiFF files were provided in a worldwide population dataset that allowed the data to be projected onto a geographical map [129].

Figure 3.

Figure 3

Data formats used by 108 downloadable datasets, among 128 COVID-19 datasets.

Data hosting

As shown in Table 3 below, the most popular data repository is GitHub, incorporating 57 (44.5%) datasets. Of all, six (4.7%) datasets were stored on Mendeley Data, a cloud-based repository for research data from scholarly articles. Individual webpages (N = 55; 43.0%) referred to those datasets accessible only via stand-alone websites, in comparison with those deposited on established data repositories.

Table 3.

List of data repositories used by datasets in this study

Repository Number Percent
GitHub 57 44.5%
Google drive 7 5.5%
Mendeley 6 4.7%
Kaggle 3 2.3%
Individual web page 55 43.0%

Data update frequency

More than half (N = 74; 57.8%) of the datasets were being updated regularly (often daily or weekly). If data depositors did not offer any information regarding the updating frequency, we treated those datasets as not being updated on a regular basis. We recorded the date of the last update on those datasets. Figure 4 illustrates the number of datasets that stopped updating in each month.

Figure 4.

Figure 4

The time of last updates for datasets that were not being updated regularly, among 128 COVID-19 datasets.

License

Table 4 showed the statistics for data licensing. Among the 128 datasets we collected, 39 (30.5%) datasets clearly specified data licenses to allow permitted use of datasets. The COVID-19 Image Data Collection [111] used multiple licenses for different subsets of data. 37.5% (N = 48) stated their own terms and policies for data usage in detail online. 8.6% (N = 11) datasets require users to cite their associated papers when using the data but do not offer other information on data sharing and usage. 23.4% (N = 30) datasets do not release any information regarding data usage.

Table 4.

The number of times that a data license is used in dataset collection

License Number Percent
MIT 12 9.4%
Creative Commons Attribution 4.0 12 9.4%
GNU General Public License v3.0 7 5.5%
Apache license 2.0 4 3.1%
Creative Commons Zero v1.0 Universal 3 2.3%
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 2 1.6%
Mozilla Public License 2.0 1 0.8%
Self-defined data usage policy 48 37.5%
Citation required 11 8.6%
Unavailable 30 23.4%

Note: One dataset [111] used multiple licenses, thus percentages in this table may not add up. Self-defined data usage: data owners defined their own data usage policy; citation required: data owners only require users to cite their associated papers when using the data.

Metadata availability

Of 108 datasets that are immediately downloadable, 77.8% (N = 84) provide metadata in machine readable formats. Several datasets [40, 74, 78, 125, 130] and data deposited on established data repositories (GitHub, Mendeley and Kaggle) offer application programming interfaces (API) to automatically retrieve metadata. 9.3% (N = 10) datasets provide metadata in free text, which includes information like dataset names, data owners and data description. 13.0% (N = 14) datasets do not release any information on metadata.

Q3. Citation

Dataset article availability

41.4% (N = 53) datasets were described with details in publications on PMC. Of the 53 articles describing datasets, 5 articles described extensively the purpose and techniques of building COVID-19 databases. The main purpose of the remaining 48 articles was to carry out modeling, prediction or other types of analysis in diverse domains, with some description about datasets in the study. These were often the datasets that were not updated on a regular basis: those data were collected, standardized and maintained by the authors themselves for the specific studies.

Citation count

Figure 5 demonstrates the number of citations for each dataset. Typically, a dataset can be cited in two ways in articles: (1) as a URL in the full text and (2) as an article that describes the dataset. It is possible for an article to cite both the URL and the article of the dataset. The number shown in Figure 5 is for the overall citations (both articles and URLs), in which the duplicated citations were removed. The number of citations across different datasets varied heavily. The dataset available in the John Hopkins University Dashboard [9] was the most popular dataset and was cited 454 times. Of the top 10 datasets, 9 were from the epidemiology domain. However, a low number of citations do not necessarily indicate that the dataset has little impact and may just reflect the fact that they did not have enough time to accrue citations yet (i.e. more recently published datasets).

Figure 5.

Figure 5

Number of citations for each dataset. The horizontal axis indicates the number of citations of a dataset. The vertical axis label corresponds to the dataset ID in our dataset summary list (included in the Supplementary Data available online at https://academic.oup.com/bib).

Table 5 presents the top 10 cited datasets in our study. The John Hopkins University Dashboard [9] had a large number of citations both as an online data link and a publication. Worldometers [33] and CDC [31] are high-impact data sources for COVID-19 case update and cited frequently as external data links. They are used in the Johns Hopkins University Dashboard but, since they do not accrue citations indirectly, their impact may be underestimated. The remaining seven datasets were almost all cited as articles published on PubMed and had none or few URL citations.

Table 5.

Top 10 cited datasets

Dataset Overall citations URL citations Article citations
John Hopkins University Dashboard [6] 454 416 275
Real-time estimation of the novel coronavirus incubation time [93] 239 0 239
Worldometers [30] 231 231 0
Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) [22] 189 0 189
Estimates of the severity of coronavirus disease 2019: a model-based analysis [60] 132 0 132
Feasibility of controlling COVID-19 outbreaks by isolation of cases and contacts [25] 104 0 104
Pattern of early human-to-human transmission of Wuhan 2019 novel coronavirus [14] 102 1 102
Early dynamics of transmission and control of COVID-19: a mathematical modelling study [61] 97 0 97
The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak [27] 90 0 90
CDC [28] 87 87 0

Note: The number may not add up to the number of overall citations as we merged URL citations and article citations from the same article.

Discussion

Although no extensive analyses have been carried out on availability, accessibility and type of COVID-19 datasets, discussion on the collection and sharing of COVID-19 data has received great attention among the scientific community: Alamo et al. [136] highlighted a variety of significant open data sources and evaluated the limitations and readability of available data. They concluded that notable progress was achieved by certain scientific communities, particularly among epidemiologists, healthcare specialists, the machine learning community and data scientists. Several studies also reviewed and explored available COVID-19 data in specific domains. Kalkreuth and Kaufmann [137] reviewed publicly available medical imaging resources for COVID-19 cases worldwide. Rubin [138] reported on recent progress in collecting data of ventilated patients confirmed with COVID-19. Robinson and Yazdany [139] described an initiative to collect data about COVID-19 patients with rheumatic diseases. Khalatbari-Soltani et al. [140] listed a series of important socioeconomic characteristics often overlooked when collecting and reporting social science data related to the pandemic.

In our study, we took a different approach and conducted a systematic review on COVID-19 literature in PMC to identify associated datasets. A number of interesting findings were identified through our analysis. First, although PMC implemented the Data Availability Statement section, the percentage of articles that explicitly provided such information was about 2% (249/12 324) only, indicating a low adoption rate. Nevertheless, there were 3509 (about 28.5%) papers that provided at least one URL link to datasets used in the study, demonstrating a significant portion of researchers are aware of the importance of sharing data. Additionally, all 128 datasets identified in PMC articles allowed users to access their data, and 84.4% (108/128) were available for immediate download, indicating the level of openness of data sharing in COVID-19 research. Epidemiology datasets constituted more than half of our dataset collection, while imaging datasets accounted for 2.3%, indicating the need to develop more datasets for the latter and for related domains, which will probably require worldwide collaboration in order to grow to the same size as epidemiology datasets. As for data format, although FAIR [5] recommends the RDF (Resource Description Framework) format, no dataset in this study has adopted RDF, probably because common machine-readable formats such as CSV, JSON and TXT are easier to understand. We observed two major types of practices in licenses of data usage. Data owners who use established data repositories often use a variety of existing data licenses to grant data usage and sharing. On the other hand, data owners who publish datasets on individual webpages prefer to specify their own terms and policies. Overall, 76.6% (98/128) data owners allow non-commercial use of data and specify the degree of openness by releasing data usage policies.

The data update frequency relied heavily on the objectives of creating the dataset. Among 75 datasets only available as online sources, the majority of them were updated regularly for public uses. However, for 41.4% (53/128) datasets that are associated with publications, the authors collected and maintained datasets themselves for different purposes. Five articles aimed at describing how the COVID-19 databases were built, and they discuss data collection, storage and visualization. The remaining 48 articles focused on modeling, predictions or other analysis related to COVID-19. The authors of these analysis articles kept not only data but also codes and tools they used in their own studies. The datasets mentioned in these articles represent the collection of raw data that authors used as input for their analysis. Such data are often limited within a period of time and contain a relatively small number of cases.

We observed two approaches for citing datasets: (1) URL citations: citing URLs that led to the data sources and (2) Article citations: citing the article that describes the dataset. After examining the articles that cited datasets in the full text, we also discovered two major purposes of citing datasets: (1) citing a dataset as the data source used in the study and (2) citing a dataset as a general reference. Researchers are typically more likely to have used the dataset if they cite it directly as a URL. On the other hand, when citing a dataset as an article, the authors are more likely to mention it as a general reference instead of citing the data sources. This suggests that a larger number of URL citations to a dataset indicate its higher reuse. However, we also saw that datasets that aggregate data from several sources can be popular and be highly cited, but the data sources they use may not always receive citations. This indicates that we may consider indirect citations when assessing the true impact of a dataset in terms of its utility. Additionally, if a dataset is associated with a dedicated description paper, e.g. the John Hopkins University Dashboard [9] or the Epidemiological Data from the nCoV-2019 Outbreak [18], other papers that used the dataset may cite it as both URLs and papers.

One limitation of this study is that we limited our analysis to full-text articles in PMC. Although PMC is the largest full-text article repository in the biomedical domain, there are still about one-third (5992/18 332) of LitCovid papers that are not included in this study due to unavailability at PMC. Considering that LitCovid collects articles from PubMed only, the actual number of COVID-19 articles that are not included in this study could be even higher. In the future, we plan to look into other sources of full-text articles to study COVID-19 dataset status. Additionally, our study did not take into account high-impact datasets cited often by preprints, such as the Public Coronavirus Twitter Dataset [141]. Furthermore, we reviewed only the URLs extracted from articles, instead of other potential types of references that could be revealed had we reviewed the whole text. There is a chance that we missed data source information stated in plain text. We hope to resolve this problem and to expand the dataset collection by introducing natural language processing techniques in our future studies.

Conclusion

We screened 12 324 COVID-19 related full-text articles in PMC and collected 128 unique dataset URLs. By systematically analyzing the collected datasets in terms of content, accessibility and citation, we observed significant heterogeneity in the way these datasets are mentioned, shared, updated and cited. Those findings on current practice on generating, sharing and citing datasets for COVID-19 research can provide valuable insights for future improvements.

Key Points

  • 128 COVID-19 datasets from 12 324 COVID-19 articles were collected for this systematic review.

  • We conducted a quantitative analysis of dataset contents, accessibility and citations.

  • 84.4% COVID-19 scholarly datasets are available for immediate download.

  • The number of dataset URL citations is a valuable indicator of dataset utility.

Supplementary Material

Supplementary_dataset_summary_revised_bbaa331

Xu Zuo is a PhD student at School of Biomedical Informatics, The University of Texas Health Science Center at Houston. Her research focuses on clinical natural language processing.

Yong Chen is an associate professor of biostatistics at the University of Pennsylvania. His research interests include evidence synthesis, data integration and real-world evidence.

Lucila Ohno-Machado is a professor of medicine (Biomedical Informatics) at the University of California San Diego. Her research interests include privacy technology, predictive analytics and data sharing.

Hua Xu is a professor at School of Biomedical Informatics, The University of Texas Health Science Center at Houston. His research interests include biomedical text mining and data mining.

Contributor Information

Xu Zuo, School of Biomedical Informatics, The University of Texas Health Science Center at Houston.

Yong Chen, University of Pennsylvania.

Lucila Ohno-Machado, University of California San Diego.

Hua Xu, School of Biomedical Informatics, The University of Texas Health Science Center at Houston.

Data Availability

The original data presented in the study are included in the article/supplementary materials.

Funding

UTHealth CCTS Pilot Project (0015300); National Science Foundation (OIA-1937136).

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_dataset_summary_revised_bbaa331

Data Availability Statement

The original data presented in the study are included in the article/supplementary materials.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES