Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

Sergio Daga; Chiara Fallerini; Margherita Baldassarri; Francesca Fava; Floriana Valentino; Gabriella Doddato; Elisa Benetti; Simone Furini; Annarita Giliberti; Rossella Tita; Sara Amitrano; Mirella Bruttini; Ilaria Meloni; Anna Maria Pinto; Francesco Raimondi; Alessandra Stella; Filippo Biscarini; Nicola Picchiotti; Marco Gori; Pietro Pinoli; Stefano Ceri; Maurizio Sanarico; Francis P Crawley; Giovanni Birolo; GEN-COVID Multicenter Study; Alessandra Renieri; Francesca Mari; Elisa Frullanti

doi:10.1038/s41431-020-00793-7

. 2021 Jan 17;29(5):745–759. doi: 10.1038/s41431-020-00793-7

Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

Sergio Daga ^1,^2,^#, Chiara Fallerini ^1,^2,^#, Margherita Baldassarri ^1,², Francesca Fava ^1,^2,³, Floriana Valentino ^1,², Gabriella Doddato ^1,², Elisa Benetti ², Simone Furini ², Annarita Giliberti ^1,², Rossella Tita ³, Sara Amitrano ³, Mirella Bruttini ^1,^2,³, Ilaria Meloni ^1,², Anna Maria Pinto ³, Francesco Raimondi ⁴, Alessandra Stella ⁵, Filippo Biscarini ^5,¹³, Nicola Picchiotti ^6,⁷, Marco Gori ^6,⁸, Pietro Pinoli ⁹, Stefano Ceri ⁹, Maurizio Sanarico ¹⁰, Francis P Crawley ^11,¹³, Giovanni Birolo ¹²; GEN-COVID Multicenter Study, Alessandra Renieri ^1,^2,^3,^✉, Francesca Mari ^1,^2,³, Elisa Frullanti ^1,²

¹Medical Genetics, University of Siena, Siena, Italy

²Med Biotech Hub and Competence Center, Department of Medical Biotechnologies, University of Siena, Siena, Italy

³Genetica Medica, Azienda Ospedaliero-Universitaria Senese, Siena, Italy

⁴Scuola Normale Superiore, Pisa, Italy

⁵CNR-Consiglio Nazionale delle Ricerche, Istituto di Biologia e Biotecnologia Agraria (IBBA), Milano, Italy

⁶University of Siena, DIISM- SAILAB, Siena, Italy

⁷Department of Mathematics, University of Pavia, Pavia, Italy

⁸Université Côte d’Azur, Inria, CNRS, I3S, Maasai, Italy

⁹Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano, Milano, Italy

¹⁰Independent Data Scientist, Milan, Italy

¹¹Good Clinical Practice Alliance-Europe (GCPA) and Strategic Initiative for Developing Capacity in Ethical Review-Europe (SIDCER), Brussels, Belgium

¹²Department of Medical Sciences, University of Turin, Turin, Italy

¹³Present Address: ERCEA (European Research Council Executive Agency), Bruxelles, Belgium

¹⁴Department of Specialized and Internal Medicine, Tropical and Infectious Diseases Unit, Siena, Italy

¹⁵Unit of Respiratory Diseases and Lung Transplantation, Department of Internal and Specialist Medicine, University of Siena, Siena, Italy

¹⁶Department of Emergency and Urgency, Medicine, Surgery and Neurosciences, Unit of Intensive Care Medicine, Siena University Hospital, Siena, Italy

¹⁷Department of Medical, Surgical and Neuro Sciences and Radiological Sciences, Unit of Diagnostic Imaging University, Siena, Italy

¹⁸Rheumatology Unit, Department of Medicine, Surgery and Neurosciences, University of Siena, Policlinico Le Scotte, Siena, Italy

¹⁹Department of Specialized and Internal Medicine, Infectious Diseases Unit, San Donato Hospital, Arezzo, Italy

²⁰Department of Emergency, Anesthesia Unit, San Donato Hospital, Arezzo, Italy

²¹Department of Specialized and Internal Medicine, Pneumology Unit and UTIP, San Donato Hospital, Arezzo, Italy

²²Department of Emergency, Anesthesia Unit, Misericordia Hospital, Grosseto, Italy

²³Department of Specialized and Internal Medicine, Infectious Diseases Unit, Misericordia Hospital, Grosseto, Italy

²⁴Clinical Chemical Analysis Laboratory, Misericordia Hospital, Grosseto, Italy

²⁵Department of Preventive Medicine, Azienda USL Toscana Sud Est, Arezzo, Italy

²⁶Territorial Scientific Technician Department, Azienda USL Toscana Sud Est, Arezzo, Italy

²⁷Clinical Chemical Analysis Laboratory, San Donato Hospital, Arezzo, Italy

²⁸Chirurgia Vascolare, Ospedale Maggiore di Crema, Crema, Italy

²⁹Department of Health Sciences, Clinic of Infectious Diseases, ASST Santi Paolo e Carlo, University of Milan, Milan, Italy

³⁰Division of Infectious Diseases and Immunology, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy

³¹Department of Internal Medicine and Therapeutics, University of Pavia, Pavia, Italy

³²Department of Anesthesia and Intensive Care, University of Modena and Reggio Emilia, Modena, Italy

³³Department of Medical and Surgical Sciences for Children and Adults, University of Modena and Reggio Emilia, Modena, Italy

³⁴HIV/AIDS Department, National Institute for Infectious Diseases, IRCCS, Lazzaro Spallanzani, Rome, Italy

³⁵III Infectious Diseases Unit, ASST-FBF-Sacco, Milan, Italy

³⁶Department of Biomedical and Clinical Sciences Luigi Sacco, University of Milan, Milan, Italy

³⁷Infectious Diseases Clinic, Department of Medicine 2, Azienda Ospedaliera di Perugia and University of Perugia, Santa Maria Hospital, Perugia, Italy

³⁸Infectious Diseases Clinic, “Santa Maria” Hospital, University of Perugia, Perugia, Italy

³⁹Department of Infectious Diseases, Treviso Hospital, Local Health Unit 2 Marca Trevigiana, Treviso, Italy

⁴⁰Clinical Infectious Diseases, Mestre Hospital, Venezia, Italy

⁴¹Infectious Diseases Clinic, ULSS1, Belluno, Italy

⁴²Department of Molecular Medicine, University of Padova, Padova, Italy

⁴³Department of Infectious and Tropical Diseases, University of Brescia and ASST Spedali Civili Hospital, Brescia, Italy

⁴⁴Department of Molecular and Translational Medicine, University of Brescia, Italy; Clinical Chemistry Laboratory, Cytogenetics and Molecular Genetics Section, Diagnostic Department, ASST Spedali Civili di Brescia, Brescia, Italy

⁴⁵Medical Genetics and Laboratory of Medical Genetics Unit, A.O.R.N. “Antonio Cardarelli”, Naples, Italy

⁴⁶Department of Molecular Medicine and Medical Biotechnology, University of Naples Federico II, Naples, Italy

⁴⁷CEINGE Biotecnologie Avanzate, Naples, Italy

⁴⁸IRCCS SDN, Naples, Italy

⁴⁹Unit of Respiratory Physiopathology, AORN dei Colli, Monaldi Hospital, Naples, Italy

⁵⁰Division of Medical Genetics, Fondazione IRCCS Casa Sollievo della Sofferenza Hospital, San Giovanni Rotondo, Italy

⁵¹Department of Medical Sciences, Fondazione IRCCS Casa Sollievo della Sofferenza Hospital, San Giovanni Rotondo, Italy

⁵²Clinical Trial Office, Fondazione IRCCS Casa Sollievo della Sofferenza Hospital, San Giovanni Rotondo, Italy

⁵³Department of Health Sciences, University of Genova, Genova, Italy

⁵⁴Infectious Diseases Clinic, Policlinico San Martino Hospital, IRCCS for Cancer Research Genova, Genova, Italy

⁵⁵Microbiology, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Catholic University of Medicine, Rome, Italy

⁵⁶Department of Laboratory Sciences and Infectious Diseases, Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy

⁵⁷Independent Scientist, Milan, Italy

⁵⁸Department of Cardiovascular Diseases, University of Siena, Siena, Italy

⁵⁹Otolaryngology Unit, University of Siena, Siena, Italy

⁶⁰Department of Internal Medicine, ASST Valtellina e Alto Lario, Sondrio, Italy

⁶¹Study Coordinator Oncologia Medica e Ufficio Flussi Sondrio, Sondrio, Italy

⁶²Department of Infectious and Tropical Diseases, University of Padova, Padova, Italy

⁶³First Aid Department, Luigi Curto Hospital, Polla, Salerno Italy

⁶⁴Local Health Unit-Pharmaceutical Department of Grosseto, Toscana Sud Est Local Health Unit, Grosseto, Italy

⁶⁵U.O.C. Laboratorio di Genetica Umana, IRCCS Istituto G. Gaslini, Genova, Italy

⁶⁶Infectious Diseases Clinics, University of Modena and Reggio Emilia, Modena, Italy

⁶⁷Department of Respiratory Diseases, Azienda Ospedaliera di Cremona, Cremona, Italy

⁶⁸U.O.C. Medicina, ASST Nord Milano, Ospedale Bassini, Cinisello Balsamo, MI Italy

^✉

Corresponding author.

Contributed equally.

PMCID: PMC7811682 PMID: 33456056

Abstract

Within the GEN-COVID Multicenter Study, biospecimens from more than 1000 SARS-CoV-2 positive individuals have thus far been collected in the GEN-COVID Biobank (GCB). Sample types include whole blood, plasma, serum, leukocytes, and DNA. The GCB links samples to detailed clinical data available in the GEN-COVID Patient Registry (GCPR). It includes hospitalized patients (74.25%), broken down into intubated, treated by CPAP-biPAP, treated with O₂ supplementation, and without respiratory support (9.5%, 18.4%, 31.55% and 14.8, respectively); and non-hospitalized subjects (25.75%), either pauci- or asymptomatic. More than 150 clinical patient-level data fields have been collected and binarized for further statistics according to the organs/systems primarily affected by COVID-19: heart, liver, pancreas, kidney, chemosensors, innate or adaptive immunity, and clotting system. Hierarchical clustering analysis identified five main clinical categories: (1) severe multisystemic failure with either thromboembolic or pancreatic variant; (2) cytokine storm type, either severe with liver involvement or moderate; (3) moderate heart type, either with or without liver damage; (4) moderate multisystemic involvement, either with or without liver damage; (5) mild, either with or without hyposmia. GCB and GCPR are further linked to the GCGDR, which includes data from whole-exome sequencing and high-density SNP genotyping. The data are available for sharing through the Network for Italian Genomes, found within the COVID-19 dedicated section. The study objective is to systematize this comprehensive data collection and begin identifying multi-organ involvement in COVID-19, defining genetic parameters for infection susceptibility within the population, and mapping genetically COVID-19 severity and clinical complexity among patients.

Subject terms: Genetics research, Viral infection

Introduction

The GEN-COVID

Multicenter Study was designed to collect and systematize biological samples and clinical data across multiple hospitals and healthcare facilities in Italy with the purpose of deriving patient-level phenotypic and genotypic data, and the specific intention to make samples and data available to COVID-19 researchers globally. To reach these aims, the project collected and organized high-quality samples and data whose integrity was assured and could be readily accessed and processed for COVID-19 research using existing interoperability standards and tools. To this end, a GEN-COVID Biobank (GCB) and a GEN-COVID Patient Registry (GCPR) were established utilizing already existing biobanking and patient registry infrastructure. The collection of samples and data are now utilized in the GEN-COVID Multicenter Study for generating Genotyping (GWAS) and whole-exome sequencing (WES) results. This study also works collaboratively with other genomic studies on COVID-19. The data resulting from these studies are then stored and made available through the GEN-COVID Genetic Data Repository (GCGDR). All samples and data have also been systematized in accordance with the FAIR (findability, accessibility, interoperability, and reuse) data principles [1] to promote their international availability and use for COVID-19 research.

The outbreak of the coronavirus disease 2019 (COVID-19), the Severe Acute Respiratory Syndrome caused by coronavirus SARS-CoV-2, that first appeared in December 2019 in Wuhan, Huanan, Hubei Province of China, has resulted in millions of cases worldwide within a few short months, and rapidly evolved into a real pandemic [2]. The COVID-19 pandemic represents an enormous challenge to the world’s healthcare systems. Among the European countries, Italy was the first to experience the epidemic wave of SARS-CoV-2 infection, accompanied by a severe clinical picture and a mortality rate reaching 14%. In Italy, as of July 16th, 2020, there were 243,506 confirmed COVID-19 cases and 34,997 related deaths reported [3].

The disease is characterized by a highly heterogeneous phenotypic response to SARS-CoV-2 infection, with the large majority of infected individuals having only mild or even no symptoms. However, the severe cases can rapidly evolve toward a critical respiratory distress syndrome and multiple organ failure. The symptoms of COVID-19 range from fever, cough, sore throat, congestion, and fatigue to shortness of breath, hemoptysis, pneumonia followed by respiratory disorders, and septic shock [4].

The overburdened healthcare infrastructure and the working conditions within healthcare centers are tremendously challenging. Direct patient care is given the highest priority. Focus is concentrated on monitoring infection evolution in terms of the number of new cases and the number of deaths. Disease severity is also an important parameter that is being continually evaluated, with a current focus on patients experiencing serious pulmonary disease and other life-threatening conditions. Although patient care is the first priority, in the public health emergency situation brought on by the COVID-19 pandemic, it is also of the utmost importance to collect, process, and share with rapidity and confidence human biological materials, clinical data, and study outcomes. The best suited tool to address this need and accelerate research on COVID-19 is an accessible, high-quality biobank with associated clinical data and the necessary tools to guarantee interoperability with other biobanks and databanks.

This paper addresses the main aim of the project: the collection and systematization of human biological materials, clinical data stored in a patient registry, and derived patient-level genetic data. The paper addresses the methods for sample and data collection, and the systematization of the samples and data for research purposes. As COVID-19 increasingly reveals itself as a multisystemic disease, the purpose of this data collection is to include the most relevant clinical variables that identify multi-organ involvement as well as identify the genetic determinants of virus–host interaction, so as to holistically disclose the effect of COVID-19 over several physiological subsystems. In the present paper, the samples and the complete datasets are then used within the GEN-COVID Multicenter Study for identifying multi-organ involvement in COVID-19, defining genetic parameters for infection susceptibility within the population, and mapping genetically COVID-19 severity and clinical complexity among patients. Going forward, the main challenge will be to define the genetic parameters for infection susceptibility within specific populations in order to be able to map and identify genetically COVID-19 severity and clinical complexity within and across patient groups.

Methods

Study design

The purpose of the GEN-COVID Multicenter Study is to make the best use of the widest possible sets of patient data and genetic material in order to identify potential links between patient genetic variation and clinical variability, patient presentation, and disease severity. By exposing the potential links between genetic variability and disease variability, the study believes it can contribute to improved patient-level diagnostics, prognosis, and personalized treatment of COVID-19. To achieve this overall aim, the following specific objectives are being pursued: (1) to perform sequencing (WES) on DNA of 2000 COVID-19 patient samples [performed by the University of Siena (UNISI)]; (2) to perform genotyping (GWAS) on DNA of 2000 COVID-19 patients [performed by the Institute for Molecular Medicine of Finland (FIMM)]; (3) to associate the host genetic data obtained on 2000 COVID-19 patients with severity and prognosis; (4) to share phenotypic data and samples across the GEN-COVID consortium platform as well as in cooperation with research institutions and national platforms through the GEN-COVID Disease Registry and Biobank; (5) to share genetic data through the Network of Italian Genome (NIG, http://www.nig.cineca.it/, NIG database, http://nigdb.cineca.it) at CINECA, the largest Italian computing center.

Planned key deliverables of the project are (1) to develop a state-of-the-art Patient Registry and Biobank for COVID-19 clinical research with access for academic and industry partners; (2) to understand the genetic and molecular basis of susceptibility to SARS-CoV-2 infection and (susceptibility to a potentially more severe clinical outcome [prognosis] within 12 months); and (3) to understand the genetic profile of patients. The overall aim is to contribute to the rapid identification of medicines to be repurposed for personalized therapeutic approaches that demonstrate greater efficacy against the COVID-19 virus. As the initial starting point of this process, the ACE2 gene has already been extensively investigated in the Italian population [5].

The GEN-COVID Multicenter Study includes a network of 22 Italian hospitals, 13 from Northern Italy, 5 from Central Italy, and 4 from Southern Italy. It also includes local healthcare units and departments of preventative medicine (https://sites.google.com/dbm.unisi.it/gen-covid). The network continues to grow as more hospitals and healthcare centers express an interest in contributing samples and data. It started its activity on March 16, 2020, following approval by the Ethical Review Board of the Promoter Center, University of Siena (Protocol n. 16929, approval dated March 16, 2020). Written informed consent was obtained from all individuals who contributed samples and data. Detailed clinical and laboratory characteristics (data), specifically related to COVID-19, were collected for all subjects.

Study participants and recruitment

In order to ensure a collection that could be, as much as possible, comprehensive and representative of the Italian population, hospitals from across Italy, local healthcare units, and departments of preventive medicine have been involved in collecting samples and associated patient-level data for the GEN-COVID Multicenter Study. The inclusion criteria for the study are PCR-positive SARS-CoV-2 infection, age ≥18 years, and appropriately given informed consent that includes detailed information about the study, maintaining the confidentiality of personal data. In addition to the samples collection, an extensive questionnaire is used to assess disease severity and collect basic demographic information from each patient (Supplementary Table 1).

As of July 16, 2020, we have collected samples and data from 1033 individuals (1021 without family ties and 12 with family relations). All were positively diagnosed with SARS-CoV-2 and representing a wide range of disease severity, rang from hospitalized patients with severe COVID-19 disease to asymptomatic individuals. Infection status was confirmed by SARS-CoV-2 viral RNA polymerase-chain-reaction (PCR) test collected at least from nasopharyngeal swabs. Recruitment remains ongoing with the goal of including samples and data from 2000 individuals by the end of September 2020. So far recruiting an averages of 200 patients per week.

Data collection and storage

The GEN-COVID registry was designed in order to guarantee data accuracy and, at the same time, to ensure ease of data entry in order to facilitate compliance and save clinicians time. The highest data integrity and data privacy standards, with reference to the EU General Data Protection Regulation (GDPR) [6], were also built into the training for personnel. Samples and data were collected and systematized in order to meet the FAIR Data Principles requirements.

The socio-demographic information included sex, age, and ethnicity. Information about family history, (pre-existing) chronic conditions, and SARS-CoV-2 related symptoms were collected through a detailed core clinical questionnaire as previously reported [7]. This clinical data were continually updated as new information appeared regarding COVID-19 (Supplementary Table 1). More than 150 clinical items have been collected and synthesized in a binary mode for each involved organ/system: heart, liver, pancreas, kidney, and olfactory/gustatory and lymphoid systems. The collection and organizing methodologies allowed for rapid statistical analysis. Data were handled and stored in accordance with the EU GDPR [6].

Peripheral blood samples in ethylenediamine tetraacetic acid-containing tubes were collected for all subjects. Genomic DNA was centrally isolated from peripheral blood samples using the MagCore^®Genomic DNA Whole Blood Kit (Diatech Pharmacogenetics, Jesi, Italy) according to the manufacturer’s protocol at the Promoter Center. For all subjects, aliquots of plasma and serum are also available. Whenever possible, leukocytes were isolated from whole blood by density gradient centrifugation and stored in dimethyl sulfoxide solution and frozen using liquid nitrogen. For the majority of cohort, swab specimens are also available and stored at the reference hospitals.

Genetic data from GWAS and WES were generated for all patients. The generation of such a massive amount of sequencing data required sufficient computing resources able to store and analyze large quantities of data. For this purpose, GEN-COVID took advantage of the University of Siena’s participation in the Network for Italian Genomes (NIG, http://www.nig.cineca.it/, NIG database, http://nigdb.cineca.it/), which collects genome sequencing data from the Italian population. NIG has a specific agreement with CINECA, the largest computing center in Italy and one of the largest in Europe, for the use of the CINECA facility for the storage and analysis of data. Data upload followed quality and regulatory requirements already in place to ensure adequate uniformity and homogeneity levels. Data were formatted to meet the requirements of the FAIR Data Principles and thus made interoperable with other FAIR omics data and reference databases.

Collected laboratory and instrumental data

A continuous quantitative respiratory score, the PaO₂/FiO₂ [Partial pressure of oxygen/Fraction of inspired oxygen ratio (P/F)] was assigned to each patient as an indicator of the respiratory involvement. Taking the normal value >300 as the threshold, we defined four grades of severity score for the PaO₂/FiO₂ ratio: P/F less than or equal to 100, between 101 and 200, between 201 and 300, and greater than 300. A P/F value is not available for the non-hospitalized subjects because the test is only performed in hospitalized patients when needed. Heart involvement was considered on the basis of one or more of the following abnormal data: a cardiac Troponin T (cTnT) value higher than the reference range (<15 ng/L) (indicative of ischemic disorder), an increase in the N-terminal (NT)-pro hormone BNP (NT-proBNP) value (reference value <88 pg/mL for males and <153 pg/mL for females) (indicative of heart failure), and the presence of arrhythmias (indicative of electric disorder). Hepatic involvement was defined on the basis of a clear liver enzymes elevation as glutamate-pyruvate transaminase (GPT) and glutamate-oxaloacetate transaminase (GOT) higher than the gender specific reference value (for GPT < 41 UI/L in males and <31 UI/L in females; for GOT < 37 UI/L in males and <31 UI/L in females). Pancreatic involvement was considered on the basis of pancreatic enzymes as pancreatic amylase (PA) and lipase (PL) higher or lower than their specific reference range (13–53 UI/l for PA and 13–60 UI/l per PL). Kidney involvement was defined in the presence of a creatinine value higher than the gender specific reference value (0.7–1.20 mg/dL in males and 0.5–1.10 mg/dL in females). Lymphoid system involvement was designated as Natural killer (NK) cells and/or peripheral CD4⁺ T cells below reference value (NK cells > 90 cell/µL (mm³); CD4+ T cells > 400 cell/µL (mm³)). For each patient a numerical grading for the olfactory and gustatory dysfunction was defined through a clinical questionnaire, administered by ENT specialists. D-Dimer values of >10× with or without low Fibrinogen level, were used to interpret the involvement of the blood clotting system. Interleukin 6 (IL6), lactate dehydrogenase (LDH), and c-reactive protein (CRP) values above the reference range (<0.5 mg/dL for CRP and 135–225 UI/l in males and 135–214 UI/l in females for LDH) were used to determine pro-inflammatory cytokines system involvement.

Whole-exome sequencing

WES with at least 97% coverage at 20× was performed using the Illumina NovaSeq6000 System (Illumina, San Diego, CA, USA).

Sample preparation was performed following the Nextera Flex for Enrichment manufacturer protocol. The workflow uses a bead-based transposome complex to tagment genomic DNA, which is a process that fragments DNA and then tags the DNA with adapter sequences in one step. After it is saturated with input DNA, the bead-based transposome complex fragments a set number of DNA molecules. This fragmentation provides flexibility to use a wide DNA input range to generate normalized libraries of consistent tight fragment size distribution. Following tagmentation, a limited-cycle PCR adds adapter sequences to the ends of a DNA fragment. A subsequent target enrichment workflow is then applied. Following pooling, the double stranded DNA libraries are denatured and biotinylated Illumina CEX Panel probes are hybridized to the denatured library fragments. After hybridization, Streptavidin Magnetic Beads then capture the targeted library fragments within the regions of interest. The captured and indexed libraries are eluted from beads and further amplified before sequencing. The WES analysis was performed on the Illumina NovaSeq6000 System (Illumina San Diego, CA, USA) according to the NovaSeq6000 System Guide.

Genotyping

Genotyping data on 700,000 genetic markers were obtained on genomic DNA using the Illumina Global Screening Array (Illumina) according to the manufacturer’s protocol. Homo sapiens (human) Genome Reference Consortium Human Build 38 (GRCh38) was used. Quality checks (SNP calling quality, cluster separation, and Mendelian and replication error) were done using GenomeStudio analysis software (Illumina). The computer package Plink v1.90 [8] was used to process 700k SNP-genotyping data and to calculate SNP genotype statistics.

Statistical analysis

Descriptive statistics were calculated to determine the distribution of clinical features by sex, age, and ethnicity. Chi-square tests were used to evaluate the statistical association between the clinical severity of the disease (from no hospitalization to intubation) and the categorical clinical variables: gender, ethnicity, blood group, respiratory severity, taste/smell involvement, heart involvement, liver involvement, pancreas involvement, kidney involvement, lymphoid involvement, cytokines trigger, D-dimer, and number of comorbidities. A linear regression model was used to test the statistical association between COVID-19 severity and age.

The variability within clinical features and their relative relationships have been summarized and described by principal component analysis (PCA). Only numerical variables with a missing rate lower than 50% were selected; these included: hyposmia, neutrophils, CRP, fibrinogen, LDH, D-dimer, and number of comorbidities. Missing data were defined equal to the most common value among the k-neighbors (k = 5) [9], as defined by Gower distances [10]. After imputation, variables were centered and scaled prior to PCA. Descriptive statistics, chi-square tests, linear regression, and PCA were performed with the R environment for statistical computing [11].

A descriptive analysis of the phenotypes by using a hierarchically clustered heatmap was performed. In particular, both patients and phenotypes are clusterized with the agglomerative hierarchical clustering methodology, where the chosen metric is the hamming distance and the linkage criterion is the “average” one (unweighted pair group method with arithmetic mean, UPGMA). The corresponding dendrograms of the clusterization are reported in the upper and in the left part of the heat plot. Then the information on the grading of severity of the patients is added a posteriori on the left strip. The resulting plot is obtained with the Python Seaborn package.

Results

The GEN-COVID Multicenter Study, through a cooperative and carefully curated moded of sample and data collection, has employed rigorous analyses to achieve phenotypic and genotypic data that can now be used to begin to identify host genetic dispositions to COVID-19. The careful methodological approach was carried out across a large geographical area to develop a biobank (the GCB), a registry (the GCPR), and finally the resulting genetica data collection (the GCGDC). Following the timelines and milestones of the GEN-COVID Multicenter Study (see Fig. 1), the study has achieved a COVID-19 biobank, registry, and genetic data collection linked to one another, providing a high degree of confidence in sample and data integrity, and open to the world for COVID-19 research at what may still be considered an early point in this pandemic.

Fig. 1 — A Main milestones of the study with the timeline for the 22 Italian hospitals (P: Promoter, Policlinico Santa Maria Alle Scotte, Azienda Ospedaliera Universitaria Senese, Siena; 1: San Matteo Hospital Fondazione IRCCS, Pavia; 2:ASST Santi Paolo e Carlo, University of Milan, Italy; 3: Ospedale Maggiore di Crema, Italy; 4: ASST Valtellina e Alto Lario, Sondrio; 5: University Hospital of Modena and Reggio Emilia, Modena; 6: IRCCS, Lazzaro Spallanzani, Rome; 7: ASST-FBF-Sacco, Milan; 8: Santa Maria Hospital, Azienda Ospedaliera di Perugia, Perugia; 9: Treviso Hospital, Local Health Unit (ULSS) 2 Marca Trevigiana, Treviso; 10: Ospedale dell’Angelo, ULSS 3 Serenissima, Mestre; 11: Belluno Hospital, ULSS 1 Dolomiti, Belluno; 12: ASST Spedali Civili Hospital, Brescia; 13: Policlinico San Martino Hospital, IRCCS, Genova; 14: AORN dei Colli, Monaldi Hospital, Naples; 15: A.O.R.N. “Antonio Cardarelli”, Naples; 16: Fondazione IRCCS Casa Sollievo della Sofferenza Hospital, San Giovanni Rotondo; 17: IRCCS Istituto G. Gaslini, Genoa; 18: CEINGE Biotecnologie Avanzate, Naples; 19: San Donato Hospital, Arezzo; 20: Misericordia Hospital, Grosseto; 21: Fondazione Policlinico Universitario Agostino Gemelli IRCCS; 22: Luigi Curto Hospital, Polla (SA)). B Main milestones of the study with the timeline for local health units (Continuity Assistance Special Units, USCA) and departments of preventive medicine (1. USCA, Chianciano; 2: USCA Sansepolcro; 3: USCA Siena; 4: USCA Orbetello; 5: USCA Arezzo; 6: Department of preventive medicine Senese, Siena; 7: Department of preventive medicine Aretino-Casentino-Valtiberina, Arezzo; 8: Department of preventive medicine Alta Val d’Elsa, Poggibonsi; 9: Department of preventive medicine Amiata Senese e Val d’Orcia - Valdichiana Senese, Montepulciano). Other 11 USCA and 4 departments of preventive medicine have obtained IRB approval and they are going to start sample collection.

The GEN-COVID Biobank (GCB)

The GCB, a collection of biospecimens from patients affected by COVID-19 and the associated GCPR were established and maintained at the University of Siena using the infrastructure of an already well-established biobank (est. 1998) (http://www.biobank.unisi.it/ScegliArchivio.asp).

The Biobank is closely linked to national and international biobanking efforts aimed at collecting high-quality samples and patient data in a uniform manner and ensuring their FAIR management. It is part of the BBMRI-IT [12], EuroBioBank [13], Telethon Network of Genetic Biobanks [14], and RD-Connect [15]. The biobank and registry are ISO-certified (certificate 199556-2016-AQ-ITA-ACCREDIA) and accredited according to SIGU (the Italian Society of Human Genetics) requirements (Certificate 204107-2016-AQ-ITA-DNV).

Collected biological samples include peripheral blood, plasma, serum, primary leukocytes, and DNA samples. Samples were stored in a dedicated biobank section, while associated clinical data were entered in the related registry. The biobank and registry were organized according to the highest scientific standards, preserving patients’ and citizens’ privacy, while providing services to the healthcare and scientific community to develop better treatments, test diagnostic tools, and advance COVID-19 and coronavirus research. Biobank personnel are responsible for sample pseudonymization, storage, and insertion in the online biobank catalog.

Geographical coverage

The GEN-COVID Multicenter Study reached a large number of subjects throughout Italy. Tuscany, which is the region in which the study is carried out, contributes presently 22.8% of enrolled patients. The Northern Italian regions, particularly Lombardy and Venetia, currently contribute 52.3% of enrolled patients (Fig. 2). This distribution reflects closely the incidence of SARS-CoV-2 infection per 100,000 inhabitants for each Italian region, as updated to 4 July 2020 [3].

Fig. 2 — Comparison of GEN-COVID geographical coverage (right) and the incidence of SARS-CoV-2 infection per 100,000 inhabitants by Italian provinces (left).

The GEN-COVID Patient Registry (GCPR)

From April 7, 2020 to July 16, 2020, the GCPR collected clinical data from a total of 1033 Italian SARS-Cov-2 PCR-positive individuals. For each individual, we collected clinical information using standardized clinical schedules (Supplementary Table 1). The study protocol also provides access to patients’ medical records and continual clinical data updating in order to secure continuity for patient follow-up.

The mean age of the entire cohort is presently 58.7 years (range 18–99). The cohort is presently predominantly male (57.1%) with a mean age of 59.5 years (range 18–99); the mean age of the females is 57.6 years (range 19–98) (Table 1). About 40.3% of the cohort has no chronic conditions. The overall case-fatality rate is 3.6% (37) deaths among 1033 cases with a mean age of 75.2 years [range 62–91]. Regarding the ethnicity, the cohort is composed of 998 White (96.61%), 21 Hispanic (2.03%), 4 Black (0.38%), and 10 Asian (0.96%) patients (Table 1).

Table 1.

Characteristics of cohort.

No. of subjects	1033
Median age (range)^a	58.7 (18–99)
Gender no. (%)
Male	590 (57,1%)
Female	443 (42.9%)
Ethnicity no. (%)
White	998 (96.61%)
Hispanic	21 (2.03%)
Black	4 (0.38%)
Asian	10 (0.96%)
Clinical category no. (%)
Hospitalized intubated (Group 4)	98 (9.5%)
Hospitalized CPAP/BiPAP (Group 3)	190 (18.4%)
Hospitalize oxygen support (Group 2)	326 (31.55%)
Hospitalized w/o oxygen support (Group 1)	153 (14.8%)
Not hospitalized a/paucisymptomatic (Group 0)	266 (25.75%)

Open in a new tab

Subjects have been divided into five qualitative severity clinical categories depending on the need for hospitalization, the respiratory impairment and, consequently, the type of ventilation required: (1) hospitalized and intubated (9.5%); (2) hospitalized and CPAP-BiPAP and high-flow oxygen treated (18.4%); (3) hospitalized and treated with conventional oxygen support only (31.55%); (4) hospitalized without respiratory support (14.8%); (5) not hospitalized pauci/asymptomatic individuals (25.75%) (Group 4 to 0 in Table 1).

Gender distribution was statistically significantly different among the 5 groups (p value = 7.81 × 10⁻⁶). In the group with high-care intensity (Group 4), 72.4% of subjects were male, while in the group with the milder phenotype (Group 0) 59.8% of subjects were female (Table 2). Hyposmia and/or hypogeusia were present in 13.9% of cases in Group 4, 25.3% in Group 3, 31.6% in Group 2, 19.3% in Group 1, and in 57.1% of Group 0. A slight statistically significant difference among the 5 groups was found regarding the presence of comorbidities (p value = 0.012). No statistically significant difference was present for ethnicity and blood group distribution (Table 2).

Table 2.

Cohort stratification by disease severity.

Subject characteristics	Group 4	Group 3	Group 2	Group 1	Group 0	p value
Median age (range)	61.3 (29–79%)	65.3 (21–91%)	66.0 (21–99%)	55.86 (25–93%)	46.75 (19–72%)	1.35 × 10⁻⁷
Gender
Male (%)	72.4%	71.5%	59.8%	53.6%	40.2%	7.81×10⁻⁶
Female (%)	27.6%	28.5%	40.2%	46.4%	59.8%
Ethnicity
White	97%	97.3%	96.7%	96.6%	99.6%
Hispanic	2.%	1.1%	1.5%	2%	0.4%	0.731
Black	1.%	0	0.6%	0.7%	0
Asian	0	1.6%	1.2%	0,7%	0
Blood group
A	40.6%	46.6%	43.8%	43.25%	55%
B	8.7%	12.8%	16.1%	13.5%	10%	0.209
O	50.7%	39.8%	39.2%	43.25%	30%
AB	0	0.8%	0.9%	0	5%
Comorbidities
None	21.7%	18.8%	26.3%	32.4%	72.5%
One	34.8%	30.1%	26.7%	35.1%	17.5%	0.012
More than one	42%	45.1%	45.6%	28.4%	10%
Unknown	1.5%	6%	1.4%	4.1%	0

Open in a new tab

Figure 3 shows the relationships between continually updated laboratory variables from PCA. The first two principal components explain 42.4% of the variability in the data (PC1: 23.8%; PC2: 18.6%). Neutrophils, LDH, and D-dimer appear to be positively correlated, while fibrinogen and CRP, and hyposmia and the number of comorbidities have been found, pairwise, to be negatively correlated. The largest contributors to PC1 were LDH (24.4%), neutrophils (24%), D-dimer (23.8%), and hyposmia (15.3%); the largest contributors to PC2 were hyposmia (25.8%), fibrinogen (22.1%), CRP (19.8%), and the number of comorbidities (15.4%) (Fig. 3).

The continually updated laboratory values used in Fig. 3 can be further mined through clinical reasoning and represented as a binary clinical classification for organ/system damage (Table 3).

Table 3.

Binary clinical classification.

Organ/system	Value	Rule	Clinical Interpretation
Lung	1, 0	1 if severity grading 4–2 and 0 if severity grading 1–0	Lung disease
Heart	1, 0	1 if cTnT > reference value or NT-proBNP gender specific reference value or Arrhythmia	Heart disease
Liver	1, 0	1 if ALT and AST > gender-specific reference value	Liver disease
Pancreas	1, 0	1 if lipase and/or pancreatic amylase > or < specific reference value	Pancreas disease (either inflammation or depletion)
Kidney	1, 0	1 if creatinine > gender-specific reference value	Kidney disease
Lymphoid system	1, 0	1 if NK cells < reference value or CD4 lymphocytes < reference value	Innate and adaptive immune deficit
Olfactory/gustatory system	1, 0	1 if hypogeusia or hyposmia	Olfactory and Gustatory deficit
Clotting system	1, 0	1 if D-dimer > 10× W/wo low fibrinogen level (with high basal level)	Thromboembolism
Pro-inflammatory cytokines system	1, 0	1 if IL6 > reference value or LDH and CRP > reference value	Hyperinflammatory response

Open in a new tab

cTnT cardiac Troponin T, NT-proBNP N-terminal (NT)-pro hormone BNP, ALT Alanine transaminase, AST Aspartate transaminase, CD4 CD4+ T cells, NK Natural killer, IL6 Interleukin 6, LDH lactate dehydrogenase, CRP c-reactive protein.

Table 4 shows the prevalence of different organ/systems damage in the five different clinical categories based on respiratory failure (Table 4). Heart involvement was detected in 55% of subjects in Group 4, 39% of subjects in Group 3, 34.1% in Group 2, and 21.6% in Group 1. Liver involvement was present in 72.4% of cases in Group 4, 59.3% in Group 3, 46% in Group 2, and 33.7% in Group 1. Statistically significant difference among the 5 groups was found for all organs/systems, except for the lymphoid system.

Table 4.

Cohort systemic description.

Organ/system involvement	Group 0	Group 1	Group 2	Group 3	Group 4	p value
Heart disease	0^a	0.216	0.341	0.390	0.550	0.00016
Liver disease	0^a	0.337	0.460	0.593	0.724	2.96 × 10⁻³³
Pancreas disease	0^a	0.054	0.073	0.218	0.304	7.15 × 10⁻⁵
Kidney disease	0^a	0.121	0.244	0.278	0.434	0.0117
Innate and adaptive immune deficit	0^a	0.202	0.138	0.270	0.507	0.229
Olfactory/gustatory deficit	0,4	0.162	0.225	0.157	0.086	0.0011
Thromboembolism	0^a	0.040	0.073	0.097	0.318	4.2 × 10⁻⁷
Hyperinflammatory response	NA	0.081	0.152	0.278	0.492	2.02 × 10⁻⁵

Open in a new tab

NA not applicable.

^aAssigned on clinical ground.

Finally, the dendrogram in Fig. 4 shows how COVID-19 phenotypes can be distributed and clustered using the above reported clinical data representations. In particular, Hierarchical Clustering analysis identified five main clinical categories and several subcategories: (A) severe multisystemic, with either thromboembolic (A1) or pancreatic variant (A2); (B) cytokine storm, either moderate (B1) or severe with liver involvement (B2); (C) mild, either with (C1) or without hyposmia (C2); (D) moderate, either without (D1) or with (D2) liver damage; (E) heart type, either with (E1) or without (E2) liver damage (Fig. 4).

GEN-COVID Genetic Data Repository (GCGDR)

WES and Genotype (GWAS) data were generated within the GCGDR. In order to be able to store and analyze the massive amount of genomic data (mainly WES with coverage > 97% at 20×, and prospetically including also WGS) generated with the analysis of the entire cohort of samples populating the biobank, we relied on the NIG. External users can upload and analyze data using the NIG pipeline by registering and creating a specific project. A section dedicated to COVID-19 samples has been created within the NIG database (http://nigdb.cineca.it/) that provides variant frequencies as a free tool for both clinicians and researchers.

The data from WES are available both in Variant Call Format file or as binarized file, according to the different classes of variants: (1) rare variants (minor allele frequency (MAF) < 1%); (2) low-frequency variants (MAF < 5%); (3) common polymorphisms (MAF > 5%) in either homozygosity or supposed compound heterozygosity, with rare or low-frequency variants. The distribution of these three classes of variants according to mutated genes in our cohort is shown in Supplementary Fig. 1.

From WES, 580,688 variants have been called: of these, 543,138 are SNP and 37,550 are MNP (multi-nucleotide polymorphisms). Exonic SNPs were distributed over the 22 autosomes of the human genome, plus the sex chromosomes. The average missing rate was 0.01, with per-sample maximum value of 0.017. 15,285 SNP loci had a missing rate greater than 5%. The average MAF was 0.032 (std. dev. 0.091), with a right-skewed distribution (median MAF = 0.0007). Only 1,041 SNPs were monomorphic (0.2%), but 437,246 (80.5%) had a frequency <0.01. From the genotype perspective, the average observed heterozygosity was 0.047.

The data from high-density (700k) SNP genotyping are also generated on the same cohort and shared with international collaborations, including the COVID-19 Host Genetics Initiative (https://covid-19genehostinitiative.net/) and with GoFAIR VODAN [16]. From this analysis, SNP genotypes at 730,059 loci, distributed over the entire human genome, have been obtained. The average missing rate was 0.015, with per-sample maximum value of 0.042. 11,163 SNP loci had a missing rate greater than 5%. The average MAF was 0.113 (std. dev. 0.145), with a right-skewed distribution (median MAF = 0.035). In total, 147,579 SNPs were monomorphic (20.2%). From the genotype perspective, the average observed heterozygosity was 0.155.

Discussion

The COVID-19 pandemic represents an enormous challenge for the world’s healthcare systems. The healthcare infrastructures and the working conditions are tremendously challenged in many hospitals and direct patient care has rightly been given the highest priority. The main public health focus is on monitoring infection evolution in terms of the number of new cases and the number of deaths as well as the number of patients experiencing serious pulmonary or systemic disease. To better characterize the current outbreak and facilitate prospective research to address the current and possible future epidemics/pandemics, we set up a COVID-19 biobank and patient registry where biological samples and associated clinical data from patients are collected in a standardized manner.

As expected, the majority of subjects in the group with high-care intensity (Group 4) were males (72.4%) while in the group with mild phenotype the majority of subjects were females (59.2%). This is confirmatory of previously published data reporting a predominance of males among the most severely COVID-19 affected patients [17]. Among the 767 SARS-CoV-2 positive hospitalized patients in the current cohort, 63% are males and 12.8% required intubation. This is in line with the distribution of the Italian population of hospitalized COVID-19 patients [3] underlining the representativeness of our cohort.

Heart involvement was detected in the majority of severe cases (Group 4), confirming again a recent report [18]. Hospitalized SARS-CoV-2 positive patients (Group 2 to 4) have multiple-organ involvement: in particular, heart, liver, pancreas, and kidney. In line with our previous data and with literature findings, this confirms that COVID-19 is a systemic disease rather than simply a lung disorder [19, 20].

Clinical data representation and interpretation

Clinical data may be represented and consequently interpreted in different ways. The simplest way of representation is using the raw data of laboratory/instrumental values. In this case, reasoning about which value has to be considered and/or at which time of clinical evolution the value needs to be measured is necessary in order to have consistency within the cohort. PCA analysis using the WORSEN score at the time of admission has shown the expected variability with hyposmia to be juxtaposed to the number of comorbidities and thus representing a marker of less severity. The fibrinogen value is juxtaposed to inflammatories markers, such as CRP (and D-Dimer and LDH) because it is consumed during the prothrombotic state. We can conclude that such raw laboratory values are fairly good for representing the clinical variability of the cohort in classical PCA analysis.

A more elaborate way of representing clinical data is to filter the raw laboratory/instrument values by clinical reasoning, which often requires a face-to-face meeting with organ reference specialists and direct access to the patients’ medical records. The proposed mediation of such a clinical methodology for COVID-19 is represented in Table 3 and its distribution against lung dysfunction synthesized in Table 4.

Involvement of relevant organs or systems is represented in binary and is then used for representing COVID-19 as a systemic disorder (Fig. 4). We propose this representation as one of the best, being closer to the real complexity of the disease. It should be considered for use in further data mining and correlation with genetic data. The emerging clinical categories from Hierarchical Cluster Analysis point to specific types and subtypes that are more likely to have common genetic factors.

As unmasked by our dendrogram (group A), there is indeed a growing body of evidence suggesting that, in addition to the common respiratory symptoms (fever, cough, and dyspnea), COVID-19 severely-ill patients can often have symptoms of a multisystemic disorder [21]. Multiple organ failure due to diffuse microvascular damage is an important cause of death in COVID-19 severely affected patients [22]. In line with our definition of an A1 subgroup, a retrospective study on 21 deaths after SARS-Co-V2 infection recently reported that 71% of the patients who died had disseminated intravascular coagulation (DIC), while the incidence of DIC in surviving patients was 0.6% [23]. These data suggest that DIC is an important risk factor for increased in‐hospital mortality and special attention should be paid to its early diagnosis and treatment.

While a debate still exists about the significance of pancreatic enzyme elevations during COVID-19 infection and the capability of SARS-CoV-2 virus to induce pancreatic injury due to cytotoxic effects [24, 25], it is worth noting that among patients with a multisystemic involvement we observe a subclass of individuals (group A2) with pancreatic damage, likely suggesting a secondary effect of SARS-CoV-2 infection on a subgroup of genetically predisposed individuals. Inflammatory cytokine “storm,” has been reported as playing a key role in the severe immune injury to the lungs caused by T‐cell overactivation (group B) [26]. While some investigators have suggested a potential mechanism of myocardial injury due to COVID‐19‐induced cytokine storm that is mediated by a mixed T helper cell response in combination to hypoxia [27] our findings indicate rather a distinct class of patients, (group E) presenting with heart involvement in the absence of an inflammatory cascade. This would tend to support the hypothesis that SARS‐CoV‐2 may directly damage myocardial tissue and induce a major cardiovascular event. Thus, as currently recommended, our research reinforces the need to monitor plasma cTnT and NT‐proBNP levels in COVID‐19 patients. In line with current evidence [28, 29], although liver injury seems to occur more frequently among critically ill patients with COVID‐19 (group B), it can also be present in non-critically ill patients (groups D and E) and, as suggested, it could be mostly related to prolonged hospitalization and viral shedding duration. This allows defining, for each group, a clinical subclass according to this organ involvement.

A recent extensive review determined the prevalence of chemosensory deficits based on pooling together 42 studies reporting on 23,353 patients [30]. Estimated random prevalence was 38.5% for olfactory dysfunction, 30.4% for taste dysfunction, and 50.2% for overall chemosensory dysfunction. No correlation with age was detected, but anosmia/hypogeusia decreased with disease severity and ethnicity turned out to play a significant role with Caucasians having a three to six times higher prevalence of chemosensory deficits than East Asians. In accordance with evidence found in the literature, hyposmia was mostly represented among patients in group C with mild clinical symptoms [31].

Genetic data representation and interpretation

Similar to the clinical data, large aggregates of genetic data derived from WES may be represented, and consequently interpreted, in different ways. After variant calling, it is possible to use data as such, or variants can be prioritized and filtered according to standard bioinformatics procedures [32], such as damaging effect predictions, healthy population allele frequency, and gene constraints to variation.

Alternatively, it is also possible to represent data in a binary mode as follows: (1) select missense, splicing, and loss of function variants below 1% (rare variants); (2) select missense, splicing, and loss of function variants between 1 and 5% (low frequency variants); (3) select missense, splicing, and loss of function variants above 5% (common polymorphisms) in either homozygosity or supposed compound heterozygosity with rare or low frequency variants. The majority of patients showed about 3% of mutated genes in the above class (1), 5% in class (2), and 28% in class (3) variants (Supplementary Fig. 1A). No patients showed variants in more than 8000 genes (Supplementary Fig. 1B).

Protein interaction network and pathway analysis have been widely used to uncover and describe genetic relationships in complex diseases, such as cancer [33, 34]. For example, overrepresentation analysis of the biological processes and pathways significantly affected by mutations will be instrumental to empower the statistical detection of genetic signatures associated to specific COVID-19 phenotypes and to reduce the number of parameters to consider (e.g. dimensionality reduction) with the purpose of developing robust algorithms for the prediction of genetic susceptibility to COVID-19 infection and response. Variants, genes, or biological processes will be employed as features to train interpretable, supervised machine learning classifiers (e.g., gradient boosting decision trees [35, 36]), which will ease the identification of the genetic factors associated with clinical phenotypes.

While data collection is being consolidated and brought to completion according to the study design, we have started to work on a relatively new methodology based on topological data analysis to provide a detailed multidimensional and multiscale exploration of the whole-exome data that can drive an AI selection of genes that provide higher predictive power in a machine learning model. The method will be presented, together with the results, in a forthcoming paper.

Post-Mendelian model of complex diseases

Previous attempts to interpret the genetic bases of complex disorders have failed with very few exceptions, even in those disorders in which (like COVID-19) twin studies demonstrated a very high rate of heritability, such as in psychiatric disorders. The reason for this story containing such a lack of scientific success resides in several weak points in the overall genetic approach to complex diseases: (1) the method used to represent the complexity of the phenotype; (2) the procedure employed to represent the huge amount of different genetic data; and (3) the absence of a robust mathematical model able to interpret genetic data in non-Mendelian (non-rare) disorders. This paper provides a contribution to the first 2 points, likely paving the way for a solution to the third.

Frequently the phenotype of common (complex) disorders is oversimplified, thus attenuating reliable correlation with genetic data. Limiting the representation to differences of single parameters, such as respiratory assistance (intubation, CPAP-BiPAP, oxygen supplementation, etc.), is a possible trap for studies on complex disorders as may be the case with COVID-19. Similarly, genetic data are often too large to be mined and fragmented in different non communicating methods, betting on either the power of common polymorphisms (GWAS) or the power of variant accumulation (burden gene test for WES). The binary representation we are proposing here, together with network propagation for feature reduction, and followed by machine learning approaches, may help in this task. A rare disorder called TAR (OMIM # 274000) is teaching us that combinatorial rules of rare variant(s) with more common polymorphism(s) is what we are looking for [37, 38].

The GEN-COVID Multicenter Study with its Registry (GCPR), Biobank (GCB), and GCGDR is structured to continually link with leading European and international research organizations, public and private, as well as with regulatory and public health authorities for developing COVID-19 and SARS-related medicines research and treatment protocols. The success of the developing research and understanding of COVID-19 and the underlying SARS-CoV-2 virus will rely in large part on human biological materials and patient-level data that is comprehensively collected and systematically organized with careful attention to sample and data integrity as well as the FAIR Data Principles. Improving diagnostics, developing existing or new therapeutics, improving treatment protocols, and even developing public health policies relies upon a foundation of evidence that requires the comprehensive, patient, and systematic collection and organizing of COVID-19 patient biological samples and data of high integrity, confidence, and interoperability. The GEN-COVID Multicenter Study’s GCPR, GCB, and GCGDR present a model that can be further explored as a systematic approach to sample and data collection while also being immediately deployable in our collective fight against COVID-19.

Supplementary information

Supplementary Figure 1^{(61.3KB, jpg)}

Supplementary Table 1^{(60.1KB, xlsx)}

Acknowledgements

This study is part of the GEN-COVID Multicenter Study, https://sites.google.com/dbm.unisi.it/gen-covid, the Italian multicenter study aimed to identify the COVID-19 host genetic bases. The COVID-19 Biobank of Siena is part of the Genetic Biobank of Siena, member of BBMRI-IT, of Telethon Network of Genetic Biobanks (project no. GTB18001), of EuroBioBank, and of D-Connect, provided us with specimens. We thank the CINECA consortium for providing computational resources and the Network for Italian Genomes NIG http://www.nig.cineca.it for its support. We thank private donors’ support to AR (Department of Medical Biotechnologies, University of Siena) for the COVID-19 host genetics research project (D.L n.18 of March 17, 2020). We also thank the COVID-19 Host Genetics Initiative (https://www.covid19hg.org/). The views expressed here are purely those of the writer and may not in any circumstances be regarded as stating an official position of the European Commission.

GEN-COVID Multicenter Study

Francesca Montagnani²^,¹⁴, Laura Di Sarno¹^,², Andrea Tommasi¹^,²^,³, Maria Palmieri¹^,², Susanna Croci¹^,², Arianna Emiliozzi²^,¹⁴, Massimiliano Fabbiani¹⁴, Barbara Rossetti¹⁴, Giacomo Zanelli²^,¹⁴, Laura Bergantini¹⁵, Miriana D’Alessandro¹⁵, Paolo Cameli¹⁵, David Bennett¹⁵, Federico Anedda¹⁶, Simona Marcantonio¹⁶, Sabino Scolletta¹⁶, Federico Franchi¹⁶, Maria Antonietta Mazzei¹⁷, Susanna Guerrini¹⁷, Edoardo Conticini¹⁸, Luca Cantarini¹⁸, Bruno Frediani¹⁸, Danilo Tacconi¹⁹, Chiara Spertilli¹⁹, Marco Feri²⁰, Alice Donati²⁰, Raffaele Scala²¹, Luca Guidelli²¹, Genni Spargi²², Marta Corridi²², Cesira Nencioni²³, Leonardo Croci²³, Gian Piero Caldarelli²⁴, Maurizio Spagnesi²⁵, Paolo Piacentini²⁵, Maria Bandini²⁵, Elena Desanctis²⁵, Silvia Cappelli²⁵, Anna Canaccini²⁶, Agnese Verzuri²⁶, Valentina Anemoli²⁶, Agostino Ognibene²⁷, Massimo Vaghi²⁸, Antonella D’Arminio Monforte²⁹, Esther Merlini²⁹, Mario U. Mondelli³⁰^,³¹, Stefania Mantovani³⁰, Serena Ludovisi³⁰^,³¹, Massimo Girardis³², Sophie Venturelli³², Marco Sita³², Andrea Cossarizza³³, Andrea Antinori³⁴, Alessandra Vergori³⁴, Stefano Rusconi³⁵^,³⁶, Matteo Siano³⁶, Arianna Gabrieli³⁶, Agostino Riva³⁵^,³⁶, Daniela Francisci³⁷^,³⁸, Elisabetta Schiaroli³⁷, Pier Giorgio Scotton³⁹, Francesca Andretta³⁹, Sandro Panese⁴⁰, Renzo Scaggiante⁴¹, Francesca Gatti⁴¹, Saverio Giuseppe Parisi⁴², Francesco Castelli⁴³, Maria Eugenia Quiros-Roldan⁴³, Paola Magro⁴³, Isabella Zanella⁴⁴, Matteo Della Monica⁴⁵, Carmelo Piscopo⁴⁵, Mario Capasso⁴⁶^,⁴⁷^,⁴⁸, Roberta Russo⁴⁶^,⁴⁷, Immacolata Andolfo⁴⁶^,⁴⁷, Achille Iolascon⁴⁶^,⁴⁷, Giuseppe Fiorentino⁴⁹, Massimo Carella⁵⁰, Marco Castori⁵⁰, Giuseppe Merla⁵⁰, Filippo Aucella⁵¹, Pamela Raggi⁵², Carmen Marciano⁵², Rita Perna⁵², Matteo Bassetti⁵³^,⁵⁴, Antonio Di Biagio⁵⁴, Maurizio Sanguinetti⁵⁵^,⁵⁶, Luca Masucci⁵⁵^,⁵⁶, Chiara Gabbi⁵⁷, Serafina Valente⁵⁸, Ilaria Meloni¹^,², Maria Antonietta Mencarelli³, Caterina Lo Rizzo³, Elena Bargagli¹⁵, Marco Mandalà⁵⁹, Alessia Giorli⁵⁹, Lorenzo Salerni⁵⁹, Patrizia Zucchi⁶⁰, Pierpaolo Parravicini⁶⁰, Elisabetta Menatti⁶¹, Stefano Baratti⁶², Tullio Trotta⁶³, Ferdinando Giannattasio⁶³, Gabriella Coiro⁶³, Fabio Lena⁶⁴, Domenico A. Coviello⁶⁵, Cristina Mussini⁶⁶, Giancarlo Bosio⁶⁷, Sandro Mancarella⁶⁸, Luisa Tavecchia⁶⁸.

Author contributions

EF, FM, and AR designed the study. CF and IM were in charge of biological samples’ collection and biobanking. MB and FF were in charge of clinical data collection. MB, FF, AR, and FM performed analysis/interpretation of clinical data. AS and MB were in charge of DNA isolations from peripheral blood samples. FV, GD, AG, and RT carried the sequencing experiments. EB, SF, FR, AS, FB, NP, MC, PP, SC, and MS performed bioinformatics and statistical analyses. SD, FC, and FF prepared figures and tables. SD, CF, AMP, FPC, AR, and EF wrote the manuscript. CF submitted this paper. All authors have reviewed and approved the manuscript.

Proof note

At the time of the proof correction the number of the cohort has been increased up to 2026 individuals.

Data availability

The data and samples referenced here in the GEN-COVID Patient Registry and the GEN-COVID Biobank are available for consultation.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

The GEN-COVID study was approved by the University Hospital of Siena Ethical Review Board (Protocol no. 16929, dated March 16, 2020).

Footnotes

Members of the GEN-COVID Multicenter Study are listed below Acknowledgements.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Sergio Daga, Chiara Fallerini

These authors jointly supervised this work: Francesca Mari, Elisa Frullanti

Contributor Information

Alessandra Renieri, Email: alessandra.renieri@unisi.it.

GEN-COVID Multicenter Study:

Francesca Montagnani, Laura Di Sarno, Andrea Tommasi, Maria Palmieri, Susanna Croci, Arianna Emiliozzi, Massimiliano Fabbiani, Barbara Rossetti, Giacomo Zanelli, Laura Bergantini, Miriana D’Alessandro, Paolo Cameli, David Bennet, Federico Anedda, Simona Marcantonio, Sabino Scolletta, Federico Franchi, Maria Antonietta Mazzei, Susanna Guerrini, Edoardo Conticini, Luca Cantarini, Bruno Frediani, Danilo Tacconi, Chiara Spertilli, Marco Feri, Alice Donati, Raffaele Scala, Luca Guidelli, Genni Spargi, Marta Corridi, Cesira Nencioni, Leonardo Croci, Gian Piero Caldarelli, Maurizio Spagnesi, Paolo Piacentini, Maria Bandini, Elena Desanctis, Silvia Cappelli, Anna Canaccini, Agnese Verzuri, Valentina Anemoli, Agostino Ognibene, Massimo Vaghi, Antonella D’Arminio Monforte, Esther Merlini, Mario U. Mondelli, Stefania Mantovani, Serena Ludovisi, Massimo Girardis, Sophie Venturelli, Marco Sita, Andrea Cossarizza, Andrea Antinori, Alessandra Vergori, Stefano Rusconi, Matteo Siano, Arianna Gabrieli, Agostino Riva, Daniela Francisci, Elisabetta Schiaroli, Pier Giorgio Scotton, Francesca Andretta, Sandro Panese, Renzo Scaggiante, Francesca Gatti, Saverio Giuseppe Parisi, Francesco Castelli, Maria Eugenia Quiros-Roldan, Paola Magro, Isabella Zanella, Matteo Della Monica, Carmelo Piscopo, Mario Capasso, Roberta Russo, Immacolata Andolfo, Achille Iolascon, Giuseppe Fiorentino, Massimo Carella, Marco Castori, Giuseppe Merla, Filippo Aucella, Pamela Raggi, Carmen Marciano, Rita Perna, Matteo Bassetti, Antonio Di Biagio, Maurizio Sanguinetti, Luca Masucci, Chiara Gabbi, Serafina Valente, Ilaria Meloni, Maria Antonietta Mencarelli, Caterina Lo Rizzo, Elena Bargagli, Marco Mandalà, Alessia Giorli, Lorenzo Salerni, Patrizia Zucchi, Pierpaolo Parravicini, Elisabetta Menatti, Stefano Baratti, Tullio Trotta, Ferdinando Giannattasio, Gabriella Coiro, Fabio Lena, Domenico A. Coviello, Cristina Mussini, Giancarlo Bosio, Sandro Mancarella, and Luisa Tavecchia

Supplementary information

The online version of this article (10.1038/s41431-020-00793-7) contains supplementary material, which is available to authorized users.

References

1.Wilkinson M, Dumontier M, Aalbersberg I, Appleton G, Axton M, Baal A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2019;6:6. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, et al. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020;382:727–33. doi: 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Italian Civil Protection Department—COVID-19 case update. http://opendatadpc.maps.arcgis.com/apps/opsdashboard/index.html#/b0c68bce2cce478eaac82fe38d4138b1. [DOI] [PMC free article] [PubMed]
4.Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. JAMA. 2020. 10.1001/jama.2020.2648. [DOI] [PubMed]
5.Benetti E, Tita R, Spiga O, Ciolfi A, Birolo G, Bruselles A, et al. ACE2 gene variants may underlie interindividual variability and susceptibility to COVID-19 in the Italian population. Eur J Hum Genet. 2020;28:1602–14. doi: 10.1038/s41431-020-0691-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).
7.Benetti E, Giliberti A, Emiliozzi A, Valentino F, Bergantini L, Fallerini C, et al. Clinical and molecular characterization of COVID-19 hospitalized patients. 2020. http://medrxiv.org/content/early/2020/05/25/2020.05.22.20108845. [DOI] [PMC free article] [PubMed]
8.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:s13742–015. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–5. doi: 10.1093/bioinformatics/17.6.520. [DOI] [PubMed] [Google Scholar]
10.Gower JC. A general coefficient of similarity and some of its properties. Biometrics. 1971;27:857. [Google Scholar]
11.R Core Team. R: a language and environment for statistical computing. 2018, Vienna: R Foundation for Statistical Computing.
12.‘BBMRI.it’. https://www.bbmri.it/. Accessed Mar 2020.
13.‘EuroBioBank – EuroBioBank website’. http://www.eurobiobank.org/. Accessed Mar 2020.
14.‘Telethon Network of Genetic Biobanks’. http://biobanknetwork.telethon.it/. Accessed Mar 2020.
15.‘RD-Connect – RD-Connect website’. https://rd-connect.eu/. Accessed Mar 2020.
16.COVID-19 Host Genetics Initiative. The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic. Eur J Hum Genet. 2020;28:715–8. doi: 10.1038/s41431-020-0636-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Cai H. Sex difference and smoking predisposition in patients with COVID-19. Lancet Respir Med. 2020;8.4:e20. doi: 10.1016/S2213-2600(20)30117-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Knight DS, Kotecha T, Razvi Y, Chacko L, Brown JT, Jeetley PS, et al. COVID-19: myocardial injury in survivors. Circulation. 2020. 10.1161/CIRCULATIONAHA.120.049252. [DOI] [PMC free article] [PubMed]
19.Waisse S, Oberbaum M, Frass M. The hydra-headed coronaviruses: implications of COVID-19 for Homeopathy. Homeopathy. 2020. 10.1055/s-0040-1714053. [DOI] [PubMed]
20.Massabeti R, Cipriani MS, Valenti I. Covid-19: a systemic disease treated with a wide-ranging approach: a case report. J Popul Ther Clin Pharmacol. 2020;27:e26–30. doi: 10.15586/jptcp.v27iSP1.691. [DOI] [PubMed] [Google Scholar]
21.Zheng KI, Feng G, Liu WY, Targher G, Byrne CD, Zheng MH. Extrapulmonary complications of COVID-19: a multisystem disease? J Med Virol. 2020. 10.1002/jmv.26294. [DOI] [PMC free article] [PubMed]
22.Chen G, Wu D, Guo W, Cao Y, Huang D, Wang H, et al. Clinical and immunological features of severe and moderate coronavirus disease 2019. J Clin Investig. 2020;130:2620–9. doi: 10.1172/JCI137244. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Tang N, Li D, Wang X, Sun Z. Abnormal coagulation parameters are associated with poor prognosis in patients with novel coronavirus pneumonia. J Thromb Haemost. 2020;18:844–7. doi: 10.1111/jth.14768. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Wang F, Wang H, Fan J, Zhang Y, Wang H, Zhao Q. Pancreatic injury patterns in patients with COVID-19 pneumonia. Gastroenterology. 2020;159:367–70. doi: 10.1053/j.gastro.2020.03.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ashok A, Faghih M, Singh VK. Mild Pancreatic enzyme elevations in COVID-19 pneumonia: synonymous with injury or noise? Gastroenterology. 2020, S0016-5085:34778-8. 10.1053/j.gastro.2020.05.086. [DOI] [PMC free article] [PubMed]
26.Szatmary P, Arora A, Raraty MGT, Dunne DFJ, Baron RD, Halloran CM. Emerging phenotype of SARS-CoV2 associated pancreatitis. Gastroenterology. 2020;159:1551–4. doi: 10.1053/j.gastro.2020.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Xu Z, Shi L, Wang Y, Zhang J, Huang L, Zhang C, et al. Pathological findings of COVID‐19 associated with acute respiratory distress syndrome. Lancet Respir Med. 2020;8:420–2. doi: 10.1016/S2213-2600(20)30076-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zheng YY, Ma YT, Zhang JY, Xie X. COVID‐19 and the cardiovascular system. Nat Rev Cardiol. 2020;17:259–60. doi: 10.1038/s41569-020-0360-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Jiang S, Wang R, Li L, Hong D, Ru R, Rao Y, et al. Liver injury in critically ill and non-critically ill COVID-19 patients: a Multicenter, Retrospective, Observational Study. Front Med. 2020;7:347. doi: 10.3389/fmed.2020.00347. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Feng G, Zheng KI, Yan QQ, Rios RS, Targher G, Byrne CD, et al. COVID-19 and liver dysfunction: current insights and emergent therapeutic strategies. J Clin Transl Hepatol. 2020;8:18–24. doi: 10.14218/JCTH.2020.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Von Bartheld CS, Hagen MM, Butowt R. Prevalence of chemosensory dysfunction in COVID-19 patients: a systematic review and meta-analysis reveals significant ethnic differences. ACS Chem Neurosci. 2020;11:2944–61. doi: 10.1021/acschemneuro.0c00460. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kim GU, Kim MJ, Ra SH, Lee J, Bae S, Hung J, et al. Clinical characteristics of asymptomatic and symptomatic patients with mild COVID-19. Clin Microbiol Infect. 2020;26:948.e1–3. doi: 10.1016/j.cmi.2020.04.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet. 2017;18:599–612. doi: 10.1038/nrg.2017.52. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, et al. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12:615–21. doi: 10.1038/nmeth.3440. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017;18:551–62. doi: 10.1038/nrg.2017.38. [DOI] [PubMed] [Google Scholar]
36.Yan L, Zhang H-T, Goncalves J, Xiao Y, Wang M, Guo Y, et al. An interpretable mortality prediction model for COVID-19 patients. Nat Mach Intell. 2020;2:283–8. [Google Scholar]
37.Shuai S, PCAWG Drivers and Functional Interpretation Working Group. Steven G, Lincoln S, PCAWG Consortium. Combined burden and functional impact tests for cancer driver discovery using DriverPower. Nat Commun. 2020;11:734. doi: 10.1038/s41467-019-13929-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Albers CA, Paul DS, Schulze H, Freson K, Stephens JC, Smethurst PA, et al. Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nat Genet. 2012;44:435–S2. doi: 10.1038/ng.1083. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figure 1^{(61.3KB, jpg)}

Supplementary Table 1^{(60.1KB, xlsx)}

Data Availability Statement

The data and samples referenced here in the GEN-COVID Patient Registry and the GEN-COVID Biobank are available for consultation.

[CR1] 1.Wilkinson M, Dumontier M, Aalbersberg I, Appleton G, Axton M, Baal A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2019;6:6. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, et al. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020;382:727–33. doi: 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Italian Civil Protection Department—COVID-19 case update. http://opendatadpc.maps.arcgis.com/apps/opsdashboard/index.html#/b0c68bce2cce478eaac82fe38d4138b1. [DOI] [PMC free article] [PubMed]

[CR4] 4.Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. JAMA. 2020. 10.1001/jama.2020.2648. [DOI] [PubMed]

[CR5] 5.Benetti E, Tita R, Spiga O, Ciolfi A, Birolo G, Bruselles A, et al. ACE2 gene variants may underlie interindividual variability and susceptibility to COVID-19 in the Italian population. Eur J Hum Genet. 2020;28:1602–14. doi: 10.1038/s41431-020-0691-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation).

[CR7] 7.Benetti E, Giliberti A, Emiliozzi A, Valentino F, Bergantini L, Fallerini C, et al. Clinical and molecular characterization of COVID-19 hospitalized patients. 2020. http://medrxiv.org/content/early/2020/05/25/2020.05.22.20108845. [DOI] [PMC free article] [PubMed]

[CR8] 8.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:s13742–015. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–5. doi: 10.1093/bioinformatics/17.6.520. [DOI] [PubMed] [Google Scholar]

[CR10] 10.Gower JC. A general coefficient of similarity and some of its properties. Biometrics. 1971;27:857. [Google Scholar]

[CR11] 11.R Core Team. R: a language and environment for statistical computing. 2018, Vienna: R Foundation for Statistical Computing.

[CR12] 12.‘BBMRI.it’. https://www.bbmri.it/. Accessed Mar 2020.

[CR13] 13.‘EuroBioBank – EuroBioBank website’. http://www.eurobiobank.org/. Accessed Mar 2020.

[CR14] 14.‘Telethon Network of Genetic Biobanks’. http://biobanknetwork.telethon.it/. Accessed Mar 2020.

[CR15] 15.‘RD-Connect – RD-Connect website’. https://rd-connect.eu/. Accessed Mar 2020.

[CR16] 16.COVID-19 Host Genetics Initiative. The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic. Eur J Hum Genet. 2020;28:715–8. doi: 10.1038/s41431-020-0636-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Cai H. Sex difference and smoking predisposition in patients with COVID-19. Lancet Respir Med. 2020;8.4:e20. doi: 10.1016/S2213-2600(20)30117-X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Knight DS, Kotecha T, Razvi Y, Chacko L, Brown JT, Jeetley PS, et al. COVID-19: myocardial injury in survivors. Circulation. 2020. 10.1161/CIRCULATIONAHA.120.049252. [DOI] [PMC free article] [PubMed]

[CR19] 19.Waisse S, Oberbaum M, Frass M. The hydra-headed coronaviruses: implications of COVID-19 for Homeopathy. Homeopathy. 2020. 10.1055/s-0040-1714053. [DOI] [PubMed]

[CR20] 20.Massabeti R, Cipriani MS, Valenti I. Covid-19: a systemic disease treated with a wide-ranging approach: a case report. J Popul Ther Clin Pharmacol. 2020;27:e26–30. doi: 10.15586/jptcp.v27iSP1.691. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Zheng KI, Feng G, Liu WY, Targher G, Byrne CD, Zheng MH. Extrapulmonary complications of COVID-19: a multisystem disease? J Med Virol. 2020. 10.1002/jmv.26294. [DOI] [PMC free article] [PubMed]

[CR22] 22.Chen G, Wu D, Guo W, Cao Y, Huang D, Wang H, et al. Clinical and immunological features of severe and moderate coronavirus disease 2019. J Clin Investig. 2020;130:2620–9. doi: 10.1172/JCI137244. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Tang N, Li D, Wang X, Sun Z. Abnormal coagulation parameters are associated with poor prognosis in patients with novel coronavirus pneumonia. J Thromb Haemost. 2020;18:844–7. doi: 10.1111/jth.14768. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Wang F, Wang H, Fan J, Zhang Y, Wang H, Zhao Q. Pancreatic injury patterns in patients with COVID-19 pneumonia. Gastroenterology. 2020;159:367–70. doi: 10.1053/j.gastro.2020.03.055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Ashok A, Faghih M, Singh VK. Mild Pancreatic enzyme elevations in COVID-19 pneumonia: synonymous with injury or noise? Gastroenterology. 2020, S0016-5085:34778-8. 10.1053/j.gastro.2020.05.086. [DOI] [PMC free article] [PubMed]

[CR26] 26.Szatmary P, Arora A, Raraty MGT, Dunne DFJ, Baron RD, Halloran CM. Emerging phenotype of SARS-CoV2 associated pancreatitis. Gastroenterology. 2020;159:1551–4. doi: 10.1053/j.gastro.2020.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Xu Z, Shi L, Wang Y, Zhang J, Huang L, Zhang C, et al. Pathological findings of COVID‐19 associated with acute respiratory distress syndrome. Lancet Respir Med. 2020;8:420–2. doi: 10.1016/S2213-2600(20)30076-X. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Zheng YY, Ma YT, Zhang JY, Xie X. COVID‐19 and the cardiovascular system. Nat Rev Cardiol. 2020;17:259–60. doi: 10.1038/s41569-020-0360-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Jiang S, Wang R, Li L, Hong D, Ru R, Rao Y, et al. Liver injury in critically ill and non-critically ill COVID-19 patients: a Multicenter, Retrospective, Observational Study. Front Med. 2020;7:347. doi: 10.3389/fmed.2020.00347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Feng G, Zheng KI, Yan QQ, Rios RS, Targher G, Byrne CD, et al. COVID-19 and liver dysfunction: current insights and emergent therapeutic strategies. J Clin Transl Hepatol. 2020;8:18–24. doi: 10.14218/JCTH.2020.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Von Bartheld CS, Hagen MM, Butowt R. Prevalence of chemosensory dysfunction in COVID-19 patients: a systematic review and meta-analysis reveals significant ethnic differences. ACS Chem Neurosci. 2020;11:2944–61. doi: 10.1021/acschemneuro.0c00460. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Kim GU, Kim MJ, Ra SH, Lee J, Bae S, Hung J, et al. Clinical characteristics of asymptomatic and symptomatic patients with mild COVID-19. Clin Microbiol Infect. 2020;26:948.e1–3. doi: 10.1016/j.cmi.2020.04.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet. 2017;18:599–612. doi: 10.1038/nrg.2017.52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Creixell P, Reimand J, Haider S, Wu G, Shibata T, Vazquez M, et al. Pathway and network analysis of cancer genomes. Nat Methods. 2015;12:615–21. doi: 10.1038/nmeth.3440. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017;18:551–62. doi: 10.1038/nrg.2017.38. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Yan L, Zhang H-T, Goncalves J, Xiao Y, Wang M, Guo Y, et al. An interpretable mortality prediction model for COVID-19 patients. Nat Mach Intell. 2020;2:283–8. [Google Scholar]

[CR37] 37.Shuai S, PCAWG Drivers and Functional Interpretation Working Group. Steven G, Lincoln S, PCAWG Consortium. Combined burden and functional impact tests for cancer driver discovery using DriverPower. Nat Commun. 2020;11:734. doi: 10.1038/s41467-019-13929-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Albers CA, Paul DS, Schulze H, Freson K, Stephens JC, Smethurst PA, et al. Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nat Genet. 2012;44:435–S2. doi: 10.1038/ng.1083. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Employing a systematic approach to biobanking and analyzing clinical and genetic data for advancing COVID-19 research

Sergio Daga

Chiara Fallerini

Margherita Baldassarri

Francesca Fava

Floriana Valentino

Gabriella Doddato

Elisa Benetti

Simone Furini

Annarita Giliberti

Rossella Tita

Sara Amitrano

Mirella Bruttini

Ilaria Meloni

Anna Maria Pinto

Francesco Raimondi

Alessandra Stella

Filippo Biscarini

Nicola Picchiotti

Marco Gori

Pietro Pinoli

Stefano Ceri

Maurizio Sanarico

Francis P Crawley

Giovanni Birolo

Alessandra Renieri

Francesca Mari

Elisa Frullanti

Abstract

Introduction

The GEN-COVID

Methods

Study design

Study participants and recruitment

Data collection and storage

Collected laboratory and instrumental data

Whole-exome sequencing

Genotyping

Statistical analysis

Results

Fig. 1. Timeline of GEN-COVID Multicenter study.

The GEN-COVID Biobank (GCB)

Geographical coverage

Fig. 2. Geographical coverage.

The GEN-COVID Patient Registry (GCPR)

Table 1.

Table 2.

Fig. 3. PCA variables plot.

Table 3.

Table 4.

Fig. 4. Phenotypic clustering of COVID-19 patients.

GEN-COVID Genetic Data Repository (GCGDR)

Discussion

Clinical data representation and interpretation

Genetic data representation and interpretation

Post-Mendelian model of complex diseases

Supplementary information

Acknowledgements

GEN-COVID Multicenter Study

Author contributions

Proof note

Data availability

Compliance with ethical standards

Conflict of interest

Ethical approval

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases