Skip to main content
JAMIA Open logoLink to JAMIA Open
. 2020 Sep 11;3(3):472–486. doi: 10.1093/jamiaopen/ooaa030

The case for open science: rare diseases

Yaffa R Rubinstein 1,, Peter N Robinson 2,#, William A Gahl 3, Paul Avillach 4, Gareth Baynam 5, Helene Cederroth 6, Rebecca M Goodwin 7, Stephen C Groft 8, Mats G Hansson 9, Nomi L Harris 10, Vojtech Huser 11, Deborah Mascalzoni 12, Julie A McMurry 13, Matthew Might 14, Christoffer Nellaker 15, Barend Mons 16, Dina N Paltoo 7, Jonathan Pevsner 17, Manuel Posada 18, Alison P Rockett-Frase 19, Marco Roos 20, Tamar B Rubinstein 21,✉,#, Domenica Taruscio 22, Esther van Enckevort 23, Melissa A Haendel 13,#
PMCID: PMC7660964  PMID: 33426479

Abstract

The premise of Open Science is that research and medical management will progress faster if data and knowledge are openly shared. The value of Open Science is nowhere more important and appreciated than in the rare disease (RD) community. Research into RDs has been limited by insufficient patient data and resources, a paucity of trained disease experts, and lack of therapeutics, leading to long delays in diagnosis and treatment. These issues can be ameliorated by following the principles and practices of sharing that are intrinsic to Open Science. Here, we describe how the RD community has adopted the core pillars of Open Science, adding new initiatives to promote care and research for RD patients and, ultimately, for all of medicine. We also present recommendations that can advance Open Science more globally.

Keywords: open science, ontology, FAIR data, common data elements, rare disease patients, data standards

INTRODUCTION

In the United States, a rare disease (RD) is defined as one that affects fewer than 200,000 persons; for Japan, it is fewer than 50,000; and for South Korea, fewer than 20,000. In contrast, Europe and Australia define rare as 1 in 2000 individuals.1,2 Taken together, RDs represent a public health problem; ∼10% of people eventually present with an RD.2–4 Roughly 5000–8000 RDs have been described, but the number of RDs is estimated to exceed 10,000.5 Most RDs are severe and chronic and some are life-threatening. RDs, which are often inherited, frequently present in childhood and can have deleterious long-term effects. Patients with RDs often face diagnostic delays; it can take 7 years or more to reach an accurate diagnosis.6,7 Delayed or inaccurate diagnoses hinder the development of effective treatment plans, preclude prognoses and genetic counseling, create skepticism among relatives, colleagues, and physicians, and exclude patients from a community of individuals with similar experiences. Appropriate information and medical expertise on RDs are often insufficient, and access to care is difficult. Because many RDs affect multiple organ systems, care can be fragmented across several specialties. Electronic health records (EHRs) are not well suited for recording and sharing information about RDs; it remains difficult to stratify patients into useful classifications and to identify individuals with specific RDs8,9 (Figure 1).

Figure 1.

Figure 1.

Rare diseases. (A) RDs are individually rare but collectively impact ∼10% of the population. Here, RDs are represented in the classic aphorism, “When you hear hoofbeats, think of horses, not zebras”—in other words, look for the most common disease that matches the symptoms, not the rarest one. It was originally used by Theodore Woodward, professor at the University of Maryland School of Medicine in the 1940s. (B) Defining RDs requires carefully matching a patient’s spectrum of phenotypes with the phenotypic profile of candidate diseases, here represented by a single color-feature. Each zebra (patient) has a constellation of phenotypes that may match none, some (dashed lines), or all (solid lines) of the phenotypes of other zebras. The diagnosis of RDs often involves recognition of phenotypic patterns and is aided by computational phenotype analysis.

Programs have been established to accelerate the diagnosis of very RDs, identify new RDs, and provide improved RD patient care. One such program was the NIH Undiagnosed Diseases Program (UDP),10–13 which expanded to the Undiagnosed Disease Network (UDN). This NIH-funded consortium includes 12 clinical sites and analytical cores around the United States14,15; both the UDN and the UDP provide multidisciplinary clinical evaluations, research collaborations, and translational validations for RD patients. The UDN uses many hundreds of open data resources that have helped inform many diagnoses, illustrating the success of Open Science for diagnosing RD patients. Similar RD diagnostic initiatives in other countries have been instantiated in Japan in 2015 (Initiative on Rare and Undiagnosed Disease [IRUD]16) in Western Australia in 2013 (Rare and Undiagnosed Diseases Diagnostic Service, RUDDS), and in other countries. The Undiagnosed Diseases Network International (UDNI), established in 2014, is dedicated to discovering new diseases and defining standards for sharing data and best practices in RD programs throughout the world.11 With the Cross-Border Healthcare Directive (2011/24/EU), the European Union established a mandatory framework to foster cooperation addressed to RDs within European Reference Networks.17 Despite these laudatory efforts to coordinate internationally, there are not enough programs worldwide to provide the care needed for the many RDs patients. In addition, RD patients often lack a supporting community that shares the same disease, despite the many support groups such as the National Organization for Rare Disorders (NORD), European Organization for Rare Diseases (EURORDIS), and Coordination of Rare Diseases at Sanford (CoRDS).

OPEN SCIENCE AND MAKING DATA FAIR

Individually, RDs are rare, and so any one physician, researcher, or institution will not accrue sufficient experience, data, or knowledge to effectively treat or research RDs. Therefore, progress in diagnosing, treating, and understanding a particular RD requires the synthesis of all available data from multiple institutions.

To facilitate this exchange of data, the field has started to embrace the principles of Open Science. The premise of Open Science is that research will progress faster if data and knowledge are openly shared with proper data safety measures and ethical frameworks.18 Open Science, an umbrella term for a wide range of activities from basic biological research to clinical research, makes it easier for scientists and clinicians to share and access knowledge, resources, tools, and data. Open Science considers scientific knowledge a product of social collaboration that belongs to the community; hence, the public should have access to it at little or no cost.

In a very real sense, Open Science means open data. To be open, the data need to be FAIR, that is, Findable, Accessible, Interoperable, and Reusable (FAIR) for humans and machines.19 These FAIR Guiding Principles,20 adopted in 2014, are followed by many organizations world-wide, including the G20, NIH, and IRDiRC (the International Rare Disease Research Consortium). Many projects, such as the European Joint Programme on Rare Diseases, are now working on the implementation of FAIR. Germany, France, and The Netherlands decided to support communities in organizing Global Open FAIR implementation networks. The RDs GO FAIR Network was established to foster implementation in the RD domain.21 Also important are factors specifically related to data reusability, such as traceability (eg, provenance and attribution), data licensing, and connectedness of the data.22–24

FAIR data stewardship is challenging, because it requires a wide range of expertise: knowledge of the domain, local IT systems, local and cloud storage systems, local and global data access policies, machine-readable formats for data and knowledge, and software for communication between FAIR resources. Making data FAIR should be considered a team effort. There is no comprehensive suite of tools for a stakeholder to make data FAIR; ELIXIR’s “service bundles” may provide that in the future, but teams of experts are needed.

The FAIR principles require data to be prepared for reuse. Moreover, for diseases with low prevalence, sparsity of data necessitates that data are prepared for analysis across multiple sources. Current lack of interoperability is an obstacle for Open Science.25 Data scientists must go through a laborious and error-prone process of finding data, assuring access and permissions, and making data compatible and optimally reusable. By experience, this post hoc data preparation may take up a substantial part of their time,26 and inevitably leads to an inability to address certain research questions. Open Science needs international collaboration, infrastructure, and good data stewardship to address the costly inefficiency caused by data that are not prepared for reuse.

Sharing data can be problematic in general, but particularly in the RD domain, because of (1) ethical and legal constraints that can differ among institutes, regions, and countries, (2) the scale of the distribution of RD data, and (3) hesitation of scientists to share data that are precious to their careers. The FAIR principles can provide an alternative approach to centralizing data, especially clinical data, from multiple sources for analysis. When data are FAIR “at source,” distributed analysis can be effectively performed, with only the result of the analysis leaving the source and the data secure and private. In principle, all source data are available, enabling analyses ranging from counting how many patients show certain conditions to distributed machine learning to predict treatment outcomes. Some computer algorithms will be too demanding for distributed analysis, but even in that case, application of the FAIR principles will prepare data for efficient analyses.

Another significant challenge is data licensing. Integrative analytical platforms aimed at facilitating RD research and mechanism and drug discovery, such as the Monarch Initiative,27 the NCATS Biomedical Data Translator,28–30 and the Gabriella Miller Kids First Data Resource Portal,31 rely on the ability to integrate and redistribute data from other third-party public knowledge sources. The more FAIR-ready these sources are, the more the integrated data may be effectively applied for RDs. However, a recent study evaluating more than 50 data sources suggested that current licensing terms may significantly impede the use, reuse, and redistribution of data. The lack of legal data redistribution is a fundamental problem for RDs, for which maximal utility must be garnered from all possible knowledge sources. Custom licenses constitute the largest single class of licenses found in these data resources, suggesting that the providers either did not know about standard licenses or believed that standard licenses did not meet their needs.22,23 The (Re)usable Data Project32 aims to help data providers evaluate the impact of their licensing terms on downstream users, and is already assisting RD data providers to improve their reusability.

Despite these challenges, the benefits of FAIR outweigh the cost of implementation. Theoretically, the additional time to make data compatible for multi-source analysis by a data analyst is zero when data are already FAIR.33 Considering that RD data sets are precious and reused often, the efficiency gain multiplies quickly. The RD community was the first community in Europe to embrace the concept of a “Bring Your Own Data” workshop (BYOD) aimed at learning how to make data interoperable. BYODs for RD registry managers have been organized by the Istituto Superiore di Sanitá since 2015,34 and are planned to continue as part of an annual summer school at least until 2023 with support from the European Joint Program on Rare Diseases (EJPRD). Inspired by the feedback from these BYODs, “RDs GO FAIR” was created to foster adoption of FAIR principles toward a critical mass of FAIR data resources.17 Through interdisciplinary collaboration fostered by RDs GO FAIR and others, and activities of ELIXIR, BBMRI (Better Biology Makes Reality Interesting), the NIH, the EJPRD, NORD, and EURORDIS, we expect gradual maturation of guidelines, supporting tools, FAIR data stewardship (including in patient organizations), and for-profit and not-for-profit service providers. A FAIR ecosystem thus brings about an Open Science environment where new analysis possibilities can be explored under well-defined and transparent conditions for sensitive data.

OPEN SCIENCE IN THE RARE DISEASE FIELD

RD patient empowerment and resources

Patients, families, and their advocates are key stakeholders that have not always been sufficiently engaged in many biomedical research initiatives.35 Engaging patients as partners in product development is important to better understand the patient perspectives and the pathogenesis of the disease. Patients and caregivers are often the best advocates for raising awareness and describing the clinical manifestations and the daily progress of the disease and treatments.36 Engagement of patients and other stakeholders (such as caregivers, advocacy organizations, and clinicians) in clinical research can help to ensure that research efforts address relevant clinical questions and patient-centered health outcomes.37 Numerous RD programs and organizations exist, including the NIH Rare Disease Clinical Research Network (RDCRN),38 the EURORDIS-Rare Diseases Europe,39 Patient-Centered Outcomes Research Institute (PCORI),40 the Genetic Alliance,41 NORD,42 and the Innovative Medicine Initiative (IMI).43

Many patients and their families look for ways to improve dissemination of their data and help catalyze research in their RD in a hope for faster and better diagnosis and treatment. There are many inspiring examples of individual patient or a parent who with little resources but with much determination, they established a foundation for their RD, shared their data and created a successful collaboration between scientific researchers and patient organizations. A few to mention are: Syndromes Without A Name USA (SWAN),44–47 Ngly1.org foundation,48,49 the Chordoma Foundation,50 the Castleman Disease Collaborative Network,51 the Joshua Frase Foundation,52 the Cystic Fibrosis Foundation,53 and the PXE International.54 In all of these cases, there has been not only a patient or parent creating research programs and collaborations but also data sharing and data reutilization to support diagnosis and discovery. These foundational patient-scientist collaborations are a clear window into what will become the de facto standard, that is, Open Science, international collaborations involving patients, clinicians, researchers, and data technologists in a global venue.

The diversity of the aforementioned activities has contributed to the mention of the importance of patient engagement in RD clinical trials in the US Food and Drug Administration (FDA)’s 2019 draft guidance document for industry.35 The role of patients’ and parents’ support groups is growing beyond the boundaries of individual national initiatives aimed at raising public awareness and promoting medical care and social benefits.

Common data elements

Open access to data is not sufficient to make the data useful to science; data must also be structured, documented, interoperable, and curated. The magnitude of this task has led to the development of programs and software that helps automate data curation, data integration, and data mining; it has also underscored the need for machine learning and language processing.8,55–58 Health data comes from many different sources, and many different people produce, curate, and use the data. Integration is obstructed when systems and studies use different words to describe the same objects or concepts, use the same words intending different meanings, or use different data formats or structures.

Common data elements (CDEs) are a universal language that describes the data collected in a study. CDEs make data meaningful by structuring and defining commonly used, community shaped, recommended measures, and assessment instruments. Using CDEs when first collecting biomedical data makes it easier to develop meaningful analyses and research projects. When data is associated with CDEs, they can be more readily analyzed and reused to accelerate research into disease pathogenesis and therapeutic development. Although some CDEs were originally developed to address the needs of a specific research domain or clinical application, many CDEs address universal concepts of interest to a wide variety of domains for a variety of data collection purposes, such as demographic characteristics of research participants. In many cases, CDEs related to RDs may be broadly applicable for collecting data about other diseases, or for rapidly pivoting to collecting well-defined data critical for research related to emerging diseases, such as lung function measures that might have been developed for people with cystic fibrosis, and might be leveraged for use with patients with COVID-19. Identifying and reusing existing CDEs paves the way for smoothly finding, interpreting, and exchanging data. Unambiguous definitions are critical. For comparability among sources, CDEs should describe not only the data to be collected, but also rich metadata, that is the manner in which the data are collected and how the data are recorded. CDEs should define the parameter space for the data point and, instead of using natural language, they should encourage the use of standardized terminologies and ontologies. While consistency of data collection and the use of CDEs within an individual study are essential for maintaining data quality and enabling analysis, consistency of data collection across multiple studies brings additional value by promoting data sharing.59

Nevertheless, despite potential benefits and the extensive use of CDEs across clinical research studies, there are some challenges. There may be differences across studies in the interpretation and implementation of the data elements; researchers must ensure that CDEs are valid in different populations recruited for a study (eg, participants may have different cultural and linguistic backgrounds). Adoption of CDEs can be inhibited by existing research practices and legacy data systems. Conversely, use of clinical research data beyond the original purpose for which it was collected requires that researchers ensure that the collected data and its use is consistent with the informed consent and research ethics.

Data collection and annotation with a well-defined, controlled vocabulary and terms allow describing the meaning of data in a human and machine-readable way, enable data harmonization and meta-analyses, and enhance data sharing. Lack of standardization hinders data sharing and interoperability, so the use of CDEs is particularly critical for research and clinical care for people with RDs. The National Institutes of Health (NIH) Common Data Element Repository (CDE-R), developed and hosted by the US National Library of Medicine (NLM), is a platform for identifying related data elements in use across diverse areas, for harmonizing data elements, and for linking CDEs to other existing standards and terminologies.60,61 NLM and others across NIH work to ensure that formal vocabularies used to describe people, health problems, and health care processes are sufficiently robust to encompass the full range of health and disease across all populations and all communities.62–75 The CDE-R contains many CDEs developed for and by the RD research community, the Global Rare Diseases Registry Data Repository (GRDR).76,77 The PhenX toolkit is a catalog of measurement protocols, developed with a robust community consensus protocol.78 PhenX notes that its protocols can be used to combine studies to increase statistical power, enable comparisons of studies to validate results, and increase the impact of individual studies. PhenX has been used for the application of standardized measures in many clinical research studies, many of which are submitted to dbGaP. PhenX contains a collection of measures for Rare Genetic Conditions79 that, while very useful, would require significant expansion beyond their current remit of 10 per domain to be relevant to the 10 000 RDs that exist. PhenX also allows the creation of clinical data collection forms in standardized tools such as REDCap,80 which prospectively is a great advantage in standardizing data. All of the aforementioned efforts help support improved interoperability of clinical data across studies; they are critically important for RDs, for which data from one study or a different RD may help inform others.

Many RDs lack consistent identifiable terms, limiting literature searches, registry interoperability, and comparability in clinical information systems. Despite the advances in the creation of CDEs, many RDs lack a comprehensive set of disease definitions, associated phenotypes, genetic variations, treatments, prognoses, and other disease characteristics. However, CDE-development efforts that involve multidisciplinary collaboration, including informatics expertise, can address some of these challenges by identifying synonymy, clearly defining terms, and achieving consensus of key stakeholders for adoption of the CDEs. For example, this process was used to develop CDEs and guidance for health information exchange of newborn screening orders and results for lysosomal storage disorders. We now detail ongoing efforts to address this gap; the next steps would be to implement such components into CDEs, clinical systems such as EHRs, clinical decision support tools, and RD registries.

Data collected for RD research typically includes laboratory measurements, clinical observations, imaging, genomics and other 'omics data, as well as patient-reported outcomes (PROs). However, one of the biggest challenges for RD diagnosis is that RDs are not well-represented in terminologies typically used within EHRs, diagnostic settings, or other clinical information systems. The aforementioned CDEs for RD are intended to address this issue, but standardized ontologies are still lacking for use in those CDEs and clinical systems. Ontologies provide precise definitions of terms and relationships between different terms, which makes it possible to provide better quality checks, remove ambiguity, and provide much greater computability and utility in diagnostic or other algorithms. Precision medicine would greatly benefit from improved logical representation of clinical terminologies for classifying patients9; simply put, RD diagnostics requires it.

The Human Phenotype Ontology

The Human Phenotype Ontology (HPO) provides a structured, comprehensive, and well-defined set of terms that describe phenotypic abnormalities seen in human disease. It also provides a collection of disease-phenotype annotations, that is, computational assertions that a disease is associated with a given phenotypic abnormality. The HPO was created to enable “deep phenotyping,” that is, capture of symptoms and phenotypic findings using a logically constructed hierarchy of phenotypic terms.81,82 The HPO is a flagship project of the Monarch Initiative, an international consortium dedicated to developing integrative semantic technologies for disease diagnosis and mechanism discovery.27,83–85 The HPO allows algorithms to match sets of patient phenotype profiles in a “fuzzy” non-exact manner to gold standard RD profiles, other patients, and model organisms, greatly facilitating diagnosis.86–88 The HPO has therefore become the de facto standard for representing clinical phenotype data for diagnosis for rare genetic diseases by the 100 000 Genomes Project,89 the UDP,13,90 and Undiagnosed Diseases Network (UDN), as well as thousands of other clinics, laboratories, tools, and databases59,91,92; it is also a IRDiRC (International Rare Diseases Research Consortium) Recognized Resource.93

Although the focus of the HPO has, to date, been on RDs, it has been extended to provide a computational foundation for phenotype-driven analysis of genomes and other translational research on complex human disease.91 For example, many of the laboratory data recorded in EHRs for RD patients are expressed in an exact manner, such as measurements captured using the Logical Observation Identifiers Names and Codes (LOINC) standard for identifying medical laboratory observations. Recent efforts have been made to support interoperability between HPO and LOINC, such that direct measurements can be converted into HPO codes and used for diagnostic purposes.94

Deep phenotyping can be time-consuming and may miss key phenotypic features because they are not assessed (eg, phenotypes in internal organs that are only observable if a CT is performed) or not reported (eg, an inconsolable child or heavy snoring may not be documented in a clinical setting). Patients could therefore provide informative contributions to their computable phenotype profiles; however, the “terminology gap” between medical professionals and patients can limit patient participation both in research studies and in clinical phenotyping. Current patient vocabularies provide broad consumer equivalents for clinical findings, medical procedures and equipment but are not well integrated with research terminologies. For undiagnosed patients and those with RDs, affected individuals themselves are an especially critical source of phenotyping information. These patients accumulate a clear, firsthand knowledge about their condition, first from observing how the condition progresses daily, but also from multiple clinician evaluations and from other families and patients with similar conditions. In some cases, patients’ self-phenotyping combined with additional investigations has led to clinical diagnoses.95

To address these issues, the HPO was further developed to allow capture of patient-generated phenotypic profiles for use in diagnostic and patient community settings (registries, forums, clinics, and patient websites). To achieve this, a patient-centered lexicon of relevant terms was developed and added to the HPO.59,91,92 These terms are frequently referred to in plain language but can also include clinical terms (eg, Myopia [HP: 0000545] has a lay synonym “Near-sightedness”). Since the lay translation of the HPO uses the same logical infrastructure as the HPO itself, patient-generated phenotyping data can be readily combined with clinical phenotyping data to prioritize variants, improve diagnostic rates, and examine expressivity, penetrance and disease progression. Formal evaluation of the diagnostic capabilities of the lay HPO is in progress, and includes an informatics comparison against the gold-standard HPO disease annotations used in genomic diagnostics for patients with RDs. The lay HPO is expected to serve as a resource that will allow patients and families to become more effective partners in translational research, empowering families to achieve an accurate diagnosis and enabling people to improve the lives of others with RDs by increasing medical knowledge through their personal perspectives. The lay HPO should also enable RD patients to share their phenotyping profiles openly on the web using standards such as Phenopackets (see below), which allows the use of informatics to support open querying for similar patients to improve diagnosis.

Databases to share rare disease knowledge

While the HPO81 has become a global ontological standard for representing phenotypic attributes of RDs, community coordination of RD disease terminology is still emerging. Different terminological and database resources have been developed that describe RDs. The Online Mendelian Inheritance in Man (OMIM) began in the early 1960s by Dr. Victor A. McKusick as a catalog of Mendelian traits and disorders and has since become a global standard for documentation of Mendelian diseases.96,97 OMIM provides highly curated knowledge on genes and genetic diseases, phenotypes, and the relationships between them. Each disease listed in OMIM has a current summary of information based on expert review of the biomedical literature. Orphanet was established in France by the INSERM (French National Institute for Health and Medical Research) in 1997, and provides the community information and nomenclature on RDs; it is focused on improving the visibility of RDs in health and research information systems, particularly in Europe. The ORDO98 Orphanet rare disease terminology, an IRDiRC Recognized Resource,93 has been successfully used in the RD-Connect Sample Catalogue,99,100 which is an open data repository with information about biological samples from RD patients that are available to scientists for (re-)use. Disease infoSearch (diseaseinfosearch.org) is a crowdsourced database of thousands of diseases that helps patients find resources and studies, and integrates information from numerous sources, such as the NIH Genetic and Rare Diseases Information Center (https://rarediseases.info.nih.gov/). These RD-spanning databases are complemented by gene-, disease-, and/or locus-specific databases. For example, both the Human Genome Variation Society (HGVS)101 and the Leiden Open Variation Database (LOVD) list approximately 1500 expert-curated locus-specific mutation databases.102

Approaches to discovering the genetic basis of disease include linkage studies, genome-wide association studies, and a variety of designs involving next-generation sequencing including whole-exome and whole-genome sequencing. The majority of software used for these analyses are open access, greatly facilitating the pace of discovery. The results of many studies are also readily available. For example, the National Human Genome Research Institute (NHGRI)-EBI GWAS catalog reports over 70,000 variant-trait associations from >5000 studies.103 GenBank has freely released DNA sequence data since 1982.104 ClinVar is a public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.105 The Genome Aggregation Database (gnomAD) is an aggregation and harmonization of exome and genome sequencing data from a variety of large-scale sequencing projects. Summary data are available for the wider scientific community based on genomic sequences of over 140,000 individuals. Also, the Database of Genotypes and Phenotypes (dbGaP) at NIH and the European Genome-phenome Archive (EGA) at EBI.106 These are examples of resources that are facilitating major progress in the discovery of genes and their functional characterization, leading to progress toward improved diagnosis and treatment.

Recently, major knowledge sources on RDs such as Orphanet, OMIM, ClinGen, MedGen, GARD, NCI Thesaurus, and others have been working together to harmonize disease definitions in a new ontology called “Mondo”,107,108 meaning “world”. While Mondo is still in development, the new ontology already provides a computational framework for defining RDs based upon logical representation of a variety of attributes such as phenotypes, genetic variants, treatment, onset, frequency, etc. Algorithmic and manual curation efforts have been used to align these RD terminologies, yielding preliminary estimates that the total number of RDs may exceed 10,000, that is, many more than the ∼7000 estimated during the inception of the Orphan Drug Act.109 More than half of these RDs can be found in three or more resources, whereas ∼4 K are unique to a given source. This preliminary analysis suggests that there could be a substantially higher number of RDs than currently assumed, with obvious implications for diagnostics, drug discovery and treatment. However, it should be emphasized that much more rigorous analysis is needed to establish the accuracy of this estimate.

Because RD patient presentations are heterogeneous and may not perfectly match existing disease definitions based on very small populations, it is critically important to share patients’ phenotypic information to support diagnosis, matchmaking, patient registries, communities, and target drug development. Further, despite the substantial improvements in exome analysis that have revealed numerous new rare Mendelian disease genes, the specific causal gene cannot be identified for more than half of patients.109 For these patients, evidence for causality depends on identifying other affected individuals with a similar phenotype and functionally impactful variants in the same candidate gene. In order to support this n-of-1 patient matching, the Global Alliance for Genomics and Health (GA4GH) initiated the Matchmaker Exchange (MME).110 MME is a federated network connecting different patient databases containing genomic and phenotypic data using a common application programming interface and allowing data exchange among them. MME has helped diagnose thousands of patients globally, by connecting these regional resources in a data sharing network that preserves privacy and maintains clinical review of potential matches and subsequent diagnoses.

While the MME has significantly advanced diagnostic potential for very RD patients, it does depend upon a patient being registered within a participating MME database. To increase computability of the phenotype data and to maximize potential open data sharing of patient phenotype information, the GA4GH created Phenopackets. Phenopackets is a standard file format for sharing phenotypic information that enables structured data sharing of information about a participant’s phenotype, such as clinical diagnosis, age of onset, results from lab tests, and disease severity.111 It can link to separate files containing a patient’s genetic sequence and pedigree, if available. Phenopackets are expected to standardize phenotypic data exchange within medical and scientific settings. This will allow phenotypic data to flow among clinics, databases, clinical labs, journals, and patient registries in ways that are currently feasible only for more quantifiable data, like sequence data. As more Phenopackets for RD patients are shared, clinicians, biologists, RD registries, and disease and drug researchers will build more complete models of disease and match similar patients (Figure 2). In addition, the use of Phenopackets to better represent and share the heterogeneity of RD presentation will lend itself well to drug repurposing. However, repurposing drugs similarly relies on sharing knowledge that has already been generated but may otherwise be difficult to access for those trying to repurpose.112 Monarch’s RD diagnosis tool Exomiser113,114 now takes Phenopackets as input, and Phenopackets are being adopted for projects such as the Japanese Agency for Medical Research and Development’s BioBank Network (biobank-search.megabank.tohoku.ac.jp) as well as SOLVE-RD (solve-rd.eu), the RD project of the European Commission.

Figure 2.

Figure 2.

Phenopackets provide a mechanism for structured, de-identified, patient-level phenotype data sharing for computational use across the globe and in different information systems. Image credit: GA4GH Communications Team.

RD registries

Registries are considered key instruments for developing RD clinical research, enhancing patient care and health planning, and improving social, economic, and quality-of-life outcomes115,116 for the analysis of the natural history of RDs.117 Traditionally, registries have been either population-based or hospital-based. The former aim to capture all cases from a specific population and are focused on incident cases, seeking to describe the natural history of diseases.118 The latter provide responses to different clinical questions, serving as a source of patients for clinical trials and identifying and analyzing biomarkers as clinical prognosis factors.119 Both strategies are valid and are complementary because each can control for different types of biases.

Defining a standardized set of data elements is a key function and a key challenge for all registries120; the process of standardization is closely linked to the original sources of information used. The primary source of information is the patient and/or the physician collecting information directly from the patient; these sources have been used for centuries. However, standardizing the phenotype is not simple because we want the data collected to represent the patient’s clinical course. Standards such as CDEs, PROs,121 and ontologies such as HPO82 are not used by most registries; those that use them often do so in an ad hoc manner. Therefore, the main challenge for capture and reuse of registry data is transforming the physician’s free text or bespoke encoding into a standardized form. Specifically, how can the reliability between observers and within observers be guaranteed in an RD registry.122 Is the phenotype collected at a single point enough to define the full natural history of the disease? How long should be the follow-up period for each specific RD? How can a registry help in the analysis of natural and temporal variability of diseases? In fact, the only way to provide valid health outcomes is to guarantee the quality of all procedures included in the registry123; the use of ontologies instead of classical registry-specific standardization provides added value. Such standardization uses strict definitions, controls all parameters for each data element, and provides a high level of certainty about the data already collected and saved. Conversely, ontologies allow clinicians a certain level of confidentiality and flexibility because the terms are probabilistically linked. Ontologies and related standards facilitate data sharing among registries and improve interoperability between clinical and research systems.

Other secondary sources of information such as EHRs124 can provide some structured information are usually well standardized. EHRs can provide some information for certain types of registries, but since they have been built for other purposes with different criteria, they are not always appropriate for the aims of registries (EHRs typically contain a problem list functionality, while standardized and structured capture of symptoms is almost never available). RD registries have the capacity to reveal new disease genes, modifier variants, and new or very rare phenotypes, as well as the assessment of biomarkers, new treatments, and the impact of the implementation of health measurements. However, maximizing a registry’s ability to address unmet needs of RD patients requires data sharing and phenotypic and omics data, by researchers. Well designed and managed registries are regularly used for these purposes, but they must adapt their methods by collecting data directly from the EHR to identify the phenotypes instead of searching and recording specific data elements.

At early stages of registry planning, patient groups can provide support both as advisors and as partners. Patient groups can propose the creation of registries to healthcare institutions and work in a partnership. These outstanding, emerging possibilities should be carefully considered.125 As a recent example, several patient associations have contacted the Italian National Health Institute to establish and maintain disease-specific registries, and formal agreements have been signed between each association and the Institute. Further, the use of standardized registry software such as NORD’s Natural Histories Patient Registry Platform, RDconnect’s Registry Finder126 Coordination of Rare Diseases at Sanford (CoRDS), and the Program for Engaging Everyone Responsibly (PEER), are all examples of improved interoperability and data sharing and evolution with the standards over time. Ideally, such platforms will eventually robustly support both patient-generated individual content and synchronization with EHR data—something that is likely to improve clinical trial efficacy, recruitment, and engagement.

In general, early engagement of patient groups can substantially contribute to the success of the registry. The patient’s general contribution will assure that the Registry meets the patient’s needs and priorities, as well as their own data sharing wishes More specifically, patient engagement supports recruitment, relevance to patient healthcare, and the transparency of the process.127 Nevertheless, robust guidance on this issue is still insufficient and approaches to meet the challenge should be refined. Methods of engagement may vary based on the registry’s aim and many other factors. However, direct participation of the Registry governance at several levels is suitable for engaging patient partners in decision-making. Patient engagement in registries is an evolving field that presents both opportunities and challenges. Early engagement in the planning phase, consistent engagement throughout the registry functioning, relevance to patient needs, empowerment of each team component as well as transparency will create a tool that will both serve the patients and society and provide novel and integrated know-how. Simply put, RD registries are key to maximizing data sharing, patient communication across the globe for RD communities, delineating disease mechanisms, and promoting drug discovery; however, they are challenged in interoperability, maintenance, multi-model data types and sources, and governance.

Facial imaging, an artificial intelligence technology for RD research and diagnostics

The ability of artificial intelligence (AI) technologies to integrate and analyze data from different sources can be used to overcome some of the RDs’ challenges.128,129 In recent years, there have been significant advances in disease diagnosis as a result of new technologies for collecting and analyzing data. Researchers and clinicians are using these technologies to diagnose rare genetic diseases by scanning a person’s face or a photograph. AI can also be applied to speech structure and patient movement.129

The eagerness of the RD patients and their advocates to share their data and collaborate, despite the many privacy concerns, has facilitated the implementation of state of the art technologies in diagnosis and improving quality of life. Such new technologies include the ability to diagnose rare genetic diseases by scanning a person’s face or a photograph. Many RDs are manifested in a distinctive and recognizable facial phenotype, such as Noonan syndrome and Cornelia de Lange syndrome.130,131 Algorithms that analyze facial images have matured in recent years so that they predict several hundred RDs with a high degree of accuracy.20 Three-dimensional facial analysis (3DFA), an evolving deep phenotyping application, provides detailed representation and analysis of the RD phenotypes that can generate biological insights. In the RD domain, 3DFA is increasingly being implemented primarily for diagnostic purposes132–134 but also for monitoring existing and trial therapies.133,135,136

Advanced facial analysis platforms such as Cliniface,137 FACE2GENE,138 FaceBase,139 and DeepGestalt130 can point doctors in the direction of specific disorders or genes that could be responsible for the patient’s symptoms, potentially reducing the number of diagnostic tests needed to confirm the diagnosis. Facial analysis can also offer greater diagnostic certainty when the genetic causation remains undetermined or when molecular testing is unavailable, for example, in resource poor environments. AI and other analytic approaches provide objective analysis of phenotypes and the association of phenotype and genotype to streamline diagnostics, including genomic sequence interpretation.129,140 The application of facial analysis to RD diagnosis and care will require open source approaches as well as platforms that facilitate pre-competitive tools and partnerships, and that can be integrated with multi-omics initiatives.

An example is the Cliniface 3D facial analysis platform.137 Cliniface 3D tools have been shared for integration in multi-omics platforms for RD research, including through the Personalized Medicine Center for Children at the Telethon Kids Institute, and it is being prepared for partnership with the National Rare Diseases Registry System of China.141 Cliniface has been implemented across multiple research and clinical environments, including state-wide for the Western Australian Health Department, and is being increasingly integrated with the Patient Archive knowledge management platform142 which is connected to MME.110 Cliniface converts 3D facial images to text-based descriptions, specifically HPO terms. Converting face-to-text reduces the risk of individual identification, mitigating against the inherently identifying nature of facial data. These text-based descriptions can be shared through MME or Phenopackets, and they can be incorporated into text-based diagnostic support algorithms.

One of the most promising resources for facial data sharing is the Minerva Initiative.143 While it was originally launched for 2-D data sharing, the underlying principles are intentionally extensible to 3-D data. The initiative includes a research data resource (Minerva Image Resource—MIR) and an open research consortium (Minerva Consortium—MC) which allows the sharing of identifiable patient data, such as facial photographs and collaborative research projects on RD. It operates in the spirit of Open Science to enable precision public health. The Minerva Initiative has the following objectives: to build a community of researchers and clinicians, to continue to develop ethical structures and provisions for working on identifiable clinical images, and to deliver secure data sharing among consortium members. It has been constructed to align with the goals and objectives of the GA4GH.144 The Minerva Consortium (MC) is an international network of clinicians and researchers, from both public and private organizations. The public website Minerva&Me allows anyone around the world to participate directly in the Minerva Initiative.145 Initiatives such as the Minerva Initiative are poised to lead the way in terms of not only amassing data but also using integrative technologies for accessing and using data at the point of care.

While 3DFA was originally developed for RD diagnostic applications, it can also be applied to treatment monitoring for both rare and common diseases, as demonstrated in a new project traversing specialties at the Perth Children’s Hospital and Western Australia’s premier clinical trials facility, Linear Clinical Research. In addition, while 3DFA is yielding translational insights into innovations for diagnosis, treatment, and monitoring in the RD domain, it also examines the overlap between rare and more common diseases and, therefore, mechanistic research. Notably, population-level studies demonstrated that common genetic variations (polymorphisms) were associated with discrete patterns of facial variation. Notably, these facial signatures recapitulated the characteristic facies of the respective genetic syndrome due to rare genetic variation (pathogenic variants).146 An example of a common disease that is poised for 3D facial translational research is obstructive sleep apnea (OSA). OSA is a condition seen in RDs such as mucopolysaccharidoses, where it regularly has an earlier onset than in the general population. These findings highlight the overlap between common and rare phenotypes, with implications for possible reciprocal (rare-common) insights.147

Data acquisition, analysis, and sharing mechanisms for identifiable facial data are key to RD diagnosis and research, but specialized approaches are required to simultaneously facilitate more Open Science while respecting patient privacy.

Telemedicine

Developments in modern communication technology such as telemedicine have created new opportunities for the delivery of health services to remote areas and unprivileged communities. Telemedicine refers to communication tools for medical care delivery at a distance, including telephones, smart phones, interactive televideo, “store-and-forward” images and medical record transmission via personal computers, and remote monitoring.148 High-speed telecommunications systems, in addition to the invention of devices capable of capturing and transmitting images and other data in digital form, have facilitated better sharing, collaboration, and efficiency in telemedicine. As a result, health professionals can communicate faster, more widely, and more directly with other clinicians and patients regardless of location.

Access to medical care is a major concern for RD patients and their families not only in rural areas and developing countries. Among the main issues are a lack of physicians specialized in RD treatment; concerns about sharing personal information and the security of personal information, few programs and resources to support low socioeconomic families with travel accommodation, as well as loss of income associated with obtaining care from specialists at long distances.149

RD and undiagnosed patients are usually dispersed over a large geographical area, yet they require multidisciplinary experts. As a result, a correct diagnosis may be delayed, and ready access to ongoing care is limited. Thus, telemedicine can profoundly change patient care for individuals with RD and directly address challenges of geography, travel burden, and access to experts; it can provide open access and global data sharing. Telemedicine can increase patient access to health care services otherwise unavailable149 as well as for patients in developing countries and rural/remote area.150 If utilized to its potential, telemedicine may open the way for more equitable distribution of knowledge and medical care throughout the world.149,150 In 2020, the Mayo Clinic plans to serve 200 million patients, many of them from outside the United State and most of them remotely.149

Telemedicine can revolutionize the way in which healthcare is delivered and allow the home to become a preferred place of care. The advantages of this approach are patient satisfaction, reduced travel requirements to health care providers, clinics and hospitals, early intervention for disease progression, support for caregivers, and economic benefits associated with reduced hospitalization rates.151

In addition to the increased connectivity between providers and patients, telemedicine also provides a means for researchers to connect to potential participants. Mobile and wearable medical devices enable patients to share and transmit a wealth of digital health data to databases contributing to patient registries, natural history studies, and clinical trials. Telemedicine has already been used and proven its value for chronic non-RDs, such as congestive heart failure and chronic obstructive pulmonary disease152 as well as some RDs, such as mesothelioma,153,154 cystic fibrosis,155 diabetes,156–159 Prader–Willi syndrome,160 and juvenile idiopathic arthritis.161 As promising as telemedicine sounds, it cannot be a replacement for in-person examination. There are significant limitations and barriers that need to be addressed and overcome, including quality of patient-clinician interaction, insurance coverage, reimbursement for services, privacy and legal issues of state licensure laws and liability concerns.

Providing care through telemedicine technology may not work for every organization. However, with the move toward personalized medicine, incorporating telemedicine into the health system can offer benefits to physicians and patients.162 Examples for reductions in use of services are hospital admissions/re-admissions, length of hospital stay, and emergency department visits that translate into reduced mortality.152 To increase the uses and implementation of telemedicine, more resources and studies are needed to evaluate the net value, visibility, and access for patients and the health care providers.

Telemedicine can emerge as an important component of the health care delivery system that relies on sharing medical information, knowledge and collaboration, which are the building blocks necessary to facilitate Open Science. RD patients and their families seem to more enthusiastically share personal information and collaborate because they desperately want to find the correct diagnosis, experts, and treatment.

In the context of global health, telemedicine is beginning to have an important impact on many aspects of healthcare, especially in developing countries and in rural areas, opening the way for distribution of knowledge and medical care throughout the world.150 Although Open Science can aid the RD community, the RD community can be instrumental for Open Science and aid to further the development of Open Science by adopting and incorporating telemedicine and new technologies into health care delivery.

Ethical and legal considerations

We have demonstrated the significant need for Open Science practice to share data and collaborate in support of RD diagnosis, research, and patient care. Open Science creates some dilemmas and opposing forces with regards to privacy and ethical concerns. Experience with the RD patients and their families has demonstrated that their eagerness for adequate diagnosis and treatment override the privacy concerns. Nevertheless, as new technologies, and systems are developed and implemented, the ethical and legal challenges increase.163,164

Global data sharing creates significant challenges for the responsible stewardship of the growing number of large and complex datasets, including oversight, accountability, and data management. Ethical and legal frameworks are required to protect the rights of affected individuals, while still sharing data appropriately to promote progress in RD research and health care. For example, before increasing the availability and dissemination of RD patient data, scientists must consider participant protections and appropriate data use, consent, and participant understanding of data sharing, ownership, reuse, analysis and the generation of new or derived data, among other concerns.

A number of international organizations165 have devoted considerable attention and resources to developing regulatory frameworks for open data production, dissemination and use. The frameworks have resulted in national policies on Open Science, and documents such as the international accord Open Data in a Big Data World,166 the Open Science policy by the European Commission,167 and the National Academies of Science “Open Science by Design: Realizing a Vision for 21st Century Research”,168 which recommends that data be made FAIR based on legal and ethical considerations. The Biobanking and BioMolecular Resources Research Infrastructure- European Research Infrastructure Consortium (BBMRI ERIC) is building an international Code of Conduct for health research, with the aim of contributing to the proper application of the regulation, taking into account the specific features of processing personal data in the area of health research in order to clarify and specify certain rules of the General Data Protection Regulation (GDPR) for those who process personal data for scientific research in the area of health.169 Similarly, the NIH Genomic Data Sharing Policy170 includes provisions for the sharing of large-scale genomic data while taking into account participant protections and limitations on data use based on the consent of the study participants. All of these efforts emphasize the importance of good data practices in the sharing, dissemination, and re-use of biomedical data, particularly considering issues of privacy, confidentiality, intellectual property and security.165 Clinical trial data sharing has been particularly challenging when it involves the pharmaceutical industry or other entities with IP interests. With the extremely small RD cohorts, it is especially important to coordinate and to share results globally. The Vivli (https://vivli.org/) program aims to support sharing and reuse of clinical research data, including individual participant-level data from completed clinical trials globally. Medical journals generally require clinical trials to be posted on clinicaltrials.gov, and FDAAA requirements include submission of the data of both positive and negative studies. However, for RDs, it is especially important to share trial information as trials are being designed and launched, so that different studies can be aligned and patients can be recruited from around the world. PAGs for each RD have been successful in doing so, and approaches such as those attempted in OpenTrials (https://opentrials.net/) are laudatory but are not yet established enough to support the RD community.

Approaching ethical challenges from an international standpoint is central to the promise of Open Science.171

Addressing Indigenous rights and interests in genomic and other data sharing is critical for equitable scientific translation. While Indigenous experiences with genetic research have been shaped by a series of negative interactions, there is increasing recognition that equitable benefits can only be realized through greater participation of Indigenous communities. Issues of trust, accountability, return of benefit and equity will need to be addressed. In this context, it is notable that the Research Data Alliance International Indigenous Data Sovereignty Interest Group172 developed the CARE Principles for Indigenous Data Governance. These principles identify Collective benefit, Authority to control, Responsibility, and Ethics to be used alongside other data centric principles.

While critically important and relevant to all health care domains, endeavors such as these do not specifically take into account the special Open Science needs of RD patients and caregivers. RD advocacy and methods for robust and informed data sharing must be developed alongside policies and secure infrastructure that are specifically designed for sharing data about RD patients—who by their very rarity have a much greater likelihood of re-identification or even a desire to share identified data. Toward the end of supporting genomic health ethics for all types of genetic diseases, the GA4GH has created a “Framework for Responsible Sharing of Genomic and Health-Related Data”. It contains foundational principles and core elements for responsible data sharing and is guided by concern for human rights, including the right to benefit from the progress of science, as well as privacy, non-discrimination, and procedural fairness. This is pivotal for RD patients, since traditional medical privacy laws around the world may not adequately support the Open Science strategies that RD diagnosis and research necessitates.

The best practices and ethical-legal considerations for FAIR data sharing in the context of RDs are still evolving. A necessary improvement in the management of data in a FAIR-er direction is the annotation of patient data with Ethical Legal Social Issue (ELSI) requirements and choices as determined at the time of collection. This could constitute a great addition to the quality of data that could be transferred along with the data itself. This means that if a certain dataset was collected under the condition of a specific use (eg, cancer research only in Europe) this information should travel with the data, ensuring sustainable and ELSI reusability of the data. While the promises of the Open Science paradigm and the FAIRification of data are key to effective research, especially in RD, compliance with existing regulatory requirements and ethical norms is necessary to ensure long-term sustainability of data stewardship.

New perspectives, understanding and challenges introduced by rapidly developing machine learning approaches increase the necessity of open data sharing to realize the public good, but simultaneously can give rise to new ethical and legal dilemmas. Among the challenges already becoming apparent are the potential risks for re-identification, incidental/secondary findings, and biases for equitable access to algorithmically assisted decision making. Particularly in the context of RD, implications of new machine learning will influence the best practices and acceptable frameworks for FAIR data sharing in the coming years.

A ROADMAP FOR OPEN SCIENCE IN RARE DISEASES

To accelerate the diagnosis and care of RD patients, we propose a set of recommendations to advance Open Science:

  1. Create shared RD definitions, models, and governance.

  2. Consider how to realize the FAIR principles in all aspects of the RD data lifecycle for any given RD, clinical system, or research initiative.

  3. Create metrics for successful compliance with RD-GO FAIR.

  4. Support RD tools that enable patients to share their own data in a well-informed manner and establish standards for consistent representation of phenotype data (eg, Phenopackets and HPO) as well as genotype and pedigree data.

  5. Adopt new standards for registries to support interoperability and data sharing internationally.

  6. Develop methods to create “proxy” data to share representations or subsets of personally identifiable data (such as facial images) in a deidentified manner.

  7. Establish networks of controlled-access data that can be searched using diagnostic algorithms for research on RDs.

  8. Increase centers specializing in RDs, train more clinicians in diagnosing and treating RDs, and create improved clinical decision-making guidelines related to RDs.

  9. Create opportunities for patients to be better informed and encourage patient engagement with the scientific community to increase openness and data sharing.

  10. Welcome and attribute openly-developed novel technologies and interventions in RD clinical settings.

A fundamental component of addressing the RD public health challenge involves improvements in ethical Open Science, whose core principles are data sharing and collaboration. RD families and their advocates, as well as RD physicians and scientists, have led the way toward openness, data sharing, and collaboration to find diagnoses, treatments, and improved quality of life. Despite privacy concerns, institutional policies, and technological barriers, the RD community has demonstrated that they are thought leaders in Open Science, forging the way forward for the world.

FUNDING

This work was supported by the U.S. Department of Health and Human Services National Institutes of Health (5r24od011883), National Institutes of Health (NIH) Office of the Director (OD); the Monarch Initiative (1R24OD011883) and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (U54 HD079123).

AUTHOR CONTRIBUTIONS

All authors contributed texts and ideas, participated in critical revision of the article, and approved the final version. WAG, RMG, MAH, PNR, and YRR, participated in the initial discussion and the conception/design of the article. MAH, PNR, and YRR supervised the conception, design, and revision of the manuscript.

ACKNOWLEDGMENTS

This research was supported in part by the Office of Strategic Initiatives, Office of the Director, and the Intramural Research Program at the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The findings and conclusions in this article are those of the authors and do not necessarily represent the official position of HRSA, NIH, NLM, or the Department of Health and Human Services.

The authors would like to thank Dr. Mike Huerta (the Director of the Office of Strategic Initiatives, Associate Director of the National Library of Medicine, NIH) for his enthusiastic support to initiate this article and his fruitful and valuable advice throughout the writing of the manuscript.

We would like to dedicate this article to Dr. Stephen Groft, for his great mentorship and exemplary caring and kindness toward others. Dr. Groft devoted his career to the Rare Disease community, providing hope and voice for rare disease patients and their families. This article is also dedicated to the rare disease patients, their families, the caregivers, the medical practitioners and the research community, for their commitment to sharing data and Open Science.

CONFLICT OF INTEREST

MAH and JAM have a conflict of interest. Both are the founders of Pryzm Health.

REFERENCES


Articles from JAMIA Open are provided here courtesy of Oxford University Press

RESOURCES