Abstract
Despite the increasing prevalence of clinical sequencing, the difficulty of identifying additional affected families is a key obstacle to solving many rare diseases. There may only be a handful of similar patients worldwide, and their data may be stored in diverse clinical and research databases. Computational methods are necessary to enable finding similar patients across the growing number of patient repositories and registries. We present the Matchmaker Exchange Application Programming Interface (MME API), a protocol and data format for exchanging phenotype and genotype profiles to enable matchmaking among patient databases, facilitate the identification of additional cohorts, and increase the rate with which rare diseases can be researched and diagnosed. We designed the API to be straightforward and flexible in order to simplify its adoption on a large number of data types and workflows. We also provide a public test data set, curated from the literature, to facilitate implementation of the API and development of new matching algorithms. The initial version of the API has been successfully implemented by three members of the Matchmaker Exchange and was immediately able to reproduce previously-identified matches and generate several new leads currently being validated. The API is available at https://github.com/ga4gh/mme-apis.
Keywords: MME, patient matchmaking, genomic API, rare disease, GA4GH, HPO, Matchmaker Exchange
Introduction
Rare genetic disorders collectively affect around 350 million people worldwide, but the number of people affected by any one of these disorders can be extremely small. These individuals may be seen by different clinicians and sequenced at different centres, with each individual’s data being stored in one of a rapidly growing number of different databases and patient registries. Siloing of data severely impedes the discovery of genetic causes of these disorders, while directly copying such data across various resources is impossible due to a number of legal and privacy concerns. Developing efforts such as the Global Alliance for Genomics and Health (GA4GH) APIs are designed to facilitate the exchange of genetic data between such databases, however these are currently targeting genetic data and hypothesis-driven queries. To address the need for flexible data sharing amongst resources with rare disease patient data we developed the Matchmaker Exchange Application Program Interface (MME API), a data format and protocol for querying databases to identify individuals with similar phenotypic profiles and genetic variation, a process we call “matchmaking.”
The MME API specifies the format of both the query, which is sent to participating databases (which we call “matchmaker services”), and the response, which contains information about matching individuals in the remote database. The initial version of this API follows a query-by-example philosophy, in which the request is simply a description of the individual to be matched and the response is a list of the descriptions of similar individuals. Because the API is built around the description of an individual rather than a complex query language, it is easy to understand, straightforward to implement, and provides the various databases the flexibility of experimenting with matching algorithms and regulating the amount of data that is disclosed. Further, because the case is used as the query, more specific and complete case records will return more relevant matches, thus encouraging users to submit the most complete and specific case information possible.
The sharing and automated analysis of genetic and phenotypic data has necessitated standardization using a number of ontologies and controlled terminologies. In this API, we use the Sequence Ontology (Eilbeck et al., 2005) to describe the class of the genetic variants (e.g. whether it is insertion, deletion, or SNV; missense or stopgain, etc.) and the Human Phenotype Ontology (HPO) (Köhler et al., 2014) to describe patient phenotypes. The HPO has over 11,000 terms corresponding to phenotypic abnormalities, which are structured from general (e.g. “abnormality of the nervous system”) to specific (e.g. “atonic seizures”). Importantly, the HPO has the “true path rule”, which states that the presence of a lower-level term implies the presence of all ancestors of the term (a patient with “atonic seizures”, by definition, also has “seizures” and has an “abnormality of the nervous system”). This feature makes it possible to “obfuscate” a term by using one of its ancestors instead, and to match distinct but related terms by identifying shared ancestors.
Many MME partners perform some form of internal matchmaking to identify similar patients within their database, but each organization has a different focus, collects different types of data, and stores their data in different formats. The MME API provides a standardized language for exchanging patient profiles in order to enable matchmaking between patient databases. Here we present a description of the MME API, the method used to authenticate endpoints of this API within the MME, and a test dataset available to verify that endpoints are behaving as expected and assist in the development of novel matching algorithms. The API has been developed in collaboration with the GA4GH and uses standard field names and data formats wherever possible. It complies with current best practices for Web APIs and uses Javascript Object Notation (JSON) to encode all content that is sent and received.
Methods & Results
The Matchmaker Exchange (MME) API
The matchmaking workflow
An overview of the match request and response process is shown in Figure 1. The user starts by contributing a case to one of the Matchmaker Exchange services (Philippakis et al., 2015, this issue). On behalf of the user, the matchmaker service then queries other MME services using the MME API. These other services use the structured patient data in the query to identify and return descriptions of similar cases within their respective databases. They are not permitted to store request data for uses other than analytics and diagnostics (i.e. the data exchanged over the API does not become a part of the data stored by the receiving services). Similar cases found through the API are then reported to the users for evaluation. The users can then follow up with each other on any promising matches using contact information provided with the query and response. It is currently up to each MME service to define the process for alerting their respective users of the match (i.e. step 4 in Figure 1).
Format
The API defines a set of data types, each with a corresponding set of properties (e.g. the Disorder type has two properties, “id”, which is mandatory, and “label”, which is optional). An object is a particular example (instantiation) of a type (an example Disorder object in JSON format is: {“id”: “OMIM:269880”, “label”: “SHORT syndrome”}). The core of the format is a specification of an individual with relevant phenotypic and/or genotypic features (the Patient type, defined in Table 1). A match request (see Figure 2B) contains a single case in this format, used as the query, and the match response contains a scored list of the most similar cases in the remote system, also in this format. The Patient type is designed to be flexible to facilitate matchmaking between cases with varying degrees of phenotypic and/or genotypic detail. It can contain a list of diagnoses, phenotypic features, and/or genotypic features, along with metadata such as an identifier, sex, and contact information of the submitter of the case (so that promising matches can be followed up on). There are few required fields, making it easy to implement regardless of the data stored by the matchmaker service, and many optional fields, enabling additional information to be conveyed to improve the accuracy of matchmaking and help users interpret the matches.
Table 1.
Type | Property | Req* | Expected Type | Description | Example |
---|---|---|---|---|---|
Match Request | patient | ✓ | Patient | query patient | see Fig. 2B lines 2–53 and and Patient type |
Patient | id | ✓ | string | unique, persistent patient identifier | “F0000011” |
label | string | human-readable identifier, no personally identifiable information | “174_170258” | ||
contact | ✓ | Contact | contact details for depositor of patient record | see Fig. 2B, lines 5–9 and Contact type | |
species | string | NCBI taxon identifier | “NCBITaxon:9606” | ||
sex | string | genetic sex (“FEMALE”, “MALE”, “OTHER”) | “FEMALE” | ||
age Of Onset | string | age interval at onset of the majority of the symptoms (HPO term identifier) | “HP:0003623” | ||
in heritance Mode | string | mode of inheritance (HPO term identifier) | “HP:0000006” | ||
disorders | list of Disorders | list of diagnoses | see Fig. 2B, lines 12–17 and Disorder type | ||
features | † | list of Features | list of phenotypic traits | see Fig. 2B, lines 18–33 and Feature type | |
genomicFeatures | † | list of GenomicFeatures | list of candidate causal genes and variants | see Fig. 2B, lines 34–52 and GenomicFeatures type | |
Contact | name | ✓ | string | name of the clinician or organization | “Kym Boycott” |
institution | string | institution of the clinician | “FORGE Canada” | ||
href | ✓ | string | contact URL; either public webpage or email address (mailto) | “http://dx.doi.org/10.1016/j.ajhg.2011.12.001” | |
Disorder | id | ✓ | string | OMIM or ORDO identifier | “MIM:136140” |
label | human-readable description | “Floating-Harbor Syndrome” | |||
Feature | id | ✓ | string | HPO term identifier | “HP:0004322” |
label | string | human-readable description | “Short stature” | ||
observed | string | the feature has been explicitly observed (“yes”) or explicitly not observed (“no”) | “yes” | ||
age Of Onset | string | age interval at onset (HPO term identifier) | “HP:0003577” | ||
GenomicFeature | gene | ✓ | Gene | candidate gene | see Fig. 2B, lines 36–38 and Gene type |
variant | Variant | candidate variant in gene | see Fig. 2B, lines 39–45 and Variant type | ||
zygosity | number | allelic dosage (1: heterozygous, 2: homozygous) | 1 | ||
type | GenomicFeature Type | cDNA effect of the mutation | see Fig. 2B, lines 47–50; GenomicFeature Type type | ||
Gene | id | ✓ | string | gene symbol, ensembl gene ID, or entrez gene ID | “SRCAP” |
Variant | assembly | ✓ | string | reference assembly identifier | “GRCh37” |
reference Name | ✓ | string | chromosome | “16” | |
start | ✓ | number | start position (0-based) | 30748691 | |
end | number | end position (0-based, exclusive) | 30748692 | ||
reference Bases | string | VCF-style reference allele of at least one base | “C” | ||
alternate Bases | string | VCF-style alternate allele of at least one base | “T” | ||
GenomicFeature Type | id | ✓ | string | SO term identifier | “SO:0001587” |
label | string | human-readable description | “STOPGAIN” | ||
Match Response | results | ✓ | list of Match Results | list of similar/matching patients | see Fig. 2D, lines 2–10 and Match Results type |
Match Result | score | ✓ | Match Score | scoring details for the match | see Fig. 2D, lines 4–6 and Match Score type |
patient | ✓ | Patient | matching patient | see Fig. 2D, line 7 and Patient type | |
Match Score | patient | ✓ | number | overall match score (in the range [0, 1], where 0.0 is a poor match and 1.0 is a perfect match) | 0.983 |
Example values from a patient description in Hood et al. (2012).
The “Req” column contains a check mark for properties that are mandatory for objects of the given class.
It is preferred to have both the “features” and “genomicFeatures” properties defined for every Patient object; it is mandatory to have at least one of the two.
Standardized identifiers and ontologies are used wherever possible. Diagnoses are specified using OMIM (Hamosh et al., 2005) or Orphanet (http://www.orphadata.org/) identifiers. Each phenotypic feature (a Feature object) is specified using a term from the HPO, and can be recorded as either observed (the default) or explicitly absent (it may be important for similarity measures and differential diagnosis to know if particular features or co-morbidities were explicitly checked for but not observed in the individual). To protect privacy, phenotypic features can be intentionally obfuscated in the query or the response by substituting HPO terms with ancestors of those terms. Each genotypic feature (a GenomicFeature object) represents a candidate gene or variant believed to be directly involved in the individual’s phenotype. It contains a gene identifier, specified as an HGNC gene symbol, an Ensembl gene identifier, or an Entrez gene identifier, and can include details about the type of variant (specified as a Sequence Ontology term) and/or the specific variant with respect to a reference genome. Extensive additional documentation is available on the GitHub page (https://github.com/ga4gh/mme-apis).
The match response (see Figure 2D and Table 1) contains a list of the cases in the database most similar to the case specified in the query, scored according to the particular matchmaker service’s matching algorithm. Scores must be a number between 0.0 (a poor match) and 1.0 (an excellent match), but scores are not yet comparable across matchmaker services as matching algorithms vary. Currently, only an overall score for the strength of each match is required, but more detailed scoring of the phenotypic and genotypic aspects of each match will likely be added in future versions.
API versioning
The MME API is semantically versioned (http://semver.org/), with version numbers taking the form “X.Y”, where X is incremented for major releases and Y is incremented for backwards-compatible minor releases. Every request must specify the API version within the HTTP Accept header, and the remote server must provide the API version of the response in the Content-Type header of every response (see Figure 2A and 2C).
Error handling
The remote server should use HTTP status codes to report any error encountered processing the match request. Table 2 contains a list of status codes and their meanings with regards to this API. The error response should include a JSON-formatted body with a human-readable "message" containing further details about the error (see Figure 2E). The exact error message is up to the implementer, and additional fields can be provided with further information.
Table 2.
HTTP Status Code | Reason Phrase | Description |
---|---|---|
200 | OK | no error |
400 | Bad Request | missing/invalid data |
401 | Unauthorized | missing/invalid authentication token |
405 | Method Not Allowed | invalid method (POST required) |
406 | Not Acceptable | missing/unsupported API version |
415 | Unsupported Media Type | missing/invalid content type |
422 | Unprocessable Entity | missing/invalid request body |
500 | Internal Server Error | default error |
Request authentication in the Matchmaker Exchange
All communication between servers in the Matchmaker Exchange must occur over secure HTTP (HTTPS), and requests are currently authenticated through a simple yet effective protocol. If Matchmaker B wishes to accept match requests from Matchmaker A, Matchmaker B securely sends a secret authentication token to Matchmaker A (e.g. through encrypted email). We recommend the authentication token be a randomly generated SHA1 hexadecimal digest. This authentication token must be specified as the X-Auth-Token header of all requests that Matchmaker A makes to Matchmaker B (see Figure 2A). Matchmaker B will then verify the authentication token and may perform additional checks such as validating the originating IP address of the request (though this is not required). We are currently exploring support for a federated user authentication scheme, such as OAuth 2.0 (http://oauth.net/), in future versions of the API.
Test data
In order to facilitate testing the ability of systems to query, match, and respond to requests, we have compiled a standardized test dataset of 50 de-identified individuals spanning 22 disorders. These cases were selected from publications by the FORGE Canada (Beaulieu et al., 2014) and Care4Rare Canada projects (http://care4rare.ca/), and deliberately include conditions with diverse phenotypes. Some of the conditions involve multiple organ systems (e.g. OMIM:269880 SHORT syndrome; OMIM:182212 Shprintzen-Goldberg Syndrome), while others mainly affect a single system (e.g. OMIM:614665 Meconium ileus; OMIM:243150 Intestinal atresia, multiple). In addition, multiple individuals with variable severity were included for many of the disorders (e.g. OMIM:615960 Cerebellar Dysplasia and Cysts; OMIM:615273 Congenital disorder of glycosylation, type IV), which serve as internal controls for evaluating the performance of matchmaking algorithms. These test cases are available in the MME API JSON format, and are annotated with phenotypic features, the diagnosed disorder (OMIM identifier), and the causal variant(s). New matchmaking organizations can use this dataset internally, to verify that the query and response are formatted correctly and the matching is accurate, or externally, to verify that links to other matchmaker services are functioning properly. In these cases, an additional property of the Patient object, “test”, should be set to true. This informs the system being queried that the query is a test, allowing it to respond accordingly. Normally, the system being queried will match against real patient data, return any matches, and notify users of identified matches. With a test query, the system should run the match against test data, return any matches, and suppress any notifications.
Deployment of the API across the MME Network
The MME API is currently implemented at the DECIPHER (Chatzimichali et al., 2015, this issue), GeneMatcher (Sobreira et al., 2015), and PhenomeCentral (Buske et al., 2015, this issue) portals. We have validated the API through two means. First, through the use of the test data (described above), which recovered all of the expected matches. Second, as a preliminary test with clinical cases, we used the MME API to find matches for unsolved PhenomeCentral cases within GeneMatcher. We identified 60 unsolved PhenomeCentral cases submitted by the Care4Rare Canada project, which together included 45 different candidate genes (1–5 candidate genes per record). At least one match was found for 37 out of 60 PhenomeCentral cases, with 33 matching cases returned in total. Of the 33 matches, 16 were duplicate records (entered by the same clinician in both systems) and 2 were excluded because GeneMatcher had many (≥ 30) candidate genes per record. We followed up on the 10 matching genes within the remaining 15 matching records, with 6 of the gene matches classified as false positives (i.e. phenotypes of the two patients were not significantly similar after clinician review), 2 of the gene matches still unresolved, and 2 of the gene matches classified as potentially significant hits with additional validation currently underway. GeneMatcher currently matches only on gene since most of the cases do not have phenotypic information, which may contribute to the false positive rate of this test.
Discussion
The Matchmaker Exchange is an international collaboration to facilitate the exchange of phenotypic and genotypic data for cases of rare disorders. The MME API presented here was designed to enable automated sharing of this data between multiple patient databases. The overarching principle guiding the design was to create a framework that is flexible enough to support a large number of data types and workflows, as the various members of the Matchmaker Exchange support varying depth of phenotypic and genetic data. The details of the algorithms used in each matchmaker service are also still in development. We decided on a hypothesis-free approach, in which the patient record defines the query and the receiving site determines how to optimally process the query, as it likely has the best understanding of the data available and how to use it to measure patient similarity. One added advantage of this approach is that to obtain optimum matches, the query patient has to be deeply phenotyped, thus encouraging contribution of data into the network. We believe that our approach will have utility beyond the rare disease community, and have contributed our APIs to the Global Alliance for Genomics and Health. Wherever possible, we coordinated field names and data formats with those used by the GA4GH APIs, and will continue to engage in the development of these standards.
While this API has proven successful for the first iteration of matchmaking, we are also considering extensions that should improve the efficacy of the API. These include improvements to the security/privacy configurations and a gradual adoption of hypothesis-driven queries. We believe that two changes could enhance the privacy protections offered by the MME API. First, some MME sites currently apply obfuscation to the provided data before returning it, and require direct communication between the submitting users before showing full patient data. Currently the API does not support reporting when data has been obfuscated; however this information may be useful for the receiving user. Secondly, a centralized identification framework, using a technology such as OpenID, would enable users to have a single sign-on for all of the MME partners, as well as allowing the receiving site to make decisions on what data to show in response to a query based on the user’s profile and their membership in the receiving site.
Finally we expect the current hypothesis-free nature of the API to develop into a partially hypothesis-driven approach. Towards this end the API should allow for weighing or requiring of features (e.g. specifying a specific gene or phenotype as “required”, suggesting a scoring function to be applied when computing a match score, or filtering the results based on a feature). In our tests, we have found increasing need for such features, as the scoring schemes differ significantly between matchmaker services, making expected results difficult to validate.
Acknowledgments
We are grateful to all member of the Matchmaker Exchange working group for steering our effort, as well as to the leadership of the International Rare Disease Research Consortium (IRDiRC), the Global Alliance for Genomics and Health (GA4GH), and the Clinical Genome Resource (ClinGen) for supporting the MME project. The development of the MME API was supported by funding from the National Human Genome Research Institute (1U54HG006542) as well as Genome Canada and the Canadian Institutes for Health Research through the Large Scale Advanced Research (LSARP) and Bioinformatics/Computational Biology (BCB) Programs. OB was supported by the Garron Family Cancer Centre and Hospital for Sick Children Foundation Student Scholarship Program.
Footnotes
The authors have no competing interests to declare.
References
- Beaulieu CL, Majewski J, Schwartzentruber J, Samuels ME, Fernandez BA, Bernier FP, Brudno M, Knoppers B, Marcadier J, Dyment D, Adam S, Bulman DE, et al. FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project. Am J Hum Genet. 2014;94(6):809–817. doi: 10.1016/j.ajhg.2014.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buske OJ, Girdea M, Dumitriu S, Gallinger B, Hartley T, Trang H, Misyura A, Friedman T, Beaulieu C, Bone WP, Links AE, Washington NL, et al. PhenomeCentral: a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases. Submitted to same issue. 2015 doi: 10.1002/humu.22851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatzimichali EA, Brent S, Hutton B, Perrett D, Wright CF, Bevan AP, Hurles ME, Firth HV, Swaminathan GJ. Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER. Submitted to same issue. 2015 doi: 10.1002/humu.22842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology. 2005;6(5):R44. doi: 10.1186/gb-2005-6-5-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33:D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hood RL, Lines MA, Nikkel SM, Schwartzentruber J, Beaulieu C, Nowaczyk MJ, Allanson J, Kim CA, Wieczorek D, Moilanen JS, Lacombe D, Gillessen-Kaesbach G, et al. Mutations in SRCAP, encoding SNF2-related CREBBP activator protein, cause Floating-Harbor syndrome. Am J Hum Genet. 2012;90(2):308–313. doi: 10.1016/j.ajhg.2011.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GCM, Brown DL, Brudno M, Campbell J, FitzPatrick DR, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucl. Acids Res. 2014;42(D1):D966–D974. doi: 10.1093/nar/gkt1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Philippakis A, Azzariti D, Beltran S, Brookes A, Brownstein C, Brudno M, Brunner H, Buske O, Carey K, Doll C, Dumitriu S, Dyke S, et al. The Matchmaker Exchange: A Platform for Rare Disease Gene Discovery. Submitted to same issue. 2015 doi: 10.1002/humu.22858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sobreira N, Schiettecatte F, Boehm C, Valle D, Hamosh A. New Tools for Mendelian Disease Gene Identification: PhenoDB Variant Analysis Module; and GeneMatcher, a Web-Based Tool for Linking Investigators with an Interest in the Same Gene. Hum Mutat. 2015;36(4):425–431. doi: 10.1002/humu.22769. [DOI] [PMC free article] [PubMed] [Google Scholar]