Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Oct 1.
Published in final edited form as: Hum Mutat. 2015 Oct;36(10):998–1003. doi: 10.1002/humu.22847

Mitigating false-positive associations in rare disease gene discovery

Sebastian Akle 1,2, Sung Chun 2,3, Daniel M Jordan 2,3, Christopher A Cassa 2,3
PMCID: PMC4576452  NIHMSID: NIHMS711494  PMID: 26378430

Abstract

Clinical sequencing is expanding, but causal variants are still not identified in the majority of cases. These unsolved cases can aid in gene discovery when individuals with similar phenotypes are identified in systems such as the Matchmaker Exchange. We describe risks for gene discovery in this growing set of unsolved cases. In a set of rare disease cases with the same phenotype, it is not difficult to find two individuals with the same phenotype that carry variants in the same gene. We quantify the risk of false-positive association in a cohort of individuals with the same phenotype, using the prior probability of observing a variant in each gene from over 60,000 individuals (ExAC). Based on the number of individuals with a genic variant, cohort size, specific gene, and mode of inheritance, we calculate a p-value that the match represents a true association. A match in two of ten patients in MECP2 is statistically significant (p=0.0014), while a match in TTN would not reach significance, as expected (p>0.999). Finally, we analyze the probability of matching in clinical exome cases to estimate the number of cases needed to identify genes related to different disorders. We offer RD-Match, an online tool to mitigate the uncertainty of false-positive associations.

Keywords: matchmaking, false-positive associations, rare diseases, incidental findings, incidentalome, Matchmaker Exchange

Introduction

Most clinical genomic interpretation protocols inspect databases of disease-associated variants and identify rare coding variation that is predicted to be deleterious (Stenson et al. 2012; Landrum et al. 2014). Promising novel candidates may be validated using co-segregation with disease or functional data from in vitro or in vivo models (MacArthur et al. 2014). If these approaches fail to identify a causal variant, systems that aid in finding additional cases, such as those linked through the Matchmaker Exchange (MME) may prove useful in locating additional cases with similar disorders or variation. Matchmaking services connect researchers and clinicians with the many unsolved cases around the globe that generally have some evidence to support a disease–gene association, and can be used to assist in gene discovery and validation.

The set of unsolved cases is growing dramatically due to the increasing availability of affordable clinical sequencing. At current diagnostic rates, over two thirds of exome sequencing cases from major clinical labs will go unsolved (Lee et al. 2014; Yang et al. 2014). Some of these cases will be “nearly solved” with a variant that appears to be causal but lacks supporting evidence in databases or the literature. Data from segregation analysis, functional validation, or disease–gene association studies make these excellent candidates for matchmaking.

As the number and breadth of unsolved cases from clinical laboratories continue to expand, matchmaking will not be limited to cases with a single suspected disease–gene association or precise phenotypes. This will allow researchers to discover potential matches deeper in underlying datasets, but it will also increase the potential for false-positive associations due to the greater likelihood of finding other individuals with similar phenotypes or genotypes within an exchange. This issue has been described in the context of genomic medicine and in the risk and management of incidental findings in research studies (Kohane et al. 2006; Berg et al. 2012). With the prospect of integrating data from thousands of unsolved cases, we quantify the risks inherent in broadening the use of matchmaking services to maximize their diagnostic power.

Materials and Methods

The risk of false-positive associations during gene discovery

Clinical sequencing cases are rich with matchmaking potential, but this can also lead to harmful false-positive associations. Depending on filtering criteria and the putative mode of inheritance, the average clinical exome is expected to have several rare loss of function, homozygous alternative, and compound heterozygous candidate variants (Collection 2014; Sulem et al. 2015). In each case, most of these variants will not be causal; and if these same non-causal variants are also identified in another individual in the MME, this may lead to a false-positive association. In fact, it has been well documented that false-positive matches may appear to be plausible causal variants, so we must beware of the “narrative potential” of the genome (Goldstein et al. 2013). These false associations may cause harm to patients, lead to inefficient use of clinical resources, and interfere with the discovery of true associations.

The statistical uncertainty of association—the probability that a case match will be made by chance alone—must be quantified. To that end, the prior probability of encountering a variant from the same gene or from the same phenotype in the cohort must be considered. For example, if the exchange includes many patients with developmental delay phenotypes, and some of these patients have variants in genes that commonly harbor rare variation that is unrelated to disease, our confidence in making an association in those genes is reduced. Furthermore, the inclusion of multiple candidate variants or phenotypic terms in a case may increase the risk of incorrect matches.

Modeling matchmaking probabilities using large-scale population data

To evaluate the risk of false-positive association within the MME, we model the prior probability of having a rare, non-synonymous gene variant in the general population. Given the mode of inheritance, individual demographics, and the filtering strategy, it is possible to calculate the expected matchmaking potential per exome. Using population frequency data from the Exome Aggregation Consortium (ExAC) (ExAC 2014), we calculate the expected frequency of a rare, non-synonymous variant in each gene (pg, where pg = Σi∈Vgpgi, Vg is the set of rare non-synonymous variants in gene g and pgi is the frequency of the ith variant in ExAC) as the sum of the allele frequencies of all variants annotated in ExAC as non-synonymous with an allele frequency of less than 0.1%. This threshold filters most variants that are only mildly deleterious, while including enough low frequency variants in order to conservatively evaluate false positive associations. To avoid bias from potential variant under-calling, we include only genes with at least 30x average depth of read coverage.

We use pg to calculate the probability that a randomly selected patient has a suspicious variant in each gene, given the observed inheritance pattern. The probability of a compound heterozygous variant in an individual is the square of the sum of the allele frequencies of the rare, non-synonymous variants in each gene (fg,ch = pg2). Dominant heterozygous variants are expected to occur at twice the sum of the allele frequencies of all rare, non-synonymous alleles (fg,d = 2pg). Homozygous variants may arise from genomic regions that are identical by descent (IBD) due to population isolation or consanguinity (measured by the inbreeding coefficient F) or by the statistical chance of an individual having two copies of the same variant at a given locus. This probability is represented by fg,r=Fpg+(1F)ΣiVgpgi2

Calculating the probability of false-positive matches

We treat the MME as a cohort of individuals with a number of cases for each phenotype. To address the problem of false-positive associations, we calculate a p-value for each gene and then use a Bonferroni corrected p-value threshold to account for the number of independent tests (the number of genes in our dataset derived from ExAC). The null hypothesis is that none of the variants in the gene of interest causes the disorder. We use the following equation to calculate the p-value for each match:

p(mgN,fg,h)=1(j=0mg1(Nj)fg,hj(1fg,h)Nj), (1)

where mg is the number of patients that have variants in a matching gene with rare, non-synonymous variants at frequency fg,h. N represents the number of patients with the particular phenotype and the value of fg,h in a given case is determined by the mode of inheritance: dominant (d), compound heterozygous (ch), or recessive (r). The p-value we compute in equation (1) corresponds to the probability of observing a match of mg or greater, under the null hypothesis.

For each gene under consideration, we calculate the probability of not observing a match within mg in that gene, which is equal to the binomial probability of finding matches in that gene [0,1...mg−1]. The p-value will be equal to 1 minus the probability of observing fewer than mg matches in any gene. We assume that all covered ExAC genes T, carry variants independently of one another, and use a Bonferroni correction to set an appropriate p-value threshold (α/T), where α is the uncorrected threshold.

Some genes harbor a great deal of rare variation due to their size or tolerance to mutations (Petrovski et al. 2013; Samocha et al. 2014), which can increase the risk of false association. For example, if N=10 patients in the MME share the same phenotype, and we observe two patients (mg=2) with compound heterozygous variants in TTN, this match alone would not be convincing. The average individual in ExAC carries 0.45 rare, non-synonymous variants in this gene, so this degree of matching is not unexpected in 10 individuals (Bonferroni corrected p-value>0.999), Conversely, if we observe the same match in MECP2, in which the average individual in ExAC carries 0.0066 rare, non-synonymous variants, we would be more confident that this match is causal (Bonferroni corrected p-value=0.00144).

Online system for generating risk estimates

We have released a service, Rare Disease Match (RD-Match), which calculates the probability of a false-positive association in a newly entered individual. This service can be accessed at http://genetics.bwh.harvard.edu/rdmatch/.

Power simulations to estimate the number of cases required for matching

Conversely, it may be helpful to quantify the potential to identify novel true-positive disease matches within the MME. Given a set of patients with the same phenotype, we can model the probability that a certain number will carry rare mutations in the same gene that are responsible for the phenotype, after specifying the mode of inheritance. These mutations may be present in any one of G genes that we estimate to be associated with the disorder. The probability that these mutations will occur in any one of these G disease genes is not equally likely, so we rely on information about the prior expectation of encountering disease-causing variants in the general population, in order to estimate the power to encounter true matches.

To estimate the frequencies of different causal mutations among the G genes in the N patients, we first estimate the distribution of disease causing mutations in the general population. To do this, we use data from the most severe class (DM) of rare, non-synonymous variants in HGMD and measure the population frequencies of these putative disease variants in ExAC. The resulting distribution is highly skewed: The top 152 disease genes account for over half of all observed rare disease variants in ExAC. Because systems such as the MME are focused on discovering associations in rare, lesser-known disease genes, we exclude the top 50 most common ones, which account for 27.2% of the total population frequency of disease variants. Our results are robust to both the choice of the number of top common genes excluded and the proportion of unknown disease variants included (sensitivity analyses that support this cutoff and are detailed in the Supporting Information). In this list, 17 are well-known disease genes from the ACMG guidelines for reporting incidental findings and others are very large genes that are known to tolerate extensive missense variation, including TTN, OBSCN and mucin genes. We also include genes in which there are no known disease variants and apply a very small “pseudo-count” value in 10% of these so that matches in yet-to-be-discovered disease genes may be considered. We call this curated distribution of disease variants pD.

Next, we account for the likelihood of missing a truly causal variant in a given case, as it is possible that some MME cases will not have a causal variant among the genes listed. This could result from a sequencing error or coverage issue that overlooks the variant, misclassification of the phenotype, or a non-genetic cause leading to the same condition. We apply a parameter (pmiss) to describe the probability of not observing a causal variant in a patient. The lower the pmiss value, the more matches we can expect to observe.

We estimate the power to detect a number of true associations (k) given a specified p-value threshold sufficient to reject the null hypothesis, the number of genes (G) truly associated with the phenotype, and the number of cases with a specific phenotype (N). For each round of the simulation, the process is non-deterministic at two levels. First we select G genes from the pD distribution of disease variants. Next, we assume that N patients in the MME carry a mutation in one of the G genes, or in none of them with probability pmiss.

The next step is that for each of the N patients, we randomly select which gene carries a mutation (or none) according to the frequencies of the G genes we previously sampled. Then, we find those genes that have mutations in more than one patient. Finally, we compute a corresponding p-value using Equation (1). We then keep track of how many times this p-value is less than a given threshold α/T. This algorithm is detailed in the Supporting Information.

Results

Probability of false-positive associations

We compute the probability of false-positive matches, for both the autosomal recessive and compound heterozygous classes of variants using three different numbers of individuals with a matching phenotype: 2, 10, and 100 (Figure 1).

Figure 1.

Figure 1

The probability of observing a single homozygous variant match or a single compound heterozygous variant match in a set of patients with F=0.0016. We measure the probability of observing a single homozygous variant match [a] or a single compound heterozygous variant match [b] in cohorts of 2, 10, and 100 individuals with the same phenotype, sorted by gene rank. Genes are ranked in descending order of rare, non-synonymous variation using data from the Exome Aggregation Consortium. The probability for each curve is shown as the –log base 10 of the p-value of observing a single match, meaning that a match between 2 individuals is more significant than one among 10 or 100 individuals. Matches in genes that harbor a great deal of rare variation (low rank) are also less significant than those in genes that less commonly have rare variants (high rank).

Contribution from the inbreeding coefficient

The probability of observing autosomal recessive variants is linearly proportional to the inbreeding coefficient, which can greatly affect the distribution of recessive variants versus compound heterozygous variants. In cases of consanguinity or in populations with high inbreeding coefficients, our expectation of observing recessive variants is greatly increased. This is especially the case when it is very unlikely to observe recessive variants without identity by descent due to the low aggregate population frequency of variants in a given gene. To evaluate the influence of the inbreeding coefficient (F) on the probability of observing matches in the MME, we use estimates for F from three populations: Italy, 0.0016; Japan, 0.0046; and Andhra Pradesh, India, 0.0195 (Hedrick 1986).

For 10 individuals with a given phenotype in the MME, we measure the probability of observing a match in each of approximately 17,000 genes in ExAC (Figure 2). Genes are ranked in descending order of aggregate rare, non-synonymous variation. The probability of observing a compound heterozygous variant exceeds the probability of observing an autosomal recessive variant for genes that harbor greater degrees of rare, non-synonymous variation, and the opposite is true for genes in which such variation is less common. The point at which these two probability curves intersect is determined by the inbreeding coefficient, and it occurs at lower rank numbers as the inbreeding coefficient increases.

Figure 2.

Figure 2

The probability of observing a single homozygous alternate variant match or a single compound heterozygous variant match in 10 patients, using three different population inbreeding coefficients. We measure the probability of observing a single homozygous alternate variant match [blue/light] or a single compound heterozygous variant match [red/dark] using estimated population inbreeding coefficients from Italy [a], Japan [b], and Andhra Pradesh, India [c]. Genes are ranked in descending order of rare, non-synonymous variation using data from the Exome Aggregation Consortium. The probability for each curve is presented as the –log base 10 of the p-value of observing a single match, and we observe that the point at which these two curves intersect is determined by the inbreeding coefficient, and it occurs at lower rank numbers as the inbreeding coefficient increases. In genes that harbor very low numbers of rare, non-synonymous variants, it is more likely to observe a homozygous recessive variant due to a region that is identical by descent rather than resulting from a chance match from the general population. The jaggedness of the line corresponding to recessives comes from the fact that genes are ordered by rank in frequency of mutations, which is highly correlated but slightly different than the frequency of homozygotes under Hardy-Weinberg Equilibrium.

Number of cases required to detect causal matches for different phenotypes

We attempt to broadly estimate the number of cases required to detect causal matches for different phenotypes. Using data from a set of recently ascertained clinical exome sequencing patients at the Baylor College of Medicine (Yang et al. 2014) we extract the exact number of cases (N) processed for each phenotypic category. We then obtain a broad estimate of the number of genes associated with each phenotype (G) using data from HGMD. To do this, we cross-referenced the number of genes associated with each phenotype from the clinical exome service in HGMD. We removed any phenotype with under 10 variants listed in HGMD, as these disease names did not have sufficient specificity for text matching. Using these estimates for G and N for each phenotypic category, we were able to estimate the expected number of matches that would be made within this set (Table 1).

Table 1.

Estimated power to detect matches in a clinical exome sequencing service

Phenotypic Category Cohort (N) Genes (G) Mean number of matches Probability of observing k matches
k=1 k=10 k=25
Seizure 665 63 33.447 > 0.999 > 0.999 0.999
Microcephaly 339 85 24.6023 > 0.999 > 0.999 0.5158
Congenital heart 268 61 20.4469 > 0.999 > 0.999 0.0394
Short stature 263 59 20.0965 > 0.999 > 0.999 0.0267
Abn. mvmt/tremor 234 15 10.6631 > 0.999 0.795 0
Speech delay 215 10 7.8842 > 0.999 0.0602 0
Ataxia 169 101 11.0382 > 0.999 0.7628 < 0.001
Macrocephaly 131 14 8.5664 > 0.999 0.2438 0

Estimates of the number and probability of true-positive matches for different phenotypes. Using the phenotypic category and number of cases (N) processed in a large clinical exome sequencing service (Yang et al. 2014), we estimate the number of genes (G) that are associated in the medical and scientific literature with each phenotypic category (Stenson et al. 2012). Using a Bonferroni corrected p-value threshold of 5.76*10−7 (0.01/T), we then estimate the number of true-positive matches expected for each phenotype using the number of individuals and associated genes. We also estimate the probability (power) of observing k or more different matching genes, all with p-values < 5.76*10−7.

Discussion

Comparison with database annotation and existing practices

Clinical sequencing cases are often reviewed to identify known variants that have been classified as pathogenic and reported in databases such as HGMD and ClinVar (Stenson et al. 2012; Landrum et al. 2014). When these databases contain genotype–phenotype associations similar to those in a given case, this increases support for their causality. However, these databases are notoriously riddled with potentially false-positive findings, variants with low effect size, and disease-associated polymorphisms (Tong et al. 2011; Cassa et al. 2013).

Given the uncertainties of these associations in the literature, the presence of one of these variants in a patient may not suffice as evidence of causality without variant review and/or functional validation. Databases like ClinVar that capture variant classifications from clinical labs are also problematic, as classifications and the evidence used to establish disease–gene associations may vary from one lab to another. These same issues must be considered when evaluating matches in the MME, in which each individual candidate variant is unlikely to be truly causal. The prior probability of a disease-associated variant match must be considered for any matching variant in a database or publication, and multiple test correction is needed to arrive at an accurate probability of causality. We describe the trade-offs using two broad qualitative parameters: the quality of phenotype match and the quality of variant association (Figure 3).

Figure 3.

Figure 3

Trade-offs between quality of variant association and quality of phenotype match.

Variants having both low quality of phenotype match and low quality of variant association are much less likely to be causal, particularly for severe and/or rare Mendelian disorders (Figure 3, bottom left). When the quality of phenotype match is poor but the quality of variant association is strong, such a match may still be a false positive. This may be due to phenotypic imprecision (i.e. too many phenotypic terms, which leads to a greater chance of matching another patient within this larger number of phenotypes) or may indicate the potential for phenotypic expansion (i.e. the case provides a new component of the true disease phenotype that may appear in cases.) (Figure 3, top left). Variants with high quality of phenotype match but low quality of variant association may be either promising new candidates or potential false-positives (Figure 3, bottom right). Finally, variants with high quality of both phenotype match and variant association are the most promising class of matching candidates, especially when rare (Figure 3, top right).

Integration of risk estimates into the matchmaker exchange

There are several ways to mitigate the uncertainty associated with broader matchmaking cases and integrate risk estimates into matchmaking activities. One option is for each clinical group to disclose limitations on less-certain associations, such as those producing an inconsistent phenotype or incomplete penetrance within a family. Second, the criteria used to establish the list of candidate variants could be standardized using best practices for genome alignment, variant calling, and filtering. Of course, this second option would be difficult to realize given the diversity of genomic interpretation protocols currently in use (Brownstein et al. 2014).

Our model for generating risk estimates for false-positive associations and the power to detect matches is available as a web application. The application allows users to calculate the probability of a false-positive match given a set of genes and the number of individuals expressing a certain phenotype as well as other parameters including the inbreeding coefficient. Users may also calculate the statistical power to identify one or more true associations for a set of genes using parameters such as the estimated target size of a disease and the number of individuals with a specific phenotype. Furthermore, users can estimate the number of individuals that would be needed in order to have the desired probability for establishing an association. Our p-values (RD-Match) are calculated using variant frequencies from a large pool of data (ExAC) and some basic assumptions. We caution however that our power calculations rely on estimates of parameters for which it is difficult to be certain (pmiss, G, pD). One limitation of the application is that it uses frequency data drawn from populations that may not be representative of specific case populations. In the future, this can be mitigated by with larger variant datasets from different populations. Samples with mixed ancestry could also lead to violations of independence between different genes. Furthermore, the allele frequency data used to impute the frequencies of homozygous alternative and compound heterozygous variants may in fact be depleted in the general population, causing our estimates to be conservative.

Additionally, phenotypic precision may affect these estimates. The potential for false-positive matching will be increased if phenotypes are ill-defined in the MME, as this effectively increases the number of cases with any given phenotype (N). This effect is amplified when a disorder is broadly defined or its spectrum of presentation is not well developed. Furthermore, given the role of inbreeding in proliferating autosomal recessive variants, (which thereby lowers the probability that a match with that mode of inheritance is valid) we recommend that the F parameter be specified when using this model. Users of the MME who are unable to calculate F directly in consanguineous cases should attempt to estimate its value for the population considered if it is likely to be higher than average.

Furthermore, we caution that our p-value calculations are only valid in a set of unrelated individuals with matching phenotypes and disease variants. When validating associations, investigators should check if matching individuals are separately ascertained members of the same family. In this case, traditional methods could be used to quantify the significance of any shared variant. Otherwise, related individuals may be removed from consideration and then statistical significance for matches in each gene can be checked.

Conclusions

The broader use of matchmaking in clinical exome cases allows previously siloed data to be used in aggregate to identify additional cases with matching rare genotypes or phenotypes. As the set of accessible case data expands, it will become easier to find promising matches, but not without the danger of false-positive associations.

Supplementary Material

Supp MaterialS1

Acknowledgments

This research was supported by National Institutes of Health grant K99-HG007229. We thank Michael Brudno, Peter Robinson, Dana Vuzman, and Shamil Sunyaev for helpful comments and suggestions.

References

  1. 2014. (ExAC) EAC.
  2. Berg JS, Adams M, Nassar N, Bizon C, Lee K, Schmitt CP, Wilhelmsen KC, Evans JP. An informatics approach to analyzing the incidentalome. Genet Med. 2012 doi: 10.1038/gim.2012.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Brownstein CA, Beggs AH, Homer N, Merriman B, Yu TW, Flannery KC, Dechene ET, Towne MC, Savage SK, Price EN, Holm IA, Luquette LJ, et al. An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge. Genome Biol. 2014;15:R53. doi: 10.1186/gb-2014-15-3-r53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cassa CA, Tong MY, Jordan DM. Large numbers of genetic variants considered to be pathogenic are common in asymptomatic individuals. Hum. Mutat. 2013;34:1216–1220. doi: 10.1002/humu.22375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Collection S. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 2014:1–95. doi: 10.1038/ng.3021. [DOI] [PubMed] [Google Scholar]
  6. Goldstein DB, Allen A, Keebler J, Margulies EH, Petrou S, Petrovski S, Sunyaev S. Sequencing studies in human genetics: design and interpretation. Nat. Rev. Genet. 2013;14:460–70. doi: 10.1038/nrg3455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hedrick PW. Average inbreeding or equilibrium inbreeding? Am. J. Hum. Genet. 1986;38:965–970. [PMC free article] [PubMed] [Google Scholar]
  8. Kohane IS, Masys DR, Altman RB. The incidentalome: a threat to genomic medicine. Jama. 2006;296:212–215. doi: 10.1001/jama.296.2.212. [DOI] [PubMed] [Google Scholar]
  9. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. ClinVar: Public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42 doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lee H, Deignan JL, Dorrani N, Strom SP, Kantarci S, Quintero-Rivera F, Das K, Toy T, Harry B, Yourshaw M, Fox M, Fogel BL, et al. Clinical Exome Sequencing for Genetic Identification of Rare Mendelian Disorders. JAMA. 2014 doi: 10.1001/jama.2014.14604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, Adams DR, Altman RB, Antonarakis SE, Ashley EA, Barrett JC, Biesecker LG, et al. Guidelines for investigating causality of sequence variants in human disease. Nature. 2014;508:469–476. doi: 10.1038/nature13127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 2013;9:e1003709. doi: 10.1371/journal.pgen.1003709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, Kosmicki JA, Rehnström K, Mallick S, Kirby A, Wall DP, MacArthur DG, et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014 doi: 10.1038/ng.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinforma. 2012 doi: 10.1002/0471250953.bi0113s39. Chapter 1: Unit1 13. [DOI] [PubMed] [Google Scholar]
  15. Sulem P, Helgason H, Oddson A, Stefansson H, Gudjonsson SA, Zink F, Hjartarson E, Sigurdsson GT, Jonasdottir A, Jonasdottir A, Sigurdsson A, Magnusson OT, et al. Identification of a large set of rare complete human knockouts. Nat. Genet. 2015 doi: 10.1038/ng.3243. [DOI] [PubMed] [Google Scholar]
  16. Tong MY, Cassa CA, Kohane IS. Automated validation of genetic variants from large databases: ensuring that variant references refer to the same genomic locations. Bioinformatics. 2011;27:891–893. doi: 10.1093/bioinformatics/btr029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, Ward P, Braxton A, Wang M, Buhay C, Veeraraghavan N, Hawes A, et al. Molecular Findings Among Patients Referred for Clinical Whole-Exome Sequencing. JAMA. 2014 doi: 10.1001/jama.2014.14601. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp MaterialS1

RESOURCES