Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Mar 1.
Published in final edited form as: Hum Mutat. 2015 Dec 31;37(3):231–234. doi: 10.1002/humu.22944

Multiallelic Positions in the Human Genome: Challenges for Genetic Analyses

Ian M Campbell 1,#, Tomasz Gambin 1,#, Shalini Jhangiani 2, Megan L Grove 3, Narayanan Veeraraghavan 2, Donna M Muzny 2, Chad A Shaw 1, Richard A Gibbs 1,2, Eric Boerwinkle 2,3, Fuli Yu 1,2, James R Lupski 1,2,4,5,*
PMCID: PMC4752396  NIHMSID: NIHMS745621  PMID: 26670213

Abstract

As the amount of human genomic sequence available from personal genomes and exomes has increased, so too has the observation of genomic positions having two or more alternative alleles, so-called multiallelic sites. For portions of the haploid genome that are present in more than one copy, including segmental duplications, variation at such multisite variant positions becomes even more complex. Despite the frequency of multiallelic variants, a number of commonly used resources and tools in genomic research and diagnostics do not support these multiallelic variants all together or require special modifications. Here, we explore the frequency of multiallelic sites in large samples with whole exome sequencing and discuss potential outcomes of failing to account for multiple variant alleles. We also briefly discuss some commonly utilized resources that fully support multiallelic sites.

Keywords: multiallelic, single nucleotide variant, paralogous sequence variant, human variation, pathogenic variant


The occurrence of multiallelic sites in the human genome has long been recognized, but the impact of incorrectly accounting for their frequency and distribution is underappreciated. Single nucleotide variant (SNV) alleles, often referred to as single nucleotide polymorphisms (SNPs), are identified by the presence of a different DNA nucleotide from the reference genome allele at a unique position. The nature of Watson-Crick DNA base pairing allows for up to any of the three other nucleotides to be present at the same genomic coordinate position across different individuals. Genomic positions that vary to more than one other nucleotide are often referred to as multiallelic sites (Lindsay et al. 2006; Alvarez 2008). The difficulty of mapping DNA sequence reads to non-unique segments of the genome has complicated identification of multiallelic sites at unique positions. In some cases ‘mis-mapped’ reads have single bases of non-identity that mimic new alleles. Improved sequence length and quality, advances in mapping methodology, and an increase in the number of human genomes sequenced facilitates improved read mapping. Accurate mapping minimizes confusion and potential errors contributed by mis-mapping sequence reads, thereby minimizing technical errors that contribute to erroneous inference of variant sites.

Past estimates from the 1000 Genomes Project (1000G) Phase 3 indicate that multiallelic sites account for ~ 2.3% (1,742,796 / 77,228,885) of all autosomal positions with SNVs (1000 Genomes Project Consortium et al. 2012). As more personal genomes are studied however, it is clear that more sites uniquely positioned on the reference haploid genome bare additional alleles and that the true frequency of multiallelic variation may be potentially much higher. Correctly accounting for multiallelic sites is important for both the understanding of genetic structure in populations as well as for the more pragmatic searches for ‘causative’ disease alleles in individuals and cohorts. For population studies, even rare haplotypes can be further subdivided when the true pattern of variation of specific sites is revealed. Phasing, recombination, and other elements of inheritance are all dependent on such fine resolution.

In a disease study setting, the significance of multiallelic sites for genomic sequence analysis is an even more important practical issue. In family, cohort and diagnostic studies, patterns of segregation or predicted impact on gene function can lead to specific alleles being considered as candidates for disease ‘causation’. Given the limits of our currently available knowledge of the function of individual variant alleles, the rarity of any observed variation is commonly used as proxy for prediction of functionally deleterious changes. Specifically, if an allelic variant has never been seen in a large control cohort or variant database, its likelihood of pathogenicity is considered to be higher than if it has been observed in multiple other individuals. This is assumed to be true, even if the control cohort has not been specifically ascertained for the phenotype of interest, but often assumed to represent ‘healthy individuals’. To enable comparisons across the full variant frequency spectrum, accurate and large-scale collections of variant frequency information across populations are essential.

Some databases and tools have not fully supported identification and analysis of multiallelic sites. Our own in-house variant caller (Atlas 2.0) (Shen et al. 2010) was not initially designed to report two distinct non-reference alleles at a given genomic position in a single individual. Similarly, two frequently utilized human variant datasets, the NHLBI Exome Sequencing Project (ESP) Exome Variant Server (Tennessen et al. 2012) and the 1000 Genomes Project Phase 1 (1000 Genomes Project Consortium et al. 2012), do not report multiallelic SNVs. More recent versions of tools such as GATK, as well as the databases dbSNP, 1000G Phase 3, and ExAC (Exome Aggregation Consortium, http://exac.broadinstitute.org) now, however, support multiallelic SNVs.

The above issues can be particularly problematic at sites where the reference genome contains an allele that, when considering the human population or a major ethnic group as a whole, is actually a minor allele. At these sites, the majority of individuals always have at least one “variant” potentially masking a second, and possibly pathogenic variant. Such sites account for ~ 2.3% (1,742,796 / 77,228,885) of all autosomal positions with SNVs in the 1000G Phase 3.

To examine the frequency of multiallelic sites among our in-house sample sets, we analyzed 14,443 individuals and found that 50,935 of 2,268,345 positions (2.24%) that were observed to be variant (using conservative sequence mapping criteria, of a total sequence read count > 30 and ‘variant read’:‘total read’ ratio of > 0.3) are actually multiallelic (Figure 1). Relatedly, 4.12% of all genotype calls occur at these multiallelic positions. As expected, the rate of identification of multiallelic sites among all variant sites increases non-linearly as the total sample size increases. This relationship is similar to the increase of SNV occurrences with sample size (International HapMap 3 Consortium et al. 2010). These data predict 6.00% of variant sites in a cohort of ~100,000 samples (see Figure 1 for formula). This approximates the number found in data from the ExAC server, using their default calling criteria, where 600,072 of 9,462,741 variant positions (6.4%) (in 61,486 individuals) are actually observed to be multiallelic for SNVs. This suggests that the challenges resulting from multiallelic variants increases in concert with the number of individuals in the study sample.

Figure 1.

Figure 1

Observation of multiallelic positions increases with cohort size. To assess the relationship between cohort size and the proportion of sites observed to be multiallelic, we resampled the HGSC cohort to produce random subsets of various sizes. On the x-axis the number of individuals in the sub-sampled cohorts and on the y-axis is the median proportion of sites observed to be multiallelic across samples. The red line corresponds to a fitted non-linear model, while the dotted line corresponds to a linear model. This model suggests that the challenges that result from multiallelic sites increase with cohort size. The observed fractions for the Center for Mendelian Genomics (CMG) and Atherosclerosis Risk in Communities (ARIC) sub-cohorts are also plotted.

To illustrate the importance of recognizing multiallelic sites in real data, we explored variants that occur in > 0.1% of individuals from our local cohort (≥ 15 individuals) and compared these variants to those reported in ESP. Under a Poisson model, the variants we analyzed are unlikely to occur zero times in a cohort of the size of the ESP (estimated Poisson model zero probability for a single locus, p = 0.001). We restricted our analyses to genomic positions that had one reported allele in ESP to limit biases arising due to differences in coverage. Figure 2A indicates positions of synonymous, non-synonymous and stop-gain/loss variants that occur in > 0.1% of individuals in our cohort but that are not reported in ESP because they occur at multiallelic sites. At 52 positions, the alternative allele not present in ESP was assayed on the Illumina HumanExome BeadChip (Grove et al. 2013). We found that 100% of the variant calls from exome sequencing at these positions were concordant with microarray results for individuals tested with both technologies. This result is unsurprising given the strict VR and TR cutoff values we used for our analysis. These experimental observations are consistent with the contention that the multiallelic variants we detected are indeed true variants rather than sequencing errors.

Figure 2.

Figure 2

Alleles Not Reported in ESP but Present in >0.1% of Individuals Sequenced at the Baylor College of Medicine Human Genome Sequencing Center. A. Ideogram indicating the distribution of multiallelic variants missing from ESP. Variants are color-coded by their effect on coding sequence. B. Number of multiallelic SNV genotypes and alleles identified in our cohort in more than 0.1% of individuals that are missing from ESP, most likely because they are masked by other alleles. Note that no multiallelic SNVs are reported in ESP.

Figure 1B gives the breakdown of such variants by type. As an example, this analysis identified 17 stop-gain/loss alleles that occur in >0.1% of individuals that are not included in ESP. To estimate the upper bound of how many stop-gain/loss mutations could potentially be masked, we searched dbNSFP (Liu et al. 2013) version 2.1 for stopgains and losses not reported in ESP but that could arise at a position with a different ESP allele; we found 113,538 potential stopgain or loss alleles that, if they did occur, would likely be masked.

Failing to account for multiallelic variants in personal genomes sequenced clinically could theoretically lead to missed diagnoses. For example, consider an individual heterozygous for a common synonymous variant as well as a rare pathogenic variant at the same genomic position. If the variant calling algorithm is not equipped to account for two non-reference variant alleles, the pathogenic variant may go unnoticed. Similarly, if databases commonly used for variant interpretation do not contain multiallelic variants, misinterpretation of variant frequencies can potentially occur. As a specific example, Trivellin, et al report in their 2014 New England Journal of Medicine article regarding patients with gigantism and acromegaly that, “11 patients had a c.924G>C substitution (p.E308D) in GPR101, which was not found in 7600 control samples obtained from public databases,” referring to data from the ESP (Trivellin et al. 2014). However, the GPR101 (MIM #300393, GenBank NM_054021.1) p.E308D variant is absent in ESP not necessarily because it is does not occur in the population, but because another even more rare synonymous c.924G>T (p.E308=) variant is reported instead. More recent data from ExAC suggests the p.E308D variant is present in ~ 0.4% of the population. Thus, while a significant enrichment of the p.E308D variant remains in the Trivellin patient cohort (p = 1×10−13, Fisher's exact test), the interpretation is nonetheless subtly altered.

The occurrence of multiallelic sites in genomic regions present in more than one copy in the haploid reference, for example segmental duplications, is of further importance (Bailey et al. 2001; Fredman et al. 2004). Even biallelic SNVs are challenging to distinguish from paralagous sequence variants (PSVs) and multi-site variants (MSV) (Lindsay et al. 2006). Unrecognized multiallelic variants have been mistaken for biallelic variation at remote positions, whereas the real structure is of a single multiallelic variant position (Estivill et al. 2002). Such variants are often removed outright by bioinformatics pipelines rather than directly incorporating such complexity.

Data collections, databases and toolsets are continually evolving and new capabilities and data dimensions are constantly emerging. The possibility that additional multiallelic variation at unique sites in the human genome may be present, but not properly represented, in these evolving public data sets should be considered when interpreting local data. As with all genomic analyses, biological interpretations should first carefully consider the intended use and scope of existing datasets.

Table 1 lists a further selection of resources and tools commonly used in genomics. We contacted the developers or tested each for its handling of multiallelic variants. If a specific use case requires consideration of sites that could potentially vary in more than one way, as many do, we suggest accessing a resource that offers full compatibility. The complexities of human variation and often-limited resources necessitate developing tools that function on the majority of cases rather than all conceivable cases. In other cases, the mathematical or computational difficulties may hamper analysis.

Table 1.

Support for Multiallelic Variants Among Selected Genomics Resources

Name URL Compatibility
Variant classification databases
HGMD http://www.hgmd.cf.ac.uk/ac/index.php Full
ClinVar http://www.ncbi.nlm.nih.gov/clinvar/ Full
COSMIC http://cancer.sanger.ac.uk/cosmic Full
Variant frequency databases
ExAC http://exac.broadinstitute.org/ Full
Exome Variant Server http://evs.gs.washington.edu/EVS/ None
1000 genomes http://www.1000genomes.org/ Full (Version 3+)
dbSNP http://www.ncbi.nlm.nih.gov/SNP/ Full
Variant callers
GATK https://www.broadinstitute.org/gatk/ Full
samtools http://samtools.sourceforge.net/ Full
Atlas https://www.hgsc.bcm.edu/software/software/atlas-2 None (Planned)
Annotation tools
ANNOVAR http://annovar.openbioinformatics.org/en/latest/ Partial
dbNSFP https://sites.google.com/site/jpopgen/dbNSFP Full
SnpEFF http://snpeff.sourceforge.net/ Full
Variant Effect Predictor http://www.ensembl.org/Homo sapiens/Tools/VEP Full
VariantAnnotation https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html Partial
Analysis tools
PLINK http://pngu.mgh.harvard.edu/~purcell/plink/ None (Planned)
VAAST http://www.yandell-lab.org/software/vaast.html Full
pVAAST http://www.hufflab.org/software/pvaast/ Full
PhenIX http://compbio.charite.de/PhenIX/ Full
Exomiser https://www.sanger.ac.uk/resources/software/exomiser/ Full
eXtasy http://extasy.esat.kuleuven.be/ Partial
Phevor http://weatherby.genetics.utah.edu/cgi-bin/Phevor/PhevorWeb.html Full
SNPRelate https://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html None
Variant imputation / Haplotype reconstruction
SNPHap https://www-gene.cimr.cam.ac.uk/staff/clayton/software/ None
PHASE http://stephenslab.uchicago.edu/software.html Full
FastPHASE http://stephenslab.uchicago.edu/software.html None
Arlequin http://cmpg.unibe.ch/software/arlequin35/ Full
IMPUTE2 https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#examples None
BEAGLE http://faculty.washington.edu/browning/beagle/beagle.html Full

Going forward, we suggest that multiallelic sites are sufficiently common and represent such a fundamental property of human variation that support for multiallelic variants should be considered a requirement for any genomic tool or repository. Ideally, tools would accept multiple variant alleles that occupy the same genomic position from different lines of an input file as well as encoded into the same line, as is the variant call file (VCF) standard (Global Alliance Data, http://ga4gh.org/#/fileformats-team). Many users may be unaware of how multiallelic variants are encoded in their data.

The choice of single-line versus multi-line representation can be particularly problematic and deserves further discussion. Single-line representations of multiallelic variants are often preferred by variant calling or joint genotyping algorithms. Meanwhile, tools designed to store variants in relational databases or make annotations may prefer a multi-line format. Therefore, standardized procedures to reliably convert between these two representations of multiallelic variants will need to be developed. For example, it is currently not clear how some properties of multiallelic variants, such as genotype likelihoods, should be split when converting from single to multi-line representation. This information is often lost during the conversion, hampering reproduction of the original dataset.

Segmental duplications encode a number of functional genes (She et al. 2004), but our understanding of how variation of these genes may contribute to disease is lacking because few analysis pipelines are designed to investigate them. As the read length of sequencing technologies improve, so too will our ability to fully delineate both bi- and muiltiallelic variants as well as segmental gains from PSVs and MSVs. Informatics will evolve to address these issues and capitalize on the ability to assay such variation and assess potential association with trait manifestation.

Finally, Human Genetics lacks clearly defined shared standards or a recognized single organized community approach for large-scale meta-analyses of massively parallel sequencing data. Such integrated analyses would provide the opportunity to identify emerging data and analysis issues such as the appropriate handling of multiallelic sites. Some challenges might initially seem largely insignificant at the level of single or small numbers of samples but can have important consequences at larger scales. Resolving such problems will hopefully enable more robust and reliable comparison of sequencing data generated among different groups. Systematic practices will also improve the quality of sequencing data at the level of a single individual, facilitating molecular diagnosis. The development of community resources and standards frameworks is likely to be both logistically and politically difficult, but such approaches might be able to address a number of the challenges currently facing Human Genomics in a discipline and era where global data sharing is to the benefit of populations worldwide.

Acknowledgments

Genotype data were obtained in a manner conforming to the Institutional Review Board of Baylor College of Medicine. JRL has stock ownership in 23 and Me, is a paid consultant for Regeneron Pharmaceuticals, has stock options in Lasergen Inc.; is a member of the Scientific Advisory Board of Baylor Miraca Genetics Laboratories; and is a co-inventor on multiple United States and European patents related to molecular diagnostics for inherited neuropathies, eye diseases, and bacterial genomic fingerprinting. The Department of Molecular and Human Genetics at Baylor College of Medicine derives revenue from the chromosomal microarray analysis (CMA) and clinical exome sequencing offered in the Baylor Miraca Genetics Laboratory (http://www.bcm.edu/geneticlabs/).

Grant Sponsors:

IMC is a fellow of the Baylor College of Medicine Medical Scientist Training Program (T32 GM007330) and was supported by a fellowship from the National Institute of Neurological Disorders and Stroke (NINDS) (F31 NS083159); National Human Genome Research Institute (NHGRI)/National Heart Lung and Blood Institute (NHBLI) (grant U54 HG006542, NINDS grant RO1 NS058529, NHGRI grant U54 HG003273); National Institutes of Health American Recovery and Reinvestment Act of 2009 (grant RC2 HL102419); Atherosclerosis Risk in Communities (ARIC) study carried out as a collaborative study supported by the NHLBI (contracts HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, HHSN268201100012C).

Footnotes

The other authors have no conflicts of interest to report.

References

  1. 1000 Genomes Project Consortium. Abecasis GR, Auton A, Brooks LD, Depristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Alvarez G. Deviations from Hardy-Weinberg proportions for multiple alleles under viability selection. Genet Res (Camb) 2008;90:209–216. doi: 10.1017/S0016672307009068. [DOI] [PubMed] [Google Scholar]
  3. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. doi: 10.1101/gr.187101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Estivill X, Cheung J, Pujana MA, Nakabayashi K, Scherer SW, Tsui L-C. Chromosomal regions containing high-density and ambiguously mapped putative single nucleotide polymorphisms (SNPs) correlate with segmental duplications in the human genome. Hum Mol Genet. 2002;11:1987–1995. doi: 10.1093/hmg/11.17.1987. [DOI] [PubMed] [Google Scholar]
  5. Fredman D, White SJ, Potter S, Eichler EE, Dunnen den JT, Brookes AJ. Complex SNP-related sequence variation in segmental genome duplications. Nat Genet. 2004;36:861–866. doi: 10.1038/ng1401. [DOI] [PubMed] [Google Scholar]
  6. Grove ML, Yu B, Cochran BJ, Haritunians T, Bis JC, Taylor KD, Hansen M, Borecki IB, Cupples LA, Fornage M, Gudnason V, Harris TB, et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE Consortium. PLoS ONE. 2013;8:e68095. doi: 10.1371/journal.pone.0068095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. International HapMap 3 Consortium. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker PIW, Deloukas P, Gabriel SB, Gwilliam R, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lindsay SJ, Khajavi M, Lupski JR, Hurles ME. A chromosomal rearrangement hotspot can be identified from population genetic variation and is coincident with a hotspot for allelic recombination. The American Journal of Human Genetics. 2006;79:890–902. doi: 10.1086/508709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Liu X, Jian X, Boerwinkle E. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Hum. Mutat. 2013;34:E2393–402. doi: 10.1002/humu.22376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. She X, Jiang Z, Clark RA, Liu G, Cheng Z, Tuzun E, Church DM, Sutton G, Halpern AL, Eichler EE. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature. 2004;20:927–930. doi: 10.1038/nature03062. [DOI] [PubMed] [Google Scholar]
  11. Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, Yu F. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res. 2010;20:273–280. doi: 10.1101/gr.096388.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Trivellin G, Daly AF, Faucz FR, Yuan B, Rostomyan L, Larco DO, Schernthaner-Reiter MH, Szarek E, Leal LF, Caberg J-H, Castermans E, Villa C, et al. Gigantism and acromegaly due to Xq26 microduplications and GPR101 mutation. N Engl J Med. 2014;371:2363–2374. doi: 10.1056/NEJMoa1408028. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES