Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Qingbo Wang; Emma Pierce-Hoffman; Beryl B Cummings; Jessica Alföldi; Laurent C Francioli; Laura D Gauthier; Andrew J Hill; Anne H O’Donnell-Luria; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium; Konrad J Karczewski; Daniel G MacArthur

doi:10.1038/s41467-019-12438-5

. 2020 May 27;11:2539. doi: 10.1038/s41467-019-12438-5

Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Qingbo Wang ^1,^2,³, Emma Pierce-Hoffman ¹, Beryl B Cummings ^1,^2,⁴, Jessica Alföldi ^1,², Laurent C Francioli ^1,², Laura D Gauthier ^1,⁵, Andrew J Hill ^1,⁶, Anne H O’Donnell-Luria ^1,²; Genome Aggregation Database Production Team; Genome Aggregation Database Consortium, Konrad J Karczewski ^1,², Daniel G MacArthur ^1,^2,^7,^8,^✉

¹Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA

²Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA 02114 USA

³Program in Bioinformatics and Integrative Genomics, Harvard Medical School, Boston, MA 02115 USA

⁴Program in Biomedical and Biological Sciences, Harvard Medical School, Boston, MA 02115 USA

⁵Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA

⁶Department of Genome Sciences, University of Washington, Seattle, WA 98195 USA

⁷Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, Australia

⁸Centre for Population Genomics, Murdoch Children’s Research Institute, Melbourne, Australia

⁹European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD UK

¹⁰Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114 USA

¹¹Genomics Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA

¹²Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA

¹³Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA

¹⁴Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA UK

¹⁵National Heart & Lung Institute and MRC London Institute of Medical Sciences, Imperial College London, London, W12 0NN UK

¹⁶Cardiovascular Research Centre, Royal Brompton & Harefield Hospitals NHS Trust, London, SW3 6NP UK

¹⁷Unidad de Investigacion de Enfermedades Metabolicas. Instituto Nacional de Ciencias Medicas y Nutricion, Mexico City, 14080 Mexico

¹⁸Peninsula College of Medicine and Dentistry, Exeter, EX25DW UK

¹⁹Division of Preventive Medicine, Brigham and Women’s Hospital, Boston, MA 02115 USA

²⁰Division of Cardiovascular Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115 USA

²¹Department of Cardiology, University Hospital, 43100 Parma, Italy

²²Department of Biology, Faculty of Natural Sciences, University of Haifa, Haifa, 3498838 Israel

²³Departments of Medicine and Genetics, Albert Einstein College of Medicine, Bronx, NY 10461 USA

²⁴Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44122 USA

²⁵Sorbonne Université, APHP, Gastroenterology Department, Saint Antoine Hospital, Paris, 75012 France

²⁶NHLBI and Boston University’s Framingham Heart Study, Framingham, MA 01702 USA

²⁷Department of Medicine, Boston University School of Medicine, Boston, MA 02118 USA

²⁸Department of Epidemiology, Boston University School of Public Health, Boston, MA 02118 USA

²⁹Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109 USA

³⁰National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892 USA

³¹The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA

³²Department of Biochemistry, Wake Forest School of Medicine, Winston-Salem, NC 27101 USA

³³Center for Genomics and Personalized Medicine Research, Wake Forest School of Medicine, Winston-Salem, NC 27157 USA

³⁴Center for Diabetes Research, Wake Forest School of Medicine, Winston-Salem, NC 27101 USA

³⁵Department of Cardiovascular Sciences, University of Leicester, Leicester, LE1 7RH UK

³⁶NIHR Leicester Biomedical Research Centre, Glenfield Hospital, Leicester, LE3 9QP UK

³⁷Department of Epidemiology and Biostatistics, Imperial College London, London, W2 1PG UK

³⁸Department of Cardiology, Ealing Hospital NHS Trust, Southall, UB1 3HW UK

³⁹Imperial College Healthcare NHS Trust, Imperial College London, London, W2 1NY UK

⁴⁰Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong, China

⁴¹Department of Medicine, Harvard Medical School, Boston, MA 02115 USA

⁴²Departments of Cardiovascular Medicine, Cellular and Molecular Medicine, Molecular Cardiology and Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH 44195 USA

⁴³McLean Hospital, Belmont, MA 02478 USA

⁴⁴Department of Medicine, University of Mississippi Medical Center, Jackson, MS 39216 USA

⁴⁵Department of Epidemiology, Colorado School of Public Health, Aurora, CO 80045 USA

⁴⁶Department of Medicine and Pharmacology, University of Illinois at Chicago, Chicago, IL 60612 USA

⁴⁷Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX 78227 USA

⁴⁸Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118 USA

⁴⁹Cardiac Arrhythmia Service and Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA 02114 USA

⁵⁰Cardiovascular Epidemiology and Genetics, Hospital del Mar Medical Research Institute (IMIM), Barcelona, 08003 Catalonia Spain

⁵¹CIBER CV, Barcelona, 08017 Catalonia Spain

⁵²Department of Medicine, Medical School, University of Vic-Central University of Catalonia, Barcelona, 08500 Spain

⁵³Institute for Cardiogenetics, University of Lübeck, Lübeck, 23562 Germany

⁵⁴DZHK (German Research Centre for Cardiovascular Research), Partner Site Hamburg/Lübeck/Kiel, 23562 Lübeck, Germany

⁵⁵University Heart Center Lübeck, 23562 Lübeck, Germany

⁵⁶Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, 51003 Estonia

⁵⁷Helsinki University and Helsinki University Hospital, Clinic of Gastroenterology, Helsinki, 00100 Finland

⁵⁸Institute of Clinical Molecular Biology (IKMB), Christian-Albrechts-University of Kiel, Kiel, 24118 Germany

⁵⁹Bioinformatics Program, MGH Cancer Center and Department of Pathology, Boston, MA 02129 USA

⁶⁰Cancer Genome Computational Analysis, Broad Institute, Cambridge, MA 02142 USA

⁶¹Endocrinology and Metabolism Department, Hadassah-Hebrew University Medical Center, Jerusalem, 91120 Israel

⁶²Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Oneida, NY 13421 USA

⁶³Institute for Genomic Medicine, Columbia University Medical Center, Hammer Health Sciences, 1408, 701 West 168th Street, New York, NY 10032 USA

⁶⁴Department of Genetics & Development, Columbia University Medical Center, Hammer Health Sciences, 1602, 701 West 168th Street, New York, NY 10032 USA

⁶⁵Centro de Investigacion en Salud Poblacional. Instituto Nacional de Salud Publica MEXICO, Mexico, 62100 Mexico

⁶⁶Lund University, Lund, SE-221 00 Sweden

⁶⁷Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, 00014 Finland

⁶⁸Lund University Diabetes Centre, Lund, SE-214 28 Sweden

⁶⁹Human Genetics Center, University of Texas Health Science Center at Houston, Houston, TX 77030 USA

⁷⁰Department of Neurology, Columbia University, New York, NY 10032 USA

⁷¹Institute of Biomedicine, University of Eastern Finland, Kuopio, 70210 Finland

⁷²Department of Psychiatry, PL 320, Helsinki University Central Hospital, Lapinlahdentie, 00 180 Helsinki, Finland

⁷³Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, 171 77 Sweden

⁷⁴Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA

⁷⁵Department of Neurology, Helsinki University Central Hospital, Helsinki, 00290 Finland

⁷⁶Department of Public Health, Faculty of Medicine, University of Helsinki, Helsinki, 00014 Finland

⁷⁷Center for Genome Science, Korea National Institute of Health, Chungcheongbuk-do, 363-951 Republic of Korea

⁷⁸MRC Centre for Neuropsychiatric Genetics & Genomics, Cardiff University School of Medicine, Hadyn Ellis Building, Maindy Road, Cardiff, CF24 4HQ UK

⁷⁹National Heart and Lung Institute, Cardiovascular Sciences, Hammersmith Campus, Imperial College London, London, SW3 6LY UK

⁸⁰Department of Health, THL-National Institute for Health and Welfare, 00271 Helsinki, Finland

⁸¹Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT 06510 USA

⁸²Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, New Haven, CT 06510 USA

⁸³Division of Pediatric Gastroenterology, Emory University School of Medicine, Atlanta, GA 30322 USA

⁸⁴Department of Internal Medicine, Seoul National University Hospital, Seoul, 03080 Republic of Korea

⁸⁵Institute of Clinical Medicine, The University of Eastern Finland, Kuopio, 70210 Finland

⁸⁶Kuopio University Hospital, Kuopio, 70210 Finland

⁸⁷Department of Clinical Chemistry, Fimlab Laboratories and Finnish Cardiovascular Research Center-Tampere, Faculty of Medicine and Health Technology, Tampere University, Tampere, 33720 Finland

⁸⁸The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA

⁸⁹Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong, China

⁹⁰Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Hong Kong, China

⁹¹Cardiovascular Research REGICOR Group, Hospital del Mar Medical Research Institute (IMIM), Barcelona, 08003 Catalonia Spain

⁹²Department of Genetics, Harvard Medical School, Boston, MA 02115 USA

⁹³Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Old Road, Headington, Oxford, OX3 7LJ UK

⁹⁴Wellcome Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN UK

⁹⁵Oxford NIHR Biomedical Research Centre, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX3 9DU UK

⁹⁶F Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048 USA

⁹⁷Atherogenomics Laboratory, University of Ottawa Heart Institute, Ottawa, ON K1Y 4W7 Canada

⁹⁸Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA 02114 USA

⁹⁹Department of Clinical Sciences, University Hospital Malmo Clinical Research Center, Lund University, Malmo, 205 02 Sweden

¹⁰⁰Lund University, Department of Clinical Sciences, Skane University Hospital, Malmo, 222 42 Sweden

¹⁰¹Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, 14610 Mexico

¹⁰²Medical Research Institute, Ninewells Hospital and Medical School, University of Dundee, Dundee, DD1 9SY UK

¹⁰³Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, 08826 Republic of Korea

¹⁰⁴Department of Psychiatry, Keck School of Medicine at the University of Southern California, Los Angeles, CA 90033 USA

¹⁰⁵Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21205 USA

¹⁰⁶Division of Genetics and Epidemiology, Institute of Cancer Research, London, SM2 5NG UK

¹⁰⁷Medical Research Center, Oulu University Hospital, Oulu, Finland and Research Unit of Clinical Neuroscience, Neurology, University of Oulu, Oulu, 90014 Finland

¹⁰⁸Research Center, Montreal Heart Institute, Montreal, Quebec, H1T 1C8 Canada

¹⁰⁹Department of Medicine, Faculty of Medicine, Université de Montréal, Québec, H3T 1J4 Canada

¹¹⁰Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212 USA

¹¹¹Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212 USA

¹¹²Department of Biostatistics and Epidemiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA 19104 USA

¹¹³Department of Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA 19104 USA

¹¹⁴Center for Non-Communicable Diseases, Karachi, 75300 Pakistan

¹¹⁵National Institute for Health and Welfare, Helsinki, 00271 Finland

¹¹⁶Deutsches Herzzentrum München, München, 80636 Germany

¹¹⁷Technische Universität München, München, 80333 Germany

¹¹⁸Division of Cardiovascular Medicine, Nashville VA Medical Center and Vanderbilt University, School of Medicine, Nashville, TN 37232-8802 USA

¹¹⁹Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA

¹²⁰Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA

¹²¹Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY 10029 USA

¹²²Institute of Clinical Medicine, Neurology, University of Eastern Finland, Kuopio, 80101 Finland

¹²³Department of Twin Research and Genetic Epidemiology, King’s College London, London, WC2R 2LS UK

¹²⁴Departments of Genetics and Psychiatry, University of North Carolina, Chapel Hill, NC 27599 USA

¹²⁵Saw Swee Hock School of Public Health, National University of Singapore, National University Health System, Singapore, 117549 Singapore

¹²⁶Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore

¹²⁷Duke-NUS Graduate Medical School, Singapore, 169857 Singapore

¹²⁸Life Sciences Institute, National University of Singapore, Singapore, 117456 Singapore

¹²⁹Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546 Singapore

¹³⁰Folkhälsan Institute of Genetics, Folkhälsan Research Center, Helsinki, 00250 Finland

¹³¹HUCH Abdominal Center, Helsinki University Hospital, Helsinki, 00100 Finland

¹³²Center for Behavioral Genomics, Department of Psychiatry, University of California, San Diego, CA 92093 USA

¹³³Institute of Genomic Medicine, University of California, San Diego, CA 92093 USA

¹³⁴Juliet Keidan Institute of Pediatric Gastroenterology, Shaare Zedek Medical Center, The Hebrew University of Jerusalem, Jerusalem, 91905 Israel

¹³⁵Instituto de Investigaciones Biomédicas UNAM, Mexico City, 04510 Mexico

¹³⁶Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán Mexico City, Mexico City, 14080 Mexico

¹³⁷Radcliffe Department of Medicine, University of Oxford, Oxford, OX3 9DU UK

¹³⁸Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, 9713 The Netherlands

¹³⁹Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, MS 39216 USA

¹⁴⁰Program in Infectious Disease and Microbiome, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA

¹⁴¹Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA 02114 USA

^✉

Corresponding author.

PMCID: PMC7253413 PMID: 32461613

Abstract

Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically survey MNVs in 125,748 whole exomes and 15,708 whole genomes from the Genome Aggregation Database (gnomAD). We identify 1,792,248 MNVs across the genome with constituent variants falling within 2 bp distance of one another, including 18,756 variants with a novel combined effect on protein sequence. Finally, we estimate the relative impact of known mutational mechanisms - CpG deamination, replication error by polymerase zeta, and polymerase slippage at repeat junctions - on the generation of MNVs. Our results demonstrate the value of haplotype-aware variant annotation, and refine our understanding of genome-wide mutational mechanisms of MNVs.

Subject terms: Genetic variation, Genomics, Haplotypes, Genetic variation

Multi-nucleotide variants (MNV) are genetic variants in close proximity of each other on the same haplotype whose functional impact is difficult to predict if they reside in the same codon. Here, Wang et al. use the gnomAD dataset to assemble a catalogue of MNVs and estimate their global mutation rate.

Introduction

Multi-nucleotide variants (MNVs) are defined as clusters of two or more nearby variants existing on the same haplotype in an individual^1,2 (Fig. 1a). When variants in an MNV are found within the same codon, the overall impact may differ from the functional consequences of the individual variants³. For instance, the two variants depicted in Fig. 1b are each predicted individually to have missense consequences, but in combination result in a nonsense variant. Such cases, which would be missed by virtually all existing tools for clinical variant annotation, can result both in missed diagnoses and false positive pathogenic candidates in analyses of families affected by genetic diseases^1,2.

MNV identification tools^4–8 have been applied to databases of human genetic variation at varying scales, including 1000 Genomes⁹ Phase 3 (2504 individuals with high coverage exome and low coverage genome-sequencing data), and the Exome Aggregation Consortium¹ (60,706 individuals with high coverage exome data). Together, these analyses identified over 10,000 MNVs altering protein sequences, demonstrating the pervasive nature of MNV annotation in the population-level data. In addition, analysis of the 1000 Genomes data set highlighted differences in the frequencies of MNVs depending on sequence context¹⁰. In combination with yeast experiments^11–13, biological mechanisms that account for the enrichment of specific types of MNVs, such as DNA replication error by polymerase zeta, have been suggested.

Studies of newly occurring (de novo) MNVs have also been performed using trio data sets^2,14–16; analysis of 283 trios with whole-genome sequence data¹⁶ confirmed that MNV events occur much more frequently than expected by random chance. By focusing on noncoding regions, this study also highlighted potentially different mechanisms that dominate MNV generation depending on the genomic region and the distance between the two constitutive variants. As part of the Deciphering Developmental Disorders (DDD) study¹⁷, Kaplanis et al.² analyzed exome-sequence data from over 6000 trios to quantify the pathogenic impact of MNVs in developmental disorders, showing that such variants are substantially more likely to be deleterious than SNVs and further clarifying the mutational mechanisms that generate them. These analyses also have provided estimates of the germline MNV rate per generation, falling into a consistent range of 1–3% of the SNV rate. Although these studies have provided valuable information about the mutational origins and functional impact of MNVs, to date there has been no analysis that investigated MNVs across the entire genome (including noncoding regions) in many thousands of deeply sequenced individuals, limiting our understanding of the genome-wide profile and complete frequency distribution of this class of variation.

Here, we present the analysis of a large-scale collection of MNVs, along with clinical interpretation of MNVs from over 6000 sequenced individuals from rare disease families. We also provide gene-level statistics on MNVs and describe the distribution of MNVs by functional consequence and by gene-level constraint. Finally, to enhance our understanding of MNV mechanisms, we examine the distributions of MNVs stratified by more than ten different functional annotations across the human genome, as well as estimates of the genome-wide per-base frequencies of the dominant mutational processes generating MNVs.

Results

Read-based phasing for identification of MNVs

Identification of MNVs requires the constituent variants to be properly phased—that is, to be identified accurately as either both occurring on the same haplotype (in cis) or on two different haplotypes (in trans). Phasing can be performed following three broad strategies: read-based phasing¹⁸, which assesses whether nearby variants co-segregate on the same reads in DNA sequencing data; family-based phasing¹⁹, which assesses whether pairs of variants are co-inherited within families; and population-based phasing²⁰, which leverages haplotype sharing between members of a large genotyped population to make a statistical inference of phase. Read-based phasing is particularly effective for pairs of nearby variants, making it suitable for the analysis of MNVs.

For this project, we generated read-based phasing results for variants in the Genome Aggregation Database (gnomAD) v2.1 callset using GATK HaplotypeCaller²¹, yielding 125,748 human whole exomes and 15,708 genomes with local phase information; the properties of this callset are described in detail in an accompanying paper²². To assess phasing accuracy, we used 5785 family trios with exome-sequencing data and 635 family trios with whole-genome sequencing data that largely overlapped with the gnomAD 2.1 release data. We calculated the phasing sensitivity, defined as the fraction of heterozygous variant pairs that have read-based phase information assigned for both variants, and found that it was 87.9% for adjacent heterozygous variant pairs, reflecting the stringent haplotype-calling criteria of GATK²¹ (Supplementary Tables 1–3). We used Phase-By-Transmission (PBT)¹⁹, a family-based phasing method (Fig. 1c), to assess our phasing specificity, and found that over 99.8% of the MNVs identified with read-based phasing were consistent with the PBT trio-based phasing. The sensitivity and specificity of our read-based phasing remained high even when the two variants of the MNV were 10 bp apart (82.8% and 99.8%; Supplementary Fig. 1 and Supplementary Table 1). These results demonstrate high specificity and sensitivity for the detection of MNV events across the genome.

Functional impact of MNVs

In order to provide an overview of the functional impact of MNVs (Fig. 1b), we examined all phased high-quality SNV pairs (i.e., SNV pairs that pass stringent filtering criteria; see the Methods section) within 2 bp distance of each other across the 125,748 exome-sequenced individuals from our gnomAD 2.1 data set, resulting in the discovery of 31,575 MNVs exist within the same codon. When the two variants comprising the MNV were considered together, the resulting functional impact on the protein differed from the independent impacts of the individual variants in ~60% of cases (18,756 MNVs; Fig. 2a; Supplementary Data 1). Among the differing annotations of functional consequence, 407 were gained nonsense (neither individual SNV was a nonsense mutation, but the resulting MNV is), and 1821 were rescued nonsense (at least one of the two individual SNVs would create a nonsense mutation, but the resulting MNV does not). Such categories of MNVs have a major impact on variant interpretation, and thus are critical for accurate variant annotation. There was an average of 55.2 variants with altered functional interpretation (including 0.062 gained and 4.42 rescued nonsense) due to MNVs per individual.

Fig. 2 — Functional impact of MNVs. a The number of MNVs in the gnomAD exome data set per MNV category. Of the 1821 rescued nonsense mutations, 1538 are rescued in all individuals that harbor the original nonsense mutation and are used for the analysis in (b) and (c). Gained and rescued nonsense MNVs were further filtered to HC pLoF in (b) and (c). b The number of gained/rescued nonsense mutations per gene, and examples of disease-associated genes with two or more gained/rescued nonsense mutations. c The fraction of each category of MNV found in a set of 3941 constrained genes (top two deciles of constraint²²)

To understand the overall impact of correctly annotating the functional consequence of MNVs in a population-level data set, we counted the number of gained/rescued nonsense mutations per gene in gnomAD (Fig. 2b; Supplementary Data 1). For rescued nonsense mutations, we found 1538 sites that are rescued in all the individuals with the component variants. A total of 1633 genes carried gained or rescued nonsense mutations within our data set, including 41 genes that are disease-relevant (reported by OMIM²³ or annotated as haploinsufficient by Clingen^24,25). In addition, the proportion of rescued nonsense mutations of falling in predicted loss-of-function (pLoF) constrained genes (genes with a significant depletion of pLoFs compared with an expectation based on a mutational model^1,26, defined as LOEUF²² decile <20%) was higher (proportion = 0.219) when compared with all the other classes of MNVs (proportion = 0.192; Fisher’s exact test, p = 0.0247; Fig. 2c; Supplementary Fig. 2). Conversely, gained nonsense mutations are depleted among constrained genes (proportion = 0.0620) compared with all other classes of MNVs (Fisher’s exact test, p = 1.01 × 10⁻¹¹). These results suggest a significant enrichment of LoF annotation errors in the absence of MNV annotation.

In addition, we have investigated another class of variant pairs whose combined interpretation can be highly different from either of the individual component variants: insertion/deletion (indel) pairs that result in frame restoration (e.g., 4 bp deletion + 7 bp insertion, resulting in 3 bp = 1 amino acid insertion), and have annotated such frame-restoring indel pairs (n = 1406) when separate by up to 30 bp (considering the limitations of read-based phasing; Supplementary Fig. 3). When we compare the LoF confidence of constituent indels, we found that the proportion of frame-restoring indel pairs falling on LoF-constrained genes were significantly higher when the constituent indels are high-confidence (HC) LoFs (proportion = 0.0262 for low-confidence, LC, and 0.167 for HC pairs. Fisher’s exact test, p = 1.66 × 10⁻⁷; Supplementary Fig. 3h), suggesting that frame-restoring indel pairs can also be a source of LoF annotation errors.

Finally, in order to understand the impact of these variants in clinical applications, we also annotated MNVs in 6072 sequenced individuals from rare disease families, including 4275 case samples. This resulted in 16 gained nonsense mutations and 110 changed missense MNVs with high CADD²⁷ scores and low frequencies in gnomAD (CADD >20 and <10 individuals in gnomAD; Supplementary Data 2). However, after close manual curation, none of the corresponding MNVs were definitively causal variants for the diseases affecting the family, suggesting that MNVs contribute to only a small fraction of total rare disease diagnoses, in line with expectations based on their relative rarity and previous results².

Genome-wide mutational mechanisms of MNVs

We next turned our attention to understanding the mutational mechanisms underlying the origins of MNVs genome-wide, focusing on whole-genome sequence data from 15,708 individuals in the gnomAD v2.1 callset. We considered pairs of high-quality variants in autosomes separated by up to 10 bp, resulting in the assembly of a catalogue of 5,513,219 MNVs including 1,792,248 MNVs within 2 bp distance—an order-of-magnitude increase in size over previous collections.

We considered three established major categories of mutational origins of MNVs with constituent SNVs falling next to each other (adjacent MNVs. Figure 3a), each of which is biased toward certain MNV patterns: (1) combinations of distinct single-nucleotide mutation events; (2) replication errors by error-prone polymerase zeta; and (3) polymerase slippage events at repeat junctions. MNVs in the first category are a product of two or more SNVs, which typically occur in different generations and may thus have different allele frequencies. We expect to see an enrichment of CpG transition compared with non-CpG transversion for this class, due to the underlying difference of SNV mutation rate^28–30. The second category, replication error introduced by DNA polymerase zeta (pol-zeta), is a well known class of replication error that introduces MNVs. Previous studies^10–13,31 have shown that pol-zeta is prone to specific types of replication error, mainly TC- > AA, GC- > AA, and their reverse complements, with experimental evidence that these MNV patterns occur in a single generation; thus, the constituent SNVs will typically have the same allele frequencies. The third category, replication slippage, is another known mode of DNA replication error^32–34. This process is especially frequent at sites with repetitive sequence context; previous studies^35–37 have shown that the indel rate can be up to 10⁶ times higher than the SNV mutation rate at these sites. As shown in Fig. 3a, the combination of an insertion and then a deletion of two base pairs can result in an MNV.

We observed the signature of each of these MNV mechanisms in our data set. First, we calculated the number of MNVs for each MNV pattern (Fig. 3b) and observed that the most frequent MNV pattern is CA- > TG substitutions, which are likely to occur as a combination of an A- > G transition, followed by a high mutation rate C- > T CpG transition (Supplementary Fig. 4a). On the other hand, the least frequent MNV pattern is TA- > GC substitutions, which occur as a combination of two non-CpG transversions. The 273.4-fold difference (270,071 versus 988) of the frequency of MNVs between these two patterns is comparable with the theoretical ratio calculated based on the mutation rate of the component SNVs (475.6-fold), and the overall correlation between the theoretical and observed frequency of each MNV pattern was strong (Pearson correlation r = 0.839 with p = 9.15 × 10⁻²² in log space; Supplementary Fig. 4b–e).

To investigate the extent of pol-zeta signature, we calculated the number of MNVs in which the gnomAD allele counts of the constitutive single-nucleotide variants are equal (following previous methodology², also described in the Methods section), and observed that these one-step MNVs are significantly enriched in MNV patterns matching the pol-zeta signature (90.5% for GA- > TT, and 80.5% for GC- > AA, compared with 39.9% overall; Fisher’s exact test, p < 10⁻¹⁰⁰; Fig. 3c).

Finally, in order to capture polymerase slippage events, we calculated the fraction of MNVs in repetitive contexts per MNV pattern (Fig. 3d). For the MNV patterns AA- > TT, >30% of all the MNVs observed were in repetitive contexts. The fractions of the MNV patterns AT- > TA and TA- > AT in repetitive contexts were also high, exceeding 10% (Fisher’s exact test, p < 10⁻¹⁰⁰ compared with the 3.15% across all patterns). For all MNV patterns in repeat contexts, we see a significant excess of MNVs compared with the expected number based on a model that assumes MNVs are simple combination of two SNV events (Supplementary Fig. 4). These observations support the role of replication slippage as one of the major drivers of MNVs. In addition, we did not see a correlation between the frequency of one-step MNVs and the frequency of MNVs in repetitive contexts (Pearson correlation r = 0.0561, p > 0.05; The fraction of one-step MNVs exceeded 80% for AT- > TA and TA- > AT, but was 46% for AA- > TT), suggesting that multiple slippage events leading to MNV generation can take place either as a single event (i.e., in single generation) or multiple events (i.e., in different generation), or even recurrently. These findings come with the caveat that variants in repetitive regions will have higher error rates due to slippage and misalignment errors, but we have reduced this risk by applying random forest filtering for individual sites, as well as removing all the variants in low-complexity regions from our analysis (see the Methods section).

Estimation of global mutation rate of MNVs

In order to compare the frequency of three different mechanisms, we quantified the contribution of two single-nucleotide variation events vs other replication error modes, such as pol-zeta errors or replication slippage, using a simple probabilistic model. Specifically, focusing on adjacent MNVs, we assigned the MNV frequency for each MNV pattern to be the sum of the probability of two SNV events (P) and the probability of other replication error factors (Q), and estimated the Q term. In other words, we estimated the divergence of the observed number of MNV sites from the number expected by a simple SNV mutation model (see the Methods section). The resulting estimated proportion of two SNV events and other replication error events is described in Fig. 4a.

Fig. 4 — Distribution of MNVs across genome. a The number and the fraction of MNVs per origin, per substitution pattern. Gray are the estimated fraction of MNV originating from two single-nucleotide substitution events, brown for polymerase slippage at repeat contexts and purple are the others (presumably mainly replication error by pol-zeta). The colors along the bottom represent the estimated biological origins that dominate MNVs of that specific substitution pattern. b, c MNV density, defined as the number of MNVs per functional annotation divided by the base pair length in the annotation (relative to the whole-genome region), ordered by the methylation level of the functional category. d Estimated fraction of MNVs by different origins, per functional category around the coding region

As expected, the proportion differs substantially from one MNV pattern to another. For example, while 98.0% of CA- > TG MNVs appear to be caused by combinations of simple SNV events, the corresponding proportion is 5.84% for GA- > TT, 18.9% for GC- > AA, and 9.52% for AA- > TT MNVs. We presume that the lower proportion of two simple SNV events is mainly due to pol-zeta errors for GA- > TT and GC- > AA, and polymerase slippage for the AA- > TT. Since 83.2% of the overall MNVs were classified as either SNV combination, repeat context, or pol-zeta error at GA- > TT or GC- > AA, our analysis suggests that these three major categories explain a substantial fraction of MNV events genome wide, although some possible additional mechanisms with smaller frequencies might exist. These calculations also allow us to estimate the genome-wide mutation rate of MNVs caused by pol-zeta: 1.59 × 10⁻¹⁰ per 2 bp per generation for GA- > TT, and 4.08 × 10⁻¹⁰ for GC- > AA. Given that there are ~1.66 × 10⁸ GA pairs and 1.20 × 10⁸ GC pairs in the reference human genome, we estimate there are on average 0.026 GA- > TT and 0.049 GC- > AA mutations per generation (Supplementary Data 3).

We also explored the potential mutational mechanisms for MNVs with a greater distance between the component variants (Supplementary Figs. 5–7), and observed signatures of non-independence of mutation events extending over distances up to 10 bp, with an enrichment of motifs consistent with pol-zeta and polymerase slippage mechanisms for adjacent MNVs (minimum 1.08, maximum 4.06-fold enrichment of one-step MNV, Fisher’s exact test, p-value < 0.05; Supplementary Figs. 8,9). This confirms the presence of mutational mechanisms capable of creating simultaneous mutations separated by considerable distances^{16,29,38–40}, although further work will be required to fully characterize the underlying processes.

Overall, our analysis of MNVs in 15,708 whole-genome-sequenced individuals supports the previously suggested three major mechanism of MNVs and quantifies the different contribution of each mechanism for different MNV patterns at the genome-wide scale.

MNV distribution across different genomic regions

We next examined how MNV pattern distributions differ between functional annotation categories. We used 13 different functional annotations such as coding sequence, enhancer, and promoter from Finucane et al.⁴¹, and the DNA methylation annotation from the Encyclopedia of DNA Elements (ENCODE)⁴², to calculate the number of MNVs that fall into each category (Supplementary Table 4). MNV density, defined as the number of MNVs observed in each functional category divided by the total length of the genomic interval belonging to each category, is shown in Fig. 4b and c. We found that MNV density of the substitution patterns typically involving CpG transitions is positively correlated with the methylation level (linear regression Pearson correlation r = 0.95 for CG- > TA and r = 0.87 for CA- > TG, p < 10⁻³). Conversely, MNV density for non-CpG transversion-related substitution patterns, and the substitution patterns related to pol-zeta slippage, negatively correlates with methylation status (linear regression Pearson correlation r = −0.90 for GA- > TC, r = −0.91 for AG- > CC, r = −0.91 for GA- > TT, and r = −0.92 for GC- > AA, p <10⁻⁵; Fig. 4b, c).

Finally, we explored the effect of genic context on MNV origins and discovery: we selected the seven major regional annotations around gene-coding sequences^43,44, and calculated the fraction of MNVs likely explained by different mutational origins in each of these regions (Fig. 4d). Across all regions, we found that the MNV signal is primarily dominated by CpG transitions. The fraction of non-CpG transversions and polymerase slippage at repeats were consistently lower than (or nearly equal to) 5% of the overall signal. Pol-zeta signature was not as dominant as CpG transitions, except for at the transcription start site region, which has by far the lowest methylation rate in those seven annotations, and is thus expected to have a lower rate of CpG deamination mutations (which are dependent on the methylation of the original cytosine).

Overall, our results suggest that MNV density is highly dependent on the CpG methylation status of the surrounding sequence, and that MNVs that originate from non-CpG transversions or polymerase slippage at repeat junctions are relatively uncommon compared with those driven by CpG transitions or pol-zeta errors. Finally, MNVs that originate from pol-zeta error are the most common class of MNVs in the region close to the transcription start sites of genes, as low methylation levels in these regions result in low levels of CpG transitions.

Discussion

We analyzed 125,748 human exomes and 15,708 genomes and identified 1,792,248 MNVs across genome with constituent variants falling within 2 bp distance, including 31,575 that exist within a codon. We have shown that MNVs represent an important class of genetic variation, and that they have a significant impact on the functional interpretation of genomic data, both at the population and individual level. Although we did not encounter an individual in which an MNV is the likely cause of a rare disease after sequencing 6072 individuals from rare disease families, we expect that applying our pipeline to larger numbers of disease samples will identify previously missed diagnoses, as has been observed in another study of developmental delay cases².

The large number and high quality of variant calls in the gnomAD database provided increased power for statistical analysis of the three major mutational mechanisms (combinations of independent SNVs; replication errors by pol-zeta; and polymerase slippage at repeat junctions) responsible for the generation of MNVs, and importantly allowed us to estimate the relative contribution of each of these processes.

Our estimates of substitution pattern-specific MNV mutation rate and fraction come with important caveats. Our approach assumes that the local SNV mutation rate is invariant across instances of a specific 3 bp context; however, prior work has shown considerable regional variation in mutation rate across the genome, as well as variation driven by ancestry, environment, and other factors^45–48. Another important limitation is the lack of confident estimates of insertion and deletion rate as a function of repeat length, which limits the confidence of our estimate of the fraction of polymerase slippage. Future large genome-scale data sets with more accurate insertion and deletion calls, likely involving long-read sequencing data, will be required to improve modeling of insertion and deletion mutations.

One clear feature of our data set was the signature of non-independence of mutational events separated by up to 10 bp, as suggested in various de novo studies^{16,29,38–40}; further investigation of these clustered mutations, and contextualizing them with known sources of genomic instability, such as homologous recombination⁴⁹ or transposable elements^50,51, will be informative in exploring the mechanisms of clustered mutations.

The complete list of MNVs identified in gnomAD is publicly available (https://gnomad.broadinstitute.org/downloads), with the allele count annotated for both genome and exome. For the coding regions, we have also annotated the functional consequence of constituent SNVs and MNVs separately, and made the result viewable in an intuitive browser (https://gnomad.broadinstitute.org). Although some fraction of MNVs is missing from this list due to incomplete phasing sensitivity and read coverage, the database provides the most comprehensive set of estimates of MNV allele frequencies to date, valuable for further analysis of mutational mechanisms as well as the interpretation of MNVs in rare disease and cancer genomics^52,53.

Finally, despite the large sample size of our MNV data set, the fraction of MNVs that we have observed out of all the possible MNV configurations is still very far from saturating the space of possible MNVs, with only ~0.005% of all possible adjacent MNVs observed in our data (Supplementary Figs. 10, 11). Increasing the number of sequenced individuals⁵⁴ in both disease and non-disease cohorts will permit the discovery and determination of the phenotypic impact of an increasingly comprehensive catalogue of variation. This study confirms the importance of incorporating haplotypic phase into these efforts to permit the discovery and accurate interpretation of the full range of human variation.

Methods

Ethics

We have complied with all relevant ethical regulations. This study was overseen by the Broad Institute’s Office of Research Subject Protection and the Partners Human Research Committee, and was given a determination of Not Human Subjects Research. Informed consent was obtained from all participants.

MNV calling

125,748 human exomes and 15,708 genomes from gnomAD 2.1 callset were used for the analyses (Supplementary Tables 5,6). We used hail (https://github.com/hail-is/hail), an open source, cloud-based scalable analysis tool for large genomic data. For MNV discovery, we exhaustively looked for variants that appear in the same individual, in cis, and within 2 bp distance for the exome data set and 10 bp distance for the genome data set, using the hail window_by_locus function (i.e., we computationally checked every pair of genotypes within a certain window size, for every individual, to see whether the individual carries a pair(s) of mutation in the same haplotype. See Supplementary Methods for further detail. Also, we did not expand the window size >10 bp for MNV discovery, as phasing sensitivity significantly drops when the distance between variants is >10 bp, as shown in Supplementary Fig. 1d). For trio-based analyses, we expanded the range to 100 bp to obtain a more macroscopic view. Although we performed MNV calling in sex chromosomes for the coding region, we restricted our analysis to autosomes, in order to control for differences in zygosity.

MNV calling in rare disease samples was performed in a similar fashion as in the gnomAD exome data set. In total, 6072 rare disease whole-exome sequences were curated at the Broad Center for Mendelian Genomics (CMG)⁵⁵ and went through the MNV calling pipeline with the window size of 2 bp distance. The phenotypes observed in the cohort include: muscle disease such as Limb Girdle Muscular Dystrophy (LGMD; roughly one-third of the total), neurodevelopmental disorders, or severe phenotypes in eye, kidney, cardiac, or other orphan diseases (Supplementary Data 2).

MNV filtering

In the gnomAD MNV analysis, variant pairs for which one or both of their components have low quality reads were filtered out. Specifically, we only selected the variant sites that pass the Random Forest filtering, resulting in acceptance of 53.3% of the initial MNV candidates (Supplementary Fig. 12a). We also filtered out variant sites that are classified as low-complexity regions (LCRs) identified with the symmetric DUST algorithm⁵⁶ at a score threshold of 30, and additionally applied adjusted threshold criteria (GQ ≥ 20, DP ≥ 10, and allele balance > 0.2 for heterozygote genotypes) for filtering individual variants (Supplementary Table 7). For each MNV site, we annotated the number of alleles that appear as MNV, as well as the number of individuals carrying the MNV as a homozygous variant. The distribution of MNV sites that contain homozygous MNVs is shown in Supplementary Fig. 13. We also collapsed the MNV patterns that are reverse complements of each other, after observing that the number of MNVs are roughly symmetric (before collapsing, the ratio of each MNV pattern to its corresponding reverse complement pattern was mostly close to 1, with 0.95 being the lowest and 1.10 being the highest for adjacent MNVs) (Supplementary Fig. 14). All the MNV patterns in the main text and figures are equivalent to their reverse complement, and we do not distinguish them.

For the rare disease cohort, since our motivation was to find a definite example where an MNV is acting as a causal variant for a rare disease with severe phenotype rather than obtaining the population-level statistics, we did not apply site and sample-specific filtering, as opposed to the gnomAD MNV analysis. Instead of being computationally filtered by read quality, the 129 putative MNVs (16 gained nonsense mutations, 110 changed missense with high CADD score and low gnomAD MNV frequency, and 3 gained missense) went through manual inspection by the analysts at the Center for Mendelian Genomics (CMG) at the Broad Institute⁵⁵, after annotating the affected gene. Specifically, all the variants were checked manually under the criteria below:

- Whether the gene affected is constrained in the gnomAD population.

- Whether the case has already been solved with other causal variant.

- Whether the MNV looks real in the Interactive Genome Browser (IGV).⁵⁷

- Whether the MNV is in the proband and, if applicable, the segregation pattern of the MNV

- Whether the known function of the gene affected matches the patient phenotype.

MNVs were filtered out if they failed one or more of the criteria above. These results suggest that MNVs explain only a small fraction of undiagnosed genetic disease cases, consistent with their overall frequency as a class of variation, and with prior work in large disease-affected cohorts². The summary for MNV analysis in rare disease cohort is also available at Supplementary Data 2.

Analysis of phasing sensitivity

In order to compare the phasing information derived from different methods (read-based and trio-based), we took an approach of comparing the relative phase (binary classification of whether two SNVs of MNV are in the same haplotype or not), as shown in Supplementary Table 8. We investigated the heterozygous variant pairs whose phasing information is not provided by the trio-based phasing and observed that majority (83.5%) of the cases reflected both parents carrying a heterozygous variant, a scenario where trio-based phasing is inherently uninformative. We also investigated the heterozygous variant pairs whose phasing information is not provided by the read-based phasing. Specifically, unphased pairs tend to have either low- or high-read depth (odds ratio = 3.20, Fisher’s exact test, p < 10⁻¹⁰⁰ for low, and odds ratio = 2.33, Fisher’s exact test, p < 10⁻¹⁰⁰ for high-read depth; Supplementary Table 3), consistent with our previous understanding that an excess of reads can lead to involvement of erroneous reads and thus reduce the confidence of phasing of HaplotypeCaller⁵⁸ (as well as the lack of the number of reads reduces the calling rate). All the statistical tests are two-sided, throughout the paper.

Analysis of functional impact in coding region

We focused on the coding region of the canonical transcript of genes and examined the codon change and their consequence for all the MNVs that fall in a single codon (see Supplementary Tables 9,10 for the number of MNVs that spans across two codons). When comparing with population-level constraint, for each MNV, we annotated the constraint metric (LOEUF²²) of the gene whose protein product is affected. For rescued nonsense mutations, we took only the ones are rescued in all the individuals with the component variants (i.e., we excluded the ones whose allele count of MNVs are not equal to the allele count of the SNV that introduces a nonsense mutation), resulting in 1538 out of 1821 rescued nonsense mutations. We next used Loss-Of-Function Transcript Effect Estimator (LOFTEE²²) in order to exclude the nonsense mutations that are not likely to affect the protein function. This resulted in 371 high-confidence (HC) gained nonsense mutations and 1400 HC rescued nonsense mutations, which were used for the population-level constraint analysis. In addition, we stratified the gene sets by core essential/nonessential genes from CRISPR/Cas knockout experiments^59,60 as an orthogonal indicator of gene constraint (Supplementary Fig. 2).

We did not include and correct for MNVs consisting of three SNVs in a single codon in the analysis of functional impact in coding region, since the number and frequency of such MNVs are significantly low (228 in total, with 5 newly gained nonsense, but no re-rescued or re-gained nonsense; 0.220 in total per person). The full list of such MNVs are available as a separate file at: https://gnomad.broadinstitute.org/downloads.

Frame-restoring indel analysis was performed in a similar fashion. We used the gnomAD exome data set to call and filter the insertion/deletion pairs using the same filtering criteria (except for the fact that we did not restrict our analysis to cases where the frameshift effect would be rescued in all individuals), and focused on the canonical transcripts for the functional impact evaluation.

Defining one-step MNVs and MNVs in repetitive contexts

A one-step MNV was defined as a MNV for which the allele count of both SNVs that make up the MNV is the same and close to the allele count of the MNV itself. We also compared the allele count of constituent SNVs (AC1 and AC2) with the allele count of the corresponding MNV (AC_mnv), and observed that the majority of one-step MNVs we discovered have AC_mnv divided by AC1 >0.9 (Supplementary Fig. 15). Therefore, we expect the false discovery rate of one-step MNVs (misclassifying the MNV whose AC1 and AC2 are equal just by chance) to be limited. The full distribution of all the allele counts, including per-population characterizations, are shown in Supplementary Fig. 16 and Supplementary Table 11.

Repetitive sequences are defined by taking the ±4 bp context of the MNV and setting the threshold manually, by looking at the distribution of repeat contexts around all the MNVs (Supplementary Figs. 17, 18). Specifically, a sequence is defined as repetitive if the number of dinucleotide repeat units > 1, for both reference and alternative ±4 bp context, and the number of dinucleotide repeat units > 2, for either reference or alternative ±4 bp context, and, for adjacent MNVs only, if the reference and/or alternative 2 bp are mononucleotide repeat, increase the threshold by one mononucleotide repeat unit.

Here, dinucleotide repeat unit is defined as the reference or the alternative allele itself (with the gap when d > 1 and counting the overlap. For example, the reference and alternative dinucleotide repeat counts for TATATAT - > TAAAAAT are both 3). The third criteria was added specifically for adjacent MNVs to adjust for counting the overlap more than once. This threshold was set so that the number of MNVs with equal or higher repeats would be <5% of the total, corresponding to two standard deviations away from the mean, and also because the estimated mutation rate in these repetitive contexts is likely to be orders of magnitude higher than the background MNV mutation rate originating from the combination of two SNV events^35–37.

Calculating the proportion of MNVs per biological origin

We calculated the proportion of MNV per biological origin by comparing the observed number of MNVs (that are not in repetitive contexts) with the expected number of MNV under single-nucleotide mutational model.

Specifically, if we simply hypothesize most of the MNV are combination of two single-nucleotide substitution events, we can estimate the relative probability of MNV event per substitution pattern. For example, probability of observing a CA to TG MNV in a single individual, single site (p(CA → TG)) is proportional to p(CA → TA) p(TA → TG) + p(CA → CG) p(CG → TG), and probability of TA to GC MNV (p(TA → GC)) is proportional to $p (TA \to GA) \cdot p (GA \to GC) + p (TA \to TC) \cdot p (TC \to GC)$ . Former equation involves the product of transition at CpG, while both term of the latter are product of transversion at non-CpG, which works as a reasonable explanation of the frequency difference of those two MNV patterns.

Using the same principle (and accounting for reference base pair frequency, population number and global SNV mutation rate defined by 3 bp context²⁶, we first constructed a null model of MNV distribution. In reality, this null model does not represent the real distribution we observe, due to biological mechanisms that introduce MNV. Therefore, we allowed additional factor q, that denotes the mutational event where two SNVs are introduced at the same time. For the example of $p (CA \to TG)$ , we model this probability to be proportional to $l p (TA \to GA) \cdot p (GA \to GC) + p (TA \to TC) \cdot p (TC \to GC) + q (CA \to TG)$ , and try to estimate the q term, which corresponds to the proportion of MNVs that are explained by non-SNV (and non-repeat) factor. Further details are explained in the Supplementary Methods (section “Models and assumptions for calculating the proportion of MNV per biological mechanism”).

In addition, for each of MNV pattern, we annotated the predicted major mechanism for each MNV pattern in the following order:

1. Pol-zeta, for the patterns known as polymerase signature (GA- > TT and GC- > AA)

2. Repeat, for the patterns whose fraction of MNVs in repeat contexts are >10% (corresponding to two standard deviations away from the mean; AA- > TT, AT- > TA, and TA- > AT)

3. One of Ti at CpG, Ti, Ti at CpG + Tv, Ti + Tv, Tv combination, based on possible combinations of single-nucleotide mutational processes. For example, Ti at CpG is when transition in CpG combined with another transition can occur in the mutational processes (Supplementary Data 3).

Estimation of the global MNV rate per substitution pattern

In order to estimate the global MNV mutation rate for adjacent MNVs, as well as the mutation rate per MNV pattern, we first focused the number of one-step MNVs, assuming that there are no recurrent mutations and therefore the allele frequency of constituent SNVs are equal if and only if it originates from an MNV event in a single generation. In this section, we will simply write one-step MNV of distance 1 bp (i.e., adjacent) as MNV.

We then calculated the global MNV mutation rate under the Watterson estimator model, as in Kaplanis et al.². Specifically, we divided the number of MNV sites by the number of SNV sites in our gnomAD data set, and scaled by the global single-nucleotide mutation rate identified in previous research (1.2 × 10⁻⁸), which yielded 2.94 × 10⁻¹¹ per 2 bp per generation. This is roughly two-thirds of the estimation provided by the Kaplanis et al.² using trio data, slightly smaller presumably due to differing filtering method. Next, In order to get the mutation rate per 2 bp for each of the MNV patterns, we simply scaled the global MNV mutation rate described above by the number of reference 2 bp and the coverage difference. The full data for all the 78 patterns are shown in Supplementary Data 3. Further details are explained in the Supplementary Methods (section “Models and assumptions for estimation of the global MNV rate per substitution pattern”).

Functional enrichment

Thirteen functional annotations were collected from Finucane et al.⁴¹ as a bed file (which originates from database, such as ENCODE, Roadmap⁶¹ and UCSC genome browser⁶².) For the methylation data, we collected the genome methylation level from ENCODE, and calculated the fraction of methylated CpG out of all the CpGs in the region, and ordered by the fraction (Supplementary Table 4).

MNV density calculation was performed under the null hypothesis that the number of MNV of type WX→YZ we observe in an arbitrary genomic interval is proportional to the number of WX in the interval. Specifically, the MNV density of WX→YZ in interval I is defined as

$D (WX \to YZ ∣ I) = \frac{N (WX \to YZ ∣ I)}{N (WX ∣ I)}$ , where N(WX→YZ|I) is the number of MNVs of WX→YZ, and N(WX|I) is the number of WX in the reference genome we observe in that specific genomic interval. We then normalized the density by dividing by D(WX→YZ|I = whole genome) for scaling purpose (i.e., D(WX→YZ|I) = k means that the probability of observing a mutation of WX→YZ given a sequence context of WX is k times higher in genomic functional category I than the overall genome.)

For estimating the fraction of MNVs per origin, we took a thresholding approach and defined four MNVs (CA- > TG, AC- > GT, CC- > TT, and GA- > AG) as CpG signal, two (GC- > AA, GA- > TT) as pol-zeta, three as repeat (AA- > TT, TA- > AT, AT- > TA) and six transversion (TA- > GC, CG- > AT, AT- > CG, CG- > GC, GC- > CG, CG- > AC) signal (and left all the other 78-(4 + 2 + 3 + 6) = 63 patterns as others, in order to highlight the strongest signals) based on the result from Fig. 3. The fraction of MNVs per origin is then defined simply as the number of MNVs that fall into that pattern divided by all the MNVs, in the genomic interval. The coverage difference per interval was as small as negligible (Supplementary Table 4).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Supplementary information

Supplementary Information^{(6.7MB, pdf)}

Dataset 1^{(520.9KB, xlsx)}

Dataset 2^{(43.5KB, xlsx)}

Dataset 3^{(17.9KB, xlsx)}

Peer Review File^{(1.3MB, pdf)}

41467_2019_12438_MOESM6_ESM.pdf^{(52.4KB, pdf)}

Description of Additional Supplementary Files

Reporting Summary^{(62.8KB, pdf)}

Acknowledgements

We would like to thank the many individuals whose sequence data are aggregated in gnomAD for their contributions to research, and for making this work possible. The results published here are in part based upon data: (1) generated by The Cancer Genome Atlas managed by the NCI and NHGRI (accession: phs000178.v10.p8). Information about TCGA can be found at http://cancergenome.nih.gov, (2) generated by the Genotype-Tissue Expression Project (GTEx) managed by the NIH Common Fund and NHGRI (accession: phs000424.v7.p2), (3) generated by the Exome Sequencing Project, managed by NHLBI, (4) generated by the Alzheimer’s Disease Sequencing Project (ADSP), managed by the NIA and NHGRI (accession: phs000572.v7.p4). We would like to thank the Hail team for developing tools essential for the large-scale computation in this work. We would like to thank the analysis team of the Broad’s Rare Disease Group for their manual inspection of MNVs in rare disease cohorts. This work was funded by NIDDK U54 DK105566, NIGMS R01 GM104371, and NHGRI UM1 HG008900-01. Q.W. was supported by the Nakajima Foundation Scholarship. K.J.K. was supported by NIGMS F32 GM115208. A.O.D.L. was supported by NICHD K12 HD052896.

Author contributions

Q.W. conducted the study, performed the analysis, and wrote the paper. E.P.H., A.J.H., and B.B.C. defined the MNV classification and drafted the research. A.O.D.L. provided the data set for rare disease analysis. L.C.F. and L.D.G. generated the trio-based and read-based phasing information. J.A., B.B.C., and K.J.K. reviewed and edited the paper. D.G.M. conceived the project, supervised the overall work, reviewed and edited the paper.

Data availability

The list of coding MNVs in gnomAD exome are available at gs://gnomad-public/release/2.1/mnv/gnomad_mnv_coding.tsv (tab separated file). The coding MNVs consisting of three SNVs in a single codon is available as a separate file at gs://gnomad-public/release/2.1/mnv/gnomad_mnv_coding_3bp.tsv. The list of frame-restoring indel pairs are available at gs://gnomad-public/release/2.1/mnv/frame_restoring_indels.tsv. The list of all the MNVs in gnomAD genomes are available at gs://gnomad-public/release/2.1/mnv/genome/gnomad_mnv_genome_d{i}.tsv.bgz (tab separated file, compressed. Replace {i} (0 < i < 11) with the distance between two SNVs of MNV.), or gs://gnomad-public/release/2.1/mnv/genome/gnomad_mnv_genome_d{i}.ht (hail table. Replace {i} (0 < i < 11) with the distance between two SNVs of MNV.). Explanations for each column in each file can be found at gs://gnomad-public/release/2.1/mnv/mnv_readme.md. All the files above are also available at the download page of the gnomAD browser (https://gnomad.broadinstitute.org/downloads).

Code availability

The code used in the study is available at https://github.com/macarthur-lab/gnomad_mnv.

Competing interests

D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme. K.J.K. owns stock in Personalis. E.V.M. has received research support in the form of charitable contributions from Charles River Laboratories and Ionis Pharmaceuticals, and has consulted for Deerfield Management. M.I.M.: The views expressed in this article are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health. He has served on advisory panels for Pfizer, NovoNordisk, Zoe Global; has received honoraria from Merck, Pfizer, NovoNordisk, and Eli Lilly; has stock options in Zoe Global and has received research funding from Abbvie, Astra Zeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, NovoNordisk, Pfizer, Roche, Sanofi Aventis, Servier, and Takeda. As of June 2019, M.I.M. is an employee of Genentech, and holds stock in Roche. R.K.W. has received unrestricted research grants from Takeda Pharmaceutical Company. M.J.D. is a founder of Maze Therapeutics. B.M.N. is a member of the scientific advisory board at Deep Genomics and consultant for Camp4 Therapeutics, Takeda Pharmaceutical, and Biogen. A.O.D.L. has received honoraria from ARUP and Chan Zuckerberg Initiative.

Footnotes

Peer review information Nature Communications thanks Jeffrey Rosenfeld and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A full list of consortium members appears at the end of the paper.

Contributor Information

Daniel G. MacArthur, Email: danmac@broadinstitute.org

Genome Aggregation Database Production Team:

Irina M. Armean, Eric Banks, Louis Bergelson, Kristian Cibulskis, Ryan L. Collins, Kristen M. Connolly, Miguel Covarrubias, Mark J. Daly, Stacey Donnelly, Yossi Farjoun, Steven Ferriera, Stacey Gabriel, Jeff Gentry, Namrata Gupta, Thibault Jeandet, Diane Kaplan, Kristen M. Laricchia, Christopher Llanwarne, Eric V. Minikel, Ruchi Munshi, Benjamin M. Neale, Sam Novod, Nikelle Petrillo, Timothy Poterba, David Roazen, Valentin Ruano-Rubio, Andrea Saltzman, Kaitlin E. Samocha, Molly Schleicher, Cotton Seed, Matthew Solomonson, Jose Soto, Grace Tiao, Kathleen Tibbetts, Charlotte Tolonen, Christopher Vittal, Gordon Wade, Arcturus Wang, James S. Ware, Nicholas A. Watts, Ben Weisburd, and Nicola Whiffin

Genome Aggregation Database Consortium:

Carlos A. Aguilar Salinas, Tariq Ahmad, Christine M. Albert, Diego Ardissino, Gil Atzmon, John Barnard, Laurent Beaugerie, Emelia J. Benjamin, Michael Boehnke, Lori L. Bonnycastle, Erwin P. Bottinger, Donald W. Bowden, Matthew J. Bown, John C. Chambers, Juliana C. Chan, Daniel Chasman, Judy Cho, Mina K. Chung, Bruce Cohen, Adolfo Correa, Dana Dabelea, Dawood Darbar, Ravindranath Duggirala, Josée Dupuis, Patrick T. Ellinor, Roberto Elosua, Jeanette Erdmann, Tõnu Esko, Martti Färkkilä, Jose Florez, Andre Franke, Gad Getz, Benjamin Glaser, Stephen J. Glatt, David Goldstein, Clicerio Gonzalez, Leif Groop, Christopher Haiman, Craig Hanis, Matthew Harms, Mikko Hiltunen, Matti M. Holi, Christina M. Hultman, Mikko Kallela, Jaakko Kaprio, Sekar Kathiresan, Bong-Jo Kim, Young Jin Kim, George Kirov, Jaspal Kooner, Seppo Koskinen, Harlan M. Krumholz, Subra Kugathasan, Soo Heon Kwak, Markku Laakso, Terho Lehtimäki, Ruth J. F. Loos, Steven A. Lubitz, Ronald C. W. Ma, Jaume Marrugat, Kari M. Mattila, Steven McCarroll, Mark I. McCarthy, Dermot McGovern, Ruth McPherson, James B. Meigs, Olle Melander, Andres Metspalu, Peter M. Nilsson, Michael C. O’Donovan, Dost Ongur, Lorena Orozco, Michael J. Owen, Colin N. A. Palmer, Aarno Palotie, Kyong Soo Park, Carlos Pato, Ann E. Pulver, Nazneen Rahman, Anne M. Remes, John D. Rioux, Samuli Ripatti, Dan M. Roden, Danish Saleheen, Veikko Salomaa, Nilesh J. Samani, Jeremiah Scharf, Heribert Schunkert, Moore B. Shoemaker, Pamela Sklar, Hilkka Soininen, Harry Sokol, Tim Spector, Patrick F. Sullivan, Jaana Suvisaari, E. Shyong Tai, Yik Ying Teo, Tuomi Tiinamaija, Ming Tsuang, Dan Turner, Teresa Tusie-Luna, Erkki Vartiainen, Hugh Watkins, Rinse K. Weersma, Maija Wessman, James G. Wilson, and Ramnik J. Xavier

Supplementary information

Supplementary information is available for this paper at 10.1038/s41467-019-12438-5.

References

1.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kaplanis, J. et al. Exome-wide assessment of the functional impact and pathogenicity of multinucleotide mutations. Genome Res. gr.239756.118 (2019). [DOI] [PMC free article] [PubMed]
3.Rosenfeld JA, Malhotra AK, Lencz T. Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing. Nucleic Acids Res. 2010;38:6102–6111. doi: 10.1093/nar/gkq408. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wei, L. et al. MAC: identifying and correcting annotation for multi-nucleotide variations. BMC Genomics16, 569 (2015). [DOI] [PMC free article] [PubMed]
5.Lai Z, et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016;44:e108. doi: 10.1093/nar/gkw227. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Cheng S-J, et al. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Res. 2017;45:e82. doi: 10.1093/nar/gkx041. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017;33:2037–2039. doi: 10.1093/bioinformatics/btx100. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Khan W, et al. MACARON: a python framework to identify and re-annotate multi-base affected codons in whole genome/exome sequence data. Bioinformatics. 2018;34:3396–3398. doi: 10.1093/bioinformatics/bty382. [DOI] [PubMed] [Google Scholar]
9.Consortium T. 1000 G. P. A global reference for human genetic variation. Nature. 2015;526:68. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Harris K, Nielsen R. Error-prone polymerase activity causes multinucleotide mutations in humans. Genome Res. 2014;24:1445–1454. doi: 10.1101/gr.170696.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zhong X, et al. The fidelity of DNA synthesis by yeast DNA polymerase zeta alone and with accessory proteins. Nucleic Acids Res. 2006;34:4731–4742. doi: 10.1093/nar/gkl465. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sakamoto AN, et al. Mutator alleles of yeast DNA polymerase ζ. DNA Repair. 2007;6:1829–1838. doi: 10.1016/j.dnarep.2007.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Stone JE, Lujan SA, Kunkel TA. DNA polymerase zeta generates clustered mutations during bypass of endogenous DNA lesions in Saccharomyces cerevisiae. Environ. Mol. Mutagenesis. 2012;53:777–786. doi: 10.1002/em.21728. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chen J-M, Férec C, Cooper DN. Closely spaced multiple mutations as potential signatures of transient hypermutability in human genes. Hum. Mutat. 2009;30:1435–1448. doi: 10.1002/humu.21088. [DOI] [PubMed] [Google Scholar]
15.Schrider DR, Hourmozdi JN, Hahn MW. Pervasive multinucleotide mutational events in eukaryotes. Curr. Biol. 2011;21:1051–1054. doi: 10.1016/j.cub.2011.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Besenbacher S, et al. Multi-nucleotide de novo mutations in humans. PLOS Genet. 2016;12:e1006315. doi: 10.1371/journal.pgen.1006315. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.The Deciphering Developmental Disorders Study et al. Large-scale discovery of novel genetic causes of developmental disorders. Nature. 2015;519:223–228. doi: 10.1038/nature14135. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint: arXiv:1207.3907 [q-bio] (2012).
19.Francioli LC, et al. A framework for the detection of de novo mutations in family-based sequencing data. Eur. J. Hum. Genet. 2017;25:227–233. doi: 10.1038/ejhg.2016.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet.14, e1007308 (2018). [DOI] [PMC free article] [PubMed]
21.Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at: 10.1101/201178v3 (2018).
22.Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at: 10.1101/531210v3 (2019).
23.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33:D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Landrum MJ, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42:D980–D985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Rehm HL, et al. ClinGen–the clinical genome resource. N. Engl. J. Med. 2015;372:2235–2242. doi: 10.1056/NEJMsr1406261. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Samocha KE, et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014;46:944–950. doi: 10.1038/ng.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000;156:297–304. doi: 10.1093/genetics/156.1.297. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Francioli LC, et al. Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet. 2015;47:822–826. doi: 10.1038/ng.3292. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Xue Y, et al. Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree. Curr. Biol. 2009;19:1453–1457. doi: 10.1016/j.cub.2009.07.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Northam MR, et al. DNA polymerases ζ and Rev1 mediate error-prone bypass of non-B DNA structures. Nucleic Acids Res. 2014;42:290–306. doi: 10.1093/nar/gkt830. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Montgomery SB, et al. The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes. Genome Res. 2013;23:749–761. doi: 10.1101/gr.148718.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Bacolla A, et al. Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes. Nucleic Acids Res. 2015;43:5065–5080. doi: 10.1093/nar/gkv364. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Ananda G, et al. Microsatellite interruptions stabilize primate genomes and exist as population-specific single nucleotide polymorphisms within individual human genomes. PLOS Genet. 2014;10:e1004498. doi: 10.1371/journal.pgen.1004498. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Leclercq S, Rivals E, Jarne P. DNA slippage occurs at microsatellite loci without minimal threshold length in humans: a comparative genomic approach. Genome Biol. Evol. 2010;2:325–335. doi: 10.1093/gbe/evq023. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Lai Y, Sun F. The relationship between microsatellite slippage mutation rate and the number of repeat units. Mol. Biol. Evol. 2003;20:2123–2131. doi: 10.1093/molbev/msg228. [DOI] [PubMed] [Google Scholar]
37.Pumpernik D, Oblak B, Borštnik B. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol. Genet. Genomics. 2008;279:53–61. doi: 10.1007/s00438-007-0294-1. [DOI] [PubMed] [Google Scholar]
38.Chan K, Gordenin DA. Clusters of multiple mutations: incidence and molecular mechanisms. Annu Rev. Genet. 2015;49:243–267. doi: 10.1146/annurev-genet-112414-054714. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Supek F, Lehner B. Clustered mutation signatures reveal that error-prone DNA repair targets mutations to active genes. Cell. 2017;170:534–547. doi: 10.1016/j.cell.2017.07.003. [DOI] [PubMed] [Google Scholar]
40.Michaelson JJ, et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell. 2012;151:1431–1442. doi: 10.1016/j.cell.2012.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Finucane HK, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. doi: 10.1126/science.1105136. [DOI] [PubMed] [Google Scholar]
43.Maston GA, Evans SK, Green MR. Transcriptional Regulatory Elements in the Human Genome. Annu. Rev. Genom. Hum. Genet. 2006;7:29–59. doi: 10.1146/annurev.genom.7.080505.115623. [DOI] [PubMed] [Google Scholar]
44.Kulaeva OI, Nizovtseva EV, Polikanov YS, Ulianov SV, Studitsky VM. Distant Activation of transcription: mechanisms of enhancer action. Mol. Cell. Biol. 2012;32:4892–4897. doi: 10.1128/MCB.01127-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Aggarwala V, Voight BF. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 2016;48:349–355. doi: 10.1038/ng.3511. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Duret L. Mutation patterns in the human genome: more variable than expected. PLOS Biol. 2009;7:e1000028. doi: 10.1371/journal.pbio.1000028. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Ségurel L, Wyman MJ, Przeworski M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 2014;15:47–70. doi: 10.1146/annurev-genom-031714-125740. [DOI] [PubMed] [Google Scholar]
48.Harris K. Evidence for recent, population-specific evolution of the human mutation rate. PNAS. 2015;112:3439–3444. doi: 10.1073/pnas.1418652112. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Guirouilh-Barbat, J., Lambert, S., Bertrand, P. & Lopez, B. S. Is homologous recombination really an error-free process? Front. Genet.5, 175 (2014). [DOI] [PMC free article] [PubMed]
50.Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genetics Dev. 1999;9:657–663. doi: 10.1016/s0959-437x(99)00031-3. [DOI] [PubMed] [Google Scholar]
51.Wicker T, et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 2007;8:973–982. doi: 10.1038/nrg2165. [DOI] [PubMed] [Google Scholar]
52.Roberts SA, et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat. Genet. 2013;45:970–976. doi: 10.1038/ng.2702. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Alexandrov LB, et al. Signatures of mutational processes in human cancer. Nature. 2013;500:415–421. doi: 10.1038/nature12477. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Stark Z, et al. Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet. 2019;104:13–20. doi: 10.1016/j.ajhg.2018.11.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Centers for Mendelian Genomics. Bamshad MJ. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028. [DOI] [PubMed] [Google Scholar]
57.Robinson JT, et al. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Lenoir WF, Lim TL, Hart T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res. 2018;46:D776–D780. doi: 10.1093/nar/gkx993. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Hart T, et al. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell. 2015;163:1515–1526. doi: 10.1016/j.cell.2015.11.015. [DOI] [PubMed] [Google Scholar]
61.Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Kent WJ, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics. 2017;33:3645–3647. doi: 10.1093/bioinformatics/btx469. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(6.7MB, pdf)}

Dataset 1^{(520.9KB, xlsx)}

Dataset 2^{(43.5KB, xlsx)}

Dataset 3^{(17.9KB, xlsx)}

Peer Review File^{(1.3MB, pdf)}

41467_2019_12438_MOESM6_ESM.pdf^{(52.4KB, pdf)}

Description of Additional Supplementary Files

Reporting Summary^{(62.8KB, pdf)}

Data Availability Statement

The code used in the study is available at https://github.com/macarthur-lab/gnomad_mnv.

[CR1] 1.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Kaplanis, J. et al. Exome-wide assessment of the functional impact and pathogenicity of multinucleotide mutations. Genome Res. gr.239756.118 (2019). [DOI] [PMC free article] [PubMed]

[CR3] 3.Rosenfeld JA, Malhotra AK, Lencz T. Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing. Nucleic Acids Res. 2010;38:6102–6111. doi: 10.1093/nar/gkq408. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Wei, L. et al. MAC: identifying and correcting annotation for multi-nucleotide variations. BMC Genomics16, 569 (2015). [DOI] [PMC free article] [PubMed]

[CR5] 5.Lai Z, et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 2016;44:e108. doi: 10.1093/nar/gkw227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Cheng S-J, et al. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Res. 2017;45:e82. doi: 10.1093/nar/gkx041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Danecek P, McCarthy SA. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics. 2017;33:2037–2039. doi: 10.1093/bioinformatics/btx100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Khan W, et al. MACARON: a python framework to identify and re-annotate multi-base affected codons in whole genome/exome sequence data. Bioinformatics. 2018;34:3396–3398. doi: 10.1093/bioinformatics/bty382. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Consortium T. 1000 G. P. A global reference for human genetic variation. Nature. 2015;526:68. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Harris K, Nielsen R. Error-prone polymerase activity causes multinucleotide mutations in humans. Genome Res. 2014;24:1445–1454. doi: 10.1101/gr.170696.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Zhong X, et al. The fidelity of DNA synthesis by yeast DNA polymerase zeta alone and with accessory proteins. Nucleic Acids Res. 2006;34:4731–4742. doi: 10.1093/nar/gkl465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Sakamoto AN, et al. Mutator alleles of yeast DNA polymerase ζ. DNA Repair. 2007;6:1829–1838. doi: 10.1016/j.dnarep.2007.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Stone JE, Lujan SA, Kunkel TA. DNA polymerase zeta generates clustered mutations during bypass of endogenous DNA lesions in Saccharomyces cerevisiae. Environ. Mol. Mutagenesis. 2012;53:777–786. doi: 10.1002/em.21728. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Chen J-M, Férec C, Cooper DN. Closely spaced multiple mutations as potential signatures of transient hypermutability in human genes. Hum. Mutat. 2009;30:1435–1448. doi: 10.1002/humu.21088. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Schrider DR, Hourmozdi JN, Hahn MW. Pervasive multinucleotide mutational events in eukaryotes. Curr. Biol. 2011;21:1051–1054. doi: 10.1016/j.cub.2011.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Besenbacher S, et al. Multi-nucleotide de novo mutations in humans. PLOS Genet. 2016;12:e1006315. doi: 10.1371/journal.pgen.1006315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.The Deciphering Developmental Disorders Study et al. Large-scale discovery of novel genetic causes of developmental disorders. Nature. 2015;519:223–228. doi: 10.1038/nature14135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint: arXiv:1207.3907 [q-bio] (2012).

[CR19] 19.Francioli LC, et al. A framework for the detection of de novo mutations in family-based sequencing data. Eur. J. Hum. Genet. 2017;25:227–233. doi: 10.1038/ejhg.2016.147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet.14, e1007308 (2018). [DOI] [PMC free article] [PubMed]

[CR21] 21.Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at: 10.1101/201178v3 (2018).

[CR22] 22.Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at: 10.1101/531210v3 (2019).

[CR23] 23.Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33:D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Landrum MJ, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42:D980–D985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Rehm HL, et al. ClinGen–the clinical genome resource. N. Engl. J. Med. 2015;372:2235–2242. doi: 10.1056/NEJMsr1406261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Samocha KE, et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014;46:944–950. doi: 10.1038/ng.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000;156:297–304. doi: 10.1093/genetics/156.1.297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Francioli LC, et al. Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet. 2015;47:822–826. doi: 10.1038/ng.3292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Xue Y, et al. Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree. Curr. Biol. 2009;19:1453–1457. doi: 10.1016/j.cub.2009.07.032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Northam MR, et al. DNA polymerases ζ and Rev1 mediate error-prone bypass of non-B DNA structures. Nucleic Acids Res. 2014;42:290–306. doi: 10.1093/nar/gkt830. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Montgomery SB, et al. The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes. Genome Res. 2013;23:749–761. doi: 10.1101/gr.148718.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Bacolla A, et al. Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes. Nucleic Acids Res. 2015;43:5065–5080. doi: 10.1093/nar/gkv364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Ananda G, et al. Microsatellite interruptions stabilize primate genomes and exist as population-specific single nucleotide polymorphisms within individual human genomes. PLOS Genet. 2014;10:e1004498. doi: 10.1371/journal.pgen.1004498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Leclercq S, Rivals E, Jarne P. DNA slippage occurs at microsatellite loci without minimal threshold length in humans: a comparative genomic approach. Genome Biol. Evol. 2010;2:325–335. doi: 10.1093/gbe/evq023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Lai Y, Sun F. The relationship between microsatellite slippage mutation rate and the number of repeat units. Mol. Biol. Evol. 2003;20:2123–2131. doi: 10.1093/molbev/msg228. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Pumpernik D, Oblak B, Borštnik B. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol. Genet. Genomics. 2008;279:53–61. doi: 10.1007/s00438-007-0294-1. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Chan K, Gordenin DA. Clusters of multiple mutations: incidence and molecular mechanisms. Annu Rev. Genet. 2015;49:243–267. doi: 10.1146/annurev-genet-112414-054714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Supek F, Lehner B. Clustered mutation signatures reveal that error-prone DNA repair targets mutations to active genes. Cell. 2017;170:534–547. doi: 10.1016/j.cell.2017.07.003. [DOI] [PubMed] [Google Scholar]

[CR40] 40.Michaelson JJ, et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell. 2012;151:1431–1442. doi: 10.1016/j.cell.2012.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Finucane HK, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Consortium TEP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. doi: 10.1126/science.1105136. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Maston GA, Evans SK, Green MR. Transcriptional Regulatory Elements in the Human Genome. Annu. Rev. Genom. Hum. Genet. 2006;7:29–59. doi: 10.1146/annurev.genom.7.080505.115623. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Kulaeva OI, Nizovtseva EV, Polikanov YS, Ulianov SV, Studitsky VM. Distant Activation of transcription: mechanisms of enhancer action. Mol. Cell. Biol. 2012;32:4892–4897. doi: 10.1128/MCB.01127-12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Aggarwala V, Voight BF. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 2016;48:349–355. doi: 10.1038/ng.3511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Duret L. Mutation patterns in the human genome: more variable than expected. PLOS Biol. 2009;7:e1000028. doi: 10.1371/journal.pbio.1000028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Ségurel L, Wyman MJ, Przeworski M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 2014;15:47–70. doi: 10.1146/annurev-genom-031714-125740. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Harris K. Evidence for recent, population-specific evolution of the human mutation rate. PNAS. 2015;112:3439–3444. doi: 10.1073/pnas.1418652112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Guirouilh-Barbat, J., Lambert, S., Bertrand, P. & Lopez, B. S. Is homologous recombination really an error-free process? Front. Genet.5, 175 (2014). [DOI] [PMC free article] [PubMed]

[CR50] 50.Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genetics Dev. 1999;9:657–663. doi: 10.1016/s0959-437x(99)00031-3. [DOI] [PubMed] [Google Scholar]

[CR51] 51.Wicker T, et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 2007;8:973–982. doi: 10.1038/nrg2165. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Roberts SA, et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat. Genet. 2013;45:970–976. doi: 10.1038/ng.2702. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Alexandrov LB, et al. Signatures of mutational processes in human cancer. Nature. 2013;500:415–421. doi: 10.1038/nature12477. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Stark Z, et al. Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet. 2019;104:13–20. doi: 10.1016/j.ajhg.2018.11.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Centers for Mendelian Genomics. Bamshad MJ. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 2006;13:1028–1040. doi: 10.1089/cmb.2006.13.1028. [DOI] [PubMed] [Google Scholar]

[CR57] 57.Robinson JT, et al. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014;30:2843–2851. doi: 10.1093/bioinformatics/btu356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR59] 59.Lenoir WF, Lim TL, Hart T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res. 2018;46:D776–D780. doi: 10.1093/nar/gkx993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.Hart T, et al. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell. 2015;163:1515–1526. doi: 10.1016/j.cell.2015.11.015. [DOI] [PubMed] [Google Scholar]

[CR61] 61.Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Kent WJ, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Wagih O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics. 2017;33:3645–3647. doi: 10.1093/bioinformatics/btx469. [DOI] [PubMed] [Google Scholar]

PERMALINK

Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Qingbo Wang

Emma Pierce-Hoffman

Beryl B Cummings

Jessica Alföldi

Laurent C Francioli

Laura D Gauthier

Andrew J Hill

Anne H O’Donnell-Luria

Konrad J Karczewski

Daniel G MacArthur

Abstract

Introduction

Fig. 1.

Results

Read-based phasing for identification of MNVs

Functional impact of MNVs

Fig. 2.

Genome-wide mutational mechanisms of MNVs

Fig. 3.

Estimation of global mutation rate of MNVs

Fig. 4.

MNV distribution across different genomic regions

Discussion

Methods

Ethics

MNV calling

MNV filtering

Analysis of phasing sensitivity

Analysis of functional impact in coding region

Defining one-step MNVs and MNVs in repetitive contexts

Calculating the proportion of MNVs per biological origin

Estimation of the global MNV rate per substitution pattern

Functional enrichment

Reporting summary

Supplementary information

Acknowledgements

Author contributions

Data availability

Code availability

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases