Abstract
Mitochondrial DNA (mtDNA) has an important, yet often overlooked, role in health and disease. Constraint models quantify the removal of deleterious variation from the population by selection, representing a powerful tool for identifying genetic variation underlying human phenotypes1–4. However, nuclear constraint models could not be applied to the mtDNA, due to its unique features. Here we describe the development of a mitochondrial genome constraint model and its application to the Genome Aggregation Database (gnomAD), a large-scale population dataset reporting mtDNA variation across 56,434 humans5. Specifically, we analyze constraint by comparing the observed variation in gnomAD to that expected under neutrality, calculated using a mtDNA mutational model and observed maximum heteroplasmy level data. Our results demonstrate strong depletion of expected variation, suggesting many deleterious mtDNA variants remain undetected. To aid their discovery, we compute constraint metrics for every mitochondrial protein, tRNA, and rRNA gene, revealing a spectrum of intolerance to variation. We further characterize the most constrained regions within genes via regional constraint, and identify the most constrained sites within the entire mtDNA via local constraint, showing their enrichment in pathogenic variation. Constraint also clustered in 3D structures, providing insight into functionally important domains and their disease relevance. Notably, we identify constraint at often overlooked sites, including rRNA and non-coding regions. Lastly, we demonstrate these metrics can improve the discovery of deleterious variation underlying rare and common phenotypes.
Main
Mitochondria produce the majority of cellular energy and play a key role in many other processes including signaling, redox homeostasis, cell fate, immune response, and metabolic regulation6–8. The prokaryotic origin of mitochondria is reflected by the circular mitochondrial genome (mtDNA), which is maintained and expressed in the mitochondria9. The human mtDNA is ~16.5 kb and encodes 13 proteins, as well as 22 transfer RNAs (tRNAs) and two ribosomal RNAs (rRNA) required for their translation10,11. These proteins encode subunits of complexes I, III, IV and V in the oxidative phosphorylation pathway, which are enzymes central to energy generation and metabolism. The mtDNA has unique features that differentiate it from the nuclear genome. This includes maternal inheritance, absence of introns, multiple copies per cell (e.g. 100s-1000s), germline bottleneck, and distinct mutational mechanisms including a higher rate of mutation7,12. The percentage of mtDNA with a variant, known as heteroplasmy level, can range from 0–100%. Variants are heteroplasmic when in a fraction of mtDNA copies, or homoplasmic when in all.
Variants in mtDNA can cause cellular dysfunction and have been linked to many phenotypes. This includes causing rare mitochondrial diseases, which are clinically heterogeneous disorders with ‘any symptom, in any organ, at any age’11, as well as contributing to common diseases such as autism13, cancer14–16, Alzheimer’s and Parkinson’s diseases17. Large cohort analyses have also yielded significant associations between mtDNA variants and phenotypes including multiple sclerosis, blood traits, and biomarkers18–20. Despite its importance in health and disease, the functional impact of most mtDNA variants is unknown. Classifying mtDNA variants is challenging, in part due to the lack of tools for predicting mtDNA variant effect12. One such tool not available for mtDNA is a constraint model, which quantifies the removal of deleterious variation from the population by selection1. Constrained genome regions are enriched in deleterious variation, and nuclear constraint metrics such as those calculated using the Genome Aggregation Database (gnomAD) are widely used in human genetic analyses1–4. Constraint also identifies sites most essential for function, which are subject to the strongest selection1–4.
We recently expanded the gnomAD population database to include mtDNA variation5. The gnomAD v3.1 dataset reports homoplasmic and heteroplasmic variants in 56,434 humans and at >50% of mtDNA positions5, representing one of the largest mtDNA databases and an opportunity to assess constraint. The inclusion of heteroplasmy data in gnomAD enabled new population metrics such as maximum heteroplasmy, which represents the highest observed heteroplasmy level of a variant in the population. Indeed, heteroplasmy is important for assessing selection in mitochondria, since the heteroplasmy level of a pathogenic variant must rise above a ‘disease threshold’ to cause symptoms7,11. While nonsynonymous to synonymous ratios or variants per base have been used to assess mtDNA selection21–24, these do not account for differences in mutability between variant types and positions, limiting utility. Furthermore, these studies focused on homoplasmies, observed variation, and small datasets22–25, providing an incomplete picture.
Here we describe the development of a mitochondrial genome constraint model and its application to gnomAD. We characterize constraint across the human mtDNA, and show that constrained regions are enriched in pathogenic variation and functionally-critical sites. We also demonstrate that our constraint metrics can aid the classification of variants that underlie human phenotypes. This work establishes a constraint model for the mitochondrial genome, and provides insight into which sites in the mtDNA are most important for health.
Building a constraint model for mtDNA
We analyzed constraint in mtDNA by comparing the observed level of variation in gnomAD to that expected under neutrality (Extended Data Fig. 1), as calculated by a mutational model. Nuclear constraint models could not be applied to mtDNA due to its unique features (Supplementary Methods), nor could nuclear mutational models since the two genomes have distinct mutational mechanisms26. Therefore, we developed a mitochondrial mutational model to calculate expected variation. We adapted a composite likelihood model27 for the mtDNA and applied it to a curated set of de novo mtDNA mutations to quantify mutability in trinucleotide contexts; a method well-suited to handle possible sparsity of counts due to its smaller genome size (Methods). The predicted mutational signature was consistent with previous reports25,26, showing increased likelihood of transitions over transversions, strand bias for transitions, and a distinct signature in the non-coding OriB-OriH region (Fig. 1a, Extended Data Fig. 2a). Observed neutral variation in each locus in gnomAD correlated with their mutation likelihood (correlation coefficient R>0.99, p-value<2.2e-16), which was also a significantly better predictor than locus length (Supplementary Methods), establishing the predictive value of the mutational model.
Figure 1. Mutability and constraint in the human mtDNA.

(a) Trinucleotide mutational signature of mtDNA mutations predicted by the composite likelihood model. Mutation likelihoods for the six pyrimidine substitution types across 96 trinucleotides are shown, colored by whether the reference nucleotide is in the reference ‘light’ or reverse complement ‘heavy’ strand. This is computed for the reference sequence, excluding OriB-OriH (which has a distinct signature). (b) The observed:expected ratio of each functional class of mtDNA variation in gnomAD (synonymous, n=8219; missense, n=24,021; stop gain, n=1603; tRNA, n=4512; rRNA, n=7536 and non-coding, n=3618). The diamonds represent the observed:expected ratio, and error bars represent the 90% confidence interval. (c) The observed:expected ratio of disease-associated mtDNA variation in ClinVar and MITOMAP databases in gnomAD, which are mostly missense and tRNA variants (Extended Data Fig. 2b), grouped by classification. 2607 ClinVar variants (benign, n=910; likely benign, n=491; uncertain significance, n=1000; likely pathogenic, n=57 and pathogenic, n=149) and 881 MITOMAP variants (reported, n=791 and confirmed, n=90) are included. The diamonds represent the observed:expected ratio, and error bars represent the 90% confidence interval. (d-f) Observed and expected sum maximum heteroplasmy of variants in each protein in gnomAD are plotted for synonymous (d), missense (e) and stop gain (f) variants. The Pearson correlation coefficient R is also shown in (d). Values for (d-f) are provided in Supplementary Dataset 1.
We used this mutational model to calculate the expected level of mtDNA variation in gnomAD, comparing it to the observed to quantify constraint. Specifically, we calculated the observed and expected sum of the maximum heteroplasmy of mtDNA variants (Methods). This value represents the maximum heteroplasmy level (expressed as a fraction, range 0.0–1.0) that a variant is observed at in the population dataset (across all haplogroups). It is well-suited for detecting selection in the mitochondria given most pathogenic mtDNA variants have maximum heteroplasmy <1.05, consistent with selection reducing their heteroplasmy level below a ‘disease threshold’7,11. Simulation of mtDNA mutation and drift across generations supported that mutation rates correlate with maximum heteroplasmy under neutrality (Supplementary Methods), in line with reports that neutral mutations drifting to homoplasmy (i.e. to a maximum heteroplasmy of 1.0) increases linearly with mutation rate26, corroborating the validity of this method (Supplementary Discussion). We quantified constraint as a ratio of observed to expected sum maximum heteroplasmy and calculated a 90% confidence interval (CI) around these ratios (Methods). These values provide an inference on the strength of selection against a group of variants.
Evaluation of functional classes of variation in gnomAD showed that their predicted severity correlated with their observed:expected ratio (Fig. 1b), such that synonymous had a ratio close to 1 (0.99, CI 0.97–1.01) and stop gain close to 0 (0.008, CI 0.001–0.015). Other classes lay between these values; partial depletion of non-coding variation was also observed. Selection against disease-associated variation also correlated with classification with ratios ranging from 0.13 for pathogenic variation (CI 0.08–0.18) to 0.995 for benign (CI 0.99–1.0) (Fig. 1c). Variants of uncertain significance showed an intermediate value, reflecting the presence of pathogenic and benign variants in this category (Extended Data Fig. 2c). Missense and tRNA variants predicted deleterious in silico were also more constrained than those predicted as tolerated (Extended Data Fig. 2d). Evaluation of functional classes in another large mtDNA population database with heteroplasmy (HelixMTdb28) produced comparable results, supporting the robustness of our model (Extended Data Fig. 2e). Collectively, these data establish a constraint model for the mtDNA.
Identifying functionally critical sites in proteins
We evaluated protein gene constraint in gnomAD. The observed and expected values for synonymous variation in each gene were highly correlated (R=0.996), consistent with minimal selection (Fig. 1d). In contrast, stop gain variants were nearly absent in the population (Fig. 1f), as expected. Although gnomAD is not a random population sampling5, this observation is in line with previous reports5,28,29, and indicates most predicted loss of function variants are not compatible with life. Most proteins showed depletion of expected missense variation (Fig. 1e, Fig. 2a). We adopted the observed:expected ratio CI upper bound fraction (OEUF) as a conservative measure of constraint akin to nuclear constraint models1, which accounts for uncertainty around the ratio to avoid overestimating constraint (Methods). Gene missense OEUF values ranged from 0.16 to 0.98 (Supplementary Dataset 1) and correlated with gene function (Fig. 2a), in line with complex I and IV defects being the most common causes of pediatric disease30 and studies in mtDNA ‘mutator’ mice21. Mitochondrial genes showed similar missense constraint to nuclear genes involved in cell viability, development, and disease (Supplementary Discussion). Comparison with conservation showed similarities but was not significantly correlated (Extended Data Fig. 3a).
Figure 2: Assessment of missense constraint identifies gene and regional constraint.

(a) The missense observed:expected ratio in gnomAD for each protein (MT-ND1, n=1987; MT-ND2, n=2199; MT-ND3, n=732; MT-ND4, n=2900; MT-ND4L, n=611; MT-ND5, n=3874; MT-ND6, n=1083; MT-CYB, n=2433; MT-CO1, n=3289; MT-CO2, n=1466; MT-CO3, n=1664; MT-ATP6, n=1426; MT-ATP8, n=417), ordered by the oxidative phosphorylation complex it belongs to (I, III, IV and V). The diamonds represent the observed:expected ratio, and error bars represent the 90% confidence interval. Values are provided in Supplementary Dataset 1. (b) Areas of regional missense constraint identified in MT-ND1 are shown in red within linear protein sequence (top) and 3D protein structure (bottom). Residues in green form the shallow part of the quinone binding pocket and those in yellow are involved in proton pumping per Kampjut and Sazanov39. (c) The odds ratio (OR) enrichment of pathogenic (n=79) vs benign (n=625) missense variants (most severe consequence) outside, proximal to (<6 Ångstrom distance from), and within areas of regional constraint; the y-axis is displayed with square root scale. The height of the bar represents the OR, and the error bars represent the 95% confidence interval. (d) The proportion of curated missense variants from a clinical genetics service that are within (red), proximal to (purple), or outside (blue) regional missense constraint, categorized by classification. 171 missense variants are included (benign & likely benign, n=34; VUS of low clinical significance, n=77; VUS, n=31; VUS of high clinical significance, n=9 and pathogenic & likely pathogenic, n=20). The color legend is per (c). VUS are variants of uncertain significance.
Missense tolerance can vary within proteins, a phenomenon called regional constraint, whereby specific gene regions can be more constrained than the gene3. Since these regions are enriched in pathogenic missense in the nuclear genome3, we developed a method to assess regional constraint in the mtDNA (Methods). Approximately 15% of total protein sequence was regionally missense constrained, and all proteins except MT-ATP8 had at least one region identified (Extended Data Fig. 4a, 5a, Supplementary Dataset 2). Mapping of regional constraint onto protein structures revealed many were in close proximity in 3D space, as was the case for complex I subunit MT-ND1 where regional constraint topologically clustered in the binding pocket for quinone, a molecule essential for complex I function (Fig. 2b). Regional constraint was enriched in residues of known functional importance involved in cofactor binding or proton transfer (Extended Data Fig. 5a); manual inspection revealed other functionally critical sites clustering with constraint (Extended Data Fig. 5d–e, Supplementary Discussion). These data highlight the utility of regional constraint to identify residues not realized as functionally important, which is significant given the lack of functional domain annotations for these genes.
We then assessed the association between regional constraint and pathogenic variation. This showed regional constraint was highly enriched in pathogenic versus benign missense (Odds Ratio [OR]=26, 95% CI=12–55); residues near regional constraint in 3D space (<6 Ångstrom) were also enriched to a lower extent (OR=2.6, 95% CI=1.4–4.6, Fig. 2c). This was supported by variants curated by a clinical genetics service in cases suspected to have mitochondrial disease, where the proportion of missense variants within or proximal to regional constraint correlated with classification (Fig. 2d). Regional missense constraint performed particularly well at discriminating ‘true negatives’, given the lack of benign variants within regional constraint (Fig. 2d, Extended Data Fig. 5b). Regional constraint therefore provides a tool for variant classification, including by enabling use of the ACMG pathogenic criterion “Located in a mutational hotspot and/or critical and well-established functional domain without benign variation”12. Overall, we demonstrate the utility of regional constraint for mtDNA variant classification and for prioritizing variants of uncertain significance in individuals with rare disease.
Revealing constraint across RNA genes
Most genes in the mtDNA encode for RNAs; specifically tRNAs and rRNAs required for mitochondrial translation. The tRNAs showed a spectrum of intolerance to base substitutions in gnomAD, with OEUF values ranging from 0.21 for MT-TM to 0.87 for MT-TT (Fig. 3a, Supplementary Dataset 1). Gene constraint was not significantly correlated with tRNA codon usage (Extended Data Fig. 3c), suggesting other factors are driving selection. Indeed, MT-TM is the most constrained tRNA, encoding the initiator (and elongator) tRNAMet. Gene conservation only partially correlated with constraint (Extended Data Fig. 3b). The low OEUF values in the rRNA genes are striking given these are typically overlooked due to a lack of tools for predicting rRNA variant effect12, and supports they likely harbor missed pathogenic variation.
Figure 3: Constraint across and within RNA genes.

(a) The observed:expected ratio for variants in each RNA in gnomAD (MT-TM, n=204; MT-TL1, n=225; MT-TN, n=219; MT-TY, n=198; MT-TI, n=207; MT-TR, n=195; MT-TE, n=207; MT-TP, n=204; MT-TF, n=213; MT-TS1, n=207; MT-TV, n=207; MT-TK, n=210; MT-TD, n=204; MT-TL2, n=213; MT-TW, n=204; MT-TA, n=207; MT-TG, n=204; MT-TS2, n=177; MT-TQ, n=216; MT-TH, n=207; MT-TC, n=198; MT-TT, n=198; MT-RNR1, n=2862; MT-RNR2, n=4674), ordered by RNA type and value. Values are provided in Supplementary Dataset 1. (b) The observed:expected ratio for variants in each base type in tRNA (WC pair, n=2364; non-WC pair, n=318 and loop or other, n=1842) and rRNA (WC pair, n=3078; non-WC pair, n=354 and loop or other, n=4104). WC represents Watson-Crick, and loop or other includes all single-stranded regions. (c) The observed:expected ratio for variants in each tRNA domain (acceptor stem, n=933; D-stem, n=468; D-loop, n=411; anticodon stem, n=672; anticodon loop, n=462; variable region, n=270; T-stem, n=606 and T-loop n=459); their secondary structure location is per Extended Data Fig. 6c. The diamonds in (a-c) represent the observed:expected ratio, and error bars in (a-c) represent the 90% confidence interval (CI). (d) The observed:expected ratio CI upper bound fraction (OEUF) of variants at each tRNA secondary structure position. Darker colors represent lower values, per the legend. Values are provided in Supplementary Dataset 4. (e) Each tRNA position OEUF mapped onto tRNA tertiary structure; color legend is per (d). Labeled position 46 and D-stem positions are shown in nucleotide style. The mRNA molecule is colored green.
The RNAs form secondary structures with double-stranded stems and single-stranded loops. Variants disrupting Watson-Crick (WC) base pairs in stems were highly constrained in tRNAs and rRNAs (OEUF 0.18 and 0.17, Fig. 3b), in line with most pathogenic tRNA variants breaking WC pairs31 (Extended Data Fig. 6a). Modified bases were also more constrained than non-modified bases, consistent with their role in RNA stability and function32 (Extended Data Fig. 6b). The tRNAs share a cloverleaf secondary structure with annotated domains (Extended Data Fig. 6c). Assessment across domains revealed clear differences, such that the D-stem was most constrained (OEUF 0.15) and the T-loop the least (OEUF 0.72, Fig. 3c). These values correlated with enrichment in pathogenic variants (Extended Data Fig. 6d), and are consistent with studies on pathogenic burden across domains, especially in the anticodon and acceptor22,31. Although the latter have obvious functional significance due to binding mRNA or amino acids, these data also highlight a critical role for the D-stem which is involved in forming the tertiary tRNA structure33.
Owing to the shared structure of the tRNAs, each base can be assigned a position number. Evaluation across each tRNA position revealed a spectrum of constraint (Fig. 3d, Supplementary Dataset 4). These data affirm a loss of function effect of non-wobble anticodon variants (positions 35–36)26,34, whilst also highlighting positions not widely appreciated as functionally important. This includes the position 11–24 pairing in the D-stem, and position 46 in the variable region that interacts with the D-stem in the tertiary structure (Fig. 3e). Unlike the tRNAs, the rRNAs do not share a common structure. Therefore, we assessed regional constraint to characterize tolerance to variation across each rRNA (Supplementary Dataset 2). Approximately 15% of rRNA bases were regionally constrained (Extended Data Fig. 4b, 5c, 5g–h). Bases with modifications and in bridges connecting the mitoribosomal subunits, which have a key role in function32,35, were enriched within regional constraint (Extended Data Fig. 5c, Supplementary Discussion). The two well-established rRNA pathogenic variants in MT-RNR136 were also nearby to regional constraint (Extended Data Fig. 5f–g). Collectively, these data identify which tRNA and rRNA sites are most important for function, and thus most likely to harbor deleterious variation.
Constraint in non-coding elements
Approximately 10% of the mtDNA is non-coding. Most of this sequence is within the ‘control’ region, which contains elements involved in mtDNA transcription and translation37. Since we observed partial depletion of expected non-coding variation (Fig. 1b), we assessed constraint in non-coding elements in gnomAD (Supplementary Dataset 5). Several elements were constrained (Extended Data Fig. 7); including the promoter for transcription of the light strand (LSP, OEUF 0.53), conserved sequence block 3 (CSB3, OEUF 0.33), and the origin for replication of the light strand (OriL, OEUF 0.38). In contrast, the hypervariable sequences (HVS1–3) showed OEUF values ranging from 0.90–1.09; these regions are highly polymorphic, and therefore were expected to be tolerant of variation. These data support that the depletion of non-coding variation reflects that a proportion are deleterious, due to effects on mtDNA replication and transcription.
Most constrained sites in human mtDNA
To identify the most constrained sites in the entire mtDNA, agnostic of locus annotation, we assessed local intolerance to base or amino acid substitution for every position in gnomAD. This was computed using an overlapping sliding window method which assigns every position a score between 0–1, termed the mtDNA local constraint (MLC) score (Methods). The most constrained positions with a score >0.99 include residues in MT-CO1 and MT-CO2 binding copper, notable given copper metalation is required for complex IV function38, as well as a binding site for complex I inhibitor rotenone39 (Fig. 4a). RNA bases with high scores >0.95 included rRNA sites involved in tRNA and mRNA binding during translation40 (Fig. 4a, Supplementary Video). Non-coding bases were depleted from the highest score quartile (Fig. 4b), and those with the highest scores were in the light strand replication origin or control region. The latter included regions of unknown function, which also have a low population frequency of variants (Extended Data Fig. 8); further investigation is needed to determine what is driving their signals (Supplementary Discussion).
Figure 4. Assessment of mtDNA local constraint (MLC) scores.

(a) The MLC score across every base position in the human mtDNA. Encoded genes are shown below, colored as follows: protein in blue, rRNA purple, tRNA orange, and non-coding yellow. The dashed gray line represents a score of 0.95. Per base scores are provided in Supplementary Dataset 6. The top panel shows examples of positions with the highest scores, from left to right in MT-RNR2 at the tRNA/mRNA interface in the mitoribosome (m.3032–3071), copper binding sites in MT-CO1 (p.240, p.290–291) and MT-CO2 (p.196, p.200, p.204), and residues in the MT-ND4 rotenone binding site (e.g. p.215). (b) The proportion of bases in each locus type in each MLC score quartile. (c) The odds ratio (OR) enrichment of pathogenic (n=205) vs benign (n=884) variants within each MLC score quartile, for RNA base and amino acid substitutions only. The height of the bar represents the OR, and the error bars represent the 95% confidence interval.
Variants at positions with high MLC scores are predicted to be more deleterious. Accordingly, pathogenic variants were 7.5 times more likely to be within the highest score quartile than benign (95% CI=4.96–11.5, Fig. 4c); they were also enriched in scores between 0.50–0.75 (OR=1.8, 95% CI=1.2–2.5) and depleted from the lowest quartile (OR=0.21, 95% CI=0.14–0.32). Although depleted, some pathogenic variants were in the lowest quartile (Extended Data Fig. 9a–b). Since this score measures constraint around each position, pathogenic variants can have a low score if their neighboring positions are tolerant of variation. Thus a higher score increases the likelihood of pathogenicity but does not preclude benign impact (Supplementary Discussion). While pathogenic variants that cause disease at heteroplasmy had a higher score distribution, those that cause disease at homoplasmy also fell in highly constrained sites (Extended Data Fig. 9c–d). Sites with high scores were intolerant of indels, supporting their variation is more likely to impair function (Extended Data Fig. 9e); they were also more likely to be conserved across vertebrates and in chimpanzees specifically (Extended Data Fig. 9f–g). We assigned scores to all possible SNVs (Methods) and showed variants with higher scores were more likely to be seen only as heteroplasmies or ultra-rare homoplasmies in population databases5,28,29 (Extended Data Fig. 9h–j), in line with increased pathogenicity5,12. Overall, these data support MLC as a predictor of deleterious impact, across every locus type in the mtDNA.
We then used the MLC score to assess the relationship between mtDNA and human phenotypes in ~200,000 individuals with genome sequencing in the UK Biobank. We generated a MLC score sum (MSS) for each participant from all of their heteroplasmies to assess their functional impact on phenotype41; specifically on blood cell count for neutrophils and platelets. When both heteroplasmy count and MSS were included in a linear regression model adjusted for age, sex, and smoking status, neutrophil count was only significantly associated with heteroplasmy count (p-value 3.4×10−09), while platelet count was only significantly associated with MSS (p-value 3.0×10−04, Extended Data Table 1). These outcomes remained significant when including adjustments for haplogroup, population structure, and restricting to participants self-identified as white (p-values 8.3×10−08 and 6.1×10−04 respectively). This supports different roles for mitochondria in neutrophils and platelets, consistent with the lack of nuclear DNA in platelets and the diminished role of mitochondria in neutrophils (Supplementary Discussion). Significant associations between MSS and various other phenotypes were also identified, as described in an accompanying manuscript41. Lastly, population frequencies and PhyloP scores of heteroplasmies in UK Biobank had a low correlation with MLC (Extended Data Fig. 10), supporting constraint provides additional insight over traditional approaches. Overall, these data support the utility of the MLC score and show how these metrics can provide insight into the role of mtDNA in human phenotypes.
Assessing highly constrained sites in vitro
To further validate the constraint model, we used mitochondrial DddA-derived cytosine base editing (DdCBE)42 to assess mutations at highly constrained sites without a disease association; m.3047G>A and m.3075G>A in MT-RNR2. This gene is typically overlooked in disease analyses; furthermore there are no in silico predictors for these sites (Supplementary Discussion). These variants are within regional constraint and have MLC score >0.90. A synonymous mutation, a non-constrained class, was also assessed (MT-ND2 m.5147G>A with MLC score 0). We engineered the DdCBEs to enable enrichment of cells with both DdCBE halves by drug selection (Methods). On-target editing up to >95% was achieved in HEK293T cells for each target, with range 56–98% and mean 87% across replicates (Fig. 5a). Analysis supported minimal impact of bystander edits, which included a synonymous bystander nearby to ND2–5147 with high editing, and that RNR2–3047 had a low off-target rate while RNR2–3075 had a higher rate (Supplementary Information).
Figure 5: Functional assessment of mutations in highly constrained sites using base editing.

(a) Sanger sequencing chromatograms showing on-target editing in HEK293T cells transfected with DdCBE base editor or dead DdCBE control. Chromatograms displaying the wider editing widow are provided in the Supplementary Information. (b) Growth curves in transfected cells cultured in glucose (‘GLU’, top panel) or galactose (‘GAL’, bottom panel, which requires mitochondrial energy generation). Values and error bars reflect the mean ± s.e.m. of n=4 (ND2–5147, RNR2–3047) or n=5 (RNR2–3075) biological replicates. (c) Normalized oxygen consumption rate (OCR) in transfected cells. (d) Relative values of mitochondrial respiration parameters. Values for each parameter are normalized OCR relative to the mean of all DdCBE controls. The dashed line represents the mean of all relative values for dead DdCBE controls (i.e. 1.0). Values and error bars in (c-d) reflect the mean ± s.e.m. of n=3 (ND2–5147, RNR2–3047) or n=4 (RNR2–3075) independent biological replicates. Values for each replicate are shown as dots. Base editing efficiencies across replicates in (b-d) were 58–96% for ND2–5147, 82–98% for RNR2–3047, and 80–96% for RNR2–3075. **P < 0.01, *P < 0.05 by unpaired two-tailed t-test. All p-values <0.05 in order of display are: 0.0122956, 0.0037154, 0.0195705, 0.0023793, 0.0052251, 0.011332, 0.0446153, and 0.0148744.
Functional studies showed that the cells with rRNA mutations at highly constrained sites (RNR2–3047 and RNR2–3075) had a severe growth impairment in galactose medium, a condition in which cells with mitochondrial energy generation defects die, compared to the synonymous mutation (ND2–5147) and dead DdCBE controls (Fig. 5b). Slower growth was also observed in glucose medium. Accordingly, mitochondrial respiration was significantly reduced in cells with the constrained mutations compared to dead DdCBE, confirming their deleterious impact (Fig. 5c–d). These data demonstrate the value of our constraint model for identifying variants that severely impair mitochondrial function, including in regions overlooked in disease analyses.
Discussion
Here, we advance efforts to map constraint across the human genome by expanding constraint models to include the mtDNA. Application of our model to gnomAD revealed strong depletion of expected variation, supporting that many deleterious variants remain uncharacterized. Our constraint metrics provide tools to help address this gap, by identifying which sites across the mtDNA are most likely to harbor pathogenic variation. We validated the utility of these metrics using disease-associated variants and functional annotations, and provided examples of how they can be used to investigate the role of mtDNA in rare and common phenotypes. This constraint model primarily captures negative selection against variants with functional impact at heteroplasmy; a criterion most reported pathogenic variants meet29. Future iterations that can assess weaker selection against variants seen at homoplasmy in the population are warranted.
There are important considerations with using gnomAD to assess mitochondrial constraint. Most gnomAD samples are from blood5, a dividing cell type that can exhibit lower heteroplasmy of pathogenic variants than other (post-mitotic) tissues due to selection across cell divisions43. The depletion of expected variation we observed may therefore be greater than in other tissues, although selection in the germline and embryos is stronger than in somatic cells44,45. Our data also provide a population-level quantification of constraint, that does not capture any differences in selection that could occur on different genetic backgrounds. This is relevant given reports of mitochondrial haplogroups modifying penetrance of disease-associated variants12,46, and of nuclear genetic background impacting heteroplasmy transmission25. Larger datasets will be needed to provide the power to analyze haplogroup and population differences in constraint.
The increasing availability of genome sequencing data heralds an exciting era of mtDNA research. In this spirit, our constraint model aims to empower research into the role of mtDNA variation in health and disease.
Methods
Mitochondrial mutational model
Mitochondrial composite likelihood model:
We adapted a composite likelihood model described by Dietlein et al27, and applied it to de novo mutations to quantify mutability in trinucleotide contexts in the human mtDNA. The composite model decomposes the mutational likelihood of each trinucleotide context into multiplicative factors, namely the effects of the reference nucleotide, mutation class, and flanking nucleotides, making it optimal for the possible sparsity of mutation counts per context in the smaller mtDNA. Since this model was developed for quantifying mutability in the nuclear exome27, we adapted it for application to the mtDNA; a detailed description is in the Supplementary Information, and summarized here.
We classified 12 base substitution types and their reference nucleotides , and categorized them into three mutational classes transversions type I (class I), transversions type II (class II) and transitions (class III). We counted the de novo mutations of each base substitution type to produce vector . The likelihood ratio of each reference nucleotide where was calculated as
where the probability of the reference nucleotide is its frequency in the reference sequence. The likelihood ratio of each mutation class was calculated as
where the probability of the mutation class was derived from the relative frequency of nucleotides in the reference sequence. The likelihood ratio of each base substitution type was calculated as
and the likelihood ratios for sequence context as
For is the count of nucleotide at position around base substitution of type , and is the frequency of nucleotide at position around the reference nucleotide in the reference sequence. While there is replicative strand bias for transitions (class III), this has not been established for transversions26,47. Therefore for the base substitution types were first classified into their pyrimidine type due to their lower counts, and was used to count to count of nucleotide at position around , and to represent the frequency of nucleotide at position around the pyrimidine reference nucleotide. We then computed the mutability of each mutation class at every base in the reference sequence, excluding the non-coding OriB-OriH region, for with reference nucleotide and flanking nucleotide at position as
This produced a composite mutation likelihood for every possible base substitution in the mtDNA. Mutability in the non-coding OriB-OriH region was quantified separately to handle its inverted signature for transitions26 (Supplementary Methods).
De novo dataset:
We quantified mutability using mitochondrial de novo mutations from the literature and an in-house dataset. Published de novo mutation datasets were identified from a search of the literature25,45,48–51; these included germline as well as somatic de novo mutations which have a highly similar mutational signature in the mtDNA26,48. An overview of the various de novo mutation datasets used is provided in Supplementary Table 2. The in-house dataset of de novo mutations was identified from 1690 mother-child pairs with unaffected status in the SPARK cohort52, using genome sequencing data obtained from SFARI Base (application #12267.2). The GATK Mutect2 pipeline was used to call variants in the SPARK pairs, and variants at >1% heteroplasmy level that were present in the child but not in the mother after stringent filtering were regarded as de novo, akin to criteria used by others25. Checks were performed to confirm that de novo mutations across sources had highly similar mutation characteristics, and one outlier group was excluded (see Supplementary Methods). A final dataset of 4216 de novo mutation counts was used, extracted from each source using custom scripts run in Python v3.10. The trinucleotide context of all de novos was assumed to be the same as the mtDNA reference sequence. Additional details on the de novos and their curation are in the Supplementary Information.
Validation:
The predictive value of the mutational model was validated by measuring the correlation between the mutation likelihoods and observed level of neutral variation in gnomAD. Haplogroup variants from Phylotree Build 17 supplemented with variants at non-conserved sites in the lowest decile of PhyloP were used to ascertain a list of neutral variants (Supplementary Dataset 8). Custom scripts implemented in Python v3.10 were used to sum the observed maximum heteroplasmy of neutral variants in gnomAD and their mutation likelihood scores across each locus. Linear regression models fitting mutation likelihoods and observed neutral variation, and their Pearson correlation coefficients and p-values, were calculated using R v3.6.1 or v4.4.0. The highly mutable G>A and T>C variants were fit separately, akin to how CpG transitions are handled separately in nuclear models1, as was the non-coding OriB-OriH region. Additional details on this method are in the Supplementary Information.
Assessment of mitochondrial genome constraint
We measured mtDNA constraint in gnomAD as a ratio of observed to expected variation. Specifically, we calculated the observed and expected sum maximum heteroplasmy of variation. Custom scripts run in Python v3.10 were used to calculate ratios and their CIs. Detailed description of these methods are in the Supplementary Information, and described in brief below. We also note that the nuclear genome constraint model could not be applied to the mtDNA, due to its unique features, which necessitated the development of an approach tailor-made to handle the mtDNA; key reasons for this are discussed in the Supplement.
Maximum heteroplasmy metric:
This value represents the maximum heteroplasmy (fraction of alternate to total reads, range 0.0–1.0) that a variant is observed at across all individuals in gnomAD, identified from genome sequencing data5. Every possible variant is assigned a maximum heteroplasmy value between 0.0 and 1.0; i.e. including those seen at homoplasmy (value >0.95), heteroplasmy only (between 0.1–0.95), or not observed (0.0). This is a population-level metric that does not incorporate the haplogroup background of carriers nor any linked variants.
Observed calculation:
The observed value for a variant class and or locus in gnomAD v3.15 was determined by summing the maximum heteroplasmy value of every possible SNV in the group (e.g. all missense within a protein gene).
Expected calculation:
The expected value under neutrality was determined by summing the mutation likelihoods of every possible SNV in the group, and applying the linear model formulas fit on the mutation likelihoods and observed sum maximum heteroplasmy of neutral variation in each locus in gnomAD (described above). This enabled the expected sum maximum heteroplasmy value to be predicted from the sum mutational likelihood.
Confidence interval and OEUF:
We also calculated the 90% confidence interval (CI) around each observed:expected ratio using a beta distribution, adapting the method used for nuclear models1. The construction of the CI involved computing the density of the beta distribution for a given observed value across a range of expected values (drawn by a varying parameter). Since confidence in the ratio can vary depending on sample size (i.e. the expected value, and proportion of possible variation captured), the observed:expected ratio upper bound fraction of the confidence interval (OEUF) is useful to capture the uncertainty around ratio estimates. The OEUF was used as a conservative measure of constraint.
Simulation of germline mtDNA mutation and heteroplasmy
To validate a correlation between mitochondrial mutation rates and population maximum heteroplasmy, we adapted a computational model by Colnaghi et al to simulate mutation and heteroplasmy drift for neutral mutations in the human female germline53. Heteroplasmy levels were tracked across 10,000 maternal lineages for five generations for 10 mutation rates (between 10−9-10−7 per base pair), using model parameters drawn from human data by Colnaghi et al53. The maximum heteroplasmy distribution for each mutation rate in the simulated population was evaluated through bootstrapping. This simulation was implemented using a custom R script (run in v.3.6.1 or v4.4.0); a detailed description of this method is in the Supplementary Information.
Gene, non-coding, and variant in silico annotations
A ‘synthetic’ VCF with every possible SNV in the human mtDNA reference sequence NC_012920.1, and their Ensembl Variant Effect Predictor gene and consequence annotations, generated as described5 was used for computing observed and expected variation. Note that each human mtDNA gene has only one transcript, thus distinction between canonical and non-canonical transcripts was not required. Non-coding elements in the human mtDNA and their coordinates were downloaded from MITOMAP29; elements with expected value <10 were excluded from analyses. The coordinates for the non-coding ‘OriB-OriH’ region are per Ju et al26. Haplogroup variants were extracted from PhyloTree Build 1754 as previously described55. The phyloP conservation scores derived from 100 vertebrate genomes, and APOGEE, HmtVar and MitoTip in silico predictions, were retrieved as described5,55. Bases conserved in chimpanzee were identified from sequence alignment of the human and chimpanzee reference sequence (CM054459.1) using msa R package v1.3656 and annotated using custom Python scripts.
Protein and RNA annotations
Functional sites within the proteins were gleaned from UniProt and the literature as follows. UniProt Knowledgebase annotations were downloaded from the UniProt FTP site (https://www.uniprot.org/downloads), and binding site annotations for human mtDNA-encoded proteins were extracted from available bed files (date of access November-16–2020)57. Residues involved in complex I proton transfer were curated based on ovine data reported by Kampjut and Sazanov39; the reported residue positions were manually confirmed as equivalent to human except for MT-ND6 which was shifted by one residue. For RNA genes, the base type annotation (e.g. base pair in stem) was determined using a custom script in Python v3.10 and manually curated secondary structure data reported previously55. Modified bases in RNA genes and tRNA domain annotations were obtained as described55. The tRNA secondary structure position numbers and their corresponding mtDNA position were obtained from Sonney et al58. Codon usage of each tRNA was determined by counting the corresponding amino acid within the protein-coding sequence using custom scripts in Python v3.10; for Leucine and Serine the codon sequence was used to distinguish between the two tRNAs for these amino acids. Bases involved in rRNA:rRNA bridges connecting the two mitoribosomal subunits are per Amunts et al35.
Population databases
Data from the Genome Aggregation Database (gnomAD) v3.1 generated from whole genome sequences from 56,434 individuals were retrieved from the gnomAD browser (https://gnomad.broadinstitute.org/downloads)5. HelixMTdb data generated by proprietary exome+ assay of 195,983 individuals were downloaded from the Helix website (https://www.helix.com, version dated 03–27-2020)28; observed maximum heteroplasmy of homoplasmic variants was not reported and therefore assigned as 1.0. MITOMAP data generated from 56,910 GenBank sequences were downloaded from the MITOMAP website (https://www.mitomap.org/MITOMAP/resources, polymorphism table, download date 07–14-2022)29. Note that the MITOMAP database does not include heteroplasmy information. An overview of these population databases and their application within this manuscript is provided in Supplementary Table 3.
Disease-associated variation
Disease-associated mtDNA variants were obtained from ClinVar and MITOMAP databases. For ClinVar, all mtDNA SNVs were retrieved (download date 05–25-2022)59. Variants listed only to be associated with cancer were excluded to focus on germline conditions, and those with conflicting interpretations were also excluded. For MITOMAP, all disease-associated variants were retrieved (disease table, download date 05–25-2022)29. A total of 2607 ClinVar variants and 882 MITOMAP variants were used. For Extended Data Fig. 2c, ClinVar Uncertain Significance and MITOMAP Reported variants were subset by whether they satisfied pathogenic (PP3 and PM2_supporting) or benign (BP4 and BS1) criteria for computational algorithms and population frequency in ACMG/AMP mtDNA variant classification guidelines12. Variants which met both criteria PP3 and PM2_supporting were regarded as ‘with pathogenic criteria’, and those satisfying both criteria BP4 and BS1 as ‘with benign criteria’. For Extended Data Fig. 9d, all variants with a confirmed status and plasmy status of ‘−/+’ in MITOMAP (associated with disease at heteroplasmy only) were reviewed. Any variants reported to be observed at homoplasmy in an individual in at least one publication were shown in the ‘at homoplasmy’ group in Extended Data Fig. 9d, as per Supplementary Dataset 9. Curated missense variants identified in cases suspected to have a mitochondrial disease (in Fig. 2d) were obtained from the Victorian Clinical Genetics Service. In brief, variants identified from clinical-grade targeted mtDNA sequencing were assessed using criteria adapted from the American College of Medical Genetics mitochondrial variant classification guidelines12, as described previously60. Variants were curated by medical genomics scientists, and variant classifications were reviewed by a multidisciplinary team. Curated missense variants are listed in Supplementary Dataset 3.
Replication dataset
HelixMTdb population data was obtained as described above28. Linear models fitting neutral variation observed in HelixMTdb and their mutational likelihoods across loci were used to calculate expected values, as above. Variants at bases m.300–316, m.513–525, and m.16182–16194 were not called in HelixMTdb and therefore were excluded from calculations. Ratios of observed:expected variation for each functional class of mtDNA variation in HelixMTdb and their CIs were calculated as above.
Regional constraint
A method was developed for assessment of mitochondrial regional constraint, which adapts methods described by Samocha et al3 and Davydov et al61 for nuclear constraint analyses. This analysis was implemented using custom scripts run in Python v3.10, as follows. For protein genes, the missense observed:expected ratio of all possible regions ≥30 bp within each gene was calculated, and a beta distribution used to compute the probability of the observed:expected ratio of each region being ≤ the gene’s missense observed:expected ratio. Regions with a p-value <0.01 were retained, and a greedy algorithm was applied to discard any region overlapping another with a lower p-value; for overlapping regions with the same p-value the longest was retained. This produced a list of non-overlapping candidate regions significantly more missense constrained than the gene. The false discovery rate (FDR) of each candidate was then estimated by applying the same method to 1000 random permutations of each gene, calculated as the proportion of permutations that produced a false positive result of the same length and ≤ p-value as the candidate region. Areas of regional missense constraint with FDR <0.1 were regarded as high-confidence and used for all analyses. Regional constraint in the rRNA genes was evaluated using the same process with minor modifications. All high-confidence regions are provided in Supplementary Dataset 2. A detailed description of this method is in the Supplementary Information.
The distance between residues and bases in 3D protein and rRNA structures was calculated using custom scripts implementing the Bio.PDB Biopython module62 in Python v3.10, to identify those in close proximity to regional constraint. The electron microscopy structures of human complex I (PDB:5XTD)63, complex III (PDB:5XTE)63, complex IV (PDB:5Z62)64, and the mitochondrial ribosome (PDB:6ZSE)65 from Protein Data Bank (PDB) were used. For the human complex V subunit MT-ATP6, the 3D structure predicted by AlphaFold obtained from UniProt was used (AF-P00846-F1)66. A protein residue was regarded to be in close proximity to regional constraint when the minimum distance between its alpha carbon atom and the alpha carbon of a residue in regional constraint was <6 Ångstrom, a threshold commonly used to define contacting residues67. A rRNA base was regarded to be in close proximity to regional constraint when the minimum distance between its nitrogen atom involved in base pairing and the equivalent nitrogen of a base in regional constraint was <6 Ångstrom, or if its phosphate atom and the phosphate of a base in regional constraint was <6 Ångstrom. Nitrogen atoms at position N1 were used for purines and position N3 for pyrimidines to capture base pair interactions, and phosphate atoms to capture flanking bases.
mtDNA local constraint score
The mtDNA local constraint (MLC) score was developed to measure local intolerance to base or amino acid substitutions at and around every position in the human mtDNA, and is not an adaptation of a nuclear genome constraint method. This score was calculated using an overlapping sliding window method, implemented with custom scripts run in Python v3.10. Starting from position m.1, a window of length was drawn and the observed:expected ratio of all of substitutions within the window and its 90% CI calculated. The window start position was moved by 1 bp, and the process repeated until all possible windows of length in the mtDNA were evaluated. A window length of 30 bp was used to enable all to have expected value >10. For positions in protein genes only amino acid substitutions (missense) were included in calculations, while all base substitutions were included for positions in RNA genes and non-coding regions. For each mtDNA position, the mean observed:expected ratio CI upper bound fraction (OEUF) of all overlapping windows was computed and percentile ranked to achieve a score between 0.0 and 1.0 (ranging from least to most constrained). The score for every base position is provided in Supplementary Dataset 6. A MLC score was then assigned for every possible SNV as follows: non-coding, RNA and missense variants were assigned their positional score, and non-missense in protein genes were assigned scores based on the variant class OEUF with synonymous, stop gain, and start or stop lost assigned scores of 0.0, 1.0, and 0.70 respectively. A higher score is predicted to be more deleterious; scores for every SNV are in Supplementary Dataset 7. A detailed description of this method is in the Supplementary Information.
Odds ratio enrichment analysis
Odds ratio (OR) analysis assessing pathogenic versus benign variation across categories of regional constraint and MLC score quartiles was calculated as , as previously described2, where is the number of pathogenic variants in the category/quartile, is the number of benign variants in the category/quartile, is the number of pathogenic variants not in the category/quartile, and is the number of benign variants not in the category/quartile. The standard error was calculated as . A 95% confidence interval for each OR was calculated from the SE, as . Pathogenic variants included those with a pathogenic or likely pathogenic classification in ClinVar or a confirmed disease association in MITOMAP. Benign variants were those with a benign classification in ClinVar.
Visualization of protein and RNA 3D structures
Protein and rRNA 3D structures from Protein Data Bank (PDB) were visualized using UCSF ChimeraX68 v1.3. The electron microscopy structures of human complex I (PDB:5XTD)63, complex III (PDB:5XTE) 63, complex IV (PDB:5Z62)64, and the mitochondrial ribosome including A/P-site and P/E-site tRNAs (PDB:6ZSE)65 were used. The ovine complex I electron microscopy structure (PDB:6ZKM)39 was also used to show the homologous MT-ND4 region binding rotenone in Fig. 4a. Figures and videos displaying constraint data on 3D structures were generated using custom ChimeraX command files.
Evaluation of heteroplasmy and blood cell counts in UK Biobank
Mitochondrial heteroplasmy was identified from whole genome sequencing (WGS) data from the UK Biobank, a large population study of people from the United Kingdom aged 40–69 years69, as described previously41. In brief, MitoHPC70 (version 20230418) was used to call heteroplasmic SNVs with a heteroplasmy level of >5% and <95%, and variant allele fraction (VAF) is reported with respect to the human mtDNA reference sequence (NC_012920.1). Variants were filtered using the following criteria: at poly-C homopolymer regions, read depth <300, and or with base quality, strandedness, slippage, weak evidence, germline, or position flags. Samples were excluded using the following criteria: mitochondrial contamination level >3%, two or more variants from a different mitochondrial haplogroup, multiple variants predicted as nuclear-encoded mitochondrial sequences, low coverage, mtDNA copy number ≤40, and or heteroplasmy count above five. Samples with cell count outliers more than three standard deviations from the mean were also excluded. 193,115 of 200,000 samples with WGS data were retained for analysis. The association between heteroplasmy metrics (count and MLC score sum) and cell counts (platelets and neutrophils) was determined using a linear regression model adjusting for age (natural spline, 4 degrees of freedom), sex, and smoking status (“current”, “former”, or “never smoker”). Raw neutrophil count was transformed using log(neutrophil count + 1) to more closely approximate a normal distribution. Regression models were run with each metric separately (‘Single’) or together in the same model (‘Combined’) in R v4.0.4. P-values were computed using a two-sided test.
Nuclear gene annotations
Nuclear missense constraint metrics were obtained from Karczewski et al1; non-canonical transcripts and genes with constraint flags were excluded. Cell essential genes and genes required for development or viability were obtained from the International Mouse Phenotyping Consortium, using FUSIL bin codes “CL”, “DL”, and “SV”; genes with a low orthologue category or a FUSIL outlier flag were excluded71. Genes associated with developmental disorders were obtained from DECIPHER; genes with limited confidence were excluded72.
Mitochondrial base editing
A mitochondrially-targeted DddA-derived cytosine base editor (DdCBE) dual plasmid system42 was used. The left (Addgene #179682) and right (Addgene #179686) DdCBE plasmids used for editing, and the left dead DdCBE plasmid (Addgene #179683) used with the right as a control, were all gifts from David Liu. We replaced the BSD gene in the right plasmid with PuroR to enable dual drug selection. Transcription activator-like effector (TALE) arrays targeting m.3047, m.3075 or m.5147 were synthesized by IDT or Invitrogen and incorporated into the plasmids; TALE sequences are in the Supplementary Information. Final plasmids were checked with long-read sequencing by Plasmidsaurus.
A step-by-step protocol for the transfection of DdCBE has been deposited in the protocols.io repository73. In brief, base editing was done in HEK293T cells (ATCC CRL-3216) that were maintained in DMEM (Gibco) with 10% FBS and 1x Antibiotic-Antimycotic (Gibco) at 37 °C with 5% CO2. The cell line tested negative for mycoplasma and was authenticated as being hypotriploid. 1.5 × 105 cells were plated per 12-well one day prior, and 2 μg of each plasmid was transfected using Lipofectamine 3000 (Invitrogen) per manufacturer’s protocol, after which selection with up to 5 μg/mL blasticidin and 1 μg/mL puromycin (Gibco) was used to enrich cells with both plasmids. Selection was continued for 10–14 days typically, during which the medium was replaced every 2 days. Cells were maintained in standard media after selection (without blasticidin or puromycin) for at least two days, and DNA was extracted. On-target editing was quantified from Sanger sequencing of targeted PCR amplicons using EditR74, and a subset were validated using next generation sequencing of PCR amplicons by Genewiz. We observed a range of editing efficiencies between 56–98% with a mean 87%. Off-target editing rate was assessed through mtDNA sequencing and calculated as previously described75. A detailed description of these methods and primers used are in the Supplementary Information.
Cell growth and oxygen consumption measurements
Cell confluency was measured in HEK293T cells grown in glucose-containing (4.5 g/L glucose, 110 mg/L sodium pyruvate, 10% FBS, 1x Antibiotic-Antimycotic) or galactose-containing (4.5 g/L galactose, 10% FBS, 1x Antibiotic-Antimycotic) DMEM medium using an Incucyte S3 live-cell imaging system (Essen BioScience) that captured 9–16 images per well every 3 h. For oxygen consumption measurement, Seahorse Cell Culture Microplates (Agilent) were coated using 50 μg/mL Poly-D-Lysine (Gibco), and 1.2–1.6 × 104 HEK293T cells per well were plated 16 h prior. Seahorse Xfe96/XF Pro Extracellular Flux Assay and Cell Mito Stress Test Kits (Agilent) were used per manufacturer’s instructions, with 1.5 μM oligomycin, 1 μM FCCP and 0.5 μM Rotenone/Antimycin A in Seahorse XF DMEM medium (Agilent). Oxygen consumption was measured using a Seahorse XF Analyzer, and values were normalized to cell confluency measured prior. Three or five technical replicates were used in each cell growth and seahorse biological replicate, respectively.
Statistical analysis
The statistical tests utilized in this study are described in detail in the relevant Methods and or Supplementary Methods sections.
Extended Data
Extended Data Figure 1. Schematic overview of mitochondrial genome constraint.

(a) We established a constraint model for the mtDNA to quantify the removal of deleterious variants from the population by negative selection. We assessed constraint by identifying genes and regions where the observed variation is less than expected, under neutrality. Observed is calculated using maximum heteroplasmy in gnomAD, and specifically by summing the maximum heteroplasmy value of every variant in a gene or region. Expected is calculated using a mutational model, and specifically by summing the mutational likelihoods of every variant in a gene or region and applying linear models fit on neutral variation (ascertained using Phylotree and PhyloP). The ratio of observed:expected variation and its 90% confidence interval is calculated, and the OEUF is used as a conservative measure of constraint. (b) A suite of constraint metrics are available via Supplementary Datasets, including constraint metrics for each gene and non-coding element, as well as regional missense constraint for each protein gene, regional constraint for each rRNA gene, position constraint for tRNA genes, and local constraint for every position in the mtDNA (MLC scores). (c) Constraint metrics can identify deleterious variants, and constrained sites are enriched in pathogenic variants from ClinVar and MITOMAP. Example applications include using regional constraint for variant classification and prioritization in individuals with rare disease and using the MLC score to assess associations between heteroplasmy burden and common phenotypes. Created with BioRender.com.
Extended Data Figure 2: Mutability, disease-associated variation, and constraint across classes of human mtDNA variation.

(a) Trinucleotide mutational signature of mtDNA mutations within the OriB-OriH region (m.16197–191) predicted by the composite likelihood model. Mutation likelihoods for the six pyrimidine base substitution types across 96 trinucleotides are shown, colored by whether the reference nucleotide is in the reference ‘light’ or reverse complement ‘heavy’ strand. (b) Proportion of total disease-associated variants in ClinVar (n=2607) and MITOMAP (n=882) by consequence. (c) The observed:expected ratio of ClinVar Uncertain Significance and MITOMAP Reported mtDNA variants in gnomAD, and subset by whether they met pathogenic or benign criteria for computational algorithms and MITOMAP population frequency in ACMG/AMP guidelines for mtDNA variant interpretation. 1000 ClinVar variants (with pathogenic evidence, n=147; note none satisfied both benign criteria) and 791 MITOMAP variants (with pathogenic evidence, n=175; with benign evidence, n=28) are included. (d) The observed:expected ratio of in silico predictions in gnomAD for missense variants by APOGEE (pathogenic, n=7276 and neutral, n=16,800), and for tRNA variants by MitoTIP (likely pathogenic, n=981; possibly pathogenic, n=1171, possibly benign, n=1162 and likely benign, n=1198) and HmtVar (pathogenic, n=202; likely pathogenic, n=6, likely polymorphic, n=4139 and polymorphic, n=24); all of which are recommended per ACMG/AMP mtDNA guidelines for variant interpretation. Note the outlier HmtVar ‘likely pathogenic’ group only includes six variants. (e) Assessment of functional classes of mtDNA variation in a replication dataset, HelixMTdb. The n per class is per Fig. 1d. The diamonds in (c-e) represent the observed:expected ratio, and the error bars in (c-e) represent 90% confidence intervals.
Extended Data Figure 3: Assessment of conservation and codon usage across genes.

Median phyloP base conservation scores for each protein (a) or RNA (b) gene, derived from 100 vertebrate genomes. Higher phyloP values represent increased conservation. (c) The count of codons within the protein-coding sequence corresponding to each tRNA. Pearson correlation coefficient (R) and its p-value (p) computed using a two-sided test is shown in (a-c). OEUF stands for observed:expected ratio confidence interval upper bound fraction, and OEUF values for (a-c) are provided in Supplementary Dataset 1.
Extended Data Figure 4: Areas of regional constraint within each protein and rRNA gene.

(a) Intervals of regional missense constraint identified in each protein are colored in red. For display purposes each protein is shown at the same length (i.e. are not scaled by their actual protein length), and amino acid residue numbering is shown. (b) Intervals of regional constraint identified in each rRNA are colored in red. The rRNA sequences are not scaled by their length, and mtDNA position coordinates are shown. Coordinates for (a-b) are provided in Supplementary Dataset 2.
Extended Data Figure 5: Characteristics of regional constraint.

(a-b) The proportion of bases encoding proteins (n=11,341) or residues of functional significance (n=141) (a) or benign (n=625) and pathogenic (n=79) missense variants (most severe consequence) (b) that are within, proximal to (<6 Ångstrom distance from), or outside regional missense constraint. (c) The proportion of bases encoding rRNA (n=2512), or modified bases and bases in rRNA:rRNA intersubunit bridges (n=63) that are within, proximal to (<6 Ångstrom distance from), or outside regional constraint. (d) The four areas of regional missense constraint in MT-CYB are shown in red, visualized in its dimeric 3D structure. Heme molecules involved in electron transfer are colored green. (e) The two areas of regional missense constraint in MT-ND6 are shown in red in the 3D structure. Residues colored in yellow are involved in the transition from the open to closed complex state in the π-bulge (p.61–63 and p.67) per Kampjut and Sazanov39. (f) An area of regional constraint within the MT-RNR1 tertiary structure, indicated in red. Modified bases are colored blue, and disease-associated bases (m.1494 and m.1555) purple. The mRNA molecule is colored green. (g) Areas of regional constraint within MT-RNR1 secondary structure, indicated by red font. The box highlights an area including regional constraint, modified bases (blue font) and disease-associated variants (at m.1494 and m.1555, bold purple font); also shown in tertiary structure in Extended Data Fig. 5f. (h) The areas of regional constraint within the MT-RNR2 secondary structure, indicated by red font; modified bases are in blue font.
Extended Data Figure 6: Characteristics of RNA variants and bases.

(a) The proportion of pathogenic (n=121) and benign (n=232) tRNA variants for each base type. (b) The observed:expected ratio for variants in modified and non-modified bases in tRNA (modified, n=411 and non-modified, n=4101) and rRNA (modified, n=30 and non-modified, n=7506). The diamonds represent the observed:expected ratio, and error bars represent the 90% confidence interval. (c) The generic tRNA secondary structure, with positions colored by domain. (d) The proportion of pathogenic (n=121) and benign (n=232) tRNA variants for each domain, following the color legend in (c).
Extended Data Figure 7: Measuring constraint across non-coding elements.

Top schematic shows annotated elements within the non-coding control region, which spans the artificial chromosome break (m.16569–1). The top row includes the three hypervariable sequences (HV1, HV2, HV3), the second row includes termination-associated sequences (TAS, TAS2), conserved sequence blocks (CSB1, CSB2, CSB3) and L-strand and H-strand promoters (LSP, HSP1), and the third row includes a control element (MT-5) and transcription factor binding sites (TFX, TFY, TFL, TFH). The observed:expected ratio 90% confidence interval upper fraction (OEUF) within each element is shown per the color gradient legend; darker colors represent lower OEUF. Values are provided in Supplementary Dataset 5. The bottom schematic shows the position of the control region and origin for replication of the light strand (OriL) within the mtDNA, with encoded loci colored by their type (non-coding in yellow, protein blue, rRNA purple and tRNA orange).
Extended Data Figure 8: mtDNA local constraint (MLC) scores and population allele frequencies across the non-coding control region.

(a) The MLC score of positions across the control region, calculated using gnomAD maximum heteroplasmy data, are shown; a schematic of annotated non-coding elements is displayed above. The five peaks from left to right overlap (1) a recently discovered second light strand promoter76, (2–3) regions of unknown function within the D-loop, (4) conserved sequence block 3, or (5) the light strand promoter. Base scores are provided in Supplementary Dataset 6. (b-c) The homoplasmic allele frequency (AF) of variants across the control region in gnomAD (b) or HelixMTdb (c). (d) The population allele frequency of variants across the control region in the MITOMAP database (which does not include heteroplasmy data). (b-d) are displayed with a square root transformed y-axis; note only SNVs are included.
Extended Data Figure 9: Relationship between the mtDNA local constraint (MLC) score and genomic annotations.

(a) The proportion of benign (n=884) and pathogenic (n=205) variants in each score quartile. (b) Density plot showing the score distribution of disease-associated variants; numbers per (a). (c) Density plot showing the score distribution of 184 pathogenic variants with disease plasmy status in MITOMAP, colored by association with disease at heteroplasmy only, or at homoplasmy. (d) Density plot showing the score distribution of 88 ‘confirmed’ pathogenic variants from MITOMAP, colored by whether reported in individuals at heteroplasmy only or at homoplasmy, per a manual literature review. Plots (a-d) include missense and RNA variants only, and for (c-d) ‘at homplasmy’ includes observed at both homoplasmy and heteroplasmy. (e) Boxplot showing the score distribution for base positions where indels are observed in gnomAD (n=416), HelixMTdb (n=697), and MITOMAP (n=667) databases. (f) The distribution of PhyloP base conservation scores for bases within each score quartile (0.0–0.25, n=4142; 0.25–0.50, n=4142; 0.50–0.75, n=4141; 0.75–1.0, n=4143); a dashed line is shown at score = 0. (g) The MLC score across every base position in the human mtDNA; bases that are conserved in chimpanzees are denoted by black pipe symbols, while those non-conserved and encoding base or amino acid substitutions are shown as white pipe symbols. (h-j) The MLC variant score distribution for SNVs across population frequency categories in gnomAD (homoplasmy AF ≥0.002%, n=7363; homoplasmy AF <0.002%, n=1846 and heteroplasmy only, n=1641) (h), HelixMTdb (homoplasmy AF ≥0.002%, n=8049; homoplasmy AF <0.002%, n=3442 and heteroplasmy only, n=2613) (i) and MITOMAP (AF ≥0.002%, n=8617 and AF <0.002%, n=10,343) (j) databases. Note that allele frequency <0.002% is recommended as evidence of pathogenicity in ACMG/AMP mtDNA guidelines12, and that heteroplasmy data is not available for MITOMAP. For (e-f, h-j), boxplot elements include: center line, median; box limits, 25th and 75th percentiles; minima and maxima, 1.5x interquartile range; points, outliers.
Extended Data Figure 10: MLC scores versus population frequency or phyloP for heteroplasmies in the UK Biobank.

The MLC score of single nucleotide variant heteroplasmies in the UK Biobank (UKB) is plotted against their population allele frequency (a) or PhyloP scores (b). Plot in (a) is displayed with a square root transformed y-axis. Note a phyloP score of >3 represents significantly conserved sites with p-value < 0.05. Variant classes in each plot are described in the titles in (a), and the histograms on the right margin show the distribution of variants across the y-axis. R2 coefficient of determination, and a blue line of linear model fit, is shown.
Extended Data Table 1:
Association between MLC score sum (MSS) and blood cell counts in the UK Biobank.
| Cell type | Variable | Single model | Combined model | ||||
|---|---|---|---|---|---|---|---|
|
| |||||||
| β coefficient | SE | p-value | β coefficient | SE | p-value | ||
|
| |||||||
| Neutrophils | Heteroplasmy count | 0.00682 | 0.000867 | 3.52 × 10−15 | 0.00653 | 0.00110 | 3.42 × 10−09 |
| MLC score sum (MSS) | 0.0142 | 0.00271 | 1.82 × 10−07 | 0.0015 | 0.00346 | 0.664 | |
|
| |||||||
| Platelets | Heteroplasmy count | 0.835 | 0.199 | 2.81 × 10−05 | 0.266 | 0.254 | 0.294 |
| MLC score sum (MSS) | 3.39 | 0.624 | 5.61 × 10−08 | 2.87 | 0.795 | 3.03 × 10−04 | |
Table showing β coefficients, standard error (SE), and p-values from linear regression models of the association between platelet or neutrophil count and heteroplasmy count and or mtDNA local constraint (MLC) score sum (MSS). Regressions were run as separate models (‘Single’) or together in the same model (‘Combined’, where both heteroplasmy count and MSS were included in the same model), and adjusted for age, sex and smoking status. Significant p-values, computed using a two-sided test, are bolded.
Supplementary Material
Acknowledgements
N.J.L. received a National Health and Medical Research Council (NHMRC) Early Career Fellowship APP1159456 and an Australian American Association Scholarship. This research was conducted using the UK Biobank Resource under Application Number 17731, and supported by National Heart, Lung and Blood Institute, National Institutes of Health (NIH) grant R01HL144569. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. D.R.T. was supported by an NHMRC Principal Research Fellowship GNT1155244. The Chair in Genomic Medicine awarded to JC is generously supported by The Royal Children’s Hospital Foundation. Research conducted at the Murdoch Children’s Research Institute was supported by the Victorian Government’s Operational Infrastructure Support Program. We are grateful to all of the families in SPARK, the SPARK clinical sites and SPARK staff, and we appreciate obtaining access to the data on SFARI Base. We gratefully acknowledge Sarah Calvo for providing advice on the constraint model and its application to gnomAD. We also acknowledge the contributions of the broader team involved in curation of disease-associated variation at the Victorian Clinical Genetics Service, including Belinda Chong and Sebastian Lunke. We also acknowledge The Islet, Oxygen consumption, Mass Isotopomer flux Core (IOMIC) at Yale which assisted with oxygen consumption measurement and the Yale Center for Genome Analysis for providing the PacBio sequencing services, which is funded in part by the National Institutes of Health instrument grant 1S10OD028669–01.
Footnotes
Competing interests
The authors declare no competing interests.
Additional Information
Supplementary Information is available for this paper. This contains Supplementary Methods, Supplementary Figures 1–10 and Supplementary Tables 1–7 which pertain to the Supplementary Methods, Supplementary Discussion, Supplementary References, and descriptions of Supplementary Datasets and the Supplementary Video.
Code availability
The code used for analyses and figure generation are available at https://github.com/leklab/mitochondrial_constraint.
Data availability
Data analyzed or generated during this study are included in this article and its Supplementary files, and available via https://github.com/leklab/mitochondrial_constraint. Constraint metrics are provided in the Supplementary Datasets, and will also be available via http://gnomad.broadinstitute.org. Publicly-available datasets used in this study are available from the following sources: ClinVar, https://www.ncbi.nlm.nih.gov/clinvar/; DECIPHER, https://www.deciphergenomics.org/ddd/ddgenes (developmental disorder genes); gnomAD, http://gnomad.broadinstitute.org; HelixMTdb, https://www.helix.com/mitochondrial-variant-database; HmtVar, https://www.hmtvar.uniba.it/; IMPC, https://www.ebi.ac.uk/mi/impc/essential-genes-search/ (essential genes); MitImpact, https://mitimpact.css-mendel.it/ (APOGEE predictions); MITOMAP, https://www.mitomap.org/MITOMAP; NCBI Genome, https://www.ncbi.nlm.nih.gov/datasets/genome/; Phylotree, https://www.phylotree.org/ (haplogroup variants); Protein Data Bank, https://www.rcsb.org/; UCSC, https://genome.ucsc.edu/ (phyloP scores); UniProt, https://www.uniprot.org/. A detailed description of these datasets and their application is also provided at https://github.com/leklab/mitochondrial_constraint/tree/main/required_files.
Main references
- 1.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443, doi: 10.1038/s41586-020-2308-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Havrilla JM, Pedersen BS, Layer RM & Quinlan AR A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95, doi: 10.1038/s41588-018-0294-6 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Samocha KE et al. Regional missense constraint improves variant deleteriousness prediction. BioRxiv, doi: 10.1101/148353 (2017). [DOI] [Google Scholar]
- 4.Petrovski S, Wang Q, Heinzen EL, Allen AS & Goldstein DB Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709, doi: 10.1371/journal.pgen.1003709 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Laricchia KM et al. Mitochondrial DNA variation across 56,434 individuals in gnomAD. Genome Res. 32, 569–582, doi: 10.1101/gr.276013.121 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.McBride HM, Neuspiel M & Wasiak S Mitochondria: more than just a powerhouse. Curr. Biol. 16, R551–560, doi: 10.1016/j.cub.2006.06.054 (2006). [DOI] [PubMed] [Google Scholar]
- 7.Stewart JB & Chinnery PF Extreme heterogeneity of human mitochondrial DNA from organelles to populations. Nature reviews. Genetics 22, 106–118, doi: 10.1038/s41576-020-00284-x (2021). [DOI] [PubMed] [Google Scholar]
- 8.Chen Y, Zhou Z & Min W Mitochondria, oxidative stress and innate immunity. Front. Physiol. 9, 1487, doi: 10.3389/fphys.2018.01487 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gray MW Mitochondrial evolution. Cold Spring Harb. Perspect. Biol. 4, a011403, doi: 10.1101/cshperspect.a011403 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Anderson S et al. Sequence and organization of the human mitochondrial genome. Nature 290, 457–465, doi: 10.1038/290457a0 (1981). [DOI] [PubMed] [Google Scholar]
- 11.Gorman GS et al. Mitochondrial diseases. Nature reviews. Disease primers 2, 16080, doi: 10.1038/nrdp.2016.80 (2016). [DOI] [PubMed] [Google Scholar]
- 12.McCormick EM et al. Specifications of the ACMG/AMP standards and guidelines for mitochondrial DNA variant interpretation. Hum. Mutat. 41, 2028–2057, doi: 10.1002/humu.24107 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang Y et al. Association of mitochondrial DNA content, heteroplasmies and inter-generational transmission with autism. Nat Commun 13, 3790, doi: 10.1038/s41467-022-30805-7 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gorelick AN et al. Respiratory complex and tissue lineage drive recurrent mutations in tumour mtDNA. Nat Metab 3, 558–570, doi: 10.1038/s42255-021-00378-8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gopal RK et al. Early loss of mitochondrial complex I and rewiring of glutathione metabolism in renal oncocytoma. Proc. Natl. Acad. Sci. U. S. A. 115, E6283–E6290, doi: 10.1073/pnas.1711888115 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kim M, Mahmood M, Reznik E & Gammage PA Mitochondrial DNA is a major source of driver mutations in cancer. Trends Cancer, doi: 10.1016/j.trecan.2022.08.001 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Keogh MJ & Chinnery PF Mitochondrial DNA mutations in neurodegeneration. Biochim. Biophys. Acta 1847, 1401–1411, doi: 10.1016/j.bbabio.2015.05.015 (2015). [DOI] [PubMed] [Google Scholar]
- 18.Yonova-Doing E et al. An atlas of mitochondrial DNA genotype-phenotype associations in the UK Biobank. Nat. Genet. 53, 982–993, doi: 10.1038/s41588-021-00868-1 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kraja AT et al. Associations of Mitochondrial and Nuclear Mitochondrial Variants and Genes with Seven Metabolic Traits. American Journal of Human Genetics 104, 112–138, doi: 10.1016/j.ajhg.2018.12.001 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yamamoto K et al. Genetic and phenotypic landscape of the mitochondrial genome in the Japanese population. Commun Biol 3, 104, doi: 10.1038/s42003-020-0812-9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Stewart JB et al. Strong purifying selection in transmission of mammalian mitochondrial DNA. PLoS Biol. 6, e10, doi: 10.1371/journal.pbio.0060010 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Voets AM et al. Large scale mtDNA sequencing reveals sequence and functional conservation as major determinants of homoplasmic mtDNA variant distribution. Mitochondrion 11, 964–972, doi: 10.1016/j.mito.2011.09.003 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Elson JL, Turnbull DM & Howell N Comparative genomics and the evolution of human mitochondrial DNA: assessing the effects of selection. American Journal of Human Genetics 74, 229–238, doi: 10.1086/381505 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kivisild T et al. The role of selection in the evolution of human mitochondrial genomes. Genetics 172, 373–387, doi: 10.1534/genetics.105.043901 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wei W et al. Germline selection shapes human mitochondrial DNA diversity. Science 364, eaau6520, doi: 10.1126/science.aau6520 (2019). [DOI] [PubMed] [Google Scholar]
- 26.Ju YS et al. Origins and functional consequences of somatic mitochondrial DNA mutations in human cancer. eLife 3, doi: 10.7554/eLife.02935 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dietlein F et al. Identification of cancer driver genes based on nucleotide context. Nat. Genet. 52, 208–218, doi: 10.1038/s41588-019-0572-y (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bolze A et al. A catalog of homoplasmic and heteroplasmic mitochondrial DNA variants in humans. BioRxiv, doi: 10.1101/798264 (2020). [DOI] [Google Scholar]
- 29.Lott MT et al. mtDNA Variation and Analysis Using MITOMAP and MITOMASTER. Current protocols in bioinformatics 1, 1.23.21–21.23.26, doi: 10.1002/0471250953.bi0123s44 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lake NJ, Compton AG, Rahman S & Thorburn DR Leigh syndrome: One disorder, more than 75 monogenic causes. Ann. Neurol. 79, 190–203, doi: 10.1002/ana.24551 (2016). [DOI] [PubMed] [Google Scholar]
- 31.McFarland R, Elson JL, Taylor RW, Howell N & Turnbull DM Assigning pathogenicity to mitochondrial tRNA mutations: when “definitely maybe” is not good enough. Trends Genet. 20, 591–596, doi: 10.1016/j.tig.2004.09.014 (2004). [DOI] [PubMed] [Google Scholar]
- 32.Rebelo-Guiomar P, Powell CA, Van Haute L & Minczuk M The mammalian mitochondrial epitranscriptome. Biochim Biophys Acta Gene Regul Mech 1862, 429–446, doi: 10.1016/j.bbagrm.2018.11.005 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Helm M et al. Search for characteristic structural features of mammalian mitochondrial tRNAs. RNA (New York) 6, 1356–1379, doi: 10.1017/s1355838200001047 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wong L-JC et al. Interpretation of mitochondrial tRNA variants. Genet. Med. 22, 917–926, doi: 10.1038/s41436-019-0746-0 (2020). [DOI] [PubMed] [Google Scholar]
- 35.Amunts A, Brown A, Toots J, Scheres SH & Ramakrishnan V Ribosome. The structure of the human mitochondrial ribosome. Science 348, 95–98, doi: 10.1126/science.aaa1193 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhao H et al. Maternally inherited aminoglycoside-induced and nonsyndromic deafness is associated with the novel C1494T mutation in the mitochondrial 12S rRNA gene in a large Chinese family. American Journal of Human Genetics 74, 139–152, doi: 10.1086/381133 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Nicholls TJ & Minczuk M In D-loop: 40 years of mitochondrial 7S DNA. Exp. Gerontol. 56, 175–181, doi: 10.1016/j.exger.2014.03.027 (2014). [DOI] [PubMed] [Google Scholar]
- 38.Horn D & Barrientos A Mitochondrial copper metabolism and delivery to cytochrome c oxidase. IUBMB life 60, 421–429, doi: 10.1002/iub.50 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kampjut D & Sazanov LA The coupling mechanism of mammalian respiratory complex I. Science 370, abc4209, doi: 10.1126/science.abc4209 (2020). [DOI] [PubMed] [Google Scholar]
- 40.Koripella RK, Sharma MR, Risteff P, Keshavan P & Agrawal RK Structural insights into unique features of the human mitochondrial ribosome recycling. Proc. Natl. Acad. Sci. U. S. A. 116, 8283–8288, doi: 10.1073/pnas.1815675116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Hong YS et al. Deleterious heteroplasmic mitochondrial mutations are associated with an increased risk of overall and cancer-specific mortality. Nat Commun 14, 6113, doi: 10.1038/s41467-023-41785-7 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Mok BY et al. CRISPR-free base editors with enhanced activity and expanded targeting scope in mitochondrial and nuclear DNA. Nat. Biotechnol. 40, 1378–1387, doi: 10.1038/s41587-022-01256-8 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Rajasimha HK, Chinnery PF & Samuels DC Selection against pathogenic mtDNA mutations in a stem cell population leads to the loss of the 3243A-->G mutation in blood. American Journal of Human Genetics 82, 333–343, doi: 10.1016/j.ajhg.2007.10.007 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Floros VI et al. Segregation of mitochondrial DNA heteroplasmy through a developmental genetic bottleneck in human embryos. Nat. Cell Biol. 20, 144–151, doi: 10.1038/s41556-017-0017-8 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zaidi AA et al. Bottleneck and selection in the germline and maternal age influence transmission of mitochondrial DNA in human pedigrees. Proc. Natl. Acad. Sci. U. S. A. 116, 25172–25178, doi: 10.1073/pnas.1906331116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Schaefer PM et al. Combination of common mtDNA variants results in mitochondrial dysfunction and a connective tissue dysregulation. Proc. Natl. Acad. Sci. U. S. A. 119, e2212417119, doi: 10.1073/pnas.2212417119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Methods References
- 47.Kennedy SR, Salk JJ, Schmitt MW & Loeb LA Ultra-sensitive sequencing reveals an age-related increase in somatic mitochondrial mutations that are inconsistent with oxidative damage. PLoS Genet. 9, e1003794, doi: 10.1371/journal.pgen.1003794 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Ludwig LS et al. Lineage Tracing in Humans Enabled by Mitochondrial Mutations and Single-Cell Genomics. Cell 176, 1325–1339.e1322, doi: 10.1016/j.cell.2019.01.022 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Rebolledo-Jaramillo B et al. Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA. Proc. Natl. Acad. Sci. U. S. A. 111, 15474–15479, doi: 10.1073/pnas.1409328111 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li M et al. Transmission of human mtDNA heteroplasmy in the Genome of the Netherlands families: support for a variable-size bottleneck. Genome Res. 26, 417–426, doi: 10.1101/gr.203216.115 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Yuan Y et al. Comprehensive molecular characterization of mitochondrial genomes in human cancers. Nat. Genet. 52, 342–352, doi: 10.1038/s41588-019-0557-x (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.pfeliciano@simonsfoundation.org, S. C. E. a. & Consortium, S. SPARK: A US cohort of 50,000 families to accelerate autism research. Neuron 97, 488–493, doi: 10.1016/j.neuron.2018.01.015 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Colnaghi M, Pomiankowski A & Lane N The need for high-quality oocyte mitochondria at extreme ploidy dictates mammalian germline development. eLife 10, doi: 10.7554/eLife.69344 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Van Oven M PhyloTree Build 17: Growing the human mitochondrial DNA tree. Forensic Science International: Genetics Supplement Series 5, e392–e394 (2015). [Google Scholar]
- 55.Lake NJ, Zhou L, Xu J & Lek M MitoVisualize: a resource for analysis of variants in human mitochondrial RNAs and DNA. Bioinformatics 38, 2967–2969, doi: 10.1093/bioinformatics/btac216 (2022). [DOI] [PubMed] [Google Scholar]
- 56.Bodenhofer U, Bonatesta E, Horejs-Kainrath C & Hochreiter S msa: an R package for multiple sequence alignment. Bioinformatics 31, 3997–3999, doi: 10.1093/bioinformatics/btv494 (2015). [DOI] [PubMed] [Google Scholar]
- 57.UniProt C UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515, doi: 10.1093/nar/gky1049 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Sonney S et al. Predicting the pathogenicity of novel variants in mitochondrial tRNA with MitoTIP. PLoS computational biology 13, e1005867, doi: 10.1371/journal.pcbi.1005867 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Landrum MJ et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067, doi: 10.1093/nar/gkx1153 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Akesson LS et al. Early diagnosis of Pearson syndrome in neonatal intensive care following rapid mitochondrial genome sequencing in tandem with exome sequencing. European Journal of Human Genetics 27, 1821–1826, doi: 10.1038/s41431-019-0477-3 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Davydov EV et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS computational biology 6, e1001025, doi: 10.1371/journal.pcbi.1001025 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hamelryck T & Manderick B PDB file parser and structure class implemented in Python. Bioinformatics 19, 2308–2310, doi: 10.1093/bioinformatics/btg299 (2003). [DOI] [PubMed] [Google Scholar]
- 63.Guo R, Zong S, Wu M, Gu J & Yang M Architecture of human mitochondrial respiratory megacomplex I2III2IV2. Cell 170, 1247–1257.e1212, doi: 10.1016/j.cell.2017.07.050 (2017). [DOI] [PubMed] [Google Scholar]
- 64.Zong S et al. Structure of the intact 14-subunit human cytochrome c oxidase. Cell Res. 28, 1026–1034, doi: 10.1038/s41422-018-0071-1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Aibara S, Singh V, Modelska A & Amunts A Structural basis of mitochondrial translation. eLife 9, doi: 10.7554/eLife.58362 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Jumper J et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589, doi: 10.1038/s41586-021-03819-2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Soltanikazemi E, Quadir F, Roy RS, Guo Z & Cheng J Distance-based reconstruction of protein quaternary structures from inter-chain contacts. Proteins 90, 720–731, doi: 10.1002/prot.26269 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Pettersen EF et al. UCSF ChimeraX: structure visualization for researchers, educators, and developers. Protein Sci. 30, 70–82, doi: 10.1002/pro.3943 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Sudlow C et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779, doi: 10.1371/journal.pmed.1001779 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Battle SL et al. A bioinformatics pipeline for estimating mitochondrial DNA copy number and heteroplasmy levels from whole genome sequencing data. NAR Genom Bioinform 4, lqac034, doi: 10.1093/nargab/lqac034 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Cacheiro P et al. Human and mouse essentiality screens as a resource for disease gene discovery. Nat Commun 11, 655, doi: 10.1038/s41467-020-14284-2 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Firth HV et al. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 84, 524–533, doi: 10.1016/j.ajhg.2009.03.010 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Nicole Lake KM, Cohen Justin, Lek Monkol. Mitochondrial DNA base editing in HEK293T cells. protocols.io. 10.17504/protocols.io.yxmvm3rnol3p/v1. (2024). [DOI]
- 74.Kluesner MG et al. EditR: A Method to Quantify Base Editing from Sanger Sequencing. CRISPR J 1, 239–250, doi: 10.1089/crispr.2018.0014 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Mok BY et al. A bacterial cytidine deaminase toxin enables CRISPR-free mitochondrial base editing. Nature 583, 631–637, doi: 10.1038/s41586-020-2477-4 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data analyzed or generated during this study are included in this article and its Supplementary files, and available via https://github.com/leklab/mitochondrial_constraint. Constraint metrics are provided in the Supplementary Datasets, and will also be available via http://gnomad.broadinstitute.org. Publicly-available datasets used in this study are available from the following sources: ClinVar, https://www.ncbi.nlm.nih.gov/clinvar/; DECIPHER, https://www.deciphergenomics.org/ddd/ddgenes (developmental disorder genes); gnomAD, http://gnomad.broadinstitute.org; HelixMTdb, https://www.helix.com/mitochondrial-variant-database; HmtVar, https://www.hmtvar.uniba.it/; IMPC, https://www.ebi.ac.uk/mi/impc/essential-genes-search/ (essential genes); MitImpact, https://mitimpact.css-mendel.it/ (APOGEE predictions); MITOMAP, https://www.mitomap.org/MITOMAP; NCBI Genome, https://www.ncbi.nlm.nih.gov/datasets/genome/; Phylotree, https://www.phylotree.org/ (haplogroup variants); Protein Data Bank, https://www.rcsb.org/; UCSC, https://genome.ucsc.edu/ (phyloP scores); UniProt, https://www.uniprot.org/. A detailed description of these datasets and their application is also provided at https://github.com/leklab/mitochondrial_constraint/tree/main/required_files.
