Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 10.
Published in final edited form as: Mitochondrion. 2021 Jan 26;58:303–310. doi: 10.1016/j.mito.2021.01.006

Comparison of whole genome sequencing and targeted sequencing for Mitochondrial DNA

Ruoying Chen 1, Micheala A Aldred 2, Weiling Xu 3, Joe Zein 4, Peter Bazeley 1, Suzy AA Comhai 3, Deborah A Meyers 5, Eugene R Bleecker 5, Chunyu Liu 6, Serpil C Erzurum 3,4,*, Bo Hu 1,*; NHLBI Severe Asthma Research Program (SARP)
PMCID: PMC8354572  NIHMSID: NIHMS1716219  PMID: 33513442

Abstract

Mitochondrial dysfunction has emerged to be associated with a broad spectrum of diseases, and there is an increasing demand for accurate detection of mitochondrial DNA (mtDNA) variants. Whole genome sequencing (WGS) has been the dominant sequencing approach to identify genetic variants in recent decades, but most studies focus on variants on the nuclear genome. Whole genome sequencing is also costly and time consuming. Sequencing specifically targeted for mtDNA is commonly used in the diagnostic settings and has lower costs. However, there is a lack of pairwise comparisons between these two sequencing approaches for calling mtDNA variants on a population basis. In this study, we compared WGS and mtDNA-targeted sequencing (targeted-seq) in analyzing mitochondrial DNA from 1499 participants recruited into the Severe Asthma Research Program (SARP). Our study reveals that targeted-sequencing and WGS have comparable capacity to determine genotypes and to call haplogroups and homoplasmies on mtDNA. However, there exists a large variability in calling heteroplasmies, especially for low-frequency heteroplasmies, which indicates that investigators should be cautious about heteroplasmies acquired from different sequencing methods. Further research is highly desired to improve variant detection methods for mitochondrial DNA.

Keywords: Mitochondrial DNA, Asthma, Targeted sequencing, Whole genome sequencing

1. Introduction

Human mitochondrial DNA is inherited almost exclusively from the mother. It is a circular double-stranded DNA molecule. Mitochondrial DNA was the first significant part of human genome sequenced (1), which revealed that it consists of 16,569 bases and encodes 37 genes, including 13 key protein-coding genes, 22 transfer RNAs and 2 ribosomal RNAs. The displacement loop (D-loop) located in the main non-coding region on mtDNA acts as a promoter for both strands, which is a hot spot for mutations.

Depending on the energy demands, oxidative stress and pathological conditions, the number of mtDNA copies per cell varies greatly (2,3). Each human cell could contain thousands of mtDNA copies in comparison to having only two copies of the nuclear DNA. The mtDNA copies may carry different sequences as a result of inherited or somatic mutations. Mitochondrial mutations are defined according to the presence of mutated mtDNA molecules in a single cell or among cells within an individual, whereby the condition of co-existence of wild-type and mutated sequences is termed as heteroplasmy and the condition of all mutated sequences being identical is termed as homoplasmy. In contrast to the nuclear genome, the mitochondrial genome is more likely to mutate (4) and mutations will accumulate with aging (5,6). Increasing evidence suggests that mtDNA mutations are associated with various diseases, including not only mitochondrial genetic disorders but also type-2 diabetes, cardiovascular disease, cancer and asthma (715).

Prior to the emergence of next generation sequencing technology, detecting mtDNA variants has been mainly restricted to the D-loop region and was achieved mostly by Sanger sequencing and PCR-based methods (16). Both approaches have major limitations due to the detection limit of variant allele frequency and the lack of ability to cover the whole mitochondrial genome (17,18). During the past two decades, the next generation sequencing has become the most common technology to identify genetic variants given its ability to achieve deep genome coverage and to detect low-frequency variants. Whole genome sequencing (WGS) allows for comprehensive discovery of single nucleotide polymorphisms and de-novo mutations along the entire human genome. While most WGS studies focus on nuclear DNA, the resulting sequencing data can be readily adapted to the discovery of mtDNA variants (9,19). Applications of WGS to identify mitochondrial variants can be found in some recent studies (15,2023). However, WGS is generally expensive and time consuming for studies with large sample sizes, especially when the focus is mitochondrial DNA but not nuclear DNA.

Due to the small size of the mitochondrial genome, sequencing specifically targeted for mtDNA is a more affordable approach in practice (2427). To perform targeted sequencing, samples can be prepared by enriching mtDNA through PCR amplification or isolating mtDNA prior to sequencing. Methods of isolation focus on the separation of organelles using centrifugation or capture array and obtain intact mitochondria at an early stage, while amplification methods focus on increasing the proportion of mitochondrial DNA by polymerase chain reaction (PCR) (17).

Although both WGS and targeted sequencing have been successfully used for analyzing mitochondrial genome in the literature, there are no direct comparisons of their performances in a paired manner based on large population-based cohort studies. In this study, we applied both WGS and targeted-sequencing to the same set of whole blood samples collected from participants enrolled in the Severe Asthma Research Program (SARP)(2831). Complete mtDNA genomes were analyzed using a same bioinformatics pipeline to call mtDNA variants, and the results from the two sequencing datasets were compared systematically.

2. Materials and methods

2.1. Study cohort and sequencing

The Severe Asthma Research Program, funded by the National Heart, Lung and Blood Institute, is a long-term cohort study to study severe asthma. The study protocol and procedures were approved by the Institutional Review Board (IRB) at each participating center and an independent Data Safety Monitoring Board. All subjects provided written informed consent and/or assent. The study is registered on Clinicaltrials.gov [NCT01750411]. For the SARP participants, asthma was verified based upon American Thoracic Society guidelines, which include positive methacholine challenge test and/or reversible airflow obstruction. Healthy controls had normal spirometry and negative methacholine challenge. Exclusion criteria included having any of the following: current smokers (more than 5–10 pack years depending on age), other respiratory diseases (e.g., CF or COPD), premature birth before 35 weeks’ gestation, clinically relevant or untreated gastroesophageal reflux, recurrent sinopulmonary infections or obstructive sleep apnea, and cancer diagnosis in the past five years.

Whole genome Sequencing.

DNA was extracted from whole blood samples collected from the SARP participants. A total of 1882 samples underwent whole genome sequencing at the New York Genome Center of Trans-Omics for Precision Medicine (TOPMed) (www.nhlbiWGS.org). These samples are from 1881 distinct participants (one participant has a duplicate), of which only five are healthy controls. Sequencing libraries were prepared with 500 ng DNA input, using the Kappa Hyper Library Preparation Kit (PCR-free) (Roche Sequencing and Life Science, Indianapolis, IN). Sequencing was generated with the Illumina HiSeq X with pair-ended reads of 150 bps, using V3 sequencing chemistry and Illumina HiSeq Control Software (HCS) v3.3.39 (Illumina, San Diego, CA).

Targeted Sequencing.

Remaining DNAs from the blood samples were sent to Cleveland Clinic (kindly provided by Dr. Meyers). We applied a mtDNA sequencing method in which samples underwent nuclear DNA digestion, whole mitochondrial genome amplification, DNA library preparation, and sequencing sequentially. Briefly, 20 ng DNA of each sample was digested with enzymes Exonuclease V, DraIII, PshAl and Xmal at 37°C for 4 hours, amplified the whole mitochondrial genome using REPLI-g mitochondrial DNA kit (QIAGEN, Germantown, MD) for 8 hours or overnight, and quantitated dsDNA with Quan-iT PicoGreen Kit (ThermoFisher Scientific, Waltham, MA) and FlexStation3 Reader (Molecular Devices, San Jose, CA). Then, 2 ng of the mtDNA-enriched sample was used for Nextera XT DNA library preparation with 384-plex dual index barcoding (Illumina, San Diego, CA). Samples were pooled in equimolar amounts and sequenced using Illumina MiSeq System at the Genomics Core of Cleveland Clinic. The reads are pair-ended with 151 bps. A total of 1960 samples underwent mtDNA-targeted sequencing.

To compare WGS and target-sequencing, participants with only WGS data or only targeted-seq data were excluded. Furthermore, we also removed participants with mean read depth lower than twenty across the mtDNA genome. A total of 1499 asthmatic participants with paired WGS and targeted-seq data were included in the following analyses.

2.2. Data analysis

Bioinformatics Analysis.

Raw sequencing data were aligned to the revised Cambridge reference genome (rCRS) using BWA (v0.7.12)(32). The mtDNA genome by rCRS consists of 16569 base positions, where position 3017 has a base of “N” representing a historical sequencing error. The WGS data were further down-sampled using SAMtools (v1.10) to achieve similar coverage as the targeted-seq data in a sensitivity analysis. In order to call mtDNA variants, only uniquely mapped reads with quality scores at least twenty were kept. More specifically, the mtDNA variants (heteroplasmies and homoplasmies) were called using MitoCaller based on the bam files after reads alignment.

MitoCaller is a likelihood-based method that takes into account sequencing error rate and the circularity of the mtDNA genome (20), which predicts genotypes based on reads mapped to both strands of the mtDNA genome. Unlike the nuclear DNA, mtDNA can have 15 distinct genotypes (A,C,G,T,A/C, A/G, A/T, C/G, C/T, G/T, A/C/G, A/C/T, A/G/T, C/G/T, A/C/G/T) at each genome position since there could be multiple mtDNA copies. At each position, the alternative allele frequency (AAF) is defined as the proportion of the major non-reference allele among all qualified reads covering this position. Homoplasmy and heteroplasmy are then called according to the alternative allele frequencies. A threshold of 5%−95% for AAF was applied in this study. More specifically, a homoplasmy is called if the alternative allele frequency is greater than 95%; a heteroplasmy is called if the alternative allele frequency is between 5% and 95%; and a reference site is called if the alternative allele frequency is lower than 5%. Different lower thresholds of AAF were applied in multiple sensitivity analyses. For each position of each participant, the two sequencing datasets are considered as having the same heteroplasmy (or homoplasmy) if the two heteroplasmies (or homoplasmies) have the same mutant allele.

The mtDNA haplogroups were obtained using HaploGrep2 (33).

Statistical Analysis.

In general, categorical variables were summarized as frequencies and percentages and continuous variables were summarized as means and standard deviations (SDs) or medians and ranges as appropriate. Comparing the distributions of each categorical variable between any two groups was performed using the chi-squared test and ANOVA was used for continuous variables. All statistical analyses were performed using R (version 3.6.1, cran-project.org).

3. Results

Among the 1499 subjects included in this study, 579 (38.6%) are male and 920 (61.4%) are female. 907 (60.5%) are Caucasian, 459 (30.6%) are African American and 133 (8.9%) are other races. 792 (52.8%) have severe asthma and 707 (47.2%) have non-severe asthma. Adult subjects account for 81.4% of the cohort. The mean BMI is 29.4 (SD=8.7).

Figure 1 shows the mean read depths across the mtDNA genome. For the WGS data, the median is 1176 (range=[68, 10004]); for the targeted-seq data, the median is 454 (range=[20, 4297]). Both sequencing datasets have significantly lower coverage in the D-loop region (i.e., the beginning and tail ends of the mtDNA genome) than other regions (p<0.001). While the two sequencing datasets have different read coverages for mtDNA, their coverages are both sufficient for sensitive variant detection (34,35).

Figure 1.

Figure 1.

Mean read depth across mtDNA genome by WGS and targeted sequencing

We first compared haplogroups from the two sequencing datasets. Among all the participants, the haplogroups are different for only 6 (0.4%) participants (Supp. Table 1). These six subjects were then excluded from the downstream analyses of mtDNA variants.

3.1. Genotype

Since the mtDNA has 16569 positions, there are a total of 24,737,517 mtDNA sites from 1493 subjects. Excluding the artificial position of 3107 and 1134 sites with no read coverage in the targeted-seq data, we were able to estimate the genotypes at 24,734,890 sites, of which, 23,662,867 (95.7%) sites have identical genotypes obtained from the two sequencing datasets. Among the 1,072,023 (4.3%) sites with different genotypes, 85% of them are the sites where the WGS data only have the reference allele but the targeted-seq data have one or more alternative alleles. The sites with different genotypes have greater read depths than those with identical genotypes (p<0.001 for both WGS and targeted-seq). The overall genotype agreement rate declines slightly by restricting the comparison to the sites with greater read depths (Figure 2a). The genotype agreement rate is still 93.3% at 9,338,029 sites with depths greater than 500 in both datasets.

Figure 2.

Figure 2.

(a) Relationship between proportion of having identical genotypes (genotype agreement rate) and minimum depth threshold. (b) First track (blue color): proportions (out of 1493 subjects) of having different genotypes from the WGS and targeted-seq datasets; second track (red color): 80 mtDNA positions with 50% or higher proportions of having different genotypes.

Across all mtDNA positions, the genotype agreement rates have a median of 97.7% (IQR=[95.7%, 98.7%], range=[13.9%, 100%]; see Figure 2b). At six mtDNA positions, the genotypes from the two sequencing datasets are identical for all 1493 subjects. Only 80 positions have an agreement rate lower than 50%, where 34 are in the D-loop region, 43 on protein coding genes and 3 on ribosomal RNA (Figure 2b). At the subject level, the genotype agreement rate averaged over all positions ranges from 89.7% to 99.4%. In Supp. Table 2, the study subjects are divided into four groups according to the quartiles of their genotype agreement rates. The results show no significant differences in terms of the participants’ demographics and other characteristics among these four groups.

3.2. Alternative allele frequency

At each mtDNA genome position, if all mapped reads contain the reference allele, the AAF is defined as zero. 95.6% of the sites have no alternative alleles from either sequencing dataset, which are then deemed as reference sites. Only 0.01% of the sites have different alternative alleles from the two datasets. Figure 3a compares the AAFs from the targeted-seq and WGS data, excluding the sites with no alternative alleles (i.e., AAF=0) or different alternative alleles in the two sequencing datasets. The overall correlation coefficient is very close to one (r=0.996). The D-loop region has slightly lower correlation (r=0.993) in comparison to that of the coding genes (r=0.999).

Figure 3.

Figure 3.

(a) Alternative allele frequencies (AAFs) from WGS and targeted-seq. The sites with zero AAFs in both datasets and those with different alternative alleles are excluded. (b) Numbers of homoplasmies across the mtDNA genome from WGS and targeted-seq (transformed as log2(x+1)). Positions with no homoplasmies called from both datasets are excluded.

3.3. Homoplasmy and heteroplasmy

Based on the WGS data, 47053 homoplasmies and 10223 heteroplasmies are called from 1493 subjects. The numbers of homoplasmies and heteroplasmies are 49117 and 4946, respectively, at the same AAF threshold based on the targeted-seq data. The reference and homoplasmic sites have similar read depths, but the heteroplasmic sites have significantly lower depths. For the targeted-seq data, the mean read depth is 110 at the sites with heteroplasmies in comparison to 473 and 459, respectively, at the reference and homoplasmic sites (p<0.001). For the WG data, the mean depth is 656 at the sites with heteroplasmies, and are 1633 and 1646 at the reference and homoplasmic sites, respectively (p<0.001).

Table 1 shows the numbers of reference sites, heteroplasmies and homoplasmies by the sequencing data. There are a total of 49497 sites with homoplasmies called from at least one sequencing data, of which 46673 (94.3%) sites have identical homoplasmies called from both datasets (Supp. Figure 1a). The high overlap rate holds across the mtDNA genome. Moreover, the correlation coefficient is 0.987 between the (log2-transformed) numbers of homoplasmies of each subject in the two sequencing datasets (Figure 3b).

Table 1.

Distribution of the numbers of reference sites, homoplasmies and heteroplasmies determined by targeted-seq and WGS. Among the 1720 sites determined as heteroplasmies by both data, 1715 have the same variants (identical mutant alleles) and 5 have different mutant alleles.

Targeted-seq
Reference Heteroplasmy Homoplasmy
WGS Reference 24674560 3049 5
Heteroplasmy 6064 1715
5
2439
Homoplasmy 203 177 46673

For the heteroplasmies called from the WGS data, 6778 (66.3%) are located in the D-loop region, 302 (3.0%) are on ribosomal RNAs and 2906 (28.4%) are on protein-coding genes. For the heteroplasmies called from the targeted-seq data, 2041 (41.2%) of the heteroplasmies are in the D-loop region, 988 (20%) on ribosomal RNAs and 1653 (33.4%) on protein coding genes (Figure 4a). The heteroplasmies from WGS data are distributed at 1140 genome positions, of which 613 (53.8%) positions are singletons (i.e., only one subject has the heteroplasmy), 337 (29.6%) positions have 2 to 5 subjects with the heteroplasmy and 190 (16.7%) positions have more than 5 heteroplasmies (Figure 4b). From the targeted-seq data, heteroplasmies are distributed at 2287 positions, of which 1815 (79.4%) are singletons, 428 (18.7%) have 2 to 5 subjects with the heteroplasmies and only 44 (1.9%) positions have more than 5 heteroplasmies. Therefore, the targeted-seq data have heteroplasmies called at more positions (p<0.001) than the WGS data, but a larger proportion of its heteroplasmies are singletons (p<0.001).

Figure 4.

Figure 4.

(a) Numbers of heteroplasmies across the mtDNA genome from WGS and targeted-seq (transformed as log2(x+1)). First track: WGS (blue color); second track: targeted-seq (black color); (b) mtDNA positions categorized by the number of heteroplasmies; (c) and (d) Alternative allele frequencies for sites called as heteroplasmies by one sequencing data but as homoplasmies by the other. The dashed red line represents a proportion of 0.9 and the solid red line is for a proportion of 0.95.

For the pairwise comparison of the heteroplasmies at the site level, 1715 sites have identical heteroplasmies determined from both datasets (Supp. Figure 1a), which account for 34.7% of the heteroplasmies from the targeted-seq data and only 16.8% of the heteroplasmies from the WGS data. The details of these 1715 heteroplasmies are listed in Supp. table 3. Regarding the sites with inconsistent calling results, 6064 heteroplasmies in the WGS data are determined as reference sites by targeted-seq, of which 95.7% have no mutant alleles (i.e., AAF=0) in the targeted-seq data. On the other hand, 3049 heteroplasmies called using the targeted-seq data are determined as reference sites in the WGS data (96% have no mutant alleles). For the sites called as heteroplasmies in one sequencing data but as homoplasmies in the other, Figures 4c and 4d compare their alternative allele frequencies. In Figure 4c, the AAFs are all greater than 0.95 from WGS but are smaller than 0.95 from targeted-seq, leading to different plasmies called. However, the AAFs at 67.2% of these sites are between 90% and 95% in the targeted-seq data, indicating differences less than 5% from the threshold of 95%. Similarly, the AAFs at 71.6% of the sites called as heteroplasmies from WGS but as homoplasmies from targeted-seq differ by less than 5% (Figure 4d). The more extreme cases are the 12 sites with AAFs between 95% and 96% in the targeted-seq data but their AAFs are between 94% and 95% in the WGS data, where such small differences in the AAFs call different plasmies.

Table 2 shows the plasmies called at several mtDNA locations known with classic pathogenic mutations. For mutations 1555A>G that is associated with deafness, 3243A>G associated with mitochondrial myopathy, encephalopathy, lactic acidosis, and stroke-like episodes (MELAS), and 14484T>C associated with Leber hereditary optic neuropathy (LHON), consistent heteroplasmies are called from the two sequencing datasets. The homoplasmies are also consistent. However, for mutation 3460G>A associated with LHON, three participants have heteroplasmies identified from the targeted-seq data, but no mutations are called from the WGS data. The inconsistency may be due to the low read depths (all below 50) at these three heteroplasmies in the targeted-seq data.

Table 2.

mtDNA variants called at positions known with classic pathogenic mutations (*TOPMed subject IDs).

Targeted-seq WGS
Mutation Position Gene ID* AAF Depth Plasmy AAF Depth Plasmy
1555A>G 1555 RNR1 NWD373937 100.0% 423 Homoplasmy 99.7% 3940 Homoplasmy
NWD810582 12.9% 387 Heteroplasmy 12.7% 1325 Heteroplasmy
NWD857530 7.3% 191 Heteroplasmy 7.2% 736 Heteroplasmy
3243A>G 3243 TL1 NWD190277 7.5% 1061 Heteroplasmy 7.7% 1047 Heteroplasmy
NWD418976 9.7% 907 Heteroplasmy 7.7% 586 Heteroplasmy
3460G>A 3460 ND1 NWD362792 6.0% 50 Heteroplasmy 0.0% 1311 Reference
NWD643458 6.2% 48 Heteroplasmy 0.0% 612 Reference
NWD675926 6.7% 30 Heteroplasmy 0.0% 973 Reference
11778G>A 11778 ND4 NWD743129 100.0% 451 Homoplasmy 99.9% 2381 Homoplasmy
14484T>C 14484 ND6 NWD759129 32.0% 1011 Heteroplasmy 32.0% 1347 Heteroplasmy

We further performed multiple sensitivity analyses of the comparisons of heteroplasmies. Firstly, we increased the lower threshold of AAF in calling heteroplasmies, that is, comparing more common variants. As the threshold increases from 5% to 20%, the number of heteroplasmies decreases as expected and the number from the WGS dataset decreases faster than that from the targeted-seq dataset (Supp. Figure 1b). At a 10% threshold, the numbers of heteroplasmies are 3753 and 3428 from the WGS and targeted-seq datasets, respectively, among which 1399 were found in both datasets. At a 20% threshold, the targeted-seq dataset yielded more heteroplasmies than the WGS dataset (1837 vs. 1315). Their intersection contains 668 heteroplasmies, which account for 50.8% and 36.4% of the heteroplasmies in the WGS and targeted-seq datasets, respectively. Therefore, the overlap rate becomes greater for more common heteroplasmies.

Secondly, we only kept the sites with read depth greater than 100 in both datasets. About 11% of the sites were excluded. The genotypes are identical at 94.9% of the sites with depths greater than 100. A total of 5992 heteroplasmies are called from the WGS data, which contains 513 (56.3%) of 911 heteroplasmies called from the targeted-seq data (Supp. Figure 2a). When we further increased the AAF threshold to 20% for common heteroplasmies, both datasets yields similar numbers of heteroplasmies (275 vs. 257) and 172 (66.9%) heteroplasmies called from the targeted-seq data overlap with those called from the WGS data (Supp. Figure 2b).

Thirdly, the WGS data were down sampled at a fraction of 33% to achieve comparable coverage across mtDNA as the targeted-seq data (Supp. Figure 3a). In comparing the results from the downsampled WGS data to those from the targeted-seq data, we found that the genotypes are identical at 95.6% of the sites. Meanwhile 47419 homoplasmies and 10125 heteroplasmies are called from the downsampled WGS data, where 46541 (98.2%) of the homoplasmies overlap with the homoplasmies from the targeted-seq data and 1671 (16.5%) of the heteroplasmies overlap with those from the targeted-seq data (Supp. Figure 3b). For common heteroplasmies called at the cut-off of 20%, 670 (50%) from the downsampled data are also identified from the targeted-seq data. These findings are very similar to those based on the original WGS data without downsampling, which indicate that the different depths in the WGS and targeted-seq data have minimal effects on the study findings.

4. Discussion

In this study, we compared whole genome sequencing and targeted sequencing in analyzing mitochondrial genomes based on the participants enrolled in the SARP program. Targeted-sequencing shows comparable capacity as whole genome sequencing in determining genotypes and allele frequencies across the mitochondrial genome. The two sequencing approaches also yields highly consistent results for homoplasmies and haplogroups. Regarding heteroplasmies, the WGS data called more heteroplasmies than the targeted-seq data, while the targeted-seq data yielded more singletons. Despite the highly correlated alternative allele frequencies, the heteroplasmies called from the two datasets are considerably variable, where about 17% of the heteroplasmies from WGS data are also identified from the targeted-seq data. However, the overlap rate increases to 30%−50% for more common heteroplasmies (e.g., AAF at 10% or 20%). Furthermore, higher consistent results are found at positions known with classic mtDNA mutations.

One possible reason for inconsistent calling of heteroplasmies is the application of a single dichotomous cut-off to the alternative allele frequencies. We used a cut-off of 5–95% to call heteroplasmies, which is the same threshold used in Liu et al (2018). Different cut-offs were used in other studies (20,36). While such variant-calling criteria are straightforward and easy to interpret in practice, there is obvious ambiguity when the allele frequencies are close to the pre-selected cut-offs. In particular, allele frequencies with small differences but on different sides of the cut-offs will lead to completely different calling results. Moreover, there is no established strategy to select an optimal cut-off. The higher the cut-off, the lower the prevalence of heteroplasmies. In addition to mitoCaller used in our analyses, existing packages also call heteroplasmies by dichotomizing other measures representing heteroplasmy level. A similar algorithm was proposed in Ye et al. (2014), which calls heteroplasmies by categorizing log likelihood ratio, a metric also adopted in mtDNA-Server (37). The package MToolBox relies on heteroplasmic fraction and the recent NOVOPlasty package uses minor allele frequency (34,38). Therefore, it is of great interests to further refine and improve variant calling algorithms tailored for mtDNA.

Nuclear copies of mitochondrial DNA segments (NUMTs) could lead to false positives in calling heteroplasmies, especially for WGS data, since NUMTs can be mistaken for real mtDNA sequences (34). In our analysis, we only included reads uniquely mapped to mtDNA to minimize possible interference of NUMTs, which is also the approach used in Ye et al. (2014) and Ding et al (2015) (15,20). Furthermore, 1) our reads are pair-ended and have long length of 150bp or 151bp; 2) our analysis restricts to the reads with mapping quality score greater than 20; and 3) the mitoCaller algorithm used in our analysis requires all alleles of a heteroplasmy to be observed at least once in both strand. To further quantify the impact of NUMTs in calling heteroplasmies for the WGS data, we conducted a realignment experiment by mapping the WGS reads to the reference genome GRCh38 that contains the NUMTs. We found that only 3.6% of the reads mapped to rCRS are re-mapped to NUMTs. Among all the heteroplasmies identified using rCRS, only 56 (0.5%) are not identified using GRCh38 (Supp. Figure 4a). While there are some (4.4%) new heteroplasmies identified using GRCh38, the overlap rate is still reasonably high at 95.1%. Also, the AAFs at the heteroplasmic sites by using the two reference genome are highly correlated (r=0.996, Supp. Figure 4b). Therefore, the impact of NUMTs is minimal in our study, although our analysis cannot completely exclude the interference of NUMTs.

To our knowledge, our study is the only large-cohort study to compare whole genome sequencing and targeted-sequencing for detecting mtDNA variants. The paired data obtained from different portions of the same DNA sample of each participant enable a direct and informative comparison. Furthermore, we applied a common analysis pipeline to both sequencing datasets. Our study has several limitations. First, there are no gold standard results for the SARP participants involved to benchmark with either WGS or targeted sequencing to provide more definitive conclusions about their respective accuracies and drawbacks. Deep sequencing (e.g., 5000~10000 fold coverage for mtDNA) data from family pedigrees or month-child pairs were used as true signals in other studies (20). However, such pedigrees are not present in our study cohort. Secondly, the targeted sequencing approach was originally designed for high-throughput haplogrouping. As such, the read depth is considerably lower than for WGS, which limits the detection of heteroplasmies. In addition, biases in amplification that skew allelic ratios cannot be excluded. The Qiagen Repli-G system is an isothermal rolling circle amplification using high fidelity phi-29 polymerase, some artefactual errors could be introduced during this initial step and masquerade as heteroplasmy. Thirdly, copy number variation of mtDNA was not examined and compared since mtDNA copy numbers are unable to be estimated from the targeted sequencing data. Copy number of mtDNA is typically calculated under the assumption that sequencing coverages are proportional to the copies of autosomal and mtDNA. Since autosomal DNA has two copies in each cell, the copy number of mtDNA can be estimated as twice of the ratio of mtDNA coverage over autosomal DNA coverage. Therefore, one advantage of whole genome sequencing is the availability of the autosomal data that provide direct estimation of mtDNA copy numbers. Lastly, insertions and deletions (INDELs) were not assessed in our study since mitoCaller does not estimate these structural variants. Future research will be required to compare INDELs from WGS and targeted-seq.

Despite these limitations, our study demonstrates comparable performances of whole genome sequencing and targeted sequencing in detecting mtDNA homoplasmies and haplogroups using a large cohort with well-characterized phenotypes. A large variability exists in the heteroplasmies detected, which implicates that while NGS technologies are able to provide high throughput data for heteroplasmy detection, caution should be taken for interpreting heteroplasmies detected from different sequencing approaches. Moreover, enrichment strategy and selection of the reference genome can also lead to very different results for heteroplasmies (39). Given the increasing interests about mitochondrial pathogenicity in the scientific community, future research is highly needed to derive accurate detection methods for mitochondrial variants.

Supplementary Material

Supplementary Material
Supplementary Table 3

Acknowledgements

This work was supported by awards from the National Heart, Lung, and Blood Institute (HL081064, HL103453, and HL109250 to SCE, and R35HL140019 to MAA); this project was also supported in part by the National Center for Advancing Translational Sciences (ULTR000439). S.C. Erzurum is supported in part by the Alfred Lerner Chair for Biomedical Research.

SARP was supported by awards from National Heart, Lung, and Blood Institute (U10 HL109172, U10 HL109168, U10 HL109152, U10 HL109257, U10 HL109046, U10 HL109250, U10 HL109164, U10 HL109086). TOPMed WGS data was supported by the National Heart, Lung, and Blood Institute. Core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626–02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data.

The authors would like to thank the editor and the reviewers for their insightful comments, which significantly improved the quality of this manuscript.

REFERENCES

  • 1.Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F et al. (1981) Sequence and organization of the human mitochondrial genome. Nature, 290, 457–465. [DOI] [PubMed] [Google Scholar]
  • 2.Clay Montier LL, Deng JJ and Bai Y (2009) Number matters: control of mammalian mitochondrial DNA copy number. J Genet Genomics, 36, 125–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Fazzini F, Schopf B, Blatzer M, Coassin S, Hicks AA, Kronenberg F and Fendt L (2018) Plasmid-normalized quantification of relative mitochondrial DNA copy number. Sci Rep, 8, 15347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wallace DC (1994) Mitochondrial DNA sequence variation in human evolution and disease. Proc Natl Acad Sci U S A, 91, 8739–8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Elson JL, Samuels DC, Turnbull DM and Chinnery PF (2001) Random intracellular drift explains the clonal expansion of mitochondrial DNA mutations with age. Am J Hum Genet, 68, 802–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nekhaeva E, Bodyak ND, Kraytsberg Y, McGrath SB, Van Orsouw NJ, Pluzhnikov A, Wei JY, Vijg J and Khrapko K (2002) Clonally expanded mtDNA point mutations are abundant in individual cells of human tissues. Proc Natl Acad Sci U S A, 99, 5521–5526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gorman GS, Schaefer AM, Ng Y, Gomez N, Blakely EL, Alston CL, Feeney C, Horvath R, Yu-Wai-Man P, Chinnery PF et al. (2015) Prevalence of nuclear and mitochondrial DNA mutations related to adult mitochondrial disease. Ann Neurol, 77, 753–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kang E, Wu J, Gutierrez NM, Koski A, Tippner-Hedges R, Agaronyan K, Platero-Luengo A, Martinez-Redondo P, Ma H, Lee Y et al. (2016) Mitochondrial replacement in human oocytes carrying pathogenic mitochondrial DNA mutations. Nature, 540, 270–275. [DOI] [PubMed] [Google Scholar]
  • 9.Tang S and Huang T (2010) Characterization of mitochondrial DNA heteroplasmy using a parallel sequencing system. Biotechniques, 48, 287–296. [DOI] [PubMed] [Google Scholar]
  • 10.Taylor RW, Pyle A, Griffin H, Blakely EL, Duff J, He L, Smertenko T, Alston CL, Neeve VC, Best A et al. (2014) Use of whole-exome sequencing to determine the genetic basis of multiple mitochondrial respiratory chain complex deficiencies. JAMA, 312, 68–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Taylor RW and Turnbull DM (2005) Mitochondrial DNA mutations in human disease. Nat Rev Genet, 6, 389–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tuppen HA, Blakely EL, Turnbull DM and Taylor RW (2010) Mitochondrial DNA mutations and human disease. Biochim Biophys Acta, 1797, 113–128. [DOI] [PubMed] [Google Scholar]
  • 13.Wallace DC (2005) A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: a dawn for evolutionary medicine. Annu Rev Genet, 39, 359–407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Xu W, Ghosh S, Comhair SA, Asosingh K, Janocha AJ, Mavrakis DA, Bennett CD, Gruca LL, Graham BB, Queisser KA et al. (2016) Increased mitochondrial arginine metabolism supports bioenergetics in asthma. J Clin Invest, 126, 2465–2481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ye K, Lu J, Ma F, Keinan A and Gu Z (2014) Extensive pathogenicity of mitochondrial heteroplasmy in healthy human individuals. Proc Natl Acad Sci U S A, 111, 10654–10659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ramos A, Santos C, Mateiu L, Gonzalez Mdel M, Alvarez L, Azevedo L, Amorim A and Aluja MP (2013) Frequency and pattern of heteroplasmy in the complete human mitochondrial genome. PLoS One, 8, e74636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Duan M, Tu J and Lu Z (2018) Recent Advances in Detecting Mitochondrial DNA Heteroplasmic Variations. Molecules, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Just RS, Irwin JA and Parson W (2015) Mitochondrial DNA heteroplasmy in the emerging field of massively parallel sequencing. Forensic Sci Int Genet, 18, 131–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Huang T (2011) Next generation sequencing to characterize mitochondrial genomic DNA heteroplasmy. Curr Protoc Hum Genet, Chapter 19, Unit19 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ding J, Sidore C, Butler TJ, Wing MK, Qian Y, Meirelles O, Busonero F, Tsoi LC, Maschio A, Angius A et al. (2015) Assessing Mitochondrial DNA Variation and Copy Number in Lymphocytes of ~2,000 Sardinians Using Tailored Sequencing Analysis Tools. PLoS Genet, 11, e1005306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Duan M, Chen L, Ge Q, Lu N, Li J, Pan X, Qiao Y, Tu J and Lu Z (2019) Evaluating heteroplasmic variations of the mitochondrial genome from whole genome sequencing data. Gene, 699, 145–154. [DOI] [PubMed] [Google Scholar]
  • 22.Nemeth K, Darvasi O, Liko I, Szucs N, Czirjak S, Reiniger L, Szabo B, Kurucz PA, Krokker L, Igaz P et al. (2019) Next-generation sequencing identifies novel mitochondrial variants in pituitary adenomas. J Endocrinol Invest, 42, 931–940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Payne BA, Wilson IJ, Yu-Wai-Man P, Coxhead J, Deehan D, Horvath R, Taylor RW, Samuels DC, Santibanez-Koref M and Chinnery PF (2013) Universal heteroplasmy of human mitochondrial DNA. Hum Mol Genet, 22, 384–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dolle C, Flones I, Nido GS, Miletic H, Osuagwu N, Kristoffersen S, Lilleng PK, Larsen JP, Tysnes OB, Haugarvoll K et al. (2016) Defective mitochondrial DNA homeostasis in the substantia nigra in Parkinson disease. Nat Commun, 7, 13548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kang E, Wang X, Tippner-Hedges R, Ma H, Folmes CD, Gutierrez NM, Lee Y, Van Dyken C, Ahmed R, Li Y et al. (2016) Age-Related Accumulation of Somatic Mitochondrial DNA Mutations in Adult-Derived Human iPSCs. Cell Stem Cell, 18, 625–636. [DOI] [PubMed] [Google Scholar]
  • 26.Liu C, Fetterman JL, Liu P, Luo Y, Larson MG, Vasan RS, Zhu J and Levy D (2018) Deep sequencing of the mitochondrial genome reveals common heteroplasmic sites in NADH dehydrogenase genes. Hum Genet, 137, 203–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Rygiel KA, Tuppen HA, Grady JP, Vincent A, Blakely EL, Reeve AK, Taylor RW, Picard M, Miller J and Turnbull DM (2016) Complex mitochondrial DNA rearrangements in individual cells from patients with sporadic inclusion body myositis. Nucleic Acids Res, 44, 5313–5329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Jarjour NN, Erzurum SC, Bleecker ER, Calhoun WJ, Castro M, Comhair SA, Chung KF, Curran-Everett D, Dweik RA, Fain SB et al. (2012) Severe asthma: lessons learned from the National Heart, Lung, and Blood Institute Severe Asthma Research Program. Am J Respir Crit Care Med, 185, 356–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Moore WC, Bleecker ER, Curran-Everett D, Erzurum SC, Ameredes BT, Bacharier L, Calhoun WJ, Castro M, Chung KF, Clark MP et al. (2007) Characterization of the severe asthma phenotype by the National Heart, Lung, and Blood Institute’s Severe Asthma Research Program. J Allergy Clin Immunol, 119, 405–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Teague WG, Phillips BR, Fahy JV, Wenzel SE, Fitzpatrick AM, Moore WC, Hastie AT, Bleecker ER, Meyers DA, Peters SP et al. (2018) Baseline Features of the Severe Asthma Research Program (SARP III) Cohort: Differences with Age. J Allergy Clin Immunol Pract, 6, 545–554 e544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zein J, Gaston B, Bazeley P, DeBoer MD, Igo RP Jr., Bleecker ER, Meyers D, Comhair S, Marozkina NV, Cotton C et al. (2020) HSD3B1 genotype identifies glucocorticoid responsiveness in severe asthma. Proc Natl Acad Sci U S A, 117, 2187–2193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Li H and Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Weissensteiner H, Pacher D, Kloss-Brandstatter A, Forer L, Specht G, Bandelt HJ, Kronenberg F, Salas A and Schonherr S (2016) HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Res, 44, W58–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Dierckxsens N, Mardulyn P and Smits G (2019) Unraveling heteroplasmy patterns with NOVOPlasty. NAR Genomics and Bioinformatics, 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Griffin HR, Pyle A, Blakely EL, Alston CL, Duff J, Hudson G, Horvath R, Wilson IJ, Santibanez-Koref M, Taylor RW et al. (2014) Accurate mitochondrial DNA sequencing using off-target reads provides a single test to identify pathogenic point mutations. Genet Med, 16, 962–971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li M, Schonberg A, Schaefer M, Schroeder R, Nasidze I and Stoneking M (2010) Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes. Am J Hum Genet, 87, 237–249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Weissensteiner H, Forer L, Fuchsberger C, Schopf B, Kloss-Brandstatter A, Specht G, Kronenberg F and Schonherr S (2016) mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud. Nucleic Acids Res, 44, W64–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Calabrese C, Simone D, Diroma MA, Santorsola M, Gutta C, Gasparre G, Picardi E, Pesole G and Attimonelli M (2014) MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing. Bioinformatics, 30, 3115–3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Santibanez-Koref M, Griffin H, Turnbull DM, Chinnery PF, Herbert M and Hudson G (2019) Assessing mitochondrial heteroplasmy using next generation sequencing: A note of caution. Mitochondrion, 46, 302–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material
Supplementary Table 3

RESOURCES