Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Nov 11;17(11):e1009594. doi: 10.1371/journal.pcbi.1009594

MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data

Larry N Singh 1,*, Brian Ennis 2, Bryn Loneragan 3, Noah L Tsao 4, M Isabel G Lopez Sanchez 3, Jianping Li 5, Patrick Acheampong 1, Oanh Tran 6, Ian A Trounce 3, Yuankun Zhu 2, Prasanth Potluri 1; Regeneron Genetics Center7, Beverly S Emanuel 6, Daniel J Rader 8, Zoltan Arany 9, Scott M Damrauer 4, Adam C Resnick 2, Stewart A Anderson 5, Douglas C Wallace 1
Editor: Manja Marz10
PMCID: PMC8610268  PMID: 34762648

Abstract

The growing number of next-generation sequencing (NGS) data presents a unique opportunity to study the combined impact of mitochondrial and nuclear-encoded genetic variation in complex disease. Mitochondrial DNA variants and in particular, heteroplasmic variants, are critical for determining human disease severity. While there are approaches for obtaining mitochondrial DNA variants from NGS data, these software do not account for the unique characteristics of mitochondrial genetics and can be inaccurate even for homoplasmic variants. We introduce MitoScape, a novel, big-data, software for extracting mitochondrial DNA sequences from NGS. MitoScape adopts a novel departure from other algorithms by using machine learning to model the unique characteristics of mitochondrial genetics. We also employ a novel approach of using rho-zero (mitochondrial DNA-depleted) data to model nuclear-encoded mitochondrial sequences. We showed that MitoScape produces accurate heteroplasmy estimates using gold-standard mitochondrial DNA data. We provide a comprehensive comparison of the most common tools for obtaining mtDNA variants from NGS and showed that MitoScape had superior performance to compared tools in every statistically category we compared, including false positives and false negatives. By applying MitoScape to common disease examples, we illustrate how MitoScape facilitates important heteroplasmy-disease association discoveries by expanding upon a reported association between hypertrophic cardiomyopathy and mitochondrial haplogroup T in men (adjusted p-value = 0.003). The improved accuracy of mitochondrial DNA variants produced by MitoScape will be instrumental in diagnosing disease in the context of personalized medicine and clinical diagnostics.

Author summary

Recent studies have highlighted the importance of mitochondrial DNA variation in both primary mitochondrial disease and complex, human pathology including COVID-19, and space-flight stress. The vast amount of existing, next-generation sequencing (NGS) data can be leveraged to interrogate both nuclear and mitochondrial DNA (mtDNA) sequence simultaneously, allowing for analysis of the interplay between mitochondrial and nuclear encoded genes in mitochondrial function. Identifying mtDNA sequence accurately is complicated by the presence of nuclear encoded mitochondrial sequences (NUMTs), which are homologous to mtDNA. Current software for analyzing mtDNA from NGS do not accurately model the unique characteristics of mitochondrial genetics. We introduce MitoScape, a novel, big-data, software which models mitochondrial genetics through machine learning to accurately identify mtDNA sequence from NGS data. MitoScape takes advantage of rho-zero cell data to model the characteristics of NUMTs. We show that MitoScape produces more accurate heteroplasmy estimates compared to published software. We provide an example of applying MitoScape in replicating an association between hypertrophic cardiomyopathy and mitochondrial haplogroup T in men. MitoScape is an important contribution to mitochondrial genomics allowing for accurate mtDNA variants, and the ability to tailor mtDNA analysis in different population and disease contexts, which is not available in other software.


This is a PLOS Computational Biology Software paper.

Introduction

Both mitochondrial DNA (mtDNA) and nuclear DNA (nDNA) variants are known to impair the function and structure of mitochondria, leading to primary mitochondrial disease [1]. But studies have also implicated mtDNA variants in a myriad of common, complex, human diseases, including cancer, cardiovascular disease, diabetes and neurodegenerative disease [26]. More recently, mtDNA variation and damage have been implicated in COVID-19 [7,8] and even spaceflight [9]. Thus, there is a need to interrogate both mtDNA and nDNA variants simultaneously in both primary mitochondrial and complex disease. Existing, large-scale, next-generation sequencing (NGS) datasets are a valuable resource for retrospectively analyzing both mtDNA and nDNA variation in an array of common diseases. Today, such large datasets are both abundant and necessary in genetic association studies for overcoming biases and false negatives due to a lack of statistical power. For example, the Cancer Mitochondrial Atlas (TCMA) identified signatures of mtDNA variation in different forms of cancer, using data from thousands of whole genome sequencing (WGS) samples [5].

Due to fundamental differences between Mendelian and mitochondrial genetics, erroneous interpretation and poor data analysis are common in analyzing mtDNA [10]. Inherent complexities of mitochondrial genetics include heteroplasmy, nuclear-encoded mtDNA sequences (NUMTs), and low-complexity regions. Heteroplasmy is the presence of multiple copies of mtDNA with differing nucleotide sequences in a single cell or population of cells and tissues. Heteroplasmy arises since each human nucleated cell typically comprises 100s to 1000s of mitochondria with each mitochondrion containing 2–8 copies of mtDNA, and each copy accumulating independent variants. Low and high percentages of mtDNA variants give rise to low and high heteroplasmy, respectively. While nDNA variants follow the laws of Mendelian genetics, mtDNA variants abide by the principles of population genetics [6]. The prevailing hypothesis is that cells are resilient to low levels of mtDNA having genetic defects, and biochemical defects only occur once the levels of defective mtDNA exceed a critical threshold, a phenomenon termed the threshold effect [11]. Since the level of heteroplasmy appears to be positively correlated with disease severity, the threshold effect suggests that high percentages of an mtDNA variant are required to produce a functional effect. But detecting low-level heteroplasmy is essential for two reasons: first, since heteroplasmy varies by tissue, low heteroplasmy in blood may allude to high heteroplasmy in internal organs; and second, low-level heteroplasmy variants appear to be widespread and present in all humans, and can be heritable and functional [12]. Low-level heteroplasmy also increases with age and hence, may contribute to common late-onset diseases [12]. Consequently, the primary question in mitochondrial genetics, is not whether or not a variant exists, but at what heteroplasmy level. Thus, accurate computation of mtDNA heteroplasmy especially at low levels, is crucial for understanding and diagnosing complex disease.

Contributing to the complexity of obtaining heteroplasmy levels, are NUMTs, which result from mtDNA fragments that were transferred into the nucleus and incorporated into the nuclear genome [13]. The formation of NUMTs, termed numtogenesis, is a dynamic, on-going, and evolutionarily-conserved process [14]. Human NUMTs range from 64–100% sequence identity to mtDNA and vary in length from approximately 40bp to almost the entire mtDNA [15]. Reads from standard NGS are shorter than many NUMTs, which means that using sequence alignments alone to distinguish mtDNA from NUMTs is prone to error. Alignment of a NUMT to the mtDNA, results in a false positive and inflation of heteroplasmy. Conversely, alignment of mtDNA to NUMTs, results in false negatives and an underestimate of heteroplasmy. The effects of NUMT variants are often underappreciated in studies of mtDNA heteroplasmy [16], and the effect of NUMTs on mtDNA heteroplasmy is perceived to be negligible. This assumption is flawed, however, since the human genome contains over 700 germline NUMTs and multiple NUMTs correspond to the same mtDNA region [17].

Current, computational methods for obtaining mtDNA variants from NGS data can be broadly classified into two categories: 1) those that rely on unique alignments of mtDNA (unique alignment approach), and 2) those that rely on post-filtering of mtDNA variants (post-filtering approach). Unique alignment approaches such as MToolBox [18], contend with NUMTs by discarding sequence reads that do not uniquely map to mtDNA. Such approaches that solely rely on sequence alignment are likely to result in genuine mtDNA being discarded and both over- and underestimates of heteroplasmy [6,19,20]. Post-filtering approaches such as mtDNA-Server [21], limit the influence of NUMTs on mtDNA variants by discarding or flagging variants that flank published NUMT regions [5]. Post-filtering has several drawbacks compared to the unique alignment approach. First, filtering can result in potentially important variants being overlooked. Second, the composition of NUMTs varies depending on the disease, population and even individual [15,22], making a complete list of NUMT regions impractical to formulate. Furthermore, the NUMTs can also vary by tissue or cells collected, meaning that post-filtering is not generalizable to different samples collected. Third, post-filtering of mtDNA variants suspected to be from NUMTs overlooks the fact that mtDNA variants are often in high linkage-disequilibrium and cannot be treated independently of each other. Post-filtering loses information about which variants occur on the same read, an important factor in determining legitimate mtDNA variants. Fourth, mtDNA copy number estimation, another important measure of mtDNA variability, is not possible from post-filtering methods. Fifth, post-filtering does not accommodate retrospective quality-control analysis of NGS reads. Sixth, from a software engineering perspective, post-filtering systems are tightly coupled, inflexible to changes in individual components, for example, changes to the alignment software, and do not scale or generalize well. Therefore, in summary, accurate mtDNA variants are only possible from accurate mtDNA alignments. Low complexity regions suffer similar consequences as NUMTs, as nucleotides that flank these regions are commonly filtered [21,23].

Published methods for obtaining mtDNA variants are based on some combination of the aforementioned techniques and assumptions. These approaches rely on rigid and often arbitrary thresholds for filtering NUMTs, and do not adequately model the variable nature of mtDNA. Sequence alignment alone is insufficient to distinguish mtDNA from NUMTs, and filtering of NUMT and low-complexity regions is restrictive. We present MitoScape, a novel, machine learning-based software to align mtDNA from NGS data (Fig 1 and Design and Implementation). MitoScape incorporates two novel advancements: first, we use machine learning to model and learn the unique characteristics of mtDNA and NUMTs; and second, we use rho-zero cells for the first time as a source of NUMTs for training the classifier. An advantage of a machine learning classifier is that mtDNA sequence alignments are assigned probabilities of being from mtDNA as opposed to being discarded, unknown to the user. Furthermore, we take advantage of the fact that machine learning can learn the salient characteristics of NGS to discriminate mtDNA reads from NUMT reads, without making unnecessary and arbitrary assumptions. The training datasets can also be altered to accommodate different populations or diseases, for example. MitoScape is a big-data, cloud-based software system and is scalable to virtually any number of NGS samples. The main application of MitoScape is to retrospectively analyze mtDNA variation in existing NGS data in common disease contexts. We tested MitoScape on a novel, gold-standard benchmark dataset comprised of mtDNA-enriched data.

Fig 1. Overview of MitoScape algorithm.

Fig 1

WGS data containing total DNA includes both mtDNA and NUMTs. After alignment to the reference genome, some NUMTs will erroneously align to mtDNA, and some mtDNA will erroneously align to NUMTs. To correct these alignment errors, we use a random forest classifier. The classifier is trained on positive, mtDNA-enriched alignments, and negative mitochondria-depleted alignments. We also use linkage disequilibrium r2 scores and common NUMT locations to determine the probability that an ambiguous read is truly from mtDNA.

Design and implementation

Ethics statement

The IBBC study was approved by “The Committees for the Protection of Human Subjects (IRB)” at the Children’s Hospital of Philadelphia, under protocol “Genetic Modifiers of 22q11.2 Abnormalities”, IRB 07–005352. Each participant and his or her caregiver, when appropriate, provided informed written consent/assent to participate prior to recruitment.

Training data

Aligning NGS reads to the human genome reference sequence results in a subset of reads that align ambiguously to both the mtDNA (revised Cambridge Reference Sequence or rCRS [24]) and the nDNA (Fig 1). Sequence alignments alone cannot discriminate mtDNA from NUMTs. Rather than rely on restrictive filters that use hard thresholds, we developed a novel approach using a machine learning classifier to compute the probability that a sequencing read is from mtDNA. Our classifier automatically learns characteristics of both mtDNA and NUMT reads to better align mtDNA sequences. We use a positive training set comprised of authentic mtDNA reads, and a negative training set comprised of NUMT sequences for supervised learning. For the negative training set, we sequenced both wild-type and mtDNA-depleted (rho-zero or ρ0) WAL2A lymphoblastoid and 143B osteosarcoma cell lines [25], each in duplicate, for a total of eight samples. The use of rho-zero cells to model NGS characteristics of NUMT sequences is a novel departure from other computational approaches for aligning mtDNA. To obtain NUMT sequences from each sample:

  1. Align reads to the rCRS.

  2. Re-align the aligned reads from step 1 to the nuclear genome, i.e. all of the human reference genome except for the rCRS.

  3. These aligned reads comprise the negative training set of NUMTs.

For the positive training set, we sequenced ten samples of mtDNA-enriched samples generated by amplifying mtDNA in two overlapping long-range PCR fragments of about 8,500bp each from WAL2A cell lines. The resulting amplified DNA sequences were then aligned to the rCRS to create our positive training set of mtDNA reads.

Feature selection and machine learning classifier

We chose a random forest classifier [26] for resolving ambiguities in aligning reads to the rCRS, due to this classifier’s simplicity and resistance to overfitting when the number of training samples is small, as is our case. We also tested gradient boosted trees as a classifier but found that random forest classifiers had a lower test error. Training of the classifier was performed using k-fold cross-validation whereby 80% of all reads from all samples were chosen at random for training and the remaining 20% used for validation. An 80%-20% split corresponds to k = 5, with four groups for training and one group for testing. Defined as 2 * (precision–recall) / (precision + recall), the resulting F1 score from k-fold cross-validation was 0.81.

To train the classifier, we required quantities that are measurable and informative of whether a read is a NUMT or mtDNA, termed features. Several features were tested for model selection using the random forest classifier (Table 1). We discuss those features here. According to data in our human mitochondrial genome database, MITOMAP (www.mitomap.org), approximately 55% of mtDNA loci have been reported as mtDNA variants. Furthermore, the frequency of each mtDNA variant is dependent on the population of interest. Therefore, relying solely on mtDNA variant frequency for filtering is not informative and would lead to overfitting of the machine learning algorithm. Instead, we developed a novel solution as follows. We observe that due to the high linkage disequilibrium (LD) among mtDNA variants, many mtDNA variants are inherited as a haplotype or haplogroup [3]. Therefore, we computed LD r2 scores for all mtDNA variants from 45,494 GenBank hand-curated mtDNA sequences from MITOMAP, using Plink version 1.9 (atgu.mgh.harvard.edu/plinkseq/). These LD scores were used to compute the probability of two variants on a paired-end read appearing together on the same mtDNA sequence. A low probability suggests that the read is from a NUMT rather than mtDNA. To obtain an initial set of variants in each read, we developed a basic variant caller in MitoScape, which called variants based on the mismatching positions (MD) tags in the sequence alignment/map format (SAM) fields of an aligned paired-end read.

Table 1. Summary of features considered for random forest classifier.

Each feature is considered for determining whether the alignment of the read in SAM format corresponds to mtDNA or a NUMT. The SAM Tag field indicates the corresponding field in the SAM alignment format specification. Features in bold were used in the final model for the random forest classifier.

Feature Description SAM Tag
mtDNA edit distance Edit distance between the read and aligned mtDNA sequence. A lower edit distance indicates that there are fewer differences between the read sequence and the reference sequence. NM
Nuclear edit distance Edit distance between the read and aligned nuclear genomic sequence. NM
mtDNA alignments Number of alignments to the mtDNA sequence NH
Nuclear alignments Number of alignments to the nuclear genomic sequences NH
Mapping score Non-normalized mapping quality score XQ
Mapping quality Mapping quality field of SAM entry
MT LD Linkage disequilibrium scores of variants within the paired alignment to mtDNA
NUMT Overlap Percentage of overlap of the paired read with a known, validated NUMT region.

The composition of NUMTs in a genome is highly variable and depends on population, disease, tissue, and even individual. Consequently, providing an exhaustive list of NUMTs is counter-productive and will lead to over-fitting of the classifier. In MitoScape, we allow the user to select a list of known NUMTs as a parameter to the software. We provide a generic list of common, experimentally validated NUMTs [17] based on the most common tissue used in NGS studies: blood. Using an input list of NUMTs, MitoScape calculates the fraction that an ambiguous paired-end read overlaps with a known NUMT region. This score, referred to as NUMT overlap (Table 1), is used as a feature. Based on variable or feature importance and model accuracy (S1 Fig), the final model used mtDNA edit distance, nuclear edit distance, mtDNA LD scores and NUMT overlap, using 128 trees in the random classifier to obtain the probability that an NGS read is from mtDNA.

The following is a summary of the workflow for calling mtDNA variants using MitoScape:

  1. Align the WGS sample to the mtDNA reference sequence.

  2. Re-align the aligned sequences from step 1 to the non-mtDNA reference sequence.

  3. Call MitoScape with the outputs from steps 1 and 2 to classify ambiguous mtDNA reads.

  4. Call variants on the output mtDNA sequences from step 3.

Several design choices were made to ensure that MitoScape employed a flexible and decoupled architecture. Every major software component can be replaced without changing any code. For example, gsnap [27] has the unique ability to align circular DNA such as the rCRS; however, another short-read sequence aligner could be used. For calling variants we utilized Mutect2, which was originally designed for calling cancer variants. All of the software developed was designed using Scala, a modern, scalable, functional and object-oriented programming language which runs on the Java Virtual Machine (JVM). For processing of the aligned reads, we used ADAM version 0.32.0 [28,29], a library designed for big-data, genomic analysis using Apache Spark. For machine learning paradigms, Apache Spark version 3.0.0, a fast, unified analytics engine for big-data processing was chosen. Apache Spark also improves on many of the shortcomings of Apache Hadoop, including improved performance and flexibility.

Customization of MitoScape

The performance of a machine learning classifier is dependent on the quality of the training data. MitoScape was designed to be scalable and flexible, and so it is a trivial matter to add more or different training data to the model—a feature that is unique to MitoScape and not present in other tools. For instance, if studying cancer, a training data comprised of cancer samples would be more appropriate. The majority of NGS studies contain lymphocyte DNA and hence, the lymphoblastoid cell lines are a suitable model. Similarly, both the list of NUMT scores and the linkage disequilibrium scores could be customized by adding more data samples or, for example, restricting samples based on population or tissue. The performance of MitoScape will improve as more data is generated and analyzed, and the system is designed to accommodate more data. The ability to select mtDNA variants based on sensitivity and specificity via the prediction probability parameter of MitoScape is another useful enhancement that is unavailable in other software.

Results

Benchmark dataset

The current gold standard used in clinical genetic labs for obtaining mtDNA variants is long-range PCR amplification of mtDNA sequencing followed by NGS and variant calling [30]. Furthermore, since the vast majority of existing NGS is from Illumina paired-end reads, and MitoScape was designed to obtain mtDNA sequences from NGS, we tested MitoScape using this approach in the context of a complex disease: schizophrenia. We obtained nine blood samples from the 22q11IBBC study, which comprised subjects having 22q11.2 deletion syndrome (22q11DS) and schizophrenia [31]. These nine samples are completely different from the ten used as in the positive training set, and hence are an appropriate and valid test set. To obtain gold-standard mtDNA sequences, we amplified mtDNA from these samples in two long-range overlapping PCR fragments (S1 Supplementary Methods). The purity of amplified sequence varies with different PCR primers. Hence, we designed, developed and tested novel PCR primers for human blood cells (S1 Supplementary Methods and S2 and S3 Figs). We then sequenced the amplicons on an Illumina MiSeq sequencer using twice as long 2x300bp paired-end reads than the common 2x150bp reads to produce more accurate alignments. The reads were then aligned to the human genome reference sequence (GRCh38). To maximize accuracy of alignments, we adopted a stringent, conservative approach in which only reads having at least 270bp (>90% of the maximum read length) aligning to the rCRS were retained. At least 90% of all sequenced reads (median = 92%) from all nine samples aligned to the GRCh38. Of all the aligned reads, at least 87% (median = 90%) aligned to the rCRS (S2 and S3 Figs) from each sample. These resulting sequences represent pure mtDNA sequences, free of NUMTs, and comprises our “Benchmark” mtDNA sequences (Fig 2). Our benchmark mtDNA sequences offer many improvements compared to validation samples used in previously published software for NGS analysis of human mtDNA, such as mtDNA-Server [21] (S1 Supplementary Methods). Following standard practice for machine learning training, the Benchmark dataset is a completely different set of samples from the training datasets to reduce bias.

Fig 2. Outline of testing scheme for MitoScape.

Fig 2

Nine different 22q11.2 deletion syndrome (DS) samples were chosen for performance testing. For each sample, we performed both 1) PCR amplification to enrich mtDNA, and 2) whole genome sequencing (WGS). MitoScape was applied to the WGS samples to obtain accurate mtDNA alignments. Variants were then called from both the resulting mtDNA from both mtDNA enrichment (Benchmark mtDNA) and WGS (test mtDNA) to obtain mtDNA variants. The Benchmark mtDNA variants represent the gold-standard variants from the nine samples. The test mtDNA variants were then compared to the Benchmark set for evaluation of the performance of MitoScape. Heteroplasmy values of the test mtDNA variants similar to those of the Benchmark variants, indicates that MitoScape is doing well, and vice-versa.

Model testing and evaluation

MitoScape distills mtDNA sequences from NGS data. Our goal is to compare the mtDNA sequences from MitoScape to the Benchmark mtDNA. We measured the performance of MitoScape by comparing the heteroplasmy levels of variants from MitoScape mtDNA to the heteroplasmy levels of the same variants from the Benchmark mtDNA. To obtain variants from the aligned mtDNA sequences, we selected a cancer variant caller with the ability to handle the polyploid genome of mitochondria: Mutect2 v4.1.9.0 [32], with mitochondrial-mode set to true, followed by FilterMutectCalls with default options. Only variants from the Benchmark mtDNA sequences that passed FilterMutectCalls filtering were considered as part of our Benchmark mtDNA variant set. Note that this test dataset is completely independent of the training data, and hence is a legitimate test for performance. We refer to the mtDNA variants called from the Benchmark mtDNA sequences as the Benchmark variants.

For the same nine 22q11DS samples used in the Benchmark mtDNA sequences, we performed WGS sequencing. WGS data are not enriched for mtDNA and hence also contain NUMTs. Therefore, MitoScape is required to discriminate mtDNA from NUMTs (Fig 2). We obtained mtDNA sequences from the WGS data using MitoScape with a prediction probability of 0.5 followed by Mutect2 to call variants. It is important to note that we do not use FilterMutectCalls in our tests of MitoScape, so we are not using Mutect2’s filters. We then computed the difference in heteroplasmy levels between the Benchmark variants and the MitoScape mtDNA variants at each mtDNA variant locus for each sample. We defined heteroplasmy error of MitoScape in a given sample and specific mtDNA locus as the heteroplasmy in the Benchmark mtDNA variants minus the heteroplasmy computed using MitoScape i.e., heteroplasmy error = Benchmark heteroplasmy–MitoScape heteroplasmy. Hence, the closer the heteroplasmy error is to zero, the more the heteroplasmy results of MitoScape match the Benchmark heteroplasmy values. Positive heteroplasmy error indicates that the heteroplasmy estimate of a given variant was higher in the Benchmark variant set than that of the same variant in the MitoScape variant set. Therefore, increased positive heteroplasmy error suggests an increased chance of being a false negative. Any variants that were not called in a variant set are regarded as having heteroplasmy equal to 0.

Overall performance of MitoScape

The Benchmark variants consisted of low-level heteroplasmy variants at almost every locus in the mtDNA, and thus, was a comprehensive test dataset (Figs 3 and S4). The nine test samples used here are substantially more than the two samples used for testing in mtDNA-server. Based on our variant calls we have captured low heteroplasmy variants across the entire mtDNA genome, thus adding substantially more samples is unlikely to add significantly more information. Since each test sample requires both mtDNA-enriched and (2x300 Illumina) WGS data, adding significantly more samples was also cost-prohibitive. MitoScape had heteroplasmy error of approximately zero for most variants except for two mtDNA regions at m.303 and m.16184 (Fig 3). Closer inspection reveals that these loci are within low-complexity regions consisting of homopolymer runs of almost exclusively cytosines. The low read depth of the variants in these two regions (Fig 3) and the high GC content render these two regions difficult to sequence and emphasizes that care should be taken when analyzing variants in these two regions. Due to sampling error, higher levels of heteroplasmy are likely to have higher variances than lower levels of heteroplasmy. Therefore, to account for differences in variances, we also scaled the heteroplasmy by p(1-p), where p is the benchmark heteroplasmy level. We have defined this measure as the scaled heteroplasmy level. The trends in heteroplasmy levels are similar between the raw heteroplasmy errors (Fig 3A) and the scaled heteroplasmy levels (Fig 3B).

Fig 3. Plot of heteroplasmy error between Benchmark variants and MitoScape variants, for each variant in each sample.

Fig 3

The x-axis represents the position in the rCRS. Benchmark read depth represents the read depth of the variant from the Benchmark dataset. Heteroplasmy error in a given sample and mtDNA locus is defined as the heteroplasmy value from the Benchmark variant set minus the heteroplasmy computed using MitoScape. Note that heteroplasmy error is a difference in fractions or percentages, not the percentage error. A. Raw Heteroplasmy Error. B. Scaled Heteroplasmy Error: Heteroplasmy error is scaled by p(1-p) where p is the benchmark heteroplasmy.

The mean estimated heteroplasmy of variants from MitoScape was within 1% of that computed from the Benchmark. The standard deviation of heteroplasmy error approached zero as the read depth of the variant increased (Fig 4), emphasizing the importance of read depth in obtaining accurate heteroplasmy. Moreover, as the read depth of the WGS data increased, the standard deviation in heteroplasmy error decreased (Fig 4).

Fig 4. Summary statistics of heteroplasmy error for MitoScape, MToolBox, and mtDNA-Server (Mutserve).

Fig 4

Heteroplasmy error in each sample and mtDNA locus is defined as the heteroplasmy value from the Benchmark variant set minus the heteroplasmy computed using MitoScape, MToolBox, or mtDNA-Server. A. Raw Heteroplasmy Error. B. Scaled Heteroplasmy Error: heteroplasmy error is scaled by p(1-p) where p is the benchmark heteroplasmy.

Mitochondrial DNA (mtDNA) copy number—defined as the ratio of mtDNA copies to nDNA copies—potentially impacts the effect of NUMTs on heteroplasmy detection. It is conceivable that the samples with lower mtDNA copy numbers would have greater error in mtDNA heteroplasmy levels. Hence, we examined the relationship between mtDNA copy number and MitoScape heteroplasmy error. We computed mtDNA copy number as the ratio of number of mtDNA reads to nDNA reads in chromosomes 1–22 for each of the nine test samples. We found no obvious correlation between mtDNA copy number and MitoScape heteroplasmy error (S5 Fig). These results are inconclusive, however, likely given the small N—nine samples—and the low amount of variation in the mtDNA copy number (mean = 90, standard deviation = 25).

Comparison of MitoScape with standard tools

We compared MitoScape to two common tools for obtaining mtDNA variants from NGS data: MToolBox [18] and mtDNA-Server [21]. MToolBox is the standard tool used by the MSeqDR consortium, a large, global consortium for mitochondrial disease research consisting of a team of over 100 mitochondrial disease experts [33]. MtDNA-Server is a scalable NGS analysis workflow based on Apache Hadoop, for obtaining variants from human mtDNA data. MtDNA-Server achieved similar or superior performance to several other computational tools for mtDNA analysis and ultra-sensitive variant detection [21]. To handle NUMTs, and low-complexity regions, mtDNA-Server adopts a post-filtering approach of tagging and filtering variants that are in these regions, or those variants that meet various thresholds. The schemes used in both MToolBox and mtDNA-Server are representative of all computational methods used to analyze mtDNA variants. We used MToolBox and the standalone version of mtDNA-Server—Mutserve v2.0.0rc10—to determine mtDNA variants from the same 22q11.2DS samples as used in the Benchmark and MitoScape mtDNA variant sets. We then determined the number of false negatives and false positive variants detected by both MitoScape, MToolBox, and mtDNA-Server, using the Benchmark variant set as our gold standard. Hence, the heteroplasmy values from the Benchmark variant set represents the actual heteroplasmy.

We next compared misclassifications in MitoScape, MToolBox, and mtDNA-Server. We defined any variant to be a false negative or missing if the heteroplasmy error is less than -0.2. In other words, we allow for the computed heteroplasmy to be 0.2 less than the Benchmark estimate of heteroplasmy but no more. For example, if the actual heteroplasmy is 0.5, then the corresponding heteroplasmy from the tested software would have to be greater than 0.3 for a match, and any heteroplasmy less than 0.3 is a false negative. We defined a false positive as any variant having an estimated heteroplasmy level greater than 0.2 plus the actual heteroplasmy. We used the same criteria for misclassifications to compare MitoScape, MToolBox, Mutect2, and mtDNA-Server. MitoScape was the most accurate in making heteroplasmic variant calls, having produced only one misclassification in the nine test samples as compared to 125 in MToolBox, and 21 in mtDNA-Server (Table 2). MitoScape’s sole misclassification was a false negative, whereas the other two tools produced significant numbers of both false negatives and false positives, although mtDNA-Server was more accurate than MToolBox (Table 2 and S6 Fig). Surprisingly, both MToolBox produced errors in calling homoplasmic variants (defined as having heteroplasmy greater than 50%), with MToolBox and mtDNA-Server having 118 and 14 homoplasmic variant miscalls, respectively. MitoScape produced zero homoplasmic variant miscalls. The maximum absolute heteroplasmy error represents the maximum error in a heteroplasmy call made by each software. The minimum value for this measure by definition is zero in the best case, and one in the worst case. The maximum absolute heteroplasmy error produced by MitoScape was just 0.36 as compared to 1.0 for both MToolBox and mtDNA-Server (Table 2). Thus, MitoScape produced the most accurate heteroplasmy estimates in terms of every statistical measure amongst the three tools.

Table 2. Comparison of errors in variant calling among MitoScape, MToolBox, and mtDNA-Server.

False negatives are variants that are in the Benchmark mtDNA variant set but not in the corresponding tool (MitoScape, MToolBox, or mtDNA-Server) mtDNA variant set. Conversely, false positives are not in the Benchmark mtDNA variant set but were called by the corresponding tool (MitoScape, MToolBox, or mtDNA-Server). A variant is regarded as not detected if the heteroplasmy error exceeds 0.2. The maximum absolute heteroplasmy error ranges from 0.0 (best possible) to 1.0 (worst possible).

Statistic MitoScape MToolBox mtDNA-Server
Number of misclassifications 1 125 21
Number of false positives 0 5 1
Number of false negatives 1 120 20
Number of homoplasmic variant errors 0 118 14
Maximum absolute heteroplasmy error 0.36 1.0 1.0

We also investigated the relationship between read depth and heteroplasmy error in calling heteroplasmic variants. We found that on average, the heteroplasmy error for MitoScape was consistently lower than MToolBox: -0.5 to -1% versus -1 to -4% (Fig 4). Also, the standard deviation of heteroplasmy error from both MToolBox and mtDNA-Server were greater than that of MitoScape, indicating that MToolBox and mtDNA-Server had variants with much larger errors in heteroplasmy levels than MitoScape. The differences in errors are most dramatic for low read depths, indicating that MitoScape is the best tool for calling variants with low read depths. These results also suggest that on average, MitoScape can reliably detect mtDNA variants as low as 0.005–0.01 heteroplasmy. The lower limit on average for MToolBox is at least three-fold higher at 0.03–0.04. Furthermore, these results demonstrate, however, that determining an absolute lower limit of detection of heteroplasmy is not possible, as the detection limit depends on the variant and read depth.

These results remained consistent if we used the scaled heteroplasmy error as opposed to the raw heteroplasmy error. MitoScape was the only tool for which the scaled heteroplasmy error was zero regardless of read depth, and the standard deviation of scaled heteroplasmy error was consistently lower than both MToolBox and Mutserve up to a read depth of approximately 1200. For read depths greater than 1200, the standard deviation of Mutserve was lower than both MitoScape and MToolBox. For all tools, however, the standard deviation of the scaled heteroplasmy error with read depths greater than 1200, is in (-0.001, 0.001) and is effectively zero being below measurement error. Thus, at read depths greater than 1200 the difference in performance of all tools is negligible. At read depths lower than 1200, MitoScape performs best in terms of heteroplasmy error, both raw and scaled.

While accurate heteroplasmy level measurement is critical in diagnosing mitochondrial disease [34], there are other important metrics for evaluating the performance of MitoScape and related tools. Accordingly, we sought to determine the detection error of benchmark mtDNA variants. We define detection error as the fraction of mtDNA variants from the benchmark dataset that can be detected by MitoScape and related software tools at different heteroplasmy thresholds. We found that MitoScape consistently detected a large fraction of mtDNA variants than either MToolBox or Mutserve for all minor allele frequency thresholds (Fig 5). Moreover, MitoScape detected almost 100% of the benchmark mtDNA variants that were above 0.1 heteroplasmy. In contrast, MToolBox detected at most half of the mtDNA variants regardless of heteroplasmy threshold, and Mutserve detected a maximum of 94% of the mtDNA variants.

Fig 5.

Fig 5

Comparison of the fraction of benchmark variants detected (y-axis) versus the heteroplasmy threshold for detection (x-axis), for the MitoScape, MToolBox, and Mutserve. The number of heteroplasmic mtDNA variants is shown in parentheses.

Application to complex human disease: Hypertrophic cardiomyopathy

We provide an example of applications of MitoScape to complex human disease where variants from both nuclear DNA and mtDNA are required. Our results have shown that both MToolBox and Mutserve can lead to erroneous homplasmic mtDNA variant calls (Table 2), which in turn potentially leads to erroneous mtDNA haplogroup calls. Thus, even for haplogroup calls, Mitoscape adds value and improved performance. Hypertrophic cardiomyopathy (HCM) is the most common genetic disorder of the heart, and is characterized by left ventricular hypertrophy [35]. HCM is thought to be associated primarily with mutations in 11 or more genes, but genotype-phenotype associations have been inconsistent suggesting that other genetic or environmental factors are at play. In particular, studies have shown an association between HCM and mitochondrial haplogroups in European populations [36,37]. We tested the association between HCM and mitochondrial haplogroups in an American population, using 7,184 whole exome sequences from the Penn Medicine Biobank (S1 and S2 Tables). These haplogroup calls were made based on the homoplasmic mtDNA variants calls generated from the MitoScape workflow. We found that men but not women in haplogroup T were at 3.52 times higher risk for HCM than those in the most common European haplogroup, R0 (S3 Table, adjusted p-value = 0.003), thus corroborating reported associations [36]. In contrast to the Castro et al study [36], MitoScape produces the complete mtDNA sequence as opposed to a small subset of single nucleotide changes, and hence more precise mitochondrial haplogroup information.

Summary

We have presented MitoScape, a novel, machine learning-based software to align mtDNA from NGS data. We have also demonstrated the superior performance of MitoScape compared to two common tools for obtaining mtDNA variants from NGS. MToolBox and mtDNA-Server produce 125- and 21-fold more misclassifications than MitoScape. Importantly, MitoScape has several additional advantages over post-filtering approaches as described in the Introduction. First, alignment tools, variant callers and sequencing technology are all likely to improve over time. Unlike other tools, the design of MitoScape allows for these components to be changed without modification to the software. For instance, different sequence aligners and variant callers can be readily used in the MitoScape framework depending on the research problem. Second, MitoScape has the ability to attenuate the prediction probability to allow for varying the percentage of false positives and negatives based on the user’s needs—a powerful feature that is unique to MitoScape. Third, with MitoScape the resulting classified sequence alignments can be used to determine mtDNA copy number. MtDNA copy number is an important source of mitochondrial variation and plays an important role in the pathophysiology of certain human diseases, especially mtDNA depletion syndromes [38]. Computing mtDNA copy number is not possible in post-filtering approaches such as mtDNA-Server.

Obviously, an important goal of mitochondrial genomics is to identify mtDNA variation, including mtDNA copy number, which alter mitochondrial function. MitoScape goes beyond other software by not only accurately identifying mtDNA sequences, but also modeling mitochondrial genetics through machine learning. Unique alignment and post-filtering approaches do not and cannot model salient aspects of mitochondrial biology. MitoScape takes a novel departure in identifying mtDNA sequence by adopting for the first time, rho-zero cells modeling NGS sequencing of NUMTs. MitoScape also incorporates a highly flexible framework allowing for different NUMTs and even training sets to be modified to account for mtDNA variation in different tissues, diseases and populations. For instance, if we are studying cancer, using both rho-zero cells and mtDNA-enriched data from cancer cells will provide more accurate estimates of heteroplasmy that general approaches. This flexible approach guards against over-fitting and permits analysis in a context-specific manner which is critical for studying mitochondrial genetics. No other software offers this flexibility. Another advantage of our novel approach is that once new variants are discovered and more training data is produced, these data can be used to continually update and improve classification. These design choices allow for obtaining high precision and accurate mtDNA variants from NGS data, which will be vital in the diagnosis of both primary mitochondrial and complex human disease.

Availability and future directions

Project name: MitoScape.

Project home page: https://cavatica.sbgenomics.com/public/apps#d3b-bixu/app-publisher/mitoscape-wf/ including full instructions on how to run MitoScape on the Seven Bridges Cavatica platform.

Source home page: https://github.com/larryns/MitoScape.

Operating System: Platform independent.

Programming Language: Scala.

Data specific to HCM analysis are available from the Penn Medicine Biobank (https://pmbb.med.upenn.edu/). All other data, including Benchmark data, are available via authorized access from https://cavatica.sbgenomics.com/u/cavatica/22q11-deletion-syndrome-project/.

A strength of MitoScape is the availability to add more data including positive (mtDNA-enriched) data and negative (rho-zero) data. Additional data in the form of NUMT locations can also be added to improve, adapt or extend MitoScape to specific datasets, including non-human data.

Supporting information

S1 Supplementary Methods. Details of supplementary methods not covered in main text.

(DOCX)

S1 Fig

Plot of relative feature (variable) importance scores (y-axis) after training of random forest classifier in MitoScape. Each feature is displayed on the x-axis. Feature importance scores of all variables sums to one, and the higher the relative variable importance score, the more important this feature was in the classification procedure.

(TIF)

S2 Fig. Summary of unaligned sequencing reads from nine 22QDS samples to both rCRS and RSRS.

(TIF)

S3 Fig. Summary of reads aligning to mitochondrial DNA reference (rCRS or RSRS) from nine 22QDS samples.

(TIF)

S4 Fig. Summary statistics of number and frequency of heteroplasmic mtDNA variants identified in the 9 benchmark test data samples.

(TIF)

S5 Fig. Relationship between mitochondrial DNA copy number and heteroplasmy error from MitoScape. The red diamonds represent the mean heteroplasmy error for a given mitochondrial copy number.

Each blue circle represents the heteroplasmy error and mitochondrial copy number for a single variant.

(TIF)

S6 Fig. Summary of false negative misclassifications for MitoScape, MToolBox, and Mutserve (mtDNA-Server).

The y-axis represents the cumulative number of false negatives where the corresponding actual heteroplasmy is less than the value on the x-axis. The x-axis represents minor allele frequency, and therefore, is between 0 and 0.5. Minor allele frequency is equal to actual heteroplasmy if actual heteroplasmy is < 0.5, and equal to 1-actual heteroplasmy, otherwise.

(TIF)

S1 Table. Penn Medicine Biobank Participant Characteristics.

(DOCX)

S2 Table. Haplogroup demographics of subjects used in hypertrophic cardiomyopathy-mitochondrial haplogroup association from Penn Biobank data.

(DOCX)

S3 Table. Logistic regression analysis with HCM as dependent variable and mitochondrial haplogroups, age, and the first five principal components of the nuclear genetic variants PCA analysis as covariates, for men only.

Reference haplogroup is R0. Adjustment for multiple testing was done by Bonferroni correction. Logistic regression was performed using R.

(DOCX)

Acknowledgments

The authors thank the Regeneron Genetics Center for supplying sequencing and genetic variant data on hypertrophic cardiomyopathy.

Data Availability

Data specific to HCM analysis are available from the Penn Medicine Biobank (https://pmbb.med.upenn.edu). All other data, including Benchmark data, are available via authorized access from https://cavatica.sbgenomics.com/u/cavatica/22q11-deletion-syndrome-project/.

Funding Statement

This work was supported by grants awarded to SA (National Institutes of Mental Health - MH110185) and DCW (National Institute of Neurological Disorders and Stroke: NS021328, National Institutes of Mental Health: MH108592, and Office of the Director: OD010944). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Parikh S, Goldstein A, Karaa A, Koenig MK, Anselm I, Brunel-Guitton C, et al. Patient care standards for primary mitochondrial disease: a consensus statement from the Mitochondrial Medicine Society. Genetics in Medicine. 2017;19: 1380–1380. doi: 10.1038/gim.2017.107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Singh LN, Crowston JG, Lopez Sanchez MIG, Van Bergen NJ, Kearns LS, Hewitt AW, et al. Mitochondrial DNA Variation and Disease Susceptibility in Primary Open-Angle Glaucoma. Invest Ophthalmol Vis Sci. 2018;59: 4598–4602. doi: 10.1167/iovs.18-25085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chalkia D, Singh LN, Leipzig J, Lvova M, Derbeneva O, Lakatos A, et al. Association Between Mitochondrial DNA Haplogroup Variation and Autism Spectrum Disorders. JAMA Psychiatry. 2017;74: 1161–1168. doi: 10.1001/jamapsychiatry.2017.2604 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wallace DC. Mitochondrial genetic medicine. Nature Genetics. 2018;50: 1642. doi: 10.1038/s41588-018-0264-z [DOI] [PubMed] [Google Scholar]
  • 5.Yuan Y, Ju YS, Kim Y, Li J, Wang Y, Yoon CJ, et al. Comprehensive molecular characterization of mitochondrial genomes in human cancers. Nat Genet. 2020;52: 342–352. doi: 10.1038/s41588-019-0557-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Schon EA, DiMauro S, Hirano M. Human mitochondrial DNA: roles of inherited and somatic mutations. Nat Rev Genet. 2012;13: 878–890. doi: 10.1038/nrg3275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Singh KK, Chaubey G, Chen JY, Suravajhala P. Decoding SARS-CoV-2 hijacking of host mitochondria in COVID-19 pathogenesis. American Journal of Physiology-Cell Physiology. 2020;319: C258–C267. doi: 10.1152/ajpcell.00224.2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Codo AC, Davanzo GG, Monteiro L de B, Souza GF de, Muraro SP, Virgilio-da-Silva JV, et al. Elevated Glucose Levels Favor SARS-CoV-2 Infection and Monocyte Response through a HIF-1α/Glycolysis-Dependent Axis. Cell Metabolism. 2020;32: 498–499. doi: 10.1016/j.cmet.2020.07.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.da Silveira WA, Fazelinia H, Rosenthal SB, Laiakis EC, Kim MS, Meydan C, et al. Comprehensive Multi-omics Analysis Reveals Mitochondrial Stress as a Central Biological Hub for Spaceflight Impact. Cell. 2020;183: 1185–1201.e20. doi: 10.1016/j.cell.2020.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Verschoor ML, Ungard R, Harbottle A, Jakupciak JP, Parr RL, Singh G. Mitochondria and cancer: past, present, and future. Biomed Res Int. 2013;2013: 612369. doi: 10.1155/2013/612369 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Craven L, Alston CL, Taylor RW, Turnbull DM. Recent Advances in Mitochondrial Disease. Annu Rev Genomics Hum Genet. 2017;18: 257–275. doi: 10.1146/annurev-genom-091416-035426 [DOI] [PubMed] [Google Scholar]
  • 12.Stewart JB, Chinnery PF. Extreme heterogeneity of human mitochondrial DNA from organelles to populations. Nature Reviews Genetics. 2020; 1–13. doi: 10.1038/s41576-019-0192-5 [DOI] [PubMed] [Google Scholar]
  • 13.Tsuji J, Frith MC, Tomii K, Horton P. Mammalian NUMT insertion is non-random. Nucleic Acids Res. 2012;40: 9073–9088. doi: 10.1093/nar/gks424 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ramos A, Barbena E, Mateiu L, del Mar González M, Mairal Q, Lima M, et al. Nuclear insertions of mitochondrial origin: Database updating and usefulness in cancer studies. Mitochondrion. 2011;11: 946–953. doi: 10.1016/j.mito.2011.08.009 [DOI] [PubMed] [Google Scholar]
  • 15.Choudhury AR, Singh KK. Mitochondrial Determinants of Cancer Health Disparities. Semin Cancer Biol. 2017;47: 125–146. doi: 10.1016/j.semcancer.2017.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Singh KK, Choudhury AR, Tiwari HK. Numtogenesis as a mechanism for development of cancer. Semin Cancer Biol. 2017;47: 101–109. doi: 10.1016/j.semcancer.2017.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dayama G, Emery SB, Kidd JM, Mills RE. The genomic landscape of polymorphic human nuclear mitochondrial insertions. Nucleic Acids Res. 2014;42: 12640–12649. doi: 10.1093/nar/gku1038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Calabrese C, Simone D, Diroma MA, Santorsola M, Guttà C, Gasparre G, et al. MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing. Bioinformatics. 2014;30: 3115–3117. doi: 10.1093/bioinformatics/btu483 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hazkani-Covo E, Zeller RM, Martin W. Molecular Poltergeists: Mitochondrial DNA Copies (numts) in Sequenced Nuclear Genomes. PLOS Genetics. 2010;6: e1000834. doi: 10.1371/journal.pgen.1000834 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li M, Schroeder R, Ko A, Stoneking M. Fidelity of capture-enrichment for mtDNA genome sequencing: influence of NUMTs. Nucleic Acids Res. 2012;40: e137. doi: 10.1093/nar/gks499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Weissensteiner H, Forer L, Fuchsberger C, Schöpf B, Kloss-Brandstätter A, Specht G, et al. mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud. Nucleic Acids Res. 2016;44: W64–W69. doi: 10.1093/nar/gkw247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wei W, Pagnamenta AT, Gleadall N, Sanchis-Juan A, Stephens J, Broxholme J, et al. Nuclear-mitochondrial DNA segments resemble paternally inherited mitochondrial DNA in humans. Nature Communications. 2020;11: 1740. doi: 10.1038/s41467-020-15336-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhidkov I, Nagar T, Mishmar D, Rubin E. MitoBamAnnotator: A web-based tool for detecting and annotating heteroplasmy in human mitochondrial DNA sequences. Mitochondrion. 2011;11: 924–928. doi: 10.1016/j.mito.2011.08.005 [DOI] [PubMed] [Google Scholar]
  • 24.Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, Howell N. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet. 1999;23: 147. doi: 10.1038/13779 [DOI] [PubMed] [Google Scholar]
  • 25.Trounce IA, Kim YL, Jun AS, Wallace DC. Assessment of mitochondrial oxidative phosphorylation in patient muscle biopsies, lymphoblasts, and transmitochondrial cell lines. Methods Enzymol. 1996;264: 484–509. doi: 10.1016/s0076-6879(96)64044-0 [DOI] [PubMed] [Google Scholar]
  • 26.Hastie T, Tibshirani R, Friedman J. Chapter 15—Random Forests. 2nd edition. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. 2nd edition. New York, NY: Springer; 2016. pp. 587–603. [Google Scholar]
  • 27.Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26: 873–881. doi: 10.1093/bioinformatics/btq057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Massie M, Nothaft F, Hartl C, Kozanitis C, Schumacher A, Joseph AD, et al. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. UCB/EECS-2013-207, EECS Department, University of California, Berkeley; 2013. [Google Scholar]
  • 29.Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, et al. Rethinking Data-Intensive Science Using Scalable Analytics Systems. Proceedings of the 2015 International Conference on Management of Data (SIGMOD ‘15). ACM; 2015.
  • 30.Cui H, Li F, Chen D, Wang G, Truong CK, Enns GM, et al. Comprehensive next-generation sequence analyses of the entire mitochondrial genome reveal new insights into the molecular diagnosis of mitochondrial DNA disorders. Genet Med. 2013;15: 388–394. doi: 10.1038/gim.2012.144 [DOI] [PubMed] [Google Scholar]
  • 31.Schneider M, Debbané M, Bassett AS, Chow EWC, Fung WLA, van den Bree M, et al. Psychiatric disorders from childhood to adulthood in 22q11.2 deletion syndrome: results from the International Consortium on Brain and Behavior in 22q11.2 Deletion Syndrome. Am J Psychiatry. 2014;171: 627–639. doi: 10.1176/appi.ajp.2013.13070864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Benjamin D, Sato T, Cibulskis K, Getz G, Stewart C, Lichtenstein L. Calling Somatic SNVs and Indels with Mutect2. bioRxiv. 2019; 861054. doi: 10.1101/861054 [DOI] [Google Scholar]
  • 33.Shen L, Diroma MA, Gonzalez M, Navarro-Gomez D, Leipzig J, Lott MT, et al. MSeqDR: A Centralized Knowledge Repository and Bioinformatics Web Resource to Facilitate Genomic Investigations in Mitochondrial Disease. Hum Mutat. 2016;37: 540–548. doi: 10.1002/humu.22974 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.McCormick EM, Lott MT, Dulik MC, Shen L, Attimonelli M, Vitale O, et al. Specifications of the ACMG/AMP standards and guidelines for mitochondrial DNA variant interpretation. Hum Mutat. 2020. doi: 10.1002/humu.24107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Maron BJ. Clinical Course and Management of Hypertrophic Cardiomyopathy. New England Journal of Medicine. 2018;379: 655–668. doi: 10.1056/NEJMra1710575 [DOI] [PubMed] [Google Scholar]
  • 36.Castro MG, Huerta C, Reguero JR, Soto MI, Doménech E, Alvarez V, et al. Mitochondrial DNA haplogroups in Spanish patients with hypertrophic cardiomyopathy. International Journal of Cardiology. 2006;112: 202–206. doi: 10.1016/j.ijcard.2005.09.008 [DOI] [PubMed] [Google Scholar]
  • 37.Hagen CM, Aidt FH, Hedley PL, Jensen MK, Havndrup O, Kanters JK, et al. Mitochondrial Haplogroups Modify the Risk of Developing Hypertrophic Cardiomyopathy in a Danish Population. PLoS ONE. 2013;8: e71904. doi: 10.1371/journal.pone.0071904 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Filograna R, Mennuni M, Alsina D, Larsson N-G. Mitochondrial DNA copy number in human disease: the more the better? FEBS Letters. n/a. doi: 10.1002/1873-3468.14021 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009594.r001

Decision Letter 0

Manja Marz

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

24 Jul 2021

Dear Dr.. Singh,

Thank you very much for submitting your manuscript "MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Manja Marz

Software Editor

PLOS Computational Biology

Manja Marz

Software Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Comments on “MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data”

Larry N. Singh et al. present in this manuscript a novel software, MitoScape, developed for analyzing human mitochondrial DNA. MitoScape leverages mtDNA reads produced from next-generation sequencing experiments to infer mtDNA variants, including homoplasmies and heteroplasmies, and has the potential to assess mtDNA copy numbers. An elegant design of MitoScape is that it can accurately classify reads from mtDNA and reads from NUMTS based on a random forest classifier trained using features of alignments from mtDNA-enriched libraries and alignments from mtDNA-depleted libraries. The authors show significantly improved accuracy of MitoScape in identifying mtDNA heteroplasmies and homoplasmies as compared to two other mtDNA analytical tools, MToolBox, and mtDNA-Server. The authors further exemplify the use of MitoScape in a clinical study of mtDNA which enables inference of the full-length human mtDNA sequence.

The manuscript on MitoScape is well-written, accompanied with detailed online instruction on how to integrate MitoScape into the Seven Bridges Cavatica platform, a secure, cloud-based environment for analyzing large-scale genomics data. The source code of MitoScape is written in scala which has been made available on GitHub.

In light of increasing availability of high-quality next-generation sequencing data, an improved analytical tool like MitoScape will definitely accelerate the discovery of mtDNA’s roles in human diseases. Thus, I recommend this study for publication in PLOS Computational Biology.

I only have one minor suggestion.

An important factor influencing mtDNA variant detection is mtDNA copy number. Normally, samples with lower mtDNA copy numbers would be more significantly affected by reads from NUMTS. I wonder what distribution of mtDNA copy numbers in the nine testing blood samples is, given the corresponding data were generated by WGS. Does mtDNA copy number affect performance of mtDNA variant identification, especially identification of heteroplasmies? Providing copy number data could also illustrate another important use of MitoScape, as mentioned in Discussion.

Reviewer #2: The accurate identification of mitochondrial DNA (mtDNA) variants, especially low-frequency heteroplasmies, in sequencing data is an important problem in mitochondrial genetics. This problem is especially acute due to the presence of nuclear DNA fragments masquerading as mitochondrial DNA in sequencing data, commonly known as “NUMTs”. There are many approaches available to minimize NUMT contamination in sequencing data. However, these are largely heuristic, rely on the use of arbitrary filters, and there is often no clear measure of their efficacy/accuracy. Singh et al. provide an effective, scalable method to deal with these problems. I have a number of questions and comments to which I would like to hear the authors’ response:

The authors mentioned 10 samples enriched for mtDNA sequences as positive control. Am I correct in thinking that these independent from the 9 samples used as the test set? Perhaps make this clearer in the main text.

They mention using k-fold cross-validation with 80% of the training set and 20% as validation. Could the authors describe this in more detail? There are 8 samples in the negative training set and 10 in the positive set. How were they split? Or was the separation into training and validation sets carried out on the read level? How many iterations of the k-fold validation were carried out? Could the authors provide performance metrics from the k-fold validations? Can they also provide the weights that the classifier assigns to each of the features used? Additionally, can they also generate a plot of the probability of correctly assigning a read to mt/NUMT sequences as a function of these features in their validation? These metrics will be useful in understanding the classifier and how important each is in discriminating mitochondrial and NUMT sequences.

I am not sure using the absolute difference in heteroplasmy frequency between the benchmark and test set is the appropriate way to measure the performance of the method. Higher frequency heteroplasmies are more likely to vary in frequency just by sampling error. Perhaps more accurate would be to scale the difference by p(1-p) where p is the heteroplasmy fraction in the benchmark dataset. It would also be helpful to see this difference with respect to the frequency of the heteroplasmy. Please also provide similar metrics or a figure (e.g. as in Fig 3) for the other methods for a direct comparison with Mitoscape.

I noticed in Fig. 4 that Mutserve matches or performs better than Mitoscape at higher sequencing depth (>1200). Would this be a fair assessment? If so, perhaps mention this as it does not take away from the fact that Mitoscape performs well at lower sequencing depths.

I do think an absolute difference of 0.2 is appropriate to define false negative or positive heteroplasmies. One reason for this is that a difference of 0.2 matters means different things at low and high frequency heteroplasmies. For example, at a frequency of 0.2, a difference of -0.2 genuinely indicates a failure to detect the heteroplasmy whereas the same difference at a frequency of 0.5 does not. Additionally, I feel that heteroplasmy classification errors (as defined here) should be measured by detection (or failure of) heteroplasmies, and not based on difference in frequency of known heteroplasmies. For example, a false positive could be detection of a heteroplasmy above some frequency threshold (e.g. 0.05) which did not exist in the benchmark data.

I did not understand the purpose of the section “Application to complex human disease: hypertrophic cardiomyopathy”. Was Mitoscape used to call mtDNA reads for haplogroup calling? How much more accurate was this process compared to just calling haplogroups from the sequence data (without any filters) using Haplogrep (for example)? Alternatively, how much more accurate was Mitoscape compared to other methods described in the paper? In other words, I see the benefit of using Mitoscape to call low-frequency heteroplasmies, but I don’t see how beneficial it is for haplogroup calling.

Please provide details of the number and frequency distribution of heteroplasmies identified in the 9 test samples.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009594.r003

Decision Letter 1

Manja Marz

27 Oct 2021

Dear Dr.. Singh,

We are pleased to inform you that your manuscript 'MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Manja Marz

Software Editor

PLOS Computational Biology

Jason A. Papin

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: I see the authors point about how Mitoscape improves homoplasmic variants, which could lead to improvement in haplogroup calling. However, I still find the section "Application to complex human disease: hypertrophic cardiomyopathy" odd because it does not test the accuracy of haplogroup calling directly. Instead, it discusses an association result, which is a downstream application but is not very relevant to the performance of Mitoscape. I understand the authors argument about not having haplogroup calls from the other tools to compare. I think this may be a personal choice and I will leave the decision to keep this section or move to the supplement to the authors.

The authors satisfactorily addressed my questions and I have no further questions or comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009594.r004

Acceptance letter

Manja Marz

5 Nov 2021

PCOMPBIOL-D-21-00835R1

MitoScape: A big-data, machine-learning platform for obtaining mitochondrial DNA from next-generation sequencing data

Dear Dr Singh,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Olena Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Supplementary Methods. Details of supplementary methods not covered in main text.

    (DOCX)

    S1 Fig

    Plot of relative feature (variable) importance scores (y-axis) after training of random forest classifier in MitoScape. Each feature is displayed on the x-axis. Feature importance scores of all variables sums to one, and the higher the relative variable importance score, the more important this feature was in the classification procedure.

    (TIF)

    S2 Fig. Summary of unaligned sequencing reads from nine 22QDS samples to both rCRS and RSRS.

    (TIF)

    S3 Fig. Summary of reads aligning to mitochondrial DNA reference (rCRS or RSRS) from nine 22QDS samples.

    (TIF)

    S4 Fig. Summary statistics of number and frequency of heteroplasmic mtDNA variants identified in the 9 benchmark test data samples.

    (TIF)

    S5 Fig. Relationship between mitochondrial DNA copy number and heteroplasmy error from MitoScape. The red diamonds represent the mean heteroplasmy error for a given mitochondrial copy number.

    Each blue circle represents the heteroplasmy error and mitochondrial copy number for a single variant.

    (TIF)

    S6 Fig. Summary of false negative misclassifications for MitoScape, MToolBox, and Mutserve (mtDNA-Server).

    The y-axis represents the cumulative number of false negatives where the corresponding actual heteroplasmy is less than the value on the x-axis. The x-axis represents minor allele frequency, and therefore, is between 0 and 0.5. Minor allele frequency is equal to actual heteroplasmy if actual heteroplasmy is < 0.5, and equal to 1-actual heteroplasmy, otherwise.

    (TIF)

    S1 Table. Penn Medicine Biobank Participant Characteristics.

    (DOCX)

    S2 Table. Haplogroup demographics of subjects used in hypertrophic cardiomyopathy-mitochondrial haplogroup association from Penn Biobank data.

    (DOCX)

    S3 Table. Logistic regression analysis with HCM as dependent variable and mitochondrial haplogroups, age, and the first five principal components of the nuclear genetic variants PCA analysis as covariates, for men only.

    Reference haplogroup is R0. Adjustment for multiple testing was done by Bonferroni correction. Logistic regression was performed using R.

    (DOCX)

    Attachment

    Submitted filename: ResponseToReviewers.docx

    Data Availability Statement

    Data specific to HCM analysis are available from the Penn Medicine Biobank (https://pmbb.med.upenn.edu). All other data, including Benchmark data, are available via authorized access from https://cavatica.sbgenomics.com/u/cavatica/22q11-deletion-syndrome-project/.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES