Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA)

Alexander W Röck; Arne Dür; Mannis van Oven; Walther Parson

doi:10.1016/j.fsigen.2013.07.005

. 2013 Dec;7(6):601–609. doi: 10.1016/j.fsigen.2013.07.005

Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA)^☆

Alexander W Röck ^a, Arne Dür ^b, Mannis van Oven ^c, Walther Parson ^a,^d,^⁎

PMCID: PMC3819997 PMID: 23948335

Abstract

The assignment of haplogroups to mitochondrial DNA haplotypes contributes substantial value for quality control, not only in forensic genetics but also in population and medical genetics. The availability of Phylotree, a widely accepted phylogenetic tree of human mitochondrial DNA lineages, led to the development of several (semi-)automated software solutions for haplogrouping. However, currently existing haplogrouping tools only make use of haplogroup-defining mutations, whereas private mutations (beyond the haplogroup level) can be additionally informative allowing for enhanced haplogroup assignment. This is especially relevant in the case of (partial) control region sequences, which are mainly used in forensics. The present study makes three major contributions toward a more reliable, semi-automated estimation of mitochondrial haplogroups. First, a quality-controlled database consisting of 14,990 full mtGenomes downloaded from GenBank was compiled. Together with Phylotree, these mtGenomes serve as a reference database for haplogroup estimates. Second, the concept of fluctuation rates, i.e. a maximum likelihood estimation of the stability of mutations based on 19,171 full control region haplotypes for which raw lane data is available, is presented. Finally, an algorithm for estimating the haplogroup of an mtDNA sequence based on the combined database of full mtGenomes and Phylotree, which also incorporates the empirically determined fluctuation rates, is brought forward. On the basis of examples from the literature and EMPOP, the algorithm is not only validated, but both the strength of this approach and its utility for quality control of mitochondrial haplotypes is also demonstrated.

Keywords: mtDNA, Haplogroup, EMPOP, Fluctuation rates, Phylotree

1. Introduction

Human mitochondrial (mt)DNA is passed from mother to offspring and therefore inherited along a phylogeny. The first human mitochondrial genome (mtGenome) was sequenced in the early 1980s [1] and revised 18 years later [2], serving as a reference sequence (rCRS) relative to which other mtDNA sequences have been reported in a difference-coded format. A plethora of partial as well as complete mtGenomes has been produced since, permitting an increased understanding of the evolution of this molecule. Its dispersal through human migration left characteristic footprints induced by mutations that have been used to assign sequences to haplogroups [3]. The growing collection of established clades has meanwhile reached 3925 discernible haplogroups based on 16,810 full mtGenomes [Phylotree (www.phylotree.org) Build 15 [4]].

The understanding of sequences from the standpoint of their haplogroup affiliation has become increasingly valuable in studies of human mtDNA. Not only does haplogroup nomenclature and assignment facilitate comparison and communication of genetic variability, but it is also employed to characterize mitochondrial lineages for population [5], medical [6] and forensic genetic [7] purposes. Most importantly for forensics, haplogroup assignment has proven to be an important tool for sequence data quality control [8]. Haplogrouping of mtDNA sequences has been greatly simplified with the provision of Phylotree, which is widely accepted as “mitochondrial haplogroup dictionary” in the scientific community. The haplogroup-defining mutations listed in Phylotree not only facilitate manual haplogroup assignment, but they also serve as the basis for a number of software applications that perform the task (e.g. MitoTool [9,10], HmtDB [11], HaploGrep [12], and mtDNAoffice [13]). To date, however, none of the available automated solutions provide reliable and unbiased haplogroup estimates, especially in the case of partial mtDNA sequences [7]. The major limitation of existing tools is that they base their haplogroup assignment solely on defined motifs of diagnostic mutations (virtual haplotypes). The remaining “private” mutations of a sequence which can be additionally informative for the haplogroup status are not considered. As a consequence, the haplogroup assignments are therefore often incorrect or too coarse.

Consider for example the following control region haplotype from Argentina, 16189C 16292T 16519C 71A 153G 204C 207A 263G 315.1C 373G, which harbors no characteristic mutation to match a haplogroup in Phylotree's motif list (Build 15) but nevertheless seems to fall within superhaplogroup R0. This sequence was assigned to haplogroup H by mtDNAmanager [14] and to haplogroup H1 + 16,189 by HaploGrep [12]. Additional coding region sequencing of this sample revealed haplogroup H55 status (4769G 10464A). This conclusion could also have been drawn from the control region haplotype alone, if its near match with the complete sequence JQ705203 known to belong to haplogroup H55 had been considered. In this study, we offer new software (EMMA) that bases haplogroup estimation on Phylotree's list of virtual haplotypes and a database of 14,990 quality-controlled full mtGenomes, and that employs a maximum likelihood approach. We demonstrate, by comparative analysis that our tool yields more precise haplogroup assignments than other available software. For the Argentinean control region haplotype, EMMA correctly assigned the proper haplogroup status as H55 even without coding region information.

2. Materials and methods

2.1. Virtual Phylotree haplotypes and full mtGenome database

The phylogenetic tree in Phylotree [4] represents known global mtDNA variation by defining haplogroups and their signature mutations. This tree is regularly updated incorporating newly available mtGenomes and is made available to the user in HTML format. All results presented in this study use Phylotree Build 15 (September 30, 2012) as the reference tree. For our purposes, an R [15] script was developed that transforms the tree into a list of hypothetical haplotypes carrying the signature mutations of the respective haplogroups (tree nodes) as differences to the rCRS [2]. As these haplotypes are inferred rather than observed in the real world they are herein referred to as virtual haplotypes.

Recently, a new reference sequence for mtDNA, the so-called Reconstructed Sapiens Reference Sequence (RSRS), has been proposed [16]. Instead of using a contemporary European mtGenome as reference sequence the authors suggest switching to a reconstructed ancestral sequence that is allocated between haplogroups L0 and L1′2′3′4′5′6′. A switch to the RSRS in the forensic field, however, is not expected to occur soon [17]. The software presented here is primarily designed for forensic purposes, thus input of mtDNA data is currently based on the rCRS.

The defined haplogroup motifs in Phylotree are based on a database of published mtGenomes (http://www.phylotree.org/mtDNA_seqs.htm) that were downloaded from GenBank and evaluated for their application within EMMA. Some sequences were incomplete (e.g. [18], accession number EF657231, lack of control region; [19], accession number EF661002, sequencing frame 1–4167 4434–5483 5785–8314 8566–10683 10749–16548) and therefore excluded from this study. MtGenomes highlighted as problematic in Appendix S1 of Ref. [20] have also been removed. Additionally, mtGenomes that have been generated in the course of second generation sequencing attempts in Refs. [21,22], and flawed data published by [[23], I.P. Maksum, unpublished, V.C. Phan, unpublished] have been excluded from the database. Of the remaining mtGenomes, those containing ten or more ‘N’ designations in the FASTA string were excluded because of insufficient sequence quality.

Subsequent analysis of our quality-controlled mtGenome database revealed that 20 haplogroups of Phylotree 15 were not represented by complete sequences anymore due to the strict policy applied above. To ensure maximum coverage of haplogroups, 29 mtGenomes rejected in the first step were reincluded in the database. With the expected future availability of more reliable mtGenomes those questionable haplotypes will be replaced. A single haplogroup was considered unreliable and therefore not reincluded: according to Phylotree haplogroup M25 is represented by two mtGenomes from [24] (accession numbers DQ246830, DQ246833). Due to the lack of the first 250 bases in these mtGenomes and the presence of several doubtful variants, both sequences were not considered for the database, thus haplogroup M25 is the only haplogroup that is not represented by any mtGenome in the database for EMMA. See ESM1 for the list of genomes added and the reason for initial exclusion.

ESM1

29 mtGenomes rejected in the first step that were again added to the EMMA database.

mmc1.pdf^{(15.1KB, pdf)}

Finally, all FASTA strings were translated into rCRS-coded haplotypes using SAM [25] and subsequently checked with in-house software to harmonize alignment. The quality filtering of the sequences finally resulted in a database of 14,990 full mtGenomes stored with their accession numbers and version. In conjunction with the 3925 virtual Phylotree motifs, these 18,915 virtual and real mtGenomes form the basis for haplogroup estimation.

2.2. Fluctuation rates

Haplogrouping of mtDNA data in rCRS-based format requires consistent alignment and notation of sequences following a phylogenetic approach [26] in order to assess the stability of mutations in defined haplogroups. Here, we refer to this mutational (in)stability as a fluctuation rate. The weighting scheme presented for the string-search method in Ref. [25] was updated by assessing the stability of mutations within the mtDNA control region among 19,171 full control region haplotypes for which raw lane data were available. Haplogroups were manually assigned to all sequences in this dataset between November 2011 and September 2012 following the classification outlined in Phylotree Builds 12 through 15. Consequently, the sequences were grouped into discernable control region haplogroup clusters (CR-HGs), i.e. clusters of haplogroups that can be confidently determined based on control region motifs. We set a minimum of four available sequences to define a CR-HG with the exception of CR-HGs L0, L2, L6, U4′9, K3, and P9 for which only one or two sequences were available. In these cases, merging the sequences with their parent haplogroups L and R would have rendered the resolution too coarse. For a list of CR-HGs based on Phylotree Build 15 and the number of samples for each cluster see ESM2. Samples that were assigned to multiple haplogroups due to uncertainty were split equally into the respective CR-HGs.

ESM2

Summary of discernable control region haplogroup clusters (CR-HGs) based on 19,171 full control region haplotypes.

mmc2.pdf^{(23KB, pdf)}

Assuming independent positions we estimated the fluctuation rate by

r_{α β} = \frac{\sum_{γ} min (n (α, γ), n (β, γ))}{\sum_{γ} n (γ)}

where α, β are elements of the set A, C, G, T, – with α not equal to β, γ runs over all CR-HGs where α or β are dominant, n(x,γ) denotes the number of samples in CR-HG γ with symbol x and n(γ) denotes the total number of samples in CR-HG γ.

Heteroplasmies were split equally into the represented bases and each base adds its fraction to n(x,γ). If the estimate for the fluctuation rate is zero, a minimum value of 10⁻⁶ for transitions and 10⁻⁹ for transversions or indels is assigned. Finally, we compute the diagonal rates as $r_{α α} = 1 - \sum_{β \neq α} r_{α β}$ .

Zero weight was assigned to insertions in the C-stretches of HVS-I (around 16,191 and 16,193) and HVS-II (around 309) because they carry no phylogenetic signature. Furthermore, zero weight was applied to deletions at positions 521 and 522 as they are sometimes the result of a 5′ alignment of indels in the AC-stretch. For multiple insertions at positions 315 (C-insertions), 455 (T-insertions), 524 (AC-insertions), and 573 (C-insertions) only the first insertion was weighted. Additional insertions were assigned weight zero. Two unique duplication events, a 15 base pair insertion at position 16,032 found in one haplotype from the USA and a 204 base pair insertion at position 563 found in two haplotypes from Morocco and the USA, respectively, were interpreted as single events. Thus, the weight of the single insertion of the first base was taken as measure for the whole duplication event. The same logic was applied to deletions at positions 105–111 relative to the rCRS.

The resulting fluctuation rates for control region mutations were expanded to the coding region using the number of occurrences of coding region mutations following [27]. Taking T16519C as the most frequent mutation with 209 occurrences according to Ref. [27], we derived fluctuation rates for all other mutations following the formula r = r(T16519C) × n/209. Additionally, multiple C-insertions at positions 960, 965, 5899, 8276, 8278 and insertions of the string CCCCCTCTA at position 8289 were treated as a single event. For the comprehensive list of fluctuation rates see ESM3.

ESM3

Table of positional fluctuation rates.

mmc3.txt^{(146.8KB, txt)}

Some mutations were not considered for tree reconstruction by Phylotree. To reflect this in the collection of virtual haplotypes, we introduced an extended nomenclature for denoting the simultaneous appearance of mutations and deletions. In addition to the character set defined by the IUPAC code we used corresponding small letters that represent the same nucleotides as the respective capital letter plus the deletion. For example, the annotation “523a” represents both the rCRS nucleotide and the deletion at position 523. In the same manner, “524.1m” represents an insertion of an A or C at position 524.1 as well as the non-insertion. Using this extended character set in EMMA we applied pattern match rules between nucleotides as presented in Ref. [25]. Mutations 309.1c 309.2c 309.3c 315.1c 315.2c 315.3c 521a 522c 523a 524c 524.1m 524.2m 524.3m 524.4m 524.5m 524.6m 524.7m 524.8m 16181m 16182m 16183m 16191.1c 16191.2c 16193.1c 16193.2c 16193.3c 16519Y were artificially added to all virtual haplotypes. Consequently, any mismatch between a test profile and a virtual database profile for these mutations does not influence the haplogroup estimate since they were not treated as differences when applying pattern match rules. The same holds true for mutations set in brackets by Phylotree. These were considered optional for the affected haplogroup status.

2.3. Algorithm

A test profile is compared to every database profile with the same or a larger reading frame (sequence span) than the test profile. Differences between two profiles (costs) were computed as detailed below. The basic idea is, for test profile t, to maximize the likelihood of the base profile b L_t(b) = ∏_i r(b_i → t_i) where the product is taken over all positions i of the common reading frame and r denotes the fluctuation rate. As the number of positions can be large, e.g. about 16.6 thousand for full genomes, and the computational work for evaluating the product would become too high, the ranking of the base profiles is determined as follows: Instead of maximizing the likelihood function we equivalently minimized the cost function C_t(b) = lg(∏_ir(t_i → t_i)/L_t(b)) where lg(x) = log 10(x)/3 is the scaled decimal logarithm. If the test profile and the base profile are encoded by short lists of differences relative to a reference sequence such as the rCRS, then the cost function can be efficiently evaluated by the formula:

C_{t} (b) = \sum_{i} c (b_{i}, t_{i})

where i runs over all positions where the test and the base profile differ, and c(b_i,t_i) = lg(r(t_i → t_i)/r(b_i → t_i)) are real numbers termed positional costs for the change from the base profile symbol to the test profile symbol. The scaling of the logarithm was motivated by the costs of approximately 1.0 for an average mutation. For ambiguous symbols in base or test profiles, the maximum rates of matching nucleotides are used. Therefore, the ranking of the base profiles by their total costs equals the maximum likelihood ranking. In the output of the algorithm only base profiles with the lowest and second lowest costs were presented where a tolerance value (default 0.3) is used to cluster the optimal and suboptimal profiles.

3. Results and discussion

3.1. Validation of the concept

3.1.1. Preparation and validation of the mtGenome database

Any estimate of the haplogroup status of an mtDNA sequence can only be as good as the underlying database of sequences used to derive the estimate. EMMA uses two databases, one consisting of the 3925 virtual haplotypes of Phylotree Build 15 and one comprised of 14,990 full mtGenomes downloaded from GenBank. Because the latter come without haplogroup information, we used the database of virtual haplotypes from Phylotree Build 15 to estimate the haplogroup of the downloaded mtGenomes. This procedure was straightforward in most cases, as the presented coding region information in the virtual haplotypes was sufficient for unambiguous haplogroup determination of the GenBank sequences. To validate this procedure we extracted the 5920 mtGenomes that are cited as haplogroup-specific example sequences in Phylotree Build 15, serving as the basis for the virtual haplotypes. Of these, 140 were excluded for quality and coverage reasons. For the remaining 5780 genomes (38.6% of the 14,990 mtGenomes in the total database), the haplogroup status was determined with EMMA. We found full concordance between the Phylotree haplogroup labels and EMMA's haplogroup estimates in 5774 cases (99.9% of 5780). The different results observed in six cases could be attributed to two issues. First, a missing signature mutation caused a more conservative estimate (more ancestral haplogroup) by EMMA in three cases (AY922257, HQ873519, JQ797975). For example, AY922257 (haplogroup M30c1a1 in Phylotree Build 15) lacked the C16069T required for haplogroup M30c1a, and although the subhaplogroup M30c1a1 diagnostic G9966A was present, EMMA estimated haplogroup M30c1. In the other three cases (EU219921, EU545420, JQ704528) two different haplogroups were compatible with the sequence motif of the mtGenome, thus the haplogroup assignment relied on the costs of the differing mutations. EU545420 (haplogroup HV9-152 in Phylotree Build 15), for example, carried G8994A (costs of 0.79) and T152C (costs of 0.29) that together suggested haplogroup HV9-152. However, mutation C13449T (costs of 2.00) was also present in this haplotype and indicated haplogroup HV10, since the mismatch with the two HV9-152 diagnostic mutations resulted in lower overall costs than that with C13449T. For details on all six examples, see ESM4. This validation process revealed one error in mtGenome AY882385 of GenBank (haplogroup U3b1a), where position 3546 carried the transition C3546T instead of the U3b1 characteristic transversion C3546A. After communicating this finding to the authors of this mtGenome, its sequence data has been corrected and updated in GenBank.

ESM4

Example genomes from Phylotree Build 15 where the haplogroup estimate of EMMA differed from Phylotree's haplogroup assignment.

mmc4.pdf^{(16.4KB, pdf)}

The same procedure was employed for the example genomes of Phylotree Build 14, before Build 15 was released. In addition to high concordance of haplogroups, the EMMA haplogroup assignment process unveiled the following details. FJ383192 (haplogroup D4b2b5) lacked the diagnostic polymorphism 9296 for haplogroup D4b2b. DQ112751 (coding region only) listed as haplogroup S1, lacked marker 14384C that is characteristic for haplogroup S1. Thus, both mtGenomes chosen for Phylotree Build 14 did not constitute ideal representatives for these haplogroups and were replaced in Build 15. Furthermore, our analysis showed that the sequence motifs for haplogroups K1a8b and K1a22 were identical in Build 14. Haplogroup K1a22 was therefore deleted in Build 15. Finally, polymorphism 1718 for haplogroup C5b1a1 should have read 1719, which was also corrected in Build 15.

3.1.2. Validation using external/literature HVS-I/HVS-II and CR profiles

To further evaluate EMMA, we processed all 26 CR haplotypes from Table 2 in Ref. [7]. In general, estimates from EMMA were the same as or a refinement of the most likely haplogroup assigned by Bandelt et al. based on Phylotree 13. This was also true for the related GenBank entries found by EMMA. A summary of the haplotypes and haplogroup assignments of Table 2 in Ref. [7] is reproduced in Table 1, but modified to include results obtained with EMMA and updated haplogroup estimates from HaploGrep based on Phylotree Build 15.

Table 1.

Samples from Table 2 of Ref. [7] with updated haplogroup classification based on Phylotree Build 15 according to HaploGrep and EMMA.

No.	Related GenBank sample in Ref. [7]	(Updated) haplogroup of related GenBank sample from Ref. [7]	Likely haplogroup [7]^a	HaploGrep [12]^b	QV^c	EMMA^d	Costs^e	Rank 1 results EMMA^f	Missing mutations^g	Private mutations^h
1	HM030505	M20	M20	M20	90.8	M20	0.72	HM030505	T16362C T16519C	None
2	HM030542	N10a	N10a	N10a	86.6	N10a	3.06	HM030542	C16111T T16224C T16519C	A16258T T16311C
3	HM030500	N10b	N10b	N10b	100.0	N10b	0.00	N10b	None	None
							0.00	HM030500 (N10b)	None	−309.2C
4	EF093557	E1a1a	M52a	M52a	74.9	M52a	4.62	M52a	None	C16114A T16126C C16218T C16291T T16356C G16391A
5	EF093544	E1a1a	E1a1a	E1a1a	100.0	E1a1a, E1a1a1	0.00	E1a1a	None	None
							0.00	E1a1a1	None	None
							0.00	EF093544 (E1a1a)	None	−309.1C
6	GU296545	U5b2a1b	U5b2a1b	U5b2a1b	93.5	U5b2a1b	0.61	U5b2a1b, GU296545	None	G16319A
7	EU979418	H1a3c	H1a3	H1a3c	90.4	H1a3c	0.39	JQ703419	None	T146C
8	GU903270	J2a1a1a	J2a1a1	J2a1a1a	100.0	J2a1a1a, J2a1a1a2	0.00	J2a1a1a	None	None
							0.00	J2a1a1a2	None	None
							0.00	GU903270 (J2a1a1a)	None	None
9	EU770310	K2b1a1	K2b1a	K2b1a1	94.1	K2b1a1	0.00	EU770310	None	None
10	GU296544	U5b2a1a1	U5b2a1a1	U5b2a1a2	75.1	U5b2a1a1	0.00	GU296544	None	None
11	EU597496	K1a4a1e	K1a(K1a4a1)	K1a4a1e	94.4	K1a4a1e	0.52	K1a4a1e	None	T204C
							0.52	EU597496 (K1a4a1e)	−524.3A −524.4C	T204C
12	AY495209	J1c6	J1c6	J1c + 16261 + 189	92.2	J1c-16261-189, J1c12	0.63	J1c-16261-189	T16126C	None
							0.63	J1c12	T16126C	None
13	EU597496	K1a4a1e	K1a(K1a4a1)	K1a4a1e	94.4	K1a4a1e	1.51	K1a4a1e	None	T204C A272G
							1.51	EU597496 (K1a4a1e)	−524.3A −524.4C	T204C A272G
14	AY882396	U1a1	U1a1	U1a1b	95.1	U1a1b	0.00	JX289842	None	−16193.1C
15	FJ348157	J2a1a1a2	J2a1a	J2a1a1	86.0	J2a1a1	0.00	HQ699438	None	None
16	AY882396	U1a1	U1a1	U1a1b	95.1	U1a1b	0.00	JX289842	−309.2C	−16193.1C
17	AF347006	V7a1, excluded for EMMA [20]	V7a	V7a	76.1	V7a, V7a1	1.55	V7a	None	A73G A95C
							1.55	V7a1	None	A73G A95C
							1.55	JX171078 (V7a1)	−309.2C	A73G A95C
18	AF347006	V7a1, excluded for EMMA [20]	V7a	V7a	85.8	V7a, V7a1	1.55	V7a	None	A73G A95C
							1.55	V7a1	None	A73G A95C
							1.55	JX171078 (V7a1)	−309.2C	A73G A95C
19	EU095550	B2d	B4b(B2d)	B4b	89.0	B2d	0.00	EU095550	None	−16193.1C −309.2C −309.3C
20	FJ770954	M43a	M(M43a)	D4e2a	103.3	D4e2a, M10, M10a, M74, M74b, M74b2, D4j-16311, D4j11, M43a	0.38	D4e2a	None	T16311C −573.2C −573.3C
							0.47	M10	None	T16362C −573.2C −573.3C
							0.47	M10a	None	T16362C −573.2C −573.3C
							0.53	M74	None	−573.1C −573.2C −573.3C
							0.53	M74b	None	−573.1C −573.2C −573.3C
							0.53	M74b2	None	−573.1C −573.2C −573.3C
							0.53	D4j-16311	None	−573.1C −573.2C −573.3C
							0.53	D4j11	None	−573.1C −573.2C −573.3C
							0.53	FJ770954 (M43a)	−309.1C	−573.1C −573.2C −573.3C
							0.53	JQ702232 (M74b2)	−309.1C	−573.1C −573.2C −573.3C
							0.63	AP008834 (D4e2a)	None	T16311C T16519C −573.2C −573.3C
							0.63	AP008468 (D4e2a)	−573.4C	T16311C T16519C
21	HM030520	M74b	M74	D4j1b2	79.8	M74b	2.71	HM030520	C16214T	T16093C T16172C T16297C
22	AF347006	V7a1, excluded for EMMA [20]	V7a	V7a	90.7	V7a, V7a1	0.80	V7a	None	A95C
							0.80	V7a1	None	A95C
23	AY495306	V-@72	HV(V)	V + @72	100.0	V-@72, V1a1, V1a, V1a1b	0.00	V-@72	None	None
							0.25	AY495306	−309.1C	T16519C
24	HQ165756	sample not in GenBank anymore	HV1(HV1b)	HV1b3	96.0	HV1b3	0.00	JQ704284	None	None
25	AY339515	Z1a1a	Z1a1a	Z1a	100.0	Z1a, Z1a1, Z1a1a, Z1a2	0.00	Z1a	None	None
							0.00	AY339515 (Z1a1a)	−309.1C	None
26	EF660967	J2a2a, excluded for EMMA [20]	J2a2a	J2a2a	86.8	J2a2a	0.00	JQ797922	None	None

Open in a new tab

Likely haplogroup assigned in Ref. [7] based on Phylotree Build 13.

Haplogroup classification by HaploGrep using Phylotree Build 15.

Quality value assigned by HaploGrep.

Haplogroups of rank 1 results estimated by EMMA.

Estimated costs by EMMA; the default cost range of 0.3 was applied for all calculations.

Source for rank 1 haplogroups by EMMA; Haplogroup labels indicate virtual haplotypes from Phylotree, GenBank accession numbers indicate mtGenomes.

Mutations within the test profile's range that are not present in the test profile but in the database profile.

Mutations within the test profile's range that are present in the test profile but absent in the database profile.

Nearest samples from GenBank found by EMMA with respect to fluctuation rate induced costs, coincided with those listed in the original table for 15 of the 26 samples. Example no. 4 lists sample EF093557 together with the most likely haplogroup M52a in the original table. However, taking Phylotree Build 15 into account, sample EF093557 is of haplogroup E1a1a status, not M52a as stated in Ref. [7]. EMMA correctly estimated the haplogroup for example no. 4 based on the virtual haplotype M52a. The nearest mtGenome according to EMMA was JQ703445 (haplogroup M52a) with costs of 5.74.

For example no. 7, GenBank entry JQ703419 with costs of 0.39 seemed more closely related to the profile in question than EU979418 (indicated in Ref. [7]) with costs of 1.10. Both samples however belong to haplogroup H1a3c. Example no. 12 was assigned haplogroup J1c6 status based on GenBank entry AY495209 [7]. In addition to the virtual haplotype J1c-16261-189, which was also returned by HaploGrep, EMMA included haplogroup J1c12 in the top ranks. If only the full mtGenomes were searched, samples JQ702857 (haplogroup J1c7a) and JQ705489 (haplogroup J1c-16261) at costs of 1.14 would have been returned. We therefore suggest J1c12 (not yet defined in Phylotree Build 13) as most likely haplogroup as this clade is a subbranch of the intermediate branch J1c-16261-189.

For the examples nos. 14–16 EMMA returned different GenBank samples than those indicated in the original table [7]. However, the most likely haplogroups according to EMMA were not distinct from the originals. Instead, they were refinements of those given by Bandelt et al. [7]. Differences between the EMMA and Bandelt et al. [7] analyses were also observed for examples 17, 18, 22, 24, and 26. In these cases, the most likely haplogroup was confirmed, but the GenBank entries favored by EMMA were different from those found by Bandelt et al. The reason for this can be explained by the underlying database: samples AF347006 and EF660967 were excluded from our database for quality reasons (see Ref. [20]). Sample HQ165756 is no longer available at GenBank and has apparently been removed (“this record was removed at the submitter's request”).

Haplotype no. 20 was most likely assigned in Ref. [7] to haplogroup M43a based on GenBank entry FJ770954. For this profile, HaploGrep reported haplogroup D4e2a as best choice with a quality value of 103.3%. This CR profile cannot be unambiguously assigned to a single haplogroup without additional information from the coding region. Two equally matching GenBank samples of different haplogroups were found by EMMA: FJ770954 (haplogroup M43a) and JQ702232 (haplogroup M74b2), both at costs of 0.53, that only differed in a C-insertion at 309 and three C-insertions at 573. A search along Phylotree added haplogroups D4e2a, M10, M74, D4j-16311 and subhaplogroups thereof that only differed in the coding region. For haplotype no. 25 several different genomes that all perfectly matched the test profile (zero costs) were reported by EMMA. Among those, sample AY339515 (haplogroup Z1a1a) was found, but also haplogroups Z1a (FJ147318), Za1a (FJ493513), and Z1a2 (AY195761) were presented. In summary, this compilation shows that EMMA correctly classified all of the 26 examples.

3.2. Improved haplogroup assignment with full mtGenome sequences

Haplogrouping of partial mtDNA sequences such as CR sequences, when restricted to virtual haplotypes, may lead to imprecise or even inaccurate results, as demonstrated below. Taking a closer look at the Argentinean sample introduced earlier, its CR haplotype 16189C 16292T 16519C 71A 153G 204C 207A 263G 315.1C 373G (ABS133 [28]) was assigned to haplogroup H by mtDNAmanager and to haplogroup H1 + 16189 by HaploGrep with the quality value of 64.7% reflecting the low quality of the estimate. Applying EMMA together with the collection of 14,990 full mtGenomes, the best match was JQ705203 (haplogroup H55) with costs of 3.39 (ESM5b). The second best estimate was also a member of haplogroup H55 (JQ704460) with costs of 3.91. Additional sequencing of the coding region segments 3042–3549, 4270–4859, and 10,184–11,000 yielded the variants 4769G 10646A confirming haplogroup H55 status (Phylotree, Build 15).

ESM5

Results obtained by EMMA for the examples discussed.

mmc5.txt^{(332.1KB, txt)}

The CR haplotype of sample “stain 3, GEDNAP 44” 16248T 146C 263G 309.1C 309.2C 315.1C [29] suggested haplogroup H status. The transition C16248T is only found in haplogroups H1ah1, H3w and H4c1 within haplogroup H in Phylotree (Build 15), with haplogroup H4c1 status also requiring a transition at 73. MtDNAmanager assigned this profile to haplogroup H, while the query in HaploGrep confirmed the manual search in Phylotree with quality values of 82.7% for H1ah1 and 94.2% for H3w. EMMA found two perfectly matching (costs of 0.00) GenBank entries belonging to haplogroup HV (GU592048, GU592033). Haplogroups H1ah1 and H3w followed with costs of 0.39 resulting from the private mutation at 146 (ESM5a). Sequencing of the coding region segments 3280–3513 6700–7140 7226–7749 9110–9360 14,520–14,950 yielded 7028T as single difference to the rCRS, therefore excluding haplogroup H status (thus rejecting haplogroups H1ah1 and H3w) and confirming haplogroup status HV. The two previous examples demonstrate that browsing virtual haplotypes in motif lists is not always sufficient to determine reliable haplogroup status. Private mutations that can only be matched with full mtGenome sequences provide relevant information when searching for matching haplotypes and nearest neighbors to support haplogroup assignment.

We noted that in some cases virtual haplotypes lead to better estimates (lower costs) of the haplogroup than the relevant full mtGenome, especially when only a few distant mtGenome sequences are available for defining a haplogroup, or when private mutations put an mtGenome at further distance than the virtual haplotype. This is for example evident in the CR haplotype 4 (Table 1), which was assigned to haplogroup M52a both by the virtual Phylotree haplotypes (costs of 4.62–4.92) and the database of full mtGenomes. However, the estimate produced with the full genomes (JQ703445) came with costs of 5.74. Eventually, our analyses confirmed the advantage gained by including both, full mtGenomes and virtual haplotypes, as primary source for estimating mtDNA haplogroups using EMMA.

3.3. EMMA for QC purposes

In the course of developing and refining the software EMMA all haplotypes stored in the EMPOP database were tested. Since the algorithm applied by EMMA bases haplogroup assignment on costs (fluctuation rates), high cost profiles may also pinpoint errors in sequence data. The following example describes this particular application of the software in more detail. The CR profile of sample FRE228 from southwest Germany [EMPOP, unpublished] 16192T 16223T 16356C 16519C 73G 195C 263G 315.1C resulted in a best match with mtGenome JQ704890 (haplogroup H7b) at costs of 1.23 for two private mutations 16192T and 73G. Further estimates suggested FJ467950 (R8a1b, costs of 1.83, private mutations 16192T 16223T 16356C) and several samples from haplogroup U4b1a (and subhaplogroups thereof) at costs of 1.86 (see ESM5b). All of these haplotypes harbored differences from the profile in question due to the private mutations 16192T 16223T, as well as 499A, a polymorphism absent in the query sequence but an otherwise stable signature mutation for haplogroup U4′9. To clarify the haplogroup status of this sample the coding region segments 4402–5448 and 11,996–12,860 were sequenced. The thus observed variants 4646C 4769G 12308G 12372A confirmed haplogroup U4 status. The relatively high weight of mutation G499A (costs of 0.77), which could have back-mutated within haplogroup U4, put all other haplogroup U4 database haplotypes at higher total costs and thus lower ranks. A more detailed analysis of the EMMA output gave rise to the assumption that the sample could belong to haplogroup U4a2b, which is defined by CR mutations 310C and 16223T on top of the haplogroup U4 motif. The lack of the transition at 310, however, led to inconclusive results by EMMA suggesting various subclades within U4 as best estimates. Re-examination of the CR raw data revealed that despite redundant sequence coverage (4/5 sequences) an interpretation error had occurred in the primary analysis of this sample. The sequence contig of sample FRE228 mistakenly contained a forward sequencing read of sample FRE208 with the HVS-II motif 204C 263G 315.1C, which led to an artificial recombinant in the consensus sequence reported for FRE228. For a screenshot of the sequence contig see ESM6. The corrected consensus haplotype for sample FRE228 reads 16192T 16223T 16356C 16519C 73G 195C 263G 310C 499A 524.1A 524.2C 524.3A 524.4C 524.5A 524.6C, which was fully compatible with haplogroup status U4a2b. This dataset was analyzed at a very early stage of the EMPOP database project and loaded with Release 1. Based on this finding the entire dataset from southwest Germany was inspected again and one other similar instance was observed where a wrongly imported sequence read caused an incorrect consensus haplotype. Upon reanalysis, it was established that the CR profile of sample FRE237 16069T 16126C 16189C 16519C 73G 185A 188G 228A 263G 295T 315.1C 462T 489C did not carry 309.1C. This was corrected on EMPOP Release 9 (see also sample history at EMPOP). This example demonstrates the power of haplogrouping as a tool for quality control by highlighting missing and private mutations of a sample in question.

ESM6

Screenshot of sequence contig of sample FRE228.

mmc6.pdf^{(751KB, pdf)}

Reported instances of artificial recombination (Table 3 in Ref. [7]) were further tested with EMMA following the same logic (ESM7). In all but one sample (#7, ESM7) the haplogroups identified by EMMA were identical to one of the two reported components in Ref. [7]. For sample #7, haplogroups C4c1b and C7a2 were proposed by EMMA since the haplogroup C1 specific transition at T16325C and the tandem deletion at positions 290 and 291 were missing in the given example. More interestingly, all haplogroup estimates were flagged with rather high costs (ESM7), indicating that the haplotypes may need to be checked with the submitting authors in a QC format. In our experience, and aside from a few unusual distant haplotypes, cost values exceeding 2.00 were generally correlated with error or misalignment, except for unusual distant haplotypes. Therefore, we suggest prioritizing those samples for data review.

ESM7

Results obtained by EMMA for artificial recombination described in Table 3 of Ref. [7].

mmc7.txt^{(560.2KB, txt)}

3.4. Limitations – critical samples

The full mtGenome sequence from Spain [Sample ID Z034, unpublished, personal communication] 73G 195C 263G 315.1C 497T 524.1A 524.2C 750G 1189C 1438G 1811G 2706G 3480G 4769G 5460A 5655C 7028T 7521A 8860G 9055A 9698C 10398G 10550G 11299C 11467G 11719A 11914A 12308G 12372A 14167T 14766T 14798C 15326G 16093C 16189C 16224C 16311C 16519C was initially assigned to haplogroup K1a12 by the authors. EMMA's lowest cost estimate (costs of 3.41) was produced with a virtual haplotype for haplogroup K1a1 with private mutations 16189C 195C 5460A 5655C 7521A. Haplogroup K1a12 was ranked second with costs of 3.82 resulting from private mutations 16093C 16189C 195C 5655C 7521A 11914A. The costs for coding region mutations 5460A and 11914A were equal (costs of 1.05 each) and the two remaining private coding region mutations (5655C and 7521A) were found for both results. The additional costs of 0.41 for K1a12 were contributed by T16093C (see ESM5a). The best estimate based on the mtGenome database was found with JQ703323 (haplogroup K1a12) with costs of 3.97 for four private mutations (16189C 5655C 7521A 11914A) and additional costs of 0.94 for the missing mutation 16343G (ESM5b). Although in virtually all cases full mtGenomes can be assigned unambiguously to a haplogroup, this example highlights an exception to that general rule. In this case, primarily the resolution of haplogroups K1a1 and K1a12 by full mtGenomes and the subsequent definition of the mitochondrial phylogeny caused difficulties in haplogrouping of this sample, which can be attributed to the relatively high mutability (and hence weak diagnostic value) of sites 11,914 and 5460 defining haplogroups K1a1 and K1a12, respectively.

Recurrent and unstable mutations are set in brackets by Phylotree [4]. These mutations are interpreted as optional for the respective clade of the mitochondrial phylogeny by EMMA. Thus, the absence of such a mutation in a test profile is not penalized with additional costs. This is different from HaploGrep [12] where brackets in Phylotree are ignored and the respective mutations treated in the same manner as other mutations. Consider the CR haplotype VEC033 from Venezuela 16223T 16241G 16325C 16362C 16519C 73G 94A 204C 263G 315.1C 489C in Ref. [30]. The best estimate by EMMA was haplogroup D4e1a with costs of 2.00. Haplogroup D1 and subhaplogroups thereof followed with costs of 2.26 (see ESM5a). Using HaploGrep the best estimate was haplogroup D1 with a quality value of 83.8%. D4e1a ranked fourth with a quality value of 79.7% since mutation 16092C for haplogroup D4e1 was missing in this haplotype. However, this mutation was stated in brackets in Phylotree which is disregarded by HaploGrep, resulting in the reduced rank for haplogroup D4e1a.

The northern Indian haplotype 16223T 16240G 16298C 16311C 16327T 16357C 73G 249DEL 263G 310C 315.1C 489C [Sample ID I_070, unpublished, range 16,024–16,365 73–340 438–576] was assigned to haplogroup C4a3b by EMMA (minimal costs of 1.59 toward mtGenome NA18547, see ESM5a). HaploGrep reported haplogroup D1 with quality value 88.7%. The correct haplogroup C4a3b ranked second with a quality value of 87.0%. Again, this was due to the fact that 16278T for haplogroup C4a3 was reported as missing in the haplotype even though this mutation was set in brackets in Phylotree. These examples demonstrate that unstable/recurrent mutations can play a critical role in the assignment of haplogroups and therefore must be properly handled in automated software solutions. Besides mutation stability, the amount of sequence information available is vital for reliable haplogroup estimation. The lack of information resulting from sequence stretches omitted for analyses causes ambiguous haplogroup estimates. In ESM8 and ESM9 we present four control region profiles of the EMPOP database for which the full mtGenome was analyzed providing the maximum resolution possible. We demonstrate that haplogrouping cannot be performed fully automated and manual inspection is required especially for partial profiles.

ESM8

Practical guide for using EMMA.

mmc8.doc^{(31KB, doc)}

ESM9

Detailed result obtained using EMMA for the examples presented in ESM8.

mmc9.txt^{(1.2MB, txt)}

4. Conclusions

With the public availability of Phylotree [4], which incorporates data from myriad sources and ultimately serves as the most up-to-date and comprehensive mtDNA phylogeny, the assignment of haplogroups to haplotypes has been greatly facilitated. Nevertheless, and despite the broad acceptance of Phylotree in population, medical, and forensic genetics, the task of manually classifying mitochondrial haplotypes into haplogroups remains tedious and error-prone. The manual evaluation of the mutation rates/weights – an important factor in haplogroup assignment – can be biased, and the instability of particular mutations within haplogroups (as opposed to across the entire tree) is difficult to incorporate manually.

Though Phylotree gave rise to (semi-)automated tools for haplogrouping, these approaches have been shown to be restricted in their capacity to overcome the deficiencies of manual haplogrouping [7]. The main limitations of these tools stem from the fact that they rely solely on haplogroup-defining patterns of mutations derived from the phylogenetic tree and they lack a database of real haplotypes that takes private mutations into account.

To improve haplogrouping of mtDNA sequences we propose EMMA, a concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach. The concept is based on two pillars: a comprehensive and curated database of 14,990 full mtGenomes and 3925 virtual haplotypes of Phylotree Build 15 compiled to represent the backbone for drawing haplogroup estimates, and a database of 19,171 manually haplogrouped CR haplotypes that were used to determine the fluctuation rate of mutations using a maximum likelihood approach. Based on these fluctuation rates, cost values resulting from differences between the mtDNA profiles are calculated by EMMA. Lowest cost profiles within a defined tolerance are reported in the output.

The approach was validated using full mtGenomes cited in Phylotree as blind samples, along with literature data for which the haplogroup assignments had been previously challenged. EMMA will be made available with a new EMPOP version that is currently under construction, to be used in the evaluation of the ever growing number of quality-controlled CR sequences. In turn, these new CR data will further enhance the quality of EMMA's haplogroup estimates and improve this already useful tool for mtDNA analysis.

Acknowledgments

The authors would like to thank Martin Bodner, Liane Fendt, Gabriela Huber, Simone Nagl, Daniela Niederwieser, and Bettina Zimmermann for additional sequencing and haplogrouping of samples. The authors are grateful to Jodi Irwin for comments on the manuscript and fruitful discussions. The study was supported by the FWF Austrian Science Fund (TR397) and the National Institute of Justice (NIJ) grant 2011-MU-MU-K402 and has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement no. 285487 (EuroForGen-NoE). MvO was supported in part by the Netherlands Forensic Institute (NFI) and by a grant from the Netherlands Genomics Initiative (NGI)/Netherlands Organization for Scientific Research (NWO) within the framework of the Forensic Genomics Consortium Netherlands (FGCN).

Footnotes

^☆

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike License, which permits non-commercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

References

1.Anderson S., Bankier A.T., Barrell B.G., de Bruijn M.H., Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., Sanger F., Schreier P.H., Smith A.J., Staden R., Young I.G. Sequence and organization of the human mitochondrial genome. Nature. 1981;290:457–465. doi: 10.1038/290457a0. [DOI] [PubMed] [Google Scholar]
2.Andrews R.M., Kubacka I., Chinnery P.F., Lightowlers R.N., Turnbull D.M., Howell N. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 1999;23:147. doi: 10.1038/13779. [DOI] [PubMed] [Google Scholar]
3.Bandelt H.J., Kong Q.P., Richards M., Macaulay V. Estimation of mutation rates and coalescence times: some caveats. In: Bandelt H.J., Macaulay V., Richards M., editors. Human Mitochondrial DNA and the Evolution of Homo Sapiens. Springer-Verlag; Berlin, Germany: 2006. pp. 47–90. [Google Scholar]
4.van Oven M., Kayser M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 2009;30:E386–E394. doi: 10.1002/humu.20921. [DOI] [PubMed] [Google Scholar]
5.Bandelt H.J., Macaulay V., Richards M. Springer-Verlag; Berlin/Heidelberg: 2006. Human Mitochondrial DNA and the Evolution of Homo Sapiens. [Google Scholar]
6.Bandelt H.J., Achilli A., Kong Q.P., Salas A., Lutz-Bonengel S., Sun C., Zhang Y.P., Torroni A., Yao Y.G. Low penetrance of phylogenetic knowledge in mitochondrial disease studies. Biochem. Biophys. Res. Commun. 2005;333:122–130. doi: 10.1016/j.bbrc.2005.04.055. [DOI] [PubMed] [Google Scholar]
7.Bandelt H.J., van Oven M., Salas A. Haplogrouping mitochondrial DNA sequences in legal medicine/forensic genetics. Int. J. Legal Med. 2012;126:901–916. doi: 10.1007/s00414-012-0762-y. [DOI] [PubMed] [Google Scholar]
8.Salas A., Carracedo A., Macaulay V., Richards M., Bandelt H.J. A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population genetics. Biochem. Biophys. Res. Commun. 2005;335:891–899. doi: 10.1016/j.bbrc.2005.07.161. [DOI] [PubMed] [Google Scholar]
9.Fan L., Yao Y.G. MitoTool: a web server for the analysis and retrieval of human mitochondrial DNA sequence variations. Mitochondrion. 2011;11:351–356. doi: 10.1016/j.mito.2010.09.013. [DOI] [PubMed] [Google Scholar]
10.Fan L., Yao Y.G. An update to MitoTool: using a new scoring system for faster mtDNA haplogroup determination. Mitochondrion. 2013;13:360–363. doi: 10.1016/j.mito.2013.04.011. [DOI] [PubMed] [Google Scholar]
11.Rubino F., Piredda R., Calabrese F.M., Simone D., Lang M., Calabrese C., Petruzzella V., Tommaseo-Ponzetta M., Gasparre G., Attimonelli M. HmtDB, a genomic resource for mitochondrion-based human variability studies. Nucleic Acids Res. 2012;40:D1150–D1159. doi: 10.1093/nar/gkr1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kloss-Brandstätter A., Pacher D., Schönherr S., Weissensteiner H., Binna R., Specht G., Kronenberg F. HaploGrep: a fast and reliable algorithm for automatic classification of mitochondrial DNA haplogroups. Hum. Mutat. 2011;32:25–32. doi: 10.1002/humu.21382. [DOI] [PubMed] [Google Scholar]
13.Soares I., Amorim A., Goios A. mtDNAoffice: a software to assign human mtDNA macro haplogroups through automated analysis of the protein coding region. Mitochondrion. 2012;12:666–668. doi: 10.1016/j.mito.2012.08.003. [DOI] [PubMed] [Google Scholar]
14.Lee H.Y., Song I., Ha E., Cho S.B., Yang W.I., Shin K.J. mtDNAmanager: a Web-based tool for the management and quality analysis of mitochondrial DNA control-region sequences. BMC Bioinformatics. 2008;9:483. doi: 10.1186/1471-2105-9-483. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.R Core Team . R Foundation for Statistical Computing; Vienna, Austria: 2012. R: A language and Environment for Statistical Computing. R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org/ [Google Scholar]
16.Behar D.M., van Oven M., Rosset S., Metspalu M., Loogvali E.L., Silva N.M., Kivisild T., Torroni A., Villems R.A. Copernican reassessment of the human mitochondrial DNA tree from its root. Am. J. Hum. Genet. 2012;90:675–684. doi: 10.1016/j.ajhg.2012.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Salas A., Coble M., Desmyter S., Grzybowski T., Gusmao L., Hohoff C., Holland M.M., Irwin J.A., Kupiec T., Lee H.Y., Ludes B., Lutz-Bonengel S., Melton T., Parsons T.J., Pfeiffer H., Prieto L., Tagliabracci A., Parson W. A cautionary note on switching mitochondrial DNA reference sequences in forensic genetics. Forensic Sci. Int. Genet. 2012;6:e182–e184. doi: 10.1016/j.fsigen.2012.06.015. [DOI] [PubMed] [Google Scholar]
18.Herrnstadt C., Elson J.L., Fahy E., Preston G., Turnbull D.M., Anderson C., Ghosh S.S., Olefsky J.M., Beal M.F., Davis R.E., Howell N. Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major African, Asian, and European haplogroups. Am. J. Hum. Genet. 2002;70:1152–1171. doi: 10.1086/339933. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Gasparre G., Porcelli A.M., Bonora E., Pennisi L.F., Toller M., Iommarini L., Ghelli A., Moretti M., Betts C.M., Martinelli G.N., Ceroni A.R., Curcio F., Carelli V., Rugolo M., Tallini G., Romeo G. Disruptive mitochondrial DNA mutations in complex I subunits are markers of oncocytic phenotype in thyroid tumors. Proc. Natl. Acad. Sci. U.S.A. 2007;104:9001–9006. doi: 10.1073/pnas.0703056104. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Yao Y.G., Salas A., Logan I., Bandelt H.J. mtDNA data mining in GenBank needs surveying. Am. J. Hum. Genet. 2009;85:929–933. doi: 10.1016/j.ajhg.2009.10.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gunnarsdottir E.D., Li M., Bauchet M., Finstermeier K., Stoneking M. High-throughput sequencing of complete human mtDNA genomes from the Philippines. Genome Res. 2011;21:1–11. doi: 10.1101/gr.107615.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Schönberg A., Theunert C., Li M., Stoneking M., Nasidze I. High-throughput sequencing of complete human mtDNA genomes from the Caucasus and West Asia: high diversity and demographic inferences. Eur. J. Hum. Genet. 2011;19:988–994. doi: 10.1038/ejhg.2011.62. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gonder M.K., Mortensen H.M., Reed F.A., de Sousa A., Tishkoff S.A. Whole-mtDNA genome sequence analysis of ancient African lineages. Mol. Biol. Evol. 2007;24:757–768. doi: 10.1093/molbev/msl209. [DOI] [PubMed] [Google Scholar]
24.Rajkumar R., Banerjee J., Gunturi H.B., Trivedi R., Kashyap V.K. Phylogeny and antiquity of M macrohaplogroup inferred from complete mt DNA sequence of Indian specific lineages. BMC Evol. Biol. 2005;5:26. doi: 10.1186/1471-2148-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Röck A.W.J., Irwin A., Dür T., Parsons W., Parson S.A.M. String-based sequence search algorithm for mitochondrial DNA database queries. Forensic Sci. Int. Genet. 2011;5:126–132. doi: 10.1016/j.fsigen.2010.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Bandelt H.J., Parson W. Consistent treatment of length variants in the human mtDNA control region: a reappraisal. Int. J. Legal Med. 2008;122:11–12. doi: 10.1007/s00414-006-0151-5. [DOI] [PubMed] [Google Scholar]
27.Soares P., Ermini L., Thomson N., Mormina M., Rito R., Röhl A., Salas A., Oppenheimer S., Macaulay V., Richards M.B. Correcting for purifying selection: an improved human mitochondrial molecular clock. Am. J. Hum. Genet. 2009;84:740–759. doi: 10.1016/j.ajhg.2009.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Bobillo M.C., Zimmermann B., Sala A., Huber G., Röck A.W., Bandelt H.J., Corach D., Parson W. Amerindian mitochondrial DNA haplogroups predominate in the population of Argentina: towards a first nationwide forensic mitochondrial DNA sequence database. Int. J. Legal Med. 2010;124:263–268. doi: 10.1007/s00414-009-0366-3. [DOI] [PubMed] [Google Scholar]
29.Rand S., Schurenkamp M., Hohoff C., Brinkmann B. The GEDNAP blind trial concept part II. Trends and developments. Int. J. Legal Med. 2004;118:83–89. doi: 10.1007/s00414-003-0421-4. [DOI] [PubMed] [Google Scholar]
30.Castro de Guerra D., Figuera P.C., Bravi C.M., Saunier J., Scheible M., Irwin J., Coble M.D., Rodriguez-Larralde A. Sequence variation of mitochondrial DNA control region in North Central Venezuela. Forensic Sci. Int. Genet. 2012;6:e131–e133. doi: 10.1016/j.fsigen.2011.11.004. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ESM1

29 mtGenomes rejected in the first step that were again added to the EMMA database.

mmc1.pdf^{(15.1KB, pdf)}

ESM2

Summary of discernable control region haplogroup clusters (CR-HGs) based on 19,171 full control region haplotypes.

mmc2.pdf^{(23KB, pdf)}

ESM3

Table of positional fluctuation rates.

mmc3.txt^{(146.8KB, txt)}

ESM4

Example genomes from Phylotree Build 15 where the haplogroup estimate of EMMA differed from Phylotree's haplogroup assignment.

mmc4.pdf^{(16.4KB, pdf)}

ESM5

Results obtained by EMMA for the examples discussed.

mmc5.txt^{(332.1KB, txt)}

ESM6

Screenshot of sequence contig of sample FRE228.

mmc6.pdf^{(751KB, pdf)}

ESM7

Results obtained by EMMA for artificial recombination described in Table 3 of Ref. [7].

mmc7.txt^{(560.2KB, txt)}

ESM8

Practical guide for using EMMA.

mmc8.doc^{(31KB, doc)}

ESM9

Detailed result obtained using EMMA for the examples presented in ESM8.

mmc9.txt^{(1.2MB, txt)}

[bib0005] 1.Anderson S., Bankier A.T., Barrell B.G., de Bruijn M.H., Coulson A.R., Drouin J., Eperon I.C., Nierlich D.P., Roe B.A., Sanger F., Schreier P.H., Smith A.J., Staden R., Young I.G. Sequence and organization of the human mitochondrial genome. Nature. 1981;290:457–465. doi: 10.1038/290457a0. [DOI] [PubMed] [Google Scholar]

[bib0010] 2.Andrews R.M., Kubacka I., Chinnery P.F., Lightowlers R.N., Turnbull D.M., Howell N. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat. Genet. 1999;23:147. doi: 10.1038/13779. [DOI] [PubMed] [Google Scholar]

[bib0015] 3.Bandelt H.J., Kong Q.P., Richards M., Macaulay V. Estimation of mutation rates and coalescence times: some caveats. In: Bandelt H.J., Macaulay V., Richards M., editors. Human Mitochondrial DNA and the Evolution of Homo Sapiens. Springer-Verlag; Berlin, Germany: 2006. pp. 47–90. [Google Scholar]

[bib0020] 4.van Oven M., Kayser M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 2009;30:E386–E394. doi: 10.1002/humu.20921. [DOI] [PubMed] [Google Scholar]

[bib0025] 5.Bandelt H.J., Macaulay V., Richards M. Springer-Verlag; Berlin/Heidelberg: 2006. Human Mitochondrial DNA and the Evolution of Homo Sapiens. [Google Scholar]

[bib0030] 6.Bandelt H.J., Achilli A., Kong Q.P., Salas A., Lutz-Bonengel S., Sun C., Zhang Y.P., Torroni A., Yao Y.G. Low penetrance of phylogenetic knowledge in mitochondrial disease studies. Biochem. Biophys. Res. Commun. 2005;333:122–130. doi: 10.1016/j.bbrc.2005.04.055. [DOI] [PubMed] [Google Scholar]

[bib0035] 7.Bandelt H.J., van Oven M., Salas A. Haplogrouping mitochondrial DNA sequences in legal medicine/forensic genetics. Int. J. Legal Med. 2012;126:901–916. doi: 10.1007/s00414-012-0762-y. [DOI] [PubMed] [Google Scholar]

[bib0040] 8.Salas A., Carracedo A., Macaulay V., Richards M., Bandelt H.J. A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population genetics. Biochem. Biophys. Res. Commun. 2005;335:891–899. doi: 10.1016/j.bbrc.2005.07.161. [DOI] [PubMed] [Google Scholar]

[bib0045] 9.Fan L., Yao Y.G. MitoTool: a web server for the analysis and retrieval of human mitochondrial DNA sequence variations. Mitochondrion. 2011;11:351–356. doi: 10.1016/j.mito.2010.09.013. [DOI] [PubMed] [Google Scholar]

[bib0050] 10.Fan L., Yao Y.G. An update to MitoTool: using a new scoring system for faster mtDNA haplogroup determination. Mitochondrion. 2013;13:360–363. doi: 10.1016/j.mito.2013.04.011. [DOI] [PubMed] [Google Scholar]

[bib0055] 11.Rubino F., Piredda R., Calabrese F.M., Simone D., Lang M., Calabrese C., Petruzzella V., Tommaseo-Ponzetta M., Gasparre G., Attimonelli M. HmtDB, a genomic resource for mitochondrion-based human variability studies. Nucleic Acids Res. 2012;40:D1150–D1159. doi: 10.1093/nar/gkr1086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0060] 12.Kloss-Brandstätter A., Pacher D., Schönherr S., Weissensteiner H., Binna R., Specht G., Kronenberg F. HaploGrep: a fast and reliable algorithm for automatic classification of mitochondrial DNA haplogroups. Hum. Mutat. 2011;32:25–32. doi: 10.1002/humu.21382. [DOI] [PubMed] [Google Scholar]

[bib0065] 13.Soares I., Amorim A., Goios A. mtDNAoffice: a software to assign human mtDNA macro haplogroups through automated analysis of the protein coding region. Mitochondrion. 2012;12:666–668. doi: 10.1016/j.mito.2012.08.003. [DOI] [PubMed] [Google Scholar]

[bib0070] 14.Lee H.Y., Song I., Ha E., Cho S.B., Yang W.I., Shin K.J. mtDNAmanager: a Web-based tool for the management and quality analysis of mitochondrial DNA control-region sequences. BMC Bioinformatics. 2008;9:483. doi: 10.1186/1471-2105-9-483. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0075] 15.R Core Team . R Foundation for Statistical Computing; Vienna, Austria: 2012. R: A language and Environment for Statistical Computing. R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org/ [Google Scholar]

[bib0080] 16.Behar D.M., van Oven M., Rosset S., Metspalu M., Loogvali E.L., Silva N.M., Kivisild T., Torroni A., Villems R.A. Copernican reassessment of the human mitochondrial DNA tree from its root. Am. J. Hum. Genet. 2012;90:675–684. doi: 10.1016/j.ajhg.2012.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0085] 17.Salas A., Coble M., Desmyter S., Grzybowski T., Gusmao L., Hohoff C., Holland M.M., Irwin J.A., Kupiec T., Lee H.Y., Ludes B., Lutz-Bonengel S., Melton T., Parsons T.J., Pfeiffer H., Prieto L., Tagliabracci A., Parson W. A cautionary note on switching mitochondrial DNA reference sequences in forensic genetics. Forensic Sci. Int. Genet. 2012;6:e182–e184. doi: 10.1016/j.fsigen.2012.06.015. [DOI] [PubMed] [Google Scholar]

[bib0090] 18.Herrnstadt C., Elson J.L., Fahy E., Preston G., Turnbull D.M., Anderson C., Ghosh S.S., Olefsky J.M., Beal M.F., Davis R.E., Howell N. Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major African, Asian, and European haplogroups. Am. J. Hum. Genet. 2002;70:1152–1171. doi: 10.1086/339933. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0095] 19.Gasparre G., Porcelli A.M., Bonora E., Pennisi L.F., Toller M., Iommarini L., Ghelli A., Moretti M., Betts C.M., Martinelli G.N., Ceroni A.R., Curcio F., Carelli V., Rugolo M., Tallini G., Romeo G. Disruptive mitochondrial DNA mutations in complex I subunits are markers of oncocytic phenotype in thyroid tumors. Proc. Natl. Acad. Sci. U.S.A. 2007;104:9001–9006. doi: 10.1073/pnas.0703056104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0100] 20.Yao Y.G., Salas A., Logan I., Bandelt H.J. mtDNA data mining in GenBank needs surveying. Am. J. Hum. Genet. 2009;85:929–933. doi: 10.1016/j.ajhg.2009.10.023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0105] 21.Gunnarsdottir E.D., Li M., Bauchet M., Finstermeier K., Stoneking M. High-throughput sequencing of complete human mtDNA genomes from the Philippines. Genome Res. 2011;21:1–11. doi: 10.1101/gr.107615.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0110] 22.Schönberg A., Theunert C., Li M., Stoneking M., Nasidze I. High-throughput sequencing of complete human mtDNA genomes from the Caucasus and West Asia: high diversity and demographic inferences. Eur. J. Hum. Genet. 2011;19:988–994. doi: 10.1038/ejhg.2011.62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0115] 23.Gonder M.K., Mortensen H.M., Reed F.A., de Sousa A., Tishkoff S.A. Whole-mtDNA genome sequence analysis of ancient African lineages. Mol. Biol. Evol. 2007;24:757–768. doi: 10.1093/molbev/msl209. [DOI] [PubMed] [Google Scholar]

[bib0120] 24.Rajkumar R., Banerjee J., Gunturi H.B., Trivedi R., Kashyap V.K. Phylogeny and antiquity of M macrohaplogroup inferred from complete mt DNA sequence of Indian specific lineages. BMC Evol. Biol. 2005;5:26. doi: 10.1186/1471-2148-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0125] 25.Röck A.W.J., Irwin A., Dür T., Parsons W., Parson S.A.M. String-based sequence search algorithm for mitochondrial DNA database queries. Forensic Sci. Int. Genet. 2011;5:126–132. doi: 10.1016/j.fsigen.2010.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0130] 26.Bandelt H.J., Parson W. Consistent treatment of length variants in the human mtDNA control region: a reappraisal. Int. J. Legal Med. 2008;122:11–12. doi: 10.1007/s00414-006-0151-5. [DOI] [PubMed] [Google Scholar]

[bib0135] 27.Soares P., Ermini L., Thomson N., Mormina M., Rito R., Röhl A., Salas A., Oppenheimer S., Macaulay V., Richards M.B. Correcting for purifying selection: an improved human mitochondrial molecular clock. Am. J. Hum. Genet. 2009;84:740–759. doi: 10.1016/j.ajhg.2009.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib0140] 28.Bobillo M.C., Zimmermann B., Sala A., Huber G., Röck A.W., Bandelt H.J., Corach D., Parson W. Amerindian mitochondrial DNA haplogroups predominate in the population of Argentina: towards a first nationwide forensic mitochondrial DNA sequence database. Int. J. Legal Med. 2010;124:263–268. doi: 10.1007/s00414-009-0366-3. [DOI] [PubMed] [Google Scholar]

[bib0145] 29.Rand S., Schurenkamp M., Hohoff C., Brinkmann B. The GEDNAP blind trial concept part II. Trends and developments. Int. J. Legal Med. 2004;118:83–89. doi: 10.1007/s00414-003-0421-4. [DOI] [PubMed] [Google Scholar]

[bib0150] 30.Castro de Guerra D., Figuera P.C., Bravi C.M., Saunier J., Scheible M., Irwin J., Coble M.D., Rodriguez-Larralde A. Sequence variation of mitochondrial DNA control region in North Central Venezuela. Forensic Sci. Int. Genet. 2012;6:e131–e133. doi: 10.1016/j.fsigen.2011.11.004. [DOI] [PubMed] [Google Scholar]

PERMALINK

Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA)^☆

Alexander W Röck

Arne Dür

Mannis van Oven

Walther Parson

Abstract

1. Introduction

2. Materials and methods

2.1. Virtual Phylotree haplotypes and full mtGenome database

2.2. Fluctuation rates

2.3. Algorithm

3. Results and discussion

3.1. Validation of the concept

3.1.1. Preparation and validation of the mtGenome database

3.1.2. Validation using external/literature HVS-I/HVS-II and CR profiles

Table 1.

3.2. Improved haplogroup assignment with full mtGenome sequences

3.3. EMMA for QC purposes

3.4. Limitations – critical samples

4. Conclusions

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA)☆

Alexander W Röck

Arne Dür

Mannis van Oven

Walther Parson

Abstract

1. Introduction

2. Materials and methods

2.1. Virtual Phylotree haplotypes and full mtGenome database

2.2. Fluctuation rates

2.3. Algorithm

3. Results and discussion

3.1. Validation of the concept

3.1.1. Preparation and validation of the mtGenome database

3.1.2. Validation using external/literature HVS-I/HVS-II and CR profiles

Table 1.

3.2. Improved haplogroup assignment with full mtGenome sequences

3.3. EMMA for QC purposes

3.4. Limitations – critical samples

4. Conclusions

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA)^☆