Abstract
HvrBase++ is the improved and extended version of HvrBase. Extensions are made by adding more population-based sequence samples from all primates including humans. The current collection comprises 13 873 hypervariable region I (HVRI) sequences and 4940 hypervariable region II (HVRII) sequences. In addition, we included 1376 complete mitochondrial genomes, 205 sequences from X-chromosomal loci and 202 sequences from autosomal chromosomes 1, 8, 11 and 16. In order to reduce the introduction of erroneous data into HvrBase++, we have developed a procedure that monitors GenBank for new versions of the current data in HvrBase++ and automatically updates the collection if necessary. For the stored sequences, supplementary information such as geographic origin, population affiliation and language of the sequence donor can be retrieved. HvrBase++ is Oracle based and easily accessible by a web interface (http://www.hvrbase.org). As a new key feature, HvrBase++ provides an interactive graphical tool to easily access data from dynamically created geographical maps.
INTRODUCTION
HvrBase was originally started as compilation of hypervariable region I (HVRI) and hypervariable region II (HVRII) mitochondrial sequences (1,2). These regions are situated in the non-coding mitochondrial control region and play an important role in population genetics (3,4). With some exceptions, mitochondrial DNA (mtDNA) follows a maternal clonal inheritance pattern without recombination (5–7). Therefore, population genetics analyses allow studying the population history of maternal inherited mitochondrial genomes. Furthermore, mtDNA variation correlates with the geographic origin of the population, and has been linked to a wide range of degenerative diseases, preferentially affecting the central nervous system, heart, muscle, renal and endocrine systems, and is generally used in forensic comparisons (8–10).
HvrBase++ focuses on aspects of population genetics and collects meta information like ethnic groups and spoken languages for each individual. For this reason, sequences have only been included if a minimum of meta information was available. Moreover, meta information is linked to a geographical information system (GIS), which allows intuitive searches supported by geographical maps to obtain additional information about countries. This interactive map searching feature and the presence of meta information predestine HvrBase++ as a database for population genetics analysis.
Among other databases, like MITOMAP (11) as a general resource for mtDNA-related data and the ‘mtDNA Population Database’ (12) for forensic studies, HvrBase++ contributes to the wide area of mtDNA analysis.
Wherever in the course of a phylogenetic analysis mitochondrial data are used which at best reflect matrilineal history, a closer look at nuclear DNA (nDNA) is indispensable to answer questions concerning phylogenetic history in their entirety. When drawing a comparison between evolutionary pathways of the pyruvate dehydrogenase E1α (PDHA1) subunit and mtDNA, J. Hey showed that ‘variation at nuclear genes and mtDNA are not both consistent with a common demographic history’ (13).
While both hypervariable regions of mtDNA are commonly used for phylogenetic studies, no equivalent sequence markers exist when dealing with nuclear DNA. With HvrBase++, we introduce a set of ready-to-use nDNA sequence markers. Genes that code for the human immune defence, and non-coding regions around microsatellite DNA markers, are promising candidates for nDNA sequence markers owing to their mutation rate (3).
Moreover, the detection of nuclear mitochondrial-like sequences called ‘numts’ in the last decade has shed doubts whether mtDNA data have been classified correctly (14–16). This is caused by a maximum of 94% similarity between mtDNA and numts fragments.
To meet these concerns, researchers have begun to incorporate nuclear markers in their studies. It is a matter of fact that HvrBase++ now carries nuclear markers as well.
Compilation of sequences
In HvrBase++ the term ‘sequence’ represents a piece of DNA from one individual. A ‘lineage’ in contrast means a piece of DNA from possibly different individuals, which share the same nucleotide sequence. Meta information about the sampled individuals was collected from publications (supplemented information). If different sequence sources were available, they have been chosen in the following order: (i) public databases like GenBank (17), (ii) supplemental data from publications, (iii) data manually extracted from publications and (iv) data requested from authors.
After collecting and extracting sequences and meta information for a gene or region, a global nucleotide alignment was created. For the HVRI and HVRII regions, HvrBase++ carries a manual alignment (2) and an alignment generated with MAFFT (18). Automatically calculated alignments can be obtained from HvrBase++ for complete mitochondrial genomes and nuclear sequences, respectively. A procedure checks sequence alterations in GenBank and updates the data in the next release. It is worth noting that every single update step is logged in our database system and can be traced via the HvrBase++ web interface.
Sequences enter HvrBase++ if meta information can be retrieved from the corresponding publication. Meta information must be attributable to each sequence in the paper and consider: (i) geographic origin, (ii) population, (iii) spoken language and (iv) bibliographic information. Owing to those filtering criteria, not all data from publications and most of the forensic data, for example the comprehensive forensic dataset from ‘mtDNA Population Database’ (12), could not be integrated into HvrBase++.
Since there is no unique way to gain the above named meta information either from publications or from sequence files, it is difficult to build a fully automated tool that identifies meta information that is located in different resources.
Synonyms and context-dependent meanings of a word may pose a challenge as well. Where it is easy for humans to associate certain information, it is a hard task for computers. Seeing that, HvrBase++ categorizes ambiguous data to facilitate a broad range of complex search patterns. Bibliographic information, like authors, publication date, journal and PubMed publication identifier are standardized. Each country is assigned to just one continent, e.g. in HvrBase++ Turkey is assigned to Asia, the Canary Islands belong to the sovereign territory of Spain.
All 258 language entries in HvrBase++ have been adapted to comply with the SIL (Summer Institute of Linguistics) and ISO/DIS 639-2 language code standards respectively from Ethnologue vol. 14 (19). In order to avoid information loss and to compensate the incompleteness of any of the standards, it was necessary to integrate both language codes (Table 1). The following example clearly shows the hassle of associating a mother tongue of an individual deduced from a publication with the SIL or ISO language codes.
Table 1.
SIL | ISO | No. of individuals | Language family or population |
---|---|---|---|
Yes | Yes | 7248 | English |
Yes | No | 41 | Mandenka (population from Senegal, ‘Mandinka’ in SIL) |
No | Yes | 1951 | Bantu (Africa's largest language family) |
No | No | 454 | Mbenzele (population from Central African Republic) |
4611 | Language information missing or not assignable | ||
Total | 14 305 |
This year, the SIL and ISO/DIS 639-2 codes have converged. We will account for them in the next major release.
It is known that a certain tongue belongs to the Niger-Kordofanian language family. Niger-Kordofanian is a collective language code only used in the ISO standard whose languages can be found throughout Southern and Central Africa as well as in Sub-Saharan Western Africa. Since that language family does not have a SIL code, a more in-depth knowledge about the very tongue (e.g. language name and habitation of a tribe) would be essential to find a suitable SIL code.
Technical organization
HvrBase++ is managed in an Oracle 10g relational database system. Sequence data and accompanying information are extracted and stored in HvrBase++ via Perl programs that use object-oriented modules from the BioPerl-Project (20) and the Perl DBI module.
The web client is based on the Apache web server technology. For the geographical interface, a map server (MapServer version 4.6 from the University of Minnesota) is integrated into the web client using geographical maps from publicly available resources.
UMN MapServer provides the core functionality of a GIS system for an intuitive data access from dynamically created geographical maps.
Description of the compilation
The HvrBase++ database now comprises not only HVRI and HVRII sequences but also mitochondrial genomes and nuclear sequences from several chromosomal loci. Not surprisingly, human sequences are overrepresented with a total amount of 20 037 sequences (Table 2). Table 3 displays an excerpt from the human HVRI dataset gathered from 103 publications which encompasses sequences from 89 countries and 220 ethnic groups.
Table 2.
Number of | |||
---|---|---|---|
Humans | Great Apes | Neanderthalers | |
HVRI | 13 350 | 520 | 3 |
HVRII | 4925 | 13 | 2 |
Mitochondrial genomes | 1376 | 0 | 0 |
Nuclear sequences | 386 | 21 | 0 |
Total | 20 037 | 554 | 5 |
Table 3.
Continent | Lineages | Human samples | Number of | |||||
---|---|---|---|---|---|---|---|---|
Countries | References | Populations | Languages | SIL | ISO | |||
Europe | 2033 | 4358 | 17 | 39 | 25 | 31 | 20 | 16 |
Africa | 1046 | 1680 | 25 | 22 | 47 | 47 | 22 | 25 |
North America | 824 | 1581 | 7 | 19 | 34 | 9 | 9 | 8 |
South America | 267 | 473 | 7 | 10 | 11 | 19 | 7 | 7 |
Asia | 2867 | 4778 | 23 | 49 | 102 | 67 | 31 | 47 |
Australia/Oceania | 224 | 473 | 10 | 10 | 12 | 28 | 9 | 16 |
World | 7036 | 13 343 | 89 | 103 | 220 | 194 | 81 | 118 |
Note that the last row does not depict the arithmetic sum in columns 2, 5–9 as some relevant subsets overlap across continents.
Table 4 describes the 10 loci for the 407 nuclear sequences. The amount of nuclear markers in HvrBase++ is currently not very high because the compilation is at an early stage. We feel confident that it will get more and more important to sequence and analyze nuclear genes for studies in population genetics due to possibly contradicting histories of nuclear genes and mtDNA (3,21). Figure 1 shows the sequence increase of HVRI/II, mitochondrial genomes and nuclear sequences for all available publications within the past 25 years. Thus, it can be assumed that this upward trend will continue.
Table 4.
Amount | Gene | Gene function | Chromosome | Length in bp |
---|---|---|---|---|
8 | pdh1 | Pyruvate dehydrogenase E1-α subunit gene, partial seq. | X | 1769 |
41 | Factor ix | Factor IX gene, intron 4 | X | 3740 |
42 | rrm2p4 | Ribonucleotide reductase M2 pseudogene 4, partial seq. | X | 2392 |
42 | tnfsf5 | Tumor necrosis factor ligand superfamily 5 gene, partial seq. | X | 5239 |
1 | amelx | Amelogenin X chromosome gene, complete seq. | X | 5323 |
71 | xq13.3 | Xq13.3 non-coding region | X | 10 178 |
56 | mc1r promoter | Melanocortin 1 receptor gene, promoter | 16 | 6599 |
1 | mc1r | Melanocortin 1 receptor gene | 16 | 953 |
8 | lpl | Lipoprotein lipase gene, partial seq. | 8 | 542–1636 |
61 | ch1 | Membrane protein CH1 gene, partial seq. | 1 | 9626 |
59 | β-globin | β-globin gene, complete seq. | 11 | 3008 |
17 | β-globin repl. init. reg. | β-globin gene, repl. ori. init. reg. and partial seq. | 11 | 1312 |
User interface
The new geographic map search interface is the centre of the web interface, which provides an intuitive search method and presents the results clearly structured. On the other hand, the well-tried form-based search function from HvrBase is recommended for more systematic searches. Supported sequence output formats are Phylip, GenBank, XML and simple text files. The form-based and map searches in combination make it possible to find any kind of sequence available from a sampled individual.
Figure 2 shows the geographic map functionality in HvrBase++. It is possible to search for all genes in countries and continents. A more sophisticated search can be obtained by specifying populations and language codes. Sequence patterns can be detected within genes for a whole sequence or a given range. Moreover, regular expressions allow for complex motif searches. Each country (or continent) is pictured in the world map and colour-coded, depending on the number of sequences from the respective country. More detailed information is displayed at the bottom of the world map after choosing a country from the map.
Quality and completeness of the data and future directions
Although HvrBase++ represents a large compilation of HVRI and HVRII sequences, completeness cannot be claimed. The collection of mitochondrial genomes and nuclear genes will be extended, and gaps will have to be closed in future releases.
Therefore, we solicit everybody to furnish new sequences and respective information by electronic mail. We would also be grateful to receive already published sequences that are missing in our collection.
This database gives easy access to freely available sequences without altering them in any way. That means we have not checked the data for typos or any other kind of sequence errors that might have occurred between their acquisition and their publication (22–24). Our intention is not to fix putative errors in other publications and finally to hold in our hand another dataset. This could cause confusion by the use of sequences in comparative analyses from two different sources.
We recommend our colleagues to control their datasets carefully and to follow the instructions proposed by Bandelt et al. (25) to detect suspicious sequence positions.
New sequence versions in GenBank are investigated automatically and continuously included into HvrBase++. Since there is no common way to update sequences from non-public databases, we have done this manually. As we aim at a high quality of data, we will welcome any cues regarding programming bugs, misinterpretations or other discrepancies.
Acknowledgments
We thank all colleagues who have provided their sequence data in computer readable format and have given us additional information when needed. Funding to pay the Open Access publication charges for this article was provided by the Deutsche Forchungsgemeinschaft (DFG).
Conflict of interest statement. None declared.
REFERENCES
- 1.Handt O., Meyer S., von Haeseler A. Compilation of human mtDNA control region sequences. Nucleic Acids Res. 1998;26:126–129. doi: 10.1093/nar/26.1.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Burckhardt F., von Haeseler A., Meyer S. HvrBase: compilation of mtDNA control region sequences from primates. Nucleic Acids Res. 1999;27:138–142. doi: 10.1093/nar/27.1.138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhang D.X., Hewitt G.M. Nuclear DNA analyses in genetic studies of populations: practice, problems and prospects. Mol. Ecol. 2003;12:563–584. doi: 10.1046/j.1365-294x.2003.01773.x. [DOI] [PubMed] [Google Scholar]
- 4.Avise J.C. The history and purview of phylogeography: a personal reflection. Mol. Ecol. 1998;7:371–379. [Google Scholar]
- 5.Kondo R., Satta Y., Matsuura E.T, Ishiwa H., Takahata N., Chigusa S.I. Incomplete maternal transmission of mitochondrial DNA in Drosophila. Genetics. 1990;126:657–663. doi: 10.1093/genetics/126.3.657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gyllensten U., Wharton D., Josefsson A., Wilson A.C. Paternal inheritance of mitochondrial DNA in mice. Nature. 1991;352:255–257. doi: 10.1038/352255a0. [DOI] [PubMed] [Google Scholar]
- 7.Skibinski D.O.F., Gallagher C., Beynon C.M. Mitochondrial DNA inheritance. Nature. 1994;368:817–818. doi: 10.1038/368817b0. [DOI] [PubMed] [Google Scholar]
- 8.Coskun P.E., Beal M.F., Wallace D.C. Alzheimer's brains harbor somatic mtDNA control-region mutations that suppress mitochondrial transcription and replication. Proc. Natl Acad. Sci. USA. 2004;29:10726–10731. doi: 10.1073/pnas.0403649101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sukernik R.I., Derbebeva O.A., Starikovskaya E.B., Volodko N.V., Mikhailovskaya I.E., Buychov I.Yu., Lott M., Brown M., Wallace D. The mitochondrial genome and humans mitochondrial diseases. Russ. J. Genet. 2002;38:161–170. [PubMed] [Google Scholar]
- 10.Budowle B., Allard M.W., Wilson M.R., Chakraborty R. Forensics and mitochondrial DNA: applications, debates, and foundations. Annu. Rev. Genomics Hum. Genet. 2003;4:119–141. doi: 10.1146/annurev.genom.4.070802.110352. [DOI] [PubMed] [Google Scholar]
- 11.Brandon M.C., Lott M.T., Nguyen K.C., Spolim S., Navanthe S.B., Baldi P., Wallace D.C. MITOMAP: a humans mitochondrial genome database—2004 update. Nucleic Acids Res. 2004;33:D611–D613. doi: 10.1093/nar/gki079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Monson K.L., Miler K.W.P., Wilson M.R., DiZinno J.A., Budowle B. The mtDNA population database: an integrated software and database resource for forensic comparison. Forensic Sci. Commun. 2002;4 Available at http://www.fbi.gov/hqlab/fsc/backissu/april2002/miller1.htm. [Google Scholar]
- 13.Hey J. Mitochondrial and nuclear genes present conflicting portraits of human origins. Mol. Biol. Evol. 1997;14:166–172. doi: 10.1093/oxfordjournals.molbev.a025749. [DOI] [PubMed] [Google Scholar]
- 14.Mishmar D., Ruiz-Pesini E., Brandon M., Wallace D.C. Mitochondrial DNA-like sequences in the nucleus (NUMTs): insights into our African origins and the mechanism of foreign DNA integration. Hum. Mutat. 2004;23:125–133. doi: 10.1002/humu.10304. [DOI] [PubMed] [Google Scholar]
- 15.Bensasson D., Zhang D.X., Hartl D.L., Hewitt G.M. Mitochondrial pseudogenes: evolution's misplaced witnesses. Trends Ecol. Evol. 2001;16:314–321. doi: 10.1016/s0169-5347(01)02151-6. [DOI] [PubMed] [Google Scholar]
- 16.Thalman O., Hebler J., Poinar H.N., Pääbo S., Vigilant L. Unreliable mtDNA data due to nuclear insertions: a cautionary tale from analysis of humans and other great apes. Mol. Ecol. 2004;13:321–335. doi: 10.1046/j.1365-294x.2003.02070.x. [DOI] [PubMed] [Google Scholar]
- 17.Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Wheeler D.L. GenBank: update. Nucleic Acids Res. 2004;32:D23–D26. doi: 10.1093/nar/gkh045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Katoh K., Kuma K., Toh H., Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005;33:511–518. doi: 10.1093/nar/gki198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Grimes B.F. In: Grimes B.F., editor. 2000. Ethnologue: Volume 1 Languages of the World, (14th Edn) ISBN 1-55671-103-4. [Google Scholar]
- 20.Stajich J.E., Block D., Boulez K., Brenner S.E., Chervitz S.A., Dagdigian C., Fuellen G., Gilbert J.G.R., Korf I., Lapp H., et al. The Bioperl Toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1161–1168. doi: 10.1101/gr.361602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Petit R.J., Duminil J., Fineschi S., Hampe A., Salvini D., Vendramin G.G. Comparative organization of chloroplast, mitochondrial and nuclear diversity in plant populations. Mol. Ecol. 2005;14:689–701. doi: 10.1111/j.1365-294X.2004.02410.x. [DOI] [PubMed] [Google Scholar]
- 22.Bandelt H.-J., Quintana-Murci L.L., Salas A., Macaulay V. The fingerprint of phantom mutations in mitochondrial DNA data. Am. J. Hum. Genet. 2002;71:1150–1160. doi: 10.1086/344397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Herrnstadt C., Preston G., Howell N. Errors, phantom and otherwise, in humans mtDNA sequences. Am. J. Hum. Genet. 2003;72:1585–1586. doi: 10.1086/375406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Forster P. To err is human. Ann. Hum. Genet. 2003;67:2–4. doi: 10.1046/j.1469-1809.2003.00002.x. [DOI] [PubMed] [Google Scholar]
- 25.Bandelt H.-J., Lahermo P., Richards M., Macaulay V. Detecting errors in mtDNA data by phylogenetic analysis. Int. J. Legal Med. 2001;115:64–69. doi: 10.1007/s004140100228. [DOI] [PubMed] [Google Scholar]