Skip to main content
Journal of Biomolecular Techniques : JBT logoLink to Journal of Biomolecular Techniques : JBT
. 2003 Mar;14(1):9–16.

Simple Tests to Detect Errors in High-Throughput Genotype Data in the Molecular Laboratory

David J Vandenbergh a,b,c, Kathrine Heron a, Ryan Peterson a, Karl B Shpargel a, Abigail Woodroffe a, David A Blizard a, Gerald E McClearn a,b, George P Vogler a,b
PMCID: PMC2279894  PMID: 12901607

Abstract

With the advent of high-density DNA marker data sets for the mouse and other model systems, 100 or more genotypes are routinely generated from large groups of mice. Issues of the accuracy and reliability of the genotyping are extremely important but often not addressed until genetic analysis is conducted. Simple tests that rely on the robust predictions arising from Mendelian genetics can be made quickly in the molecular laboratory as the data are generated, and require only a spreadsheet program. In this report, genotype data from 392 mice tested at 96 marker sites were analyzed for errors that are typical when handling large volumes of data generated in a repetitive process. The testing consisted of: (1) repeating the genotyping of approximately 1% of the samples; (2) examining the deviation from the expected segregation ratio (1:2:1) on a marker-by-marker basis; and (3) testing the correlation of the genotype at one marker with that at neighboring genetic markers on a chromosome. These three steps allowed analysis at the level of the microtiter plate, where errors are most likely to occur. A set of 96 dinucleotide repeat markers that are polymorphic between the C57BL/6J and DBA/2J mouse strains and can be multiplexed is reported for use in other genotyping projects.

Keywords: Hardy–Weinberg equilibrium, inbred lines, C57, DBA, multiplex marker set


Significant theoretical work is available in the literature on the problems created by mistyping or misclassification of genotypes for genetic analysis of both simple and complex traits. These efforts have generated analytical tools to estimate the effects of typing error on subsequent linkage analysis.1– 5 Error filters can then be applied to the genotype data to minimize the interference with detecting linkage. These methods do not address errors in the data that can be corrected prior to linkage analysis to increase the accuracy of the genotypes obtained.

Genetic analysis based on the results of Mendel6 provides a framework with which to analyze genotypes when individuals in a large sample group are tested at multiple loci. Predictions based on Hardy–Weinberg equilibrium allow for a convenient method to compare genotype frequencies with the underlying allele frequencies7, 8 but can become difficult in populations in which breeding is not controlled by the experimenter. In the case of crosses between two strains of inbred mice, Hardy–Weinberg equilibrium is at its simplest (1:2:1 ratio) because allele frequencies are expected to be 0.5 for each genetic marker. We have taken advantage of these laws to describe simple methods for testing the reliability of genotype data in mouse studies prior to genetic analysis. The work of Sturtevant demonstrated that the correlation values between a reference marker and a test marker would diminish with increasing map distance.9 The combination of these tests provides a practical method for examining data as they are being generated.

METHODS

In connection with a study of age-related processes, genetic analysis was conducted on 392 F2 mice generated from a cross of C57BL/6J and DBA/2J parental strains. A small snip (approximately 2 mm) of the tail of each mouse was taken at weaning for genetic analysis. DNA was extracted by standard lysis with proteinase K digestion followed by phenol/Sevag extraction and ethanol precipitation.10 A portion of the purified DNA was diluted to 10 ng/μL into 96-well plates. Genotyping was carried out using markers from the MIT collection (http://www-genome.wi.mit.edu). Primers were purchased from Research Genetics, Inc. (Huntsville, AL), and the forward primer was fluorescently labeled for allele detection. See Table 1 for the markers used.

TABLE 1.

Markers for Multiplexed Genotyping to Distinguish C57BL/6J and DBA2/J Mouse Allelesa

B6 allelee DBA allelee
Markerb CMc Sepd Obs (bp) Publ (bp) Obs (bp) Publ (bp) Dye labelf
D1Mit68 9 176 172 180 176 HEX
D1Mit236 25.7 16.7 149 143 133 136 HEX
D1Mit46 43.1 17.4 254 256 259 264 HEX
D1Mit87 62.1 19 201 204 197 198 TET
D1Mit16 87.2 25.1 183 186 196 197 FAM
D1Mit221 102 14.8 114 119 120 125 TET
With 16 18.6
D2Mit149 7 199 195 201 197 HEX
D2Mit323 31.7 24.7 126 126 124 128 TET
D2Mit300 50.3 18.6 101 108 99 106 FAM
D2Mit304 73 22.7 118 120 196 186 FAM
D2Mit343 84.2 11.2 149 183 141 173 FAM
D2Mit266 109 24.8 126 130 135 138 HEX
With 7 20.4
D3Mit264 2.4 103 109 101 107 FAM
D3Mit46 13.8 11.4 160 166 155 160 HEX
D3Mit25 29.5 15.7 130 134 124 127 FAM
D3Mit29 45.2 15.7 145 150 182 184 FAM
D3Mit42 58.8 13.6 150 154 142 146 TET
D3Mit194 67.6 8.8 144 146 139 142 HEX
D3Mit59 84.1 16.5 203 208 201 206 FAM
With 18 13.6
D4Mit104 1.9 244 243 242 241 HEX
D4Mit214 17.9 16 121 126 137 152 HEX
D4Mit255 48.5 30.6 251 238 244 234 TET
D4Mit204 61.9 13.4 104 104 82 82 FAM
D4Mit190 79 17.1 143 145 150 151 FAM
With 13 19.3
D5Mit227 9 97 92 102 102 FAM
D5Mit81 28 19 206 210 192 194 TET
D5Mit10 54 26 189 196 197 203 FAM
D5Mit137 68 14 149 152 144 148 HEX
D5Mit167 78 10 120 121 116 125 HEX
With X 17.3
D6Mit116 5.5 117 123 110 114 FAM
D6Mit188 32.5 27 127 130 156 155 HEX
D6Mit149 46.3 13.8 199 189 208 199 TET
D6Mit111 63.7 17.4 139 146 142 148 FAM
D6Mit15 74 10.3 255 260 195 195 FAM
With 12 17.1
D7Mit76 3.4 226 228 223 226 FAM
D7Mit69 24.5 21.1 237 236 239 238 HEX
D7Mit253 52.8 28.3 81.4 90 83 92 TET
D7Mit12 66 13.2 199 197 206 206 TET
With 2 20.9
D8Mit281 11 122 123 115 117 FAM
D8Mit191 21 10 138 144 127 132 FAM
D8Mit249 37 16 145 148 187 172 HEX
D8Mit211 49 12 151 154 163 166 TET
D8Mit215 59 10 174 182 168 176 FAM
With 10 12
D9Mit64 7 187 190 183 184 FAM
D9Mit229 28 21 122 122 135 142 HEX
D9Mit196 48 20 143 148 150 156 FAM
D9Mit212 61 13 99 108 106 118 FAM
D9Mit52 72 11 170 172 174 176 HEX
With 19 16.3
D10Mit212 9 117 123 125 131 HEX
D10Mit36 29 20 140 146 142 148 FAM
D10Mit117 48 19 120 142 122 126 TET
D10Mit95 51 3 196 201 176 179 TET
D10Mit14 65 14 191 192 185 182 TET
With 8 14
D11Mit227 2 165 170 157 166 HEX
D11Mit86 28 26 119 126 126 134 HEX
D11Mit36 47.6 19.6 237 234 219 220 FAM
D11Mit99 59.5 11.9 121 124 105 108 TET
D11Mit291 70 10.5 92 94 95 96 TET
With 17 17
D12Mit190 28 122 128 109 114 TET
D12Mit194 45 17 104 108 96 100 HEX
D12Mit8 58 13 167 168 177 179 HEX
With 6 15
D13Mit198 16 132 136 134 140 TET
D13Mit125 44 28 188 191 183 185 FAM
D13Mit230 62 18 150 154 147 151 TET
D13Mit35 75 13 194 190 180 182 HEX
With 4 19.7
D14Mit11 0.7 151 152 159 158 FAM
D14Mit122 21.5 20.8 141 144 127 132 HEX
D14Mit203 28.3 6.8 151 158 169 176 HEX
D14Mit194 44.4 16.1 91 89 77 81 TET
D14Mit75 54 9.6 177 178 178 190 TET
With 15 13.3
D15Mit179 10.8 140 148 149 152 TET
D15Mit183 23 12.2 125 129 123 127 FAM
D15Mit105 47.9 24.9 123 125 110 113 HEX
D15Mit39 56.6 8.7 130 130 123 124 TET
D15Mit35 61.7 5.1 116 142 110 136 FAM
With 14 12.7
D16Mit131 4.3 141 144 181 180 TET
D16Mit4 27.3 23 128 132 118 123 HEX
D16Mit191 57.8 30.5 116 122 118 124 FAM
D16Mit71 70.7 12.9 156 159 162 163 FAM
With 1 22.1
D17Mit16 18.2 117 118 104 106 FAM
D17Mit177 24 5.8 107 110 113 116 HEX
D17Mit93 44.5 20.5 155 156 170 168 TET
D17Mit123 56.7 12.2 128 133 149 155 FAM
With 11 12.8
D18Mit60 16 203 186 211 194 TET
D18Mit123 31 15 117 116 123 122 HEX
D18Mit142 47 16 115 121 127 133 TET
D18Mit144 57 10 177 180 173 177 HEX
With 3 13.7
D19Mit68 6 131 136 127 132 FAM
D19Mit40 25 19 106 112 100 106 TET
D19Mit8g 47 22 196 178 164 168 TET
D19Mit71 54 7 132 137 128 135 TET
With 9 16
DXMit89 3 146 149 153 157 FAM
DXMit166 15.5 12.5 113 114 126 126 FAM
DXMit95 43 27.5 139 151 133 137 TET
DXMit79 50.5 7.5 138 139 136 137 HEX
DXMit135 69 18.5 113 118 115 122 TET
With 14 16.5

aThis table, as well as a template file to help in error detection, is also available on the World Wide Web at http://www.cdhg.psu.edu/GeneticMarkerAnalysis.

…bName of polymorphism used from Research Genetics (http://www.resgen.com). The markers are listed numerically by chromosome (the number following the “D” in each name). To make multiplex reactions, markers from each chromosome were paired with a second chromosome as indicated at the end of each chromosome set (i.e., chromosome 1 paired with 16, 2 with 7, etc.).

cCentiMorgan (cM) position on the chromosome based on consensus committee map (http://www.informatics.jax.org). (See reference 11 for detailed information about the database.)

dDistance in cM from previous marker. The average separation of markers on a chromosome is shown in bold at the end of each chromosome list.

eAllele sizes in base pairs (bp) from C57BL/6J (B6) or DBA/2J (DBA) mice. Obs, observed; Publ, published, from the MIT website (http://www-genome.wi.mit.edu/cgi-bin/mouse/index).

fDye labels: Three fluorescent dyes (FAM, HEX, and TET) from Applied Biosystems, Inc., were used to label forward primers. Primers were pooled, keeping dyes in separate multiplex PCR reactions.

gMarker D19Mit8 has been found not to be located on chromosome 19 using R/qtl software. This problem may be limited to the batch of primers supplied.

The polymerase chain reaction (PCR) reaction was carried out in a total volume of 10 μL consisting of 10 ng of the template DNA, 2.5 mM MgCl2, 10 mM dNTPs, 0.04 mM spermidine, 0.5 U AmpliTaq Gold DNA polymerase (Applied Biosystems Inc., Foster City, CA), and the buffer supplied with the polymerase. Following a denaturation at 95°C for 2 min, 35 cycles of PCR were carried out (45 s at 95°C, 45 s at 59°C, 60 s at 72°C). The samples were electrophoresed on an ABI 310 Genetic Analyzer (Applied Biosystems). Allele fragment sizes were determined by GeneScan software (Applied Biosystems). These sizes were converted to allele calls (either B or D for C57BL/6J or DBA/2J, respectively) with ABI Genotyper software (Applied Biosystems) and exported into an Excel spreadsheet (Microsoft Corporation, Redmond, WA).

RESULTS

Initial testing and selection of markers from the MIT collection were performed on a small set of mouse DNA samples to determine the marker’s ability to fit into a multiplexed genotyping set. Eight samples—the two parental strains, an artificial heterozygote of parental DNA mixed in equal proportion, and five F2 mice selected from each of the 96-well plates—were used to establish the expected electrophoretic patterns of the homozygote and heterozygote alleles. The five mice from the F2 population served as the first test of the large-scale genotype production by comparing the genotypes generated in the marker selection phase with those generated during the high-throughput phase. The genotypes of all five mice were consistent in the two phases, indicating that there were no detectable shifts of the DNA samples during preparation of the 96-well trays. The final set of markers used is shown in Table 1 and is available on the website of the Center for Developmental and Health Genetics (CDHG) (http://www.cdhg.psu.edu/GeneticMarkerAnalysis).

Approximately 20% of the selected genetic markers could not be used owing to one of three problems: (1) the primers did not generate amplified product; (2) the observed allele sizes were different from those expected based on the MIT website and conflicted with the size of another marker in the same dye color; or (3) the primers worked poorly when amplified in the presence of two or three other markers in the multiplexed PCR reaction. The likelihood that a marker would fail was unpredictable, and replacement markers were chosen to maintain a map position and fit with the allele sizes of the other markers already in its set. The markers that could not be used are shown in Table 2 and on the CDHG website.

TABLE 2.

Failed Markers and Their Replacements

Failed marker Replacementa
D10Mit150 D10Mit95
D10Mit20 D10Mit36
D10Mit83 D10Mit212
D11Mit152 D11Mit227
D11Mit180
D12Mit134 D12Mit153
D12Mit136 D12Mit8
D14Mit64 D14Mit122
D16Mit49 D16Mit191
D18Mit4 D18Mit144
D1Mit236
D2Mit297
D3Mit151
D5Mit136
D5Mit292
D5Mit61 D5Mit227
D6Mit105 D6Mit149
D6Mit105
D7Mit220
D7Mit250
D7Mit31 D7Mit253
D8Mit223 D8Mit281
D8Mit64
DXMit124
DXMit141
D12Mit153

aBlanks indicate that no marker was found that was appropriate either because of position on the chromosome or because allele size overlapped with other marker alleles.

The second test examined the deviation of the genotype frequencies from the expected Mendelian proportions (1:2:1). A chi-square test of each marker was conducted after all mice were genotyped for each marker. The five markers on the X chromosome are hemizygous in the males (no heterozygous males) and were tested only in the females. Of all 96 markers tested, only two deviated from the expected proportion with a value < 0.05. This number is close to the expected four or five deviations, and uncorrected values were used to minimize false negatives. Examination of the Genotyper files for these markers, D10mit14 and D10mit95, revealed that alleles were misgrouped during the allele-calling step. Marker D10mit14 consists of 191- and 185-bp alleles, and D10mit95 has alleles of 196 and 176 bp. The two larger alleles (196 and 191 bp) were accidentally grouped as one marker, and the two smaller alleles (185 and 176 bp) were grouped as the second marker. Assignment of the alleles to their proper marker restored correct allele frequencies for each of the suspect markers.

The third test examined the correlation of allele status between markers on a chromosome. A detailed protocol is shown in Figure 1 . The test was conducted by comparing the allele status of one marker with those of the other markers on a chromosome. Mouse chromosomes all have centromeres at one end (acrocentric), and the most centromeric marker was chosen as the reference. For most chromosomes the correlation declined asymptotically to zero as the map distance in centiMorgans increased between the reference and the test markers. This test was repeated for each chromosome using the marker most distal from the centromere as the reference marker because the correlation between two markers at opposite ends of the chromosome was already approaching zero (random segregation), especially for the larger chromosomes, and thus was uninformative. The uncorrected allele correlations for chromosomes 2, 7, 8, and 10 are shown in Figure 2A . For two of the chromosomes (chromosomes 2 and 8), the alleles were negatively correlated, as indicated by values less than zero (Fig. 2A ). The negatively correlated markers were examined in the original genotype data and were discovered to have reversed allele calls based on the allele sizes of the parental strains. The entire set of values for chromosome 8 was below the abscissa because the reference marker was the miscalled marker. When corrected, the allele markers were all positively correlated, as shown in Figure 2B .

FIGURE 1.

FIGURE 1

Stepwise error analysis from allele calls made by automated DNA fragment size detection software.

FIGURE 2.

FIGURE 2

A: Correlation values between the most centromeric marker and the other markers on chromosomes 2, 7, 8, and 10 before correction. All markers on chromosome 8 appear to be negatively correlated with the most centromeric marker, whereas only the second marker on chromosome 2 shows a negative correlation. The first reference marker is not tested for correlation to itself, so the first data point on the graph is the value for the second marker on the chromosome. B: Correlation values for chromosomes 2, 7, 8, and 10 after correcting the allele calls that were erroneous for the most centromeric marker (the reference marker) on chromosome 8 and the second marker on chromosome 2. The correlations are positive and decrease with increasing distance from the most centromeric marker as expected.

DISCUSSION

The techniques described herein can be used to detect gross errors in high-throughput genotyping, particularly in cases of simple expectations of genotype frequencies such as F2 populations generated from two inbred lines of the organism under analysis. The errors found in this study were due to two types of problems. First, misassignment of a DNA fragment (size in base pairs) during allele calling as an allele of one of the other markers in a dye set caused a skew in the distribution of alleles from the expected 1:2:1 ratio for the erroneous marker. Second, switching the assignment of the genotype (B or D) for a pair of alleles caused markers on a chromosome to appear to be negatively correlated. Both of these mistakes were easily detected and corrected before any genetic analysis of phenotypes was conducted, thus increasing the likelihood that high-quality data were available for analysis. It is important to conduct these tests at the level at which errors are likely to be made—in this case, the microwell plate level. These tests are easy to conduct, requiring widely available software, and can be carried out in the molecular laboratory as data from each chromosome are collected, rather than waiting until the entire data set is generated.

Acknowledgments

We thank Kate Anthony for expert technical assistance and Jeanne Spicer for website support. This study was supported by the National Institute on Aging (grant AG14731) of the National Institutes of Health. There are no known conflicts of interest on the part of any of the authors.

REFERENCES

  • 1.Freimer NB, Sandkuijl LA, Blower SM. Incorrect specification of marker allele frequencies: effects on linkage analysis. Am J Hum Genet 1993;52:1102–1110. [PMC free article] [PubMed] [Google Scholar]
  • 2.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors I: complex-valued recombination fractions and complex phenotypes. Am J Hum Genet 2000; 66:1095–1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors II: marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am J Hum Genet 2000;66:1107–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors III: marker loci and their map as nuisance parameters. Am J Hum Genet 2000;66:1298–1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am J Hum Genet 2000; 66:1310–1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Mendel, G. Versuche über Pflanzen-hybriden [Experiments in plant hybridization]. Verh Naturforsch Ver Abh Brunn 1865;IV:3–47 (http://www.esp.org).
  • 7.Hardy GH. Mendelian proportions in a mixed population. Science 1908;28:49–50. [DOI] [PubMed] [Google Scholar]
  • 8.Weinberg W. Über den Nachweis der Vererbung beim Menchen. Jahresh Ver Vaterl Naturkd Wuerttemb 1908; 64:368–382. [Google Scholar]
  • 9.Sturtevant A. The linear arrangement of six sex-linked factors in Drosophila as shown by their mode of association. J Exp Zool 1913;14:43–59. [Google Scholar]
  • 10.Sambrook J, Fritsch E, Maniatis T. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 1989.
  • 11.Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT, Mouse Genome Database Group. 2002. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res 2002;30: 113–115. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Biomolecular Techniques : JBT are provided here courtesy of The Association of Biomolecular Resource Facilities

RESOURCES