Simple Tests to Detect Errors in High-Throughput Genotype Data in the Molecular Laboratory

David J Vandenbergh; Kathrine Heron; Ryan Peterson; Karl B Shpargel; Abigail Woodroffe; David A Blizard; Gerald E McClearn; George P Vogler

. 2003 Mar;14(1):9–16.

Simple Tests to Detect Errors in High-Throughput Genotype Data in the Molecular Laboratory

David J Vandenbergh ^a,^b,^c, Kathrine Heron ^a, Ryan Peterson ^a, Karl B Shpargel ^a, Abigail Woodroffe ^a, David A Blizard ^a, Gerald E McClearn ^a,^b, George P Vogler ^a,^b

PMCID: PMC2279894 PMID: 12901607

Abstract

With the advent of high-density DNA marker data sets for the mouse and other model systems, 100 or more genotypes are routinely generated from large groups of mice. Issues of the accuracy and reliability of the genotyping are extremely important but often not addressed until genetic analysis is conducted. Simple tests that rely on the robust predictions arising from Mendelian genetics can be made quickly in the molecular laboratory as the data are generated, and require only a spreadsheet program. In this report, genotype data from 392 mice tested at 96 marker sites were analyzed for errors that are typical when handling large volumes of data generated in a repetitive process. The testing consisted of: (1) repeating the genotyping of approximately 1% of the samples; (2) examining the deviation from the expected segregation ratio (1:2:1) on a marker-by-marker basis; and (3) testing the correlation of the genotype at one marker with that at neighboring genetic markers on a chromosome. These three steps allowed analysis at the level of the microtiter plate, where errors are most likely to occur. A set of 96 dinucleotide repeat markers that are polymorphic between the C57BL/6J and DBA/2J mouse strains and can be multiplexed is reported for use in other genotyping projects.

Keywords: Hardy–Weinberg equilibrium, inbred lines, C57, DBA, multiplex marker set

Significant theoretical work is available in the literature on the problems created by mistyping or misclassification of genotypes for genetic analysis of both simple and complex traits. These efforts have generated analytical tools to estimate the effects of typing error on subsequent linkage analysis.^1– ⁵ Error filters can then be applied to the genotype data to minimize the interference with detecting linkage. These methods do not address errors in the data that can be corrected prior to linkage analysis to increase the accuracy of the genotypes obtained.

Genetic analysis based on the results of Mendel⁶ provides a framework with which to analyze genotypes when individuals in a large sample group are tested at multiple loci. Predictions based on Hardy–Weinberg equilibrium allow for a convenient method to compare genotype frequencies with the underlying allele frequencies^7, ⁸ but can become difficult in populations in which breeding is not controlled by the experimenter. In the case of crosses between two strains of inbred mice, Hardy–Weinberg equilibrium is at its simplest (1:2:1 ratio) because allele frequencies are expected to be 0.5 for each genetic marker. We have taken advantage of these laws to describe simple methods for testing the reliability of genotype data in mouse studies prior to genetic analysis. The work of Sturtevant demonstrated that the correlation values between a reference marker and a test marker would diminish with increasing map distance.⁹ The combination of these tests provides a practical method for examining data as they are being generated.

METHODS

In connection with a study of age-related processes, genetic analysis was conducted on 392 F2 mice generated from a cross of C57BL/6J and DBA/2J parental strains. A small snip (approximately 2 mm) of the tail of each mouse was taken at weaning for genetic analysis. DNA was extracted by standard lysis with proteinase K digestion followed by phenol/Sevag extraction and ethanol precipitation.¹⁰ A portion of the purified DNA was diluted to 10 ng/μL into 96-well plates. Genotyping was carried out using markers from the MIT collection (http://www-genome.wi.mit.edu). Primers were purchased from Research Genetics, Inc. (Huntsville, AL), and the forward primer was fluorescently labeled for allele detection. See Table 1 for the markers used.

TABLE 1.

Markers for Multiplexed Genotyping to Distinguish C57BL/6J and DBA2/J Mouse Alleles^a

			B6 allele^e		DBA allele^e
Marker^b	CM^c	Sep^d	Obs (bp)	Publ (bp)	Obs (bp)	Publ (bp)	Dye label^f
D1Mit68	9		176	172	180	176	HEX
D1Mit236	25.7	16.7	149	143	133	136	HEX
D1Mit46	43.1	17.4	254	256	259	264	HEX
D1Mit87	62.1	19	201	204	197	198	TET
D1Mit16	87.2	25.1	183	186	196	197	FAM
D1Mit221	102	14.8	114	119	120	125	TET
With 16		18.6
D2Mit149	7		199	195	201	197	HEX
D2Mit323	31.7	24.7	126	126	124	128	TET
D2Mit300	50.3	18.6	101	108	99	106	FAM
D2Mit304	73	22.7	118	120	196	186	FAM
D2Mit343	84.2	11.2	149	183	141	173	FAM
D2Mit266	109	24.8	126	130	135	138	HEX
With 7		20.4
D3Mit264	2.4		103	109	101	107	FAM
D3Mit46	13.8	11.4	160	166	155	160	HEX
D3Mit25	29.5	15.7	130	134	124	127	FAM
D3Mit29	45.2	15.7	145	150	182	184	FAM
D3Mit42	58.8	13.6	150	154	142	146	TET
D3Mit194	67.6	8.8	144	146	139	142	HEX
D3Mit59	84.1	16.5	203	208	201	206	FAM
With 18		13.6
D4Mit104	1.9		244	243	242	241	HEX
D4Mit214	17.9	16	121	126	137	152	HEX
D4Mit255	48.5	30.6	251	238	244	234	TET
D4Mit204	61.9	13.4	104	104	82	82	FAM
D4Mit190	79	17.1	143	145	150	151	FAM
With 13		19.3
D5Mit227	9		97	92	102	102	FAM
D5Mit81	28	19	206	210	192	194	TET
D5Mit10	54	26	189	196	197	203	FAM
D5Mit137	68	14	149	152	144	148	HEX
D5Mit167	78	10	120	121	116	125	HEX
With X		17.3
D6Mit116	5.5		117	123	110	114	FAM
D6Mit188	32.5	27	127	130	156	155	HEX
D6Mit149	46.3	13.8	199	189	208	199	TET
D6Mit111	63.7	17.4	139	146	142	148	FAM
D6Mit15	74	10.3	255	260	195	195	FAM
With 12		17.1
D7Mit76	3.4		226	228	223	226	FAM
D7Mit69	24.5	21.1	237	236	239	238	HEX
D7Mit253	52.8	28.3	81.4	90	83	92	TET
D7Mit12	66	13.2	199	197	206	206	TET
With 2		20.9
D8Mit281	11		122	123	115	117	FAM
D8Mit191	21	10	138	144	127	132	FAM
D8Mit249	37	16	145	148	187	172	HEX
D8Mit211	49	12	151	154	163	166	TET
D8Mit215	59	10	174	182	168	176	FAM
With 10		12
D9Mit64	7		187	190	183	184	FAM
D9Mit229	28	21	122	122	135	142	HEX
D9Mit196	48	20	143	148	150	156	FAM
D9Mit212	61	13	99	108	106	118	FAM
D9Mit52	72	11	170	172	174	176	HEX
With 19		16.3
D10Mit212	9		117	123	125	131	HEX
D10Mit36	29	20	140	146	142	148	FAM
D10Mit117	48	19	120	142	122	126	TET
D10Mit95	51	3	196	201	176	179	TET
D10Mit14	65	14	191	192	185	182	TET
With 8		14
D11Mit227	2		165	170	157	166	HEX
D11Mit86	28	26	119	126	126	134	HEX
D11Mit36	47.6	19.6	237	234	219	220	FAM
D11Mit99	59.5	11.9	121	124	105	108	TET
D11Mit291	70	10.5	92	94	95	96	TET
With 17		17
D12Mit190	28		122	128	109	114	TET
D12Mit194	45	17	104	108	96	100	HEX
D12Mit8	58	13	167	168	177	179	HEX
With 6		15
D13Mit198	16		132	136	134	140	TET
D13Mit125	44	28	188	191	183	185	FAM
D13Mit230	62	18	150	154	147	151	TET
D13Mit35	75	13	194	190	180	182	HEX
With 4		19.7
D14Mit11	0.7		151	152	159	158	FAM
D14Mit122	21.5	20.8	141	144	127	132	HEX
D14Mit203	28.3	6.8	151	158	169	176	HEX
D14Mit194	44.4	16.1	91	89	77	81	TET
D14Mit75	54	9.6	177	178	178	190	TET
With 15		13.3
D15Mit179	10.8		140	148	149	152	TET
D15Mit183	23	12.2	125	129	123	127	FAM
D15Mit105	47.9	24.9	123	125	110	113	HEX
D15Mit39	56.6	8.7	130	130	123	124	TET
D15Mit35	61.7	5.1	116	142	110	136	FAM
With 14		12.7
D16Mit131	4.3		141	144	181	180	TET
D16Mit4	27.3	23	128	132	118	123	HEX
D16Mit191	57.8	30.5	116	122	118	124	FAM
D16Mit71	70.7	12.9	156	159	162	163	FAM
With 1		22.1
D17Mit16	18.2		117	118	104	106	FAM
D17Mit177	24	5.8	107	110	113	116	HEX
D17Mit93	44.5	20.5	155	156	170	168	TET
D17Mit123	56.7	12.2	128	133	149	155	FAM
With 11		12.8
D18Mit60	16		203	186	211	194	TET
D18Mit123	31	15	117	116	123	122	HEX
D18Mit142	47	16	115	121	127	133	TET
D18Mit144	57	10	177	180	173	177	HEX
With 3		13.7
D19Mit68	6		131	136	127	132	FAM
D19Mit40	25	19	106	112	100	106	TET
D19Mit8^g	47	22	196	178	164	168	TET
D19Mit71	54	7	132	137	128	135	TET
With 9		16
DXMit89	3		146	149	153	157	FAM
DXMit166	15.5	12.5	113	114	126	126	FAM
DXMit95	43	27.5	139	151	133	137	TET
DXMit79	50.5	7.5	138	139	136	137	HEX
DXMit135	69	18.5	113	118	115	122	TET
With 14		16.5

Open in a new tab

^aThis table, as well as a template file to help in error detection, is also available on the World Wide Web at http://www.cdhg.psu.edu/GeneticMarkerAnalysis.

^bName of polymorphism used from Research Genetics (http://www.resgen.com). The markers are listed numerically by chromosome (the number following the “D” in each name). To make multiplex reactions, markers from each chromosome were paired with a second chromosome as indicated at the end of each chromosome set (i.e., chromosome 1 paired with 16, 2 with 7, etc.).

^cCentiMorgan (cM) position on the chromosome based on consensus committee map (http://www.informatics.jax.org). (See reference ¹¹ for detailed information about the database.)

^dDistance in cM from previous marker. The average separation of markers on a chromosome is shown in bold at the end of each chromosome list.

^eAllele sizes in base pairs (bp) from C57BL/6J (B6) or DBA/2J (DBA) mice. Obs, observed; Publ, published, from the MIT website (http://www-genome.wi.mit.edu/cgi-bin/mouse/index).

^fDye labels: Three fluorescent dyes (FAM, HEX, and TET) from Applied Biosystems, Inc., were used to label forward primers. Primers were pooled, keeping dyes in separate multiplex PCR reactions.

^gMarker D19Mit8 has been found not to be located on chromosome 19 using R/qtl software. This problem may be limited to the batch of primers supplied.

The polymerase chain reaction (PCR) reaction was carried out in a total volume of 10 μL consisting of 10 ng of the template DNA, 2.5 mM MgCl₂, 10 mM dNTPs, 0.04 mM spermidine, 0.5 U AmpliTaq Gold DNA polymerase (Applied Biosystems Inc., Foster City, CA), and the buffer supplied with the polymerase. Following a denaturation at 95°C for 2 min, 35 cycles of PCR were carried out (45 s at 95°C, 45 s at 59°C, 60 s at 72°C). The samples were electrophoresed on an ABI 310 Genetic Analyzer (Applied Biosystems). Allele fragment sizes were determined by GeneScan software (Applied Biosystems). These sizes were converted to allele calls (either B or D for C57BL/6J or DBA/2J, respectively) with ABI Genotyper software (Applied Biosystems) and exported into an Excel spreadsheet (Microsoft Corporation, Redmond, WA).

RESULTS

Initial testing and selection of markers from the MIT collection were performed on a small set of mouse DNA samples to determine the marker’s ability to fit into a multiplexed genotyping set. Eight samples—the two parental strains, an artificial heterozygote of parental DNA mixed in equal proportion, and five F2 mice selected from each of the 96-well plates—were used to establish the expected electrophoretic patterns of the homozygote and heterozygote alleles. The five mice from the F2 population served as the first test of the large-scale genotype production by comparing the genotypes generated in the marker selection phase with those generated during the high-throughput phase. The genotypes of all five mice were consistent in the two phases, indicating that there were no detectable shifts of the DNA samples during preparation of the 96-well trays. The final set of markers used is shown in Table 1 and is available on the website of the Center for Developmental and Health Genetics (CDHG) (http://www.cdhg.psu.edu/GeneticMarkerAnalysis).

Approximately 20% of the selected genetic markers could not be used owing to one of three problems: (1) the primers did not generate amplified product; (2) the observed allele sizes were different from those expected based on the MIT website and conflicted with the size of another marker in the same dye color; or (3) the primers worked poorly when amplified in the presence of two or three other markers in the multiplexed PCR reaction. The likelihood that a marker would fail was unpredictable, and replacement markers were chosen to maintain a map position and fit with the allele sizes of the other markers already in its set. The markers that could not be used are shown in Table 2 and on the CDHG website.

TABLE 2.

Failed Markers and Their Replacements

Failed marker	Replacement^a
D10Mit150	D10Mit95
D10Mit20	D10Mit36
D10Mit83	D10Mit212
D11Mit152	D11Mit227
D11Mit180
D12Mit134	D12Mit153
D12Mit136	D12Mit8
D14Mit64	D14Mit122
D16Mit49	D16Mit191
D18Mit4	D18Mit144
D1Mit236
D2Mit297
D3Mit151
D5Mit136
D5Mit292
D5Mit61	D5Mit227
D6Mit105	D6Mit149
D6Mit105
D7Mit220
D7Mit250
D7Mit31	D7Mit253
D8Mit223	D8Mit281
D8Mit64
DXMit124
DXMit141
D12Mit153

Open in a new tab

^aBlanks indicate that no marker was found that was appropriate either because of position on the chromosome or because allele size overlapped with other marker alleles.

The second test examined the deviation of the genotype frequencies from the expected Mendelian proportions (1:2:1). A chi-square test of each marker was conducted after all mice were genotyped for each marker. The five markers on the X chromosome are hemizygous in the males (no heterozygous males) and were tested only in the females. Of all 96 markers tested, only two deviated from the expected proportion with a value < 0.05. This number is close to the expected four or five deviations, and uncorrected values were used to minimize false negatives. Examination of the Genotyper files for these markers, D10mit14 and D10mit95, revealed that alleles were misgrouped during the allele-calling step. Marker D10mit14 consists of 191- and 185-bp alleles, and D10mit95 has alleles of 196 and 176 bp. The two larger alleles (196 and 191 bp) were accidentally grouped as one marker, and the two smaller alleles (185 and 176 bp) were grouped as the second marker. Assignment of the alleles to their proper marker restored correct allele frequencies for each of the suspect markers.

The third test examined the correlation of allele status between markers on a chromosome. A detailed protocol is shown in Figure 1 . The test was conducted by comparing the allele status of one marker with those of the other markers on a chromosome. Mouse chromosomes all have centromeres at one end (acrocentric), and the most centromeric marker was chosen as the reference. For most chromosomes the correlation declined asymptotically to zero as the map distance in centiMorgans increased between the reference and the test markers. This test was repeated for each chromosome using the marker most distal from the centromere as the reference marker because the correlation between two markers at opposite ends of the chromosome was already approaching zero (random segregation), especially for the larger chromosomes, and thus was uninformative. The uncorrected allele correlations for chromosomes 2, 7, 8, and 10 are shown in Figure 2A . For two of the chromosomes (chromosomes 2 and 8), the alleles were negatively correlated, as indicated by values less than zero (Fig. 2A ). The negatively correlated markers were examined in the original genotype data and were discovered to have reversed allele calls based on the allele sizes of the parental strains. The entire set of values for chromosome 8 was below the abscissa because the reference marker was the miscalled marker. When corrected, the allele markers were all positively correlated, as shown in Figure 2B .

Stepwise error analysis from allele calls made by automated DNA fragment size detection software.

A: Correlation values between the most centromeric marker and the other markers on chromosomes 2, 7, 8, and 10 before correction. All markers on chromosome 8 appear to be negatively correlated with the most centromeric marker, whereas only the second marker on chromosome 2 shows a negative correlation. The first reference marker is not tested for correlation to itself, so the first data point on the graph is the value for the second marker on the chromosome. B: Correlation values for chromosomes 2, 7, 8, and 10 after correcting the allele calls that were erroneous for the most centromeric marker (the reference marker) on chromosome 8 and the second marker on chromosome 2. The correlations are positive and decrease with increasing distance from the most centromeric marker as expected.

DISCUSSION

The techniques described herein can be used to detect gross errors in high-throughput genotyping, particularly in cases of simple expectations of genotype frequencies such as F2 populations generated from two inbred lines of the organism under analysis. The errors found in this study were due to two types of problems. First, misassignment of a DNA fragment (size in base pairs) during allele calling as an allele of one of the other markers in a dye set caused a skew in the distribution of alleles from the expected 1:2:1 ratio for the erroneous marker. Second, switching the assignment of the genotype (B or D) for a pair of alleles caused markers on a chromosome to appear to be negatively correlated. Both of these mistakes were easily detected and corrected before any genetic analysis of phenotypes was conducted, thus increasing the likelihood that high-quality data were available for analysis. It is important to conduct these tests at the level at which errors are likely to be made—in this case, the microwell plate level. These tests are easy to conduct, requiring widely available software, and can be carried out in the molecular laboratory as data from each chromosome are collected, rather than waiting until the entire data set is generated.

Acknowledgments

We thank Kate Anthony for expert technical assistance and Jeanne Spicer for website support. This study was supported by the National Institute on Aging (grant AG14731) of the National Institutes of Health. There are no known conflicts of interest on the part of any of the authors.

REFERENCES

1.Freimer NB, Sandkuijl LA, Blower SM. Incorrect specification of marker allele frequencies: effects on linkage analysis. Am J Hum Genet 1993;52:1102–1110. [PMC free article] [PubMed] [Google Scholar]
2.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors I: complex-valued recombination fractions and complex phenotypes. Am J Hum Genet 2000; 66:1095–1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors II: marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am J Hum Genet 2000;66:1107–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors III: marker loci and their map as nuisance parameters. Am J Hum Genet 2000;66:1298–1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am J Hum Genet 2000; 66:1310–1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Mendel, G. Versuche über Pflanzen-hybriden [Experiments in plant hybridization]. Verh Naturforsch Ver Abh Brunn 1865;IV:3–47 (http://www.esp.org).
7.Hardy GH. Mendelian proportions in a mixed population. Science 1908;28:49–50. [DOI] [PubMed] [Google Scholar]
8.Weinberg W. Über den Nachweis der Vererbung beim Menchen. Jahresh Ver Vaterl Naturkd Wuerttemb 1908; 64:368–382. [Google Scholar]
9.Sturtevant A. The linear arrangement of six sex-linked factors in Drosophila as shown by their mode of association. J Exp Zool 1913;14:43–59. [Google Scholar]
10.Sambrook J, Fritsch E, Maniatis T. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 1989.
11.Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT, Mouse Genome Database Group. 2002. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res 2002;30: 113–115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r1] 1.Freimer NB, Sandkuijl LA, Blower SM. Incorrect specification of marker allele frequencies: effects on linkage analysis. Am J Hum Genet 1993;52:1102–1110. [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors I: complex-valued recombination fractions and complex phenotypes. Am J Hum Genet 2000; 66:1095–1106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors II: marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am J Hum Genet 2000;66:1107–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors III: marker loci and their map as nuisance parameters. Am J Hum Genet 2000;66:1298–1309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Goring HH, Terwilliger JD. Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am J Hum Genet 2000; 66:1310–1327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.Mendel, G. Versuche über Pflanzen-hybriden [Experiments in plant hybridization]. Verh Naturforsch Ver Abh Brunn 1865;IV:3–47 (http://www.esp.org).

[r7] 7.Hardy GH. Mendelian proportions in a mixed population. Science 1908;28:49–50. [DOI] [PubMed] [Google Scholar]

[r8] 8.Weinberg W. Über den Nachweis der Vererbung beim Menchen. Jahresh Ver Vaterl Naturkd Wuerttemb 1908; 64:368–382. [Google Scholar]

[r9] 9.Sturtevant A. The linear arrangement of six sex-linked factors in Drosophila as shown by their mode of association. J Exp Zool 1913;14:43–59. [Google Scholar]

[r10] 10.Sambrook J, Fritsch E, Maniatis T. Molecular Cloning: A Laboratory Manual. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 1989.

[r11] 11.Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT, Mouse Genome Database Group. 2002. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res 2002;30: 113–115. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Simple Tests to Detect Errors in High-Throughput Genotype Data in the Molecular Laboratory

David J Vandenbergh

Kathrine Heron

Ryan Peterson

Karl B Shpargel

Abigail Woodroffe

David A Blizard

Gerald E McClearn

George P Vogler

Abstract

METHODS

TABLE 1.

RESULTS

TABLE 2.

FIGURE 1.

FIGURE 2.

DISCUSSION

Acknowledgments

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Simple Tests to Detect Errors in High-Throughput Genotype Data in the Molecular Laboratory

David J Vandenbergh

Kathrine Heron

Ryan Peterson

Karl B Shpargel

Abigail Woodroffe

David A Blizard

Gerald E McClearn

George P Vogler

Abstract

METHODS

TABLE 1.

RESULTS

TABLE 2.

FIGURE 1.

FIGURE 2.

DISCUSSION

Acknowledgments

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases