Using Quality Measures to Facilitate Allele Calling in High-Throughput Genotyping

Birgir Pálsson; Frosti Pálsson; Mark Perlin; Hákon Gudbjartsson; Kári Stefánsson; Jeffrey Gulcher

doi:10.1101/gr.9.10.1002

. 1999 Oct;9(10):1002–1012. doi: 10.1101/gr.9.10.1002

Using Quality Measures to Facilitate Allele Calling in High-Throughput Genotyping

Birgir Pálsson ^1,³, Frosti Pálsson ^1,³, Mark Perlin ², Hákon Gudbjartsson ¹, Kári Stefánsson ¹, Jeffrey Gulcher ¹

PMCID: PMC310819 PMID: 10523529

Abstract

Currently, the main limitation in high-throughput microsatellite genotyping is the required manual editing of allele calls. Even though programs for automated allele calling have been available for several years, they have limited capability because accurate data could only be assured by manual inspection of the electropherograms for confirmation. Here we describe the development of a parametric approach to allele call quality control that eliminates much of the time required for manual editing of the data. This approach was implemented in an editing tool, Decode-GT, that works downstream of the allele calling program, TrueAllele (TA). Decode-GT reads the output data from TA, displays the underlying electropherograms for the genotypes, and sorts the allele calls into three categories: good, bad, and ambiguous. It discards the bad calls, accepts the good calls, and suggests that the user inspect the ambiguous calls, thereby reducing dependence on manual editing. For the categorization we use the following parameters: (1) the quality value for each allele call from TrueAllele; (2) the peak height of the alleles; and (3) the size of the peak shift needed to move peaks into the nearest bin. Here we report how we optimized the parameters such that the size of the ambiguous category was minimized, and both the number of miscalled genotypes in the good category and the useable genotypes in the bad category were negligible. This approach reduces the manual editing time and results in <1% miscalls.

Dissection of the major and minor genetic factors important in complex genetic diseases requires the ability to generate an enormous amount of genotypic information. Because these diseases tend to skip one or more generations, one can choose for study either many large extended families with multiple patients separated by many meiotic events or an even greater number of sib-pairs (Lander and Schork 1994). Regardless of the approach used, at least a half million microsatellite genotypes may be necessary for any given project. For example, when using 1000 microsatellite markers to type 1000 DNA samples, a total of 1 million genotypes must be determined.

SNP (single nucleotide polymorphism) genotyping may be used in the future for such studies, but higher density SNP maps and cheaper genotyping platforms are prerequisites. In addition, because the heterozygosity rates of SNPs are so low compared with microsatellites, at least 2 to 5 times more SNPs will be required to achieve the same power as microsatellites in pedigree based studies (Kruglyak 1997). Another disadvantage is that the accuracy of SNP genotyping is less easily determined through inheritance checking than microsatellites. Furthermore, this higher density of markers will also require a very high resolution physical map to assure proper order of markers and will probably need to await the full sequence of the human genome. Although some have hoped that genome-wide SNP association studies may replace family-based linkage studies, the required number of SNP markers has been estimated to be about 500,000 (Kruglyak 1999). For these reasons, microsatellite genotyping will probably continue to be the method of choice for genome-wide linkage studies in the near future. To achieve this scale of genotyping within a reasonable time, a high-throughput approach is needed at every step (Hall et al. 1996).

As robot technology and more sophisticated sequencers have increased the throughput in microsatellite genotyping dramatically, the editing of the data has become a bottleneck, limiting throughput. The software used for allele calling has not evolved at the same pace as the robotic and sequencing technologies, and manual editing of the data is both costly and time consuming. The main limitation of software that is currently available has been the lack of quality measures for the allele calls made by the automated programs, which could help sort out accurate calls from inaccurate ones. Hence, if accurate allele calling is desired, a human eye must check all the automated calls by inspecting the electropherograms. Furthermore, many programs have not been tailored for high-throughput genotyping, lacking features such as batch processing of gel files.

We hypothesized that there must be a set of parameters that could be used to fractionate the allele calls from an allele-calling program according to quality in a manner that decreases the user's editing time without compromising accuracy. The use of such quality measures in genotyping would perhaps be analogous to their use in sequencing (Ewing and Green 1998; Ewing et al. 1998). Here we describe a set of parameters that we optimized according to efficiency and accuracy of allele calls.

TrueAllele

We chose to work with an allele-calling program called TrueAllele (TA), commercially available from Cybergenetics, Inc. It uses quantitation and deconvolution algorithms for allele calling. TA is written in Matlab and currently runs under MacOS, Windows NT/95/98, and Unix-based systems (Perlin et al. 1994, 1995). At deCODE genetics we run TA 1.02b.1 on 400 MHz Pentium II work stations, running the Linux operating system accessed via ReflectionX from PC/NT computers.

Compared to Genotyper 2.0 (GT) for Applied Biosystems (1994), TA has three main advantages. It provides a quality measure for every allele call; it allows for batch processing of gel files; and it performs an efficient tracking of the gel files. The main limitation of the program is that the interface is not as user friendly as GT, and the manual editing of the allele calls can be as time consuming for a high throughput project.

To streamline the process of genotype analysis and to make it as user friendly as possible, we developed two programs that handle the preparation and management of the batch runs. One program gathers all files that are required for a given batch run on TA into a single folder. The other program extracts and prepares the files after each batch run, preparing the results and quality measures for allele calls, as well as the electropherograms for our editing program called Decode-GT.

Decode-GT

Next, we created a program, Decode-GT 1.0, that incorporates a set of parameters that can be optimized for the most efficient and accurate allele calls. It is a PC program that runs under Windows NT and has three main functions. First, it sorts the allele calls according to quality measures and can display the electropherograms on which they are based. Second, it checks the allele calls of CEPH control samples to ensure that the gel is properly calibrated. Third, it performs an inheritance check on the results using pedigree information. Decode-GT reads the combined results file from TA and sorts the data into three categories—bad allele calls, good allele calls, and ambiguous allele calls—sorting is based on a TA quality measure, the peak heights, and the peak shifts. The aim is that only calls in the ambiguous category need be inspected by the user.

Defining Criteria for Categorization

Our goal was to set the criteria used by Decode-GT such that the ambiguous category would include relatively few allele calls, without discarding too many useable allele calls in the bad category or including false calls in the good category. To find the optimal settings, we performed a study in which we compared TA results with results of manual editing using GT. We systematically examined how various settings affected: (1) the number of miscalls that were captured into the ambiguous category and (2) the size of the ambiguous category. We then incorporated the criteria found to be optimal in Decode-GT and tested them on a new data set examining (1) the number of miscalls in the good category prior to editing (i.e., prior to inspection of the CEPH control and ambiguous genotypes and the inheritance check); (2) the remaining miscalls after editing; and (3) the average size of the ambiguous category.

We independently processed 7595 genotypes from 80 markers using both TA and GT in our first study. Of those, there were 719 discrepancies between the two methods; these we refer to as miscalls by TA, since all genotypes from GT had been manually inspected and edited as necessary. The main reasons for miscalls were the following: (1) The signal (peak height) was very low; (2) there was contamination or PCR artifacts that gave additional peaks; (3) TA had shifted the size calibration to fit the peaks into the binning library; (4) heterozygous genotypes were called as homozygous due to insufficient amplification of the larger allele, and therefore low peak height; (5) TA called a stutter peak as an allele; and (6) TA called a homozygous genotype as heterozygous by assigning an allele to a small peak in the electropherogram noise.

Bad Calls

Allele calls that fall under this category are discarded and electropherograms for the alleles are not inspected. We used the peak height of allele 1 (the smaller fragment by molecular weight) to find a threshold value that would discard as many unusable allele calls as possible, without discarding a large fraction of allele calls that were useable (that is, were used when inspected by a user in GT). The peak height value is assigned by TA on a similar scale as the value given in GT.

Figure 1 shows the effect of increasing the height threshold from 0 to 100 on the total number of discarded genotypes, and for the discarded genotypes the fraction that is usable (were called by a user in GT). The number of discarded genotypes increases rapidly as the height threshold rises from 0 to 45. After that, the rate of increase lessens. The number of potentially usable genotypes that are discarded starts rising at height ∼35 and rises steadily thereafter. Therefore, at a height threshold <40, the discarded allele calls are primarily unreadable calls and at a threshold of 50, only 0.3% of the potentially usable data are discarded whereas 403 discrepancies are moved to the bad category. Therefore, using a height threshold of 50 is optimal.

The peak height of allele 1 (smaller fragment) is used to categorize allele calls as bad calls that are automatically discarded. The graph shows percent increase in the total number of discarded genotypes (red line) and the number of usable genotypes that are discarded (black line) increases as the height threshold goes from 0 to 100. Based on this graph the height threshold for the bad category was set at 50, where the number of discarded genotypes decreases and the fraction of usable genotypes that are discarded is ∼0.3% of the total number of genotypes.

Ambiguous Calls

As the peak height of allele 1 decreases, the risk of a miscall tends to increase. To determine the optimum height threshold for the ambiguous category, we inspected the effect of increasing the peak height threshold from 50 to 150 on the total number of genotypes placed in the ambiguous category and the number of miscalls included in the good category (Fig. 2). As the peak height threshold reaches 100, the decrease in miscalls in the good category levels off. The size of the ambiguous category reaches 10% at that value, which is acceptable. Therefore, genotypes with peak heights between 50 and 100 are placed in the ambiguous category. Using just this criterion, the fraction of miscalled genotypes that remains in the good category is ∼2.75%.

The peak height of allele 1 (smaller fragment) is used to categorize allele calls as ambiguous. The graph shows how the size of the ambiguous category (red line) increases and how the number of miscalls (black line) that are not listed as ambiguous, decreases as the peak height threshold for the ambiguous category goes from 50 to 150. At peak height threshold 100, the ambiguous category is ∼10% of the total number of genotypes, and the decrease in miscalls in the good category levels off. We therefore use peak height threshold 100 for the ambiguous category.

The quality value assigned by TA ranges between 0.0 and 1.0. It reflects the peak height, the shape, and stutter pattern for each marker. Because peak height has an effect on the quality value, the majority of allele calls with a low-quality value are already included in the ambiguous category by our peak-height threshold criteria or have been discarded into the bad category. PCR artifacts can produce a strong signal, but those peaks usually lack the shape and stutter pattern stored by TA in its library and so usually result in a very low quality value. Figure 3 shows how the ambiguous category expands and how the portion of miscalled genotypes in the good category decreases, as the quality threshold increases from 0.7 to 1.0. As the quality threshold reaches 0.8, the reduction in the number of miscalled genotypes in the good category, due to classification into the ambiguous category, levels off. However, miscalls are rapidly removed again after the quality threshold reaches 0.9. When the quality threshold reaches 0.8, the ambiguous category contains ∼13% of the total number of genotypes, but it expands rapidly after that. It is therefore optimal to use 0.8 as quality threshold.

The quality value provided by TA is used to categorize genotypes in the ambiguous category. The graph shows how the size of the ambiguous category (red line) increases and how the number of miscalls (black line) that are not listed as ambiguous decreases as the quality threshold value changes from 0.7 to 1. At quality threshold value 0.8, the decrease in miscalls in the ambiguous category levels off and the ambiguous category does not rise significantly. At quality threshold >0.9, the number of miscalls starts to decrease rapidly but the ambiguous category expands just as rapidly. On this basis, the quality threshold value was set at 0.8.

When peaks do not fit into a bin defined by the binning library that TA has made from past experience, TA shifts the peak to the nearest defined bin. In some cases, it is correct, but sometimes peaks are miscalled by incorrect shifting. Because TA records the size of the shift for each allele call, we could study the effect of the degree of shifting on the ambiguous category and number of miscalls. Figure 4 shows how the ambiguous category increases and how the portion of remaining miscalled genotypes not included in the ambiguous category decreases as the shift threshold goes from 1.0 to 0.0 bp. When the shift value for an allele call equals or exceeds the shift threshold, the allele is assigned to the ambiguous category. For example, at a shift threshold of zero all genotypes would be included in the ambiguous category. By setting the shift threshold at 0.3, the miscalled genotypes that are still in the good category are down to 1.05% of total genotypes. The ambiguous category is then up to 15%. By using a higher threshold value, the percent of miscalls increases rapidly, but the number of genotypes in the ambiguous category does not decrease significantly. By lowering the threshold value, the number of miscalls does not decrease significantly, but the ambiguous category expands steadily. Therefore, we use 0.3 and higher as the peak shift criteria for the ambiguous category.

When TA shifts peaks to fit them into its binning library for a given marker, it tends to make mistakes. Because the size of the shifting is documented, we looked at how incorporating a peak shift threshold (listing genotypes that have a peak shift above the peak shift threshold value as ambiguous) affects the number of miscalls (black line) in the good category as well as the size of the ambiguous category (red line). At peak shift threshold 0.3, the number of miscalls incorporated into the ambiguous category levels off and the ambiguous category expands only moderately. Therefore, peak shift threshold for the ambiguous category was set at 0.3.

As described previously, TA tends to call heterozygote genotypes as homozygous when the larger allele is poorly amplified with respect to the smaller allele. To capture those miscalled alleles in the ambiguous category, a function was incorporated in Decode-GT that detects homozygous allele calls, reads the height of the signal upstream (higher molecular weight) from the called allele, and lists the genotype as ambiguous if the maximum height of the signal is above a defined threshold value. To avoid including broad homozygous peaks into the ambiguous category, the reading of the signal starts 4 bp upstream from the called allele (and ends at the upper boundaries of the marker window). Figure 5 shows how the ambiguous category expands and how the number of miscalled genotypes in the good category decreases as the threshold for the maximum height of the upstream signal is changed from 0% to 50% of the height of the called allele (allele 1). Based on this graph, we decided to use 10% as the threshold criteria for the highest upstream signal. At that point, the ambiguous category is ∼17% and the miscalled genotypes are down to 0.8%. By going lower, the ambiguous category rises rapidly as does the number of captured miscalls.

To capture genotypes that are erroneously called homozygous, but have an undetected peak upstream in the electropherogram, a function was incorporated in Decode-GT that detects homozygous allele calls, reads the intensity of the signal upstream from the called allele, and lists the genotype as ambiguous if the maximum value of the signal is above a defined value. The graph shows how the ambiguous category expands, and how the proportion of miscalled genotypes that are not included in the ambiguous category decreases as the threshold for the maximum height of the upstream signal is changed from zero to 50% of the height of the called allele. Based on this graph we decided to use 10% as the criteria for the highest signal. At that point, the ambiguous category is ∼17% and the miscalled genotypes are down to 0.8%. By going lower, the ambiguous category rises rapidly as does the number of captured miscalls.

As described, some miscalls are due to TA identifying a homozygous sample as heterozygous by assigning an allele to a small peak in the electropherogram noise. By including those allele calls in which the height of peak 2 (the fragment of higher molecular weight) lower than 40 in the ambiguous category, we added 10 genotypes to the ambiguous category or ∼0.1% (that had peak height 1 larger than 100). Of those, two genotypes were miscalls. When the height threshold for peak 2 was increased to 50, 31 calls were added to the ambiguous category, but there was no decrease of miscalls in the good category. Therefore, we use 40 as the height threshold for peak 2.

To catch the miscalls caused when TA calls a stutter peak, we defined as ambiguous allele calls in which the peak height of the smaller allele (in molecular weight) is smaller than the peak height of the larger fragment. As a rule, the larger fragment is amplified to a lesser degree, so a large proportion of the allele calls that fulfill this criterion are miscalls. In our study we caught seven miscalls by imposing this criteria, and 59 genotypes were added to the ambiguous category.

In summary, we used these six criteria for the ambiguous category: (1) peak height of allele 1 lower than 100; (2) quality value <0.8; (3) shift value equal or higher than 0.3 bp; (4) the highest peak upstream from homozygous allele higher than 10% of the height of the called allele; (5) peak height of allele 2 lower than 40; and (6) peak height of allele 1 smaller than the peak height of allele 2. By using these six criteria simultaneously to define the ambiguous category, the number of the ambiguous genotypes was 1357, or 17.9%. The number of miscalled genotypes that had not been captured into the ambiguous category was 46, or 0.6%. Of those 46, 7 were from a marker that had alleles with only 1 bp difference (mononucleotide alleles). That marker has now been eliminated from our marker set, as well as other markers that have mononucleotide repeats. Of the remaining 39, 28 (including both control samples) belong to the same marker. All of those samples were called homozygous but were heterozygous with the second peak being very small and differing by only 2 or 4 bp from the first allele. These peaks were not detected in the highest signal function because they were so close to the called allele. To avoid incorporating broad homozygous allele peaks in the ambiguous category, the reading of the signal starts 4 bp upstream from the detected peak. Of the 11 remaining miscalls, 10 had more than two peaks due to spectral overlap or leakage between lanes and had been discarded when edited with GT. The only remaining genotype was actually not miscalled by TA but had been called incorrectly when edited in GT.

In the second part of the study we set the optimal criteria described above, to categorize the data in Decode-GT and inspected 6912 genotypes from 72 new markers and 96 new samples (including 2 CEPH control samples). Of those 6912, 1011 (14.0%) were listed as ambiguous, 95 (1.4%) were automatically discarded, and the remaining 5806 genotypes were in the good category. All of the allele calls in the good categorywere inspected and revealed 78 miscalled alleles or 1.12%.

A different person then edited the same data following the step-by-step procedure described below: sequential inspection of (1) control samples, (2) ambiguous genotypes, (3) allele ladder plots of the genotypes in the good category, and (4) inheritance errors. When this inspection revealed miscalled alleles that had not been placed into the ambiguous category, all allele calls for that particular marker were subsequently inspected. After editing, only 27 of the 78 miscalls that were in the good category prior to the editing had not been captured, or 0.4% of the total number of genotypes.

Using Decode-GT

To assist the user in editing and evaluating the quality of data, Decode-GT has six view modes: main-view, CEPH-view, inheritance check, ladder plots, allele histograms, and report. Figure 6 shows the Decode-GT main window and explains some features.

In the main view, the called genotypes are listed and the electropherogram for each selected genotype is shown. That graph can be expanded to allow the user to check for alleles outside the defined marker size window. The user can select to have all genotypes, the ambiguous genotypes, or homozygous genotypes displayed in the list box. In a separate graph the user can select to view the electropherograms of all colors for the selected genotype to detect spectral overlap, or have some or all electropherograms for that marker plotted simultaneously in one graph for inspection of the allelic ladder. There is also a window that allows the user to type in comments that will be incorporated into the report. The user can edit the selected genotype, discard it, or discard a whole marker.

CEPH View

In CEPH-view, the electropherograms for the CEPH-control samples are shown simultaneously in separate graphs (Fig. 7). The user can select a marker for which the electropherograms are to be inspected. The known genotypes for the selected marker are also shown. The user can then shift the alleles for the entire gel or marker to normalize according to the CEPH reference genotypes.

Decode-GT shows the electopherograms for the CEPH controls and also lists the known reference genotype for the selected marker.

Inheritance Check View

This view shows the results from the inheritance check (Fig. 8). There are two list boxes—one that shows the families who had inheritance errors and the other that lists the members of the selected family. Each family member can be successively selected and each corresponding electropherogram can be immediately inspected to resolve discrepancies. As in the main view, the user can edit the selected genotype, view the allelic ladder, and check for spectral overlap.

Ladder Plot View

The ladder plot view shows the superposition of the electropherograms from all samples genotyped with the selected marker (Fig. 9). When more than one gel file is loaded in to the program, this view allows comparison of allelic ladders if the same marker is on both gels.

The allelic ladder view shows all the electropherograms for one marker superimposed in one graph.

Allele Histogram View

The allele histogram view shows the number of occurrences for each allele for a selected marker (Fig. 10). This can be useful to compare allele frequencies for markers between gel files or sets of individuals.

Report View

The report view shows the name of the user, the date, and the name of the gel file (Fig. 11). It also presents statistical information about the data, such as the number and percentage of discarded genotypes, ambiguous genotypes, and edited genotypes along with heterozygosity rate and inheritance errors for each marker.

The program creates a report of the data, including statistical information such as the heterozygosity for each marker, the number of discarded genotypes, and the average quality.

Using Decode-GT

After the data has been loaded into the program, the user performs these tasks successively:

Inspects the CEPH-control samples to see if they match each other and the known genotypes.
Inspects the genotypes listed as ambiguous.
Inspects the allelic ladder plots of the good genotype category to look for unexpected peaks.
Performs an inheritance check and inspects the mismatches (if any).
Inspects all allele calls for that marker if the inspection reveals any errors made by TA that were not included in the ambiguous category.
Saves the edited results table and the report file.

DISCUSSION

We have described how an allele-calling program combined with quality measures and empirically derived criteria results in very accurate genotyping while limiting the users energies to inspection of ambiguous calls. By discarding allele calls that do not meet the given criteria for quality value and peak height, some allele calls that could be used if inspected by eye, are discarded. However, our tests showed that <0.5% of automatically discarded genotypes had been used when edited with GT. Prior to editing, the fraction of miscalled alleles falling into the good category were <1%. Using our defined inspection protocol this fraction drops to <0.4% in our study.

The total error rate in genotyping is composed of calling errors and other processing errors, such as, PCR, DNA isolation, and electrophoresis. In this paper we address only the issue of calling errors, and how we tolerate a slight increase in error rate to increase throughput. Using this approach, the total error rate in our genotyping data is <1% and within acceptable limits. We believe that an unacceptable genotyping error rate for multipoint linkage studies is >4%. A calling error rate of 0.5% while inspecting <15% of the genotypes is then quite acceptable. Therefore, the main advantages of this approach are the batch-run feature and the dramatically reduced manual editing time. Our approach is similar to work that has been done to enhance the editing of sequences by using quality values with Phred/Phrap/Consed (Ewing et al. 1998, Ewing and Green 1998).

The hands-on time in preparing a TA run for a gel file is 5–15 min and the editing of the results in Decode-GT is 10–20 min, depending on the quality of the data—in total 15–35 min per gel file, averaging ∼25 min. When using Genescan and GT for processing gel files, the hands-on time averaged 2–3 h. The reduction in hands-on time compared to the previous method, when all allele calls were confirmed by inspection, is 80%–90%.

Another time-saving feature of TA is the automatic binning for all markers that are processed. When using GT, the binning information must be typed manually into a template document when a marker is processed for the first time. This allows rapid (even daily) alterations in marker panels without having to manually reset or redefine the expected bins. We routinely custom design panels to rerun markers that have failed in the multiplex runs on a particular set of samples.

At deCODE Genetics, we currently process ∼400,000 microsatellite genotypes per week using Perkin-Elmer-ABI 877 PCR robots and 377 XL Sequencers with 96 lane upgrades and are currently doing three- to fourfold multiplexing with 80%–85% efficiency. For our initial genome-wide screens we use the ABI Linkage Marker Set (v. 2) and the ABI intercalating set, for a total of 870 markers, along with additional sets to fill in the gaps. These are all dinucleotide markers that have been PIG–tailed to eliminate the plus A artifact (Brownstein et al. 1996).

The dream of modern human genetics is that we will soon be able to solve the common complex genetic diseases. This may come from the use of the most informative markers (microsatellites) applied to the most informative families or populations with extensive genealogy spanning centuries (Gulcher and Stefánsson 1998). But because several genes may together or in part contribute to each disease, the power to detect linkage must be further increased through the use of higher density marker sets, larger numbers of patients linked together over generations within a population, and robust multipoint identity by reliable statistical methods, (Kruglyak et al. 1996; Kong and Cox 1997). The use of allele calling software together with optimized parameters that fractionate the data according to quality as described here, may advance human genetics toward its destiny.

Availability of Programs

TA is available from Cybergenetics, Inc. (Pittsburgh, PA; www.cybgen.com). Decode-GT is free of charge and available to academic groups upon request from deCODE Genetics. To obtain a copy of the program, contact Birgir Pálsson, e-mail birgir@decode.is. A demonstration version is available at www.decode.is/company/index.html.

Acknowledgments

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Footnotes

⁴

Corresponding author.

E-MAIL jgulcher@decode.is; FAX 354 570 1903.

REFERENCES

Applied Biosystems. DNA fragment analysis software, user's manual set. Foster City, CA: PE Applied Biosystems; 1994. [Google Scholar]
Brownstein MJ, Carpten JD, Smith JR. Modulation of non-templated nucleotide addition by Taq polymerase: Primer modifications that facilitate genotyping. BioTechniques. 1996;20:1004–1010. doi: 10.2144/96206st01. [DOI] [PubMed] [Google Scholar]
Ewing B, Hillier L, Wendl M, Green P. Base calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 1998;8:175–185. doi: 10.1101/gr.8.3.175. [DOI] [PubMed] [Google Scholar]
Ewing B, Green P. Base calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 1998;8:186–194. [PubMed] [Google Scholar]
Gulcher JR, Stefánsson K. Population genomics: Laying the groundwork for genetic disease modeling and targeting. Clin Chem Lab Med. 1998;36:523–527. doi: 10.1515/CCLM.1998.089. [DOI] [PubMed] [Google Scholar]
Hall JM, LeDuc CA, Watson AR, Roter AH. An Approach to High-throughput Genotyping. Genome Res. 1996;6:781–790. doi: 10.1101/gr.6.9.781. [DOI] [PubMed] [Google Scholar]
Kong A, Cox NJ. Allele-sharing models: LOD scores and accurate linkage tests. Am J Hum Genet. 1997;61:1179–1188. doi: 10.1086/301592. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kruglyak L. The use of a genetic map of biallelic markers in linkage studies. Nat Genet. 1997;17:21–24. doi: 10.1038/ng0997-21. [DOI] [PubMed] [Google Scholar]
Kruglyak L. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet. 1999;22:139–144. doi: 10.1038/9642. [DOI] [PubMed] [Google Scholar]
Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: A unified In the second part of the study we set the optimal In the second part of the study we set the optimal multipoint approach. Am J Hum Genet. 1996;58:1347–1363. [PMC free article] [PubMed] [Google Scholar]
Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]
Perlin MW, Burks MB, Hoop RC, Hoffman EP. Toward fully automated genotyping: Allele assignment, pedigree construction, phase determination, and recombination detection in Duchenne muscular dystrophy. Am J Hum Genet. 1994;55:777–787. [PMC free article] [PubMed] [Google Scholar]
Perlin MW, Lancia G, Ng S-K. Toward fully automated genotyping: genotyping microsatellite markers by deconvolution. Am J Hum Genet. 1995;57:1199–1210. [PMC free article] [PubMed] [Google Scholar]

[B1] Applied Biosystems. DNA fragment analysis software, user's manual set. Foster City, CA: PE Applied Biosystems; 1994. [Google Scholar]

[B2] Brownstein MJ, Carpten JD, Smith JR. Modulation of non-templated nucleotide addition by Taq polymerase: Primer modifications that facilitate genotyping. BioTechniques. 1996;20:1004–1010. doi: 10.2144/96206st01. [DOI] [PubMed] [Google Scholar]

[B3] Ewing B, Hillier L, Wendl M, Green P. Base calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 1998;8:175–185. doi: 10.1101/gr.8.3.175. [DOI] [PubMed] [Google Scholar]

[B4] Ewing B, Green P. Base calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 1998;8:186–194. [PubMed] [Google Scholar]

[B5] Gulcher JR, Stefánsson K. Population genomics: Laying the groundwork for genetic disease modeling and targeting. Clin Chem Lab Med. 1998;36:523–527. doi: 10.1515/CCLM.1998.089. [DOI] [PubMed] [Google Scholar]

[B6] Hall JM, LeDuc CA, Watson AR, Roter AH. An Approach to High-throughput Genotyping. Genome Res. 1996;6:781–790. doi: 10.1101/gr.6.9.781. [DOI] [PubMed] [Google Scholar]

[B7] Kong A, Cox NJ. Allele-sharing models: LOD scores and accurate linkage tests. Am J Hum Genet. 1997;61:1179–1188. doi: 10.1086/301592. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Kruglyak L. The use of a genetic map of biallelic markers in linkage studies. Nat Genet. 1997;17:21–24. doi: 10.1038/ng0997-21. [DOI] [PubMed] [Google Scholar]

[B9] Kruglyak L. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet. 1999;22:139–144. doi: 10.1038/9642. [DOI] [PubMed] [Google Scholar]

[B10] Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: A unified In the second part of the study we set the optimal In the second part of the study we set the optimal multipoint approach. Am J Hum Genet. 1996;58:1347–1363. [PMC free article] [PubMed] [Google Scholar]

[B11] Lander ES, Schork NJ. Genetic dissection of complex traits. Science. 1994;265:2037–2048. doi: 10.1126/science.8091226. [DOI] [PubMed] [Google Scholar]

[B12] Perlin MW, Burks MB, Hoop RC, Hoffman EP. Toward fully automated genotyping: Allele assignment, pedigree construction, phase determination, and recombination detection in Duchenne muscular dystrophy. Am J Hum Genet. 1994;55:777–787. [PMC free article] [PubMed] [Google Scholar]

[B13] Perlin MW, Lancia G, Ng S-K. Toward fully automated genotyping: genotyping microsatellite markers by deconvolution. Am J Hum Genet. 1995;57:1199–1210. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Using Quality Measures to Facilitate Allele Calling in High-Throughput Genotyping

Birgir Pálsson

Frosti Pálsson

Mark Perlin

Hákon Gudbjartsson

Kári Stefánsson

Jeffrey Gulcher

Abstract

TrueAllele

Decode-GT

Defining Criteria for Categorization

Bad Calls

Figure 1.

Ambiguous Calls

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Using Decode-GT

Figure 6.

CEPH View

Figure 7.

Inheritance Check View

Figure 8.

Ladder Plot View

Figure 9.

Allele Histogram View

Figure 10.

Report View

Figure 11.

Using Decode-GT

DISCUSSION

Availability of Programs

Acknowledgments

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases