Automating HIV Drug Resistance Genotyping with RECall, a Freely Accessible Sequence Analysis Tool

Conan K Woods; Chanson J Brumme; Tommy F Liu; Celia K S Chui; Anna L Chu; Brian Wynhoven; Tom A Hall; Christina Trevino; Robert W Shafer; P Richard Harrigan

doi:10.1128/JCM.06689-11

. 2012 Jun;50(6):1936–1942. doi: 10.1128/JCM.06689-11

Automating HIV Drug Resistance Genotyping with RECall, a Freely Accessible Sequence Analysis Tool

Conan K Woods ^a, Chanson J Brumme ^a, Tommy F Liu ^b, Celia K S Chui ^a, Anna L Chu ^a, Brian Wynhoven ^a, Tom A Hall ^c, Christina Trevino ^d, Robert W Shafer ^b, P Richard Harrigan ^a,^e,^✉

PMCID: PMC3372133 PMID: 22403431

Abstract

Genotypic HIV drug resistance testing is routinely used to guide clinical decisions. While genotyping methods can be standardized, a slow, labor-intensive, and subjective manual sequence interpretation step is required. We therefore performed external validation of our custom software RECall, a fully automated sequence analysis pipeline. HIV-1 drug resistance genotyping was performed on 981 clinical samples at the Stanford Diagnostic Virology Laboratory. Sequencing trace files were first interpreted manually by a laboratory technician and subsequently reanalyzed by RECall, without intervention. The relative performances of the two methods were assessed by determination of the concordance of nucleotide base calls, identification of key resistance-associated substitutions, and HIV drug resistance susceptibility scoring by the Stanford Sierra algorithm. RECall is freely available at http://pssm.cfenet.ubc.ca. In total, 875 of 981 sequences were analyzed by both human and RECall interpretation. RECall analysis required minimal hands-on time and resulted in a 25-fold improvement in processing speed (∼150 technician-hours versus ∼6 computation-hours). Excellent concordance was obtained between human and automated RECall interpretation (99.7% agreement for >1,000,000 bases compared). Nearly all discordances (99.4%) were due to nucleotide mixtures being called by one method but not the other. Similarly, 98.6% of key antiretroviral resistance-associated mutations observed were identified by both methods, resulting in 98.5% concordance of resistance susceptibility interpretations. This automated sequence analysis tool provides both standardization of analysis and a significant improvement in data workflow. The time-consuming, error-prone, and dreadfully boring manual sequence analysis step is replaced with a fully automated system without compromising the accuracy of reported HIV drug resistance data.

INTRODUCTION

Human immunodeficiency virus (HIV) drug resistance genotyping has been used in clinical practice for over 10 years to help guide and tailor highly active antiretroviral therapy (HAART) regimens (2). By identifying resistance mutations in the areas of the viral genome targeted by antiretroviral drugs, genotypic drug resistance testing allows physicians to optimize antiretroviral therapy regimens for each patient, increasing the chance of successful virological suppression (2, 6) and subsequently reducing the overall cost of treatment by minimizing the use of ineffective drugs and avoiding treatment failure-related inpatient care (20).

The predominant methodology used for HIV genotypic drug resistance testing involves reverse transcriptase PCR (RT-PCR) amplification of extracted viral RNA from plasma followed by population-based (bulk) sequencing (4). Since multiple sequencing primers are required to provide bidirectional coverage over the entire length of the amplicon, individual DNA sequence reads are then assembled into a contiguous consensus sequence by use of analysis software. Commercially available HIV drug resistance genotyping kits, such as the Trugene (Siemens, Deerfield, IL) (10, 13) and ViroSeq (Abbott Laboratories, Des Plaines, IL) (5) tests, are distributed with custom analysis software; however, simple software solutions do not exist for genotyping methods developed in-house. The available “generic” sequence analysis programs require considerable hands-on time; highly trained technicians must first inspect each trace file and trim out regions of problematic or low-quality sequence before manually specifying the sequence reads to assemble. The sequence assembly is subsequently verified by slow and labor-intensive human visual examination. A final consensus sequence is then exported and processed with a drug resistance interpretation algorithm.

Drug resistance mutation reporting often varies between laboratories, even when identical samples are tested (9, 17). While many interlaboratory discrepancies can result from differences in sample preparation (e.g., primer choice or stochastic variation), variation may be introduced by technicians as they subjectively review the assembled sequences (11). Since drug-resistant HIV variants may be present at low frequencies in clinical isolates, accurate identification of nucleotide “mixtures” (positions where two or more nucleotides are observed) is required. Differences in individual technicians' propensities to identify low-level nucleotide mixtures could result in clinically relevant drug resistance mutations being missed (10, 11). In order to minimize erroneous HIV drug resistance reporting and optimize genotyping protocols, clinical and research laboratories often participate in quality assurance programs (QAP), where identical samples are sent to multiple laboratories for independent analysis (9, 17). Unfortunately, due to the complications of subjective sequence interpretation, sources of any aberrant results can be difficult to ascertain.

The implementation of an automated sequence analysis tool would enable objective and consistent interpretation of HIV genotype data and provide considerable practical advantages, most notably improvements in processing speed and significantly decreased labor and software costs. At the British Columbia Centre for Excellence in HIV/AIDS (BCCfE), we have developed a bioinformatics tool, RECall, to address these challenges. RECall is a pipeline for assembling, aligning, analyzing, and finishing sequence chromatogram files and has been tailored specifically for HIV genotyping. While these steps are performed by most sequence assembly programs, RECall has been designed specifically to reproducibly call nucleotide mixtures. RECall is available for free as a Web application (http://pssm.cfenet.ubc.ca/).

Here we present the results of an external validation of automated RECall analysis of sequence data generated by an independent laboratory.

MATERIALS AND METHODS

Laboratory methods.

HIV genotyping was performed at the Stanford University Hospital Diagnostic Virology Laboratory (Stanford, CA). Clinical genotypic resistance testing was performed on 981 sequentially collected plasma samples by use of a previously described approach (17). Briefly, plasma virus extraction and purification were performed on Qiagen BioRobot M48 or QIAsymphony SP automated nucleic acid extraction instruments (Qiagen Inc., Valencia, CA), followed by one-step RT-PCR and a nested 2nd-round PCR. Direct bidirectional sequencing encompassing HIV-1 protease (PR) and the first 296 codons of reverse transcriptase (RT) was performed on an ABI 3730 sequencer (Life Technologies, Carlsbad, CA). Chromatograms were created using Sequencing Analysis v 5.2 (ABI). Nucleotide mixtures (positions containing two or more nucleotides, with the minor peak height being ≥20% of the major peak height) were marked with model 3730 Data Collection software v 3.30 (ABI).

A laboratory technologist assembled the sequence trace files for each sample and generated a consensus sequence by using Lasergene SeqMan v 8.0 (DNAStar, Madison, WI). SeqMan uses user-defined parameters to identify positions in the assembly with potential “conflicts.” Conflicts are any nucleotide positions with a mixture (20% threshold), positions where overlapping sequences do not have the same base call, and any “N” calls that the sequencer could not distinguish. The analyzing technologist visually inspected each sequence, stopping at each conflict and making manual edits where necessary. The edited sequence was then inspected by a second technologist, who verified the conflicts and any manual edits that were made.

Sequence analysis using RECall.

Raw ABI chromatograms were reanalyzed using RECall. The software requires a consistent file naming convention to automatically group multiple sequence reads (primers) belonging to the same sample into a single consensus sequence (contig).

The sequencing trace files (.ab1) are first processed with the software package phred (7, 8), which calls bases and assigns quality scores to each nucleotide. In a trace file, a “mixed” or “ambiguous” base is represented by overlapping peaks. When calling bases, phred determines the location and area of the primary peak (“called base”) and the largest secondary peak (“uncalled base”) in the trace file. Since the primary and secondary peaks are often offset, RECall attempts to align the peak positions to their corresponding locations in the .ab1 files. The quality scores that phred assigns are a measure of the accuracy of the base call. Regions of poor sequence quality at the beginning and end of fragments are identified and trimmed automatically by RECall. phred quality scores are also used to identify low-quality regions (regions with phred scores of <20) within a fragment, which are also flagged and excluded from the final contig assembly. Grouped fragments are assembled and aligned to a user-supplied reference sequence (e.g., HIV-1 HXB2 [GenBank accession no. K03455]) by use of a modified Smith-Waterman algorithm (18). For this study, all chromatograms (.ab1) were submitted to RECall in a single batch and were processed without any human intervention, using a standard desktop PC (Intel Core-i5 660 3.33-GHz CPU, 3 GB RAM, Windows XP).

RECall nucleotide mixture calling and “marking” of potentially problematic bases.

The most important feature of RECall is the process by which the software calls ambiguous nucleotides (mixtures). Following the assembly and alignment step, RECall identifies mixtures based on the quality and area under the curve of the called and uncalled bases as determined by phred. The RECall configuration variables for mixture calling for clinical drug resistance testing at the BCCfE are listed in Table 1. Each position in the sequence alignment is examined sequentially. For each position, a list is first generated by counting the frequency of each nucleotide that appears as either a called base or an uncalled base meeting the mixture area criterion. This list is then reduced to include only nucleotides that are observed in at least half of the sequence reads. The list is then ordered by frequency, and the most common (majority) and second most common (secondary) bases are retained. If the secondary base is called with more than half the frequency of the majority base, then a mixture is called. If two bases tie as the most common bases but no majority is achieved, then a mixture of those two bases is called. Finally, if none of these conditions is achieved, then the most commonly called base is used. Since phred is limited to calling a maximum of two nucleotides per chromatogram peak (“called” and “uncalled” bases), RECall does not call mixtures of three nucleotides, instead defaulting to the predominant two-base mixture.

Table 1.

Configuration variables for nucleotide mixture calling and base “marking” for clinical drug resistance genotyping

Parameter	Value	Interpretation
Quality censoring cutoff phred score	<10	Bases with phred quality scores below the cutoff are excluded from the assembly.
Mixture area (%)	≥20	The area of the uncalled peak must be at least 20% of the called peak area. If >50% of the reads pass this threshold, then a mixture is called.
Mark area (%)	≥17.5	The area of the uncalled peak must have at least 17.5% of the called peak area. If ≥50% of the reads pass this threshold, then a mark is made.
Mark average quality cutoff phred score	<20	If the average quality of the base across all reads is below the cutoff, then a mark is made.
Additional marks		Insertions, deletions, and single primer coverage are also marked.

Open in a new tab

During the mixture calling step, RECall also “marks” potentially problematic sequences according to the parameters listed in Table 1. Insertions, deletions, and low-quality and problematic positions are flagged for optional confirmation by a human user. In addition, positions meeting the mark area criterion are flagged for review in a manner similar to the mixture calling procedure (Table 1). In this study, mixtures and marks were not subjected to human interpretation.

RECall pass-fail criteria.

Sequences were passed or failed based on criteria established in the BCCfE laboratory, which form the default parameters in RECall. Multiple quality checks were performed on every sample to ensure that the sequence was acceptable. Problems leading to sequence rejection by RECall are listed in Table 2. If desired, these parameters can be modified by the user. Sequences that pass internal quality control checks are exported automatically as plain text or FASTA-formatted files. Because RECall by default requires double primer coverage over the entire sequence length, some samples that the Stanford laboratory deemed acceptable by human interpretation were rejected by RECall. In the following analyses, we included only those sequences that passed RECall's default quality control criteria.

Table 2.

Criteria used by RECall for rejecting a sequence

Failure category	Description
Stop codon	Any unambiguous stop codon (TGA, TAA, or TAG)
Bad inserts	An insertion relative to the reference sequence that is not a multiple of three bases, resulting in a frameshift
Bad deletion	A deletion relative to the reference sequence that is not a multiple of three bases, resulting in a frameshift
Too many mixtures	>3.5% of nucleotides sequenced are called as mixtures
N count	≥5 Ns (any base) in the sequence
Mark count	≥100 positions marked as being potentially problematic
Single coverage	>3 consecutive bases of single-read coverage with phred scores of <40
Low quality	Any section where the quality of all coverage is too low to make a call

Open in a new tab

RECall Web application.

The RECall Web application includes personal password-protected user accounts that allow sequencing jobs to be saved and reanalyzed in the future without the need to upload files again. Two types of accounts are available: for traceability, operators with “User” access are not given access to the program parameters but may process data using only the parameters provided by the local “SuperUser.” Processed sequences are retained on the RECall server for a user-chosen period, after which they are automatically deleted. No submitted data are reprocessed, collected, analyzed, used for any purpose, or shared with anyone.

Data analyses.

The finished sequences generated by RECall were returned to the Stanford laboratory for comparison of these results with the results of conventional HIV genotypic drug resistance testing methods (henceforth referred to as “human” testing). The performance of RECall was measured by both the speed and concordance of base calls. A partial nucleotide discordance was considered to be present when one methodology reported a nucleotide mixture and the other reported one of the mixture's components (e.g., human-reported Y and RECall-reported C). A complete nucleotide discordance was considered to be present if each method reported a different unambiguous nucleotide at the same position for a sample (e.g., human-reported T and RECall-reported C) or if an unambiguous nucleotide called by one method was not contained in a mixture called by the other (e.g., human-reported G and RECall-reported Y).

In addition to an analysis of the entire protease-RT sequence length, a specific analysis was performed to compare only antiretroviral drug resistance mutation positions for mutations defined as key resistance mutations by the International AIDS Society (USA table) (12). A drug resistance mutation was considered present if it was observed either alone or as part of an amino acid mixture. The Stanford HIV drug resistance genotyping Web service Sierra (algorithm version 6.0.1 [http://hivdb.stanford.edu/pages/algs/sierra_sequence.html]; Stanford University, Stanford, CA) (14) was used to infer antiretroviral drug susceptibilities from both human- and RECall-analyzed PR-RT nucleotide sequences.

RESULTS

During software development at the BCCfE, RECall showed >99.5% agreement between human-reviewed and automated base calls when it was tested on in-house sequences (data not shown). We therefore wished to perform an external validation of the applicability of RECall to independently generated sequence data.

HIV protease-RT sequences and raw .ab1 sequence trace files were shared for 981 samples genotyped by the Stanford laboratory (with manual technician review) and reanalyzed by RECall. Of these, 875 (89.2%) met the default RECall acceptability criteria after automated processing. The primary reason for failure was a lack of double primer coverage over the entire sequence length. Using a standard desktop PC (Intel Core-i5 660 3.33-GHz CPU, 3 GB RAM, Windows XP), RECall completed base calling, assembly, and alignment in less than 6 h, with no hands-on analysis. In contrast, manual analysis required an estimated 150 h of technician time.

Nucleic acid sequence concordance between human and RECall interpretations.

There was 99.7% overall agreement in base calling between human and RECall over 1,036,875 analyzed bases. The rates of complete sequence concordance were 99.6% for 259,875 protease (PR) nucleotide positions and 99.7% for 777,000 reverse transcriptase (RT) nucleotide positions (Fig. 1). Of the 944 discordant PR nucleotides, 940 (99.6%) were “partially discordant” (mixtures called by one method but not the other), and 4 (0.4%) were completely discordant. Of the 2,535 discordant RT nucleotides, 2,517 (99.3%) were partially discordant, and 18 (0.7%) were completely discordant. Most of the partially discordant bases (2,530 of 3,457 bases [73.2%]) comprised nucleotide pairs resulting from transitions (R = A/G, Y = C/T) rather than transversions (K = G/T, M = A/C, S = C/G, W = A/T). The completely discordant positions were relatively equally distributed among transitions, transversions, and a combination of both (n = 11, 6, and 5, respectively) (Fig. 1). Nucleotide mixtures were detected at approximately 1.1% of all bases, corresponding to 12.5 mixtures per 1,185-bp PR-RT fragment. Overall, the human operator called a marginally larger number of mixtures (10,996 human-called mixtures [1.06%] and 10,921 RECall-called mixtures [1.05%]; P = 0.8). Positions with three-nucleotide mixtures (i.e., B, D, H, or V) (Fig. 1) were automatically discordant because phred (and therefore RECall) is not programmed to recognize these. A representative sample of nucleotide positions with discordant calls by RECall and the human operator is shown in Fig. 2.

Fig 1 — Concordant and discordant nucleotide base calls in protease and reverse transcriptase sequences analyzed manually and by RECall. Matrices depict the frequencies of nucleotides in protease (A) and reverse transcriptase (B) called by human operators (vertical axis) and by RECall (horizontal axis). Concordant base calls are highlighted in green. Partially discordant base calls (mixtures called by one method but not the other) are highlighted in yellow. Entirely discordant base calls are highlighted in red. Blank cells represent zero. International Union of Biochemistry and Molecular Biology ambiguity codes are as follows: R = A/G, Y = C/T, W = A/T, M = A/C, K = G/T, S = G/C, B = C/G/T, D = A/G/T, H = A/C/T, and V = A/C/G. Columns for B, D, H, and V are not shown for RECall, as the software does not call three-base mixtures. Overall, 99.7% concordance was observed for more than 1 million bases compared.

Fig 2 — Chromatograms illustrating discordant base calls between human and RECall sequence interpretations. The majority of differences between the two analysis methods were due to partial discordances where one method called a nucleotide mixture and the other method called only one nucleotide component of a mixture. Depicted here are representative chromatogram traces for discordant mixture base calls. Panels A to C show positions called mixtures by human visual inspection but not by RECall. Panels D to F show positions called mixtures by RECall but not by human interpretation. In each panel, the top line of text contains the consensus human base calls, while the bottom line of text shows the consensus RECall base calls. The discordant mixtures are circled in orange.

Amino acid sequence concordance between human and RECall interpretations.

The 944 discordant PR nucleotide positions resulted in 904 discordant PR codons. Of these, 380 (42.0%) resulted in nonsynonymous discordances between the human and RECall interpretations when the sequences were translated to amino acids: 378 (99.5%) were partial amino acid discordances (where at least one amino acid was shared between the two interpretations), while only 2 (0.5%) were complete amino acid differences. In RT, the 2,535 discordant nucleotide positions occurred in 2,469 unique codons. When the sequences were translated to amino acids, 729 (29.5%) discordant substitutions were observed between the human and RECall interpretations: 724 (99.3%) were partial differences, and 5 (0.7%) were completely discordant.

Overall, human and RECall sequence review identified 1,096 (266 in PR and 830 in RT) and 1,098 (269 in PR and 829 in RT) “key” antiretroviral drug resistance mutations (12), respectively, either as complete amino acid substitutions or as part of mixtures. For PR, the two methods were in agreement for 265 cases (264 [99.6%] were in complete agreement). The human method identified 1 PR resistance mutation that RECall did not, while RECall identified 4 that the human method did not. Similarly, for RT, the two methods both identified resistance mutations in 824 cases (809 [98.2%] were in complete agreement). The human method identified 6 RT resistance mutations that RECall did not, while RECall identified 5 that the human method did not. In general, it was not obvious which method was “correct.”

Antiretroviral susceptibility scoring.

All 875 PR-RT sequences interpreted by both methods were submitted to Sierra, the Stanford HIV Drug Resistance Database genotyping tool (algorithm version 6.0.1), and were scored for susceptibility to all currently available protease inhibitors (PI), nucleoside/nucleotide reverse transcriptase inhibitors (NRTI), and nonnucleoside reverse transcriptase inhibitors (NNRTI). Briefly, Sierra identifies documented resistance mutations in each sequence and uses a rules-based algorithm to generate a resistance score for 19 PI, NRTI, and NNRTI (14). A higher score indicates a greater probability of resistance. We calculated the susceptibility score differences between human- and RECall-interpreted sequences. In total, 34 samples (3.9%) had discordant scores for one or more antiretrovirals (median of 5 drugs). Of these, 17 samples had a score difference of ≥10, with the maximum difference being 72. However, small differences in susceptibility scores may not translate into clinically relevant differences in resistance. In addition to providing raw susceptibility scores, Sierra categorizes each sequence interpretation as susceptible (S; susceptibility score of <15), intermediate (I; score of 15 to 59), or resistant (R; score of ≥60). For simplicity, “I” and “R” interpretations were grouped together into a single “resistant” category. Only 13 samples (1.5%) had a discordant drug resistance interpretation for ≥1 drugs (median of 2 drugs). Among 16,625 drug resistance scores, only 35 (0.2%) had discordant resistance interpretations between human- and RECall-interpreted sequences (Table 3). Of these discordances, 25 (71.4%) were cases where human calls resulted in a “susceptible” interpretation while RECall did not. However, there was no statistically significant difference in the frequency of “resistant” interpretations between human- and RECall-analyzed sequences (13.2% versus 13.3%; P = 0.82).

Table 3.

Sierra drug resistance interpretation concordance between human- and RECall-analyzed sequences

Drug class	Drug (abbrev)^a	Sierra resistance interpretation by Human analysis/RECall analysis (no. of samples)^b
Drug class	Drug (abbrev)^a	S/S	R/R	S/R	R/S
NNRTI	Delavirdine (DLV)	710	164	1	0
	Efavirenz (EFV)	706	168	1	0
	Etravirine (ETR)	768	105	0	2
	Nevirapine (NVP)	706	168	1	0
NRTI	Lamivudine (3TC)	709	163	2	1
	Abacavir (ABC)	735	138	1	1
	Zidovudine (AZT)	734	139	0	2
	Stavudine (D4T)	726	148	1	0
	Didanosine (DDI)	737	136	1	1
	Emtricitabine (FTC)	709	163	2	1
	Tenofovir (TDF)	762	111	1	1
PI	Atazanavir/r (ATV/r)	783	89	2	1
	Darunavir/r (DRV/r)	839	35	1	0
	Fosamprenavir/r (FPV/r)	792	81	2	0
	Indinavir/r (IDV/r)	792	82	1	0
	Lopinavir/r (LPV/r)	807	66	2	0
	Nelfinavir (NFV)	777	95	3	0
	Saquinavir/r (SQV/r)	794	78	3	0
	Tipranavir/r (TPV/r)	816	59	0	0
	Total	14,402	2,188	25	10

Open in a new tab

“/r” indicates a combination with ritonavir.

S/S, susceptible by both methods; R/R, resistant by both methods; S/R, susceptible by human interpretation but resistant by RECall; R/S, resistant by human interpretation but susceptible by RECall. “Resistant” interpretations included both intermediate (I) and resistant (R) Sierra calls.

DISCUSSION

This study evaluated the performance of RECall, an automated sequence analysis tool developed by the BC Centre for Excellence in HIV/AIDS, in quickly and accurately interpreting HIV genotypic data for drug resistance testing. RECall is available free of charge as a Web application (http://pssm.cfenet.ubc.ca). We compared the results generated by RECall to human-verified sequences from the Stanford University Hospital Diagnostic Virology Laboratory, a well-recognized institution that has conducted routine HIV genotypic drug resistance testing for over 10 years.

Using a set of 875 HIV-1 protease and reverse transcriptase sequences, we analyzed the concordance of detection of ambiguous nucleotides, amino acid changes, and drug resistance mutations between sequences interpreted manually by lab technicians or automatically by RECall. RECall showed excellent agreement with subjective human interpretation of HIV sequence data, with 99.7% concordance over more than 1 million bases compared. Similar degrees of agreement (>99.5%) were noted in previous analyses of smaller data sets from other independent laboratories (3, 19). Of the limited number of differences in base calling, the vast majority were due to partial nucleotide discordance, where one method detected a mixture and the other detected one component of the mixture. As a result, the majority of amino acid differences detected by human versus RECall interpretation were also due to partial discordances. Human and RECall reviews agreed on 98.1% of PI and 98.7% of NRTI/NNRTI resistance mutations identified by either method. For comparison, when identical sequence trace files are inspected and edited by multiple human operators, the rate of identification of resistance mutations can be <90% (11), depending on the samples tested. Although a very small number of key resistance mutations were identified by a single method only, these were all a result of partial mismatches due to differential detection of nucleotide mixtures.

Despite the extremely high concordance between methods, there may be several reasons for the observed discrepancies. First, RECall relies on phred peak areas to call mixtures and is therefore unable to call mixtures of three nucleotides. The impact of this shortcoming, however, is negligible, as 3-base mixtures were called exceedingly rarely by human interpretation (0.007% of bases called) and could simply represent technical artifacts (11, 17). Second, technicians, especially less-experienced ones, can arguably be biased during sequence interpretation: for example, at a single position, visibly “cleaner” chromatograms with taller peaks may be assigned more weight, mixtures within lower-quality areas of sequence may not be considered “true” mixtures, and frequently observed “patterns” of nucleotide mixtures may influence a person's decision to call mixtures with a borderline secondary peak area. In contrast, RECall is not programmed to weigh sequence reads based on peak height; determination of mixtures is based solely on peak area rather than any perception of shape, and the interpretation of data quality is strictly dependent upon phred scores. While the inflexibility of a fully automated system for sequence analysis and interpretation may appear to be a drawback, the results of this study show otherwise. RECall is configured to mark unusual sequence positions, including mixtures, which a technician could visually check. In this study, RECall was run without human intervention and still rapidly produced unbiased, consistent results for a data set generated by different methods in an external laboratory.

RECall did call marginally fewer mixtures overall than the human operator, but this difference was not statistically significant. Subjectively, these discordant bases could be considered “hard to call” by human operators, and the mixture calling frequency would be related directly to each individual's personal biases (Fig. 2). In a related experiment, eight BCCfE lab technicians were presented with a panel of chromatograms for which the two sequence interpretation methods produced discordant results, with one method calling a mixture and the other not. In general, the majority of operators preferentially called a mixture; 75% of those surveyed chose a mixed-base call over half the time. However, mixture calling frequencies varied widely among technicians, ranging from 25 to 75% (data not shown), illustrating the extremely subjective nature of calling mixed bases. The results of our human versus RECall analysis fall well within the range of interoperator mixture calling variability (11); the small difference in mixed-base frequencies may be as likely due to overcalling by the technicians as to undercalling by RECall. If this discrepancy is of concern, it can easily be lessened by modifying the mixture calling threshold (% uncalled base area) to more closely mimic a favored laboratory technician's tendencies. Regardless of the mixture calling parameters chosen, RECall provides standardization of base calling frequencies—an extremely important feature of a clinical reporting tool, and one that is clearly not achievable with solely human interpretation (11). Furthermore, good laboratory practice standards call for traceability of data; if manual edits are made to data generated by automated instruments, both the change and its justification should be robustly documented (1). In HIV drug resistance testing, the large number of manual changes required during the assembly and editing of a consensus sequence precludes this. RECall, however, provides a system for minimizing and tracking these manual edits.

Most importantly, RECall significantly improves the processing efficiency of HIV drug resistance genotyping sequence data. Specifically, RECall removes the need to perform several time-consuming and potentially error-prone manual analysis tasks, including identifying and grouping chromatograms from a single sample, trimming regions of low-quality data, aligning primer sequences to a reference standard, manually reviewing mixed bases, and exporting a finished FASTA file. Once the RECall program is initiated (a process requiring only a few mouse clicks), automated analysis requires no subsequent human intervention. Furthermore, additional efficiency gains are achieved by fully integrating RECall into the data processing pipeline; ideally, RECall is set to run immediately as soon as chromatogram data are released from the sequencing instrument.

While the results presented here are limited to HIV drug resistance genotyping of protease and RT, RECall can easily be extended to analyze other regions of HIV or any protein coding regions that can be sequenced by population-based methods. At the BCCfE, RECall is currently the primary software used for drug resistance genotyping of protease-RT, integrase, and gp41, as well as for genotypic tropism testing of the V3 loop. RECall was used (without human review) to process sequences from several randomized clinical trials of HIV tropism, and the results were found to be predictive of virological outcomes (15, 16).

The results of our interlaboratory comparisons show that RECall can provide an objective, standardized protocol for HIV sequence interpretation in clinical and research laboratories. The speed and cost-effectiveness of using an automated tool for sequence analysis are the primary advantages. Standardizing sequence interpretation enables changes in laboratory procedures to be evaluated independent of the sequence interpretation steps. Furthermore, RECall enables unbiased sequence interpretation, and its internal parameters provide additional quality control mechanisms, both of which ensure that only consistent, high-quality data are reported.

ACKNOWLEDGMENTS

We thank the staff of the Stanford University School of Medicine Diagnostic Virology Laboratory for their assistance with HIV drug resistance genotyping. We appreciate the help of all current and former staff and students of the BCCfE who assisted with software testing and for providing input during the development of RECall.

C.J.B. is supported by a Vanier Graduate Scholarship from the Canadian Institutes of Health Research (CIHR). P.R.H. holds a GlaxoSmithKline/CIHR Chair in Clinical Virology.

The funding sources played no role in the study design or in the collection, analysis, and interpretation of data.

Footnotes

Published ahead of print 7 March 2012

REFERENCES

1. Anonymous 2011. Ground-truth data cannot do it alone. Nat. Methods 8:885. [DOI] [PubMed] [Google Scholar]
2. Baxter JD, et al. 2000. A randomized study of antiretroviral management based on plasma genotypic antiretroviral resistance testing in patients failing therapy. AIDS 14:F83–F93 [DOI] [PubMed] [Google Scholar]
3. Brooks JI, et al. 2009. Evaluation of an automated sequence analysis tool to standardize HIV genotyping results, abstr O061. 18th Annu. Can. Conf. HIV/AIDS Res., Vancouver, Canada [Google Scholar]
4. Cockerill FR., 3rd 1999. Genetic methods for assessing antimicrobial resistance. Antimicrob. Agents Chemother. 43:199–212 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Cunningham S, et al. 2001. Performance of the Applied Biosystems ViroSeq human immunodeficiency virus type 1 (HIV-1) genotyping system for sequence-based analysis of HIV-1 in pediatric plasma samples. J. Clin. Microbiol. 39:1254–1257 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. DeGruttola V, et al. 2000. The relation between baseline HIV drug resistance and response to antiretroviral therapy: re-analysis of retrospective and prospective studies using a standardized data analysis plan. Antivir. Ther. 5:41–48 [DOI] [PubMed] [Google Scholar]
7. Ewing B, Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186–194 [PubMed] [Google Scholar]
8. Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175–185 [DOI] [PubMed] [Google Scholar]
9. Galli RA, Sattha B, Wynhoven B, O'Shaughnessy MV, Harrigan PR. 2003. Sources and magnitude of intralaboratory variability in a sequence-based genotypic assay for human immunodeficiency virus type 1 drug resistance. J. Clin. Microbiol. 41:2900–2907 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Grant RM, et al. 2003. Accuracy of the TRUGENE HIV-1 genotyping kit. J. Clin. Microbiol. 41:1586–1593 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Huang DD, Eshleman SH, Brambilla DJ, Palumbo PE, Bremer JW. 2003. Evaluation of the editing process in human immunodeficiency virus type 1 genotyping. J. Clin. Microbiol. 41:3265–3272 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Johnson VA, et al. 2010. Update of the drug resistance mutations in HIV-1: December 2010. Top. HIV Med. 18:156–163 [PubMed] [Google Scholar]
13. Kuritzkes DR, et al. 2003. Performance characteristics of the TRUGENE HIV-1 genotyping kit and the Opengene DNA sequencing system. J. Clin. Microbiol. 41:1594–1599 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Liu TF, Shafer RW. 2006. Web resources for HIV type 1 genotypic-resistance test interpretation. Clin. Infect. Dis. 42:1608–1618 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. McGovern RA, et al. 2010. Population-based sequencing of the V3-loop is comparable to the enhanced sensitivity Trofile assay (ESTA) in predicting virologic response to maraviroc (MVC) of treatment-naïve patients in the MERIT Trial, abstr 92. 17th Conf. Retrovir. Opportun. Infect., San Francisco, CA [Google Scholar]
16. McGovern RA, et al. 2010. Population-based V3 genotypic tropism assay: a retrospective analysis using screening samples from the A4001029 and MOTIVATE studies. AIDS 24:2517–2525 [DOI] [PubMed] [Google Scholar]
17. Shafer RW, et al. 2001. High degree of interlaboratory reproducibility of human immunodeficiency virus type 1 protease and reverse transcriptase sequencing of plasma samples from heavily treated patients. J. Clin. Microbiol. 39:1522–1529 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Smith TF, Waterman MS. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–197 [DOI] [PubMed] [Google Scholar]
19. Tilston P, et al. 2011. Evaluation of RECall automated basecalling software in HIV drug resistance testing, abstr P4. Eur. Soc. Clin. Virol. Winter Meet., London, United Kingdom [Google Scholar]
20. Weinstein MC, et al. 2001. Use of genotypic resistance testing to guide HIV therapy: clinical impact and cost-effectiveness. Ann. Intern. Med. 134:440–450 [DOI] [PubMed] [Google Scholar]

[B1] 1. Anonymous 2011. Ground-truth data cannot do it alone. Nat. Methods 8:885. [DOI] [PubMed] [Google Scholar]

[B2] 2. Baxter JD, et al. 2000. A randomized study of antiretroviral management based on plasma genotypic antiretroviral resistance testing in patients failing therapy. AIDS 14:F83–F93 [DOI] [PubMed] [Google Scholar]

[B3] 3. Brooks JI, et al. 2009. Evaluation of an automated sequence analysis tool to standardize HIV genotyping results, abstr O061. 18th Annu. Can. Conf. HIV/AIDS Res., Vancouver, Canada [Google Scholar]

[B4] 4. Cockerill FR., 3rd 1999. Genetic methods for assessing antimicrobial resistance. Antimicrob. Agents Chemother. 43:199–212 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Cunningham S, et al. 2001. Performance of the Applied Biosystems ViroSeq human immunodeficiency virus type 1 (HIV-1) genotyping system for sequence-based analysis of HIV-1 in pediatric plasma samples. J. Clin. Microbiol. 39:1254–1257 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. DeGruttola V, et al. 2000. The relation between baseline HIV drug resistance and response to antiretroviral therapy: re-analysis of retrospective and prospective studies using a standardized data analysis plan. Antivir. Ther. 5:41–48 [DOI] [PubMed] [Google Scholar]

[B7] 7. Ewing B, Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186–194 [PubMed] [Google Scholar]

[B8] 8. Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175–185 [DOI] [PubMed] [Google Scholar]

[B9] 9. Galli RA, Sattha B, Wynhoven B, O'Shaughnessy MV, Harrigan PR. 2003. Sources and magnitude of intralaboratory variability in a sequence-based genotypic assay for human immunodeficiency virus type 1 drug resistance. J. Clin. Microbiol. 41:2900–2907 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Grant RM, et al. 2003. Accuracy of the TRUGENE HIV-1 genotyping kit. J. Clin. Microbiol. 41:1586–1593 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Huang DD, Eshleman SH, Brambilla DJ, Palumbo PE, Bremer JW. 2003. Evaluation of the editing process in human immunodeficiency virus type 1 genotyping. J. Clin. Microbiol. 41:3265–3272 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Johnson VA, et al. 2010. Update of the drug resistance mutations in HIV-1: December 2010. Top. HIV Med. 18:156–163 [PubMed] [Google Scholar]

[B13] 13. Kuritzkes DR, et al. 2003. Performance characteristics of the TRUGENE HIV-1 genotyping kit and the Opengene DNA sequencing system. J. Clin. Microbiol. 41:1594–1599 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Liu TF, Shafer RW. 2006. Web resources for HIV type 1 genotypic-resistance test interpretation. Clin. Infect. Dis. 42:1608–1618 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. McGovern RA, et al. 2010. Population-based sequencing of the V3-loop is comparable to the enhanced sensitivity Trofile assay (ESTA) in predicting virologic response to maraviroc (MVC) of treatment-naïve patients in the MERIT Trial, abstr 92. 17th Conf. Retrovir. Opportun. Infect., San Francisco, CA [Google Scholar]

[B16] 16. McGovern RA, et al. 2010. Population-based V3 genotypic tropism assay: a retrospective analysis using screening samples from the A4001029 and MOTIVATE studies. AIDS 24:2517–2525 [DOI] [PubMed] [Google Scholar]

[B17] 17. Shafer RW, et al. 2001. High degree of interlaboratory reproducibility of human immunodeficiency virus type 1 protease and reverse transcriptase sequencing of plasma samples from heavily treated patients. J. Clin. Microbiol. 39:1522–1529 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Smith TF, Waterman MS. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–197 [DOI] [PubMed] [Google Scholar]

[B19] 19. Tilston P, et al. 2011. Evaluation of RECall automated basecalling software in HIV drug resistance testing, abstr P4. Eur. Soc. Clin. Virol. Winter Meet., London, United Kingdom [Google Scholar]

[B20] 20. Weinstein MC, et al. 2001. Use of genotypic resistance testing to guide HIV therapy: clinical impact and cost-effectiveness. Ann. Intern. Med. 134:440–450 [DOI] [PubMed] [Google Scholar]

PERMALINK

Automating HIV Drug Resistance Genotyping with RECall, a Freely Accessible Sequence Analysis Tool

Conan K Woods

Chanson J Brumme

Tommy F Liu

Celia K S Chui

Anna L Chu

Brian Wynhoven

Tom A Hall

Christina Trevino

Robert W Shafer

P Richard Harrigan

Abstract

INTRODUCTION

MATERIALS AND METHODS

Laboratory methods.

Sequence analysis using RECall.

RECall nucleotide mixture calling and “marking” of potentially problematic bases.

Table 1.

RECall pass-fail criteria.

Table 2.

RECall Web application.

Data analyses.

RESULTS

Nucleic acid sequence concordance between human and RECall interpretations.

Fig 1.

Fig 2.

Amino acid sequence concordance between human and RECall interpretations.

Antiretroviral susceptibility scoring.

Table 3.

DISCUSSION

ACKNOWLEDGMENTS

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Automating HIV Drug Resistance Genotyping with RECall, a Freely Accessible Sequence Analysis Tool

Conan K Woods

Chanson J Brumme

Tommy F Liu

Celia K S Chui

Anna L Chu

Brian Wynhoven

Tom A Hall

Christina Trevino

Robert W Shafer

P Richard Harrigan

Abstract

INTRODUCTION

MATERIALS AND METHODS

Laboratory methods.

Sequence analysis using RECall.

RECall nucleotide mixture calling and “marking” of potentially problematic bases.

Table 1.

RECall pass-fail criteria.

Table 2.

RECall Web application.

Data analyses.

RESULTS

Nucleic acid sequence concordance between human and RECall interpretations.

Fig 1.

Fig 2.

Amino acid sequence concordance between human and RECall interpretations.

Antiretroviral susceptibility scoring.

Table 3.

DISCUSSION

ACKNOWLEDGMENTS

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases