Abstract
The Bio-V Suite is a collection of python scripts designed specifically for bioinformatic research regarding transport protein evolution. The Bio-V Suite contains nine powerful programs for Unix-based environments, each of which can be run as a standalone tool or be accessed in a programmatic fashion. These programs and their functions are as follows: TMStats generates topological statistics for transport proteins. GSAT performs shuffle-based binary alignments and is fully scalable. It can cross compare two FASTA files or individual sequences. Protocol1 performs remote PSI-BLAST searches and filters redundant/similar sequences and annotates them. Protocol2 finds homologues between FASTA lists and generates graphical reports. TSSearch uses a rapid search algorithm to find distant homologues in FASTA files in a heuristic manner. SSearch is the exhuastive version of TSSearch. GBlast will identify potential transport proteins in any genome/proteome file, or find similar transport protein homologues between two different genomes/proteomes before generating a graphical report. AncientRep will find putative transmembrane repeat units using a list of homologues. DefineFamily will generate a FASTA list to represent an entire TC family. These nine programs are tabulated with descriptions of their capabilities in Table 1.
Keywords: transport proteins, protein evolution, transport classification database, homology, TMS repeats
Introduction
In the Saier Lab at UCSD, we have built, and continue to expand the Transport Classification Database (www.TCDB.org; [1, 2]). TCDB is a relational database containing sequence, structural, functional, pathological, physiological and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology (IUBMB)-approved transporter classification (TC) system. We have refined our techniques to further understand and interrelate existing transporter families, and to aid in the discovery of new ones. The programs presented in this paper (collectively called the ‘BioV Suite’), represent those refined techniques that have yielded the best results as compared to other programs in performing their designated tasks.
TMStats
The TMStats program (transmembrane alpha-helical statistical prediction tool) is a web-based tool that allows a user to generate topological data with respect to any Class, Subclass, Superfamily, Family, Subfamily, Cluster, or System within the Transporter Classification Database (TCDB). However, a python wrapper is included which allows a user to analyze an entire FASTA file. TMStats allows for multiple inputs, which allows the user to input several Transporter Classification (TC) numbers and produce a composite report consisting of all proteins within the chosen categories. TMStats produces a report consisting of tabulated topological data for each TCID: a histogram, a bar graph of topological types, the average transmembrane alpha-helical (TMS) count, and the standard deviation (SD) value. Many systems within TCDB include multiple non-transporter auxiliary units that function with their respective transporters. These auxiliary proteins can skew the overall results; hence, TMStats includes an option to include or exclude such proteins.
The TMStats histogram represents the probability of finding a protein with a particular number of TMSs within the selected TC hierarchy. To generate a smooth distribution in the histogram, TMStats magnifies the sample’s population to 10,000. This is achieved using the following procedure.
Each protein is represented by its total number of TMSs. ‘X’ represents a set of 10,000 proteins (TMS counts). TMS predictions are performed using HMMTOP [3, 4]. Next to the histogram is a bar graph that illustrates the actual distribution of predicted TMSs as found in the selected proteins derived from TCDB. A TMS average is displayed alongside these graphs, accompanied by a standard deviation (SD) score to represent the TMS distribution in the selected TC systems. The final items in the report are tabulated data containing a list of TCIDs sorted by their corresponding TMS count. Clicking on the corresponding member brings the user to the entry in TCDB for further analysis.
Figure 1 provides is an example of a TMStat histogram (A) and bar graph (B), respectively. In this example, the Major Facilitator Superfamily (MFS) members were selected from TCDB, and its corresponding graphs were generated in realtime.
Figure 1.
Distribution histogram (A) and topological statistics (B) generated with the TMStats programs for the Major Facilitator Superfamily (MFS; TC# 2.A.1), as of January 1, 2012.
GSAT
GSAT (Global Sequence Alignment Tool) is a binary alignment tool designed to evaluate homology between distantly related protein sequences. GSAT is similar to GAP [13] which has been shown to be superior to six other programs that are used to make sequence comparisons for enzymes if demonstrating homology [14, 15]. It is also quantitative, providing a comparison score in S.D. values for protein sequences of given lengths. It can be used for any set of proteins or nucleic acids. GSAT exists in three versions: GSAT-Cmd, GSAT-Web, and Scala-GSAT. GSAT-Cmd is the command line version of GSAT. It may be executed via any Unix shell and can be implemented into other programs. GSAT-Web is identical to GSAT-Cmd, but it can be run over the web. GSAT-Web is currently implemented on the TCDB Analyze page. Scala-GSAT is a fully scalable version of GSAT that is capable of performing an exhaustive comparison of a complete FASTA file against another. All three versions of GSAT are written in python and use an implementation of the EMBOSS Needleman-Wunsch (NW) algorithm [5, 6] to produce a global alignment. GSAT is ideal for comparing sequences for overall sequence similarity and will return a standard score based on 500 randomized shuffles. Standard scores are calculated as
Figure 2 presents an example of a global binary alignment obtained using GSAT. Protein “A” is a homolog of the Connexin family (2.A.24), and protein “B” is a homolog of the Innexin family (2.A.25). In between the two sequences are ‘|’ and ‘:’ characters. A vertical bar means the two sequences share the same amino acid at a position. A colon denotes a similarity (NW), which means the two aligned amino acids are different, but similar in nature. Two Z-scores are given: a standard score and a precise score. The standard score is rounded off to the nearest integer, while the precise score is not.
Figure 2.
Output of the GSAT program showing an alignment of two sequences and presenting the relevant statistics. Vertical bars denote identities, colons denote similarities and a dash represents a gap in the alignment.
Protocol1
Protocol1 is a tool that simplifies the process of identifying distant homologues by automating repetitive tasks. Protocol1 accepts a gi number, accession number, or a FASTA sequence. The selected protein is searched on NCBI’s NR protein database using PSI-BLAST [7] with a default cutoff e-value of 0.003. The unique aspect of protocol1 is that it does not require a local BLAST database to perform this search. Protocol1 interfaces directly with NCBI PSI-BLAST via the Internet and therefore never needs to be updated. This feature is unique and has not been replicated even in the BioPerl/BioPython packages.
Once the user has collected a list of homologues, an option is given to perform additional iterations to collect more distant homologues. Protocol1 then removes redundant sequences and abnormally short and/or long sequences as desired by the user. It then uses the CD-Hit program [8] to delete sequences that share a certain percent identity with another retrieved sequence. This threshold value is chosen by the user, henceforth generating a FASTA list of representative proteins, none of which are more similar to each other than the threshold value selected by the user. Protocol1 annotates each sequence based on its genus and species name and creates a FASTA file. In addition, Protocol1 creates a tabulated data sheet containing each sequence’s abbreviation, header description, protein length, GenBankIndex (gi) number, phylum, and organismal domain.
DefineFamily
The DefineFamily (DF) tool will build FASTA lists to represent any TC family in TCDB. The generated FASTA files can be used for a wide variety of purposes, including but not limited to TMStats, Protocol2, AncientRep, TSSearch, SSearch, ScalaGSAT and any other tool that accepts FASTA formatted files. DefineFamily works by selecting representatives of each subfamily within a family and performs a BlastP search using NCBI’s NR database. Alternatively, DF can also retrieve results from the NR database using protocol1’s remote PSI-Blast feature. DF uses a default e-value cut off of e−3, but can be changed by the user. e−3 is used as the cut off because it has been found empirically to retrieve a minimal number of false positives while maximizing the number of true homologues. However, for sequences retrieved between values of e−6 and e−3, other methods such as GSAT or Protocol 2 should be used to establish homology.
Protocol2
The protocol2 program automates and simplifies the process of comparing distantly related families for homology and generates graphical reports. Protocol2 requires two FASTA files generated by protocol1 to represent the two families that the user intends to compare. Protocol2 offers two options to compare the selected families — SSearch and TSSearch. SSearch is an exhaustive comparison method that uses the Smith-Waterman (SW) algorithm [9] to produce alignments and a standard score (similar to GSAT’s) for pairs of proteins. To clarify, TSSearch and SSearch are separate programs from Protocol2 which may be run independently. Protocol2 will use either one of these programs to find local alignments and generate graphical reports.
TSSearch is a heuristic version of the SSearch method and takes a fraction of the time to run. Both search methods will cross-compare the selected FASTA files and generate tabulated data containing standard scores and local regions of homology. Protocol2 selects all high-scoring pairs (HSP) of 10 standard deviations and up, and extracts local regions of homology. These extracted sequences are globally aligned using the GSAT program. An HTML report is generated containing tabulated data for each HSP. Each row contains the HSP’s local alignment score (SSearch or TSSearch), GSAT score, region of homology, and global alignment of HSP with transmembrane segments highlighted and numbered.
Figure 4 provides an example of a detailed result that will be displayed when the “view” link is clicked (see Table 2). Each detailed result contains both sequences’ best local alignment in FASTA format. GSAT will perform a global alignment on the best local alignment, and its result will be displayed in the black box. Finally, the newly aligned sequences are displayed side by side (with gaps), and their predicted TMSs are highlighted and numbered. This allows the user to easily see which TMSs align (if any). Above each TMS is a percent bar. This bar corresponds to how hydrophobic the peak of the selected TMS is relative to the most hydrophobic peak in the entire protein. These peaks are calculated using the Kyte-Doolittle scale [10]. The opacity of the TMS bars decrease as their percent values decrease. This allows the user to quickly determine their confidence in a predicted TMS.
Figure 4.
Protocol2 binary alignment derived from GSAT.
Table 2.
List of programs with a few highlighting characteristics (excluding TSSearch & SSearch).
| Program | Unique Advantage |
|---|---|
| TMStats | The only existing program that generates graphical topology reports from a FASTA file or from TCDB’s live database. |
| GSAT | The only open source shuffle based alignment tool with programmatic access. It is fully scalable and has an online version. |
| Protocol1 | The only tool available to perform remote PSI-BLAST searches with iterations. |
| Protocol2 | The only fully featured tool for establishing homology between lists of transporters with graphical reports. |
| GBlast | The only tool for identifying transporters within entire genomes/proteomes with graphical reports |
| AncientRep | The only shuffle based repeat analysis tool that can perform vertical and horizontal searches using all processors, and has alignment optimization features. |
| DefineFamily | The only tool available for building FASTA files to represent any TC family. |
The BioV suite is available for download on our website (http://biov.tcdb.org). Installation instructions and example usages are included in the documents link.
TSSearch
The Targeted Smith-Waterman Search (TSSearch) is a heuristic version of the SSearch program (see below) designed to find local areas of homology between distantly related full-length sequences. TSSearch establishes homology by generating a “Standard Score” which is calculated in the same way as for GSAT. The main difference with TSSearch is that it uses the Smith-Waterman algorithm as opposed to GSAT’s Needleman-Wunsch algorithm. TSSearch is faster than SSearch because it does not perform a shuffle based comparison for every combination of subjects and targets. Thus, TSSearch performs one local alignment for every pair of subjects and targets without shuffling. If the default setting is chosen, three top target sequences are assigned to each subject sequence based on their raw bit-scores divided by their alignment lengths. A shuffle based alignment is performed between the subject sequence and its three assigned target sequences. A standard score is recorded for each combination, based on 500 shuffles each (1,500 alignments total). These ‘Standard-Scores’ are calculated the same way as with GSAT. The following python script demonstrates how these scores are retrieved.
1. def Z(i): 2. from numpy import std, average, shuffle 3. from smithwater import Sw,Sw_prime 4. (answers,shuffled_Rs) = [],[] 5. for j in Sw_prime(i): 6. R = Sw(i,j) 7. for i in range(500): 8. shuffled_j = shuffle(j) 9. shuffled_Rs.append( Sw(i,shuffled_j) ) 10. standard_deviation = std(shuffled_Rs) 11. average = mean(shuffled_Rs) 12. z_score = (R − average) / standard_deviation 13. answers.append(z_score) 14. return answers
In this python script Sw(a,b) and Sw_prime(a) are defined elsewhere. The Sw() function accepts an ‘a-sequence’ and ‘b-sequence’ and returns a ‘bit-score’ determined by the Smith-Waterman algorithm. The Sw_prime() function accepts one subject sequence and will return a list of 3 of the most similar target sequences. Using this heuristic technique, TSSearch can perform 250,000 comparisons in less than 10 minutes. TSSearch generates tabulated data containing the subject ID, target ID, raw bit score, area of highest homology and z-score, sorted by z-scores in descending order.
SSearch
The SSearch (Smith-Waterman Search) program has the same objective as TSSearch. SSearch is exhuastive, while TSSearch is herustic. SSearch will perform a shuffle based local alignment for every subject and target sequence combination and will generate a text file containing: subject ID, target ID, raw bit score, area of highest homology and z-score, sorted by z-scores in descending order. As a consequence of this exhaustive approach, SSearch takes roughly 166× longer to obtain results for a job containing 250,000 comparisons than does TSSearch.
GBlast
Genome-Blast (GBlast) is designed to analyze entire genomes/proteomes and identify their potential transport proteins. GBlast locates transporters by BLAST searching each sequence in an organismal proteome against a specific TCDB database in which auxiliary proteins (non-transporter components of TC systems are removed). GBlast generates an HTML report that contains tabulated data reporting: the best TCID match, TCID description, query TMS count, hit TMS count, match length, TMS overlap score, and evalue. Results are sorted by e-value, TMS-overlap, and TCID. When viewing each result, GBlast displays a high-scoring pair (HSP) alignment and a binary hydropathy plot (see Figure 4).
The GBlast TMS-overlap score reveals the likelihood that a query sequence, like the hit sequence, is that of a transporter. Any score above “1” is highly suggestive of a transport protein. This conclusion is based on the fact that almost all proteins with TMSs similar to those in TCDB are transporters, with few exceptions [11]. When GBlast reports a ‘Null’ value, this informs the user that either the subject and/or the hit protein had zero predicted TMS regions. “Null” scores are displayed at the very end of the report. In these situations, the binary hydropathy plot is ideal for locating shared hydrophobic peaks that were not predicted to be TMSs. TMS-overlap scores are calculated by counting the total number of overlapping amino acyl residues belonging to a TMS in both sequences in an HSP. The overlapping residues do not have to be identical or similar. This count is normalized by dividing it by ‘20’ (the typical length of a TMS).
For every result generated by GBlast, a binary hydropathy plot is generated that can be viewed in the report by clicking on a result. Binary hydropathy plots are generated using a subject hit from the BLAST search’s HSP. Both sequences are plotted on the same axis using two different colours. Both sequences are the same length when including their gaps (if any). The generated binary hydropathy plots take these gaps into consideration and space each graph accordingly. These graphs contain the hydropathy index on the Y-axis and the residue position on the X-axis. Hydropathy values are calculated using the Kyte-Doolittle scale with a window size of 19 residues. Values over 1.6 are likely to be transmembrane regions. Figure 5 shows an example binary hydropathy plot as seen in GBlast’s report.
Binary hydropathy plots are especially helpful when there are no predicted TMS regions. They allow the user to quickly identify shared potential TMSs that HMMTOP has missed. GBlast is not limited to locating transport proteins, but can also be used to find homologues between different genome/proteome files. A user only needs to change the target database setting to point to a genome/proteome file, and GBlast will generate a nearly identical report file.
AncientRep
AncientRep (AR) is a novel program for finding TMS repeats within transport proteins. AR has been shown to be successful in many cases. AR is written in Python and has integrated multithreading capabilities, allowing users to take full advantage of multicore computer systems. AR has many options, which allow for a wide variety of search techniques, ranging from exhaustive to heuristic algorithms. AR requires a FASTA file of homologues, a repeat size to test for, and an output directory.
AR offers two techniques for finding repeat units: ‘horizontal and ‘vertical’. Horizontal searching is a method in which AR attempts to find repeat TM units within a single protein. For example, AR may determine that TMSs 1 & 2 of protein “A” are homologous to TMSs 3 & 4 of protein “A”. Vertical TMS searching will locate TMS repeat units across a list of homologues, and is not restricted to a single protein. For example, AR may determine that TMSs 1 & 2 of protein “A” are homologous to TMSs 3 & 4 of protein “B”, where proteins “A” and “B” are homologous throughout their lengths. All comparisons are performed using GSAT, and scores are given in standard deviations. Horizontal comparisons are more convincing when several good scores are obtained (5 SDs and up, 60 or more residues in length and under 15% gaps.). Vertical searching is useful when horizontal scores are insufficient to establish homology. Good vertical results will reveal repeat units that are poorly conserved within a single protein, but are more easily recognized in one or more of its homologues. The vertical approach is valid only if homology is established between the pair in question by virtue of the superfamily principle [12].
When comparing potential transmembrane repeats (TMRs) during a vertical or horizontal search, AR selects only hydrophobic TMS regions as predicted by HMMTOP. In addition to this, AR offers a parameter called ‘flank’, which allows the user to specify how many hydrophilic residues should be included as ‘padding’ on either side of each TMS. By default, AR will select 10 hydrophilic residues on either side of a putative TMS. If two adjacent TMS units exist within 20 residues of each other, AR will select the midpoint between the two as the padding limit. This ensures that there are no overlapping sequences. These expanded TMS units are then spliced back together and saved in a temporary FASTA database. Each header contains the protein ID, TMS compositions, and total TMSs of the parental protein. The advantage of splicing TMSs is that every subject and target being compared will be roughly the same length. This eliminates skewed and misleading alignments due to large gaps. For each alignment, AR has a feature called ‘optimize’ which will attempt an alignment using zero flanking residues and progressively increase the padding until AR can achieve the highest z-score without involving residues belonging to another TMS.
Figure 6 provides an example of the optimization process for two TMRs in the Major Facilitator Superfamily (MFS; TC# 2.A.1). Before optimization, the z-score for this repeat unit was 8 SDs, using a default flanking length of 10 (as seen in red). After optimization, a z-score of 9 S.D. was achieved with a flanking length of 13 (as seen in green).
AR includes five parameters to designate a heuristic search. These options are “min”, “max”, “consecutive”, “VSrestrict” and “VTrestrict”. Setting a “min” value will require each subject and target protein to have a minimal number of TMSs. This saves processing time when using a list of distant homologues. The “max” setting may also be used along with the “min” setting. AR will restrict subject and target proteins to a set containing TMSs equal to or less than the “max” setting. The user can set the “min” and “max” settings to the same number to restrict subject and target proteins to a specific number of TMSs. For example, if both values are set to 12, then AR will ignore any protein that differs in its TMS count. Doing this will eliminate most false positives due to TMS deletions and insertions or to incorrect predictions of TMSs by HMMTOP, although, top results must always been checked manually.
AR uses a sliding window or “frame” of size “r” to perform comparisons, where “r” refers to the suggested repeat size the user has selected. The subject frame selects the first “r” TMS units closest to the N-terminus, and the target frame selects the first “r” TMSs that have no overlap with the subject frame. The subject and target frames then produce one comparison using GSAT. The subject frame remains fixed, and the target frame is shifted over by one TMS toward the C-terminus. A GSAT comparison is performed and recorded for each frame shift. Once the target frame reaches the end of the sequence, the subject frame is shifted over one TMS toward the C-terminus, and the target frame is reset. This process is repeated until the subject frame reaches the end of the protein.
In a horizontal search, the analysis for the selected protein will end, and the next protein in the FASTA file will be loaded. In a vertical search, the subject frame will be reset, and the target frame will be reset to its appropriate starting position. However, this time, the target frame will be shifted down one protein. This process will be repeated until the subject frame has been compared to every target frame of every protein. This is an exhaustive search and will try every logical combination of TMR comparisons. The user may restrict the search using the “consecutive” flag. Enabling this will cause the target frame to move over “r” TMSs for every shift instead of just 1. For example, using an “r” value of “2” without the consecutive flag, AR will compare TMS (1,2) with (3,4), (4,5), (5,6), etc. Using the same “r” setting with the consecutive flag, AR will compare TMSs (1,2) with (3,4), (5,6), (7,8), etc.
Another option that AR offers is VSrestrict. VSrestrict is the Vertical Subject Restriction. A valid setting for VSrestrict would be: “(1,2), (4,5)”. With this setting, the subject frame will start at TMS (1,2), and then jump to (4,5). No other frames will be selected throughout the entire search. The target frame will not be restricted and will behave normally. The VTrestrict option is similar and will restrict the target frame. A valid entry for VTrestrict would be: “(5,6)”. With this setting, AR will compare all subject frames prior to the (5,6) position to TMR (5,6). The VSrestrict, VTrestrict, and consecutive options may be used in any combination to accommodate virtually any search technique.
Results are organized by TMR subjects. For example folder “3–4” will contain all targets that were vertically compared with TMSs (3,4). Horizontal comparisons are saved in a single text file. All results are stored as a tab separated value file and contain subject protein ID, subject TMSs, target protein ID, target TMSs, percent gaps, and z-scores. Percent gaps provide valuable information about the alignment. Often, high z-scores may contain large gaps at the ends of the protein. This usually suggests that the actual repeat size is different from what was searched. Good repeat results have low percent gaps and high z-scores. Any result with a percent gap value less than 15%, an alignment of at least 60 residues and a z-score of 10 SDs or better is highly suggestive of a repeat.
Discussion
Each tool included in the BioV suite can be used for various purposes, each having its own set of strengths and weaknesses. These purposes and their advantages, disadvantages and solutions (when available) will be discussed in detail in this section.
The TMStats program exists in two different platforms, for two slightly different purposes. The web-based TMStats program is available on TCDB.org under the “analyze” page. The web-based version will allow a user to view topological statistics for any TC formatted entry. This tool will allow a user to analyze transporters that have already been characterized in TCDB. On the other hand, the desktop version of TMStats will allow a user to perform the same analysis on any FASTA file. Ideally this FASTA file will be generated using a Blast search, and hence, it will have a lower standard deviation of TMS counts. When analyzing a TC family, a user will notice the standard deviation of TMS counts is usually higher. This is because TC families will be more diverse in their populations, while a Blastp search will not. The difference in standard deviations marks two distinct usages for the TMStats programs. The web-based version is ideal for identifying extreme outliers within a family. These extreme outliers are usually of interest to those who help maintain TCDB, and to anyone with an interest in a particular TC family. The desktop edition should be used as a guide when using other programs in the BioV suite. Tools like AncientRep and Protocol2 offer parameters to restrict analysis within a range of TMSs. TMStats will help a user decide how to use these restriction parameters. TMS counts rely entirely on HMMTOP. HMMTOP is a fairly reliable TMS prediction program, but is not free of errors. This is the only unavoidable drawback to TMStats. However, skewed results will be less misleading if a large sample is used (300+ homologues). Problems arising from incorrect predictions are less prevalent when using the web-based version. This is because the entries found in TCDB are ideal representatives and constantly curated. HMMTOP predictions are usually accurate for entries in TCDB.
The GSAT program comes with three different versions. These are: GSAT-CMD, GSAT-Web and Scala-GSAT. Most researchers will only use GSAT-Web. The other two versions are for more advanced programmatic access and is described in detail in the BioV documentation. The web based GSAT program is available on TCDB, under the ‘analyze’ header. The GSAT tool was designed to be a replacement for the GCG GAP program [13], and is virtually identical in its algorithms. The primary difference between the two, is that GSAT is written in python and is open source. GSAT is fast enough to be run on the web, and is available to unrestricted public use via our servers.
Protocol1 simplifies the process of collecting distant homologues. Performing NCBI PSI-BLAST searches and collecting gi numbers to upload to protein Entrez can be tedious work if done repeatedly. Protocol1 was developed to save time and offer programmatic access to NCBI’s PSI-Blast tools. Programmatic access is readily available to users who have a local copy of BLAST+ installed, along with the NR database. However, the NR database is frequently updated and is very large. It is not always practical to keep a local copy. Remote BlastP tools already exist and can be found in the BioPython and the BioPerl packages. However, these packages do not offer PSI-Blast methods. Protocol1 also offers an option to see results after each iteration to ensure the user is on the right track, before proceeding to download and curate the selected proteins. Alternatively, the DefineFamily tool can be used to generate a list of homologues to represent an entire TC Family. Results are collected internally using Protocol1 (or BlastP), and will combine several PSI-Blast results from memebers of different subfamilies.
SSearch and TSSearch were both designed for the same purpose but work slightly differently. Both of these programs will perform shuffle-based comparisons using local alignments to find regions of homology between two different FASTA files. The difference between these two programs is that SSearch will perform a shuffle based alignment for every combination of sequences in the subject and target file, while TSSearch will only do shuffle based comparisons for the pairs that are most likely to be homologous. As a consequence, TSSearch is very fast compared to SSearch, and can cross compare two FASTA files with 500 proteins in each, in under 15 minutes. The only drawback is that TSSearch will produce fewer results than SSearch. However, the top scores are nearly always preserved regardless of whether TSSearch or SSearch is being used. Shuffle based scores are reported in standard deviations (z-scores). However, these z-scores alone are not sufficient to claim homology, regardless of how high. A z-score is only valid if it was obtained using a global alignment (as with GSAT). This is because not every amino acyl residue in an alignment partakes in the calculation of SW bit scores. These two programs are designed to find HSPs between distantly related FASTA files. The reported z-scores are used as an internal sorting mechanism within other programs, such as Protocol2. SSearch and TSSearch may be run as standalone tools, but have been integrated into Protocol2. It is unlikely that a researcher will ever need to run either of these two programs outside the context of Protocol2.
Protocol2 performs most of the work for the researcher by combining TSSearch, SSearch, GSAT and HMMTOP into one tool which produces graphical reports. When searching for homology between distantly related proteins, the researcher will be looking for good comparison scores (10 SDs +) and aligning TMSs. Protocol2 presents these data in an HTML report, allowing a researcher to quickly identify similar sequences.
The GBlast program will allow researchers to identify transporters within a genome or proteome file. Results are generated in HTML format, and are sorted by its TMS overlap score and its e-value. The advantage of this is it allows the user to immediately recognize the likelihood that a sequence is a transporter by just skimming the results table. TMS overlap scores become ambiguous when they are less than 1. In these scenarios, binary hydropathy plots are useful. A user should look for shared hydrophobic peaks to determine if a TMS is shared.
AncientRep is a powerful program that will find putative TMRs within a list of homologues. Once high sequence similarity has been established between two families using GSAT or Protocol2, the next step would be to determine if the two families arose via the same pathway. Establishing a common TMR between two families allows the researcher to claim homology with more confidence. Our former method for finding TMRs consisted of a series of steps. First, a multiple alignment is generated using a list of homologues. Next, an average hydropathy plot is produced using the multiple alignment. Using the average hydropathy plot, the user approximates two ranges to select and cut, which contain the TMSs to be compared. These two generated lists are finally cross compared using a program called Inter/Intra Compare (IC). This method is explained in detail in the paper by Zhai and Saier (2002) [16]. One of the advantages IC has over AR is that it corrects for sequences that vary in their TMS count using a multiple alignment. Varying TMS counts can arise from sequence divergence, or more commonly from incorrect predictions. Comparing sequences with different numbers of TMSs can give rise to false positives using AR. However, this is easily corrected by enabling the TMS restriction setting. This ensures that only TMSs belonging to sequences containing the same total TMS count are compared. This eliminates most (but not all) false positives. For example, if a sequence has lost two TMSs through evolution, and then gained two more in different positions, it would have the same number of total TMSs. This would offset the native TMS positions and can result in two of the same TMSs being compared. These situations do not happen often but are easy to recognize. AR scores over 20 SDs are usually artifacts, arising from this scenario. They can be identified by generating a binary or multiple alignment (e.g., using Clustal X) (Ref).
AR has several advantages over IC. AR does not require the user select which TMSs are to be compared. AR will do every combination automatically. However, AR does give the option of selecting a specific subset of subject and target TMSs. For example, a user can compare just TMSs 1,2,3 with 4,5,6. AR requires the user to select the repeat size to search. A search performed using an incorrect repeat size can reveal the true repeat size, and can guide the user in identifying the true TMR. For example, if a user searches for a 2TMS TMR and finds TMSs 1,2 aligning well with 4,5, this is suggestive of a 3TMS TMR unit. The user would then perform an AR search using a 3 TMR setting. Assuming the 3TMS TMR hypothesis is correct, the program should reveal TMSs 1,2,3 aligning with TMSs 4,5,6, and will usually give a better score. A worse score does not necessarily mean the 3 TMR hypothesis is incorrect. This could suggest that TMS 3 is poorly conserved, while TMSs 1–2 are better conserved. In this scenario, an AR search using the incorrect TMR size can be more revealing than using the actual TMR.
Another advantage AR has lies in its performance. AR is a multi-threaded program and will use all of the processors available. IC only works with a single processor and can take a much longer time to complete its task. Finally, AR is superior in its simplicity. IC requires several steps involving a handful of other programs. AR simply requires one FASTA file.
The BioV suite contains programs to aid each step required to understand transport protein evolution. Each tool has been carefully crafted to benefit both researchers and programmers. These programs are in frequent use in the Saier Lab, and have proven to be most suitable for many research tasks. They render TCDB far more valuable to the average user. They will also be of value to researchers working on other classes of integral and peripheral membrane proteins as well as cytoplasmic constituents.
Figure 3.
Protocol2 results summary page.
Table 1.
Programs developed for studies of transport protein evolution but useful for the analysis of other protein (and nucleic acid) classes.
| Program | Task Performed |
|---|---|
| TMStats | Creates graphical charts to represent topological stats within any TC hierarchy or FASTA file. |
| GSAT | Provides a simple shuffle-based global alignment tool for detecting distant homologies. |
| Protocol1 | Performs remote PSI-BLAST searches with iterations using NCBI’s NR database. |
| Protocol2 | Compares two different families of transport proteins to find similarities. It uses TSSearch and SSearch and generates easy to read graphical reports. |
| TSSearch | Uses a heuristic local search algorithm to rapidly compare two different FASTA files and find similarities between them. |
| SSearch | Uses a simple exhaustive shuffle-based local search algorithm to compare two different FASTA files and find their similarities. |
| GBlast | Provides a search tool designed to identify potential transporters within entire genomes/proteomes. It generates easy to read graphical reports. |
| AncientRep | Provides a horizontal or vertical search approach to find transmembrane repeat units within a list of homologues. |
| DefineFamily | Generates a FASTA list to represent any TC family. |
Acknowledgments
This work was supported by Public Health Service grant GM077402 from the U.S. NIH.
References
- 1.Saier MH, Jr, Tran CV, Barabote RD. TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 2006;34:D181–186. doi: 10.1093/nar/gkj001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Saier MH, Jr, Yen MR, Noto K, Tamang DG, Elkan C. The Transporter Classification Database: recent advances. Nucleic Acids Res. 2009;37:D274–278. doi: 10.1093/nar/gkn862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tusnady GE, Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol. 1998;283:489–506. doi: 10.1006/jmbi.1998.2107. [DOI] [PubMed] [Google Scholar]
- 4.Tusnady GE, Simon I. The HMMTOP transmembrane topology prediction server. Bioinformatics. 2001;17:849–850. doi: 10.1093/bioinformatics/17.9.849. [DOI] [PubMed] [Google Scholar]
- 5.Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- 6.Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- 7.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 9.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- 10.Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
- 11.Saier MH., Jr Tracing pathways of transport protein evolution. Mol Microbiol. 2003;48:1145–1156. doi: 10.1046/j.1365-2958.2003.03499.x. [DOI] [PubMed] [Google Scholar]
- 12.Doolittle RF. Of urfs and orfs: a primer on how to analyze derived amino acid sequences. University Science Books; Mill Valley, CA: 1986. [Google Scholar]
- 13.Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984;12:387–395. doi: 10.1093/nar/12.1part1.387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Matias MG, Gomolplitinant KM, Tamang DG, Saier MH., Jr Animal Ca2+ release-activated Ca2+ (CRAC) channels appear to be homologous to and derived from the ubiquitous cation diffusion facilitators. BMC Res Notes. 2010;3:158. doi: 10.1186/1756-0500-3-158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wang B, Dukarevich M, Sun EI, Yen MR, Saier MH. Membrane porters of ATP-binding cassette transport systems are polyphyletic. J Membr Biol. 2009;231(1):1–10. doi: 10.1007/s00232-009-9200-6. [DOI] [PubMed] [Google Scholar]
- 16.Zhai Y, Saier MH., Jr A simple sensitive program for detecting internal repeats in sets of multiply aligned homologous proteins. J Mol Microbiol Biotechnol. 2002;4:375–377. [PubMed] [Google Scholar]





