Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2003 Apr 15;31(8):2217–2226. doi: 10.1093/nar/gkg313

Transcript identification by analysis of short sequence tags—influence of tag length, restriction site and transcript database

Per Unneberg a, Anders Wennborg 1, Magnus Larsson
PMCID: PMC153741  PMID: 12682372

Abstract

There exist a number of gene expression profiling techniques that utilize restriction enzymes for generation of short expressed sequence tags. We have studied how the choice of restriction enzyme influences various characteristics of tags generated in an experiment. We have also investigated various aspects of in silico transcript identification that these profiling methods rely on. First, analysis of 14 248 mRNA sequences derived from the RefSeq transcript database showed that 1–30% of the sequences lack a given restriction enzyme recognition site. Moreover, 1–5% of the transcripts have recognition sites located less than 10 bases from the poly(A) tail. The uniqueness of 10 bp tags lies in the range 90–95%, which increases only slightly with longer tags, due to the existence of closely related transcripts. Furthermore, 3–30% of upstream 10 bp tags are identical to 3′ tags, introducing a risk of misclassification if upstream tags are present in a sample. Second, we found that a sequence length of 16–17 bp, including the recognition site, is sufficient for unique transcript identification by BLAST based sequence alignment to the UniGene Human non-redundant database. Third, we constructed a tag-to-gene mapping for UniGene and compared it to an existing mapping database. The mappings agreed to 79–83%, where the selection of representative sequences in the UniGene clusters is the main cause of the disagreement. The results of this study may serve to improve the interpretation of sequence-based expression studies and the design of hybridization arrays, by identifying short tags that have a high reliability and separating them from tags that carry an inherent ambiguity in their capacity to discriminate between genes. To this end, supplementary information in the form of a web companion to this paper is located at http://biobase.biotech.kth.se/tagseq.

INTRODUCTION

Hybridization array technology has emerged as the focus of interest in the field of gene expression analysis (1). However, cDNA-sequencing based techniques, e.g. expressed sequence tags (EST) sequencing (2), still provide an important source of gene expression information. A major advantage is that such methods can detect previously unknown transcripts since they are based on randomly picking and sequencing the actual transcripts rather than monitoring a pre-selected set of probes. Furthermore, there is no limit to the sample size (number of sequenced tags) to be analyzed, so that sustained sequencing efforts can provide quantitative information on the most low abundance transcripts that may be below the detection limit for hybridization array methods (3). The major drawback of sequencing methods, their low throughput, has been addressed by devising methods based on isolation of short sequence tags from each sampled mRNA, thus increasing the throughput of the number of sampled mRNAs by one to two orders of magnitude. These methods include serial analysis of gene expression (SAGE) (4), tandem arrayed ligation of expressed sequence tags (TALEST) (5), and pyrosequencing (6). In general, the identification of the original mRNA is then based on a tag of 10–20 bases from a defined position in each mRNA species.

All methods based on polymerase-mediated reproduction of mRNA sequences will have a background of errors related to polymerase fidelity, contributed by the reverse transcriptase, the polymerase used in PCR and the sequencing enzyme. The SAGE method is based on PCR amplification of ditags and therefore has two additional potential sources of bias: exclusion of identical high-abundance ditags (7) and nucleotide content bias in ditag amplification (5). All these experimental method considerations were beyond the scope of the present investigation.

The aims of this study are 2-fold. First, the sequencing based methods rely on initial cleavage by a given restriction enzyme. Therefore, the corresponding recognition site must be present in a transcript in order for the generation of tags. In addition, it is desirable that the generated tags are unique and that the tag length fulfills the requirements imposed by the experimental method. To address these issues, we studied the recognition site frequencies, tag length and tag uniqueness for different enzymes in the mRNA sequence pool. These analyses were performed with the human transcript database RefSeq (8).

Second, given a set of experimentally generated tags, a means to establish the transcript identity is by sequence alignments to transcript databases. In large-scale automated applications of gene expression analysis, the performance of such unsupervised in silico transcript identification is of critical importance. The length of the tag and the choice of transcript database will affect the efficiency of identification. To assess the influence of tag sequence length on the reliability of transcript identification, the specificity in transcript identification by the sequence similarity search program BLAST was performed with different sequence tag lengths using the UniGene (9) Human non-redundant database. An alternative to transcript identification by sequence alignments is using predefined tag-to-gene mapping databases based on a set of criteria for circumventing problems inherent in representing a gene sequence by a cluster of by partly overlapping subsequences. One such database is SAGEmap (10), which is constructed for UniGene. We tested the degree of differential classification of publicly available EST data, analyzed as full-length sequences or as ‘virtual SAGE’ tags, by constructing a tag-to-gene mapping for UniGene Human non-redundant and comparing it to SAGEmap.

MATERIALS AND METHODS

Data sources

The RefSeq library was obtained from the RefSeq repository at the National Center for Biotechnology Information (NCBI) (ftp://ncbi.nlm.nih.gov/refseq, date of download was October 31, 2001). BLAST program executions were performed on UniGene Human (non-redundant) version 142, which can be retrieved from ftp://ncbi.nlm.nih.gov/repository/unigene. The libraries used to construct the virtual tags were obtained from the Cancer Genome Anatomy Project site (http://cgap.nci.nih.gov/Tissues/LibraryFinder). We selected the largest non-normalized libraries. Moreover, we required that cloning had been done uni-directionally. Based on these criteria, the following libraries were selected: NCI_CGAP_Co14 (8282 sequences), NCI_CGAP_Gas4 (21 214 sequences), NCI_ CGAP_Kid6 (6270 sequences), NCI_CGAP_Lym12 (10 857 sequences), NCI_CGAP_Pan1 (24 880 sequences) and NCI_ CGAP_Ut1 (24 558 sequences). The SAGE-map SAGEmap_ tag_ug-rel-Nla3-Hs was downloaded from ftp://ncbi.nlm.nih. gov/pub/sage/map/Hs on October 31, 2001. The data used in this study can be downloaded at ftp://biobase.biotech.kth.se/pub/tagseq/data.

Extraction of reliable sequences

Although the CGAP libraries chosen consisted of 3′ ESTs that were uni-directionally cloned, after manual inspection it was obvious that not all sequences corresponded to 3′-end sequences. Moreover, many sequences were of poor quality, making it necessary to select reliable sequences based on poly(A) length and polyadenylation signals.

As a first criterion, we selected ESTs with a poly(A) extremity of length 8 or more, allowing 10 trailing bases that could be related to vector sequence. In addition to this, the existence of a polyadenylation signal (ATTAAA or AATAAA) located at least 10 and at most 30 bases from the first A in the poly(A) tail was required (with distances being related to the last base of the hexamer). That the choice of these values is reasonable has recently been shown by Beaudoing et al. (11). This data set represented a typical experimental sample, to be aligned to UniGene. In order to avoid alignments between identical sequences, we removed sequences representing UniGene clusters.

The refinement procedure considerably reduced the size of the libraries: Co14 (1135 sequences), Gas4 (3728 sequences), Kid6 (705 sequences), Lym12 (2083 sequences), Pan1 (5443 sequences) and Ut1 (5354 sequences). In other words, ∼10% of the sequences in each library fulfilled the selection criteria.

In the analysis of RefSeq, the entries beginning with NM (denoting mRNA sequence) and described as Homo sapiens sequences were selected. This produced a total of 14 248 RefSeq sequences. In addition, these sequences went through another round of refinement, based on the existence of polyadenylation signals and poly(A) length described above, giving a smaller set consisting of 3493 sequences.

Construction of BLAST queries

After the CGAP libraries had been refined, they were used to create input sequences to the BLAST algorithm. The following steps describe the general procedure. (i) Each sequence was cut with a restriction enzyme, and provided a recognition site existed, the 3′-most sequence was selected for further processing. (ii) The selected sequences were used to construct queries of lengths 10, 11, 12, …, 50 bp, including the recognition site (i.e. a 10 bp query consisted of the recognition site plus a tag of six bases). (iii) Sequences having their 3′-most recognition site closer than 50 bp to the 3′ end were removed from the subsequent analysis, since we wanted to monitor the results of the BLAST execution in relation to the increase in length, from 10 to 50 bp, for each query. (iv) The remaining sequences were used to query the UniGene database by aid of the BLAST algorithm.

We also queried the UniGene database using all RefSeq sequences defining a UniGene cluster, in total 11 691 sequences. These sequences also went through the steps described above.

Version 2.0.14 of BLAST was used, and program executions were performed with the parameter settings E = 1000, W = 7, v = 10, b = 10, and all other parameters default.

Analysis of tag uniqueness

We selected tags from RefSeq that were at least 30 bp long. Then, for tag lengths between 10 and 30 bp, we calculated the fraction of unique tags as the number of tags occurring only once, divided by the total set of tags (i.e. redundant tags were counted more than once).

Tag overlap

In set theory, an element may occur once in a set. In other words, since each tag sequence is an element, redundant sequences will be discarded when constructing the tag sets. Denote by S3 the set of 3′ tags and by S5 the set of 5′ tags. Definition 1 of the overlap (ρ) is then:

graphic file with name gkg313equ1.jpg

where |S| is the cardinality (number of elements) of a set. Thus, ρ1 states how large a fraction of the 3′ tags can be found in the set of 5′ tags.

However, given that a 5′ tag is selected and sequenced, it might be of interest to know how large the probability is that this tag defines a real 3′ tag. In this case, the probability will increase if a 5′ tag, which equals a 3′ tag, is frequently present in the set of 5′ tags. Now, not only the presence of a sequence is interesting, but also the number of times the sequence occurs. Multiset theory allows multiple instances of an element. We want to find the cardinality of the multiset of 5′ tags that are present in the multiset of 3′ tags, and divide this number with the size of the total 5′-tag set. This can be accomplished as follows:

graphic file with name gkg313equ2.jpg

Here, × denotes the Cartesian product of two multisets. Hence, ρ2 is the probability that a sequenced 5′ tag will correspond to a 3′ tag.

Analysis of EST data

In addition to the input data with query lengths 10–50 bp, we performed BLAST using all EST sequences having a recognition site for NlaIII. We also required that the sequences were found in UniGene clusters by inspecting the Hs.seq.data file. Searches were performed in the UniGene database. For each EST sequence, a 10 bp tag was created as the 10 first bases following the 3′-most NlaIII recognition site. Then, each tag was given a UniGene identifier based on the results of the BLAST alignment with the corresponding EST sequence. We refer to this mapping as the BLAST mapping.

The total number of BLAST mappings is equal to the number of sequences. Note that this does not exclude the possibility of two sequences aligning to the same cluster and having the same 10 bp tag (i.e. the same mapping may occur more than once). For each 10 bp tag, we compared the BLAST mapping with the SAGE mapping. If the BLAST mapping was defined in the SAGE mapping, we increased the count of equal mappings by one. The fraction of equal mappings is defined as the number of equal mappings divided by the total number of mappings.

RESULTS

Evaluation of RefSeq and UniGene for sequencing based gene expression

RefSeq and UniGene are two transcript databases developed and managed by the NCBI. These databases are both partial representations of the transcriptome. RefSeq is a highly curated database with 14 248 mRNA entries as of October 31, 2001. Recent estimates of the human gene number lie in the range 30 000–35 000, with an approximate average of three distinct transcripts per gene (12). With these estimates, RefSeq represents a 10–15% coverage of the human transcriptome. UniGene is composed of automatically generated clusters of EST sequences (13), seeded with well characterized genes and completed by alignments of residual ESTs. The number of distinct clusters was 96 327 in build version 142. The EST clustering process used to generate the UniGene database gives a significantly broader coverage than RefSeq. Each cluster corresponds to a unique transcript, and one member sequence is chosen to represent a cluster. UniGene is likely to cover the majority of human transcripts, albeit with considerably less accuracy in the description of the original mRNA sequences. We therefore used these complementary databases for assessing different aspects that are relevant for the ability to assign short sequence tags to their cognate gene/transcript.

Restriction enzyme recognition sites

The short-tag methods are based on an initial restriction enzyme cleavage in the cDNA, followed by capture of the 3′ situated 10–20 bases. In theory, a restriction enzyme with a four-base recognition sequence will cut every 256 bases in a random sequence. Furthermore, a 10 bp sequence has ∼106 possible permutations. Given that the average length of human mRNAs is 1.3 kb (12) and that it is expected to be 105 different mRNA species in the human transcriptome, there appears to be a margin of safety for being able to unambiguously identify and analyze all mRNAs in the transcriptome. However, the transcriptome sequence content is not random and many mRNAs are shorter than 1 kb, so there will in fact exist a subset of mRNAs that are not possible to analyze with a given combination of restriction enzymes. This can occur because the mRNA does not have a particular restriction enzyme site. In addition, the identification strength of a tag (i.e. the ability of a tag to identify the original mRNA) may be reduced if the terminal recognition site is situated too close to the polyadenylation site. Tags that extend into the poly(A) tail will have a sequence composition consisting mainly of adenine residues, thereby reducing sequence variation.

In order to assess the extent of this information loss and to identify some individual genes that may be affected, we have studied the occurrence of recognition sites in RefSeq. The known sequence orientation and lower error frequency in this database makes such an assessment more feasible than with UniGene.

We studied the occurrence of recognition sites in two data sets. First, we analyzed all 14 248 mRNA sequences in RefSeq. Second, we selected those sequences in RefSeq fulfilling certain criteria of poly(A) tail length and polyadenylation signal (see Materials and Methods) resulting in a refined set of 3493 sequences.

The occurrence of recognition sites was studied for all the 256 possible permutations of a 4 bp sequence, although palindrome 4 bp sequences are generally utilized by restriction enzymes. The relative abundance of recognition sites is shown in Figure 1 for the refined data set for 13 4-tag permutations representing real 4-cutter enzymes. Similar results were obtained with the large data set (data not shown).

Figure 1.

Figure 1

Recognition site frequencies for a set of common restriction enzymes in the refined data set of 3493 RefSeq sequences. The stacked bar plot shows the fraction of sequences having 0, 1 and ≥2 recognition sites, respectively.

We have chosen to show the results for a set of naturally occurring enzymes, which have been widely used in gene expression analysis methods. NlaIII and Sau3A have been used as anchor enzymes in SAGE, MspI is utilized in TALEST, and DpnII is a commonly used enzyme in representational difference analysis (RDA) (14). The distribution of recognition site frequencies is similar in both data sets. Out of the sequences studied, ∼1% did not contain an NlaIII restriction site, thus making these sequences unavailable to SAGE using this enzyme. Assuming that this set of sequences is representative of the human transcriptome and that the latter consists of 105 transcripts, some 1000 transcripts are not accessible to SAGE. For RDA, at least two restriction sites are needed, meaning that ∼10%, or 104 transcripts, are out of reach of examination if DpnII is used as the only enzyme.

We extracted the RefSeq sequences with one or zero recognition sites for transcripts that are not accessible for analysis by RDA and/or SAGE, in order to obtain qualitative information about which genes cannot be studied under given experimental conditions. Some examples of genes not accessible for study by SAGE when using NlaIII as the anchoring enzyme are G antigen 2-7, several histone family members, RAB28 and RAB4 (which are members of the RAS oncogene family), and transcription factors such as DNA-binding transcriptional activator (NCYM). Similar examples for RDA (when DpnII is used) are several ribosomal proteins, cytochrome c oxidase subunits, and oncogenes GRO2 and GRO3. Complete listings of inaccessible genes for a given enzyme are available at the website companion.

Since a short tag of 10–20 bp, situated at the 3′ end of a sequence, is used to identify the original transcript, we wished to investigate the length of the sequence located between the terminal recognition site and the polyadenylation site for all restriction enzymes. It has been believed that for the majority of mammalian pre-mRNAs, the polyadenylation site is defined by site-specific endonucleolytic cleavage (15). Recently, however, it has been shown that up to 44% human mRNA transcripts show cleavage site heterogeneity (16). In the RefSeq database, many of the sequences lack a distinct poly(A) tail thereby making the length of the 3′-most restriction fragment with variable sequence content uncertain. We therefore performed this analysis on the refined data set of 3493 mRNA sequences having well defined poly(A) tails. The result is shown in Figure 2.

Figure 2.

Figure 2

Fraction of 3′ tags generated from the refined data set with lengths 0, ≤9, 10–15, 16–20 and ≥21 bp.

Apart from recognition sites AATT and TTAA, the frequency of transcripts generating 10 bp tags that extend into the poly(A) tail lies in the range 1–5%. For NlaIII, ∼4% of the 10 bp tags extend into the poly(A) tail, indicating that these tags may have reduced identification strength. Adding the fraction of tags inaccessible to SAGE according to the previous analysis, some 5000 transcripts may generate tags with reduced or zero identification strength, given the current estimate of the size of the human transcriptome. However, the numbers presented in the preceding paragraphs should only be regarded as crude estimates since they are based on a small sample of the transcriptome under the assumption that the refined data set is representative of the entire transcriptome.

Uniqueness among 3′ tags

The discriminatory power of short tags is dependent on the redundancy of transcript identification, where redundancy means that the same tag sequence will be derived from more than one mRNA species. Given the evolutionary process with gene duplication events resulting in gene families, it is likely that conservation of some tags will occur. In order to assess this possible bias, the uniqueness among the 3′ tags of 10 bp length was measured in the RefSeq data set of 14 248 sequences. The results are shown in Figure 3.

Figure 3.

Figure 3

Percentage of tags identifying 1, 2, 3, 4 and ≥5 genes. The percentage of tags identifying one entry in RefSeq is shown after each bar plot.

For every recognition site, the fraction of tags identifying a single transcript was close to 95%, as has been estimated previously (4,17). A recent analysis of the entire UniGene Human data set has given an estimate of tag uniqueness of only 11% (18). However, the authors pointed out that tag uniqueness increases to 44% if a reference collection of well characterized sequences is used, and that gene identification with 10 bp tags likely will improve with better reference sequence databases. Consequently, the information provided by 10 bp tags seems to be sufficient in the majority of cases, although a more accurate determination of the fraction may have to await a more comprehensive transcript database. However, even if the remaining non-unique tags, corresponding to tags identifying more than one transcript, constitute only 5% of the total transcriptome, they could represent up to 5000 transcripts including transcript variants. It is therefore of importance to identify these transcripts in order to be able to treat them separately when interpreting data obtained by these experimental methods.

Many of the transcripts that were identified by the same tag produced by NlaIII were, as expected, transcript variants or splice variants, or members of gene families. An example of the former is annexin A6, splice variants 1 and 2, and an example of the latter is the histone gene family. The website companion to this article contains a complete listing, which may serve as a reference resource in the process of annotating SAGE data sets.

There is a possibility that alternative splicing may generate different tags, in which case it would be possible to distinguish between transcript variants. For this version of RefSeq, the sequences were assigned to 13 192 loci, out of which 618 contained more than one RefSeq sequence. We studied the tags produced by NlaIII, and found that 237, or 38%, of the loci with alternative transcripts displayed tag heterogeneity. This indicates that many alternative transcripts may be characterized individually using short tag methods.

Relation between 3′ and upstream tags

Methods based on capture of the sequence immediately following the most 3′ recognition site could be vulnerable to misclassification in cases where partial digestion of the cDNA instead releases upstream tags from the same mRNA. One publication (7) has described occurrences of SAGE tags that could possibly be derived from upstream cleavage sites in highly abundant mRNA. However, if properly identified, such tags could nevertheless be used to identify the mRNA, unless they happen to be identical to real 3′ tags from other mRNAs. To investigate this possibility, 10 base tags were obtained from all restriction sites in all mRNAs and coded as being 3′ or 5′ (upstream), as illustrated schematically in Figure 4. The degree of tag overlap between these groups of tags is shown in Table 1.

Figure 4.

Figure 4

Schematic drawing of the overlap between 3′ and 5′ tags. The black bar represents the recognition site, whereas the patterned bars represent different tags. In this hypothetical case, there are five 5′ tags and three 3′ tags in total, where two out of the five 5′ tags are identical to 3′ tags.

Table 1. Degree of tag overlap for 11 recognition sites.

Site ρ1 (%) ρ2 (%)
AATT (Sse9I) 22.3 3.9
AGCT (AluI) 27.0 3.9
CATG (NlaIII) 22.1 3.5
CCGG (MspI) 18.4 4.3
CGCG (FnuDII) 13.5 5.2
CTAG (MthZI) 10.6 3.6
GATC (DpnII/Sau3A) 17.4 4.6
GCGC (HhaI) 16.4 4.5
GGCC (HaeIII) 28.1 3.7
GTAC (RsaI) 13.5 4.2
TCGA (TaqI) 8.8 4.8
TGCA (CviRI) 25.6 3.7
TTAA (MseI) 22.7 4.1

The first overlap definition is based on set theory, where an element may only occur once in a set. In practice, it says how large fraction 3′ tags can be found among the 5′ tags. The second overlap definition is based on multiset theory, where each element of a set may occur multiple times. This interpretation states how large the probability is that a 5′ tag defines a 3′ tag, given that a 5′ tag is selected and sequenced. See Materials and Methods for details.

We made two alternative interpretations of tag overlap. In the first interpretation ρ1, we assessed the uniqueness of the 3′ tags, or how large a fraction of the set of 3′ sequences that can be found in the set of 5′ sequences. ρ1 varies considerably among the recognition sites, from close to 9% for TCGA to almost 30% for GGCC. The values are biased towards the size of the 5′ sets—the larger the set of 5′ tags, the larger the overlap.

Large sets are generated by recognition sites that occur frequently in transcripts, which is the case for GGCC (Fig. 1). The site TCGA, on the other hand, is missing in almost 20% of the transcripts.

The second definition ρ2 states the probability that a 5′ tag defines a 3′ tag, given that a 5′ tag has been selected for sequencing. Interestingly, ρ2 varies very little among the recognition sites, regardless of the variations of ρ1. If a substantial fraction of tags in a given experiment are derived from upstream cleavage sites, a potential source of error may be introduced since there is a possibility that tag counts are shifted from their true values.

In general, 5% of the 5′ tags would correspond to real 3′ tags, thus increasing their counts. However, this estimate is based on a data set where every transcript occurs once. In a biological sample, a given transcript species may be present in thousands of copies. The impact of tag overlap will be related to the expression level of a given transcript, with an increased risk of misclassification if the gene contains a large number of upstream tags that are identical to real 3′ tags in other transcripts and if partial cleavage occurs to a high degree.

Relationship between tag length and uniqueness

Given the results from the analysis above, it was of interest to investigate the effect of tag length for gene identification, that is, how an incremental increase of the length affects the discriminatory power. Figure 5 shows the fraction of unique tag sequences for tag lengths from 10 to 30 bp. The tags were obtained by extracting the sequences following the 3′-most restriction site.

Figure 5.

Figure 5

Fraction of tags that are unique for tag lengths 10, 15 and 30 bp in RefSeq. Note that the fraction of unique 10 bp tags is lower than in Figure 3 since we in this case selected sequences that produced tags at least 30 bases long.

For the 10 bp tags the fraction of unique tags lies close to 90%, increasing by 2–3% for tags of length 15 bp, with a smaller increase for longer tags. A recent publication (17) has shown that ∼94% of 10-base tags are expected to be unique, assuming a genome size of 15 720 genes. It should be pointed out that our way of calculating tag uniqueness differs from the method employed by Stollberg et al. (17). In our case, identical tags are counted the number of times they occur, thus increasing the total tag count in comparison with the unique tag count. Stollberg et al. counted redundant tags once, which increases the unique tag count in relation to the total tag count, giving a higher fraction of unique tags.

For 30 bp tag sequences, there are still cases where tags are not unique to one gene. Manual inspection showed that such tags derive from closely related transcripts, e.g. alternatively spliced transcripts. For example, in the case of NlaIII, one tag occurred 17 times, where the RefSeq entries corresponded to transcript variants of the dystrophin gene (accession numbers NM004006–NM004023 and NM000109).

BLAST for short tags

Identification of sequence similarity is usually performed with scoring algorithms that take regions of sequence difference into account and may allow the introduction of gaps to optimize alignments. For sequences as short as 10 bases, this approach is not feasible, since it will give too many possible hits in the sequence databases. With the purpose of investigating the performance of the BLAST algorithm (19) for automatic identification of transcripts from tags, we performed a systematic evaluation of BLAST queries with lengths from 10 to 50 bp derived from different types of data sets. In general, BLAST is not the algorithm of choice for searching short DNA sequences against DNA databases. Furthermore, the statistics produced when doing similarity searches using sequences shorter than 200 bases may not be reliable (20). The alternatives, however, would be too time consuming to meet the needs of automated large-scale gene expression analysis, and we concluded that BLAST would be the best heuristic to serve as a sufficient approximation.

In the forthcoming discussion we will be referring to the identification success as being related to the tag length of a query. The queries consist of the 4 bp recognition site plus the tag sequence. Therefore, the tag length is equal to the query length minus four bases. We made a case study of the RefSeq data set (minus the sequences not representing UniGene clusters) in order to get an overview of the behavior of a transcript set where each transcript species is unique (copy number = 1). Since all query sequences were represented in the database, one would also get an estimate of the identification resolution. Furthermore, in order to test a typical data set obtained in studies of gene expression by cDNA sequencing, we analyzed a publicly available EST library, where the copy number of a transcript species can be >>1.

An evaluation algorithm was designed to obtain a graphical illustration of the stability in transcript identification as a function of tag length. More specifically, we wanted to show the proportion of BLAST queries that change their transcript identification with increasing tag length. Intuitively, for short tag lengths, BLAST will report a larger number of equally good hits (alignments with equal search scores) than for longer tag lengths. In the following, we will only consider these top hits and refer to them as the number of hits for a given query. Moreover, since we were interested in sequence identity rather than similarity, we introduced a cut-off to remove alignments too poor to indicate identity. If the number of identical bases in an alignment was <90% of the number of bases in the query, all data generated by this query were removed from the analysis. In Figure 6A, a plot of the number of hits is shown as a function of tag length for the RefSeq data set, using restriction enzymes NlaIII and Sau3A to generate the queries.

Figure 6.

Figure 6

(A and B) Plot of the number of hits per number of queries as a function of tag length. (C and D) Plot of the percentage of queries that from a given length have a unique best hit versus tag length.

The results were similar for both restriction enzymes. As expected, the number of hits is large for short queries, and for longer queries the number of hits decreases. For tags shorter than 10 bp, the number of hits/query is >2. The number of hits/query decreases very rapidly with increasing tag length, and already at tag length 12–13, we observed ∼1.1 hits/query. For tags longer than 15 bp, the decrease in hits/query is slower. However, the number of hits/query does not reach unity, which may indicate the existence of similar UniGene reference sequences. For example, the tag generated from RefSeq entry NM001016 had aligned to UniGene clusters Hs.339696 [accession number NM001016; gene H.sapiens ribosomal protein S12 (RPS12)] and Hs.288224 (AK025642; H.sapiens cDNA). Alignment of the reference sequences NM001016 and AK025642 showed that they were 100% identical over 500 bases in the 3′ region, including the NlaIII site and the 46 bp tag. In summary, the results indicate an optimum tag length of 12–13 bp for the RefSeq data set.

Another way to analyze the effect of tag length is by studying the ‘decision length’ of a query. Unambiguous identification of a query implies that the query returns one hit. Consider the case where a query is unambiguously identified for a given tag length and all following tag lengths. We define the decision length of a query as the shortest tag length at which unambiguous identification occurs, with the identifier remaining the same for all following tag lengths. A query with a decision length <46 bp (the longest tag length) will be called a stable query, since the identification does not change for tag lengths longer than the decision length. In Figure 6B, the percentage of stable queries is plotted as a function of tag length. Since the number of hits/query is >1 for short queries, no stable queries will be found for small query lengths. As the tag length increases, the fraction of stable queries increases. As in Figure 6A, the slope of the curve changes at tag length 12–13 bp, where ∼90% of the queries are stable. The percentage of stable queries increases moderately up to tag length 45 bp, but never reaches 100%. Examination of the queries with a decision length of 46 revealed that this phenomenon almost exclusively is due to gene families. Consequently, there is always a number of transcripts that cannot be separated even with longer tag lengths.

In order to assess the influence of different tag lengths on data in a typical experimental mRNA sample, we addressed a set of EST libraries (see Materials and Methods), in which a given transcript may have a copy number c > 1, giving c identical queries. Obviously, identical queries will not be distinguishable in the identification process. Suppose, for instance, that the queries generated from a transcript with a high copy number have a large decision length L. Then, the number of hits/query will be substantially elevated for all tag lengths <L. Moreover, given that the total number of queries is q, the fraction of stable queries increases by at least c / q at tag length L.

First, all libraries were treated separately. The library Co14 originally consisted of 1135 sequences. This number was reduced to 379 queries after cutting with NlaIII and applying the similarity cut-off. In Figure 6C, the number of hits/query as a function of tag length is shown for this data set. As expected, the ratio hits/query decreases rapidly as the tag length increases, and at a tag length of ∼15 bases the slope changes, producing a ‘flattening out’ effect. Second, the libraries were pooled into one single data set for each enzyme, producing similar results (Fig. 6C). A total number of 5280 queries were included in this analysis. For the pooled Sau3A data set, 4097 queries were retained.

The curves are similar to those in Figure 6A, although with a more ragged appearance. However, the number of hits/query is slightly larger for tags lengths 12–13 bp, and the stabilization at 1.1 hits/query is seen for tags longer than 25 bp.

The results of the stable query analysis are shown in Figure 6D for the Co14 NlaIII data set, the pooled NlaIII set and the pooled Sau3A set. In this case, there is a striking difference in the appearance of the curves as compared to Figure 6B. The curves are more irregular, displaying large jumps at numerous query length transitions. For all data sets, longer tag lengths are needed before the percentage of stable queries reaches 90. For the smaller Co14 data set, only 85% of the queries are stable at tag length 35 bp, with a large increase at the next tag length.

It seemed likely that the irregularities in the latter case might be attributed to the high copy number of a given transcript. In order to investigate this, we examined the library Co14 more thoroughly and extracted all queries having a decision length of 36 bases (i.e. those queries producing a pronounced increase at the shift from 35 to 36 bases). There were 26 such queries, which accounts for the almost 6% increase in Figure 6D. All queries had two hits at query length 35. Moreover, the hits were identical for all queries, namely UniGene clusters Hs.274466 (NM001403) and Hs.181165 (BC014023). At tag length 36, all queries decide for the latter cluster. Consequently, the redundancies in the input data are a major cause. Furthermore, a BLAST alignment of the reference sequences for these two UniGene clusters showed that they were practically identical, indicating that effects in the UniGene clustering process are also responsible.

The differences in Figure 6C and D as compared to Figure 6A and B are thus related to redundancies in the EST library data sets. The last example also showed that near-identical UniGene clusters will keep the percentage of stable queries below 100%, and consequently, the number of hits/query larger than one, for longer queries.

In conclusion, a tag length of 12–13 bp, located downstream of a given restriction site, is required to uniquely identify 90% of the transcripts in the RefSeq transcript database, whereas increasing length only moderately affects this fraction. The quality of the sequence database that is used for the identification is also of importance, as indicated above.

Comparison between SAGEMap and BLAST for transcript identification

In order to test the degree of differential classification of EST data when coded as full-length ESTs or as SAGE tags, we constructed a BLAST mapping as outlined in Materials and Methods and compared this mapping with the standard SAGE mapping. The rationale for this analysis was the question whether the sequences in a given EST library, obtained for the purpose of quantifying gene expression, will identify the same genes in a gene catalog (UniGene) if they are treated as full-length ESTs analyzed with BLAST or if they instead are treated as SAGE tags. The results from the comparison between the two mappings are shown in Table 2. The mappings agreed to ∼80% for all libraries.

Table 2. Comparison of SAGEmap and BLAST for transcript identification.

Library Percentage of identical mappings
Co14 79.8
Gas4 80.8
Kid6 83.4
Lym12 81.1
Pan1 81.2
Ut1 79.5
Total 80.8

We had required that the EST queries were members of UniGene clusters, and that they corresponded to 3′-end transcripts. For nearly all ESTs (>99%) would the tag-to-gene mapping be defined in SAGEmap, were we just to look at cluster membership. Still, the BLAST mapping only agreed for 80% of the queries.

The representative sequence of a UniGene cluster is chosen as the member with the longest region of high quality sequence data. Therefore, the representative sequence may lack data related to the 3′ end of a transcript, i.e. lack sequence similarity to the ESTs used as queries in the alignment procedure. Consequently, an EST could align to the representative sequence of another UniGene cluster than that of which it is a member. We found that the discrepancies between the mappings almost exclusively were explained by this phenomenon. In fact, a recent publication has shown that 46% of the tags in UniGene (version 117) clusters differ from the tag of the representative sequence (21).

As an example case, we selected query AW778982 from library Co14. This EST is a member of UniGene cluster Hs.17240 (BG545471), with the tag sequence GAGAGAT GAC, but had aligned to the representative sequence of cluster Hs.229951 (AI678778). The FASTA header of BG545471 describes it as a 5′-end cDNA. Furthermore, this sequence did not contain an NlaIII site followed by the tag sequence of the query. In order to show that the representative sequence lacked 3′ sequence information, we made an assembly of all EST sequences belonging to the cluster (Fig. 7). The tag generated by the 3′-most NlaIII site in the contig is the same sequence as for the query, which is indicated by the vertical line in the figure. It is also evident that the representative sequence corresponds to an internal region of the transcript.

Figure 7.

Figure 7

Schematic overview of the shotgun assembly of the sequences in UniGene cluster Hs.17240. The reference sequence BG545471 is shown in red, whereas the query AW778982 is shown in blue. The sequence direction is from left to right (5′ to 3′), and the location of the 3′-most NlaIII site is displayed with a vertical line. The figure is based on the output generated by the Staden package (22).

In summary, the high degree of disagreement between the two mappings is mainly due to effects inherent in the UniGene clustering process. A consequence of this analysis is that caution must be taken when using the UniGene non-redundant database to identify transcripts with sequence information located at the 3′ end. This is the case for most of the described experimental technologies since the anchoring enzyme site closest to the 3′ end of the transcript is used to define the start of the extracted sequence.

DISCUSSION

In this study we have investigated a number of parameters that influence gene and transcript identification based on short sequence data. The results can be used to infer what effect the choice of restriction enzyme may have on short tag sequencing experiments and their outcome, and may also serve to identify ambiguities that arise in the transcript identification process.

The analysis of the occurrence of recognition sites provides an estimate of the fraction of genes that are unavailable for study, depending on the restriction enzyme used. An obvious way to circumvent this problem is to devise experiments where several restriction enzymes are used in order to monitor the entire transcriptome (5). For methods relying on restriction enzyme digestion of cDNA, and where short tag information is located at the 3′ end of the transcript, it is essential that the recognition site is not situated too closely to the 3′ end. For most recognition sequences, a vast majority (>90%) of the produced tags are longer than 20 bases. Exceptions are recognition sites that contain the CG dinucleotide, which is under-represented in human DNA (12). Transcript accessibility and tag length distribution provide vital information for determining the restriction enzyme of choice.

It is also desirable that tag uniqueness is optimized in a short sequencing experiment. The success in identifying original transcripts by short tags will depend on the ability of the short tags to discriminate between different mRNA species. As one would expect, we have found that related genes and alternative splice variants often produce the same tags. Our analysis indicates that this applies to 5–10% of the transcripts. By doing a tag uniqueness study on RefSeq, we could extract the identities of the genes that produced the same tag. For NlaIII, we found that alternative transcripts and gene families were equally responsible for this effect. Even if tag length is increased to 30 bp are closely related transcripts not easily distinguishable. The results for a given restriction enzyme and RefSeq locus can be found at our website resource.

Partial digestion of mRNA sequences may produce tags that are related to recognition sites located upstream with respect to the 3′-most site. This could present a problem for any sequencing-based gene expression profiling method that relies on positional information relating to the generated tag. The analysis of the overlap between 3′ tags and upstream tags has shown that, if occurring at a significant rate, partial cleavage may alter the tag count a substantial amount. Since the overlap is <20% for most enzymes, many low-count tags will be produced, which does not present a major problem since these tags are dismissed from analysis. However, the tag counts will increase for overlapping tags, thus indicating a higher level of expression than is the real case. If a highly expressed transcript is much affected by partial cleavage, the generated upstream tag may substantially elevate the monitored expression level of a low-abundance mRNA generating the same tag.

The preceding paragraphs have discussed how various tag characteristics are influenced by the choice of restriction enzyme. The fidelity in the identification process will also depend on the choice of transcript index database. We found that 10 bp tags in most cases are sufficient to identify a transcript when using a high-quality reference sequence database, such as RefSeq. However, RefSeq only covers part of the transcriptome, and that tag uniqueness is as high as 95% for the entire transcriptome may be an overestimate (18). Another factor is the ability to discriminate between splice variants. In UniGene, these tend to be assigned to the same clusters (13). We have identified certain problems with UniGene, such as the existence of distinct clusters, which by visual inspection were found to represent the same transcript. The advantage of UniGene as compared to RefSeq is the broader coverage of the transcriptome. Therefore, we chose UniGene for doing automated transcript identification by sequence alignments. As mentioned previously, using 10 bp tags is tractable only when using high-quality databases, and we found that at least 12–13 bp (plus the 4 bp of the recognition site) was needed to minimize the risk of ambiguous classification.

We also examined the case where a predefined tag-to-gene mapping is used to identify transcripts, as opposed to identification by sequence alignment. The comparison of the virtual mapping with SAGEmap showed that the data produced by gene expression experiments using EST sequences versus SAGE may differ considerably. We noted that the majority of discrepancies were related to effects inherent in the UniGene clustering process. The dominant discrepancy was that the representative sequences in the UniGene cluster did not correspond to the sequences used to derive the SAGE tag. In effect, UniGene non-redundant is not effective for identifying 3′ end transcripts. Comparing data from these types of experiments may provide valuable information about the sequences that generate discordant results.

In conclusion, we have found that the analysis of gene expression by short tag methods is strongly influenced by the restriction enzyme used, and by the choice of transcript database. When relying on a single enzyme, short sequence tags, and a transcript cluster database, various sequence identification problems can affect 5–20% of the tags.

With increasingly accurate information accumulating on the transcriptome sequences from many organisms, it will be possible to improve the interpretation of sequence-based expression studies by identifying short tags that have a high reliability and separating them from tags that carry an inherent ambiguity. The present study is an effort in this direction for the human transcriptome, with the web supplement to this article serving as a resource to the scientific community.

Acknowledgments

ACKNOWLEDGEMENTS

This study was performed with support from the Foundation for Strategic Research, the Swedish Cancer Society, the Swedish Research Council, and the Knut and Alice Wallenberg Foundation.

REFERENCES

  • 1.Lockhart D.J. and Winzeler,E.A. (2000) Genomics, gene expression and DNA arrays. Nature, 405, 827–836. [DOI] [PubMed] [Google Scholar]
  • 2.Adams M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., Moreno,R.F. et al. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252, 1651–1656. [DOI] [PubMed] [Google Scholar]
  • 3.Velculescu V.E. (1999) Essay: Amersham Pharmacia Biotech & Science prize. Tantalizing transcriptomes–SAGE and its use in global gene expression analysis [published erratum appears in Science (1999) 286, 2085]. Science, 286, 1491–1492. [DOI] [PubMed] [Google Scholar]
  • 4.Velculescu V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) Serial analysis of gene expression. Science, 270, 484–487. [DOI] [PubMed] [Google Scholar]
  • 5.Spinella D.G., Bernardino,A.K., Redding,A.C., Koutz,P., Wei,Y., Pratt,E.K., Myers,K.K., Chappell,G., Gerken,S. and McConnell,S.J. (1999) Tandem arrayed ligation of expressed sequence tags (TALEST): a new method for generating global gene expression profiles. Nucleic Acids Res., 27, e22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ronaghi M., Karamohamed,S., Pettersson,B., Uhlen,M. and Nyren,P. (1996) Real-time DNA sequencing using detection of pyrophosphate release. Anal. Biochem., 242, 84–89. [DOI] [PubMed] [Google Scholar]
  • 7.Welle S., Bhatt,K. and Thornton,C.A. (1999) Inventory of high-abundance mRNAs in skeletal muscle of normal men. Genome Res., 9, 506–513. [PMC free article] [PubMed] [Google Scholar]
  • 8.Maglott D.R., Katz,K.S., Sicotte,H. and Pruitt,K.D. (2000) NCBI’s LocusLink and RefSeq. Nucleic Acids Res., 28, 126–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Schuler G.D., Boguski,M.S., Stewart,E.A., Stein,L.D., Gyapay,G., Rice,K., White,R.E., Rodriguez-Tome,P., Aggarwal,A., Bajorek,E. et al. (1996) A gene map of the human genome. Science, 274, 540–546. [PubMed] [Google Scholar]
  • 10.Lash A.E., Tolstoshev,C.M., Wagner,L., Schuler,G.D., Strausberg,R.L., Riggins,G.J. and Altschul,S.F. (2000) SAGEmap: a public gene expression resource. Genome Res., 10, 1051–1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Beaudoing E., Freier,S., Wyatt,J.R., Claverie,J.M. and Gautheret,D. (2000) Patterns of variant polyadenylation signal usage in human genes. Genome Res., 10, 1001–1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lander E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C., Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [DOI] [PubMed] [Google Scholar]
  • 13.Bouck J., Yu,W., Gibbs,R. and Worley,K. (1999) Comparison of gene indexing databases. Trends Genet., 15, 159–162. [DOI] [PubMed] [Google Scholar]
  • 14.Hubank M. and Schatz,D.G. (1994) Identifying differences in mRNA expression by representational difference analysis of cDNA. Nucleic Acids Res., 22, 5640–5648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chen F., MacDonald,C.C. and Wilusz,J. (1995) Cleavage site determinants in the mammalian polyadenylation signal. Nucleic Acids Res., 23, 2614–2620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pauws E., van Kampen,A.H., van de Graaf,S.A., de Vijlder,J.J. and Ris-Stalpers,C. (2001) Heterogeneity in polyadenylation cleavage sites in mammalian mRNA sequences: implications for SAGE analysis. Nucleic Acids Res., 29, 1690–1694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Stollberg J., Urschitz,J., Urban,Z. and Boyd,C.D. (2000) A quantitative evaluation of SAGE. Genome Res., 10, 1241–1248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lee S., Clark,T., Chen,J., Zhou,G., Scott,L.R., Rowley,J.D. and Wang,S.M. (2002) Correct identification of genes from serial analysis of gene expression tag sequences. Genomics, 79, 598–602. [DOI] [PubMed] [Google Scholar]
  • 19.Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [DOI] [PubMed] [Google Scholar]
  • 20.Anderson I. and Brass,A. (1998) Searching DNA databases for similarities to DNA sequences: when is a match significant? Bioinformatics, 14, 349–356. [DOI] [PubMed] [Google Scholar]
  • 21.Clark T., Lee,S., Scott,L.R. and Wang,S.M. (2002) Computational analysis of gene identification with SAGE. J. Comput. Biol., 9, 513–526. [DOI] [PubMed] [Google Scholar]
  • 22.Bonfield J.K., Smith,K.F. and Staden,R. (1995) A new DNA sequence assembly program. Nucleic Acids Res., 23, 4992–4999. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES