VNB

SUPPLEMENTAL MATERIAL

Methodology of the VNB Algorithm

Described here are more details of how VNB operates. The software is written in JAVA, version 1.6.0. Our program uses BLAST by interfacing with the BLAST server at NCBI via a programming interface described here: http://www.ncbi.nlm.nih.gov /blast/Doc/urlapi.html. Below, the novel methods are highlighted, how the user defined options and parameters are used, how the different steps of VNB are linked to each other, and how the software accesses the necessary information. Each section below corresponds to an action described in Fig. 1 (ovals in flow chart). For each action, an output file is created and it is made available to the user via a link in the returned email. These files allow the user to check/use the intermediate results generated by VNB.

Input and Alignment. Queries are submitted as full sequences rather than accession numbers. This allows the user to submit sequence variants, partial sequences, splice-variants, etc. and compare outputs. In addition, many accession numbers give sequences with poly-(A) included, which can lead to problems with false positives. Removal of these repetitive sequences prior to submission is suggested in the help manual (http://tlab.bu.edu/help.html). The program first requires an alignment of the input sequence to its paralogs, which can be input in two different ways, the justification for each is discussed in the text. The first method queries the cDNA sequence against the RefSeq database using BLAST. The other method allows the user to define the paralogs in the alignment using CLUSTALW (http://www.ebi.ac.uk/Tools/clustalw2/index.html). The user produces a “custom” alignment (using the default output format), saves it as a test file, and inputs this alignment into VNB.

AutoProbe. The alignment from either method is used by Autoprobe to generate gene-specific “probes.” As described in Methods, the alignment is converted to a matrix, and each position of the input sequence is scored based on how dissimilar the input sequence is to the paralogs in the alignment at that position. The position scores are summed, and a score is calculated for each potential probe. The most gene-specific probe is then chosen for each window along the full length of the cDNA.

ProbeChecker. If the user checks the option called "check probes for exact matches to the paralogs," VNB invokes a unique routine called ProbeChecker, which discards any probe that exactly matches any of the paralogs in the same position in the alignment.

Query dbEST. The list of gene-specific probes is concatenated into a string S, where each probe is separated by characters considered invalid by BLAST. S is then queried against dbEST. The parameters of BLAST are chosen such that any alignment cannot involve more than one of the probes in S: this is achieved by setting a very high cost to open and extend a gap in the alignment. This unique feature saves computation time because separate queries are not required for each probe. The EST sequences that score the highest (which correspond to an exact match to one of the probes), are found in the information returned from BLAST and their accession numbers are stored.

Getting Library IDs for EST Hits. VNB takes the accession numbers of the EST hits and generates annotation for them using the batch Entrez feature at NCBI. This is done by using the following URL, http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucest&id=inputString, where inputString is a list of comma-delimited accession numbers. The program then takes the returned annotation and repeatedly scans for the string "Lib name.” VNB collects the library identifier of each EST hit, which identifies the library from which this EST is derived.

Annotation for Library IDs and Generating Output. VNB then uses the EST library annotation available at CGAP to match each library ID to a tissue from which the library was derived. The following URL is used to query for information about all EST libraries: http://cgap.nci.nih.gov/Tissues/LibraryQuery?ORG=organism&SCOPE=est&PAGE=0&SORT=tissue. The following URL is used to query for information about only the quantitative libraries (which have not been normalized): http://cgap.nci.nih.gov/Tissues/LibraryQuery?ORG=organism& SCOPE=est&PROT=non-normalized&PAGE=0&SORT=tissue (here organism is either “Mm” or “Hs” based on whether the user selected “mouse” or “human”). CGAP returns a tab-delimited file, where each line corresponds to a different library; with fields describing from what organism and what tissue each library is derived. Once the relevant information is attained about the libraries from which the EST hits are derived, the quantitative and qualitative profiles are produced as described in Figure 1S. The counts for the top tissues (at most ten, for ease of presentation) in the qualitative and quantitative profiles are also displayed as graphs, using graphics software called chart2d (http://chart2d.sourceforge.net/). In addition, VNB uses a “web-crawler” to view detailed descriptions of the EST libraries given by CGAP on separate web pages. The description of each library has a field called "#Sequences Generated to Date,” which gives the number of ESTs in each library. These numbers are summed for all libraries from each defined tissue and the sums are used to generate the absolute quantitative profile, as described in Methods. As before, the profile is also displayed as a graph using chart2d.

Figure 1S. Explanation of How VNB Derives the Values in the Output File. A typical output (same as shown in Fig. 2) is shown in colored type; qualitative profile is in light brown and quantitative profile is in brown. Text beside each arrow explains how VNB derives each value.

Output Files. VNB generates an email that specifies the parameters chosen by the user and the query name. It then lists several output files that can be retrieved on the VNB server. To name the files, a queryId is generated using the name of the query as well as the date/time. The files can be accessed at the following URL: tlab.bu.edu/vnbOutput/queryId_output, where output is the specific name of each file. First, there is a text file containing the two profiles, called output.txt. This file is followed by several accompanying graphical representations of the top tissues in each profile. The first depicts the top tissues in the non-quantitative profile in a file called nonQuantProfileImage.jpg. The second depicts the top tissues in the quantitative profile in a file called quantProfileImage.jpg. The third depicts the top tissues expressed as a percentage of all expressed mRNAs in the absolute quantitative profile in a file called absQuantProfileImage.jpg.

The email also provides links to several files that VNB generates in the course of its computations. These files are useful for analysis of the output, and include the alignment used by Autoprobe (in a file called alignment.txt), the list of probes generated by Autoprobe (in a file called probes.txt), and the hyperlinked list of accession numbers for the EST hits (in a file called ESThits.html). The hyperlinked list of accession numbers allows easy access to the respective Entrez entries to each EST for manual assessment.

Text of Email returned from VNB

From: vnb@bu.edu

Subject: vnb output for QueryName

Sent: Date, Time

The VNB program has completed the mining and analysis of the expression profile for your mouse/human gene query "QueryName." The program used the parameters of X bp window size and Y bp probe length. The profile excludes/includes ESTs from cancerous tissues and does/does not check probes for exact matches to the paralogs.

The complete qualitative (using all ESTs) and quantitative (only using ESTs from non-normalized libraries) profiles can be viewed here: tlab.bu.edu/vnbOutput/QueryName_Date_Time_output.txt.

A graph of the top tissues (at most 10) in the qualitative profile can be viewed here: tlab.bu.edu/vnbOutput/QueryName_Date_Time_nonQuantProfileImage.jpg.

A graph of the top tissues (at most 10) in the quantitative profile can be viewed here: tlab.bu.edu/vnbOutput/QueryName_Date_Time_quantProfileImage.jpg.

A graph of the top tissues (at most 10) in the absolute quantitative profile can be viewed here: tlab.bu.edu/vnbOutput/QueryName_Date_Time_absQuantProfileImage.jpg.

Sometimes it is useful to analyze how the VNB profile was obtained, in particular to look at whether the alignment was correct and included or excluded the appropriate paralogs (alignment), or view the actual sequences of the probes that were used to query dbEST (probes), or view the accession numbers of all the ESTs that were identified to belong to the query gene (estHits). These accession numbers can be used to identify clones for which cDNA for the query gene can be attained. These files (which will be available for a limited time) can be viewed using the following links:

alignment: tlab.bu.edu/vnbOutput/QueryName_Date_Time_alignment.txt

probes: tlab.bu.edu/vnbOutput/QueryName_Date_Time_probes.txt

EST hits: tlab.bu.edu/vnbOutput/QueryName_Date_Time_estHits.html

Qualitative analysis of ESTs from UNIGENE

Figure 7 in the text shows the validity of VNB in obtaining qualitative or semi-quantitative gene-expression profiles using actin and aldolase gene families. The performance of UniGene in doing the same validation is shown in Fig. 2S. As shown in Fig. 2S, left panel, UniGene does well, but its clear that there are likely false positives in the data for smooth-muscle actin (C) where over 10% of the actin expressed in muscle is identified as this isoform, whereas its known for experimental data that nearly all the actin in muscle is skeletal (S) (Gunning et al., 1983). In heart, it appears that again the smooth muscle actin is overrepresented compared to experimental determinations. As shown in Fig. 2S, right panel, UniGene does much better for the aldolase isozymes. The pattern for expression of the genes for aldolases A, B, and C (aldoA, aldoB, and aldoC) in muscle, heart, brain, liver, and kidney match that obtained by VNB. The possible exception is aldolase C, which is underrepresented in UniGene data in muscle (compared to VNB), heart (compared to VNB and experimental (Lebherz and Rutter, 1969)), and brain (compared to experimental). This could be because UniGene misidentifies many aldolase C ESTs as aldolase A ESTs (aldolase A and C are more similar to each other (85%) than they are to aldolase B).

Figure 2S. Assessment of Qualitative Expression Profiles from UniGene. Left panel, UniGene-generated expression (solid bar) of genes for mouse skeletal a-actin (acta1) (S) and smooth-muscle (aorta) a-actin (acta2) (C) in skeletal muscle and heart plotted as a percentage of the total for each isoform in a tissue. This was plotted similarly to experimentally determined expression (cross-hatched bars) of a-actin from the same tissues, which was quantified as described in Materials & Methods. Right panel, UniGene-generated expression pattern for mouse aldoA, aldoB, and aldoC denoted by letters A, B, and C, respectively. The number of ESTs for each aldolase found in muscle, heart, brain, liver, and kidney were normalized to the total aldolase ESTs in each tissue. Queries for mouse genes were: acta1 [GenBank:NM009606], acta2 [GenBank:NM007392], aldoA [GenBank:NM007438], aldoB [GenBank:NM144903], and aldoC [GenBank:NM009657].

Text of VNB Help Manual

VIRTUAL NORTHERN BLOT (VNB)

GENERAL INFORMATION:

VNB is a program that will query the large collection of EST sequences in the public domain at NCBI's dbEST (http://www.ncbi.nlm.nih.gov/dbEST/) for tissue-specific gene expression for any mRNA (cDNA) sequence from mouse or human in which you are interested. The cDNA sequence in which you are interested (your query) should be provided as a plain text file. The program will use an alignment of your query with other genes in the gene family and from that alignment, generate a set of gene-specific probes. VNB uses these probes to find exact matches in dbEST using the BLAST server at NCBI. The alignment is either generated from RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/) or provided by the user using the ClustalW site (http://www.ebi.ac.uk/clustalw/). The ESTs that are retrieved as "hits" are tallied according to tissue and two profiles are provided. A non-quantitative profile that categorizes all the hits by tissue-of-origin regardless of how the library was generated (listed in numerical order) and a quantitative profile that tallies the hits that were exclusively from libraries that had not been manipulated in any way such that the ESTs from these libraries reflect the mRNA population of the tissue of origin. The expression levels are calculated from these hits by dividing by the total number of such ESTs in dbEST from that tissue and the standard deviation of the expression level calculated using a Poisson distribution. All results are available after several minutes and an email is returned with links to these files.

SET YOUR INPUT:

Your nucleotide sequence must either be a nucleotide sequence in plain text *file* containing only {G, A, T, or C}. All other spaces and returns are ignored. You can NOT cut and paste the sequence in the window. No "word" files will work. Use the "Browse" button to upload a TEXT file in the above mentioned formats. Again, .doc files will NOT work. The name of your file will appear in the adjacent window (not visible with some browsers).

You can also enter your query in the form of an alignment generated using ClustalW. Your query sequence should be at the top and all the paralogs with which you would not like it confused in the tissue-specific gene expression profile below. When you use ClustalW, do not change any of the default settings for the standard output. The plain text *file* should have the sequences aligned with the names of each on the left and the base numbers on the right. The positions in the alignment that are invariant are noted by an asterisk at the bottom of the alignment. This option is provided because the alignment generated from the sequences currently available in RefSeq may not contain all the paralogs due to the incomplete nature of RefSeq.

IMPORTANT: In either case, be certain to remove from your query any poly(A) sequences or other well-conserved repeat sequences that are in many mRNAs, as this will cause large numbers of meaningless hits.

SELECT YOUR PARAMETERS:

1. Organism; To use, you must first select an organism. The only organisms that currently have all their libraries indexed at the cGAP site (http://cgap.nci.nih.gov/Tissues/LibraryFinder) are those for mouse and human. Here they distinguish libraries from normal tissues, cancerous tissues, and being quantitative or non-quantitative.

2. Exclude ESTs from cancerous tissue in profile; If this is checked, the results will only tally EST hits from libraries derived from normal non-diseased tissues. If you do not choose this option, cancerous tissue profiles, in addition to normal tissues, will be included in your profile.

3. Check probes for exact matches to paralogs; when this option is invoked by checking the box, an additional routine called /ProbeChecker/ is implemented. First, the BLAST alignment that is generated for your query is modified to remove any entries from RefSeq that are not genuine paralogs (rather might be from alternative splicing, alternative poly-adenylation, or mistakes, etc.). It does this by asking if there is >95% identity of putative paralogs with your query. If so, it is removed. Second, the set of probes that are generated from your query are matched against the remaining paralogs and any that are 100% matches to any of the paralogs to your query are removed. The effect of this is to decrease the false positives in the results. This, however, will also decrease the sensitivity. Caution should be used when using ClustalW alignments that include alternative transcripts from the same gene. This may lead to the program considering one of them as a true paralog, and if "Check probes" is selected, most of the probes will be rejected resulting in very few hits.

4. Window size; the default size is 8 nucleotide bases, which means that a "best-fit" probe is chosen from your query every 8 nucleotides. By decreasing this number you generate more probes, which will take longer to run, but will increase your sensitivity. Window sizes under 8 have shown an increase in false positives. By increasing this number, you generate fewer probes, which runs faster, but will decrease your sensitivity.

5. Probe length; the default size is 20 nucleotide bases, which means that the probes generated from your query for matching to entries in dbEST will be 20 nucleotides in length. Probe sizes longer than 20 usually give less sensitivity, but more specificity (fewer false positives). Probe sizes shorter than 20 usually give more sensitivity, but less specificity (more false positives). These defaults were judged optimal for queries that came from gene families that ranged from 45-90% identity for the closest paralog to the query.

GETTING YOUR RESULTS:

1. Name your query; this will be used to identify your result file. The name you give to your query may only contain digits and letters because it will be part of a file name.

2. Email address; your results will be sent to you via email. The email will provide links to the files generated from your query. These files remain on the VNB server for a limited time. In addition to the qualitative and quantitative list of the number of EST hits for each tissue, graphic displays of each are included, as well as files for the intermediate results; the alignment used for generation of the probes, the sequences of the probes used to query dbEST, and a linked list of accession numbers for each EST hit.

A significant portion of the VNB program is a "web robot," which means that it uses other web sites to compile the information. It is therefore dependent on the "traffic" on those servers. Generally, VNB should return results within 10-90 minutes, unless the servers it uses are down or slower than normal. If an error message is returned in your email, you should try it at a later time because the problem is most likely due to problems with the BLAST server. This program works best at BLAST off-peak hours, which are typically 10 PM to 6 AM EST.