Abstract
Biomedical literature can offer valuable information for organizing genes associated with the etiology and pathogenesis of disease. In this study, we demonstrate the utility of existing phylogenetic methods for organizing 375 genes associated with Breast Cancer using the MeSH annotations from over 35,000 Medline articles. Specifically, we compare the clustering (using the Colless Imbalance Index, Ic) of distance-based methods, which are used by popular phylogenetic clustering algorithms, and a characteristic-based method (Maximum Parsimony) that is commonly used for phylogenetic studies. Focusing on genes that cluster around BRCA1 and BRCA2, we examine the relevance of the clustered genes proposed by the different clustering methods based on the number of exclusive MeSH terms. Our results indicate that existing phylogenetic methods and associated metrics can be used for organizing genes according to annotated knowledge in biomedical literature.
INTRODUCTION
An essential task in translational bioinformatics is to identify genes or groups of genes that may provide insight into the etiology and pathogenesis of complex genetic disorders1. Staying abreast recent findings and identifying relevant correlations from rapidly increasing corpora of knowledge is often intractable (e.g., Medline is growing by approximately 1 million articles per year). Automated tools and centralized databases offer some solace towards the identification of putative linkages between genes and their products for specific biological domains2. However, a bulk of knowledge remains embedded within natural text literature3 and thus motivates the need to develop clustering approaches that identify possible correlations between described entities (e.g., genes in the context of a given disease).
A number of knowledge bases have emerged to navigate pertinent gene knowledge, including information embedded in biomedical literature. PubGene4, for example, identifies co-occurrence of gene names within literature and then constructs a matrix of co-occurring genes with MeSH (Medical Subject Headings)5 annotation terms. To date, most gene correlation analyses are done by first assembling a matrix of genes and set of scores denoting the ‘similarity’ between every pairwise combination. This matrix is then subjected to a clustering analysis, and the results are often shown in the form of a hierarchical tree. The resulting hierarchical organization represents the relationship of the taxa (singular taxon; in this case, the genes) relative to a statistically derived similarity metric of the characteristics (the units that are compared for arriving at the similarity scores; in this case, the MeSH terms). Therefore, the relationships represented between the taxa reflect the clustering according to the similarity criteria chosen. However, there are no means to ascertain which specific characteristics contribute to each resulting grouping.
Similarity-based clustering methods are referred to in phylogenetics as ‘phenetic’ methods. An alternate class of methods to organize biological data (e.g., species according to molecular sequence) is termed ‘cladistic’. Cladistic methods organize taxa according to their shared characteristics6. Considering every characteristic as an independent entity, a matrix is first constructed where each row represents each taxon, and each column represents a characteristic. The particular value of each characteristic is referred to as the ‘characteristic state’ (e.g., 1 and 0 can respectively represent ‘present’ or ‘absent’). Possible tree topologies for organizing the taxa are then explored and the ‘best’ tree is chosen based on an optimality criterion (e.g., the minimum number of characteristic state changes that explain the tree topology). Therefore, like phenetic clustering methods, the result of how the taxa relate to each other is represented in a hierarchical fashion. However, unlike phenetic methods, examination of the resulting tree topology will reveal specific characteristics relative to each posited grouping within a philosophically consistent framework.
In this study, we consider 375 genes that are associated with breast cancer. First, a presence/absence matrix of MeSH terms that are associated with each gene is crafted from over 35,000 relevant Medline citations. This matrix is then subjected to two phenetic methods, Unweighted Pair Group Method with Arithmetic Mean (UPGMA)7 and Neighbor-Joining (NJ)8, and a commonly used cladistic method, Maximum Parsimony (MP)6. We first compare the tree topologies inferred according to a metric of tree balance, the Colless Imbalance Index (Ic)9, to assess each respective method’s ability to cluster genes. We then assess the ability of each method to identify exclusive MeSH terms for describing 16 genes that cluster into 15 groups around the BRCA1 and BRCA2 genes.
MATERIALS & METHODS
Identification of Breast Cancer Genes and Relevant Medline Citations
A list of breast cancer genes and their associated gene symbols were identified from the Entrez Gene resource10. Gene aliases and gene symbols were reconciled using the Human Genome Nomenclature Committee list of accepted gene names11. This resulted in a list of 375 breast cancer genes that were the subject of individual PubMed queries and combined with the disease phrase, “breast cancer”.
To prevent automatic term expansion of query gene phrases, we used the [TW] tag, which forces PubMed to search for only the exact text phrase in the titles, abstracts, and MeSH terms. For example, to identify the breast cancer citations associated with the gene BRCA1, we performed the following PubMed query: “BRCA1[TW] AND breast cancer” – which results in 3,243 citations at the time of this writing. Using this method, we identified 36,584 citations associated with the aforementioned 375 breast cancer genes.
Assembly of Gene-MeSH Term Matrix
A Perl script was used to extract the MeSH terms from each set of citations associated with each breast cancer gene, based on a local version of Medline obtained by lease from the National Library of Medicine. The script then assembled the 375 breast cancer genes as taxa and the 7,248 MeSH terms as characteristics into a NEXUS file (a standard file format used by many phylogenetic applications). The characteristic states were set to either ‘1’ (present; at least one citation contained the MeSH term for a given gene) or ‘0’ (absent; no citations contained that particular MeSH term). This file is available upon request from the authors.
Phylogenetic Analyses
The NEXUS file was loaded into PAUP* (v.4.0b10), a commonly used phylogenetic analysis package12. The UPGMA and NJ trees were determined using the default settings. The MP search was performed using the PaupRat Parsimony Ratchet as implemented in PAUP*, which ensures a more exhaustive search of different possible tree topologies13. While both UPGMA and NJ each resulted in a single tree (as expected from the respective algorithms), MP resulted in three equally ‘best’ trees. These trees were combined into a single consensus tree using the strict consensus method, which results in a final tree topology based on concordant groups of taxa across the different tree topologies.
PAUP* was then used to assess the balance of each tree using the Colless Imbalance Index (Ic), which is reported as a value from 0 (low imbalance) to 1 (high imbalance) based on the number of tips in the tree (n), relative to number and size of the clusters formed (r and s, where r ≥ s):
The resulting trees were manually examined using MacClade (v.4.08), a graphical tool to view and analyze phylogenetic trees14. We zoomed in on the 15 hierarchical clusters determined by each method around the BRCA1 and BRCA2 genes. Based on the predicted clusters, we used a Perl script to tabulate the number of common MeSH terms and their respective Semantic Types (based on the 2006 release of MeSH) that were exclusive to each cluster (as determined from the tree nodes for just the BRCA genes).
RESULTS
Overall Tree Topologies and Tree Balance
The three resulting tree topologies inferred in this study were both quantitatively and qualitatively different. The UPGMA tree was highly imbalanced (Ic=0.96), while the NJ and MP trees were more balanced (Ic values of 0.24 and 0.17, respectively). Based on examining the graphical representations of each tree (shown in Figure 1), the MP tree was the least pectinate (comb-like) in structure, whereas the NJ and UPGMA trees were increasingly pectinate, which indicates increasingly agglomerative groupings (a negative topological characteristic in phylogenetic studies).
Figure 1.
Graphical representations for the UPGMA (left), NJ (center), and MP (right) trees are shown along with their respective Ic values, demonstrating the varying levels of clustering. A box is drawn around the cluster of 16 genes that include the BRCA1 and BRCA2 genes, highlighted in Figure 2.
Clusters Around BRCA Genes
The 15 clusters around the BRCA1 and BRCA2 genes consisted of 16 genes in each of the inferred tree topologies. Between the UPGMA and NJ trees, 11 of the genes were the same; 13 of the genes were the same between both the UPGMA and MP trees as well as the NJ and MP trees.
Focusing on the 15 clusters around the BRCA1 and BRCA2 genes, we qualitatively observed that the MP tree had more distinct sub-groupings, while the UPGMA and NJ clustering was more pectinate (Figure 2). Of the 15 clusters formed in each tree (enumerated as node labels in Figure 2), only the BRCA1 and BRCA2 genes were consistently grouped (nodes 13, 13, and 8 for UPGMA, NJ, and MP, respectively). The NJ and MP tree agreed on an additional three groupings (nodes 12, 14, and 15 from the NJ tree with nodes 7, 10, and 11 from the MP tree).
Figure 2.
Sixteen genes that formed 15 clusters around the BRCA genes, with each cluster shown as a node label on each respective tree.
MeSH Terms and Semantic Type Composition of Clusters Around BRCA Genes
We tabulated the number of exclusive MeSH terms and their respective UMLS Semantic Types across each of the clusters (as inferred from the nodes in each tree from Figure 2). As shown in Table 1, the average number of MeSH terms associated with each cluster in either the UPGMA or NJ trees was 342 and 361, respectively; the number of Semantic Types associated with each node from either UPGMA and NJ trees was 64. The average number of MeSH terms and Semantic Types associated with each cluster in the MP tree, on the other hand, were respectively 440 and 59. We assessed the distinctness of the clusters using the ratio of MeSH terms relative to the respective number of Semantic Types. We observed that this ratio (M/S) increased from 5.3 to 5.5 between the phenetic methods, and to 7.3 for the cladistic method. Finally, to assess the confidence of each node occurring by just chance, we used PAUP* to perform a phylogenetic bootstrapping technique that randomly selected MeSH terms and assigned confidence scores to each node. This resulted in determining that there were four, six, and ten nodes that were statistically significant (i.e., chance of occurring not by chance >50%) for UPGMA, NJ, an MP.
Table 1.
MeSH terms, respective Semantic Types (STY), their ratio (M/S), and significant (>50%) Bootstrap scores (BS) exclusive to each of the 15 clusters examined around the BRCA genes.
| UPGMA | NJ | MP | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Node/Cluster | MeSH | STY | M/S | BS | MeSH | STY | M/S | BS | MeSH | STY | M/S | BS |
| 1 | 110 | 61 | 1.8 | - | 108 | 56 | 1.9 | - | 108 | 50 | 2.2 | - |
| 2 | 119 | 60 | 2.0 | - | 119 | 57 | 2.1 | - | 143 | 55 | 2.6 | 98 |
| 3 | 124 | 58 | 2.1 | - | 131 | 55 | 2.4 | - | 157 | 70 | 2.2 | 100 |
| 4 | 130 | 60 | 2.2 | - | 146 | 60 | 2.4 | - | 183 | 65 | 2.8 | 52 |
| 5 | 142 | 65 | 2.2 | - | 160 | 66 | 2.4 | - | 246 | 65 | 3.8 | 100 |
| 6 | 151 | 58 | 2.6 | - | 174 | 61 | 2.9 | - | 376 | 68 | 5.5 | 96 |
| 7 | 179 | 62 | 2.9 | - | 195 | 68 | 2.9 | - | 568 | 68 | 8.4 | 100 |
| 8 | 215 | 66 | 3.3 | - | 209 | 64 | 3.3 | - | 984 | 73 | 13.5 | 100 |
| 9 | 233 | 67 | 3.5 | 82 | 233 | 69 | 3.4 | - | 471 | 64 | 7.4 | 73 |
| 10 | 261 | 65 | 4.0 | - | 276 | 65 | 4.2 | 88 | 713 | 62 | 11.5 | 100 |
| 11 | 312 | 68 | 4.6 | - | 334 | 65 | 5.1 | 100 | 1067 | 68 | 15.7 | 99 |
| 12 | 478 | 74 | 6.5 | 99 | 568 | 68 | 8.4 | 100 | 476 | 45 | 10.6 | - |
| 13 | 984 | 73 | 13.5 | 100 | 984 | 73 | 13.5 | 100 | 502 | 52 | 9.7 | - |
| 14 | 911 | 63 | 14.5 | - | 713 | 62 | 11.5 | 100 | 222 | 42 | 5.3 | - |
| 15 | 785 | 59 | 13.3 | 99 | 1067 | 68 | 15.7 | 99 | 380 | 45 | 8.4 | - |
| Average | 342 | 64 | 5.3 | 25.3 | 361 | 64 | 5.5 | 39.1 | 440 | 59 | 7.3 | 61.2 |
DISCUSSION
Most, if not all, studies that involve hierarchical clustering of genes according to literature-based knowledge use distance-derived or phenetic methods to infer various clustering schemes. However, in the phylogenetic community, the disparity between hierarchical clusters formed using either a phenetic or cladistic method is a known issue16, 17.
Part of the difference between phenetic and cladistic methods can be attributed to a distinction between the goals of each respective method. While phenetic methods attempt to arrive at what are the clusterings based on characteristic data, cladistic methods strive to ascertain how characteristic information can be used to cluster the data. Thus, whilst both methods may arrive at a meaningful clustering, only cladistic methods enable one to reliably recover specific characteristics and their respective characteristic states associated with the resulting clustering17. This is akin to the differences between the information retrieval concepts of document classification and document clustering. While document classification aims to organize documents according to their textual content, document clustering strives to identify shared features that can be used for aggregation18.
In traditional phylogenetic studies, the aim is to infer the evolutionary history of a set of taxa (e.g., organisms) using either molecular, morphological, or some combination therein. It has been demonstrated within the systematic community that phenetic methods, while good at estimating general relationships, can lead to ad hoc assumptions that might also result in the loss of discrete characteristic and characteristic state information16. Furthermore, the recovery of characteristics and characteristic states that explain a clustering is not possible17. Cladistic methods, on the other hand, preserve all characteristic and characteristic state information while inferring relationships between the set of taxa. Cladistic methods, like document clustering algorithms, are built around the principle that more similar entities (in this study, genes) will share more features6 (in this study, the MeSH terms from associated publications).
Phylogenetic techniques have gained popularity in the study of microarray gene expression data19. While most phylogenetic applications to microarray data are generally phenetic (because gene expression data are generally reported as continuous characteristic states), there has been work to show that cladistic approaches (where the characteristic states are ‘binned’ into degrees of ‘on’ or ‘off’) can lead to meaningful diagnostic markers that can subsequently be used for classifiers20. Literature-based knowledge has been shown as a reasonable complement to gene expression data for clustering genes21. This study focused on demonstrating the utility of characteristic-based methods for clustering genes according to solely MeSH terms. Future work will aim to integrate literature-based knowledge with experimental data (e.g., bio-assay), leveraging knowledge about genes that have been annotated (e.g., using the Gene Ontology). Within a cladistic framework, methods and evaluation metrics have been developed for integrating heterogeneous data types into a ‘simultaneous’ analysis22.
The use of MeSH terms may not entirely reflect the complete content of a document. To address this, Natural Language Processing (NLP) techniques have been used by systems like MedScan23 to organize knowledge contained in Medline literature. However, the use of MeSH terms reflects the expertise associated with the manual annotation of Medline records. The utility of using only MeSH terms has been demonstrated for both classification and clustering knowledge relative to biomedical literature. To this end, various similarity metrics have been developed and employed for creating classification groups24, 25. Insightful studies have also compared the level of resolution between clustering documents using only MeSH terms versus terms that were mined from full text25. While there has been some research that is critical of using MeSH (arguing that terms may often be out of date)26, there have been studies that demonstrate that MeSH terms alone can be used to make meaningful concept profiles24. Regardless, we anticipate incorporating additional characteristics into our matrix from available text mining applications.
In published studies thus far, similarity scores are derived based on a range of metrics (e.g., Euclidian or cosine distance). Determining what metric is most appropriate for which data type can be difficult, and the choice of metric may drastically affect the resulting hierarchical clustering. In the present study, we employed two different similarity criteria: UPGMA and NJ. The tree topologies predicted by these methods were both qualitatively and quantitatively (based on tree balance) different. We did not consider other similarity-based methods that have been examined for document clustering, as a major emphasis of this study was to demonstrate the utility of established phylogenetic tools (PAUP*) and methods (UPGMA and NJ) for clustering information. Future studies may include a direct comparison of tree topologies derived from a range of phylogenetic and non-phylogenetic phenetic methods.
The clustering of genes around the BRCA genes in this study exemplifies the usability and potential value of cladistic methods. Because cladistic methods are optimized to group entities (in this case, genes) together based on shared characteristics (in this case, MeSH terms), one can reliably retrieve the list of MeSH terms that are used for each grouping, within a consistent (cladistic) framework. While not shown (due to space restrictions; the shortest list is 108 MeSH terms), we are able to specifically identify the characteristics that were used to create a cluster. Thus, while one can recover characteristics based on a given clustering, it is impossible to directly identify which keywords were used within a phenetic framework to create the resulting clustering. This was shown empirically in the present study based on the average increase in the consistency (M/S) scores between phenetic (UPGMA and NJ) and cladistic (MP) methods. While this evaluation is heuristic in nature, it warrants the need for further investigation. Our future work will therefore include the examination of putative biological or clinical hypotheses based on clustered keywords and published studies. We anticipate the use of a Formal Concept Analysis15 framework that will incorporate relevant MeSH terms to infer specific relationships between genes.
While cladistic methods may be susceptible to misleading clustering due to noisy data, robustness tests exist for validating a particular tree topology27. Certainly, there is a trade off between computational performance and tractability. Phenetic methods are computationally efficient; cladistic methods generally take much longer to perform. In this study, for example, the NJ and UPGMA trees were both generated in less than 10 seconds. The MP tree, on the other hand, took over 90 minutes on the same machine (Dual Processor 2.5GHz G5 PPC).
The cladistic method exemplified here was MP, which presumes that there are no relationships between the characteristics; however, there are characteristic-based methods (e.g., Maximum Likelihood28 and Bayesian29) that can incorporate relationship information (e.g., ontological relationships between MeSH terms). These methods are designed to embrace both cladistic principles as well as model-specific information within a statistical framework. Phylogeneticists have described and enumerated the differences and advantages of both non-statistical and statistical cladistic methods for organizing biological data16, 17. A notable complication with statistical or model-based cladistic methods, however, is that combining heterogeneous data models into a single simultaneous analysis may often be computationally prohibitive. It is our hope that this study promotes further exploration and evaluation of such established phylogenetic methods and principles for knowledge-based clustering based on discrete attributes.
CONCLUSION
Traditional clustering algorithms generally use a similarity-based optimality criterion. Here, we have demonstrated the utility of existing phylogenetic tools for performing cluster analyses of genes based on knowledge inferred from annotated published literature (Medline). Furthermore, we propose that characteristic-based methods (‘cladistic’) can offer further insight into which knowledge attributes (in this case, MeSH terms) result in meaningful clustering of data (in this case, genes).
ACKNOWLEDGMENTS
The authors are grateful for the insightful discussions with Drs. P.J. Planet and M.E. Siddall from AMNH. INS is funded in part by NSF-IIS-0241229, the D.A.B. Lindberg Research Fellowship from the MLA, and the Lewis B. and Dorothy Cullman Program for Molecular Systematics.
REFERENCES
- 1.Omenn GS, States DJ, Adamski M, Blackwell TW, Menon R, Hermjakob H, et al. Overview of the HUPO Plasma Proteome Project. Proteomics. 2005 Aug;5(13):3226–45. doi: 10.1002/pmic.200500186. [DOI] [PubMed] [Google Scholar]
- 2.Mattes WB, Pettit SD, Sansone SA, Bushel PR, Waters MD. Database development in toxicogenomics: issues and efforts. Environ Health Perspect. 2004 Mar;112(4):495–505. doi: 10.1289/ehp.6697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Andrade MA, Bork P. Automated extraction of information in molecular biology. FEBS Lett. 2000 Jun 30;476(1–2):12–7. doi: 10.1016/s0014-5793(00)01661-6. [DOI] [PubMed] [Google Scholar]
- 4.Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001 May;28(1):21–8. doi: 10.1038/ng0501-21. [DOI] [PubMed] [Google Scholar]
- 5.http://www.nlm.nih.gov/mesh/
- 6.Hennig W. Phylogenetic Systematics. Urbana: University of Illinois Press; 1966. [Google Scholar]
- 7.Sokal R, Sneath P. Principles of Numerical Taxonomy. San Francisco: WH Freeman; 1963. [Google Scholar]
- 8.Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987 Jul;4(4):406–25. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
- 9.Heard SB. Patterns in tree balance among cladistic, phenetic and randomly generated phylogenetic trees. Evolution. 1992;50:2141–8. doi: 10.1111/j.1558-5646.1992.tb01171.x. [DOI] [PubMed] [Google Scholar]
- 10.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
- 11.http://www.gene.ucl.ac.uk/nomenclature/
- 12.Swofford DL. PAUP*:Phylogenetic Analysis Using Parsimony (and Other Methods) 4.0 Beta. Sunderland, MA: Sinauer Associates, Inc; 2003. [Google Scholar]
- 13.Sikes DS, Lewis PO. PAUPRat: A tool to Implement Parsimony and Likelihood Ratchet Searches using PAUP* Storrs, CT: University of Connecticut; 2001. [Google Scholar]
- 14.Maddison DR, Maddison W. MacClade: Analysis of Phylogeny and Character Evolution. Sunderland: Sinauer Associates; 1992. [Google Scholar]
- 15.Ganter B, Wille R. Formal Concept Analysis: Mathematical Foundations. Secausus: Springer-Verlag; 1997. [Google Scholar]
- 16.Farris JS. The logical basis of phylogenetic analysis. In: Platnick N, Funk V, editors. Proceedings of the Second Meeting of the Willi Hennig Society. New York: Columbia University Press; 1983. [Google Scholar]
- 17.DeSalle R. What's in a character? J Biomed Inform. 2006 Feb;39(1):6–17. doi: 10.1016/j.jbi.2005.11.002. [DOI] [PubMed] [Google Scholar]
- 18.Berry M. Survey of Text Mining: Clustering, Classification, and Retrieval. New York: Springer-Verlag; 2003. [Google Scholar]
- 19.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863–8. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sarkar IN, Planet PJ, Bael TE, Stanley SE, Siddall M, DeSalle R, et al. Characteristic attributes in cancer microarrays. J Biomed Inform. 2002 Apr;35(2):111–22. doi: 10.1016/s1532-0464(02)00504-x. [DOI] [PubMed] [Google Scholar]
- 21.Korbel JO, Doerks T, Jensen LJ, Perez-Iratxeta C, Kaczanowski S, Hooper SD, et al. Systematic association of genes to phenotypes by genome and literature mining. PLoS Biol. 2005 May;3(5):e134. doi: 10.1371/journal.pbio.0030134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Rokas A, Carroll SB. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol. 2005 May;22(5):1337–44. doi: 10.1093/molbev/msi121. [DOI] [PubMed] [Google Scholar]
- 23.Novichkova S, Egorov S, Daraselia N. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics. 2003 Sep 1;19(13):1699–706. doi: 10.1093/bioinformatics/btg207. [DOI] [PubMed] [Google Scholar]
- 24.Srinivasan P, Wedemeyer M. Mining Concept Profile with the Vector Model or Where on Earth are Diseases Being Studied?. Third SIAM International Conference on Data Mining; 2003. [Google Scholar]
- 25.Struble CA, Dharmanolla C. Clustering MeSH Representations of Biomedical Literature. HLT-NAACL 2004 Workshop: Biolink 2004. 2004:41–8. [Google Scholar]
- 26.Iliopoulos I, Enright AJ, Ouzounis CA. Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput. 2001:384–95. doi: 10.1142/9789814447362_0038. [DOI] [PubMed] [Google Scholar]
- 27.Egan MG. Support versus corroboration. J Biomed Inform. 2006 Feb;39(1):72–85. doi: 10.1016/j.jbi.2005.11.007. [DOI] [PubMed] [Google Scholar]
- 28.Edwards AW. Likelihood. Cambridge: Cambridge University Press; 1972. [Google Scholar]
- 29.Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003 Aug 12;19(12):1572–4. doi: 10.1093/bioinformatics/btg180. [DOI] [PubMed] [Google Scholar]


