Abstract
Most of the knowledge regarding genes and proteins is stored in biomedical literature as free text. Extracting information from complex biomedical texts demands techniques capable of inferring biological concepts from local text regions and mapping them to controlled vocabularies. To this end, we present a sentence-based correspondence latent Dirichlet allocation (scLDA) model which, when trained with a corpus of PubMed documents with known GO annotations, performs the following tasks: 1) learning major biological concepts from the corpus, 2) inferring the biological concepts existing within text regions (sentences), and 3) identifying the text regions in a document that provides evidence for the observed annotations. When applied to new gene-related documents, a trained scLDA model is capable of predicting GO annotations and identifying text regions as textual evidence supporting the predicted annotations. This study uses GO annotation data as a testbed; the approach can be generalized to other annotated data, such as MeSH and MEDLINE documents.
Introduction
In order to tap into the wealth of knowledge in biomedical literature, it is imperative to develop computational methods to automatically extract information from free texts and encode the information in a computable form, so that a knowledgebase populated with such information can be utilized for reasoning and deriving new knowledge. In the bioinformatics domain, the controlled vocabulary developed by the Gene Ontology Consortium 1 is the de facto standard for representing the knowledge regarding genes and proteins from a molecular biology perspective. While the accurately curated knowledgebase from the GO consortium provides invaluable information, manual processes of encoding knowledge cannot keep up with the ever increasing rate of knowledge accumulation2; the need for computational approaches to address this challenge has led to recent advances in biomedical natural language processing (BioNLP) and text mining3–5. Many research efforts have been devoted to identifying biological concepts from free text and mapping them to GO terms, including the community-wide challenge tasks in BioCreative6–8 and the TREC Genomic Track9.
Often, the task of predicting GO annotations based on biomedical literature is cast as a text categorization problem, in which a text is transformed into an input vector for a classifier, which outputs a potential class label (a GO term). The classification approach has the following two shortcomings. First, biomedical literature documents inevitably contain multiple concepts. As a result, a document is often associated with multiple annotations, and thus it is assigned to multiple classes. Within such a document, the features that are useful in training the classifier for one of the classes (a GO term) may become “noise” for the classifiers for other classes, which will potentially leads to over-fitting problems when they are inappropriately used by classifiers for other classes. A second issue is that transforming documents into feature vectors leads to a loss in the original structure of the text; thus a classifier becomes a “black box” that only outputs classification calls without providing textual evidence to support its the classification.
In this study, we aim to develop a computational approach that is capable of identifying multiple biological concepts in a biomedical text, mapping the concepts to the controlled vocabulary of the GO, and further associating the predicted GO terms with text regions that provide textual evidence for the annotations. In order to deal with the multi-topical nature of biomedical texts, we have adopted the framework of probabilistic topic models, which treats a text as a mixture of topics10, 11.
In the framework of probabilistic topic models, a topic (or a concept) is represented as a multinomial distribution over a vocabulary, which reflects the word-usage profile when a concept is discussed, and major topics are usually learned from a corpus of text in an unsupervised manner. A trained model can then be applied to a new text document to infer what topics are discussed in the text, and, more importantly, it can then probabilistically label each word or text region of the document with a topic. Probabilistic topic models have been applied to semantic analysis of texts in a variety of domains10, 12–14. In the bioinformatics domain, the latent Dirichlet allocation (LDA) model has been applied to biomedical literature to infer the biological concepts, and semantic information has been further utilized to assess the functional coherence of proteins and to enhance text categorization15–17. In this study, we have developed a novel probabilistic topic model referred to as the sentence-based correspondence LDA (scLDA) which, when trained with a corpus of PubMed documents associated with known GO annotations, learns biological concepts from texts, labels the text regions (sentences) with the concepts, and further maps the learned biological concepts to the observed GO terms. When a trained scLDA model is applied to a set of new PubMed documents, it can predict probable GO annotations and, more interestingly, identify the text regions as textual evidence to support the annotations.
The scLDA is related to the correspondence LDA (CorrLDA) developed in a seminal work by Blei and Jordan12, which models the relationship between regions of an image and the annotations associated with the image. We extend the model to learn the correspondence between text regions and annotations. The basic assumptions of the scLDA model are as follows: 1) a document contains multiple biological concepts; 2) the words in a text and the observed GO terms are semantic tokens from two different languages, of which one is understandable to humans and the other is understandable to computers, but both convey the same set of biological concepts in the document; 3) sentences are natural semantic units to present such concepts. When the concepts discussed in a document are inferred using statistical modeling, the correspondence between the text regions and annotations can be learned. Thus the scLDA can be used to achieve both the goal of predicting GO annotations and of providing textual evidence for the annotations.
Methods
Data
The scLDA model was trained with PubMed documents (title and abstract) associated with GO annotations. We downloaded the GO association file from Uniprot, and extracted 33,479 tuples containing a gene name, a GO term and a PubMed identification number. A corpus of PubMed records was then retrieved from the NCBI(www.ncbi.nlm.nih.gov); sentences in a document were delimited using the Sentence module downloaded from the Comprehensive Perl Archive Network (www.cpan.org); tokens occurring fewer than 10 times in the corpus were discarded; words belonging to a list of stop words were removed; and words were stemmed using the Porter stemming algorithm18.
Model specification
The scLDA model is based on a simplistic assumption to simulate the process that the words of a text document are “generated”, without considering syntax. The stochastic process of generating a document and its annotations is as below:
Sample a topic vector θd ∼ Dir(θ | α), which determines the tendency of topics being discussed in the document d.
- For each sentence s in the document, sample a topic indicator, zs ∼ Multi(z | θ), which indicates which topic the sentence is to discuss
- For each word in the sentence, sample a word, w ∼ Multi (w | βzs), to convey the concept of the sentence
- For each annotation
- Sample a sentence indicator, y ∼ Unif (1,…, Sd), indicating with which sentence the major concept is annotated
- Sample a GO term, g ∼ Multi(g | y, z, μ), which conveys the concept discussed in the sentence
An intuitive explanation of the above process is as follows. When writing a paper, an author first decides what topics to be included in the document by choosing the weight of each topic from a multiple topic space (Step 1), leading to a document specific topic distribution vector θd determining the tendency (probability) that each topic will be discussed in the document. For each sentence in the document, the author chooses a topic zs (Step 2) to be discussed in the sentence, based on the overall topic distribution (θd). Then, the author repeatedly chooses words for the sentence from the word-usage profile associated with the topic (Step 2a) to complete the sentence. After “generating” all sentences, the author also chooses certain sentences as key statements for the topics discussed in the document (Step 3a), and chooses a matching GO term for each sentence (Step 3b) to reflect the concept discussed in the sentence in a GO domain, so that the knowledge is in a computable form. Of course, this process only simulates how words and annotations are chosen from a stochastic point view, which does not truly reflect complex process of how a scientific article is written.
Inference algorithm
Learning of the scLDA model is performed in an unsupervised manner. In this framework, input data are text and associated GO annotations, and the task is to infer what topic each sentence is discussing (inferring the instantiation of the latent variable zs) and which sentence serves as evidence for an annotation (inferring ym). Once these latent variables are inferred, one can further learn which words are associated with a given topic and which GO terms are strongly associated with a topic. Inference of latent variables can be performed using a Markov chain Monte Carlo (MCMC) algorithm in the framework of expectation-maximization (EM). The formal representation of the model and inference procedure is as follows.
The MCMC algorithm infers the instantiation of latent variables through a sampling approach based on the conditional probability. In order to perform such computation, we compute the joint distribution of the observed and latent variables in the model as follows:
| (1) |
where w stands for all the observed words; z represents the latent topic indicators associated with all sentences; θ is a collection of document-specific topic distribution vectors; β is a conditional probability table defining word-usage profiles for topics; y indicates the between the GO terms and their corresponding sentences; μ is a conditional probability table reflecting the GO-usage profile associated with all topics; T indicates the number of topics discussed in the corpus; D denotes the size of a corpus; Sd is the number of sentences within a document d; Ns represents the number of words in a sentence s; Md is the number of annotations associated with the document d; and α, η, and ν are hyperparameters.
Our MCMC algorithm iteratively updates the state of latent variables z and y, based on the conditional probability defined by the following equations:
| (2) |
| (3) |
In Eq (2), reflects the probability topic t will be discussed based on distribution of topics in the document d; calculates probability that topic t is being discussed based on the observed words of the sentence; and determines the probability of topic t is being discussed based on its associated GO annotations. In Eq (3), the terms calculate the probability of a sentence s serving as textual evidence for the observed GO term g. More detailed description of the variables in the model is as follows. The variable stands for that the topic variable z for the sentence s in the document d takes the value of t; denotes the count of sentences, excluding sentence s, in the document d that belong to topic t; is the count of sentences, excluding the sentence s, in the document d; represents the count of the words, excluding wn, in the corpus that take the same value as that of wn and are assigned to topic t; is the count of all the words, excluding wn, that are assigned to the topic t; stands for the count of the observed GO terms, excluding gm, that take the same value as that of gm and are assigned to topic t; denotes the count of all GO terms that are assigned to the topic t; stands for that the indicator variable y associate the sentence s to the mth observed GO term (gm) in the document d; represents the count of the GO terms, excluding gm, that assume the same value as that of gm and are assigned to the topic indicated by z; I(ym == s) is an indicator function which returns a 1 when ym equals s and a 0 otherwise; V stands for the size of vocabulary of the words; and is the count of all GO terms that are assigned to the topic indicated by zs.
While it seems to be complicated, the implementation of the MCMC algorithm is relatively straightforward. When implementing the inference algorithm for the scLDA model, one can keep track of the counts in the above equations using 3 tables: a document-topic-count table (CDT) with a dimension of D×T which keeps track of the topic assignments of the sentences in all documents , a topic-word-count table (CWT) with a dimension of V×T which contains the word-usage profile for each topic , and a topic-GO-count table (CGT) with a dimension of G×T which contains the count of GO terms associated with each topic . Note that the elements of these tables are updated once a topic variable for a sentence and are updated during the process of Gibbs sampling. The inference algorithm is shown in Algorithm 1.
Algorithm 1.
| Input: | A corpus of PubMed records with GO annotations |
| Output: | An scLDA model and instantiated latent variables z and y, CDT, CWT and CGT |
| Init: | Randomly initialize the topic variables z for all sentences; randomly initialize GO-sentence association variables y. Populate the tables, CDT, CWT and CGT, according the assignment of z and y. |
| while not converged do | |
| ford in 1 to D | |
| fors in 1 to Sd | |
| remove count of sentence s from CDT | |
| remove all the words in the sentence from CWT | |
| remove the associated GO term if any from CGT | |
| sample zs according to Eq (2) | |
| assign s to the topic indicated by zs in CDT | |
| assign all words in the sentence to the topic indicated by zs in CWT | |
| assign the associated GO term to the topic indicated by zs in CGT | |
| end for | |
| form in 1 to Md | |
| remove the GO term gm from CGT | |
| sample ym according to Eq (3) | |
| assign gm to the topic to which the sentence indicated by ym belongs | |
| end for | |
| end for | |
| end while | |
Evaluations
A trained scLDA model can be applied to a set of new texts to predict potential GO annotations by inferring topics from the texts first and then outputting the GO terms most strongly associated with theses topics. The probability for a GO annotation to be associated with a text document can be calculated as follows,
| (4) |
which allows one to rank GO terms and output the top ranking ones as predicted annotations. With the most likely GO terms determined, one can further infer the sentences that are most likely associated with them using Eq (3).
Since our algorithm outputs multiple potential GO annotations for a document, evaluation of multi-label classification is different from that of conventional binary classification. In this study, we adopted the information retrieval metrics that were modified for evaluating multi-label classification19. Let D denote the test corpus and Yd and Zd be the true and predicted label sets, respectively, for document d. The precision, recall and F-score for a classification system are determined as follows,
| (5) |
| (6) |
| (7) |
Results
Since it is difficult to model the correspondence of rarely observed GO terms, we preprocessed, trained and tested the scLDA model on two specified data sets. For the first dataset, we selected a subset of PubMed records in which each GO term is associated with at least 7 PubMed records. This data set consisted of 2,271 PubMed records and 362 unique GO terms. We used this dataset to assess the overall capability of scLDA in learning biological concepts from text and its prediction accuracy, and we refer to this data set as the general training set. For the second data set, we further selected a subset of PubMed records in which each document had at least 2 distinct GO terms associated with it, and each GO term was associated with at least 14 PubMed records. These selection criteria led to a corpus of 411 PubMed records and 40 unique GO terms. We have chosen these selection criteria so that each GO term has a sufficient number of training documents to learn the association between biological concepts and GO terms, and so that we can test the capability of the scLDA to identify different text regions supporting multiple distinct GO annotations. We refer to this dataset as the multi-annotation training set.
Learning biological topics
Using the general training set, we trained multiple scLDA models with different parameter settings, e.g., models initialized with a different number of topics, and assessed the capability of the scLDA model in learning biological concepts and in mapping between learned concepts and the observed GO annotations.
A trained scLDA model returns two condition probability tables: one table represents the word-usage distributions of topics, and the other reflects the GO-term-usage distribution of the topics. By inspecting the top-ranked words and the GO terms associated with a topic, a biologist should be able to assess if the topic is related to a biological concept; also one can evaluate if the context reflected by the word-usage profile of a topic agrees with that of the GO-term-usage profile of the same topic. In addition to the qualitative and quantitative evaluations to be presented in later paragraphs, we have listed the topics returned by an scLDA model trained with 320 topics (a number close to the number of the observed GO terms) in an HTML file in the Supplementary Material section, so that readers can assess the biological relevance themselves. In this file, we show the 30 top-ranked words and the 3 top-ranked GO terms associated with each topic if the topic is associated with more than 3 terms. When inspecting the topic tables, one needs to note that, since the scLDA model is a generative model that simulates the process of “generating” all the words in the corpus, words that occur with high frequencies in the corpus are likely to be on the top of many topics, therefore some words that are specific to a topic can be ranked lower than 30th position in their corresponding topic-word profile and not seen. Also, it is worth noting that the more words and GO terms assigned to a topic by the scLDA, the more biological relevant is the topic.
After qualitatively inspecting the topics, we believe that overall the topics learned by the scLDA model are highly biologically relevant, and many of them are highly specific. Table 1 shows an example of such a topic. From this table, one can clearly tell that the context represented by the word-usage profile of the topic is closely related to the concepts conveyed by the top GO terms associated with the topic.
Table 1.
An example topic from a trained scLDA model. On the left panel of the table, the top 30 words (stemmed) with high probabilities are listed; the total number of words assigned to the topic is shown at the top of the panel. On the right panel, the top 3 GO terms and their counts are shown.
| Topic_words (1054) | Top GO terms’ names and definitions | |||
|---|---|---|---|---|
|
GO:0006348 count: 32 name: chromatin silencing at telomere definition: Repression of transcription of telomeric DNA by altering the structure of chromatin. GO:0030466 count: 29 name: chromatin silencing at silent mating-type cassette definition: Repression of transcription at silent mating-type loci by alteration of the structure of chromatin. GO:0030702 count: 28 name: chromatin silencing at centromere definition: Repression of transcription of centromeric DNA by altering the structure of chromatin |
||||
| silenc | 0.0260 | regul | 0.0081 | |
| gene | 0.0217 | repress | 0.0078 | |
| histon | 0.0211 | bind | 0.0064 | |
| protein | 0.0185 | chromosom | 0.0063 | |
| transcript | 0.0155 | mutat | 0.0061 | |
| centromer | 0.0145 | h3 | 0.0061 | |
| chromatin | 0.0137 | region | 0.0060 | |
| yeast | 0.0134 | deacetylas | 0.0060 | |
| heterochromatin | 0.0127 | role | 0.0060 | |
| function | 0.0112 | depend | 0.0060 | |
| telomer | 0.0109 | type | 0.0060 | |
| requir | 0.0093 | interact | 0.0058 | |
| complex | 0.0091 | domain | 0.0058 | |
| activ | 0.0089 | factor | 0.0057 | |
| fission | 0.0082 | loci | 0.0056 | |
In addition to qualitatively assessing the biological relevance of the word-usage profiles and GO-term-usage profiles, we also quantitatively assessed the relatedness between the word-usage and the GO-usage profiles associated with a common topic as an indication that the scLDA model is capable of mapping a topic to the controlled vocabulary of the GO. For each topic, we used the word-usage profile to create a topic-word vector; we also collected the words from the description (a long and detailed description of the meaning of a GO term) of the top 10 GO terms associated with the topic to create a GO-word vector. Then we assessed the closeness of the two vectors using a linear algebra measure, the “dot product” of two normalized vectors. As a control, we also randomly drew 10 GO terms for a topic, constructed a random-GO-word vector, and calculated a dot product between this vector and the topic-word vector of the topic. Then, we set out to test the hypothesis that the word vectors based on scLDA results are more closely related to each other than randomly derived vectors. Indeed, a paired t-test between the scLDA-derived dot products and random-matched dot products led to a highly significant p-value, p < 0.0001, indicating that the word-usage profiles derived by the scLDA model are highly relevant to word representing the concepts of GO terms, thus capable of mapping topics to the GO terms.
Predicting GO annotations
With the knowledge that the scLDA can map topics to GO terms, we assessed the capability of the scLDA model in predicting GO annotations for newly observed text documents. We performed a validation experiments by dividing 2,271 documents into 4 folds; we used 3 folds to train an scLDA model and then applied the trained model to the left out fold to evaluate the accuracy of the GO annotations output by the model. Note that the large size of the validation set in our setting made it unnecessary to perform an all-fold cross validation. Given a set of new text documents, a trained scLDA model infers all possible topics for the documents based on the words in the document and calculates the marginal probabilities for GO terms by considering all topics existing in the documents as shown in Eq (4), and these probability values can then be used to rank GO terms. On average, each document in our training set was associated with around two GO terms, therefore we selected the top two GO terms predicted by scLDA as outputs to compare with the observed true GO annotations to assess the prediction accuracy. Since the prediction of GO terms is indirectly derived through the inferred topics, and as a result the granularity of topics can have an impact on the accuracy of predictions, we assessed the prediction performances of several scLDA models initialized with different numbers of topics. As a comparison, we also performed multi-label classification training and testing using a multinomial naïve Bayes classifier, which outputs the top 2 ranking classes for each document.
The results are shown in Figure 1, which indicates that, as the granularity of the models became finer (more topics per model), the prediction accuracy of the scLDA model increases but tends to plateau when models trained using over 300 topics. The results also show that, while the performance by the scLDA model is inferior to that of the naïve Bayes classifier, it is encouraging because the latter is a generative text categorization classifier with comparable accuracy with other state-of-the-art text categorizers, such as support vector machines19. Overall, the performances by both naïve Bayes and scLDA are relatively poor; this is to the large number of candidate classes need to be predicted, in comparison to the commonly text categorization tasks which usually are binary or with small number of classes.
Figure 1.
Annotation performance. Results of scLDA models with different numbers of topics and that of a naïve Bayes classifier are shown.
Mapping textual evidence to annotations
The main purpose of developing the scLDA model was to capture the correspondence between the observed annotations and textual regions that provide supporting evidence. Associating annotation with its supporting text can be performed during training and testing. By design, an scLDA model probabilistically associates each observed annotation with one or more sentences during training. During testing, we can first predict a few GO terms as described in the previous section, and then probabilistically associate the predicted annotation with the most likely supporting text using Eq (5).
Using the multi-annotation training data set, we trained an scLDA model with 70 topics, and the results of mapping the observed GO annotations with their supporting evidence are shown as an HTML file in the Supplementary Materials. For each GO observed term, we have highlighted and color-coded the text region that has the strongest association (the highest probability) with the term. We provide these data so that readers can make their own assessment of the performance of the algorithm, because evaluation of the quality of mapping between an annotation and its supporting evidence can be highly subjective. Figure 2 shows an example, in which the observed GO terms and their supporting sentences in the PubMed record are so color-coded and highlighted. In this example, the mapping between supporting sentences and the GO terms is apparently sensible.
Figure 2.
An example of mapping between text and annotations. For this Pubmed abstract (PMID:10821758), the scDLA associates sentences that are most likely to support the two observed annotations named “Notch signaling pathway”(GO:0007219) and “Neuron migration”(GO:0001764). The GO terms and their supporting textual evidence are color-coded. The probability of the associations determined by the model is shown next to the sentences.
Discussion
In this paper, we present the scLDA model, a member of the family of probabilistic topic models. Our results indicate that, by capturing the correspondences between sentences and GO annotations, the scLDA model is capable of achieving the following goals: 1) learning biological topics from a biomedical corpus, 2) learning a mapping between the biological topics and observed GO annotations, 3) associating observed GO annotations with potential supporting text regions (sentences) in a text document, 4) predicting GO annotations for a new biomedical text and providing potential supporting textual evidence.
It is worth noting that learning overall topics, inferring local semantic context and predicting the correspondence between annotations and textual evidence were all performed in a fully unsupervised manner in the scLDA model. While it is also possible that a supervised classifier can capture the local semantic information of a text by using local text regions as inputs to perform localized classification, a supervised classifier requires training data in which each text region is labeled with a class. This makes the approach of training a localized supervised classifier impractical because manually labeling documents at sentence level is extremely labor intensive. Thus, we believe that the capability of inferring local semantic context in an unsupervised/semi-supervised manner will become increasingly important in literature-based protein annotation and text mining when dealing with a large volume of text documents, particularly when working with full-text documents.
It should also be noted that the experiment of predicting GO annotations aims to demonstrate the relevance between the inferred topics and the predicted GO terms, not to present the scLDA as a state-of-theart text categorizer, because it is well known that unsupervised learning is not adept in performing classification. The main merits of the scLDA model are capturing local semantic context and inferring a mapping between annotations and potential evidences, which give rise to new research directions to enhance the capability of automatic annotation performed by supervised learning.
Capturing local semantic context
The key motivations for applying probabilistic topic models to biomedical literature are to investigate tools with the capability of capturing the multi-topical nature of biomedical texts and inferring local semantic context of a text. The complexity of biomedical literatures, particularly in full-text documents, will prevent the success of any approach that does not take into account the local semantic context of texts. Probabilistic topic models are a family of text-mining models that address this challenge. By representing a text document as a mixture of topics, these models can be used to infer the local semantic topics (context) of a text.
Our results demonstrate that the scLDA model is capable of learning topics from a corpus in an unsupervised manner and inferring the topic for a text region. Unlike other topic models in which topics are assigned at word level so that words within a sentence can belong to different topics, the scLDA constrains that a sentence serves as the semantic unit for topic labeling and topic inference. While this constraint renders the scLDA less flexible in terms of learning topics, it enables us to map GO annotations with supporting evidence in a more sensible way in that sentences are natural semantic units. As a future direction, one can further combine formal BioNLP techniques to identify a block of text referring to a protein as a natural semantic unit discussing the protein.
Enhancing biological relevance of topics
The scLDA model considers both the words in a text and the GO annotations associated with it when inferring the topic of a sentence; thus GO terms have an impact on the scLDA’s learning of the word-usage profile for a topic. While in this case the role of GO terms is not exactly the same as the recently developed supervised topic models 21, 22 in which learning of topics is constrained to match labels, we believe that simultaneously learning topic words and topic GO terms helps the scLDA to learn topics more relevant to biological concepts.
Mapping annotations with supporting evidence
The major goal of developing the scLDA model is to develop the capability of simultaneously predicting annotations and providing supporting textual evidence to overcome the “black box” approach adopted by most of text categorization algorithms that have been studied in the field of literature-based protein annotation. Our results indicate that the scLDA model can achieve this goal. Given sufficient training cases, the prediction accuracy by the scLDA model is comparable to that of a naïve Bayes classifier. The results are encouraging in that the capability of providing textual evidence for annotations represents a significant advance over the “black box” classifiers. Due to the complexity of human languages, it is difficult to envision that a computing system can be developed to predict functional annotations for proteins in a fully automatic manner. However, it is possible to develop a semi-automatic systems to facilitate the manual annotation process by identifying potentially reliable annotations and present the predicted annotations together with their supporting evidence from text as a means to speed up the process. Under such a scenario, the capability of identifying potential textual evidence for predicted annotations will be critical for the success of such a system.
Future directions
The scLDA model is a novel text-mining model that explores the new way of modeling local correspondence between text and annotations in the bioinformatics domain. The results presented in this paper are of the proof-of-concept nature rather than the state-of-the-art. However, we believe that the model does provide new directions that are worth further study and to be generalized to other domains beyond GO annotation.
The first possible direction is to combine semantic modeling with discriminative text classification techniques. As a generative model, the scLDA model aims to capture statistical structure for “generating” the text and annotations, but is not particularly adept for discriminative classifications among candidate annotations during predicting GO annotations. Another possible factor influencing the accuracy of the predicted annotations by the scLDA model is that prediction of annotation is an indirect process in which the model first infers the topics existing in a text and then “generates” annotations based on the topics. These shortcomings can be alleviated by employing a supervised classifier which, in discriminating candidate classes, can directly take into account the information derived from local semantic modeling by scLDA to achieve a better prediction accuracy. Such an approach would take full advantage of both local semantic modeling and the power of discriminative classifiers, as shown in one of our previous studies17 in which semantic information was used to enhance the classification accuracy of supervised classifiers in text categorization.
The second possible direction is to integrate the scLDA model in a framework of semi-supervised learning in order to take advantage of its capability of mapping observed annotations to their supporting textual evidence. As pointed out previously, many state-of-the-art supervised classifiers are usually superior in terms of discriminating candidate annotations, but the requirement of labeled training data at local text region makes full supervised approaches impractical in this setting. In a semi-supervised learning framework17, 23, partial reliable results from unsupervised learning models can be imputed as pseudo-labeled training cases to enhance the performance of a supervised learning algorithm. Thus, the mapping between text regions and observed GO annotations by the scLDA model serves as an excellent source of information in a semi-supervised learning environment.
Finally, another direction to enhance the mapping between text regions and GO annotations is to combine the scLDA model with formal BioNLP. The current scLDA model mainly relies on statistical patterns of word and GO-term usages to infer the relationship between text regions and annotations, without considering other rich information of biomedical texts. Using BioNLP techniques, one can perform syntactic parsing and identify named entities corresponding to the genes of interest in the text, and then a supervised or semi-supervised learning technique can be used to combine the syntactic information derived from NLP and semantic information derived scLDA to perform more rigorous mapping between text regions and annotations.
Supplementary materials:
The source code in C++, experiment data and supplementary results are available for download at the following URL: http://carcweb.musc.edu/TextminingProjects/scLDA/
Acknowledgments
Funding: This study is partially supported by NIH grants: 5R01LM010144, 1R01LM011155 and 5R01LM009153.
Reference
- 1.Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Baumgartner WA, Jr, Cohen KB, Fox LM, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23:i41–48. doi: 10.1093/bioinformatics/btm229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cohen KB, Hunter L. Getting started in text mining. PLoS Comput Biol. 2008;4:e20. doi: 10.1371/journal.pcbi.0040020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Altman RB, et al. Text mining for biology--the way forward: opinions from leading scientists. Genome Biol. 2008;9(Suppl 2):S7. doi: 10.1186/gb-2008-9-s2-s7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wilbur WJ, Rzhetsky A, Shatkay H. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics. 2006;7:356. doi: 10.1186/1471-2105-7-356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Blaschke C, Leon EA, Krallinger M, Valencia A. Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics. 2005;6(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Camon EB, et al. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics. 2005;6(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005;6(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cohen AM, Hersh WR. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab. 2006;1:4. doi: 10.1186/1747-5333-1-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003;3:993–1022. [Google Scholar]
- 11.Griffiths TL, Steyvers M. Finding scientific topics. Proc Natl Acad Sci U S A. 2004;101(Suppl 1):5228–5235. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Blei DM, Jordan MI. Modeling annotated data. Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2003. pp. 127–134. [Google Scholar]
- 13.Griffiths TL, Steyvers M, Tenenbaum JB. Topics in semantic representation. Psychol Rev. 2007;114:211–244. doi: 10.1037/0033-295X.114.2.211. [DOI] [PubMed] [Google Scholar]
- 14.Teh Y, Jordan M, Beal M, Blei D. Hierarchical Dirichlet processes. J Am Stat Assoc. 2006;101:1566. [Google Scholar]
- 15.Zheng B, Lu X. Novel metrics for evaluating the functional coherence of protein groups via protein semantic network. Genome Biol. 2007;8:R153. doi: 10.1186/gb-2007-8-7-r153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zheng B, McLean DC, Jr, Lu X. Identifying biological concepts from a protein-related corpus with a probabilistic topic model. BMC Bioinformatics. 2006;7:58. doi: 10.1186/1471-2105-7-58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lu X, Zheng B, Velivelli A, Zhai C. Enhancing text categorization with semantic-enriched representation and training data augmentation. J Am Med Inform Assoc. 2006;13:526–535. doi: 10.1197/jamia.M2051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Porter MF. An algorithm for suffix stripping. Program. 1980;14:130–137. [Google Scholar]
- 19.Jin B, Muller B, Zhai C, Lu X. Multi-label literature classification based on the Gene Ontology graph. BMC Bioinformatics. 2008;9:525. doi: 10.1186/1471-2105-9-525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gruber A, Rosen-Zvi M, Weiss Y. Hidden topic Markov models. Proceedings of Artificial Intelligence and Statistics; San Juan, Puerto Rico. 2007. [Google Scholar]
- 21.Ramage D, Hall D, Nallapati R, Manning CD. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. Proceedings of the Conference on Empirical Methods in Natural Language Processing; 2009. pp. 248–256. [Google Scholar]
- 22.Blei D, McAuliffe J. Supervised topic models. Neural Information Processing Systems. 2007 [Google Scholar]
- 23.Zhu X. University of Wisconsin-Madison; 2007. Semi-Supervised Learning Literature Survey. Computer Sciences TR 1530, [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
The source code in C++, experiment data and supplementary results are available for download at the following URL: http://carcweb.musc.edu/TextminingProjects/scLDA/


