Optimal Training Sets for Bayesian Prediction of MeSH® Assignment

Sunghwan Sohn; Won Kim; Donald C Comeau; W John Wilbur

doi:10.1197/jamia.M2431

. 2008 Jul-Aug;15(4):546–553. doi: 10.1197/jamia.M2431

Optimal Training Sets for Bayesian Prediction of MeSH^® Assignment

Sunghwan Sohn ^1,^∗, Won Kim ¹, Donald C Comeau ¹, W John Wilbur ¹

PMCID: PMC2442263 PMID: 18436913

Abstract

Objectives

The aim of this study was to improve naïve Bayes prediction of Medical Subject Headings (MeSH) assignment to documents using optimal training sets found by an active learning inspired method.

Design

The authors selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH term, they found an optimal training set, a subset of the whole training set. An optimal training set consists of all documents including a given MeSH term (C ₁ class) and those documents not including a given MeSH term (C ₋₁ class) that are closest to the C ₁ class. These small sets were used to predict MeSH assignments in the MEDLINE^® database.

Measurements

Average precision was used to compare MeSH assignment using the naïve Bayes learner trained on the whole training set, optimal sets, and random sets. The authors compared 95% lower confidence limits of average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN) classifier.

Results

For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another classifier, C-modified least squares (CMLS), produced an additional 6% improvement over naïve Bayes.

Conclusion

Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS, where using the whole training set would not be feasible.

Introduction

MEDLINE is a large collection of bibliographic records of articles in the biomedical literature maintained by the National Library of Medicine (NLM). In late 2006, MEDLINE included about 16.5 million references, which have been processed by human indexing. Each MEDLINE reference is assigned a number of relevant medical subject headings (MeSH). MeSH is a controlled vocabulary produced by the NLM and used for indexing, cataloging, and searching biomedical and health-related information and documents (see http://www.nlm.nih.gov/mesh/ for details of MeSH).

Human indexing is costly and requires intensive labor. The indexing cost at the NLM consists of data entry, NLM staff indexing and revising, contract indexing, equipment, and telecommunication costs.¹ The annual budget for contracts to perform MEDLINE indexing including purchase orders at four foreign centers is several million dollars (James Marcetich, Head, NLM Index Section, personal communication, August 2007). NLM's indexers are highly trained in MEDLINE indexing practice as well as in a subject domain(s) in the MEDLINE database. Since 1990, the MEDLINE database has grown faster than before with more documents available in electronic form. The cost of human indexing of the biomedical literature is high, so many attempts have been made to provide automatic indexing.^1-8 The NLM Indexing Initiative is a research effort to explore indexing methodologies for semiautomated user-assisted indexing as well as fully automated indexing applications.^1,2 This project has produced a system for recommending indexing terms for arbitrary biomedical text, especially titles and abstracts of journal articles. The system has been in use by library indexers since September 2002. The system consists of several methods of discovering MeSH terms that are combined to produce an ordered list of recommended indexing terms. K-nearest neighbor (KNN) is used as one method to rank the MeSH terms that are candidates for indexing a document.⁵ This study sought to investigate the naïve Bayes learner as an alternative to KNN.

For the MEDLINE database, all references have titles and about half have abstracts. This is more than 16.5 gigabytes of data. With this much data, it is not realistic to run the most sophisticated machine learning algorithms. Only the simplest and most efficient algorithms can process this data on high-end commodity servers (2 CPUs, 4 GB of memory). Naïve Bayes has the efficiency to work with all of MEDLINE. In this study, the performance of using naïve Bayes for automatic MeSH assignment was investigated. One challenge is that many MeSH terms occur in a relatively very small portion of the MEDLINE database, and the corresponding naïve Bayes's performance is not good. This poor performance is related to a significantly imbalanced class distribution—very few documents include a given MeSH term (C ₁ class) and a very large number of documents do not include it (C ₋₁ class). In Bayesian learning for binary classification, a large preponderance of C ₋₁ documents dominates the decision process and deteriorates classification performance on the unseen test set. Careful example selection must be considered to solve this problem and improve the classification performance.

In this article, we perform example selection that starts from a small training set (STS) and iteratively adds informative examples selected from the whole training set (WTS) into the STS until an optimum is reached. Because a given MeSH term occurs in only a small portion of the WTS, all C ₁ documents are placed in the initial STS. Then, the C ₋₁ documents most similar to C ₁ documents are iteratively added to the STS. As the size of the STS increases, the STS that produces the best result on the training set by a leave-one-out cross-validation method is selected as our optimal training set (OTS). Although we use the word optimal, the selected set might not be a global optimum because we find this set by a greedy approach. The detailed procedure will be explained under Example Selection in the Methods section. Naïve Bayes using this OTS provides a superior alternative to KNN, which is one method currently used in the NLM Indexing Initiative^1,2 for MeSH prediction.

Traditionally, example selection has been used for three major reasons.⁹ The first reason is to control computational cost. Standard support vector machines (SVM)¹⁰ require long training times that scale superlinearly with the number of features and become prohibitive with large data sets. Pavlov et al.¹¹ used boosting to combine multiple SVMs trained on small subsets. Boley and Cao¹² reduced the training data by partitioning the training set into disjoint clusters and replacing any cluster containing only nonsupport vectors with its representative. Quinlan¹³ developed a windowing technique to reduce the time for constructing decision trees for very large training sets. A decision tree was built on a randomly selected subset from the training set and tested for the remaining examples that were not included in this subset. Then, the selected misclassified examples were added to the initial subset, and a tree was constructed from the enlarged training set and tested on the remaining set. This cycle was repeated until all the remaining examples were correctly classified. For us cost is not a concern because naïve Bayes can be trained efficiently on all of MEDLINE.

The second reason for example selection is to reduce labeling cost when labeling all examples is too expensive. Active learning is a way to deal with this problem. It uses current knowledge to predict the best choice of unknown data to label next in an attempt to improve the efficiency of learning. Active learning starts with a small number of labeled data as the initial training set. It then repeatedly cycles through learning from the training set, selecting the most informative data to be labeled, labeling them by a human expert, and adding newly labeled data to the training set. These informative documents may be near the decision boundary^14,15 or they may be the documents producing maximal disagreement among a committee of classifiers.^16,17 The emphasis is on the best possible learning from the fewest possible documents.^14,18,19 While studying active learning methods, we saw instances where the results using a small training set produced better results than using the whole training set. Others have seen the same effect.^15,20,21 However, they observed this effect only for some limited cases and the improvement was small (generally <10%). By contrast we find a large improvement for all cases, nearly 200% on average. Also, we do not need active learning to avoid labeling because the entire training set is labeled. We use an active learning–inspired approach not to minimize labeling cost, but to maximize performance.

The third reason is to improve learning by focusing on the most relevant examples. Our example selection belongs to this category. Boosting²² can also be implemented as a type of example selection. Wilbur²³ used staged Bayesian retrieval on a subset of MEDLINE, and it outperformed a more standard boosting approach. He initially trained naïve Bayes on the whole training set and used it to select examples that had a higher probability of belonging to a small specialty database. Both the selected and the small specialty data set were used as a training set for the second stage classifier. Then he combined the two classifiers to obtain the best performance. His method seems to be similar to our example selection, but he did not know about the poor performance of the naïve Bayes classifier on the whole MEDLINE database. He used a relatively small subset of MEDLINE and saw only a small (<10%) improvement in performance. What he did was like the first round of optimization to obtain the OTS in our method. By contrast, we iteratively perform example selection to reach an optimum for a single classifier and find much greater improvement. Example selection to deal with the imbalanced class problem has previously been proposed as a method to improve learning. Various strategies have been suggested to tackle this problem. Sampling to balance examples between the majority and minority classes is a commonly used method.²⁴ Up-sampling randomly selects examples with replacement from the minority class until the size of minority class matches with the majority class. It does not gain information about the minority class, but increases the misclassification cost of the minority class. Alternatively one can directly assign a larger misclassification cost for the minority class than that of the majority class.^25,26 Down-sampling eliminates examples from the majority class until it balances with the minority class. Examples to be eliminated can be selected randomly or focused further away from the minority class. Others, using clustering and various other algorithms, have attempted to reflect the character of the majority class in a fair manner.²⁷ Down-sampling may lose information from the majority class and risks harming performance. Our method is conceptually similar to focused down-sampling. In focused down-sampling the criteria are set a priori and then examples are selected to balance class size. However, we do not aim for a balanced class size in our OTS. We explicitly adjust the subset of focused examples from the majority set iteratively until the best training is achieved.

Methods

Data Preparation

MEDLINE is a collection of references to articles in the biomedical literature. We used the titles and, where available, the abstracts. At the time of our experiment, MEDLINE included 16,534,506 references. For each MeSH term, the WTS was created by randomly selecting two-thirds of the documents from the C ₁ class and two-thirds from the C ₋₁ class. This gives the WTS the same proportion of C ₁ and C ₋₁ documents as in all of MEDLINE. The remaining documents served as our test set. Stop words were removed, but no stemming was performed. Using all single words and two-word phrases in the titles and abstracts provided 56,194,161 features. However, we used feature selection (Appendix A, online supplement available at www.jamia.org) and so not all of them were used in a naïve Bayes classifier. The MeSH terms are not used as features in the actual training and test process, but are only used to define classes.

Classification Tasks

Our classification task was to predict which documents were assigned a particular MeSH term (C ₁ class) and which documents were not assigned that term (C ₋₁ class). We selected 20 MeSH terms with the number of C ₁ class articles covering a wide frequency range: approximately 100,000, 50,000, 30,000, 20,000, 10,000, 5,000, 4,000, 3,000, 2,000, and 1,000 C ₁ class articles. All but one of these terms are leaf MeSH terms, and the appropriate documents can be searched for directly in PubMed. “Myocardial infarction” is an internal node of the MeSH hierarchy. The proper search is a union of the results of searching for it directly and searching for the terms below it in the hierarchy: “myocardial stunning” and “shock, cardiogenic.” A detailed explanation of MeSH can be found at http://www.nlm.nih.gov/mesh/.

Learning Methods

For our principal learner we used the naïve Bayes Binary Independence Model (BIM),²⁸ in which a document is represented by a vector of binary attributes indicating presence or absence of features in the document (for details refer to Appendix A, online supplement). We made this choice because BIM can be trained rapidly and can efficiently handle a large amount of data.

C-modified least squares²⁹ is a wide-margin classifier that has many properties in common with SVMs. However, its smooth loss function allows us to apply a gradient search method to optimize rapidly, and thus it can be applied to larger data sets. Although CMLS can be trained faster than an SVM, it is still impractical to apply CMLS to the WTS. However, we can run CMLS on the smaller OTS. Typically CMLS performs better than Bayes. The question is, how will it perform on an example set optimized for Bayesian learning?

Because the NLM Indexing Initiative^1,2 currently uses a KNN method to aid MeSH prediction, it is valuable to compare our Bayes results on the OTS with the same KNN method. The standard approach of a KNN classifier would compare all pairs of documents, one from the test set and one from the training set. This is a very expensive computation for a huge database such as MEDLINE. To reduce the computational cost we obtained the upper bounds for KNN average precision. For details refer to Appendix C (online supplement). The upper bounds of KNN were compared with 95% lower confidence limits of Bayes, which were obtained by Student's t-test. Because a higher average precision is better, if the lower bound of the naïve Bayes method using the OTS is higher than the upper bounds we found for the KNN method, we can safely conclude that the naïve Bayes method using OTS is better than the KNN method.

Example Selection

To identify the optimal training set (OTS) for Bayesian learning, we followed a procedure that simulates active learning (▶). This is not active learning because it requires the entire training set to be labeled before this example selection is performed.

Good results have been obtained from random down-sampling of examples in other domains.²⁴ To address questions of the size of our OTS versus the specific documents in that set, we used random sampling of the C ₋₁ documents to create a random STS (Ran STS) with the same number of elements as the OTS. We then applied Bayes learning to these Ran STS.

Feature Selection

Proper feature selection often improves the performance of a machine learner. For the naïve Bayes classifier we implemented feature selection by setting a threshold and retaining only those features whose weights are in absolute value above that threshold. Previous research has shown this to be a highly effective feature selection method for naïve Bayes when class size is unbalanced.³⁰ This allows us to reduce the feature dimensionality and generally see a gain in performance. In most of our work with naïve Bayes reported here, we use a fixed threshold value of 1. This allows a large reduction in the number of features and generally does not degrade performance. To further investigate the effectiveness of feature selection, we estimated optimal threshold values for each classification task on the WTS (using leave-one-out) and tested them in order to compare the performance with our example selection method.

Evaluation

To measure classification performance, we used average precision. This simple measure is well suited to our problem. To calculate average precision,³¹ the 5,511,502 documents in the test set are ordered by the score from the machine learner. Precisions are calculated at each rank where a C ₁ document is found and these precision values are averaged. A detailed definition of average precision is provided in Appendix D (online supplement). We also present precision-recall curves for a sample of the classification tasks.

Previous studies have shown limitations using accuracy and ROC for imbalanced data sets.³² Accuracy is only meaningful when the cost of misclassification is the same for all documents. Because we are dealing with cases where the class C ₁ is much smaller than C ₋₁, the cost of misclassifying a C ₁ document needs to be much higher. One could classify all documents as C ₋₁ and obtain high accuracy if the cost is taken to be the same over all documents. For example, with our largest C ₁ class, calling all documents C ₋₁ gives an accuracy over 99%. For the smallest class, the accuracy would be more than 99.99%. Clearly a more sensitive measure is needed.

The challenge of our data set is measuring how high the C ₁ documents are ranked without being unduly influenced by the large number of low-ranking C ₋₁ documents. Given a particular set of C ₁ and C ₋₁ documents and their associated ROC score, the ROC score can be increased simply by adding additional irrelevant C ₋₁ documents with scores lower than any existing C ₁ documents. In fact an arbitrarily high ROC score can be obtained by adding enough irrelevant low-scoring C ₋₁ documents. In sharp contrast, adding C ₋₁ documents that score below the lowest C ₁ document, has no effect on the average precision. Average precision is much more sensitive to retrieving C ₁ documents in high ranks. Because we are only interested in such high-ranking C ₁ documents, average precision is well suited to our purpose.

Results

The numerical results of these experiments appear in ▶.

Table 1.

Table 1 Average Precisions for Prediction of MeSH Terms in MEDLINE Articles

MeSH Terms	Number of C₁	OTS Size	WTS Bayes	WTS Bayes OptCut ^∗	OTS Bayes	Ran STS Bayes	OTS CMLS
Rats, Wistar	122,815	742,540	0.160	0.309	0.376	0.154	0.386
Myocardial infarction	101,810	252,131	0.325	0.644	0.674	0.322	0.688
Blood platelets	51,793	128,286	0.274	0.600	0.599	0.265	0.649
Serotonin	50,522	124,581	0.175	0.564	0.578	0.163	0.626
State medicine	31,338	357,993	0.096	0.215	0.262	0.091	0.244
Bladder	30,572	154,715	0.231	0.474	0.481	0.219	0.514
Drosophila melanogaster	21,695	53,740	0.243	0.684	0.688	0.204	0.689
Tryptophan	20,391	194,950	0.108	0.514	0.500	0.094	0.557
Laparotomy	10,284	173,304	0.043	0.218	0.209	0.040	0.289
Crowns	10,152	51,138	0.178	0.501	0.551	0.168	0.581
Streptococcus mutans	5,105	12,430	0.374	0.716	0.744	0.386	0.752
Infectious mononucleosis	5,040	46,260	0.134	0.537	0.583	0.130	0.614
Blood banks	4,076	39,494	0.109	0.256	0.315	0.107	0.345
Humeral fractures	4,087	31,793	0.128	0.450	0.507	0.122	0.569
Tuberculosis, lymph node	3,036	66,584	0.117	0.249	0.343	0.108	0.348
Mentors	3,275	55,214	0.048	0.367	0.368	0.038	0.419
Tooth discoloration	2,052	19,764	0.108	0.302	0.365	0.094	0.469
Pentazocine	2,014	12,202	0.041	0.678	0.590	0.030	0.681
Hepatitis E	1,032	2,508	0.309	0.611	0.675	0.290	0.629
Genes, p16	1,057	11,847	0.100	0.319	0.286	0.081	0.244
Average			0.165	0.460	0.485	0.155	0.515

Open in a new tab

^∗ Used an optimal cutoff for feature selection in Bayes. The other Bayes classification tasks used cutoff value 1.

CMLS = C-modified least squares; MeSH = medical subject headings; OTS = optimal training set; Ran STS = random small training set.

Whole training set (WTS) = 11,023,004 documents. Optimal training set (OTS) = C ₁ documents + optimal C ₋₁ documents (details in ▶).

For naïve Bayes we used a fixed weight threshold of 1 for feature selection except for WTS Bayes OptCut, where we used a customized threshold value for each classification task. Using a threshold of 1 allowed Bayes to use a much smaller number of features on the WTS, ranging from 15,261 to 1,080,168 features depending on the classification task (without a threshold there are 56,194,161 features). The OTS size varied from 2,508 to 742,540 documents for different MeSH assignments, which is 0.02% to 6.74% of the WTS size. The average precision using Bayes on the OTS (OTS Bayes) ranged from 0.209 to 0.744, with an average of 0.485. Compared to the overall average of 0.165 seen on the WTS, this is nearly a 200% improvement. ▶ shows precision-recall curves of WTS Bayes and OTS Bayes for some classification tasks. Here it is helpful to recall that the average precision is the area under the precision-recall curve. In all cases the area under the curve for OTS training was much larger than for WTS training. Also, the OTS curve was above the WTS curve for most recall levels except for recall levels close to 1 in some cases. This is highly preferable in information retrieval where relevant examples should appear in the top ranks.

A comparison of precision-recall curves of WTS Bayes and OTS Bayes. (A) drosophila melanogaster, (B) streptococcus mutans, (C) mentors, (D) pentazocine.

The overall average using Bayes on the WTS with an optimal threshold (WTS Bayes OptCut in ▶) was 0.460. The threshold ranged from 3.6 to 9.4. The performance was much improved over the WTS with a fixed cutoff value of 1 (WTS Bayes), but not as good as using Bayes on the OTS. In 15 of 20 MeSH assignments, the OTS Bayes performed better than WTS Bayes OptCut.

Although it is not feasible to train a complex machine learning algorithm on the WTS because of its huge size, the much smaller size of the OTS allows us to use a more sophisticated learner such as CMLS. Using CMLS on the OTS (OTS CMLS in ▶) further improved the performance and produced 6% better results on average than Bayes.

To address the importance of the size of the OTS versus the specific documents in the OTS, we created a comparable random small training set (Ran STS in ▶). It includes all of the C ₁ documents, just as in the OTS, and a number of randomly selected C ₋₁ documents equal to the number of C ₋₁ documents in the OTS. Thus, Ran STS has the same size as the OTS. For example, the MeSH term “rat, wistar” has 122,815 C ₁ documents and 619,725 (= OTS size − C ₁ size = 742,540 − 122,815) randomly selected C ₋₁ documents from WTS. When Bayesian learning was performed on the random STS (Ran STS Bayes), the average precisions were a little lower than the WTS, but were much lower than the OTS.

The upper bounds of KNN were compared with 95% lower confidence limits of Bayes. ▶ shows the 95% lower confidence limits of the average precisions for Bayes trained on the OTS. These were obtained by Student's t-test. It also shows upper bounds for the KNN average precisions. In 17 of 20 MeSH assignments, Bayes using OTS was statistically better than KNN. We also performed the Sign test and superiority 17 of 20 times yields a p-value of 0.00129. Therefore, we can safely conclude that naïve Bayes using the OTS is better than KNN.

Table 2.

Table 2 A Comparison of 95% Lower Confidence Limit of Average Precision for Bayes and Upper Bound to KNN Average Precision

MeSH Terms	OTS Bayes 95% Lower Confidence Limit	KNN Upper Bound
Rats, Wistar	0.374 ^∗	0.414
Myocardial infarction	0.671	0.623
Blood platelets	0.595	0.521
Serotonin	0.574	0.473
State medicine	0.258	0.216
Bladder	0.476	0.461
Drosophila melanogaster	0.683	0.579
Tryptophan	0.494	0.398
Laparotomy	0.202	0.151
Crowns	0.542	0.518
Streptococcus mutans	0.733	0.674
Infectious mononucleosis	0.571	0.506
Blood banks	0.305	0.231
Humeral fractures	0.493 ^∗	0.530
Tuberculosis, lymph node	0.327	0.295
Mentors	0.350	0.301
Tooth discoloration	0.348 ^∗	0.366
Pentazocine	0.565	0.436
Hepatitis E	0.654	0.567
Genes, p16	0.271	0.267
Average		0.426

Open in a new tab

^∗ 95% lower confidence limit of OTS Bayes is less than the KNN upper bound.

KNN = K-nearest neighbor; MeSH, medical subject headings; OTS, optimal training set.

A plot of the average precision versus the size of the STS for three MeSH terms appears in ▶. The OTS occurs at the peak of each curve, at a much smaller size than the WTS.

Average precision versus number of documents for several MeSH terms.

Discussion

Using an optimal training set can greatly improve learning with naïve Bayes. Although 11 million training documents can easily be handled, it is not the best option. Much better results can be obtained using a carefully chosen smaller training set. In the smallest improvement seen, the average precision nearly doubled. In the best case, the improvement was by a factor of 14 times. Proper feature selection is another way to improve the naïve Bayes performance. When using an optimized weight threshold (WTS Bayes OptCut) for each classification task, we saw much better performance than using a fixed weight threshold for all classification tasks (WTS Bayes). The overall performance, however, was not as good as using the OTS. As a further benefit of the OTS approach, the small size of the OTS allows using a more sophisticated machine learner such as CMLS. CMLS generally performs better than naïve Bayes. In our case, CMLS trained on the OTS showed a 6% improvement over naïve Bayes trained on the OTS. In practice, it would make sense to apply CMLS to OTS, if affordable, because in most cases it was better than naïve Bayes. A KNN classifier, which is used as one method to rank MeSH terms by the Indexing Initiative at the NLM,^1,2 was also compared with naïve Bayes on the OTS. In most cases, naïve Bayes performed better than KNN. On average CMLS on the OTS was 21% better than KNN.

An important question is why training on the relatively small OTS produces such a large improvement in the naïve Bayes classifier's performance when compared with training on the WTS. At the most basic level, if the naïve assumption that all features are independent of each other given the context of the class were true, arguably, naïve Bayes would be the ideal classifier. Because this naïve assumption is generally false, different more complex algorithms such as support vector machines, CMLS, and decision trees, and even more sophisticated Bayesian network approaches³³ have been developed to deal at some level with dependencies among features. These more complex algorithms often give an improvement over naïve Bayes but with a higher computational cost. Because our approach is successful in improving over naïve Bayes on the WTS, it must also be a way to deal with dependencies.

To see how our approach deals with dependencies, it is helpful to consider the following argument. Naïve Bayes on the WTS learns how to distinguish C ₁ and C ₋₁ documents. But C ₋₁ consists of two types of documents, i.e.,

(1)

where B ₁ is a very small set of documents that are close in content to C ₁ documents, whereas B ₋₁ is most of C ₋₁ and consists of documents that are distant from C ₁ and unlikely to be confused with it. Now training naïve Bayes on the WTS is essentially teaching it to distinguish C ₁ and B ₋₁ as B ₁ will have almost no influence on the probability estimates used to compute weights. Such training may be far from optimal in its ability to distinguish between C ₁ and B ₁. Our method is a way to determine B ₁ so that Inline graphic . Then training on the OTS is optimal naïve Bayesian training to distinguish C ₁ and B ₁, and because B ₋₁ is already distant from C ₁ we may expect to see a large improvement in performance. But the very existence of such a set as B ₁ is only possible because features are not independently distributed in C ₋₁. It is the co-occurrence of a number of features in a single document in B ₁ above random that gives that document a particular flavor or topicality that can be very similar to and confused with a document in C ₁. Removal of the B ₋₁ set from the training process alleviates the feature dependency problem and improves the results. It is a solution somewhat similar to the support vector machine in which the training generally focuses on a small set of support vectors that determine the final training result while a large part of the training set is ignored. However, our Bayesian approach is much more efficient to train than a support vector machine.

In the final analysis, whatever improvement one sees must be reflected in a difference in the weights computed. Equation (A.6) in Appendix A (online supplement) gives the definition of a weight for a Bayesian learner. A document's score is the sum of weights for features that appear in the document. A positive weight value denotes that a feature is more likely to appear in C ₁ documents, a negative weight value denotes that a feature is more likely to appear in C ₋₁ documents, and a weight value close to zero denotes no preference for either class. ▶ ▶ show examples of features (words) from the “pentazocine” classification task with significant weight changes between the OTS and the WTS. The features in ▶ appear equally likely in C ₁ and the nearby C ₋₁ documents included in the OTS (set B ₁). These are not useful for distinguishing the two sets and have OTS weights near zero. However when using the WTS, C ₋₁ includes many distant documents that do not include these features (set B ₋₁), so they now appear relatively more frequent in C ₁ documents, leading to positive WTS weights. In ▶ are features that are somewhat related to C ₁ documents, but that appear more frequently in set B ₁ documents, so have negative OTS weights. They are useful for recognizing that a document is not a C ₁ document. When including the many distant B ₋₁ documents in the WTS, these features also emerge as more common in C ₁ compared to C ₋₁ and receive a positive weight. In both ▶, the more positive weights obtained with the WTS move the scores of documents in B ₁ to a more positive value, harming the precision. Careful example selection, eliminating irrelevant documents from the majority class, C ₋₁, helps to alleviate this problem.

Table 3.

Table 3 “Pentazocine” Classification Task: Neutral Features Inappropriately Given Positive Weight by Training on the WTS

Feature Term	Weight Trained on OTS	Weight Trained on WTS
Antinociception	0.0019	3.7516
Addictive	0.0115	3.1606
Anesthetic agents	0.0115	2.3108
Central action	0.0115	3.4684

Open in a new tab

Abbreviations as in ▶.

Table 4.

Table 4 “Pentazocine” Classification Task: Negative Weighted Features Inappropriately Given Positive Weight by Training on the WTS

Feature Term	Weight Trained on OTS	Weight Trained on WTS
Bupivacaine	−1.0459	1.2844
Thiopental	−1.0459	1.2598
Dynorphin	−1.0014	1.4213
Sphincter of Oddi	−1.0014	3.6554

Open in a new tab

Abbreviations as in ▶.

The poor results from the Ran STS show that random down-sampling of examples does not work in our data sets even though others observed better performance²⁴ in a different setting. The dramatically better results from the OTS, which is the same size as Ran STS, demonstrates that the better results are due to the particular documents selected, not just the small size. The nature of the OTS is more important than its size. In the OTS, C ₋₁ documents are more likely to be close to C ₁ documents—they lie near the decision boundary. This gives more discriminative learning and better results.

We believe there is a larger lesson in our experience estimating probabilities of word occurrences. In any endeavor in which probabilities must be estimated, the choice of training data can be crucial to success. A large number of training examples that are irrelevant to the issue can seriously dilute those relevant examples that would otherwise provide useful probability estimates. This phenomenon may be important not only for naïve Bayes learners, but also for Bayesian networks, Markov models, and decision trees, which all use probabilities.

There is a need for further work. Much work has gone into feature selection. More consideration should be given to example selection. This is especially true when the C ₁ documents are a very small proportion of the available training set. However, we have seen similar results in a few cases of balanced training sets. Although we are not doing active learning, our identification of the optimal training set for Bayesian learning is still iterative. We would like to identify this set directly, possibly using information available from learning on the whole training set. Finally, we would like to investigate optimal training set creation for other learning methods, such as CMLS.

Note: References 34–36 are cited in the online data supplement to this article at www.jamia.org.

Footnotes

Supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. The authors thank the reviewers for valuable feedback and suggestions.

References

1.Aronson AR, Bodenreider O, Chang HF, et al. The NLM Indexing Initiative Proc AMIA Symp 2000:17-21. [PMC free article] [PubMed]
2.Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer Medinfo 2004:268-272. [PubMed]
3.Cooper GF, Miller RA. An experiment comparing lexical and statistical methods for extracting MeSH terms from clinical free text J Am Med Inform Assoc 1998;5:62-75. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Fowler J, Maram S, Kouramajian V, Devadhar V. Automated MeSH indexing of the World-Wide Web Proc Annu Symp Comput Appl Med Care 1995:893-897. [PMC free article] [PubMed]
5.Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment Proc AMIA Symp 2001:319-323. [PMC free article] [PubMed]
6.Kim W, Wilbur WJ. A strategy for assigning new concepts in the MEDLINE database AMIA 2005 Symp Proc 2005:395-399. [PMC free article] [PubMed]
7.Kouramajian V, Devadhar V, Fowler J, Maram S. Categorization by reference: A novel approach to MeSH term assignment Proc Annu Symp Comput Appl Med Care 1995:878-882. [PMC free article] [PubMed]
8.Ruch P. Automatic assignment of biomedical categories: Toward a generic approach Bioinformatics 2006;22:658-664. [DOI] [PubMed] [Google Scholar]
9.Blum AL, Langley P. Selection of relevant features and examples in machine learning Art Intell 1997;97:245-271. [Google Scholar]
10.Burges CJC. A tutorial on support vector machines for pattern recognition. Available electronically from the author: Bell Laboratories, Lucent Technologies. 1999.
11.Pavlov D, Mao J, Dom B. Scaling-up support vector machines using boosting algorithm 15^th International Conference on Pattern Recognition, Barcelona, Spain, September 3–8, 2000. Los Alamitos, CA: IEEE Computer Society; 2000. pp. 2219-2222doi: 10.1109/ICPR.2000.906052. 2000. Accessed May 22, 2008. [DOI]
12.Boley D, Cao D. Training Support vector machines using adaptive clusteringIn: Berry M, Dayal U, Kamath C, Skillicorn D, editors. 4^th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, April 22–24, 2004. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2004. pp. 126-137.
13.Quinlan JR. C4.5: Programs for Machine LearningSan Mateo, CA: Morgan Kaufman Publishers; 1993.
14.Lewis DD, Catlett J. Heterogeneous uncertainty sampling for supervised learningEleventh International Conference on Machine Learning, New Brunswick, New Jersey, July 10–13, 1994In: Cohen WW, Hirsh H, editors. San Francisco, CA: Morgan Kaufmann Publishers; 1994. pp. 148-156.
15.Lewis DD, Gale WA. A sequential algorithm for training text classifiers17^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 3–6, 1994New York, NY: Springer-Verlag; 1994. pp. 3-12.
16.Freund Y, Seung H, Shamir E, Tishby N. Selective sampling using the query by committee algorithm Mach Learn 1997;28:133-168. [Google Scholar]
17.Seung HS, Opper M, Sompolinsky H. Query by committeeFifth Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, July 27–29, 1992New York, NY: ACM Press; 1992. pp. 287-294doi: 10.1145/130385.130417 1992. Accessed May 22, 2008. [DOI]
18.Tong S, Koller D. Support vector machine active learning with applications to text classification J Mach Learn Res 2001;2:45-66. [Google Scholar]
19.Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reductionIn: Brodley CE, Danyluk AP, editors. Eighteenth International Conference on Machine Learning, Williamstown, MA, June 28–July 01, 2001. San Francisco, CA: Morgan Kaufmann Publishers; 2001.
20.Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers with online and active learning J Mach Learn Res 2005;6:1579-1619. [Google Scholar]
21.Schohn M, Cohn D. Less is more: Active learning with support vector machinesIn: Langley P, editor. Proceedings of the Seventeenth International Conference on Machine Learning 2000. San Francisco, CA: Morgan Kaufmann; 2000.
22.Schapire RE. The boosting approach to machine learning: An overview. MSRI Workshop on Nonlinear Estimation and Classification; 2002. 2002.
23.Wilbur WJ. Boosting naive Bayesian learning on a large subset of MEDLINE American Medical Informatics 2000 Annual Symposium; 2000. Los Angeles, CA: American Medical Informatics Association; 2000. pp. 918-922. [PMC free article] [PubMed]
24.Japkowicz N, Stephen S. The class imbalance problem: A systematic study Intell Data Anal 2002;6:429-450. [Google Scholar]
25.Domingos P. MetaCost: A general method for making classifiers cost-sensitive Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. . San Diego, CA, August 15–18, 1999. New York, NY: ACM Press; 1999. pp. 155-164.
26.Maloof M. Learning when data sets are imbalanced and when costs are unequal and unknown Proceedings of the ICML-2003 Workshop: Learning with Imbalanced Data Sets II, August 21–24, 2003. Menlo Park, CA: AAAI Press; 2003. pp. 73-80.
27.Nickerson AS, Japkowicz N, Milios E. Using unsupervised learning to guide resampling in imbalanced data sets Proceedings of the Eighth International Workshop on AI and Statistics, January 4–7, 2001. London, UK: Gatsby Computational Neuroscience Unit; 2001. pp. 261-265.
28.Lewis DD. Naive (Bayes) at forty: The independence assumption in information retrieval ECML 1998:4-15.
29.Zhang T, Oles FJ. Text categorization based on regularized linear classification methods Inf Retrieval 2001;4:5-31. [Google Scholar]
30.Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and naive Bayes Sixteenth International Conference on Machine Learning, 1999. San Francisco, CA: Morgan Kaufmann; 1999. pp. 258-267.
31.Manning CD, Schutze H. Foundations of Statistical Natural Language ProcessingCambridge, MA: MIT Press; 1999.
32.Visa S, Ralescu A. Issues in Mining Imbalanced Data Sets—A Review Paper Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, April 16–17, 2005. Cincinnatti, OH: University of Cincinnatti; 2005. pp. 67-73.
33.Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers Machine Learning 1997;29(2–3):131-163. [Google Scholar]
34.Madsen RE, Kauchak D, Elkan C. Modeling word burstiness using the Dirichlet distribution. 22nd International Conference on Machine Learning, 2005. Bonn, Germany: ACM Press; 2005. 545–52.
35.Witten IH, Moffat A, Bell TC. Managing GigabytesSecond edition. San Francisco: Morgan-Kaufmann; 1999.
36.Salton G. Automatic Text ProcessingReading, MA: Addison-Wesley; 1989.

[bib1] 1.Aronson AR, Bodenreider O, Chang HF, et al. The NLM Indexing Initiative Proc AMIA Symp 2000:17-21. [PMC free article] [PubMed]

[bib2] 2.Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer Medinfo 2004:268-272. [PubMed]

[bib3] 3.Cooper GF, Miller RA. An experiment comparing lexical and statistical methods for extracting MeSH terms from clinical free text J Am Med Inform Assoc 1998;5:62-75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Fowler J, Maram S, Kouramajian V, Devadhar V. Automated MeSH indexing of the World-Wide Web Proc Annu Symp Comput Appl Med Care 1995:893-897. [PMC free article] [PubMed]

[bib5] 5.Kim W, Aronson AR, Wilbur WJ. Automatic MeSH term assignment and quality assessment Proc AMIA Symp 2001:319-323. [PMC free article] [PubMed]

[bib6] 6.Kim W, Wilbur WJ. A strategy for assigning new concepts in the MEDLINE database AMIA 2005 Symp Proc 2005:395-399. [PMC free article] [PubMed]

[bib7] 7.Kouramajian V, Devadhar V, Fowler J, Maram S. Categorization by reference: A novel approach to MeSH term assignment Proc Annu Symp Comput Appl Med Care 1995:878-882. [PMC free article] [PubMed]

[bib8] 8.Ruch P. Automatic assignment of biomedical categories: Toward a generic approach Bioinformatics 2006;22:658-664. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Blum AL, Langley P. Selection of relevant features and examples in machine learning Art Intell 1997;97:245-271. [Google Scholar]

[bib10] 10.Burges CJC. A tutorial on support vector machines for pattern recognition. Available electronically from the author: Bell Laboratories, Lucent Technologies. 1999.

[bib11] 11.Pavlov D, Mao J, Dom B. Scaling-up support vector machines using boosting algorithm 15^th International Conference on Pattern Recognition, Barcelona, Spain, September 3–8, 2000. Los Alamitos, CA: IEEE Computer Society; 2000. pp. 2219-2222doi: 10.1109/ICPR.2000.906052. 2000. Accessed May 22, 2008. [DOI]

[bib12] 12.Boley D, Cao D. Training Support vector machines using adaptive clusteringIn: Berry M, Dayal U, Kamath C, Skillicorn D, editors. 4^th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, April 22–24, 2004. Philadelphia, PA: Society for Industrial and Applied Mathematics; 2004. pp. 126-137.

[bib13] 13.Quinlan JR. C4.5: Programs for Machine LearningSan Mateo, CA: Morgan Kaufman Publishers; 1993.

[bib14] 14.Lewis DD, Catlett J. Heterogeneous uncertainty sampling for supervised learningEleventh International Conference on Machine Learning, New Brunswick, New Jersey, July 10–13, 1994In: Cohen WW, Hirsh H, editors. San Francisco, CA: Morgan Kaufmann Publishers; 1994. pp. 148-156.

[bib15] 15.Lewis DD, Gale WA. A sequential algorithm for training text classifiers17^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, July 3–6, 1994New York, NY: Springer-Verlag; 1994. pp. 3-12.

[bib16] 16.Freund Y, Seung H, Shamir E, Tishby N. Selective sampling using the query by committee algorithm Mach Learn 1997;28:133-168. [Google Scholar]

[bib17] 17.Seung HS, Opper M, Sompolinsky H. Query by committeeFifth Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, July 27–29, 1992New York, NY: ACM Press; 1992. pp. 287-294doi: 10.1145/130385.130417 1992. Accessed May 22, 2008. [DOI]

[bib18] 18.Tong S, Koller D. Support vector machine active learning with applications to text classification J Mach Learn Res 2001;2:45-66. [Google Scholar]

[bib19] 19.Roy N, McCallum A. Toward optimal active learning through sampling estimation of error reductionIn: Brodley CE, Danyluk AP, editors. Eighteenth International Conference on Machine Learning, Williamstown, MA, June 28–July 01, 2001. San Francisco, CA: Morgan Kaufmann Publishers; 2001.

[bib20] 20.Bordes A, Ertekin S, Weston J, Bottou L. Fast kernel classifiers with online and active learning J Mach Learn Res 2005;6:1579-1619. [Google Scholar]

[bib21] 21.Schohn M, Cohn D. Less is more: Active learning with support vector machinesIn: Langley P, editor. Proceedings of the Seventeenth International Conference on Machine Learning 2000. San Francisco, CA: Morgan Kaufmann; 2000.

[bib22] 22.Schapire RE. The boosting approach to machine learning: An overview. MSRI Workshop on Nonlinear Estimation and Classification; 2002. 2002.

[bib23] 23.Wilbur WJ. Boosting naive Bayesian learning on a large subset of MEDLINE American Medical Informatics 2000 Annual Symposium; 2000. Los Angeles, CA: American Medical Informatics Association; 2000. pp. 918-922. [PMC free article] [PubMed]

[bib24] 24.Japkowicz N, Stephen S. The class imbalance problem: A systematic study Intell Data Anal 2002;6:429-450. [Google Scholar]

[bib25] 25.Domingos P. MetaCost: A general method for making classifiers cost-sensitive Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. . San Diego, CA, August 15–18, 1999. New York, NY: ACM Press; 1999. pp. 155-164.

[bib26] 26.Maloof M. Learning when data sets are imbalanced and when costs are unequal and unknown Proceedings of the ICML-2003 Workshop: Learning with Imbalanced Data Sets II, August 21–24, 2003. Menlo Park, CA: AAAI Press; 2003. pp. 73-80.

[bib27] 27.Nickerson AS, Japkowicz N, Milios E. Using unsupervised learning to guide resampling in imbalanced data sets Proceedings of the Eighth International Workshop on AI and Statistics, January 4–7, 2001. London, UK: Gatsby Computational Neuroscience Unit; 2001. pp. 261-265.

[bib28] 28.Lewis DD. Naive (Bayes) at forty: The independence assumption in information retrieval ECML 1998:4-15.

[bib29] 29.Zhang T, Oles FJ. Text categorization based on regularized linear classification methods Inf Retrieval 2001;4:5-31. [Google Scholar]

[bib30] 30.Mladenic D, Grobelnik M. Feature selection for unbalanced class distribution and naive Bayes Sixteenth International Conference on Machine Learning, 1999. San Francisco, CA: Morgan Kaufmann; 1999. pp. 258-267.

[bib31] 31.Manning CD, Schutze H. Foundations of Statistical Natural Language ProcessingCambridge, MA: MIT Press; 1999.

[bib32] 32.Visa S, Ralescu A. Issues in Mining Imbalanced Data Sets—A Review Paper Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, April 16–17, 2005. Cincinnatti, OH: University of Cincinnatti; 2005. pp. 67-73.

[bib33] 33.Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers Machine Learning 1997;29(2–3):131-163. [Google Scholar]

[bib34] 34.Madsen RE, Kauchak D, Elkan C. Modeling word burstiness using the Dirichlet distribution. 22nd International Conference on Machine Learning, 2005. Bonn, Germany: ACM Press; 2005. 545–52.

[bib35] 35.Witten IH, Moffat A, Bell TC. Managing GigabytesSecond edition. San Francisco: Morgan-Kaufmann; 1999.

[bib36] 36.Salton G. Automatic Text ProcessingReading, MA: Addison-Wesley; 1989.

PERMALINK

Optimal Training Sets for Bayesian Prediction of MeSH^® Assignment

Sunghwan Sohn, PhD

Won Kim, PhD

Donald C Comeau, PhD

W John Wilbur, MD, PhD