Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2010 Nov 13;2010:537–541.

Author Keywords in Biomedical Journal Articles

Aurélie Névéol 1, Rezarta Islamaj Doğan 1, Zhiyong Lu 1
PMCID: PMC3041277  PMID: 21347036

Abstract

As an information retrieval system, PubMed® aims at providing efficient access to documents cited in MEDLINE®. For this purpose, it relies on matching representations of documents, as provided by authors and indexers to user queries. In this paper, we describe the growth of author keywords in biomedical journal articles and present a comparative study of author keywords and MeSH® indexing terms assigned by MEDLINE indexers to PubMed Central Open Access articles. A similarity metric is used to assess automatically the relatedness between pairs of author keywords and indexing terms. A set of 300 pairs is manually reviewed to evaluate the metric and characterize the relationships between author keywords and indexing terms. Results show that author keywords are increasingly available in biomedical articles and that over 60% of author keywords can be linked to a closely related indexing term. Finally, we discuss the potential impact of this work on indexing and terminology development.

INTRODUCTION

Information retrieval via a search engine relies on the representation of meaning in texts such as user queries and document content. These representations of a user’s information need and an author’s information production are compared and matched to produce a result set of documents found to be relevant to the user’s query. In order to facilitate this process, an additional representation of the document content within the collection can be provided by an information specialist.

In the biomedical domain, MEDLINE® indexers use indexing terms from the Medical Subject Headings (MeSH®) thesaurus to summarize the content of articles in the MEDLINE database. These indexing terms consist of main headings (e.g., Asthma) or main heading/subheading combinations (e.g., Asthma/drug therapy). While about a dozen MeSH indexing terms are assigned to each article cited in MEDLINE (for brevity, we refer to those as MEDLINE indexing terms), MeSH covers topics in the entire biomedical domain with about 25,000 main headings. Meanwhile, many biomedical journals ask authors to provide a list of keywords along with their manuscript submission.

Although the assignment of keywords to an article by the author and the assignment of MeSH indexing terms by the indexer may seem like two very similar activities, there are significant differences in terms of form and perspective. Typically, authors are asked to choose a small number of keywords, without reference to a controlled vocabulary; whereas indexers are trained to select indexing terms from MeSH according to a specific protocol. Moreover, in addition to the subjectivity inherent in an indexing task [13], authors are focused on selecting keywords representing what they consider as important to describe the content of their own article. In contrast, indexers consider the article in the larger scope of the collection.

Hence, for a given article, it is no surprise that the sets of author keywords and indexing terms can be substantially different. In fact, in the PubMed Central Open Access subset, authors provide an average of 5.3 (±1.9) keywords while indexers assign on average 13.0 (±11.9) indexing terms. Table 1 shows the 4 author keywords and 25 MEDLINE indexing terms assigned by indexers for the sample citation (Surgenor et al., 2009 - PMID 11299068) that we will use as a running example.

Table 1:

Author keywords and MeSH indexing terms assigned to a sample article indexed in MEDLINE.

Author keywords:
decision-making
rural health services
interhospital transport
survival analysis
MEDLINE Indexing Terms:
Adult
Aged
Cohort Studies
Decision Making
Diagnosis-Related Groups
Female
Health Services Accessibility
Hospital Mortality
Hospitals, Community/organization & administration
Hospitals, Rural/organization & administration
Humans
Intensive Care Units/utilization
Length of Stay
Male
Middle Aged
New Hampshire/epidemiology
Outcome Assessment (Health Care)
Patient Transfer/statistics & numerical data
Prospective Studies
Survival Analysis

Few studies have compared author keywords and MeSH indexing terms provided in MEDLINE citations. One study assessed the authors’ knowledge of MeSH through a manual comparison of author keywords and MEDLINE indexing on a set of 415 articles published by a Korean journal [4]. Similarly, another study compared author keywords from 706 articles published in a Spanish journal to the corresponding MEDLINE indexing [5]. It centered on evaluating the author’s description of an article content using keywords. It also aimed at assessing the evolution of the descriptions in a dozen subfields over the time period of the corpus (1994–2001).

The main goal of our work is to compare the article representations provided independently by authors and indexers, and to evaluate to what degree the indexers capture topics considered as important by the authors. It substantially differs from the two previous studies cited above. First, instead of relying on manual examination of the two lists, we propose an automatic method for comparison. Second, we conduct experiments for over 14,000 articles in 735 journals, as opposed to a few hundred articles from a single journal in the previous investigations. Finally, the underlying assumption in both past studies was that MEDLINE indexing was the gold standard that author keywords should compare to. In contrast, we aim at assessing whether topics covered by author keywords are covered by MeSH indexing terms. This may be useful for information retrieval as author keywords are currently not used in PubMed for either indexing or retrieving citations. We also aim at analyzing author topics that may not be covered by indexing terms as this may be useful for terminology development.

MATERIAL AND METHODS

PubMed Central Open Access subset

This study is based on the PubMed Central (PMC) Open Access subset, a document collection comprising full-length articles under specific copyrights1 in PMC. Out of 176,005 articles in PMC Open Access, a subset of 14,398 articles includes both author keywords and MEDLINE indexing terms and was used in this study.

Automatically Matching Representations

One of the biggest challenges in this study is to assess the overlap between author keywords and MeSH indexing terms using automatic tools. To this end, we propose to use exact match and a similarity measure to find closely related MEDLINE indexing terms for each author keyword provided for a given article.

Finding exact matches

To find exact matches between author keywords and MeSH indexing terms, we used two different versions of MeSH (2004 and 2010) to map author keywords to MeSH. Specifically, all main headings, subheadings and entry terms were mapped to their corresponding unique MeSH identifiers, to allow for term variant grouping. For example, the author keyword antibiotics could be matched to the MeSH main heading Anti-Bacterial Agents because antibiotics is an entry term for Anti-Bacterial Agents. Terms such as macrolide therapy that could not be automatically mapped to an exact MeSH equivalent were left verbatim.

Measuring term similarity

For each author keyword in a given article that remains unmatched, we measured its relatedness to every MEDLINE indexing term of the same article in order to identify the most closely related indexing term. For this purpose, we relied on the “PubMed Distance” [6] – a metric that evaluates the semantic relatedness between two biomedical terms based on their separate and joint search result counts in PubMed. It is computed using the following formula:

PD(x,y)=max{logf(x),logf(y)}logf(x,y)logMmin{logf(x),logf(y)}

where M is the total number of articles indexed by PubMed. f(x) is defined as the count of articles a PubMed search returns for the search term x. f(x,y) is the number of articles PubMed returns for search x and y. The range of the distance is between zero and infinity. In this work, a distance greater than one is normalized to one. The smaller a PubMed distance is, the closer the two terms are, i.e., a score of 0 means the terms are identical. The advantage of the PubMed distance compared to other similarity measures for the biomedical domain (such as those reviewed in [7]) is that it allows comparisons between terms that are not part of a controlled vocabulary. This is the case for the majority of the author keywords in this work.

After computing PubMed distance scores between a given author keyword and all the MEDLINE indexing terms in a given article, we identified the indexing term with the smallest score (i.e., the shortest distance) and paired it to the relevant author keyword. For example, in the article by Surgenor et al., only one author keyword survival analysis is an exact match to a MEDLINE indexing term. For the other 3 author keywords, their distances to each of the 25 MEDLINE indexing terms were computed in order to identify the most closely related indexing term. Table 2 shows the identified indexing terms with their respective shortest PubMed distances to the 3 author keywords.

Table 2:

pairs of author keywords and MEDLINE indexing terms with shortest PubMed distances.

Author Keyword MEDLINE Indexing Term PubMed distance
decision-making Decision Making 0
rural health services Hospitals, Rural 0.254
interhospital transport Patient Transfer/Statistics &Numerical Data 0.361

Selecting Related Terms

Finally, in order to choose a cut-off value to determine whether an author keyword is closely related to a MEDLINE indexing term, we assess the likelihood of observing a given PubMed distance score strictly by chance based on results from randomly assembled pairs of keywords and indexing terms. After processing pairs from random sampling, we sort all pairs by their distance scores incrementally and find a score that separates the top 1% of pairs (lowest distances) from the rest of the pairs. By doing so, we are able to measure the significance level (p=0.01) of an observed value of PubMed distance purely by chance. Hence, we can use this cut-off value as a threshold to determine whether an author keyword is truly related to the MEDLINE indexing term it was paired with.

Evaluation of the automatic representations

To assess the automatic methods and further characterize the relations between author keywords and indexing terms, 300 pairs of author keyword / MEDLINE indexing term were randomly selected to be manually reviewed and classified into the following six categories:

  1. Synonyms, Acronyms or graphical variants. e.g., bmi/Body Mass Index

  2. Author keyword is more specific than MEDLINE indexing term. e.g., indole-3-acetic acid/Indoleacetic Acids

  3. MEDLINE indexing term is more specific than author keyword. e.g., neoplasms/Prostatic Neoplasms

  4. Author keyword and MEDLINE indexing term are closely related. e.g., ophthalmologist/Eye Abnormalities

  5. Author keyword and MEDLINE indexing term are loosely related. e.g., inflammation/Cytokines

  6. Author keywords and MEDLINE indexing term are unrelated. e.g., clinical signs/Turkey

For the manual review, the 300 pairs were equally split into 3 sets, each of which was first independently reviewed by two reviewers. In particular, the distance score was removed so that the reviewers were blind to the results of the automatic similarity. The classification results from two separate reviewers on the same set of 100 pairs were then compared for computing inter-annotator agreement. Then, the pairs classified differently by the two reviewers were discussed until an agreement was reached before including the consensus into the final classification results. Once the pairs were classified, we computed an average distance score for each of the six classes.

As an assessment of the similarity measure, our expectation was that smaller distance scores would correspond to categories 1 to 4 (which represent pairs of closely related terms) whereas larger scores would correspond to categories 5 and 6 (which represent loosely related or unrelated terms).

Further characterization of author keywords

To further characterize those author keywords not having an exact match in the set of MeSH indexing terms assigned to MEDLINE and to identify potential candidates for inclusion in MeSH, the same 300 author keywords were reviewed and classified based on their coverage in MeSH. Author keywords refer to:

  1. A topic covered in MeSH but not selected by indexers. e.g., rural health services is actually a MeSH main heading but was not selected by MEDLINE indexers in our sample citation.

  2. A topic covered by a unique MeSH heading, but the author keyword is not linked to the corresponding MeSH heading e.g., decision-making (with dash “-”) is not an entry term for the main heading Decision Making (without dash).

  3. A topic covered in MeSH by multiple headings. e.g., chitosan-alginate is covered by the two main headings Chitosan and Alginates.

  4. A topic not covered in MeSH e.g., non-alcoholic steatohepatitis

RESULTS

The percentage of articles with author keywords has grown significantly over the past 10 years.

As seen in Figure 1, it increased from 0.6% in 2000 to 15% in 2009.

Figure 1:

Figure 1:

Cumulative percentage of articles with author keywords in the PMC Open Access set.

Results of the automatic matching between author keywords and MEDLINE indexing terms

As can be seen in Table 3, 46% (34,940/75,901) of the author keywords can be mapped to MeSH through exact matching. However, only 25% (18,885) also appear in MEDLINE indexing. The remaining 57,016 author keywords that cannot be matched to an exact MELINE indexing term can either be mapped to MeSH but are not selected for indexing in MEDLINE (e.g., rural health services in Table 2) or cannot be mapped to MeSH (e.g., interhospital transfer).

Table 3:

Automatic matching of author keywords to MEDLINE indexing terms (0.38 similarity threshold).

Author keywords MeSH NOT MeSH Row SUM
Exact match to indexing term (%) 18,885 (25%) - 18,885 (25%)
Closely related to indexing term (%) 7,026 (9%) 21,059 (28%) 28,085 (37%)
Not related to any indexing term (%) 9,029 (12%) 19,902 (26%) 28,931 (38%)
Column SUM 34,940 (46%) 40,961 (54%) 75,901 (100%)

Figure 2 shows the breakdown of 57,016 author keywords according to their respective distance scores. As can be seen, 400 author keywords have a distance score of 0, suggesting that these author keywords are semantically equivalent to their MeSH counterpart despite the fact that they could not be paired through exact matching.

Figure 2:

Figure 2:

Number of author keywords with a PubMed distance below different thresholds.

Based on the analysis of the random pairs of author keywords and indexing terms, we observed that only l% of the pairs have a score below 0.38 while the remaining 99% of the random pairs have scores greater than 0.38. This result suggests that 0.38 is a reasonable threshold for measuring a statistically significant (p=0.01) relatedness between term pairs. The results of using 0.38 as the distance threshold are shown in Table 3. Of the author keywords, 28,085 were found to have a distance score lower or equal to the threshold and consequently considered closely related to a MEDLINE indexing term. Conversely, 28,931 author keywords had a distance score above the threshold and were considered to be unrelated.

Results of the evaluation of term relatedness

The overall classification results are displayed in Table 4. It shows that author keyword and indexing terms in Class 1 have an extremely small distance score. Moreover, on average the distance scores of classes 2, 3 and 4 are significantly smaller than their counterparts in Classes 5 and 6. This implies that the PubMed distance metric succeeds in measuring different ranges of relatedness.

Table 4:

Classification of the relationships between keywords and indexing terms.

Class Pairs (%) Avg. PubMed Distance
1 17 (5.7%) 0.099
2 29 (9.7%) 0.236
3 30 (10.0%) 0.297
4 102 (34.0%) 0.323
5 67 (22.3%) 0.479
6 55 (18.3%) 0.596

In addition, results in Table 4 show that the distance cut-off value (0.38), chosen by the random sampling, seems to be an appropriate threshold for separating related and unrelated terms.

In order to determine the strength of inter-annotator agreement on the initial manual classification results, the Kappa coefficients were computed. The Kappa values for the three sets are 0.412, 0.388, and 0.413, indicating that the strength of inter-annotator agreements is either “fair” or “moderate” according to the classic Kappa coefficient classification.

Results of the characterization of author keywords

Table 5 presents the results of the categorization of author keywords for which no corresponding MEDLINE indexing terms were found. It also shows that the majority of the author keywords refer to existing concepts in MeSH while another 33% of author keywords are not covered in MeSH.

Table 5:

Classification of the relationships between keywords and MeSH indexing terms.

Coverage of author keyword in MeSH Number out of 300 (%)
Covered in MeSH but not selected by indexers 108 (36%)
Covered in MeSH but not yet linked 47 (16%)
Covered by multiple MeSH headings 45 (15%)
Not covered in MeSH 100 (33%)

DISCUSSION AND CONCLUSION

From the MEDLINE indexing perspective, our results (in Table 3) show that the majority (62%) of the author keywords are already covered by an exact or closely related MeSH indexing term in MEDLINE indexing. For example, when an author supplied the keyword thiamethoxam, which is a MeSH Supplementary Concept, the corresponding main heading Thiazoles was selected by MEDLINE indexers. Many other cases of divergence can be related to indexing inconsistencies observed in previous studies [2]. Previous research efforts involving an analysis of author annotations also pointed out that authors do not have the thorough knowledge of terminologies that indexers do, so that their indexing recommendations should be considered cautiously [8]. In light of the FEBS Letters experiment where protein/protein interactions provided by authors were not well received by curators, it was concluded that authors are not necessarily the right people to generate keyword summaries validating their own claims [9].

One major reason for author keywords to have no related equivalent in the MEDLINE indexing is that the topic referred to by the author keyword may not be specifically covered in MeSH. In fact, from the terminology development perspective, we show (in Table 5) that 49% of the author keywords are either not covered in MeSH (33%) or not linked to their equivalent in MeSH (16%), which suggests that author keywords may be helpful for terminology development. For instance, author keywords with very low distance scores to MeSH indexing terms could be considered as entry terms (e.g., CD8+ t-cells for the main heading CD8-Positive T-Lymphocytes). Similarly, keywords that are not covered in MeSH, (e.g., non-alcoholic steatohepatitis) could be considered for the creation of new main headings. However, other terms that are not currently covered in MeSH, (e.g., robustness in the context of computer simulation) may be out of scope for a vocabulary focusing on the biomedical literature. Another argument for considering author keywords for terminology development is that authors often act as users of information retrieval systems themselves. As such, keywords they choose are likely to be query search terms as well. Thus, including some author keywords in a terminology such as MeSH could ultimately help users retrieve relevant information.

In conclusion, we have shown that author keywords—which are increasingly available in PubMed Central—can be advantageously used for terminology development. The similarity measure used in this study was shown to be efficient in assessing the relatedness between biomedical terms and useful in selecting author keywords for review by terminology developers.

Acknowledgments

This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. The authors would like to thank W. John Wilbur and Lou Knecht for useful discussions, and G. Craig Murray for his editorial assistance.

Footnotes

References

  • 1.Morris J. Individual differences in the interpretation of text: Implications for information science. J Am Soc Inf Sci Technol. 2009;60(12):141–9. [Google Scholar]
  • 2.Funk ME, Reid CA. Indexing consistency in MEDLINE. Bull Med Libr Assoc. 1983 Apr;71(2):176–83. [PMC free article] [PubMed] [Google Scholar]
  • 3.Lancaster FW. Indexing and abstracting in theory and practice. University of Illinois; Champaign, IL: 1991. [Google Scholar]
  • 4.Lee SH, Moon HW. A Comparison Study of Subject Words of Korean Medical Journal Papers: Author Keywords vs MeSH Terms Assigned by MEDLINE. Journal of the Korean society for information management. 2000 Sep;17(3):109–124. [Google Scholar]
  • 5.de Granda Orive JI, García Río F, Roig Vázquez F, Escobar Sacristán J, Gutiérrez Jiménez T, Callol Sánchez L. Key words, essential tools for bibliographic research: analysis of usage in Archivos de Bronconeumología for respiratory system knowledge areas. Arch Bronconeumol. 2005 Feb;41(2):78–83. doi: 10.1016/s1579-2129(06)60401-1. [DOI] [PubMed] [Google Scholar]
  • 6.Lu Z, Wilbur WJ. Improving accuracy for identifying related PubMed queries by an integrated approach. J Biomed Inform. 2009 Oct;42(5):831–8. doi: 10.1016/j.jbi.2008.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pedersen T, Pakhomov SV, Patwardhan S, Chute CG. Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform. 2007 Jun;40(3):288–99. doi: 10.1016/j.jbi.2006.06.004. [DOI] [PubMed] [Google Scholar]
  • 8.Hahn U, Wermter J, Blasczyk R, Horn PA. Text mining: powering the database revolution. Nature. 2007 Jul 12;448(7150):130. doi: 10.1038/448130b. [DOI] [PubMed] [Google Scholar]
  • 9.Lok C. Literature mining: Speed reading. Nature. 2010 Jan 28;463(7280):416–8. doi: 10.1038/463416a. [DOI] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES