Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Sep 1.
Published in final edited form as: Health Info Libr J. 2010 Sep;27(3):235–243. doi: 10.1111/j.1471-1842.2010.00897.x

Caption-based topical descriptors for microscopic images of breast neoplasms as published in academic papers

Sujin Kim 1,, Shannon Lamkin 2, Pam Duncan 3
PMCID: PMC2933792  NIHMSID: NIHMS227102  PMID: 20712718

Introduction

With the advancement of biomedical imaging technologies, visual findings of scientific discoveries are increasingly available through published academic papers. As evident from most biomedical publishers, the visual findings in figures are stored separately in literature databases. However, the stored images are not easily retrievable because the images are not indexed with descriptors necessary for facilitating further retrieval. Compared to conventional indexing and abstracting services, which exist mostly for text-based information retrieval, very little attention has been given to the investigation of image retrieval in medicine. For instance, there is no way in the current PubMed system to retrieve breast cancer images from middle-aged women scanned from immunohistochemistry staining slides. One can only gain access to published images via hyperlink(s) within the main document. More importantly, the visual findings in the papers are usually described in accompanying texts, such as captions. Captions describing visual findings in the text often contain detailed experimental procedures and results corresponding to a specific image. If a reader wants to grasp core findings reported in images, captions should not be overlooked. However, accessibility is limited to search published images by word(s) in captions.

One of the image-intensive biomedical disciplines is pathology. A growing number of digitized pathology images are used for clinical diagnosis, scientific research, and biomedical education. Currently, the digital archiving of scanned pathologic images allows for storage of images without the risk of breakage, fading, or scratching. However, the archived images are difficult to search due to inadequate descriptions. For instance, they are normally labeled only with a sequential file name or an abbreviated diagnostic name such as liver01.jpeg. Moreover, no comprehensive or standard vocabulary specific to image retrieval is applied to published articles in online literature databases such as PubMed Central.

As the basic link between an image and the content of published work, captions can be the best source of topical descriptors for non-text information contained in a scholarly paper. However, only a limited number of studies have assessed topical descriptors extracted from captions attached to biomedical images. Of those studies, none has attempted to discover characteristics of keywords from microscopic image captions. As a preliminary step toward understanding the topical description of published microscopic images, the current study assessed the major characteristics of a set of topical descriptors from an automatic keyword finder and from human indexers based on selected captions downloaded from the PubMed Central database. In order to promote greater accessibility of published microscopic images, the current study therefore discussed major findings on core descriptors and their contextual mapping results into vocabularies in the Unified Medical Language System through the MetaMap Transfer Engine (MMTx) developed by the U.S. National Library of Medicine (NLM).

Background

Image indexing and retrieval in general

Visual images in a digitized format have become popular in use, but very little attention has been given to the study of image description for retrieval. Several studies using art images sought to frame theoretical backgrounds for image description. It is not surprising that many recent information systems for images are designed with the needs of art historians in mind [1-2]. Discussions on indexable units, image attributes, subject matters, and query types have been popular research topics. However, these are not easily applicable to medical and scientific images. Science and medicine are standardized in their report of visual information, which is uncommon in art images. In this regard, identifying what to index can be easily achieved by studying the standard practice of medical literature indexing. For instance, Bidgood (1999) suggests that diagnostic interpretation attributes might be useful in conjunction with procedure description attributes as indexing keys for microscopic images [3]. For the radiologic images, Lowe suggests the useful image attributes should include image modality, image type, anatomical field of view, major anatomic segments, comparison to internal norm, cause of finding, and historical data to filter output [4].

The aforementioned discussion of text-based image retrieval exhibits several limitations. For instance, the contents of the images are interpreted by people and then described in text. In this description process, subjective interpretation can cause inconsistent indexing results. Researchers in machine processing have been working with several of the more tactile attributes of images, such as color, shape, texture and spatial similarity. Content-based image retrieval (CBIR) was introduced to directly use image features such as color, shape, intensity, distance, etc. in digitized images [5]. In biomedical imaging, the CBIR is used to develop computerized diagnostic applications by examining distinct histological image features to detect the malignancy level of cancer [6]. As many researchers recognized, indexing images is neither an easy task nor an economical operation. Increasingly, image indexing studies focus on developing algorithms and applications of CBIR to make these methods more effective; however, this is only one piece of the greater puzzle of image description.

Digital Pathology and Image Description

Increasingly, in the case of published articles, publishers require the accompanying figures to be submitted in digital format for their archives. Digital pathology, also called virtual microscopy, is accomplished by “creating a digital replica of the content of a glass microscope slide and displaying and manipulating it on the computer, so that it closely emulates looking at a slide with a traditional microscope” [7]. Because microscopic images are often used by pathology researchers as well as other scientists in order to report scientific evidence, searching for digital images will become increasingly important. Traditional glass slide collections are not only breakable and difficult to transport, they can also fade and cannot be shared readily over distance. Digitized slides have become a useful alternative for clinical consultation in rural areas with limited specialists [8]. Pathology educators also rate the image quality of digitized slides as equal or superior to that of traditional glass slides [9-11]. Whole-slide images are becoming more popular in digital pathology, because they allow an entire slide to be digitized rather than requiring the capture of individual or sequential images for viewing [12-13]. In addition, with digitization capable of minimizing the size of single images, these whole slide images can be annotated with relevant information. In this way, the annotated regions of interest (ROI) along with their relevant descriptions can be located easily.

Representing archived pathology images has not gained much attention in the biomedical imaging and informatics community. Most attention has typically been directed at indexing textual information, not images. For instance, PubMed does not designate a search field for images in MEDLINE records, even though the images accompanying these records “serve as a valuable source of medical education and clinical decision support” [14]. Furthermore, Medical Subject Headings (MeSHs) are assigned to describe the content of an article as whole rather than providing granular information related specifically to image content. However, several articles have discussed the problems confronting microscopic image description. There is no coherent metadata standard to describe [12, 13] something even as elemental as the image-capturing system (e.g., the digital microscope). Microscopic systems do not usually record clinically-relevant information such as histologic grading, cells, genes, or patient follow-up data; and no single, unified storage file format currently exists for such data. If metadata does exist, data sharing or even submission to a journal requires conversion to a simple two-dimensional image format such as TIFF, leaving metadata unsupported and lost.

Caption-based indexing and vocabulary mappings

Captions are the accompanying textual descriptions of images. Captions are concise summaries of important research findings contained in the figures of published articles [14]. There have been mixed research findings on the usefulness of caption-based indexing. Several studies confirmed that keywords found in captions are an extremely effective way to index and retrieve biomedical journal articles compared to the usual method of searching by title and abstract [15-19]. Hearst et al. found that captions contain important information about experimental methods [20]. For instance, searching for “Western Blot” returned more than a thousand results in a caption search, while very few results were returned when the search was run only through title and abstract text. The usefulness of captions for categorizing biomedical documents was also reported [16]. A few prototypical search engines were developed to retrieve published figures and tables but the application is still in prototypical phase. Despite much optimism among researchers about the potential of captions for image indexing, contradictory findings are also reported in several studies. In terms of the examination of texts associated with images, Yu and his colleagues found that 67% of abstract sentences correspond to images in full-text articles [21]. Although the findings from these studies are not a direct comparison between caption-based and abstract-based descriptors, the findings imply that abstract sentences are a sufficient source for image indexing.

Caption-based indexing is generated through the user-given words in an article. The author-given words in articles differ from article to article for the same topical concept being indexed. That is why controlled vocabularies such as MeSH are used by professional indexers to maintain consistent indexing results. To provide consistent indexing vocabularies, researchers frequently study mappings between user-given words and controlled vocabularies. Sneiderman et al. (2008) discuss their pilot study of a system that automatically indexes biomedical images using terms extracted from dermatology image captions and the portions of the article that pertain to the images [22]. In their usability assessment, the results of the automated extraction system were somewhat disappointing, with only 26% of the exact UMLS matches contained in the caption considered useful for indexing the images. Gay (2005) reported promising results regarding the usefulness of captions for automatic biomedical literature indexing using the Medical Text Indexer (MTI) [23]. For the radiologic images, Kahn (2008) goes further with his caption analysis, using the information contained therein to filter retrieval by age, gender, and image modality in his image database [24]. Previously reported findings discussed in this section suggest that the study should evaluate microscopic imaging descriptors collected from different texts for the development of better imaging search mechanisms. Additionally, no academic studies to date focus on the use of captions as a source of subject indexing terms for pathology images. Therefore, this study proposes to address these needs by assessing captions used in published biomedical articles for their utility in subject indexing.

Method

Research questions

The study examined three research questions:

  • RQ1: What are the major topical descriptors for microscopic images generated by human indexers and a computer keyword finder?

  • RQ2: Are there differences between the topical descriptors identified by humans and those identified by a computer keyword finder?

  • RQ3: What are the mapping results of the core descriptors into the vocabularies in the UMLS Metathesaurus?

Caption collection and indexers

Captions associated with microscopic images published in academic papers were identified by searching the National Library of Medicine's (NLM) PubMed Central database using the search statement : Figure[Body - All Words] AND ‘breast neoplasms’[MeSH Major Topic] AND ‘pathology’[Subheading] AND 1997/01/01[PubDate] : 2008/08/31[PubDate]. The search was limited to full text articles because the study required full access to the articles containing images and captions, as well as associated MEDLINE records. In addition, “breast neoplasms” was selected as the disease category of interest. A total of 828 records were initially downloaded into EndNote (v.X2), a bibliographic record management software program.

Through manual screening, the study found that slightly less than 42.75 percent of the downloaded records (N=354 records) contained microscopic images capable of inclusion in the study (e.g., no image bearing records or images other than microscopic images were removed.) The study also found that a search using different terms, such as Images, would produce a different set of microscopic image-bearing articles. Therefore, by no means did this study result in a comprehensive listing of all articles containing microscopic images; rather, it was intended to collect a set of closely related images for the purpose of caption analysis.

For human indexing, two students enrolled in the School of Library and Information Science, University of Kentucky (Lexington, Kentucky, USA) produced a set of descriptors based solely on the captions. Cross validation was performed to reduce inter-indexer variability, and only descriptors agreed upon by both indexers were included in the study. To obtain machine-generated descriptors, the collected captions were also processed through an online text analysis tool, Text Analysis Portal for Research (TAPoR)1. The TAPoR module used in this study was a KeywordFinder “which tries to find keywords or key phrases of a source text and recommends them to the user.” The study only used 20 top frequency words. The identified core concepts were processed through the MetaMap Transfer (MMTx)2 engine to map them to the UMLS Metathesaurus.

Results

RQ1: Major topical descriptors for microscopic images by human and computer

The first research question sought to describe general characteristics of the collected captions. A total of 354 captions were manually copied into individual text files and used for the remaining analyses. The captions came from 54 different journals, and more than 76 percent of them (N=272) were taken from seven journals, including Breast Cancer Research (N=107), American Journal of Pathology (N=40), Journal of Clinical Pathology (N=34), BMC Cancer (N=33), Annual Surgery (N=33), Proceedings of the National Academy of Sciences USA (N=20), and Neoplasia (N=17). All MeSH subheadings for Breast neoplasms were identified. Appendix A shows MeSHs for the caption source articles by frequency assigned along with their frequency. MeSH Tags such as Humans, Female, Middle Aged, etc. occurred frequently, as did genomic- and proteomic-related terms such as Tumor Cells, Cultured, Immunohistochemistry, etc.

Descriptive characteristics of caption collection are shown in Table 1. Individual captions were small in file size as well as in the number of average total words (N=81.71). This meant a fairly small number of words to be read and processed in both human and automatic indexing. Human indexers assigned a smaller number of keywords (AVG=5.41) per caption compared to the number TAPoR (AVG=19.68) assigned. Based on the variance analysis of the number of keywords assigned, TAPoR generates a fairly consistent number of keywords throughout the caption collection (AVG=19.68, StdDev=6.57 vs. AVG=5.41, StdDev=3.61). This result supports other research findings of better indexing consistency in terms of the number of keywords generated through automatic indexing. Further comparison between human-assigned and TAPoR-generated descriptors follows.

Table 1. Descriptive Characteristics of Caption Collection.

Sum Mean StdDev Median Mode Min Max
#Filesize 195,933 553.48 393.92 475 127 36 2350
#TotalWords 28,927 81.71 60.04 69 54 6 371
#UniqueWords 9,994 28 17 26 31 2 102
#MeSH 5,398 15.25 5.67 15 15 1 38
#TAPoR 6,966 19.68 6.57 22 25 0 48
#HumanInd 1,915 5.41 3.61 5 3 0 19

RQ2: Comparison between human- and computer-assigned descriptors

For the same captions, humans assigned only one-third of the keywords that TAPoR assigned. This implies that human indexing is less productive in identifying a large number of keywords as compared to an automatic keyword generator like TAPoR. In caption-based indexing, the general topical matter of the most frequently found keyword groups are varied. They include disease/diagnostic names (e.g., ductal carcinoma in situ, breast cancer, invasive ductal carcinoma, ductal hyperplasia, etc.), laboratory techniques and procedures (e.g., immunohistochemical staining, haematoxylin, eosin, photomicrograph types, magnification, etc.), and cells and biomarkers (e.g., apoptosis, epithelial cells, cytoplasm, acini, etc.). Appendix B shows the top 30 most frequently assigned topical descriptors. Although not listed in Appendix B, some keywords (e.g., 100×, 200×, .005μm, 40%, 70%, etc.) are a part of the quantitative measurements of specific study results. These are unconventional subject headings which are not used in any controlled vocabularies but can be useful for refinement of microscopic image retrieval. Apparently, TAPoR identified unusual and less meaningful words (e.g., arrows, observed, red, situ, note, sections, strong, etc.) which are not otherwise assigned for MeSHs (or any subject headings) for article descriptions. Additionally, both human indexers and TAPoR did not identify word variants such as acronyms (e.g., H&E vs. hematoxylin and eosin, ductal carcinoma in situ vs. DCIS, etc.), and plurality (e.g., cell vs. cells, lobule vs. lobules vs. lobe, etc.). Phrases were more frequently identified in human keywords (e.g., ductal carcinoma in situ vs. ductal carcinoma; in situ, etc.).

RQ 3: Mapping the core descriptors into the UMLS Metathesaurus

The study formed 79 core descriptors by combining the top 20 human-assigned descriptors and the top 20 TAPoR-assigned descriptors and eliminating duplicates. The number of duplicates indicates that automatically generated keywords potentially produce high false hits for retrieval. Therefore, it is imperative to map or validate the automatically generated terms against controlled vocabularies such as the UMLS Metathesaurus for retrieval improvement.

The study mapped both the full set of human and TAPoR-assigned keywords and the set of 79 core descriptors to the UMLS Metathesaurus. For the full set, the study found that human-assigned descriptors outperformed TAPoR-assigned keywords. Approximately 41% of total descriptors by humans (N=533) were “fully matched” while only 35.49% of TAPoR keywords were identified as “full matches.” Human-generated keywords also outperformed TAPoR keywords in “partial match” (human=48.44% vs. TAPoR=38.66%) and “mismatch” (human=10.97% vs. TAPoR=25.85%). This implies that human-assigned keywords are better Meta Mapping source terms compared to automatically-generated keywords from the core descriptors of the UMLS vocabularies through MMTx. Several numerical descriptions such as 100×, 200mb, 30%, etc., failed to map to the Metathesaurus. The fact that numerical notations used for quantitative assessment are frequently reported in scientific observations challenges the Metathesaurus to improve Meta Mapping.

Of the 79 core descriptors, 45 (56.9%) were “fully matched” (F), 25 (31.65%) were “partially matched” (P), and nine were “not found” (M) in UMLS. The study also compared the mapping results of the human and TAPoR-assigned core descriptors. Appendix C shows the selected core descriptors in the first column (Core Descriptors) mapped to the UMLS Metathesaurus in the next column (UMLS Mapping). The third column (Match) lists matching results based on the mapped words in the second column. F refers to Full Match, P to Partial Match, and M to Mismatch. The frequency of the core descriptors that appeared in both keyword sets is listed in the last column (Freq). The most frequently identified descriptor, tumor cells (N=61), was mapped to Unspecified tumor cell NOS and enclosed in parentheses. Cell was identified as a semantic type and enclosed in square brackets in the second column.

This study identified several useful types of terms in the core descriptor set. Histologic diagnostic terms such as ductal carcinoma in situ, infiltrating ductal carcinoma, invasive ductal carcinoma, atypical ductal hyperplasia, lobular carcinoma in situ, etc., can be helpful in identifying microscopic images in academic papers since histology uses mainly visual findings. Descriptors such as expression, endothelial cells, myoepithelial cells, stromal cells, and cytoplasm are useful in indicating cellular components expressed in microscopic images. Terms such as normal, abnormal, invasive, infiltrating, positive, and negative are used for qualitative assessments, while terms such as immunohistochemistry and haematoxyiln and eosin are related to laboratory procedures and techniques, especially histologic staining. In addition to the importance of anatomical vocabularies, the language of clinical therapies and disease processes proved to be highly relevant to clinical vocabularies in general.

Discussion

The study assessed some general characteristics of the collected captions for use as microscopic image descriptors (RQ1), compared the topical descriptors assigned by human indexers to those assigned by an automatic indexing engine (RQ2), and identified a list of core descriptors and their mapping results found in the UMLS Metathesaurus (RQ3). First, the findings suggest that caption-based descriptors can complement title or abstract-based literature indexing for figure image retrieval in articles. The identified keywords such as immunohistochemistry, H&E staining, magnification, and 100× describe laboratory procedures and methods which are not accessible through MeSH descriptors assigned to the articles. This finding is supported by several works which were implemented in a prototype system for searching captions and figures to locate biomedical images published in open access journals [20]. This finding is important because captions are identified as granular information specific to figures delivering the core contexts of scientific evidence in an abridged format. With the growth of full-text retrieval, availability of caption searching for figure images will improve topical accessibility in regards to the retrieval of published biomedical images.

Second, humans assigned fewer keywords than TAPoR but produced better matches to the UMLS Metathesaurus. This result corresponds to the previous study. In this study, Sneiderman found that search words provided by humans were proven to be effective in matching concepts in the UMLS Metathesaurus, with 98% exact matching compared to only 26% of exact UMLS matches contained in the caption considered useful to index the matched image [22]. This finding implies that terms generated by human indexers for image indexing have a high probability of matching the Meta concepts present in the texts that reference those images, while current automated algorithms generate many matches not useful for indexing. It would be important to discover whether experienced human indexers are likely to provide different indexing results as compared to inexperienced indexers.

Third, the study found that several dimensions of the imaging descriptors in the caption-based keywords can be implemented as search filters for improved image retrieval. Semantic types identified through Meta Mapping are highly likely to be useful in filtering image retrieval results. Such types include Indicator, Reagent, or Diagnostic Aid; Organic Chemical; Laboratory Procedure; Spatial Concept; Qualitative Concept; and Quantitative Concept. Furthermore, if the filters are combined with MeSHs describing the overall subject matter of an article, this combination will definitely improve the relevance of microscopic images. With respect to forming a metadata framework for online microscopic image description, the semantic types can be used as a core metadata set. In this regard, this finding can be used for standardizing a microscopic image description protocol to train medical students.

Fourth, biomedical informaticians as well as medical librarians will benefit through this study in terms of being able to identify core search keywords for microscopic image retrieval in published academic papers. System librarians, text-mining researchers, and imaging specialists can also use the study findings to improve their understanding of biomedical imaging descriptions for a more efficient system design. End users will be able to use the core descriptors identified in this study to facilitate PubMed or GoogleImage searches. As shown by previous studies on the effectiveness of mapping, the MMTx engine can also enhance searching [23, 24].

However, the study has also some limitations. The study's caption collection was limited to breast neoplasms as a disease of interest, so the findings may not apply to other disease categories. Different searching strategies might yield different sets of caption collections. As previously noted, the human indexing was performed by non-expert indexers. In future studies, the research team plans to experiment with trained versus untrained indexers; we will also formally assess retrieval effectiveness and relevance by measuring similarities for images retrieved by different imaging descriptors. TAPoR was chosen as the text analysis software for generating keywords from the caption collection. It should be also noted that if multiple lexical analysis tools were used in this study, it could minimize algorithm variation.

Conclusion

To provide greater accessibility to non-textual information such as images, textual descriptions associated with the visual findings need to be carefully studied. Findings suggest that caption-based descriptors will culminate in a systematic access to published images. Additionally, this study will assist in understanding the bibliographic control of visual materials for better retrieval by providing a semantic mapping mechanism. Future research endeavors should be expanded by examining the effectiveness of diverse description sources for the relevance of image search results. In the field of information organization and retrieval, methods and tools for non-textual information retrieval have garnered increasing attention as the digital world expands. It is incumbent upon libraries and other information agencies to promote and maintain an interest in the opportunities and challenges associated with biomedical imaging.

Key Messages.

Implications for Policy:

  • Caption-based descriptors produced several aspects of microscopic image descriptions which could be used as complementary sources for title or abstract-based literature indexing.

  • Terms suggested by human indexers for image indexing have a high probability of matching the Meta concepts present in the texts that reference those images, while the automated algorithm generated many matches useless for indexing.

  • Semantic types identified through Meta Mapping are highly likely to be useful in filtering image retrieval results. Such types include Indicator, Reagent, or Diagnostic Aid; Organic Chemical; Laboratory Procedure; Spatial Concept; Qualitative Concept; and Quantitative Concept.

Implications for Practice:

  • Medical librarians as well as biomedical informaticians will benefit from this study in terms of identifying a core set of search keywords for microscopic image retrieval in published academic papers.

  • System librarians, text-mining researchers, and imaging specialists can use the study findings to improve their understanding of biomedical imaging description for a more efficient system design.

  • End-users may be able to use the core descriptors identified in this study to facilitate PubMed or GoogleImage searches.

Acknowledgments

Grant Acknowledgement: This project was supported in part by a grant (RE-04-08-0069-08) from the Institute of Museum and Library Service (IMLS). In addition, this publication was made possibly by Grant Number P20RR-16481 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH).

Appendix A: MeSHs for Caption Source Articles by Frequency Rank.

MeSH Freq MeSH Freq
Humans 352 Cell Proliferation 17
Female 287 Neoplasm Metastasis 17
Middle Aged 108 Reverse Transcriptase Polymerase Chain Reaction 17
Adult 85 Blotting, Western 16
Aged 83 Breast Neoplasms/ metabolism/pathology 16
Tumor Cells, Cultured 73 Gene Expression 16
Animals 54 Prospective Studies 16
Immunohistochemistry 49 Signal Transduction 16
Aged, 80 and over 47 Treatment Outcome 16
Mice 47 Axilla 15
Neoplasm Invasiveness 47 Phenotype 15
Breast Neoplasms/ pathology 45 Predictive Value of Tests 15
Cell Line, Tumor 45 Oligonucleotide Array Sequence Analysis 14
Prognosis 43 Breast Neoplasms/ genetics/ pathology 13
Breast Neoplasms/ genetics/pathology 36 Cohort Studies 13
Lymphatic Metastasis 31 Diagnosis, Differential 13
Survival Analysis 30 Neoplasm Transplantation 13
Male 29 Risk Factors 13
Neoplasm Staging 29 Transplantation, Heterologous 13
Gene Expression Regulation, Neoplastic 26 Tumor Markers, Biological/ analysis 13
Disease Progression 21 Breast Neoplasms/genetics/ metabolism/pathology 12
Follow-Up Studies 21 Time Factors 12
Retrospective Studies 21 Transcription, Genetic 12
Transfection 21 Tumor Markers, Biological/ metabolism 12
Gene Expression Profiling 20 Drug Resistance, Neoplasm 11
Sensitivity and Specificity 20 Hyperplasia 11
Mice, Nude 19 Sensitivity and Specificity 20
Cell Division 17

Appendix B: Top 30 Most Frequently Assigned Topical Descriptors.

Rank Human-assigned descriptors Freq TAPoR-generated descriptors Freq
1 ductal carcinoma in situ 24 ductal carcinoma 43
2 Immunohistochemical staining 24 original magnification 42
3 breast cancer 21 breast cancer 35
4 Invasive ductal carcinoma 21 normal breast 35
5 DCIS 19 Expression 26
6 tumor cells 18 magnification 25
7 immunohistochemistry 17 Cells 24
8 apoptosis 14 invasive ductal carcinoma 23
9 Epithelial cells 13 immunohistochemical 21
10 cytoplasm 12 epithelial cells 20
11 Immunostaining 12 breast carcinoma 19
12 apoptotic cells 10 haematoxylin 19
13 breast carcinoma 10 invasive ductal 19
14 ductal hyperplasia 10 tumor cells 19
15 Photomicrograph 9 Arrows 18
16 stromal cells 9 Eosin 18
17 fibroblasts 8 mda mb 231 18
18 Immunohistochemical analysis 8 Note 18
19 invasive breast cancer 8 Situ 18
20 tumour cells 8 carcinoma cells 16
21 hematoxylin 7 mda-mb-231 16
22 human breast 7 Negative 16
23 infiltrating ductal carcinoma 7 tumour cells 16
24 myoepithelial cells 7 Cytoplasm 15
25 tumour 7 h&e 15
26 acini 6 Positive 15
27 atypical ductal hyperplasia 6 Arrow 14
28 breast epithelium 6 Panel 14
29 cancer cells 6 breast tissue 13
30 cytoplasmic staining 6 cancer cells 13

Appendix C: Selected Core Descriptors Mapped into UMLS Vocabularies through MMTx.

Core Descriptors UMLS Mapping Match Freq
tumor cells Tumor cells ([M]Unspecified tumor cell NOS) [Cell] F 61
breast cancer Breast Cancer (Breast Carcinoma) [Neoplastic Process]
Breast Cancer (Malignant neoplasm of breast) [Neoplastic Process]
F 56
ductal carcinoma ductal carcinoma (Ductal Breast Carcinoma) [Neoplastic Process]
Ductal Carcinoma [Neoplastic Process]
F 49
invasive ductal carcinoma Invasive Ductal Carcinoma (Carcinoma, Ductal, Breast) [Neoplastic Process] F 44
original magnification Original [Idea or Concept] P 42
normal breast Normal [Qualitative Concept]
Breast [Body Part, Organ, or Organ Component]
Normal [Qualitative Concept]
Breast (Entire breast) [Body Part, Organ, or Organ Component]
P 35
breast carcinoma Breast Carcinoma [Neoplastic Process] F 34
mda mb 231 MDA [Organic Chemical, Pharmacologic Substance] mb (Megabase) [Quantitative Concept] P 34
epithelial cells Epithelial Cells [Cell] F 33
arrow Arrow [Manufactured Object] F 32
DCIS DCIS (Carcinoma, Intraductal) [Neoplastic Process] F 31
immunohistochemistry Immunohistochemistry [Laboratory Procedure] F 29
cytoplasm Cytoplasm [Cell Component] F 27
expression Expression (Expression procedure) [Therapeutic or Preventive Procedure]
Expression (Gene Expression) [Genetic Function]
F 26
magnification Meta Mappings: <none> M 25
cells Cells [Cell] F 24
ductal carcinoma in situ Ductal Carcinoma In Situ (Carcinoma, Intraductal) [Neoplastic Process] F 24
haematoxylin Haematoxylin (Hematoxylin) [Indicator, Reagent, or Diagnostic Aid, Organic Chemical] F 24
Immunohistochemical staining Immunohistochemical [Laboratory Procedure]
Staining (Staining method) [Laboratory Procedure]
P 24
immunohistochemical Immunohistochemical [Laboratory Procedure] F 21

Footnotes

1

TAPoR also comprises a research project consisting of 6 leading humanities computing centers in Canada. More detailed information about individual projects and products can be found at http://tada.mcmaster.ca/Main/TAPoRwareKeywordsFinder Only nouns and noun phrases are considered to be keywords, and verbs, adverbs, etc. are not included in the keywords. In the result page, TAPoR “lists 20 top frequency words, 10 top frequency word pairs and word triplets respectively.”

2

Built on natural language processing and computational linguistic techniques, MMTx uses the UMLS Metathesaurus provided by NLM to semantically map user-given terminologies into controlled vocabularies [14]. The MMTx includes MeSH descriptors in the Metathesaurus, and the study did not intend to explicit MeSHs from other descriptors included in the UMLS. The study customized the MMTx to run on the Windows command-based platform.

No conflicts of interest have been declared.

Contributor Information

Sujin Kim, Email: sujinkim@uky.edu, School of Library and Information Science and Department of Pathology and Laboratory Medicine, 339 Lucille Little Building, University of Kentucky, Lexington, KY 40506-0224.

Shannon Lamkin, Email: shannonlamkin@yahoo.com, University of Kentucky Libraries, Lexington, KY 40506.

Pam Duncan, Email: psdunc2@uky.edu, University of Kentucky, Lexington, KY 40506-0224.

References

  • 1.Jorgensen C. Indexing images: testing an image description template. Proceedings of the 59th Annual Meeting of the American Society for Information Science; 1996. pp. 209–213. [Google Scholar]
  • 2.Panofsky E. Meaning in the Visual Arts. Penguin; London: 1993. [Google Scholar]
  • 3.Bidgood WD, et al. Image Acquisition context: procedure description attributes for clinically relevant indexing and selective retrieval of biomedical images. JAMIA. 1999;6(1):61–75. doi: 10.1136/jamia.1999.0060061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lowe HJ, et al. Towards knowledge-based retrieval of medical images: the role of semantic indexing, image content representation and knowledge-based retrieval. Proceedings/AMIA Annual Fall Symposium; 1998. pp. 882–886. [PMC free article] [PubMed] [Google Scholar]
  • 5.Kim S, Jeong HK, Choi HJ, Kim D. Automatic histologic Grading for Lobular Carcinoma In Situ. World Congress on Biomedical Physics and Medical Engineering, Munich, Germany (September 7-September 13, 2009); 2009d. To be indexed in the Springer Proceedings. [Google Scholar]
  • 6.Rasmussen EM. Indexing Images. Annual review of Information Science and Technology. 1997;32:169–196. [Google Scholar]
  • 7.Dee FR. Virtual microscopy for comparative pathology. Toxicologic Pathology. 2006;34(7):966–7. doi: 10.1080/01926230601123062. [DOI] [PubMed] [Google Scholar]
  • 8.Li XX, et al. A feasibility study of virtual slides in surgical pathology in China. Human Pathology. 2007;38(12):1842–1848. doi: 10.1016/j.humpath.2007.04.019. [DOI] [PubMed] [Google Scholar]
  • 9.Leong FJ, Leong AS. Digital imaging applications in anatomic pathology. Advances in Anatomic Pathology. 2003;10(2):88–95. doi: 10.1097/00125480-200303000-00003. [DOI] [PubMed] [Google Scholar]
  • 10.Pritt BS, Gibson PC, Cooper K. Digital imaging guidelines for pathology: a proposal for general and academic use. Advances in Anatomic Pathology. 2003;10(2):96–100. doi: 10.1097/00125480-200303000-00004. [DOI] [PubMed] [Google Scholar]
  • 11.Montalto MC. Pathology RE-imagined: the history of digital radiology and the future of anatomic pathology. Archives of pathology and laboratory medicine. 2008;132(5):764–5. doi: 10.5858/2008-132-764-PRTHOD. [DOI] [PubMed] [Google Scholar]
  • 12.Ho J, et al. Use of whole slide imaging in surgical pathology quality assurance: design and pilot validation studies. Human Pathology. 2006;37(3):322–31. doi: 10.1016/j.humpath.2005.11.005. [DOI] [PubMed] [Google Scholar]
  • 13.Yagi Y, Gilbertson JR. Digital imaging in pathology: the case for standardization. J Telemed Telecare. 2005;11(3):109–16. doi: 10.1258/1357633053688705. [DOI] [PubMed] [Google Scholar]
  • 14.Kahn CE, Thao C. GoldMiner: A radiology image search engine. American Journal of Raiology. 2007;188:1475–78. doi: 10.2214/AJR.06.1740. [DOI] [PubMed] [Google Scholar]
  • 15.Yeh AS, Hirschman L, Morgan AA. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics. 2003;19:i331–9. doi: 10.1093/bioinformatics/btg1046. [DOI] [PubMed] [Google Scholar]
  • 16.Shatkay H, Chen N, Blostein D. Integrating image data into biomedical text categorization. Bioinformatics. 2006;22(14):e446–53. doi: 10.1093/bioinformatics/btl235. [DOI] [PubMed] [Google Scholar]
  • 17.Choi Y, Rasmussen EM. Users' relevance criteria in image retrieval in American history. Information Processing and Management. 2002;38:695–726. [Google Scholar]
  • 18.Xu S, McCusker J, Krauthammer M. Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics. 2008;24(17):1968. doi: 10.1093/bioinformatics/btn340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hua J, et al. Identifying Fluorescence Microscope Images in Online Journal Articles Using Both Image and Text Features. Proceedings of the 2007 IEEE International Symposium on Biomedical Imaging; 2007. pp. 1224–1227. [Google Scholar]
  • 20.Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, Wooldridge MA, Ye J. BioText Search Engine: beyond abstract search. Bioinformatics. 2007;2:2196–2197. doi: 10.1093/bioinformatics/btm301. [DOI] [PubMed] [Google Scholar]
  • 21.Yu H. Towards answering biological questions with experimental evidence: automatically identifying text that summarizes image content in full-text articles. AMIA Annu Symp Proc; 2006. pp. 834–838. [PMC free article] [PubMed] [Google Scholar]
  • 22.Sneiderman CA, et al. UMLS-based Automatic Image Indexing. AMIA Annual Symposium proceedings AMIA Symposium 1141; 2008. [20 February 2009]. Available from: http://archive.nlm.nih.gov/pubs/pubPDFs/Sneiderman_et_al_AMIA_2008.pdf. [PubMed] [Google Scholar]
  • 23.Gay CW, Kayaalp M, Aronson AR. Semi-automatic indexing of full text biomedical articles. AMIA Annual Symposium proceedings AMIA Symposium; 2005. [20 February 2009]. Available from: http://ii.nlm.nih.gov/resources/amia05.fulltext.w.footer.pdf. [PMC free article] [PubMed] [Google Scholar]
  • 24.Kahn CE. Effective metadata discovery for dynamic filtering of queries to a radiology image search engine. Journal of digital imaging the official journal of the Society for Computer Applications in Radiology. 2008;21(3):269–73. doi: 10.1007/s10278-007-9036-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES