Caption-based topical descriptors for microscopic images of breast neoplasms as published in academic papers

Sujin Kim; Shannon Lamkin; Pam Duncan

doi:10.1111/j.1471-1842.2010.00897.x

. Author manuscript; available in PMC: 2011 Sep 1.

Published in final edited form as: Health Info Libr J. 2010 Sep;27(3):235–243. doi: 10.1111/j.1471-1842.2010.00897.x

Caption-based topical descriptors for microscopic images of breast neoplasms as published in academic papers

Sujin Kim ^1,^✉, Shannon Lamkin ², Pam Duncan ³

PMCID: PMC2933792 NIHMSID: NIHMS227102 PMID: 20712718

Introduction

With the advancement of biomedical imaging technologies, visual findings of scientific discoveries are increasingly available through published academic papers. As evident from most biomedical publishers, the visual findings in figures are stored separately in literature databases. However, the stored images are not easily retrievable because the images are not indexed with descriptors necessary for facilitating further retrieval. Compared to conventional indexing and abstracting services, which exist mostly for text-based information retrieval, very little attention has been given to the investigation of image retrieval in medicine. For instance, there is no way in the current PubMed system to retrieve breast cancer images from middle-aged women scanned from immunohistochemistry staining slides. One can only gain access to published images via hyperlink(s) within the main document. More importantly, the visual findings in the papers are usually described in accompanying texts, such as captions. Captions describing visual findings in the text often contain detailed experimental procedures and results corresponding to a specific image. If a reader wants to grasp core findings reported in images, captions should not be overlooked. However, accessibility is limited to search published images by word(s) in captions.

One of the image-intensive biomedical disciplines is pathology. A growing number of digitized pathology images are used for clinical diagnosis, scientific research, and biomedical education. Currently, the digital archiving of scanned pathologic images allows for storage of images without the risk of breakage, fading, or scratching. However, the archived images are difficult to search due to inadequate descriptions. For instance, they are normally labeled only with a sequential file name or an abbreviated diagnostic name such as liver01.jpeg. Moreover, no comprehensive or standard vocabulary specific to image retrieval is applied to published articles in online literature databases such as PubMed Central.

As the basic link between an image and the content of published work, captions can be the best source of topical descriptors for non-text information contained in a scholarly paper. However, only a limited number of studies have assessed topical descriptors extracted from captions attached to biomedical images. Of those studies, none has attempted to discover characteristics of keywords from microscopic image captions. As a preliminary step toward understanding the topical description of published microscopic images, the current study assessed the major characteristics of a set of topical descriptors from an automatic keyword finder and from human indexers based on selected captions downloaded from the PubMed Central database. In order to promote greater accessibility of published microscopic images, the current study therefore discussed major findings on core descriptors and their contextual mapping results into vocabularies in the Unified Medical Language System through the MetaMap Transfer Engine (MMTx) developed by the U.S. National Library of Medicine (NLM).

Background

Image indexing and retrieval in general

Visual images in a digitized format have become popular in use, but very little attention has been given to the study of image description for retrieval. Several studies using art images sought to frame theoretical backgrounds for image description. It is not surprising that many recent information systems for images are designed with the needs of art historians in mind [1-2]. Discussions on indexable units, image attributes, subject matters, and query types have been popular research topics. However, these are not easily applicable to medical and scientific images. Science and medicine are standardized in their report of visual information, which is uncommon in art images. In this regard, identifying what to index can be easily achieved by studying the standard practice of medical literature indexing. For instance, Bidgood (1999) suggests that diagnostic interpretation attributes might be useful in conjunction with procedure description attributes as indexing keys for microscopic images [3]. For the radiologic images, Lowe suggests the useful image attributes should include image modality, image type, anatomical field of view, major anatomic segments, comparison to internal norm, cause of finding, and historical data to filter output [4].

The aforementioned discussion of text-based image retrieval exhibits several limitations. For instance, the contents of the images are interpreted by people and then described in text. In this description process, subjective interpretation can cause inconsistent indexing results. Researchers in machine processing have been working with several of the more tactile attributes of images, such as color, shape, texture and spatial similarity. Content-based image retrieval (CBIR) was introduced to directly use image features such as color, shape, intensity, distance, etc. in digitized images [5]. In biomedical imaging, the CBIR is used to develop computerized diagnostic applications by examining distinct histological image features to detect the malignancy level of cancer [6]. As many researchers recognized, indexing images is neither an easy task nor an economical operation. Increasingly, image indexing studies focus on developing algorithms and applications of CBIR to make these methods more effective; however, this is only one piece of the greater puzzle of image description.

Digital Pathology and Image Description

Increasingly, in the case of published articles, publishers require the accompanying figures to be submitted in digital format for their archives. Digital pathology, also called virtual microscopy, is accomplished by “creating a digital replica of the content of a glass microscope slide and displaying and manipulating it on the computer, so that it closely emulates looking at a slide with a traditional microscope” [7]. Because microscopic images are often used by pathology researchers as well as other scientists in order to report scientific evidence, searching for digital images will become increasingly important. Traditional glass slide collections are not only breakable and difficult to transport, they can also fade and cannot be shared readily over distance. Digitized slides have become a useful alternative for clinical consultation in rural areas with limited specialists [8]. Pathology educators also rate the image quality of digitized slides as equal or superior to that of traditional glass slides [9-11]. Whole-slide images are becoming more popular in digital pathology, because they allow an entire slide to be digitized rather than requiring the capture of individual or sequential images for viewing [12-13]. In addition, with digitization capable of minimizing the size of single images, these whole slide images can be annotated with relevant information. In this way, the annotated regions of interest (ROI) along with their relevant descriptions can be located easily.

Representing archived pathology images has not gained much attention in the biomedical imaging and informatics community. Most attention has typically been directed at indexing textual information, not images. For instance, PubMed does not designate a search field for images in MEDLINE records, even though the images accompanying these records “serve as a valuable source of medical education and clinical decision support” [14]. Furthermore, Medical Subject Headings (MeSHs) are assigned to describe the content of an article as whole rather than providing granular information related specifically to image content. However, several articles have discussed the problems confronting microscopic image description. There is no coherent metadata standard to describe [12, 13] something even as elemental as the image-capturing system (e.g., the digital microscope). Microscopic systems do not usually record clinically-relevant information such as histologic grading, cells, genes, or patient follow-up data; and no single, unified storage file format currently exists for such data. If metadata does exist, data sharing or even submission to a journal requires conversion to a simple two-dimensional image format such as TIFF, leaving metadata unsupported and lost.

Caption-based indexing and vocabulary mappings

Captions are the accompanying textual descriptions of images. Captions are concise summaries of important research findings contained in the figures of published articles [14]. There have been mixed research findings on the usefulness of caption-based indexing. Several studies confirmed that keywords found in captions are an extremely effective way to index and retrieve biomedical journal articles compared to the usual method of searching by title and abstract [15-19]. Hearst et al. found that captions contain important information about experimental methods [20]. For instance, searching for “Western Blot” returned more than a thousand results in a caption search, while very few results were returned when the search was run only through title and abstract text. The usefulness of captions for categorizing biomedical documents was also reported [16]. A few prototypical search engines were developed to retrieve published figures and tables but the application is still in prototypical phase. Despite much optimism among researchers about the potential of captions for image indexing, contradictory findings are also reported in several studies. In terms of the examination of texts associated with images, Yu and his colleagues found that 67% of abstract sentences correspond to images in full-text articles [21]. Although the findings from these studies are not a direct comparison between caption-based and abstract-based descriptors, the findings imply that abstract sentences are a sufficient source for image indexing.

Caption-based indexing is generated through the user-given words in an article. The author-given words in articles differ from article to article for the same topical concept being indexed. That is why controlled vocabularies such as MeSH are used by professional indexers to maintain consistent indexing results. To provide consistent indexing vocabularies, researchers frequently study mappings between user-given words and controlled vocabularies. Sneiderman et al. (2008) discuss their pilot study of a system that automatically indexes biomedical images using terms extracted from dermatology image captions and the portions of the article that pertain to the images [22]. In their usability assessment, the results of the automated extraction system were somewhat disappointing, with only 26% of the exact UMLS matches contained in the caption considered useful for indexing the images. Gay (2005) reported promising results regarding the usefulness of captions for automatic biomedical literature indexing using the Medical Text Indexer (MTI) [23]. For the radiologic images, Kahn (2008) goes further with his caption analysis, using the information contained therein to filter retrieval by age, gender, and image modality in his image database [24]. Previously reported findings discussed in this section suggest that the study should evaluate microscopic imaging descriptors collected from different texts for the development of better imaging search mechanisms. Additionally, no academic studies to date focus on the use of captions as a source of subject indexing terms for pathology images. Therefore, this study proposes to address these needs by assessing captions used in published biomedical articles for their utility in subject indexing.

Method

Research questions

The study examined three research questions:

RQ1: What are the major topical descriptors for microscopic images generated by human indexers and a computer keyword finder?
RQ2: Are there differences between the topical descriptors identified by humans and those identified by a computer keyword finder?
RQ3: What are the mapping results of the core descriptors into the vocabularies in the UMLS Metathesaurus?

Caption collection and indexers

Captions associated with microscopic images published in academic papers were identified by searching the National Library of Medicine's (NLM) PubMed Central database using the search statement : Figure[Body - All Words] AND ‘breast neoplasms’[MeSH Major Topic] AND ‘pathology’[Subheading] AND 1997/01/01[PubDate] : 2008/08/31[PubDate]. The search was limited to full text articles because the study required full access to the articles containing images and captions, as well as associated MEDLINE records. In addition, “breast neoplasms” was selected as the disease category of interest. A total of 828 records were initially downloaded into EndNote (v.X2), a bibliographic record management software program.

Through manual screening, the study found that slightly less than 42.75 percent of the downloaded records (N=354 records) contained microscopic images capable of inclusion in the study (e.g., no image bearing records or images other than microscopic images were removed.) The study also found that a search using different terms, such as Images, would produce a different set of microscopic image-bearing articles. Therefore, by no means did this study result in a comprehensive listing of all articles containing microscopic images; rather, it was intended to collect a set of closely related images for the purpose of caption analysis.

For human indexing, two students enrolled in the School of Library and Information Science, University of Kentucky (Lexington, Kentucky, USA) produced a set of descriptors based solely on the captions. Cross validation was performed to reduce inter-indexer variability, and only descriptors agreed upon by both indexers were included in the study. To obtain machine-generated descriptors, the collected captions were also processed through an online text analysis tool, Text Analysis Portal for Research (TAPoR)¹. The TAPoR module used in this study was a KeywordFinder “which tries to find keywords or key phrases of a source text and recommends them to the user.” The study only used 20 top frequency words. The identified core concepts were processed through the MetaMap Transfer (MMTx)² engine to map them to the UMLS Metathesaurus.

Results

RQ1: Major topical descriptors for microscopic images by human and computer

The first research question sought to describe general characteristics of the collected captions. A total of 354 captions were manually copied into individual text files and used for the remaining analyses. The captions came from 54 different journals, and more than 76 percent of them (N=272) were taken from seven journals, including Breast Cancer Research (N=107), American Journal of Pathology (N=40), Journal of Clinical Pathology (N=34), BMC Cancer (N=33), Annual Surgery (N=33), Proceedings of the National Academy of Sciences USA (N=20), and Neoplasia (N=17). All MeSH subheadings for Breast neoplasms were identified. Appendix A shows MeSHs for the caption source articles by frequency assigned along with their frequency. MeSH Tags such as Humans, Female, Middle Aged, etc. occurred frequently, as did genomic- and proteomic-related terms such as Tumor Cells, Cultured, Immunohistochemistry, etc.

Descriptive characteristics of caption collection are shown in Table 1. Individual captions were small in file size as well as in the number of average total words (N=81.71). This meant a fairly small number of words to be read and processed in both human and automatic indexing. Human indexers assigned a smaller number of keywords (AVG=5.41) per caption compared to the number TAPoR (AVG=19.68) assigned. Based on the variance analysis of the number of keywords assigned, TAPoR generates a fairly consistent number of keywords throughout the caption collection (AVG=19.68, StdDev=6.57 vs. AVG=5.41, StdDev=3.61). This result supports other research findings of better indexing consistency in terms of the number of keywords generated through automatic indexing. Further comparison between human-assigned and TAPoR-generated descriptors follows.

Table 1. Descriptive Characteristics of Caption Collection.

	Sum	Mean	StdDev	Median	Mode	Min	Max
#Filesize	195,933	553.48	393.92	475	127	36	2350
#TotalWords	28,927	81.71	60.04	69	54	6	371
#UniqueWords	9,994	28	17	26	31	2	102
#MeSH	5,398	15.25	5.67	15	15	1	38
#TAPoR	6,966	19.68	6.57	22	25	0	48
#HumanInd	1,915	5.41	3.61	5	3	0	19

Open in a new tab

RQ2: Comparison between human- and computer-assigned descriptors

For the same captions, humans assigned only one-third of the keywords that TAPoR assigned. This implies that human indexing is less productive in identifying a large number of keywords as compared to an automatic keyword generator like TAPoR. In caption-based indexing, the general topical matter of the most frequently found keyword groups are varied. They include disease/diagnostic names (e.g., ductal carcinoma in situ, breast cancer, invasive ductal carcinoma, ductal hyperplasia, etc.), laboratory techniques and procedures (e.g., immunohistochemical staining, haematoxylin, eosin, photomicrograph types, magnification, etc.), and cells and biomarkers (e.g., apoptosis, epithelial cells, cytoplasm, acini, etc.). Appendix B shows the top 30 most frequently assigned topical descriptors. Although not listed in Appendix B, some keywords (e.g., 100×, 200×, .005μm, 40%, 70%, etc.) are a part of the quantitative measurements of specific study results. These are unconventional subject headings which are not used in any controlled vocabularies but can be useful for refinement of microscopic image retrieval. Apparently, TAPoR identified unusual and less meaningful words (e.g., arrows, observed, red, situ, note, sections, strong, etc.) which are not otherwise assigned for MeSHs (or any subject headings) for article descriptions. Additionally, both human indexers and TAPoR did not identify word variants such as acronyms (e.g., H&E vs. hematoxylin and eosin, ductal carcinoma in situ vs. DCIS, etc.), and plurality (e.g., cell vs. cells, lobule vs. lobules vs. lobe, etc.). Phrases were more frequently identified in human keywords (e.g., ductal carcinoma in situ vs. ductal carcinoma; in situ, etc.).

RQ 3: Mapping the core descriptors into the UMLS Metathesaurus

The study formed 79 core descriptors by combining the top 20 human-assigned descriptors and the top 20 TAPoR-assigned descriptors and eliminating duplicates. The number of duplicates indicates that automatically generated keywords potentially produce high false hits for retrieval. Therefore, it is imperative to map or validate the automatically generated terms against controlled vocabularies such as the UMLS Metathesaurus for retrieval improvement.

The study mapped both the full set of human and TAPoR-assigned keywords and the set of 79 core descriptors to the UMLS Metathesaurus. For the full set, the study found that human-assigned descriptors outperformed TAPoR-assigned keywords. Approximately 41% of total descriptors by humans (N=533) were “fully matched” while only 35.49% of TAPoR keywords were identified as “full matches.” Human-generated keywords also outperformed TAPoR keywords in “partial match” (human=48.44% vs. TAPoR=38.66%) and “mismatch” (human=10.97% vs. TAPoR=25.85%). This implies that human-assigned keywords are better Meta Mapping source terms compared to automatically-generated keywords from the core descriptors of the UMLS vocabularies through MMTx. Several numerical descriptions such as 100×, 200mb, 30%, etc., failed to map to the Metathesaurus. The fact that numerical notations used for quantitative assessment are frequently reported in scientific observations challenges the Metathesaurus to improve Meta Mapping.

Of the 79 core descriptors, 45 (56.9%) were “fully matched” (F), 25 (31.65%) were “partially matched” (P), and nine were “not found” (M) in UMLS. The study also compared the mapping results of the human and TAPoR-assigned core descriptors. Appendix C shows the selected core descriptors in the first column (Core Descriptors) mapped to the UMLS Metathesaurus in the next column (UMLS Mapping). The third column (Match) lists matching results based on the mapped words in the second column. F refers to Full Match, P to Partial Match, and M to Mismatch. The frequency of the core descriptors that appeared in both keyword sets is listed in the last column (Freq). The most frequently identified descriptor, tumor cells (N=61), was mapped to Unspecified tumor cell NOS and enclosed in parentheses. Cell was identified as a semantic type and enclosed in square brackets in the second column.

This study identified several useful types of terms in the core descriptor set. Histologic diagnostic terms such as ductal carcinoma in situ, infiltrating ductal carcinoma, invasive ductal carcinoma, atypical ductal hyperplasia, lobular carcinoma in situ, etc., can be helpful in identifying microscopic images in academic papers since histology uses mainly visual findings. Descriptors such as expression, endothelial cells, myoepithelial cells, stromal cells, and cytoplasm are useful in indicating cellular components expressed in microscopic images. Terms such as normal, abnormal, invasive, infiltrating, positive, and negative are used for qualitative assessments, while terms such as immunohistochemistry and haematoxyiln and eosin are related to laboratory procedures and techniques, especially histologic staining. In addition to the importance of anatomical vocabularies, the language of clinical therapies and disease processes proved to be highly relevant to clinical vocabularies in general.

Discussion

The study assessed some general characteristics of the collected captions for use as microscopic image descriptors (RQ1), compared the topical descriptors assigned by human indexers to those assigned by an automatic indexing engine (RQ2), and identified a list of core descriptors and their mapping results found in the UMLS Metathesaurus (RQ3). First, the findings suggest that caption-based descriptors can complement title or abstract-based literature indexing for figure image retrieval in articles. The identified keywords such as immunohistochemistry, H&E staining, magnification, and 100× describe laboratory procedures and methods which are not accessible through MeSH descriptors assigned to the articles. This finding is supported by several works which were implemented in a prototype system for searching captions and figures to locate biomedical images published in open access journals [20]. This finding is important because captions are identified as granular information specific to figures delivering the core contexts of scientific evidence in an abridged format. With the growth of full-text retrieval, availability of caption searching for figure images will improve topical accessibility in regards to the retrieval of published biomedical images.

Second, humans assigned fewer keywords than TAPoR but produced better matches to the UMLS Metathesaurus. This result corresponds to the previous study. In this study, Sneiderman found that search words provided by humans were proven to be effective in matching concepts in the UMLS Metathesaurus, with 98% exact matching compared to only 26% of exact UMLS matches contained in the caption considered useful to index the matched image [22]. This finding implies that terms generated by human indexers for image indexing have a high probability of matching the Meta concepts present in the texts that reference those images, while current automated algorithms generate many matches not useful for indexing. It would be important to discover whether experienced human indexers are likely to provide different indexing results as compared to inexperienced indexers.

Third, the study found that several dimensions of the imaging descriptors in the caption-based keywords can be implemented as search filters for improved image retrieval. Semantic types identified through Meta Mapping are highly likely to be useful in filtering image retrieval results. Such types include Indicator, Reagent, or Diagnostic Aid; Organic Chemical; Laboratory Procedure; Spatial Concept; Qualitative Concept; and Quantitative Concept. Furthermore, if the filters are combined with MeSHs describing the overall subject matter of an article, this combination will definitely improve the relevance of microscopic images. With respect to forming a metadata framework for online microscopic image description, the semantic types can be used as a core metadata set. In this regard, this finding can be used for standardizing a microscopic image description protocol to train medical students.

Fourth, biomedical informaticians as well as medical librarians will benefit through this study in terms of being able to identify core search keywords for microscopic image retrieval in published academic papers. System librarians, text-mining researchers, and imaging specialists can also use the study findings to improve their understanding of biomedical imaging descriptions for a more efficient system design. End users will be able to use the core descriptors identified in this study to facilitate PubMed or GoogleImage searches. As shown by previous studies on the effectiveness of mapping, the MMTx engine can also enhance searching [23, 24].

However, the study has also some limitations. The study's caption collection was limited to breast neoplasms as a disease of interest, so the findings may not apply to other disease categories. Different searching strategies might yield different sets of caption collections. As previously noted, the human indexing was performed by non-expert indexers. In future studies, the research team plans to experiment with trained versus untrained indexers; we will also formally assess retrieval effectiveness and relevance by measuring similarities for images retrieved by different imaging descriptors. TAPoR was chosen as the text analysis software for generating keywords from the caption collection. It should be also noted that if multiple lexical analysis tools were used in this study, it could minimize algorithm variation.

Conclusion

To provide greater accessibility to non-textual information such as images, textual descriptions associated with the visual findings need to be carefully studied. Findings suggest that caption-based descriptors will culminate in a systematic access to published images. Additionally, this study will assist in understanding the bibliographic control of visual materials for better retrieval by providing a semantic mapping mechanism. Future research endeavors should be expanded by examining the effectiveness of diverse description sources for the relevance of image search results. In the field of information organization and retrieval, methods and tools for non-textual information retrieval have garnered increasing attention as the digital world expands. It is incumbent upon libraries and other information agencies to promote and maintain an interest in the opportunities and challenges associated with biomedical imaging.

Key Messages.

Implications for Policy:

Caption-based descriptors produced several aspects of microscopic image descriptions which could be used as complementary sources for title or abstract-based literature indexing.
Terms suggested by human indexers for image indexing have a high probability of matching the Meta concepts present in the texts that reference those images, while the automated algorithm generated many matches useless for indexing.
Semantic types identified through Meta Mapping are highly likely to be useful in filtering image retrieval results. Such types include Indicator, Reagent, or Diagnostic Aid; Organic Chemical; Laboratory Procedure; Spatial Concept; Qualitative Concept; and Quantitative Concept.

Implications for Practice:

Medical librarians as well as biomedical informaticians will benefit from this study in terms of identifying a core set of search keywords for microscopic image retrieval in published academic papers.
System librarians, text-mining researchers, and imaging specialists can use the study findings to improve their understanding of biomedical imaging description for a more efficient system design.
End-users may be able to use the core descriptors identified in this study to facilitate PubMed or GoogleImage searches.

Acknowledgments

Grant Acknowledgement: This project was supported in part by a grant (RE-04-08-0069-08) from the Institute of Museum and Library Service (IMLS). In addition, this publication was made possibly by Grant Number P20RR-16481 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH).

Appendix A: MeSHs for Caption Source Articles by Frequency Rank.

MeSH	Freq	MeSH	Freq
Humans	352	Cell Proliferation	17
Female	287	Neoplasm Metastasis	17
Middle Aged	108	Reverse Transcriptase Polymerase Chain Reaction	17
Adult	85	Blotting, Western	16
Aged	83	Breast Neoplasms/ metabolism/pathology	16
Tumor Cells, Cultured	73	Gene Expression	16
Animals	54	Prospective Studies	16
Immunohistochemistry	49	Signal Transduction	16
Aged, 80 and over	47	Treatment Outcome	16
Mice	47	Axilla	15
Neoplasm Invasiveness	47	Phenotype	15
Breast Neoplasms/ pathology	45	Predictive Value of Tests	15
Cell Line, Tumor	45	Oligonucleotide Array Sequence Analysis	14
Prognosis	43	Breast Neoplasms/ genetics/ pathology	13
Breast Neoplasms/ genetics/pathology	36	Cohort Studies	13
Lymphatic Metastasis	31	Diagnosis, Differential	13
Survival Analysis	30	Neoplasm Transplantation	13
Male	29	Risk Factors	13
Neoplasm Staging	29	Transplantation, Heterologous	13
Gene Expression Regulation, Neoplastic	26	Tumor Markers, Biological/ analysis	13
Disease Progression	21	Breast Neoplasms/genetics/ metabolism/pathology	12
Follow-Up Studies	21	Time Factors	12
Retrospective Studies	21	Transcription, Genetic	12
Transfection	21	Tumor Markers, Biological/ metabolism	12
Gene Expression Profiling	20	Drug Resistance, Neoplasm	11
Sensitivity and Specificity	20	Hyperplasia	11
Mice, Nude	19	Sensitivity and Specificity	20
Cell Division	17

Open in a new tab

Appendix B: Top 30 Most Frequently Assigned Topical Descriptors.

Rank	Human-assigned descriptors	Freq	TAPoR-generated descriptors	Freq
1	ductal carcinoma in situ	24	ductal carcinoma	43
2	Immunohistochemical staining	24	original magnification	42
3	breast cancer	21	breast cancer	35
4	Invasive ductal carcinoma	21	normal breast	35
5	DCIS	19	Expression	26
6	tumor cells	18	magnification	25
7	immunohistochemistry	17	Cells	24
8	apoptosis	14	invasive ductal carcinoma	23
9	Epithelial cells	13	immunohistochemical	21
10	cytoplasm	12	epithelial cells	20
11	Immunostaining	12	breast carcinoma	19
12	apoptotic cells	10	haematoxylin	19
13	breast carcinoma	10	invasive ductal	19
14	ductal hyperplasia	10	tumor cells	19
15	Photomicrograph	9	Arrows	18
16	stromal cells	9	Eosin	18
17	fibroblasts	8	mda mb 231	18
18	Immunohistochemical analysis	8	Note	18
19	invasive breast cancer	8	Situ	18
20	tumour cells	8	carcinoma cells	16
21	hematoxylin	7	mda-mb-231	16
22	human breast	7	Negative	16
23	infiltrating ductal carcinoma	7	tumour cells	16
24	myoepithelial cells	7	Cytoplasm	15
25	tumour	7	h&e	15
26	acini	6	Positive	15
27	atypical ductal hyperplasia	6	Arrow	14
28	breast epithelium	6	Panel	14
29	cancer cells	6	breast tissue	13
30	cytoplasmic staining	6	cancer cells	13

Open in a new tab

Appendix C: Selected Core Descriptors Mapped into UMLS Vocabularies through MMTx.

Core Descriptors	UMLS Mapping	Match	Freq
tumor cells	Tumor cells ([M]Unspecified tumor cell NOS) [Cell]	F	61
breast cancer	Breast Cancer (Breast Carcinoma) [Neoplastic Process] Breast Cancer (Malignant neoplasm of breast) [Neoplastic Process]	F	56
ductal carcinoma	ductal carcinoma (Ductal Breast Carcinoma) [Neoplastic Process] Ductal Carcinoma [Neoplastic Process]	F	49
invasive ductal carcinoma	Invasive Ductal Carcinoma (Carcinoma, Ductal, Breast) [Neoplastic Process]	F	44
original magnification	Original [Idea or Concept]	P	42
normal breast	Normal [Qualitative Concept] Breast [Body Part, Organ, or Organ Component] Normal [Qualitative Concept] Breast (Entire breast) [Body Part, Organ, or Organ Component]	P	35
breast carcinoma	Breast Carcinoma [Neoplastic Process]	F	34
mda mb 231	MDA [Organic Chemical, Pharmacologic Substance] mb (Megabase) [Quantitative Concept]	P	34
epithelial cells	Epithelial Cells [Cell]	F	33
arrow	Arrow [Manufactured Object]	F	32
DCIS	DCIS (Carcinoma, Intraductal) [Neoplastic Process]	F	31
immunohistochemistry	Immunohistochemistry [Laboratory Procedure]	F	29
cytoplasm	Cytoplasm [Cell Component]	F	27
expression	Expression (Expression procedure) [Therapeutic or Preventive Procedure] Expression (Gene Expression) [Genetic Function]	F	26
magnification	Meta Mappings: <none>	M	25
cells	Cells [Cell]	F	24
ductal carcinoma in situ	Ductal Carcinoma In Situ (Carcinoma, Intraductal) [Neoplastic Process]	F	24
haematoxylin	Haematoxylin (Hematoxylin) [Indicator, Reagent, or Diagnostic Aid, Organic Chemical]	F	24
Immunohistochemical staining	Immunohistochemical [Laboratory Procedure] Staining (Staining method) [Laboratory Procedure]	P	24
immunohistochemical	Immunohistochemical [Laboratory Procedure]	F	21

Open in a new tab

Footnotes

¹

TAPoR also comprises a research project consisting of 6 leading humanities computing centers in Canada. More detailed information about individual projects and products can be found at http://tada.mcmaster.ca/Main/TAPoRwareKeywordsFinder Only nouns and noun phrases are considered to be keywords, and verbs, adverbs, etc. are not included in the keywords. In the result page, TAPoR “lists 20 top frequency words, 10 top frequency word pairs and word triplets respectively.”

²

Built on natural language processing and computational linguistic techniques, MMTx uses the UMLS Metathesaurus provided by NLM to semantically map user-given terminologies into controlled vocabularies [14]. The MMTx includes MeSH descriptors in the Metathesaurus, and the study did not intend to explicit MeSHs from other descriptors included in the UMLS. The study customized the MMTx to run on the Windows command-based platform.

No conflicts of interest have been declared.

Contributor Information

Sujin Kim, Email: sujinkim@uky.edu, School of Library and Information Science and Department of Pathology and Laboratory Medicine, 339 Lucille Little Building, University of Kentucky, Lexington, KY 40506-0224.

Shannon Lamkin, Email: shannonlamkin@yahoo.com, University of Kentucky Libraries, Lexington, KY 40506.

Pam Duncan, Email: psdunc2@uky.edu, University of Kentucky, Lexington, KY 40506-0224.

References

1.Jorgensen C. Indexing images: testing an image description template. Proceedings of the 59th Annual Meeting of the American Society for Information Science; 1996. pp. 209–213. [Google Scholar]
2.Panofsky E. Meaning in the Visual Arts. Penguin; London: 1993. [Google Scholar]
3.Bidgood WD, et al. Image Acquisition context: procedure description attributes for clinically relevant indexing and selective retrieval of biomedical images. JAMIA. 1999;6(1):61–75. doi: 10.1136/jamia.1999.0060061. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Lowe HJ, et al. Towards knowledge-based retrieval of medical images: the role of semantic indexing, image content representation and knowledge-based retrieval. Proceedings/AMIA Annual Fall Symposium; 1998. pp. 882–886. [PMC free article] [PubMed] [Google Scholar]
5.Kim S, Jeong HK, Choi HJ, Kim D. Automatic histologic Grading for Lobular Carcinoma In Situ. World Congress on Biomedical Physics and Medical Engineering, Munich, Germany (September 7-September 13, 2009); 2009d. To be indexed in the Springer Proceedings. [Google Scholar]
6.Rasmussen EM. Indexing Images. Annual review of Information Science and Technology. 1997;32:169–196. [Google Scholar]
7.Dee FR. Virtual microscopy for comparative pathology. Toxicologic Pathology. 2006;34(7):966–7. doi: 10.1080/01926230601123062. [DOI] [PubMed] [Google Scholar]
8.Li XX, et al. A feasibility study of virtual slides in surgical pathology in China. Human Pathology. 2007;38(12):1842–1848. doi: 10.1016/j.humpath.2007.04.019. [DOI] [PubMed] [Google Scholar]
9.Leong FJ, Leong AS. Digital imaging applications in anatomic pathology. Advances in Anatomic Pathology. 2003;10(2):88–95. doi: 10.1097/00125480-200303000-00003. [DOI] [PubMed] [Google Scholar]
10.Pritt BS, Gibson PC, Cooper K. Digital imaging guidelines for pathology: a proposal for general and academic use. Advances in Anatomic Pathology. 2003;10(2):96–100. doi: 10.1097/00125480-200303000-00004. [DOI] [PubMed] [Google Scholar]
11.Montalto MC. Pathology RE-imagined: the history of digital radiology and the future of anatomic pathology. Archives of pathology and laboratory medicine. 2008;132(5):764–5. doi: 10.5858/2008-132-764-PRTHOD. [DOI] [PubMed] [Google Scholar]
12.Ho J, et al. Use of whole slide imaging in surgical pathology quality assurance: design and pilot validation studies. Human Pathology. 2006;37(3):322–31. doi: 10.1016/j.humpath.2005.11.005. [DOI] [PubMed] [Google Scholar]
13.Yagi Y, Gilbertson JR. Digital imaging in pathology: the case for standardization. J Telemed Telecare. 2005;11(3):109–16. doi: 10.1258/1357633053688705. [DOI] [PubMed] [Google Scholar]
14.Kahn CE, Thao C. GoldMiner: A radiology image search engine. American Journal of Raiology. 2007;188:1475–78. doi: 10.2214/AJR.06.1740. [DOI] [PubMed] [Google Scholar]
15.Yeh AS, Hirschman L, Morgan AA. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics. 2003;19:i331–9. doi: 10.1093/bioinformatics/btg1046. [DOI] [PubMed] [Google Scholar]
16.Shatkay H, Chen N, Blostein D. Integrating image data into biomedical text categorization. Bioinformatics. 2006;22(14):e446–53. doi: 10.1093/bioinformatics/btl235. [DOI] [PubMed] [Google Scholar]
17.Choi Y, Rasmussen EM. Users' relevance criteria in image retrieval in American history. Information Processing and Management. 2002;38:695–726. [Google Scholar]
18.Xu S, McCusker J, Krauthammer M. Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics. 2008;24(17):1968. doi: 10.1093/bioinformatics/btn340. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hua J, et al. Identifying Fluorescence Microscope Images in Online Journal Articles Using Both Image and Text Features. Proceedings of the 2007 IEEE International Symposium on Biomedical Imaging; 2007. pp. 1224–1227. [Google Scholar]
20.Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, Wooldridge MA, Ye J. BioText Search Engine: beyond abstract search. Bioinformatics. 2007;2:2196–2197. doi: 10.1093/bioinformatics/btm301. [DOI] [PubMed] [Google Scholar]
21.Yu H. Towards answering biological questions with experimental evidence: automatically identifying text that summarizes image content in full-text articles. AMIA Annu Symp Proc; 2006. pp. 834–838. [PMC free article] [PubMed] [Google Scholar]
22.Sneiderman CA, et al. UMLS-based Automatic Image Indexing. AMIA Annual Symposium proceedings AMIA Symposium 1141; 2008. [20 February 2009]. Available from: http://archive.nlm.nih.gov/pubs/pubPDFs/Sneiderman_et_al_AMIA_2008.pdf. [PubMed] [Google Scholar]
23.Gay CW, Kayaalp M, Aronson AR. Semi-automatic indexing of full text biomedical articles. AMIA Annual Symposium proceedings AMIA Symposium; 2005. [20 February 2009]. Available from: http://ii.nlm.nih.gov/resources/amia05.fulltext.w.footer.pdf. [PMC free article] [PubMed] [Google Scholar]
24.Kahn CE. Effective metadata discovery for dynamic filtering of queries to a radiology image search engine. Journal of digital imaging the official journal of the Society for Computer Applications in Radiology. 2008;21(3):269–73. doi: 10.1007/s10278-007-9036-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Jorgensen C. Indexing images: testing an image description template. Proceedings of the 59th Annual Meeting of the American Society for Information Science; 1996. pp. 209–213. [Google Scholar]

[R2] 2.Panofsky E. Meaning in the Visual Arts. Penguin; London: 1993. [Google Scholar]

[R3] 3.Bidgood WD, et al. Image Acquisition context: procedure description attributes for clinically relevant indexing and selective retrieval of biomedical images. JAMIA. 1999;6(1):61–75. doi: 10.1136/jamia.1999.0060061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Lowe HJ, et al. Towards knowledge-based retrieval of medical images: the role of semantic indexing, image content representation and knowledge-based retrieval. Proceedings/AMIA Annual Fall Symposium; 1998. pp. 882–886. [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Kim S, Jeong HK, Choi HJ, Kim D. Automatic histologic Grading for Lobular Carcinoma In Situ. World Congress on Biomedical Physics and Medical Engineering, Munich, Germany (September 7-September 13, 2009); 2009d. To be indexed in the Springer Proceedings. [Google Scholar]

[R6] 6.Rasmussen EM. Indexing Images. Annual review of Information Science and Technology. 1997;32:169–196. [Google Scholar]

[R7] 7.Dee FR. Virtual microscopy for comparative pathology. Toxicologic Pathology. 2006;34(7):966–7. doi: 10.1080/01926230601123062. [DOI] [PubMed] [Google Scholar]

[R8] 8.Li XX, et al. A feasibility study of virtual slides in surgical pathology in China. Human Pathology. 2007;38(12):1842–1848. doi: 10.1016/j.humpath.2007.04.019. [DOI] [PubMed] [Google Scholar]

[R9] 9.Leong FJ, Leong AS. Digital imaging applications in anatomic pathology. Advances in Anatomic Pathology. 2003;10(2):88–95. doi: 10.1097/00125480-200303000-00003. [DOI] [PubMed] [Google Scholar]

[R10] 10.Pritt BS, Gibson PC, Cooper K. Digital imaging guidelines for pathology: a proposal for general and academic use. Advances in Anatomic Pathology. 2003;10(2):96–100. doi: 10.1097/00125480-200303000-00004. [DOI] [PubMed] [Google Scholar]

[R11] 11.Montalto MC. Pathology RE-imagined: the history of digital radiology and the future of anatomic pathology. Archives of pathology and laboratory medicine. 2008;132(5):764–5. doi: 10.5858/2008-132-764-PRTHOD. [DOI] [PubMed] [Google Scholar]

[R12] 12.Ho J, et al. Use of whole slide imaging in surgical pathology quality assurance: design and pilot validation studies. Human Pathology. 2006;37(3):322–31. doi: 10.1016/j.humpath.2005.11.005. [DOI] [PubMed] [Google Scholar]

[R13] 13.Yagi Y, Gilbertson JR. Digital imaging in pathology: the case for standardization. J Telemed Telecare. 2005;11(3):109–16. doi: 10.1258/1357633053688705. [DOI] [PubMed] [Google Scholar]

[R14] 14.Kahn CE, Thao C. GoldMiner: A radiology image search engine. American Journal of Raiology. 2007;188:1475–78. doi: 10.2214/AJR.06.1740. [DOI] [PubMed] [Google Scholar]

[R15] 15.Yeh AS, Hirschman L, Morgan AA. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics. 2003;19:i331–9. doi: 10.1093/bioinformatics/btg1046. [DOI] [PubMed] [Google Scholar]

[R16] 16.Shatkay H, Chen N, Blostein D. Integrating image data into biomedical text categorization. Bioinformatics. 2006;22(14):e446–53. doi: 10.1093/bioinformatics/btl235. [DOI] [PubMed] [Google Scholar]

[R17] 17.Choi Y, Rasmussen EM. Users' relevance criteria in image retrieval in American history. Information Processing and Management. 2002;38:695–726. [Google Scholar]

[R18] 18.Xu S, McCusker J, Krauthammer M. Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics. 2008;24(17):1968. doi: 10.1093/bioinformatics/btn340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Hua J, et al. Identifying Fluorescence Microscope Images in Online Journal Articles Using Both Image and Text Features. Proceedings of the 2007 IEEE International Symposium on Biomedical Imaging; 2007. pp. 1224–1227. [Google Scholar]

[R20] 20.Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, Wooldridge MA, Ye J. BioText Search Engine: beyond abstract search. Bioinformatics. 2007;2:2196–2197. doi: 10.1093/bioinformatics/btm301. [DOI] [PubMed] [Google Scholar]

[R21] 21.Yu H. Towards answering biological questions with experimental evidence: automatically identifying text that summarizes image content in full-text articles. AMIA Annu Symp Proc; 2006. pp. 834–838. [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Sneiderman CA, et al. UMLS-based Automatic Image Indexing. AMIA Annual Symposium proceedings AMIA Symposium 1141; 2008. [20 February 2009]. Available from: http://archive.nlm.nih.gov/pubs/pubPDFs/Sneiderman_et_al_AMIA_2008.pdf. [PubMed] [Google Scholar]

[R23] 23.Gay CW, Kayaalp M, Aronson AR. Semi-automatic indexing of full text biomedical articles. AMIA Annual Symposium proceedings AMIA Symposium; 2005. [20 February 2009]. Available from: http://ii.nlm.nih.gov/resources/amia05.fulltext.w.footer.pdf. [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Kahn CE. Effective metadata discovery for dynamic filtering of queries to a radiology image search engine. Journal of digital imaging the official journal of the Society for Computer Applications in Radiology. 2008;21(3):269–73. doi: 10.1007/s10278-007-9036-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Caption-based topical descriptors for microscopic images of breast neoplasms as published in academic papers

Sujin Kim, Ph.D.

Shannon Lamkin, MLIS

Pam Duncan, JD/MLIS

Roles

Introduction

Background

Image indexing and retrieval in general

Digital Pathology and Image Description

Caption-based indexing and vocabulary mappings

Method

Research questions

Caption collection and indexers

Results

RQ1: Major topical descriptors for microscopic images by human and computer

Table 1. Descriptive Characteristics of Caption Collection.

RQ2: Comparison between human- and computer-assigned descriptors

RQ 3: Mapping the core descriptors into the UMLS Metathesaurus

Discussion

Conclusion

Key Messages.

Acknowledgments

Appendix A: MeSHs for Caption Source Articles by Frequency Rank.

Appendix B: Top 30 Most Frequently Assigned Topical Descriptors.

Appendix C: Selected Core Descriptors Mapped into UMLS Vocabularies through MMTx.

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Caption-based topical descriptors for microscopic images of breast neoplasms as published in academic papers

Sujin Kim, Ph.D.

Shannon Lamkin, MLIS

Pam Duncan, JD/MLIS

Roles

Introduction

Background

Image indexing and retrieval in general

Digital Pathology and Image Description

Caption-based indexing and vocabulary mappings

Method

Research questions

Caption collection and indexers

Results

RQ1: Major topical descriptors for microscopic images by human and computer

Table 1. Descriptive Characteristics of Caption Collection.

RQ2: Comparison between human- and computer-assigned descriptors

RQ 3: Mapping the core descriptors into the UMLS Metathesaurus

Discussion

Conclusion

Key Messages.

Acknowledgments

Appendix A: MeSHs for Caption Source Articles by Frequency Rank.

Appendix B: Top 30 Most Frequently Assigned Topical Descriptors.

Appendix C: Selected Core Descriptors Mapped into UMLS Vocabularies through MMTx.

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases