Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2010 Nov 13;2010:41–45.

Using Ontology Network Structure in Text Mining

Donald J Berndt 1,2, James A McCart 1, Stephen L Luther 1
PMCID: PMC3041319  PMID: 21346937

Abstract

Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.

Introduction

Natural language processing, as well as other more deductive approaches to pattern discovery, rely on existing ontologies to look up terms and map them to concepts. These methods typically use localized knowledge in the ontology, looking up terms, handling synonyms, and better understanding the context surrounding a specific concept. This research is aimed at pursuing a very different, more macro-ontological approach, computing measures of concept importance from the entire ontological network structure, thereby quantifying domain knowledge in a form that can be used in statistical text mining and other inductive data mining techniques. This macro-ontological approach contrasts, and in some ways complements, the micro-ontological perspective used in many existing methods. The large-scale ontologies being constructed for use with semantic web methods, as well as the sustained investments in medical ontologies, provide an opportunity to test more global views and computational algorithms for quantifying different aspects of domain knowledge by utilizing network structure. If successful, such methods present a very innovative pathway for introducing domain knowledge into machine learning techniques.

The approach to bringing ontological knowledge into the statistical text mining process being pursued in this study views the ontology as a very large scale graph, much like the Internet. Following this approach includes adapting search engine algorithms for use on the ontological network. Algorithms such as Hyperlink-Induced Topic Search (HITS)1 and Google’s PageRank2 compute measures based on the network structure, estimating the importance of individual pages or nodes. By adapting these algorithms for use on a medical ontology such as the Unified Medical Language System (UMLS), different importance measures can be computed for each concept. These individual concept weightings can then be imported into the statistical text mining process at the term or phrase level3.

The philosophy behind this approach is that a large group of contributors incrementally construct any ontology over an extended time period and that there is wisdom in that crowd. The objective is to extract some of the domain-specific wisdom from the ontology in the large, and inject it into subsequent machine learning efforts. In other words, can we capture and use human crowd-sourced insights to leverage machine learning?

Adapting Search Engine Algorithms

Search engine algorithms such as PageRank consider Internet structure at a page or node level, using factors like the number of in-edges and out-edges to assign each page an “objective measure of its citation importance”2. Other algorithms such as Hyperlink-Induced Topic Search (HITS) use the notion of “authoritative pages” and hub structures to generate a link analysis-based page importance1. These specific algorithms and subsequent extensions have worked very well in the context of search engines and information retrieval. Even though medical text mining is certainly a very different domain, many aspects of these Web analytic approaches seem applicable. In fact, one recent conference paper looked at using ontologies to improve document classification in the national security domain4. While somewhat preliminary in nature, the study did find that ontologically derived data combined with vector space methods did improve classification performance. Though the approach being pursued here differs in several respects, this early result highlights the potential of using an ontological network for deriving domain knowledge that can be fed back into the data mining process.

Ontology Term Ranking Calculations

To begin investigating graph or network algorithms in the medical text mining context, terms and relationships from the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT) portion of the UMLS were used to create a very large graph with roughly 300K nodes and 450K edges. The graph was created using the Java Universal Network/Graph Framework (JUNG) and stored as an XML file for processing with pre-existing graph algorithm packages, as well in a MySQL database (for both processing and fast retrieval within the text mining process). Several existing Web analytic measures, such as HITS and PageRank, were computed using the JUNG framework for all connected SNOMED CT concepts within the graph.

This preliminary investigation provided experience with many aspects of this approach, including interacting with the UMLS, very large scale graph creation and processing, using search engine algorithms, and the implementation of a database for storing the final results. The preliminary computations focused on the parent/child and synonym relationships, though additional edge types will be considered later.

A total of 318,106 SNOMED concepts were obtained from the 2009AB version of the UMLS. Seven concepts were removed from further processing due to being classified under the “inactive concept” sub-hierarchy of SNOMED (IHTSDO 2010)1. Of the remaining 318,099 concepts, 4,304 of those concepts (1.35%) were found to have a synonymous relation with at least one other concept2. A representative concept was selected for each of the 2,049 independent groups of synonyms.

The parent/child relationships between the remaining 315,844 concepts (and their synonyms) were then obtained. A total of 430,645 edges between 288,179 concepts (91.24%) were found. The 27,665 concepts (8.76%) without any parent/child relationships were removed from the graph. Tables 1 and 2 present basic descriptive statistics about the graph, Figure 1 displays a portion of the graph (depicting 18,652 nodes and 35,193 edges). Both the tables and figure demonstrate that the graph is large and the degree of connectedness varies, which is certainly important with regard to search engine like measures.

Table 1.

Ontology graph components and counts.

Components Component Counts
Nodes (Concepts) 288,179
Edges (Relationships) 430,645
Synonyms 2,049

Table 2.

Degree of node interconnectedness.

Degree Avg Std Dev Min Max
Total 2.99 9.35 1 2,517
In-Degree 1.49 0.87 0 16
Out-Degree 1.49 9.27 0 2,516

Figure 1.

Figure 1.

A Portion of the SNOMED CT graph.

Table 3 presents some initial computations drawn from the ontological network. The PageRank and HITS algorithms mentioned above were calculated for all 288,179 SNOMED CT concepts in the graph (using a restricted linkage structure). These are the same algorithms used for search engine tasks, without any adaptations for ontological computations in the medical domain. It is certainly worth exploring further refinements that take advantage of the unique structure implemented within the UMLS, but for now the focus is to validate our basic approach.

Table 3.

PageRank and HITS network measures.

Measure Avg Std Dev Min Max
PageRank 3.47E-06 9.93E-07 2.62E-06 1.81E-05
HITS–Hub 1.07E-05 1.86E-03 3.07E-06 1.00
HITS–Authority 1.74E-04 1.85E-03 2.15E-13 0.02

Integrating with Statistical Text Mining

The aim of this research is to evaluate existing search engine algorithms and other Web analytic methods and adapt them for use on portions of the UMLS. The newly derived concept weightings will then be injected into the statistical text mining process. While there are several ways to combine the ontological information with the text mining process, a good starting point is to use concept weights from the ontology to adjust the term-by-document weights that are calculated as part of the statistical text mining process. This is a fairly straightforward integration point that should allow for evaluation of these new techniques on some existing use cases.

Table 4 outlines the major steps used in a typical statistical text mining process. The highlighted steps dealing with the term-by-document matrix generation and decomposition are the points where macro-ontological domain knowledge can be easily introduced. The term-by-document matrix includes weightings derived by some statistical method, such as information entropy or inverse document frequency. At this point, adjusted weights can be produced by combing the network calculations with the existing weights. Of course, the general methods developed here for computing concept importance from ontology structure can be used with many other techniques such as natural language processing, structured data and rule mining, as well as statistical text mining.

Table 4.

Statistical text mining processes.

Process Description
Parsing Separating words and phrases in the document
Stemming Locating roots
Synonym Identification Mapping roots in the document to synonyms
Stop list analysis Elimination of stop words
Term-by-document matrix generation Mathematical representation of each document (document counts, term counts, term weights)
Singular value decomposition Statistical process that enables fast processing of numerical representation of documents
Clustering A process that groups objects that are in some way related

There are two macro-level perspectives on the ontological network: a structural perspective and a contextual perspective that incorporates clinician expertise related to a specific task. The research outlined here pursues the more structural approach. However, future work will explore methods for adding task specific clinical knowledge that could be merged with the structural measures to produce more refined domain knowledge.

Integrating the graph-oriented measures from the ontology with the statistical text mining process includes several challenges.

  1. Where in the text mining process should macro-ontological measures be inserted?

    While there are different pathways to integrate the ontological network computations with the statistical text mining process, one straightforward approach is to simply adjust the weights in the existing term-by-document matrix already used for text mining (as described above).

  2. Which measures are reasonable to use as adjustments to term weights?

    The goal of this research is to adapt some search engine measures for use with ontological networks. Both the PageRank and HITS algorithms seem reasonable, along with other measures such as the in or out degree of nodes. The PageRank algorithm is used throughout the remainder of the paper.

  3. How should terms be matched to ontological concepts?

    Of course, there are several different methods for matching or looking up terms in the ontology (or derived network database). You could find all entries that contain a term or look for a more exact match5,6. The approach here is to use an exact match, though several concepts can still be retrieved since the same term can have different meanings. A more refined approach that makes use of the semantic types available in the UMLS and other systems is certainly worth exploring in the future.

  4. What should happen if a term maps to multiple ontological concepts?

    Even after an approach for term lookup is adopted, there may still be multiple matches in the ontology for a particular term. If a group of concepts with varying network measures are retrieved, there needs to be a mechanism for selecting or combining them for any subsequent term weight adjustments. A simple approach of taking the maximum valued measure is being used for these examples. However, other approaches should be explored.

  5. How should the measures be combined with term weights?

    After computing measures, looking up terms, and combining multiple matches, any network measure must be combined with an existing term weight assigned by the text mining process. This is a critical step that certainly needs further investigation. Essentially, this step determines whether the statistical text mining weights or the ontology graph weights exert more or less influence on the outcome. It is likely that selecting the appropriate mixture is somewhat task dependent. A couple example functions are used in the preliminary analyses that follow.

Example Term Lookups

To partially illustrate the process, several terms (derived in part from a prior text mining study) are used for example term lookups. The prior study was aimed at identifying fall-related injuries using the EMR progress notes. Terms such as fall, fracture, pain, wrist, and leg are used (along with some unrelated terms) to retrieve concepts and the associated search engine measures from the prototype ontological network database. Table 5 shows the results, including the number of matches and the PageRank measure (using the maximum), as well as the statistics for the sub-graphs surrounding each term (nodes and edges).

Table 5.

Term lookups and sub-graph statistics.

Term Matches Nodes Edges PageRank
Fall 1 372 385 3.78E-06
Fracture 1 2,463 4,090 2.89E-06
Pain 1 1,030 1,287 3.60E-06
Wrist 2 448 558 3.55E-06
Leg 4 1,101 1,602 7.04E-06

The sub-graphs were created by selecting the term as a central point and then radiating outward three hops. For terms with multiple matches, the resulting subgraphs of each concept were combined together into a single graph. Figures 2 and 3 illustrate the subgraphs (using Cytoscape) for “smoking” and “fall,” with the main concept highlighted in yellow (or lighter shading). Again, the macro-ontological approach being investigated in this paper assumes that the network structure is varied enough in terms of connectivity that some terms or concepts (or even types of terms) are far more connected than others. A second fundamental assumption is that there is some notion of term importance embodied in the network structure, which can be uncovered using algorithms akin to those used to rank pages on the Web. The topologies rather than the details of Figures 1, 2, and 3 are intended to present a visualization of this variation, whereas Figure 4 depicts the variation in the quantitative term importance measures (based on network connectivity).

Figure 2.

Figure 2.

Sub-graph for “smoking” (three hops).

Figure 3.

Figure 3.

Sub-graph for “fall” (three hops).

Figure 4.

Figure 4.

PageRank box and whisker chart.

Figure 4 provides a box and whisker chart of the PageRank graph measure for each sub-graph’s network. In addition, the figure displays the average value of all concepts from the sub-graph along with the average value obtained for each term lookup.

Preliminary Results

As a preliminary test of the integrated text mining approach a benchmark smoking-related data set7 was selected from the i2b2 National Center for Biomedical Computing. The data set includes 502 discharge summaries with smoking related annotations, including a target outcome label for each document. A subset of 140 records (27.89%) consisting of all “current smokers” (n = 58; 41.43%) and “non-smokers” (n = 82; 58.57%) were used in this research. The integrated approach included precomputed macro-ontological term rankings based on PageRank and the injection of the rankings into a latent semantic analysis-based text mining algorithm. The task was a straightforward binary classification of the clinical notes as either smoking related or not.

RapidMiner, an open source data mining tool (rapid-i.com), was modified to implement the graph-theoretic approach. The text mining approach uses latent semantic analysis (LSA), with weights based on TF-IDF, and singular value decomposition (SVD) for dimension reduction. An improved term-by-document matrix manipulation module was created that allows for the introduction of term importance weights based on ontology graph computations. The SVD vectors were used in a series of logistic regression models to perform the classification task. Two different functions for combining the TF-IDF and ontology graph weights were used: F1 [d + (d(1−d) * w(1−w) * s) ] and F2 [d + (d(1+d) * w(1+w) * s) ], where d is the weight in the term-by-document matrix, w is the ontology graph weight, and s is a scaling value.

Figure 5 summarizes the performance of the two functions over a range of scaling factors, with a benchmark performance based on only text mining weights. The baseline model achieved an accuracy of 58.56% while functions F1 and F2 obtained values ranging from 60.50% to 68.43% and 59.21% to 64.00%, respectively, where accuracy is based on the number of correct classifications.

Figure 5.

Figure 5.

Performance Results.

The goal was not to construct the most accurate model possible, rather to compare an integrated graph-theoretic strategy with an unassisted text mining strategy. As shown in Figure 5, there was a consistent relative improvement in accuracy, indicating that using ontology graph structure is indeed a promising approach.

Conclusion

This work has focused on using search engine computations over ontological graphs to derive structural information for use in texting mining. This structural information is essentially a proxy for the domain knowledge embedded in the ontology by countless contributors over long periods of time. Currently, this work is unidirectional. That is, the computations occur within the ontology and are then injected into the text mining process. However, it may be advantageous to feed back into the ontology any results from the text mining process itself, thereby supporting an iterative refinement of any text mining models. In fact, the feedback could be solicited from clinician experts or automatically generated from the accuracy of the text mining results (or both). Clinician feedback would provide a mechanism for capturing task-specific domain expertise for use in combination with the more generic domain knowledge embodied in the ontology. More automated feedback could be generated by looking at the correlation between terms and outcomes or rewarding the terms that are associated with the successful cases in any evaluation data sets. All of these potential feedback loops could lead to improvements in the macro-ontological approach pursued in this study and will be among the next steps in this research stream.

Acknowledgments

Funding for this research came from the Consortium for Healthcare Informatics Research, [HIR-09-002] HSR&D/RR&D Center of Excellence, James A. Haley Veterans Hospital, Tampa, FL. This research presents the findings and conclusions of the authors and does not necessarily represent the Department of Veterans Affairs (VA) or the Health Services Research and Development Service. De-identified clinical records used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. Ozlem Uzuner, i2b2 and SUNY.

Footnotes

1

Examples of concepts classified in the “inactive” sub-hierarchy are ambiguous, outdated, and erroneous concepts.

2

Concepts may have multiple relationship types between one another. Concepts were considered synonymous if a synonymous but not a parent/child relationship existed between them. A limitation is concepts may be considered synonymous even though they have some relationship other than parent/child (e.g., broad/narrow).

References

  • 1.Kleinberg J. Authoritative sources in a hyperlinked environment. J ACM. 1999;46(5):604–632. [Google Scholar]
  • 2.Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst. 1998;30(1–7):107–117. [Google Scholar]
  • 3.Manning C, Schutze H. Foundations of statistical natural language processing. MIT Press; 1999. [Google Scholar]
  • 4.Nagarajan M, Sheth A, Aguilera M, Keeton K, Merchant A, Uysal M. Altering document term vectors for classification – ontologies as expectations of co-occurrence. Proc World Wide Web Conf (WWW); May 8–12; Alberta, Canada. 2007. [Google Scholar]
  • 5.Lui H, Johnson S, Friedman C. Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J Am Med Inform Assoc. 2002;9(6):621–636. doi: 10.1197/jamia.M1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Shu J, Clifford GD, Long WJ, Moody GB, Szolovits P, Mark RG. An open-source, interactive Java-based system for rapid encoding of significant events in the ICU using the Unified Medical Language System. Comput Cardiol. 2004;31:197–200. [Google Scholar]
  • 7.Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15(1):15–24. doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES