Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2011 Oct 22;2011:181–188.

Can SNOMED CT fulfill the vision of a compositional terminology? Analyzing the use case for Problem List

James R Campbell 1,, Junchuan Xu 2,, Kin Wah Fung 2,
PMCID: PMC3243203  PMID: 22195069

Abstract

We analyzed 598 of 63,952 terms employed in problem list entries from seven major healthcare institutions that were not mapped with UMLS to SNOMED CT when preparing the NLM UMLS-CORE problem list subset. We intended to determine whether published or post-coordinated SNOMED concepts could accurately capture the problems as stated by the clinician and to characterize the workload for the local terminology manager. From the terms we analyzed, we estimate that 7.5% of the total terms represent ambiguous statements that require clarification. Of those terms which were unambiguous, we estimate that 38.1% could be encoded using the SNOMED CT January 2011 pre-coordinated (published core) content. 60.4% of unambiguous terms required post-coordination to capture the term meaning within the SNOMED model. Approximately 28.5% of post-coordinated content could not be fully defined and required primitive forms. This left 1.5% of unambiguous terms which were expressed with meaning which could not be represented in SNOMED CT. We estimate from our study that 98.5% of clinical terms unambiguously suggested for the problem list can be equated to published concepts or can be modeled with SNOMED CT but that roughly one in four SNOMED modeled expressions fail to represent the full meaning of the term. Implications for the business model of the local terminology manager and the development of SNOMED CT are discussed.

Introduction

Studies of clinical vocabulary requirements1,2 have demonstrated that controlled vocabularies in general and more specifically, classifications created for statistical reporting, fail to support the expressive needs of clinicians. The need for scalable content coupled with institutional vagaries of term use has led to proposals for terminology with compositional features based upon the design features of ontologies3,4. Such principled terminologies propose a model of meaning which constrains the attributes (relationships) and values (concept references) which may be employed in the development of an expression to represent conceptual meaning not in the published terminology. The process of creating an expression representing non-core content is called post-coordination.

SNOMED CT has grown to be the largest compositional reference terminology with the objective of capturing the expressive content required for the clinical electronic health record. In response to studies2,5 which have identified problems of poor reproducibility of SNOMED post-coordinated expressions, the IHTSDO has developed specifications for attribute and value sets to be utilized in post-coordinated content development6. Knowledge of this proposed model of meaning is not widespread. Few organizations worldwide develop or maintain repositories of post-coordinated SNOMED data. Yet the vision of a fully expressive and locally responsive vocabulary resource cannot be met unless the resources to manage composition of conceptual content not in the published core become better understood and routinely employed. Can the current SNOMED CT model of meaning faithfully record the expressive needs of clinicians? What would be required of the terminology manager who must provide the resources necessary to manage these vocabularies?

The problem list is a summarization of diagnoses, symptoms and care management issues that has long been promoted as a central component of the EHR7,8. Studies have evaluated SNOMED CT as a vocabulary resource for the problem list9,10 and have concluded that SNOMED is the most clinically comprehensive, but not necessarily complete. These historical studies have not employed the compositional features of SNOMED CT to capture meaning, largely because of lack of guidance regarding standard methods to employ post-coordination and criticisms of lack of clarity 2. We undertook an evaluation of terms employed in healthcare institutions for EHR problem list entries to determine whether SNOMED CT could capture the meanings of the concepts proposed by those terms clearly, reproducibly and comprehensively.

Methods:

Methods for development of the NLM CORE SNOMED problem list subset were reported in 201011. The subset is published and made available by the NLM for public use12. The term sets that were used to develop this subset were updated to include a seventh healthcare institution in 2010. The revised statistics and procedures for our report are summarized in the data management algorithm of Figure 1. The expanded problem set now included 77116 terms (set A in figure 1) arising in the problem lists of the contributing institutions. Based upon frequency of use statistics provided by the sources, those terms representing 95% of problem instances recorded in the electronic health records (EHRs) were identified. Following the procedures reported earlier11, all terms were lexically (termwise) cross-referenced with the UMLS, identifying a concept unique identifier (CUI) for the concept which the term was judged to represent. Those Concept Unique Identifiers (CUIs) with co-occurrences to SNOMED CT core concepts were identified and the source term was then mapped to the SNOMED CT concept identifier. This set of source terms and SNOMED CT concept identifiers became the 201008 release of the CORE problem list (set B). The analyses and identification of the core problem subset were accomplished initially with UMLS 2008AC and SNOMED CT July 2008.

Figure 1.

Figure 1.

Data management procedures

We analyzed convenience samples of two subsets of terms which many of those not reflected in the CORE SNOMED CT problem list:

  1. 348 high frequency terms: a) without a UMLS mapping or b) with a UMLS map but no cooccurrence in SNOMED CT (set C; N=1641)

  2. 250 low frequency terms reflecting those which occurred in less than 5% of problem instances at the source institutions which had no UMLS map identified by NLM (set E; N=18952)

For the terms in these two samples one investigator (JRC) analyzed the semantics of the term statement within the context of the problem list, made a judgment whether the term was clear and unambiguous, and attempted to capture the conceptual meaning with SNOMED CT. He first searched for pre-coordinated concepts from the January 2011 SNOMED CT release and proceeded to development of a post-coordinated SNOMED CT expression only when this failed. Assignments were reviewed by a second investigator (JX) and discrepant opinions as to the encoding were decided by consensus of all authors.

Some terms were judged to be ambiguous as problem statements because they were vague, incomplete or employed conflicting descriptors. These were flagged and excluded from further semantic analysis. Examples from the two sets included:

  • “Weight concern adult assymptomatic”

  • “Trochlea tendon inflammation”

  • “Vaginal hydrocoele, male”

  • “Anomaly of the genital tract for maternal care”

  • “Medial internal derangement knee”

  • “DOPL report is not suspicious”

Some terms were judged to be accurately represented by pre-coordinated (published core content) SNOMED CT concepts employed within the default context proposed for SNOMED6 (SNOMED CT User Guide page 56). Following assumptions used to develop the CORE problem list11, we restricted valid choices from SNOMED CT to include only 404684003|Clinical findings| and its children, 272379006|Events|, 243796009|Situations with explicit context| and 71388002|Procedures|. Examples of pre-coordinated concepts identified for problem list terms include:

  • “Panniculitis alpha 1 antitrypsin” = 403415009 |Panniculitis due to alpha 1 antitrypsin deficiency (disorder)|

  • “Hx of pertussis” = 161422003 |History of – pertussis (situation)|

  • “Anterior posterior repair cystocele s/p” = 13910004 |Combined anteroposterior colporrhaphy (procedure)|

  • “History of renal carcinoma” = 415081006 |History of malignant neoplasm of kidney (situation)|

  • “Post radiation neuropathy” = 445339002 |Neuropathy due to ionizing radiation (disorder)|

When a pre-coordinated concept could not be identified, we attempted to create a post-coordinated expression to completely encode the meaning which we understood from the source term. We followed the guidance for allowable attributes and values published in the SNOMED User Guide6 (pages 28–61). Examples of the stated forms of post-coordinated expressions include:

  • “Leg joint pain” = (116680003 |is a(attribute)| = 57676002 |Joint pain(finding)|): (363698007 |Finding site(attribute)| = 4527007 |joint of lower extremity(body structure)|)

  • “Walking difficulty Ortho knee cause” = (116680003 |is a(attribute)| = 228158008 |Difficulty walking(finding)|): (42752001 |Due to(attribute)| = 428724006 |Disorder of knee joint(disorder)|)

  • “Renal stone, left” = (116680003 |is a(attribute)| = 95570007 |Kidney stone|): (363698007 |finding site(attribute)| = 64033007 |kidney structure(body structure|): (272741003 |laterality (attribute)| = 7771000 |Left(qualifier)|))

  • “Cat bite infected” = (116680003 |is a(attribute)| = 76844004 |infected wound (disorder)| 42752001 |): (Due to(attribute)| = 283782004 |cat bite (disorder)|)

  • “Dendrite, s/p penetrating keratoplasty” =(116680003 |is a(attribute)| = (423903007 |Corneal dendrite (finding)|): (255234002 |After (attribute)| = 42101009 |Penetrating keratoplasty (procedure)|)

The authors judged for each post-coordinated concept whether SNOMED attributes and values could capture the complete and comprehensive conceptual meaning represented by the term within the defining attributes and values. Such concepts are described in the SNOMED CT model of meaning to be fully defined. We identified some that required elements of meaning which were not supported by the authorized SNOMED relationship set (attributes) or concept targets (values) and modeled those as primitives in compliance with SNOMED formalisms, if they were otherwise of an appropriate semantic class. We tabulated those cases and categorized them by use case in order to characterize the limitations of the SNOMED model of meaning. Examples included:

  • “Occlusive disease of distal artery of upper limb” = (116680003 |is a(attribute)| = 431706008 |Occlusion of artery of upper extremity (disorder)|) (Code value missing for artery of forearm)

  • “Sensory problem with feet, low risk” = (116680003 |is a(attribute)| = 398026008 |reduced sensation of skin|): (363698007 |Finding site| = 60496002 |Skin structure of foot|) (Attribute missing for modeling risk)

  • “Pain > 3 months, wrist” = (116680003 |is a(attribute)| = 56608008 |Wrist pain|): (288524001 |Courses (qualifier)| = 90734009 |Chronic(qualifier)|) (Requires numbers in value statement for duration of symptoms)

  • “Edema, non-diabetic, macular” = (116680003 |is a(attribute)| = 37231002 |Macular edema (disorder)|) (Exclusionary concept required to model non-diabetic etiology)

Finally we identified those few terms which expressed meaning within a different semantic context or employed features that could not be modeled with SNOMED CT as currently documented. Examples included:

  • “Anticoagulation monitoring: INR 2.0–3.0” (Goal statement is not a valid problem)

  • “Sepsis r/o” (Employs uncertainty)

  • “Shared MR#” (Administrative concept is not a valid problem)

  • “COPD action plan” (Care plan statement is not a valid problem)

From the summary statistics of this post-coordination analysis, we calculated estimates of the utility of SNOMED CT January 2011 to fully capture the meaning of the terms that were recorded by clinicians on their problem list. We surveyed those that required primitive forms for patterns of limited expressiveness in the SNOMED model of meaning. We further prepared estimates of the workload volume incumbent upon a local terminology manager who must maintain the term lists (entrance vocabulary), SNOMED coding tables and extension concepts required to completely encode the problem list for their institution.

Results:

Figure 1 summarizes the semantic analysis of the two term subsets and lists for each the raw frequency of occurrence of the five major categories of semantics assessed by the authors (Ambiguous; Pre-coordinated; Post-coordinated and defined; Post-coordinated requiring primitive and Outside the model of meaning). The frequency of occurrence within each category is expressed as a percentage occurrence within the sample subset and 95% confidence intervals have been included to indicate the predicted statistical variance of the observation. These observations of percentage were employed in table 2 to estimate the occurrence of semantic categories within the larger population.

Table 2.

Estimation of SNOMED coverage of term sets

CORE problem subset terms (Set B) High frequency UMLS mapped (Set C) Low frequency UMLS mapped (Set D) Low frequency no UMLS map (Set E) Estimated term occurrence (A=B+C+D+E)
Ambiguous 146 3871 1744 5761 (7.5%)
Pre-coordinated (core publication) 13164 246 6485 7277 27172 (35.2%)
Post-coordinated and fully defined 905 23915 5988 30808 (40.0%)
Post-coordinated requires primitive 321 8485 3488 12294 (15.9%)
Outside problem model of meaning 23 603 455 1081 (1.4%)
Total terms in Set 13164 1641 43359 18952 77116

Table 2 summarizes the calculations which employ the observations of semantic categorization frequency in order to arrive at an estimate of the characteristics of the entire population of terms. The five categories of semantic analysis are listed in column one. Column two details the numbers of terns found by NLM to be represented by pre-coordinated (published core content) SNOMED CT concepts which were included in the CORE Problem List. Columns three and four employ the estimates of occurrence in the high frequency term set (subset C from table 1) to estimate the semantic categorization of sets C and D. We assumed that existence of UMLS mapping for both sets allowed us to employ frequency of occurrence observations relevant to set C for set D. Column five employs the observations from table 1 for subset E to estimate the semantic categorization of set E. The final column summarizes sets B through E to estimate the semantic categorization of the original set A of 77,116 terms assembled by the NLM. Estimated frequency of occurrence of each category in set A is included in brackets.

Table 1.

SNOMED CT coding analysis: Semantic categorization of term subsets

Sample high frequency terms (Subset of C) Sample low frequency terms (Subset of E)
Analytical sample N 348 250
Ambiguous 31 (8.9%; CI 5.9 – 11.9%) 23 (9.2%; CI 5.6 – 12.8%)
Pre-coordinated (core publication) 52 (14.9%; CI 11.2 – 18.7%) 96 (38.4%; CI 32.4 – 44.4%)
Post-coordinated and fully defined 192 (55%; CI 49.9 – 60.4%) 79 (31.6%; CI 25.8 – 37.4%)
Post-coordinated requires primitive 68 (19.5%; CI 15.4 – 23.7%) 46 (18.4%; CI 13.6 – 23.2%)
Outside problem model of meaning 5 (1.4%; CI 0.2 – 2.7%) 6 (2.4%; CI 0.5 – 4.3%)

We further reviewed 114cases (Table 1: terms identified as Post-coordinated but requiring primitive) where the conceptual meaning proposed by the clinical term could not be fully defined in SNOMED CT employing the published model of meaning6. This analysis sometimes required judgment as to the best proposal when extension of the meaning model was required, so we collapsed those categories. We found that:

  • 72 cases (75.5% of primitive assignments) required attributes (relationships) or values (codes) not supported by SNOMED

  • 19 cases (14.9%) required numeric values to accurately represent the meaning

  • 23 cases (9.6%) required negation or exclusionary values

Discussion

In this analysis of a large corpus of problem list terms recorded by clinicians at diverse healthcare institutions, we found that the compositional model developed for SNOMED CT was likely capable of structured recording of those clinical terms judged to be unambiguous in (27172+30808+12294) / (77116-5761) = 98.5% of terms. In the high frequency term set which encompassed 95% of recorded problem instances, SNOMED CT pre-coordinated concepts encoded (13164+((14.9%)*1641))/(13164+1641) = 90.6% of terms and post-coordination captured an additional 8.2% for a total lexical (term-based) encoding rate of 98.8%, suggesting that an expanded CORE Problem List can serve a broad community of clinicians.

Compositional encoding of problem concepts creates an opportunity for the EHR. Relationships which are included in defining problem concept meaning can be employed in vendor decision support and query software. These ontologic features of SNOMED CT can support advanced functionality including aggregation of patients with related problems and identification of diseases cases by affected organ system or shared etiology. Arguably the most important feature of the relationships defining SNOMED concepts is the hierarchical subsumption relationship: 116680003 |is a(attribute)|. This supports aggregation of SNOMED CT problem instances into groups of more specialized (or more general) concepts and is a required element of post-coordinated expressions including those which cannot be fully defined. In this analysis we estimate that 15.9% of the 77116 terms would implemented as post-coordinated concepts without fully defining the conceptual meaning of the source statement due to limitations of SNOMED content or the model of meaning. In this large set of problem interface terms, we estimate that 9282 terms (75.5%) would require consideration of new SNOMED attributes or values for full definition, 1832 terms (14.9%) would require modeling of numeric values in the expression and 1180 terms (9.6%) would require that the SNOMED model be extended to include negation and exclusionary concepts. The importance of these editorial extensions to the successful use of SNOMED CT in the EHR is a matter of reasonable debate when considering the complexity and size of the terminology as it exists today.

Much of the labor in maintaining a terminology resource for accurate capture of clinical problems at the point of care resides in the rendition of the infrequent concept or the rarely used term. Compositional terminology models are meant to allow for just-in-time concept modeling and expression recording but they do little to tailor the entrance term set to local needs. It is important to note that the entrance term to concept ratio of the CORE problem list subset was 2.3, indicating that clinicians often employed differences in term (lexical) expression even for frequently recorded problem concepts. In the low frequency term set (E) we have not undertaken a concurrence analysis to assess how often concepts found in the high frequency set recur as low frequency cases with different entrance terms. Speaking from experience with managing a SNOMED CT problem list subset over the past 14 years, at Nebraska we have noted that term requests continue to grow at a rate approximately 2.4 times the new concepts requested, remarkably similar to the ratio in the NLM CORE problem list subset. The implications for the local terminology manager are that effective strategies for gathering and analyzing local requirements of language, preferably with support for a classifier for reconciliation to core SNOMED CT and to avoid concept duplication, are strategies that should be considered important.

We recognize several important limitations in procedures we used to prepare this data. They include:

  1. Semantic analysis and post-coordination accomplished primarily by a single author

  2. Failure to use a description logic classifier for comparison of post-coordinated to the SNOMED core for duplication and to search for recurrences across term sets

  3. Low rate of sampling (1.4% of terms) in the low frequency term set E with accompanying uncertainty in the validity of the estimations

  4. Failure to analyze all sets of terms for semantics, in particular - set D

Recent studies5 have recognized that requirements for post-coordination are likely to be domain specific and that variability in the applications of post-coordination procedures between editors can be substantial. The availability of public domain tooling which would support computational validation of post-coordination activities – in this instance for problem list - is one possible step towards more reproducible and uniform use of SNOMED CT.

In summary, we analyzed a very large set of terms generated by clinicians for use in the problem list of the EHR. We found that SNOMED CT could scale to compositionally manage the meaning expressed by those clinicians but that fully defining the conceptual space was not possible in roughly one in six clinical statements. The trade-off between expanding the SNOMED CT model in order to capture more nuance and making an already complex system more confusing should be resolved in the crucible of EHR deployment and assessment of benefit for clinical decision making. The site terminologist who will manage local extensions to SNOMED CT should be prepared to listen to the words their clinicians speak and deploy them in an effective user interface with terminology management services that include tooling for concept modeling and classification. Given the complexity and expense of such undertakings, terminology services shared within consortia of users or provided by middle-ware vendors may well be a necessary approach.

References

  • 1.Chute CG, Cohn SP, Campbell KE, Oliver DE, Campbell JR. The content coverage of clinical classifications. For The Computer-Based Patient Record Institute’s Work Group on Codes & Structures. J Am Med Inform Assoc. 1996 May-Jun;3(3):224–33. doi: 10.1136/jamia.1996.96310636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Campbell JR, Carpenter P, Sneiderman C, Chute CG, Warren JJ. Phase II Evaluation of Clinical Coding Schemes: Completeness, Taxonomy, Mapping, Definitions, and Clarity, for the CPRI Workgroup on Codes and Structures. JAMIA. 1997 May-Jun;4(3):238–251. doi: 10.1136/jamia.1997.0040238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chute CG, Cohn SP, Campbell JR. A framework for comprehensive health terminology systems in the United States. JAMIA. 1998 Nov-Dec;5(6):503–510. doi: 10.1136/jamia.1998.0050503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cimino JJ. Coding systems in health care. Methods of Information in Medicine. 1996;35(4–5):273–284. [PubMed] [Google Scholar]
  • 5.Richesson RL, Andrews JE, Krischer JP. Use of SNOMED CT to represent clinical research data: Semantic characterization of data items on case report forms. JAMIA. 2006;13(5):536–546. doi: 10.1197/jamia.M2093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.SNOMED CT User Guide January 2011 Release ©2011 International Health Terminology Standards Development Organization.
  • 7.Weed L. The problem oriented record - its organizing principles and its structure. League Exchange. 1975;103:3–6. [PubMed] [Google Scholar]
  • 8.Dick RS, Steen EB. The computer-based patient record: an essential technology for patient care. National Academy Press; Washington DC: 1991. [PubMed] [Google Scholar]
  • 9.Wasserman H, Wang J. An applied evaluation of SNOMED CT as a clinical vocabulary for the computerized diagnosis and problem list. AMIA Annual Symposium Proceedings; 2003. pp. 699–703. [PMC free article] [PubMed] [Google Scholar]
  • 10.Elkin PL, Brown SH, Husser CS, Bauer BA, Wahner-Roedler D, Rosenbloom ST, Speroff T. Evaluation of the content coverage of SNOMED CT: ability of SNOMED clinical terms to represent clinical problem lists. Mayo Clin Proc. 2006 Jun;81(6):741–8. doi: 10.4065/81.6.741. [DOI] [PubMed] [Google Scholar]
  • 11.Fung KW, McDonald C, Srinivasan S. The UMLS-CORE project: a study of the problem list terminologies used in large healthcare institutions. J Am Med Inform Assoc. 2010;17:675–680. doi: 10.1136/jamia.2010.007047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.UMLS CORE Problem List Subset http://www.nlm.nih.gov/research/umls/Snomed/core_subset.html

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES