Phase II Evaluation of Clinical Coding Schemes: Completeness, Taxonomy, Mapping, Definitions, and Clarity

James R Campbell; Paul Carpenter; Charles Sneiderman; Simon Cohn; Christopher G Chute; Judith Warren; CPRI Work Group on Codes and Structures

doi:10.1136/jamia.1997.0040238

. 1997 May-Jun;4(3):238–250. doi: 10.1136/jamia.1997.0040238

Phase II Evaluation of Clinical Coding Schemes

Completeness, Taxonomy, Mapping, Definitions, and Clarity

James R Campbell ¹, Paul Carpenter ¹, Charles Sneiderman ¹, Simon Cohn ¹, Christopher G Chute ¹, Judith Warren ¹; CPRI Work Group on Codes and Structures¹

PMCID: PMC61239 PMID: 9147343

Abstract

Objective: To compare three potential sources of controlled clinical terminology (READ codes version 3.1, SNOMED International, and Unified Medical Language System (UMLS) version 1.6) relative to attributes of completeness, clinical taxonomy, administrative mapping, term definitions and clarity (duplicate coding rate).

Methods: The authors assembled 1929 source concept records from a variety of clinical information taken from four medical centers across the United States. The source data included medical as well as ample nursing terminology. The source records were coded in each scheme by an investigator and checked by the coding scheme owner. The codings were then scored by an independent panel of clinicians for acceptability. Codes were checked for definitions provided with the scheme. Codes for a random sample of source records were analyzed by an investigator for “parent” and “child” codes within the scheme. Parent and child pairs were scored by an independent panel of medical informatics specialists for clinical acceptability. Administrative and billing code mapping from the published scheme were reviewed for all coded records and analyzed by independent reviewers for accuracy. The investigator for each scheme exhaustively searched a sample of coded records for duplications.

Results: SNOMED was judged to be significantly more complete in coding the source material than the other schemes (SNOMED^* 70%; READ 57%; UMLS 50%; ^*p <.00001). SNOMED also had a richer clinical taxonomy judged by the number of acceptable first-degree relatives per coded concept (SNOMED^* 4.56; UMLS 3.17; READ 2.14, ^*p <.005). Only the UMLS provided any definitions; these were found for 49% of records which had a coding assignment. READ and UMLS had better administrative mappings (composite score: READ^* 40.6%; UMLS^* 36.1%; SNOMED 20.7%, ^*p <. 00001), and SNOMED had substantially more duplications of coding assignments (duplication rate: READ 0%; UMLS 4.2%; SNOMED^* 13.9%, ^*p <. 004) associated with a loss of clarity.

Conclusion: No major terminology source can lay claim to being the ideal resource for a computer-based patient record. However, based upon this analysis of releases for April 1995, SNOMED International is considerably more complete, has a compositional nature and a richer taxonomy. It suffers from less clarity, resulting from a lack of syntax and evolutionary changes in its coding scheme. READ has greater clarity and better mapping to administrative schemes (ICD-10 and OPCS-4), is rapidly changing and is less complete. UMLS is a rich lexical resource, with mappings to many source vocabularies. It provides definitions for many of its terms. However, due to the varying granularities and purposes of its source schemes, it has limitations for representation of clinical concepts within a computer-based patient record.

A computer-based patient record (CPR) is an electronic patient record that supports users by providing accessibility to complete and accurate data, alerts, reminders, clinical decision support and links to medical knowledge.¹ It is supported by a system of programs which provide useful access to the data, allow entry and collation of information, and secure storage. In 1992 the Institute of Medicine treatise on the CPR¹ clearly pointed to the importance of a central data dictionary and industry coding standards as key elements. Methods for achieving those goals have remained illusory, in part because of historical emphasis by software developers on billing and epidemiology rather than patient care, and also due to intense debate as to the merits of standardized schemes for classification. There has been no general agreement on the attributes of such systems, much less whether a particular scheme does the job.

Although some will argue the pragmatics of individual attributes, building upon the work of others within the informatics community,²^,³ we maintain that a classification scheme for implementation within a CPR should have the following features:

Complete and comprehensive²—The classification scheme should cover the entire clinical spectrum, including all component disciplines involved in patient care at sufficient granularity (depth and level of detail) to depict the care process.
Clarity (clear and non-redundant)²—A code is an assignment of an identifier within the scheme to represent a concept from clinical practice. A term⁴ is a natural language phrase associated with and representing that concept in clinical parlance. Although synonym terms are desirable since they allow a clinician reasonable variation in expression, a concept should be neither vague nor ambiguous, and should not have overlapping meanings within the scheme. This means that a single concept should not have many code representations within the scheme.
Mapping (administrative cross references)—To be useful, a clinical scheme must point to related entities in widely used administrative and epidemiologic reporting systems.
Atomic and compositional character—Clinical classifications that break findings and events (concepts) into basic component pieces, have substantial practical advantages by avoiding an explosion of terms with the additional of new knowledge. Such a scheme is multiaxial and compositional, as opposed ato a precoordinated scheme wherein each concept—no matter how complex—has a single representational code. To illustrate, contrast the two representations of the concept “back pain”:

UMLS

SNOMED International

C0004604 BACK PAIN

T-D2100 BACK,

F-A26000 PAIN
Synonyms²—The scheme must support alternate terminology as required by the clinicians
Attributes—The scheme should support a mechanism to modify or qualify meaning of the core term
Uncertainty—The scheme must support a graduated record of certainty for findings and assessments.
Hierarchies and inheritance²—A hierarchical organization of concepts, linking logically more general and more specific terms, facilitates the use of classifications within an “intelligent” record by supporting inductive reasoning. A single term should be allowed many parents or children as clinically appropriate.
Context-free identifiers—The codes themselves must be devoid of meaning to avoid assignment conflicts as the body of clinical knowledge evolves.
Unique identifiers—A code must not be re-used when it is declared obsolete.
Definitions—Concepts should be associated with concise explanations of their meaning. Publication of definitions does not guarantee clarity, but it promotes the development of a clear schema.
Language independence—The scheme should be freely translated across the human languages in use by patients and caregivers.
Syntax and grammar—Compositional schemes must be accompanied by a set of rules that define logical and clinically relevant constructions of the codes. Pre-coordinated schemes do not have this requirement.

In an earlier publication⁵ we studied seven major systems for the attribute of completeness. This work drew on other projects which had previously evaluated one or a few systems in limited application areas.⁶^,⁷^,⁸^,⁹^,¹⁰^,¹¹^,¹²^,¹³^,¹⁴ Our conclusion was that no scheme available in the English-speaking world was sufficiently comprehensive to be readily implemented within a CPR. However, three candidate systems were substantially better than competitors: the READ classification¹⁵^,¹⁶^,⁴ of the National Health Service of Great Britain; SNOMED International,¹⁷^,¹⁸^,¹⁹ published by the College of American Pathologists (CAP); and the Unified Medical Language System (Metathesaurus)²⁰^,²¹^,²²^,²³^,²⁴ of the National Library of Medicine (NLM). Our study was limited in scope, especially within the domain of nursing practice. Furthermore, coding staff scored their own evaluations and we did not study any attributes other than completeness. It is our purpose in this paper to extend our initial work, and evaluate the 1995 release of these three schemes relative to the attributes of completeness, clarity, definitions, inheritance and administrative mapping.

Methods

Evaluation Set

As part of an initial examination of coding schemes previously reported,⁵ we took source material from clinical records found in four medical centers across the United States. Our first report details the methods whereby source material was prepared for analysis. The data originated with both textual and flowcharted information found in active clinical charts. Dr. Chute reviewed this material for concepts and formed an initial evaluation set of 3061 records. These records were hierarchically linked constructions of sometimes complex conceptual entities. Retrospectively analyzing our evaluation set, we identified two limitations of the source material: (1) comprehensiveness and (2) granularity. In particular, we determined that nursing care was one portion of the clinical record that was underrepresented. Furthermore, the level of detail of the evaluation records differed from item to item. We therefore revised our initial evaluation set. We first expanded our phase one material with additional concepts taken from nursing care plans, flowcharts and notes. We used the same methods to gather and prepare this data as we had previously described.

In order ot identify the universe of conceptual information to be found in the CPR, and to better define its granular nature (i.e., level of detail), we organized a consensus discussion within the Computer-based Patient Record Institute (CPRI) workgroup on codes. From this discussion, we formulated a list of the conceptual domains of the CPR. This is included in abbreviated form in the Appendix. We used the domain definitions from this discussion and reorganized the original evaluation records so that they followed this categorization. We eliminated all duplicates, since some records appeared more than once, and assigned a domain to each record based upon the clinical context of the material from its source.

By way of example, the source record detailed in our earlier paper⁵ was modified as follows:

Source text: “... it was identified as a superficial spreading melanoma Clark's level 2, with depth of invasion 0.84 mm.”

Phase I Evaluation Records:

<Diagnosis>:melanoma
<Extent>:Clark's level 2
<Quantitative>:0.184 mm [depth of invasion]
<Mode>:superficial spreading

Phase II Evaluation Records:

<6.1.1|Medical diagnosis>:superficial spreading melanoma
<3.|Attributes>:Clark's level 2
<3.|Attributes>:depth of invasion [0.84 mm]

In this particular case, the concept “superficial spreading melanoma” was judged to fit within the definition of a medical diagnosis more appropriately than an attribute/diagnosis pair. Examples of material added to the evaluation set from nursing documentation included:

Phase II Nursing Data:

<6.2.1|Nursing diagnosis>:ineffective individual coping
<4.4|Educational intervention>:explain lab values to patient
<5.1.1|Symptom>:fear of suffocation

Completeness

We gave the evaluation records to three coding team leaders: Drs. Sneiderman (READ), Warren (SNOMED), and Cohn (UMLS). We asked the owners of the three systems to give us copies that would be current in April, 1995. We received copies of the three schemes as follows:

SNOMED International version 3.1; publication April 1995; delivered March 1995
READ Version 3.1; publication May 1995; delivered August 1995
UMLS Version 1.6; publication January 1995; delivered July 1995

Each coding team leader commented upon problems with coding of the source records using public or commercial browsing tools. This necessitated review by the scheme owner to assure fairness. The READ browser, published by Computer Aided Medical Systems, frequently missed terms—many times due to cultural differences in phrasing and spelling. It did not employ translation techniques to assist with differences characterized by the two terms: “NEVUS” and “NAEVUS.” The COACH browser distributed by the National Library of Medicine often buried the term which happened to match the source term uniquely at the bottom of a list of hundreds of marginal choices. In an effort to be linguistically complete, it was often misleading or troublesome to use. SNOMED did not have an adequate public-domain browser and we were forced to purchase a tool from an independent developer (Medsight Informatique Inc., 1801 McTavish St-Bruno, Quebec, Canada) that was functionally superior to many others but not well suited to high volumes of coding.

Using these tools, each coding team leader led the effort to classify the source record from the phase II evaluation set within the scheme provided. The best mapping was recorded by code and full text coding record. The full text record was stripped of technical verbiage such as “NOS” and “NEC” for readability by a clinician whose assignment would be to judge meaning and relevance, not totality of the scheme. We took this action so as not to bias the scorers (described below—a team of clinician experts) at a time when they would have no knowledge of the source scheme. Take for example our full text coding of the record shown in Figure 1.

Sample full-text codings for clinical scoring.

We made every effort when using the compositional features of SNOMED to employ sensible and logical constructions since some “nonsense” constructions were possible. The mapping was reviewed by a second author and then passed to the publisher of the scheme for comment and correction. Based upon this response, a fraction of the original coding was revised. Most changes occurred in the READ scheme, where unfamiliarity with differences in culture and administrative systems caused some confusion.

The scoring team leader, Dr. Carpenter, assembled a team of nine clinicians (six physicians and three nurses) from the Mayo Clinic for scoring of the matched sets. Each scorer had a minimum of ten years of staff clinical experience. The nurses had primarily inpatient experience. Physicians were trained in the disciplines of Endocrinology, Internal Medicine, Orthopedic Surgery, Cardiology and Pediatrics and practiced within a variety of settings. Five individuals from this group (three physicians and two nurses) scored each set of records on a fivepoint Likert scale, rating the acceptability of the match between source concept and coded result. (All evaluation steps used the same Likert rating: 1 = strongly disagree, 3 = neutral, 5 = strongly agree.) Only the full text coding was reviewed by the clinical scoring team. In cases where their assessment score was 3 or less, we also asked them to rate whether the offered term was too specific, too general, or was unrelated to the source term. We scored exact lexical matches, excluding issues of number and minor changes in word order, as six points without clinician review. We scored all records without a match as zero. Scoring sheets assembled by the scoring team leader were collated, checked for accuracy, and entered into a SAS database for analysis. 4% of records were found to be in error (for example, the scoring sheet contained the wrong candidate term) and were reanalyzed by two team members and scored using the same Likert scale. We used an average of the scores for the final analyses.

We used the SAS procedure²⁵ FREQ to calculate descriptive statistics of the clinician scoring by scheme. For calculation of a completeness score, we accepted only code matches with a Likert score of 4 or higher. Using this categorization, we computed the Chi-Square statistic to analyze for statistical differences between the schemes in aggregate. To compare coding schemes by information domain, we employed the SAS procedure GLM and used Bonferroni adjusted t-tests to compare mean scores by domain and compute confidence intervals.

In order to assess variability in rater scoring, we randomly selected 85 cases from all source material which had coding matches in all three schemes, but exact lexical matches in none of them. We did this several months after the original scoring step. These cases were prepared on a new scoring sheet in which the matches for the three schemes were displayed side-by-side, but still blinded as to source. We then asked three of the original clinician-scorers to rate this new set, and compared their scores to the original ratings using simple descriptive statistics. A sample record from the scoring sheet is shown in Figure 2.

Definitions

Neither SNOMED nor READ provides definitions apart from that implied by the code term. In the case of UMLS which provides a separate file (MRDEF) of definitions, we searched the file for definitions of all the UMLS codes selected by the coding team. A raw score was computed for the frequency of availability of definitions.

Taxonomy

From the source data covering the domains 4.X-Interventions, 5.X-Findings, 6.X-Diagnoses and impressions, 10.X-Anatomy and 11.X-Etiology, we selected a pragmatic sample of 10% of source codes at random. We used the computer files provided by the owner of each scheme to analyze the codes within this sample of records for all “parents” and “children.” This was subject to some interpretation in the case of UMLS, and we chose to analyze only the “context relationships” from data file MRCXT. This file lists hierarchical relationships between UMLS concepts as taken from the UMLS source vocabularies. UMLS also has a semantic network which defines a set of hierarchical relationships but we did not analyze these.

We compiled only the first-generation relatives for codes from all schemes and arranged them in a pair-wise fashion for analysis. Six clinician-informatics specialists were recruited and generously volunteered their time for the analysis. These reviewers are all experienced practicing clinicians, who are also active in systems development at their home institutions. All are active members of the American Medical Informatics Association (AMIA). Via Internet mail, they reviewed random samples of 125 parent-child pairs for each coding scheme. They used a five point Likert scale (1 = extremely dissatisfied with pairing, 3 = neutral, 5 = extremely satisfied with pairing) to rate the clinical utility of each pairing. Each reviewer was blinded to the source scheme they were rating. We asked them to score the pair following these directions:

“We ask you to rate the clinical and programmatic utility of this pairing by choosing a score for each pair based upon your judgement of their appropriateness as a CLINICALLY USEFUL classification hierarchy. We ask you to judge this pair based upon the following criteria:

the pairing is clinically relevant and sound for support of clinical reminders and other features of the computerized record

the pairing is sensible and appropriate when viewed as an informatics structural element of the computerized record”

A sample from the coding sheet for SNOMED appears in Figure 3.

Mapping

Using the computer files provided by the scheme owners, we evaluated the mapping of codes we found to the billing and administrative schemes in use in the host countries, whenever these cross references were supplied in the published scheme. This meant that we reviewed codes for records from source domains 6.X (diagnoses) and 5.1.X (symptoms and reports) for mappings to ICD-9-CM for SNOMED and UMLS, and to ICD-10 for READ. For procedures, we also studied records from domains 4.X (procedures) excluding domain 4.2.X (medications) for mappings to CPT or ICD-9-CM for SNOMED and UMLS, and to OPCS-4 for READ. At two separate sites, we assembled a team of medical encoding specialists who reviewed the mapped codes for accuracy.

Clarity (Coding Duplications)

Each coding team leader further analyzed the 10% random sample of records mentioned above to search the scheme for duplicate coding. In this analysis, we specifically looked for clinical concepts that had more than one coding assignment (ignoring the differences in terms that might be available). For this sample of 190 records, the team leaders did an exhaustive analysis of the published scheme using the browser, looking for additional coding representations of the same (source) concept. We assembled all possible representations and then a second member of the study team evaluated them to judge if they were true duplications of meaning or were representational variants. A duplication was judged to be present if the coding scheme had a second identifier for the source concept. For precoordinated schemes, any second coding would be a duplication. For SNOMED, the issue is more complex and we looked carefully at the semantics (coding axes and code combinations) of the coding assignment. In the case of SNOMED, we judged second representations to be duplicates only if they used the same semantic structure. For SNOMED, we classified additional code representations as a representational variant if the duplicate code had different semantics. For example,

[T-D9510 RIGHT ANKLE] and [T-D9500, G-A100 ANKLE, RIGHT]

was judged to be a duplication since both codings have root in the topology (T-axis) semantic. However, for the source record “S2-second heart sound,”

[F-35040 SECOND HEART SOUND] and [G-A702 SECOND; T-32000 HEART; A-25100 SOUND]

were judged to be representational variants. Although variants might be confusing or misleading to the untrained user, an experienced developer could possibly exploit such richness to their advantage. We summarized the frequency of both events as an indication of duplications in the published schema.