Abstract
SNOMED CT’s content undergoes many changes from one release to the next. Over the last year SNOMED CT’s Bacterial infectious disease subhierarchy has undergone significant editing to bring consistent modeling to its concepts. In this paper we analyze the stated and inferred structural modifications that affected the Bacterial infectious disease subhierarchy between the Jan 2015 and Jan 2016 SNOMED CT releases using a two-phased approach. First, we introduce a methodology for creating a human readable list of changes. Next, we utilize partial-area taxonomies, which are compact summaries of SNOMED CT’s content and structure, to identify the “big picture” changes that occurred in the subhierarchy. We illustrate how partial-area taxonomies can be used to help identify groups of concepts that were affected by these editing operations and the nature of these changes. Modeling issues identified using our two-phase methodology are discussed.
Introduction
SNOMED CT [1] is a large and complex medical terminology that is maintained by the International Health Terminology Standards Development Organization (IHTSDO) [2]. Extensive editorial work goes into each SNOMED CT release. Additional content, in the form of new concepts and new relationships, is added and existing content is revised to address errors and inconsistencies, while content that is no longer valid is retired.
The IHTSDO sometimes focuses editorial resources on a specific portion of SNOMED CT’s content. The Bacterial infectious disease subhierarchy, with over 2000 concepts, underwent significant remodeling in the July 2015 and Jan 2016 releases. The editorial staff at the IHTSDO determined that the concepts in this subhierarchy were modeled inconsistently. The goal of the remodeling is to make stated changes that enable SNOMED CT’s classifier to better infer correct hierarchical relationships. The still ongoing revision has affected hundreds of the Bacterial infectious disease concepts, requiring thousands of individual editing operations. SNOMED CT’s editors and end users need to be aware of the overall impact of these modifications on the content of the terminology. Editors need to ensure that all of their modifications were correct and end users need to know that their systems will still work correctly.
In this paper we analyze how the current, but incomplete modifications to the concepts in the Bacterial infectious disease affected the subhierarchy. However, the task of analyzing changes across multiple SNOMED CT releases is not straightforward. The IHTSDO does not provide a “user friendly” change log. A list of individual changes (a “delta”) is provided with each release, but one editing operation may result in multiple entries. The delta provides no additional information about the change (e.g., is the new target of a relationship more specific, less specific, or unrelated to the old target?), or what motivated it. Additionally, a list of changes, as provided by the delta, does not provide the “big picture” of what changed. When hundreds of concepts are modified it is difficult to review each individual change. As a result, an editor or end user may decide to avoid an in-depth review of the changes. However, such a review is critical, not just to ensure the correctness of each change, but also because some changes result in unintended consequences, causing new errors and inconsistencies, as demonstrated in Wang et al. [3].
To support the identification and analysis of structural changes between SNOMED CT releases we introduce a two-phase summarization-based approach. First, we introduce a method to create a human-readable SNOMED CT change log (based on the delta files provided with each release) that captures and summarizes the individual structural changes that occurred in the stated and inferred versions of SNOMED CT. Second, we utilize partial-area taxonomies[4], which summarize structurally and semantically similar concepts, to capture a summary of the major changes that occurred to groups of these similar concepts. We use this two-phase approach to identify significant structural changes in the Bacterial infectious disease subhierarchy. Between the July 2015 and Jan 2016 releases we identified thousands of editing operations applied to hundreds of concepts. We illustrate how the remodeling significantly affected, and continues to affect, the content of the subhierarchy.
Background
SNOMED CT
The Jan 2016 release of SNOMED CT consists of over 300,000 active concepts and over 1.5 million relationships among these concepts. SNOMED CT is released as a series of tab-delimited files that represent the concepts, relationships, and descriptions of the terminology. SNOMED CT is based on Description Logics and it is distributed with both the stated relationships, which were explicitly defined by SNOMED CT editors, and inferred relationships, which were obtained by applying a reasoner to the stated relationships. Each SNOMED CT release comes with a “snapshot” of the terminology, which contains all the content for a given release, and with a “delta,” which identifies the individual changes that occurred between the last two releases. A full history of changes for every concept is also provided.
Various studies have looked at SNOMED CT’s content. Ceuster’s [5] analyzed changes in 18 releases of SNOMED CT and identified the need for better documentation of changes. Specifically, documentation on what motivated SNOMED CT authors to introduce changes of a certain type was identified as important. Rector et al. [6] and Mortensen et al. [7] looked at quality issues in SNOMED CT content.
Partial-area Taxonomies
We define an Abstraction Network [8] as a hierarchy of nodes, where each node represents a set of structurally similar concepts. It serves as a compact summary of a terminology’s content and structure. In previous work [4, 9–11], we have developed different Abstraction Networks to summarize biomedical terminologies and ontologies, such as the National Cancer Institute thesaurus (NCIt) [12], the Gene Ontology (GO) [13], and SNOMED CT. We have shown that Abstraction Networks support various SNOMED CT use cases, such as quality assurance (QA) [4], characterizing the modeling complexity of SNOMED CT [14], and observing changes to content due to QA [15].
One Abstraction Network we developed to support SNOMED CT QA is the partial-area taxonomy [4, 11], which summarizes groups of structurally and semantically similar concepts. We define a subject subtaxonomy as a partial-area taxonomy rooted at concept that represents a subject area (e.g., Bacterial infectious disease). In Ochs et al. [4] we looked at the properties of subject subtaxonomies for ten important Clinical finding subhierarchies (e.g., Cancer and Heart disease). We will now illustrate the derivation of a subject subtaxonomy using an excerpt from the Jan 2016 release of SNOMED CT using Bacterial infectious disease as the selected subject concept.
Given a SNOMED CT subhierarchy rooted at a concept c, we define an area as the set of all concepts in the subhierarchy with the same set of attribute relationship types. Only the types of the relationships are considered; target concepts are disregarded. Areas are named after their set of attribute relationship types (e.g., {Causative agent, Pathological process}). We say that the set of concepts summarized by an area are structurally similar, as the concepts are modeled using the same set of attribute relationship types. All areas are disjoint in terms of the concepts they summarize. We define an area taxonomy as an Abstraction Network composed of areas that are connected by hierarchy child-of links based on the underlying IS-A relationships. We define a root concept of an area as a concept that has no parent concepts in its area. An area A is child-of another area B if a root concept in A has a parent concept in B. Figure 1 (b) shows the area taxonomy derived from the concepts in Figure 1 (a).
Figure 1.
(a) An excerpt of 19 concepts from the Bacterial infectious disease subhierarchy. Concepts are shown as white boxes. Arrows represent IS-A relationships. Colored, dashed bubbles indicated a set of concepts have the same types of attribute relationships. (b) The area taxonomy for the concepts in (a). Areas are shown as colored boxes and labeled with their set of attribute relationships and number of concepts. Upward arrows represent hierarchical child-of links between areas. (c) The subject subtaxonomy for the concepts in (a). Partial-areas are shown as white boxes in their areas and are labeled with the name of their root and their number of concepts. Upward direct arrows indicate child-of links between partial-areas.
A partial-area consists of a root concept and all of its descendant concepts in its area. Partial-areas are named after their root and labeled with the total number of concepts summarized (this will be written using parenthesis, e.g., Bacterial sepsis (2)). Partial-areas summarize subhierarchies of semantically similar concepts within each area, since all of the concepts in the partial-area are descendants of the root. Multiple partial-areas may summarize the same concept. A partial-area taxonomy is an Abstraction Network, consisting of partial-area nodes, that refines an area taxonomy by explicitly identifying these subhierarchies of semantically similar concepts. Figure 1 (c) shows the partial-area taxonomy (subject subtaxonomy, since it is rooted at a chosen subject concept) for the concepts in Figure 1 (a). We note that partial-area taxonomies can be derived using inferred or stated relationships. To derive partial-area taxonomies we developed a software tool called BLUSNO [16], which provides a user interface for exploring partial-area taxonomies across multiple SNOMED CT releases.
Terminology Diffs
We note that there has been extensive work in the area of detecting changes between terminology releases (“diffs,” e.g., [17, 18]). As noted earlier, SNOMED CT provides pre-computed diff information with each release in the form of delta files. In [19] we observe that terminology diff tools often output an overwhelming amount of information (i.e., tens of thousands of identified changes) when many concepts are modified. The two-phase approach described in this paper visually identifies changes to sets of similar concepts, reducing the amount of information one has to view, and provides a more descriptive list of changes than is provided by many diff tool.
Methods
To determine how SNOMED CT’s content changed it is necessary to have a list of the editing operations that were applied to the concepts. SNOMED CT’s editors do not maintain such a list and there is no record of justifications for changes. We look at the stated changes SNOMED CT’s editors applied and their effect on the inferred version of the terminology. The delta files provided with each release are not “human readable” and one editing operation may be represented by multiple entries. For example, in the stated version of the July 2015 release, the concept Bacterial infection due to Bacillus had a parent Disease due to Gram-positive bacteria that was replaced by a more general parent, Disease. In the delta file this is expressed as two changes: the removal of the IS-A relationship to Disease due to Gram-positive bacteria and the addition of the IS-A relationship to Disease. There are thousands of entries in the delta file due to the changes in the Bacterial infectious disease subhierarchy. While a list of changes will provide a glimpse at the structural editing operations that were applied, such a list does not give the “big picture” of the overall impact of the changes. Thus, a methodology for detecting and visualizing the overall impact of the changes in the inferred version of SNOMED CT is also necessary.
Editing Operation Detection and Summarization
A relationship can be represented as tuple in the form (source concept, type, destination concept, group). For a relationship ri we abbreviate this as (si, ti, di, gi). In the following we refer to a relationship that no longer exists in a release as an inactive relationship and we refer to a new relationship in a release as an activated relationship. Given the set of activated and inactive relationships, as expressed in one release’s delta files, we can infer the editing operations that were applied and create a “human readable” log of changes that also reduces the overall complexity of the change information. Table 1 identifies five kinds of editing operations that can be detected in the delta file.
Table 1.
Five kinds of editing operations which can be identified from SNOMED CT’s relationship delta file
| Editing Operation | Description | Example (from stated relationships delta file) | |
|---|---|---|---|
| 1 | Added Relationship | Given an activated relationship r1 there exists no inactive relationship r2 where s1 = s2, t1 = t2, and d1 = d2. d1 is also not an ancestor or descendant of d2. | Carbuncle of ear had a Causative agent relationship added with a target Superkingdom Bacteria. |
| 2 | Removed Relationship | Given an inactive relationship r1 there exists no activated relationship r2 where s1 = s2, t1 = t2, and d1 = d2. d1 is also not an ancestor or descendant of d2. | The Finding site relationship to Finger structure was removed from Tuberculous dactylitis and replaced with “Entire digit”. |
| 3 | More Refined Target | Given an inactive relationship r1 there exists an activated relationship r2 where s1 = s2, t1 = t2, and d2 is a descendant of d1. | The target of Staphylococcal enterocolitis”s Finding site relationship changed from Intestinal structure to Colon structure and Small intestinal structure. |
| 4 | Less Refined Target | Given an inactive relationship r1 there exists an activated relationship r2 where s1 = s2, t1 = t2, and d2 is an ancestor of d1. | Bacterial infection due to Bacillus”s stated parent changed from Disease due to Gram-positive bacteria to Disease to Disease. |
| 5 | Relationship Group Changed | Given an activated relationship r1 there exists an inactive relationship r2 where s1 = s2, t1 = t2, and d1 = d2 but g1 does not equal to g2. | The stated Causative agent relationship from Plague meningitis to Yersinia pestis was moved from group 1 to group 2. |
We distinguish between hierarchical changes and changes to attribute relationships. We separate editing operations 1 – 4 in Table 1 into operations applied to IS-A relationships and operations applied to attribute relationships (e.g., added parent and added attribute relationship). These editing operations can be computed using the delta for the stated relationships (i.e., editing operations applied by a SNOMED CT editor) or from the delta for the inferred relationships (i.e., the implicit changes that were inferred by the classifier).
We note that it is possible to identify other kinds of editing operations using the approach described in Table 1. For example, Tuberculous dactylitis has a removed Finding site relationship to Finger structure and an added Finding site relationship to Entire digit. This could be considered a “changed attribute relationship target” operation. In this study we found relatively few removed relationship editing operations. Thus, for this study, we do not explicitly identify this kind of operation. From the delta files we can also infer why some changes occurred. For example, if a destination concept in a relationship is deactivated then it would necessitate that the target change, if the kind of relationship is still needed to model the source concept.
Below is an example of converting two stated relationship delta entries (left) into a summarized editing operation.
| Active | Source | Type | Destination | Group | Tuberculoma of brain |
|---|---|---|---|---|---|
| No | 416265003 | 116680003 | 15202009 | 0 | Less refined parent Tuberculoma => Disease |
| Yes | 416265003 | 116680003 | 64572001 | 0 |
To enable our editing operation analysis we created a module for the BLUSNO system that is able to process the delta files and output the editing operations. Output is provided as tab delimited text.
Reflecting on Change Using Subject Subtaxonomies
Subject subtaxonomies summarize groups of structurally and semantically similar concepts, all of which are related to a chosen subject concept (e.g., Bacterial infectious diseases). By comparing the areas and partial-areas in the partial-area subtaxonomies derived for multiple SNOMED CT releases, one obtains a summary of the major changes that occurred. Consider the examples in Figure 2, which capture how many of the 372 concepts in the Jan 2015 Bacterial infection by site (372) partial-area, in the {Causative agent, Finding site, Pathological process} area, changed in July 2015 and Jan 2016. We identify several significant changes based on the subject subtaxonomies in Figure 2. These changes result in modifications to the areas and partial-areas in the subject subtaxonomy.
Figure 2.
An example of how the changes to the concepts in the Bacterial infection by site (372) partial-area are captured in subject subtaxonomies derived for three SNOMED CT versions. (a) The Jan 2015 {Causative agent, Finding site, Pathological process} area with its single partial-area. (b) The July 2015 version of the area from (a). The partial-area Bacterial infection by site lost 196 concepts and a total of 110 new partial-areas (19 shown) appeared due to changes in the underlying concept hierarchy (c) Jan 2016 version of the area from (a). The blue partial-areas in (b) were merged back into the Bacterial infection by site partial-area. New partial-areas (e.g., Tetanus (9)) also appeared, indicating a change in their concept’s semantics. Several partial-areas (e.g., Lepromatous leprosy (2)) still exist in the area. Other partial-areas moved to areas with more relationships (e.g., Tuberculosis of genitourinary system (6) moved to the area {Associated morphology, Causative agent, Finding site, Pathological process} and grew to 19 concepts). The concepts in Tuberculosis of urinary organs (4) moved to this partial-area.
Resulting introduced partial-areas. One major change in July 2015 was the introduction of 110 partial-areas to the {Causative agent, Finding site, Pathological process} area (19 are shown in Figure 2(b)). The concepts in these 110 partial-areas were summarized by Bacterial infection by site (372) in Jan 2015. These concepts (e.g., the root of the Tuberculosis of gastrointestinal tract (9) partial-area in Figure 2(b)) did not change structurally (i.e., they have same set of attribute relationships in both releases) but they did change semantically (i.e., the hierarchy changed and they are no longer descendants of Bacterial infection by site).
There were no stated editing operations applied to Tuberculosis of gastrointestinal tract. Thus, one needs to look at the inferred changes. Three such changes occurred: the classifier inferred a less refined parent (Disorder of gastrointestinal tract), a new Causative agent attribute relationship to Genus Mycobacterium, and a Pathological process attribute relationship was moved to a relationship group with the inherited Causative agent relationship. The stated editing operations that lead to this change were applied to Mycobacteriosis, the parent of Tuberculosis.
By investigating the root concepts of the other new partial-areas in Figure 2(b) we observe a similar phenomenon. All of the (former) descendant concepts of Bacterial infection by site that did not have the same relationship group organization (i.e., the Finding site attribute was not assigned to a relationship group with Causative agent and Pathological process attribute relationships), were no longer classified under Bacterial infection by site after July 2015, and thus, appeared as new partial-areas. Looking at the Jan 2016 version of this area in Figure 2(c), one observes that some of these partial-areas are still in the area (e.g., Lepromatous leprosy (2)). One can also see that several partial-areas were added (e.g., Tetanus (9)), again indicating that sets of concepts are no longer being inferred as descendants of Bacterial infection by site.
Resulting removed partial-areas. Some of the partial-areas introduced in the July 2015 {Causative agent, Finding site, Pathological process} area no longer exist in the area in Jan 2016. The concepts in certain partial-areas (colored blue in Figure 2(b)) returned to Bacterial infection by site in Jan 2016 due to some combination of editing operations. Other partial-areas, however, moved to entirely different areas, indicating a change in structure.
Resulting changes to areas. When a concept moves from one area to another it indicates a change in the structure of the concept. Specifically, it means additional (or fewer) types of attribute relationships are used to define the concept. For example, the partial-area Tuberculosis of bones and/or joints (20), in {Causative agent, Finding site, Pathological process} in July 2015, moved to {Associated morphology, Causative agent, Finding site, Pathological process} in Jan 2016. Additionally, eight concepts (e.g., Tuberculosis of hip and Tuberculosis synovitis) were inferred as belonging to this partial-area’s subhierarchy. In July 2015, four of the eight concepts were summarized by only Bacterial arthritis (33) in {Associated morphology, Causative agent, Finding site, Pathological process} (not shown in Figure 2(b)). In Jan 2016 the concepts are summarized by both the Tuberculosis of bones and/or joints (28) and Bacterial arthritis (37) partial-areas. Others concepts (e.g., Tuberculous necrosis of bone) were summarized by a singleton partial-area (i.e., it has only one concept) in {Associated morphology, Causative agent, Finding site, Pathological process} in July 2015 but were absorbed into Tuberculosis of bones and/or joints (28) in Jan 2016. Additional, the concepts from Tuberculosis of gastrointestinal tract (9) in July 2015 are in Bacterial gastroenteritis (24) in Jan 2016, indicating a change in both structure and semantics.
From these examples one can see that the differences between subject subtaxonomies can be used to identify major changes to the hierarchy (via the addition and removal of partial-areas, indicating significant modifications and disruptions) and structure of the concepts (via sets of concepts moving between One does not need to review individual concepts. There are typically significantly fewer partial-areas than concepts, thus, one can review significantly less information, saving editorial resources and enabling a better orientation to the impact of the remodeling. When a significant change is identified via the subject taxonomy, only then would an editor need to review the individual changes which occurred. With a better orientation into the global impact of the changes one will have a greater chance of discovering unwanted side effects of the remodeling and focus future editorial work on correcting the incorrect inferences.
Principles Guiding the Remodeling of Bacterial Infectious Diseases
The above methodologies, implemented in the BLUSNO software tool, allow a user to identify what changed in a summarized manner. However, to understand why the content changed, one needs to look at and understand some of the modeling principles followed by IHTSDO’s editors. The end goal of the remodeling effort was to enable SNOMED CT’s classifier to infer most of the hierarchical connections between concepts in the subhierarchy. This will save SNOMED CT’s editors significant time and effort. Furthermore, the prior modeling of the Bacterial infectious disease concepts led to incorrect inferences that affected the quality of the content.
No proximal primitive parent: One common design pattern in SNOMED CT is the use of a closest proximal primitive parent in the stated modeling of a concept. A proximal primitive parent is the closest parent in the hierarchy to a concept being modeled that is not fully defined [20]. For example, in the Jan 2016 stated version of SNOMED CT, the closest proximal primitive parent of the concept Furuncle of face is Disease. By modeling concepts using the closest proximal primitive parent design pattern, the SNOMED CT classifier can more accurately infer parents and subtypes. In Jan 2015 few of the concepts in the Bacterial infectious disease hierarchy followed this pattern. The major focus of the hierarchy redesign has been to consistently apply this pattern to the content.
Many of the concepts in the Bacterial infectious disease subhierarchy (1404, 68.9% in Jan 2015) that did not follow the proximal primitive parent pattern had one or more stated parents that were fully defined. These parents are unnecessary and often cause incorrect inferences. When these concepts are modeled with a Pathological process relationship with a target of Infectious process and a Causative agent with a target of some kind of bacteria, the classifier auto-classifies the concepts into the Bacterial infectious disease subhierarchy.
Inconsistent relationship grouping: A concept may have multiple relationships of the same type (e.g., multiple Finding site relationships, like Boil of lower limb in Figure 3(b)). To correctly associate sets of related relationships SNOMED CT includes a mechanism to organize relationships into sets called relationship groups [21]. In the Bacterial infectious disease subhierarchy the attribute relationships for many of the concepts were not consistently organized into relationship groups.
Figure 3.
Examples of stated modeling for concepts remodeled between Jan 2015 and Jan 2016. (a-d) Show the modeling of Boil of lower limb and (e-h) show the modeling of Staphylococcal infection of skin. (a) Stated, Jan 2015. (b) Stated, Jan 2016. (c) Inferred, Jan 2015. (d) Inferred, Jan 2016. (e) Stated, Jan 2015. (f) Stated, Jan 2016. (g) Inferred, Jan 2015. (h) Inferred, Jan 2016.
A large portion of the subhierarchy’s concepts (1857, 91.2% of the hierarchy) had at least one of the above issues and were considered to be inconsistent. The remodeling process was carried out by one editor, (JTC). The remodeling process was done manually, requiring the editor to modify each concept to fit the model. Several editing operations may be required to remodel a concept. Figure 3 (a) and (b) illustrate the stated remodeling of Boil of lower limb. This concept required a closest proximal primitive parent (Disease) and several stated attribute relationships. Figure 3 (e) and (f) illustrate the remodeling of Staphylococcal infection of skin. The kind of stated remodeling shown in Figure 3 was applied to nearly 1000 concepts between the Jan 2015 and Jan 2016 releases.
Consider again Tuberculosis of gastrointestinal tract. The parent that was removed in July 2015, Bacterial gastrointestinal infectious disease, underwent remodeling to fit the closest proximal primitive parent concept model and all of its attribute relationships were assigned the same relationship group. However, in July 2015 Tuberculosis of gastrointestinal tract was not yet remodeled to fit the closest proximal primitive parent model.
Thus, its Finding site attribute relationship was not in the same relationship group as its Causative agent and Pathological process attribute relationships. Due to this difference, SNOMED CT’s classifier could not infer Bacterial gastrointestinal infectious disease as a parent of Tuberculosis of gastrointestinal tract in July 2015.
In Jan 2016 Tuberculosis of gastrointestinal tract was remodeled to fit the closest proximal parent model. The classifier inferred a more refined parent Bacterial gastroenteritis, a child of Bacterial gastrointestinal infectious disease. Thus, in Jan 2016, Tuberculosis of gastrointestinal tract was inferred with a more specific classification.
By reviewing the partial-areas that appeared in {Causative agent, Finding site, Pathological process} in July 2015 and then “merged back” into the Bacterial infection by site partial-area in Jan 2016, we observe that their root concepts all underwent remodeling to fit the closest proximal primitive parent model. Once this remodeling was complete the concepts summarized by these partial-areas were correctly inferred as being descendants of Bacterial infection by site. As a larger proportion of the hierarchy is remodeled in future releases, we expect that the concepts in the remaining partial-areas (e.g., Lepromatous leprosy (2)) will eventually “merge” back into the Bacterial infection by site partial-area or move to another area, indicating that their remodeling has been completed.
Results
We obtained the Jan 2015, July 2015, and Jan 2016 international releases of SNOMED CT in RF2 format [22]. For each release we used our BLUSNO tool to create a subject subtaxonomy rooted at Bacterial infectious disease using each concept’s inferred relationships. For each subject subtaxonomy we determined the area and partial-area(s) that summarized each concept. For the July 2015 and Jan 2016 releases we identified the stated editing operations and inferred changes that affected the subhierarchy’s active concepts. Table 2 provides metrics for the subhierarchy and subject subtaxonomies for each release and illustrates the desired improvements that were expected from the remodeling effort (e.g., a decrease in primitive concepts and a decrease in the use of stated fully defined parents).
Table 2.
Bacterial infectious disease subhierarchy and subject subtaxonomy metrics across the three releases
| Jan 2015 | July 2015 | Jan 2016 | |
|---|---|---|---|
| # Concepts | 2037 | 2006 | 2066 |
| # Primitive concepts | 847 | 685 | 519 |
| # Concepts with a Fully Defined Stated Parent | 1404 | 893 | 407 |
| # Concepts with No Relationship Group | 1368 | 949 | 508 |
| # Concepts that Don’t Follow Closest Proximal Primitive Parent Model (approximately) | 1857 | 1214 | 622 |
| # Areas | 26 | 27 | 26 |
| # Partial-areas | 431 | 573 | 514 |
In the July 2015 release we identified 2218 editing operations that affected 541 (/2006 = 27.0%) active concepts in the subhierarchy. In the Jan 2016 release there were 2215 editing operations that affected 560 (/2066 = 27.1%) concepts. There were 17 concepts that had editing operations applied in both releases. In the inferred version of the subhierarchy the stated editing operations led to many inferred changes. In the July 2015 release there were 7209 implicit changes that affected 1791(/2066 = 86.7%) concepts. In the Jan 2016 release there were 4543 implicit changes that affected 1016(/2066 = 49.2%) concepts. A total of 742 concepts were affected in both releases. Table 3 lists the various editing operations that were applied to the subhierarchy, and their frequency for each release.
Table 3.
Number of concepts with a stated editing operation or affected inferentially during the reasoning process
| Stated Editing Operations | Inferred Changes | |||
|---|---|---|---|---|
| # Concepts in July 2015 | # Concepts in Jan 2016 | # Concepts in July 2015 | # Concepts in Jan 2016 | |
| Added parent | 19 | 5 | 147 | 198 |
| Added attribute relationship | 531 | 536 | 1052 | 150 |
| Less refined parent | 462 | 503 | 265 | 107 |
| Less refined attribute relationship | 14 | 15 | 16 | 22 |
| More refined parent | 4 | 7 | 59 | 174 |
| More refined attribute relationship | 9 | 17 | 20 | 85 |
| Relationship group changed | 141 | 138 | 1727 | 822 |
| Removed parent | 98 | 44 | 336 | 165 |
| Removed attribute relationship | 7 | 12 | 43 | 625 |
Several editing operations were typically applied to each remodeled concept, with several instances of a given type of editing operation typically applied. For example, Salmonella infection had one instance of a “less refined stated parent” editing operation and two instances of an “added attribute relationship” editing operation. Most concepts (346 in July 2015, 365 in Jan 2016) had two kinds of editing operations applied. The most common pair of editing operations was making a stated parent less refined and adding one or more attribute relationships (e.g., the examples in Figure 3). The most common less refined parent was Disease, with 494 instances in July 2015 and 553 in Jan 2016. Many concepts had several stated fully defined parents replaced by a single stated Disease parent. The second most common stated parent was Sexually transmitted infectious disease (a primitive concept).
From Table 2 we observe two important aspects of the changes to the subhierarchy. First, the percentage of primitive concepts decreases significantly between the releases. A higher proportion of fully defined concepts are desirable because it indicates that more concepts are sufficiently defined and are distinguished from their parent concepts. Fully defined concepts can also be used by SNOMED CT’s classifier to infer additional relationships.
Second, we observed that the size of the Bacterial infectious disease hierarchy changed between each release. In the July 2015 release 136 concepts were removed from the subhierarchy and 105 concepts were added. Some of this change was the result of deactivating concepts. However, many of the removed concepts (97 total) were still active in July 2015 but were no longer inferred as belonging to the subhierarchy (e.g., Bacterial tonsillitis, Brodie’s abscess, and Traveler’s diarrhea). The continued remodeling for the Jan 2016 release saw only 34 concepts removed from the subhierarchy and the addition of 94 concepts. Out of the 94 added concepts, 63 were removed in the July 2015 release and are now correctly inferred as belonging to the subhierarchy (e.g., Traveler’s diarrhea). A total of 71 concepts that were in the subhierarchy in Jan 2015, and are still active in Jan 2016, have not returned to the Bacterial infectious disease subhierarchy (e.g., Brodie’s abscess and Tuberculosis of abdomen).
Discussion
The Bacterial infectious disease subhierarchy has undergone significant remodeling over the last year. The same effort is planned for the entire Infectious disease subhierarchy, with 6,239 concepts in Jan 2016. A similar effort is ongoing for the Congenital disease subhierarchy. The benefit of the remodeling effort is that the concepts will follow a consistent model, enabling the classifier to infer most of the hierarchical connections between concepts. The remodeling of the Bacterial infectious disease subhierarchy will reduce the amount of future work needed to maintain the subhierarchy in later releases.
The development of a human readable editing operation summary, and a partial-area-taxonomy-based methodology for identifying important changes, only allows us to identify what changed and how it changed, not why it changed. However, one can infer some of the reasons. For example, one can see that the majority of stated editing operations were making a stated parent less specific, the addition of new stated attribute relationships, and the assignment of attribute relationships to relationship groups. Most of the modified concepts had some combination of these three editing operations applied. Indeed, the remodeling of the Bacterial infectious disease hierarchy was initiated to address such an issue. Specifically, many Bacterial infectious disease concepts did not follow a standardized model, which sometimes yielded erroneous and spurious inferences following classification by the classifier.
The significant amount of change from this remodeling effort is reflected in the Bacterial infectious disease subject subtaxonomies. From Table 2, we observe that the number of partial-areas increased greatly between the Jan 2015 and July 2015 releases, while the number of areas largely stayed the same and the number of concepts slightly decreased. The number of concepts in each area also did not change significantly. This indicates a significant change in semantics. In the July 2015 subtaxonomy we observed the addition of 214 partial-areas (110 of which were in the {Causative agent, Finding site, Pathological process} area) and the removal of 72 partial-areas. In total, 287 concepts were summarized by new partial-areas while still being in the same area and only 17 concepts were summarized by a new partial-area in a different area, indicating that the amount of structural change was minimal.
Jan 2016 saw fewer added partial-areas (76). A total of 74 concepts were in the same area but in new partial-areas (e.g., Tetanus in Figure 2(c)). In Jan 2016 there were significantly more concepts summarized by new partial-areas in different areas (60, e.g., Tuberculosis of bones and/or joints). In comparison to the July 2015 subject subtaxonomy, there were relatively many removed partial-areas (135 total), summarizing 159 concepts. A total of 106 of these concepts returned to their original Jan 2015 partial-areas (e.g., the blue partial-areas in Figure 2(b)). A total of 93 of the partial-areas removed in Jan 2016 only existed in July 2015. The Jan 2016 release thus saw a greater change in attribute relationship structure. For example, 130 concepts moved to the {Associated morphology, Causative agent, Finding site, Pathological process} area, either as part of their partial-area (e.g., Tuberculosis of bones and/or joints) or to a new partial-area (e.g., Tuberculosis of gastrointestinal tract). These concepts were thus modeled with one or more additional kinds of attribute relationships versus their Jan 2015 modeling.
One aspect of the remodeling that is surprising is how disruptive the process has been. As highlighted by the addition and removal of hundreds of partial-areas over the two releases, there has been significant change to these concepts. The concepts in the 110 partial-areas that “emerged” from the Bacterial infection by site partial-area in the July 2015 release were no longer classified as bacterial infections by site. As additional concepts undergo remodeling they will be reclassified into the Bacterial infection by site subhierarchy, however this process is incomplete and it is expected that this hierarchy will be inconsistent for several more SNOMED CT releases.
The editing tool that was used to remodel the content, the IHTSDO Workbench [23] does not support certain collaborative terminology development functionality, such as branching. Typically, when undertaking a significant development project, a terminology will be “branched” and all modifications are made to a separate copy of the terminology. Only after the entire task has been completed will the changes be integrated into the release version of the terminology. In contrast, for the Bacterial infectious disease remodeling project, incomplete remodeling was released in the July 2015 and Jan 2016 releases.
Conclusions
In this paper we described a two-phase methodology to track important changes in a SNOMED CT subhierarchy. We introduced a method of summarizing the editing operations that were applied to a subhierarchy based on the delta files provided in each SNOMED CT release. We also illustrated how partial-area taxonomies highlight the major changes that occurred in the Bacterial infectious disease subhierarchy. We utilized this methodology to analyze the impact of remodeling the Bacterial infectious disease subhierarchy, whose concepts are being modified to fit a standardized concept model. This methodology can help track remodeling changes in other subhierarchies.
Acknowledgement
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01 CA190779. The content is solely the responsibility of the authors and does not necessarily represent the views of the National Institutes of Health. This work was supported in part by the Intramural Research Program of the NIH, National Library of Medicine.
References
- 1.Stearns M. Q, Price C, Spackman K. A, et al. SNOMED clinical terms: overview of the development process and project status. Proc AMIA Annu Symp. 2001:662–6. [PMC free article] [PubMed] [Google Scholar]
- 2.IHTSDO. International Health Terminology Standards Development Organization (IHTSDO) 2012 [cited 2012 4 August 2012]. Available from: http://www.ihtsdo.org/
- 3.Wang Y, Halper M, Wei D, et al. Auditing complex concepts of SNOMED using a refined hierarchical abstraction network. J Biomed Inform. 2012;45(1):1–14. doi: 10.1016/j.jbi.2011.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ochs C, Geller J, Perl Y, et al. Scalable Quality Assurance for Large SNOMED CT Hierarchies Using Subject-based Subtaxonomies. J Am Med Inform Assoc. 2014;22(3):507–18. doi: 10.1136/amiajnl-2014-003151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ceusters W. Applying Evolutionary Terminology Auditing to SNOMED CT. AMIA Annu Symp Proc. 2010;2010:96–100. [PMC free article] [PubMed] [Google Scholar]
- 6.Rector A. L, Brandt S, Schneider T. Getting the foot out of the pelvis: modeling problems affecting use of SNOMED CT hierarchies in practical applications. J Am Med Inform Assoc. 2011;18(4):432–40. doi: 10.1136/amiajnl-2010-000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mortensen J. M, Minty E. P, Januszyk M, et al. Using the wisdom of the crowds to find critical errors in biomedical ontologies: a study of SNOMED CT. J Am Med Inform Assoc. 2015;22(3):640–8. doi: 10.1136/amiajnl-2014-002901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Halper M, Gu H, Perl Y, et al. Abstraction Networks for Terminologies: Supporting Management of “Big Knowledge”. Artificial intelligence in medicine. 2015;64(1):1–16. doi: 10.1016/j.artmed.2015.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Min H, Perl Y, Chen Y, et al. Auditing as part of the terminology design life cycle. J Am Med Inform Assoc. 2006;13(6):676–90. doi: 10.1197/jamia.M2036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ochs C, Perl Y, Geller J, et al. Quality Assurance of the Gene Ontology Using Abstraction Networks. Journal of Bioinformatics and Computational Biology. 2015 doi: 10.1142/S0219720016420014. In press. [DOI] [PubMed] [Google Scholar]
- 11.Wang Y, Halper M, Min H, et al. Structural methodologies for auditing SNOMED. J Biomed Inform. 2007;40(5):561–81. doi: 10.1016/j.jbi.2006.12.003. [DOI] [PubMed] [Google Scholar]
- 12.Fragoso G, de Coronado S, Haber M, et al. Overview and utilization of the NCI thesaurus. Comp Funct Genomics. 2004;5(8):648–54. doi: 10.1002/cfg.445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ashburner M, Ball C. A, Blake J. A, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25–9. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wei D, Gu H, Perl Y, et al. Structural measures to track the evolution of SNOMED CT hierarchies. J Biomed Inform. 2015;57:278–87. doi: 10.1016/j.jbi.2015.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ochs C, Perl Y, Geller J, et al. Using Aggregate Taxonomies to Summarize SNOMED CT Evolution. International Workshop on Biomedical and Health Informatics. 2015:1008–15. [Google Scholar]
- 16.Geller J, Ochs C, Perl Y, et al. New Abstraction Networks and a New Visualization Tool in Support of Auditing the SNOMED CT Content. AMIA Annu Symp Proc. 2012:237–46. [PMC free article] [PubMed] [Google Scholar]
- 17.Hartung M, Groß A, Rahm E. COnto–Diff: generation of complex evolution mappings for life science ontologies. J Biomed Inform. 2013;46(1):15–32. doi: 10.1016/j.jbi.2012.04.009. [DOI] [PubMed] [Google Scholar]
- 18.Noy N. F, Musen M. Promptdiff: A fixed-point algorithm for comparing ontology versions. AAAI/IAAI 2002. 2002:744–50. [Google Scholar]
- 19.Ochs C, Perl Y, Geller J, et al. Summarizing and Visualizing Structural Changes during the Evolution of Biomedical Ontologies Using a Diff Abstraction Network. J Biomed Inform. 2015;56:127–44. doi: 10.1016/j.jbi.2015.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hernandez P, Richardson C. 2014 [26 February 2016]. Authoring SNOMED CT: Generic Authoring Principles SNOMED CT Implementation Showcase: IHTSDO. Available from: http://ihtsdo.org/fileadmin/user_upload/doc/showcase/show14/SnomedCtShowcase2014_Present_14107.pdf. [Google Scholar]
- 21.Cornet R, Schulz S. Relationship groups in SNOMED CT. Studies in health technology and informatics. 2009;150:223–7. [PubMed] [Google Scholar]
- 22.Ceusters W. SNOMED CT’s RF2: Is the Future Bright? Studies in health technology and informatics. 2011;169:829–33. [PMC free article] [PubMed] [Google Scholar]
- 23.IHTSDO Workbench. Available from: http://www.ihtsdo.org/news/article/view/ihtsdo-launches-global-health-terminology-workbench/



