Abstract
Domain-specific common data elements (CDEs) are emerging as an effective approach to standards-based clinical research data storage and retrieval. A limiting factor, however, is the lack of robust automated quality assurance (QA) tools for the CDEs in clinical study domains. The objectives of the present study are to prototype and evaluate a QA tool for the study of cancer CDEs using a post-coordination approach. The study starts by integrating the NCI caDSR CDEs and The Cancer Genome Atlas (TCGA) data dictionaries in a single Resource Description Framework (RDF) data store. We designed a compositional expression pattern based on the Data Element Concept model structure informed by ISO/IEC 11179, and developed a transformation tool that converts the pattern-based compositional expressions into the Web Ontology Language (OWL) syntax. Invoking reasoning and explanation services, we tested the system utilizing the CDEs extracted from two TCGA clinical cancer study domains. The system could automatically identify duplicate CDEs, and detect CDE modeling errors. In conclusion, compositional expressions not only enable reuse of existing ontology codes to define new domain concepts, but also provide an automated mechanism for QA of terminological annotations for CDEs.
Introduction
Domain-specific common data elements (CDEs) are emerging as an effective approach to standards-based clinical research data storage and retrieval. Notably, the National Cancer Institute (NCI) has implemented the Cancer Data Standards repository (caDSR) that adopted the ISO/IEC 11179 Metadata Registry (MDR) standard1, 2. ISO/IEC 11179 defines a standard model of a meta-data registry - a registry of information about data models*. Along with the expected provenance and workflow information such as the creator, owner, workflow state, etc., part 3 of the ISO/IEC 11179 model also describes a model for formally associating data model elements with their intended meaning. It is the common meaning aspect of the 11179 standard that allows one to determine that two data elements from two different models are alternative representations of the same real world entity.
The caDSR repository has attempted to record the intended meaning of many of the data elements that support cancer study data collection and reporting which, in turn, are used to identify data elements used in relevant domains, an example of which is the Cancer Genome Atlas (TCGA) Biospecimen Core Resource (BCR) data dictionary, which is used to create clinical data collection forms for a number of clinical cancer genome study domains3. The viability and usefulness of this approach depends on the quality of the reference caDSR elements themselves. Earlier studies4, 5 have uncovered serious issues with at least some of the caDSR definitions and have highlighted a need for robust, principled and automated quality assurance (QA) tools for the CDEs in cancer study domains.
Date element meanings, as recorded in the caDSR frequently use a simple form of “post-coordination”, where a primary (focus) concept is identified along with one or more secondary identifiers that modify or qualify the intended target meaning. When taken on its own, this approach does not lend itself to automated validation and consistency checking. If, however, the constellation of identifiers were to be assembled into a compound expression of the sort that is used in SNOMED CT and OWL, it would then be possible use a description logic (DL) classifier to evaluate the various definitions for completeness and consistency. As an example, consider a data element about “joint pain in the hands”. This could be assembled as a simple collection, with a focus concept of “pain” juxtaposed with “joint” and “hand” or it could be constructed as a formal DL construct of “pain AND finding_site SOME hand AND finding_site SOME joint” (meaning the set of all things that simultaneously instances of (1) pain (2) have a finding_site of a hand and (3) have a finding_site of a joint). The latter form allows an automated reasoner to determine the relationship of the target data element with similar constructs. A DL reasoner would be able to determine that this data element described “hand pain” and “joint pain”, was related to, but more specific than a data element that was just about “joint pain” (which could be describing ankle pain, knee pain, etc). Given sufficient information†, a DL reasoner would also be able to determine that a data element about, say “joint pain in the eyebrow” was most likely nonsensical, as there was no possible instance that could simultaneously be both. While the above example may seem contrived, mistakes of this sort can and do occur in data element definitions. As an example, a data element that confuses “apoptosis” as a process with “apoptosis” as a morphological abnormality‡ may, in turn, confuse data elements that describe genetic processes with descriptions of dead cells themselves.
DL-based mechanisms allow ontology curators to formally and unambiguously represent concept meanings and relationships, and to use off the shelf reasoning tools such as HermiT6 to automate the computation of the relationship between two class expressions and consistency checks. Rector, et al. developed an effective quality assurance mechanism using reasoners to incorporate qualifiers (e.g., acute or chronic) in the post-coordination in SNOMED CT7. The editors of the National Cancer Institute Thesaurus (NCIt) use a DL reasoner to check the terminology completeness and consistency8, and the Thesaurus is distributed as a DL based terminology. Horridge, et al. demonstrated how automated DL reasoning, along with a Justification Finding Service can be used as a QA technique for the development of large and complex ontologies such as ICD-119. The post-coordination approach using compositional expressions has been used to build a common ontology to harmonize ICD-11 and SNOMED CT10, 11. In an earlier study, we audited the semantic completeness of the SNOMED CT content using a formal model of the normal forms of SNOMED CT expressions12.
The objective of the present study to design, develop and evaluate a quality evaluation tool for cancer study CDEs using a post-coordination approach. We propose a compositional expression pattern based on a modified SNOMED CT observable model, which renders the Data Element Concept model structure informed by ISO/IEC 11179. We then developed a transformation tool to convert the pattern-based compositional expressions into the OWL syntax. The OWL-based constraints are designed and the reasoning and explanation services are invoked to detect CDE modeling errors and duplicate CDEs. We then test the system utilizing CDEs from two TCGA clinical cancer study domains.
Materials and Methods
Materials
NCI caDSR CDEs, TCGA Data Dictionary and NCI Thesaurus (NCIt)
We downloaded an XML rendering of all non-retired production CDEs as of August 7, 2014 from the NCI caDSR website13 and an XML rendering of the snapshot of the publicly available TCGA BCR data dictionary from the TCGA website3. We also downloaded the asserted (i.e. the inputs to a DL reasoner vs the conclusions) version 15.01d of NCIt in OWL format from the NCI Enterprise Vocabulary Services (EVS) website14.
Semantic Web Applications and Tools
We used 4store15, an open source RDF triple store as the back end for data integration and query and Protégé, an open source editor and knowledge acquisition system developed by the Stanford University16 in combination with HermiT version 1.3.8.3,6 a DL reasoner.
Methods
Figure 1 shows the architecture used in our evaluation. Module 1, Data Integration and Services combines the information from the caDSR Common Data Elements and the TCGA data dictionary as a cohesive unit, which allows the SPARQL query services to access the contents of both resources as a single unit. Module 2, Compositional Expression Transformation converts the data element meaning definitions recorded in the caDSR elements into DL expressions which become the inputs to Module 3, OWL-based Quality Assurance which uses the combination of the NCI Thesaurus and additional disjointness axioms to detect potential errors and duplications in the data element definitions. Each of these modules is described in more detail in the following sections.
Figure 1.

System Architecture.
Module 1: Data Integration and Services
The caDSR CDEs and TCGA data dictionary were converted from XML into RDF using the XML2RDF tool that was developed for the Redefer project17. The output was loaded into the 4Store RDF triple store instance and a SPARQL endpoint was added to enable the actual SPARQL query services. The original XML files and transformed RDF files can be found at the project github website (https://github.com/caCDE-QA/owl-qa). Figure 2 shows a SPARQL query that retrieves all CDEs from the domain “clinical pharmaceutical” and their data related to data element concept, object class and property recorded in caDSR.
Figure 2.

SPARQL query for the “clinical pharmaceutical” domain.
Module 2: Compositional Expression Transformation
ISO/IEC 11179 identifies a basic model of a definition of a Data Element Concept, “A concept that is an association of a property (“a quality common to all members of an object class”) with an object class (“set of ideas, abstractions or things in the real world that are identified with explicit boundaries and meaning and whose properties follow the same rules”)2. The data element definitions in the NCI caDSR separated the property and object class concept references, an aspect that we were able to take advantage of in an earlier study4. One of the defining concepts in both the property and object class definitions were identified as the “focus concept” (i.e. one focus property and one focus object class). The relationship between the remaining concepts and the focus was left unspecified. Figure 3 shows the 11179 Data Element Concept model region and Table 1 shows two sample Data Element Concepts as defined in the NCI caDSR, with the primary or focus concept of each definition in bold. The first definition asserts that the object of the data element was Internal Radiation Therapy and the property being measured was the “Technique” qualified by “delivery”. In the second example, the object is asserted to be “blood pressure”, qualified by “person” and the property of the blood pressure was “assessment”.
Figure 3.

Data Element Concept Definition
Table 1.
Sample Data Element Concept Definitions
| Data Element Id | Name | Object Class Concepts | Object Class Concept Meanings | Property Concepts | Property Concept Meanings |
|---|---|---|---|---|---|
| 2201422 | Brachytherapy Delivery Technique | C15195 | Internal Radiation Therapy |
C16847 C61560 |
Technique Delivery |
| 2004291 | Diastolic Blood Pressure |
C54706 C25190 |
Blood Pressure Person |
C25367 | Assessment |
It became apparent that we were dealing with two levels of modeling. On the most fundamental level, we were working with data elements as defined in ISO 11179 (“unit of data that is considered in context to be indivisible”)2. In this context, the data elements “Adjuvant Postoperative Pharmaceutical Therapy Administered Indicator” (3397567), “Year of Death” (2897030) and “Initial Pathology Diagnosis Method” (2757948) all have the same object class/property definitional structure. Considerable value could be added, however, if the data elements could be correctly positioned in more sophisticated model such as the proposed SNOMED CT observables model18 as shown in Figure 4, which would allow the various property and object class modifiers to be correctly identified by the role they played in context of the data element itself. One of the challenges in doing this, however, is that not all of the data elements in the caDSR can be treated as observation results. As an example, “Tumor Tissue Site” (CDE 3427536) is not an observation unto itself but would instead serve as a direct site or (inherent) location in a more complete location.
Figure 4.

Proposed SNOMED CT Observation Result Model
The remainder of this paper focuses on the first and more abstract model level, where every CDE is treated as complete “observation” unto itself – either as a property of an object (independent continuant) or as characterizing a process.
Figure 5 shows a diagram illustrating the mappings between a modified version of the SNOMED CT observable model and the ISO/IEC 11179 model implemented in the NCI caDSR. As illustrated in the figure, a data element concept is mapped with an observable entity; a data element concept property is mapped to the target (i.e., range class) of the predicate “is about”; a primary property is mapped to the target (i.e., range class) of the predicate “property type”; a data element object class is mapped to the target of the predicate “inheres in or characterizes”. Note that we combined two predicates “inheres in” and “characterizes” into a single predicate “inheres in or characterizes” as NCI caDSR does not directly provide the distinction between independent continuants and processes (see the section Discussions for more details). Within the NCI caDSR model, a primary object class is the target of the predicate “object class type” and a primary property is the target of the predicate “property type”.
Figure 5.

Mappings between SNOMED CT Observable Model and ISO/IEC 11197
We created a transformation tool using the Jena Java API19 that takes the input from the results of a SPARQL query described in the previous section and renders the data recorded for the data element concept of a CDE into an OWL-based compositional expression. Table 2 shows two compositional expressional examples transformed to represent the semantics for the data element concept, object class and property of the CDEs “Clinical Trial Drug Classification Name” and “‘Agent Administration Total Dose Code’”.
Table 2.
Compositional Expression in caDSR and enhanced OWL representation in Manchester OWL Syntax
| Original Data Recorded in caDSR | Transformed Compositional Expression |
|---|---|
| Public Id: 3378323 CDE Name: Clinical Trial Drug Classification Name Property Code: C25161 Property Name: Classification Primary Property: C25161 Object Class code: C71104:C1708 Object Class Name: Clinical Trial Agent Primary Object Class: C1708 |
Class: ‘Clinical Trial Drug Classification Name’ Annotations: label “Clinical Trial Drug Classification Name” EquivalentTo: ’Observable Entity’ and (‘is about’ some (Classification and (‘inheres in or characterizes’ some (‘Clinical Trial Agent’ and (‘object class type’ some Agent))) and (‘property type’ some Classification))) SubClassOf: ’Observable Entity’ |
| Public Id: 3088785 CDE Name: Agent Administration Total Dose Code Property Code: C25304:C25488:C25709:C25162 Property Name: Total Dose Unit of Measure Code Primary Property Name: C25162 Object Class Code: C70962 Object Class Name: Agent Administration Primary Object Class: C70962 |
“Class: ‘Agent Administration Total Dose Code’ Annotations: label “Agent Administration Total Dose Code” EquivalentTo: ’Observable Entity’ and (‘is about’ some (Quantity and (‘inheres in or characterizes’ some (‘Prescription Agent’ and (‘object class type’ some Agent))) and (‘property type’ some Quantity))) SubClassOf: ’Observable Entity’” |
Module 3: OWL-based Quality Assurance
Module 2 transforms the CDE definitions into OWL compositional expressions. We are now in a position that we can use a DL reasoner to validate the expressions against the underlying ontology.
Checking for duplicate data elements
A DL reasoner can find equivalent (duplicate) definitions. This provides an automated mechanism for checking potential duplicate data elements that have the same meanings.
Mechanism for detecting data element modeling errors
With the OWL rendering of compositional expression patterns based on the SNOMED CT observable model, we can add constraints to the patterns that will allow a DL reasoner to detect potential data element modeling errors. The first constraint is to assert that instances of Object Class and Property are disjoint – that nothing can simultaneously be a thing or process and attribute or characteristic of the thing or process.
A second constraint is an assertion about the subjects (domain) and objects (range) of the :is_about, :property_type, :measures, and :object_class_type predicates. The domain/range constraint restricts each object property linking between the instances of asserted classes. Table 3 shows two types of the OWL constraints asserted in the compositional patterns.
Table 3.
Two types of the OWL constraints asserted in the compositional patterns in Manchester OWL syntax
| disjointness constraints | domain/range constraints |
|---|---|
| Class: ‘Observable Entity Property’ DisjointWith: ’Observable Entity Object Class’, ’Primary Object Class’. Class: Primary Object Class’ DisjointWith: ’Observable Entity Property’, ’Primary Property’. Class: ‘Primary Property’ DisjointWith: ’Observable Entity Object Class’, ’Primary Object Class’. |
ObjectProperty: ‘is about’ Domain: ‘Observable Entity’ Range: ‘Observable Entity Property’ ObjectProperty: ‘property type’ Domain: ‘Observable Entity Property’ Range: ‘Primary Property’ ObjectProperty: ‘inheres in or characterizes’ Domain: ‘Observable Entity Property’ Range: ‘Observable Entity Object Class’ ObjectProperty: object class type Domain: ‘Observable Entity Object Class’ Range: ‘Primary Object Class’ |
Explanation of detected inconsistencies or errors
We used the Protégé 5.0 beta version16 and the HermiT reasoner6 which can explain the possible reasons for inconsistencies or errors. Figure 6 shows a screenshot of illustrating the explanations for the class “C1708|Agent” violating the disjointness constrain asserted between the classes “Observable Entity Object Class” and “Observable Entity Property”.
Figure 6.

Example Explanation of an Inconsistency
Case study of CDEs from two TCGA cancer genome study domains
We used the configuration above perform a quality evaluation of the CDEs from two clinical cancer domains in TCGA data dictionary: Clinical Pharmaceutical and Clinical Shared. CDEs from each domain along with the data element concept, object class and property assertions were retrieved using the SPARQL query in Figure 2. The results were converted compositional expressions and classified using the Protégé based DL reasoner. We were able to identify the equivalent CDEs and the CDEs violating the reasoning constraints. To verify the modeling errors, four co-authors (GJ, HS, CT, CW) reviewed the detected CDEs by checking the definitions of the CDEs recorded in the NCI caDSR to verify whether there is a modeling error or not.
Results
In total, TCGA data dictionary contains 775 CDEs for 38 clinical cancer domains, which cover 21 cancer types. In the present study, we performed a case study of two clinical cancer domains: Clinical Pharmaceutical and Clinical Shared, which contain 18 and 98 CDEs respectively.
The reasoning services identified 6 CDEs with equivalent CDEs from the domain Clinical Pharmaceutical and 29 CDEs with equivalent CDEs from the domain Clinical Shared. In total, there are 12 groups of equivalent CDEs. Human-based review shows that among 12 groups of equivalent CDEs identified, the CDEs in 2 groups had modeling errors, indicating that they should not be considered as the equivalent CDEs. Table 4 shows the equivalent CDE groups identified with errors from the domain Clinical Shared.
Table 4.
The equivalent CDEs identified with errors from the domain Clinical Shared
| Domain | Equivalent CDEs | Errors |
|---|---|---|
| Clinical Shared | 2181650 Patient Smoking History Category 2228604 Started Smoking Year |
Yes |
| Clinical Shared | 88 Performance Status Assessment Eastern Cooperative Oncology Group Scale 2792763 Performance Status Assessment Timepoint Category 2003853 Karnofsky Performance Status Score |
Yes |
Table 5 shows the results of CDEs violating constraints from two TCGA domains by reasoning services. In total, there are 19 CDEs (out of 116 CDEs) identified with constraint violations. Human-based review shows that all 19 CDEs had modeling errors in their asserted primary properties. Table 5 provided the suggested primary properties for the 19 CDEs.
Table 5.
CDEs violating constraints identified from two TCGA domains by reasoning services.
| Domain | CDEs Violating Constraints | Asserted Primary Property | Semantic Type | Our Suggested Primary Property | Semantic Type |
|---|---|---|---|---|---|
| Clinical Pharmaceutical | 2975232 Prior Therapy Regimen Text | C1708|Agent | Chemical Viewed Functionally | C25365|Description | Intellectual Product |
| Clinical Shared | 2791194 First Disease Recurrence Disease Extent Category | C13717|Anatomic Site | Body Location or Region | C25372|Category | Classification |
| Clinical Shared | 3108203 Neoplasm Anatomic Subdivision Name | C13717|Anatomic Site | Body Location or Region | C42614|Name | Conceptual Entity |
| Clinical Shared | 3124503 First Recurrent Non- Nodal Metastatic Anatomic Site Descriptive Text | C13717|Anatomic Site | Body Location or Region | C25365|Description | Intellectual Product |
| Clinical Shared | 3427536 Tumor Disease Anatomic Site | C13717|Anatomic Site | Body Location or Region | C25365|Description | Intellectual Product |
| Clinical Shared | 2006657 Diagnosis Age | C15220|Diagnosis | Diagnostic Procedure | C2515|Age | Organism Attribute |
| Clinical Shared | 2896956 Month Cancer Initial Diagnosis Number | C15220|Diagnosis | Diagnostic Procedure | C25164|Date | Temporal Concept |
| Clinical Shared | 2896958 Day Cancer Initial Diagnosis Number | C15220|Diagnosis | Diagnostic Procedure | C25164|Date | Temporal Concept |
| Clinical Shared | 2896960 Year Cancer Initial Diagnosis Number | C15220|Diagnosis | Diagnostic Procedure | C25164|Date | Temporal Concept |
| Clinical Shared | 2896991 Month Tumor Recurrence After Initial Treatment Number | C15220|Diagnosis | Diagnostic Procedure | C25164|Date | Temporal Concept |
| Clinical Shared | 2897006 Day Tumor Recurrence After Initial Treatment Number | C15220|Diagnosis | Diagnostic Procedure | C25164|Date | Temporal Concept |
| Clinical Shared | 2897008 Year Tumor Recurrence After Initial Treatment Number | C15220|Diagnosis | Diagnostic Procedure | C25164|Date | Temporal Concept |
| Clinical Shared | 3382736 Prior Cancer Diagnosis Occurrence Description Text | C15220|Diagnosis | Diagnostic Procedure | C25365|Description | Intellectual Product |
| Clinical Shared | 2897014 Month Tumor Progression After Initial Treatment Number | C15368|Treatment | Health Care Activity | C25164|Date | Temporal Concept |
| Clinical Shared | 2897016 Day Tumor Progression After Initial Treatment Number | C15368|Treatment | Health Care Activity | C25164|Date | Temporal Concept |
| Clinical Shared | 2897018 Year Tumor Progression After Initial Treatment Number | C15368|Treatment | Health Care Activity | C25164|Date | Temporal Concept |
| Clinical Shared | 2178045 Age Began Smoking in Years | C18270|Cigarette Smoking | Individual Behavior | C2515|Age | Organism Attribute |
| Clinical Shared | 2554674 Patient Death Reason | C28554|Death | Finding | C25365|Description | Intellectual Product |
| Clinical Shared | 2584114 Primary Other Site of Disease Name | C2991|Diseases and Disorders | Disease or Syndrome | C25365|Description | Intellectual Product |
Discussion
In a previous study, we analyzed the terminological concepts associated with the standard structure of the caDSR CDEs using the UMLS Semantic Network4. We found out that the semantic annotations of a CDE that did not observe the overall pattern of disjointness between dominant semantic types of primary object class/property, had a high probability to have modeling errors. Although the ISO/IEC 11179 specification states that it provides a semantically precise structure for data elements, the standard does not specify disjointness constraints between object class concept and property concept. In the present study, we designed a compositional expression pattern to post-coordinate the data element concept of a CDE and transformed them to the Semantic Web OWL-based expression. The transformation allows us to take advantage of the built-in feature of OWL in expressing the disjointness and domain/range restrictions among a set of OWL classes, and subsequently invoke the reasoning services provided in existing OWL-DL reasoning tools (e.g., the HermiT reasoner used in this study) to check the inconsistencies and violations. This approach has proven to be very useful in identifying potential modeling errors. With the case study of the CDEs from two TCGA clinical cancer study domains, we reviewed those CDEs violating the constraints (n=19) and found that all of them had incorrect property concepts asserted.
While the domain expertise would be definitely helpful in evaluating whether an object class/property concept is correctly asserted or not, the challenging question is how to make the system to tell automatically that a terminological code could only be used for annotating an object class not a property. We consider that the approaches leveraging the upper level ontologies such as UMLS Semantic Network, basic formal ontology (BFO)20, 21 or BioTop ontology (a top-domain ontology for the life science)22, 23 would potentially provide a formal approach to define the constraints for supporting the CDE modeling applications. Linking to ISO/IEC 11179 model, a constraint can be made like “an object class concept has to be an independent continuant or processual entity whereas a property concept cannot be such entity”. Actually, the SNOMED observable model we referenced in this study has been developed using such principles. As we are using a modified version of SNOMED observable model to post-coordinate the data element concepts, we plan to look into the upper level ontology-based approaches further in our future study.
With the compositional expressions coded for the data element concepts representing the meaning of CDEs, the Semantic Web reasoning services can be invoked to automatically identify semantically equivalent CDEs. On the one hand, this powerful feature could be used to detect potential duplicate CDEs as well as the modeling errors. As demonstrated in our case study (in Table 4), among 12 groups of equivalent CDEs are identified, the CDEs in 2 groups had modeling errors, meaning that they should not be considered as the equivalent CDEs. For example, reviewing the definition for the CDEs “2181650|Patient Smoking History Category” and “2228604|Started Smoking Year”, their primary property concept are all assigned as the “Patient Medical History” which is not correct. If we correct the primary property concepts for the two CDEs as “C25372|Category” and “C25164|Date” respectively, they will be no longer inferred as the equivalent CDEs. On the other hand, the post-coordination approach could be used to harmonize and standardize the data elements collected from different study groups or sites. Specifically, if we could use the compositional expression pattern to post-coordinate these data elements using standard ontologies such as NCIt, we would be able to accurately identify the data elements with the same meanings and easily harmonize the data element and enable data integration. This is exactly the very goal the ISO/IEC 11179 standard would like to achieve. Leveraging the SNOMED CT observable model would help make the compositional expression pattern user-friendlier. For example, the notion of Observable Entity is user-friendlier to clinical study researchers than the notion of “Data Element Concept”.
We have demonstrated the value of using Semantic Web technologies in building our QA tools for cancer study CDEs. First, we developed a semantic metadata repository using an open source RDF triple store. The RDF-based data model not only provides a scalable and powerful framework for data integration, but also provides standard SPARQL query services that allow us to easily retrieve the preferred set of CDEs in a particular TCGA clinical cancer study domain. In a separate study, we demonstrated that such Semantic Web-based metadata repository could also be used to support authoring detailed clinical models (DCMs) in clinical cancer study domains24. In the present study, the framework enables us to build a domain-specific QA mechanism. Although we focused on the two TCGA domains as a case study, the approach could be easily generalized to any other TCGA domains. Second, we developed a compositional expression transformation tool that transforms the data structure of a data element concept informed by ISO/IEC 11179 into an OWL-based representation. The compositional expression is based on a modified version of SNOMED CT observable model. We consider that the syntax of SNOMED CT’s compositional expression grammar25 may be useful as an intermediate layer in our transformation tool, which will provide a human readable format and potentially improve the usability of authoring a post-coordination expression. Third, as the NCI Thesaurus is distributed in the OWL format, we can easily integrate the compositional expressions generated for the CDEs in each domain with the NCIt using the “owl:imports” mechanism. As the latest version of the NCIt has marked all its retired concepts using the deprecated annotation “owl:deprecated”, we were able to detect retired concepts out of the compositional expressions that are originally based on the annotations using the NCIt codes. Fourth, Semantic Web reasoning services are a critical component of our QA tool. We used the latest version of HermiT reasoning plugin in Protégé 5 environment in this study. We found that incremental reasoning and explanation services are very helpful in detecting and explaining the CDE modeling errors and duplicate CDEs. However, further studies will be needed in transforming the reasoning results into the well-formed report to inform the decision-making of CDE curators. In addition, integration of compositional expressions with the large-scale ontology such as NCIt will make the reasoning services more complicated and also require high performance. There already have a number of high-performance reasoners such as Snorocket26 available in the Semantic Web community, and we will compare and test out such reasoners in the future.
There are a number of limitations in the study. First, our QA tool is based on the CDEs from a particular domain. While the domain-specific approach has advantage of making the auditing results more interpretable, some of the modeling errors may not be able to be detected. The systematic solutions would include 1) enabling the violation checking globally on CDEs across domains; 2) implementing the upper level ontology-based approach as discussed above. Second, we used a simplified compositional expression model that captures main constructs of a data element concept informed by ISO/IEC 11179 standard. For example, for an object class concept “Adjuvant Hormone Therapy”, it is recorded as “C2140:C15445”. The primary object class “C15445|Endocrine Therapy” has been captured in our model but the semantic relationship between “C2140|Ajuvant” has not been captured. We plan to look into the approach for post-coordination such expressions in the future. Third, the QA of the data structure describing the meaning of a value domain is out of scope in the present paper and will be conducted in a separate study.
Conclusion
In this study, we developed and evaluated a QA tool for cancer study CDEs using a post coordination approach. We designed a compositional expression pattern based on a version of SNOMED CT observable model, which is used to represent the data structure of a data element concept (i.e., the meaning of data element) informed by the ISO/IEC 11179 metadata standard. Leveraging the existing Semantic Web tools, we demonstrated that the post-coordination approach could enable an effective and automated mechanism in detecting potential CDE modeling errors and duplicate CDEs. Future work will be focused on 1) developing a systematic QA approach leveraging upper level ontologies; 2) refining the compositional expression model by aligning with SNOMED observable model; 3) making the reasoning-based explanation more user-friendly; and 4) incorporating the value domain in the scope.
Acknowledgments
The study is supported in part by a NCI U01 Project – caCDE-QA (1U01CA180940–01A1).
Footnotes
The 11179 standard is about metadata, not the actual data models. This is sometimes a source of confusion, as, in order to record metadata about an artifact, you also need a good idea of the type and purpose of the artifact itself. 11179 is about how one describes the provenance and purpose of a given element of a given type in a given data model and how it relates to similar elements with different types in different models.
”Sufficient information” in this context assumes that the reasoner would be able to process statements about negation and/or disjointness and that it its set of known facts included an axiom that explicitly or implicitly asserted that the set of things that are parts of the ear are disjoint (no pun intended) from the set of things that are joints.
See: 20663007 | Apoptosis (morphological abnormality) | in the January 2014 edition of SNOMED CT.
References
- 1.Warzel DB, Andonaydis C, McCurry B, Chilukuri R, Ishmukhamedov S, Covitz P. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2003. Common data element (CDE) management and deployment in clinical trials; p. 1048. Epub 2004/01/20. [PMC free article] [PubMed] [Google Scholar]
- 2.ISO/IEC 11179 Specification. 2015. [February 17, 2015]; Available from: http://standards.iso.org/ittf/PubliclyAvailableStandards/c050340_ISO_IEC_11179-3_2013.zip.
- 3.TCGA BCR Data Dictionary. 2015. [February 17, 2015]; Available from: https://tcga-data.nci.nih.gov/docs/dictionary/
- 4.Jiang G, Solbrig HR, Chute CG. Quality evaluation of cancer study Common Data Elements using the UMLS Semantic Network. Journal of biomedical informatics. 2011;44(Suppl 1):S78–85. doi: 10.1016/j.jbi.2011.08.001. Epub 2011/08/16. [DOI] [PubMed] [Google Scholar]
- 5.Jiang G, Solbrig HR, Chute CG. Quality evaluation of value sets from cancer study common data elements using the UMLS semantic groups. Journal of the American Medical Informatics Association : JAMIA. 2012;19(e1):e129–36. doi: 10.1136/amiajnl-2011-000739. Epub 2012/04/19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.HermiT OWL Reasoner. 2015. [February 20, 2015]; Available from: http://hermit-reasoner.com/
- 7.Rector A, Iannone L. Lexically suggest, logically define: quality assurance of the use of qualifiers and expected results of post-coordination in SNOMED CT. Journal of biomedical informatics. 2012;45(2):199–209. doi: 10.1016/j.jbi.2011.10.002. Epub 2011/10/26. [DOI] [PubMed] [Google Scholar]
- 8.de Coronado S, Wright LW, Fragoso G, Haber MW, Hahn-Dantona EA, Hartel FW, et al. The NCI Thesaurus quality assurance life cycle. Journal of biomedical informatics. 2009;42(3):530–9. doi: 10.1016/j.jbi.2009.01.003. Epub 2009/05/29. [DOI] [PubMed] [Google Scholar]
- 9.Horridge M, Parsia B, Noy NF, Musen MA. Reasoning Based Quality Assurance of Medical Ontologies: A Case Study. AMIA Annu Symp Proc. 2014;2014:671–80. [PMC free article] [PubMed] [Google Scholar]
- 10.Rodrigues JM, Schulz S, Rector A, Spackman K, Millar J, Campbell J, et al. ICD-11 and SNOMED CT Common Ontology: circulatory system. Studies in health technology and informatics. 2014;205:1043–7. Epub 2014/08/28. [PubMed] [Google Scholar]
- 11.Schulz S, Rodrigues JM, Rector A, Spackman K, Campbell J, Ustun B, et al. What’s in a class? Lessons learnt from the ICD - SNOMED CT harmonisation. Studies in health technology and informatics. 2014;205:1038–42. Epub 2014/08/28. [PubMed] [Google Scholar]
- 12.Jiang G, Chute CG. Auditing the semantic completeness of SNOMED CT using formal concept analysis. Journal of the American Medical Informatics Association : JAMIA. 2009;16(1):89–102. doi: 10.1197/jamia.M2541. Epub 2008/10/28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.caDSR downloads. 2015. [February 17, 2015]; Available from: https://wiki.nci.nih.gov/display/caDSR/caDSR+Downloads.
- 14.NCI Thesaurus Download. 2015. [February 17, 2015]; Available from: http://cbiit.nci.nih.gov/evs-download/thesaurus-downloads.
- 15.4store. 2015. [February 20, 2015]; Available from: http://4store.org/
- 16.Protege Ontology Environment. 2015. [February 20, 2015]; Available from: http://protege.stanford.edu/
- 17.Redefer Project. 2015. [March 8, 2015]; Available from: http://rhizomik.net/html/redefer/
- 18.IHTSDO Observable and Investigation Model Project. 2015. [February 17, 2015]; Available from: https://csfe.aceworkspace.net/sf/projects/observable_and_investigation_mod/
- 19.Apache Jena. 2015. [March 8, 2015]; Available from: https://jena.apache.org/
- 20.Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome biology. 2005;6(5):R46. doi: 10.1186/gb-2005-6-5-r46. Epub 2005/05/17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.BFO - Basic Formal Ontology. 2015. [February 14, 2015]; Available from: http://ifomis.uni-saarland.de/bfo/
- 22.BioTop - a top-domain ontology for the life science. 2015. [February 14, 2015]; Available from: http://purl.org/biotop.
- 23.Schulz S, Beisswanger E, Wermter J, Hahn U. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2006. Towards an upper level ontology for molecular biology; pp. 694–8. Epub 2007/01/24. [PMC free article] [PubMed] [Google Scholar]
- 24.Jiang G, Sharma DK, Solbrig HR, Tao C, Weng C, Chute CG, editors. International SWAT4LS Wrokshop: Semantic Web Applications and Tools for Life Sciences. Berlin, Germany: 2014. Dec, 2014. Building A Semantic Web-based Metadata Repository for Facilitating Detailed Clinical Modeling in Cancer Genome Studies; pp. 9–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Compositional Grammar for SNOMED CT Expressions in HL 7 Version 3 . Technical Report Cophenhagen. Denmark: The International Health Terminology Standards Development Organization; 2008. External Draft for Trial Use Version 0.006, Dec 2008. [Google Scholar]
- 26.Snorocket. 2015. [February 14, 2015]; Available from: http://aehrc.com/research/health-data-management-and-semantics/clinical-terminology-tools/snorocket.
