Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Oct 1.
Published in final edited form as: J Biomed Inform. 2016 Jul 19;63:11–21. doi: 10.1016/j.jbi.2016.07.016

A Computational Framework for Converting Textual Clinical Diagnostic Criteria into the Quality Data Model

Na Hong 1,2,¥, Dingcheng Li 1,¥, Yue Yu 3,¥, Qiongying Xiu 4, Hongfang Liu 1, Guoqian Jiang 1,§
PMCID: PMC5077690  NIHMSID: NIHMS806369  PMID: 27444185

Abstract

Background

Constructing standard and computable clinical diagnostic criteria is an important but challenging research field in the clinical informatics community. The Quality Data Model (QDM) is emerging as a promising information model for standardizing clinical diagnostic criteria.

Objective

To develop and evaluate automated methods for converting textual clinical diagnostic criteria in a structured format using QDM.

Methods

We used a clinical Natural Language Processing (NLP) tool known as cTAKES to detect sentences and annotate events in diagnostic criteria. We developed a rule-based approach for assigning the QDM datatype(s) to an individual criterion, whereas we invoked a machine learning algorithm based on the Conditional Random Fields (CRFs) for annotating attributes belonging to each particular QDM datatype. We manually developed an annotated corpus as the gold standard and used standard measures (precision, recall and f-measure) for the performance evaluation.

Results

We harvested 267 individual criteria with the datatypes of Symptom and Laboratory Test from 63 textual diagnostic criteria. We manually annotated attributes and values in 142 individual Laboratory Test criteria. The average performance of our rule-based approach was 0.84 of precision, 0.86 of recall, and 0.85 of f-measure; the performance of CRFs-based classification was 0.95 of precision, 0.88 of recall and 0.91 of f-measure. We also implemented a web-based tool that automatically translates textual Laboratory Test criteria into the QDM XML template format. The results indicated that our approaches leveraging cTAKES and CRFs are effective in facilitating diagnostic criteria annotation and classification.

Conclusion

Our NLP-based computational framework is a feasible and useful solution in developing diagnostic criteria representation and computerization.

Keywords: Diagnostic Criteria, Quality Data Model, Natural Language Processing, cTAKES, Conditional Random Fields

Graphical abstract

graphic file with name nihms806369u1.jpg

1 Introduction

The term “diagnostic criteria” designates the specific combination of signs, symptoms, and test results that clinicians use to determine correct diagnosis [1]. It is one of the most valuable sources of knowledge that can be used for supporting clinical decision making and improving patient care [2]. However, existing diagnostic criteria are scattered over different media such as medical textbooks, literature, and clinical practice guidelines, and they are usually described in unstructured free text without uniform standard. This situation hinders the efficient use of diagnostic criteria for supporting contemporary clinical decision making, which needs an integrated system with interoperable and computable processes.

One solution to better support clinical decision making is to make these diagnostic criteria computerized; however, it is costly and time-consuming for experts and clinicians to complete all of the tasks manually. To this end, on the one hand, the Natural Language Processing (NLP) technology could be used to enable automatically, or semi-automatically transforming diagnostic criteria into a computable format. On the other hand, a data model to represent diagnostic criteria is equally essential for its computerized implementation. Such a data model would enable the representation of diagnostic criteria in a structured, standard, and encoded framework to support many of the clinical applications in a scalable fashion. In order to investigate the model adaptability to diagnostic criteria, we previously evaluated the application feasibility of the National Quality Forum (NQF) Quality Data Model (QDM) [3] through a data-driven approach, in which we manually analyzed the distribution and coverage of the data elements extracted from a collection of diagnostic criteria in QDM. The results demonstrated that the use of QDM is feasible in building a standards-based information model for representing computable diagnostic criteria [4].

The objective of the present study is to develop and evaluate automated methods for converting textual clinical diagnostic criteria into a structured format using QDM. We leverage clinical NLP tools to facilitate the computerization and standardization of diagnostic criteria. Specifically, we use a combination of the Clinical Text Analysis and Knowledge Extraction System (cTAKES)-supported and rule-based methods for extracting individual diagnostic criterion from full-text clinical diagnostic criteria. We also develop a machine learning algorithm based on the Conditional Random Fields (CRFs) to automatically annotate and classify the attributes of diagnosis events. Finally, we develop an integrated web-based system that automatically transforms textual diagnostic criteria into a standard QDM template by implementing the algorithms.

2 Background

2.1 Clinical NLP Tools and Information Models

A number of tools and methods based on NLP technology have been reported and used in structuring free-text–based clinical text, such as clinical guidelines, clinical notes, and electronic health records (EHRs) [5, 6]. Typical clinical NLP tools that could support term recognition and text annotation from clinical text include Health Information Text Extraction tool (HITex) [7], MetaMap [8], OpenNLP [9], and cTAKES [10]. Some studies compared the performance of these frequently used NLP tools, and the cTAKES shows satisfactory performance and usability [11, 12]. cTAKES is an open-source Apache project and is an NLP system designed to extract information from EHR-based clinical free-text. cTAKES was built on the Unstructured Information Management Architecture (UIMA) framework, which is an open source framework designed by IBM and a series of comprehensive NLP methods [13]. Its modular architecture is composed of pipelined components combining rule-based and machine learning techniques [10]. These components exchange data using a standard data structure known as the Common Analysis System (CAS). CAS contains the original document with annotated results, and a powerful index system. The components of cTAKES are specifically trained for use in the clinical domain, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems, as well as in clinical and translational research [14].

Other than common tools in clinical NLP tasks, there are also many successful applications of machine learning algorithms [1518] that are customized to support different scenarios and use cases. In recent years, the Conditional Random Fields (CRFs) algorithms demonstrated significant performance in the clinical NLP field in comparison with other machine learning algorithms [1921]. In the CRFs applications, such as entity extraction and text classification, we usually wish to predict a vector Y = [y0; y1; : : : ; ym] of random variables given an observed feature vector X = [x0; x1; : : : ; xn], which requires us to label the words in a sentence with their corresponding features (i.e., contextual information) which subsequently used for training. The features can be part-of-speech (POS), neighboring words and word bigrams, prefixes and suffixes, capitalization, membership in domain-specific lexicons, semantic information of words, etc. Considering the advantage of CRFs in the contextual information understanding and decent NLP performance, we leveraged CRFs for the purpose of the attributes extraction and classification and the performance tuning in the present study.

Current efforts to develop international recommendation standard models in clinical domains have laid the foundation for modeling and representing computable diagnostic criteria. There are a number of clinical data models developed in related fields (eg, QDM, Clinical Element Models [CEMs] [22], and HL7 Fast Health Interoperable Resources [FHIR] [23]). QDM is designed to allow EHRs and other clinical electronic systems to share a common understanding and interpretation of the clinical data. It allows quality measure developers and many clinical researchers or performers to clearly and unambiguously describe the data required to calculate the quality measure. Different from CEM and FHIR, QDM contains both a data model module and a logic module. The latter handles logic expressions elegantly with a collection of functions, logic operators and temporal operators. Therefore, we chose QDM as the information model for standard representation of diagnostic criteria in the present study.

2.2 Clinical Text Computerization and Standardization

The related studies on clinical text computerization mainly include the following three aspects.

  1. Clinical guideline computerization and Computer Interpretable Guideline (CIG) Systems. Various computerized clinical guidelines and the decision support systems that incorporate the guidelines have been developed. Researchers have tried different approaches to computerizing clinical practice guidelines [2427], but those guidelines cover many complex medical procedures; thus, the application of these studies in real-world clinical practice is still very limited. However, the methods used to computerize guidelines are valuable in addressing the issues in diagnostic criteria computerization.

  2. Clinical NLP technologies. Unstructured clinical text mainly exists in the form of clinical notes, eligibility criteria, and clinical guidelines. There is much precedent on the work of clinical NLP applications using machine learning, rule-based methods, and other novel methods [19, 28, 29]. These studies offer valuable contribution in exploring different methods to automatically process information in clinical text.

  3. Formalization method studies on clinical research data. Some previous studies investigated the eligibility criteria in clinical trial protocols, developed approaches for eligibility criteria extraction and semantic representation, and used hierarchical clustering for dynamic categorization of such criteria [28, 30]. For example, EliXR provided a corpus-based knowledge acquisition framework that uses the Unified Medical Language System (UMLS) to standardize eligibility-concept encoding and to enrich eligibility-concept relations for clinical research eligibility criteria from text. QDM-based phenotyping methods used for identification of patient cohorts from EHR data also provide valuable reference on our work [31].

Although current studies on the computerization and standardization of diagnostic criteria are still immature, there are some studies that started working on the diagnostic criteria computerization and only focused on some particular diseases. Examples of such studies include the computerized diagnostic criterion of inclusion body myositis [32] or Brugada-type electrocardiograms [33]. However, few of the current studies are taken from the perspective of using a standard information model to build generalizable and scalable methods for the diagnostic criteria computerization.

3 Materials and Methods

3.1 Materials

3.1.1 Data Collection

In this study, our diagnostic criteria were collected from a number of sources, including medical textbooks, journal papers, documents issued by professional organizations (such as the World Health Organization [WHO] [34]), and the Internet. Table 1 shows an example of textual diagnostic criteria for diabetes mellitus [35].

Table 1.

An Example of Textual Diagnostic Criteria for Diabetes Mellitus

Diagnostic Criteria for Diabetes Mellitus
A1C ≥ 6.5%. The test should be performed in a laboratory using a method that is NGSP certified and standardized to the DCCT assay.
OR
FPG ≥ 126 mg/dl (7.0 mmol/l). Fasting is defined as no caloric intake for at least 8 h.
OR
2-h plasma glucose ≥ 200 mg/dl (11.1mmol/l) during an OGTT. The test should be performed as described by the World Health Organization, using a glucose load containing the equivalent of 75 g anhydrous glucose dissolved in water.
OR
In a patient with classic symptoms of hyperglycemia or hyperglycemic crisis, a random plasma glucose ≥ 200 mg/dl (11.1 mmol/l).
In the absence of unequivocal hyperglycemia, criteria 1–3 should be confirmed by repeat testing.

In this study, we collected 237 textual diagnostic criteria using web crawler and manual retrieval. According to our previous study [36], most commonly used datatypes in diagnostic criteria include laboratory test, symptom, diagnostic study, diagnostic activity, physical exam, medication, recommended procedure and patient characteristic. In this study, for the experiments, we randomly selected 63 out of these 237 textual diagnostic criteria and focused on the computational methods for two typical datatypes: Symptom and Laboratory Test. Table 2 shows the number and source of the selected diagnostic criteria by 6 clinical topics.

Table 2.

Clinical Topics Distribution of 63 Diagnostic Criteria

Clinical Topic Number of Diagnostic Criteria Sources
Cardiology 32 Internet, Text book
Critical Care 16 Internet
Diabetes 9 Internet, WHO document
Dermatology 4 Internet
Allergy 1 Internet
Anaesthesiology 1 Internet

3.1.2 The Multi-purpose Annotation Environment Annotator

We used the Multi-purpose Annotation Environment (MAE) toolkit [37] for manually annotating attributes of individual diagnostic criteria [38]. The MAE tool allows defining flexible annotation objects in a document type definition (DTD) file in terms of the annotation task, and it also enables easy annotation by simply highlighting the word and selecting defined entity tags. The information a bout a tag is added to the appropriate tab at the bottom of the tool display window. The id, start, end, and text features are automatically generated by MAE, which makes the MAE fit well for our task. Figure 1 shows our manual annotation example for the attributes of a laboratory test in an individual criterion: Renal, Creatinine (mg/dL):>5.0 or <200.

Fig. 1.

Fig. 1

A Screenshot of MAE Illustrating Attributes Annotation in the Criterion ‘Renal, Creatinine (mg/dL): >5.0 or <200.’

3.1.3 cTAKES

Built on an operable interface UIMA, cTAKES provides the pipeline for selecting which descriptors are used together and for determining the order of the descriptors (see detail in cTAKES 3.2 Component Use Guide) [39]. The type system based on Intermountain Healthcare’s Clinical Element Models (CEMs) has been implemented in UIMA and is fully functional in cTAKES versions 2.0 and later [40]. Dictionaries such as UMLS, SNOMED CT, and RxNorm are integrated into the cTAKES clinical pipeline. cTAKES discovers clinical-named entities and clinical events using a dictionary lookup algorithm and a subset of the UMLS [41]. In addition, cTAKES extracts named entity attributes and assigns values for the attributes such as UMLS concept unique identifiers (CUIs) and SNOMED CT codes, polarity, uncertainty, condition, etc. In this study, we used a cTAKES (V.3.2.1) NLP analysis engine and its integration version MedTagger [42] to process the diagnostic criteria.

3.1.4 NQF, QDM, and HQMF

QDM consists of criteria for data elements, relationships for relating data element criteria to each other, and functions for filtering criteria to the subset of data elements that are of interest [43]. The basic components of QDM include: category (e.g., Symptom), datatype (e.g., Symptom, Active), attribute (e.g., information about Severity, Start Datetime, Stop Datetime, and Ordinality), and value set comprising concept codes from one or more code systems. In this study, we used the QDM version 4.0 to represent the diagnostic criteria. As a standard and computable format of QDM, the Health Quality Measure Format (HQMF) formally defines a quality measure (data elements, logic, definitions, etc.) to support consistent and unambiguous interpretation [44]. The HQMF as an HL7 standard has been accepted as a format to define clinical quality measures (cQMs). In our QDM representation implementation, we used the latest templates of HQMF (2013) which are issued by NQF [45].

3.2 Methods

Figure 2 shows an integrated framework we designed for the NLP-based QDM modeling of diagnostic criteria. The framework of our study comprises three modules. The first module is an individual criterion extraction and classification module, in which we use cTAKES to detect sentences and annotate events (using its built-in UMLS dictionaries) in diagnostic criteria. The second module is an attribution extraction and classification module, in which the CRFs algorithms were chosen to annotate attributes belonging to each particular event (eg, a laboratory test). The third module is a QDM/HQMF transformation and representation module. Since our annotated work is based on the framework of UIMA, which contributes the NLP-annotated elements in the model of CAS, we created the mapping from CAS to our target data model QDM/HQMF. The overall goal of the integrated framework is to transform the textual diagnostic criteria into the QDM/HQMF-based representation, including all data elements, value sets, and logic composition of diagnostic criteria.

Fig. 2.

Fig. 2

An Integrated Framework of Representing Diagnostic Criteria in QDM

We describe our methods in detail as follows, including the methods for individual criterion extraction and classification, attributes of extraction and classification, the QDM/HQMF representation implementation, and the performance evaluation.

3.2.1 Individual Criterion Extraction and Classification

We used cTAKES to annotate diagnostic criteria under the UIMA framework. To extract an individual criterion that describes a different diagnosis event from full-text diagnostic criteria, we implemented two steps of processing: a) the sentence detection and b) the diagnosis event annotation. All individual sentences automatically extracted from the textual diagnostic criteria need to be further classified using the event annotations. Mapping rules are created between the cTAKES data annotation model (CAS, based on the type system) and the QDM model on the datatype-level to support the generation of QDM/HQMF representation (Figure 3).

Fig. 3.

Fig. 3

The Process Flow of Individual Criterion Extraction and Classification

(1) Data pre-processing

Our first step in structuring textual diagnostic criteria is to pre-process the original data. It is an essential step before the data are ready for NLP tools to process. Since our diagnostic criteria data are collected from different sources, including websites, medical books, and other documents, the data are noisy in nature; hence, we removed the HTML characters to clean the web data, recoded the data by using unified character sets, and replaced some special spelling characters, such as the non-English character used in the label of a disease name.

(2) Sentence detection using cTAKES

Detecting the sentences which describe specific diagnosis evidence from diagnostic criteria text is an essential step when extracting an individual diagnostic criterion. cTAKES wraps a sentence detector from an open source tool known as OpenNLP, which enables the user to mark one individual sentence in the text by an end-of-sentence terminated character. The individual sentences detected are used for the classification in the next step.

(3) Event annotations using cTAKES

We used the cTAKES to perform the event annotations on textual diagnostic criteria. cTAKES supports the automatic annotation of the following EventMentions: disease/disorder, sign/symptom, medication, medication event and procedure. In our study, we focused on the event annotations of sign/symptom mentions and procedure mentions.

(4) Rule-based Sentence Classification

Considering that an EventMention is the core element of each individual diagnostic criterion, we created event-based rules to classify the detected sentences. A sentence is classified into a QDM category (e.g., Symptom, Laboratory Test) using the core event identified in each sentence. Each sentence has zero or more mentions recognized by cTAKES. Each of these mentions is mapped to a QDM category. The sentence classification rules are illustrated in Figure 4, in which we suppose that each individual sentence(X) is annotated by cTAKES, and each EventMention has an event Type(Y). When one Sentence(X) is annoted with only one type of eventMention (n=1), it will be classified into one Caterogy (Y1); and when one Sentence(X) contains more than 1 types of eventMentions (n>1), it will be classified into different categoties: Caterogy (Y1) to Caterogy (Yn).

Fig. 4.

Fig. 4

Sentence Classification Rules

For example:

Sentence1 “Serum triglycerides 150 mg/dL (≥1.7 mmol/L) or HDL cholesterol <36 mg/dL (<0.9 mmol/l) in men and <40 mg/dL (<1.0 mmol/L) in women” is one of the individual sentences of the diagnostic criteirion “Criteria for the Metabolic Syndrome.” cTAKES annotated two terms, “Serum triglycerides” and “HDL cholesterol,” with one same eventType (Procedure/Laboratory Procedure), so we classified sentence1 into the category (Laboratory Test, Result).

Sentence2 “Clinical deterioration (worsening pain, increasing white cell count, worsening vital signs)” is one of the individual sentences of diagnostic criteirion “Indications and Contraindications for ERCP in Patients With Acute Biliary Pancreatitis.” cTAKES annotated the term “white cell count” as eventType (Procedure/Laboratory Procedure), as well as the terms “pain” and “vital signs” as eventType (Sign/Symptom). So we classified sentence2 into the categories (Laboratory Test, Result) and (Symptom, Active).

(5) Data post-processing

Based on the content analysis and individual criterion classification of diagnostic criteria text, we further created a set of rules for data post-processing to refine the classification results.

First, UMLS semantic types are used to filter out those unrelated event subtypes. In cTAKES event annotations, EventMention: sign/symptom and EventMention: procedure cover the mentions of symptom and laboratory tests respectively, which exist in the diagnostic criteria text. However, these two EventMentions in cTAKES have broader semantics (i.e., meanings) than the corresponding data types in diagnostic criteria, which could be further qualified using UMLS semantic types. We filtered out those unrelated semantic types and used more specific diagnostic sub-events to classify the sentences. Table 3 shows the filtering rules we used for the diagnostic criteria classification after running cTAKES.

Table 3.

UMLS Semantic Type Filter for cTAKES Event Annotation Results (Procedure and Sign/Symptom)

Related or Unrelated Semantic Types Used in the Procedure Mention Semantic Types Used in the Sign/Symptom Mention
Related/In PROC|Procedures|T059|Laboratory Procedure DISO|Disorders|T184|Sign or Symptom
Unrelated/Out PROC|Procedures|T060|Diagnostic Procedure
PROC|Procedures|T061|Therapeutic or Preventive Procedure
DISO|Disorders|T033|Finding
DISO|Disorders|T048|Mental or Behavioral Dysfunction
PHYS|Physiology|T042|Organ or Tissue Function
PHYS|Physiology|T041|Mental Process
PHYS|Physiology|T040|Organism Function
ACTI|Activities & Behaviors|T056|Daily or Recreational Activity
PHEN|Phenomena|T034|Laboratory or Test Result

In addition, there still exist some general concept annotations after the UMLS Semantic Type–based filtering. We further created the filtering rules to exclude these general concepts from the classification collection of the Laboratory Procedure (T059) and Sign or Symptom (T184). The excluded terms include “findings,” “symptoms,” “tests,” “analysis,” etc.

3.2.2 Attributes Extraction and Classification

For acquiring detailed attributes of each individual diagnostic criterion, we used a CRFs algorithm to classify the attributes of a particular diagnostic event, such as measure unit, value, severity, temporal constraints, etc. We then assigned standard concept codes to these attributes using UMLS CUIs or SNOMED CT codes. The attributes extraction and classification process is shown in Figure 5.

Fig. 5.

Fig. 5

The Process of Attributes Extraction and Classification using CRFs

(1) Training and testing data preparation

We defined a set of attribute types using the MAE annotator to prepare our training data and testing data. For example, for the individual diagnostic criteria in the Laboratory Test category, we annotated them with the attribute types as defined in Table 4. Note that “body&organ” is treated as one of attributes in the EventMention type “Laboratory Test” and the anatomical site annotation returned from cTAKES is used as one of source features to train the CRF model for classification.

Table 4.

Attribute Types Definition of Laboratory Test Diagnostic Criteria

Type Label Type Definition
labtest The name of lab test
resultString The lab test result represented in text narration, such as increase, decrease, elevate, positive, etc.
resultValue Lab test result value is a number
symbol Sign before a value in a math notation representation, such as >,<, etc.
unit The unit used for the value in a unit symbol, usually after a value
method The method of lab test
body&organ The body or organ at which this lab test is conducted, such as serum, liver
demographic Just annotate sex Demographic in our annotation (women, men, female, male) and, or, used for connecting different single
boolean QDM
exclusion Exclusion criterion
(2) CRFs algorithm design

As illustrated in Figure 6, our CRF model consists of a vector of feature functions, f(y, O), with f =<f1, …, fk> and a vector of weights θ (θ =<θ1, …, θk>). O refers to the input sentences and diverse features we are deploying. A refers to attribute categories (namely, we aim at recognizing and classifying attributes defined in last section) we want to predict. Formally, it is represented as,

Fig. 6.

Fig. 6

CRF Model for Attributes Classifications

P(A=yO)=exp(θ·f(y,O))Z(O)

where y refers to the possible values of A and k refers to the number of features and Z(O) = Σy′ exp (θ · f(y′, O)) is a normalization factor. For each position i in the observation feature sequences o, we define the |A| × |A| matrix random variable Mi(o) = [Mi(y′, y|o)] by,

[Mi(y,yo)]=exp(Λi(y,yo))Λi(y,yo)=kλkfk(ei,Yei=(y,y),x)+kμkgk(vi,Yvi=y,x)

where ei is the edge with labels (Yi−1, Yi) and vi is the vertex with label Yi. For the parameter estimation, since we utilized CRF++ for training and testing, LBFGS (limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm) is used for parameter estimation.

(3) CRFs features definition

In order to apply the CRF framework, the appropriate feature vectors f(y, O) that are used for the training model must be defined. The features can help discriminate attributes in the diagnostic criteria text. In our application, we built a list of features using the cTAKES aggregate annotation engine, and the feature extractor is in charge of extracting the feature vectors for the CRFs classification task from those cTAKEs annotators, including POS (Part-of-Speech) Tagger Annotator, LVG (Lexical Variant Generation) Annotator, Syntactic Parser Annotator, Semantic Role Labeling Annotator, Constituency Parser Annotator, Dependency Parser Annotator, and UMLS Concept Annotator. The features we extracted for our CRFs algorithms include:

  • Bag-of-words: The unigrams, bigrams, and trigrams of words (i.e., tokens) within a window of [−2, 2].

  • POS tags: The POS unigrams, bigrams, and trigrams within a window of [−2, 2]. We use the combination of cTAKES and MedTagger POS Tagger.

  • Sentence information: The length of the sentence containing the word, whether there is an end mark at the end of the sentence such as ‘.’, ‘?’, or ‘!’, or whether there is any bracket matched in the sentence.

  • Affixes: All prefixes and suffixes of length from 1 to 5.

  • Orthographical features: The form information about a word (whether the word is uppercase, contains a digit or not, has uppercase characters inside, has punctuation marks inside, has a digit inside, or whether the numeral is Roman or Arabic, etc.)

  • Word shapes: There are two typical types of word shapes: one is generated by mapping any uppercase character, lowercase character, digit, and other character in the word to ‘A’, ‘a’, ‘#’, and ‘-’, respectively, and the other one is generated by mapping consecutive uppercase characters, lowercase characters, digits, and other characters to ‘A’, ‘a’, ‘#’, and ‘-’, respectively. For instance, the two types of word shapes of “PO/5mg” are “AA-#aa’ and “A-#a”.

  • Rules generated by some regular expressions. Currently, the rules were created on two types of attributes: Unit and Symbol. Units in diagnostic criteria are usually with the regular format and following a number, for example, 100 g, 2 mm3, 6.5 mmol/L, 13 g/dL, 200 pg/mL, 20 mmol per liter, etc. We built a set of unit rules that cover most of the common units [46]. Symbols, such as <, >, <=, >=, ≤, ≥, and = are also processed by particular rules to improve the performance.

  • Topic features: We generate topic features for each word using GibbsLDA++ [47].

  • UMLS Concept features: We extract concept features from the UMLS dictionary lookup analysis engine of MedTagger.

(4) Training and classification

Using training data and features, we trained our CRFs model. Following the convention, we usually split the dataset into training and testing datasets to better model the underlying distribution. In this study, the data set is divided into training and testing in a 70:30 split. For the training, we run 5 fold cross validation in order to obtain the most optimal model. The input of the training model is:

  1. An individual criteria sentence S

  2. The position i of a word in the sentence

  3. The label li of the current word

  4. The feature vector X=[x1; x2; x3; … xn] of the current word

After the CRFs training, a classifier was generated to annotate and classify the attributes of individual criteria with the same data type of training data. Currently, our experiments are mainly focused on the attributions annotation of the datatype “Laboratory Test, Result”, therefore, the output categories of the classifier are the same datatypes we annotated in Table 4.

3.2.3 QDM/HQMF XML rendering

For automatically generating a standard QDM/HQMF XML for each textual diagnostic criterion, we built the mapping rules on two levels: datatype level and data level.

(1) Datatype-level QDM\HQMF mapping

We created the datatype-level mappings between the cTAKES UIMA-CAS type system and the QDM datatypes (Figure 7) through analyzing the textual definitions of data types in both models. Datatype-level mappings are mainly focused on 7 cTAKES types and 8 selected QDM datatypes that frequently appear in diagnostic criteria.

Fig. 7.

Fig. 7

Datatype-level Mappings between the cTAKES Type System and the QDM Datatypes

Since the cTAKES type system and the QDM model are defined for distinct application objective, the semantic conflict is always a challenge when implementing the model mapping. On the one hand, it is possible that each type from different models clearly covers the same semantic scope (1:1 mapping), such as the mapping from the cTAKES:MedicationMention to the “QDM: Medication, Active”; on the other hand, one type from one model sometimes covers multi-semantic scope of the types in another model (1:n mapping, or n:1 mapping). For example, the cTAKES:SignSymptomMention covers the scope of the sign and symptom of a disease diagnosis, whereas the “QDM: Symptom, Active” just covers the scope of the symptom, and the “QDM: Physical Exam, Performed” actually reflects the sign for a disease diagnosis. Therefore, we built a 1:2 mapping between these three types.

(2) Data-level QDM/HQMF transformation

More specific than the datatype-level mapping, another level of mapping is the data-level mappings, which are executed on the instance details to support data model and format transformation. The mapping rules were created between the defined CRFs annotation types and QDM elements. All predicted attributes of diagnostic criteria are mapped into the HQMF XML template using the data-level mapping rules. Figure 8 illustrates a data-level mapping example of a QDM datatype ‘Laboratory Test, Result’. Specifically, we tagged the CRFs-based classifier output into the IOB (Inside, Outside, Beginning) format. Next, mapping was made between the output of CRFs and QDM/HQMF so that a complete diagnostic criterion was displayed in HQMF XML format.

Fig. 8.

Fig. 8

Data-level Mappings Between CRFs Annotation Types and QDM HQMF Elements

Actually, the representation of QDM is considered standardized and modularized; each object representation is composed of structured templates. There are three kinds of “result comparison” templates in QDM (hqmf r1 template - 2.16.840.1.113883.3.560.1.1019.1, 2.16.840.1.113883.3.560.1.1019.2, and 2.16.840.1.113883.3.560.1.1019.3) [45] which combined with the template “Laboratory Test, Result” to represent one particular of laboratory test result. Table 5 shows the difference of the three kinds of comparison templates.

Table 5.

The 3 Comparison Templates Related to Laboratory Test Result

Template Type Description
2.16.840.1.113883.3.560.1.1019.1 Value Type=“ANYNonNULL” Result is Null
2.16.840.1.113883.3.560.1.1019.2 Value Type=“CD” Result is literals
2.16.840.1.113883.3.560.1.1019.3 Value Type=“IVL_PQ” Result is value

The supplementary document S1 shows an example of the HQMF XML representation of the QDM-based criterion “Laboratory Test, Result: LDL-c (result < 100 mg/dL)”, and the target elements are highlighted. The criterion is composed by two templates: HQMF template “Laboratory Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12) and HQMF template “result comparison” (hqmf r1 comparison template - 2.16.840.1.113883.3.560.1.1019.3).

3.2.4 Evaluation Methods

We evaluated the overall performance of developed classification algorithms for the QDM representation using the manual annotations as the gold standard. We manually annotated individual criteria sentences (n=267), within two types of diagnostic evidence: Symptom (n=125) and Laboratory Test (n=142), from a collection of 63 diagnostic criteria. Furthermore, we manually annotated laboratory test attributes of Laboratory Test individual Criteria. We used 100 out of 142 individual diagnostic criteria as the training data and 42 out of 142 individual diagnostic criteria as the testing data to evaluate the performance of the CRFs-based algorithm.

Specifically, two authors (N.H., Y.Y.) annotated the diagnostic criteria, and we used the kappa statistic to evaluate the inter-annotator agreement. In the formula (1), P (A) is annotator agreement, and P (E) is expected agreement. After our two-round annotation, the inter-agreement K achieved 88.44%. According to the principle of kappa statistic, researchers regard that kappas over 0.75 as excellent and the interobserver agreement 0.81–0.99 belongs to “Almost perfect agreement” [48, 49].

K=P(A)-P(E)1-P(E) (1)

Actually, we found that most of inconsistent annotations were caused by different annotating styles between two annotators. For example, in the sentence “Exclusion of active viral infection both in pericardial effusion and endomyocardial/epimyocardial biopsies (no virus isolation, no IgM-titer against cardiotropic viruses in pericardial effusion, and negative PCR for major cardiotropic viruses)”, one annotator annotated the term “IgM” as a laboratory test, whereas the other one annotated the term “IgM-titer” as a laboratory test. These inconsistent annotations were resolved through a group discussion and the final consensus was achieved and used as the gold standard. Three standard measures were used to measure the performance of the classifications: precision (2), recall (3), and F-measure (4). In the following equation: TP stands for true positive, FN stands for false negative, and FP stands for false positive.

Precision=TPTP+FP (2)
Recall=TPTP+FN (3)
F-measure=2×Precision×RecallPrecision+Recall (4)

4 Results

4.1 Evaluation on Individual Criterion Extraction and Classification

In our experiment, we tested the results of cTAKES event annotations that are in the QDM categories of the Laboratory Test and Symptom. Here are two examples of the criteria sentences automatically extracted from diagnostic criteria for diabetes mellitus original text by sentence detector:

  • Laboratory Test, Result Criterion1: Thrombocytopenia (platelets <100,000 cells/mm3) (According to Sentence Classification Rules in Figure 4)

  • Symptom, Active Criterion2: Gastrointestinal-hepatic dysfunction: Moderate (diarrhea, nausea, vomiting, abdominal pain) (According to Sentence Classification Rules in Figure 4)

According to automatic sentence identification results and rule-based sentence classification, we evaluated the performance of Laboratory Test and Symptom individual criterion classification results. Table 6 shows the evaluation results in terms of whether mapping rules correctly allocate a QDM datatype for an individual criterion.

Table 6.

Performance of the Individual Diagnostic Criterion Classification Results

QDM Datatype Laboratory Test, Result Symptom, Active Average
Precision 0.84 0.84 0.84
Recall 0.92 0.79 0.86
F-score 0.87 0.82 0.84

4.2 Evaluation on Attribute Extraction and Classification

Here is an example of the attributes annotations in an individual laboratory test criterion from diagnostic criteria for diabetes mellitus:

  • Thrombocytopenia(platelets[labtest][symbol]100,000[value]cells/mm3[unit]) (Value Type=”IVL_PQ”, see Table 4)

The evaluation was mainly focused on the classification performance of four kinds of attribute elements, in which some attribute types were combined when the evaluation was performed, including LabTest Event (laboratory test EventMention), LabTest Value (test result is a value with unit), LabTest Symbol (the symbol related with value) and LabTest Result Description (result in literal description, such as ‘Positive’). Table 7 shows the classification evaluation results of the elements in the QDM datatype Laboratory Test, Result.

Table 7.

Performance of the Attributes Extraction and Classification Results (QDM: Laboratory Test, Result)

Attributes LabTest Event LabTest Result Value
LabTest Result Description Average
Value/Unit Symbol
Precision 0.94 0.98 0.96 0.90 0.95
Recall 0.91 0.90 0.87 0.84 0.88
F-score 0.93 0.94 0.91 0.87 0.91

4.3 Implementation on QDM/HQMF XML Rendering

Using the algorithms as described above, we assembled a computational pipeline that transformed the diagnostic criteria in free text into the QDM/HQMF XML structure. Our framework realizes an automated web-based online transformation pipeline via Apache Tomcat. Namely, after users upload the free text file or type in “free text criteria” on the web dialogue box, the transformation into the QDM/HQMF XML structure is executed with the backend processing, including extractions of features and concepts and classification of attributes. An HQMF XML rendering of the structured diagnostic criteria is displayed in the corresponding dialogue box (see the supplementary document S2).

5 Discussion

Previous studies using clinical NLP and standard clinical data model laid the foundation of the method and tool development in the present study, although few studies directly focused on using NLP-based approaches to support the formalization of free text diagnostic criteria, and no unified data model was developed specifically for the diagnostic criteria representation.

In this study, we developed an NLP-based computational framework to support the representation of clinical diagnostic criteria in the QDM-based structure, leveraging both the machine learning and rule-based NLP technologies. Under this framework, we also executed the prototype implementation and performance evaluation. Our experiment was implemented into the standard UIMA architecture, which would effectively facilitate further component extension. As a basic NLP tool, cTAKES supports many computational processes within our overall workflow, including sentence detection, event annotation, and feature generation in the CRFs-based machine learning algorithm. With the goal of translating unstructured diagnostic criteria into structured QDM representation, the model mapping between CAS and QDM is another significant work after NLP-supported extraction and classification tasks. From the standardization perspective, our framework supports the reuse of the existing standards-based information model, terminology services, and technical interface. From the implementation perspective, we chose the two most common types of diagnosis evidence, Symptom and Laboratory Test, to perform our experiment in this study. The evaluation results indicated that the framework and methods we designed and developed are feasible to represent diagnostic criteria in a standard and computable way.

For the sentence identification and classification on the diagnostic criteria, the refinements of the filtering rules have improved the accuracy of sentence classification by using UMLS semantic type and concept refinement. Using the filtering rules, 95% of the event-annotated sentences could be accurately classified into the categories of Symptom or Laboratory Test. We noticed that in some cases, one sentence could be automatically classified into more than one category because the original sentence description was written in a combination of many events. We are working on further dividing semantic groups of the combined events in a complex sentence, to support more accurate annotation in the next step of attributes classification.

For the attributes extraction and classification of individual criteria, our CRFs training model adapted well on the particular category of criteria: Laboratory Test. It is found that besides those standard features, rules generated by regular expressions, UMLS concepts and topic features boost the performance by about 10%. With the same principles, we consider that the CRFs algorithms could achieve similar performance on other categories of individual diagnostic criteria, such as physical examination, medication, and procedure. The CRFs model contains different rules and patterns which require us to further analyze the data in detail, optimize training features, and build some new rules for adapting their particularity.

For the implementation of the transformation tool, we have prototyped a web-based application that enables the transformation from an individual criterion to the QDM/HQMF XML structure. In the prototype implementation, we only focused on the diagnosis criteria of the laboratory test. The prototype provides batch real-time processing and immediate visualization in an effective way, demonstrating that our framework is feasible and practical.

There are a number of limitations in this study. First, we mainly focused on the two QDM categories: Symptom and Laboratory Test for the experiments. We plan to further analyze common characters and particular points for other types of individual diagnostic criteria. In addition, our manually annotated training data are still limited in terms of the number of individual criteria annotated, resulting in some attributes (e.g., negation rational, or reason) that only occured a few of times, which are not sufficient for learning patterns out of them. However, with its nature of self-learning and hardly manual intervention, the machine learning CRFs algorithms can be adapted and extended easily to other types of criteria with the accumulating number of training data. Note that the sematnic filter is currently designed for the general clinical domains, not including the specific domain like metal disorders that would need specialized vocabularies and customized semantic filters.

Second, our topic modeling method is based on heuristic configrations. We followed the previous studies to obtain the topic distribution in this study, but we postulate that the configurations can be enhanced in the future [50]. In addition, we used the cTAKES tool by default configuration, the performance of which may be improved in the future.

Third, there is also another challenge related to the value set standardization. As previously described, diagnostic criteria could be well transformed and represented in the standard structure of QDM/HQMF; however, the value set representation in the structured QDM/HQMF needs to be standardardized. Although QDM is designed to support the use of the value set, there is currently no authoritative value set database developed for the purpose of standardization of clinical diagnostic criteria. In our implementation, we temporarily created our local value sets and used the UMLS codes in the QDM/HQMF renderings of diagnostic criteria. In the future, we plan to leverage the National Library of Medicine Value Set Authority Center (VSAC) [51] and the Object Management Group (OMG) standard Common Termiology Services 2 (CTS2) [52] for establishing value set definition services.

Finally, the present study is mainly focused on building the methodology for converting textual diagnostic criteria into a standard format based on QDM. On the one hand, we believe that the methods we developed could be generalized to other datatypes as specified in QDM or other types of criteria (e.g., clinical trial eligibility criteria); on the other hand, our implementation prototype is still preliminary. In the future, we plan to improve the performance of our tools through taking into account of practical workflow. For example, we will develop a user-friendly interface that allows human experts to use the results produced by our methods to author the structured and computable diagnostic criteria. We will also conduct rigorous evaluation to assess how our methods can be used in a realistic setting to improve efficiency.

6 Conclusion

We developed a framework with methods and tools that transform the free-text–based diagnostic criteria into structured QDM representation in a semi-automatic way. Clinical NLP tools (e.g., cTAKES) and the machine learning algorithms (e.g., CRFs) successfully supported the annotation, classification, and represention of textual diagnostic criteria in QDM. We hypothesize that our approach could also be generalized to clinical knowledge representation and formalization using other information models, such as HL7 FHIR. In the future, we plan to further improve the implementation algorithms, such as sentence detector improvement considering the content of the sentence, CRFs features optimization and extension for multiple types of diagnostic criteria, and further improve our web-based application more user-friendly. We also plan to extend our web-based prototype to adapt more QDM datatypes, and develop tools to support editing and authoring of structured diagnostic criteria by clinicians and domain experts.

Supplementary Material

1
2

Highlights.

  • A computational framework for modeling textual diagnostic criteria is developed

  • A mechanism for classifying individual diagnostic criterion is created.

  • A machine-learning algorithm for classifying criterion attributes is created.

  • A computational pipeline prototype is developed.

  • The tool performance is evaluated with satisfactory results.

Acknowledgments

This project was supported in part by the caCDE-QA project (U01 CA180940). This paper is expanded from a short paper presented in the BioNLP 2015 Workshop.

Footnotes

Competing interests

The authors declare that they have no competing interests.

Contributors

G.J. and N.H. conceived and designed the study; N.H., D.L., and G.J. drafted the manuscript; N.H., Y.Y., and D.L. led the data collection and individual criteria extraction and classification; D.L., Q.X., Y.Y., and N.H. led the implementation of CRFs-based algorithms and web-based interface. H.L. and G.J. provided leadership for the project; all authors contributed expertise and edits.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2

RESOURCES