Abstract
Background:
A significant amount of clinical information captured as free-text narratives could be better utilized for several applications, such as clinical-decision support, ontology development, evidence-based practice, and research. The Human Phenotype Ontology (HPO) is specifically used for semantic comparisons for diagnostic purposes. All these functions require quality coverage of the domain of interest. In this study, we used natural language processing to capture craniofacial and oral phenotype signatures from electronic health records and then used these signatures for evaluation of existing oral phenotype ontology coverage.
Methods:
We applied a text-processing pipeline based on clinical Text Analysis and Knowledge Extraction System (cTAKES) to annotate the clinical notes with Unified Medical Language System (UMLS) codes. We extracted the disease/disorder phenotype terms, which were then compared against Human Phenotype Ontology (HPO) terms and their synonyms.
Results:
We retrieved 2,153 de-identified clinical notes from 558 patients. Finally, 2416 unique diseases/disorders phenotype terms were extracted, which contained 210 craniofacial or oral phenotype terms. Twenty-six of these phenotypes were not found in the HPO.
Conclusions:
In this paper, we demonstrated that natural language processing tools could extract relevant phenotype terms from clinical narratives, which could help identify gaps in existing ontologies and enhance craniofacial and dental phenotyping vocabularies.
Practical implications:
The expansion of terms in the dental, oral, and craniofacial domains in HPO is particularly important as the dental community moves toward electronic health records.
Keywords: natural language processing, evidence-based dentistry, craniofacial and oral phenotypes, ontology
1. Introduction
In recent years, there has been an explosion of available data as the medical field has moved from paper-based to electronic health records (EHRs), along with other big data sources such as digital imaging, genomics, proteomics, and metabolomics. To make this vast amount of data clinically useful and to support in-depth analysis to understand the molecular basis of diseases, methods and tools are required that accurately integrate and link -omics data with clinical information1.
A phenotype is defined as the morphological, physiological and behavioral characteristics of an individual while genotype is an individual’s entire genetic makeup2,3,4. A phenotype terminology is a catalog of specific signs, symptoms, imaging findings, and other abnormalities seen in clinical practice. A concept in ontology denotes a single meaning including all variations from any source that express that same meaning. Each of these concepts is assigned at least one semantic type and unique Concept Unique Identifier, or CUI, which represents that single meaning2. Individual clinical concepts (e.g. “class II malocclusion”) are referred to as term. Ontologies provide helpful definitions of these terms as well as the relationships between them. When compiled together, these terms form a foundation for computational searching, analysis and, most importantly, provide the ability to create inferences about information. For example, Crouzon syndrome is a rare craniofacial abnormality whose phenotype includes craniosynostosis, prominent forehead, curved nose, midface hypoplasia, short upper lip, mandibular prognathism with class 3 malocclusion, crowded teeth, and anterior open bite. However, patients with Crouzon syndrome may exhibit a wide spectrum of these features. To better understand the different sub-phenotypes in craniofacial conditions, precise and detailed disease-phenotype relationships found in ontologies are needed.
One approach to capture phenotypic information is the use of biomedical ontologies that are consensus-based vocabularies. Gruber defines ontology as “the specification of conceptualizations, used to help programs and humans share knowledge”5. The highly structured vocabulary and relationships within ontologies help to optimize the exchange of information in and across domains without an uncontrolled explosion due to excess information. The Human Phenotype Ontology (HPO) facilitates the study of disease-phenotype relationships and is developed using medical literature. HPO is a part of the Monarch Initiative, an NIH-supported international consortium dedicated to semantic integration of biomedical and model organism data with the ultimate goal of improving biomedical research6–8. Human phenotype ontologies are especially important in cases of rare and undiagnosed diseases where clinical experts may use different terms to describe similar clinical phenotypes9.
The construction of ontologies requires content expertise and continuous accrual of new knowledge, which is challenging to incorporate in a timely manner. The comprehensiveness of ontology in a given domain is crucial for its usefulness. Thus, there is a need for systems that can effectively and efficiently provide an assessment of domain coverage and potential new content for inclusion. One of these mechanisms includes exploiting the information found in biomedical resources like published literature and EHRs. Electronic health records have several advantages for use in phenotyping such as cost efficiency and the availability of large amounts of clinical and temporal information10. Although structured data, such as International Classification of Disease (ICD) codes, are useful controlled vocabularies, they are limited in detail as they are designed to facilitate medical billing. On the other hand, the unstructured clinical data (e.g. consult notes, history and physical notes, discharge summaries, and pathology reports) are rich in clinical detail of patient conditions including diagnosis, signs and symptoms, family history, onset, severity and other clinical findings. Unfortunately, the unstructured nature of this free-text information makes it challenging to extract and interpret the phenotypic information within it. Natural language processing (NLP) can support analysis of data derivedfrom EHRs by converting clinical and genetic information from unstructured text to a computer-accessible form11.
In this study, we used existing NLP tools to capture different terms used for describing clinical phenotypes in the craniofacial, dental and oral domains. This information was further used to evaluate the completeness of existing phenotype ontologies. We utilized an open source NLP tool known as clinical Text Analysis and Knowledge Extraction System (cTAKES), an application first developed at the Mayo Clinic12. Our aim was to extract craniofacial, dental and oral phenotype terms from EHRs at the NIH Clinical Center and to identify terms that may be missing from the HPO without any feature engineering or retraining of machine learning models. Such successful extraction will enhance the coverage of craniofacial, dental and oral terms in HPO and demonstrate the use of a data-driven methodology utilizing an NLP tool within a diverse clinical source.
2. Methods
Our methodology included the following steps: 1) data collection, 2) natural language processing,3) identifying craniofacial and dental phenotypes and 4) comparison with the HPO. Figure 1 shows the workflow of the text-mining pipeline that extracts the phenotype terms from clinical documents.
2.1. Dataset
Research subjects who are admitted to the NIH Clinical Center (CC) have their clinical documentation archived in the NIH CC data warehouse system known as Biomedical Translational Research Information System (BTRIS)13. BTRIS incorporates data from the CC EHRs and research systems from the 27 NIH institutes and centers (ICs). This includes data on common and rare conditions that are routinely studied at the NIH. Out of the 27 institutes, clinical data is collected from only 18 institutes that have ongoing human subject protocols. This data includes structured as well as unstructured clinical texts available to researchers in a de-identified, secure and controlled manner. As a first step, a query of BTRIS was performed to search data from 2007–2015. The search criteria were based on clinical terms related to the dental, oral, and craniofacial region (e.g., hemifacial microsomia) and ICD-9 codes (e.g., 756.0), Table 1. The full search criteria are in supplemental material Table 1.
Table 1:
Terms | ICD 9-codes |
---|---|
Abnormalities, Craniofacial | 376.44 |
Craniofacial Abnormality | 351.8 |
Craniofacial and skeletal abnormality | 351.9 |
Craniomandibular Disorder | 438.83 |
Craniomandibular Diseases |
The search was done against the BTRIS database containing 30 million clinical encounter notes. Table 2 lists the distribution of the types of notes including first registration reports (first admission summary for all new human subjects at the NIH CC), dental consults, all other consults (e.g. dermatology, genetics, pediatrics), discharge summaries and outpatient notes.
Table 2:
Types of notes | Count |
---|---|
First registration reports | 254 |
Dental consults | 521 |
Other consults | 1108 |
Discharge summaries/outpatient notes | 446 |
2.2. Text Mining Pipeline
Establishment of a reference standard
To capture oral and craniofacial terms, a reference standard was created using 200 documents randomly selected from the 2329 data set. From this corpus, 112 sentences were selected and were independently annotated by two clinicians (BG, RM) for disease/disorder terms. In the first round, with no annotation guidelines, an inter-rater agreement (linear weighted kappa) of 0.62 was achieved14. Disagreements were reconciled through consensus and the annotation schema was refined. In the second round, the same two clinicians rated the documents independently using the established guidelines (linear weighted kappa= 0.82). An example is depicted in Fig 3 in supplemental materials. Given the high inter-rater reliability of the annotation, one clinician (RM) extracted the remaining terms from the corpus.
Note parsing
We installed Oracle virtual box 4.3.10r93012 to host a Virtual Machine with Ubuntu 14 Operating System (Includes the open source NLP tool cTAKES 3.2.2). Our application was based on various components adapted from the existing cTAKES and scripts written in Python 3.5.1. Because of the different components associated with the tool, we explored different analysis engines for various categories of information extraction, and we selected the aggregate plain text Unified Medical Language System (UMLS) processor for our final output. UMLS, developed by the National Library of Medicine (NLM), lists comprehensive biomedical and healthcare terms for developing computer systems capable of understanding the specialized vocabulary2. The clinical notes were analyzed by preprocessing them into cTAKES input format with Python scripts. Modules such as sentence boundary detector, tokenizer, normalizer, part-of-speech tagger, shallow parser and named entity were used in our pipeline. The text analysis continued with dictionary-based lookup using the UMLS Metathesaurus module to detect disease/disorder terms. Detection was performed at the token level first and then expanded to match at the noun phrase level. All detected concepts were extracted along with their concept unique identifiers (CUI), preferred terms, as well as coding schema using a Python script. Finally, a custom script was used to analyze and filter all unique disease/disorder terms by CUI. Please see supplemental table 4 for more information.
Data-driven concept screening and comparison with Human Phenotype Ontology
A team of clinical experts (AB, BG, ME, JL) further evaluated the disease/disorder phenotype terms obtained by processing the entire corpus. They screened the craniofacial and dental phenotypes from the list, and these phenotype terms were then manually compared against the Human Phenotype Ontology database (November 2016 version).
3. Results
We retrieved 2329 de-identified notes from 558 patients based on our search criteria. After running our pipeline on the raw de-identified free text clinical notes, we extracted 2416 diseases/disorders terms. An example of raw input which was converted into output after running cTAKES and our customized Python script is presented in table 3. The application output was then compared with the reference standard and the extracted information was labeled as true positive (TP), false positive (FP), true negative (TN) or false negative (FN). To test for accuracy of our method, the output was used to compute the three outcome measures: precision, recall, and F–score. As shown in Table 4, we achieved an overall F-score of 0.81 with high precision and moderate recall.
Table 3:
Raw Input : FIRST_NAME i=120] who comes for an evaluation and possible treatment options for his Overbite. There is an epidermal nevus of the right face extending onto the neck. Processed Output | ||||
---|---|---|---|---|
Normalized form | UMLS CUI | Preferred text | Coding schema | Code |
Overbite | C0266063 | Deep Overbite | SNOMED-CT | 60476005 |
Epidermal | C0334082 | Epidermal Nevus | SNOMED-CT | 25201003 |
Nevus | C0334082 | Epidermal Nevus | SNOMED-CT | 25201003 |
Table 4:
Performance Measure | Disease/disorder terms |
---|---|
Precision = TP/(TP+FP) | 0.91 |
Recall = TP/(TP+ FN) | 0.74 |
F measure = 2*(Precision *Recall)/(Precision + Recall) | 0.81 |
Of the 2416 disease/disorder terms, 210 terms aggregated with the craniofacial and dental domain. Comparison with HPO (November 2016 version) helped us find 26 terms that were not present in the database. Overall, we found that 11% of the extracted craniofacial and dental phenotype terms were absent in HPO. The final list of missing terms can be found in the Supplemental Material 2.
4. Discussion
Use of phenotype sets for clinical research depends on many factors. First, the presence of all relevant and necessary terms in an ontology is of utmost importance. Such terms should also map to the medical jargon and terminology found in clinical notes. Second, the presence of a model organism phenotype that is compatible and mapped to the human phenotype ontology is necessary to facilitate effective matching based on the phenotype similarities between species. This trans-species mapping is critical to further research in developmental variation, evolutionary modifications, and genetic etiology of normal and diseased development. Additionally, the presence of searchable central databases or registries that contain de-identified but coded phenotypes of patients should support semantic matching of similar cases15–17. This is particularly important in cases of rare or undiagnosed diseases where only a handful of individuals around the world may have a similar constellation of features and symptoms. Finally, scientific publications should be annotated with explicit phenotype terms to facilitate identification of phenotypes and information for clinicians and researchers15. The power of identifying similar phenotypes in patients, in addition to linking to comprehensive biomedical databases that utilize extensive ontologies and trans-species phenotype alignment, offers the ability to coordinate and identify phenotype-gene relationships and also gene-gene, gene-small molecule relationships in a systems biology context in order to advance the understanding of disease mechanisms.
The results from our study showed that existing free text electronic medical records can be mined utilizing the cTAKES tool and can be used to aid in ontology development and maintenance. The precision and recall is comparable to the extraction of phenotype terms by a domain expert (in our case, dental and craniofacial professionals). Reasonable accuracy was achieved in extracting various disease/disorder terms using our method for detecting disease/disorder phenotype. The data-driven concept screening reduced the number of phenotypes from hundreds to a few dozen, which made the manual evaluation feasible and beneficial to compare against the existing ontology. Thus, natural language processing is an excellent tool that ensures consistent, efficient and timely update of existing ontologies through incorporation of new terms.
Overall, we found 26 craniofacial and dental terms that were not a part of the HPO. The terms compared against HPO included only the terms that were present in UMLS, and not in HPO. In an evidence-based manner, we showed that there are terms that physicians and craniofacial experts regularly use for the dental, oral, and craniofacial region that are absent in current version of ontologies. The absence of these terms current ontologies may limit the use of any ontology. For example, one of the terms missing in HPO is “posterior crossbite.” A posterior crossbite is a dental malocclusion indicating an abnormal buccolingual relationship of the teeth. It is a deviation from ideal occlusion in the transverse plane of space in the posterior segments. This can occur between one single posterior tooth or a group of teeth, unilaterally or bilaterally. This anomaly is often a part of the phenotypic description of patients with developmental or congenital conditions, such as cleft palate, narrow palate, asymmetric growth of the maxilla or the mandible, as well as some pathologic conditions, such as acromegaly, muscular dystrophy, condylar hypoplasia or hyperplasia and osteochondroma. They are not self-correcting and require early detection and intervention as it can worsen over time. Here, we have shown that HPO may be incomplete in the important subdomains of craniofacial, dental and oral phenotypes, which can hinder advances in craniofacial and oral disease identification and treatment. An updated HPO ontology may also help in translational research on craniofacial disorders. For example, a clinician may describe a craniofacial phenotype as “narrow maxilla” with similarities to the phenotype of a knockout mouse. The model organism will have a specific gene knocked out and the gene in question may be part of a recognized pathway. This relationship may not be found through simple text searches but may become evident via graph-based searches using ontological relationships.
We were able to achieve reasonable accuracy in extracting various disease/disorder terms using the dictionary look-up method. The precision rate was high (91%) with a moderate recall rate (74%). The moderate recall rate may be due to limitations in abbreviation extraction with both false positive and false negative cases. For example, “bacterial occ” which is a regularly used abbreviation for occasional amounts of bacteria in the sample (urine or mucus) was extracted as “osteochondritis dissecans.” This clearly points to the necessity of integrating abbreviation recognition and disambiguation components into clinical NLP systems to improve our performance in future work. Also, some of the lexical and spelling variations, and misspelling in the clinical notes in our dataset accounted for a moderate recall. One major limitation of this study was the relatively small dataset related to craniofacial disease resulting in fewer clinical notes in our domain of interest. As the craniofacial research team and program at NIH is being developed, we will be able to test this method in a larger database. The validation of the algorithm in different patient populations or EHR systems from other institutions will be necessary for reproducibility and further interrogation of the method. Finally, manual screening of the concepts was a time-consuming process; however, we are exploring other methodologies that may automate the screening procedure in the future.
Extracting phenotype terms through NLP is restricted particularly if the descriptors are contained in phrases using polysemous words such as “restriction in mouth opening.” Precise boundary identification is needed to enhance the concept identification that will improve the sensitivity and specificity of the task. Another important area where NLP could improve the phenotype ontologies is through extracting temporal relations from EHRs. For example, “alopecia totalis” itself does not impart full information as age of onset is an integral component of the disease description that further delineate the importance of the finding (early versus late alopecia totalis may align with a rare condition with natural aging progression, respectively). As such, the treatment plan would be completely different for alopecia totalis at age 10 versus age 60. Therefore, for more complete phenotype description, the methodology must be able to capture other clinical information such as phenotype severity, age-of-onset, and progression over time. Thus, integration of this information can advance the understanding of disease mechanisms and accelerate development of target therapies.
5. Conclusions and Future Direction
As the use of electronic clinical notes in dentistry increases, extracting relevant information from those records using natural language processing could be very beneficial in dental research. And as dentistry moves towards precision medicine, we need ontologies to capture comprehensive relationships between genes and diseases. Our data-driven method using NLP can enhance current biomedical databases to ensure consistent updating of existing ontologies such as HPO by incorporating new terms in a timely and efficient manner. This research will result in the improvement of HPO coverage of dental, oral and craniofacial terms that are used by clinicians and researchers in an evidence-based manner. Our research discovered phenotype terms using NLP and BTRIS data that were not included in HPO, and this identification method can be applied widely by other researchers in their domain. An important application would be the successful use and adoption of these natural language processing tools to automate the phenotyping algorithms. Future work includes use of more heterogeneous data and exploration of advanced techniques such as long parsing with deeper contextual information extraction to encourage the development of robust clinical phenotype extraction from EHRs.
Supplementary Material
6. Acknowledgements
This work was carried out with funding from the NIDCR/NIH Intramural Research Program. We would like to thank the BTRIS staff at NIH who assisted with the data retrieval. We would also like to thank Dr. Leslie Biesecker of NHGRI/NIH for his thorough review and expert comments in preparation of the manuscript. We would also like to thank the lab members of the Craniofacial Anomalies and Regeneration Section (CARS) for their insightful input on the research methodology.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflict of Interest:
The authors have no conflicts to declare.
Disclosure. None of the authors reported any disclosures.
7. References
- 1.Du P, Feng G, Flatow J, et al. From disease ontology to disease-ontology lite: statistical methods to adapt a general-purpose ontology for the test of gene-ontology associations. Bioinformatics. 2009;25:i63–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bodenreider O The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Robinson PN, Mundlos S. The human phenotype ontology. Clinical genetics. 2010;77:525–534. [DOI] [PubMed] [Google Scholar]
- 4.Smith CL, Eppig JT. The Mammalian Phenotype Ontology as a unifying standard for experimental and high-throughput phenotyping data. Mammalian genome. 2012;23:653–668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Guarino N, Oberle D, Staab S. What is an Ontology? Handbook on ontologies: Springer; 2009:1–17. [Google Scholar]
- 6.Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. The American Journal of Human Genetics. 2008;83:610–615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kohler S, Doelken SC, Mungall CJ, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42:D966–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Robinson PN. Deep phenotyping for precision medicine. Hum Mutat. 2012;33:777–780. [DOI] [PubMed] [Google Scholar]
- 9.Posey JE, Harel T, Liu P, et al. Resolution of Disease Phenotypes Resulting from Multilocus Genomic Variation. N Engl J Med. 2017;376:21–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yu S, Liao KP, Shaw SY, et al. Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc. 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008;35:128–144. [PubMed] [Google Scholar]
- 12.Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17:507–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Cimino JJ, Ayres EJ. The clinical research data repository of the US National Institutes of Health. Stud Health Technol Inform. 2010;160:1299–1303. [PMC free article] [PubMed] [Google Scholar]
- 14.Deleger L, Li Q, Lingren T, et al. Building gold standard corpora for medical natural language processing tasks. AMIA 2012. [PMC free article] [PubMed] [Google Scholar]
- 15.Robinson PN, Mungall CJ, Haendel M. Capturing phenotypes for precision medicine. Cold Spring Harb Mol Case Stud. 2015;1:a000372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association. 2013;20:117–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zemojtel T, Kohler S, Mackenroth L, et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci Transl Med. 2014;6:252ra123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.