Implications of mappings between International Classification of Diseases clinical diagnosis codes and Human Phenotype Ontology terms

Amelia L M Tan; Rafael S Gonçalves; William Yuan; Gabriel A Brat; Robert Gentleman; Isaac S Kohane; The Consortium for Clinical Characterization of COVID-19 by EHR (4CE)

doi:10.1093/jamiaopen/ooae118

. 2024 Nov 18;7(4):ooae118. doi: 10.1093/jamiaopen/ooae118

Implications of mappings between International Classification of Diseases clinical diagnosis codes and Human Phenotype Ontology terms

Amelia L M Tan ^1,^†, Rafael S Gonçalves ^2,^†,^✉, William Yuan ³, Gabriel A Brat ⁴, Robert Gentleman ⁵, Isaac S Kohane ⁶; The Consortium for Clinical Characterization of COVID-19 by EHR (4CE)

PMCID: PMC11570992 PMID: 39559493

Abstract

Objective

Integrating electronic health record (EHR) data with other resources is essential in rare disease research due to low disease prevalence. Such integration is dependent on the alignment of ontologies used for data annotation. The international classification of diseases (ICD) is used to annotate clinical diagnoses, while the human phenotype ontology (HPO) is used to annotate phenotypes. Although these ontologies overlap in the biomedical entities they describe, the extent to which they are interoperable is unknown. We investigate how well aligned these ontologies are and whether such alignments facilitate EHR data integration.

Materials and Methods

We conducted an empirical analysis of the coverage of mappings between ICD and HPO. We interpret this mapping coverage as a proxy for how easily clinical data can be integrated with research ontologies such as HPO. We quantify how exhaustively ICD codes are mapped to HPO by analyzing mappings in the unified medical language system (UMLS) Metathesaurus. We analyze the proportion of ICD codes mapped to HPO within a real-world EHR dataset.

Results and Discussion

Our analysis revealed that only 2.2% of ICD codes have direct mappings to HPO in UMLS. Within our EHR dataset, less than 50% of ICD codes have mappings to HPO terms. ICD codes that are used frequently in EHR data tend to have mappings to HPO; ICD codes that represent rarer medical conditions are seldom mapped.

Conclusion

We find that interoperability between ICD and HPO via UMLS is limited. While other mapping sources could be incorporated, there are no established conventions for what resources should be used to complement UMLS.

Keywords: ontology, data interoperability, ontology interoperability

Background and significance

The analysis of clinical patient data together with experimental data is important in biomedical research and requires principled integration of these data. The ability to integrate real-world data is especially imperative in rare disease research, where disease prevalence is very low and therefore little data are available. The emergence of natural language processing (NLP) tools has made it easier to structure free-text data from the EHR and to map (some) extracted data to the unified medical language system (UMLS), which can then be used to traverse different ontologies.¹ Ontologies such as the human phenotype ontology (HPO)² and WHO’s international classification of diseases (ICD)³ are often used to label patient data. The motivations or use cases for these 2 ontologies are distinct. In the United States, ICD is used in EHRs primarily for reimbursement purposes, while HPO is used for biological characterization in research. These ontologies were developed independently, without coordination to date, and typically have limited interoperability between them. However, with the increasing use and usefulness of electronic health record (EHR) data, which contributes to the “real-world data” characterization of patient populations and diseases, as well as the interest in more detailed characterizations for EHR and cohort databases, the uses of these two ontologies increasingly overlap. This has created an urgent motivation to improve interoperability between ICD and HPO.

While the 2 ontologies do fundamentally cover different scopes, the links between diseases and phenotypes are important, as they allow for the transformation of data from the EHR to the research context where phenotypes are essential in the characterization of diseases in research settings. While there are other ontologies that might be better suited for mapping from diseases in ICD codes to diseases in respective ontologies, like the Monarch Disease Ontology (MONDO), these are not as useful toward disease characterization and clinical diagnostics as some conditions do not have a disease label yet. There is a certain gray area between phenotypic features and diseases. Some diseases can still present themselves in HPO as features of other diseases (eg, diabetes mellitus as a feature of Bardet Biedl syndrome). The extent of overlap is unknown. This exercise quantifies the overlap between these 2 domains of phenotypic features and disease.

To illustrate the lack of interoperability, consider the concept “acute bronchitis,” which is represented in ICD and HPO with different identifiers—J20 in ICD for “acute bronchitis,” and HP:0012388 in HPO. Without a mapping between these 2 symbols to establish that they are conceptually equivalent, and should be interpreted as the same thing, data that rely on only one of the codes cannot be easily cross-analyzed. Although there have been recent developments in NLP tools for ontology mapping purposes, there still lacks a consensus on how to use these tools across research groups to achieve accurate and reusable mappings between ontology terms. For this reason, many researchers rely on the UMLS Metathesaurus, which provides mappings between ontologies derived through expert curation, to attain partial interoperability among the ontologies in UMLS.⁴ The UMLS is widely used for mapping across ontologies with approximately 50% of users using it for that purpose.⁵

A survey we conducted of recent research published in the last 2 years (see Table 1) demonstrates that researchers rely on UMLS as the primary source of mappings between HPO and ICD. Schofield et al developed a method (reported in paper #1 in Table 1) to identify and gather disease-phenotype associations by leveraging the UMLS mappings of diseases represented in ICD to phenotypes from the HPO or mammalian phenotype (MP) ontologies.⁶ In papers #2-#4, the authors used semi-automated mapping tools (specifically MetaMap, CLAMP, and cTakes) to identify HPO terms in free-text descriptions of phenotypes from EHR clinical notes. In the setting up of the open annotation for rare diseases (OARD)—a data resource containing annotations for rare-disease-related phenotypes, described in paper #4—the authors used the NLP tool cTakes to harmonize claim codes, lab procedures, and clinical notes with UMLS concept unique identifiers (CUIs) before traversing to HPO and MONDO ontologies. In paper #5, the authors used existing mappings of ICD to HPO obtained from UMLS and BioPortal, and then a partial-logical mapping strategy that uses the HPO structure in order to obtain better mapping coverage of ICD to HPO terms. In paper #6, the authors applied automated string matching using the bidirectional encoder representations from transformers (BERT) NLP model followed by manual verification of disease name mappings. Paper #7 provides a Phecode-HPO mapping set generated by using mappings in UMLS, the Phecode map of ICD codes to Phecodes, and tool-generated mappings using different approaches. From this short survey, it is evident that recent research heavily depends on UMLS as the main source of mappings, highlighting the need to quantify the biases and implications of these mappings.

Table 1.

Articles published in 2021-2022 that used mappings between ICD and HPO.

#	Title	Mapping reference(s)	Mapping tool(s)	Year
1	Linking common human diseases to their phenotypes; development of a resource for human phenomics⁶	UMLS	None: employed newly introduced methods involving NLP + UMLS	2021
2	Clinical phenotypic spectrum of 4095 individuals with down syndrome from text mining of electronic health records⁷	UMLS	MetaMap	2021
3	Development of a phenotype ontology for autism spectrum disorder by natural language processing on electronic health records⁸	UMLS	CLAMP	2022
4	OARD: open annotations for rare diseases and their phenotypes based on real-world data⁹	UMLS	cTakes	2022
5	Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis¹⁰	BioPortal + UMLS	None: employed a partial logical mapping strategy	2022
6	Building a knowledge graph to enable precision medicine¹¹	UMLS	BERT	2022
7	Linking rare and common disease vocabularies by mapping between the human phenotype ontology and phecodes¹²	UMLS, PheMap	SORTA, string-matching, and WikiMedMap	2023

Open in a new tab

Abbreviations: BERT, bidirectional encoder representations from transformers; HPO, human phenotype ontology; ICD, international classification of diseases; UMLS, unified medical language system.

In this paper, we quantify the limitations and biases of using UMLS mappings of ICD to HPO and how they may affect research findings. Specifically, we set out to determine the coverage of mappings between ICD10-CM—the finer-grained variant of ICD10 used in the United States—and HPO, as this indicates how much information in the EHR is “covered” when transiting between EHR coding (ICD) and research-oriented ontologies (such as HPO), and to discuss what that coverage implies for the (potential) integration and analysis of EHR data with experimental data. (From here on out, any mention of “ICD” is meant as a shorthand for “ICD10-CM.”) We quantify how exhaustively the popular ICD classification has been mapped to HPO by analyzing mappings derived from the widely used source of mappings between biomedical ontologies—the UMLS Metathesaurus. The purpose of our work is not to specify new mappings, nor to identify erroneous mappings; rather we provide a synchronic analysis of the mappings that are publicly accessible and currently used to convert from ICD to HPO for such goals as rare disease analysis. We also discuss the implications of those decisions for secondary research and analysis in the context of real-world EHR data from Beth Israel Deaconess Medical Center (BIDMC).

Methods and materials

Ontologies analyzed

The HPO ontology logically and systematically describes phenotypic traits in human diseases. Many public disease knowledge databases, such as MedGen¹³ and Orphanet,¹⁴ as well as consortiums like the undiagnosed disease network (UDN) use HPO as the controlled vocabulary to annotate phenotypes for diseases. Various algorithms and tools also leverage HPO to support phenotype-based differential diagnostics, gene-disease discovery, and genomic diagnostics, among others. For example, the Phen2Gene tool uses a database of weighted and ranked gene lists for every HPO term, and then, given a patient-specific list of HPO terms, the tool calculates a prioritized gene list based on a probabilistic model and outputs gene-disease relationships.¹⁵

Real-world patient data from hospitals and health insurance claims data, on the other hand, are often stored using codes from WHO’s ICD, which is primarily designed and used as a billing instrument. There is a vast amount of patient data encoded using these ICD annotations, including signs and symptoms, diseases, external causes of injury or diseases, and abnormal findings.¹⁶ Currently the most widely used version of ICD in healthcare settings throughout the United States is ICD10-CM, which replaces the older ICD9-CM.

Sources of ontology mappings

As we mentioned, UMLS appears to be the most popular and comprehensive source of mappings. However, there are other potential sources of mappings that researchers can consider when integrating data annotated with different ontologies. BioMappings¹⁷ provides a community-contributed table of mappings—some verified by humans, while others simply predicted using software tools. At the time of writing, there were zero mappings between ICD and HPO in the BioMappings repository. Ontologies themselves often include mappings between the terms they define and terms in other ontologies or databases (so-called “database cross-references”). The ICD classification itself does not contain any such mappings; however, HPO does. There are 39 mappings between HPO and ICD terms, with 20 of those only present in HPO, and not in UMLS. These “missing” mappings could be due to UMLS not being up-to-date with the latest release of HPO. We also extracted 632 ICD mappings from the MONDO disease ontology, and which have an HPO mapping. We determined that all 632 mappings are already contained in UMLS.

UMLS as a current mapping source

The UMLS Metathesaurus unifies concepts across ontologies via an assignment of CUIs. A CUI is assigned to each unique collection of terms that are conceptually equivalent. For our study, we used the ICD CUIs and searched for HPO terms with identical CUIs in the May 2022 version of the UMLS tables. More specifically, after downloading the UMLS archive (umls-2022AA-mrconso.zip), we use in our analysis the file MRCONSO.RRF (of 2GB in size) contained therein—this is a table containing all UMLS CUIs and their associated labels, synonyms, and mappings to different ontologies. The table comes without column names; in our analysis, we primarily use column 1 (CUI), column 12 (SAB—abbreviated source ontology), column 13 (CODE—ontology term identifier), and column 15 (STR—string label). For example, the following sample of the UMLS table contains 2 mappings: (1) between ICD10CM code K85 and HPO term HP:0001735 and (2) between K85.9 and HP:0001735, because they share the same CUI. When a CUI has both an ICD code (I) and an HPO term (H) associated with it, we say that I is mapped to H.

CUI | SAB | CODE | STR

C0001339 | ICD10CM | K85 | Acute pancreatitis

C0001339 | ICD10CM | K85.9 | Acute pancreatitis, unspecified

C0001339 | HPO | HP:0001735 | Acute pancreatitis

C0001339 | HPO | HP:0001735 | Acute pancreatic inflammation

In 2014, Bodenreider et al analyzed the coverage of disease phenotypes in standard biomedical ontologies to determine which phenotypes have a mapping in UMLS, and to which specific ontology(ies) in UMLS they map to.¹⁸ A mapping to UMLS was found for 54% of disease phenotypes, with the best coverage by a single ontology being provided by SNOMED CT, which covers 30% of phenotypes in HPO. According to the study, at the time, ICD10-CM provided coverage for only 15% of HPO phenotype terms.¹⁸

Mapping coverage of codes with different levels of usage in BIDMC data

Our BIDMC EHR dataset was extracted in February 2022, and is a subset of the data submitted to the 4CE Consortium.¹⁹ To understand the mapping coverage of codes with different levels of usage, we split the ICD codes into 3 usage groups. Codes that are used across more than 1% of patients are referred to as common codes; Codes that are used in 1%-0.1% of patients are categorized as infrequently used, and codes that are used in less than 0.1% of patients are grouped as rare ICD codes. For each of these groups, we computed the proportion of mapped and unmapped codes as well as the proportion of codes that fell into the “others” category. Codes in the “others” category are those found in the ICD code list in our EHR dataset, but which did not match existing ICD10-CM codes. Throughout our analyses, we stratified the setting of care into 3 groups: intensive care unit (ICU) patients, admitted patients, and outpatients. The rationale for this stratification is that the setting of care might confound the mapping coverage, since codes used in outpatients could be very different from codes used in ICU patients. Furthermore, use cases for different care settings may differ, so it would be informative to characterize mapping coverage across these patient groups. For example, researchers working with ICU data who want to convert their ICD codes to HPO terms might face a different challenge compared to researchers working with outpatients when performing the same task.

Mappability of the most used ICD codes

To survey the mappability of the most used ICD codes, we retrieved the 10 ICD codes with the highest number of patient counts without an associated HPO term and those with a mapped HPO term. For each of these codes, we extracted the number of times the codes were used (counts) and the number of patients that have been assigned these codes (Patient Counts). For those with a matched HPO term, the matching HPO term and label are also presented.

Mappability of ICD code categories by top-level ICD categories

We stratified our next analyses by top-level ICD categories to better understand if certain disease groups are better mapped. To further differentiate the codes, we split them into 2 groups: 1 for less-used codes which are used in less than 100 patients (inclusive), and codes that are more commonly used in more than 100 patients.

We then calculate the PatientCoverage for each of these groups of ICD codes. The purpose of calculating PatientCoverage was to quantify the proportion of patients with ICD codes from each ICD category that have HPO mappings. To calculate this, we take the summation of the total number of patients with mapped ICD codes as a proportion of the total number of patients with codes assigned in the ICD category. For example, consider the codes used in <100 patients from the category “Q: Congenital malformations, deformations and chromosomal abnormalities”—there are 309 codes used a total of 3506 times. Of these usage counts, 1260 involve mapped terms and hence the proportion mapped for this ICD category in terms of usage counts is 1260/3506, resulting in a patient coverage proportion of 0.359.

Patient Coverage = \frac{Σ Patient Counts for matched codes}{Σ Patient Counts for all codes in category}

(1)

Results

Dictionary level mapping

We first collated the dictionary-level mapping of all ICD codes to HPO, which is composed of all mappings between ICD codes and HPO terms in UMLS. The results from this analysis show that 2.2% of ICD codes are mappable to HPO via the UMLS Metathesaurus (Table 2). Although the proportion mappable is low, the distribution of mapped codes within real-world hospital ICD usage would provide a practical estimate of the consequences of the sparse mapping.

Table 2.

Mappings between ICD and HPO in UMLS.

Ontology	# Codes	# Unique HPO codes mapped to ICD	% HPO codes mapped to ICD	# Unique ICD codes mapped to HPO	% ICD mapped to HPO (#unique codes mapped/total ICD codes)
HPO	16 366	1870	11.4%	–	–
ICD10-CM	95 847	–	–	2066	2.2%

Open in a new tab

ICD10-CM is the coding system used at BIDMC.

Abbreviations: BIDMC, Beth Israel Deaconess Medical Center; HPO, human phenotype ontology; ICD, international classification of diseases; UMLS, unified medical language system.

Mapping coverage of codes with different levels of usage

Therefore, we investigated the mappability of codes across different levels of usage with the categories for commonly used ICD codes (>1%), codes that are infrequently used (0.1%-1%), and codes that are rarely used (<0.1%). For the commonly used codes (>1%), the ICU patients have the highest proportion of mapped codes (33.3%) as compared to the admitted patients and outpatients which have 24.4% and 25.8% of the ICD codes matched to a corresponding matched HPO term, respectively (Figure 1). ICU patients also show a higher proportion mapped for the infrequently used ICD codes as compared to the admitted and outpatient cohorts possibly due to better-defined codes that are used in the ICU (Figure 1). Surprisingly, the outpatient group also does marginally better than the admitted patients for the commonly used codes. Overall, although less than half of the codes used are mappable to HPO via UMLS, it is reassuring that codes that are used more frequently tend to have a higher proportion mapped to HPO.

Graphs depicting the proportion of ICD/diagnosis codes that are matched to an HPO term or, unmatched to any HPO terms. — The proportion of ICD/diagnosis codes that are matched to an HPO term, unmatched to any HPO terms or others (do not have a corresponding ICD code in the UMLS dictionary). The proportions were calculated for common codes that are used in >1% of the cohort, infrequently used codes that are attributed to 0.1%-1% of the cohort, and rare codes that are assigned to <0.1% of the patient cohort. These were calculated separately for the (A) admitted patients, (B) ICU patients, and (C) outpatients. HPO, human phenotype ontology; ICD, international classification of diseases; UMLS, unified medical language system.

Mappability of the most used ICD codes

To survey the mappability of the most used ICD codes, we retrieved the 10 ICD codes with the highest number of patient counts without (Table 3) and with an associated HPO term (Table 4). For example, Table 3 presents some ICD codes that are used in roughly 2000 patients or more, and which do not have a corresponding HPO term, hence representing the greatest loss of information when converting between ICD and HPO terms. Of the codes that are not mapped, there were many (more than 7 out of 10 codes) that were from the ICD category “Z” that covers codes that are “factors influencing health status and contact with health services.” These codes mainly cover encounters with the health system and hence the absence of mappings is expected since the scope of HPO is to describe phenotypes. While we could choose not to interpret these as missing mappings, as we expect them to be missing, we point out that there are some Z codes in ICD that are mapped to HPO—for example, Z67.1 (Type A blood) is mapped to HP:0032370 (Blood group A). So to emulate the process of converting EHR data, we did not remove Z codes from our analysis since we want to characterize the overall mappability of ICD to HPO. Also, this category of codes only accounted for 5.7% of all codes used in admitted patients and does not significantly affect overall mappability.

Table 3.

The 10 most frequently used codes in admitted patients, ICU patients, and outpatients (not admitted) that do not have a matched HPO term.

Cohort	Counts	Patient counts	CUI	ICD ID	Label
Admitted	15 543	11 671	C5539297	Z20.822	Contact with and (suspected) exposure to COVID-19
	12 544	7184	C0085580	Z87.891	Personal history of nicotine dependence
	11 625	6659	C2911355	Z23	Encounter for immunization
	6283	6219	C0341102	Z37.0	Single live birth
	7006	5750	C2910668	Z20.828	Contact with and (suspected) exposure to other viral communicable diseases
	10 348	5620	C1313895	F41.9	Anxiety NOS
	6721	5155	C0022660	Z11.59	Encounter for screening for other viral diseases
	8740	5042	C2910658	Z00.00	Encounter for adult health check-up NOS
	8073	4467	C0003469	F32.9	Major depressive disorder, single episode, and unspecified
	11 461	4405	C2910579	Z79.01	Long-term (current) use of anticoagulants
ICU	5305	3506	C0085580	Z20.822	Contact with and (suspected) exposure to COVID-19
	6029	2946	C5539297	Z87.891	Personal history of nicotine dependence
	3441	2845	C0022660	D62	Acute posthemorrhagic anemia
	3052	2514	C2911355	J96.01	Acute respiratory failure with hypoxia
	2831	2149	C0154298	Z20.828	Contact with and (suspected) exposure to other viral communicable diseases
	5752	2078	C0341102	Z79.01	Long-term (current) use of anticoagulants
	2512	2010	C2977065	Z66	DNR status
	3910	1960	C2882161	F41.9	Anxiety NOS
	2733	1897	C2910658	Z11.59	Encounter for screening for other viral diseases
	3769	1828	C2911178	Z00.00	Encounter for adult health check-up NOS
Outpatients	1 16 491	78 109	C2910579	Z11.59	Encounter for screening for other viral diseases
	57 453	37 259	C5203670	U07.1	COVID-19
	52 548	33 996	C2910447	Z00.00	Encounter for adult health check-up NOS
	39 471	30 929	C2910668	Z20.822	Contact with and (suspected) exposure to COVID-19
	57 339	29 087	C0085580	Z23	Encounter for immunization
	26 199	20 805	C5539297	Z20.828	Contact with and (suspected) exposure to other viral communicable diseases
	24 703	18 863	C5539296	Z11.52	Encounter for screening for COVID-19
	28 360	13 064	C2910658	F41.9	Anxiety NOS
	16 779	12 076	C0341102	Z00.01	Encounter for general adult medical examination with abnormal findings
	18 014	11 573	C2910448	Z87.891	Personal history of nicotine dependence

Open in a new tab

Abbreviations: CUI, concept unique identifier; HPO, human phenotype ontology; ICD, international classification of diseases.

Table 4.

The 10 most frequently used codes in admitted patients, ICU patients, and outpatients (not admitted) that have a matched HPO term.

Cohort	Counts	Patient counts	ICD ID	ICD label	HPO	HPO label
Admitted	36 074	11 273	I10	High blood pressure	HP:0000822	High blood pressure
	23 666	10 802	E78.5	Hyperlipidemia, unspecified	HP:0003077	Hyperlipidemia
	13 486	6791	K21.9	Esophageal reflux NOS	HP:0002020	Gastroesophageal reflux disease
	10 538	6010	N17.9	Acute kidney failure, unspecified	HP:0001919	Acute kidney failure
	14 376	4954	I25.10	Atherosclerotic heart disease NOS	HP:0001677	Coronary atherosclerosis
	10 429	4834	D64.9	Anemia, unspecified	HP:0001903	Anemia
	7986	3908	E66.9	Obesity NOS	HP:0001513	Obesity
	5464	3530	E87.1	Sodium [Na] deficiency	HP:0002902	Hyponatremia
	4358	3386	E87.2	Acidosis	HP:0001941	Acidosis
	9044	3375	E03.9	Hypothyroidism, unspecified	HP:0000821	Hypothyroidism
ICU	10 776	4494	E78.5	Hyperlipidemia, unspecified	HP:0003077	Hyperlipidemia
	14 953	4431	I10	High blood pressure	HP:0000822	High blood pressure
	5596	3045	N17.9	Acute kidney failure, unspecified	HP:0001919	Acute kidney failure
	6049	2655	K21.9	Esophageal reflux NOS	HP:0002020	Gastroesophageal reflux disease
	7573	2422	I25.10	Atherosclerotic heart disease NOS	HP:0001677	Coronary atherosclerosis
	2938	2313	E87.2	Acidosis	HP:0001941	Acidosis
	4597	2032	D64.9	Anemia, unspecified	HP:0001903	Anemia
	3238	1997	E87.1	Sodium [Na] deficiency	HP:0002902	Hyponatremia
	3187	1398	E66.9	Obesity NOS	HP:0001513	Obesity
	1711	1385	I95.9	Hypotension, unspecified	HP:0002615	Hypotension
Outpatients	85 112	23 465	I10	High blood pressure	HP:0000822	High blood pressure
	44 506	18 526	E78.5	Hyperlipidemia, unspecified	HP:0003077	Hyperlipidemia
	30 216	14 182	K21.9	Esophageal reflux NOS	HP:0002020	Gastroesophageal reflux disease
	15 559	9204	R05	Cough	HP:0012735	Coughing
	19 115	8927	E66.9	Obesity NOS	HP:0001513	Obesity
	13 624	7858	R07.9	Chest pain, unspecified	HP:0100749	Chest pain
	13 525	7396	R53.83	Fatigue NOS	HP:0012378	Fatigue
	18 379	6982	G47.33	Obstructive sleep apnea (adult) (pediatric)	HP:0002870	Obstructive sleep apnea
	9565	6947	R51.9	Headache, unspecified	HP:0002315	Headache
	13 015	6575	R06.02	Shortness of breath	HP:0002094	Dyspnea

Open in a new tab

Abbreviations: HPO, human phenotype ontology; ICD, international classification of diseases.

The other codes that are poorly mapped range from ones like “anxiety NOS” (where NOS stands for “not otherwise specified”) and “major depressive disorder, single episode, unspecified,” “acute posthemorrhagic anemia,” to “acute respiratory failure with hypoxia” which are common diseases with multiple possible causes. Of these unmapped codes, for example, “acute respiratory failure with hypoxia,” a manual lookup on the HPO website using keywords “respiratory failure” yielded a close match “Respiratory failure HP:0002878” which could potentially be mapped to the ICD code and hence better capturing the biology of the 2514 patients associated with the ICD code “J96.01 acute respiratory failure with hypoxia.” This suggests that such trivial manual improvements to mapping bring about a very favorable effort to pay-off ratio in bringing about a much-improved patient representation. If capturing the overall population diagnosis is the goal, then these codes represent the lowest hanging fruits for the highest improvement in patient diagnosis representation. It also points the way to effective use of machine learning techniques for such mappings.

The most frequently used codes which are mappable to HPO terms via UMLS represent the common diagnoses that are most well represented. When studying common diagnoses, one can reasonably assume that the mapping between ICD and HPO would sufficiently capture the prevalence of those diagnoses in the EHR data. However, it should be cautioned that studies looking into diagnoses across diseases would have an overrepresentation of diagnoses in Table 4 since they are better mapped than those in Table 3.

Mappability of ICD code categories by top-level ICD categories

As the top-level ICD categories represent different major disease groups (eg, category I is for diseases of the circulatory system, category J is for diseases of the respiratory system), the next analysis is stratified by these categories to better understand any biases in the mapping of these disease groups (Figure 2). The outcome from this analysis shows that while codes that are used in less than 100 patients generally have a lower proportion mapped of less than 40% across all ICD categories, codes that are used in >100 patients have a much higher representation in the proportion of patients with a mapped code although the patient coverage does vary substantially depending on the ICD category (Figure 2). We conducted the same analysis of outpatient and ICU patient data, and the results are similar and in line with those of admitted patients that we just discussed (results not shown).

Graphs depicting patient coverage of mapped and unmapped terms across different ICD categories. — Patient coverage of mapped and unmapped terms across different ICD categories. The left column specifies codes that are used in less than 100 patients, while the right column specifies codes that are used in more than 100 patients in our dataset. The figure depicts in yellow the frequency of usage of ICD codes that have an HPO mapping, in green the usage of ICD codes that do not have an HPO mapping, and in purple the codes in the “others” category which are in the ICD code list of our EHR dataset, but which did not match existing ICD10-CM codes. The numbers on the right of each column specify the number of times that codes from that category are used. HPO, human phenotype ontology; ICD, international classification of diseases.

Discussion

To leverage the vast amount of real-world patient data with the tools developed to work with ontologies, an accurate mapping between the different systems used for annotating these datasets is necessary. In this paper, we have analyzed how well aligned the ICD is with the HPO. The rationale for moving from ICD codes to HPO terms is that, for example, genomics-based diagnostic pipelines rely on standardized phenotype and disease ontology terms as the input, such as terms from HPO. Only with a mapping from ICD-coded patient data to HPO can researchers accurately integrate and cross-analyze data originating from EHRs together with public research data beyond their study. For example, tools like PheRS, which measures the similarity between an individual’s diagnosis codes and phenotypic features of known genetic disorders, also require mappings that link ICD codes to HPO terms, which most EHRs do not contain.²⁰ In such scenarios, unless the researcher is an expert in ontology, they will most likely turn to resources like the UMLS tables for these mappings; hence, determining the coverage of ICD10-CM to HPO mappings with the UMLS table is imperative.

We set out to analyze the extent to which ICD codes—either specified in ICD10-CM (ie, our “dictionary level” analysis) or used in our EHR data—were mapped to the widely used phenotype ontology HPO. In our analysis of the mapping coverage, we found that the proportion of terms mapped between ICD and HPO are low both at the dictionary level and in real-world EHR data. That said, the proportion of mapped terms in our EHR data is higher than that found at the dictionary level. However, to enable the integration and cross-analysis of clinical and experimental data at scale, we argue that there is an urgent need to improve mappability across these ontologies, and to make such mappings available to the research community using such vehicles as the UMLS or other appropriate mechanisms. We also found that certain terms have much higher leverage in improving patient coverage and should be looked into first. For example, such ICD codes as those related to COVID-19 infection or other viral diseases, nicotine dependence, and anxiety are each used to annotate over 10 000 patients in our dataset. It should also be cautioned that the proportion of ICD codes mapped to HPO is especially low for rarer conditions (eg, ICD codes from the neoplasm category are considerably less mapped than codes from the nervous system category) and could implicate biases if the data are used as is after conversion.

Our study indicates that the coverage of ICD to HPO mappings has not increased over the years; in fact, the opposite—coverage decreased. This likely occurred because ICD and HPO grew over time (ie ontology terms were added) while the mapping set between the 2 artifacts has seemingly not kept up with their evolution. In the last analysis of mapping coverage reported in the literature in 2014, 15% of HPO codes were mapped to ICD10-CM, whereas currently our analysis indicates that the percentage of HPO terms mapped to ICD10-CM decreased to only 11% in the 2022 version of the UMLS Metathesaurus. The UMLS website provides a similar statistic that roughly confirms our empirical finding, albeit UMLS’s statistics are likely more up-to-date—according to the UMLS website, in November 2023 a total of 10.2% of HPO terms are mapped to ICD10-CM.²¹

It should be noted that there is a mismatch in granularity between the 2 artifacts under analysis, with ICD10-CM being a more coarse-grained representation than HPO. So understandably the proportion of exact mappings between terms is rather low. In certain circumstances, one could consider non-exact mappings between terms in these ontologies—for example, broad and narrow mappings between terms that take into account the hierarchical structures of ICD and HPO—which can still be useful for large data analytics or other data integration purposes. However, to our knowledge, there are no off-the-shelf tools that can identify such broad or narrow mappings. Furthermore, currently, there are no benchmarks or means by which we could ascertain the validity of such mappings, given the lack of publicly available broad/narrow mapping sets between ontologies that researchers could use. In future work, we intend to broaden our analysis to consider other mapping resources (eg, broad/narrow mappings, concept sets) and to determine the extent to which differences in granularity between ICD and HPO can be bridged using such resources.

The scientific value of having a go-to resource for expert-verified mappings (such as UMLS) must not be underestimated. Such a resource allows for turnkey conversion between metadata that describe scientific data using different ontologies or coding mechanisms. It is essential to have appropriate tooling to allow researchers to easily leverage these mappings for their purposes, without having to go through literature, broken links, tables with various formats, and so on. UMLS is a good starting source for mappings, but clearly, there is a need for more comprehensive mapping between ontologies that can complement the current ones. When ontology mappings for entities of interest are not available in a trusted source like UMLS, researchers tend to rely on NLP tools—some of which are based on modern, neural network-based language models—to generate mappings between their codes or ontology terms of interest. Language models are statistical methods that learn the probability of word occurrences based on examples of text in order to make predictions involving those words, and they are used to power NLP applications. For example, the popular language model BERT generates numeric (vector) representations of strings, which can then be compared in vector space using measures such as cosine distance. A comparison between 2 such entities will yield a distance score that researchers can use to determine how similar the 2 entities are in their meaning. In a similar vein, large language models (LLMs) can be helpful in generating tentative mappings between ontologies. While the effectiveness of such approaches has yet to be evaluated, their outputs will certainly require review by qualified human curators and should not be used out-of-the-box. The reproducibility of mappings produced using generative artificial intelligence (AI) models also needs to be considered. While we observed in our survey that such automated approaches are often used in practice, the mappings that researchers generate using these tools and use in their studies are rarely made publicly available—and so other scientists cannot reuse them, or even begin to try to replicate the original experiments that were done using those mappings. Even when researchers share their mappings, they are rarely (if ever) contributed to some centralized repository or registry of mappings that the wider community can leverage, such as UMLS.

Conclusion

The low number of direct mappings between ICD and HPO implies these are unlikely to be sufficient for fine-grained phenotyping using EHR data. Our results also show that ICD terms with lower usage are less well mapped; hence, the UMLS mappings are less likely to cover any existing ICD codes that might be a component of a rare disease. At best, the current state of mapping resources allows researchers to analyze relatively small cohorts whose diagnoses fall under the popular ICD categories that have been mapped.

Contributor Information

Amelia L M Tan, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

Rafael S Gonçalves, Center for Computational Biomedicine, Harvard Medical School, Boston, MA 02115, United States.

William Yuan, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

Gabriel A Brat, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

Robert Gentleman, Center for Computational Biomedicine, Harvard Medical School, Boston, MA 02115, United States.

Isaac S Kohane, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

The Consortium for Clinical Characterization of COVID-19 by EHR (4CE):

Aaron J Masino, Adeline Makoudjou, Adem Albayrak, Alba Gutiérrez-Sacristán, Alberto Zambelli, Alberto Malovini, Aldo Carmona, Alexander Hoffmann, Alexandre Gramfort, Alon Geva, Alvar Blanco-Martínez, Amelia L M Tan, Ana I Terriza-Torres, Anastasia Spiridou, Andrea Prunotto, Andrew M South, Andrew K Vallejos, Andrew Atz, Anita Burgun, Anna Alloni, Anna Maria Cattelan, Anne Sophie Jannot, Antoine Neuraz, Antonio Bellasi, Anupama Maram, Arianna Dagliati, Arnaud Sandrin, Arnaud Serret-Larmande, Arthur Mensch, Ashley C Pfaff, Ashley Batugo, Ashok K Krishnamurthy, Atif Adam, Audrey Dionne, Batsal Devkota, Bertrand Moal, Bing He, Brendin R Beaulieu-Jones, Brett K Beaulieu-Jones, Brian D Ostasiewski, Bruce J Aronow, Bryce W Q Tan, Byorn W L Tan, Carlo Torti, Carlos Sáez, Carlos Tadeu Breda Neto, Charles Sonday, Charlotte Caucheteux, Chengsheng Mao, Chiara Zucco, Christel Daniel, Christian Haverkamp, Chuan Hong, Clara-Lea Bonzel, Cinta Moraleda, Damien Leprovost, Daniel A Key, Daniela Zöller, Danielle Pillion, Danielle L Mowery, Danilo F Amendola, Darren W Henderson, David A Hanauer, Deanne M Taylor, Demian Wassermann, Derek Y Hazard, Detlef Kraska, Diego R Mazzotti, Domenick Silvio, Douglas S Bell, Douglas A Murad, Elisa Salamanca, Emily Bucholz, Emily J Getzen, Emily R Pfaff, Emily R Schriver, Emma M S Toh, Enea Parimbelli, Enrico M Trecarichi, Fatima Ashraf, Fernando J Sanz Vidorreta, Florence T Bourgeois, Francesca Sperotto, François Angoulvant, Gabriel A Brat, Gael Varoquaux, Gilbert S Omenn, Giuseppe Agapito, Giuseppe Albi, Griffin M Weber, Guillaume Verdy, Guillaume Lemaitre, Gustavo Roig-Domínguez, Hans U Prokosch, Harrison G Zhang, Hossein Estiri, Ian D Krantz, Isaac S Kohane, Jacqueline P Honerlaw, Jaime Cruz-Rojo, James B Norman, James Balshi, James J Cimino, James R Aaron, Janaina C C Santos, Jane W Newburger, Janet J Zahner, Jason H Moore, Jayson S Marwaha, Jean B Craig, Jeffrey G Klann, Jeffrey S Morris, Jihad Obeid, Jill-Jênn Vie, Jin Chen, Jiyeon Son, Joany M Zachariasse, John Booth, John H Holmes, José Luis Bernal-Sobrino, Juan Luis Cruz-Bermúdez, Judith Leblanc, Juergen Schuettler, Julien Dubiel, Julien Champ, Karen L Olson, Karyn L Moshal, Kate F Kernan, Katie Kirchoff, Kavishwar B Wagholikar, Kee Yuan Ngiam, Kelly Cho, Kenneth D Mandl, Kenneth M Huling, Krista Y Chen, Kristine E Lynch, L Nelson Sanchez-Pinto, Lana X Garmire, Larry Han, Lav P Patel, Lemuel R Waitman, Leslie Lenert, Li L L J Anthony, Loic Esteve, Lorenzo Chiudinelli, Luca Chiovato, Luigia Scudeller, Malarkodi Jebathilagam Samayamuthu, Marcelo R Martins, Marcos F Minicucci, Maria Clara Saad Menezes, Margaret E Vella, Maria Mazzitelli, Maria Savino, Marianna Milano, Marina P Okoshi, Mario Cannataro, Mario Alessiani, Mark S Keller, Martin Hilka, Martin Wolkewitz, Martin Boeker, Maryna Raskin, Mauro Bucalo, Meghan R Hutch, Mélodie Bernaux, Michele Beraghi, Michele Morris, Michele Vitacca, Miguel Pedrera-Jiménez, Mohamad Daniar, Mohsin A Shah, Molei Liu, Monika Maripuri, Mundeep K Kainth, Nadir Yehya, Nandhini Santhanam, Nathan P Palmer, Ne Hooi Will Loh, Neil J Sebire, Nekane Romero-Garcia, Nicholas W Brown, Nicolas Paris, Nicolas Griffon, Nils Gehlenborg, Nina Orlova, Noelia García-Barrio, Olivier Grisel, Pablo Rojo, Pablo Serrano-Balazote, Paolo Sacchi, Patric Tippmann, Patricia Martel, Patricia Serre, Paul Avillach, Paula S Azevedo, Paula Rubio-Mayo, Petra Schubert, Pietro H Guzzi, Piotr Sliz, Priyam Das, Qi Long, Rachel B Ramoni, Rachel S J Goh, Rafael Badenes, Raffaele Bruno, Ramakanth Kavuluru, Riccardo Bellazzi, Richard W Issitt, Robert W Follett, Robert L Bradford, Robson A Prudente, Romain Bey, Romain Griffier, Rui Duan, Sadiqa Mahmood, Sajad Mousavi, Sara Lozano-Zahonero, Sara Pizzimenti, Sarah E Maidlow, Scott Wong, Scott L DuVall, Sébastien Cossin, Sehi L'Yi, Shawn N Murphy, Shirley Fan, Shyam Visweswaran, Siegbert Rieg, Silvano Bosari, Simran Makwana, Stéphane Bréant, Surbhi Bhatnagar, Suzana E Tanni, Sylvie Cormont, Taha Mohseni Ahooyi, Tanu Priya, Thomas P Naughton, Thomas Ganslandt, Tiago K Colicchio, Tianxi Cai, Tobias Gradinger, Tomás González González, Valentina Zuccaro, Valentina Tibollo, Vianney Jouhet, Víctor Quirós-González, Vidul Ayakulangara Panickan, Vincent Benoit, Wanjiku F M Njoroge, William A Bryant, William Yuan, Xin Xiong, Xuan Wang, Ye Ye, Yuan Luo, Yuk-Lam Ho, Zachary H Strasser, Zahra Shakeri Hossein Abad, Zongqi Xia, Kernan F Kate, Alejandro Hernández-Arango, and Eli L Schwamm

Author contributions

Conceptualization: Amelia L.M. Tan, Rafael S. Gonçalves, and Isaac S. Kohane. Data curation: William Yuan and Gabriel A. Brat. Methodology & Investigation: Amelia L.M. Tan, Rafael S. Gonçalves, and Isaac S. Kohane. Writing – original draft: Amelia L.M. Tan and Rafael S. Gonçalves. Writing – review & editing: Amelia L.M. Tan, Rafael S. Gonçalves, William Yuan, Gabriel A. Brat, Robert Gentleman, and Isaac S. Kohane.

Funding

The authors have no sources of funding to declare.

Conflicts of interest

R.G. owns shares, stocks, or consults for: 23andMe, Scorpion Tx, BioMap, Myia Labs, Pheno.AI, and intrECate.

Data availability

To replicate this analysis it is necessary to have (1) the UMLS Metathesaurus (version 2022AA) and (2) the BIDMC patient dataset:

UMLS is freely available upon registering for a UMLS Terminology Services (UTS) account (https://uts.nlm.nih.gov/uts/signup-login). The UMLS Metathesaurus version 2022AA that we used in our study can be obtained from: https://tinyurl.com/m7976xcz.
The BIDMC dataset contains confidential patient electronic health data and cannot be shared outside of the institution. Interested parties may contact authors for potential collaboration or contact BIDMC’s Institutional Review Board (IRB) to request access to the data through: https://www.bidmc.org/research/research-and-academic-affairs/clinical-research-at-bidmc/committee-on-clinical-investigation-irb.

Code availability

The code to replicate the analysis (tables, figures, etc) is available on GitHub at: https://github.com/hms-dbmi/ICDtoHPOmappings.

Ethical considerations

IRB approval was obtained at the BIDMC (#2020P000565). Participant informed consent was waived by each IRB because the study involved only retrospective data and no individually identifiable data were shared outside of each site’s local study team.

References

1. Garcelon N, Burgun A, Salomon R, et al. Electronic health records for the diagnosis of rare diseases. Kidney Int. 2020;97:676-686. [DOI] [PubMed] [Google Scholar]
2. Köhler S, Gargano M, Matentzoglu N, et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49:D1207-D1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Organisation mondiale de la santé, World Health Organization, WHO. The ICD-10 Classification of Mental and Behavioural Disorders: Diagnostic Criteria for Research. World Health Organization; 1993. [Google Scholar]
4. Lindberg DAB, Humphreys BL, McCray AT.. The unified medical language system. Yearb Med Inform. 1993;02:41-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Amos L, Anderson D, Brody S, et al. UMLS users and uses: a current overview. J Am Med Inform Assoc. 2020;27:1606-1611. 10.1093/jamia/ocaa084 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Kafkas Ş, Althubaiti S, Gkoutos GV, et al. Linking common human diseases to their phenotypes; development of a resource for human phenomics. J Biomed Semant. 2021;12:1-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Havrilla JM, Zhao M, Liu C, et al. Clinical phenotypic spectrum of 4095 individuals with down syndrome from text mining of electronic health records. Genes (Basel). 2021;12:1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Zhao M, Havrilla J, Peng J, et al. Development of a phenotype ontology for autism spectrum disorder by natural language processing on electronic health records. J Neurodev Disord. 2022;14:32. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Liu C, Ta CN, Havrilla JM, et al. OARD: open annotations for rare diseases and their phenotypes based on real-world data. Am J Hum Genet. 2022;109:1591-1604. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Blair DR, Hoffmann TJ, Shieh JT.. Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis. Nat Commun. 2022;13:3675. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Chandak P, Huang K, Zitnik M. Building a knowledge graph to enable precision medicine. bioRxiv, 10.1101/2022.05.01.489928, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
12. McArthur E, Bastarache L, Capra JA.. Linking rare and common disease vocabularies by mapping between the human phenotype ontology and phecodes. JAMIA Open. 2023;6:ooad007. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Louden DN. MedGen: NCBI’s portal to information on medical conditions with a genetic component. Med Ref Serv Q. 2020;39:183-191. 10.1080/02763869.2020.1726152 [DOI] [PubMed] [Google Scholar]
14. Weinreich SS, Mangon R, Sikkens JJ, et al. [Orphanet: a European database for rare diseases]. Ned Tijdschr Geneeskd. 2008;152:518-519. [PubMed] [Google Scholar]
15. Zhao M, Havrilla JM, Fang L, et al. Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genom Bioinform. 2020;2:lqaa032. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Organisation mondiale de la santé, World Health Organization, WHO, et al. The ICD-10 Classification of Mental and Behavioural Disorders: Clinical Descriptions and Diagnostic Guidelines. World Health Organization; 1992. [Google Scholar]
17. Hoyt CT, Hoyt AL, Gyori BM.. Prediction and curation of missing biomedical identifier mappings with biomappings. Bioinformatics. 2023;39:btad130. 10.1093/bioinformatics/btad130 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. LHNCBC Abstract. Coverage of phenotypes in standard terminologies. - LHNCBC Abstract. Accessed November 7, 2022. https://lhncbc.nlm.nih.gov/LHC-publications/pubs/Coverageofphenotypesinstandardterminologies.html
19. Weber GM, Hong C, Xia Z, et al. ; The Consortium for Clinical Characterization of COVID-19 by EHR (4CE). International comparisons of laboratory values from the 4CE collaborative to predict COVID-19 mortality. NPJ Digit Med. 2022;5:74. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Callahan TJ, Stefanski AL, Wyrwa JM, et al. Ontologizing health systems data at scale: making translational discovery a reality. arXiv, http://arxiv.org/abs/2209.04732, 2022, preprint: not peer reviewed. Date accessed June 7, 2023. [DOI] [PMC free article] [PubMed]
21. U.S. National Library of Medicine. UMLS Metathesaurus—HPO (human phenotype ontology) - statistics. Accessed June 6, 2023. https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/HPO/stats.html

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

To replicate this analysis it is necessary to have (1) the UMLS Metathesaurus (version 2022AA) and (2) the BIDMC patient dataset:

UMLS is freely available upon registering for a UMLS Terminology Services (UTS) account (https://uts.nlm.nih.gov/uts/signup-login). The UMLS Metathesaurus version 2022AA that we used in our study can be obtained from: https://tinyurl.com/m7976xcz.
The BIDMC dataset contains confidential patient electronic health data and cannot be shared outside of the institution. Interested parties may contact authors for potential collaboration or contact BIDMC’s Institutional Review Board (IRB) to request access to the data through: https://www.bidmc.org/research/research-and-academic-affairs/clinical-research-at-bidmc/committee-on-clinical-investigation-irb.

[ooae118-B1] 1. Garcelon N, Burgun A, Salomon R, et al. Electronic health records for the diagnosis of rare diseases. Kidney Int. 2020;97:676-686. [DOI] [PubMed] [Google Scholar]

[ooae118-B2] 2. Köhler S, Gargano M, Matentzoglu N, et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49:D1207-D1217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B3] 3. Organisation mondiale de la santé, World Health Organization, WHO. The ICD-10 Classification of Mental and Behavioural Disorders: Diagnostic Criteria for Research. World Health Organization; 1993. [Google Scholar]

[ooae118-B4] 4. Lindberg DAB, Humphreys BL, McCray AT.. The unified medical language system. Yearb Med Inform. 1993;02:41-51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B5] 5. Amos L, Anderson D, Brody S, et al. UMLS users and uses: a current overview. J Am Med Inform Assoc. 2020;27:1606-1611. 10.1093/jamia/ocaa084 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B6] 6. Kafkas Ş, Althubaiti S, Gkoutos GV, et al. Linking common human diseases to their phenotypes; development of a resource for human phenomics. J Biomed Semant. 2021;12:1-15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B7] 7. Havrilla JM, Zhao M, Liu C, et al. Clinical phenotypic spectrum of 4095 individuals with down syndrome from text mining of electronic health records. Genes (Basel). 2021;12:1159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B8] 8. Zhao M, Havrilla J, Peng J, et al. Development of a phenotype ontology for autism spectrum disorder by natural language processing on electronic health records. J Neurodev Disord. 2022;14:32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B9] 9. Liu C, Ta CN, Havrilla JM, et al. OARD: open annotations for rare diseases and their phenotypes based on real-world data. Am J Hum Genet. 2022;109:1591-1604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B10] 10. Blair DR, Hoffmann TJ, Shieh JT.. Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis. Nat Commun. 2022;13:3675. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B11] 11. Chandak P, Huang K, Zitnik M. Building a knowledge graph to enable precision medicine. bioRxiv, 10.1101/2022.05.01.489928, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]

[ooae118-B12] 12. McArthur E, Bastarache L, Capra JA.. Linking rare and common disease vocabularies by mapping between the human phenotype ontology and phecodes. JAMIA Open. 2023;6:ooad007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B13] 13. Louden DN. MedGen: NCBI’s portal to information on medical conditions with a genetic component. Med Ref Serv Q. 2020;39:183-191. 10.1080/02763869.2020.1726152 [DOI] [PubMed] [Google Scholar]

[ooae118-B14] 14. Weinreich SS, Mangon R, Sikkens JJ, et al. [Orphanet: a European database for rare diseases]. Ned Tijdschr Geneeskd. 2008;152:518-519. [PubMed] [Google Scholar]

[ooae118-B15] 15. Zhao M, Havrilla JM, Fang L, et al. Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genom Bioinform. 2020;2:lqaa032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B16] 16. Organisation mondiale de la santé, World Health Organization, WHO, et al. The ICD-10 Classification of Mental and Behavioural Disorders: Clinical Descriptions and Diagnostic Guidelines. World Health Organization; 1992. [Google Scholar]

[ooae118-B17] 17. Hoyt CT, Hoyt AL, Gyori BM.. Prediction and curation of missing biomedical identifier mappings with biomappings. Bioinformatics. 2023;39:btad130. 10.1093/bioinformatics/btad130 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B18] 18. LHNCBC Abstract. Coverage of phenotypes in standard terminologies. - LHNCBC Abstract. Accessed November 7, 2022. https://lhncbc.nlm.nih.gov/LHC-publications/pubs/Coverageofphenotypesinstandardterminologies.html

[ooae118-B19] 19. Weber GM, Hong C, Xia Z, et al. ; The Consortium for Clinical Characterization of COVID-19 by EHR (4CE). International comparisons of laboratory values from the 4CE collaborative to predict COVID-19 mortality. NPJ Digit Med. 2022;5:74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ooae118-B20] 20. Callahan TJ, Stefanski AL, Wyrwa JM, et al. Ontologizing health systems data at scale: making translational discovery a reality. arXiv, http://arxiv.org/abs/2209.04732, 2022, preprint: not peer reviewed. Date accessed June 7, 2023. [DOI] [PMC free article] [PubMed]

[ooae118-B21] 21. U.S. National Library of Medicine. UMLS Metathesaurus—HPO (human phenotype ontology) - statistics. Accessed June 6, 2023. https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/HPO/stats.html

PERMALINK

Implications of mappings between International Classification of Diseases clinical diagnosis codes and Human Phenotype Ontology terms

Amelia L M Tan, PhD

Rafael S Gonçalves, PhD

William Yuan, PhD

Gabriel A Brat, MD

Robert Gentleman, PhD

Isaac S Kohane, MD, PhD

Abstract

Objective

Materials and Methods

Results and Discussion

Conclusion

Background and significance

Table 1.

Methods and materials

Ontologies analyzed

Sources of ontology mappings

UMLS as a current mapping source

Mapping coverage of codes with different levels of usage in BIDMC data

Mappability of the most used ICD codes

Mappability of ICD code categories by top-level ICD categories

Results

Dictionary level mapping

Table 2.

Mapping coverage of codes with different levels of usage

Figure 1.

Mappability of the most used ICD codes

Table 3.

Table 4.

Mappability of ICD code categories by top-level ICD categories

Figure 2.

Discussion

Conclusion

Contributor Information

Author contributions

Funding

Conflicts of interest

Data availability

Code availability

Ethical considerations

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases