Abstract
Objective
In recent years, electronic health record systems have been widely implemented in China, making clinical data available electronically. However, little effort has been devoted to making drug information exchangeable among these systems. This study aimed to build a Normalized Chinese Clinical Drug (NCCD) knowledge base, by applying and extending the information model of RxNorm to Chinese clinical drugs.
Methods
Chinese drugs were collected from 4 major resources—China Food and Drug Administration, China Health Insurance Systems, Hospital Pharmacy Systems, and China Pharmacopoeia—for integration and normalization in NCCD. Chemical drugs were normalized using the information model in RxNorm without much change. Chinese patent drugs (i.e., Chinese herbal extracts), however, were represented using an expanded RxNorm model to incorporate the unique characteristics of these drugs. A hybrid approach combining automated natural language processing technologies and manual review by domain experts was then applied to drug attribute extraction, normalization, and further generation of drug names at different specification levels. Lastly, we reported the statistics of NCCD, as well as the evaluation results using several sets of randomly selected Chinese drugs.
Results
The current version of NCCD contains 16 976 chemical drugs and 2663 Chinese patent medicines, resulting in 19 639 clinical drugs, 250 267 unique concepts, and 2 602 760 relations. By manual review of 1700 chemical drugs and 250 Chinese patent drugs randomly selected from NCCD (about 10%), we showed that the hybrid approach could achieve an accuracy of 98.60% for drug name extraction and normalization. Using a collection of 500 chemical drugs and 500 Chinese patent drugs from other resources, we showed that NCCD achieved coverages of 97.0% and 90.0% for chemical drugs and Chinese patent drugs, respectively.
Conclusion
Evaluation results demonstrated the potential to improve interoperability across various electronic drug systems in China.
Keywords: Chinese Clinical Drug, Drug normalization, Chinese patent drug, Drug knowledge base, RxNorm
Introduction
Recently, the Chinese government has devoted significant resources to developing the healthcare sector, due to growing population, continuing urbanization, and increasing disease burden. Health information technologies have been developed and deployed increasingly and widely among hospitals and other healthcare providers (e.g., pharmacy stores). As of 2015, 90% of tertiary hospitals in China have implemented electronic health records (EHRs) systems.1 More recently, the Chinese government also released a plan to establish health information exchange platforms at different levels such as city, province, and nationwide.2,3
Drug information is an important part of healthcare data. However, different drug terminologies (e.g., by government or EHR vendors) have been developed in China, which makes it difficult to exchange drug information across different systems. For example, 2 hospitals with different EHR vendors may have drug data in different terminologies, thus making them difficult to exchange. Even within a hospital, there might be different systems using different drug codes, e.g., one system for physician orders and another for inventory management. From the clinical practice aspect, a consistent and unified information model is essential to establish reliable medication history of patients across the continuum of care. The lack of interoperability among different drug terminologies may cause safety issues, e.g., increasing errors in prescribing, dispensing, and administration drugs. Moreover, for health Information Technology (IT) vendors, they also face the challenge of mapping among different drug terminologies (e.g., from hospital drug systems to insurance drug lists) when building clinical information systems. To address these challenges, it is necessary to develop a unified and comprehensive information model of clinical drugs in China, in order to support the seamless information exchange in clinical decision support, quality assurance, health service research, reimbursement, and mandatory reporting.4
Some initial efforts have been made by stakeholders in China to address this challenge. For example, the China Food and Drug Administration (CFDA) has created and maintained a comprehensive list of drugs approved on the Chinese market.5 The China Health Insurance System also releases a catalog of drugs used in the insurance system, called the National Basic Medical Insurance Drug List.6 There are also several drug dictionaries such as the Chinese Pharmacopoeia (CP,《中国药典》),7 the Chinese Pharmacy Dictionary (《中国药学大辞典》),8 and the Contemporary Drug’s Names and Trade Names Dictionary (《当代药品商品名与别名辞典》).9 The Institute of Information on Traditional Chinese Medicine also established a unified Traditional Chinese Medical Language System, which provides terminologies and ontologies for traditional Chinese medicine.10
Despite the above efforts and resources, several major challenges remain for normalizing Chinese drug names. First, current Chinese drug terminologies usually use string expressions to represent drug attributes such as ingredient (IN), dose form (DF), and strength, instead of normalized concept identifiers. The free-text variations of drug names require additional processes (e.g., natural language processing – NLP) before they can be normalized and used for further computational applications. Moreover, the semantic relationships between the drug and its core attributes are not well defined and represented in current terminologies. There is no formal information model to represent diverse Chinese drugs, and there is no comprehensive drug knowledge base that can make different Chinese drug terminologies interoperable using the information model. Moreover, one major type of drugs widely distributed in China is Chinese patent drugs, whose main INs are herbal extracts. Chinese patent drugs (Figure 1) differ markedly from chemical drugs in terms of the formation of drug names, the vocabulary of and hierarchical relations between DFs, and the definitions and expressions of drug strength information, which place further challenges on the normalization of drug names.
Figure 1.
Examples of Chinese patent drugs distributed in China.
On the other hand, the United States has carried out much work in biomedical concept normalization, including clinical drugs.11–13 A prominent example is RxNorm, a standard nomenclature of clinical drugs developed by the National Library of Medicine (NLM).4 The RxNorm project started in 2002. It represents medications at the level of “clinical drug,” which stands for pharmaceutical products given to (or taken by) a patient with therapeutic or diagnostic intent and is defined as IN(s), strength(s), and DF. Specifically, the information model of RxNorm represents a clinical drug and its attributes in a normalized fashion, which then can be used as an “Interlingua” to link drug names in source vocabularies. By using RxNorm, computer systems can communicate drug-related information efficiently and unambiguously.14,15 Currently, RxNorm is the most comprehensive and widely used English drug name knowledge base.15,16
Motived by RxNorm, our ultimate goal is to build a normalized knowledge base for Chinese clinical drugs. As an initial step, in this study, we collected Chinese drugs (including both chemical drugs and Chinese patent drugs) from 4 major sources and normalized them into an RxNorm-like representation. We call this knowledge base Normalized Chinese Clinical Drugs (NCCD) (NCCD can be accessed at https://sbmi.uth.edu/ccb/resources/nccd.htm.). The process of constructing NCCD consists of 2 steps: (1) we assess and extend the RxNorm information model to represent Chinese drugs and (2) we develop a hybrid approach that combines NLP and manual review to normalize Chinese drugs to the extended information model. Our evaluation shows that NCCD can represent normalized drug information precisely with a good coverage of popular Chinese drugs, demonstrating its potential to improve the interoperability across various electronic drug systems in China.
Background
Information Model of RxNorm
RxNorm is a normalized naming system for clinical drugs produced by the US National NLM. It serves as a resource to support semantic interoperability between drug terminologies and pharmacy knowledge bases. RxNorm contains the names of prescription and many nonprescription drugs in the United States. A clinical drug in RxNorm is originally defined as the combination of active INs, strengths, and DF.13 Later, the attributes of quantity factor and quality distinction can also be used to define a clinical drug optionally.17 When any of these attributes varies, a new RxNorm concept is created with a concept unique identifier (RxCUI). The same drug in different string forms will be mapped to the same RxCUI, and the normalized drug name is labeled as the preferred form of the name. Specifically, generic and branded drug concepts of different levels of specificity are defined and represented by semantic types, so as to meet the needs of various scenarios.13 In RxNorm, both term types and semantic types are denoted as Term types.
The major categories for generic drug concepts include: IN (an active compound or moiety of the drug), Precise ingredient (a specified form of the IN such as a salt or isomer form), Semantic Clinical Drug Component (SCDC, IN + Strength), Semantic Clinical Drug Form (SCDF, IN + DF), and Semantic Clinical Drug (SCD, IN + Strength + DF). Similarly, the main categories of branded drug concepts include: Brand Name (BN), Semantic Branded Drug Component (SBDC, BN + Strength), Semantic Branded Drug Form (SBDF, BN + DF), and Semantic Branded Drug (SBD, BN + Strength + DF). The various kinds of drug entities are linked to each other via reciprocal relationships. For example, SBDC is the trade name of SCDC, and SCDC has the trade name of SBDC. Together, concepts of different semantic types for the same drug form a semantic network.
The international drug concept models
In addition to RxNorm, a number of other drug ontologies have been used in United States and worldwide. For example, the National Drug Concept Model in Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT) is designed for use by Member countries of the SNOMED International to create their own national drug extensions. Similar to RxNorm, it aims to facilitate the information exchange among different national drug terminologies by using consistent naming of drug-related concepts. It also includes recommended mapping to other international standards including the Anatomical Therapeutic Chemical classification18 and the ISO Identification of Medicinal Products.19 There are existing models as extensions to the SNOMED CT Drug Model, such as the Australian Medicines Terminology in Australia20 and dm + d (the National Household Survey (NHS) Dictionary of Medicines and Devices) in UK.21 Furthermore, The National Drug File - Reference Terminology (NDF-RT)22 produced by the US Department of Veterans Affairs, Veterans Health Administration is used for modeling drug characteristics including INs, chemical structure, DF, physiologic effect, mechanism of action, pharmacokinetics, and related diseases. It also provides a multi-axial hierarchical knowledge structure that classifies various INs and drug products.22,23
In this study, we selected RxNorm as the basis model for Chinese clinical drugs for several reasons. First, RxNorm is designed specifically for normalizing clinical drugs, with a clear representation for different semantic types of drugs. Over the past decade, RxNorm has been extensively developed into a comprehensive resource with links to rich external resources including the SNOMED CT drug model and the NDF-RT. Currently, it is widely used for clinical practice and health service research in the United States. For example, RxNorm has been used as a standard drug vocabulary in the Common Data Model of the Observational Health Data Sciences and Informatics (OHDSI) consortium, which aims to build an international data network for enabling large-scale observation studies.24
Major Types of Drugs in Chinese Clinical Information Systems
There are several major types of drugs present in Chinese clinical information systems: chemical drugs, biological drugs, Chinese herbal drugs, and Chinese patent drugs. Among them, chemical drugs and biological drugs are similar to those in Western medicine. In contrast, Chinese herbal drugs and Chinese patent drugs (i.e., herbal extracts) are only used mainly in China. The current version of NCCD collects and normalizes chemical drugs and Chinese patent drugs only. Although chemical drugs in China can be easily mapped to the RxNorm representation, Chinese patent drugs have several unique characteristics that cannot be accommodated by the information model of RxNorm.
First, the names of Chinese patent drugs (e.g., 六味地黄丸—tablets of 6 drugs with Rehmannia, 八珍丸—Tablets of 8 precious INs) are neither IN names nor BNs. We refer to it as a new semantic type called patent name (PN). For a given PN, different manufacturers are required to produce the drug with the same portions of fixed INs, in accordance with the CP’s monograph.25 Therefore, a PN often stands for an ensemble of multiple INs (MIN) in 1 drug preparation, which is similar to the term type MIN in RxNorm. However, unlike MIN, PN s do not provide details about the INs. Furthermore, there is no formal BN for Chinese patent drugs. The combination of a PN and a manufacturer name could serve as a surrogate of BNs in RxNorm.
Another big difference is about the definition and representation of strength information in Chinese patent drugs. Actually, chemical drugs and Chinese patent drugs have a different basis of strength substance defined in RxNorm, i.e., the IN (generally active IN or active moiety, or another substance altogether) in reference to which strength is defined. The strength of chemical drugs is defined as the amount of strength substance, which can be the active IN or the active moiety (e.g., 500 mg capsule). In contrast, since Chinese patent drugs are extracted and condensed from the original Chinese herbals, their strength is defined as the original amount of Chinese herbals used to generate 1 U of the DF. In this example of “每8丸重1.44克(每8丸相当于饮片3克)—every set of 8 tablets weighs 1.44 g (every 8 tablets are equivalent to 3 g of prepared herbals in small pieces ready for decoction),” the strength cannot be calculated as the weight of each tablet (1.44/8 = 0.18 g). It should be defined as the weight of original herbal products in each tablet (i.e., 3 g/8 = 375 mg). These differences indicate the need to define the calculation method of strength specifically for Chinese patent drugs.
Methods
In this study, we collected Chinese drugs from 4 major resources— the China Food and Drug Administration, China Health Insurance Systems, Hospital Pharmacy Systems, and the China Pharmacopoeia, and integrate and normalize them into NCCD. From these sources, we aim to create both generic drug names and branded drug names of different specification levels, if applicable. Chemical drugs were normalized using the original information model in RxNorm without much change. On the other hand, Chinese patent drugs were represented using an expanded RxNorm model to incorporate the unique characteristics of these drugs. A hybrid approach combining automated NLP technologies and manual review by domain experts was then applied to drug attribute extraction, normalization, and further generation of drug names at different specification levels. Lastly, we evaluated the automatic parsing performance, data quality, and coverage of common drugs in NCCD.
Data Sources
The current version of NCCD is built upon drug vocabularies from the following 4 sources:
Approved drugs from CFDA: CFDA is an agency for streamlining food and drug safety regulation processes in China.5 It maintains a comprehensive archive of drugs available in the Chinese market, including Chinese patent drugs, Chinese herbal medicine, chemical drugs, and biological drugs. As of November 2016, the drug vocabulary from CFDA had 170 600 drug entries.
Drugs from China Health Insurance System (CHIS): CHIS maintains a catalog of drugs used in the China National Basic Medical Insurance.6 This catalog is issued by the Ministry of Labor and Social Security and the State Planning Commission, with the goal of strengthening the basic medical insurance administration with a reasonable control of medical costs. CHIS also consists of Chinese patent drugs, Chinese herbal medicine, Chemical drugs, and biological drugs. The version of CHIS used in this study contains 8530 drugs.
Drugs used in hospital information systems (HIS): Through collaboration with hospitals, we have collected a list of drugs used in several HIS systems in China. The HIS drug list contains 28 071 entries.
Drugs in China Pharmacopoeia (CP): CP is an official compendium of drugs compiled by the Pharmacopoeia Commission of the Ministry of Health of China. It is recognized by the World Health Organization as the official CP.6 CP covers not only names of traditional Chinese medicine and chemical drugs, but also information about purity, precautions, storage, etc. Importantly, it contains the detailed INs and strength information for traditional Chinese medicine. The current version of CP contains 936 drugs.
As described above, different drug sources contain different numbers of drugs. Moreover, they also contain different types of drug information. The detailed data fields in each drug source are listed in the Supplementary Table S1. Among them, CFDA contains the most representative and comprehensive drug attributes, including typical attributes present in RxNorm, such as INs, strength, and DFs. Therefore, in this study, CFDA served as the backbone of NCCD, augmented by information from the other data sources.
Information Model Design for NCCD
A brief description of the information model of NCCD
As an initial step, we follow the original information model of RxNorm4 to organize drugs in NCCD. An RxNorm concept is formed by different drug attributes or their combinations. For example, for a chemical drug, its semantic clinical drug is defined as IN + DF + strength. Moreover, clinical/branded drug names are defined at different levels of specificity and with different attribute combinations, so as to meet the needs of various scenarios.13 When any of the attributes vary, a new concept will be added to the nomenclature. A NCCD concept unique identifier (CRxCUI) is generated and assigned to each concept. The same drug name in different string forms will be mapped to the same CRxCUI, and the normalized drug name is labeled as the preferred form of the name.
We follow the original information model of RxNorm to represent chemical drugs in China. In total, NCCD defined 20 types of relationships among the 10 semantic types for chemical drugs. Considering the unique characteristics of Chinese patent drugs, it is necessary to adapt and extend the original information model of RxNorm for Chinese patent drugs. Following the common naming conventions of RxNorm, we defined 10 semantic types for the representation of Chinese patent drugs (Table 1), based on the following 4 types of attributes:
Table 1.
Ten Semantic Types Defined for the Representation of Chinese Patent Drugs in NCCD
| Semantic types | Name | Description |
|---|---|---|
| PN | Patent name | Generic name of a Chinese patent drug |
| IN | Ingredient | |
| DF | Dose Form | |
| BN | Branded Name | Patent name + Product manufacturer |
| SCD | Semantic Clinical Drug | Patent name + Strength + Dose Form |
| SBD | Semantic Branded Drug | Patent name + Strength + Dose Form + Product manufacturer |
| SCDC | Semantic Clinical Drug Component | Patent name + Strength |
| SCDF | Semantic Clinical Drug Form | Patent name + Dose Form |
| SBDC | Semantic Branded Drug Component | Patent name + Strength + Product manufacturer |
| SBDF | Semantic Branded Drug Form | Patent name + Dose Form + Product manufacturer |
商品名称|剂型|单位剂量|生产厂家
Product name|Dose form|Strength|Product manufacturer
As illustrated in Table 1, a new semantic type of PN is defined, which is the generic name of a Chinese patent drug. Because only a small portion of Chinese patent drugs (3.1%) have detailed IN information, PN is considered as an ensemble of multiple active INs and a replacement of the semantic type of MIN. It is used as the IN information in the formation of concepts for clinical/branded drugs, instead of the original semantic type of IN. Additionally, the attribute of product manufacturer is incorporated into the concept formation for branded names, in order to differentiate among Chinese patent drugs of the same PN. As a result, only one new semantic type of PN is added into NCCD.
In addition to the above semantic types, 1 attribute of Chinese drugs, the National Drug Approval Number (国药准字), which is also similar to the New Drug Approval identifier for drugs in the United States, is also added into NCCD for Chinese drugs. It stands for the approval number assigned by the CFDA for new drug production. It has a standard format: 国药准字+1 letter + 1 digit + 8 digits, in which the letter is “H” for chemicals and “Z” for Chinese herbal drugs. Only drugs with this approval number can be produced and distributed in China.
Process of NCCD Construction
As introduced previously, the drug records from 4 data sources were integrated into NCCD. Among them, the CFDA data has the largest scale and contains the most comprehensive set of data fields. Therefore, it was chosen as the backbone of NCCD and has been processed first. After that, data records from the other 3 sources were integrated into NCCD. The attributes of IN, DF, strength, and the National Drug Approval Number are extracted for both chemical and Chinese patent drugs. Branded names of chemical drugs are also extracted. Moreover, the attributes of PN and manufacturer are extracted for each Chinese patent drug, in order to form semantic concepts related to its branded name.
As illustrated in Figure 2, the workflow for building NCCD mainly consisted of 3 steps: (1) Extract candidate strings of attributes by parsing the original data records using NLP based methods; (2) Normalize candidate attributes to concepts in NCCD based on semantic similarity and domain knowledge; and (3) Generate possible clinical and branded drug names leveraging original relations and construct the semantic network among concepts. Detailed description of each step is as follows:
Figure 2.
Work flow of building NCCD using a specific Chinese patent drug as an example.
Step 1. Candidate attribute string extraction using NLP-based methods
Drug attributes such as IN and DF may not be provided explicitly in the data records from each source, this step used NLP-based methods to parse the content of related data fields and extract drug attributes from the original records.
For example, if the drug contains only one IN, the IN term is usually included in its drug name, such as the IN “Aciclovir(阿昔洛韦)” in Aciclovir Capsules(阿昔洛韦胶囊). On the other hand, when the drug contains multiple INs, they are usually described in one continuous string without obvious segmentation. Furthermore, the information of DF and strength are often present together with the INs, which need to be split and extracted. As an example, the original data record “Injection 250 ml: 12.5 g glucose and 2.25 g sodium chloride” is split into 2 parts as in “glucose | 12.5 g/250 ml | Injection” and “sodium chloride | 2.25 g | Injection.”
In order to parse the raw data records and extract different drug attributes, dictionaries/keyword lists of INs, DFs, and strengths are collected from multiple resources and enriched semi-automatically by adding new terms extracted automatically from drug data sources after manual check. Pattern-based parsing rules were designed based on observation of each data source to extract drug attributes automatically. The list of frequent patterns can be found in Supplementary Table S2. If the parser was unable to interpret the data format, domain experts were involved to extract such information from the data records.
Step 2. Normalization of candidate attributes to NCCD concepts based on semantic similarity and domain knowledge
After parsing the raw data record, the extracted drug attributes were normalized to concepts in NCCD. If no existing concepts matched, it was added into NCCD as a new concept and assigned a unique id CRxCUI. Two major approaches were used for concept normalization, based on the characteristics of different drug attributes:
Semantic similarity-based concept normalization: In order to map attributes of diverse string forms such as IN names to NCCD, the semantic similarity between the new IN and existing IN terms in NCCD was calculated automatically based on the vector-space-model.26 Specifically, cosine similarity was used in a vector-space-model for semantic similarity measurement after normalizing the cases of words. The new IN was mapped to the concept with the most semantically similar IN terms;
concept normalization aligned with RxNorm editorial conventions: The strength information of some drugs had to be calculated based on the text information, especially for Chinese patent drugs and drugs of the injection DF. Furthermore, the units of strength also needed to be normalized. Various rules were developed to normalize such concepts. Taking future cross-linking with other drug ontologies including RxNorm into account, NCCD attempted to follow the editorial conventions in RxNorm for concept normalization. For example, the strength of glucose in “250 ml: glucose | 12.5 g” was converted to “glucose | 50.0 mg/ml.”
Furthermore, to ensure the normalization quality, manual checks were conducted over the outputs of the automatic process, especially when the highest IN similarity did not reach a certain threshold or if no pattern can be matched.
Step 3. Generate possible clinical and branded drug names leveraging original relations
Once attributes of a drug were normalized, possible clinical and branded drug names of different specification levels as listed in Table 1 were generated by leveraging the original relations between attributes and the drug name. Finally, a semantic network was formed by drug entities of attributes such as PN s (PN), INs, DF, clinical drug names, branded drug names, SCDC, SCDF, SBDC, Semantic Branded Drug Form (SBDF), and their relationships according to the information model of NCCD.
Evaluation
To evaluate the performance of the NLP tool for drug attribute parsing and extraction, we randomly selected 300 records of chemical drugs and 300 records of Chinese patent drugs and manually reviewed their parsing outputs. The precision, recall, and F-measure of each attribute were reported on these 2 datasets. Specifically, precision is measured as the percentage of correctly parsed attributes among all the attributes automatically parsed; recall is measured as the percentage of correctly parsed attributes among all the attributes present in drugs; and F-measure is the harmonic mean of precision and recall.
Next, the data quality of NCCD was evaluated from 2 aspects:
to validate the accuracy of drug information present in NCCD, 10% of chemical drugs (1700), and 10% of Chinese patent drugs (250) were randomly selected for manual review. For each specific drug, the attributes like INs, DFs, and strengths and semantic relations between these attributes were manually checked for their correctness. The accuracy is defined as the percentage of attributes/relations correctly present in NCCD.
To examine the coverage of NCCD in terms of common drugs, 500 chemical drugs were randomly collected from a popular drug information search tool in China, named Clove Park Drug Assistant-丁香园用药助手.27 In addition, 500 Chinese patent drugs are collected from Clove Park Drug Assistant and the National guidelines for clinical use of essential drugs (Chinese patent medicine)-国家基本药物临床应用指南 (中成药).28 Coverage is defined as the percentage of INs, DFs, strengths, and semantic clinical drugs that are correctly covered by NCCD among the 500 drugs in each set. Since drug types such as SCDC, SCDF, SBDC, and SBDF are populated automatically from IN, DF, and strength, they do not need to be formally evaluated. Moreover, given that a majority of collected Chinese patent drugs do not provide strength information, the coverage of semantic clinical drug forms is also reported for Chinese patent drugs.
Results
Information Model of NCCD
We follow the original information model of RxNorm to represent chemical drugs in China (Figure 3[A]). The information model for Chinese patent drugs is illustrated in Figure 3(B). It is presented as a semantic network formed by concepts of clinical/branded drugs at different levels of specificity with distinctive relationships. In comparison with RxNorm, the only new semantic type in the information model for Chinese patent drugs is PN. As the generic name of a Chinese patent drug, PN is considered as an ensemble of multiple active INs and is used in the formation of concepts for clinical/branded drugs, due to the limited IN information provided in the current data sources. Additionally, the attribute of product manufacturer is incorporated into the concept formation for branded names, in order to differentiate among Chinese patent drugs of the same PN. Moreover, the National Drug Approval Number (国药准字) is also added into NCCD as a drug attribute, which stands for the approval number assigned by the CFDA for new drug production. In total, 114 995 concepts and 1 042 624 relations are defined for Chinese patent drugs in NCCD.
Figure 3.
Examples of semantic relations of chemical drug (A) and Chinese patent drug (B) concepts in NCCD. For chemical drugs, it is almost identical to the original RxNorm model. For Chinese patent drugs, the RxNorm model was extended to accommodate additional attributes and relations.
Statistics of NCCD
Table 2 list the statistics of concepts for chemical drugs and Chinese patent drugs, respectively. The contributions from each data source and their integration are listed. CFDA contributes the most to NCCD, together with the other 3 sources. As of November 10th, 2017, for chemical drugs, NCCD contains 5546 INs, 16 976 clinical drugs, and 7895 branded drugs in total. For Chinese patent drugs, NCCD contains 10 908 PN s, 3286 INs, 2663 clinical drugs, and 9169 branded drugs.
Table 2.
Statistics of Concepts for Chemical Drugs in Each of the Four Data Sources, NCCD, and RxNorm
| CFDA | CHIS | HIS | CP | NCCD | RxNorma | |
|---|---|---|---|---|---|---|
| Chemical drug | ||||||
| Ingredient | 2876 | 2699 | 2731 | 47 | 5546 | 10 715 |
| Multiple Ingredients | 1037 | 0 | 789 | 0 | 1492 | 3977 |
| Dose Form | 385 | 224 | 235 | 16 | 786 | 166 |
| Brand Name | 4898 | 0 | 0 | 0 | 4858 | 15 691 |
| Semantic Clinical Drug Component | 10 715 | 0 | 2804 | 0 | 12 097 | 27 104 |
| Semantic Clinical Drug Form | 9322 | 3903 | 4222 | 90 | 13 279 | 14 556 |
| Semantic Clinical Drug | 16 128 | 0 | 5227 | 0 | 16 976 | 36 131 |
| Semantic Branded Drug Component | 7700 | 0 | 0 | 0 | 7701 | 21 853 |
| Semantic Branded Drug Form | 5984 | 0 | 0 | 0 | 5989 | 14 593 |
| Semantic Branded Drug | 7868 | 0 | 0 | 0 | 7895 | 21 732 |
| Chinese patent drug | ||||||
| Patent name | 10 052 | 25 | 2908 | 245 | 10 908 | NA |
| Ingredient | 0 | 0 | 3036 | 258 | 3286 | NA |
| Dose Form | 471 | 9 | 134 | 0 | 504 | NA |
| Brand Name | 3026 | 2548 | 0 | 3271 | NA | |
| Semantic Clinical Drug Component | 1714 | 0 | 750 | 0 | 2314 | NA |
| Semantic Clinical Drug Form | 14 395 | 28 | 2214 | 0 | 15 821 | NA |
| Semantic Clinical Drug | 2096 | 0 | 939 | 0 | 2663 | NA |
| Semantic Branded Drug Component | 8370 | 0 | 849 | 0 | 9019 | NA |
| Semantic Branded Drug Form | 60 680 | 0 | 2854 | 0 | 61 090 | NA |
| Semantic Branded Drug | 8591 | 0 | 978 | 0 | 9169 | NA |
aThe statistics of RxNorm are provided by 2015 UMLS (https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/RXNORM/stats.html).
In addition, NCCD currently contains 250 267 unique concepts and 2 602 760 relations, for which 114 995 (45.95%) concepts and 1 042 624 (40.06%) relations are defined for Chinese patent drugs. Figure 4 further illustrates the concept coverage and the overlap among the 4 data sources.
Figure 4.

Concept overlap among 4 drug data sources in NCCD.
CFDA: China Food and Drug Administration; CP: Chinese Pharmacopoeia; HIS: Hospital Information System; CHIS: Chinese Health Insurance System.
Performance
As shown in Table 3, the NLP-based semantic parsing model has obtained a F-measure of 87.20% for SCD of chemical drugs and 86.00% for SCD of Chinese patent drugs. For chemical drugs, the automatic extraction of strength information got the lowest F-measure of 84.40%. Since only a small proportion of Chinese patent drugs have strength information in the data sources, we manually checked the data to get such information. The extraction of DFs achieved the highest performance for both the chemical (F-measure 94.90%) and Chinese patent drugs (F-measure 92.70%). Notably, as illustrated in Table 4, high accuracy was obtained for drug concepts in NCCD, demonstrating that NCCD precisely represents the drug information from the 4 data sources. Additionally, 97.00% of randomly selected common chemical drugs and 90.00% of randomly selected common Chinese patent drugs were covered in NCCD, which indicated that NCCD has a comprehensive coverage for practical applications.
Table 3.
Performance of the NLP Tool for Automatic Parsing of Drug Information from Drug Resources (%)
| Semantic type | Precision | Recall | F-measure | |
|---|---|---|---|---|
| Chemical drug | ||||
| Ingredient | 88.00 | 95.00 | 91.37 | |
| Strength | 78.00 | 92.00 | 84.40 | |
| Dose form | 98.00 | 93.00 | 94.90 | |
| Semantic clinical drug | 82.00 | 93.00 | 87.20 | |
| Chinese patent drug | ||||
| Patent name | 98.00 | 98.00 | 98.00 | |
| Ingredient | 80.00 | 96.00 | 87.30 | |
| Dose form | 88.00 | 98.00 | 92.70 | |
| Semantic clinical drug form | 80.00 | 96.00 | 87.20 | |
| Semantic clinical drug | 78.00 | 96.00 | 86.00 | |
Table 4.
Accuracy and Coverage of Concepts of Chemical Drugs and Chinese Patent Drugs in NCCD (%)
| Semantic type | Accuracy | Coverage |
|---|---|---|
| Chemical drug | ||
| Ingredient | 99.30 | 98.00 |
| Strength | 98.85 | 96.00 |
| Dose form | 100.00 | 98.00 |
| Semantic Clinical drug | 98.60 | 97.00 |
| Chinese patent drug | ||
| Patent name | 100.00 | 96.00 |
| Ingredient | 99.12 | 98.00 |
| Strength | 100.00 | 98.00 |
| Dose form | 100.00 | 92.00 |
| Semantic Clinical Drug Form | 98.90 | 90.00 |
| Semantic Clinical Drug | 98.58 | 90.00 |
Discussion
As the first comprehensive knowledge base of normalized Chinese clinical drugs, NCCD leverages the information model in RxNorm with further expansion to accommodate the unique characteristics of Chinese patent medicines. Evaluation demonstrates NCCD is capable of representing normalized drug information accurately. It also has a good coverage of common clinical drugs with a recall of 97.00% for chemical drugs and a recall of 90.00% for Chinese patent drugs, respectively. The promising performance of NCCD demonstrates its readiness to facilitate the interoperability across various electronic drug systems and pharmaceutical applications in China.
Error Analysis
The errors in automatic parsing were mainly caused by rare or inconsistent patterns present in the drug data sources, particularly for HIS, which collected drug information from multiple drug manufacturers. For example, the strength information in the string of “40 mg以泮托拉唑计)或(以泮托拉唑钠计42.3 mg)” “40 mg (count on Pantoprazole) or (count on Pantoprazole 42.3 mg)” was not correctly recognized. Besides, although NCCD has a good data quality with high accuracy, some errors also occurred in the manual checking process, especially when a large amount of drug information was provided in one record. For example, some Chinese patent drugs may contain many INs (more than 30). In addition, some drug strength information was not normalized correctly in manual check, such as the unit conversion from “g” to “mg.” Moreover, unit standards such as the Unified Code for Units of Measure could be leveraged in our next step for representing units and converting between them, to avoid errors related to unit conversion
We also analyzed the drugs not currently covered in NCCD, and the major cause was that some drug name variations used in clinical settings could not be exactly matched to entries in NCCD. Another reason is that tools for mapping formulary names to drug terminologies are not perfect—errors exist during such automated mappings. For instance, “红霉素过氧苯甲酰凝胶-Erythromycin Benzoyl Peroxide Gel” and “醋酸洗必泰溶液-Chlorhexidine acetate solution” are drug names commonly used in clinics. However, they are not included in the current drug sources used for NCCD. The potential influence on applications would be use-case specific. For instance, users trying to use NCCD for drug normalization of these terms may not find corresponding terms inside, thus hindering the interoperability among drug systems due to lack of drug term variations in NCCD.
Challenges of Representing Chinese Patent Drugs
Considering the quality of drug information provided in the major data sources, the representation of Chinese patent medicines faces several challenges. Firstly, a large portion of Chinese patent medicines (78.79%) do not have specific strength information provided in the major data sources, either. What is more, the DFs of Chinese patent drugs have hierarchical relations. Taking DFs of the drug in Figure 1 as an example, the DF 丸 (tabular) has several sub-categories, including 浓缩丸 (concentrated tabular), 水蜜丸 (water-honeyed tabular), 大蜜丸 (Big candied tabular), etc. The same drug with different DFs may actually have different therapeutic effectiveness.28 Currently, we only use the level of丸 (tabular) for DF normalization, which could be further refined by considering the specific sub-categories, similar to the organization of DF groups in RxNorm.17 In the future, we could also use the optional attribute of quality distinction as in RxNorm to incorporate such information into drug representations.17
Analysis of Difference Among Four Major Drug Information Sources
As illustrated in Tables 2 and 3, the 4 drug sources provide different types of drug information. Among them, CFDA contains the most comprehensive drug attribute information. In contrast, CHIS mainly provides general drug information such as product names and DFs, without detailed INs, BNs and strength of drugs. Besides, only CHIS and CP provide detailed IN and strength information for a part of Chinese patent drugs. Although CP also provides strength information for chemical drugs, it is expressed by percentage of IN in the drug. For example, “the percentage of C12H13N2O2 should not be <98.5% in Isocarboxazid.”7 This probably is related to the differences on the original purposes of these drug sources: CFDA is in charge of organizing the formulation and publication of the national pharmacopeia.5 On the other hand, CHIS provides a standard of medication reimbursement amounts for basic medical insurance, work-related injury insurance, and maternity insurance.6 CHIS mainly uses drug information for prescription and administration; while CP provides the standards and regulations, as a reference for the production, supply, application, and supervision of drugs in China.7
Maintenance and Quality Assurance
Quality assurance is a major concern in the production of the NCCD releases. While minimizing requirements for keyboard entry during NCCD data production keeps typographical errors to a minimum, review of consistency of relationships and other internal checks help assure the quality of the release. Most of the information for new changes to NCCD comes from the new release of data sources from national standards (e.g., CFDA and CHIS). A subgroup under the OHDSI China Workgroup is organized to be responsible for reviewing and reconciling potential errors in each release, which provides an important check on consistent creation of NCCD names and updates of the NCCD model. Twice a year releases of the OHDSI Chinese Vocabularies including NCCD are scheduled to occur simultaneously, in which the content in NCCD are coded and included in the OHDSI Vocabulary CDM.
Distribution, Copyright, and User Support
The NCCD file can be obtained at no cost from the OHDSI China Workgroup website (http://www.ohdsichina.org/nccd). The downloadable zip file contains several NCCD content files which are bar-delimited text files, as well as load scripts that can be used to import the content files into MySQL or Oracle databases. At the same web site, users can find related information, including the NCCD overview and technical documentation. The content of NCCD is freely available, including NCCD names and codes. No copyrighted proprietary information is included in the current release. We do ask users to fill in a simple registration form with their contact information so that they can be informed about the upcoming NCCD changes. A GitHub site about NCCD is also planned to provide a platform for technical support and feedback collection.
Limitations and Future Work
This work has several limitations. First, the current information model does not accommodate packages of multiple drugs, and traditional Chinese herbal medicines without standard monographs. Appropriate expansion of the information model is required when we include these drugs. Moreover, the data sources included in this study are not complete. Additional data sources of drug information such as those from individual hospitals, pharmaceuticals, and health information technology companies should be continuously incorporated into NCCD in the future. It would also be of use to generate cross-references between NCCD with other databases such as RxNorm to link chemical drugs distributed in China and the United States, a Traditional Chinese Medical Language System10 to get additional information such as indications, or drug classification systems such as Anatomical Therapeutic Chemical classification18 and NDF-RT22 as in RxNorm, so that drugs can be organized into different groups according to the organ or system on which they act and/or their therapeutic and chemical characteristics.
Finally, we expect that the construction of NCCD would promote better regulations by CFDA and other stakeholders to improve the data quality for Chinese patent drugs. More detailed information of IN and strength would facilitate the normalization, interoperability, as well as the development, distribution, application, and post-marketing surveillance of Chinese patent drugs.
Conclusion
As the first normalized Chinese clinical drug knowledge base, NCCD leverages the information model in RxNorm with further expansion to accommodate the unique characteristics of Chinese patent medicines. Evaluation results show that NCCD can represent normalized drug information precisely with a good coverage of popular drug resources, demonstrating its capability of facilitating the interoperability across various electronic drug systems and pharmaceutical applications in China.
Competing interests
None.
Contributors
The work presented here was carried out in collaboration among all authors. LW, YZ, MJ, and HX designed methods and experiments. LW, MJ, JW carried out the experiments. LW and YZ analyzed the data, interpreted the results, and drafted the paper. All authors have contributed to editing, reviewing, and approving the manuscript.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
Supplementary Material
ACKNOWLEDGMENTS
This study was supported in part by National Science Foundation of China (81271668), Technology Foundation Projects of Nantong (MS12015112), Jiangsu Government Scholarship for Overseas Studies (2014), National Key Research and Development programs (2016YFC0901602), NSFC and Guangdong Provincial Center for Big Data Science Joint Fund Project (No. U1611261), Major Project of Frontier and Key Technical Innovation of Guangdong Province in 2014 (Science and Technology Major Project) (No. 2014B010118003), the Frontier and Key Technology Innovation Project of Guangdong Province (2014B010118003), the 2014 information projects of Jiangsu Province Health and Life Committee (X201401), Jiangsu Province’s Key Provincial Talents Program (ZDRCA2016005), the 2016 industry prospecting and common key technology key projects of Jiangsu Province Science and Technology Department (BE2016002-4), The 2016 projects of Nanjing Science Bureau (201608003).
References
- 1. Li X, Meng Y, Liu L, Li J, Rao K. Application of electronic medical records in China. Chin J Med Libr Inf Sci 2016;258:15–19. [Google Scholar]
- 2.The State Council of the People's Republic of China. China Details Major Tasks in Healthcare Reform 2017. http://english.gov.cn/policies/latest_releases/2017/05/05/content_281475646380958.htm. Accessed April 3, 2017.
- 3. Lei J, Wen D, Zhang X et al. Enabling health reform through regional health information exchange: a Model Study from China. J Healthcare Eng 2017;2017:1–9. [DOI] [PubMed] [Google Scholar]
- 4. Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Professional 2005;75:17–23. [Google Scholar]
- 5. Chinese Food and Drug Administration. 2017. http://www.sfda.gov.cn/WS01/CL0412/. Accessed April 3, 2017.
- 6. Yan W, Jingrui L. Challenges in the Implementation of Essential Medicine Drug List and National Basic Medical Insurance Drug Catalogue. China Health Insurance 2010;4. [Google Scholar]
- 7. National Pharmacopoeia Committee. Pharmacopoeia of the People's Republic of China. Beijing: China Medical Science and Technology Press; 2017. [Google Scholar]
- 8. Chinese Pharmacy Editorial Committee. Chinese Pharmacy Dictionary. Beijing: People’s Medical Publishing House; 2010. [Google Scholar]
- 9. Chinese Pharmaceutical Association. Contemporary Drug’s Names and Trade Names Dictionary. Beijing: Chemical Industry Press; 2006. [Google Scholar]
- 10. Mao Y, Yin A, ed. Ontology modeling and development for Traditional Chinese Medicine. Biomedical Engineering and Informatics, 2009. BMEI'09. 2nd International Conference on; 2009:1–5; IEEE. [Google Scholar]
- 11. Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J BiomedI Inform 2014;47:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Aronson AR, ed. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium; Washington, DC: American Medical Informatics Association; 2001. [PMC free article] [PubMed] [Google Scholar]
- 13. Nelson SJ, Zeng K, Kilbourne J, Powell T, Moore R. Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc 2011;184:441–448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Dhavle AA, Ward-Charlerie S, Rupp MT, Kilbourne J, Amin VP, Ruiz J. Evaluating the implementation of RxNorm in ambulatory electronic prescriptions. J Am Med Inform Assoc 2015;23:e99–e107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Bennett CC. Utilizing RxNorm to support practical computing applications: capturing medication history in live electronic health records. J BiomedI Inform 2012;454:634–641. [DOI] [PubMed] [Google Scholar]
- 16.National Library of Medicine. RxNorm Overview 2017. https://www.nlm.nih.gov/research/umls/rxnorm/overview.html. Accessed April 3, 2017.
- 17.National Library of Medicine. RxNorm Documentation. 2017. https://www.nlm.nih.gov/research/umls/rxnorm/docs/2010/rxnorm_doco_full_2010-3.html. Accessed April 3, 2017.
- 18. World Health Organization. The Anatomical Therapeutic Chemical Classification System with Defined Daily Doses (ATC/DDD). Norway: WHO; 2006. [Google Scholar]
- 19. Milmo S. The Complexity of IDMP. Pharmaceutical Technol 2017;415:6–8. [Google Scholar]
- 20. McBride S, Lawley M, Leroux H, Gibson S, ed. Using Australian Medicines Terminology (AMT) and SNOMED CT-AU to Better Support Clinical Research. HIC; 2012:144–149. [PubMed] [Google Scholar]
- 21.Datapharm Communications Ltd. Dictionary of Medicines and Devices Browser 2017. http://dmd.medicines.org.uk/DesktopDefault.aspx?tabid=1. Accessed April 3, 2017.
- 22. Blach C, Del Fiol G, Dundee C et al. Use of RxNorm and NDF-RT to normalize and characterize participant-reported medications in an i2b2-based research repository. AMIA Summits on Translational Science Proceedings 2014;2014:35. [PMC free article] [PubMed] [Google Scholar]
- 23. Palchuk MB, Klumpenaar M, Jatkar T, Zottola RJ, Adams WG, Abend AH, ed. Enabling hierarchical view of RxNorm with NDF-RT drug classes. AMIA Annual Symposium Proceedings; 2010:577–581; American Medical Informatics Association. [PMC free article] [PubMed] [Google Scholar]
- 24. Hripcsak G, Duke JD, Shah NH et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health TechnolI Inform 2015;216:574. [PMC free article] [PubMed] [Google Scholar]
- 25. Taylor M. Chinese patent medicines. A Beginner's Guide. Santa Cruz, CA: Global Eyes International Press; 1998. [Google Scholar]
- 26. Tang B, Zhang Y, Wang J et al. UTH_CCB: a report for semeval 2014–task 7 analysis of clinical text. SemEval 2014;2014:802. [Google Scholar]
- 27.Clove Park. Clove Park Drug Assistant. 2017. http://drugs.dxy.cn/. Accessed April 3, 2017.
- 28. State Administration of Traditional Chinese Medicine of the People’s Republic of China. National Guidelines for Clinical Use of Essential Drugs (Chinese Patent Medicine). People's Medical Publishing House; 2009.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



