Skip to main content
The British Journal of Radiology logoLink to The British Journal of Radiology
. 2019 Sep 4;92(1102):20190255. doi: 10.1259/bjr.20190255

Development and implementation of a dynamically updated big data intelligence platform from electronic health records for nasopharyngeal carcinoma research

Li Lin 1, Wei Liang 2, Chao-Feng Li 3, Xiao-Dan Huang 1, Jia-Wei Lv 1, Hao Peng 1, Bing-Yi Wang 2, Bo-Wei Zhu 2, Ying Sun 1,1,1,1,
PMCID: PMC6774598  PMID: 31430186

Abstract

Objective:

To develop a big data intelligence platform for secondary use of electronic health records (EHRs) data to facilitate research for nasopharyngeal cancer (NPC).

Methods:

This project was launched in 2015 and carried out by the cooperation of an academic cancer centre and a technology company. Patients diagnosed with NPC at Sun Yat-sen University Cancer Centre since January 2008 were included in the platform. Standard data elements were established to defined 981 variables for the platform. For each patient, data from 13 EHRs systems were extracted, integrated, structurized and normalized. Eight functional modules were constructed for the platform to facilitate the investigators to identify eligible patients, establish research projects, conduct statistical analysis, track the follow-up, search literature, etc.

Results:

From January 2008 to December 2018, 54,703 patients diagnosed with NPC were included. Of these patients, 39,058 (71.4%) were male, and 15,645 (28.6%) were female; median age was 47 (interquartile range, 39–55) years. Of 981 variables, 341 were obtained from data structurization and normalization, of which 68 were generated by interacting multiple data sources via well-defined logical rules. The average precision rate, recall rate and F-measure for 341 variables were 0.97 ± 0.024, 0.92 ± 0.030, and 0.94 ± 0.027 respectively. The platform is regularly updated every seven days to include new patients and add new data for existing patients. Up to now, eight big data-driven retrospective studies have been published from the platform.

Conclusion:

Our big data intelligence platform demonstrates the feasibility of integrating EHRs data of routine healthcare, and offers an important perspective on real-world study of NPC. The continued efforts may be focus on data sharing among multiple hospitals and publicly releasing of data files.

Advances in knowledge:

Our big data intelligence platform is the first disease-specific data platform for NPC research. It incorporates comprehensive EHRs data from routine healthcare, which can facilitate real-world study of NPC in risk stratification, decision-making and comorbidities management.

Introduction

With widespread adoption of electronic health records (EHRs), the volume of data being captured from routine healthcare procedures is growing at an unprecedented pace.1 However, the big data from EHRs has not yet been realized to its potential to translate into knowledge and clinical practice, particularly due to the following factors: (a) routine clinical practice data are captured in various non-interoperable EHRs systems and majority of them are documented as free text, rather than structured data2,3; these lead to inefficiency of manual data acquisition, as well as difficulties in automated data extraction; (b) it is generally acknowledged that data from routine clinical practice is less convincing than that from randomized control trials (RCTs) because of sampling bias and confounding variables.4 Nevertheless, in current cancer management, patients’ age and the complexities of comorbidities are increasing, making it more and more difficult to justify clinical decisions based on RCT outcomes5; in parallel, evidence generated from real-world data has been described as having the potential to be more generalisable than RCT data.6 Hence, improving the efficiency of data acquisition from EHRs to explore big data of routine clinical practice, if available, would be advantageous to improve the capacity of real-world data to address clinical questions regarding cancer management.

Nasopharyngeal cancer (NPC) is a type of tumour with extremely unbalanced endemic distribution, with incidence lower than 1/105 in most areas, while endemic areas are centralised in China, accounting for 47% of newly diagnosed patients in 2018.7 Due to high clinical service volume, RCTs from China had constantly changed clinical practice in NPC management during the past decade8–10. Likewise, these RCTs excluded elderly patients and patients with severe comorbidities or organ dysfunctions. Hence, conducting studies within the clinical setting is indispensable to provide the evidences for treatment decisions to those particular patient populations.

Advances in computational technologies have enabled effectively and efficiently use of EHRs by integrating, structurization and normalization of the data, and several successful data platforms have been developed to facilitate clinical researches and health surveillance.11,12 In this context, we launched a project in 2015 to developed a big data intelligence platform by integrating data from 13 EHRs to provide structured and normalised data to promote big data real-world study in NPC. Here, we discuss in-depth in the development, performance evaluation and implementation of this platform.

Methods

Overview

In 2015, a project was launched to developed a big data intelligence platform by integrating EHRs to provide high-quality real-world data for NPC research. The project included four procedures: (a) development of NPC specific standard data elements; (b) data extraction and integration; (c) data structurization and normalization; and (d) construction of functional modules on the platform. The framework of this project is shown in Figure 1. NPC oncologists from Sun Yat-sen University Cancer Centre, and informatics technicians and software engineers from YiduCloud Technology Ltd (Beijing China) were involved in this project. This data platform construction study has been approved by Our Institutional Review Board (Approval No. 863-2015-001) in 2015 and waived the need to obtain written informed consent from the patients.

Figure 1.

Figure 1.

Framework of the project to develop the data platform. First of all, patient data were extracted from 13 EHRs, including HIS, EMR, RIS, LIS and other EHRs systems. Each gray box in the blue square denotes a data processing procedure. Firstly, standard data elements were established to form the standard data layer of the platform; then, secure data storage and data quality control should be addressed; finally, data structurization, normalization and calculation were done by using the big data engine which incorporating natural language processing and machine learning models. To faciliate practical use, eight functional modules, which are shown in the yellow square, were developed for the platform, HIS, Hospital Information System; EMR, Electronic Medical Record; RIS, Radiology Information System; LIS, Laboratory Information System; eCRF, electronic case report form.

Establishment of NPC specific standard data elements

Standard data elements were established to define all the variables for the platform. The established standard data elements include 18 modules of 259 data elements, totally 981 common and NPC specific variables. Example modules and data elements are described in Table 1. The standard data elements were built with reference to the National Cancer Data Base (NCDB), the Surveillance, Epidemiology, and End Results (SEER) database, local database and terminology standards, including Health Level Seven (HL7) China clinical document architecture, International Classification of Diseases, 10th revision (ICD-10), Logical Observation Identifiers Names and Codes for laboratory terminology, and the seventh edition of American Joint Committee on Cancer /International Union Against Cancer staging system, etc. (Table 2).

Table 1.

Modules and standard data elements

Modules Elements no. Content Label
Coding 3 Coding no., coding version, coding IT
Source 4 Patient unique identification information from institution IT
Demographic 24 Name, gender, blood type, ethic Demographic
Basic medical information 43 Chief complain, past history, present history, family history, physical examination, KPS, ECOG, etc. Baseline
Laboratory 11 Test method name, list of test items, test results, doctor, description and diagnosis Test and examination
Imaging 30 Examination method name, list of items, results, doctor, description and diagnosis Test and examination
Pathology 13 Pathological diagnosis, biopsy description, number of lymph nodes, lymph node metastasis, doctor, etc. Test and examination
Staging 8 Standard, clinical and pathological staging Baseline
Plan 6 The purpose of the treatment plan, the planning cycle, including the number of treatment methods Treatment
Surgery 18 Surgery time, purpose of surgery, anesthesia, surgery site, surgery name, accompanying surgery, bleeding, first aid, surgeon, etc. Treatment
Chemotherapy 12 Time, drug, dose, way, cycle number, effect, adverse effects, etc. Treatment
Radiotherapy 13 Time, site, technique, dosage, frequency, effect, adverse effects, etc. Treatment
Immunotherapy 11 Time, drug, dose, way, cycle number, effect, adverse effects, etc. Treatment
Endocrine therapy 11 Time, drug, dose, way, cycle number, effect, adverse effects, etc. Treatment
Target therapy 11 Time, drug, dose, way, cycle number, effect, adverse effect, etc. Treatment
Discharge note 13 Admission/discharge time, admission/discharge diagnosis, second surgery (Y/N) Treatment
Expense information 8 Outpatient classification fees, hospitalization classification fees, payment methods, the total amount Cost
Follow-up 20 Survival, metastasis, relapse, secondary malignancies, KPS/ECOG, RTOG/EORTC late toxicities Follow-up

Radiation Therapy Oncology Group.No., number; IT, information technology; KPS, Karnofsky performance scale index, ECOG, Eastern Cooperative Oncology Group; ROTG, The EORTC, European Organisation for Research and Treatment of Cancer.

Table 2.

International database and terminology standards for generating standard data elements

Category Standards
National health and family planning commission, PRC National health key value code: WS364.X-2011 (X: 1–17, 17 parts)
EMR common data element (X: 1–17, 17 parts)
HL7 CDA HL7 China CAD Ver. 2013 (5 parts)
International cancer database structure NCDB PUF−2015
National Cancer Institute SEER program coding and staging manual 2015-surveillance
FORDS v. 2015
NHS data model and dictionary NCDR_Ver.5.2
ASCO treatment plan final (Online)
National radiotherapy data set-RTDS-(NCASAT) Ver. 4
Guidelines AJCC/UICC staging system ver. 7
CTCAE v. 4
Toxicity criteria of the RTOG
RECIST V1.1 (US, UK, Canada, Europe etc.)
NCCN guidelines
Terminology ICD-9-CM-3
ICD-10
ICD-O-3
LOINC Ver. 2.42
Karnofsky performance scale index (Online)
National coding standard GB/T 2261.1–2003
Personal basic information classification and code part 1: gender
GB/T 4671–2008 family relation coding

GB/T is recommended national standard of China. AJCC, American Joint Committee on Cancer;ASCO, American society of clinical oncology; CAD, clinical document architecture;CTCAE, Common terminology criteria for adverse events; EMR, Electronic Medical Record;FORDS, Facility oncology registry data standards; HL7, Health Level Seven; ICD, international classification of diseases; LOINC, Logical Observation Identifiers Names and Codes; NCASAT, National Clinical Analysis and Specialised Applications Team; NCCN, National Comprehensive Cancer Network;NCDB, National Cancer Database; NCDR, National Cancer Data Repository;PUF, Participant User File; RECIST, response evaluation criteria in solid tumours; RTDS, National Radiotherapy Data Set;RTOG, radiation therapy oncology group; UICC, International Union Against Cancer; UK, United Kingdom; US, United State.

Data extraction and integration

To date, patients who diagnosed with nasopharyngeal cancer from January 2008 to December 2018 were included in the platform. Due to diversity of the diagnostic terminologies of “nasopharyngeal carcinoma” used in Chinese EHRs, we took all the variations of the terms into account when searched for the patients to enroll into the data platform, in order to include all NPC patients. During this process, we excluded patients who diagnosed with lymphoma, sarcoma or adenoid cystic carcinoma of nasopharynx. For each patient, data were extracted from 13 EHRs systems, including Hospital Information System (HIS), Electronic Medical Record (EMR), Radiation Information System (RIS), Laboratory Information System (LIS), pathology system, ultrasound system, electrocardiogram system, endoscopy system, anaesthesia information management system, MOSAIQ radiotherapy management system, follow up system, physical examination information system, and the tumor bio-bank.

Data extraction were performed using Extract-Transform-Load (ETL) according to the pre-defined standard data elements. ETL processes included data extraction, cleaning, transformation and loading. By ETL, data from multiple non-interoperable systems were consolidated into a data warehouse. Then, all records of a same patient from various systems were integrated as one record using a unique identification number provided by enterprise patient master index (EPMI).

Data structuriZation and normaliZation

Of 981 variables, 640 were first level variables that could be directly extracted from the original data; and 341 were second level variables that obtained from structurization and normalization of the original data, of which 68 were third level variables that generated by interacting multiple data sources via well-defined logical rules. In general, free-text data from EMR, radiology/pathology/ultrasound/electrocardiography reports and anaesthesia information management system were transformed into structured data, and all the different terms which indicate the same thing were found out and normalized as the standard terminology. Natural language processing (NLP) and multiple machine learning models were used for Chinese word segmentation and part-of-speech tagging13, and a hybrid approach was created to improve the performance of data structurization and normalization.14 Detailed processes of applying NLP and machine learning models have been previously reported by our technical team.15,16

More specifically, EHRs data of randomly sampled 200 patients were manually structurized and normalized by NPC oncologists according to the standard data elements and used as the training dataset to develop the machine learning models. Then these models were applied to all the patients to perform structurization and normalization. Thereafter, another 100 patients were randomly selected as the testing data set to evaluated the performance of the automated processes. Evaluation criteria included: precision (P), recall (R), and F-measure (F), where p = TP/TP +FP, R = TP/TP +FN and F = 2P*R/p + R, where TP stands for true positive, indicating correct structurization and normalization result; FP stands for false-positive, indicating false structurization and normalization result; and FN stands for false-negative, indicating no structurization result was obtained. Precision, recall and F-measure vary from 0 to 1, higher value indicates better performance.

To ensure the authenticity of the big data platform, the precision and recall rate of each variable was required to be higher than 0.9. If not, structurization and normalization results of the 100 testing patients would be modified by NPC oncologists and added into the training data set to optimize the models. These processes were repeated until the required precision and recall rate were reached.

Construction of functional modules on the platform

To facilitate the investigators, eight functional modules were designed for the platform, including dashboard, search engine, patient timeline, research project, electronic case report forms, data dictionary, statistical analysis and literature library (Table 3). All the functional modules were designed follow agile principle which allows interaction from period of times to optimize functions and repair defects.

Table 3.

Description of the eight modules on the platform

Modules Description
Dashboard Displays an overview of characteristics of the database, including the distribution of the patients according gender, age, tumor-node-metastasis stage, treatment strategy, etc.
Search engine This module is designed for users to search for eligible patients. Both simple search (key words) and advanced search (logical rules) are provided. The search results are presented in lines of patients. Selected variables of eligible patients can be exported from the platform in Excel sheets.
Patient timeline A visualization feature to demonstrate the diagnosis, treatment and follow-up processes of each patient. Timeline of patient diagnosis history, laboratory results, examination results, treatments, important events (first visit, last follow-up, first recurrence and/or metastasis date, death date) were able to be displayed together on a single screen. Patient timeline enables applying logical rules to interact multiple data sources.
Research project This module is designed for users to create research projects which include eligible patients with interested variables for study management, including data statistical analysis, follow up or data export. The projects can be saved in users’ accounts and updated at any time.
eCRF eCRF can be designed on the platform for prospective clinical trials data collection.
Data dictionary A tracible dictionary which describing what type of data is collected within a database, its format, structure, and the completeness of different fields.
Statistical analysis This module is designed for users to do some simple statistical analysis, including descriptive statistics, univariate analysis and correlation analysis.
Literature library An open link access to scholarly resources across literature databases, guidelines and dispensatory.

eCRF, electronic case report form.

Patient privacy and platform security

Patient privacy and platform security were the primary concerns for designing the platform. For each patient, personal identification data (name and national identification number) and personal sensitive data (addresses, phone numbers, contact person information, etc.) were deidentified following Health Insurance Portability and Accountability Act. A security framework was designed to provided authentication, authorization and audit for the systems. Different permissions would be configured to specific users according to their needs and hospital policies.

Results

Overview of the platform

From January 2008 to December 2018, totally 54,703 patients who diagnosed with NPC were included into the platform. Patient demographic characteristics are demonstrated in Table 4. Of these patients, 39,058 (71.4%) were male, and 15,645 (28.6%) were female; median age was 47 (interquartile range, 39–55) years. The platform is regularly updated every seven days to include new patients and add new data for existing patients.

Table 4.

Patient characteristics of the platform

Characteristics No. (%)
Year distribution
 2008 3852 (7.0)
 2009 5036 (9.2)
 2010 4682 (8.6)
 2011 5071 (9.3)
 2012 4978 (9.1)
 2013 5166 (9.4)
 2014 4894 (8.9)
 2015 5355 (9.8)
 2016 5533 (10.1)
 2017 5238 (9.6)
 2018 4898 (9.0)
Age distribution
 2–5 years 20 (0.04)
 6–17 years 415 (0.76)
 18–40 years 14,223 (26.0)
 41–64 years 35,824 (65.5)
≥65 years 4212 (7.7)
Gender
 Male 39,058 (71.4)
 Female 15645 (28.6)

Examples of data structurization and normalization

Of 981 variables, 341 were obtained from structurization and normalization. Take “MRI reports” for example, six first level variables, including examination identification, date, name, site, descriptions of findings and conclusions were directly extracted from RIS. Then, the free-text report was structurized and normalized into additional 63 second level variables, including tumour site, tumour maximum diameter, information on involvement status of 38 nasopharynx associated anatomic structures and cervical node involvement condition, recurrence (yes/no), metastasis (yes/no), metastasis site, uncertain metastasis (yes/no), uncertain metastasis site, etc.

Of these 341 variables, 68 were generated by interacting multiple data sources via well-defined logical rules. Take the variable “chemotherapy aim” for example, objectives of chemotherapy were classified into four categories by well-defined rules: (1) neoadjuvant chemotherapy, which was defined as chemotherapy that administered at least 2 weeks prior to the definite radiotherapy; (2) concurrent chemotherapy, which was defined as chemotherapy that administered within 2 weeks prior to and post to the definite radiotherapy; (3) adjuvant chemotherapy, which was defined as chemotherapy that administered more than 2 weeks post to the definite radiotherapy; and (4) palliative chemotherapy, which was defined as chemotherapy that administered to the patients who were diagnosed with recurrent or metastatic disease at first diagnosis or during follow-up. Generally, it takes several minutes to determine whether a patient have received neoadjuvant chemotherapy before definite radiotherapy by manually reviewing the corresponding EHRs, while the platform took only a few hours to process the data and apply it to thousands of patients.

Except for determining chemotherapy aim, logical rules interacting different data sources can mainly be implemented to the following two circumstances: (a) to classify the laboratory test, examinations and patient measures (such as weight) into pre-treatment, during treatment or post-treatment period, even post-neoadjuvant chemotherapy or during concurrent chemotherapy; (b) to define first diagnosis, first chemotherapy, first radiotherapy, first recurrence or metastasis to calculate relapse-free survival time, distant metastasis-free survival time or progression-free survival time.

Data structurization and normalization accuracy

Performance of data structurization and normalization was evaluated for the final version of the platform in 100 randomly selected patients. For 341 variables, the average precision rate of structurization and normalization was 0.97 ± 0.024, recall rate was 0.92 ± 0.030, F-measurement was 0.94 ± 0.027. For 68 variables generated by interacting multiple data sources, the average precision rate of structurization and normalization was 0.92 ± 0.12, recall rate was 0.87 ± 0.12, and F-measurement was 0.89 ± 0.12. The main reason for error depends on the quality of the original data.

Patient timeline

Six patient timelines involving basic information, visiting, diagnosis, treatment, examinations and laboratory tests were established. Investigators can set up individualized patient timeline for their study project by selecting interested data elements or variables. Figure 2 shows an example of the patient timelines.

Figure 2.

Figure 2.

Patient timelines. Each dot on the timelines indicates a visiting/examination/test/treatment. For visiting, green dots denote out-patient visiting, while blue dots denote in-patient care. For examinations and tests, blue or green dots denote examinations or tests with normal results, and red dots denote those with abnormal results. When put the mouse on the dots, chemotherapy drugs, dosage of radiotherapy and examination/test results can be showed. DMFS, distant metastasis-free survival; ECT, emission computed tomography; HB, haemoglobin; LRFS, localrelapse-free survival; NPC, nasopharyngeal cancer; OS, overall survival; PFS, progress-free survival; PLT, platelet; RBC, red blood cell; WBC, white bloodcell.

Platform implementation status and examples

The platform was put into practice in May 2016 and has been adopted by 3 research groups of approximately 20 NPC oncologists and investigators. Only the necessary user permissions are provided to specific users according to their needs, guaranteeing the availability and confidentiality of information, as well as the security of the processes. After the interested variables of eligible patients were extracted and exported from the platform, structured data from the platform would be carefully reviewed before analysis, and all the data are tracible for monitoring. Till now, more than 30 big data-driven retrospective clinical studies are running or have been carried out based on the platform, and 8 of them have been published, involving prognostication and risk stratification of the patients,17–19 decision-making20–23 and comorbidities management.24 The numbers of patients included in these eight studies ranged from 269 to 46,919.

Discussion

In oncology, study exploring big data of routine healthcare from EHRs is vital to obtain reliable real-world evidence to guide clinical practice.4 In this study, we first developed a big data intelligence platform for NPC research, in which comprehensive data of routine healthcare from EHRs have been successfully integrated, structured and normalized with satisfied precision and recall rate (both >0.90). Our intelligent platform supports several modules that enable the investigators to identify eligible patients and view patient information, establish research projects, conduct statistical analysis, track the follow-up, search literature and even share study management. To date, eight big data-driven retrospective studies have been published from this platform to address clinical problems.18–25

At present, most of the EHRs data remain in unstructured free-text format, and it lacks interoperability among various EHRs, which are the greatest challenge in using EHRs for real-world data collection.26 In this context, our data platform extracted and integrated data of clinical diagnosis, treatment and follow-up activities from 13 EHRs, and natural language procession and machine learning models were applied to transformed the free-text data into discrete data which can easily be automatically extracted. Moreover, we implemented logical rules to interact data from multiple sources to define some variables (such as chemotherapy aim) which cannot be directly extracted or structured from single data sources. Importantly, our data platform incorporates really comprehensive and detailed data from all the aspects of the routine healthcare procedure.

Databases derived from data of routine clinical practice for research purposes is not a new aspect, several successful databases have been deployed around the world and changed the study of cancer care,8,9,27 and have been adopted as reference for our data platform. One prominent example is the NCDB, which currently captures approximately 70% of all patients newly diagnosed with cancer from approximately 30% of US hospitals (more than 34 million records as of 2016).27 To facilitate cancer research, a publicly shared subset of the NCDB, known as the Participant User File, was made available in 2013 through an application process to researchers and updates annually. As a result, hundreds of studies have been published based on the data from NCDB. Other databases include the SEER database and SEER-Medicare,8,9 albeit no disease-specific database have been reported, and our big data intelligence platform is therefore the first disease-specific platform for cancer research.

As a specialized big data platform for NPC research, our big data platform demonstrates several advantages when compare to currently available public databases. First, the data platform incorporates comprehensive and detailed data of routine healthcare, including chemotherapeutic regimens (time, drug, dose, way, cycle number, effect, adverse effects, etc.) and radiotherapy information (time, site, technique, dosage, frequency, effect, adverse effects, etc.), while the widely used DCDB and SEER database have limit data on these points.8,9,27 Second, the data platform provides information on most of the NPC relevant clinical prognostic factors and their dynamic changes, such as tobacco exposure, family history, EBV DNA status, concurrent comorbidities, and socioeconomic status. Third, information on involvement status of 38 nasopharynx associated anatomic structures and cervical lymph node involvement condition have been provided, which can be used to restage the patients if the American Joint Committee on Cancer staging system is updated. Finally, the platform dynamically updates every seven days to include new patients and new data for existing patients.

Despite of strengths, several limitations of the big data platform need to be acknowledged. First, as an EHR is primarily designed for documenting patient care, rather than developed to capture the information for research purpose, data quality and the problem of missing data of EHRs remains relevant, which is the main reason of the suboptimal quality of the processed data on the platform (average recall rate and F-measure were less than 0.9 for 68 variables generated by interacting multiple data sources). Hence, next-generation EHRs using unified medical language system are expected to improve data quality.10 Second, it is currently a local platform of data from single institution to facilitate limited number of investigators. Moving forward, data sharing among multiple hospitals and publicly releasing of files of the data platform are planned.

In summary, our big data intelligence platform demonstrates the feasibility of developing data platform by integrating EHRs data of routine clinical practice, and offers an important perspective on real-world study of nasopharyngeal cancer. The continued efforts may be focus on data sharing among multiple hospitals and publicly releasing of data files.

Footnotes

Funding: Special Support Program of Sun Yat-sen University Cancer Center (16zxtzlc06); the Planned Science and Technology Project of Guangdong Province (2019B020230002); Natural Science Foundation of Guangdong Province (2017A030312003); Health & Medical Collaborative Innovation Project of Guangzhou City, China (201803040003); Innovation Team Development Plan of the Ministry of Education (No. IRT_17R110), Overseas Expertise Introduction Project for Discipline Innovation (111 Project, B14035).

Contributor Information

Li Lin, Email: linli@sysucc.org.cn.

Wei Liang, Email: wei.liang@yiducloud.cn.

Chao-Feng Li, Email: lichf@sysucc.org.cn.

Xiao-Dan Huang, Email: huangxd1@sysucc.org.cn.

Jia-Wei Lv, Email: lvjw@sysucc.org.cn.

Hao Peng, Email: penghao@sysucc.org.cn.

Bing-Yi Wang, Email: bingyi.wang@yiducloud.cn.

Bo-Wei Zhu, Email: bowei.zhu@yiducloud.cn.

Ying Sun, Email: sunying@sysucc.org.cn.

References

  • 1.Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol 2017; 106: 1–9. doi: 10.1007/s00392-016-1025-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Joe P. Natural language processing in electronic health records. 2011.
  • 3.ENRICH, Cl The application of CNLP (clinical natural language processing) for improved analytics.. White Pap 2014. [Google Scholar]
  • 4.Khozin S, Blumenthal GM, Pazdur R. Real-World data for clinical evidence generation in oncology. J Natl Cancer Inst 2017; 109. doi: 10.1093/jnci/djx187 [DOI] [PubMed] [Google Scholar]
  • 5.Jennens RR, Giles GG, Fox RM. Increasing underrepresentation of elderly patients with advanced colorectal or non-small-cell lung cancer in chemotherapy trials. Intern Med J 2006; 36: 216–20. doi: 10.1111/j.1445-5994.2006.01033.x [DOI] [PubMed] [Google Scholar]
  • 6.Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, et al. Real-World Evidence - What Is It and What Can It Tell Us? N Engl J Med 2016; 375: 2293–7. doi: 10.1056/NEJMsb1609216 [DOI] [PubMed] [Google Scholar]
  • 7.Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018; 68: 394–424. doi: 10.3322/caac.21492 [DOI] [PubMed] [Google Scholar]
  • 8.Chen L, Hu C-S, Chen X-Z, Hu G-Q, Cheng Z-B, Sun Y, et al. Concurrent chemoradiotherapy plus adjuvant chemotherapy versus concurrent chemoradiotherapy alone in patients with locoregionally advanced nasopharyngeal carcinoma: a phase 3 multicentre randomised controlled trial. Lancet Oncol 2012; 13: 163–71. doi: 10.1016/S1470-2045(11)70320-5 [DOI] [PubMed] [Google Scholar]
  • 9.Zhang L, Huang Y, Hong S, Yang Y, Yu G, Jia J, et al. Gemcitabine plus cisplatin versus fluorouracil plus cisplatin in recurrent or metastatic nasopharyngeal carcinoma: a multicentre, randomised, open-label, phase 3 trial. Lancet 2016; 388: 1883–92. doi: 10.1016/S0140-6736(16)31388-5 [DOI] [PubMed] [Google Scholar]
  • 10.Sun Y, Li W-F, Chen N-Y, Zhang N, Hu G-Q, Xie F-Y, WF L, et al. Induction chemotherapy plus concurrent chemoradiotherapy versus concurrent chemoradiotherapy alone in locoregionally advanced nasopharyngeal carcinoma: a phase 3, multicentre, randomised controlled trial. Lancet Oncol 2016; 17: 1509–20. doi: 10.1016/S1470-2045(16)30410-7 [DOI] [PubMed] [Google Scholar]
  • 11.Johnson AEW, Pollard TJ, Shen L, Lehman L-WH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035. doi: 10.1038/sdata.2016.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Vogel J, Brown JS, Land T, Platt R, Klompas M. MDPHnet: secure, distributed sharing of electronic health record data for public health surveillance, evaluation, and planning. Am J Public Health 2014; 104: 2265–70. doi: 10.2105/AJPH.2014.302103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Xiong Y, Wang Z, Jiang D, Wang X, Chen Q, Xu H, et al. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text. BMC Med Inform Decis Mak 2019; 19(Suppl 2): 66. doi: 10.1186/s12911-019-0770-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ji B, Liu R, Li S, Yu J, Wu Q, Tan Y, et al. A hybrid approach for named entity recognition in Chinese electronic medical record. BMC Med Inform Decis Mak 2019; 19(Suppl 2): 64. doi: 10.1186/s12911-019-0767-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Peng H, Chen B-B, Tang L-L, Chen L, Li W-F, Zhang Y, et al. Prognostic value of nutritional risk screening 2002 scale in nasopharyngeal carcinoma: a large-scale cohort study. Cancer Sci 2018; 109: 1909–19. doi: 10.1111/cas.13603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yao J-J, Zhang F, Gao T-S, Zhang W-J, Lawrence WR, Zhu B-T, et al. Survival impact of radiotherapy interruption in nasopharyngeal carcinoma in the intensity-modulated radiotherapy era: a big-data intelligence platform-based analysis. Radiother Oncol 2019; 132: 178–87. doi: 10.1016/j.radonc.2018.10.018 [DOI] [PubMed] [Google Scholar]
  • 17.Zhang Y, Tang L-L, Li Y-Q, Liu X, Liu Q, Ma J. Spontaneous remission of residual post-therapy plasma Epstein-Barr virus DNA and its prognostic implication in nasopharyngeal carcinoma: a large-scale, big-data intelligence platform-based analysis. Int J Cancer 2019; 144: 2313–9. doi: 10.1002/ijc.32021 [DOI] [PubMed] [Google Scholar]
  • 18.Lv J-W, Qi Z-Y, Zhou G-Q, He X-J, Chen Y-P, Mao Y-P, JW L, ZY Q, et al. Optimal cumulative cisplatin dose in nasopharyngeal carcinoma patients receiving additional induction chemotherapy. Cancer Sci 2018; 109: 751–63. doi: 10.1111/cas.13474 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Peng H, Tang L-L, Chen B-B, Chen L, Li W-F, Mao Y-P, et al. Optimizing the induction chemotherapy regimen for patients with locoregionally advanced nasopharyngeal carcinoma: a big-data intelligence platform-based analysis. Oral Oncol 2018; 79: 40–6. doi: 10.1016/j.oraloncology.2018.02.011 [DOI] [PubMed] [Google Scholar]
  • 20.Peng H, Tang L-L, Liu X, Chen L, Li W-F, Mao Y-P, et al. Anti-Egfr targeted therapy delivered before versus during radiotherapy in locoregionally advanced nasopharyngeal carcinoma: a big-data, intelligence platform-based analysis. BMC Cancer 2018; 18: 323. doi: 10.1186/s12885-018-4268-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Huang X-D, Zhou G-Q, Lv J-W, Zhou H-Q, Zhong C-W, Wu C-F, JW L, CF W, et al. Competing risk nomograms for nasopharyngeal carcinoma in the intensity-modulated radiotherapy era: a big-data, intelligence platform-based analysis. Radiother Oncol 2018; 129: 389–95. doi: 10.1016/j.radonc.2018.09.004 [DOI] [PubMed] [Google Scholar]
  • 22.Lv J-W, Chen Y-P, Huang X-D, Zhou G-Q, Chen L, Li W-F, JW L, WF L, et al. Hepatitis B virus screening and reactivation and management of patients with nasopharyngeal carcinoma: a large-scale, big-data intelligence platform-based analysis from an endemic area. Cancer 2017; 123: 3540–9. doi: 10.1002/cncr.30775 [DOI] [PubMed] [Google Scholar]
  • 23.Evans RS. Electronic health records: then, now, and in the future. Yearb Med Inform 2016; Suppl 1; ((Suppl 1): S48–S61Suppl 1. doi: 10.15265/IYS-2016-s006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cronin KA, Ries LAG, Edwards BK. The surveillance, epidemiology, and end results (seer) program of the National cancer Institute. Cancer 2014; 120 Suppl 23(Suppl 23): 3755–7. doi: 10.1002/cncr.29049 [DOI] [PubMed] [Google Scholar]
  • 25.Boffa DJ, Rosen JE, Mallin K, Loomis A, Gay G, Palis B, et al. Using the National cancer database for outcomes research: a review. JAMA Oncol 2017; 3: 1722–8. doi: 10.1001/jamaoncol.2016.6905 [DOI] [PubMed] [Google Scholar]
  • 26.Daly MC, Paquette IM, Surveillance PIM. Surveillance, epidemiology, and end results (seer) and SEER-Medicare databases: use in clinical research for improving colorectal cancer outcomes. Clin Colon Rectal Surg 2019; 32: 061–8. doi: 10.1055/s-0038-1673355 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Adamusiak T, Shimoyama N, Shimoyama M. Next generation phenotyping using the unified medical language system. JMIR Med Inform 2014; 2: e5: e5. doi: 10.2196/medinform.3172 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The British Journal of Radiology are provided here courtesy of Oxford University Press

RESOURCES