Abstract
Almost every Korean (97%) is enrolled in the National Health Insurance program, and most receive medical treatment at least once a year. Data are collected by the Health Insurance Review and Assessment Service (HIRA), and the results of the review are sent to the National Health Insurance Service (NHIS). The data handled by NHIS and HIRA cover almost the entire population and can be used for various research purposes. NHIS and HIRA support research by making these data available to researchers. The greatest advantage of these data is that they are the only data which include virtually the entire population. Both HIRA and NHIS data are provided in the form of sample data and all (customized) data. NHIS and HIRA data are similar but exhibit minor differences. HIRA data consists of five tables, including general specification details, in-hospital treatment details, disease details, out-of-hospital prescription details, and nursing institution information. NHIS data include death records (including cause of death), some medical examination records, and the socio-economic variables of the subject, such as income, in addition to all the HIRA data. Clinical results of treatments are not recorded in NHIS or HIRA. However, because public data are used for billing purposes, actual research has thus far been limited. Therefore, researchers must develop a study design that can minimize the errors or bias occurring during the course of the study. Therefore, it is necessary to clearly understand the structure and characteristics of NHIS and HIRA data when initiating research.
Keywords: Big data, Data science, Cohort study
INTRODUCTION
In recent years, various studies have been reported in Korea pertaining to big data from the medical field.1,2,3,4 This has been a focus because big data presents the advantage of securing the data of a large number of patients within a short time period and at a low cost, unlike the randomized controlled trial approach.4,5
Medical big data, which have recently attracted widespread interest in Korea, can be classified as public and private medical big data. Public health big data is represented by the National Health Insurance Service (NHIS) data6 and the Health Insurance Review and Assessment Service (HIRA) data.7 The various studies conducted using the electronic medical records (EMR) data used in hospitals8,9 are representative examples of private medical big data (Table 1). The public medical big data provided to researchers lie within the scope of the research topic after undergoing an application process through a system.
Table 1. Claims data versus EMR data.
Claims data | EMR data | |
---|---|---|
Example | NHIS data, HIRA data | Medical chart data |
Subject | Includes all medical use of the total national population | If a patient visits a hospital other than a specific hospital, tracking is not possible. |
Data Structure | Consists of the same data structure | The data structure of each hospital is different |
DQM | No DQM required | DQM required |
Purpose of data accumulation | - Not “medical” data; rather this data is a bill for medical treatment. | - Detailed records of events occurring during the course of patient care. |
- Not data from a medical perspective in nature. | ||
- Data for billing to the country. | ||
Analysis of medical practice | There is a limit to the interpretation of the results of medical practice. | All analysis from a medical perspective such as patient symptoms, examination, diagnosis, treatment, and prognosis is possible. |
Data characteristics | - Only insured health benefits item related data is recorded. | - Any data related to the insured/non-insured health benefits item is recorded. |
- It is impossible to analyze non-insured health benefits items. | - “Medical” data based on actual clinical practice. | |
- In many cases, the “outcome” of treatment is not reflected. | - All test results and treatment results are reflected. | |
- Most of the “test results” do not require description (test results). | - In some cases, there is no indication for actions taken. | |
- There are reasons for specific medical actions that were taken (ICD-10 code, etc.). |
EMR, electronic medical records; NHIS, National Health Insurance Service; HIRA, Health Insurance Review and Assessment Service; DQM, data quality management; ICD, International Classification of Diseases.
The NHIS and HIRA public medical big data are generated by the Korean health insurance system and are also known as “claims data” or “public data” because they are accumulated for the purpose of claims by medical institutions. Both public and private medical big data have various advantages and disadvantages. Therefore, researchers must clearly understand the data's characteristics to obtain accurate research results corresponding to their aims.10 This study focuses on the NHIS and HIRA data, which are the most widely used public health big data.
CLAIMS DATA REPRESENTED BY NHIS DATA AND HIRA DATA
The NHIS and HIRA data are similar but exhibit minor differences; therefore, it is essential to accurately understand these differences. Essentially, the NHIS data11 include death records (including cause of death), some medical examination records, and socio-economic variables of subjects such as income, in addition to all the contents of the HIRA data.12 Therefore, more studies employ NHIS data rather than HIRA data among the claims data (Table 2).
Table 2. NHIS data versus HIRA data.
NHIS data | HIRA data | ||
---|---|---|---|
SAMPLE data§ | |||
Components | 2% of the total population and 4 other types | 3% of the total population and 3 other types | |
Acquisition (purchase) of sample data | IRB-approved research protocol is required | No research protocol required* | |
Ownership | No ownership of data | Provide ownership of data | |
Type of preferred study | Longitudinal study preferred | Cross-sectional data preferred | |
ALL (customized) data‖ | |||
Components | Provides all national data that meet the conditions requested by the researcher | ||
Ownership | No ownership of data | ||
Death records | Includes death records (except cause of death) | Cannot be confirmed† | |
Center visit | Researchers must visit the center for use | Researchers must visit the center for use‡ |
Both NHIS and HIRA provide “ALL (customized)” and “SAMPLE” data. For “ALL (customized)” data, neither NHIS nor HIRA provide ownership of the data (direct visit to the center or remote access) HIRA's “SAMPLE” data is the only data that can be directly owned and processed by medical personnel or companies.
NHIS, National Health Insurance Service; HIRA, Health Insurance Review and Assessment Service; IRB, Institutional Review Board.
*Acquisition (purchase) of sample data provided by HIRA does not require research protocol or IRB approval. However, when conducting research using sample data, researchers can start the research by obtaining the research protocol and IRB approval; †In the case of death in the hospital, death information can be checked only when the medical treatment termination code is checked as death in the HIRA data. However, not all people die in the hospital, and even if they die in the hospital, there are cases where the treatment termination code is omitted; therefore, it is not possible to check death record with 100% accuracy; ‡Basically, remote access to HIRA is possible, but in some cases, remote access is restricted when the amount of data is too large or when a private company participates in the research and conducts analysis; §Only some patients are represented as a sample; ‖All the eligible citizens of the country.
The greatest advantage of the NHIS and HIRA data is that it is the only data which includes nearly the entire population.4,5 These data are the closest to real word data (RWD), which is called big data in the medical field. It enables researchers to analyze and observe all medical activities such as prescriptions, procedures, and surgeries within the scope of reimbursement performed by domestic medical institutions.1,2,6,7 This approach helps immensely in reflecting trends in the medical field, and various clinical studies have been conducted based on this approach.
NHIS and HIRA data are inherently limited because they are used for billing and not for clinical research.4,5 Clinical outcomes are not recorded in the data. This is because these outcomes do not lie within the scope of the claims. For example, if a patient is placed under antihypertensive or diabetes medication, the direct result regarding a drop in blood pressure or glucose levels is not recorded. Additionally, the records of non-remunerated activities are not provided because they also are not classified as within the scope of the claims. Therefore, researchers must obtain a comprehensive understanding of the characteristics of claims data to predict results that are appropriate for clinical research purposes. Furthermore, trial and error along with any biases formed during the research process can be reduced.
COMPONENTS OF NHIS DATA AND HIRA DATA
HIRA data consists of 5 tables which include the general specification details, in-hospital treatment details, disease details, out-of-hospital prescription details, and the nursing institution information.13 HIRA provides a guide file to help researchers comprehensively determine the variable items for each table. The general details of the specification contain information such as the patient's age, gender, department, date of visit, and state, i.e., whether they were injured or suffering from a disease. The in-hospital treatment history includes records of the medical expenses, prescription fees, examination fees, procedure/operation codes, etc. The information on the total wounded and diseased patients provided includes all injuries and diseases rather than only major ones. The out-of-hospital prescription information contains information on out-of-hospital prescription drugs, the number of prescription days, and the quantity information. Lastly, the information on nursing institutions such as the type (clinic, hospital, tertiary general hospital, public health center, etc.), city and province information, and whether these institutions were equipped with computerized tomography/magnetic resonance imaging, are recorded. The personal information provided by the HIRA data is de-identified to ensure that personal identification is impossible and non-payment information is not recorded.
As mentioned above, the NHIS data include death records (other than cause of death), medical examination records (adult screenings for people over the age of 40, screenings for working women, and screenings for infants), and income decile. In the case of research related to the cause of death, when a researcher applies for data to NHIS, they can simultaneously apply for death data (information on cause of death) to the Korea Statistics Promotion Institute. NHIS provides combined death data (death cause information) provided by the Korea Statistics Promotion Institute.14
PROVISION OF NHIS DATA AND HIRA DATA
Both the HIRA and NHIS data are provided in the form of sample data and all (customized) data. In the case of all (customized) data, the variables provided are limited to the research purpose, and only the results corresponding to the research design are provided, not the raw data. Ultimately, ownership of the data is not recognized. Customized data can be analyzed by visiting the analysis center operated by each institution, or in some cases, by remotely accessing the data and analyzing it in a private laboratory. The sample data provided by the NHIS include a total of five sample cohort data: the total national 2% sample cohort, the adult medical check-up cohort, the elderly cohort (over the age of 60), the working women cohort, and the infant medical cohort (Fig. 1).15 The sample cohort data of the NHIS cannot be owned, and they can be used for research by visiting the analysis center or by remotely accessing and analyzing the data and exporting the results. However, the research schedule must be established by considering that it takes several months from application to data extraction and from room assignment to the analysis center.
Fig. 1. Selection of appropriate claims data according to research purpose.
NHIS, National Health Insurance Service; HIRA, Health Insurance Review and Assessment Service.
The sample data provided by the HIRA from 2009 to 2018 (the provision of which is not confirmed for 2019) are divided into 4 types based on the year.13 The total patient dataset is 3% (approximately 1.4 million people), in-patient dataset is 13% (approximately 1 million people), pediatric patient dataset (under 20 years old) is 10% (approximately 1.1 million people), and elderly patient dataset (65 years and older) is 20% of the total patient population in Korea. However, from 2017, 3% of the entire patient dataset was maintained, and the remaining datasets were provided with a unified extraction scale of 10%. The HIRA patient dataset is advantageous since it is the only dataset to include claims data that can be owned by an individual. When the researcher pays a certain amount for each dataset according to the year, it is provided on a USB drive and can be used for research. When a researcher requests data from HIRA for research purposes, a maximum of 250 GB is provided. Therefore, in the case of diseases with many patients, the years of data provided may be limited. However, If the amount of data requested by the researcher does not exceed 250 GB, data for more than 5 years can be secured.
HIRA and NHIS data involve different methods for accessing data, such as variable characteristics, analysis center visit analysis, remote analysis, and availability of data, and the cost and time of the usage of the data varies. Therefore, it is essential for researchers to determine the data suitable for their research purposes. NHIS data must be used if the research requires death records (including the cause of death), some medical examination records, and socio-economic variables such as income. Additionally, among the NHIS data, if a disease or drug is rare or if it is necessary to analyze a large amount of data, it is best to use all (customized) data; otherwise, a sample cohort is better. If the research field involves cost analysis that does not require patient or disease continuity data, or it is focused on the variations of a disease or prescription trend, it is best to purchase and study a patient dataset from HIRA (Fig. 1).
CONSIDERATIONS TO BE TAKEN INTO ACCOUNT WHEN CONDUCTING RESEARCH USING CLAIMS DATA
The definition of a patient or disease using the claims data in big data research differs from traditional research methods.4,5 Unlike the conceptual definition (CP) for various diseases, an operational definition (OP) is required when using the claims data.16 Blood glucose levels must be checked or laboratory tests must be performed to define diabetes mellitus in traditional studies.17 However, a study using claims data is based on the International Classification of Diseases-10 (ICD-10) category (ICD-10 E10-E14),18,19 the presence of current oral hypoglycemic agents/insulin, or on the prescription period (defined as a prescription for a patient for more than the reference date during the year). The examples of patients with excessive bleeding include those who have received more than a certain amount of transfused blood, and patients with gastrointestinal bleeding include those who have been prescribed endoscopic hemostasis and anti-ulcer medication. Based on the characteristics of the disease, there are cases where a patient is defined only by the disease. However, the accuracy of the patient definition in some cases can be improved by using available information on the drugs, prescription period, examination, procedures, and surgery. The important aspect is that the expert opinion of clinicians is absolutely necessary for the OP. In contrast to EMR research, data quality management (DQM) is not required. However, it remains to be determined whether this quality is an advantage or a disadvantage.16
Lastly, the biggest drawback is that neither the NHIS data nor the HIRA data are linked with the EMR of each hospital, and therefore accurate diagnoses and actual prescriptions are not recorded. Because the data are provided only for insured health benefits items, there is no data on the non-insured health benefits items, which leads to inaccuracies. Additionally, the severity of diseases is overestimated by some medical staff because the claims data are related to the insured health benefits item. Conversely, it may be difficult to accurately identify medical coding even in the case of a serious disease if it is not related to the insured health benefits item.
DATA APPLICATION PROCESS
Firstly, the researchers must obtain an Institutional Review Board (IRB) approval or exemption and apply it to a research plan based on this to HIRA or NHIS. The researcher is guided through the application procedure over the helpline, and the most important process is the application for a variable which is suitable for the study. The research process can be simplified if the variables are applied carefully to ensure that they are sufficient for the purpose of the study. The necessary variables must be included in the research applications because additional time is required to apply and extract data from the beginning (Fig. 1).
UNMET REQUIREMENT OF CLAIMS DATA AND PREPARATION FOR ANALYSIS USING CLAIMS DATA
Researchers must verify certain details before conducting their research. Firstly, it is necessary to clearly understand the characteristics of the variables for each type of data to analyze the claims data and to determine a research topic that can be implemented. The organization of the operational definitions of selection, exclusion items, result values, and the patient groups that fit the topic in a code is essential. The injury code,20 number and treatment material code,21 and the drug code must be considered before proceeding22 with the operational definition of a code that is suitable for the topic.
After the research topic is determined, it must be approved and exempted by the IRB and analysts must then be hired to perform data analysis. There are some cases where the researchers themselves deal with the statistical analysis tools such as SAS or R; however, securing a data scientist is essential. After selecting the data suitable for a given topic, the researcher can then apply to the NHIS or HIRA for the data corresponding to the procedure. The research analysis can be performed after the data are secured. The researchers must consider the control group when applying for the data. There have been cases in the past where a research application was submitted without considering the control group, and limited analysis or re-application was performed. Initiating research analysis implies that costs are incurred; an analysis fee must be paid for medical research purposes.
A relatively long waiting time is required before accessing the NHIS or HIRA data, due to the research protocol and IRB approval. Understanding the variable structure of the HIRA data and proficiency in statistical programs (SAS, R) is important in the data extraction and research process. The trial-and-error process is involved while accepting and extracting assignments, and it is slightly difficult to set hypotheses and predict the research results. A separate mapping operation is required, because most of the data are coded.
Smooth communication with data scientists is crucial to successfully complete the research,16 due to which it is necessary to communicate all parts of the research as “code.” Lastly, searches for the same or similar research topic in PubMed prior to the study23 can help in searching with the keywords of the claims data thesis which include the NHIS, HIRA, Korean claims, nationwide data, population-based data, and national health insurance data. Researchers can search for related research more easily when searching through these fields. It is also very helpful to refer to the various OP presented in previously published papers.24,25,26
CONCLUSION
NHIS and HIRA provide data to researchers for conducting research, and they only involve a research protocol and minimal fees. Claims data have clear limitations in research owing to the nature of the data. However, these data can be highly beneficial for domestic researchers if they can be used appropriately. It is also evident that this is the only option for research focused on the entire country. Understanding the form and structure of claims data is the first step in research. Claims data can prove to be a useful research resource in healthcare field provided the researchers accurately understand the characteristics of the NHIS and HIRA data and use them to derive useful research results.
Footnotes
Funding: None.
Conflict of Interest: The authors have no conflicts of interest to declare.
- Conceptualization: Kim HS.
- Data curation: Kyoung DS, Kim HS.
- Formal analysis: Kyoung DS, Kim HS.
- Writing - original draft: Kyoung DS, Kim HS.
- Writing - review & editing: Kim HS.
References
- 1.Jung I, Kwon H, Park SE, Han KD, Park YG, Kim YH, et al. Increased risk of cardiovascular disease and mortality in patients with diabetes and coexisting depression: a nationwide population-based cohort study. Diabetes Metab J. 2021;45:379–389. doi: 10.4093/dmj.2020.0008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ahn HY, Chae JE, Moon H, Noh J, Park YJ, Kim SG. Trends in the diagnosis and treatment of patients with medullary thyroid carcinoma in Korea. Endocrinol Metab (Seoul) 2020;35:811–819. doi: 10.3803/EnM.2020.709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kim H, Lee H, Yim HW, Kim HS. Association of serum 25-hydroxyvitamin D and diabetes-related factors in Korean adults without diabetes: The Fifth Korea National Health and Nutrition Examination Survey 2010–2012. Prim Care Diabetes. 2018;12:59–65. doi: 10.1016/j.pcd.2017.07.002. [DOI] [PubMed] [Google Scholar]
- 4.Kim HS, Kim JH. Proceed with caution when using real world data and real world evidence. J Korean Med Sci. 2019;34:e28. doi: 10.3346/jkms.2019.34.e28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kim HS, Lee S, Kim JH. Real-world evidence versus randomized controlled trial: clinical research based on electronic medical records. J Korean Med Sci. 2018;33:e213. doi: 10.3346/jkms.2018.33.e213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Seong SC, Kim YY, Khang YH, Park JH, Kang HJ, Lee H, et al. Data resource profile: the national health information database of the National Health Insurance Service in South Korea. Int J Epidemiol. 2017;46:799–800. doi: 10.1093/ije/dyw253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kim JA, Yoon S, Kim LY, Kim DS. Towards actualizing the value potential of Korea Health Insurance Review and Assessment (HIRA) data as a resource for health research: strengths, limitations, applications, and strategies for optimal use of HIRA data. J Korean Med Sci. 2017;32:718–728. doi: 10.3346/jkms.2017.32.5.718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Choe S, Shinn J, Kim HS, Kim JH. Changes in target achievement rates after statin prescription changes at a single university hospital. Cardiovasc Prev Pharmacother. 2020;2:103–111. [Google Scholar]
- 9.Kim H, Lee H, Kim TM, Yang SJ, Baik SY, Lee SH, et al. Change in ALT levels after administration of HMG-CoA reductase inhibitors to subjects with pretreatment levels three times the upper normal limit in clinical practice. Cardiovasc Ther. 2018;36:e12324. doi: 10.1111/1755-5922.12324. [DOI] [PubMed] [Google Scholar]
- 10.Choi EK. Cardiovascular research using the Korean National Health Information Database. Korean Circ J. 2020;50:754–772. doi: 10.4070/kcj.2020.0171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.National Health Insurance Sharing Service. Data provision guide [Internet] Wonju: National Health Insurance Service; c2019. [cited 22 Jun 2021]. https://nhiss.nhis.or.kr/bd/ab/bdaba001cv.do . [Google Scholar]
- 12.Health Insurance Review & Assessment Service. Healthcare bigdata hub [Internet] Wonju: Health Insurance Review & Assessment Service; c2015. [cited 22 Jun 2021]. https://opendata.hira.or.kr/home.do . [Korean] [Google Scholar]
- 13.Health Insurance Review & Assessment Service. Patients sample data application guide [Internet] Wonju: Health Insurance Review & Assessment Service; c2015. [cited 22 Jun 2021]. https://opendata.hira.or.kr/op/opc/selectPatDataAplInfoView.do . [Korean] [Google Scholar]
- 14.MicroData Integrated Service. Use microdata [Internet] Daejeon: Statistics Korea; 2019. [cited 20 Aug 2021]. https://mdis.kostat.go.kr/consign/consignDthRequestList.do?curMenuNo=UI_POR_P9017 . [Korean] [Google Scholar]
- 15.National Health Insurance Sharing Service. Sample cohort DB guide [Internet] Wonju: National Health Insurance Service; c2019. [cited 22 Jun 2021]. https://nhiss.nhis.or.kr/bd/ab/bdaba002cv.do . [Korean] [Google Scholar]
- 16.Kim HS, Kim DJ, Yoon KH. Medical big data is not yet available: why we need realism rather than exaggeration. Endocrinol Metab (Seoul) 2019;34:349–354. doi: 10.3803/EnM.2019.34.4.349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kim MK, Ko SH, Kim BY, Kang ES, Noh J, Kim SK, et al. 2019 Clinical practice guidelines for type 2 diabetes mellitus in Korea. Diabetes Metab J. 2019;43:398–406. doi: 10.4093/dmj.2019.0137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.International Expert Committee International Expert Committee report on the role of the A1C assay in the diagnosis of diabetes. Diabetes Care. 2009;32:1327–1334. doi: 10.2337/dc09-9033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kim TM, Kim H, Jeong YJ, Baik SJ, Yang SJ, Lee SH, et al. The differences in the incidence of diabetes mellitus and prediabetes according to the type of HMG-CoA reductase inhibitors prescribed in Korean patients. Pharmacoepidemiol Drug Saf. 2017;26:1156–1163. doi: 10.1002/pds.4237. [DOI] [PubMed] [Google Scholar]
- 20.Korea Informative Classification of Diseases. Korea Classification Disease (KCD8) [Internet] Daejeon: Korea Informative Classification of Diseases; c2020. [cited 23 Jun 2021]. https://koicd.kr/main.do . [Korean] [Google Scholar]
- 21.Health Insurance Review & Assessment Service. Criteria for insurance recognition [Internet] Wonju: Health Insurance Review & Assessment Service; c2017. [cited 22 Jun 2021]. http://www.hira.or.kr/rd/insuadtcrtr/InsuAdtCrtrList.do?pgmid=HIRAA030069000400 . [Korean] [Google Scholar]
- 22.Health Insurance Review & Assessment Service. Drug reimbursement list [Internet] Wonju: Health Insurance Review & Assessment Service; c2017. [cited 22 Jun 2021]. http://www.hira.or.kr/bbsDummy.do?pgmid=HIRAA030014050000 . [Korean] [Google Scholar]
- 23.National Library of Medicine. PubMed [Internet] Bethesda (MD): National Library of Medicine; [cited 22 Jun 2021]. https://pubmed.ncbi.nlm.nih.gov/ [Google Scholar]
- 24.Tang KL, Quan H, Rabi DM. Measuring medication adherence in patients with incident hypertension: a retrospective cohort study. BMC Health Serv Res. 2017;17:135. doi: 10.1186/s12913-017-2073-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lee YH, Han K, Ko SH, Ko KS, Lee KU, Taskforce Team of Diabetes Fact Sheet of the Korean Diabetes Association Data analytic process of a nationwide population-based study using national health information database established by National Health Insurance Service. Diabetes Metab J. 2016;40:79–82. doi: 10.4093/dmj.2016.40.1.79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kim HK, Song SO, Noh J, Jeong IK, Lee BW. Data configuration and publication trends for the Korean National Health Insurance and Health Insurance Review & Assessment Database. Diabetes Metab J. 2020;44:671–678. doi: 10.4093/dmj.2020.0207. [DOI] [PMC free article] [PubMed] [Google Scholar]