Abstract
There has been a surge in the academic publication output based on secondary analyses of the data from the Taiwan's National Health Insurance claim records. It has become a challenge to comprehend such a rapid expansion of the literature. Therefore, this study aimed to explore the conceptual content of National Health Insurance Research Database-based cancer research, using the abstract of articles extracted from PubMed between 2002 and 2015. Search terms including “National Health Insurance Research Database (NHIRD) AND Taiwan,” “Taiwan AND population-based,” and “Taiwan AND nationwide” were used to search in PubMed with the publication date limited to between 1997 and 2015. The retrieved articles were manually screened to retain only those that were cancer-related and were based on secondary data analysis of the NHIRD. A total 589 articles were selected for subsequent text mining using the R software. Among the 589 articles, the top 5 most studied cancer types were breast (16.3%), lung (11.4%), colorectal (10.4%), liver (8.3%), and prostate (7.5%). The articles that received the highest number of citations by PubMed Central articles were cited 92 times. The top 3 most frequently occurred keywords in the abstracts of the 589 articles were cancer, patient, and risk, with 3670, 2535, and 1652 times, respectively. Analysis of key conception indicated that the most common conceptions were diabetes, survival, breast cancer, lung cancer, and colorectal cancer. In conclusion, in this study of 589 published articles on secondary data analysis of the NHIRD, indexed by PubMed between 2002 and 2015, we found that while the risk factors of cancer, treatment of cancer, and survival of cancer patients were popular research topics, end-of-life cancer care issues were less studied. Further studies should explore these areas since they are as important as treatment of the disease itself for many patients.
Keywords: bibliometric analysis, cancer, National Health Insurance Research Database, PubMed
1. INTRODUCTION
Despite advances in the diagnosis and treatment, cancer is still a leading cause of death worldwide. There were 14.1 million new cases of cancer and 8.2 million cancer deaths worldwide in 2012.[1] According to the World Health Organization estimates for 2011, cancer causes more deaths than coronary heart disease or stroke.[2] In addition to primary studies designed specifically for cancer research, secondary analysis of existing data collected for nonresearch purposes has been used as a cost-effective approach to complement findings from primary studies and to help explore new research hypotheses.[3]
In Taiwan, a government-run, single-payer National Health Insurance (NHI) scheme was established in 1995. The coverage rate is over 99% of Taiwan's 23 million residents. More than 20,000 medical care facilities, including hospitals, clinics, pharmacies, and medical laboratories, which represent over 93% of all healthcare facilities in Taiwan, were contracted by the NHI scheme.[4] Under the universal health coverage scheme, virtually all healthcare services, including consulting and treatment expenses for inpatient and ambulatory care, dental services, traditional Chinese medicine therapies, physical rehabilitation, and home care, are being covered. Enrollees of the scheme are issued with an integrated circuit-embedded smart card that is used to obtain medical services. For contracted healthcare organizations to receive reimbursement, they must submit relevant claim records to the NHI Administration (the former Bureau of National Health Insurance). Claims are then reviewed by a panel of medical experts for the type, volume, quality, and appropriateness of medical services provided under the NHI program. Diagnosis or use of medical services that are not conformed to the NHI fee schedule, drug list, clinical guidelines, and patient conditions can result in a severe penalty. Under the NHI system, certain diseases or injuries are classified as catastrophic illnesses. Patients with these illnesses can apply for a certificate, which allows them to waive outpatient and inpatient copayments. For example, the insurance of the certificate for cancer requires a diagnosis by physicians with pathological reports and a formal review by the NHI Administration.
The claim records of the NHI are established as a database, the National Health Insurance Research Database (NHIRD), available for application by eligible scientists in Taiwan for research purposes since 1998. The database consisted of the original claim records and a number of different linkable registration files. The registration files contain information on contracted medical facilities, medical personnel, and drug prescriptions. Researchers can apply for specific subject dataset such as cancer dataset or catastrophic illness dataset as well as longitudinal dataset (Longitudinal Health Insurance Database) containing a random sample of 1 million NHI enrollees.[5] The availability of such population-based database has stimulated and facilitated academic research in various scientific disciplines, especially in the area of health research.[6] The exceedingly rapid expansion of the literature based on NHIRD has made it a challenge to comprehend what has already been done, particularly in broad subject areas such as cancer research. Therefore, the aim of this study is to explore the conceptual content of NHIRD-based cancer research based on the abstract of articles extracted from PubMed. Findings from this study may be used to identify gaps in cancer research based on the NHIRD. In addition, the broader scientific community may be able to gain ideas and insights based on the existing NHIRD studies for the development of their own primary studies.
2. MATERIALS AND METHODS
2.1. Data source
Data were retrieved and downloaded from PubMed, a website (http://www.ncbi.nlm.nih.gov/pubmed/) that provides free access to biomedical journal citations and abstracts mainly indexed by Medline.[7,8] The service is administered by the National Center for Biotechnology Information of the United States National Library of Medicine. Search terms including “National Health Insurance Research Database AND Taiwan,” “Taiwan AND population-based,” and “Taiwan AND nationwide” were used in the search strategies. The publication date was limited to the year between 1997 and 2015. The search was conducted on July 30, 2016 and a total of 4586 articles were retrieved. The retrieved articles were manually screened by 2 authors (Y-HK and MK) to eliminate articles that were not based on data from the Taiwan's NHIRD (1008 articles excluded and 3578 articles remained) and on the topic of cancer (2989 articles eliminated). The resulting 589 articles were included in subsequent analyses.
2.2. Text mining and data analysis
Text mining was performed using the R 3.0.2 software (R Foundation for Statistical Computing, Vienna, Austria). Four packages for R (https://cran.r-project.org/web/packages/), including RISmed for extracting bibliographic content from PubMed, SnowballC for collapsing words to a common root to aid comparison of vocabulary, tm for text mining, and rentrez for processing the results of PubMed searches were used. Descriptive analyses were conducted to calculate the frequencies of published articles for different journals and the origin of cancer sites among the studies. Citation frequency by PubMed Central articles was obtained on July 30, 2016. We also calculated the citation frequency up to 2 years of publication for the 2 top-ranking articles. The date of publication was defined as either the publication date or the Epub date, whichever was earlier. In addition, we also reported the frequency of author self-citation with self-citation defined as exists when the citing and the cited papers have at least 1 author in common. Furthermore, word cloud diagrams were generated to visualize the original word frequencies among the abstracts from the 589 articles. The co-occurrence of keywords (keyword association) in the text of the articles’ abstracts was evaluated using correlation analysis. Two keywords occurred together in the same abstract indicate that they are associated, and the strength of their association was quantified with Pearson correlation coefficients. The top 15 most frequently occurring keywords (primary keywords) in the abstract were analyzed for the correlation with other keywords (secondary keywords) in the abstract. Next, individual keywords with similar concepts were combined into key conceptions to simplify subsequent visualization using network plots.
The study protocol was reviewed and approved by the institutional review board of Dalin Tzu Chi Hospital, Buddhist Tzu Chi Medical Foundation, Taiwan (No. B10501012).
3. RESULTS
Of the 3578 publications based on the data from the NHIRD publishing between 1997 and 2015, 589 (16.5%) were on the topic of cancer. The earliest article on cancer based on the NHIRD appeared in 2002 and the number of articles grew from 2 in 2002 to 160 in 2015 (Fig. 1). Table 1 lists the journals that published more than 4 studies from NHIRD between the years 2002 and 2015. The top 3 journals with most NHIRD cancer articles were PLoS ONE, Medicine, and BMC Cancer, with 64, 37, and 19 articles, respectively. The 5-year impact factors of the journals ranged from 0.91 to 9.02, with a median of 3.70. The top 5 most studied cancer types were breast (16.3%), lung (11.4%), colorectal (10.4%), liver (8.3%), and prostate (7.5%) (Table 2).
Table 1.
Table 2.
3.1. Citation frequency by PubMed central articles
Of the 589 studies, the one that received the highest number of citations by PubMed Central articles (92 times as of July 30, 2016) (37 times within 2 years of publication [i.e., up to November 14, 2014], of which 1 was author self-citations) was a study aimed to investigate the association between nucleoside analog use and risk of tumor recurrence in patients with hepatitis B virus-related hepatocellular carcinoma after curative surgery. The study, published in the Journal of the American Medical Association in 2012, used a cohort design and Cox regression analysis to calculate hazard ratios (HRs) of 518 patients treated with nucleoside analogs compared with 4051 patients without the treatment. The results showed that nucleoside analog use was independently associated with a reduced risk of hepatocellular carcinoma recurrence (HR = 0.67, P < 0.001).[9]
The study that received the second highest number of citations by PubMed Central articles (43 times as of July 30, 2016) (8 times within 2 years of publication [i.e., up to August 5, 2011], of which 2 were author self-citations) was also published in 2009 by the same group of investigators as the above study. The cohort study, published in Gastroenterology, found that early Helicobacter pylori eradication was associated with a low gastric cancer risk (HR = 0.77) in 80,255 patients with peptic ulcer diseases.[10]
A few studies focused on medications use had received a relatively high number of citations. A 2010 study, published in the Journal of the National Cancer Institute, reported that the consumption of aristolochic acid-containing Chinese herbal products (e.g., Mu Tong) was associated with an increased risk of cancer of the urinary tract in a dose-dependent manner that is independent of arsenic exposure. This study had been cited 36 times.[11] Another study published in 2013, which had been cited 24 times, found that statin use was associated with a reduced risk of hepatocellular carcinoma among patients with chronic hepatitis C virus infection.[12] In addition, a 2011 study based on a case–control design reported that statins might reduce the risk of liver cancer (adjusted odds ratio [OR] = 0.62, 95% confidence interval [95% CI] = 0.42–0.91). This study had been cited 22 times.[13]
Furthermore, we noted that NHIRD cancer studies related to diabetes received a relatively high number of citations. A secondary cohort study of 472,979 adult patients with type 2 diabetes suggested that diabetes was associated with an increased cancer risk. This study, published in 2014, had been cited 8 times by PubMed Central articles.[14] On the other hand, another secondary cohort study published in 2012 found that patients with diabetes were not at increased risk for the development of lung cancer, but the use of antidiabetes drugs could decrease the risk by up to 45%. This study had been cited 30 times.[15] A secondary case–control study published in 2012 did not find a significant association between pioglitazone and bladder cancer in 54,928 patients with type 2 diabetes (adjusted HR = 1.31, 95% CI = 0.66–2.58). This study had been cited 27 times.[16] Another secondary case–control study on 606,583 type 2 diabetic patients published in 2012 found that the use of pioglitazone (OR = 0.73, 95% CI = 0.65–0.81) and rosiglitazone (OR = 0.83, 95% CI = 0.72–0.95) was associated with a decreased liver cancer incidence in diabetic patients. This study had been cited 25 times.[17] Moreover, a secondary cohort study designed to examine cancer incidence associated with the use of insulin glargine versus intermediate/long-acting human insulin showed that insulin glargine use did not increase the risk of overall cancer incidence, but it was positively associated with both pancreatic cancer (adjusted HR = 2.15, 95% CI = 1.01–4.59) and prostate cancer (adjusted HR = 2.42, 95% CI = 1.50–8.40) in men. This study, published in 2011, had been cited 21 times.[18]
3.2. Word cloud
The word frequencies in the abstracts among 589 articles were visualized as a word cloud with a larger word size represents a higher frequency of appearance among the articles (Fig. 2). The top 3 words with the highest frequency both over different periods and over the entire period of 2002 to 2015 were cancer, patients, and risk. These 3 words were suppressed in the display of the word cloud to allow a better visualization of the remaining words. Words ranked the fourth to the sixth in frequency were care, survival, and lung in 2002 to 2010; age, breast, and women in 2011 to 2012; age, diabetes, and breast in 2013; age, breast, and women in 2014; and breast, age, and care in 2015. Over the entire period between 2002 and 2015, age, breast, and women were the words ranked the fourth to the sixth in frequency.
3.3. Analysis of keyword association
Table 3 shows the results of the analysis of keyword association for the top 15 most frequently occurred keywords in the articles’ abstracts. The top 3 most frequently occurred keywords in the abstracts of the 589 articles were cancer, patient, and risk with 3670, 2535, and 1652 times, respectively. The remaining 12 keywords had a frequency of occurrence ranged from 692 times in “age” to 131 times in “hospice.” Each of the primary keywords was evaluated for their correlations with other keywords in the abstract (secondary keywords). Overall, the correlation coefficients ranged from the highest at 0.68 for “neck” and “head” to the lowest at 0.12 for “cancer” and “cervix.”
Table 3.
3.4. Analysis of key conception
A list of key conceptions was created from the keywords from the 589 articles (Table 4). To facilitate the comprehension of the subsequent plots of key conceptions, only those conceptions (nodes in Fig. 3A–E) with at least 4 connecting lines with other conceptions were shown in Table 4 and Fig. 3A–E. The most common conception was diabetes, which appeared in 20% of the articles. Next, the conception of survival appeared in 19.2% of the articles. Breast cancer, lung cancer, and colorectal cancer were the next 3 most common conceptions with an appearance of over 10% in the articles. Figure 3A–E shows the network plots for the key conceptions listed in Table 4. The data were displayed separately in 5 periods, (A) 2002 to 2010, (B) 2011 to 2012, (C) 2013, (D) 2014, and (E) 2015, to provide a clearer view of the associations among the different conceptions. For the years 2002 to 2010 (64 articles), the associations among key conceptions were the consequences of cancer (particularly breast cancer), such as the issue of patient's survival, healthcare costs, chemotherapy, hospice, and palliative care. For the years 2011 to 2012 (133 articles), the associations were about the risk factors of cancer (particularly breast cancer, prostate cancer, hepatocellular carcinoma, and lung cancer), such as diabetes, hypertension, dyslipidemia, stroke, infarction, and medications for diabetes, hypertension, and statin use. For the year 2013 (105 articles), the associations were also related to the risk factors of cancer, such as diabetes, hypertension, dyslipidemia, and medications for diabetes, hypertension, and statin use. For the year 2014 (127 articles), the key conceptions were related to the risk factors of cancer (particularly lung cancer and breast cancer), such as diabetes, hypertension, and stroke; cancer treatments, such as surgery, chemotherapy, and radiotherapy; and issue of patient's survival. For the year 2015 (160 articles), the associations were on the risk factors for a wider range of cancers (breast cancer, hepatocellular carcinoma, lung cancer, colorectal cancer, lymphoma, and gastric cancer), such as diabetes, hypertension, dyslipidemia, and stroke. The issues of cancer treatments, such as surgery, chemotherapy, and radiotherapy and the issue of patient's survival and cancer-related healthcare costs also received a large number of associations.
Table 4.
4. DISCUSSION
The present study is the first to conduct a computational text analysis and visualization of articles indexed by PubMed on cancer research that were based on secondary data analyses of the Taiwan's NHIRD. We used various approaches, including word cloud, tokens association, and conceptions network, to visualize the content of 589 articles in NHIRD published between 2002 and 2015.
We found that breast cancer, lung cancer, colorectal cancer, liver cancer, and prostate cancer were the top 5 most studied cancers based on the NHIRD data. This is not surprising because these are the most common cancer type in Taiwan. The 5 highest incidences of invasive cancers in 2013 were of the female breast, colorectal, liver, lung, and prostate.[19] The large number of cases in these cancers provides sufficient sample sizes for exploring various analyses of their risk factors, survival, and treatment. Conversely, less common cancer types do not have enough cases for investigating their associations with other disorders except when the latter have a high prevalence, such as diabetes, hypertension, and hyperlipidemia.
In the word cloud visualization, we found that survival of patients, hospice care, and end-of-life care were gradually increased from 2012 to 2015. Treatments of cancer, such as surgery, chemotherapy, and radiotherapy, were less frequently appeared words in most of the study periods, but their appearance was mildly increased from 2014 to 2015.
In the visualization of the associations of key conceptions, the most frequent associations were generally related to cancer risk factors, followed by survival, therapy-related for cancer, and end-of-life care. The associations of conceptions of patient's survival, cancer-related surgery, radiotherapy, and chemotherapy appeared to gradually increase from 2013 to 2015. This observation may be explained by the need for a longer follow-up period for studies on survival and therefore, only until recently, the longitudinal dataset of the NHIRD has accumulated a sufficient number of cancer cases for such evaluation. Rare cancers or risk factors with long induction periods will become feasible for investigation as the length of follow-up of the NHIRD cohort increases over time.
Our analyses of citation frequency of the articles indicated that the most cited study (92 times by PubMed Central articles as of July 30, 2016) was the one on the associations between nucleoside analog (e.g., Lamivudine, Adefovir, Entecavir) use and a lower risk of hepatocellular carcinoma recurrence among patients with hepatitis B virus-related hepatocellular carcinoma after liver resection.[9] A possible explanation for its high citation is that hepatocellular carcinoma is one of the leading causes of death for cancer patients and therefore, its treatment is widely studied, which leads to its citation by other research articles. It should be noted that the citation frequency obtained in this study reflected only the number of PubMed Central articles, which are all full-text articles freely accessible to the public (https://www.ncbi.nlm.nih.gov/pmc/) rather than by all articles indexed by PubMed or by other databases such as Web of Science, Scopus, and Google Scholar.[20]
Despite the NHIRD is a nationwide, population-based dataset, it has a number of inherent limitations that hinder its use. While the NHIRD represents a cohort of approximately 1 million patients, there still may not be enough cases for the study of certain rare cancers and their associations with other diseases with a low prevalence or incidence. In addition, no information on cancer stage is available from the dataset. Important potential confounding variables, such as body mass index, smoking, and alcohol intake, are also not available from the dataset for any statistical adjustment. Moreover, the study of cancer medications is impeded by the lack of information on dosage and medication adherence. Furthermore, patients’ use of self-paid medications or procedures is also not recorded in the NHIRD.
A few limitations of the present study should be mentioned. First, only articles written in English and indexed by PubMed were included in the study. Nevertheless, most medical research articles based on NHIRD should have been identified since researchers’ performance in Taiwan is generally evaluated based on output of articles published in the Science Citation Index,[21] which is most likely to be covered by Medline. Second, articles with no abstract could not be analyzed. Third, variations in the length of abstract and inaccuracy in the content of the abstract might potentially influence our results.[22]
In conclusion, in this study of 589 published articles on secondary data analysis of the NHIRD, indexed by PubMed between 2002 and 2015, we found that the top 5 most studied cancers were breast, lung, colorectal, liver, and prostate. Articles generally focused on the association between cancer and its possible risk factors, such as diabetes, hypertension, dyslipidemia, and statin use. The conceptions of patients’ survival, cancer-related surgery, radiotherapy, and chemotherapy were gradually increased from 2013 to 2015. Overall, there were more articles focusing on the risk factors of cancer, treatment of cancer, and survival of cancer patients, but relatively few articles on end-of-life cancer care including hospice care, palliative care, and home palliative care. These latter neglected areas should further be explored using the NHIRD as they are as important as treatment of the disease itself for many patients.
Footnotes
Abbreviations: CI = confidence interval, HR = hazard ratio, NHI = National Health Insurance, NHIRD = National Health Insurance Research Database, OR = odds ratio.
J-KC, C-WL, and C-LW have contributed equally to this work.
The authors have no funding and conflicts of interest to disclose.
References
- [1].Ferlay J, Soerjomataram I, Dikshit R, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer 2015;136:E359–86. [DOI] [PubMed] [Google Scholar]
- [2].Global Health Observatory data repository. Number of deaths (World) by cause. Estimates for 2000–2012. Available from: http://apps.who.int/gho/data/node.main.CODWORLD?lang=en. Accessed February 8, 2017. [Google Scholar]
- [3].Cheng HG, Phillips MR. Secondary analysis of existing data: opportunities and implementation. Shanghai Arch Psychiatry 2014;26:371–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].National Health Insurance Administration, Ministry of Health and Welfare. 2015–2016 National Health Insurance Annual Report; 2015. Available from: http://www.nhi.gov.tw/Resource/webdata/13767_1_2015-2016%20NHI%20ANNUAL%20REPORT.pdf. Accessed February 8, 2017. [Google Scholar]
- [5].National Health Research Institutes. National Health Insurance Research Database, data subsets. Available from: http://nhird.nhri.org.tw/en/Data_Subsets.html. Accessed November 9, 2016. [Google Scholar]
- [6].Chen YC, Yeh HY, Wu JC, et al. Taiwan's National Health Insurance Research Database: administrative health care database as study object in bibliometrics. Scientometrics 2011;86:365–80. [Google Scholar]
- [7].U.S. National Library of Medicine. MEDLINE, PubMed, and PMC (PubMed Central): How are they different? 2016. Available from: https://www.nlm.nih.gov/pubs/factsheets/dif_med_pub.html. Accessed February 8, 2017. [Google Scholar]
- [8].Castillo M. Is your journal indexed in MEDLINE? Am J Neuroradiol 2011;32:1–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Wu CY, Chen YJ, Ho HJ, et al. Association between nucleoside analogues and risk of hepatitis B virus-related hepatocellular carcinoma recurrence following liver resection. JAMA 2012;308:1906–14. [DOI] [PubMed] [Google Scholar]
- [10].Wu CY, Kuo KN, Wu MS, et al. Early Helicobacter pylori eradication decreases risk of gastric cancer in patients with peptic ulcer disease. Gastroenterology 2009;137:1641–8. [DOI] [PubMed] [Google Scholar]
- [11].Lai MN, Wang SM, Chen PC, et al. Population-based case-control study of Chinese herbal products containing aristolochic acid and urinary tract cancer risk. J Natl Cancer Inst 2010;102:179–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Tsan YT, Lee CH, Ho WC, et al. Statins and the risk of hepatocellular carcinoma in patients with hepatitis C virus infection. J Clin Oncol 2013;31:1514–21. [DOI] [PubMed] [Google Scholar]
- [13].Chiu HF, Ho SC, Chen CC, et al. Statin use and the risk of liver cancer: a population-based case-control study. Am J Gastroenterol 2011;106:894–8. [DOI] [PubMed] [Google Scholar]
- [14].Lin CC, Chiang JH, Li CI, et al. Cancer risks among patients with type 2 diabetes: a 10-year follow-up study of a nationwide population-based cohort in Taiwan. BMC Cancer 2014;14:381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Lai SW, Liao KF, Chen PC, et al. Antidiabetes drugs correlate with decreased risk of lung cancer: a population-based observation in Taiwan. Clin Lung Cancer 2012;13:143–8. [DOI] [PubMed] [Google Scholar]
- [16].Tseng CH. Pioglitazone and bladder cancer: a population-based study of Taiwanese. Diabetes Care 2012;35:278–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Chang CH, Lin JW, Wu LC, et al. Association of thiazolidinediones with liver cancer and colorectal cancer in type 2 diabetes mellitus. Hepatology 2012;55:1462–72. [DOI] [PubMed] [Google Scholar]
- [18].Chang CH, Toh S, Lin JW, et al. Cancer risk associated with insulin glargine among adult type 2 diabetes patients—a nationwide cohort study. PLoS ONE 2011;6:e21368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Taiwan Cancer Registry. Cancer Incidence and Mortality Rates in Taiwan. Available from: http://tcr.cph.ntu.edu.tw/main.php?Page=N2. Accessed February 14, 2017. [Google Scholar]
- [20].Kulkarni AV, Aziz B, Shams I, et al. Comparisons of citations in Web of Science, Scopus, and Google Scholar for articles published in general medical journals. JAMA 2009;302:1092–6. [DOI] [PubMed] [Google Scholar]
- [21].Wu YT, Lee HY. National Health Insurance database in Taiwan: a resource or obstacle for health research? Eur J Intern Med 2016;31:e9–10. [DOI] [PubMed] [Google Scholar]
- [22].Fontelo P, Gavino A, Sarmiento RF. Comparing data accuracy between structured abstracts and full-text journal articles: implications in their use for informing clinical decisions. Evid Based Med 2013;18:207–11. [DOI] [PubMed] [Google Scholar]