Abstract
Purpose
The present study is aimed at predicting the physician's specialty based on the most frequent two medications prescribed simultaneously. The results of this study could be utilized in the imputation of the missing data in similar databases. Patients and Methods. The research is done through the KAy-means for MIxed LArge datasets (KAMILA) clustering and random forest (RF) model. The data used in the study were retrieved from outpatients' prescriptions in the second populous province of Iran (Khorasan Razavi) from April 2015 to March 2017.
Results
The main findings of the study represent the importance of each combination in predicting the specialty. The final results showed that the combination of amoxicillin-metronidazole has the highest importance in making an accurate prediction. The findings are provided in a user-friendly R-shiny web application, which can be applied to any medical prescription database.
Conclusion
Nowadays, a huge amount of data is produced in the field of medical prescriptions, which a significant section of that is missing in the specialty. Thus, imputing the missing variables can lead to valuable results for planning a medication with higher quality, improving healthcare quality, and decreasing expenses.
1. Introduction
Population aging, urbanization, and consequently noncommunicable disease (NCD) prevalence have led to the escalation of medication consumption per capita globally [1]. According to the United Nations Office on Drugs and Crime (UNODC) report in 2018, about 80% of the population between the ages of 15 and 64 in Asia are taking medications [2]. Considering the prevalence of NCDs and their associated prescribed healthcare services, insurance claimed data, even in a short period, provides an invaluable source of big data.
Heterogeneity is one of the big data characteristics, which leads to obtaining biased or even erroneous results [3, 4]. Although clustering on big data to create homogeneous clusters is an essential step for the subsequent analyses [5], it also produces intermediary results that researchers could use to interpret the relation of variables within clusters [6, 7]. In addition, missing data may affect the valence of the data in big data applications [8]. Therefore, imputing missing data leads to reasonable, valuable, and unbiased results. Among numerous studies that investigated medical prescriptions in various ways [9–11], a considerable proportion reported a significant rate of up to one-third for the physician's specialty missingness [12–17]. Despite the provision of a remarkable increase in the accuracy of disease labeling of patients, one of the overlooked applications of studying prescriptions is predicting the physician's specialty.
To the best of our knowledge, limited studies have been conducted to predict the specialty of physicians using machine learning methods on big data. The present study is aimed at predicting the physician's specialty based on the most frequent two medications prescribed simultaneously. The results of this study could be utilized in the imputation of the missing data in similar databases.
2. Materials and Methods
2.1. Data Source
The data used in the study had been registered in the second populous province of Iran (Khorasan Razavi) from April 2015 to March 2017. The data included the fields of prescription ID, patient ID, medication, and prescription date, as well as physician's specialty, gender, and age. Also, each record of the data represents a single medication of prescription.
2.2. Data Manipulation and Analysis
An essential step at the beginning of the analysis process is data wrangling. In this study, data was retrieved from 4 SQL tables and preprocessed via R language. As the variables in this step contained both quantitative and qualitative measures, KAy-means for MIxed LArge datasets (KAMILA) clustering was implemented, which is suitable for clustering mixed-type data [18]. In this step, the variables included in the clustering were age of physician, medication, insurance company, prescription month, sex of physician, physician specialty, corticosteroid, and antibiotic. Moreover, specialties with ages between 22 and 85 years old are included in the study. Also, prescriptions with only one medication were excluded from further analysis. Based on the expert opinion, the number of clusters has been determined to be four.
After evaluating the results of the first step, all combinations of two medications (2-combs) in every prescription were derived, and a new data set was developed.
In similar studies which investigated disease prediction, the random forest (RF) model was suggested as the superior model with the highest performance compared to other well-known predictive models such as support vector machine (SVM), Naïve Bayes algorithm, and logistic regression [19–22]. Furthermore, RF is convenient in situations including more than two classes [23]. When specialties are missed in a database, there is no information available about specialist characteristics. So, the study has proposed an RF model that uses characteristic-free variables as predictors, i.e., 2-combs.
At last, in order to predict the specialty based on prescription patterns, the following actions have been done.
The RF method has been fitted as the following phases:
The count of the 2-combs was calculated and sorted in a descending order
For each specialty, the selected proportions of the prescribed 2-combs were calculated
Data were restructured as wide format
To select the appropriate variables as predictors in the RF model, several RF models were fitted on the data using the first n most frequent 2-combs (n = 1, 2, ⋯, 40)
Selecting variables to achieve appropriate accuracy
Checking the validation of the model using the accuracy index
In this approach, the first 25 most frequent combinations have been used as variables, and the physician's specialty was considered the response. The 7354 specialties have been split into training and validation datasets by a 70/30 proportion. Therefore, 5160 and 2194 observations were randomly allocated to training and test subsets, respectively.
To carry out the research aim, R 4.1.2 programming language has been used to develop codes, and the RStudio environment was applied to implement and execute them. As well, Apache Spark framework 3.1.0 via “sparklyr” R library along with “tidyverse” and “vroom” libraries was used for data wrangling. Besides, other libraries like “kamila” and “RandomForest” were used for analyses. Finally, “ggplot2” and “ggtext” have been applied to make visualization. A configured computer with 64 GB of RAM and a 32-core Intel CPU was used.
3. Results
The study results were obtained from 17,137,949 prescriptions prescribed by 30,544 physicians. 18,620 (61%) were related to male practitioners. The means of physician's age were 51.9 and 42.1 for males and females, respectively. In addition, general practitioners (GPs) constituted 20,081 (65.7%) of total physicians.
In the first step, KAMILA clustering was performed on prescription records. Figure 1 provides overall sight of the distribution of each specialty in four clusters. For example, endocrinology and nephrology were concentrated in cluster 1.
Figure 1.

The heatmap plot of the specialty variable in four clusters.
As shown in Figure 2, the high-prescribed antibiotics (i.e., ciprofloxacin, azithromycin, cephalexin, chloramphenicol, and doxycycline) were concentrated in cluster 2. Results represented that GPs have prescribed the majority of the antibiotics. In addition, ibuprofen and diclofenac, two well-known nonsteroidal anti-inflammatory drugs (NSAIDs), were primarily categorized in clusters 1 and 4, respectively. Furthermore, dexamethasone was mainly categorized in clusters 3 and 4.
Figure 2.

The five highest frequent medications for each specialty in four clusters.
Variations and overlaps of medicines across specialty levels became a motivation factor in evaluating the 2-combs in another way. So, Figure 3 represents the five highest frequent 2-combs in each specialty.
Figure 3.

The five highest frequent 2-combs in each specialty.
The results of the RF classification model are presented as follows.
According to Figure 4, the value of accuracy in the 25th 2-combs tended to be a constant. The final RF model was fitted using the 25 most frequent 2-combs, which has the optimal accuracy of 0.74 in the validation dataset.
Figure 4.

The accuracy of the RF models against the number of most frequent 2-combs used as predictors.
Figure 5 presents the mean decreased accuracy by deleting each predictor of the model, and it also represents each predictor's importance. So, the combination of amoxicillin and metronidazole between the 25 first combinations contributed the most to accuracy. Also, the combination of ASA and atorvastatin in Figure 3 appeared in several specialties as the most frequent 2-combs have the fourth accuracy rank in Figure 5. To predict the specialty, the findings of the study can be applied to any similar medical prescription database, which is provided in a user-friendly R-shiny web application (https://mahboube-akhlaghi.shinyapps.io/SpecialtyPrediction).
Figure 5.

The mean decreased accuracy by deleting each predictor of the model.
4. Discussion
The present study focused on predicting specialty based on the RF model on 17,265,238 prescriptions. Also, the method of this study is in line with numerous previous studies which have focused on disease prediction [19, 24]. Many studies presented the RF model's superiority over some machine learning methods [19].
Shirazi et al. used a community detection algorithm to identify the real specialty of physicians in prescription databases [25], which was a subjective method. In the present study, the authors tried to predict missing specialties differently. In the study, the 2-combs were utilized to predict the physician's specialty using the RF model.
However, the main findings of the study showed that the RF method could precisely predict the physician's specialty using the proportion of their most frequent 2-comb prescriptions. In addition, knowledge about the patterns of the prescription data can help better treatment and investigate the side effects of 2-combs. In this case, the combination of amoxicillin-metronidazole has the 21st rank among the 25 most frequently selected variables, which has the highest importance in making an accurate prediction. This is despite the fact that the recent systematic review and meta-analysis research about simultaneously amoxicillin/metronidazole usage in periodontitis treatment has shown positive short-term effects [26]. In addition, considering the extracted results showed that acetaminophen existed in several 2-combs. Several studies presented evidence that acetaminophen has adverse effects on certain human organs (for example, the liver) [27–29]. Moreover, some recent studies have presented simultaneously prescribing ASA and atorvastatin. He et al.'s research demonstrated the strong effect of ASA and atorvastatin 2-combs on inhibiting the growth of prostate cancer cells [30]. Other studies provided evidence of the merit of using these 2-combs in preventing and treating cardiovascular disease and severe sepsis [31–34]. Also, in many 2-combs, one of the medications was antibiotics. However, adverse events and antibiotic resistance are a well-acknowledged global problem [35, 36].
Finally, we found that general practitioners had the most outpatients. One reason for it is family physicians' issues [37]. Also, the services of general practitioners have a less economic burden on the patients than specialty physicians' services. Further, the prescribed medications were more related to the diseases such as colds and allergies, which patients often refer to a general practitioner. The observations in this study are limited to the insurance claimed database registry. However, in Iran, many prescriptions do not cover by insurance companies, and this issue is a specific limitation of the study.
5. Conclusion
In the study, we predict the physician's specialty by using the proportion of their most frequent 2-comb prescriptions. Today, a huge amount of data is produced in the field of medical prescriptions, which a significant section of that is missing in the specialty [12–17]. So, imputing the missing can lead to valuable results for planning a medication with higher quality, improving healthcare quality, and decreasing expenses [38–40].
Acknowledgments
This study was funded partially by the Isfahan University of Medical Sciences (IUMS).
Abbreviations
- NCDs:
Noncommunicable diseases
- UNODC:
The United Nations Office on Drugs and Crime
- KAMILA:
KAy-means for MIxed LArge datasets clustering
- RF:
Random forest
- SVM:
Support vector machine
- GPs:
General practitioners
- NSAIDs:
Nonsteroidal anti-inflammatory drugs.
Data Availability
The raw data used to support the findings of this study are restricted in order to protect patient privacy, but the aggregated data are available from the first author upon request.
Disclosure
This current study is a part of a Biostatistics PhD thesis at the School of Health, Isfahan University of Medical Sciences, with project number 397804.
Conflicts of Interest
The authors report no conflicts of interest in this work.
References
- 1.Yazdi Feyzabadi V., Mehrolhassani M., Iranmanesh M. Evaluation of medication consumption indices in Iran from 2012 to 2015: a descriptive study. Iranian Journal of Epidemiology . 2019;14:72–81. [Google Scholar]
- 2.Collaborator WDR. Executive Summary Policy Implications . The United Nations Office on Drugs and Crime (UNODC); 2021. [Google Scholar]
- 3.Cappa F., Oriani R., Peruffo E., McCarthy I. Big data for creating and capturing value in the digitalized environment: unpacking the effects of volume, variety, and veracity on firm performance. Journal of Product Innovation Management . 2021;38(1):49–67. doi: 10.1111/jpim.12545. [DOI] [Google Scholar]
- 4.Björkdahl J., Holmén M. Exploiting the control revolution by means of digitalization: value creation, value capture, and downstream movements. Industrial and Corporate Change . 2019;28(3):423–436. doi: 10.1093/icc/dty022. [DOI] [Google Scholar]
- 5.Zerhari B., Lahcen A. A., Mouline S. Big data clustering: algorithms and challenges. International Conference on Big Data, Cloud and Applications BDCA’15; 2015; Tetuan, Morocco. pp. 1–6. [Google Scholar]
- 6.Li Y., Deng X., Ba S., et al. Cluster-based data filtering for manufacturing big data systems. Journal of Quality Technology . 2021;54(3):1–13. [Google Scholar]
- 7.Chitta R. Kernel-Based Clustering of Big Data . Michigan State University; 2015. [Google Scholar]
- 8.Jeon G., Sangaiah A. K., Chen Y.-S., Paul A. Special issue on machine learning approaches and challenges of missing data in the era of big data. International Journal of Machine Learning and Cybernetics . 2019;10(10):2589–2591. doi: 10.1007/s13042-019-01010-8. [DOI] [Google Scholar]
- 9.Barbieri E., Liberati C., Cantarutti A., et al. Antibiotic prescription patterns in the paediatric primary care setting before and after the COVID-19 pandemic in Italy: an analysis using the AWaRe metrics. Antibiotics . 2022;11(4):p. 457. doi: 10.3390/antibiotics11040457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wondmkun Y. T., Ayele A. G. Assessment of prescription pattern of systemic steroidal drugs in the outpatient department of Menelik II Referral Hospital, Addis Ababa, Ethiopia, 2019. Patient Preference and Adherence . 2021;15:9–14. doi: 10.2147/PPA.S285064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kurata K., Taniai E., Nishimura K., Fujita K., Dobashi A. A prescription survey about combined use of acetylcholinesterase inhibitors and anticholinergic medicines in the dementia outpatient using electronic medication history data from community pharmacies. Integrated Pharmacy Research & Practice . 2015;4:p. 133. doi: 10.2147/IPRP.S86661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Patel A., Medhekar R., Ochoa-Perez M., et al. Care provision and prescribing practices of physicians treating children and adolescents with ADHD. Psychiatric Services . 2017;68(7):681–688. doi: 10.1176/appi.ps.201600130. [DOI] [PubMed] [Google Scholar]
- 13.Ringwalt C., Gugelmann H., Garrettson M., et al. Differential prescribing of opioid analgesics according to physician specialty for Medicaid patients with chronic noncancer pain diagnoses. Pain Research and Management . 2014;19(4) doi: 10.1155/2014/857952.857952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Weiner S. G., Baker O., Rodgers A. F., et al. Opioid prescriptions by specialty in Ohio, 2010–2014. Pain Medicine . 2018;19(5):978–989. doi: 10.1093/pm/pnx027. [DOI] [PubMed] [Google Scholar]
- 15.Weiner S. G., Chou S. C., Chang C. Y., et al. Prescription and prescriber specialty characteristics of initial opioid prescriptions associated with chronic use. Pain Medicine . 2020;21(12):3669–3678. doi: 10.1093/pm/pnaa293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ringwalt C., Roberts A. W., Gugelmann H., Skinner A. C. Racial disparities across provider specialties in opioid prescriptions dispensed to Medicaid beneficiaries with chronic noncancer pain. Pain Medicine . 2015;16(4):633–640. doi: 10.1111/pme.12555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sun B. C., Lupulescu-Mann N., Charlesworth C. J., et al. Variations in prescription drug monitoring program use by prescriber specialty. Journal of Substance Abuse Treatment . 2018;94:35–40. doi: 10.1016/j.jsat.2018.08.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Foss A., Markatou M., Ray B., Heching A. A semiparametric method for clustering mixed data. Machine Learning . 2016;105(3):419–458. doi: 10.1007/s10994-016-5575-7. [DOI] [Google Scholar]
- 19.Uddin S., Khan A., Hossain M. E., Moni M. A. Comparing different supervised machine learning algorithms for disease prediction. BMC Medical Informatics and Decision Making . 2019;19(1):1–16. doi: 10.1186/s12911-019-1004-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shah A. D., Bartlett J. W., Carpenter J., Nicholas O., Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology . 2014;179(6):764–774. doi: 10.1093/aje/kwt312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rakhra M., Soniya P., Tanwar D., et al. Crop price prediction using random forest and decision tree regression:-a review. Materials Today: Proceedings . 2021 doi: 10.1016/j.matpr.2021.03.261. [DOI] [Google Scholar]
- 22.Yang L., Wu H., Jin X., et al. Study of cardiovascular disease prediction model based on random forest in eastern China. Scientific Reports . 2020;10(1):1–8. doi: 10.1038/s41598-020-62133-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Díaz-Uriarte R., Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics . 2006;7(1):1–13. doi: 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jiang Y., Zhang X., Ma R., et al. Cardiovascular disease prediction by machine learning algorithms based on cytokines in Kazakhs of China. Clinical Epidemiology . 2021;13:417–428. doi: 10.2147/CLEP.S313343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Shirazi S., Albadvi A., Akhondzadeh E., Farzadfar F., Teimourpour B. A new application of community detection for identifying the real specialty of physicians. International Journal of Medical Informatics . 2020;140:p. 104161. doi: 10.1016/j.ijmedinf.2020.104161. [DOI] [PubMed] [Google Scholar]
- 26.Karrabi M., Baghani Z. Amoxicillin/metronidazole dose impact as an adjunctive therapy for stage II-III grade C periodontitis (aggressive periodontitis) at 3-and 6-month follow-ups: a systematic review and meta-analysis. Journal of Oral & Maxillofacial Research . 2022;13(1) doi: 10.5037/jomr.2022.13102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tsuji Y., Kuramochi M., Golbar H. M., Izawa T., Kuwamura M., Yamate J. Acetaminophen-induced rat hepatotoxicity based on M1/M2-macrophage polarization, in possible relation to damage-associated molecular patterns and autophagy. International Journal of Molecular Sciences . 2020;21(23):p. 8998. doi: 10.3390/ijms21238998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kim D., Moon B. S., Park S. M., et al. Feasibility of TSPO-specific positron emission tomography radiotracer for evaluating paracetamol-induced liver injury. Diagnostics . 2021;11(9):p. 1661. doi: 10.3390/diagnostics11091661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Herndon C. M., Dankenbring D. M. Patient perception and knowledge of acetaminophen in a large family medicine service. Journal of Pain & Palliative Care Pharmacotherapy . 2014;28(2):109–116. doi: 10.3109/15360288.2014.908993. [DOI] [PubMed] [Google Scholar]
- 30.He Y., Huang H., Farischon C., et al. Combined effects of atorvastatin and aspirin on growth and apoptosis in human prostate cancer cells. Oncology Reports . 2017;37(2):953–960. doi: 10.3892/or.2017.5353. [DOI] [PubMed] [Google Scholar]
- 31.Mahtta D., Ramsey D. J., Al Rifai M., et al. Evaluation of aspirin and statin therapy use and adherence in patients with premature atherosclerotic cardiovascular disease. JAMA Network Open . 2020;3(8, article e2011051) doi: 10.1001/jamanetworkopen.2020.11051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hennekens C. H., Schneider W. R. The need for wider and appropriate utilization of aspirin and statins in the treatment and prevention of cardiovascular disease. Expert Review of Cardiovascular Therapy . 2008;6(1):95–107. doi: 10.1586/14779072.6.1.95. [DOI] [PubMed] [Google Scholar]
- 33.Sanchez M. A., Thomas C. B., O’Neal H. R. Do aspirin and statins prevent severe sepsis? Current Opinion in Infectious Diseases . 2012;25(3):345–350. doi: 10.1097/QCO.0b013e3283520ed7. [DOI] [PubMed] [Google Scholar]
- 34.Patel S. S., Guzman L. A., Lin F. P., et al. Utilization of aspirin and statin in management of coronary artery disease in patients with cirrhosis undergoing liver transplant evaluation. Liver Transplantation . 2018;24(7):872–880. doi: 10.1002/lt.25067. [DOI] [PubMed] [Google Scholar]
- 35.Lund B., Cederlund A., Hultin M., Lundgren F. Effect of governmental strategies on antibiotic prescription in dentistry. Acta Odontologica Scandinavica . 2020;78(7):529–534. doi: 10.1080/00016357.2020.1751273. [DOI] [PubMed] [Google Scholar]
- 36.King L. M., Bartoces M., Fleming-Dutra K. E., Roberts R. M., Hicks L. A. Changes in US outpatient antibiotic prescriptions from 2011–2016. Clinical Infectious Diseases . 2020;70(3):370–377. doi: 10.1093/cid/ciz225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Rakel R. E., Rakel R. E. Textbook of Family Practice . Wb Saunders; 1984. [Google Scholar]
- 38.Duggal P. S., Paul S. Big data analysis: challenges and solutions. International Conference on Cloud, Big Data and Trust . 2013;15:269–276. [Google Scholar]
- 39.Mathew P. S., Pillai A. S. Big data solutions in healthcare: problems and perspectives. 2015 International conference on innovations in information, embedded and communication systems (ICIIECS); 2015; Coimbatore, India. pp. 1–6. [DOI] [Google Scholar]
- 40.Heinrich A., Lojo A., Xu F. Big data technologies in healthcare: needs, opportunities and challenges. White Paper, Big Data Value Association, TF7 Healthcare Subgroup . 2016;21 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The raw data used to support the findings of this study are restricted in order to protect patient privacy, but the aggregated data are available from the first author upon request.
