Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2022 May 23;2022:293–302.

Clinical Data Cohort Quality Improvement: The Case of the Medication Data in The University of Minnesota’s Clinical Data Repository

Sebastien D Kiogou 1, Chih-Lin Chi 1, Rui Zhang 1,2, Sisi Ma 1, Terrence J Adam 1,2
PMCID: PMC9285162  PMID: 35854717

Abstract

Clinical and translational research centers (CTRCs) have emerged as key centers for electronic medical record related research through integrated data repositories (IDRs) and the ‘secondary use’ of clinical data. Researchers accessing and pre-processing ever increasing amounts of electronic medical records for data mining tasks have a growing need for best practice approaches for clinical data quality assessment and improvement. This project focused on a large data extract for 7 statin medication prescriptions for patients with cardiovascular disease. After the initial data extraction, we proceeded to analyze the data for completeness, correctness, currency, and percentage populated using established data quality frameworks. Assessment of the said data was performed through medication possession ratios, medication discontinuation reasons, and drug dosages. When we compared distributions of data elements such as drug dosage before and after changes were introduced by our pre-processing protocols, only a minimal noticeable difference was found as the clinical data cohort quality assessment and pre-processing were completed without substantially altering the original data structure. Our study demonstrated practical steps for clinical data cohort quality improvement using medication data and illustrates a best practice approach in clinical data cohort quality improvement for any data mining tasks.

Introduction

Clinical and translational research centers (CTRCs) provide resources to support researchers in applying scientific efforts to patient care. Integrated data repositories (IDRs) provide key support of clinical and translational research by providing data sources for retrospective analytic research and the identification and recruitment of prospective research subjects. IDRs are often the byproducts of research collaborations between academic medical centers and healthcare delivery organizations; thus, IDRs are integrated data sources fed with millions of heterogeneous electronic health records (EHRs) from clinics and hospitals. The emergence of CTRCs as research resources and the increasing availability of electronic medical records through IDRs have enhanced the opportunity for ‘secondary use’ of such data for research16.

The University of Minnesota’s Clinical & Translational Science Institute (CTSI) is a member of the CTSA consortium. Part of the CTSI’s prerogatives is to support the Best Practices Integrated Informatics Core (BPIC) which provides centralized informatics services and collaborative science opportunities for the university7. In the same vein, the clinical data repository (CDR) created by the CTSI grants researchers access to millions of electronic health records. The CDR is an integrated medical record repository comprised of patient records from 8 hospitals and more than 40 clinics - from Fairview Health Services and the University of Minnesota Physicians. The latter information-rich resource gives researchers the ability to study medical conditions, analyze patient outcomes, and identify best practices across large populations, all while protecting patients’ privacy. On December 6, 2019 The University of Minnesota’s Institutional Review Board (IRB) under the human research protection program approved our submission for a study that uses machine learning algorithms to address the critical gap related to the lack of a standardized data approach for statin related adverse events (ADEs) detection and surveillance.

Cardiovascular diseases (CVD) are the leading cause of death both in the United States (U.S) and the world; about 655 thousand Americans and 17.9 million individuals worldwide die from heart disease each year5,18. In the US, CVD claim more lives each year than all forms of cancer and chronic lower respiratory disease combined15. In developed countries, cholesterol levels more than 3.8 mmol/liter are responsible for more than 50% of CVD events9. Statin drugs or HMG-CoA reductase inhibitors are a class of lipid-lowering drugs used to prevent or treat cardiovascular disease for both primary and secondary prevention1,4. According to the American Heart Association2, statin drugs are the most common cholesterol-lowering drugs. Statin drugs can lower LDL cholesterol concentration by an average of 1.8 mmol/l thereby reducing the risk of cardiac events (heart attack, sudden cardiac death) by about 60% and that of stroke by 17% after long-term treatment12. However, statin therapies have been associated with ADEs such as muscle damage, increased risk of diabetes mellitus, liver damage, neuropathy, pancreatic and liver dysfunction. In addition, these morbidities constitute major causes of statin medication non-adherence; in a survey by Cohen, Brinton, Ito, & Jacobson8, the primary reason for statin therapy discontinuation was ADEs (62%) which according to Bates, Connaughton, & Watts3 is a major challenge for preventive cardiology.

To address this critical gap, our proposed research elaborated three specific aims, one of which is discussed in this article. Big data technology and advanced data mining algorithms including machine learning, deep learning, and natural language processing have proven optimal in handling high dimensional data. However, the success of data-driven algorithms is influenced by both the algorithm and the data used for modeling. For any data mining task, early steps such as data profiling, data extraction, data cleaning, data pre-processing, and data quality improvement are critical to model success6. The focus of this article is to develop practical steps for clinical data cohort quality improvement using the CDR medication data. Though this approach was used for a particular clinical data cohort quality improvement, it illustrates a best practice approach in clinical data cohort quality improvement for any data mining task.

Materials and Methods

The University of Minnesota’s Clinical Data Repository (CDR): The CDR is an electronic medical record repository comprised of hospital and clinic patient records from Fairview Health Services and the University of Minnesota Physicians. The data in the CDR come from the electronic health records of 2.5 million patients at 8 hospitals and more than 40 clinics. Available hospital data started to flow in from year 2011 and clinic data from 2000 with access to the data being granted via the Academic Health Center Information Exchange (AHC-IE) Data Shelter which provides a variety of analytic tools for data analysis. The data in the CDR is housed in patient and encounter centered tables within a SQL Server Management Studio (SSMS) inside the AHC-IE data shelter; the latter data shelter is a collaborative agreement between the University of Minnesota Academic Health Center and Fairview Health Services to support the joint mission of improving patient care and supporting healthcare research and education7. The data currently available in the CDR can be classified into the broad categories of demographic, administrative, and procedural data detailing patients’ problem list, vitals, medication, diagnosis, etc.

Data Quality Assessment: For ease of researcher use and data delivery, tables in the AHC-IE data shelter are combined and standardized through a variety of quality checks and then made available in the clinical data repository7. To better evaluate the quality of the medical data, we assessed the tables and data elements of interest for completeness, correctness, and currency in conformity with Weiskopf & Weng17. According to the same authors, EHR data were correct if the information they contained were true; whereas completeness referred to whether a truth about a patient was present in the EHR; as well, EHR data were considered current if they were recorded in the EHR within a reasonable period of time or, alternatively, if they were representative of the patient state at a desired time of interest. To evaluate whether the data was complete enough for our specific purpose, we either looked at the presence or absence of certain data elements, sought agreement between data elements from different tables, or compared distributions / occurrences of certain data elements with nationally recorded rates including with the American Heart Association (AHA) and the Center for Disease Control (CDC). Data correctness was assessed by evaluating the agreement between data elements in the data shelter to their expected physiological value ranges. For currency evaluation, we checked if data were entered into the EHR within a set time limit or if the data element was measured recently enough to be considered medically relevant. The core data elements in our investigation came from seven tables; in addition to the medication table, we utilized the patient, patient coverage, social history, basic vital, diagnoses, and reason for visit tables. Among the data quality assessment methods noted above, we used the most appropriate or feasible method when assessing a particular data element. After assessing data quality for all data elements in tables of interest, the data was filtered, and data elements were joined from these seven tables to obtain the analysis dataset. In the medication table, the following data elements were assessed: patient_id and service_id are identification features necessary to join tables; drug_name_orig and ingredient_name specified the medications used such as Simvastatin; order date, discontinued date, start_date and end_date were utilized for filtering and matching; dose_amt along with dose_unit were used as predictors of our outcome of interest; medication_order_status as well as active_order status were used to implement exclusion / inclusion criteria; discontinued_reason served as predictor and categorization factor. For all these data elements, data quality was assessed through the dimensions of correctness, completeness, and currency. As an example, for dose_amt which is defined as the quantity of medication to be administered at time, we started by checking percent populated; then to assess correctness of dose_amt, we compared its values with expected value ranges based on ingredient_name and clinical expertise / experience of authors. Although the presence of dose_amt added to the completeness of the medication table, we further check for completeness by comparing distribution of the latter data element to what is specified in the Grundy et al. Clinical Practice Guidelines11. Additionally, medication active_order status, order date, discontinued date, start_date, and end_date were used as references to ascertain dose_amt currency.

Data Quality Improvement: The Medication table included 109,870 individual patients and 31,264,618 rows; each row of the table represents a medication order for a particular patient though patients may have more than 1 order. We were interested in seven particular statin drugs including Rosuvastatin, Atorvastatin, Pitavastatin, Pravastatin, Simvastatin, Lovastatin, and Fluvastatin; these agents are the most commonly prescribed statin drugs as attested by the literature19 and the extracted dataset. Study period was defined to be from January 1st, 2000 to December 31st, 2019 reflecting the availability of data in the CDR and the timing of our IRB acceptance. To ensure we were including only orders that were created and medications that were taken, we focused our study on Active or Completed medications whose Order Status were either Completed, Dispensed, Sent, OR Verified. The study was limited to subjects 18 years or older to focus on adult medication dosages and adverse effect reactions. After implementing all study related inclusion/exclusion criteria, we were left with 56,658 individual patients and 422,138 patient drug orders. The quality improvement effort focused on medication dosage, pseudo medication possession ratio, and medication discontinuation.

Statin Drug Dosage Distribution: While identifying the seven statin drugs in the medication data, we encountered five additional statin related combination drugs including ezetimibe-Simvastatin, Amlodipine-atorvastatin, Lovastatin-Niacin, atorvastatin-ezetimibe, and Niacin-Simvastatin. Dosage distributions for the seven statin drugs and the combination drugs showed dosage amounts / units that could not be used in our planned data mining task. To improve the quality of dosage data and make it suitable to our purpose we made the following changes. First, we converted any combination drug prescription into a single statin prescription making sure we kept rosuvastatin, atorvastatin, pravastatin, simvastatin, lovastatin, fluvastatin, or pitavastatin afterward since statin exposure was a key information component. Most notations for the dosage amount of statin combination drugs consist of two numbers separated by a dash (-); yet, we used expert knowledge to assign single statin dose amount appropriately. For the combination drug Lovastatin-Niacin for instance, we had a dosage of 1000-20 mg; we chose to keep 20 mg as dosage and Lovastatin as drug name. Furthermore, if the dosage unit of any statin combination drug was tablet, we changed the combination drug name to be that of the statin drug to provide consistent statin exposure representation; the name on the other part of the combination drug was not retained as it was not needed for later analyses. This information could have been retained if we were seeking more specific information on drug exposures and interactions. Nonetheless, we created an additional column to indicate that these were combination drugs. Similarly, all dosage units of capsule were renamed to tablet. Finally, if a dosage unit were tablet, we converted the dosage to mg as follows; we would multiply the number of tablet(s) by the dosage amount of the most recent mg prescription of the same drug for the same patient; if there is no mg prescription of the same drug for a particular patient, we used the most frequent dosage amount in mg for that particular drug in the whole table.

Medication Possession Ratio (MPR): To ensure minimum adherence to the statin drugs in the study population, a pseudo medication possession ratio (pMPR) was calculated, as MPR remains one of the most referenced measures of medication adherence in the health care industry13,14. Since we could not compute a true MPR based on our setting due to limits on available prescription fill data, we computed a medication possession ratio based on medication start date and end date while accounting for overlapping dates; thus, if two dates overlapped, we don’t count the overlapping days twice before dividing the days on medication by the period patient was on the stated drug. An upper bound was also set for the pMPR such that any value greater than 1 was assigned a value of 1. Nonetheless, to improve data cohort quality, we sought to understand why some patients had pMPR below 0.8 which is the threshold for an acceptable MPR13. In the pool of patients with a pMPR below 0.8, we compared the group of pMPR below .4 to the rest of the group. We looked at different avenues to identify differences between these groups including time spent on medication, number of medications ordered, and medication discontinuation reason with the most evident difference being revealed through discontinuation reason.

Results

Several tables were instrumental in elucidating the methods we set forth for data quality assessment and improvement. Table 1.1 and table 1.2 below contain detailed data quality assessment of all core data elements used in the medication table.

Table 1.1:

Data quality assessment: Data element population

Column-Description % Populated (Original table) % Populated (Studied drugs)
Patient_id: System generated patient ID number 100.00 100.00
Service_id: Medical service ID associated with patient problem 100.00 100.00
Drug_name_orig: Medication name assigned by data source (e.g. FUROSEMIDE 10 MG/ML IJ SOLN) 100.00 100.00
Ingredient_name: Medication ingredient name(s) (may contain > 1 ingredient) 89.31 100.00
Order_datetime: Date and time order was placed. 100.00 100.00
Start_date: Date when administration of the medication should begin 94.28 99.97
End_date: Date when the administration of the medication should end 86.84 89.71
Dose_amt: Quantity of medication administered (e.g. 100-1000, 12.5-25) 71.77 75.00
Dose_unit: Units used for the dose amount (e.g. Mg, bottle) 71.77 75.00
Medication_order_status: The current status of the medication order (eg. sent, canceled, completed) 91.28 100.00
Active_order: Indicates active order (Active Medication, Completed Medication, Discontinued Medication) 90.86 100.00
Discontinued_reason: Reason for medication discontinuation (e.g. Allergic Response, Dose Adjustment) 62.72 84.45
Discontinued_datetime: The date and time the medication was discontinued. 74.41 99.64

Table 1.2:

Medication Table Data Quality Assessment

Column-Description Correctness Completeness Currency
Patient id: System generated patient ID number System generated System generated System generated
Service id: Medical service ID associated with patient problem System generated System generated System generated
Drug name orig: Medication name assigned by data source (e.g. FUROSEMIDE 10 MG/ML IJ SOLN) Checked agreement between drug name, drug name orig, Ingredient name, generic name, brand name Checked CDR presence and % populated Assessed date features:order date, stati date, end date, and discontinued date
Ingredient name: Medication ingredient name(s) (may contain > 1 ingredient) Checked association of diagnosis in diagnosis table and drug name in medication table Checked CDR presence and % populated Checked date features: order date, stati date, end date, and discontinued date
Order datetime: Date and time order was placed. Checked agreement between order date, start date, end date Checked CDR presence and % populated Checked date features: start date, end date
Start date: Date when administration of the medication should begin Checked agreement between order date, start date, end date Checked CDR presence and % populated Checked date features: order date, end date
End date: Date when the administration of the medication should end Checked agreement between order date, start date, end date Checked CDR presence and % populated Checked through date features: order date, start date
Dose amt: Quantity of medication administered at a time (e.g. 100-1000, 12.5-25) Checked expected value ranges and agreement of ingredient name, frequency, strength amount, route, form, instruction Checked CDR presence and % populated Checked date features or EHR data entry within study period
Dose unit: Units used for the dose amount (e.g. Mg, bottle) Checked agreement of dose amount, dose unit, ingredient name, frequency, strength amount, route, form, instruction Checked CDR presence and % populated Checked date features or EHR data entry within study period
Medication order status: The current status of the medication order (eg. sent, canceled, completed) Checked for agreement between medication order status, order date, active order Checked CDR presence and % populated Checked date features or EHR data entry within study period
Active order: Indicates if an order is active (Active Medication, Completed Medication, Discontinued Medication) Checked for agreement between medication order status, order date, active order Checked CDR presence and % populated Checked date features or EHR data entry within study period
Discontinued reason: Reason for medication discontinuation (e.g. Allergic Response, Dose Adjustment) Checked for agreement between discontinue reason, service id, diagnosis code, diagnosis date, discontinue date Checked CDR presence and % populated Checked date features or EHR data entry within study period
Discontinued datetime: The date and time the medication was discontinued. Assessed agreement between discontinued reason, diagnostic datetime, order date, start date, end date Checked CDR presence and % populated Checked date features: order date, diagnostic datetime, start date, end date

Knowing the percentage populated, correctness, completeness, and currency of data elements early in the process will ultimately guide the planning of data elements to be used in the data mining task. The majority of our core data elements was reasonably populated with discontinuation reason being the lowest populated data element at 62%; yet once we filtered the data down to study related criteria, the data elements were populated enough to be kept in the data mining task with dose_amt and dose_unit being at the lowest percentage of 75%. We eventually used other data elements including drug_name_orig to upgrade the percentage populated of dose_amt and dose_unit in the data quality improvement process. As can be seen below, the correctness, completeness, and currency of all core data elements were evaluated. Although it seemed tedious at time to perform these tasks, the time we spent during this process turned valuable during the analysis phase.

The data quality improvement process assessed if patients were adherent to their medications but also pre-processed the medication table features in ways that make them amenable to model building. Medication dosage was perhaps the most intriguing feature that needed pre-processing since our initial study was aiming at identifying statin users at risk of developing adverse drug events. Following the pre-processing protocol described earlier, we converted all dosage amounts and units in standard milligram (mg) dosage. Furthermore, we discarded all prescriptions that were less than .1% of total prescriptions for a particular statin drug; we thought these prescriptions were not standard and could not be part of a general trend our models would capture or adding these random prescriptions could add noise to models.

Initial dosage distributions for the seven statin drugs and the combination drugs can be seen in table 2.1 and table 2.2. These summary tables show distributions of dosage amount / unit prescribed for all the patients. Each cell shows the dosage amount, dosage unit, and the percentage of time it was prescribed for that specific statin drug; for instance, in the very first cell of the table 2.1 we can see that 40 mg was prescribed 39.51% of the time for all Simvastatin drug prescriptions. The latter tables informed us about the changes that needed to be made to the dose amount and dosage unit in order to fit our data mining algorithms. Instances of dosage with whole numbers and mg as unit were the most appropriate to our data mining task; we then proceeded to convert / change all other values and units.

Table 2.1:

Dosage Distribution for Statin Drugs

Simvastatin Atorvastariu L ova statin Prai a statin Ro suva statin Fluva statin Pitava statin
40 mg 39.508% 40 mg 40.49% 40 mg 48.69% 40 mg 39.62% 40 mg 34.01% 40 mg 42.51% 2 mg 49.54%
20mg 34.05% 20 mg 22.15% 20 mg 33.89% 20 mg 26.45% 20 mg 28.80% 20 mg 28.50% 4 mg 31.98%
80 mg 9.93% 80 mg 21.89% 10 mg 7.76% 80 mg 16.90% 10 mg 20.39% 1 tablet 19.32% 1 mg 17.56%
10mg 9.59% 10 mg 13.40% 1 tablet 3.58% 10 mg 12.12% 5 mg 11.58% 80 mg 9.17% 1 tablet <1%
1 tablet 5.37% 1 tablet 1.14% 60 mg 2.69% 1 tablet 3.22% 1 tablet 3.78% .5 tablet <1%
5 mg 0.90% 5 mgO.39% 80 mg 2.59% 5mg0.79% 2.5 mg 0.87%
60 mg 0.31% 60 mg 0.25% 2 tablet 0.54% 60 mg 0.47% 30 mg 0.17%
30 mg 0.11% 0.5 tablet 0.10% 5 mg 0.10% 30 mg 0.15% 0.5 tablet 0.15%
0.5 tablet 0.10% 30 mg 0.10% 30mg<0.1% 2 tablet <0.1% 15mg<0.1%
2 tablet <0.1% .5 tablet <0.1% 5-10mg<0.1% 0.5 tablet <0.1% 7.5mg<0.1%
1.5 tablet <0.1% 15 mg <0.1% 1.5 tablet <0.1% .5 tablet <0.1% 60mg<0.1%
.5 tablet <0.1% 40-80 mg<0.1% 3 tablet <0.1% 15mg<0.1% .5 tablet <0.1%
15 mg<0.1% 1.5 tablet <0.1% 0.5 tablet <0.1% 1.5 tablet <0.1% 80mg<0.1%
.5 mg <0.1% 2 tablet <0.1% 1 capsule <0.1% 3 tablet <0.1% 2 tablet <0.1%
40-80 mg<0.1% 10-20 mg<0.1% 65 mg<0.1% 10-20 mg<0.1% 10-20 mg<0.1%
79mg<0.1% 90mg<0.1% 5-10mg<0.1% 11 mg <0.1%
1 mg<0.1% 3 mg <0.1% 1 mg<0.1% 2.545 mg<0.1%
3 tablet <0.1% 50mg<0.1% 40-80 mg<0.1% 4mg <0.1%
2mg<0.1% .5 mg <0.1% 1 mg<0.1%
2.5 mg<0.1% .5-1 tablet <0.1% 1.5 tablet <0.1%
20 tablet <0.1% 1 mg <0.1% 2 mg <0.1%
2(M0mg<0.1% 20 tablet <0.1% 2(M0mg<0.1%
.5-1 tablet <0.1% 20-40 mg<0.1% 4 tablet <0.1%
3 mg<0.1% 5-10mg<0.1% 90mg<0.1%
4mg<0.1% 80 tablet <0.1%
50 mg <0.1% 9mg <0.1%
7 mg <0.1%

Table 2.2:

Dosage Distribution of Statin Combination Drugs

Ezetiniibe_ Amlodipine_ Lovastatin_ Niacin_
Simvastatin Atorvastatin Niacin Simvastatin
1 tablet 92.26% 1 tablet 77.33% 1 tablet 57.14% 1 tablet 68.18%
10-40 mg 2.90% 10-20 mg 8.78% 1000-20 mg<l5% 2 tablet <40%
10-80 mg 1.85% 10-40 mg 3.11% .5 tablet <15%
.5 tablet 1.39% 5-10 mg 3.11% 500-20 mg<l5%
10-20 mg 0.96% 5-40 mg <3% 2 tablet <15%
10-10 mg 0.46% 5-20 mg <3%
30 tablet <0.4% .5 tablet <3%
10 tablet <0.4% 10-10 mg <3%
20 tablet <0.4% 5-20 capsule <3% 80 mg <3% 10-80 mg <3%
80 mg <3%
10-80 mg <3%

Following the protocol, we described in the method section, we made all necessary changes and were left with amenable dosage amount all in mg. We then proceeded to evaluate the changes we made; we compared distribution of drug dosages before and after changes were introduced by our pre-processing protocol; table 8 below represents the new dosage distribution after implementing suggested pre-processing protocol whereas table 7 shows dosage distributions from the initial medication table.

Table 8:

Statin drug dosage distribution after pre-processing

Simvastatin Atorvastatin Lovastatin Pravastatin Rosuvastatin Fluvastatin Pitavastatin
40 mg 43.13% 40 mg 41.15% 40 mg 50.93% 40 mg 41.59% 40 mg 36.41% 40 mg 57.97% 2 mg 50%
20 mg 35.19% 20 mg 22.45% 20 mg 35.04% 20 mg 27.17% 20 mg 29.77% 20 mg 28.99% 4 mg 32.43%
80 mg 10.52% 80 mg 22.06% 10 mg 8.01% 80 mg 17.52% 10 mg 20.83% 80 mg 13.04% 1 mg 17.57%
10 mg 9.82% 10 mg 13.58% 80 mg 3.11% 10 mg 12.28% 5 mg 11.88%
5 mg 0.92% 5 mg 0.41% 60 mg 2.81% 5 mg 0.8% 2.5 mg 0.9%
60 mg 0.31% 60 mg 0.25% 5 mg 0.1% 60 mg 0.49% 30 mg 0.18%
30 mg 0.11% 30 mg 0.1% 30 mg 0.16%

Table 7:

Initial Statin Drug Dosage Distributions

Simvastatin Atorvastatin Lovastatin Pravastatin Rosuvastatin Fluvastatin Pitavastatin
40 mg 41.84% 40 mg 41.03% 40 mg 50.85% 40 mg 41.05% 40 mg 35.48% 40 mg 53.01% 2 mg 50%
20 mg 36.06% 20 mg 22.45% 20 mg 35.40% 20 mg 27.40% 20 mg 30.05% 20 mg 35.54% 4 mg 32.27%
80 mg 10.52% 80 mg 22.18% 10 mg 8.11% 80 mg 17.50% 10 mg 21.28% 80 mg 11.44% 1 mg 17.72%
10 mg 10.16% 10 mg 13.58% 60 mg 2.81% 10 mg 12.56% 5 mg 12.08%
5 mg 0.96% 5 mg 0.39% 80 mg 2.71% 5 mg 0.82% 2.5 mg 0.91%
60 mg 0.33% 60 mg 0.25% 5 mg 0.10% 60 mg 0.49% 30 mg 0.17%
30 mg 0.11% 30 mg 0.10% 30 mg 0.16%

The biggest difference in both tables is found in the distribution of statin drug Fluvastatin with difference margin no more than 6%; otherwise, distribution of dosage amounts for every other statin drug is similar. This is an indication that the pre-processing protocol did not alter the internal distribution of medication dosage even if it helped improve the quality of the medication dosage for our data mining purpose. We also checked the validity of the same pre-processing protocol by categorizing medication dosage based on the guideline found in Grundy et al.11. The guideline categorized statin drug dosages into low, moderate, and high intensity for treatment consideration. Likewise, we converted statin drug dosages for the initial and new tables into low, moderate, and high intensity categories - as shown in table 9 below - and compared their distributions. The only noticeable difference was found with Fluvastatin with a difference margin less than 2% thereby confirming no structural change in dosage distribution after data quality improvement steps.

Table 9:

Statin Drug Dose Pre- and Post-Standardization

Drugs Original Dosage New Dosage
Simvastatin moderate 78.02% moderate 78.42%
Simvastatin low 11.12% low 10.74%
Simvastatin high 10.85% high 10.83%
Atorvastatin moderate 36.13% moderate 36.12%
Atorvastatin low 0.39% low 0.40%
Atorvastatin high 63.46% high 63.46%
Lovastatin moderate 56.37% moderate 56.84%
Lovastatin low 43.62% low 43.15%
Pravastatin moderate 59.21% moderate 59.75%
Pravastatin low 40.78% low 40.24%
Rosuvastatin moderate 33.36% moderate 32.72%
Rosuvastatin low 0.91% low 0.90%
Rosuvastatin high 65.71% high 66.37%
Fluvastatin moderate 11.44% moderate 13.043%
Fluvastatin low 88.55% low 86.95%
Pitavastatin moderate 100% moderate 100%

In the same vein, pMPR were computed for all patients in the study population to assess medication adherence. 87% of the study population had pMPR above 0.8. These metrics suggested that the study population was adherent to the statin drugs studied. However, the computed pMPR was likely higher than typical MPR since the only available data was on medication orders and did not reflect actual prescription fill data typically used for MPR calculations.

To understand why some patients had pMPR below the threshold, we compared the group of pMPR below .4 to the rest of patients with pMPR below 0.8. We created five (5) different categories of medication discontinuation reasons based on fifty different medication discontinuation reasons in the study population charted data. This categorization which was based on discontinuity similarity as attested by the present authors, was a way to group medication discontinuation reasons into categories that would better explain medication adverse events occurrence. In table 4, we show the percentage of occurrence of each category as well as the percentage of occurrences within each category of medication discontinuation reason. In more than 87% of the cases, the medication was discontinued for reordering reasons; stopped drugs, administrative, and adjustment were the next most frequent.

Table 4:

Discontinuation Categories Percentage Totals with Discontinuation Reasons

Discontinue Category Discontinuation Reason % (total) Discontinue Caregory Discontinuation Reason % (total)
Reorder 87.61% A: 111 in 3.29%
Reorder 100% Me dication Reconciliation Clean Up 31.10%
StoppedDrug 5.99% Duplicate 23.45%
Stopped by Patient 30.22% Stopped Pre-Admission or entry em 15.29%
Alternate therapy 27.17% Error 8.74%
Therapy completed 16.76% Formulary change 5.96%
Discontinued by encounters provider 9.76% Cost of medication 5.42%
Discontinued by other Health Provider 8.82% Erroneous Entry 2.83%
Not Needed 2.93% Not Coyere d by Insurance 2.80%
Rx not filled by Patient 1.61% Cost/Formulary change 1.90
Medication Failed 1.52% Pharmacy Medication Reconciliatior 0.90%
Not filled taken by Patient 0.91% Appointment needed for refills 0.74%
Not Effective 0.25% Unavailable 0.35%
OTHER <0.2% Per Xursrng Home MAR 0.24%
Med D/C’d <0.2% Avadability 0.15%
Paradoxical response <0.2% Stopped during discharge readmit 0.09%
Med Changed <0.2% Contact Moye - Error <0.09%
SideEffect 0.90% Out of medication <0.09%
Side effects 87.85% Adjust 2.09%
Allergic response 6.73% Dose adjustment 99.99%
Contraindicated 4.23% zRx Age adjusted dose change <.1%
Contraindicated/pregnancy 0.97% Discharge 0.12%
Medication Recalled <0.9% Stop at Discharge 99.77%
Presnancy <0.9% Patient Discharge <.1

The investigation of medication discontinuation reasons was instrumental in showing that low pseudo medication possession ratio occurs as a result of medication non-adherence. First, we computed the percentage of occurrence of discontinuation categories within each pMPR group as illustrated in table 5 (pMPR <= 0.4 Vs. 0.6 < pMPR <= 0.8). we can see that the main differences appear with the discontinuation categories Reorder and StoppedDrug. More than 88% of high pMPR patients reordered their medications as opposed to 60% of low pMPR patients.

Table 5:

Drug discontinuation in high versus low MPR subjects

Reason for Discontinuation High MPR subjects Low MPR Subjects
Reorder 88.63% 60.62%
Stopped Drug 5.37% 22.87%
Admin 3.15% 6.91%
Adjust 2.01% 3.58%
SideEffect 0.75% 5.58%
Discharge 0.09% 0.44%

Additionally, only 5% of high pMPR patients stopped their drugs versus more than 22% among low pMPR patients. Correspondingly, non-adherence appears to be the main difference between these discontinuation category percentages. Furthermore, the StoppedDrug discontinuation category was analyzed to find which items were involved in this discrepancy.

As shown in table 6 below, 28% of high pMPR patients used alternative therapy while only 14% of low pMPR patients did; even so, more than 43% of low pMPR patients stopped their medication on their own compared to 28% for high pMPR patients. Thus, high pMPR patients used more alternative therapies whereas low pMPR patients just stopped using their statin medication unilaterally.

Table 6:

Drug Discontinuation Categories by Percent of High and Low MPR Subjects

Discontinuation Reason Category % of High MPR Subjects % of Low MPR Subjects
Alternate therapy 23.93% 14.34%
Stopped by Patient 28.03% 43.67%
Therapy completed 16.80% 16.49%
Discontinued by encounter’s provider 10.26% 6.27%
Discontinued by other Health Care Provider 8.S9% 8.64%
Not Needed 237% 3.23%
Medication Failed 1.61% <1%
Rx. not filled by Patient 1.28% 4.35%
Not filled/taken by Patient 0.36% 1.72%
Not Effective 0.28% <1%
OTHER <0.28%
Med D/C’d <0.28%
Med Changed <0.28%
Paradoxical response <0.28% <1%
Overall Percentage who stopped drug 5.40% 22.90%

Discussion

Integrated data repositories hosted in academic research centers can host millions of patient records from diverse sources providing an information-rich resource for researchers. However, this opportunity has given rise to another challenge, that of extracting and preparing data from IDRs for research purposes. Data cleaning and quality improvement stages during research are often more time consuming than the later analysis and modeling phases. This article uses the medication data in the CDR as a use case to develop practical steps for clinical data cohort quality improvement; we demonstrated ways to perform data cohort quality checks based on the concepts of correctness, completeness, and currency while also improving data quality when pre-processing medication dose amounts and units to fit our modeling scheme. Achieving quality improvement with the medication data was possible through the instrumentalization of key concepts such as the medication possession ratio, medication discontinuation reasons, medication dosage categorization, etc. The quality of each data element useful to our modeling purpose in the medication table was assessed including the percentage of data populated in each data column as well as the correctness, completeness, and currency. The data was also pre-processed for the medication dosage information to ensure dosage format is amenable to modeling paradigms; we also created dosage categories and compared the distribution of dosages before and after modification were made to the dosage amount and units. However, this effort has a number of limitations. It is focused on electronic medical record data for medication related information. As a result, the observations do not necessarily extend into other areas of medical information. In addition, the data from the project is derived from a single health system which has local and regional practice components which may be unique from the perspective of medication and documentation patterns. The focus of the project which was on statin medications also limits the type of medications which were reviewed and may not apply to other therapeutic or drug classes which may not have similar dosage forms, drug combinations, or patterns of use.

Nonetheless, we observe that data quality assessment and data sources integration have emerged as some of the major topics in health care literature. Weiskopf & Weng17 performed a review of the clinical research literature discussing data quality assessment methodology for electronic health record (EHR) data reuse for research. The latter authors concluded that if the reuse of EHR data for clinical research is to become accepted, researchers should adopt validated, systematic methods of EHR data quality assessment. Their conclusion speaks to the importance of developing a systematic and accepted methodology for assessing and improving EHR data quality; the current paper incorporates these recommendations. Nonetheless, the topic of clinical data cohort pre-processing or quality improvement is important. Chi et al.6 discussed data cohort extraction from IDRs to facilitate machine learning; while the same authors mentioned large public and private databases available to researchers, they elaborated more on emerging models based on the development of partnerships which make available “primary” use corporate data for secondary data analysis. The summary of the overall approach to clinical data pre-processing using the medication use case is summarized in Appendix I which can provide a framework for similar future efforts. The AHC-IE data shelter provides a collaborative agreement between the University of Minnesota Academic Health Center and Fairview Health Services to support the joint mission of improving patient care and supporting health care research and education. This collaborative work confirms the current trend of agglomerating sources of data through system interoperability, integrated data sources, and collaboration between healthcare organizations and research institutions. Future work from this study will expand on the clinical data cohort preparation for machine learning tasks for prediction of adverse event outcomes and continue to develop best practice approaches in clinical data cohort quality improvement.

Conclusion

Our study demonstrated practical steps for clinical data cohort quality improvement using the medication table in the AHC-IE data shelter within the University of Minnesota’s Clinical Data Repository. In many regards, we sought to illustrate a best practice approach in clinical data cohort quality improvement for any data mining task.

Figures & Table

Appendix A:

Appendix A:

References

  • 1.Aguilar-Salinas C, Zubirán R. 2016. Faculty opinions recommendation of interpretation of the evidence for the efficacy and safety of statin therapy. Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature.
  • 2.American Heart Association Cholesterol medications [Internet]. www.heart.org. aha; 2020 [cited 2021Dec31]. Available from: https://www.heart.org/en/health-topics/cholesterol/prevention-and-treatment-of-high-cholesterol-hyperlipidemia/cholesterol-medications.
  • 3.Bates TR, Connaughton VM, Watts GF. Non-adherence to statin therapy: A major challenge for Preventive Cardiology. Expert Opinion on Pharmacotherapy. 2009;10(18):2973–85. doi: 10.1517/14656560903376186. [DOI] [PubMed] [Google Scholar]
  • 4.Bibbins-Domingo K, Grossman DC, Curry SJ, Davidson KW, Epling JW, García FA, et al. Statin use for the primary prevention of cardiovascular disease in adults. JAMA. 2016;316(19):1997. doi: 10.1001/jama.2016.15450. [DOI] [PubMed] [Google Scholar]
  • 5.CDC Heart disease facts [Internet]. Centers for Disease Control and Prevention. Centers for Disease Control and Prevention; 2021 [cited 2021Dec31]. Available from: https://www.cdc.gov/heartdisease/facts.htm.
  • 6.Chi C-L, Wang J, Clancy TR, Robinson JG, Tonellato PJ, Adam TJ. Big data cohort extraction to facilitate machine learning to improve statin treatment. Western Journal of Nursing Research. 2016;39(1):42–62. doi: 10.1177/0193945916673059. [DOI] [PubMed] [Google Scholar]
  • 7. Clinical & Translational Science Institute U of M. Clinical Data Repository [Internet]. Clinical and Translational Science Institute - University of Minnesota. 2021 [cited 2021Dec31]. Available from: https://ctsi.umn.edu/services/data-informatics/clinical-data-repository.
  • 8.Cohen JD, Brinton EA, Ito MK, Jacobson TA. Understanding statin use in America and gaps in patient education (usage): An internet-based survey of 10,138 current and former statin users. Journal of Clinical Lipidology. 2012;6(3):208–15. doi: 10.1016/j.jacl.2012.03.003. [DOI] [PubMed] [Google Scholar]
  • 9.Croom K. A summary of the National Institute for Health and Clinical Excellence (NICE) guidelines on lipid modification. Drugs in Context. 2008;4:1–8. [Google Scholar]
  • 10.Grundy SM, Stone NJ, Bailey AL, Beam C, Birtcher KK, Blumenthal RS, et al. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APHA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: A report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines [Internet]. Circulation. U.S. National Library of Medicine; 2019 [cited 2021Dec31]. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7403606/
  • 11.Grundy SM, Stone NJ, Bailey AL, Beam C, Birtcher KK, Blumenthal RS, et al. AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APHA/ASPC/NLA/PCNA guideline on the management of blood cholesterol: A report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation. 2018 2019;139(25) doi: 10.1161/CIR.0000000000000625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Law MR. Quantifying effect of statins on low density lipoprotein cholesterol, ischaemic heart disease, and stroke: Systematic review and meta-analysis. BMJ. 2003;326(7404):1423. doi: 10.1136/bmj.326.7404.1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rolnick SJ, Pawloski PA, Hedblom BD, Asche SE, Bruzek RJ. Patient characteristics associated with medication adherence. Clinical Medicine & Research. 2013;11(2):54–65. doi: 10.3121/cmr.2013.1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sperber C, Samarasinghe SR, Lomax GP. An upper and lower bound of the medication possession ratio. Patient Preference and Adherence. 2017;11:1469–78. doi: 10.2147/PPA.S136890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics—2021 update. Circulation. 2021;143(8) doi: 10.1161/CIR.0000000000000950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Weiner MG. Toward reuse of clinical data for research and Quality Improvement: The end of the beginning? Annals of Internal Medicine. 2009;151(5):359. doi: 10.7326/0003-4819-151-5-200909010-00141. [DOI] [PubMed] [Google Scholar]
  • 17.Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: Enabling Reuse for Clinical Research. Journal of the American Medical Informatics Association. 2013;20(1):144–51. doi: 10.1136/amiajnl-2011-000681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.WHO Cardiovascular diseases (heart attack, stroke) [Internet]. World Health Organization. World Health Organization; 2021 [cited 2021Dec31]. Available from: https://www.who.int/westernpacific/health-topics/cardiovascular-diseases.
  • 19.Wisher D. Martindale: The Complete Drug Reference. 37th ed. Journal of the Medical Library Association : JMLA. 2012;100(1):75–6. [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES