Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 May 23.
Published in final edited form as: J Clin Monit Comput. 2021 Feb 8;36(2):397–405. doi: 10.1007/s10877-021-00664-6

Accuracy of identifying hospital acquired venous thromboembolism by administrative coding: implications for big data and machine learning research

Tiffany Pellathy 1, Melissa Saul 2, Gilles Clermont 2, Artur W Dubrawski 3, Michael R Pinsky 2, Marilyn Hravnak 1
PMCID: PMC8349368  NIHMSID: NIHMS1685896  PMID: 33558981

Abstract

Big data analytics research using heterogeneous electronic health record (EHR) data requires accurate identification of disease phenotype cases and controls. Overreliance on ground truth determination based on administrative data can lead to biased and inaccurate findings. Hospital-acquired venous thromboembolism (HA-VTE) is challenging to identify due to its temporal evolution and variable EHR documentation. To establish ground truth for machine learning modeling, we compared accuracy of HA-VTE diagnoses made by administrative coding to manual review of gold standard diagnostic test results. We performed retrospective analysis of EHR data on 3680 adult stepdown unit patients identifying HA-VTE. International Classification of Diseases, Ninth Revision (ICD-9-CM) codes for VTE were identified. 4544 radiology reports associated with VTE diagnostic tests were screened using terminology extraction and then manually reviewed by a clinical expert to confirm diagnosis. Of 415 cases with ICD-9-CM codes for VTE, 219 were identified with acute onset type codes. Test report review identified 158 new-onset HA-VTE cases. Only 40% of ICD-9-CM coded cases (n = 87) were confirmed by a positive diagnostic test report, leaving the majority of administratively coded cases unsubstantiated by confirmatory diagnostic test. Additionally, 45% of diagnostic test confirmed HA-VTE cases lacked corresponding ICD codes. ICD-9-CM coding missed diagnostic test-confirmed HA-VTE cases and inaccurately assigned cases without confirmed VTE, suggesting dependence on administrative coding leads to inaccurate HA-VTE phenotyping. Alternative methods to develop more sensitive and specific VTE phenotype solutions portable across EHR vendor data are needed to support case-finding in big-data analytics.

Keywords: Administrative coding, Venous thromboembolism, Big data analytics, Electronic health record data, Phenotyping, Machine learning

1. Introduction and purpose

Healthcare big data refers to electronic health record (EHR) clinical and administrative data that are massive in volume, in the velocity at which they are generated, and the variety (structured, semi-structured, and unstructured) of data types and sources of collection [1]. Clinical databases are valuable in that they contain longitudinal patient data, collected under natural (non-experimental) circumstances, which are rich in information and can be iterated using computer-based mass screening approaches. These data, through the application of data science and machine learning (ML) approaches, hold promise for cost-effectively and efficiently evaluating, analyzing, and advancing practices and care protocols on a larger scale than can be conducted in traditional prospective clinical studies and trials [2]. ML methodologies have the ability to scale up correlational analyses to highly multivariate, high-frequency data to discern emerging complex patterns and relationships associated with disease evolution or clinical deterioration [3]. These methods hold potential for breakthroughs in our ability to (1) identify complex mechanisms underlying disease; (2) inform improvements to existing explanatory models; and (3) generate new hypotheses for research [4]. Furthermore, the application of ML approaches can support better characterization and understanding of disease phenotypes, subgroups within a disease associated with heterogeneity of clinical presentations and/or treatment effects, enabling the data driven discovery of clinical and biologic features and patterns of disease to augment expert clinician knowledge [5, 6].

These advantages, however, must be balanced with the limitations and inherent complexities of big data. It is well established, and imperative to note, that data originally obtained for clinical purposes are not recorded with the same attention to precision and structure as research data [7]. A common challenge with research use of EHR data is the complexity of ensuring data veracity, especially with regard to accurately identifying patients and conditions of interest to create methodologically rigorous and reproducible cohorts [8].

Quality EHR-based research is predicated on the ability to efficiently identify and accurately annotate disease cases and controls from heterogeneous clinical data in big data repositories [9]. Because they will be used to train ML models and tools for clinical application, these annotations, referred to as ground truth, must accurately reflect clinical reality [8]. EHR data, however, are optimized for individual patients and healthcare reimbursement rather than aggregation across cohorts [10]. Although administrative diagnosis codes, International Classification of Diseases, Ninth and Tenth Revision (ICD-9-CM/ICD-10-CM), have been commonly used to identify specific disease phenotypes in clinical data, these codes were designed for billing purposes, are subject to coding bias, and their reported accuracy in identifying diagnoses of interest (e.g., cardiovascular, renal, obesity, orthopedic, hematologic, and infectious) in clinical data is variable [1118], rendering them unreliable for establishing ground truth [1921].

New-onset, hospital acquired venous thromboembolism (HA-VTE), manifesting as deep vein thrombosis (DVT) or pulmonary embolism (PE), is a complex and potentially lethal disease process involving interactions between inherited risk factors and clinical and/or hospital acquired susceptibilities to thrombosis, such as immobility and procedures that activate the inflammatory response system [22, 23]. Symptoms of HA-VTE are often non-specific, occur gradually over a period of hours to days, and can vary across individuals, making it a difficult condition to diagnose, even for experienced clinicians [24, 25]. A leading cause of preventable hospital death, carrying a high mortality risk and a national cost burden exceeding $7 billion annually [26], HA-VTE is recognized as a leading failure to rescue (FTR) complication [27, 28] and a major public health issue subject to surveillance and reporting by multiple government agencies (i.e., Agency for Healthcare Research and Quality [AHRQ], the Joint Commission, and the Centers for Medicare and Medicaid Services [CMS]). Accurate quality reporting and research to improve VTE identification and management are needed, however, new-onset HA-VTE has been identified as a challenging disease phenotype to identify and extract from clinical databases due to its temporal nature and variety of presentations [9]. Timely identification and intervention are critical to FTR complication prevention [27, 29, 30], however, in a substantial proportion of patients, accurate identification of HA-VTE is delayed, leading to increased rates of morbidity and mortality [31, 32]. The potential for using ML approaches to plumb the electronic health record (EHR) for cases allowing characterization of the prodrome and presenting signatures of HA-VTE is attractive.

To establish ground truth HA-VTE case outcomes for ML modeling, we compared the accuracy of identifying cases of HA-VTE in adult step-down unit (SDU) clinical data by ICD-9-CM codes to identification by manual review of results of gold standard tests for diagnosing VTE: lower extremity Doppler ultrasound (LEDUS), computed tomographic angiography (CTA), ventilation-perfusion scan (VQ) and/or magnetic resonance angiography (MRA) [32, 33]. Such information should help to inform accurate case-finding in research involving clinical data.

2. Methods

A retrospective analysis of adult SDU clinical data was conducted.

2.1. Setting and sample

The study sample was obtained from data previously collected for the Predicting Patient Instability Non-Invasively for Nursing Care (PPINNC) study (R01 NR01391). All were adult patients admitted with a need for a monitored bed in the study SDU of the University of Pittsburgh Medical Center between 11/06 and 09/08. Clinical and administrative data from all patients in this Level-1 trauma center adult medical-surgical-trauma SDU were extracted from the hospital Medical Archival Retrieval System (MARS). Information from the hospital’s electronic clinical, administrative, and financial databases is forwarded and stored in the MARS repository [34].

Under Institutional Review Board approval for waiver for informed consent, every patient, age ≥ 21 years, admitted to the 22-bed SDU during the study timeframe contributed to the original data without exclusion. No special classes of patients were excluded. These practices yielded a total census convenience sample of 3680 patient admissions.

2.2. Variables and procedure

Variables included VTE cases, either PE and/or DVT. We defined DVT as a venous thrombosis in the internal jugular, superior or inferior vena cava, iliac, femoral or popliteal, and/or calf (posterior and anterior tibial, peroneal, soleal, or gastrocnemius) veins. We defined PE as a clot in a main, lobar, segmental and/or sub-segmental pulmonary artery. Our procedure for identifying HA-VTE cases in the sample is detailed below.

2.3. Administrative coding

ICD-9-CM administrative coding for this sample was reviewed to identify codes associated with VTE (PE or DVT) (Table 1). As we aimed to define a cohort of patients who developed VTE acutely during hospitalization, cases with VTE codes reflective of chronic venous thrombotic disease were then excluded for presence of pre-existing disease, and those with acute disease not present on admission were retained for further analysis. Although only ICD-9-CM codes were collected in this sample, the corresponding, more contemporary ICD-10-CM codes for each diagnosis are also listed in Table 1 for more recent understanding.

Table 1.

ICD-9-CM codes associated with new-onset and chronic venous thromboembolism (VTE)

Diagnosis ICD-9-CM codes [6467] ICD-10-CM code [68]
Acute event
 Pulmonary embolism (PE) 415.1x I26.0x, I26.9x
 Lower extremity deep vein thrombosis (DVT) 451.1x, 451.2, 451.81, 451.9, 453.4x, 453.8, 453.9 I80.1x, I80.2x, I82.4x
 Venous embolism and thromboembolism of vena cava 453.2 I82.210, I82.220
 Venous embolism and thrombosis of internal jugular 453.86 I82.C1x
Chronic disease
 PE 416.2 I27.82, I27.24
 Lower extremity DVT 453.3x, 453.5x I82.5x
 Venous embolism and thromboembolism of vena cava 453.77 I82.210, I82.221
 Venous embolism and thrombosis of internal jugular 453.76 I82.C2x

Corresponding ICD-10 codes are also provided for contemporary understanding and application

2.4. Radiology reports

All radiologic reports for the gold standard VTE diagnostic tests (LEDUS, CTA, VQ scan, and/or MRA) performed at any point during the participants’ hospital stay were extracted from the hospital’s MARS. The unstructured (free text) “Impression” section of each report was reviewed to determine diagnosis of acute, HA-VTE cases, diagnosis of chronic, pre-existing VTE disease cases, as well as to identify cases in which the diagnosis of acute, HA-VTE was definitively ruled out. Our protocol consisted of preliminary screening of the diagnostic test reports by a bioinformatics expert (MS), using terminology extraction to search for the following basic terms/word strings determined based on clinical domain expertise: acute, embolism, emboli, embolus, pulmonary embolism, PE, DVT, thrombus, thrombosis, thromboembolism, thromboemboli, thrombotic disease, deep venous thrombosis, deep vein thrombosis, clot, and results called. This review was helpful in sorting the large volume of report results into the following categories: (1) positive VTE test report, (2) negative VTE test report, (3) test report indeterminate for VTE and (4) test report unrelated to VTE. This step was followed by manual expert review by one investigator (TP) of all test reports (from all the initially sorted categories), to identify positive and negative HA-VTE diagnostic test results. To clarify indeterminate radiologic test results, unstructured progress notes were additionally extracted and then manually reviewed by the same clinical expert (TP) as needed to derive a positive or negative diagnosis. At each screening stage, the primary investigator reviewed and discussed findings with a team of clinical experts (MH, MRP, GC). Participants with radiologic test data negative for VTE or indicative of chronic VTE were set aside, and only gold-standard diagnostic test cases positive for HA-VTE were used for comparison with positive ICD-9-CM coded cases. Radiology report data included the date and time the test was conducted, affording the ability for precise temporal annotation of HA-VTE diagnosis for each case.

3. Analysis

The number of new-onset HA-VTE cases identified by ICD-9-CM codes were compared with the number of cases of new-onset HA-VTE identified by definitive diagnostic test (VTE true positives). VTE ICD-9-CM codes without a corroborating definitive diagnostic test were identified as false positives. The positive predictive values (PPVs) were then derived by the following method: true positive rate/(true positive rate + false positive rate) [35].

4. Results

As shown in Fig. 1, from the total sample, 415 participants (11%) with the relevant new-onset VTE ICD-9-CM codes were identified. After eliminating participants with present-on-admission VTE and chronic VTE identified by ICD-9-CM coding (n = 196), there remained 219 participants with ICD-9-CM codes for new-onset VTE. From 4544 VTE diagnostic test reports, 4386 did not contain language associated with new-onset VTE and were eliminated following review. Forty test reports were indeterminate for VTE. Manual review of progress and discharge notes confirmed 5 positive cases, and 35 negative cases, leaving 158 diagnostic tests with language confirming a definitive diagnosis of new-onset VTE diagnosis. Of the 219 ICD-9-CM new-onset VTE cases, only 87 of them (40%) were also confirmed by a true positive diagnostic test, leaving 132 (60%) positively coded cases never corroborated by testing (Fig. 1). Furthermore, there were 71 VTE positive cases confirmed by a gold standard diagnostic test alone that had no associated ICD-9-CM codes. ICD-9-CM coding for HA-VTE had a positive predictive value (PPV) of 40%, and sensitivity of only 55%.

Fig. 1.

Fig. 1

New-onset, hospital acquired venous thromboembolism (HA-VTE) case identification process

Manual review of radiology reports generated a list of additional key terms frequently found to be associated with HA-VTE positive, negative, and indeterminate diagnoses, and those are listed in Table 2.

Table 2.

Terminology associated with hospital acquired venous thromboembolism (HA-VTE) diagnosis in radiology reports

Radiology test results Preceding terms Common diagnosis terms Subsequent terms
Hospital acquired pulmonary embolism (PE)
 Positive identifiers Definitive segmental Pulmonary Clinical service was notified
Positive examination for PE were communicated to
Positive for Embolism, embolus, emboli were called to
Filling defect(s) consistent with Occlusive/Occluded Thrombus, thromboembolism, thromboemboli
Sub-occlusive Acute
Multiple Clot
 Negative identifiers Negative evaluation for
No large central or main
 Indeterminate identifiers Questionable Cannot be excluded
Non-diagnostic study Cannot be evaluated
Residual Suboptimal vessel opacification
Evidence of chronic Respiratory motion artifact
Cannot be reliably ruled out
Repeat study suggested
Hospital acquired deep vein thrombosis (DVT)
 Positive identifiers Positive for extensive DVT With associated edema
Thrombus Enlargement of the entire extremity
Thrombotic disease Was visualized
Venous Within
 Negative identifiers No evidence of appeared patent Deep venous thrombosis
Deep vein thrombosis
Evidence of chronic Occlusive disease
 Indeterminate identifiers Evidence of chronic Could not be visualized
Could not be evaluated

Italicized terms were included in initial terminology extraction screening process

5. Discussion

We conducted a retrospective analysis of EHR data on 3680 adult SDU patients to identify and annotate HA-VTE (either DVT or PE) cases and compared the accuracy of new-onset HA-VTE diagnosis by ICD-9-CM codes with gold-standard diagnoses manually extracted by chart review. Our findings confirm that ICD-9-CM coding of EHR missed proven HA-VTE cases and inaccurately assigned cases without VTE, and quantifies that dependence on administrative coding leads to inaccurate identification of HA-VTE by about 50%. These findings emphasize the implications of such continued usage of administrative coding for HA-VTE identification in data science research, quality measurement/reporting, and outcomes research, in spite of known inaccuracy. In particular, data science and machine learning methods presume validity of ground truth case determination in primary data [36]. For these methods to yield accurate predictive models, they require time intensive annotation by clinical domain experts and rigorous data processing to ensure the veracity of compiled clinical data and prevent threats to research validity.

Although administrative diagnosis codes (ICD-9-CM/ICD-10-CM) are commonly used to identify disease cohorts in clinical databases, our findings align with a significant number of studies that have highlighted their inaccuracy across multiple disease processes and clinical specialties. We found the accuracy of ICD-9-CM codes alone for detection of HA-VTE was marginal at 55% and our findings are comparable with previous attempts to identify VTE in hospitalized patients. National quality metrics, such the AHRQ patient safety indicators and the American College of Surgeons’ National Surgical Quality Improvement Program (NSQIP), rely heavily on administrative coding to identify post-operative VTE from healthcare data repositories and have accuracy rates which vary from 40 to 80% [3739].

Our results are also qualitatively similar to those of other investigators with respect to disease phenotyping by ICD codes, albeit not specifically VTE. Baldereschi et al. compared administrative coding against medical record review with clinical adjudication to identify acute ischemic stroke events and found administrative coding had a positive predictive value of 85.7%, providing misleading indications about both the quantity and quality of acute ischemic stroke hospital care [11]. Ko et al. examined the accuracy of administrative coding with the kidney disease: improving global outcomes criteria in identifying acute kidney injury (AKI) and chronic kidney injury (CKD). Even with the increased detail of ICD-10-CM coding, their analysis showed administrative coding failed to identify almost half of the patients with AKI (40.5%) and CKD (45.9%) in their study cohort [14]. Similar findings have been reported in diseases such as preoperative anemia, post-partum hemorrhage, warfarin complications, aortic stenosis, coronary artery disease, congenital heart disease and hip fractures [12, 13, 1618, 40].

Though it may seem intuitive that seemingly uncomplicated diagnoses are easier to identify through administrative coding, the available evidence fails to support that claim. The frequent and common diagnosis of obesity in a discharge abstract database of 17, 380 patients was examined by Martin et al., who found that obesity was poorly coded by ICD-10-CM codes with a PPV of only 65.94% [15]. Furthermore, surgical coding, considered to be the most straightforward category of medical coding, was found by Nouraei et al. to be highly inaccurate with a needed post-audit coding change occurring in 51% of 30,127 patients [41].

Burles et al. examined the accuracy of administrative coding for PE (one type of VTE) in emergency department data and found the PPV of ICD-10-CM codes to be 82.3% [42], which seems high compared to the 40% found in this study. However, their team was examining PE as a present-on-admission diagnosis, a much more objective and easier to define diagnosis than HA-VTE, which evolves temporally over the time course of hospitalization. Fang et al. confirmed the PPV of VTE by administrative coding varied based on clinical setting, with a primary diagnosis of VTE made on hospital admission or during an emergency department encounter having an accuracy of almost 90% [43]. New-onset HA-VTE develops after admission, and thus is more challenging both for clinicians to recognize clinically and to capture accurately with administrative coding. Clinical signs of VTE (tachycardia, tachypnea, dyspnea, low grade fever, swelling) are common symptoms in hospitalized patients and are often attributed either to the patient’s admission diagnosis or coded as a separate diagnosis (e.g., tachycardia, fever). The accuracy of inpatient administrative coding depends on information transfer between clinicians and coders which is prone to subjectivity, variability and error [41]. Coding practices can vary across institutions and diagnoses can be missed by administrative coders who lack clinical insight as common symptoms are often coded without close review of free text medical notes or diagnostic test result reports [44, 45]. Finally, codes associated with lower reimbursement potential may be up-coded or not evaluated as assiduously as those leading to more lucrative billing [46] and CMS had designated post-operative VTE as diagnosis that reduces hospital reimbursement payments [47].

The use of more than one type of clinical data has been shown to significantly improve case identification sensitivity [48, 49]. Using data from a Canadian provincial VTE database, Alotaibi et al. demonstrated that ICD discharge diagnoses coupled with imaging procedure codes provided a significant improvement in the accuracy of VTE phenotype identification compared with ICD coding alone, with a sensitivity exceeding 84% [50]. Although this study did not differentiate between VTE type (chronic, present-on-admission, hospital acquired) and the sample included only patients who had undergone a VTE diagnostic test (indicating a priori bias), the use of a search algorithm that includes both VTE diagnostic codes and relevant VTE diagnostic imaging procedure codes could be a relatively efficient initial step for HA-VTE phenotype identification in structured data, before moving on to review of unstructured data.

Natural language processing (NLP) is another approach that holds promise for HA-VTE case identification in unstructured data. Accuracy of NLP identification of postoperative VTE varies from 15% to as high as 89% depending on the sample population and type of data reviewed (radiology reports, progress notes, discharge summaries) [5154]. Our manual review of unstructured radiology report data identified key phrases (Table 2) frequently associated with HA-VTE positive, negative, and indeterminate diagnoses which further contribute to the growing body of literature that can better inform NLP semantic analysis techniques for specific conditions [55].

While clinical domain expertise-informed terms were helpful in guiding manual review of radiology reports, we found word strings associated with immediate verbal communication of critical results, such as, “…clinical service was notified…”, “…were communicated to….”, and “… were called to…,” were almost always present on positive HA-VTE diagnostic reports. “Suboptimal vessel opacification, respiratory motion artifact, and repeat study suggested,” were present, almost exclusively, on diagnostic studies with indeterminate results that required additional diagnostic work up. These common phrases could inform more standardized terminology standards for EHR reporting across institutions and/or be coupled with clinical terms in NLP approaches to improve contextual understanding and accuracy of case identification. Natural language processing approaches hold promise but validating findings across multiple healthcare organizations and clinical databases remains a challenge for researchers.

A dearth of annotated data sets and of automated methods to convert imperfect clinical data into quality computable phenotypes are recognized barriers to progress in ML research and data science applications for health care [21, 48]. Computable phenotypes, computerized definitions of clinical conditions using specified and standardized EHR and supplementary data elements (without the need for chart review or interpretation by a clinician), offer a more rigorous approach to identifying patient records in big data repositories [19, 56]. While computable phenotypes hold promise for improved feasibility of time and cost-efficient reproducible queries in big data repositories, development remains a significant informatics challenge. As outlined in this paper, the heterogeneity and dynamic nature of clinical data and the process of first identifying and defining a phenotype is a labor intensive, highly manual process conducted by domain experts. Then, once the EHR data elements associated with a phenotype are defined (i.e., lab data, medication data, diagnostic data), translating human readable phenotype information into a computable format that can be used across data sets and institutions requires a multidisciplinary team that includes clinical experts, biostatisticians, EHR informaticians, NLP experts, and computer scientists working in close collaboration [49, 57]. Inherent in the complexities associated with computable phenotype development to identify disease/condition outcomes of interest is the concurrent challenge of accurately identifying the data elements and clinical features that define the outcome.

5.1. Limitations

Data analyzed for this study were obtained exclusively from patients admitted to one SDU of a single Level-1 Trauma Center and therefore, may be biased by the coding practices inherent to that facility and the specialty population. Many HA-VTE diagnoses are not identified until after hospital discharge and this study was limited to a review of records and test reports from admission until discharge only. The use of ICD-9-CM codes (2006–2008) could also be considered a potential limitation. Implementation of ICD-10-CM coding system in October 2015, was promoted by CMS as having the potential to improve the quality of patient care via more accurate data collection, clearer documentation of diagnoses and procedures, and more accurate claims processing [58, 59]. However, several studies comparing ICD-9-CM and ICD-10-CM classifications have shown similar rates of validity in capturing clinical condition information [60].

5.2. Strengths

Venous thromboembolism can be a chronic or an acute condition, and acute VTE can be further characterized as present-on-admission and hospital acquired. Most studies examining the administrative coding accuracy of VTE do not differentiate between the two types. This study specifically sought to identify new-onset, HA-VTE and patients with VTE present before hospital admission (chronic or present-on-admission) were eliminated from consideration of the acute disease phenotype. Our study is likely to have identified all diagnosed new-onset HA-VTE cases, as well as the date and time of diagnosis, as the gold standard diagnostic test reports reviewed to identify VTE diagnosis included a time stamp [61]. Even in cases when administrative diagnosis codes can reliably represent a clinical diagnosis, their inability to provide a precise time of diagnosis is a limitation that impacts precision modeling research.[45] This is significant as accurate case identification is needed for big data analytics and can result in more sensitive and specific predictive algorithms, which could help to inform VTE identification earlier in routine clinical care, and perhaps targeted prophylactic interventions. Additionally, this study aimed to identify HA-VTE in a general SDU population, with a mixture of medical and surgical patients whose risk for VTE development varies and presents a greater diagnostic challenge for clinicians. Many studies looking at VTE identification through coding have focused on populations with an extremely high incidence of VTE, such as patients undergoing major orthopedic surgery and/or with advanced malignancy, for whom clinicians and coders already have a high index of suspicion for VTE development [54, 62].

6. Conclusions

Methodology and validity of data are critical when working with EHR and/or clinical database data to ensure study cohorts accurately represent the disease phenotype of interest. Our study corroborates that administrative coding alone is inadequate for ensuring data veracity and quantifies the degree to which this occurs in the case of HA-VTE. Further review of confirmatory clinical data points such as procedure codes, progress notes, discharge notes or diagnostic test results are needed, however this process is not standardized and requires an extensive time investment by clinical experts. These findings support the need for alternative methods to develop more sensitive and specific VTE phenotype solutions portable across EHR vendor data as well as the development of computable phenotypes that can support reproducible queries of EHR data across healthcare organizations and efficiently identify patients with specific conditions or events for research studies and disease management registries [19, 63].

Acknowledgements

This research was supported by National Institutes of Health R01NR01391 (all), F31NR01810 (TP, MH).

Footnotes

Data availability Data availability in tables and figures.

Conflict of interest The authors have no commercial conflicts of interest with this work.

Compliance with ethical standards

Ethical approval The study was approved by the institutional review boards of the University of Pittsburgh and Carnegie Mellon University.

References

  • 1.Manyika J Big data: the next frontier for innovation, competition, and productivity. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation. 2011.
  • 2.Bowton E, et al. Biobanks and electronic medical records: enabling cost-effective research. Sci Transl Med. 2014;6(234):234cm3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Pinsky MR, Dubrawski A. Gleaning knowledge from data in the intensive care unit. Am J Respir Crit Care Med. 2014;190(6):606–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Shmueli G To explain or to predict? Stat Sci. 2010;25(3):289–310. [Google Scholar]
  • 5.Basile AO, Ritchie MD. Informatics and machine learning to define the phenotype. Expert Rev Mol Diagn. 2018;18(3):219–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Seymour CW, et al. Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. Jama. 2019;321(20):2003–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ngiam KY, Khor W. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–73. [DOI] [PubMed] [Google Scholar]
  • 9.Xu J, et al. Review and evaluation of electronic health records-driven phenotype algorithm authoring tools for clinical and translational research. J Am Med Inform Assoc. 2015;22(6):1251–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Richesson R, Smerek M, Electronic health records-based phenotyping. In Rethinking clinical trials: a living textbook of pragmatic clinical trials. Bethesda: NIH Health Care Systems Research Collaboratory; 2017. [Google Scholar]
  • 11.Baldereschi M, et al. Administrative data underestimate acute ischemic stroke events and thrombolysis treatments: data from a multicenter validation survey in Italy. PLoS ONE. 2018;13(3):e0193776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Small AM, et al. Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease. J Biomed Inform. 2017;72:77–84. [DOI] [PubMed] [Google Scholar]
  • 13.Steiner JM, et al. Identification of adults with congenital heart disease of moderate or great complexity from administrative data. Congenit Heart Dis. 2018;13(1):65–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ko S, et al. International statistical classification of diseases and related health problems coding underestimates the incidence and prevalence of acute kidney injury and chronic kidney disease in general medical patients. Intern Med J. 2018;48(3):310–5. [DOI] [PubMed] [Google Scholar]
  • 15.Martin BJ, et al. Coding of obesity in administrative hospital discharge abstract data: accuracy and impact for future research studies. BMC Health Serv Res. 2014;14:70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cundall-Curry DJ, et al. , Data errors in the National Hip Fracture Database: a local validation study. Bone Joint J. 2016;98-b(10):1406–9. [DOI] [PubMed] [Google Scholar]
  • 17.Golinvaux NS, et al. Administrative database concerns: accuracy of International Classification of Diseases, Ninth Revision coding is poor for preoperative anemia in patients undergoing spinal fusion. Spine (Phila Pa 1976). 2014;39(24):2019–23. [DOI] [PubMed] [Google Scholar]
  • 18.Delate T, et al. Assessment of the coding accuracy of warfarin related bleeding events. Thromb Res. 2017;159:86–90. [DOI] [PubMed] [Google Scholar]
  • 19.McPeek Hinz ER, Bastarache L, Denny JC, A natural language processing algorithm to define a venous thromboembolism phenotype. AMIA Annu Symp Proc; 2013. p. 975–83. [PMC free article] [PubMed] [Google Scholar]
  • 20.Oake J, et al. Using electronic medical record to identify patients with dyslipidemia in primary care settings: international classification of disease code matters from one region to a national database. Biomed Inform Insights. 2017;9:1178222616685880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wiens J, et al. Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019;25(9):1337–40. [DOI] [PubMed] [Google Scholar]
  • 22.Janssen KJ, et al. Development and validation of clinical prediction models: marginal differences between logistic regression, penalized maximum likelihood estimation, and genetic programming. J Clin Epidemiol. 2012;65(4):404–12. [DOI] [PubMed] [Google Scholar]
  • 23.Ramos JD, et al. The Khorana score in predicting venous thromboembolism for patients with metastatic urothelial carcinoma and variant histology treated with chemotherapy. Clin Appl Thromb Hemost. 2016;23(7):755–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Alpert JS, Dalen JE. Epidemiology and natural history of venous thromboembolism. Prog Cardiovasc Dis. 1994;36(6):417–22. [DOI] [PubMed] [Google Scholar]
  • 25.Anderson FA Jr, et al. A population-based perspective of the hospital incidence and case-fatality rates of deep vein thrombosis and pulmonary embolism. The Worcester DVT Study. Arch Intern Med. 1991;151(5):933–8. [PubMed] [Google Scholar]
  • 26.Grosse SD, et al. The economic burden of incident venous thromboembolism in the United States: a review of estimated attributable healthcare costs. Thromb Res. 2016;137:3–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Johnston MJ, et al. A systematic review to identify the factors that affect failure to rescue and escalation of care in surgery. Surgery. 2015;157(4):752–63. [DOI] [PubMed] [Google Scholar]
  • 28.Silber JH, et al. Failure-to-rescue: comparing definitions to measure quality of care. Med Care. 2007;45(10):918–25. [DOI] [PubMed] [Google Scholar]
  • 29.Clarke SP, Aiken LH. Failure to rescue. Am J Nurs. 2003;103(1):42–7. [DOI] [PubMed] [Google Scholar]
  • 30.Hravnak M, et al. , Causes of failure to rescue. In: Textbook of rapid response systems. New York: Springer; 2017. p. 95–110. [Google Scholar]
  • 31.Ageno W, et al. Factors associated with the timing of diagnosis of venous thromboembolism: results from the MASTER registry. Thromb Res. 2008;121(6):751–6. [DOI] [PubMed] [Google Scholar]
  • 32.Torres C, Haut ER. Prevention, diagnosis, and management of venous thromboembolism in the critically ill surgical and trauma patient. Curr Opin Crit Care. 2020;26(6):640–7. [DOI] [PubMed] [Google Scholar]
  • 33.Schulman S, Ageno W, Konstantinides SV. Venous thromboembolism: past, present and future. Thromb Haemost. 2017;117(07):1219–29. [DOI] [PubMed] [Google Scholar]
  • 34.Yount RJ, Vries JK, Councill CD. The medical archival system: an information retrieval system based on distributed parallel processing. Inf Process Manag. 1991;27(4):379–89. [Google Scholar]
  • 35.Simon D, Boring JR, Sensitivity, specificity, and predictive value. In: Walker HK, Hall WD, Hurst JW, editors. Clinical methods: the history, physical, and laboratory examinations. 3rd ed. Boston: Butterworths; 1990. [PubMed] [Google Scholar]
  • 36.Ferrao JC, et al. Preprocessing structured clinical data for predictive modeling and decision support. A roadmap to tackle the challenges. Appl Clin Inform. 2016;7(4):1135–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Henderson KE, et al. Clinical validation of the AHRQ postoperative venous thromboembolism patient safety indicator. Jt Comm J Qual Patient Saf. 2009;35(7):370–6. [DOI] [PubMed] [Google Scholar]
  • 38.Leonardo Tamriz TH, Nair V. Mini-sentinel systematic evaluation of health outcome of interest definitions for studies using administrative data venous thromboembolism report. 2011. [Google Scholar]
  • 39.Florecki KL, et al. What does venous thromboembolism mean in the national surgical quality improvement program? J Surg Res. 2020;251:94–9. [DOI] [PubMed] [Google Scholar]
  • 40.Butwick AJ, et al. Accuracy of international classification of diseases, ninth revision, codes for postpartum hemorrhage among women undergoing cesarean delivery. Transfusion. 2018;58(4):998–1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Nouraei SA, et al. Accuracy of clinician-clinical coder information handover following acute medical admissions: implication for using administrative datasets in clinical outcomes management. J Public Health (Oxf). 2016;38(2):352–62. [DOI] [PubMed] [Google Scholar]
  • 42.Burles K, et al. Limitations of pulmonary embolism ICD-10 codes in emergency department administrative data: let the buyer beware. BMC Med Res Methodol. 2017;17(1):89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Fang MC, et al. Validity of using inpatient and outpatient administrative codes to identify acute venous thromboembolism: the CVRN VTE study. Med Care. 2017;55(12):e137–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.O’Malley KJ, et al. Measuring diagnoses: ICD code accuracy. Health Serv Res. 2005;40(5 Pt 2):1620–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Leisman DE, et al. Development and reporting of prediction models: guidance for authors from editors of respiratory, sleep, and critical care journals. Crit Care Med. 2020;48(5):623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Pruitt Z, Pracht E. Upcoding emergency admissions for non-life-threatening injuries to children. Am J Manag Care. 2013;19(11):917–24. [PubMed] [Google Scholar]
  • 47.Services USCfMM. Hospital-Acquired Condition (HAC) Reduction Program. 07/21/2020. https://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/Value-Based-Programs/HAC/Hospital-Acquired-Conditions. Cited 4 Aug 2020
  • 48.Wei W-Q, et al. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J Am Med Inform Assoc. 2016;23(e1):e20–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Liao KP, et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ. 2015;350:h1885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Alotaibi GS, et al. The validity of ICD codes coupled with imaging procedure codes for identifying acute venous thromboembolism using administrative data. Vasc Med. 2015;20(4):364–8. [DOI] [PubMed] [Google Scholar]
  • 51.Murff HJ, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA. 2011;306(8):848–55. [DOI] [PubMed] [Google Scholar]
  • 52.FitzHenry F, et al. Exploring the frontier of electronic health record surveillance: the case of post-operative complications. Med Care. 2013;51(6):509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Rochefort CM, et al. A novel method of adverse event detection can accurately identify venous thromboembolisms (VTEs) from narrative electronic health record data. J Am Med Inform Assoc. 2015;22(1):155–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Dantes RB, et al. Improved identification of venous thromboembolism from electronic medical records using a novel information extraction software platform. Med Care. 2018;56(9):e54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Correa EA Jr, Lopes AA, Amancio DR. Word sense disambiguation: a complex network approach. Inf Sci. 2018;442:103–13. [Google Scholar]
  • 56.Richesson R, Gold WL, Rasmussen SL. Electronic health records-based phenotyping. In: t.N.H.C.S.R.C.E.H.R.C.W. Group, editors. Rethinking clinical trials: a living textbook of pragmatic clinical trials. Bethesda, MD: NIH Health Care Systems Research Collaboratory. Updated 20 Oct 20. [Google Scholar]
  • 57.Banda JM, et al. Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annu Rev Biomed Data Sci. 2018;1:53–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Bowman S Why ICD-10 is worth the trouble. J AHIMA. 2008;79(3):24–9; quiz 41–2. [PubMed] [Google Scholar]
  • 59.Averill RF, Butler R. Misperceptions, misinformation, and misrepresentations: the ICD-10-CM/PCS saga. J AHIMA; 2013. [Google Scholar]
  • 60.Topaz M, Shafran-Topaz L, Bowles KH. ICD-9 to ICD-10: evolution, revolution, and current debates in the United States. Perspect Health Inf Manag. 2013; 10(Spring):1d. [PMC free article] [PubMed] [Google Scholar]
  • 61.Le Gal G, Righini M. Controversies in the diagnosis of venous thromboembolism. J Thromb Haemost. 2015;13:1. [DOI] [PubMed] [Google Scholar]
  • 62.Sanfilippo KM, et al. Improving accuracy of International Classification of Diseases codes for venous thromboembolism in administrative data. Thromb Res. 2015;135(4):616–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Chen Y, et al. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. J Am Med Inform Assoc. 2013;20(e2):e253–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.The International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM). U.S. Department of Health and Human Services; 2007. [Google Scholar]
  • 65.ICD9DATA.COM. 2006 ICD-9-CM Diagnosis Codes 2006. http://www.icd9data.com/2006/Volume1/default.htm. Cited 5 Aug 2020.
  • 66.2007 ICD-9-CM Diagnosis Codes. 2007. http://www.icd9data.com/2007/Volume1/default.htm. Cited 5 Aug 2020.
  • 67.2008 ICD-9-CM Diagnosis Codes. 2008. http://www.icd9data.com/2008/Volume1/default.htm. Cited 5 Aug 2020.
  • 68.2020 ICD-10-CM/PCS Medical Coding Reference. 2020. https://www.icd10data.com/. Cited 5 Aug 2020.

RESOURCES