Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 13.
Published in final edited form as: Anesthesiology. 2020 Apr;132(4):738–749. doi: 10.1097/ALN.0000000000003150

Classification of Current Procedural Terminology Codes from Electronic Health Record Data Using Machine Learning

Michael L Burns 1, Michael R Mathis 1, John Vandervest 1, Xinyu Tan 1, Bo Lu 1, Douglas A Colquhoun 1, Nirav Shah 1, Sachin Kheterpal 1, Leif Saager 1,2
PMCID: PMC7665375  NIHMSID: NIHMS1644311  PMID: 32028374

Abstract

Background:

Accurate anesthesiology procedure code data is essential to quality improvement, research, and reimbursement tasks within anesthesiology practices. Advanced data science techniques including machine learning and natural language processing offer opportunities to develop classification tools for Current Procedural Terminology codes across anesthesia procedures.

Methods:

Models were created using a Train/Test dataset including 1,164,343 procedures from 16 academic and private hospitals. Five supervised machine learning models were created to classify anesthesiology Current Procedural Terminology codes, with accuracy defined as first choice classification matching the institutional-assigned code existing in the perioperative database. The two best performing models were further refined and tested on a Holdout dataset from a single institution distinct from Train/Test. A tunable confidence parameter was created to identify cases for which models were highly accurate, with the goal of ≥95% accuracy, above the reported 2018 Centers for Medicare and Medicaid Services fee-for-service accuracy. Actual submitted claim data from billing specialists was used as a reference standard.

Results:

Support vector machine and neural network label-embedding attentive models were the best performing models, respectively demonstrating overall accuracies of 87.9% and 84.2% (single best code), and 96.8% and 94.0% (within top three). Classification accuracy was 96.4% in 47.0% of cases using support vector machine and 94.4% in 62.2% of cases using label-embedding attentive model within the Train/Test dataset. In the Holdout dataset, respective classification accuracies were 93.1% in 58.0% of cases and 95.0% among 62.0%. The most important feature in model training was procedure text.

Conclusions:

Through application of machine learning and natural language processing techniques, highly accurate real-time models were created for anesthesiology Current Procedural Terminology code classification. The increased processing speed and a priori targeted accuracy of this classification approach may provide performance optimization and cost reduction for quality improvement, research, and reimbursement tasks reliant on anesthesiology procedure codes.

Introduction

Anesthesiology professional fee billing is a complex process requiring accurate documentation by clinical providers and timely coordination among administrative personnel. Billing staff are responsible for selecting Current Procedural Terminology (CPT) codes to describe anesthesia care provided during each procedure and enable reimbursement using relative base unit values.13 Anesthesiology base CPT codes are determined by surgical procedures performed. The process of assigning CPT codes is complex and labor-intensive, requiring various resources including specialized coding personnel for health record data extraction, transcription, translation, assignment, validation, and auditing.4 The process can be costly: professional billing costs are estimated to represent 13.4% of professional revenue for ambulatory surgical procedures and 3.1% for inpatient surgical procedures, equating to an estimated $170-$215 per case for billing and insurance-related activities.5 Error rates in medical coding can be high: even with specialized teams, error rates as high as 38% for standard CPT coding in anesthesia have been described,6 well above the 2018 overall fee-for-service error rate reported from Comprehensive Error Rate Testing by the Centers for Medicare and Medicaid Services (8.1%).7 Modest gains in process efficiency can have large effects on revenue: a decrease by 10 days in accounts receivable resulted in a 3.0% revenue gain for a single academic anesthesiology practice.8 While crosswalk from surgical to anesthesia CPTs exists, surgical CPT data is frequently unavailable real-time due to business, political, or technical obstacles; when available, surgical CPT data has similar lag times to anesthesia CPT generation. Efficient billing processes are key to maintaining financial viability within departments. Additionally, billing data are vitally important in quality improvement and research projects to allow reproducible case inclusion, exclusion, and risk adjustment.9

As electronic health record adoption has increased, healthcare data has become more available. Data science techniques have also advanced, including methods for creating classification models using machine learning, and processing and analyzing human language using natural language processing. Machine learning and natural language processing have been applied to a variety of clinical applications including disease prediction,10 gene expression profiling,11 and medical imaging.12 Such techniques are beginning to be applied within clinical anesthesiology13 and intensive care,14 including applications predicting bispectral index,15 hypotension,16 and postoperative mortality.17 While applications exist in medical coding, including assignment of International Classification of Diseases diagnostic codes,18,19 there remains a paucity of work to apply these techniques to anesthesia billing. Anesthesia billing is a classification problem in which text and other variables are translated into a single numerical code from a limited set of choices. Natural language processing and machine learning tools excel at these tasks.

Using data science techniques applied to perioperative electronic health record data across multiple centers, anesthesia CPT code classification models were developed via multiple machine learning methods and evaluated. We hypothesized that machine learning and natural language processing could be used to develop an automated system capable of classifying anesthesia CPT codes with accuracy exceeding current benchmarks. This classification modeling could prove beneficial in efforts to optimize performance and reduce costs for research, quality improvement, and reimbursement tasks reliant on such codes.

Materials and Methods

Study Design

Institutional Review Board approval for this multicenter was obtained for this retrospective observational study (HUM00152875, Ann Arbor, Michigan) and followed multidisciplinary guidelines for reporting machine learning-based classification models in biomedical research.20 The study design was presented, approved, and registered within the multicenter research committee on August 14, 2017 prior to accessing the data.21 This design included study outcomes, data collection, and statistical analyses.

Case Selection

This study included all patients, adults and pediatrics, undergoing elective or emergent procedures with an institution-assigned valid anesthesia CPT code and an operative date between January 1st, 2014 and December 31st, 2016 from 16 contributing centers in the Multicenter Perioperative Outcomes Group database. This data set includes both academic hospitals and community based practices across the United States. Methods for data collection, validation, and multicenter integration within the Multicenter Perioperative Outcomes Group are previously described22,23 and data from this group have been used in multiple published studies.2426 All sites submitting valid data were eligible for inclusion; cases with missing procedure text were excluded. No additional exclusion criteria were applied. This data set is called “Train/Test”.

A second and distinct data set was created using cases on patients undergoing elective or urgent procedures with a valid institution-assigned CPT code between October 1st, 2015 and November 1st, 2016 from a single Multicenter Perioperative Outcomes Group institution not included in the Train/Test data set. This “Holdout” data set was used for external validation of the models created in this study. Figure 1 shows a flow diagram of the data sets used and the experimental design of this study.

Figure 1: Machine learning study design flowchart.

Figure 1:

Flow diagram of the experimental design of this study. The Train/Test data set is used to create each model while the Holdout data set is used as an external validation. Each model is trained using 5-fold cross validation. Parameter tuning occurred with each of the 20 iteration of model training. The single institution from the Holdout Dataset was not included in the 16 institutions included in the Train/Test Dataset.

Model Features

Features are model inputs while labels are outputs. To maximize the number of cases included in the study and allow for broad and easy application of the models, the features used in each model were limited to perioperative electronic health record data commonly found in anesthesia records: age, sex, American Society of Anesthesiologists (ASA) physical status, emergent status, procedure text, procedure duration and the derived procedure text length (number of words in procedure text). Institution-assigned anesthesia CPT codes were used as labels for each case and each case represents an instance for machine learning modeling. Continuous features underwent scaling through normalization to achieve properties of a standard normal distribution with a mean of zero and a standard deviation of one.

Primary Outcome

Submitted claim data from billing specialists was used as a reference standard to train and test models. The primary outcome of this study is classification accuracy of institution-assigned anesthesia CPT code. Accuracy is defined as (Number of Correct Anesthesia CPT Classifications) / (Total Number of Anesthesia CPT Classifications). To measure the quality of the reference standard, 500 random cases were randomly selected from the Train/Test data set and adjudicated by manual review of operative notes and anesthesia records, performed by an anesthesiologist domain expert (MB), and from which a sample of 50 cases were reviewed by the University of Michigan departmental billing manager.

Data Preparation and Natural Language Processing

Procedural text is the short text assigned to each case, describing the procedure(s) carried out. Natural language processing techniques were used to process text data into forms usable for machine learning models. As procedure text is typically hand entered, it is subject to misspellings and frequently contains medical abbreviations and acronyms. Top misspelled words by frequency were physician hand audited for validity and placed into a dictionary which was used for text processing. To aid in processing and decrease vocabulary size, procedure text was standardized through removal of numbers, punctuations, and common English stop words (e.g. “a”, “an”, “the”, etc.). Common medical abbreviations and acronyms were expanded using domain knowledge from an anesthesiologist (MB), and a unique spelling correction library was created using approximate string distance and co-occurrence algorithms. The spelling correction library was then manually adjudicated by an anesthesiologist (MB). Following text processing, term matrices were created with single and multi-word phrases using n-grams.27,28 Steps to transform text into numerical values used in machine learning models included term frequency-inverse document frequency and word2vec.2931 Details of natural language processing and text transformation can be found in Supplemental Digital Content 1.

Supervised Machine Learning Methods

In supervised machine learning methods all data used in training has labels, meaning each case used in training has inputs and outputs. Five unique supervised machine learning classification models were compared: random forest,32 long short-term memory,33 extreme gradient boosting,34 support vector machine,35 and label-embedding attentive model18. Each model was chosen for potential advantageous properties, including ease of implementation/interpretation (random forest and support vector machine), reduction of bias via weighting of low sample observations (extreme gradient boosting), and ease of handling text and language inputs (long short-term memory and the label-embedding attentive model). Random forest was implemented using R, while long short-term memory, extreme gradient boosting, support vector machine, and the label-embedding attentive model were implemented in Python using TensorFlow and trained on an Amazon Web Services graphics processing units. After initial hyper-parameter tuning, all models were trained and tested 20 times using 5-fold cross validation: 80% of data for training and the remaining 20% for testing. Further details of the machine learning packages used and their hyper-parameter tuning can be found in Supplemental Digital Content 2.

The deep learning methods in this study were the label-embedding attentive model18 and long short-term memory. Procedure texts for these models were encoded into vectors using word2vec embedding31 as input. The label-embedding attentive model encoded the descriptions for each anesthesia CPT from the CPT Professional Edition medical code set maintained by the American Medical Association2. Most deep learning models for text classification only embed input (feature) text.36 A “compatibility matrix” was computed between embedded words and labels via cosine similarity. From this matrix, an attention score was calculated for each word and the entire procedural text sequence was then derived as the average of embedded words, weighted by the attention scores. This score was used for CPT classification.

Feature Importance

Within the support vector machine model, linear coefficients were used to investigate which features were most important for machine learning decisions. The higher the weight of the input feature, the more important the feature is to CPT classification. Within procedure text, weights were used to compare feature importance of individual words as well as the overall importance of the entire procedure text as the sum of the weights of individual words.

Confidence Parameter

To identify specific cases for which the machine learning models demonstrated a pre-specified level of accuracy, an adjustable confidence parameter was created as a model output for each case using methods similar to previous statistical studies such as density ratio estimations.37,38 Importantly, following machine learning model training, the confidence parameter is calculable for each case, prior to accessing the institution-assigned CPT code to ascertain classification accuracy. Support vector machine and the label-embedding attentive model were the selected machine learning methods to calculate the confidence parameter, given their relative amenability to handling procedure text, compared to other machine learning methods studied.

The confidence parameter was created by comparing the top two primary anesthesia CPT codes for each case using CPT probabilities in the support vector machine model and CPT scores in the label-embedding attentive model. The confidence parameter is calculated for each case as follows. For the support vector machine model, confidence parameter is calculated as:

confidenceparameter=PCPT1PCPT21

Where PCPT1 and PCPT2 are the highest and second highest probabilities of all CPTs for that case. For the label-embedding attentive model, confidence parameter was calculated as:

confidenceparameter=scoreCPT1scoreCPT2

Where scoreCPT1 and scoreCPT2 are the highest and second highest scores of all CPTs for that case. Cases were stratified into three confidence parameter ranges to differentiate cases with high versus low classification confidence: “High” (confidence parameter ≥1.6), “Medium” (<1.6 and ≥1.2), and “Low” (<1.2) (fig. 2 and fig. 3). The “High” category was targeted to return a ≥95% accuracy (i.e. <5.0% misclassification rate), as was the goal for this study. The “Medium” and “Low” categories were targeted to achieve balanced classes. Although these strata were chosen for reporting purposes, it is worth noting that any confidence parameter threshold can be selected based on the desired accuracy.

Figure 2: Accuracy of Current Procedural Terminology code assignment as a function of confidence parameter.

Figure 2:

This graph shows the percentage accuracy of model CPT classification (y-axis) for a given cutoff of confidence parameter (x-axis) for the support vector machine model. The Train/Test and Holdout data set accuracies are plotted for both the first assigned CPT code (“top 1 CPT code”) and top three assigned CPT codes (“top 3 CPT codes”). High (≥1.6), Medium (1.6>confidence parameter≥1.2), Low (<1.2) areas are labeled above the figure. Confidence parameter is a derived measure of relative probability between best-fit and second best-fit CPT classifications. Current Procedural Terminology (CPT).

Figure 3: Percentage case inclusion as a function of confidence parameter.

Figure 3:

This graph shows the percentage of cases included for model CPT classification (y-axis) for a given cutoff of confidence parameter (x-axis) for the support vector machine model. The Train/Test and Holdout data set accuracies are plotted. High (≥1.6), Medium (1.6>confidence parameter≥1.2), Low (<1.2) areas are labeled above the figure. Confidence parameter is a derived measure of relative probability between best-fit and second best-fit CPT classifications. Current Procedural Terminology (CPT).

Testing Generalizability, Calibration, and Model Processing Speed

To determine the generalized ability of the models to classify anesthesia CPT codes, select models were tested on the Holdout data set (data from a distinct institution unseen by the Train/Test data set). For ease of assessing model calibration (as described further in the statistical analysis), CPT codes were transformed to a continuous variable via CPT-specific anesthesia base unit values, as are currently used for anesthesiology reimbursement.39 The higher the assigned base unit value, the higher the reimbursement. Of note, each CPT code has a single base unit value, but multiple CPT codes may have the same base unit value. Finally, to assess the feasibility of an automated CPT classification model to be deployed in real-time, potentially embedded into the perioperative electronic health record, the Holdout data set was processed ten times on the support vector machine model, measuring processing time (in seconds).

Statistical Analysis

Exploratory data analysis techniques such as histograms, QQ-Plots, box-plots, scatterplots and basic descriptive (means, medians, interquartile range) were used to assess the distribution of measures, to explore the most informative transformations, extreme values of the covariates, confounders and relevant predictors considered in the analysis. These analyses were performed within the Train/Test and Holdout data sets separately. Standardized differences were used to compare summary statistics across these two data sets. To reduce the dimensionality of the classification model and to facilitate comparisons across clusters of CPT codes a clinical approach was adopted in which CPT codes were grouped by anatomical region of the surgical procedure.40 These are referred to as “CPT categories” in the text. Model performance was analyzed by assessing accuracy - defined as a first choice CPT classification matching the institutional-assigned CPT code existing in the Multicenter Perioperative Outcomes Group database. Accuracy within the top three is defined as one of the top three CPT classifications from the model matching the institution-assigned CPT code; narrowing a billing specialist’s classification task from 285 possible CPT codes to only 3 may yield efficiency gains. In response to peer review, other metrics of classification such as the net reclassification index and calibration were also used to assess the quality of the classification models. Both of these metrics were appropriately modified from the classical binary classification to the multiclass classification case. To assess, we considered the following statistics:

net reclassification index=(p^up p^down)p^up+ p^downn.

Where and p^up and p^down are the average of the probability estimates for CPT codes for which the base unit value of the model-classified CPT codes went up or down with respect to the original CPT code, and n is the total number of CPT codes classified. The net reclassification index is then interpreted as the net change in base unit value of CPT codes reclassified by both models. Calibration plots were constructed using z-scores for base unit values from reference standard CPT codes as well as base unit values from model-classified CPT code for both models.

Results

The Train/Test data was comprised of 1,164,343 unique cases across 16 institutions and spanning 262 anesthesia CPT codes (table 1). The 2018 anesthesia CPT catalog consists of 285 unique codes.1,39 The Holdout data set comprised 58,510 cases from a single institution and spanned 232 anesthesia CPT codes. In the Train/Test data set 36,356 cases were missing procedure text, representing 0.1% of the data. The Holdout data set had 17 such cases (<0.1% of the data). The Train/Test data set included 227 of the 232 codes contained in the Holdout data set; the five anesthesia CPT codes unique to the Holdout data set are described in Supplemental Digital Content 3. Fifty-seven percent of patients were female. The mean age was 50 years, and 8.5% were pediatric (age <18 years). Cases were primarily ASA 2 (46.5%) and ASA 3 (37.1%), and 4.5% were emergent.

Table 1.

Key metrics and comparisons of the two data sets used in this study (Train/Test and Holdout).

Category Train/Test Holdout OR P-Value
Case Demographics
Unique Anesthesia Cases 1,164,343 58,510
Unique Anesthesia CPTs 262 232
CPT Categories
Head 00100–00222 156,017 (13.4%) 14,934 (25.5%) 0.5 < 0.0001
Neck 00300–00352 53,302 (4.6%) 3,238 (5.5%) 0.8 < 0.0001
Thorax (chest, shoulder) 00400–00474 58,001 (5.0%) 2,746 (4.7%) 1.1 0.0061
Intrathoracic 00500–00580 57908 (5.0%) 3,778 (6.5%) 0.8 < 0.0001
Spine and Spinal Cord 00600–00670 36,520 (3.1%) 1,291 (2.2%) 1.4 < 0.0001
Upper Abdomen 00700–00797 170,005 (14.6%) 7,646 (13.1%) 1.1 < 0.0001
Lower Abdomen 00800–00882 227,202 (19.5%) 7,658 (13.1%) 1.6 < 0.0001
Perineum 00902–00952 105,208 (9.0%) 3,584 (6.1%) 1.5 < 0.0001
Pelvis (except hip) 01112–01190 4,904 (0.4%) 227 (0.4%) 1.0 0.9999
Upper Leg (except knee) 01200–01274 35,094 (3.0%) 1,162 (2.0%) 1.5 < 0.0001
Knee and Popliteal Area 01320–01444 45,967 (3.9%) 1,502 (2.6%) 1.5 < 0.0001
Lower Leg (below knee) 01462–01522 37,350 (3.2%) 1,217 (2.1%) 1.5 < 0.0001
Shoulder and Axilla 01610–01682 24,076 (2.1%) 907 (1.6%) 1.3 < 0.0001
Upper Arm and Elbow 01710–01782 7,110 (0.6%) 408 (0.7%) 0.9 0.0129
Forearm, Wrist and Hand 01810–01860 38,149 (3.3%) 1,269 (2.2%) 1.5 < 0.0001
Radiological Procedure 01916–01936 45,378 (3.9%) 3,329 (5.7%) 0.7 < 0.0001
Burn Debridement 01951–01953 1,054 (0.1%) 1 (<0.1%) 1.0 < 0.0001
Obstetric 01958–01969 61,098 (5.2%) 3,516 (6.0%) 0.9 < 0.0001
Other Procedure 01990–01999 0 (0.0%) 97 (0.2%) 0.0 N/A
Patient Demographics
Female 659,272 (56.6%) 32,078 (54.8%) 1.1 < 0.0001
Age (years) 51 (22) 50 (23) 0.09a
Pediatric (age <18 years) 98,778 (8.5%) 6,549 (5.6%) 1.6 < 0.0001
ASA 1 111,269 (9.6%) 6,307 (10.8%) 0.9 < 0.0001
ASA 2 536,752 (46.5%) 25,998 (44.4%) 1.1 < 0.0001
ASA 3 428,397 (37.1%) 23,095 (39.5%) 0.9 < 0.0001
ASA 4 75,230 (6.5%) 2,969 (5.1%) 1.3 < 0.0001
ASA 5 1,600 (0.1%) 132 (0.2%) 0.5 < 0.0001
ASA 6 16 (<0.1%) 9 (<0.1%) 1.1 < 0.0001

Frequencies are displayed as percentages or means with standard deviation, as appropriate. P-values are calculated to evaluate differences between groups using Chi-Squared test for categorical features, student’s t-test for continuous features. CPT categories are defined by body region. Odds ratio (OR) thresholds for determining the effect size were: Small (OR≤1.5); Medium (1.5<OR≤2) and large (3<OR). For standardized differences (SD): Small (≤0.2), Medium (0.2<SD≤0.5); Large (0.5<SD≤0.8); Very Large (0.8<SD).

a

Standardized mean difference. Current Procedural Terminology (CPT), Odds ratio (OR), standardized differences (SD), American Society of Anesthesiologists physical status classification (ASA).

Using CPT categories, codes were unevenly distributed between the data sets. Case distributions into each CPT grouping varied between individual institutions, but the distribution reflects the content of the overall Multicenter Perioperative Outcomes Group database. The Holdout data set was similar to the Train/Test data set (table 1). As sample sizes are large, statistically significant differences were observed between data sets. Two body regions showed a relative sparsity: Burn Debridement (1054 cases versus 1 between the Train/Test and the Holdout data sets, respectively) and Other (0 cases versus 97).

Primary Outcome Adjudication

Institution-assigned primary anesthesia CPTs were used as the reference standard labels when developing the models. Among the 500 cases from the Train/Test data set adjudicated by anesthesiologist manual review, 25 of 500 (5.0%) cases were found to be misclassified by primary anesthesia CPT in the source data set. Nine of 25 errors would have been correctly classified by the support vector machine model. A sample of 50 cases, including all 25 for which the institution assigned CPT code was in error (per anesthesiologist review) and a random 25 for which the institution assigned CPT code was correct, were validated by the University of Michigan Anesthesiology Department billing manager. The review by the anesthesia billing manager showed agreement in 22 of the 25 cases found to be incorrect and 25 of the 25 for cases found to be correct, for an overall 88% concordance with the anesthesia attending review.

Procedure Text and Natural Language Processing

Feature importance was used to gain insight into model classifications and potential improvements, but not used to evaluate model error. Procedure text was the most important feature used to classify anesthesia CPT codes. This text had an average word count of 10 words per case. The vocabulary size across all cases was 25,098 unique words. Most individual words were rare, occurring in <10 cases across both data sets, accounting for 19,159 (76.3%) of the vocabulary size. Unique medical word misspellings totaled 8,353. The top misspelled medical terms included “discectomy”, “dilatation”, “curettage”, and “excision”, along with longer terms such as “esophagogastroduodenoscopy” and “cholangiopancreatography”. In all, 21.3% of cases contained at least one misspelled word that was subsequently corrected.

Machine Learning Model Parameters

In the support vector machine model, the average weight for each individual word in the procedure text was 7.9 whereas the average combined weight of all words within the procedure text was 337.5. Weights for other features were considerably lower than the combined procedural text weight: ASA physical status classification (6.1), text length (4.3), age (3.2), sex (2.1), emergent status (1.5), and case duration (1.5).

Train/Test data set

The highest overall accuracy was found with the support vector machine model (87.9%, CI 87.6–88.2%) (table 2). Extreme gradient boosting (87.9%, CI 87.5–88.3%), and long short-term memory (86.4%, 83.5–89.3%), and the label-embedding attentive model (84.2%, CI 84.1–84.3%) were all more accurate than random forest modeling (82.0%, CI 68.1–95.9%). Using CPT categories to identify cases for which the random forest model demonstrated differential performance, there was a low of 70.7% for radiology procedures and a high of 92.0% for shoulder procedures. There was an observed positive relationship between the number of cases comprising a specific CPT code and the accuracy of the models for the CPT code, with a Pearson correlation of 0.72. Overall accuracy within the top three was 96.8% for support vector machine model and 94.0% for the label-embedding attentive model.

Table 2.

Results of five machine learning models on the Train/Test data set.

Machine Learning Model Average Accuracy (95% CI) Top CPT Code
random forest 82.0% (68.1–95.9%)
support vector machine 87.9% (87.6–88.2%)
extreme gradient boosting 87.9% (87.5–88.3%)
long short-term memory 86.4% (83.5–89.3%)
label-embedding attentive model 84.2% (84.1–84.3%)

Accuracies of the five machine learning models calculated from the Train/Test data set training tested 20 times using 5-fold cross validation, shown with 95% confidence intervals. Current Procedural Terminology (CPT).

Confidence Parameters

The best performing model in the testing was the support vector machine model at 87.9% (CI 87.6–88.2%), or a misclassification rate of 12%. However, through the use of confidence parameters assigned along with CPT code output, results were partitioned into identifiable groups and those with higher confidence parameters correlated with accuracy of CPT classification (Pearson correlations >0.97, fig. 2). Cases within the “High” (confidence parameter ≥1.6) category represented 47% of the data in testing (fig. 3) and yielded a 96.4% accuracy (fig. 2). At a more stringent confidence parameter of ≥2.0, first CPT classification accuracy increased to 97.1%, encompassed 39.3% of the cases, and accuracy at this confidence within the top three was 99.1%. For the label-embedding attentive model, there was a 94.4% accuracy in 62.2% of cases with a 98.2% top three accuracy.

Holdout data set performance metrics

Accuracy.

The best performing machine learning model by overall accuracy in the Holdout data set was the support vector machine model (81.2%). When stratifying by confidence parameter metrics there was a 93.1% accuracy (fig. 2) for high confidence parameter (≥1.6) encompassing 58.0% of the data (fig. 3). At the more stringent confidence, very high confidence parameter (≥2.0) demonstrated a 94.7% accuracy and 48.0% data set coverage. Accuracy within the support vector machine model top three was 96.3% for the Holdout data set. The overall accuracy of the label-embedding attentive model was 82.1% for the Holdout data set and accuracy within the top three was 94.6%. The label-embedding attentive model the accuracy is improved to 95.0% for cases with confidence parameter ≥1.6, encompassing 62.0% of the data set. This means using the label-embedding attentive model CPTs were classified within the study’s desired threshold (≥95% accuracy) on 62% of the data. At the more stringent confidence (confidence parameter ≥2.0) accuracy improved to 96.9% with a data set coverage of 48.3%.

When CPT codes were grouped by body region, we found that the label-embedding attentive model correctly identified the proper body region in 91.4% of its first-choice CPT classifications, while the support vector machine model correctly identified 93.1%. Furthermore, the label-embedding attentive model and support vector machine models correctly identified the proper body region in 97.5% and 97.7% of top-three choices, respectively.

Net Reclassification Index.

Following transformation of CPT codes to anesthesiology base unit values, the support vector machine model net reclassification index was 0.294 (95%CI: 0.270–0.318), indicating the support vector machine model lead to a 29.4% excess proportion of increased anesthesiology base unit values compared to original CPT code base unit values. Using a similar approach, the label-embedding attentive model net reclassification index was 0.343 (95%CI: 0.342–0.344%).

Calibration.

Following transformation of CPT codes to anesthesiology base unit values, calibration plot intercepts were -0.00 (p=0.855) and -0.00 (p=0.995) while calibration plot slopes were 0.849 (p<0.001) and 0.845 (p<0.001) for the support vector machine and the label-embedding attentive models, respectively. Details can be found in Supplemental Digital Content 4.

Processing Time

The processing speed of the support vector machine model on the Holdout data set (58,510 cases) was 1.09 ± 0.05 seconds. Processing speeds were equivalent across all models.

Discussion

In this retrospective multicenter study, a machine learning-based approach to CPT code classification is described using commonly available perioperative electronic health record data. This study found important differences in accuracy between five machine learning techniques. Within training, the models studied showed a range of classification accuracy from 82–88%, a 50% difference in misclassification rate between the worst and best performing models. Within validation, an overall accuracy of 82.1% in the Holdout data set of the best performing model (label-embedding attentive model) was observed. When restricting to high confidence cases, (confidence parameter ≥1.6) comprising 62% of cases within the data set, there was an augmented accuracy of 95.0%, the quality target for this study and eclipsing the most recently reported accuracy for fee-for-service payment within Centers for Medicare and Medicaid Services (91.9%).7 The models developed in this study may offer a reduction in processing time and personnel resources required to perform these administrative tasks.

This approach is different from traditional computer-aided coding in the medical space: whereas traditional approaches focus on automating transcription tasks, this study focuses on classification capabilities. The confidence parameter was created to stratify cases into groups to improve model utility. This allowed identification of cases with high classification confidence, which may enable re-allocation of administrative or auditing resources to review cases for which ambiguity exists.

To investigate external validation, the support vector machine and label-embedding attentive models were tested on the Holdout data. Both models yielded lower overall accuracies (81.2–82.1%) for the Holdout data set relative to Train/Test, yet through the use of a confidence parameter, an identifiable 58.0%-62.0% of cases with a confidence parameter ≥1.6 demonstrated overall accuracies of 93.1%-95.0%. These results were encouraging for the generalizability of the models - potentially due to use of data from both academic and private hospitals across 16 medical centers. The machine learning models developed proved robust to unseen data at the holdout institution, with a broadly similar case mix, yet with some site-specific, idiosyncratic documentation and practice patterns.

For the remaining lower-confidence cases, the models can narrow assignment choices for medical billing specialists to selection between the top three choices. Top three choice narrows the classification task of billing specialists from 285 to 3, providing a shorter and prioritized list from which to choose. Top three accuracies were 94.6%-96.8%. Thus, these models could aid coding personnel by providing a smaller subset of CPT assignment choices. In other instances, the models created in this study could be used for post-assignment analysis, as auditing tools used to identify discrepancies and potential coding errors by comparing manual and automated assignments. There is often a window for resubmission for which automation could help target efforts to reclaim lost revenues and attaching base unit values to the CPT codes these models can identify regions of over and under billing, acting to aid auditing and evaluation processes.

Given the promising results of this study, models developed from this work have been directly incorporated in the billing workflow at the University of Michigan for auditing and resubmission purposes. Beyond use at our single center, the CPT classification tool developed in this study has substantial applicability in the broader business practices of anesthesia care. Billing departments and vendors spend a considerable amount of time processing information for reimbursement and are slowed in an environment where documentation errors are common. An estimated 15.7% of anesthesia cases contain at least one documentation error after first billing attempt and the median time to correct documentation errors was 33 days.41 Furthermore, 1.3% of all anesthetic cases went without reimbursement due to improper documentation and failure to correct errors. Within this study, medical misspellings accounted for 33.3% of the procedure text vocabulary, and 21.3% of cases contained at least one misspelled term. These tools could be used to refocus resources away from routine, high-confidence, CPT assignment and towards areas of more complex processing and auditing to further improve the speed and accuracy of the overall billing process. When deployed as a web application, the models are able to process over one million cases in under 10 minutes. In the context of studies demonstrating anesthesiology practices gaining revenue via decreasing charge lag,8 and reports demonstrating hospital operating margins between 2–3%, a machine learning classification approach represents an opportunity to reduce costs without compromising patient care.4244

Additionally, the methods developed in this study may expedite CPT assignment for use in research and quality improvement projects. The classification models created enable near real-time anesthesia CPT assignment upon upload of core electronic health record data to a research or quality improvement coordinating center, freeing researchers and quality improvement champions from a dependency on billing data which may not be available in a timely manner.

Work remains to develop the full potential of billing aides like the CPT classification models created in this study. These tools require continued retraining as new information becomes available and updating when medical coding changes occur. Without updating over time the tools will perform with gradual lower accuracy. One implementation concept is where the models can be trained with historical data when a new center begins to use them but retrained periodically when new data becomes available. Existing and new centers would benefit from novel data inclusion.

Study Limitations

This study has several important limitations which must be further explored.

  1. Surgical CPTs were frequently unavailable from the contributing institutions, and of those providing surgical CPTs there was a similar or longer delay in availability from procedure date, compared to anesthesia CPT codes. Such lag times in surgical CPT coding preclude early crosswalking to Anesthesia CPT codes, and justify the approach based upon procedure text used in this study.

  2. Training sets derived from manual CPT assignment contain errors6 and a model trained on errors will invariably reproduce similar errors. In this study, through physician validation, there was a manual CPT assignment error rate of 5.0%; thus models created in this study would benefit from audited and validated data sets to increase model assignment accuracy.

  3. Bias from overfitting to individual and/or institution-specific procedure text assignments and billing practices may have existed within this study. To alleviate this bias, data from multiple centers was used in training, and an external validation was conducted on a Holdout data set.

  4. While natural language processing was used to correct many of the spelling and formatting errors in procedure text, creating this feature required manual physician review and there remained several additional instances that went uncorrected. Further text processing and expansion of acronyms can help align similar cases, improving model accuracy.

  5. Among CPT codes for which the machine learning models demonstrate low or medium confidence, accuracy is not yet comparable to current standards.

  6. While some level of site-specific text characterization was required due to local site acronyms, standard lexicon tools for natural language processing were not used in this study, such as those available through the Unified Medical Language System,45 potentially limiting the reproducibility of the models.

  7. As base unit values were not unique to each CPT code, it was possible for model outputs to yield an incorrect CPT yet correct base unit value, thus limiting the fidelity of net reclassification index and model calibration assessments.

  8. While this study demonstrates rapid data processing and has potential for real-time classification of anesthesia CPT codes, these models have not been thoroughly analyzed in practice. The group plans to test these capabilities through prospective application of CPT classification to anesthesia quality improvement measures reliant on CPT codes. Sparsity remains an issue with large data predictive modeling. In cases that were not well represented in Train/Test data set, the models demonstrated decreased accuracy. The data sets used to create these models contained sparse procedural information and it is likely that accuracy would improve with inclusion of additional data, such as operative notes.

Conclusion and Future Directions

In summary, this study describes a rapid automated classification model for anesthesia CPT codes, with an accuracy comparable to current standards in a high-confidence subset of cases, and processing time far eclipsing current billing practices. These findings may serve to reduce the burden of manual coding of more common cases, and may increase efficiency within the billing cycle and aid processes that rely on billing data. These results broadly demonstrate the potential for machine learning and natural language processing-based classification models in healthcare operations.

Future applications include automation of high confidence CPT assignment to enable redistribution of manual efforts, and workflow integration for classification decision support. As there exist similar difficulties in reimbursement processes throughout the hospital, the methods to create these models could be used for classification of other medical billing codes such as surgical CPT and International Classification of Diseases. In classifying surgical and anesthesia CPT as well as International Classification of Diseases, it is conceivable to create a system automating the majority of the procedural billing process.

Supplementary Material

Supp 1
Supp 2
Supp 3
Supp 4

6. Acknowledgments:

The authors gratefully acknowledge the valuable contribution to protocol development and final article review by the Multicenter Perioperative Outcomes Group Perioperative Clinical Research Committee.

10. Funding Statement:

All work and partial funding can be attributed to the Department of Anesthesiology, University of Michigan Medical School (Ann Arbor, Michigan, USA).

Research reported in this publication was supported by the National Institute for General Medical Sciences of the National Institutes of Health under award number T32GM103730 (MLB and DAC) and by the National Heart, Lung and Blood Institute of the National Institutes of Health under award number K01HL141701 (MRM). The content of this study is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This project received additional funding through the Michigan Translational Research and Commercialization (MTRAC) for Life Sciences Innovation Hub, a partnership between the University of Michigan’s Fast Forward Medical Innovation and Office of Technology Transfer, and the State of Michigan.

Support for underlying electronic health record data collection was provided, in part, by Blue Cross Blue Shield of Michigan (BCBSM) and Blue Care Network as part of the BCBSM Value Partnerships program for contributing hospitals in the state of Michigan. Although BCBSM and the Multicenter Perioperative Outcomes Group work collaboratively, the opinions, beliefs, and viewpoints expressed by the authors do not necessarily reflect the opinions, beliefs, and viewpoints of BCBSM or any of its employees.

Non-standard Abbreviations

CPT

Current Procedural Terminology

Footnotes

4.

Clinical trial number and registry URL: Not applicable

5.

Prior Presentations:

This study has been presented in part at 2017 Annual Meeting of the American Society of Anesthesiologists (Boston, MA)

9.

Summary Statement: Not applicable

11.

Conflicts of Interest:

This work has been declared through the University of Michigan Office of Tech Transfer and a provisional patent (U.S. Provisional Application No.: 62/791,257) has been filed related to the work presented in this study.

References

  • 1.2018 CROSSWALK Book: A Guide for Surgery/Anesthesia CPT Codes. American Society of Anesthesiologists; 2017. [Google Scholar]
  • 2.CPT 2018, Current procedural terminology 2018: Professional edition. American Medical Association; 2017. [Google Scholar]
  • 3.Polsky D, Candon M, Saloner B, et al. Changes in primary care access between 2012 and 2016 for new patients with medicaid and private coverage. JAMA Internal Medicine. 2017;177(4):588–590. [DOI] [PubMed] [Google Scholar]
  • 4.Holt J, Warsy A, Wright P. Medical decision making: guide to improved CPT coding. South Med J. 2010;103(4):316–322. [DOI] [PubMed] [Google Scholar]
  • 5.Tseng P, Kaplan RS, Richman BD, Shah MA, Schulman KA. Administrative Costs Associated With Physician Billing and Insurance-Related Activities at an Academic Health Care System. JAMA. 2018;319(7):691–697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Henderson R, Nielsen KC, Klien SM, Pietrobon R. Miscoding Rates for Professional Anesthesia Billing: Trial Results - Software Solution. electronic Journal of Health Informatics. 2010;5(2). [Google Scholar]
  • 7.(CERT) CERT. 2018 Medicare Fee-for-Service Supplemental Improper Payment Data. 2018; https://www.cms.gov/Research-Statistics-Data-and-Systems/Monitoring-Programs/Medicare-FFS-Compliance-Programs/CERT/Downloads/2018MedicareFFSSuplementalImproperPaymentData.pdf. Accessed March 14, 2019.
  • 8.Reich DL, Kahn RA, Wax D, Palvia T, Galati M, Krol M. Development of a module for point-of-care charge capture and submission using an anesthesia information management system. Anesthesiology. 2006;105(1):179–186; quiz 231–172. [DOI] [PubMed] [Google Scholar]
  • 9.Liu JB, Liu Y, Cohen ME, Ko CY, Sweitzer BJ. Defining the Intrinsic Cardiac Risks of Operations to Improve Preoperative Cardiac Risk Assessments. Anesthesiology. 2018;128(2):283–292. [DOI] [PubMed] [Google Scholar]
  • 10.Oh J, Makar M, Fusco C, et al. A Generalizable, Data-Driven Approach to Predict Daily Risk of Clostridium difficile Infection at Two Large Academic Health Centers. Infect Control Hosp Epidemiol. 2018;39(4):425–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhao S, Dong X, Shen W, Ye Z, Xiang R. Machine learning-based classification of diffuse large B-cell lymphoma patients by eight gene expression profiles. Cancer Med. 2016;5(5):837–852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Machine Learning for Medical Imaging. Radiographics. 2017;37(2):505–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Connor CW. Artificial Intelligence and Machine Learning in Anesthesiology. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mathur P, Burns ML. Artificial Intelligence in Critical Care. Int Anesthesiol Clin. 2019;57(2):89–102. [DOI] [PubMed] [Google Scholar]
  • 15.Lee HC, Ryu HG, Chung EJ, Jung CW. Prediction of Bispectral Index during Target-controlled Infusion of Propofol and Remifentanil: A Deep Learning Approach. Anesthesiology. 2018;128(3):492–501. [DOI] [PubMed] [Google Scholar]
  • 16.Hatib F, Jian Z, Buddi S, et al. Machine-learning Algorithm to Predict Hypotension Based on High-fidelity Arterial Pressure Waveform Analysis. Anesthesiology. 2018. [DOI] [PubMed] [Google Scholar]
  • 17.Lee CK, Hofer I, Gabel E, Baldi P, Cannesson M. Development and Validation of a Deep Neural Network Model for Prediction of Postoperative In-hospital Mortality. Anesthesiology. 2018;129(4):649–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wang G, Li C, Wang W, et al. Joint Embedding of Words and Labels for Text Classification. ArXiv e-prints. 2018. https://ui.adsabs.harvard.edu/#abs/2018arXiv180504174W. Accessed May 01, 2018. [Google Scholar]
  • 19.Shi H, Xie P, Hu Z, Zhang M, Xing EP. Towards Automated ICD Coding Using Deep Learning. ArXiv e-prints. 2017. https://ui.adsabs.harvard.edu/#abs/2017arXiv171104075S. Accessed November 01, 2017. [Google Scholar]
  • 20.Luo W, Phung D, Tran T, et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J Med Internet Res. 2016;18(12):e323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.(MPOG) MPOG. Perioperative Clinical Research Committee (PCRC). https://mpog.org/pcrc/. Accessed May 21, 2019, 2019.
  • 22.Freundlich RE, Kheterpal S. Perioperative effectiveness research using large databases. Best Pract Res Clin Anaesthesiol. 2011;25(4):489–498. [DOI] [PubMed] [Google Scholar]
  • 23.Kheterpal S Clinical research using an information system: the multicenter perioperative outcomes group. Anesthesiol Clin. 2011;29(3):377–388. [DOI] [PubMed] [Google Scholar]
  • 24.Sun E, Mello MM, Rishel CA, et al. Association of Overlapping Surgery With Perioperative Outcomes. JAMA. 2019;321(8):762–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Lee LO, Bateman BT, Kheterpal S, et al. Risk of Epidural Hematoma after Neuraxial Techniques in Thrombocytopenic Parturients: A Report from the Multicenter Perioperative Outcomes Group. Anesthesiology. 2017;126(6):1053–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Larach MG, Klumpner TT, Brandom BW, et al. Succinylcholine Use and Dantrolene Availability for Malignant Hyperthermia Treatment: Database Analyses and Systematic Review. Anesthesiology. 2019;130(1):41–54. [DOI] [PubMed] [Google Scholar]
  • 27.Cavnar WaT JM. N-gram based text categorization In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, NV, 1994). 1994:161–175. [Google Scholar]
  • 28.Wang S, Manning CD. Baselines and bigrams: simple, good sentiment and topic classification Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2; 2012; Jeju Island, Korea. [Google Scholar]
  • 29.Mikolov T, Chen K, Corrado G, Dean J. Efficient Estimation of Word Representations in Vector Space. ArXiv e-prints. 2013. https://ui.adsabs.harvard.edu/#abs/2013arXiv1301.3781M. Accessed January 01, 2013. [Google Scholar]
  • 30.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed Representations of Words and Phrases and their Compositionality. 2013:3111--3119. [Google Scholar]
  • 31.Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. Distributional Semantics Resources for Biomedical Text Processing. 2013. [Google Scholar]
  • 32.Ho TK. Random decision forests. Paper presented at: Proceedings of 3rd international conference on document analysis and recognition 1995. [Google Scholar]
  • 33.Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation. 1997;9(8):1735–1780. [DOI] [PubMed] [Google Scholar]
  • 34.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Paper presented at: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016. [Google Scholar]
  • 35.Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–297. [Google Scholar]
  • 36.Kim Y Convolutional Neural Networks for Sentence Classification. CoRR. 2014;abs/1408.5882. [Google Scholar]
  • 37.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, Mass). 2010;21(1):128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Sugiyama M, Suzuki T, Kanamori T. Density ratio estimation in machine learning. Cambridge University Press; 2012. [Google Scholar]
  • 39.Relative Value Guide Book: A Guide for Anesthesia Values. American Society of Anesthesiologists; 2016, 2017, 2018. [Google Scholar]
  • 40.Consultants AB. Anesthesia CPT Code Ranges, by Area of the Body. 2016; http://www.anesthesiallc.com/publications/cpt-codes-for-anesthesia-procedures-services. Accessed December 17, 2018.
  • 41.Spring SF, Sandberg WS, Anupama S, Walsh JL, Driscoll WD, Raines DE. Automated documentation error detection and notification improves anesthesia billing performance. Anesthesiology. 2007;106(1):157–163. [DOI] [PubMed] [Google Scholar]
  • 42.Board A Hospital profit margins declined from 2015 to 2016, Moody’s finds. 2017; https://www.advisory.com/daily-briefing/2017/05/18/moodys-report. Accessed March 9, 2019.
  • 43.Cohen AEaJK. Becker’s Hospital Review: 230 hospital benchmarks 2017. https://www.beckershospitalreview.com/lists/230-hospital-benchmarks-2017. Accessed May 3, 2017.
  • 44.Catalyst H How Hospital Financial Transparency Drives Operational and Bottom Line Improvements. https://www.healthcatalyst.com/success_stories/improved-hospital-profit-margins. Accessed November 12, 2018, 2017.
  • 45.(US) NLoM. UMLS® Reference Manual. SPECIALIST Lexicon and Lexical Tools. 2009. September- https://www.ncbi.nlm.nih.gov/books/NBK9676/ and https://www.nlm.nih.gov/research/umls/about_umls.html. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1
Supp 2
Supp 3
Supp 4

RESOURCES