Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2023 Aug 30;193(1):203–213. doi: 10.1093/aje/kwad178

Development and Validation of a Claims-Based Model to Predict Categories of Obesity

Karine Suissa, Richard Wyss, Zhigang Lu, Lily G Bessette, Cassandra York, Theodore N Tsacogianis, Kueiyu Joshua Lin
PMCID: PMC11484604  PMID: 37650647

Abstract

We developed and validated a claims-based algorithm that classifies patients into obesity categories. Using Medicare (2007–2017) and Medicaid (2000–2014) claims data linked to 2 electronic health record (EHR) systems in Boston, Massachusetts, we identified a cohort of patients with an EHR-based body mass index (BMI) measurement (calculated as weight (kg)/height (m)2). We used regularized regression to select from 137 variables and built generalized linear models to classify patients with BMIs of ≥25, ≥30, and ≥40. We developed the prediction model using EHR system 1 (training set) and validated it in EHR system 2 (validation set). The cohort contained 123,432 patients in the Medicare population and 40,736 patients in the Medicaid population. The model comprised 97 variables in the Medicare set and 95 in the Medicaid set, including BMI-related diagnosis codes, cardiovascular and antidiabetic drugs, and obesity-related comorbidities. The areas under the receiver-operating-characteristic curve in the validation set were 0.72, 0.75, and 0.83 (Medicare) and 0.66, 0.66, and 0.70 (Medicaid) for BMIs of ≥25, ≥30, and ≥40, respectively. The positive predictive values were 81.5%, 80.6%, and 64.7% (Medicare) and 81.6%, 77.5%, and 62.5% (Medicaid), for BMIs of ≥25, ≥30, and ≥40, respectively. The proposed model can identify obesity categories in claims databases when BMI measurements are missing and can be used for confounding adjustment, defining subgroups, or probabilistic bias analysis.

Keywords: body mass index, machine learning, missing data, obesity, pharmacoepidemiology, prediction modeling

Abbreviations

AUC

area under the receiver-operating-characteristic curve

BMI

body mass index

CI

confidence interval

EHR

electronic health record

ICD

International Classification of Diseases

LASSO

least absolute shrinkage and selection operator

PPV

positive predictive value

The prevalence of obesity in the United States increased from 30.5% to 42.4% between 1999–2000 and 2017–2018, and that of obesity class 3 increased from 4.7% to 9.2% (1). Similarly, the prevalence of obesity in 2007–2010 among older adults was 40.8% and 27.8%, respectively, for those aged 65–74 years and 75 years or over (2). Obesity is associated with other diseases including type 2 diabetes, cardiovascular disease, and cancer, making patients with and without obesity inherently different (3, 4). For this reason, most epidemiologic studies require a measurement of adiposity or weight status to control for imbalances in levels of obesity. However, real-world evidence studies often rely on data from administrative claims databases which do not contain information on obesity, such as body mass index (BMI), waist circumferences, etc. These databases record obesity with the use of International Classification of Diseases (ICD) diagnostic codes, which tend to be differentially underrecorded (5).

Previous validation studies have reported accuracy measures for BMI-related ICD codes in administrative claims, consistently showing low sensitivity but high specificity, indicating that while patients with obesity-related ICD codes were classified correctly (5–9), only 22% had an obesity-related ICD code in the linked claims data. This high level of missingness can result in residual confounding when adjustment for BMI is required (5). Studies have proposed prediction models to address missing BMI information in claims data, specifically in commercially insured younger adults (10) and pediatric populations (11). Given different risk factors for obesity across populations, these models are not generalizable to patients with a different demographic background. However, to our knowledge, no BMI phenotyping algorithms have been developed for older adults and patients of lower socioeconomic status.

Given the level of missingness of obesity-related ICD codes in claims data, and the lack of BMI prediction models available for older populations and populations of lower socioeconomic status, we sought to develop and validate a claims-based algorithm that classifies patients into obesity categories.

METHODS

Data source

This study utilized data from the Research Patient Data Repository (12) linked to Medicare fee-for-service Parts A (inpatient coverage), B (outpatient coverage), and D (prescription benefits) claims data and Medicaid. The Research Patient Data Repository includes longitudinal electronic health record (EHR) data from 2 networks in the Boston metropolitan area. The first network (EHR system 1) consists of 1 tertiary hospital, 2 community hospitals, and 19 primary care centers. The second network (EHR system 2) includes 1 tertiary hospital, 1 community hospital, and 18 primary care centers. Consistent with our prior study, EHR system 1 was used for training and system 2 for validating the prediction model (13). This data set includes information on BMI, blood pressure, smoking status, laboratory, and radiology test results. Medicare is a US federal health insurance program providing medical and prescription drug coverage to individuals aged 65 years or older and to younger individuals with disabilities; Medicare currently covers approximately 50 million Americans. The Medicare fee-for-service claims database contains longitudinal, individual-level data on health-care utilization, inpatient and outpatient diagnoses, diagnostic tests and procedures, and pharmacy-filled prescriptions. These data are commonly used in real-world drug effectiveness and safety studies (14–16). Medicaid is a joint federal and state program designed for low- to moderate-income legal residents in the United States The Medicaid claims data include enrollment files, inpatient claims, outpatient claims, prescription drug claims, and claims data for other services and long-term care.

The linked claims data spanned from January 1, 2007, to December 31, 2017, for Medicare and January 1, 2000, to December 31, 2014, for Medicaid. Approximately 550,000 Medicare beneficiaries in the EHR data set were linked with Medicare claims via the beneficiary numbers, date of birth, and sex, with a linkage success rate of 98.7% (14). Similarly, the EHRs of all 470,563 Medicaid beneficiaries identified in the EHR were linked to Medicaid claims data, with a linkage success rate of 98.5%.

Study population

Using the EHR-Medicare and EHR-Medicaid linked databases, we identified cohorts of patients within each database with an available BMI measurement (measurement date = cohort entry date) from January 1, 2000, to December 31, 2017 (Medicare), and January 1, 2000, to December 31, 2014 (Medicaid). The cohort included patients aged 65 years or older with at least 90 days of continuous enrollment in claims for Parts A, B, and D before and after the recorded BMI measure in the EHR data set. Continuous enrollment is defined as less than 32 days of enrollment gap. The baseline and covariate assessment period was a 180-day window that ranged from 90 days prior to 90 days after the cohort entry date. Patients were excluded if they had an implausible measure of BMI (<12 or >70) (17), or had missing information about age or gender. Patients were further excluded for conditions that can alter the BMI or its validity, including a previous bariatric surgery, limb amputation, or cancer.

Predictor variables

A total of 137 predetermined variables within the claims data were included as potential predictors of obesity. These variables included demographic information (age, sex, race), obesity indicators and related comorbidities (including conditions related to cardiovascular disease, type 2 diabetes, kidney disease, liver disease, and cancer), CHADS2-VASc score (18), combined comorbidity score (19), and frailty. Also included was medication use related to obesity and associated comorbidities, as well as health-care utilization (emergency department visits, office visits, inpatient days, and days in a nursing facility) and screening test utilization (abdominal ultrasound, liver function test, electrocardiogram, flu vaccination). The full list of included variables and definitions is available in Web Table 1 (available at https://doi.org/10.1093/aje/kwad178).

BMI measurement

The BMI is an indicator of body fatness, calculated as a ratio of weight in kilograms to height in meters squared. The EHR data contains information on weight, height, and BMI. For this study, we used the recorded BMI measures. However, we validated the recorded BMI measures against a calculation using the height and weight values to rule out calculation errors. We also excluded implausible BMI measures that did not meet the predetermined cutoff (<12 or >70) (17). The BMI measurement at cohort entry in the EHR data was used as the outcome variable of the predictor model.

Model development

Models were developed using data from the training set, separately for Medicare and Medicaid data. First, we applied the least absolute shrinkage and selection operator (LASSO) regression with the Bayesian information criterion to select the most important predictors among the 137 preselected variables. Second, we fitted a linear regression model predicting continuous BMI as the dependent variable and the selected predictor variables as independent variables. Third, we used the predicted BMI from the second step to predict the binary dependent variables representing overweight status (BMI of ≥25 vs. <25), obese weight status (BMI of ≥30 vs. <30), and class 3 obesity weight status (BMI of ≥40 vs. <40) in logistic regression. Prior studies comparing different machine learning approaches in prediction modeling using claims data have shown comparable results (20–23). We therefore chose a more interpretable model using linear and logistic regression with LASSO where the coefficients of each variable are readily interpretable and the LASSO selection process is explicit. Because LASSO is a regularized regression method, it operates under the principle of “bias-variance trade-off” (24), and its coefficients can be biased; we used LASSO primarily as a feature selection method and reported the coefficients without shrinkage. We developed the model predicting the continuous measure because the categorization of a continuous variable will inevitably lose information. Predicting the original continuous BMI enabled us to use the full variability of the outcome of interest.

Model performance and statistical analysis

We used each regression coefficient obtained from the model constructed in the training set with its respective predictor variable to calculate predicted BMI and the probability of the BMI category in the validation set. The performance of the models in the training set was assessed using the R2 for linear models and the area under the receiver-operating-characteristic curve (AUC) for the logistic models. The accuracy of the models was assessed within the validation set with measures of positive predictive values (PPV), sensitivity, and overall accuracy (calculated as the number of true positives and true negatives divided by number of total predictions), with associated 95% confidence intervals (CIs). The 95% CIs were calculated using the standard formula for proportions (25). All the analyses were conducted using SAS, version 9.4 (SAS Institute, Cary, North Carolina). The study protocol and study design diagram are presented in the Web Material (Web Appendix 1, Web Figure 1).

Missing data

We used the missing indicator method for missing race/ethnicity variables, because the comorbidity and medication variables are based on administrative claims data in which absence of a code is assumed to indicate absence of the medical condition (therefore, missing such a code would lead to misclassification rather than missing data of the condition if the assumption is violated). To enhance the likelihood that such an assumption would hold, we required our cohort to have at least 180-day baseline enrollment. During the active enrollment period, it is more reasonable to assume the claims code would be recorded in the claims database if the patient has a medical condition.

Sensitivity analyses

We conducted sensitivity analyses to assess the robustness of our models in different populations in which BMI can differ. First, because patients with a BMI of ≥40 represent an infrequent phenotype that may be associated with different predictors, we used LASSO to select predictors of BMIs of ≥40 and assessed model performance using logistic regression composed of the LASSO-selected variables. Second, we retrained the models in the Medicare and Medicaid sets to create models only for Black race. Third, we compared the ability of alternative models to LASSO regression to accurately predict the different BMI categories within the training data by comparing the cross-validated predictive performance of various prediction algorithms using a measure for discrimination (C-statistic) and calibration (negative log-likelihood) within the training data set. The prediction algorithms included 3 regularized regression models (LASSO, Ridge, and Elastic Net), and a nonparametric tree-based method (XGboost). For XGboost, the tuning parameters for the learning rate, the number of boosted trees, and the tree depth were selected through a grid search using cross-validation.

RESULTS

We identified a cohort of 123,432 patients from January 1, 2007, to December 31, 2017 (Medicare), and 40,736 patients from January 1, 2000, to December 31, 2014 (Medicaid), who had an available BMI measurement and met all cohort entry criteria, including 73,717 in the Medicare training set and 20,681 in the Medicaid training set (Figure 1A–B). Population characteristics in the validation and training sets from Medicare and Medicaid are shown in Table 1 and Web Table 2. Briefly, the mean age was 73 years in the Medicare training set and 39 in the Medicaid training set. The cohort was predominantly female (62% in the Medicare set and 67% in the Medicaid set) and White (87% in the Medicare set and 51% in the Medicaid set), and had a BMIs of ≥25 (69% in the Medicare set and 67% in the Medicaid set). The average BMI was 28.5 and 29.1 in the Medicare and Medicaid databases, respectively.

Figure 1.

Figure 1

Cohort flow chart for Medicare, 2007–2017 (A), and Medicaid, 2000–2014 (B), samples, Boston, Massachusetts. BMI, body mass index.

Table 1.

Characteristics of the Population in the Training Sets From Medicare (2007–2017) and Medicaid (2000–2014) Cohorts in Boston, Massachusetts

Patient Characteristic Medicare Training Set (n = 73,717) Medicaid Training Set (n = 20,681)
No. % Mean (SD)  a No. % Mean (SD)
Age, years 73.1 (6.8) 38.7 14.7 38.7 (14.7)
Female sex 31,030 62.4 13,770 66.6
BMIb 28.5 (5.9) 29.1 (7.6)
BMI category
 ≥25 34,365 69.1 13,876 67.1
 ≥30 16,011 32.2 7,603 36.8
 ≥40 2,035 4.1 1,796 8.7
Obesity identified with ICD codes in claims 4,469 9.0 1,200 5.8
Race/ethnicity
 White 43,077 86.6 10,506 50.8
 Black 2,755 5.5 1,877 9.1
 Hispanic 1,069 2.2 3,486 16.9
 Other 1,677 3.4 1,052 5.1
 Missing 1,137 2.3 3,760 18.2
BMI-related comorbidities
 Obesity 4,469 9.0 1,200 5.8
 Sleep apnea 3,478 7.0 396 1.9
 Type 1 diabetes mellitus 1,297 2.6 248 1.2
 Type 2 diabetes mellitus 11,381 22.9 1,382 6.7
 Hyperlipidemia 27,602 55.5 1,535 7.4
 Hypertension 28,056 56.4 2,253 10.9
 Ischemic heart 9,546 19.2 436 2.1
 Ischemic stroke 3,697 7.4 272 1.3
 Nonalcoholic fatty liver disease 604 1.2 196 0.9
 Gastroesophageal reflux disease 9,097 18.3 803 3.9
 Depression 7,069 14.2 1,863 9.0
Other comorbidities
 Atrial fibrillation 7,132 14.3 191 0.9
 Anemia 7,270 14.6 791 3.8
 Any gastrointestinal bleed 1,788 3.6 325 1.6
 Cancer 7,407 14.9 432 2.1
 Chronic kidney disease 5,130 10.3 278 1.3
 Chronic obstructive pulmonary disease 4,944 9.9 456 2.2
 Dorsopathies 15,577 31.3 2,865 13.9
 Drug use disorder 1,322 2.7 1,171 5.7
 Deep vein thrombosis 1,161 2.3 153 0.7
 End stage renal disease 588 1.2 71 0.3
 Heart failure 5,227 10.5 338 1.6
 Hyperthyroid 592 1.2 115 0.6
 Hypoglycemia 5,052 10.2 922 4.5
 Joint back pain 13,927 28.0 3,138 15.2
 Kidney stones 1,292 2.6 211 1.0
 Liver disease 1,469 3.0 735 3.6
Other comorbidities, continued
 Lower gastrointestinal bleed 1,748 3.5 322 1.6
 Peptic ulcer disease 10,775 21.7 1,176 5.7
 Peripheral vascular disease 4,679 9.4 128 0.6
 Shortness of breath 8,385 16.9 1,115 5.4
 Frailty scorec
  ≥0.35 366 0.7 6 0.0
  0.25–0.34 2,830 5.7 140 0.7
  0.15–0.24 19,760 39.7 2,201 10.6
  <0.15 26,759 53.8 18,334 88.7
 Combined comorbidity score of ≥1 23,957 48.2 4,786 23.1
 CHADS2-VASc scorec
  ≥4 22,302 44.9 537 2.6
  3 13,740 27.6 943 4.6
  2 11,128 22.4 2,205 10.7
  <2 2,545 5.1 16,996 82.2
Medications
 Angiotensin-converting enzyme inhibitors 13,800 27.8 1,696 8.2
 Angiotensin II receptor blockers 2,542 5.1 133 0.6
 Antibiotics 19,872 40.0 5,465 26.4
 Antiplatelets 3,779 7.6 526 2.5
 Antiarrhythmics 1,266 2.5 42 0.2
 Anti-obesity medications 101 0.2 9 0.0
 Betablockers 19,805 39.8 1,721 8.3
 Calcium channel blockers 4,391 8.8 376 1.8
 COX-2 inhibitors 960 1.9 139 0.7
 Histamine H2-receptor antagonists 2,492 5.0 840 4.1
 Insulin 1,426 2.9 602 2.9
 Loop diuretics 6,333 12.7 535 2.6
 Noninsulin antidiabetic medications 6,396 12.9 1,024 5.0
 Nonsteroidal antiinflammatory drugs 7,443 15.0 3,768 18.2
 Nonselective β blockers 1,528 3.1 140 0.7
 Opioids 7,816 15.7 1,930 9.3
 Proton pump inhibitors 12,648 25.4 2,599 12.6
 Statin 25,967 52.2 1,758 8.5
 Warfarin 5,071 10.2 332 1.6

Abbreviations: BMI, body mass index; COX-2, cyclooxygenase-2; ICD, International Classification of Diseases; SD, standard deviation.

a Values are expressed as mean (SD).

b Weight (kg)/height (m)2.

c Frailty was scored according to Kim et al. (37), and CHADS2-VASc was scored according to Lip et al. (18).

The final model in the Medicare set comprised 97 variables, and that in the Medicaid set comprised 95 variables (Table 2, Web Tables 3–4). The main predictors of BMI were BMI-related diagnosis codes, age, sex, race, cardiovascular and antidiabetic drugs, and obesity-related comorbidities (e.g., type 2 diabetes, cancer, hypertension) (Web Tables 3–4). The predicted BMI was well correlated with the observed BMI in both training and validation sets from Medicare and Medicaid (Figure 2). The performance metrics of each model are shown in Table 2. In the Medicare model, the R2 for continuous BMI was 0.32 in the training set. The AUC in the testing set was 0.72, 0.76, and 0.88 for BMIs of ≥25, ≥30, and ≥40, respectively (Figure 3). In the validation set, the AUC was 0.72, 0.75, and 0.83 and the PPV was 81.5% (95% CI: 81.0, 81.9), 80.6% (95% CI: 79.6, 81.6), and 60.0% (95% CI: 29.6, 90.4) for BMIs of ≥25, ≥30, and ≥40, respectively. The sensitivity was 65.0% (95% CI: 64.5, 65.5), 30.8% (95% CI: 30.1, 31.5), and 0.3% (95% CI: 0.1, 0.5) for BMIs of ≥25, ≥30, and ≥40, respectively. The continuous BMI model correctly classified 65.6% (95% CI: 65.%, 66.0) of patients with BMIs of ≥25, 75.3% (95% CI: 74.9, 75.7) with BMIs of ≥30, and 95.9% (95% CI: 95.7, 96.1) with BMIs of ≥40 (Table 2).

Table 2.

Performance of the Continuous Body Mass Index Prediction Score and the Binary (Body Mass Index of ≥40) Model in Medicare (2007–2017) and Medicaid (2000–2014) Cohorts in Boston, Massachusetts

Cohort and BMI  a  Prediction Category No. of Selected Variables AUC in Training Set AUC in Validation Set Performance in Validation Set
Overall Accuracy  b 95% CI PPV for Specific Category 95% CI Sensitivity for Specific Category 95% CI
Medicare model
 ≥25c 97 0.724 0.724 0.656 0.652, 0.660 0.815 0.810, 0.819 0.650 0.645, 0.655
 ≥30c 97 0.76 0.748 0.753 0.749, 0.757 0.806 0.796, 0.816 0.308 0.301, 0.315
 ≥40c 97 0.863 0.831 0.959 0.957, 0.961 0.600 0.296, 0.904 0.003 0.001, 0.005
 ≥40d 44 0.877 0.848 0.959 0.958, 0.961 0.647 0.486, 0.808 0.011 0.006, 0.015
Medicaid model
 ≥25b 95 0.667 0.656 0.546 0.539, 0.553 0.816 0.808, 0.825 0.444 0.435, 0.452
 ≥30b 95 0.681 0.662 0.672 0.665, 0.678 0.775 0.757, 0.793 0.207 0.198, 0.216
 ≥40b 95 0.747 0.702 0.916 0.912, 0.920 0.625 0.535, 0.714 0.041 0.031, 0.050
 ≥40c 47 0.753 0.712 0.915 0.911, 0.919 0.629 0.447, 0.811 0.010 0.005, 0.014

Abbreviations: AUC, area under the receiver-operator-characteristic curve; BMI, body mass index; CI, confidence interval; PPV, positive predictive value.

a Weight (kg)/height (m)2.

b Number of true positives and true negatives divided by number of total predictions.

c Describes results from the continuous model.

d Describes results from the binomial logistic model.

Figure 2.

Figure 2

Mean measured body mass index (BMI) within deciles of predicted BMI in Medicare (2007–2017) and Medicaid (2000–2014) patients, in Boston, Massachusetts.

Figure 3.

Figure 3

Area under the receiver-operating-characteristic curves (AUCs) (blue line) of the performance of the body mass index (BMI) prediction tool in the Medicare (2007–2017) training set, for BMIs of ≥25 (A), ≥30 (B), and ≥40 (C), Boston, Massachusetts. The red line is the reference line.

In the Medicaid model, the R2 for continuous BMI was 0.26 in the training set. The AUC in the validation set was 0.66, 0.66, and 0.70 for BMIs of ≥25, ≥30, and ≥40, respectively. In the validation set, the PPV was 81.6% (95% CI: 80.8, 82.5), 77.5% (95% CI: 75.7, 79.3), and 62.5% (95% CI: 53.5, 71.4) for BMIs of ≥25, ≥30, and ≥40, respectively. The sensitivity was 44.4% (95% CI: 43.5, 45.2), 20.7% (95% CI: 19.8, 21.6), and 4.1% (95% CI: 3.1, 5.0) for BMIs of ≥25, ≥30, and ≥40, respectively. The continuous BMI model in Medicaid correctly classified 54.6% (95% CI: 53.9, 55.3) of patients with BMIs of ≥25, 67.2% (95% CI: 66.5, 67.8) with BMIs of ≥30, and 91.6% (95% CI: 91.2, 92.0) with BMIs of ≥40 (Table 2).

Based on the Hosmer-Lemeshow goodness-of-fit test, the models were well-calibrated when predicting BMIs of ≥25 or ≥30 in either the Medicare or Medicaid population but tended to overestimate the risk of BMIs of ≥40 in the low-risk groups in both Medicare and Medicaid populations (Web Tables 5–7 (Medicare), Web Tables 8–10 (Medicaid)).

For the sensitivity analysis, the model developed specifically for BMIs of ≥40 comprised 44 variables in the Medicare set and 47 variables in the Medicaid set (Table 2, Web Table 11–12). The models for BMIs of ≥40 had an overall accuracy of 95.9% (95% CI: 95.8, 96.1), PPV of 64.7% (95% CI: 48.6, 80.8), and sensitivity of 1.1% (95% CI: 0.6, 1.5) in the Medicare validation set. In the Medicaid set, the overall accuracy was 91.5% (95% CI: 91.1, 91.9), PPV was 62.9% (95% CI: 44.7, 81.1), and sensitivity was 1.0% (95% CI: 0.5, 1.4) (Table 2).

The models developed for Black individuals comprised 54 variables in the Medicare set and 33 variables in the Medicaid set (Web Tables 13–15). The model in the Medicare set had an AUC of 0.72, 0.74, and 0.82 for BMIs of ≥25, ≥30, and ≥40, respectively. In the validation set, the PPV was 82.8% (95% CI: 81.3, 84.3), 79.9% (95% CI: 76.8, 83.1), and 68.7% (95% CI: 46.0%, 91.4%) for BMIs of ≥25, ≥30, and ≥40, respectively. The sensitivity was 89.7% (95% CI: 88.4, 91.0), 39.4% (95% CI: 36.7, 42.1), and 4.4% (95% CI: 1.8, 6.9) for BMIs of ≥25, ≥30, and ≥40, respectively. The model developed for Black individuals in the Medicaid set had a lower performance than that of the Medicare model (Web Table 15).

The results of the cross-validated predictive performance of various prediction algorithms showed similar performance in terms of both discrimination and calibration in data from Medicare (Web Table 16) and Medicaid (Web Table 17).

DISCUSSION

We used a machine learning approach to develop algorithms that predict BMI in administrative claims data, known to have a high level of underrecording of BMI-related diagnosis codes and typically not containing BMI values. Overall, the prediction models had reasonable levels of accuracy at predicting BMIs of ≥25 or ≥30. In contrast, the algorithm predicting BMIs of ≥40 had suboptimal PPV and sensitivity in the validation set.

The models developed in this study included a variety of predictors ranging from demographic characteristics and health-care utilization to comorbidities and prescription medication use. Wu et al. (10) developed BMI prediction models using Optum claims–EHR linked data utilizing and comparing various machine learning approaches. Specifically, they predicted binary obesity classifications of ≥30 and ≥40 and found that the Super Learner machine learning algorithm utilizing baseline BMI yielded the best performance. Because BMI values are not available in claims data, and since the primary objective of this study was to develop a claims-based phenotyping algorithm to be used when BMI information is not available, we did not use BMI values from EHRs as predictors in our model; instead we used obesity-related diagnosis codes that are available in claims data. Wu et al.’s model excluding baseline BMI had an accuracy of 71.6% and 72.8% and an AUC of 71.4% and 78.7% for BMIs of ≥30 and ≥40, respectively, in their validation set (10). While these estimates were lower than our performance in the validation set, we caution that the 2 sets of estimates are not directly comparable due to significant differences in the study populations and their characteristics. Wu et al.’s models were developed using data from a commercially insured, younger population, which is different from our Medicare (older) and Medicaid (younger adults of lower socioeconomic status) populations. For example, the elderly patients tend to have more comorbidities associated with frailty and age-related disability (26), which were identified in our BMI prediction model in Medicare. Also, the different estimates of sensitivity and PPV may be explained by different prevalences of obesity in various populations: The prevalence of patients with BMIs of ≥40 was 4.1%, 8.7%, and 16% in our Medicare set, our Medicaid set, and Wu’s Optum populations, respectively. It is also important to note that Wu’s selected model based on the Super Learner approach required implementing 4 complex machine models, each of which may have data environment–specific tuning parameters, and taking the weighted average of these algorithms. This approach makes interpretability impossible and makes equivalent implementation in a different data environment challenging. In contrast, our models, based on LASSO, have readily interpretable coefficients, and the application of this model in external data sets can be easily implemented. Furthermore, we observed similar performance when comparing various machine learning approaches, including Ridge, Elastic Net, and XGboost, to our LASSO models in both Medicare and Medicaid data sets.

Another study, by Dugan et al. (11), developed a prediction model in younger children to identify the risk of obesity in adulthood. They identified strong predictors in childhood, including overweight status before the age of 24 months, short stature, minority race, and parental depression as strong predictors of adult obesity. These different findings emphasize the importance of developing models for different age groups, as factors associated with obesity tend to vary substantially by age and health status. For this reason, we developed prediction models in demographic-specific models in the Medicare (mean age 73) and Medicaid (mean age 39) populations.

The accuracy metrics obtained using the continuous model to predict weight status categories were reasonable at predicting BMI categories of ≥25 or ≥30; however, it did not perform as well at predicting class 3 obesity (BMIs of ≥40). The small sample size within that category might explain this difference in performance. In fact, in the training set and validation sets, the 4.1% and 3.7% of patients with class 3 obesity in our sample focused on older adults, which is low compared with the estimates in the US general population (9.2% in 2017–2018) (1). This might affect the power to predict BMIs of ≥40. In addition, it may be that class 3 obesity has a different phenotype, and thus has its own set of predictor variables that are different from those that predict BMIs of ≥25 or ≥30. In a sensitivity analysis, we created a separate model for BMIs of ≥40, but the performance was only slightly improved. Therefore, this prediction model can be used to predict BMIs of ≥25 or ≥30 but should be used with caution when predicting BMIs of ≥40.

Furthermore, given the potential racial differences in predictors of BMI, as a sensitivity analysis, we retrained the model only for patients of Black race. While these predictions were slightly underpowered given the small sample size of patients of Black race in our cohort (5.5% to 9.1%), we showed that the models specific to Black race did not perform significantly better than the overall model in the race-specific population in the validation set. This is likely explained by the fact that race is accounted for in the overall model, and there is no significant interaction between race and other predictors in terms of their association with BMI.

Given the high prevalence of obesity worldwide, and the established association with several outcomes, including type 2 diabetes, cardiovascular disease, and mortality (27, 28), the correct identification of obese patients in nonrandomized studies is critical to establish the study population or for confounding adjustment. Insurance claims–based databases are often utilized in pharmacoepidemiology because of their reliable and longitudinal capture of prescription drugs; however, they rely on ICD-9 or ICD-10 codes for the identification of obesity. Yet, obesity, as measured by ICD codes, tends to be significantly underreported (6), and the accuracy of ICD-defined BMI categories is not ideal. This poor capture of weight status measurements, resulting from the high proportion (78%) of underreporting of obesity-related ICD codes and differential reporting, can result in strong residual confounding and misclassification.

A potential concern with imputing missing data is that it can introduce bias if the data is not “missing at random” (i.e., the probability of missingness depends only on observed variables) (29). In our research context, we developed a claims-based prediction model to be applied to a claims database that typically does not have information on BMI, with the missing mechanism being administrative (BMI information is not required for reimbursement purposes). Therefore, we believe the assumption of missing at random is not violated in this scenario and adjusting for a proxy confounder (predicted BMI) is better than not adjusting for BMI at all for studies in which obesity is a potential confounder (e.g., comparative effectiveness study comparing surgical vs. medical treatment for obesity, etc.). There is a growing body of evidence showing that algorithms for proxy adjustment can improve confounding control when compared with adjustment based solely on predefined measured covariates in a wide range of research questions (30–33).

The model proposed in this study provides the possibility of identifying, with an acceptable level of precision, individual weight status of patients in administrative claims databases, particularly for the BMI categories of ≥25 or ≥30. Given the high proportion of missing information on weight in these databases, the proposed model can improve the quality of obesity and obesity-related research by estimating obesity weight status at a reasonable level. This will enhance the ability of researchers to adjust for confounding by BMI when it is not directly available in the database. They can either use the predicted BMI based on our model as a proxy confounder or use our phenotyping model to develop simulated estimates in a probabilistic bias analysis (34). In addition, researchers can also consider using the predicted BMI categories based on our model to conduct treatment-effect heterogeneity evaluation in obesity-related comparative effectiveness and safety research.

This study has some limitations. First, the Medicare model is applicable only to an older population, and the Medicaid model only to younger individuals of low socioeconomic status. In addition, BMI classification may vary in older populations, and thus misclassification is possible. Second, the completeness of the ICD codes in claims data may be questionable; however, by requiring a continuous enrollment of 180 days, the completeness of our claims database capturing the ICD code is likely improved. Third, the choice to preselect candidate predictors was made to avoid random selection of clinically implausible predictors. Although this may result in the omission of certain predictors, the preselection process was carefully performed by expert consensus. Fourth, the BMI prediction models proposed in this study are intended for populations with demographic characteristics similar to those of our study cohorts. Fifth, the prediction of continuous BMI is susceptible to the influence of outliers or extreme values. We therefore excluded extreme and implausible values from our data set to reduce measurement error. We also conducted sensitivity analyses to predict BMIs of ≥40 as a binary variable and observed comparable results. Sixth, the LASSO selects variables by identifying those most strongly associated with the outcome variable by using regularization to shrink coefficients to zero. As a result, there is no discrimination between causally associated variables and strong correlates among selected variables, and therefore the variables in the models should not be interpreted as causal. Seventh, given that the LASSO model performed as well as other machine learning approaches, we proceeded with LASSO results for predictive modeling and variable importance. Future research could explore more exhaustive evaluations of other modeling approaches. Eighth, while the BMI is indicative of weight status in most individuals, it may inaccurately estimate body fat in athletes with high muscle mass and underestimate body fat in older and more frail individuals (35). However, it remains the most commonly used measure of body fatness in clinical practice, and the ease of use as well as the low cost makes it a widely used measure (36).

CONCLUSION

We developed phenotyping models to predict BMI in claims-based observational studies. The predicted BMI phenotypes can be used in studies using administrative claims data for proxy confounding adjustment, for the identification of obesity-specific subgroups for treatment effect heterogeneity analyses, and for probabilistic bias analyses in pharmacoepidemiology studies.

Supplementary Material

Web_Material_kwad178

ACKNOWLEDGMENTS

Author affiliations: Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, United States (Karine Suissa, Richard Wyss, Zhigang Lu, Lily G. Bessette, Cassandra York, Theodore N. Tsacogianis, Kueiyu Joshua Lin); and Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States (Kueiyu Joshua Lin).

This project was supported by the National Institutes of Health (grants R01LM013204 and R01AG075335 to K.J.L.).

Due to ethical and legal restrictions, the administrative health-care data used in the analysis and that support the findings of this study are not available for sharing.

Presented at the 2022 International Conference on Pharmacoepidemiology and Therapeutic Risk Management, August 24–28, 2022, Copenhagen, Denmark.

Disclaimer: The views expressed in this article are those of the authors and do not reflect those of the National Institutes of Health (NIH)/National Library of Medicine.

Conflict of interest: none declared.

REFERENCES

  • 1. Hales  CH, Carroll MD, Fryar CD, et al.  Prevalence of obesity and severe obesity among adults: United States, 2017–2018. National Center for Health Statistics Data Brief No. 360. Hyattsville, MD: National Center for Health Statistics; 2020. https://www.cdc.gov/nchs/data/databriefs/db360-h.pdf. Accessed October 2, 2022. [Google Scholar]
  • 2. Fakhouri  THI, Ogden CL, Carroll MD, et al.  Prevalence of obesity among older adults in the United States, 2007–2010. National Center for Health Statistics Data Brief No. 106. Hyattsville, MD: National Center for Health Statistics; 2012. https://www.cdc.gov/nchs/data/databriefs/db106.pdf. Accessed October 2, 2022. [Google Scholar]
  • 3. Guh  DP, Zhang W, Bansback N, et al.  The incidence of co-morbidities related to obesity and overweight: a systematic review and meta-analysis. BMC Public Health. 2009;9:88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Apovian  CM, Okemah J, O'Neil PM. Body weight considerations in the management of type 2 diabetes. Adv Ther. 2019;36(1):44–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Suissa  K, Schneeweiss S, Lin KJ, et al.  Validation of obesity-related diagnosis codes in claims data. Diabetes Obes Metab. 2021;23(12):2623–2631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ammann  EM, Kalsekar I, Yoo A, et al.  Validation of body mass index (BMI)-related ICD-9-CM and ICD-10-CM administrative diagnosis codes recorded in US claims data. Pharmacoepidemiol Drug Saf. 2018;27(10):1092–1100. [DOI] [PubMed] [Google Scholar]
  • 7. Ammann  EM, Kalsekar I, Yoo A, et al.  Assessment of obesity prevalence and validity of obesity diagnoses coded in claims data for selected surgical populations: a retrospective, observational study. Medicine (Baltimore). 2019;98(29):e16438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Gribsholt  SB, Pedersen L, Richelsen B, et al.  Validity of ICD-10 diagnoses of overweight and obesity in Danish hospitals. Clin Epidemiol. 2019;11:845–854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Lloyd  JT, Blackwell SA, Wei II, et al.  Validity of a claims-based diagnosis of obesity among Medicare beneficiaries. Eval Health Prof. 2015;38(4):508–517. [DOI] [PubMed] [Google Scholar]
  • 10. Wu  B, Chow W, Sakthivel M, et al.  Body mass index variable interpolation to expand the utility of real-world administrative healthcare claims database analyses. Adv Ther. 2021;38(2):1314–1327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Dugan  TM, Mukhopadhyay S, Carroll A, et al.  Machine learning techniques for prediction of early childhood obesity. Appl Clin Inform. 2015;6(3):506–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Nalichowski  R, Keogh D, Chueh HC, et al.  Calculating the benefits of a research patient data repository. AMIA Annu Symp Proc. 2006;2006:1044. [PMC free article] [PubMed] [Google Scholar]
  • 13. Lin  KJ, Singer DE, Glynn RJ, et al.  Identifying patients with high data completeness to improve validity of comparative effectiveness research in electronic health records data. Clin Pharmacol Ther. 2018;103(5):899–905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Lin  KJ, Singer DE, Glynn RJ, et al.  Prediction score for anticoagulation control quality among older adults. J Am Heart Assoc. 2017;6(10):e006814. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Patorno  E, Najafzadeh M, Pawar A, et al.  The EMPagliflozin compaRative effectIveness and SafEty (EMPRISE) study programme: design and exposure accrual for an evaluation of empagliflozin in routine clinical care. Endocrinol Diabetes Metab. 2020;3(1):e00103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Patorno  E, Pawar A, Franklin JM, et al.  Empagliflozin and the risk of heart failure hospitalization in routine clinical care. Circulation. 2019;139(25):2822–2830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Li  W, Kelsey JL, Zhang Z, et al.  Small-area estimation and prioritizing communities for obesity control in Massachusetts. Am J Public Health. 2009;99(3):511–519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lip  GY, Nieuwlaat R, Pisters R, et al.  Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest. 2010;137(2):263–272. [DOI] [PubMed] [Google Scholar]
  • 19. Gagne  JJ, Glynn RJ, Avorn J, et al.  A combined comorbidity score predicted mortality in elderly patients better than existing scores. J Clin Epidemiol. 2011;64(7):749–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Austin  PC, Harrell FE, Lee DS, et al.  Empirical analyses and simulations showed that different machine and statistical learning methods had differing performance for predicting blood pressure. Sci Rep. 2022;12(1):9312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Desai  RJ, Wang SV, Vaduganathan M, et al.  Comparison of machine learning methods with traditional models for use of administrative claims with electronic medical records to predict heart failure outcomes. JAMA Netw Open. 2020;3(1):e1918962-e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Hu  P, Liu Y, Li Y, et al.  A comparison of LASSO regression and tree-based models for delayed cerebral ischemia in elderly patients with subarachnoid hemorrhage. Front Neurol. 2022;13:791547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. König  S, Pellissier V, Hohenstein S, et al.  Machine learning algorithms for claims data-based prediction of in-hospital mortality in patients with heart failure. ESC Heart Failure. 2021;8(4):3026–3036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Hastie  T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2nd ed. New York, NY: Springer; 2009. [Google Scholar]
  • 25. Baldi  B, Moore DS. The Practice of Statistics in the Life Sciences. New York, NY: W. H. Freeman & Company; 2017. [Google Scholar]
  • 26. Espinoza  SE, Quiben M, Hazuda HP. Distinguishing comorbidity, disability, and frailty. Curr Geriatr Rep. 2018;7(4):201–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Pani  LN, Nathan DM, Grant RW. Clinical predictors of disease progression and medication initiation in untreated patients with type 2 diabetes and A1C less than 7%. Diabetes Care. 2008;31(3):386–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Khan  SS, Ning H, Wilkins JT, et al.  Association of body mass index with lifetime risk of cardiovascular disease and compression of morbidity. JAMA Cardiol. 2018;3(4):280–287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Greenland  S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995;142(12):1255–1264. [DOI] [PubMed] [Google Scholar]
  • 30. Le  HV, Poole C, Brookhart MA, et al.  Effects of aggregation of drug and diagnostic codes on the performance of the high-dimensional propensity score algorithm: an empirical example. BMC Med Res Methodol. 2013;13:142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Hallas  J, Pottegard A. Performance of the high-dimensional propensity score in a Nordic healthcare model. Basic Clin Pharmacol Toxicol. 2017;120(3):312–317. [DOI] [PubMed] [Google Scholar]
  • 32. Schneeweiss  S, Rassen JA, Glynn RJ, et al.  High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009;20(4):512–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Garbe  E, Kloss S, Suling M, et al.  High-dimensional versus conventional propensity scores in a comparative effectiveness study of coxibs and reduced upper gastrointestinal complications. Eur J Clin Pharmacol. 2013;69(3):549–557. [DOI] [PubMed] [Google Scholar]
  • 34. Hunnicutt  JN, Ulbricht CM, Chrysanthopoulou SA, et al.  Probabilistic bias analysis in pharmacoepidemiology and comparative effectiveness research: a systematic review. Pharmacoepidemiol Drug Saf. 2016;25(12):1343–1353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Nuttall  FQ. Body mass index: obesity, BMI, and health: a critical review. Nutr Today. 2015;50(3):117–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Cornier  MA, Despres JP, Davis N, et al.  Assessing adiposity: a scientific statement from the American Heart Association. Circulation. 2011;124(18):1996–2019. [DOI] [PubMed] [Google Scholar]
  • 37. Kim  DH, Schneeweiss S, Glynn RJ, et al. Measuring frailty in Medicare data: development and validation of a claims-based frailty index. J Gerontol A Biol Sci Med Sci. 2018;73(7):980–987. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwad178

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES