Skip to main content
. Author manuscript; available in PMC: 2023 Nov 28.
Published in final edited form as: J Biomed Inform. 2023 Jan 27;139:104295. doi: 10.1016/j.jbi.2023.104295

Table 1:

The variables in the Wong et al. dataset [15], their type and their representation in the logistic regression and Cox-survival model. Numeric variables (age, BMI, Hba1c) were grouped and one-hot-encoded, while categorical variables (Gender, Ethnicity, and Race) were one-hot-encoded. To avoid collinearity, when one-hot-encoding a predictor variable, the binary predictor representing the largest group was left out for reference (marked with “used for reference” in the table). For each predictor group, the table also reports the percentage of missing cases, if any.

Predictor Group and predictor type Predictor percentage of missing values all cases
Number of cases (%) 56123 (100%)
Gender

One-hot-encoded categorical variable
Male 49%
Female (used for reference) 51%
Age

Grouped and one-hot-encoded numeric variable
61.88 ± 0.06 [18,89]
age < 40 7%
40 ≤ age < 50 11%
50 ≤ age < 60 22%
60 ≤ age < 70 (used for reference) 28%
70 ≤ age < 80 22%
age ≥ 80 10%
BMI

Grouped and one-hot-encoded numeric variable
29% 33.25 ± 0.04 [12.13,79.73]
BMI < 20 1%
20 ≤ BMI < 25 8%
25 ≤ BMI < 30 18%
30 ≤ BMI < 35 (used for reference) 18%
35 ≤ BMI < 40 12%
BMI ≥ 40 13%
Race

One-hot-encoded categorical variable
White (used for reference) 15% 55%
Other 1%
Black 26%
Asian 3%
Ethnicity

One-hot-encoded categoric variable
Hispanic 12% 16%
Not hispanic (used for reference) 73%
Hba1c

Grouped and one-hot-encoded numeric variable
7.58 ± 0.01 [4.1,19.3]
Hba1c < 6 17%
6 ≤ Hba1c < 7 (used for reference) 30%
7 ≤ Hba1c < 8 21%
8 ≤ Hba1c < 9 12%
9 ≤ Hba1c < 10 07%
Hba1c ≥ 10 12%
Comorbidities

Binary variables;
1 = has comorbidity
0 = does not have comorbidity
MI 13%
CHF 23%
PVD 21%
Stroke 17%
Dementia 5%
Pulmonary 31%
liver mild 16%
liver severe 3%
Renal 30%
Cancer 14%
Hiv 1%
Treatments

Binary variables;
1 = has comorbidity
0 = does not have comorbidity
Metformin 26%
dpp4 5%
sglt2 5%
Glp 7%
Tzd 1%
Insulin 25%
Sulfonylurea 9%