. 2020 Feb 4;10:1776. doi: 10.1038/s41598-020-58601-7

Table 3.

AUROC measures for each prediction model’s best parameterization. We applied a reference- and three register-based models on fifteen years of health register data comprising hospital diagnoses, hospital procedures, drug prescriptions and interactions with primary care contractors to predict five-year risk for five T2D comorbidities. For each comorbidity, prediction was performed on a T2D population free of that comorbidity at the date of prediction (date of individuals’ first T2D diagnosis). The reference model was a logistic ridge regression based on canonical features: age, sex, country or region of birth and date of first T2D diagnosis as well as their interactions, while the register-based models were logistic ridge regression, random forest and gradient boosting based on the canonical features as well as hospital diagnoses, hospital procedures, drug prescriptions and interactions with primary care extracted from Danish health registers. Incidences are proportions of cases within comorbidities’ sub-population at the end of the prediction horizon. Value ranges in brackets represent 95% confidence intervals based on bootstrap sampling. For heart failure, myocardial infarction, cardiovascular disease and chronic kidney disease the gradient boosting model outperformed the reference models. AUROC, area under receiver operating characteristic curve.

	Heart failure (incidence: 0.04)
	AUROC	ΔAUROC_RLR	ΔAUROC_LR	ΔAUROC_RF
Reference, logistic regression (RLR)	0.74 (0.72–0.75)
Logistic regression (LR)	0.77 (0.76–0.79)	0.04 (0.02–0.05)
Random forest (RF)	0.77 (0.75–0.78)	0.03 (−0.01)	−0.01 (−0.02–0.01)
Gradient boosting (GB)	0.80 (0.78–0.81)	0.06 (0.05–0.07)	0.02 (0.01–0.03)	0.03 (0.02–0.04)
	Myocardial infarction (incidence: 0.02)
	*AUROC*	Δ*AUROC*_RLR	Δ*AUROC*_RL	Δ*AUROC*_RF
Reference, logistic regression (RLR)	0.68 (0.65–0.70)
Logistic regression (LR)	0.70 (0.68–0.73)	0.03 (0.01–0.04)
Random forest (RF)	0.67 (0.64–0.69)	−0.01 (−0.03–0.01)	−0.04 (−0.06–−0.02)
Gradient boosting (GB)	0.71 (0.69–0.73)	0.03 (0.02–0.05)	0.01 (0.00–0.02)	0.04 (0.03–0.06)
	Stroke (incidence: 0.03)
	*AUROC*	Δ*AUROC*_RLR	Δ*AUROC*_RL	Δ*AUROC*_RF
Reference, logistic regression (RLR)	0.71 (0.69–0.73)
Logistic regression (LR)	0.72 (0.70–0.74)	0.01 (0.00–0.01)
Random forest (RF)	0.69 (0.67–0.71)	−0.02 (−0.04–−0.01)	−0.03 (−0.04–−0.01)
Gradient boosting (GB)	0.72 (0.70–0.74)	0.01 (0.00–0.02)	0.01 (0.00–0.02)	0.03 (0.02–0.05)
	Cardiovascular disease (incidence: 0.25)
	*AUROC*	Δ*AUROC*_RLR	Δ*AUROC*_LR	Δ*AUROC*_RF
Reference, logistic regression (RLR)	0.66 (0.64–0.67)
Logistic regression (LR)	0.68 (0.67–0.69)	0.02 (0.02–0.03)
Random forest (RF)	0.68 (0.67–0.69)	0.02 (0.02–0.03)	0.00 (0.00–0.01)
Gradient boosting (GB)	0.69 (0.68–0.70)	0.04 (0.03–0.05)	0.02 (0.01–0.02)	0.01 (0.01–0.02)
	Chronic kidney disease (incidence: 0.03)
	*AUROC*	Δ*AUROC*_RLR	Δ*AUROC*_LR	Δ*AUROC*_RF
Reference, logistic regression (RLR)	0.71 (0.69–0.73)
Logistic regression (LR)	0.74 (0.72–0.76)	0.04 (0.02–0.05)
Random forest (RF)	0.74 (0.72–0.76)	0.03 (0.01–0.05)	0.00 (−0.02–0.01)
Gradient boosting (GB)	0.77 (0.76–0.79)	0.07 (0.05–0.08)	0.03 (0.02–0.04)	0.04 (0.02–0.05)