. 2026 Feb 6;13:e84318. doi: 10.2196/84318

Table 2.

Performance comparison of 4 machine learning models evaluated using area under the receiver operating characteristic curve (AUC) and recall. Metrics were averaged across multiple runs with different random seeds to ensure robustness.

Category and subcategory			AUC	Recall
Model selection, mean (SD)
	Held-out patients
		LightGBM^a	0.51 (0.004)	0.02 (0.009)
		FNN^b	0.52 (0.005)	0.05 (0.011)
		LSTM^c	0.87 (0.002)	0.75 (0.02)
		LSTM+attention	0.87 (0.002)	0.74 (0.04)
	Out-of-time patients
		LightGBM	0.52 (0.002)	0.04 (0.003)
		FNN	0.54 (0.01)	0.08 (0.03)
		LSTM	0.84 (0.01)	0.72 (0.01)
		LSTM+attention	0.85 (0.01)	0.75 (0.02)
PRIME’s^d performance
	Sex
		Male	0.83	0.36
		Female	0.84	0.29
		Intersex	0.87	0.23
	Race
		Black	0.69 ^e	0.16
		First Nations	0.8	0.16
		White	0.84	0.38
		Other racial identities	0.81	0.27
	Sexual orientation
		Heterosexual	0.82	0.34
		Other	0.84	0.34
	Program type
		Regional (nonforensic)	0.83	0.34
		Provincial (forensic)	0.8	0.27
	Age group (years)
		18-65	0.81	0.32
		≥65	0.81	0.38
	All		0.81	0.3

^aLightGBM: light gradient boosting machine.

^bFNN: feedforward neural network.

^cLSTM: long short-term memory.

^dPRIME: Predictive Risk Identification for Mental Health Events.

^eItalicization indicates significance.