. Author manuscript; available in PMC: 2025 Feb 8.

Published in final edited form as: Proc ACM Interact Mob Wearable Ubiquitous Technol. 2024 Mar 6;8(1):31. doi: 10.1145/3643540

Table 8.

Balanced Accuracy Cross-Dataset Performance Summary of Mental-Alpaca Finetuning on Single Dataset.

Test Dataset	Dreaddit	DepSeverity		SDCNL	CSSRS-Suicide
Finetune Dataset	Task #1	Task #2	Task #3	Task #4	Task #5	Task #6
Dreaddit	$\begin{matrix} 0.823 \end{matrix}$	↑ 0.720	↑ 0.623	↓ 0.474	↑ 0.720	↓ 0.156
DepSeverity	↑ 0.618	$\begin{matrix} 0.733 \end{matrix}$	$\begin{matrix} 0.769 \end{matrix}$	\| 0.493	↑ 0.753	↓ 0.156
SDCNL	↓ 0.468	↓ 0.461	↑ 0.623	$\begin{matrix} 0.730 \end{matrix}$	↑ 0.573	↓ 0.156
CSSRS-Suicide	↓ 0.500	↓ 0.500	↑ 0.622	↑ 0.500	$\begin{matrix} 0.753 \end{matrix}$	$\begin{matrix} 0.578 \end{matrix}$
Reference:
${Alpaca}_{Z S}$	0.593	0.522	0.431	0.493	0.518	0.232
Mental-Alpaca	0.816	0.775	0.746	0.724	0.730	0.403

$Numbers$ indicate the results of the model finetuned and tested on the same dataset. The bottom few rows are related Alpaca versions for reference. ↑/↓ marks the ones with better/worse cross-dataset performance compared to the zero-shot version ${Alpaca}_{Z S}$ .