. Author manuscript; available in PMC: 2025 Feb 8.

Published in final edited form as: Proc ACM Interact Mob Wearable Ubiquitous Technol. 2024 Mar 6;8(1):31. doi: 10.1145/3643540

Table 4.

Balanced Accuracy Performance Summary of Zero-shot, Few-shot and Instruction Finetuning on LLMs. $Z S_{best}$ highlights the best performance among zero-shot prompt designs, including context enhancement, mental health enhancement, and their combination (see Table. 2). Detailed results can be found in Table 10 in Appendix. Small numbers represent standard deviation across different designs of ${Prompt}_{Part1-S}$ and ${Prompt}_{Part2-Q}$ . The baselines at the bottom rows do not have standard deviation as the task-specific output is static, and prompt designs do not apply. Due to the maximum token size limit, we only conduct few-shot prompting on a subset of datasets and mark other infeasible datasets as “−”. For each column, the best result is bolded, and the second best is underlined.

	Dataset	Dreaddit	DepSeverity		SDCNL	CSSRS-Suicide
Category	Model	Task #1	Task #2	Task #3	Task #4	Task #5	Task #6
Zero-shot Prompting	${Alpaca}_{Z S}$	0.593_±0.039	0.522_±0.022	0.431_±0.050	0.493_±0.007	0.518_±0.037	0.232_±0.076
	${Alpaca}_{Z S_best}$	0.612_±0.065	0.577_±0.028	0.454_±0.143	0.532_±0.005	0.532_±0.033	0.250_±0.060
	${Alpaca-LoRA}_{Z S}$	0.571_±0.043	0.548_±0.027	0.437_±0.044	0.502_±0.011	0.540_±0.012	0.187_±0.053
	${Alpaca-LoRA}_{Z S_best}$	0.571_±0.043	0.548_±0.027	0.437_±0.044	0.502_±0.011	0.567_±0.038	0.224_±0.049
	${FLAN-T 5}_{Z S}$	0.659±_0.086	0.664_±0.011	0.396_±0.006	0.643_±0.021	0.667_±0.023	0.418_±0.012
	${FLAN-T 5}_{Z S_best}$	0.663±_0.079	0.674_±0.014	0.396_±0.006	0.653_±0.011	0.667_±0.023	0.418_±0.012
	${LLaMA 2}_{Z S}$	0.720±_0.012	0.693_±0.034	0.429_±0.013	0.589_±0.010	0.691_±0.014	0.261_±0.018
	${LLaMA 2}_{Z S_best}$	0.720±_0.012	0.711_±0.033	0.444_±0.021	0.643_±0.014	0.722_±0.039	0.367_±0.043
	${GPT- 3.5}_{Z S}$	0.685±_0.024	0.642_±0.017	0.603_±0.017	0.460_±0.163	0.570_±0.118	0.233_±0.009
	${GPT- 3.5}_{Z S_best}$	0.688±_0.045	0.653_±0.020	0.642_±0.034	0.632_±0.020	0.617_±0.033	0.310_±0.015
	${GPT- 4}_{Z S}$	0.700±_0.001	0.719_±0.013	0.588_±0.010	0.644_±0.007	0.760_±0.009	0.418_±0.009
	${GPT- 4}_{Z S_best}$	0.725±_0.009	0.719_±0.013	0.656_±0.001	0.647_±0.014	0.760_±0.009	0.441_±0.057
Few-shot Prompting	${Alpaca}_{F S}$	0.632±_0.030	0.529_±0.017	0.628_±0.005	—	—	—
	${FLAN-T 5}_{F S}$	0.786±_0.006	0.678_±0.009	0.432_±0.009	—	—	—
	${GPT- 3.5}_{F S}$	0.721±_0.010	0.665_±0.015	0.580_±0.002	—	—	—
	${GPT- 4}_{F S}$	0.698±_0.009	0.724_±0.005	0.613_±0.001	—	—	—
Instructional Finetuning	Mental-Alpaca	0.816±_0.006	0.775 _±0.006	0.746 _±0.005	0.724 _±0.004	0.730_±0.048	0.403_±0.029
Instructional Finetuning	Mental-FLAN-T5	0.802±_0.002	0.759_±0.003	0.756 _±0.001	0.677_±0.005	0.868 _±0.006	0.481 _±0.006
Baseline	Majority	0.500_±–––	0.500_±–––	0.250_±–––	0.500_±–––	0.500_±–––	0.200_±–––
	BERT	0.783_±–––	0.763_±–––	0.690_±–––	0.678_±–––	0.500_±–––	0.332_±–––
	Mental-RoBERTa	0.831 _±–––	0.790 _±–––	0.736_±–––	0.723 _±–––	0.853 _±–––	0.373_±–––