Skip to main content
. Author manuscript; available in PMC: 2025 Feb 8.
Published in final edited form as: Proc ACM Interact Mob Wearable Ubiquitous Technol. 2024 Mar 6;8(1):31. doi: 10.1145/3643540

Table 4.

Balanced Accuracy Performance Summary of Zero-shot, Few-shot and Instruction Finetuning on LLMs. ZSbest highlights the best performance among zero-shot prompt designs, including context enhancement, mental health enhancement, and their combination (see Table. 2). Detailed results can be found in Table 10 in Appendix. Small numbers represent standard deviation across different designs of PromptPart1-S and PromptPart2-Q. The baselines at the bottom rows do not have standard deviation as the task-specific output is static, and prompt designs do not apply. Due to the maximum token size limit, we only conduct few-shot prompting on a subset of datasets and mark other infeasible datasets as “−”. For each column, the best result is bolded, and the second best is underlined.

Dataset Dreaddit DepSeverity SDCNL CSSRS-Suicide
Category Model Task #1 Task #2 Task #3 Task #4 Task #5 Task #6
Zero-shot Prompting AlpacaZS 0.593±0.039 0.522±0.022 0.431±0.050 0.493±0.007 0.518±0.037 0.232±0.076
AlpacaZS_best 0.612±0.065 0.577±0.028 0.454±0.143 0.532±0.005 0.532±0.033 0.250±0.060
Alpaca-LoRAZS 0.571±0.043 0.548±0.027 0.437±0.044 0.502±0.011 0.540±0.012 0.187±0.053
Alpaca-LoRAZS_best 0.571±0.043 0.548±0.027 0.437±0.044 0.502±0.011 0.567±0.038 0.224±0.049
FLAN-T5ZS 0.659±0.086 0.664±0.011 0.396±0.006 0.643±0.021 0.667±0.023 0.418±0.012
FLAN-T5ZS_best 0.663±0.079 0.674±0.014 0.396±0.006 0.653±0.011 0.667±0.023 0.418±0.012
LLaMA2ZS 0.720±0.012 0.693±0.034 0.429±0.013 0.589±0.010 0.691±0.014 0.261±0.018
LLaMA2ZS_best 0.720±0.012 0.711±0.033 0.444±0.021 0.643±0.014 0.722±0.039 0.367±0.043
GPT-3.5ZS 0.685±0.024 0.642±0.017 0.603±0.017 0.460±0.163 0.570±0.118 0.233±0.009
GPT-3.5ZS_best 0.688±0.045 0.653±0.020 0.642±0.034 0.632±0.020 0.617±0.033 0.310±0.015
GPT-4ZS 0.700±0.001 0.719±0.013 0.588±0.010 0.644±0.007 0.760±0.009 0.418±0.009
GPT-4ZS_best 0.725±0.009 0.719±0.013 0.656±0.001 0.647±0.014 0.760±0.009 0.441±0.057
Few-shot Prompting AlpacaFS 0.632±0.030 0.529±0.017 0.628±0.005
FLAN-T5FS 0.786±0.006 0.678±0.009 0.432±0.009
GPT-3.5FS 0.721±0.010 0.665±0.015 0.580±0.002
GPT-4FS 0.698±0.009 0.724±0.005 0.613±0.001
Instructional Finetuning Mental-Alpaca 0.816±0.006 0.775 ±0.006 0.746 ±0.005 0.724 ±0.004 0.730±0.048 0.403±0.029
Mental-FLAN-T5 0.802±0.002 0.759±0.003 0.756 ±0.001 0.677±0.005 0.868 ±0.006 0.481 ±0.006
Baseline Majority 0.500±––– 0.500±––– 0.250±––– 0.500±––– 0.500±––– 0.200±–––
BERT 0.783±––– 0.763±––– 0.690±––– 0.678±––– 0.500±––– 0.332±–––
Mental-RoBERTa 0.831 ±––– 0.790 ±––– 0.736±––– 0.723 ±––– 0.853 ±––– 0.373±–––