Table 4.
Balanced Accuracy Performance Summary of Zero-shot, Few-shot and Instruction Finetuning on LLMs. highlights the best performance among zero-shot prompt designs, including context enhancement, mental health enhancement, and their combination (see Table. 2). Detailed results can be found in Table 10 in Appendix. Small numbers represent standard deviation across different designs of and . The baselines at the bottom rows do not have standard deviation as the task-specific output is static, and prompt designs do not apply. Due to the maximum token size limit, we only conduct few-shot prompting on a subset of datasets and mark other infeasible datasets as “−”. For each column, the best result is bolded, and the second best is underlined.
| Dataset | Dreaddit | DepSeverity | SDCNL | CSSRS-Suicide | |||
|---|---|---|---|---|---|---|---|
| Category | Model | Task #1 | Task #2 | Task #3 | Task #4 | Task #5 | Task #6 |
| Zero-shot Prompting | 0.593±0.039 | 0.522±0.022 | 0.431±0.050 | 0.493±0.007 | 0.518±0.037 | 0.232±0.076 | |
| 0.612±0.065 | 0.577±0.028 | 0.454±0.143 | 0.532±0.005 | 0.532±0.033 | 0.250±0.060 | ||
| 0.571±0.043 | 0.548±0.027 | 0.437±0.044 | 0.502±0.011 | 0.540±0.012 | 0.187±0.053 | ||
| 0.571±0.043 | 0.548±0.027 | 0.437±0.044 | 0.502±0.011 | 0.567±0.038 | 0.224±0.049 | ||
| 0.659±0.086 | 0.664±0.011 | 0.396±0.006 | 0.643±0.021 | 0.667±0.023 | 0.418±0.012 | ||
| 0.663±0.079 | 0.674±0.014 | 0.396±0.006 | 0.653±0.011 | 0.667±0.023 | 0.418±0.012 | ||
| 0.720±0.012 | 0.693±0.034 | 0.429±0.013 | 0.589±0.010 | 0.691±0.014 | 0.261±0.018 | ||
| 0.720±0.012 | 0.711±0.033 | 0.444±0.021 | 0.643±0.014 | 0.722±0.039 | 0.367±0.043 | ||
| 0.685±0.024 | 0.642±0.017 | 0.603±0.017 | 0.460±0.163 | 0.570±0.118 | 0.233±0.009 | ||
| 0.688±0.045 | 0.653±0.020 | 0.642±0.034 | 0.632±0.020 | 0.617±0.033 | 0.310±0.015 | ||
| 0.700±0.001 | 0.719±0.013 | 0.588±0.010 | 0.644±0.007 | 0.760±0.009 | 0.418±0.009 | ||
| 0.725±0.009 | 0.719±0.013 | 0.656±0.001 | 0.647±0.014 | 0.760±0.009 | 0.441±0.057 | ||
| Few-shot Prompting | 0.632±0.030 | 0.529±0.017 | 0.628±0.005 | — | — | — | |
| 0.786±0.006 | 0.678±0.009 | 0.432±0.009 | — | — | — | ||
| 0.721±0.010 | 0.665±0.015 | 0.580±0.002 | — | — | — | ||
| 0.698±0.009 | 0.724±0.005 | 0.613±0.001 | — | — | — | ||
| Instructional Finetuning | Mental-Alpaca | 0.816±0.006 | 0.775 ±0.006 | 0.746 ±0.005 | 0.724 ±0.004 | 0.730±0.048 | 0.403±0.029 |
| Mental-FLAN-T5 | 0.802±0.002 | 0.759±0.003 | 0.756 ±0.001 | 0.677±0.005 | 0.868 ±0.006 | 0.481 ±0.006 | |
| Baseline | Majority | 0.500±––– | 0.500±––– | 0.250±––– | 0.500±––– | 0.500±––– | 0.200±––– |
| BERT | 0.783±––– | 0.763±––– | 0.690±––– | 0.678±––– | 0.500±––– | 0.332±––– | |
| Mental-RoBERTa | 0.831 ±––– | 0.790 ±––– | 0.736±––– | 0.723 ±––– | 0.853 ±––– | 0.373±––– | |