Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2024 Jul 25;121(31):e2410449121. doi: 10.1073/pnas.2410449121

Reply to Wang: Clarifying model performance and language markers of depression across races

Sunny Rai a,b,1, Elizabeth C Stade c, Salvatore Giorgi a,d, Ashley Francisco a, Adithya V Ganesan e, Lyle H Ungar a, Brenda Curtis d, Sharath C Guntuku a,b
PMCID: PMC11295057  PMID: 39052830

We thank Wang for commenting on our recent paper investigating the relationship between race and language markers of depression (1). The letter has two critiques of the prediction analysis.

First, the letter suggests that reviewing details of the analyses could shed light on “why [the] models trained on the language of Black individuals perform better in the White test set than in the Black test set” (2).

The details of our analyses, which the letter requests, are included in SI Appendix. We report sample sizes at each step of dataset processing (Section 1A), validation metrics, and training and test set sizes (Section 1C). While it seems counterintuitive that a model trained on Black individuals would perform better on White individuals, this is mathematically possible if MBlack has a larger error term than MWhite. Said otherwise, the language of Black individuals explains less of the variance in depression than the language of White individuals (Table 1). Notwithstanding platform effects, why that is the case is an open clinical question. Prior work also reported attenuated model performance when testing on data from Persons of Color compared to data from White individuals, even with the same amount of training data (3). It is plausible that among Black individuals, depression is associated with language effects that have too rare a frequency to be included as features in our models or that Black individuals use Facebook differently than White individuals. See Data, Materials, and Software Availability for mean features scores for Linguistic Inquiry and Word Count (LIWC) (4) and Topic distributions; we did not share raw social media data to protect participants’ privacy. We hope these details are helpful to the reader in interpreting our results.

Table 1.

Comparison of performance of different feature sets at predicting depression scores reported as Pearson r with p values

1-3grams, LIWC, and Topic distributions
White test set Black test set
MWhite 0.392 (<0.001) 0.132 (0.006)
MBlack 0.204 (<0.001) 0.126 (0.009)
Pretrained BERT embeddings (L11 and L12)
MWhite 0.347 (<0.001) 0.104 (0.031)
MBlack 0.161 (0.001) 0.058 (0.225)
Pretrained RoBERTa embeddings (L11 and L12)
MWhite 0.345 (<0.001) 0.142 (0.003)
MBlack 0.235 (<0.001) 0.128 (0.007)
RoBERTa embeddings fine-tuned on Facebook
MWhite 0.353 (<0.001) 0.179 (<0.001)
MBlack 0.288 (<0.001) 0.095 (0.048)

Second, rather than using Bidirectional Encoder Representations from Transformers (BERT) (5) as a frozen embedding model, the letter suggests fine-tuning BERT or using a model such as Robustly Optimized BERT Pre-training Approach (RoBERTa).

To study whether and how known language markers of depression (i.e., first-person pronouns and negative emotions) vary by race, we prioritized interpretable feature sets (i.e., LIWC and topics). We considered BERT-base to evaluate the predictive performance of one of the classes of large language models. In (Table 1), we report the evaluation of pretrained RoBERTa embeddings L11 and L12 and a fine-tuned RoBERTa model where all 12 layers were unfrozen to minimize Masked Language Modelling loss (6) with masking probability set to 0.15 on a dataset of 36 million Facebook messages from over 60 thousand individuals (7) for one epoch. While such domain adaptive training aligned the model to Facebook language, the impact of racial distribution in the training data on the finetuned model is an open question.

RoBERTa-base outperformed BERT-base, and finetuned RoBERTa obtained results comparable to the pretrained models (Table 1), as also noted by other work (8, 9). All models performed relatively poorly on data from Black individuals than White individuals, which aligns with our paper‘s results (1).

Acknowledgments

Author contributions

S.R., S.G., and S.C.G. designed research; S.R., E.C.S., A.F., and S.C.G. performed research; S.R., E.C.S., S.G., A.F., A.V.G., and S.C.G. contributed new reagents/analytic tools; S.R., E.C.S., S.G., A.F., and S.C.G. analyzed data; and S.R., E.C.S., S.G., L.H.U., B.C., and S.C.G. wrote the paper.

Competing interests

The authors declare no competing interest.

References

  • 1.Rai S., et al. , Key language markers of depression on social media depend on race. Proc. Natl. Acad. Sci. U.S.A. 121, e2319837121 (2024), 10.1073/pnas.2319837121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang Y., Large language models for depression prediction. Proc. Natl. Acad. Sci. U.S.A. 121, e2409757121 (2024). [DOI] [PubMed] [Google Scholar]
  • 3.Aguirre C., Harrigian K., Dredze M., “Gender and racial fairness in depression research using social media” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Merlo P., Tiedemann J., Tsarfaty R., Eds. (Online: Association for Computational Linguistics, 2021), pp. 2932–2949, 10.18653/v1/2021.eacl-main.256. [DOI] [Google Scholar]
  • 4.Boyd R.L., Ashokkumar A., Seraj S., Pennebaker J.W.. The development and psychometric properties of LIWC-22. Austin, TX: University of Texas at Austin. (2022).
  • 5.Devlin J., Chang M.W., Lee K., Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018.
  • 6.Gururangan S., et al. , “Don’t stop pretraining: Adapt language models to domains and tasks” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jurafsky D., Chai J., Schluter N., Tetreault J., Eds. (Online: Association for Computational Linguistics, 2020), pp. 8342–8360, 10.18653/v1/2020.acl-main.740. [DOI] [Google Scholar]
  • 7.Eichstaedt J. C., et al. , Closed-and open-vocabulary approaches to text analysis: A review, quantitative comparison, and recommendations. Psychol. Methods 26, 398 (2021). [DOI] [PubMed] [Google Scholar]
  • 8.Liu Y., et al. , RoBERTa: A robustly optimized BERT pretraining approach. arXiv [Preprint] (2019). https://arxiv.org/abs/1907.11692 (Accessed 5 June 2024).
  • 9.Guo Y., Sarker A., “SocBERT: A pretrained model for social media text” in The Fourth Workshop on Insights from Negative Results in NLP, Tafreshi S., et al., Eds. (Association for Computational Linguistics, Dubrovnik, Croatia, 2023), pp. 45–52.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES