Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 16.
Published in final edited form as: Psychiatry Res. 2023 Dec 10;333:115667. doi: 10.1016/j.psychres.2023.115667

Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment

Oscar NE Kjell a,b,#,*, Katarina Kjell a, H Andrew Schwartz a,b,#
PMCID: PMC11911012  NIHMSID: NIHMS2062240  PMID: 38290286

Abstract

In this narrative review, we survey recent empirical evaluations of AI-based language assessments and present a case for the technology of large language models to be poised for changing standardized psychological assessment. Artificial intelligence has been undergoing a purported “paradigm shift” initiated by new machine learning models, large language models (e.g., BERT, LAMMA, and that behind ChatGPT). These models have led to unprecedented accuracy over most computerized language processing tasks, from web searches to automatic machine translation and question answering, while their dialogue-based forms, like ChatGPT have captured the interest of over a million users. The success of the large language model is mostly attributed to its capability to numerically represent words in their context, long a weakness of previous attempts to automate psychological assessment from language. While potential applications for automated therapy are beginning to be studied on the heels of chatGPT’s success, here we present evidence that suggests, with thorough validation of targeted deployment scenarios, that AI’s newest technology can move mental health assessment away from rating scales and to instead use how people naturally communicate, in language.

Keywords: Large language models, Transformers, Artificial intelligence, Psychology, Assessment

1. Introduction

Recently, artificial intelligence-based (AI-based) language analysis has undergone a “paradigm shift” fundamentally changing how systems are developed in the field (Bommasani et al., 2021). Just a few years ago, natural language systems were primarily purpose-built –statistically optimized for a particular task such that, for example, systems for answering natural language questions (i.e., question answering) used a different model than that for sentiment analysis (scoring the positivity or negativity of a text) or paraphrasing (producing alternative phrases for a small section of text). Now, nearly all AI language systems are built on a large language model base or “foundational” model. The state-of-the-art system for sentiment analysis, question answering, paraphrasing, and dozens of other language tasks are based on the same underlying statistical deep learning model, which only needs to be “fine-tuned’’ or adapted to perform particular tasks. In fact, this technology now touches the daily life of nearly everyone with a smartphone, as it has quickly become the basis for modern Web search (Nayak, 2019), digital assistants’ language (Alexa, Siri, etc.), machine translation, and keypad autocompletion. In fact, base models for other domains of AI (vision, speech) are now finding benefits for integrating language models (e.g., Gao et al., 2023; Radford et al., 2023).

The transformer-based Large Language Model is the technology enabling this purported paradigm shift (Bommasani et al., 2021; Devlin et al., 2019). Large language models owe their success largely to their ability to statistically model words in a large context with which they occur by using the transformer, a particular deep learning technique (Devlin et al., 2019; Vaswani et al., 2017). Bringing such context to psychological text analysis, large language models can more precisely quantify the specific meaning of language and yield a truer understanding of the person behind the words.

The link between language and psychological phenomena has long been known (Boyd and Schwartz, 2021; Pennebaker et al., 2003; Tausczik and Pennebaker, 2010), and while the use of AI in psychology is not yet widespread, it has been used to successfully gain insights into, e.g., who we are (Argamon et al., 2007; Berger and Packard, 2021; Kwantes et al., 2016; Schwartz et al., 2013), how we feel (De Bruyne et al., 2022; Eichstaedt et al., 2018; Sun et al., 2020), our behaviors (Curtis et al., 2018; Kjell et al., 2021; Macavaney et al., 2021), and other topics (Eichstaedt et al., 2020; Iliev et al., 2015; Jackson et al., 2021; Schwartz and Ungar, 2015). However, quantitative assessment of the primary way humans communicate (language) has yet to reach wide-spread adoption (Boyd and Schwartz, 2021).1

Even without large language models, using AI-based language analysis of probed language (i.e., open-ended responses to survey questions), it is possible to derive a quantified score of a psychological construct with moderately high convergence (r = 0.72) with rating scales (Kjell et al., 2019a). Large language models push the accuracy to a theoretical upper limit of predicting rating scales, upwards of r = 0.85 (Kjell et al., 2022). Hence, early empirical successes in using large language models for mental health assessments (e.g., Kjell et al., 2022; Matero et al., 2019; Mohammadi et al., 2019; Zirikly et al., 2019) suggest that this technique needs not only to change AI (Bommasani et al., 2021) but that it is an essential aspect for an improvement in psychological assessment of mental health (Fig. 1). This suggests that the technique has the potential to modernize assessment methods, from the reliance on closed-ended rating scale responses to more accurate, fine-grained, and ecologically valid assessments of individuals’ state of mind. By fully leveraging individuals’ personal descriptions of their mental state in their own words, the technique has the potential to –not only improve current assessments incrementally– but also change the very nature of how individuals’ states of mind are both measured and described and ultimately increase our understanding of mental health.

Fig. 1.

Fig. 1.

The goal of assessments is to match the true psychological traits and states (gold arrow) but most work thus far evaluates against rating scales (red arrow).

Examining the question of whether large language models can modernize psychological assessments on subjective states of mind and experiences beyond rating scales, we are: 1) reviewing the intrinsic advantages of natural language in communicating mental health and showing how language has favorable measurement characteristics, 2) describing how word context matters in mental health, and reviewing how the unique contributions of large language models may realize the measurement precision of language, and then (3) provide evidence indicating how these advantages of large language models can make the long-held goal of grounding psychological mental health assessment in natural language a reality. Lastly, we discuss biases, risks, and ethical considerations related to using large language models for psychological assessments in research and clinical settings.

Large language models, specifically the application chatGPT, have received much attention in the last year. This narrative review is pre-dominantly not about chatGPT but rather the AI technology behind it, which is called transformers. The technology of transformer large language models is the focus of the evidence presented in this review for their readiness for psychological assessments. In fact, the transformer language models behind chatGPT (i.e., GPT3.5 and GPT4) are not appropriate for most psychological assessment research because they are closed (not accessible) and often changed (Chen et al., 2023). On the other hand, the majority of transformers we discuss below are open (available for download) and static. Our focus is on reviewing evidence on the application of the technique for assessment rather than other psychological tasks, such as in the delivery of automated psychotherapy, which likely needs much more development and has not been empirically validated for safety and efficacy (Stade et al., 2023).

This narrative review focuses on psychological assessment from participant language use (as opposed to wearable or smartphone sensor 7data (e.g., see D’Alfonso, 2020; Melcher et al., 2020; Seppälä et al., 2019). While there is also much interest in large language models in the delivery of interventions or therapy through specialized chatbots (e.g., see Boucher et al., 2021; Parmar et al., 2022; Torous et al., 2021), the technology is much further established empirically for psychological assessment. We further focus on the behavior of natural language use and the subjective experiences language expresses (as opposed to neuropsychological tests; e.g., Chandler et al., 2020) from natural language. The term language-based assessments, introduced by Park et al. (2015), refers to the automatic generation of a score for a given psychological construct from observed language use patterns. Importantly, language-based assessments can integrate additional data beyond language (e.g., participant age; Son et al., 2022), and for the scope of this article, we consider it a language-based assessment as long as the language is used in the psychological construct score calculation. Beyond this quantitative assessment, data for language-based assessment lend themselves to additional analyses such as data-driven language-based summaries/depictions of psychological states or traits (e.g., Schwartz et al., 2014; Kjell et al., 2019).

2. Intrinsic advantages of natural language

Imagine that a new patient is in the midst of their intake assessments. However, instead of being presented with a slew of questionnaires, they are presented with a prompt for a natural conversation and start talking with AssessmentTransformer (AT), an AI-based social dialog agent.

AT starts: How are you feeling?

Patient: Eh. A bit down. Just having a rest day.

AT: Please elaborate; how are you feeling a bit down?

Patient: I’m tired and not motivated to do anything today. Work is not improving despite raising my concerns. I feel trapped, but I’m putting on a brave face for my family.

An engaging interaction follows, where the patient expresses their psychological traits and states in their natural communication medium. AT has a decent grasp of the psychological meaning of the words being expressed because it understands them in their context –that they have concerns with work, that feeling trapped refers to things not improving, and that putting on a brave face does not mean they are literally putting on a mask. After the exchange, AT produces robust quantitative scores for the patients’ mental health (e.g., level of depression, anxiety, stress). In addition, qualitative explanations and descriptions of such assessments provide context to the quantitative scores in the patient’s own words (e.g., the level of depression relates to feeling down, tired, not motivated, and trapped at work).

Accurate mental health assessments are central to ensuring great care. It is a prerequisite for precision in healthcare: to provide the right treatment for the right person at the right time (i.e., precision mental health; Delgadillo and Lutz, 2020). In addition, accurate assessment is the core for improving care: It is the foundation to systematically develop and secure the quality of care. Intrinsically, natural language as a response format has ecological validity –it has long been considered a “window” into psychological states and is our natural way of communicating inner experiences and states of mind (e.g., Tausczik and Pennebaker 2010).

Quantitatively assessing language in an accurate manner has previously been difficult. Likert’s (1932) popular closed-ended rating scale format side-steps this by attempting to capture a one-dimensional latent variable of individuals’ attitudes (which has been generalized to assess psychiatric disorders and experiences more broadly). However, besides this being an unnatural way for people to communicate their states of mind, the scale reduces possible responses to a relatively narrow, fixed range and a constrained resolution –there is only a finite number of possible responses and scores. With sufficient development cycles, advances from Classical Test Theory and Item Response Theory have introduced methods to better select items and aggregate closed-ended responses into latent variables thought to represent true states and traits better (Lord, 2012; Thomas, 2011). However, the closed-ended nature necessitated by these methods does not allow respondents any flexibility in expressing a state of mind that deviates from the posed items –complex or unusual views are lost. They are also still limited to the inherent information loss of the item-response format. Compared to open-ended natural language, the rating scale method is overly reductionist.

2.1. Information-rich language

The idea of capturing more information can be formalized via information theory (Shannon, 1948), whereby a mathematical concept of self-information can be measured as the amount of diversity that can be represented in a dataset. Self-information is a key measure in machine learning as it shows the amount of information that algorithms have at their disposal to learn. The greater the self-information, in general, the greater the expected ability to predict variables from a given dataset (MacKay, 2023). For example, a yes/no item that is answered 50 % yes and 50 % no in a dataset will have more information than one that is answered 90 % yes (because the latter does not distinguish patients as well). Data from more item responses yield more information only if answered in a way that cannot be mapped from the other items. Similarly, more words in natural language responses do not automatically yield more information; in fact, it turns out that individuals’ description of their harmony in life using ten descriptive words (mean = 9.8 words/response) yield more information than their corresponding text responses (mean = 69 words/response; Kjell et al., 2022). However, to our knowledge, a comparison between natural language and rating scale responses has not been done.

To examine the difference in self-information of natural language versus rating scale responses assessing general affect, we asked 100 participants2: How are you feeling? Leaving an open response box. This was followed by the commonly used closed-ended rating scale, the Positive and Negative Affect Schedule (PANAS), asking respondents to describe their “feelings and emotions” using 20 affect-related items rated from 1 = “very slightly or not at all” to 5 = “extremely”.

Applying a self-information measure (the Diversity index3) to both response formats (Fig. 2A) demonstrates 4.8 times more self-information from the natural language responses (Diversity Index = 366) as compared with the rating scale responses (Diversity Index = 77). Thus, the natural language response tells us more, yielding a greater ability to distinguish responses than the full PANAS item-response scale. As such, language comprises many favorable measurement characteristics, including high range, resolution, dimensionality (Fig. 2B, C), and openness.

Fig. 2.

Fig. 2.

A. Comparison of self-information between response formats for affect. Self-information (also known as the Diversity Index) indicates a theoretical minimum number of bits necessary to represent the data. Natural language responses contain more than four times as much information than rating scales. B. Depiction of characteristics of assessment response data that can lead to greater information content. The assessment characteristics include range (i.e., the lower and upper limits), resolution (i.e., the smallest measurable interval) and multi-dimensionality (i.e., including several dimensions). Language has the ability to be greater in all three characteristics. C. Illustration of standard versus high resolution, range and dimensionality measurement in coordinate space. The standard resolution demonstrates 2 dimensions with a range of 5 values, while the high resolution demonstrates a 3rd dimension (in reality, it may have many more) with greater range and resolution.

The range of language enables us to describe the extremes (absolutely loves and hates), while its resolution yields nuanced differences between (cherishes, loves, adores, likes). The multi-dimensionality of language affords efficient and detailed communication of complex states of mind (love, excitement, joy, awe), which are not constrained to one dimension. The openness of natural language also enables us to creatively and fittingly construct personal ways of describing our state of mind (e.g., choosing adoration or despise rather than love or hate, or communicating multi-faceted descriptions for a situation that researchers or clinicians may not have anticipated: it was rough but it’s over now).

Of course, that natural language responses comprise more information than rating scales, does not necessarily mean they yield better measurement of any particular construct –the information may not be relevant to the psychological construct. Next, we focus on how large language models also deliver on psychological construct-relevant information.

3. Context matters

An important aspect of language is that words take on different meanings depending on context. Understanding words in their context computationally (word sense disambiguation) has been considered “an AI-complete problem, that is, a task whose solution is at least as hard as the most difficult problems in artificial intelligence” (Navigli, 2009). The ability to contextualize word sense is essential for capturing different psychological dimensions. For example, consider the italics words in the responses from Person A and B:

Person A Person B
How are you? I feel fine –even great! My life is a great mess! I’m having a very hard time being happy.
What is going on? Earlier, I played the game Yahtzee with my partner. I could not get that die to roll a 1! Now I’m lying on my bed for a rest. My business partner was lying to me. He was trying to game the system and played me. I think I am going to die –he left and now I have to pay the rest of his fine.

The meaning of play differs in the examples, from amusing recreational activity (Person A) to being taken advantage of (Person B). One would not be hard-pressed to be convinced that each has a different psychological meaning –the affective valence ranges from likely positive (Person A) to likely negative (Person B). In fact, according to the popular dictionary WordNet (Miller, 1995), play has at least 52 senses. When analyzing words outside of context, they are strikingly ambiguous, which is especially true for frequently used words that tend to have considerably more senses (Resnik, 1995).4

The different word senses brought out from the contexts have important connotations for the meaning and thus also for psychological insights. The word order within a context is also important; for example, consider the difference in meaning between ”the patient loves the therapy session with the therapist” versus “the therapist loves the patient in the therapy session”; as ChatGPT notes: “In the first, … the focus is on the patient’s feelings and their positive experience of the therapy session”, while “In the second statement, … the focus is on the therapist’s feelings and their positive connection with the patient” with the additional context “it is not appropriate for a therapist to have romantic or sexual feelings for their patient”. While the goal of integrating context into language for psychological analysis has been sought previously (Landauer, 1999; Schwartz et al., 2013; Tausczik and Pennebaker, 2010), it was not possible to achieve effectively for every mention of a word prior to the large language model.

3.1. The development of contextual word representations

Word embeddings are needed to turn language data (i.e., lists of letters) into a quantitative form that captures word meaning with which statistical techniques can directly be applied. The fundamental approach to word representations is to map each word to a list of numbers (i.e., word vectors). The idea of numerically capturing words’ meaning took off in the 1950s with the convergence of ideas from psychology, linguistics, and computer science (Fig. 3A; Jurafsky and Martin, 2020). Until recently, most methods utilized a bag-of-words approach whereby the order of the words in context is not taken into account (an obvious oversimplification that is nevertheless often useful). This method can be contrasted with ordered context from positional embeddings, which enable encoding implicit syntactic structure. Large language models are the product of a long-term goal within AI to go beyond “bag-of-words” (e.g., see the predecessors to BERT called ELMo, Peters et al., 2018).

Fig. 3.

Fig. 3.

A. The timeline of the development of transformer-based language models. Transformer-based language models came out of work on language modeling (i.e. the task of estimating the probability of word in their context) and vector semantics (i.e. an approach to representing the meaning of words or phrases as an array of continuously valued numbers). Transformer language models represent words in their context as continuously-valued numbers optimized for the task of language modeling. B. Depiction of embedding the sentence “My life is a great mess” using traditional word type embeddings. Word-type embeddings always represent the same word with the same vector. In this sense, when they are applied, they have no notion of context. C. Depiction of a contextual word embeddings from transformers of the same sentence. Such embeddings start with static representations from word-type embeddings but then proceed to produce new embeddings that consider the other words in context. For example, the representation of “great” not only depends on itself but what comes before and after, such as “mess”. Contextual word embeddings utilize a different representation for each instance of a word depending on their context and therefor they can encode the more precise semantics or meaning necessary for language modeling.

Note: Refs. “Bengio et al. (2003), Blei et al. (2003), Brown et al. (1992), Collobert and Weston (2008), Jelinek et al. (1975), Markov (1913), Osgood (1952), Mikolov et al. (2013), Switzer (1964).”

Large language models’ success in going beyond bag-of-words approaches is attributed to deep neural architectures/algorithms, advances in specialized hardware, large language datasets, and their algorithms enabling large training sizes.5 The technique enables capturing non-linear relationships of how words relate and interact with each other. An important part of the training process of large language models involves predicting a missing (masked) word within a given sequence of words. To succeed with the prediction, the model needs to learn syntax and associations among words; so the model learns from how the words are used in the training dataset. In addition, the large language models algorithm relies largely on attention, a mechanism that weights the effect of context words on a target word in a given sequence. Hence, attention enables the representation of the relationship between words in a sequence, which can capture long-term information, dependencies, and interactions of words in a text (Fig. 3B, C).

3.2. Large language models

The first widely adopted large language model is called BERT (short for Bidirectional Encoder Representations from Transformers; Devlin et al., 2019). Released for open use by Google in 2018, BERT has been followed by a family of large language models, including RoBERTa (Liu et al., 2019), GPT3 (Brown et al., 2020), and XLNet (Yang et al., 2019).

Large language models brought about unprecedented accuracy increases across a wide range of standardized Natural Language Processing tasks (NLP; AI’s subfield on language analysis), even surpassing the non-expert human baseline. Language models are typically evaluated and compared on a variety of different tasks, where two of the most common collections of standardized tests and benchmarks include: The General Language Understanding Evaluation (GLUE; Wang et al., 2018) and the SuperGLUE (Wang et al., 2019).

The GLUE suite comprises nine, and the SuperGLUE eight, carefully selected, standardized language understanding tasks.6 The tests include diverse tasks such as sentiment prediction, paraphrasing, similarity, grammar control, word sense disambiguation, causal reasoning, common sense reasoning, reading comprehension, natural language inference, and question answering. With the diverse set of tasks, GLUE and SuperGLUE favor language models that demonstrate ”general-purpose language understanding” (Wang et al., 2019).

At this moment (September 2022), there are 20 different models that surpass the performances of humans. Table 1 presents examples of standardized NLP tasks from GLUE along with person-level language tasks (such as assessing depression and suicide risk as presented later), describing top-performing approaches and their performance. All top-performing approaches include large language models for both standard NLP tasks and person-level tasks.

Table 1.

Examples of Standard NLP and Person-Level Language Tasks: Top Performing Systems and their Performance.

Standard Document-Level NLP tasks (GLUE) Person-Level Psychological Tasks
Task Top performing approachA Performance Task Top performing approach Performance
Sentiment (SST-2) Is a review of a movie positive or negative? Large language models-architecture for large-scale knowledge enhanced pretraining (EARNIE1) Accuracy = 0.978

GLUE Human baselineB = 0.978
Assessing depression using Twitter data (from a shared task (CLPsych 2015 (Coppersmith et al., 2015).
N = 327, test set n = 150
Large language models (MentalRoBERTa4, see also5)

RoBERTA fine tuned on mental health related Reddit data.
F1 = 0.697
Paraphrase (MRPC) Is sentence B a paraphrase of sentence A? Large language model decoding-enhanced BERT with disentangled attention (DeBERTa / Turing NLR v42) F1 = 0.940
Accuracy = 0.920
GLUE Human baselineB = 0.863/0.808
Assessing suicide risk using Reddit data from the SuicideWatch online forum (Suicide forums only), and users’ other Reddit posts (Suicide + all forums)
N = 621
Suicide forums only: Large language models (Multifeature Fusion Attention Network)6
Suicide + all forums: Large language models with multi-level dual-context language and BERT7
Suicide forums only: F1 = 0.514
Suicide + all forums: F1 = 0.457
Similarity (STS-B) How similar are the two sentences A and B? Large language models with efficient denoising pretraining (METRO / Turing NLR v53) Pearson r = 0.937
GLUE Human baselineB = 0.927
Assessing personality from social media (Facebook) language
N = 68,687, test set n = 1943
Large language models (word- and message-level attention in combination with past approaches)9 Disattenuated Pearson r = 0.54–.66
Acceptability (CoLA) Is a sentence grammatical or ungrammatical? Large language models for large-scale knowledge enhanced pre-training (EARNIE1) Mathew’s Correlation = 0.738
GLUE Human baselineB = 0.664
Assessing well-being (harmony in life) from probed language
N = 608
Large language models (BERT)8 Pearson r = 0.85

Dissattenuated Pearson r = 1.00

Notes. SST-2 = The Stanford Sentiment Treebank; MRPC = The Microsoft Research Paraphrase Corpus; STS-B = The Semantic Textual Similarity Benchmark; CoLA = The Corpus of Linguistic Acceptability.

A

Top performing systems are selected from the GLUE leaderboard (https://gluebenchmark.com/leaderboard), where the system needs to be in top 50 overall and be described with a URL and accompanied with a manuscript describing the system.

B

GLUE Human baseline = “a conservative estimate of human performance”, where the participants/annotators were non-experts recruited through crowdsourcing (Nangia and Bowman, 2019).

4

MentalRoBERTa (Ji et al., 2021)

5

(Matero et al., 2021) shows that RoBERTA performs more accurately than non-transformers

3.3. Leveraging big data information for small samples

The typical focus in NLP is to model language itself using huge amounts of data and employ these language models to solve language tasks (e.g., GLUE tasks). However, human-level AI models the individual behind the language (Ganesan et al., 2021; Soni et al., 2022). Modeling a person behind a text may include assessing their depression or suicide risk (Ganesan et al., 2021). Importantly, a huge amount of participant-generated data is not required to apply large language models in clinical sciences. First, small language samples can be submitted to a pre-trained model based on a large language model, trained for producing a psychological score. Existing large language models can be downloaded and applied using both Python (e.g., see DLATK; Schwartz et al., 2017) and R (e.g., see the text-package; Kjell et al., 2023). Second, developing predictive models with relatively small sample sizes can be achieved through dimensionality reduction of word embeddings (Ganesan et al., 2021); this has, for example, been done when predicting demographics (age, gender), personality (extraversion, openness), and mental health (suicide risk), with as low as 50 participants (and results approaching large-scale model accuracies with as few as 500 participants).

Further, large language models can handle multiple languages, such as multilingual BERT (mBERT), which was trained on the top 104 languages on Wikipedia, from English and Mandarin Chinese to Aragonese and Tagalog. For mental health researchers and practitioners, this not only opens up the possibility to use the techniques in many different languages but also the potential for new types of cross-cultural research; for example, mBERT has been employed to study misinformation about COVID-19 and health on social media in English, Arabic, and Bulgarian (Panda and Levitan, 2021).

Pre-trained language models can also be “fine-tuned” by continuing to train them on domain-specific language. For example, clinicalBERT (Alsentzer et al., 2019) is based on BERT and bioBERT (trained on biomedical text; Lee et al., 2020) with additional fine-tuning on clinical health text data such as notes from clinicians and discharge summaries. As a result, clinicalBERT provides word embeddings that perform more accurately in several tasks related to mental health.

4. Psychological insights through contextualized language

Contextualized language is the most common way of expressing and understanding complex psychological phenomena. Language plays a central role in processing and structuring emotions and thoughts introspectively and in communicating them to others. Language helps us sort memories to remember the past and plan the future. We cooperate, learn, and teach through language. Language can also be destructive and violent: A tool in arguments, manipulation, and deceit. These are all central parts of human life –so ignoring the context and structure of language data misses vital information and reduces ecological validity.

4.1. Ecological contextualized language

Large language models have been instrumental in recent AI models predicting clinically relevant outcomes from individuals’ naturally occurring text. This is demonstrated in the recently shared tasks of predicting suicide risks (N = 621) from Proceedings of Computational Linguistics and Clinical Psychology (CLPsych). The computer science research community has a tradition of arranging shared tasks, where researchers work on common tasks with the same datasets to develop competing methods to identify mental health disorders and related issues: The winning model(s) of a shared task is typically considered state-of-the-art. Prior to large language models, deep learning techniques were not able to benefit language-based predictions of mental health tasks (Lynn et al., 2018). In contrast, the two top-performing models included contextual word embeddings in the CLPsych shared task in 2019 (Matero et al., 2019; Mohammadi et al., 2019; Zirikly et al., 2019). The shared task included 15 research teams and focused on the extremely difficult task of predicting individual suicide risk from (de-identified) Reddit data, where large language model-based techniques were able to reduce error by 12.7–56.6 % over strong baselines.

The modeling of emotional processes in psychotherapies has also been improved by using large language models to assess the dynamics of valence in transcripts (human ratings from a database of 97,497 utterances). The large language model BERT was trained to predict the mean valence of transcripts rated by experts. The model’s inter-rater reliability with the rated mean (kappa = 0.48) surpasses previous state-of-the-art sentiment models (kappa = 0.31), LIWC (kappa = 0.25) and even the average human performance (kappa = 42; Tanana et al., 2021). This is further evidence of a shift where the latest in AI techniques (i.e., deep learning), in general, started giving a win in NLP for psychology.

4.2. Probed contextualized language

It might not come as a surprise, but none of the participants in our study communicated their response to the question: How are you feeling? with a numeric response (e.g., I’m a 7 on a scale from 1 to 10). Neither did anyone only use descriptive words meant to be interpreted without any context (e.g., a list of words: happy, excited, balanced). All respondents used natural language to describe their state of mind, and in fact, only 13 of the 100 participants used at least one of the words from the items of the PANAS rating scale commonly used to measure feelings. The main focus of self-report assessments is to measure/quantify the degree of a psychological construct. The typical rating scales of self-report measures comprising frequency labels (e.g., 0=Not at all to 4=Nearly every day; Kroenke and Spitzer, 2002) or agreement labels (e.g., Strongly disagree to Strongly agree) can be replaced by probes for language. Kjell et al. (2019a) found that probed language-based assessments of well-being and mental health can produce scores correlating with closed-ended rating scales upwards of Pearson r = 0.72 (N = 477).

Using large language models, language-based assessments have been shown to approach the theoretical upper limits in convergence with standard psychological assessments of well-being –the measures’ own reliability. Kjell et al. (2022) achieved a Pearson r = 0.85 (N = 608) for language-based assessment with corresponding rating scale scores for the harmony in life scale. This correlation is stronger than the scales’ own inter-item-correlation average (r = 0.76), test-retest reliability (r = 0.71–.77), and it is in line with the item-total average correlation (r = 0.84). Using large language models particularly improved the prediction from text responses rather than descriptive word responses: AI language analyses are now at the point where a quantitative score can be derived from natural language responses without sacrificing accuracy as measured by rating scales. The large language models behind this advance can become a widespread alternative in digital mental health, an avenue for modernizing the self-report of mental health assessments and ultimately improving our understanding of psychiatric conditions and human experiences more broadly.

Contextualized word embeddings have also achieved higher accuracy than trained clinicians and non-contextualized embeddings in classifying individuals diagnosed with schizophrenia (n = 30) and healthy individuals (n = 30; Sarzynska-Wawer et al., 2021). All three methods assessed transcribed interview answers in Polish to six questions about the participants’ lives and thoughts. The contextualized word embeddings (based on ELMo) achieved an accuracy of 80 % in distinguishing patients from healthy individuals, whereas clinicians assessing the same text only obtained 74 % accuracy. The model based on non-contextualized embeddings only achieved an accuracy of 70 %, which was significantly lower than the contextualized embeddings in a post hoc pairwise comparison (p = .03).

4.3. Beyond rating scales

We have discussed how natural language possesses high ecological validity (is the natural way of communicating complex psychological constructs), is information-rich (provides more information than rating scales), and comprises many favorable measurement characteristics (high range, resolution, dimensionality, and openness). Next, we describe three additional aspects central to how language-based assessments have the potential to move beyond rating scales, including i) being validated beyond convergence to rating scales, ii) providing descriptions to contextualize scores, and iii) better understanding response contexts (broadened to encompass who says what to whom, where, and how).

4.3.1. Beyond rating scales: true scores

Language responses contain valuable information that large language models can extract into scores converging with validated rating scales (Kjell et al., 2022). However, rating scales themselves are only an observable “proxy” and not a perfect true score (Fig. 1). Most psychometric theories, such as classical test theory (e.g., Novick, 1966) or item-response theory (Reise and Waller, 2009), view self-report responses as an approximation of the true latent variable that is sought. Therefore, the validity of language-based assessments and rating scales should go beyond evaluating their convergence.

This has so far only been studied in a few studies; one such study compared the two methods’ ability to accurately categorize external stimuli of pictures depicting facial expressions including sad, happy, and contemptuous (Kjell et al., 2019a). It was found that language-based assessment (based on the bag-of-words approach) more accurately categorizes facial expressions than rating scales. Further, a study focusing on theoretically relevant behaviors –cooperation– to harmony in life showed that language-based assessments significantly correlated (Pearson’s r = 0.18, N = 181; and r = 0.35 in individuals categorized as prosocials) with cooperative behaviors, whereas the corresponding harmony in life rating scale (Kjell et al., 2016) did not (Kjell et al., 2021b).

The intrinsic advantages of natural language and its favorable measurement characteristics are suitable for data-driven insights – however, the assessments need to be accurately grounded (i.e., models need to be trained to accurate assessments). Accurate assessments of psychiatric symptoms and diagnoses are important but often difficult to achieve. A method to improve assessments when there is no single, error-free measure is to involve experts that assess multiple types of (longitudinal) data to attain increased assessment accuracy (i.e., a best-estimate assessment; Eijsbroek et al., 2023; Leckman et al., 1982; Spitzer, 1983). The potential of this method for training natural language to best-estimate assessments led us to develop a reporting guideline with the aim of helping researchers plan, report, and evaluate such studies (Eijsbroek et al., 2023).

Natural language responses can also be used directly for precision in healthcare: To predict the likelihood of intervention success for a person in time without first predicting a psychiatric condition (e.g., DeRubeis et al., 2014). It can also be used for data-driven insights related to biological markers such as cortisol or relevant behaviors such as sleep patterns.

4.3.2. Beyond rating scales: descriptions more than a score

Language-based assessments also have the ability to be self-descriptive, moving beyond mere scores as the output. For example, statistically significant descriptive words and key phrases can be visualized based on their underlying meaning along relevant dimensions such as low versus high scores of personality traits (Schwartz et al., 2013), depression (K. Kjell, Johnsson, et al., 2021), or in relation to behaviors such as cooperation (Kjell et al., 2021). Fig. 4A shows AI-generated summaries of the ten most negative and positive answers to How are you feeling? from our study.7 Fig. 4B demonstrates large language models’ power to understand contextualized language and produce psychologically nuanced content.

Fig. 4. Beyond Rating Scales.

Fig. 4.

A. Large Language Models summary of language associated with positive and negative affect scores. This is a demonstration of how Large Language Models can be used for differential language analysis. We see that the summaries include both broad statements (“feeling very relaxed”) as well as specific examples of life events (an ill cat passing away). To produce these, affective valence of each response to how are you feeling? was estimated using an AI valence estimator and then groups of those classified as positive and negative were fed separately to the large language model t5-large7 to be summarized. B. Depiction of output from an early chat system versus a modern Large Language Models based chat system (ChatGPT3.5, from January, 2023). The left column shows an interaction between Sheldon and the computer software ELIZA from the TV series Young Sheldon8. The right column shows responses to the same first two questions by chatGPT3.5. ELIZA, designed by Weizenbaum (1966), is a real computer program that, despite not processing language beyond looking for simple patterns, is quite effective at evoking responses. This demonstrates the ability of large language models to interact with psychologically relevant output.

4.3.3. Beyond rating scales: the expansion of contexts

The meaning (of words) depends on contexts: We have provided examples of how the surrounding words in a text define the meaning of words. Nevertheless, language is contextual beyond itself: “Language arises in the life of the individual through an ongoing exchange of meanings with significant others” (Halliday, 1978). Whereas rating scales offer a restricted response format, consisting of closed-ended options that limit the range of expression, natural language can provide broader contextual information. These contexts can, for example, be broadened to encompass who says what to whom, where, and how.

The who may involve psychological contexts and demographic variables. Extraverts, for example, use language differently from introverts (Schwartz et al., 2013). The individual level has been modeled by fine-tuning large language models to keep track of what a specific individual has written previously, which results in word embeddings that more accurately predict human-level variables (Soni et al., 2022).

The whom may involve social contexts such as personal, professional, or healthcare settings that prompt individuals to describe and express themselves differently. Note also that response interface contexts, such as a questionnaire, virtual (e.g., a chatbot), or physical (e.g., a robot) interfaces may influence responses. The where may involve situational contexts such as how being in a waiting room, on a bus, or at home may spark different answers. Research comparing individuals’ willingness to disclose in health-screening interviews found that individuals led to believe a Virtual Human was computer-controlled, as compared to human-controlled by an operator, showed a higher willingness to disclose (Gratch et al., 2014). The participants in the computer-controlled context also displayed more intense feelings of sadness and lower impression management.

The how may consider the language-eliciting context, such as passive or prompted language, and the medium including written or spoken language. In addition, physical contexts, including facial expressions and body language, may significantly interact with the meaning of words.

Many of these contexts are potentially important to consider when collecting and analyzing data – and the generalizability of specific models to other contexts. More research is needed on best practices and the optimal way of taking advantage of contextual factors in assessing psychiatric conditions.

5. Targeted validation for deployment scenarios

While lots of evidence now exists that the technology is capable of strong validity (convergent, discriminant, external criteria; Kjell et al., 2021b, 2019, 2022; Oltmanns et al., 2021; Son et al., 2021), there are currently no “one-size fits all” models that have been validated across multiple populations or conditions, and only a few models have been tested for particular clinical deployment situations (Eichstaedt et al., 2018; Kelly et al., 2022; Son et al., 2021). Deployments of language model software for specific clinical use cases require supporting evidence for validity and reliability (as is needed for rating scales too). Hence, it is important to distinguish that this narrative review provides support for large language models for assessment as a class of techniques or algorithms. However, it does not advocate that all instances of such models should be trusted for assessment.

As a class of techniques, large language model assessments have demonstrated validity and reliability on par or better than rating scales, but any particular instance of such an assessment should go through a thorough evaluation for validity (including bias) and reliability before use in research or clinical practice just as any rating scale should. This involves validating models for target populations and use-contexts (for example, see studies analyzing language from clinical settings such as therapy sessions (Tanana et al., 2021), general online surveys (Kjell et al., 2022), social media such as Facebook (Lynn et al., 2020), Reddit (Zirikly et al., 2019), and Twitter (Coppersmith et al., 2015) as well as use-cases (such as screening (Sawhney et al., 2020) or diagnostics (Eichstaedt et al., 2018)). Assessment models also need to be validated beyond cross-sectional contexts, where studies to date have analyzed language use over time for assessment (Eichstaedt et al., 2018), capture dynamic changes (Schwartz et al., 2014; Tsakalidis et al., 2022) and predicting future symptoms trajectories (Son et al., 2021).

It is also important to control for demographic variables to understand the validity of language-based assessments beyond variables such as age, gender, and socio-economic status. It has, for example, been demonstrated that Facebook language (AUC = 0.69) predicts depression in medical records more accurately than demographic characteristics (age, sex, and race, AUC = 0.57, Eichstaedt et al., 2018). Further, studies have demonstrated that language-based assessments provide improved accuracy when controlling for occupation, age, and gender (Son et al., 2021), seven socio-economic variables such as household income and education (Matero et al., 2023), and gender and social class (Lynn et al., 2018).

6. Biases, risks, and ethical considerations

The weaknesses and strengths of large language models come with ethical considerations and responsibilities. Since this research is often interdisciplinary and involves complex methods, it is essential to develop rigorous frameworks guiding scientific evaluations of such studies (Chandler et al., 2020). Chandler et al. (2020) provide a framework for addressing common issues in integrating and evaluating the potential use of AI in psychiatry; the framework emphasizes i) the importance of evaluating the explainability of models, ii) assessing the transparency of the method, and iii) ensuring that the AI model generalizes. These issues apply to language-based assessments; it is, for example, important to explain how the model weights different parts of a text, be transparent in model architecture, training data, and performance, as well as ensure generalisability by using large representative datasets to test models in.

The section on using large language models for psychological assessments references results based on open large language models, meaning that a model comes with detailed documentation, can be downloaded, shared with others, and run offline (e.g., BERT, RoBERTA, and BLOOM). So providing sufficient information about the models, including the exact version, where it can be downloaded, the number of layers that were used, and how, will enable others to reproduce the results (which, of course, is facilitated if combined with open code). Openness, however, is not always the case with recent state-of-the-art closed models. ChatGPT, for example, cannot be downloaded and used offline, and it is only partly documented: The exact architecture and training data is not revealed, and independent investigations suggest that the model is often updated without the possibility of accessing previous models (or the exact differences; Chen et al., 2023). The closedness comes with scientific and ethical concerns. Closed models result in issues relating to replicability (i.e., a model version update can even prevent researchers in charge of the original analyses from replicating their results). From an ethical perspective, when a model cannot be downloaded and run offline, they require the user to share their data with the model host. Hence, researchers must be very cautious in sharing sensitive data with companies hosting a closed model.

Leidner and Plachouras (2017) specifically discuss ethical challenges to NLP, where they emphasize that ethical values should be incorporated in both the development and the application of NLP; this, for example, includes considering biases in large-scale language models (Kurita et al., 2019; Shah et al., 2020), privacy in regard to the open-ended language format and increased predictive power. Thus, it is important to undertake extra privacy-preservation steps, including extra data access restrictions and thorough checks for identity if seeking to share data (see Lison et al., 2021, for a thorough discussion).

There is a growing interest in ethical principles and frameworks for developing and deploying AI (Jobin et al., 2019; Peters et al., 2020), with an active debate about best practices. An extensive review of over 80 international guidelines on ethical AI, revealed five global ethical principles, including transparency, justice and fairness, non-maleficence, responsibility, and privacy; however, there are currently substantial differences in how they were conceptualized and how they should be applied (Jobin et al., 2019).

There are also legal and regulatory frameworks to consider when developing and implementing AI techniques in clinical settings. The European Commission (2023) typically requires AI solutions for clinical settings to be CE-marked (a certification declaring high safety and health requirements); they have also proposed regulations specifically concerning artificial intelligence –the AI Act– seeking to harmonize rules for the development and use of safe AI (see also Veale and Zuiderveen Borgesius 2021, Hauglid and Mahler 2023). In the United States, the Food and Drug Administration (FDA, US Food and Drug Administration, 2021) provides regulations for the application of AI in clinical settings, and the White House Office of Science and Technology Policy (White House Office of Science and Technology Policy, 2022) released a blueprint for an AI bill. This blueprint proposes five principles to guide the use of AI, including i) safe and effective systems, ii) algorithmic discrimination protections, iii) data privacy, iv) notice and explanation (i.e., informing consumers when and how AI is being used), and v) option to opt-out, with human alternatives, consideration, and fallback. These regulations and guides are quickly being updated, and it is out of the scope of this article to describe these resources at length.

7. Conclusions

This narrative review suggests that the more precise language scores provided by large language models can change how mental health is assessed by enabling patients and study participants to respond in their own words, resulting in large improvements in accuracy as well as an expanded scope of insights. Many studies already collect open-ended responses for qualitative review, and such techniques can be used to complement traditional rating scales while they are being established. Further, many resources are assisting in enabling the application of these methods to mainstream mental health research (Python library DLATK; Schwartz et al., 2017; R-package text; Kjell et al., 2021). This body of work suggests that AI’s paradigm shift to large language models (Bommasani et al., 2021) can lend itself to a change in psychology from the mostly ubiquitous reliance on rating scale responses to a more accurate, fine-grained, and ecologically grounded assessment from fully leveraging participants’ own words.

Funding

Oscar Kjell received funding from the Swedish Research Council (2019-06305), Katarina Kjell from FORTE (2022-01022) and Andrew Schwartz from DARPA Young Faculty Award (W911NF-20-1-0206), and the NSF/NIH Smart and Connected Health (R01 MH125702-01).

Footnotes

CRediT authorship contribution statement

Oscar N.E. Kjell: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. Katarina Kjell: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Writing – original draft, Writing – review & editing. H. Andrew Schwartz: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.

1

Still, Latent Semantic Analysis (Deerwester et al., 1990), a bag-of-word approach representing words with numeric representations, was developed by psychology researchers in the 1990s.

2

Participants were recruited online from Prolific, an online platform for recruiting participants that are being paid to take part in the research. Age: M = 41.3, SD = 14.6, range = 18 – 88 years; Gender: females = 58, males = 42; Nationality = U.K.

3

The Self-Information Diversity Index is a measure of diversity or entropy in a set of numeric values or probabilistic events (i.e., possible responses on a scale or by words). It is 2entropy, where entropy is computed as the Shannon entropy, mathematically defined as: H=xp(x)log(p(x)), where p is its probability mass function, and x is a set of numeric values or probabilistic events.

4

Exactly how to partition words into discrete senses is hotly debated in linguistics, but fundamentally, the context of a word is essential to its meaning. For more on the debate see (Navigli, 2009). Modern AI has cleverly side-stepped the sense-partitioning debate by representing meaning in a latent semantic space – similar to factors derived from factor analysis as we discuss in the next section.

5

A recurrent neural network (RNN) might be able to get the same performance if it could parallelize its training routine in the same way transformers do.

6

In these tasks, the test answers are not available to the public, so researchers submit their models’ answers for testing, and the results are then presented publicly on leaderboards (https://gluebenchmark.com/leaderboard)

7

The NLP analyses were done in R using the text-package. First, individuals’ text responses to how they were feeling were transformed into word embeddings using the large language models, roberta-large. Then, the affective valence of each response was estimated using an AI valence estimator trained on an open affect dataset (Preoţiuc-Pietro et al., 2016), with our estimator having a cross-validated accuracy with raters of Pearson r =.74. Lastly, the responses were divided into those that were positive and negative and then summarized using the large language model t5-large. See the open code for details.

8

From season 1, episode 12.

Declaration of competing interest

Oscar Kjell, and Katarina Kjell have co-founded and hold shares in a start-up using computational language assessments to diagnose mental health problems.

References

  1. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, & McDermott M (2019). Publicly available clinical BERT embeddings. arXiv Preprint arXiv:1904.03323 [Google Scholar]
  2. Argamon S, Koppel M, Pennebaker JW, & Schler J (2007). Mining the blogosphere: age, gender and the varieties of self-expression. First Monday. [Google Scholar]
  3. Bajaj P, Xiong C, Ke G, Liu X, He D, Tiwary S, Liu TY, Bennett P, Song X, & Gao J (2022). METRO: efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals (arXiv:2204.06644). arXiv. 10.48550/arXiv.2204.06644. [DOI] [Google Scholar]
  4. Bengio Y, Ducharme R, Vincent P, Jauvin C, 2003. A neural probabilistic language model. J. Mach. Learn Res 3 (Feb), 1137–1155. [Google Scholar]
  5. Berger J, Packard G, 2021. Using natural language processing to understand people and culture. Am. Psychol 77 (4), 525. [DOI] [PubMed] [Google Scholar]
  6. Blei DM, Ng AY, Jordan MI, 2003. Latent dirichlet allocation. J. Mach. Learn Res 3 (Jan), 993–1022. [Google Scholar]
  7. Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, & Brunskill E (2021). On the opportunities and risks of foundation models. arXiv Preprint arXiv:2108.07258 [Google Scholar]
  8. Boucher EM, Harake NR, Ward HE, Stoeckl SE, Vargas J, Minkel J, Parks AC, Zilca R, 2021. Artificially intelligent chatbots in digital mental health interventions: a review. Expert Rev. Med. Devices 18 (sup1), 37–49. [DOI] [PubMed] [Google Scholar]
  9. Boyd RL, Schwartz HA, 2021. Natural language analysis and the psychology of verbal behavior: the past, present, and future states of the field. J. Lang. Soc. Psychol 40 (1), 21–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Brown PF, Della Pietra VJ, Desouza PV, Lai JC, Mercer RL, 1992. Class-based n-gram models of natural language. Comput. Linguist 18 (4), 467–480. [Google Scholar]
  11. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, & Askell A (2020). Language models are few-shot learners. arXiv Preprint arXiv:2005.14165 [Google Scholar]
  12. Chandler C, Foltz PW, Cohen AS, Holmlund TB, Cheng J, Bernstein JC, Rosenfeld EP, Elvevåg B, 2020a. Machine learning for ambulatory applications of neuropsychological testing. Intell. Based Med 1, 100006. [Google Scholar]
  13. Chandler C, Foltz PW, Elvevåg B, 2020b. Using machine learning in psychiatry: the need to establish a framework that nurtures trustworthiness. Schizophr. Bull 46 (1), 11–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Chen L, Zaharia M, & Zou J (2023). How is ChatGPT’s behavior changing over time? (arXiv:2307.09009). arXiv. http://arxiv.org/abs/2307.09009. [Google Scholar]
  15. Collobert R, Weston J, 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. [Google Scholar]
  16. Coppersmith G, Dredze M, Harman C, Hollingshead K, Mitchell M, 2015. CLPsych 2015 shared task: depression and PTSD on Twitter. In: Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pp. 31–39. [Google Scholar]
  17. Curtis B, Giorgi S, Buffone AEK, Ungar LH, Ashford RD, Hemmons J, Summers D, Hamilton C, Schwartz HA, 2018. Can Twitter be used to predict county excessive alcohol consumption rates? PLoS ONE 13 (4), e0194290. 10.1371/journal.pone.0194290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. D’Alfonso S., 2020. AI in mental health. Curr. Opin. Psychol 36, 112–117. [DOI] [PubMed] [Google Scholar]
  19. De Bruyne L, Atanasova P, Augenstein I, 2022. Joint emotion label space modeling for affect lexica. Comput. Speech Lang 71, 101257. [Google Scholar]
  20. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R, 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci 41 (6), 391. [Google Scholar]
  21. Delgadillo J, Lutz W, 2020. A development pathway towards precision mental health care. JAMA Psychiatry 77 (9), 889–890. [DOI] [PubMed] [Google Scholar]
  22. DeRubeis RJ, Cohen ZD, Forand NR, Fournier JC, Gelfand LA, Lorenzo-Luaces L, 2014. The personalized advantage index: translating research on prediction into individualized treatment recommendations. A demonstration. PLoS ONE 9 (1), e83875. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Devlin J, Chang MW, Lee K, Toutanova K, 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. 10.18653/v1/N19-1423. [DOI] [Google Scholar]
  24. Eichstaedt JC, Kern ML, Yaden DB, Schwartz HA, Giorgi S, Park G, Hagan CA, Tobolsky V, Smith LK, Buffone A, 2020. Closed-and open-vocabulary approaches to text analysis: a review, quantitative comparison, and recommendations. Psychol. Methods 26 (4), 398. [DOI] [PubMed] [Google Scholar]
  25. Eichstaedt JC, Smith RJ, Merchant RM, Ungar LH, Crutchley P, Preoţiuc-Pietro D, Asch DA, Schwartz HA, 2018. Facebook language predicts depression in medical records. Proc. Natl. Acad. Sci 115 (44), 11203–11208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Eijsbroek V, Kjell K, Schwartz HA, Boehnke J, Fried EI, Klein DN, Gustafsson P, Augenstein I, Bossuyt PM, & Kjell O (2023). The LEADING Statement Reporting Guidelines for Expert Panel, Best Estimate Diagnosis, and Longitudinal Expert All Data (LEAD) Studies. [Google Scholar]
  27. European Commission (2023). CE marking. CE Marking. https://single-market-economy.ec.europa.eu/single-market/ce-marking_en. [Google Scholar]
  28. Ganesan AV, Matero M, Ravula AR, Vu H, Schwartz HA, 2021. Empirical evaluation of pre-trained transformers for human-level nlp: the role of sample size and dimensionality. In: Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting, 2021, p. 4515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y, 2023. Clip-adapter: better vision-language models with feature adapters. Int. J. Comput. Vis 1–15. [Google Scholar]
  30. Gratch J, Lucas GM, King AA, Morency LP, 2014. It’s only a computer: the impact of human-agent interaction in clinical interviews. In: Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems, pp. 85–92. [Google Scholar]
  31. Halliday MAK, 1978. Language As Social semiotic: The Social Interpretation of Language and Meaning, 42. Edward Arnold, London. [Google Scholar]
  32. Hauglid MK, Mahler T, 2023. Doctor Chatbot: the EU’s regulatory prescription for generative medical AI. Oslo Law Rev. 10 (1), 1–23. 10.18261/olr.10.1.1. [DOI] [Google Scholar]
  33. He P, Liu X, Gao J, & Chen W (2021). DeBERTa: decoding-enhanced BERT with Disentangled Attention (arXiv:2006.03654). arXiv. 10.48550/arXiv.2006.03654. [DOI] [Google Scholar]
  34. Iliev R, Dehghani M, Sagi E, 2015. Automated text analysis in psychology: methods, applications, and future developments. Lang. Cogn 7 (2), 265–290. [Google Scholar]
  35. Jackson JC, Watts J, List JM, Puryear C, Drabble R, Lindquist KA, 2021. From text to thought: how analyzing language can advance psychological Science. Perspect. Psychol. Sci 17 (3), 805–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Jelinek F, Bahl L, Mercer R, 1975. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Trans. Inf. Theory 21 (3), 250–256. [Google Scholar]
  37. Ji S, Zhang T, Ansari L, Fu J, Tiwari P, & Cambria E (2021). Mentalbert: publicly available pretrained language models for mental healthcare. arXiv Preprint arXiv:2110.15621 [Google Scholar]
  38. Jobin A, Ienca M, Vayena E, 2019. The global landscape of AI ethics guidelines. Nat. Mach. Intell 1 (9), 389–399. [Google Scholar]
  39. Jurafsky D, & Martin JH (2020). Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf. [Google Scholar]
  40. Kelly D, Coppersmith G, Dickerson J, Espy-Wilson C, Michel H, Resnik P, 2022. Computationally scalable and clinically sound: laying the groundwork to use machine learning techniques for social media and language data in predicting psychiatric symptoms. Biol. Psychiatry 91 (9), S50. [Google Scholar]
  41. Kjell K, Johnsson P, Sikström S, 2021a. Freely generated word responses analyzed with artificial intelligence predict self-reported symptoms of depression, anxiety, and worry. Front. Psychol 12, 602581 10.3389/fpsyg.2021.602581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Kjell O, Daukantaitė D, Hefferon K, Sikström S, 2016. The harmony in life scale complements the satisfaction with life scale: expanding the conceptualization of the cognitive component of subjective well-being. Soc. Indic. Res 126 (2), 893–919. 10.1007/s11205-015-0903-z. [DOI] [Google Scholar]
  43. Kjell O, Daukantaitė D, Sikström S, 2021b. Computational language assessments of harmony in life—not satisfaction with life or rating scales—correlate with cooperative behaviors. Front. Psychol 12, 601679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Kjell O, Kjell K, Garcia D, Sikström S, 2019a. Semantic measures: using natural language processing to measure, differentiate, and describe psychological constructs. Psychol. Methods 24 (1), 92. [DOI] [PubMed] [Google Scholar]
  45. Kjell O, Kjell K, Garcia D, & Sikström S (2019). Semantic measures: using natural language processing to measure, differentiate, and describe psychological constructs. Psychol. Methods, 24(1), 92. [DOI] [PubMed] [Google Scholar]
  46. Kjell O, Giorgi S, & Schwartz HA (2023). The Text-Package: An R-Package for Analyzing and Visualizing Human Language Using Natural Language Processing and Transformers. Psychological Methods. Advance online publication. 10.1037/met0000542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Kjell O, Sikström S, Kjell K, Schwartz HA, 2022. Natural language analyzed with AI-based transformers predict traditional subjective well-being measures approaching the theoretical upper limits in accuracy. Sci. Rep 12 (1) 10.1038/s41598-022-07520-w. Article 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Kroenke K, Spitzer RL, 2002. The PHQ-9: a new depression diagnostic and severity measure. Psychiatr. Ann 32 (9), 1–7. [Google Scholar]
  49. Kurita K, Vyas N, Pareek A, Black AW, Tsvetkov Y, 2019. Measuring bias in contextualized word representations. In: Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 166–172. [Google Scholar]
  50. Kwantes PJ, Derbentseva N, Lam Q, Vartanian O, Marmurek HHC, 2016. Assessing the Big Five personality traits with latent semantic analysis. Personal. Individ. Differ 102, 229–233. 10.1016/j.paid.2016.07.010. [DOI] [Google Scholar]
  51. Landauer TK, 1999. Latent semantic analysis: a theory of the psychology of language and mind. Discourse Process 27 (3), 303–310. [Google Scholar]
  52. Leckman JF, Sholomskas D, Thompson D, Belanger A, Weissman MM, 1982. Best estimate of lifetime psychiatric diagnosis: a methodological study. Arch. Gen. Psychiatry 39 (8), 879–883. [DOI] [PubMed] [Google Scholar]
  53. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J, 2020. BioBERT: a pretrained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), 1234–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Leidner JL, Plachouras V, 2017. Ethical by design: ethics best practices for natural language processing. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 30–40. [Google Scholar]
  55. Li J, Zhang S, Zhang Y, Lin H, Wang J, 2021. Multifeature fusion attention network for suicide risk assessment based on social media: algorithm development and validation. JMIR Med. Inform 9 (7), e28227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Likert R., 1932. A technique for the measurement of attitudes. Arch. Psychol 140, 55. 22. [Google Scholar]
  57. Lison P, Pilán I, Sanchez D, Batet M, Øvrelid L, 2021. Anonymisation Models for Text Data: state of the art, Challenges and Future Directions. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4188–4203. 10.18653/v1/2021.acl-long.323. [DOI] [Google Scholar]
  58. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, & Stoyanov V (2019). Roberta: a robustly optimized bert pretraining approach. arXiv Preprint arXiv:1907.11692 [Google Scholar]
  59. Lord FM, 2012. Applications of Item Response Theory to Practical Testing Problems. Routledge. [Google Scholar]
  60. Lynn V, Balasubramanian N, Schwartz HA, 2020. Hierarchical modeling for user personality prediction: the role of message-level attention. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5306–5316. [Google Scholar]
  61. Lynn V, Goodman A, Niederhoffer K, Loveys K, Resnik P, Schwartz HA, 2018. CLPsych 2018 shared task: predicting current and future psychological health from childhood essays. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 37–46. [Google Scholar]
  62. Macavaney S, Mittu A, Coppersmith G, Leintz J, Resnik P, 2021. Community-level research on suicidality prediction in a secure environment: overview of the CLPsych 2021 shared task. In: Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pp. 70–80. [Google Scholar]
  63. MacKay DJC, 2023. Information Theory, Inference, and Learning Algorithms. n.d.. Cambridge University Press, p. 640. [Google Scholar]
  64. Markov AA, 1913. In Essai d’une recherche statistique sur le texte du roman. Eugene Onegin” illustrant la liaison des epreuve en chain (“Example of a statistical investigation of the text of ‘Eugene Onegin’ illustrating the dependence between samples in chain”), 6th ser 7. Izvistia Imperatorskoi Akademii Nauk (Bulletin de l’Académie Impériale des Sciences de St.-Pétersbourg, pp. 153–162. [Google Scholar]
  65. Matero M, Giorgi S, Curtis B, Ungar LH, Schwartz HA, 2023. Opioid death projections with AI-based forecasts using social media language. NPJ Digit. Med 6 (1), 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Matero M, Hung A, & Schwartz HA (2021). Understanding RoBERTa’s Mood: the Role of Contextual-Embeddings as User-Representations for Depression Prediction. arXiv Preprint arXiv:2112.13795 [Google Scholar]
  67. Matero M, Idnani A, Son Y, Giorgi S, Vu H, Zamani M, Limbachiya P, Guntuku SC, Schwartz HA, 2019. Suicide risk assessment with multi-level dual-context language and bert. In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp. 39–44. [Google Scholar]
  68. Melcher J, Hays R, Torous J, 2020. Digital phenotyping for mental health of college students: a clinical review. BMJ Ment. Health 23 (4), 161–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Mikolov T, Sutskever I, Chen K, Corrado GS, & Dean J (2013). Distributed representations of words and phrases and their compositionality. 3111–3119. [Google Scholar]
  70. Miller GA, 1995. WordNet: a lexical database for English. Commun. ACM 38 (11), 39–41. [Google Scholar]
  71. Mohammadi E, Amini H, Kosseim L, 2019. CLaC at CLPsych 2019: fusion of neural features and predicted class probabilities for suicide risk assessment based on online posts. In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp. 34–38. [Google Scholar]
  72. Nangia N, & Bowman SR (2019). Human vs. muppet: a conservative estimate of human performance on the GLUE benchmark. arXiv Preprint arXiv:1905.10425 [Google Scholar]
  73. Navigli R., 2009. Word sense disambiguation: a survey. ACM Comput. Surv. (CSUR) 41 (2), 1–69. [Google Scholar]
  74. Nayak P., 2019. Understanding Searches Better Than Ever Before. October 25. Google. https://blog.google/products/search/search-language-understanding-bert/. [Google Scholar]
  75. Novick MR, 1966. The axioms and principal results of classical test theory. J. Math. Psychol 3 (1), 1–18. [Google Scholar]
  76. Oltmanns JR, Schwartz HA, Ruggero C, Son Y, Miao J, Waszczuk M, Clouston SA, Bromet EJ, Luft BJ, Kotov R, 2021. Artificial intelligence language predictors of two-year trauma-related outcomes. J. Psychiatr. Res 143, 239–245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Osgood CE, 1952. The nature and measurement of meaning. Psychol. Bull 49 (3), 197–237. 10.1037/h0055737. [DOI] [PubMed] [Google Scholar]
  78. Panda S, Levitan SI, 2021. Detecting multilingual COVID-19 misinformation on social media via contextualized embeddings. In: Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pp. 125–129. [Google Scholar]
  79. Park G, Schwartz HA, Eichstaedt JC, Kern ML, Kosinski M, Stillwell DJ, Ungar LH, Seligman ME, 2015. Automatic personality assessment through social media language. J. Personal. Soc. Psychol 108 (6), 934. [DOI] [PubMed] [Google Scholar]
  80. Parmar P, Ryu J, Pandya S, Sedoc J, Agarwal S, 2022. Health-focused conversational agents in person-centered care: a review of apps. NPJ Digit. Med 5 (1), 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Pennebaker JW, Mehl MR, Niederhoffer KG, 2003. Psychological aspects of natural language use: our words, our selves. Annu. Rev. Psychol 54 (1), 547 bth. [DOI] [PubMed] [Google Scholar]
  82. Peters D, Vold K, Robinson D, Calvo RA, 2020. Responsible AI–two frameworks for ethical design practice. IEEE Trans. Technol. Soc 1 (1), 34–47. [Google Scholar]
  83. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L, 2018. Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. 10.18653/v1/N18-1202. [DOI] [Google Scholar]
  84. Preoţiuc-Pietro D, Schwartz HA, Park G, Eichstaedt J, Kern M, Ungar L, Shulman E, 2016. Modelling valence and arousal in facebook posts. In: Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 9–15. [Google Scholar]
  85. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I, 2023. Robust speech recognition via large-scale weak supervision. In: Proceedings of the International Conference on Machine Learning, pp. 28492–28518. In: https://proceedings.mlr.press/v202/radford23a.html. [Google Scholar]
  86. Reise SP, Waller NG, 2009. Item response theory and clinical measurement. Annu. Rev. Clin. Psychol 5 (1), 27–48. [DOI] [PubMed] [Google Scholar]
  87. Resnik P. (1995). Using information content to evaluate semantic similarity in a taxonomy. arXiv Preprint Cmp-Lg/9511007. [Google Scholar]
  88. Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L, 2021. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 304, 114135. [DOI] [PubMed] [Google Scholar]
  89. Sawhney R, Joshi H, Gandhi S, Shah R, 2020. A time-aware transformer based model for suicide ideation detection on social media. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7685–7697. [Google Scholar]
  90. Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman ME, 2013. Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8 (9), e73791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Schwartz HA, Eichstaedt J, Kern ML, Park G, Sap M, Stillwell D, Kosinski M, & Ungar L (2014). Towards Assessing Changes in Degree of Depression Through Facebook. 118–125. [Google Scholar]
  92. Schwartz HA, Giorgi S, Sap M, Crutchley P, Ungar L, & Eichstaedt J (2017). Dlatk: Differential language analysis toolkit. 55–60. [Google Scholar]
  93. Schwartz HA, Ungar LH, 2015. Data-driven content analysis of social media: a systematic overview of automated methods. Ann. Am. Acad. Pol. Soc. Sci 659 (1), 78–94. [Google Scholar]
  94. Seppälä J, De Vita I, Jämsä T, Miettunen J, Isohanni M, Rubinstein K, Feldman Y, Grasa E, Corripio I, Berdun J, 2019. Mobile phone and wearable sensor-based mHealth approaches for psychiatric disorders and symptoms: systematic review. JMIR Ment. Health 6 (2), e9819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Shah DS, Schwartz HA, Hovy D, 2020. Predictive biases in natural language processing models: a conceptual framework and overview. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5248–5264. 10.18653/v1/2020.acl-main.468. [DOI] [Google Scholar]
  96. Shannon CE, 1948. A mathematical theory of communication. Bell Syst. Tech. J 27 (3), 379–423. [Google Scholar]
  97. Son Y, Clouston SA, Kotov R, Eichstaedt JC, Bromet EJ, Luft BJ, Schwartz HA, 2021. World Trade Center responders in their own words: predicting PTSD symptom trajectories with AI-based language analyses of interviews. Psychol. Med 53 (3), 918–926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Soni N, Matero M, Balasubramanian N, & Schwartz HA (2022). Human Language Modeling. arXiv Preprint arXiv:2205.05128 [Google Scholar]
  99. Spitzer RL, 1983. Psychiatric diagnosis: are clinicians still necessary? Compr. Psychiatry [DOI] [PubMed] [Google Scholar]
  100. Stade E, Stirman SW, Ungar LH, Yaden DB, Schwartz HA, Sedoc J, … & DeRubeis R (2023). Artificial Intelligence Will Change the Future of Psychotherapy: A Proposal for Responsible, Psychologist-led Development. [Google Scholar]
  101. Sun J, Schwartz HA, Son Y, Kern ML, Vazire S, 2020. The language of well-being: tracking fluctuations in emotion experience through everyday speech. J. Personal. Soc. Psychol 118 (2), 364. [DOI] [PubMed] [Google Scholar]
  102. Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, Liu J, Chen X, Zhao Y, Lu Y, Liu W, Wu Z, Gong W, Liang J, Shang Z, Sun P, Liu W, Ouyang X, Yu D, Wang H (2021). ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation (arXiv:2107.02137). arXiv. 10.48550/arXiv.2107.02137. [DOI] [Google Scholar]
  103. Switzer P. (1964). Vector images in document retrieval. Statistical Association Methods for Mechanized Documentation, 163–171. [Google Scholar]
  104. Tanana MJ, Soma CS, Kuo PB, Bertagnolli NM, Dembe A, Pace BT, Srikumar V, Atkins DC, Imel ZE, 2021. How do you feel? Using natural language processing to automatically rate emotion in psychotherapy. Behav. Res. Methods 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  105. Tausczik YR, Pennebaker JW, 2010. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol 29 (1), 24–54. [Google Scholar]
  106. Thomas ML, 2011. The value of item response theory in clinical assessment: a review. Assessment 18 (3), 291–307. [DOI] [PubMed] [Google Scholar]
  107. Torous J, Bucci S, Bell IH, Kessing LV, Faurholt-Jepsen M, Whelan P, Carvalho AF, Keshavan M, Linardon J, Firth J, 2021. The growing field of digital psychiatry: current evidence and the future of apps, social media, chatbots, and virtual reality. World Psychiatry 20 (3), 318–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  108. Tsakalidis A, Chim J, Bilal IM, Zirikly A, Atzil-Slonim D, Nanni F, Resnik P, Gaur M, Roy K, Inkster B, 2022. Overview of the CLPsych 2022 shared task: capturing moments of change in longitudinal user posts. In: Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, pp. 184–198. [Google Scholar]
  109. US Food and Drug Administration (FDA), 2021. Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. US Food and Drug Administration (FDA) [Tech. Rep, 1.]. [Google Scholar]
  110. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I, 2017. Attention is all you need. Adv. Neural Inf. Process. Syst 30, 5998–6008. [Google Scholar]
  111. Veale M, Zuiderveen Borgesius F, 2021. Demystifying the Draft EU artificial intelligence act—analysing the good, the bad, and the unclear elements of the proposed approach. Comput. Law Rev. Int 22 (4), 97–112. [Google Scholar]
  112. Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman S, 2019. Superglue: a stickier benchmark for general-purpose language understanding systems. Adv. Neural Inf. Process. Syst 32. [Google Scholar]
  113. Wang A, Singh A, Michael J, Hill F, Levy O, & Bowman SR (2018). GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv Preprint arXiv:1804.07461 [Google Scholar]
  114. Weizenbaum J., 1966. ELIZAa computer program for the study of natural language communication between man and machine. Communications of the ACM 9 (1), 36–45. [Google Scholar]
  115. White House Office of Science and Technology Policy. (2022). Blueprint For an AI Bill of Rights Making Automated Systems Work for the American People. https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf. [Google Scholar]
  116. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV, 2019. Xlnet: generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst 5754–5764. [Google Scholar]
  117. Zirikly A, Resnik P, Uzuner O, Hollingshead K, 2019. CLPsych 2019 shared task: predicting the degree of suicide risk in Reddit posts. In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp. 24–33. [Google Scholar]

RESOURCES