A multi-stage large language model framework for extracting suicide-related social determinants of health

Song Wang; Yishu Wei; Haotian Ma; Max Lovitt; Kelly Deng; Yuan Meng; Zihan Xu; Jingze Zhang; Yunyu Xiao; Ying Ding; Xuhai Xu; Joydeep Ghosh; Yifan Peng

doi:10.1038/s43856-025-01114-z

. 2025 Sep 29;5:404. doi: 10.1038/s43856-025-01114-z

A multi-stage large language model framework for extracting suicide-related social determinants of health

Song Wang ¹, Yishu Wei ², Haotian Ma ², Max Lovitt ², Kelly Deng ², Yuan Meng ², Zihan Xu ², Jingze Zhang ², Yunyu Xiao ², Ying Ding ³, Xuhai Xu ⁴, Joydeep Ghosh ¹, Yifan Peng ^2,^✉

PMCID: PMC12480878 PMID: 41023090

Abstract

Background

Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability.

Methods

We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model’s explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study.

Results

We show that our proposed framework demonstrates performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability.

Conclusions

Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.

Subject terms: Public health, Health services

Plain language summary

Social determinants of health (SDoH) are the circumstances in which people are born, grow, live, work, and age that can have an impact on their health and well-being. We aimed to improve how SDoH factors that contribute to suicide incidents are identified. We developed a computational large language model framework that can extract suicide-related SDoH factors from unstructured text. Our approach was evaluated against other advanced language models and found to perform better in extracting SDoH factors and explaining its decisions. By making the SDoH factor extraction process more accurate and transparent, this work can help experts more quickly recognize individuals at risk of suicide and support better prevention strategies.

Wang et al. present a multi-stage large language model framework designed to improve the extraction of social determinants of health (SDoH) factors from unstructured text. It results in more accurate, transparent, and efficient annotation as demonstrated through comparative analyses and user studies.

Introduction

Suicide remains a major global public health concern, presenting significant challenges for intervention and prevention. The underlying causes of suicidal thoughts and behaviors stem from a complex interplay of social, economic, political, and physical factors¹, all of which fall under the Social Determinants of Health (SDoH) framework^2,3. A deep understanding of these factors is essential for developing targeted, evidence-based strategies for suicide prevention and early intervention. While there is growing interest in integrating suicide-related SDoH factors into structured electronic health records (EHRs), much of this information remains incomplete or inaccessible, as it is often embedded within unstructured text narratives⁴.

In recent years, advancements in natural language processing (NLP) have opened new opportunities for modeling, extracting, and analyzing information from clinical notes, including suicide-related SDoH factors^4,5. However, three major challenges remain. First, suicide-related SDoH factors follow a long-tailed distribution, where a small subset of factors appear frequently while most are rare. Conventional NLP models, which excel at extracting common factors, often struggle to capture these rare yet critical ones^6–8. Second, understanding SDoH factors occurring preceding a suicide incident is crucial, as this period often represents a critical window where acute stressors and life events may directly influence an individual’s decision^7,9. However, many models struggle with capturing temporal context⁷. Finally, deep learning models are often criticized for their “black-box” nature^10–13, making it difficult to interpret their reasoning. This lack of explainability is particularly problematic in suicide studies, where trustworthiness and explainability are essential.

To address these challenges, we developed a multi-stage zero-shot large language model (LLM) framework for extracting suicide-related SDoH factors from unstructured narratives in a scalable and generalizable manner. This study focuses on SDoH factors occurring within the two weeks preceding suicide incidents. By leveraging zero-shot learning and self-explanation in LLMs¹⁴, our approach captures nuanced suicide-related SDoH factors from limited labeled data and generalizes effectively to new, unseen cases in complex clinical text¹⁵. We evaluated our framework against state-of-the-art baselines (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1), demonstrating its effectiveness. The multi-stage design also provides intermediate explanations of model decisions, enhancing interpretability and aiding in the identification of actionable insights for future interventions. A pilot user study further validated that experts using these explanations annotated suicide-related SDoH factors more quickly and accurately than through manual efforts alone.

Methods

Framework architecture

We present a multi-stage LLM framework designed to extract suicide-related SDoH factors from de-identified death investigation notes. As shown in Fig. 1, the framework consists of two intermediate stages—context retrieval and relevance verification—and one final decision-making stage, which is SDoH factor extraction.

Fig. 1 — Overview of the proposed multi-stage large language model framework for extracting suicide-related SDoH factors.

Context retrieval

This module retrieves sentences relevant to the target SDoH factor from the pre-processed text. First, the natural language toolkit (NLTK) library is used to split the input text into individual sentences¹⁶. This pre-processing step is essential, as it allows subsequent modules to focus on smaller, more manageable text segments. Once the input text is split into sentences, a pre-trained LLM (e.g., GPT-3.5-turbo) is used to identify which sentences contain the SDoH factor of interest.

To achieve this, we crafted a prompt with clear instructions and criteria to evaluate each sentence’s relevance to the target SDoH factor (Box 1). The LLM assigns a relevance label (i.e., Relevant) to sentences that pertain to the factor, and only those deemed relevant are selected for further processing. This step ensures that only the most contextually pertinent sentences are passed on, significantly reducing noise and improving the efficiency of the framework.

In developing an effective prompt, we tested several variations and evaluated their performance on a curated gold standard development set, using F-1 scores (details of the curated gold standard development set can be found in the section “Data preparation and curation”). The prompt that achieved the highest F-1 score was chosen for use in downstream experiments.

Relevance verification

Despite extensive prompt engineering, LLMs may occasionally select irrelevant sentences during the retrieval stage, resulting in a flawed set of sentences that appear irrelevant based on human judgment. Inspired by previous works^17–19, we introduce an additional verification step using another LLM to re-assess the relevance of the retrieved context (Box 2). This process involves prompting the LLM with clear instructions to evaluate the sentence’s relevance with respect to the target SDoH factor, effectively acting as a human surrogate. We hypothesize that this approach will minimize the inclusion of incorrect or tangential information, ensuring a high-quality set of relevant sentences for downstream processing.

Despite the impressive performance of LLMs, they often fall short of specialized models in certain tasks. Small models, tailored for a specific task, can be more efficient and faster than LLMs^20–22. To evaluate whether comparable relevance verification performance could be achieved with a smaller and more cost-effective LLM, we fine-tuned a FLAN-T5-base model²³ for the binary relevance verification task. Given the definition of an SDoH factor and a sentence, the model is tasked to assess the sentence’s factual relevance to the factor.

For fine-tuning, we used a learning rate of 3E-4, a batch size of 8, and conducted 3 epochs. The model was evaluated on the curated test set using F-1 scores, and the fine-tuned model with the highest F-1 score on the test set was selected as the examiner model.

SDoH factor extraction

In the final decision-making step, an LLM is used to extract SDoH factors from the verified relevant sentences. At this stage, the focus shifts from identifying relevant context to performing fine-grained information extraction. The verified relevant sentences are passed into the LLM, which identifies and extracts specific SDoH factors occurring within the two weeks preceding the incident (Box 3), based on the pre-defined definitions outlined in the web coding manual of the National Violent Death Reporting System (Table 1)²⁴.

Table 1.

Suicide-related SDoH factor definitions

Infrequent factor	Definition
Adverse Childhood Experience	Stressful or traumatic events that occurred during childhood and adolescence.
Civil Legal Problem	Civil legal (non-criminal) problem(s) appear to have contributed to the death.
Eviction or Loss of Home	A recent eviction or other loss of the victim’s housing, or the threat of it, appears to have contributed to the death.
Exposure to Disaster	Exposure to a disaster was perceived as a contributing factor in the incident.
Financial Problem	Financial problems appear to have contributed to the death.
Other Addiction Problem	Person has an addiction other than alcohol or other substance abuse, such as gambling, sexual, etc., that appears to have contributed to the death.
Other Relationship Problem	Problems with a friend or associate (other than an intimate partner or family member) appear to have contributed to the death.
Other Substance Abuse	Person has a non-alcohol-related substance abuse problem.
Recent Suicide of Friend or Family	Suicide of a family member or friend appears to have contributed to the death.
School Problem	Problems at or related to school appear to have contributed to the death.

Frequent factor	Definition
Alcohol Problem	Person has alcohol dependence or alcohol problems.
Criminal Legal Problem	Criminal legal problem(s) appear to have contributed to the death.
Family Relationship Problem	Victim had relationship problems with a family member (other than an intimate partner) that appear to have contributed to the death.
Intimate Partner Problem	Problems with a current or former intimate partner appear to have contributed to the suicide or undetermined death.
Job Problem	Job problem(s) appear to have contributed to the death.
Mental Health Problem	Current mental health problem.
Physical Health Problem	Victim’s physical health problem(s) appear to have contributed to the death.
Suicide Disclosure	Victims had a history of disclosure of suicidal thoughts or plan. Victim disclosed to another person their thoughts and/or plans to die by suicide.

Open in a new tab

Box 1 Prompting for context retrieval.

[INST]You are a helpful assistant. Read the given context below and find all the sentences that are relevant to 'TARGET_SOCIAL_FACTOR' based on the provided definition. Format your output (a list of relevant sentences) in a valid JSON payload with one key 'Relevant'.[/INST]Here is your input:[CONTEXT]INPUT_REPORT[/CONTEXT][TARGET_SOCIAL_FACTOR]FACTOR_DEFINITION[/TARGET_SOCIAL_FACTOR]

Box 2 Prompting for relevance verification.

[INST]You are a helpful assistant. Verify if the given sentence is relevant to 'TARGET_SOCIAL_FACTOR' based on the provided definition. Format your output (True or False) in a valid JSON payload with one key 'Answer'.[/INST]Here is your input:[SENTENCE]TARGET_SENTENCE[/SENTENCE][TARGET_SOCIAL_FACTOR]FACTOR_DEFINITION[/TARGET_SOCIAL_FACTOR]

Box 3 Prompting for SDoH factor extraction.

[INST]Read the descriptions and code 'True' if any description mentioned 'TARGET_SOCIAL_FACTOR' within the two weeks before the suicide incident. Format your output (True or False) in a valid JSON payload with one key 'Happened within two weeks'.[/INST]Here is your input:[CONTEXT]RELEVANT_DESCRIPTIONS[/CONTEXT][TARGET_SOCIAL_FACTOR]FACTOR_DEFINITION[/TARGET_SOCIAL_FACTOR]

Dataset description

The National Violent Death Reporting System (NVDRS) is a population-based active surveillance system that collects data on violent deaths that occur among both residents and non-residents of U.S. states, the District of Columbia, and Puerto Rico⁹. Each incident record includes two death investigation reports: one from a coroner or medical examiner (CME) and another from a law enforcement (LE) reporter. The NVDRS codes over 600 unique data elements for each incident, including SDoH factors that may have affected the victims. These data elements are manually abstracted from the death investigation notes.

For this study, we use the NVDRS 2020 version, which includes 267,804 suicide death incidents, spanning from 2003 to 2019. We focus on 10 infrequent SDoH factors that occurred within the two weeks preceding the suicide incidents (Table 1): Adverse Childhood Experience, Civil Legal Problem, Eviction or Loss of Home, Exposure to Disaster, Financial Problem, Other Addiction Problem, Other Relationship Problem, Other Substance Abuse, Recent Suicide of Friend or Family, School Problem. Additionally, to assess whether our framework can match or outperform fine-tuned models on well-represented data, we examine 8 frequent suicide-related SDoH factors occurring within the two weeks leading up to the incident (Alcohol Problem, Criminal Legal Problem, Family Relationship Problem, Intimate Partner Problem, Job Problem, Mental Health Problem, Physical Health Problem, Suicide Disclosure).

We accessed the NVDRS Restricted Access Database (RAD) through the Centers for Disease Control and Prevention (CDC) after obtaining approval for a data use agreement by submitting a research proposal, including agreeing to comply with NVDRS RAD security, confidentiality, and data protection requirements, etc. All data were de-identified prior to our analysis. Please refer to “NVDRS Data Sharing Agreement for Restricted Access Dataset Release” for more details.

Data preparation and curation

To ensure a fair comparison with the baseline pre-trained model, we adopted the same training and test set split used by a previous work⁷. To conduct the model evaluation in a cost-effective way, we curated a balanced test set by sampling a balanced number of positive and negative instances for each SDoH factor, and we further sampled a subset for evaluating the DeepSeek-R1 reasoning model (the detailed statistics can be found in Supplementary Table S1).

The Institutional Review Board (IRB) at Cornell University determined that the proposed activity is not research involving human subjects per the Code of Federal Regulations on the Protection of Human Subjects (45 CFR 46, 21 CFR 56). IRB review and approval are not required (protocol number: 23-12026810-01).

To assess the effectiveness of the relevance verification module, we curated a gold-standard dataset, including SDoH factors of interest, human-verified relevant and irrelevant sentences. Two annotators (S.W. and M.L.) independently assigned binary relevance labels (‘Relevant’ and ‘Not Relevant’) to a total of 655 sentences sampled from 160 death investigation notes. These sentences covered 16 suicide-related SDoH factors, except for Adverse Childhood Experience and Suicide Disclosure. Annotators were provided with clear definitions for each SDoH factor to guide their labeling process. In cases of annotation disagreement, the annotators engaged in discussion to reconcile differences and reach consensus. If consensus could not be reached, a third annotator (Y.P.) was consulted for final adjudication. This process ensured the consistent application of the guidelines and the production of high-quality annotations. To evaluate the reliability of the annotation process, we calculated the Inter-Annotator Agreement (IAA) between the annotators. This curated test set was then used to evaluate the effectiveness of both the context retrieval and relevance verification modules.

Baselines

We compared our proposed framework against several baseline methods: a fine-tuned BioBERT model⁷, a GPT-3.5-turbo End-to-End (End2End) model²⁵, and a GPT-3.5-turbo Chain-of-Thought (CoT) model^25,26.

Fine-tuned BioBERT

Following a previous work⁷, we treated SDoH factor extraction as a multi-label text classification task and fine-tuned a BioBERT²⁷ model using the NVDRS data. For each victim, we concatenated the CME and LE reports, placing the shorter report first, as input. The model was trained using the Adam optimizer and binary Cross-Entropy loss for parameter optimization. Fine-tuning was conducted with a learning rate of 10E−6, a batch size of 12, and 20 epochs, with early stopping (patience = 5) to prevent overfitting.

The work was performed on an Intel Core i9-9960X 16-core processor, an NVIDIA Quadro RTX 5000 GPU, and 128 GB of memory.

GPT-3.5-turbo End2End prompting

This baseline employs the GPT-3.5-turbo model²⁵ in an end-to-end fashion, where the system directly maps input narratives to structured SDoH labels without requiring intermediate steps, to extract suicide-related SDoH factors from free text. GPT-3.5-turbo, a large language model developed by OpenAI, is capable of understanding and generating human-like text due to its extensive pre-training on diverse datasets. In this approach, we query GPT-3.5-turbo directly with input text to extract SDoH factors, without additional fine-tuning or few-shot examples (Box 4). The model processed each input report to generate structured outputs containing the identified SDoH factors. The prompts are carefully crafted with specific instructions to guide the model in focusing on the relevant information within the input text.

GPT-3.5-turbo CoT prompting

This baseline utilizes the GPT-3.5-turbo model with a chain-of-thought prompting approach, where LLMs are encouraged to generate intermediate reasoning steps before final prediction (Box 5). The CoT methodology²⁶ enables the model to generate intermediate reasoning steps, enhancing its ability to handle complex extraction tasks that require multi-step reasoning. The model was instructed to first identify the relevant context related to the target SDoH factor in the input text, and then assess the potential SDoH factor indicators before generating the final answer. The key difference between the CoT approach and the End2End approach lies in the reasoning process. While the End2End method directly produces the final results, the CoT approach explicitly breaks down the reasoning into a chain of thought.

Box 5 GPT-3.5-turbo CoT prompting.

[INST]Read the given context below and code 'True' if 'TARGET_SOCIAL_FACTOR' occurred within two weeks before the suicide incident. Answer the following questions one by one:1. Is the TARGET_SOCIAL_FACTOR mentioned in the context? If yes, please find the relevant sentences. Otherwise, answer 'False'.2. If the TARGET_SOCIAL_FACTOR is mentioned, answer 'True' if it happened within two weeks before the suicide incident, answer 'False' otherwise.Format your output in a valid JSON payload with keys 'Mentioned or Not', and 'Within Two Weeks or Not'[/INST].Here is your input:[CONTEXT]INPUT_REPORT[/CONTEXT][TARGET_SOCIAL_FACTOR]FACTOR_DEFINITION[/TARGET_SOCIAL_FACTOR]

Comparison with the reasoning model

We also compared extraction performance against the DeepSeek-R1 reasoning model²⁸ using an end-to-end approach. Reasoning models, such as DeepSeek-R1, generate extended internal chains of thought, thinking through the problem before providing a final response. The DeepSeek-R1 model was trained via large-scale reinforcement learning without supervised fine-tuning as a preliminary step and showed a remarkable reasoning performance. For this comparison, DeepSeek-R1 was prompted with the input text to extract suicide-related SDoH factors, without additional fine-tuning or provision of few-shot examples (Box 6). Given the high computational demand for reasoning models, we conducted these comparisons using a subset of the test set, as detailed in the section “Data preparation and curation”.

Box 6 Prompting for DeepSeek-R1 reasoning.

[INST]You will be given a report narrative, and the definition of a social determinant of health factor. Please read the given report, and answer if the factor of interest contributed to the suicide incident within two weeks preceding the suicide incident. Output your reasoning process and your final answer to whether the factor of interest contributed to the suicide incident within two weeks preceding the suicide incident (True or False)". [/INST]Here is your input:[CONTEXT]INPUT_REPORT[/CONTEXT][TARGET_SOCIAL_FACTOR]FACTOR_DEFINITION[/TARGET_SOCIAL_FACTOR]

Evaluating the explainability of the multi-stage design

In this work, we define explainability as the model’s ability to provide explicit, human-interpretable rationales that justify its predictions. More specifically, explainability is characterized by the quality and correctness of the intermediate outputs produced by our framework, namely, the sentences identified by the model as relevant to the SDoH factor being extracted. These intermediate outputs not only reflect the model’s reasoning process but also offer transparent evidence supporting its final decision.

To evaluate the explainability of our framework, we focus on two intermediate stages: context retrieval and relevance verification. In the context retrieval stage, the model selects candidate sentences from the incident narrative, while the relevance verification module further filters these candidates to retain only those most pertinent to the queried SDoH factor. We evaluated both stages using our curated gold-standard test set, which includes not only ground-truth SDoH factor labels but also human-verified relevant sentences for each instance. We measured the explainability accuracy by computing the proportion of correctly identified relevant sentences that matched those annotated by human experts.

Details of the pilot user study

In this work, we conducted a pilot user study to evaluate the practical value of our system for annotating suicide-related SDoH factors. The goal was to assess whether providing human annotators with the framework’s intermediate output, highlighting the relevant context, could lead to faster and more accurate annotations.

To prepare the data, we selected 2 frequently occurring suicide-related SDoH factors (Alcohol Problem and Mental Health Problem) and 2 infrequent factors (Adverse Childhood Experience and Exposure to Disaster). For each factor, we randomly sampled 7 incidents. Participants (n = 6) were tasked with annotating these 4 factors across the selected incidents twice: once without the relevant context highlighted by our framework (Control arm) and once with the relevant context as AI assistance (Intervention arm).

We assessed human perceptions by monitoring annotation accuracy and time. These metrics were captured through two anonymous annotation sessions: the first conducted at baseline (Control), and the second conducted one day later (Intervention). This design allowed us to evaluate changes in performance and perceptions over a short time period, while minimizing the impact of task familiarity, providing insights into the effectiveness of our framework.

Following the sessions, participants were asked to complete a questionnaire about their annotation experience and provide feedback for each arm^29–32. The specific questions can be found in Supplementary Note 1. For Likert scales, we computed the quantitative distribution of responses from the participants for each question. All participants in the pilot study have provided written informed consent before participation.

Statistics and reproducibility

All statistical analyses in this study were performed to evaluate the performance of the proposed multi-stage LLM framework and baseline models for suicide-related SDoH factor extraction. Model performance was assessed using standard metrics, including precision, recall, and F-1 score, computed on balanced test sets comprising both positive and negative instances for each SDoH factor. For evaluation of the context retrieval and relevance verification modules, accuracy was calculated against a curated gold-standard set of 655 sentences, with inter-annotator agreement assessed using Cohen’s Kappa statistic (IAA: 0.918). The test set sampling, model evaluation, and annotation procedures were predefined and uniformly applied across all experiments to ensure reproducibility. Detailed statistics for dataset splits, sample sizes can be found in Supplementary Table S1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Overview of the framework and data sources

Our framework consists of three steps: context retrieval, relevance verification, and SDoH factor extraction (Fig. 1 and detailed in the section “ Framework architecture”). First, the context retrieval stage extracts sentences relevant to the target SDoH factor from the input text. Next, a trained relevance verification module confirms the retrieved context’s relevance. Finally, the SDoH factor extraction module determines whether the suicide-related SDoH factor occurred within the two weeks preceding the suicide incident.

We evaluated our framework using the National Violent Death Reporting System (NVDRS) dataset⁹, which contains 267,804 reported suicide death incidents across U.S. states, Puerto Rico, and the District of Columbia from 2003 to 2019 (Table 2 and detailed in the section “Dataset description”). The NVDRS provides free-text death investigation notes describing SDoH factors contributing to suicide incidents. While precipitating SDoH factors provide insights into systemic and cumulative risks, immediate pre-incident factors offer a unique lens into the specific triggers and pressures that heighten vulnerability. Although NVDRS documents over 600 structured data elements, including SDoH-related fields⁹, our research focuses on 10 infrequent and 8 frequent suicide-related SDoH factors occurring within the two weeks before suicide (Fig. 2).

Table 2.

Overview of the NVDRS Data

Characteristics	NVDRS
Total number of subjects	267,804
Age, mean (SD), year	46.3 (18.2)
Race (%)
American Indian/Alaska Native	1.3
Asian/Pacific Islander	2.3
Black or African American	6.5
Other/Unspecified	0.9
Two or more races	1.2
Unknown	0.2
White	87.6
Sex (%)
Female	22.1
Male	77.9
Unknown	0.0

Open in a new tab

Fig. 2 — Frequency distribution of suicide-related SDoH factors, with the frequent factors shown in red and infrequent factors in blue.

We compared our framework against several baseline models: (1) a fine-tuned BioBERT model⁷, (2) a GPT-3.5-turbo model²⁵ in an End-to-End (End2End) approach where the system directly maps input narratives to SDoH labels without intermediate reasoning steps, and (3) a GPT-3.5-turbo model using Chain-of-Thought (CoT) prompting^25,26, which encourages the LLM to generate explicit intermediate reasoning steps before final prediction (see the section “Baselines” for details). All models were evaluated using a balanced test set for SDoH factor extraction. Additionally, we evaluated our framework against a reasoning model, DeepSeek-R1²⁸, using a subset of the test set (see the section “Comparison with the reasoning model”). Model performance was assessed using three metrics: precision, recall (sensitivity), and F-1 score (the harmonic mean of the precision and recall).

To further validate our framework, we measured the accuracy of the relevant context retrieved using a newly curated test set of 160 cases. During curating the test set, Cohen’s Kappa score was used to assess annotation agreement, with an Inter-Annotator Agreement (IAA) of 0.918, indicating near-perfect agreement based on Mary McHugh’s stricter thresholds for healthcare research³³.

Comparison with baseline models

Extracting infrequent suicide-related SDoH factors

We first evaluated the performance of our framework in extracting 10 infrequent suicide-related SDoH factors (Fig. 3). Our framework outperformed all baseline methods in 9 out of 10 infrequent factors. When comparing F-1 scores (Fig. 3a), our approach showed an average improvement of 17.7% over the fine-tuned BioBERT model, 4.8% over the GPT-3.5-turbo End2End model, and 8.8% over the GPT-3.5-turbo CoT model.

In terms of precision (Fig. 3b), our framework achieved an overall increase of 15.6% compared to the fine-tuned BioBERT model, 2.0% compared to the GPT-3.5-turbo End2End model, and 7.3% compared to the GPT-3.5-turbo CoT model. However, the GPT-3.5-turbo End2End model outperformed our approach in extracting three specific factors: Exposure to Disaster, Other Substance Abuse, and School Problem.

A similar pattern was observed for recall (Fig. 3c). Our framework achieved an average recall improvement of 14.9% over the fine-tuned BioBERT model, 3.9% over the GPT-3.5-turbo End2End model, and 7.8% over the GPT-3.5-turbo CoT model. However, the GPT-3.5-turbo End2End model demonstrated superior recall in extracting Adverse Childhood Experience and Exposure to Disaster from the death investigation notes.

Extracting frequent suicide-related SDoH factors

We then compared our multi-stage framework with baselines for extracting 8 frequent suicide-related SDoH factors from death investigation notes (Fig. 4). Our framework consistently improved F-1 scores for 5 out of 8 factors (Fig. 4a). Notably, despite operating in a zero-shot setting without fine-tuning, our framework achieved an average F-1 score improvement of 4.0% over the fine-tuned BioBERT model. It also outperformed the zero-shot GPT-3.5-turbo End2End and CoT models by 7.4% and 4.1%, respectively.

Similar trends were observed in precision and recall scores (Fig. 4b and c). Our framework improved precision by 4.0% over the fine-tuned BioBERT model, 0.4% over the GPT-3.5-turbo End2End model, and 5.4% over the GPT-3.5-turbo CoT model. Additionally, it enhanced recall by 3.7% compared to BioBERT, 5.3% compared to GPT-3.5-turbo End2End, and 4.3% compared to GPT-3.5-turbo CoT.

Comparison with the reasoning model

We compared the performances of our proposed framework with the DeepSeek-R1 reasoning model on a subset of the SDoH factor extraction test set (Fig. 5) due to the higher computational demands of these models. As shown in Fig. 5a, DeepSeek-R1 outperformed other models in 7 out of 16 evaluated SDoH factors. Notably, it achieved an F-1 score of 66.7% on the infrequent Exposure to Disaster factor, which is 35.9% higher than our framework, 31.9% higher than the GPT-3.5-turbo End2End model, and 61.3% higher than the fine-tuned BioBERT model. In terms of precision scores (Fig. 5b), DeepSeek-R1 achieved the highest precision in 14 out of 16 factors. However, Fig. 5c shows that DeepSeek-R1 achieved the highest recall score in only 3 out of 16 factors, while our proposed framework outperformed it on recall for 11 factors.

Fig. 5 — a F-1 score comparisons. b Precision score comparisons. c Recall score comparisons. CoT Chain of Thought. Detailed scores can be found in Supplementary Table S4.

Evaluating the explainability of the multi-stage design

We evaluated the context retrieval and relevance verification stages of our framework using the curated test set, as these intermediate outputs directly inform the model’s final decisions. Crucially, both stages produce interpretable outputs, which are the selected contextual evidence that serve as explicit intermediate explanations for each prediction. This multi-layered decision-making process enhances the framework’s explainability by transparently revealing how the model reasons through retrieved evidence before arriving at a final decision.

To assess the effectiveness of this pipeline, we measured the accuracy of the retrieved context at each stage. As shown in Fig. 6, our relevance verification process improved overall relevant context retrieval accuracy from 59.3% to 73.1%, representing a 13.8 percentage point improvement over the single-stage context retrieval approach. Additionally, we assessed the FLAN-T5-base model’s performance in the relevance verification task. Without fine-tuning, it achieved an average accuracy of 55.8%. After fine-tuning, the accuracy increased by 30.8 percentage points, reaching 86.6%.

Pilot user study

We conducted a pilot user study to assess the practical value of our system for annotating SDoH factors. The primary objective was to validate that experts using our proposed method could annotate suicide-related SDoH factors more quickly and accurately than through manual efforts alone. To test this claim, we employed a two-arm design: one involving experts working independently (Control) and the other combining expert efforts with AI assistance (Intervention). We recruited 6 human experts to annotate 4 suicide-related SDoH factors (Adverse Childhood Experience, Alcohol Problem, Exposure to Disaster, Mental Health Problem) from 28 reports. Initially, the experts annotated the SDoH factors independently without AI assistance (Control). After at least 24 h, they annotated the same set of factors with AI assistance (Intervention) to minimize the effects of task familiarity. After completing the annotations, experts were asked to fill out a survey to provide feedback.

Figure 7a illustrates that experts experienced a lower mental and/or physical stress when annotating with AI assistance compared to working independently (Q1–Q4). They also reported higher satisfaction with their annotation performance (Q5). When asked about system reliability and their interest in using the system for decision-making support (Q7–Q9), all participants expressed a preference for the AI-assisted method. Furthermore, participants noted that the AI assistance resulted in quicker, more accurate, and more confident annotations compared to working independently (Q10–Q12).

Fig. 7 — a User study survey quantitative response distributions. b The annotation accuracy comparisons between the control arm and the intervention arm, p-value is 0.666. c The annotation time comparisons between the control arm and the intervention arm, p-value is 0.001.

Figure 7b shows that experts who collaborated with AI assistance achieved an annotation accuracy of 83.33%, slightly surpassing the 81.55% accuracy achieved by experts working independently. Similarly, Fig. 7c shows that annotating with AI assistance required less time, saving an average of 62.39 s per incident compared to independent annotation.

Discussion

Our work presents an approach using LLMs to extract suicide-related SDoH factors from unstructured text. The framework breaks down the extraction task into three key steps: context retrieval, relevance verification, and SDoH factor extraction. This multi-stage approach not only outperforms baseline models but also enhances transparency by providing intermediate outputs that shed light on the decision-making process.

Comparisons with baseline methods underscore the effectiveness of our proposed framework. Our approach consistently improved F-1 scores for 9 out of 10 infrequent suicide-related SDoH factors and 5 out of 8 frequent suicide-related SDoH factors occurring within the two weeks prior to the suicide incidents. Specifically, for infrequent factors, the average F-1 score of our framework exceeded the fine-tuned BioBERT model by 17.7%, the GPT-3.5-turbo End2End model by 4.8%, and the GPT-3.5-turbo CoT model by 8.8%. For frequent factors, the average F-1 score of our zero-shot multi-stage framework was 4.0% higher than the fine-tuned BioBERT model, 7.4% higher than the GPT-3.5-turbo End2End model, and 4.1% higher than the GPT-3.5-turbo CoT model. Additionally, we demonstrated that the relevance verification step improved the overall relevant context retrieval accuracy, increasing it from 59.3% to 73.1%, marking a 13.8 percentage point improvement over the single-stage context retrieval. Our study highlights the distinct ways in which different modeling techniques address the challenge of long-tailed SDoH distributions. The fine-tuned BioBERT model, though specifically optimized for extracting SDoH factors from text, showed limited generalizability to low-frequency SDoH factors. This limitation reflects the model’s reliance on the empirical distribution of the training data and its inherent sensitivity to class imbalance. In contrast, LLM-based approaches, which utilize instruction prompting and benefit from large-scale pretraining, demonstrated better robustness. These models were more effective in extracting low-frequency SDoH factors, even in zero-shot settings.

The differences in performance improvements observed across our model comparisons, particularly the more substantial gains over baseline models compared to reasoning models, highlight important considerations regarding model capabilities and practical deployment. Our framework demonstrated strong improvements over baseline models across multiple SDoH factors, especially in handling infrequent factors. When compared to reasoning models, performance gains were more modest and nuanced: our model achieved higher recall, while the reasoning models exhibited higher precision. These contrasting patterns suggest that the two approaches may be optimized for different application needs, where our framework is better suited for high-coverage extraction tasks. In contrast, reasoning models may be preferable when minimizing false positives is critical. It is also important to note that the evaluation sets used in each comparison differed. A smaller, more controlled set was used for reasoning models due to their higher computational demands. These evaluation constraints likely contribute to the observed differences in performance improvements and should be taken into consideration when interpreting the results.

The pilot user study further validated that human experts using the intermediate explanations provided by our framework reduced annotation time by an average of 62.39 seconds without sacrificing accuracy, maintaining an annotation accuracy of 83.33%.

However, this study has several limitations. First, as shown in the section “Evaluating the explainability of the multi-stage design”, fine-tuning a small FLAN-T5-base model improved the average accuracy of our multi-stage relevant context retrieval by 13.8 percentage points. However, unlike the GPT-3.5-turbo, the fine-tuned FLAN-T5 model does not inherently produce reasoning behind its predictions, limiting its explainability. To enable such reasoning capability, human-generated reasoning would need to be incorporated during the fine-tuning process, which was not implemented in this study. Second, we evaluated our model using a selection of 10 infrequent and 8 frequent SDoH factors, which represent only a subset of the full range of factors defined in the NVDRS coding manual⁹. By including both high-frequency and low-frequency factors, we aimed to evaluate the model’s performance under varying data availability scenarios and assess its potential generalizability to the more imbalanced distributions often encountered in real-world settings. The selection of these 18 factors was guided by both their observed frequency in the NVDRS data and their relevance as identified in existing literature on suicide risk^7,34. It is important to note, however, that the NVDRS encompasses over 600 unique SDoH-related factors, many of which may hold clinical or contextual significance. Expanding this framework in future research to incorporate a broader spectrum of SDoH factors would further enhance the comprehensiveness of information extraction and improve support for a wide range of downstream public health applications. While the NVDRS provides a rich and detailed source of information on violent deaths, including suicide, it is important to recognize that certain SDoH factors may be more prevalent or more frequently documented among specific subpopulations due to underlying societal and structural influences. Such differences may introduce variability in model performance across demographic groups. Moreover, LLMs are trained on datasets that may not equally represent the full range of linguistic, cultural, or socioeconomic contexts, which could further affect the consistency and fairness of their outputs. These considerations highlight the need for future research to systematically evaluate model fairness across subpopulations and assess performance disparities.

Despite growing interest in explainable NLP, there is currently no universal definition or standard evaluation framework for explainability that applies across diverse tasks and model architectures^35,36. In practice, explainability is often approximated through proxy measures such as feature importance, attention weights, or the presence of interpretable intermediate outputs. However, these proxies do not consistently align with human reasoning or trust, and their reliability as true explanations remains debated. This challenge is particularly true for LLMs, whose internal decision-making processes are largely opaque and highly contextual. In this work, we address these challenges by introducing a multi-stage framework where intermediate outputs, such as retrieved and verified context, serve as explicit and human-interpretable rationales. While our approach enhances transparency in machine-learning models, we recognize that explainability remains inherently task-specific and subjective. Further efforts are needed to develop standardized, user-centered frameworks for evaluating and comparing explanation quality across different models and domains.

Although our pilot user study demonstrated reduced annotation time when LLM-generated intermediate explanations were provided, the within-subject design may introduce potential confounding effects related to task familiarity. Annotators completed the intervention task one day after the control task. While sessions were separated by 24 h to minimize carryover effects³⁷, increased familiarity with the annotation task may still have contributed to the observed efficiency gain. To more rigorously evaluate the causal impact of AI assistance on annotation quality and efficiency, future studies should adopt a between-subjects design (e.g., comparing a group exposed to AI assistance with a separate group completing repeated control tasks). Therefore, the findings of this study should be interpreted as exploratory rather than conclusive.

Supplementary information

Supplementary Information^{(51.4KB, pdf)}

43856_2025_1114_MOESM2_ESM.pdf^{(69.7KB, pdf)}

Description of Additional Supplementary files

Supplementary Data 1^{(10.1KB, xlsx)}

Supplementary Data 2^{(9.9KB, xlsx)}

Supplementary Data 3^{(11KB, xlsx)}

Supplementary Data 4^{(9.6KB, xlsx)}

Reporting Summary^{(1.5MB, pdf)}

Acknowledgements

This study was supported by the AIM-AHEAD Consortium Development Program of NIH under grant number OT2OD032581, and the National Institute of Mental Health (NIMH) under grant number RF1MH134649. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or NIMH. The National Violent Death Reporting System (NVDRS) is administered by the Centers for Disease Control and Prevention (CDC) in participating NVDRS states. The findings and conclusions of this study are those of the authors alone and do not necessarily represent the official position of the CDC or of participating NVDRS states.

Author contributions

S.W. and Y.P. contributed to the conception of the study and study design; S.W., K.D., M.L., H.M., Y.M., Y.W., Z.X., J.Z., Y.X., Y.P. contributed to the acquisition of the data; S.W. and Y.P. contributed to the analysis and interpretation of the data; Y.X., Y.D., X.X., J.G., and Y.P. provided strategic guidance; S.W. and Y.P. contributed to the paper organization and team logistics; S.W., K.D., M.L., H.M., Y.M., Y.W., Z.X., J.Z., Y.X., Y.D., X.X., J.G., and Y.P. contributed to drafting and revising the manuscript.

Peer review

Peer review information

Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work.

Data availability

The dataset analyzed in this study, the NVDRS Restricted Access Database (RAD), is accessible upon request to researchers who meet specific eligibility criteria and take steps to ensure confidentiality and data security. Our research was approved by the NVDRS RAD proposal, which gave us the required permissions to access the data and undertake the work described here. This restricted access is in place due to the confidential nature of the NVDRS data, which includes sensitive information that could potentially lead to the unintended disclosure of the identities of victims. To safeguard this data, the CDC protects it by requiring users to fulfill certain eligibility requirements and implement the necessary measures to ensure the security of data, preserve confidentiality, and prevent unauthorized access. Researchers interested in accessing the NVDRS data can apply per the instructions provided at https://www.cdc.gov/nvdrs. Source data for Fig. 3 can be found in Supplementary Data 1. Source data for Fig. 4 can be found in Supplementary Data 2. Source data for Fig. 5 can be found in Supplementary Data 3.

Code availability

Our code and implementations have been made publicly available at https://github.com/bionlplab/2025_multistage_llm_sdoh.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s43856-025-01114-z.

References

1.Organization, W. H. & Foundation, C. G. Social Determinants of Mental Health (World Health Organization, 2014).
2.Hacker, K., Auerbach, J., Ikeda, R., Philip, C. & Houry, D. Social determinants of health-an approach taken at CDC. J. Public Health Manag. Pract.28, 589–594 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Liu, S. et al. Social vulnerability and risk of suicide in US adults, 2016–2020. JAMA Netw. Open6, e239995–e239995 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Patra, B. G. et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J. Am. Med. Inform. Assoc.28, 2716–2727 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Magoc, T. et al. Generalizability and portability of natural language processing system to extract individual social risk factors. Int. J. Med. Inform.177, 105115 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Han, S. et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J. Biomed. Inform.127, 103984 (2022). [DOI] [PubMed] [Google Scholar]
7.Wang, S. et al. An NLP approach to identify SDOH-related circumstance and suicide crisis from death investigation narratives. J. Am. Med. Inform. Assoc.30, 1408–1417 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Gabriel, R. A. et al. On the development and validation of large language model-based classifiers for identifying social determinants of health. Proc. Natl Acad. Sci. USA121, e2320716121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Liu, G. S. et al. Surveillance for violent deaths—National Violent Death Reporting System, 48 states, the District of Columbia, and Puerto Rico, 2020. MMWR Surveill. Summ.72, 1–38 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Bai, X. et al. Explainable deep learning for efficient and robust pattern recognition: a survey of recent developments. Pattern Recognit.120, 108102 (2021). [Google Scholar]
11.Adadi, A. & Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access6, 52138–52160 (2018). [Google Scholar]
12.Hassija, V. et al. Interpreting black-box models: a review on explainable artificial intelligence. Cogn. Comput.16, 45–74 (2023).
13.Singh, C., Inala, J. P., Galley, M., Caruana, R. & Gao, J. Rethinking interpretability in the era of large language models. arXiv: 2402.01761 (2024).
14.Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit. Med.7, 6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Consoli, B. et al. SDOH-GPT: using large language models to extract social determinants of health (SDOH). J. Am. Med. Inform. Assoc. ocaf094 10.1093/jamia/ocaf094 (2025). Online ahead of print. [DOI] [PMC free article] [PubMed]
16.Bird, S. NLTK: the Natural Language Toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions. 214–217 (Association for Computational Linguistics, 2004).
17.Marasovic, A., Beltagy, I., Downey, D. & Peters, M. Few-shot self-rationalization with natural language prompts. In Findings of the Association for Computational Linguistics: NAACL 2022 (eds Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I.V.) 410–424 (Association for Computational Linguistics, 2022).
18.Weng, Y. et al. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2550–2575 (2023).
19.Gero, Z. et al. Self-verification improves few-shot clinical information extraction. In Proceedings of ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH) (2023).
20.Bucher, M. J. J. & Martini, M. Fine-tuned ‘small’ LLMS (still) significantly outperform zero-shot generative AI models in text classification. arXiv: 2406.08660 (2024).
21.Lu, Y. et al. Human still wins over LLM: an empirical study of active learning on domain-specific annotation tasks. arXiv: 2311.09825 (2023).
22.Edwards, A. & Camacho-Collados, J. Language models for text classification: is in-context learning enough? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 10058–10072 (2024).
23.Chung, H. W. et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res.25, 1–53 (2022).
24.CDC. National Violent Death Reporting System: Web Coding Manual: Version 6.0 (CDC, 2022).
25.OpenAI. GPT-3.5-turbo Modelhttps://platform.openai.com/docs/models/gpt-3-5 (2023).
26.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of 36th Conference on Neural Information Processing Systems (2022).
27.Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.DeepSeek-AI et al. Deepseek-r1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv: 2501.12948 (2025).
29.Hart, S. G. NASA-task load index (NASA-TLX); 20 years later. Proc. Hum. Factors Ergon. Soc. Annu. Meet.50, 904–908 (2006). [Google Scholar]
30.Aaron Bangor, P. T. K. & Miller, J. T. An empirical evaluation of the system usability scale. Int. J. Hum.–Comput. Interact.24, 574–594 (2008). [Google Scholar]
31.Hoffman, R. R., Mueller, S. T., Klein, G. & Litman, J. Measures for explainable AI: explanation goodness, user satisfaction, mental models, curiosity, trust, and human–AI performance. Front. Comput. Sci.5, 1096257 (2023).
32.Jiun-Yin Jian, A. M. B. & Drury, C. G. Foundations for an empirically determined scale of trust in automated systems. Int. J. Cogn. Ergon.4, 53–71 (2000). [Google Scholar]
33.McHugh, M. Interrater reliability: the kappa statistic. Biochem. Med.22, 276–82 (2012). [PMC free article] [PubMed] [Google Scholar]
34.Dang, Y. et al. Systematic design and data-driven evaluation of social determinants of health ontology (SDOHO). J. Am. Med. Inform. Assoc.30, 1465–1473 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Lyu, Q., Apidianaki, M. & Callison-Burch, C. Towards faithful model explanation in NLP: a survey. Comput. Linguist.50, 657–723 (2024). [Google Scholar]
36.Zhao, H. et al. Explainability for large language models: a survey. ACM Trans. Intell. Syst. Technol. 15, 20 (2024).
37.Charness, G., Gneezy, U. & Kuhn, M. A. Experimental methods: between-subject and within-subject design. J. Econ. Behav. Organ.81, 1–8 (2012). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(51.4KB, pdf)}

43856_2025_1114_MOESM2_ESM.pdf^{(69.7KB, pdf)}

Description of Additional Supplementary files

Supplementary Data 1^{(10.1KB, xlsx)}

Supplementary Data 2^{(9.9KB, xlsx)}

Supplementary Data 3^{(11KB, xlsx)}

Supplementary Data 4^{(9.6KB, xlsx)}

Reporting Summary^{(1.5MB, pdf)}

Data Availability Statement

Our code and implementations have been made publicly available at https://github.com/bionlplab/2025_multistage_llm_sdoh.

[CR1] 1.Organization, W. H. & Foundation, C. G. Social Determinants of Mental Health (World Health Organization, 2014).

[CR2] 2.Hacker, K., Auerbach, J., Ikeda, R., Philip, C. & Houry, D. Social determinants of health-an approach taken at CDC. J. Public Health Manag. Pract.28, 589–594 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Liu, S. et al. Social vulnerability and risk of suicide in US adults, 2016–2020. JAMA Netw. Open6, e239995–e239995 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Patra, B. G. et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review. J. Am. Med. Inform. Assoc.28, 2716–2727 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Magoc, T. et al. Generalizability and portability of natural language processing system to extract individual social risk factors. Int. J. Med. Inform.177, 105115 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Han, S. et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J. Biomed. Inform.127, 103984 (2022). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Wang, S. et al. An NLP approach to identify SDOH-related circumstance and suicide crisis from death investigation narratives. J. Am. Med. Inform. Assoc.30, 1408–1417 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Gabriel, R. A. et al. On the development and validation of large language model-based classifiers for identifying social determinants of health. Proc. Natl Acad. Sci. USA121, e2320716121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Liu, G. S. et al. Surveillance for violent deaths—National Violent Death Reporting System, 48 states, the District of Columbia, and Puerto Rico, 2020. MMWR Surveill. Summ.72, 1–38 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Bai, X. et al. Explainable deep learning for efficient and robust pattern recognition: a survey of recent developments. Pattern Recognit.120, 108102 (2021). [Google Scholar]

[CR11] 11.Adadi, A. & Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access6, 52138–52160 (2018). [Google Scholar]

[CR12] 12.Hassija, V. et al. Interpreting black-box models: a review on explainable artificial intelligence. Cogn. Comput.16, 45–74 (2023).

[CR13] 13.Singh, C., Inala, J. P., Galley, M., Caruana, R. & Gao, J. Rethinking interpretability in the era of large language models. arXiv: 2402.01761 (2024).

[CR14] 14.Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit. Med.7, 6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Consoli, B. et al. SDOH-GPT: using large language models to extract social determinants of health (SDOH). J. Am. Med. Inform. Assoc. ocaf094 10.1093/jamia/ocaf094 (2025). Online ahead of print. [DOI] [PMC free article] [PubMed]

[CR16] 16.Bird, S. NLTK: the Natural Language Toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions. 214–217 (Association for Computational Linguistics, 2004).

[CR17] 17.Marasovic, A., Beltagy, I., Downey, D. & Peters, M. Few-shot self-rationalization with natural language prompts. In Findings of the Association for Computational Linguistics: NAACL 2022 (eds Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I.V.) 410–424 (Association for Computational Linguistics, 2022).

[CR18] 18.Weng, Y. et al. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2550–2575 (2023).

[CR19] 19.Gero, Z. et al. Self-verification improves few-shot clinical information extraction. In Proceedings of ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH) (2023).

[CR20] 20.Bucher, M. J. J. & Martini, M. Fine-tuned ‘small’ LLMS (still) significantly outperform zero-shot generative AI models in text classification. arXiv: 2406.08660 (2024).

[CR21] 21.Lu, Y. et al. Human still wins over LLM: an empirical study of active learning on domain-specific annotation tasks. arXiv: 2311.09825 (2023).

[CR22] 22.Edwards, A. & Camacho-Collados, J. Language models for text classification: is in-context learning enough? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 10058–10072 (2024).

[CR23] 23.Chung, H. W. et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res.25, 1–53 (2022).

[CR24] 24.CDC. National Violent Death Reporting System: Web Coding Manual: Version 6.0 (CDC, 2022).

[CR25] 25.OpenAI. GPT-3.5-turbo Modelhttps://platform.openai.com/docs/models/gpt-3-5 (2023).

[CR26] 26.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of 36th Conference on Neural Information Processing Systems (2022).

[CR27] 27.Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.DeepSeek-AI et al. Deepseek-r1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv: 2501.12948 (2025).

[CR29] 29.Hart, S. G. NASA-task load index (NASA-TLX); 20 years later. Proc. Hum. Factors Ergon. Soc. Annu. Meet.50, 904–908 (2006). [Google Scholar]

[CR30] 30.Aaron Bangor, P. T. K. & Miller, J. T. An empirical evaluation of the system usability scale. Int. J. Hum.–Comput. Interact.24, 574–594 (2008). [Google Scholar]

[CR31] 31.Hoffman, R. R., Mueller, S. T., Klein, G. & Litman, J. Measures for explainable AI: explanation goodness, user satisfaction, mental models, curiosity, trust, and human–AI performance. Front. Comput. Sci.5, 1096257 (2023).

[CR32] 32.Jiun-Yin Jian, A. M. B. & Drury, C. G. Foundations for an empirically determined scale of trust in automated systems. Int. J. Cogn. Ergon.4, 53–71 (2000). [Google Scholar]

[CR33] 33.McHugh, M. Interrater reliability: the kappa statistic. Biochem. Med.22, 276–82 (2012). [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Dang, Y. et al. Systematic design and data-driven evaluation of social determinants of health ontology (SDOHO). J. Am. Med. Inform. Assoc.30, 1465–1473 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Lyu, Q., Apidianaki, M. & Callison-Burch, C. Towards faithful model explanation in NLP: a survey. Comput. Linguist.50, 657–723 (2024). [Google Scholar]

[CR36] 36.Zhao, H. et al. Explainability for large language models: a survey. ACM Trans. Intell. Syst. Technol. 15, 20 (2024).

[CR37] 37.Charness, G., Gneezy, U. & Kuhn, M. A. Experimental methods: between-subject and within-subject design. J. Econ. Behav. Organ.81, 1–8 (2012). [Google Scholar]

PERMALINK

A multi-stage large language model framework for extracting suicide-related social determinants of health

Song Wang

Yishu Wei

Haotian Ma

Max Lovitt

Kelly Deng

Yuan Meng

Zihan Xu

Jingze Zhang

Yunyu Xiao

Ying Ding

Xuhai Xu

Joydeep Ghosh

Yifan Peng

Abstract

Background

Methods

Results

Conclusions

Plain language summary

Introduction

Methods

Framework architecture

Fig. 1.

Context retrieval

Relevance verification

SDoH factor extraction

Table 1.

Box 1 Prompting for context retrieval.

Box 2 Prompting for relevance verification.

Box 3 Prompting for SDoH factor extraction.

Dataset description

Data preparation and curation

Baselines

Fine-tuned BioBERT

GPT-3.5-turbo End2End prompting

GPT-3.5-turbo CoT prompting

Box 5 GPT-3.5-turbo CoT prompting.

Comparison with the reasoning model

Box 6 Prompting for DeepSeek-R1 reasoning.

Evaluating the explainability of the multi-stage design

Details of the pilot user study

Statistics and reproducibility

Reporting summary

Results

Overview of the framework and data sources

Table 2.

Fig. 2.

Comparison with baseline models

Extracting infrequent suicide-related SDoH factors

Fig. 3. Performance comparisons between our proposed multi-stage framework and three baseline models in extracting the 10 infrequent suicide-related SDoH factors from death investigation notes.

Extracting frequent suicide-related SDoH factors

Fig. 4. Performance comparisons between our proposed multi-stage framework and three baseline models in extracting the 8 frequent suicide-related SDoH factors from death investigation notes.

Comparison with the reasoning model

Fig. 5. Performance comparisons between our proposed multi-stage framework, baseline models, and the DeepSeek-R1 reasoning model in extracting suicide-related SDoH factors from death investigation notes.

Evaluating the explainability of the multi-stage design

Fig. 6. Accuracy score comparisons between the context retrieval in stage 1 and the relevance verification in stage 2, to examine the effectiveness of our multi-stage design in the proposed framework.

Pilot user study

Fig. 7. Pilot user studies for suicide-related SDoH factor annotation.

Discussion

Supplementary information

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles