Skip to main content
Lippincott Open Access logoLink to Lippincott Open Access
. 2025 Jul 23;9:e2400317. doi: 10.1200/CCI-24-00317

Extraction of Social Determinants of Health From Electronic Health Records Using Natural Language Processing

Zhenghua Chen 1,2,, Patricia Lasserre 2, Angela Lin 1,3, Rasika Rajapakshe 1,2,3
PMCID: PMC12309507  PMID: 40700678

Abstract

PURPOSE

Social Determinants of Health (SDoH) have a significant effect on health outcomes and inequalities. SDoH can be extracted from electronic health records (EHR) to aid policy development and research to improve population health. Automated extraction using artificial intelligence (AI) can improve efficiency and cost-effectiveness. The focus of this study was to autonomously extract comprehensive SDoH details from EHR using a natural language processing (NLP)–based AI pipeline.

MATERIALS AND METHODS

A curated set of 1,000 BC Cancer clinical documents with concentrated SDoH information served as the reference standard for training and evaluating NLP models. Two pipelines were used: an open-source pipeline trained on the annotated medical documents and an industrial pretrained solution used as a benchmark. Three experiments optimized the first pipeline's performance, assessing the effect of including subtype word positions during training. The superior open-source pipeline was then used to extract SDoH information from 13,258 oncology documents.

RESULTS

The open-source pipeline achieved an average F1 score accuracy of 0.88 on the validation data set for extracting 13 SDoH factors, surpassing the benchmark by 5%. It excelled in detailed subtype extraction, while the benchmark performed better in identifying rarely annotated SDoH information in BC Cancer data set. Overall, 60,717 SDoH factors and associated details were extracted from BC Cancer EHR oncology documents. The most frequently extracted SDoH factors included tobacco use, employment status, marital status, alcohol consumption, and living status, occurring between 8k to 12k times.

CONCLUSION

This study demonstrates the potential of an NLP pipeline to extract SDoH factors from clinical notes, with strong performance on limited data, although data set–specific adjustments are needed for broader application across institutions.

INTRODUCTION

Human health is influenced by many factors such as eating habits, disease exposure, and genetic inheritance. These factors are complex and interconnected and can be categorized into five determinants of health (DoH): genetics, behaviors, environmental and physical influences, medical care, and social factors.1 Among all the DoH, the social determinants of health (SDoH) play a crucial role in human well-being. The WHO defines SDoH as nonmedical factors that encompass the conditions in which people are born, live, work, and play2 and considers them as root social causes that undermine people's health.

CONTEXT

  • Key Objective

  • Can social determinants of health (SDoH) information be systematically and efficiently extracted from large-scale unstructured oncology consultation reports using machine learning (ML), offering an alternative to manual review?

  • Knowledge Generated

  • An open-source natural language processing (NLP) pipeline trained on a small, annotated data set achieved an F1 score accuracy of 0.88 and extracted over 60,000 SDoH factors across 13 categories and attributes from 13,258 BC Cancer EHR documents. Extraction was completed substantially faster than manual review, demonstrating the feasibility of ML for scalable SDoH information structuring.

  • Relevance

  • The automated extraction of SDoH will reduce manual chart review burden and enable scalable research and clinical care across institutions.

SDoH often reflect an individual's social status and significantly affect health outcomes and inequalities. Factors like poverty, limited education, neighborhood disadvantage, and social isolation play significant roles in determining cancer stage and survival rates.3-7 In September 2020, the British Columbia Office of the Human Rights Commissioner8 published a report advocating for the collection of disaggregated data, closely related to SDoH data, to develop effective policies addressing systemic inequalities and safeguarding human rights. Furthermore, in July 2023, the Canadian Cancer Society and Canadian Partnership Against Cancer9 released the Pan-Canadian Cancer Data Strategy which advocated for SDoH data collection such as patient race, ethnicity, and other equity-focused characteristics.

However, legislating the collection, use, and disclosure of data is onerous and indispensable, which would delay the actual collection for years. Fortunately, within existing electronic health records (EHR) at many health care facilities, there resides valuable SDoH information. Identifying efficient methods to extract this information would significantly expedite the overall SDoH data collection process. Nevertheless, the SDoH information embedded in EHR typically manifests as unstructured text, presented in narrative paragraphs within the broader context of medical documents. Given the variability in content across the records, this unstructured text may encompass diverse types and degrees of SDoH, making large-scale extraction a nontrivial endeavor. Current academic research attempts to extract SDoH from medical records, but typically only categorize the information in general terms, lacking complex details such as status, duration, type, and related history. The goal of this research was to develop a natural language processing (NLP) pipeline capable of autonomously extracting exceptionally SDoH details including 17 distinct types along with various subtypes from the EHR within the BC Cancer.

MATERIALS AND METHODS

Data

The data for this project were accessed under BC Cancer-UBC Research Ethics Board approval (Certificate no. H18-00172). The full data set used in this study includes 413,648 documents from 9,057 patients with breast cancer screened at BC Cancer during 2023-2024. A study data set of 1,000 documents from 966 patients was selected from the full data set for containing rich SDoH content. The age distribution in the 1,000 study data set (966 patients) has a median of 63 and a mean of 63.44 years, whereas the full data set (9,057 patients) has a median of 64 and a mean of 64.08 years. A Welch Two Sample t-test showed no significant difference in mean ages (P = .09116), indicating comparable age distributions. The age distribution plot of the study data set is attached in the Appendix. Owing to the document complexity, only oncology consultation (ONC) reports were selected. These documents are generated by oncologists after the initial evaluation of a patient and are typically the most comprehensive source of patient information. Usually the ONC documents include standard sections such as diagnosis, reason for referral, history of present illness, medical history, medication list, allergies, family history, social history, investigations, physical examination, and impression and plan. The social history section is where information related to SDoH is typically documented. The social history sections were extracted from the ONC documents to form the SDoH-sections documents. Finally, 1,000 documents were randomly selected from the SDoH-sections documents for annotation and pipeline development (Fig 1).

FIG 1.

FIG 1.

The data flow to decide the study data set: SDoH-sections documents are the data to extract SDoH from ultimately, and the study data set was used for model development to achieve this extraction. ONC, oncology consultation; SDOH, social determinants of health.

Annotation

The SDoH factors for extraction were selected through a rigorous review of four sources: WHO SDoH criteria,2 the Government of Canada's DoH,10 UW BioNLP Lab guidelines,11 and SDoH: The Canadian Facts.12 The final list includes employment, income, education, food insecurity, environment exposure, living conditions, childhood experiences, social exclusion, health access, race, immigration status, geographical location, indigenous ancestry, social safety nets, health behaviors, sexual orientation, gender identity, country of origin, and cultural influence.

Two annotators, the first author and a BC Cancer clinical researcher, labeled the 1,000 documents. A preliminary draft of the annotation guidelines guided the process, and ambiguities prompted discussion and refinement. Of the 1,000 documents, 965 contained at least one SDoH annotation, totaling 14,734 (4,835 events, 6,388 labeled arguments, and 3,516 span-only arguments). Common event types were tobacco, employment, marital status, alcohol, and living status. Less frequent types were drug, health service, and social support. The annotation guidelines are available online.13

mSpERT Pipeline

mSpERT pipeline is an open-source NLP pipeline that was previously used by Dr. Lybarger14 for SDoH extraction from SHAC data set. In this work, mSpERT pipeline was applied to BC Cancer documents for detecting a more extensive list of SDoH and associated details. mSpERT pipeline comprises a sequence of modules from BERT embeddings, an entity detector which classifies the SDoH type, a subtype classifier which extracts related details for the detected SDoH entities, and a relation binary classifier which makes prediction for each pair of detected entities. Several modifications were made to the pipeline's data preprocessing methods to accommodate the special cases in the study data to fit in the training process of the mSpERT pipeline.

Experiments

Experiments tested different structured data representations for training the pipeline. Documents were split into sentences, with each sentence represented as a structured unit containing its document ID, sentence index, word tokens, entity details, and relations.

When triggers and labeled arguments overlapped or diverged, three experiments were designed to test different span treatments: In the first experiment, the span of a labeled argument entity was specified independently of the span of the trigger entity, even if there was overlap. The second approach involved retaining the spans of nonoverlapping labeled arguments while consolidating the spans of overlapping triggers and labeled arguments. To do this, the entities list for all the entities similar to experiment 1 was created, but the labeled arguments that overlapped with their corresponding triggers were excluded. The structured data in the third experiment was adopted from the original mSpERT framework. In this approach, all labeled arguments associated with a trigger were assimilated into the subtype object of that particular trigger.

The best-performing model was fine-tuned using four clinical pretrained embeddings: Bio ClinicalBERT, Bio Discharge Summary BERT, MedBERT, and bert-base-NER. Hyperparameters were selected via grid search, focusing on relation filter threshold, dropout, and epochs.

Training

The classes of entity classifiers are the SDoH categories, and the subtype classifiers are SDoH attributes where the labels for each classifier are the subtypes. For each classifier, a none label was added to indicate a negative prediction. The classes of relationships between entities are positive and negative. The 1,000 annotated data were split into 650 training data, 150 validation data, and 200 test data. The training data are used for fine-tuning BERT embeddings and learning the width embeddings and classification layers' parameters. The mSpERT model was trained separately on the training data for each of the three experiments, and the performances were evaluated on the validation data. All the programs were executed on an Ubuntu 22.04.1 LTS machine with a GPU of NVIDIA GeForce RTX 3060 and a CUDA version 12.2.

Spark NLP Pipeline

Spark NLP15 is a commercial library with pretrained NLP models developed by John Snow Labs16 which achieved state-of-the-art data extraction performance over large public clinical data sets. A pipeline of a sequence of Spark NLP pretrained models was tailored for this study of extracting the comprehensive SDoH details as a baseline to compare with the performance of mSpERT pipeline. A matching scheme was also created to evaluate the performance of Spark NLP on the reference-standard data.

The Spark NLP pipeline consists of a SDoH NER model which detects over 40 SDoH factors such as education, population group, and quality of life with F1 scores ranging from 0.65 to 0.98 and a SDoH Mentions model which captures six general types of SDoH such as SDoH economics, SDoH education, and SDoH environment with F1 scores ranging from 0.67 to 0.97. The pipeline also incorporates the JSL Assertion model as a subtype classifier. It can predict none, absent, present, someone else, past, planned, hypothetical, possible, and family with F1 scores ranging from 0.45 to 0.96, whereas the subtypes in this task were supposed to indicate more details and varied between different types of entities. For example, a StatusTime entity has current, past, and other time-related subtypes, and StatusEmploy has subtypes like retired, unemployed, etc. Therefore, the useful subtypes in JSL Assertion model mostly align with the StatusTime subtypes. However, the other labels in Assertion JSL like someone else and family would help reduce the false positive of the predictions on entities which could be interesting to incorporate in the future.

Scoring

The same assessment methodology used in the 2022n2c2/UW SDoH Challenge17 was employed for scoring the performance of the mSpERT pipeline on the BC Cancer data set.

RESULTS

Figure 2 shows F1 scores for trigger extraction across the three experiments. Note that only experiment 3 made predictions on environmental exposure, while conversely, only experiment 3 failed to detect the race among the three experiments. In summary, experiment 3 exhibited the most robust performance in trigger detection, achieving an overall F1 score of 0.91, followed closely by experiment (0.9). Experiment 1 had the lowest overall F1 score of 0.77, despite achieving precision levels akin to those of the three experiments.

FIG 2.

FIG 2.

F1 scores of trigger extraction.

Table 1 provides an overview of the n2c2 scores, which were the general scores used in the 2022 SDoH n2c2 challenge to measure the overall extraction performance of the trigger, argument, and relation detection. On the basis of the n2c2 results of the three tested data structures on the validation set, data structure 3 was selected to be used in the final mSpERT pipeline. The best setting for the fine-tuning were found to be Bio ClinicalBERT embeddings, relation filter threshold = 0.4, dropout = 0.1, and epoch = 20. By using these fine-tuned hyperparameters, an overall validation F1 score of 0.88 was obtained. The results of SDoH extraction using the Spark NLP pipeline are compared with the fine-tuned mSpERT pipeline's results in Table 2. It is shown that the F1 scores of the mSpERT pipeline were higher in the majority of the entity detections, whereas the Spark NLP pipeline achieved better scores in predicting education, tobacco, and amount. Spark NLP did not predict the indigenous, country of origin, environmental exposure, duration, degree, history, and type entities, and the race factor was too few for both pipelines to predict in the validation data set. One of the advantages that Spark NLP had over mSpERT was its large training data set. Therefore, for some factors that exist in large amounts in the Spark NLP's training data set, such as education, even if there are few test cases, Spark NLP can still make good predictions on the basis of its large training data. For the average F1 score over all the entities, the mSpERT pipeline still outperformed Spark NLP with a 5% difference in average F1 scores.

TABLE 1.

Overall Extraction Performance Including Triggers, Arguments, and Relations

Test Case P R F1
Experiment 1 0.87 0.57 0.69
Experiment 2 0.89 0.78 0.83
Experiment 3 0.90 0.79 0.84

TABLE 2.

Entity Extraction Performance Comparing the Two Pipelines

Entities Spark NLP SDoH mSpERT No. of Gold
P R F1 P R F1
Alcohol 0.87 0.95 0.90 0.94 0.91 0.93 115
Country of origin 0.0 0.0 0.0 0.8 0.67 0.73 12
Culture 1.0 0.2 0.33 1.0 0.2 0.33 5
Drug 0.82 0.93 0.88 0.96 0.83 0.89 30
Education 1.0 0.5 0.67 0.0 0.0 0.0 2
Employment 0.93 0.88 0.90 0.92 0.88 0.90 134
Environ expo 0.0 0.0 0.0 0.5 0.5 0.5 2
Health behavior 1.0 0.14 0.25 0.86 0.43 0.57 14
Health service 0.4 0.25 0.31 0.93 0.81 0.87 16
Indigenous 0.0 0.0 0.0 0.0 0.0 0.0 2
Living status 1.0 0.73 0.85 0.96 0.93 0.95 107
Marital status 0.87 0.99 0.93 0.99 0.91 0.95 141
Race 0.0 0.0 0.0 0.0 0.0 0.0 5
Social support 1.0 0.67 0.80 0.89 0.81 0.85 21
Tobacco 1.0 0.94 0.97 0.97 0.87 0.92 141
Frequency 0.65 0.81 0.72 0.85 0.74 0.79 53
Amount 0.98 0.73 0.84 0.77 0.73 0.75 59
Location 0.43 0.45 0.44 0.84 0.76 0.80 89
Overall 0.88 0.76 0.82 0.92 0.84 0.88 948

Abbreviations: NLP, natural language processing; SDOH, social determinants of health.

The results for subtypes differed a lot more than the results for entities between pipelines. As in Table 3, the average F1 score of mSpERT in subtype prediction was 0.88, whereas the average F1 score of Spark NLP was 0.55. This great discrepancy was partly caused by the different structures of pipelines and partly by the difference in the training data set. In general, especially for the annotations used for this study, mSpERT would do a better job in SDoH detections with a larger range of details and relation extraction. The best-performing pipeline over all the experiments—the fine-tuned mSpERT pipeline—was applied to the test data set to evaluate its generalization performance. The fine-tuned mSpERT pipeline extracted 955 SDoH events, encompassing a total of 2,912 associated entities. The overall F1 score achieved was 0.86.

TABLE 3.

Labeled Arguments StatusTime Extraction Performance Comparing the Two Pipelines

Labeled Arguments Subtypes Spark NLP SDoH mSpERT No. of Gold
P R F1 P R F1
Status time Current 1.0 0.18 0.31 0.94 0.84 0.89 347
Future 0.0 0.0 2
None 0.29 0.22 0.25 0.93 0.88 0.9 160
Past 1.0 0.49 0.66 0.88 0.77 0.82 87
Overall 0.77 0.37 0.55 0.93 0.84 0.88 596

NOTE. Other labeled arguments are not predictable in Spark NLP.

Abbreviations: NLP, natural language processing; SDOH, social determinants of health.

The fine-tuned mSpERT model was then used to extract 60,717 SDoH from the 13,258 ONC documents' keywords detected SDoH sections. The statistics of extracted SDoH are reported in Figure 3. The distribution of the predicted events in ONC documents aligns with the distribution of annotated events in the 1,000 data. The most frequently predicted SDoH in ONC documents are tobacco, employment, marital status, alcohol, and living status. They occurred from 8k to 12k times, whereas less frequent SDoH such as drug, health service, social support, country of origin, health behavior, environmental exposure, and culture were predicted to appear from over 200 to 2k times. The least predicted events are race, education, indigenous, sexual orientation, and gender identity.

FIG 3.

FIG 3.

The SDoH predictions of events on the SDoH-sections documents using the fine-tuned SDoH mSpERT pipeline. SDOH, social determinants of health.

DISCUSSION

Given the complexity and extensive nature of extracting SDoH information, artificial intelligence (AI) has become the primary method for efficient large-scale information extraction. Although numerous machine learning models have been explored for this purpose, they often have limitations, such as focusing on a limited selection of SDoH factors and lacking comprehensive national coverage. Additionally, these models may not capture detailed information critical for understanding socioeconomic status and changes in patients.

In this study, we addressed these research deficiencies by compiling a detailed list of SDoH factors within the context of Canada, which was used to annotate documents from BC Cancer. For each SDoH factor, a detailed set of arguments was incorporated to provide more insights for downstream research. Our annotation guidelines offer a comprehensive framework for North American health care institutions to annotate their training data sets with consistency and detail. Adhering to these guidelines enhances the likelihood of achieving superior results in the extraction of SDoH factors using our pipeline. Although the data set used for our training was relatively small, its high-quality annotations made it well-suited for training and evaluating the extraction pipeline.

Additionally, although we used data exclusively from BC Cancer, the structure of oncologists' consultation reports is generally consistent across North America. Most include a narrative section under the heading Social History where SDoH information was documented. However, we acknowledge that the generalizability of our method to other data sets or health care settings may be a potential limitation. Given the lack of standardization in health care documentation, applying this model to other data sets will require users to follow the annotation guidelines outlined in this study and make minor adjustments to align with their specific data.

Extracting SDoH information enhances clinical care by uncovering nonmedical factors that can significantly influence diagnosis, treatment, and patient outcomes. Incorporating SDoH data helps health care providers personalize care by addressing factors like food insecurity, transportation issues, and health literacy, which may influence treatment adherence and follow-ups.22 Early recognition of these factors facilitates timely targeted interventions, such as connecting patients to social services, adjusting treatment plans, or offering telehealth options.23 Moreover, identifying trends in SDoH can promote health equity by directing targeted resources to vulnerable populations. Ultimately, understanding SDOH data leads to more effective, patient-centered care and improved health outcomes.

In conclusion, rather than aiming to develop a one-size-fits-all pipeline for SDoH extraction across all health care institutions, our work is intended to demonstrate that AI models are more capable than often assumed when it comes to extracting diverse types and detailed SDoH information from medical documents.

ACKNOWLEDGMENT

Z.C. would like to thank John Snow lab for providing educational license to be used in this study.

Zhenghua Chen

Employment: BC Cancer Agency

Research Funding: BC Cancer Agency

Rasika Rajapakshe

Research Funding: Kheiron Medical Technologies (Inst)

No other potential conflicts of interest were reported.

SUPPORT

Supported in part by the BC Cancer Foundation.

AUTHOR CONTRIBUTIONS

Conception and design: Rasika Rajapakshe

Collection and assembly of data: Rasika Rajapakshe

Data analysis and interpretation: All authors

Manuscript writing: All authors

Final approval of manuscript: All authors

Accountable for all aspects of the work: All authors

AUTHORS' DISCLOSURES OF POTENTIAL CONFLICTS OF INTEREST

The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.

Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).

Zhenghua Chen

Employment: BC Cancer Agency

Research Funding: BC Cancer Agency

Rasika Rajapakshe

Research Funding: Kheiron Medical Technologies (Inst)

No other potential conflicts of interest were reported.

REFERENCES


Articles from JCO Clinical Cancer Informatics are provided here courtesy of Wolters Kluwer Health

RESOURCES