MIMIC-SBDH: A Dataset for Social and Behavioral Determinants of Health

Hiba Ahsan; Emmie Ohnuki; Avijit Mitra; Hong Yu

. Author manuscript; available in PMC: 2022 Jan 6.

Published in final edited form as: Proc Mach Learn Res. 2021 Aug;149:391–413.

MIMIC-SBDH: A Dataset for Social and Behavioral Determinants of Health

Hiba Ahsan ¹, Emmie Ohnuki ¹, Avijit Mitra ¹, Hong Yu ^2,^3,^4,⁵

PMCID: PMC8734043 NIHMSID: NIHMS1767978 PMID: 35005628

Abstract

Social and Behavioral Determinants of Health (SBDHs) are environmental and behavioral factors that have a profound impact on health and related outcomes. Given their importance, physicians document SBDHs of their patients in Electronic Health Records (EHRs). However, SBDHs are mostly documented in unstructured EHR notes. Determining the status of the SBDHs requires manually reviewing the notes which can be a tedious process. Therefore, there is a need to automate identifying the patients’ SBDH status in EHR notes. In this work, we created MIMIC-SBDH¹, the first publicly available dataset of EHR notes annotated for patients’ SBDH status. Specifically, we annotated 7,025 discharge summary notes for the status of 7 SBDHs as well as marked SBDH-related keywords. Using this annotated data for training and evaluation, we evaluated the performance of three machine learning models (Random Forest, XGBoost, and Bio-ClinicalBERT) on the task of identifying SBDH status in EHR notes. The performance ranged from the lowest 0.69 F1 score for Drug Use to the highest 0.96 F1 score for Community-Present. In addition to standard evaluation metrics such as the F1 score, we evaluated four capabilities that a model must possess to perform well on the task using the CheckList tool (Ribeiro et al., 2020). The results revealed several shortcomings of the models. Our results highlighted the need to perform more capability-centric evaluations in addition to standard metric comparisons.

1. Introduction

Social determinants of health (SDOH) are “the conditions in which people are born, grow, live, work and age,” which are “shaped by the distribution of money, power and resources” (WHO, 2008). They include factors such as socio-economic status, education, neighborhood and physical environment, employment, social support networks, and access to health care. Behavioral determinants of health include tobacco use, drug use, alcohol consumption, physical activity and diet. Together, social and behavioral determinants of health (SBDHs) are environmental and behavioral factors that impact health in significant ways.

Several studies have shown the impact of SBDHs on health outcomes. Nijhawan et al., 2019 showed that variables derived from SBDHs such as substance use, and access to food contributed to the 30-day readmission prediction task. Takahashi et al., 2015 concluded that education level, unemployment status, and alcohol use significantly impacted the risk of hospitalization. Zheng et al., 2020 showed that substance abuse, low income, and unemployment could be used for identification of high-risk patients for suicide attempt.

Given their importance, physicians document information about their patients’ SBDHs in Electronic Health Records (EHRs). Information about SBDHs is useful to both researchers and clinicians. Researchers study how SBDHs impact health outcomes whereas clinicians can identify the social and behavioral risks of their patients and consequently provide tailored care such as counselling or therapy (Haas et al., 2015) and social service (Hamilton et al., 2012). The challenge for both clinicians and researchers is that SBDHs are mostly recorded in unstructured EHR notes. A study has shown that EHR notes contain 90 times more information about SBDHs than structured EHR data (Dorr et al., 2019). However, there is no globally accepted format to consistently record SBDHs in EHR notes and manually reviewing the notes for SBDHs is a time-intensive process. Therefore, there is a need to automate identifying the patient’s SBDH status in EHR notes. Although natural language processing (NLP) approaches have been developed to automatically identify the status of several SBDHs in EHR notes (Gundlapalli et al., 2013; Alzoubi et al., 2018; Feller et al., 2020), the notes and associated annotations used to develop the approaches are not publicly available.

In this work, we release MIMIC-SBDH, the first publicly available dataset of EHR notes annotated for patients’ SBDH status. For this, we annotated 7,025 discharge summaries randomly selected from the MIMIC III (Johnson et al., 2016) dataset for the following SBDHs: Community, Economics, Education, Environment, Alcohol Use, Tobacco Use, and Drug Use. In addition, we marked SBDH-related keywords to better understand the language used to discuss them. We treated the task of identifying an SBDH’s status as a classification task and studied three baseline models: two tree-based approaches (Random Forest Classifier and XGBoost Classifier) and a neural network (Bio-ClinicalBERT) for each SBDH. While we evaluated the models in terms of macro-F1, the statistic alone is not enough to understand whether the models possess certain desirable capabilities or to understand where the models are failing. So with the use of the CheckList tool (Ribeiro et al., 2020), we performed behavioral testing of the models to understand the following capabilities: (1) Negation (2) Attribution (3) Historical Phrases and (4) Robustness to Misspellings.

Our contributions are three-fold: (1) We created a publicly available dataset of EHR notes with annotations for the status of 7 SBDHs (2) We evaluated three baseline machine learning models on the task of identifying an SBDH’s status in EHR notes (3) We presented four capabilities that a model must possess to perform well on the task and performed behavioral testing of the models for these capabilities. The results revealed several shortcomings of the models and highlighted the need to perform more capability-centric evaluations in addition to standard metric comparisons.

Generalizable Insights about Machine Learning in the Context of Healthcare

By publicly releasing our annotated dataset, we hope to create a benchmark dataset to promote research in SBDHs. An accurate identification of the SBDHs’ status can catalyze studies about how these determinants impact health outcomes. We present four capabilities that a model must possess to do well on the task of predicting an SBDH’s status using EHR notes. We believe these capabilities will be helpful when designing a machine learning model for the task, regardless of which medical institution the data is sourced from. We go beyond using standard metrics to evaluate models by performing behavioral testing. The results provide us with actionable insights about a model’s shortcomings and we hope that this observation will encourage researchers in incorporating more capability-centric evaluations in their experiments.

2. Related Work

Past works attempting to identify an SBDH’s status in EHR notes have predominantly focused on detecting a single SBDH using various techniques. As part of the i2b2 Smoking Challenge (Uzuner et al., 2008), an EHR was classified into one of the five categories: Current smoker, Past smoker, Past or Current smoker, Non-smoker or Unknown smoker. Wicentowski and Sydes, 2008 used a rule-based models comprising of rules based on lexical and syntactic properties to solve classification the task. Carrero et al., 2006 experimented with supervised approaches as Decision Trees and Naive Bayes. Pedersen, 2006 used supervised methods such as Support Vector Machines, Decision Trees and Naive Bayes, as well as unsupervised approaches such as k-means clustering. Aramaki et al., 2006 experimented with breaking down the document into sentences and then combining sentence-based classification results to get the overall result. Jonnagaddala et al., 2015 trained a linear SVM to perform multiclass classification. They used a feature set comprising of traditional features such as unigrams and bigrams as well as topics generated using Latent Dirichlet Allocation (LDA) and Gibbs Sampling.

Similar approaches have been used to detect alcohol consumption. Alzoubi et al., 2018 adopted a multi-stage process. They identified alcohol-related sentences using a rule-based approach, performed supervised classification of the sentences into Current drinker, Past drinker, Non-drinker and Unknown, and combined the classifications to a document-level classification based on a set of rules. Topaz et al., 2019 used neural embeddings such as word2vec (Mikolov et al., 2013) and phrase2vec (Wu et al., 2020) as features and Random Forest as a classifier to identify alcohol abuse in MIMIC II notes. Work has also been in detecting homelessness using EHRs. Gundlapalli et al., 2013 investigated detecting homelessness among US Veterans in clinical text using an NLP tool developed by the US Veteran Affairs.

Earlier studies to predict the status of multiple SBDHs have employed various techniques. Yetisgen and Vanderwende, 2017, Lybarger et al., 2018 looked into automatically detecting substance abuse in clinical notes using multi-task learning. Feller et al., 2020 inferred the presence of 11 SBDH categories including alcohol abuse, homelessness, and sexual orientation. They used structured as well as unstructured data and adopted a twostage process of first identifying notes with related discussions through binary classification followed by running another set of classifiers to predict the SBDH status.

In our work, in addition to tree-based approaches, we explore the application of transfer learning by fine-tuning Bio-ClinicalBERT for the downstream task of identifying an SBDH’s status. Previous work evaluate models in terms of F1 metric. We additionally evaluate models by performing behavioral testing to understand whether the models possess certain capabilities that are useful for the task.

3. MIMIC-SBDH

In this section, we describe the data and the SBDHs, the annotation guidelines and the characteristics of the annotated dataset.

3.1. EHR Data

To create our dataset, we used intensive care unit (ICU) patient notes of type ‘Discharge summary’ from the MIMIC III dataset. We considered discharge summaries as they contain the most comprehensive clinical information including information about a patient’s social history. We extracted the social history section from each discharge summary using MedSpaCy’s clinical sectionizer which performs pattern-based section extraction. We specifically extracted sections with the header ”social history” and randomly selected 7,025 discharge summaries that contained this section to annotate. Two individuals annotated the sections under the supervision of a senior physician. In case of an annotation disagreement, the physician was consulted and their decision prevailed. An inter-annotator agreement (κ coefficient) of 0.898 was observed across all SBDHs on a subset of 1000 summaries.

3.2. Social and Behavioral Determinants of Health

We annotated the discharge summaries based on the following SDOHs provided by the Kaiser Family Foundation (KFF) (Artiga and Hinton, 2019): (1) Community (2) Economics (3) Education (4) Environment (5) Food and (6) Healthcare. In addition to the above SDOHs, we annotated for any relevance to the following substances to capture behavioral determinants of health: (1) Alcohol Use (2) Tobacco Use and (3) Drug Use.

3.3. Annotation Guidelines

We performed two annotation tasks: (1) We labeled the discharge summaries for each SBDH based on the extracted social history section (2) We marked keywords relevant to the SBDH in the extracted section. Passages related to social support such as a family member or friend were considered relevant to the SBDH Community. Since it was possible for a patient to have active social support and to have lost social support (due to death, separation or divorce) simultaneously, we split Community further to Community-Present and Community-Absent. A discharge summary was annotated True for Community-Present if the discharge summary had passages related to active social support and False if such passages were not found in the discharge summary. A discharge summary was annotated True for Community-Absent if the discharge summary had passages related to the loss of social support and False if such passages were not found in the discharge summary. Similarly, a discharge summary was annotated as True for the SBDH Education if there was any passage related to the patient’s education such as schooling, college or degree attained and False if there was no related passage.

A discharge summary was annotated as True for the SBDH Economics if the patient was currently employed, False if the patient was unemployed (including retirement) and None if there was no related passage. A discharge summary was annotated as True for the SBDH Environment if there was any indication of housing, False if the patient was homeless and None if there was no related passage.

For the behavioral factor Alcohol Use, a discharge summary was annotated as Present if the patient was a current consumer of alcohol, Past if the patient was a consumer in the past and had quit, Never if the patient had never consumed alcohol, Unsure if the discharge summary had an ambiguous passage related to alcohol consumption and None if there was no related text. The same rules applied to Tobacco Use and Drug Use. We did not find any passage relevant to the SBDHs Food and Healthcare in the discharge summaries and hence dropped the two categories. Table 1 shows examples of annotations from the dataset. Examples of each class for every SBDH and more annotation remarks are in Appendix A.1.

Table 1:

Examples of annotations from the dataset. The colored text indicate SBDH-related keywords. CP: Community-Present, CA: Community-Absent, EC: Economics, ED: Education, EN: Environment, AU: Alcohol Use, TU: Tobacco Use, DU: Drug Use.

Social History	CP	CA	ED	EC	EN	AU	TU	DU
married never smoked 2 glasses wine/day no drug history	True	False	False	None	None	Present	Never	Never
Past smoker, stopped >1year ago. He is retired.	False	False	False	False	None	None	Past	None
Pt is homeless. Denies IVDA. Smokes 2 packs a day.	False	False	False	None	False	None	Present	Absent

SBDH	True	False	None
Community-Present	4463	2562	N/A
Community-Absent	784	6241	N/A
Education	210	6815	N/A
Economics	988	1742	4295
Environment	4357	63	2605

SBDH	Present	Past	Never	Unsure	None
Alcohol Use	2077	515	2444	332	1657
Tobacco Use	1006	2121	2252	355	1291
Drug Use	207	221	2136	144	4317

SBDH	RF	XGBoost	Bio-ClinicalBERT
Community-Present	0.9325 ± 0.0085	0.9314 ± 0.0034	0.9578±0.0027^*
Community-Absent	0.8686 ± 0.0199	0.8631 ± 0.0180	0.9068±0.0206^*
Education	0.8249 ± 0.0323	0.8242 ± 0.0231	0.8386±0.0424
Economics	0.8886 ± 0.0143	0.8879 ± 0.0154	0.8964±0.0073
Environment	0.9069±0.0284^*	0.8881 ± 0.0340	0.7969 ± 0.0703
Alcohol Use	0.7071 ± 0.0173	0.7348±0.0148^*	0.7049 ± 0.022
Tobacco Use	0.7368 ± 0.0176	0.7399 ± 0.0195	0.7450±0.0222
Drug Use	0.6697 ± 0.0075	0.6942±0.0195^*	0.5812 ± 0.0501

SBDH	Test	Failure Rate			Example Test Cases, Label
SBDH	Test	RF	XGB	BCB	Example Test Cases, Label
Community	N	40.5	27.2	44.5	Husband died recently. False
(Present)	M	1.9	3.8	1.9	Patient ilves with her grandson. True
Community	N	11.0	14.5	13.3	Patient lost her child in an accident. True
(Absent)	M	2.5	1.7	2.5	Patient is separtaed from his wife. True
Education	A	83.7	73.5	70.9	Daughter is a medical student. False
Education	M	6.5	7.7	13.1	Son is a honors high school studnet. False
Economics	N	11.6	15.2	8.0	Former ICU nurse. False
	A	76.5	71.5	9.0	He is not working. Daughter-in-law works as chemist. False
	M	11.2	11.2	12.2	He is a retierd dentist. False
Environment	N	24.7	15.6	1.3	Intermittently lives with friends but currently homeless. False
	A	31.2	11.7	1.3	Homeless last 3 months. His family lives nearby. False
	M	9.1	10.4	7.8	Homeelss, searching for apt. False
Alcohol Use	N	28.0	11.1	27.1	Negative for alcohol use. Never
	A	48.0	53.5	43.0	Negative for ETOH. Son has history of alcohol abuse. Never
	HP	68.3	68.3	77.8	Was sober for a few months, but started drinking again. Present
	M	20.0	22.0	18.0	He dneies any history of alcohol intake. Never
Tobacco Use	N	10.0	11.7	0.6	Cigs: none ETOH: none. Never
	A	27.5	39.0	17.5	Denied smoking, EtOH or drugs. Wife is a smoker. Never
	HP	73.9	49.7	46.4	Denies recent tobacco (history of abuse but none in years) Past
	M	8.5	19.0	12.4	Unclear when he started msoking again. Never
Drug Use	N	0.6	0.6	4.3	Negative for smoking and drug. Never
	A	87.0	88.0	6.0	Denied smoking, EtOH or drugs. Wife uses marijuana. Never
	HP	58.8	68.6	98.7	Had stopped opiates. Started again 1 month ago. Present
	M	9.2	17.0	3.3	Hx of IVDU abuse ubt quit. Past

SBDH	RF	XGBoost	Bio-ClinicalBERT
Community-Present	0.9245 ± 0.0073	0.9329 ± 0.0067	0.9568 ± 0.0039
Community-Absent	0.8775 ± 0.022	0.8657 ± 0.0173	0.9186 ± 0.019
Education	0.8399 ± 0.0322	0.8266 ± 0.0211	0.8201 ± 0.0214
Economics	0.8904 ± 0.0191	0.8876 ± 0.0171	0.8941 ± 0.0157
Environment	0.8861 ± 0.0384	0.8881 ± 0.034	0.6386 ± 0.0034
Alcohol Use	0.6694 ± 0.022	0.6954 ± 0.0187	0.6275 ± 0.0345
Tobacco Use	0.6726 ± 0.0198	0.6438 ± 0.015	0.6787 ± 0.0232
Drug Use	0.6187 ± 0.0302	0.6606 ± 0.0094	0.4828 ± 0.0407

SBDH	Class	Discussion and Examples
Community-Present	True	Presence of social support. Examples: “…Lives with his wife.”
Community-Present	False	No passages about presence of social support. Examples: “No tobacco, ethanol or drug use.”
Community-Absent	True	Lack/loss of social support. Examples: “Widowed with three children…”.
Community-Absent	False	No passages about loss/lack of social support. Examples: “Former smoker, no EtOH.”
Education	True	Passages about patient’s education. Examples: “…finished law school…”
Education	False	No passages about patient’s education. Examples: “Former smoker, no EtOH. Lives with his wife.”
Economics	True	Patient is employed. Examples: “…,a technician at…”
	False	Not currenttly employed (including retirement). Examples: “Retired teacher…” “Formerly worked in…not working currently… ”
	None	No passages about patient’s employment. Examples: “Former smoker, no EtOH. Lives with his wife.
Environment	True	Indication of housing. Examples: “…, lives at home with mother… ”
	False	Lack of housing. Examples: “Homeless… ”
	None	No passages about patient’s housing. Examples: “Former smoker, no EtOH”.
Alcohol Use	Present	Patient is a current consumer of alcohol. Examples: “…,rare ETOH use…” “glass of wine everyday.”
	Past	Past consumer of alcohol. Examples: “…,no alcohol use since…” ”quit alcohol [de-ID] weeks ago…”
	Never	Has never consumed alcohol. Examples: “…,Alcohol: none…” “…denies alcohol.. ”
	Unsure	Ambiguous passages about patient’s consumption. Examples: “No ETOH abuse.” “…questionable history of ETOH use…”
	None	No passages about patient’s alcohol consumption. Examples: “Retired. Lives with his wife.”
Tobacco Use	Present	Patient is a current consumer of tobacco. Examples: “…,she is still smoking…” “smokes a pack per day… ”
	Past	Past consumer of tobacco. Examples: “…,quit smoking 20 years ago…” “former smoker… ”
	Never	Has never consumed tobacco. Examples: “…,no tobacco use…” “…denies tobacco..”
	Unsure	Ambiguous passages about patient’s consumption. Examples: “…[de-ID] tobacco…” “…denied tobacco abuse…”
	None	No passages about patient’s tobacco consumption. Examples: “Retired. Lives with his wife.”
Drug Use	Present	Patient is a current consumer of a drug. Examples: “…,positive for cocaine…” “intravenous drug abuse…”
	Past	Past consumer and does not consume any drug anymore. Examples: “…,minimal marijuana years ago… ” “remote cocaine use in the past…”
	Never	Has never consumed a drug. Examples: “…,no illicit drug use…” “…denies recreation drug use..”
	Unsure	Ambiguous passages about patient’s drug consumption. Examples: “…no cocaine abuse…” “…no IV drug abuse…”
	None	No passages about patient’s drug consumption. Examples: “Retired. Lives with his wife.”

	With Oversampling			Without Oversampling
	RF	XGBoost	Bio-ClinicalBERT	RF	XGBoost	Bio-ClinicalBERT
True	0.9513 ± 0.0058	0.9489 ± 0.0016	0.9688 ± 0.0011	0.9464 ± 0.0052	0.9504 ± 0.004	0.9681 ± 0.0019
False	0.9137 ± 0.0118	0.9138 ± 0.0066	0.9468 ± 0.0046	0.9026 ± 0.0104	0.9153 ± 0.01	0.9456 ± 0.0061

	With Oversampling			Without Oversampling
	RF	XGBoost	Bio-ClinicalBERT	RF	XGBoost	Bio-ClinicalBERT
True	0.7634 ± 0.0357	0.754 ± 0.033	0.8343 ± 0.0374	0.7779 ± 0.0406	0.7569 ± 0.0327	0.8536 ± 0.0337
False	0.9738 ± 0.0045	0.9722 ± 0.0036	0.9792 ± 0.0041	0.977 ± 0.004	0.9746 ± 0.0026	0.9834 ± 0.0045

	With Oversampling			Without Oversampling
	RF	XGBoost	Bio-ClinicalBERT	RF	XGBoost	Bio-ClinicalBERT
True	0.6601 ± 0.0657	0.658 ± 0.0476	0.6861 ± 0.0841	0.6878 ± 0.0657	0.662 ± 0.0406	0.6486 ± 0.0434
False	0.9898 ± 0.0019	0.9905 ± 0.0021	0.991 ± 0.0025	0.9919 ± 0.0019	0.9913 ± 0.0034	0.9915 ± 0.0028

	With Oversampling			Without Oversampling
	RF	XGBoost	Bio-ClinicalBERT	RF	XGBoost	Bio-ClinicalBERT
True	0.8081 ± 0.0245	0.8064 ± 0.0286	0.8166 ± 0.0124	0.8124 ± 0.0343	0.8117 ± 0.0312	0.8134 ± 0.0307
False	0.9036 ± 0.0155	0.9006 ± 0.0181	0.9208 ± 0.0142	0.9032 ± 0.0192	0.8971 ± 0.0177	0.9134 ± 0.0171
None	0.9541 ± 0.0068	0.9568 ± 0.0045	0.9519 ± 0.0066	0.9556 ± 0.0069	0.9541 ± 0.0055	0.9555 ± 0.0044

	With Oversampling			Without Oversampling
	RF	XGBoost	Bio-ClinicalBERT	RF	XGBoost	Bio-ClinicalBERT
True	0.9734 ± 0.0026	0.969 ± 0.0036	0.9597 ± 0.0037	0.9732 ± 0.0034	0.9708 ± 0.0027	0.969 ± 0.0031
False	0.7883 ± 0.0831	0.7421 ± 0.0943	0.496 ± 0.2014	0.7258 ± 0.1112	0.7948 ± 0.0774	0.0 ± 0.0
None	0.959 ± 0.0042	0.9531 ± 0.005	0.9351 ± 0.0091	0.9592 ± 0.0042	0.955 ± 0.004	0.9468 ± 0.0078

PERMALINK

MIMIC-SBDH: A Dataset for Social and Behavioral Determinants of Health

Hiba Ahsan

Emmie Ohnuki

Avijit Mitra

Hong Yu

Abstract

1. Introduction

Generalizable Insights about Machine Learning in the Context of Healthcare

2. Related Work

3. MIMIC-SBDH

3.1. EHR Data

3.2. Social and Behavioral Determinants of Health

3.3. Annotation Guidelines

Table 1:

3.4. Characteristics of Dataset

Figure 1:

Figure 2:

Table 2:

Table 3:

4. Methods

4.1. Classifiers

4.2. Class Imbalance

4.3. Behavioral Testing

5. Experiments

6. Results

Table 4:

Table 5:

7. Discussion

7.1. Impact of Class Imbalance and Oversampling

Table 6:

Figure 3:

7.2. Behavioral Testing

Limitations

8. Conclusion

Supplementary Material

9. Acknowledgement

Appendix A.

A.1. Annotation

Table 7:

A.2. Test Cases for Behavioral Testing

A.3. Results

Table 8:

Table 9:

Table 10:

Table 11:

Table 12:

Table 13:

Table 14:

Table 15:

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases