Abstract
Cancer caregivers are often informal family members who may not be prepared to adequately meet the needs of patients and often experience high stress along with significant physical, emotional, and financial burdens. Accurate prediction of caregiver’s burden level is highly valuable for early intervention and support. In this study, we used several machine learning approaches to build prediction models from the National Alliance for Caregiving/AARP dataset. We performed data cleansing and imputation on the raw data to give us a working dataset of cancer caregivers. Then a series of feature selection methods were used to identify predictive risk factors for burden level. Using supervised machine learning classifiers, we achieved reasonably good prediction performance (Accuracy ∼ 0.94; AUC ∼ 0.97; F1∼ 0.93). We identify a small set of 15 features that are strong predictors of burden and can be used to build Clinical Decision Support Systems.
Introduction
Informal caregivers provide substantial practical and emotional support for people living with cancer and other chronic diseases. The National Alliance for Caregiving (NAC) and AARP report in 2020 reveals that there are 53 million Americans providing care [1]. Informal caregivers are individuals who provide unpaid care to a spouse, family member, or friend who has been diagnosed with cancer and needs assistance with daily activities. Informal caregivers now encompass more than one in five Americans. The report also reveals that family caregivers are in worse health compared to five years ago. Due to changes in the healthcare system and an increasingly older population, the burden of cancer caregiving is increasingly shifting to informal caregivers. Current evidence suggests that cancer caregiving is intense, episodic, and has a profound impact on the caregiver’s well-being and quality of life [2, 3].
Previous studies have outlined the negative impacts associated with being a caregiver, including depression [4], burden [5], social isolation [6], loss of self‐identity [7], sleep deprivation [8], financial burden [9] and significant changes to their lives [10]. The role they take on in caring for the person with cancer is extensive, demanding, and often without training or resources. Often the demands become so burdensome that the psychological and physiological well-being of the caregiver declines, leading to negative social consequences, including disruption of their routines, sense of interpersonal loss, taking time off from work, or leaving work altogether. Caregiver burden is a measure of the perceived impact on the emotional and physical health, social life, and financial status of caregivers resulting from their responsibilities in providing care to patients [20].
Given the complexity of the above issues, this paper tackles three important questions: 1) Who are cancer caregivers and what can we know about them; 2) Can we identify those at the highest risk for deteriorating health and illness? and 3) Can we predict who among the caregivers will have the highest burden? The caregiver's burden level refers to the physical, emotional, and financial strain experienced by individuals who provide care to cancer patients.
It is evident from the literature that cancer caregivers need assistance, help, and, most likely, interventions. However, first, we must identify who is more vulnerable and experiences a high burden. How can we easily assess that? Once identified, the caregivers can then be provided some assistance. Our focus here is not on caregiver interventions [11]. Rather we are focused on knowing and predicting who needs those interventions. Hence, this study aims to identify and characterize the level of burden cancer caregivers endure by applying machine learning (ML) approaches, which have yet to be used in previous studies, to derive new insights and identify patterns that may contribute to reducing the caregivers’ burden on an individual level.
Background and Related Work
Caring for cancer patients can have a significant impact on the quality of life of informal caregivers, making it essential to provide support for them in their daily lives. These caregivers often experience physical, psychological, and financial stress, as well as obstacles in managing their own day-to-day activities while assisting the patients in managing their illnesses. To determine the degree to which the well-being of cancer caregivers should be improved, it is necessary to understand the extent to which they are affected by their caregiving responsibilities and the burden they bear in supporting cancer patients within their current environment.
This study aims to examine secondary data from existing studies (jointly conducted by the National Alliance for Caregiving and AARP Public Policy Institute) to analyze the roles, needs, and burdens of cancer caregivers and improve the well-being of those who dedicate most of their time and resources to support their cancer patients at home [1]. The study will utilize novel ML approaches to identify and characterize the burden experienced by informal cancer caregivers, with the goal of deriving new insights and patterns that can reduce the burden at an individual level. Various ML algorithms will be applied to predict significant factors that contribute to the burden experienced by informal caregivers. By identifying these factors, healthcare providers can develop targeted interventions to alleviate the burden and improve the well-being of informal cancer caregivers.
According to a 2015 AARP report published by the National Alliance for Caregiving (NAC) and the AARP Public Policy Institute, cancer caregivers come from diverse backgrounds and have varied characteristics. The study found that most cancer caregivers provide care to a relative who is 65 years or older. On average, cancer caregivers spend 32.9 hours per week caring for their loved ones, with one-third providing 41 or more hours of care per week. Cancer caregivers also assist patients with both Activities of Daily Living (ADLs) such as bathing, eating, toileting, and Instrumental Activities of Daily Living (IADLs) such as shopping, driving, managing finances, and medical/nursing tasks. The research also shows that one out of six employees in the US spends more than 20 hours a week helping close friends and relatives in need of care. The caregiver may look after a sick spouse, drive an older person to doctor's appointments, or help a child with a school-related difficulty that requires taking time away from the office. Approximately 66 million Americans provided unpaid care as informal caregivers to their beloved relatives in the household [16], which is equivalent to 3 in 10 US households, indicating that at least 30% of the household population is involved in cancer-related support to serve the lives of cancer patients. This underscores the importance of further research into caregivers' propensity to burnout, not only for their well-being but also for the health of the patients they care for.
Kent, E attempted to examine the clinical priorities of informal cancer caregiving [17]. The authors found that informal or family caregivers are a critical source of care for cancer patients in the United States. However, the authors concluded that the population of caregivers, their tasks, psychosocial needs, and health outcomes are still not well understood, despite the crucial roles they play in society. In essence, the authors assert that ongoing research aimed at improving the well-being of caregivers is a significant step toward enhancing the overall well-being of cancer patients.
According to Glajchen, including caregivers in the routine assessment of cancer patients is crucial due to their vital role in the patient's life [18]. In this research, the author reported discovering new dimensions regarding the challenges of caregiving. For instance, immigrants have lower rates of service use after hospitalization, possibly due to language barriers that make communication difficult and limit their ability to receive the attention they need during moments of hardship. Additionally, cultural differences between professionals and patients have been identified as a factor that may lead to symptom underestimation and exacerbate the severity of cancer.
Chappell and Reid examined the overall quality of life of caregivers using a path model that conceptualized burden as distinct from well-being [19]. They found that several factors, including perceived social support, burden, self-esteem, and hours of informal care, can directly affect well-being. In contrast, they also found that behavioral problems can affect burden. These findings led to an interesting discussion on the lack of evidence regarding the causality of wellbeing and burden, suggesting that the well-being of caregivers can be improved regardless of the burden of the work. What matters most is the improvement of their lifestyle. We agree with this conclusion in the sense that there is no uniform way of increasing the well-being of caregivers. Instead, a combination of multiple approaches and mechanisms that depend on prevailing situations is needed.
In a recent publication, Antoniadi et al. conducted a systematic review of predicting caregiver burden for amyotrophic lateral sclerosis (ALS) patients and identified related features using ML techniques to build a clinical decision support system (CDSS) that predicts caregiver burden, which can facilitate efficient resource management [14]. The research assessed the effect of demographic and socioeconomic information, quality of life, anxiety, and depression questionnaires for patients and caregivers. One of the major focuses of the study was to determine risk factors for caregiver burden by classifying caregivers into high or low burden groups using ML. The authors used the random forest algorithm to model the most important predictive features, which were the caregivers’ depression score, the amount of control they feel they have over their lives, and their overall perception of their quality of life. The authors also found that specific attributes of the caregiver’s quality of life assessment and the patient’s physical dysfunctions were predictive of caregiver burden.
Yoon et al. recognize in their paper that caregiver stress negatively influences caregivers, and finding out burden factors may provide crucial insights for providers to prioritize those with the highest risk of stress [15]. To develop a prediction model of caregiver difficulty, they applied several classification algorithms, such as Random Forest, Multilayer Perceptron, AdaboostM1, to discover new behavioral risk knowledge and to visualize predictors of caregiver stress from a multidimensional behavioral dataset. The authors concluded about the importance of limiting care hours, preventing burnout, and delegating caregiving tasks. These findings agree with previous studies that have found an association between the caregiver’s burden and the hours of weekly care they provide, their quality of life, and the psychological distress they bear.
Caring for cancer patients can have a significant impact on the quality of life of informal caregivers. These caregivers face physical, psychological, and financial stress, along with difficulties in managing their own daily activities while providing care to cancer patients. It is necessary to understand the extent of their burden and the challenges they face to provide support for them in their daily lives. Several studies have examined the roles, needs, and burdens of cancer caregivers, highlighting the need for ongoing research aimed at improving their well-being. ML techniques can be used to identify predictive factors that contribute to the burden experienced by caregivers, enabling targeted interventions to alleviate their stress and improve their quality of life. Understanding caregiver burden factors is crucial, and research in this area will provide valuable insights for providers to prioritize those with the highest risk of stress. Overall, improving the well-being of caregivers is essential for enhancing the overall well-being of cancer patients.
While predicting cancer caregiver burnout is a complex task, this paper proposes using ML techniques to develop predictive models. By incorporating these findings into a CDSS, healthcare professionals can more easily identify individuals who may be at risk of burnout and provide appropriate support. With the help of these predictive models, caregivers can receive timely assistance, ultimately improving the quality of care provided to cancer patients and their families.
Methods
Machine learning helps us find patterns in complex datasets and make predictions, such as predicting caregiver burden levels. To identify a subset of features that are the strongest predictors and relevant for use in our ML prediction models, we employed three different feature selection approaches: Filter, Wrapper, and Embedded methods. The purpose of curating a subset of features is to improve the generalizability of ML models by reducing overfitting. Fewer variables can further reduce training times of models and lead to simpler models. Every feature selection method will identify a different subset of features, and every subset of relevant features will render an optimal modelling performance for different ML algorithms. Once we had our small composite set of significant features, we ran several classification algorithms with supervised learning to make predictions with high accuracy.
Data Curation
Our study uses a publicly available dataset from the Caregiving in the U.S. 2015 study, which was jointly conducted by the National Alliance for Caregiving and AARP Public Policy Institute. This dataset was made possible by generous sponsorship from AARP, Archstone Foundation, Eli Lilly, Home Instead Senior Care, MetLife Foundation, Family Support Research and Training Center (FSRTC) at the University of Illinois-Chicago, Pfizer, and UnitedHealth care [1]. The data primarily consists of quantitative online interviews with 1,248 caregivers aged 18 or older, who provided unpaid care for someone of any age within both the U.S. population and households. To qualify for the study, respondents must have self-identified as an unpaid caregiver of an adult either currently or at some point in the twelve months prior to the survey. As we are interested in understanding caregiving for cancer patients, we filtered the dataset to include only those who completed the survey and self-identified as cancer caregivers, resulting in 111 instances (participants).
Data Cleansing
The original dataset consisted of 7,975 records and 214 features. However, many records were either incomplete or contained data from paid caregivers and thus were excluded. After this exclusion, we obtained a dataset of 1,248 records of unpaid informal caregivers with various diseases. Since our study focuses on cancer caregivers, we selected a subset of the original dataset where the main illness of the care recipient was labeled "Cancer" (i.e., Q18 = 14). This resulted in a total of 116 samples, of which 5 had incomplete data, reducing the final number of records to 111. Next, we focused on feature selection and reduction. The original dataset contained a wide array of features, totaling 214. However, not all these features were relevant or complete. We first excluded features with no values or only one level where the variance was zero, which reduced the total number of features to 176. We further eliminated the initial screening questions labeled as "sc" and created a subset with 152 features. Ultimately, our cleaned informal cancer caregiver dataset had 111 records and 152 feature columns. This dataset, while significantly reduced from the original, still provided a comprehensive view of the various factors that could influence caregiver burden. Our subsequent analysis and machine learning models were based on this refined dataset.
Missing Data Imputation
In the next step of the data preparation process, we evaluated the percentage of missing values in each feature. We used an imputation strategy to improve the quality of the dataset. We removed features with a missing rate of over 20% (i.e., where more than 20% of the values were missing). The dataset contained a mix of categorical and interval data types. For the interval features, we imputed missing values by replacing them with the mean value of the respective columns. For the categorical features, we treated missing values as another category coded with “777”. The final dataset included 111 features, including the burden feature, which was our target (dependent) variable. The target feature, burden, had three categorical levels: low, medium, and high burden.
Basic Data Analytics
From the cleaned dataset (111 rows x 111 columns), we conducted some basic analytics to gain a deeper understanding of cancer caregivers. Table 1 summarizes the overall information.
Table 1.
Basic information about cancer caregivers
| Questions/Options | Value | Percentage (%) |
|---|---|---|
| Average age of caregiver | 57.74 | N/A |
| Ratio of men to women | ||
| Female | 60 | 54.06 |
| Male | 51 | 45.94 |
| Caregiver marital status | ||
| Married | 65 | 58.56 |
| Living with a partner | 4 | 3.60 |
| Widowed | 10 | 9.01 |
| Separated | 0 | 0 |
| Divorced | 13 | 11.71 |
| Single, never married | 15 | 13.51 |
| Don’t know | 4 | 3.60 |
| Race/ethnicity of caregiver | ||
| White | 69 | 62.16 |
| Black | 12 | 10.81 |
| Asian | 16 | 14.41 |
| Hispanic | 14 | 12.61 |
| Categorical hours of care /week provided | ||
| 0 to 8 hours | 33 | 29.73 |
| 9 to 20 hours | 25 | 22.52 |
| 21 to 40 hours | 13 | 11.71 |
| More than 40 hours | 40 | 36.04 |
| How long have you been/did you provide care to recipient? | ||
| Six months to one year | 40 | 36.04 |
| Less than six months | 35 | 31.53 |
| Others | 36 | 32.43 |
| Caregiver Household Income | ||
| Under $15,000 | 10 | 9.00 |
| $15,000 to $29,999 | 14 | 12.61 |
| $30,000 to $49,999 | 22 | 19.82 |
| $50,000 to $74,999 | 23 | 20.72 |
| $75,000 to $99,999 | 17 | 15.31 |
| $100,000 or more | 24 | 21.62 |
| Don’t know | 1 | 0.90 |
| Caregiver education (highest grade completed) | ||
| Less than high school | 5 | 4.50 |
| High school grad/GED | 31 | 27.93 |
| Some college | 18 | 16.22 |
| Technical school | 6 | 5.40 |
| College grad | 30 | 27.03 |
| Graduate school/Grad work | 21 | 18.92 |
We next explored certain demographic issues related to burden experienced by caregivers.
As seen in Fig. 1, all age groups experience some level of high burden. But seniors above the age of 65+ experienced most burden, followed by the age group of 18-49.
Figure 1.
Age groups that experience high burden
With 1 being no physical strain at all to 5 being constant physical strain, as shown in figure 2, most caregivers (31%) labeled having to take care of a cancer patient at a strain level of 2. It is to be noted that the second highest majority of caregivers (24%) mentioned that it wasn’t a physical strain to provide caregiving, while 9% labeled the task of caregiving at a high physical strain rating at 5.
Figure 2.
Physical strain affecting the caregiver population
Figure 3 shows the distribution of caregivers and how their caregiving caused them emotional strain. A majority of caregivers (31%) labeled having to take care of a cancer patient at a moderate stress level, while 1 in 5 caregivers reported their caregiving as a very stressful situation.
Figure 3.
Population distribution of caregivers based on emotionally stressful situation
As shown in Figure 4, 45% of caregivers faced no financial difficulties while caring for their loved ones. The remaining felt some financial strain while they were caregiving.
Figure 4.
Distribution of population affected by financial strains during caregiving
The two most important Activities of Daily Living (ADLs) and Instrumental ADLs are summarized in Table 2. As shown, over 50% of caregivers provided assistance with two critical ADLs: getting in and out of beds and chairs, as well as getting to and from the toilet. Moreover, the table demonstrates that more than 80% of caregivers helped patients with two vital IADLs: grocery or other shopping and transportation.
Table 2.
Two major ADL activities for Caregivers
| Type | Activity | Population (Response Yes) |
|---|---|---|
| Activities of Daily Living (ADLs) | Get in and out of beds and chairs | 64 |
| Get to and from toilet | 53 | |
| Instrumental Activities of Daily Living (IADLs) | Grocery or other shopping | 83 |
| Transportation | 93 |
The above analysis gives us a deeper view of who the cancer caregivers are and what burden they are experiencing.
Feature Selection Methods
We were interested to find out which set of features were a strong predictor for caregiver burden. For that we worked
with a number of feature selection methods in machine learning.
• Filter Methods
In the filter feature selection method, we rely on the characteristics of the data. Filter methods use feature scores based on various statistical tests to rank and select the highest-ranking features. We used the Chi-squared (Fisher score) in this study, which compares the chi-squared statistics against a known distribution. We used the Chi2 feature selection module from the Scikit-learn library to compute chi-squared statistics between each independent feature and burden levels as dependent feature. The smaller the p-value, the more significant the independent feature is to predict the dependent feature. In our study, we selected the top 15 features where they have the smallest p-value in relationship to the burden category (dependent variable). We chose 15 as our threshold for the number of features in the filter feature selection method to balance the complexity of the model with its predictive power. This decision was based on our preliminary analysis, which suggested that including more than 15 features did not significantly improve the model's performance. However, future studies could explore the impact of using different thresholds. The list of selected variables is provided in Table 3.
Table 3.
Selected features using the chi-squared filter method
| # | Variable Name | Description |
|---|---|---|
| 1 | HHer | Householder status of Initial respondent |
| 2 | hhtype | Type of household family or not |
| 3 | racehh | Race/ethnicity of householder single punch |
| 4 | q7 | Relationship of recipient to caregiver |
| 5 | lives | Where recipient lives |
| 6 | Q15B | Does recipient live in a rural area? |
| 7 | q21 | How long have you been/did you provide care to recipient? |
| 8 | q21x | Length of time provided care all life replaced with recipient age |
| 9 | q22a | Get in and out of beds and chairs - Help with ADL |
| 10 | hourscat | Categorical hours of care provided |
| 11 | q35 | How much of a physical strain caring for recipient is/was? |
| 12 | q33 | Have/Were employed any time in last year while caregiving? |
| 13 | marital | Caregiver marital status |
| 14 | INTNET | Has Internet Access in home? - online respondents only |
| 15 | LGBT | Identifies as Lesbian, Gay, Bisexual, and/or Transgender |
• Wrapper
Wrapper methods are greedy search algorithms that use a specific ML classifier to select the optimal set of features. We used the step forward wrapper method, where we started with no features and sequentially added one feature at a time until a desired number of features were selected. In this study, we use the MLxtend sequential forward feature selector (SFS) implementation using a Random Forest classifier with 2-fold cross-validation at 70 and 30 percent data split for model training and testing, respectively. We used accuracy as the scoring measure for sequentially adding relevant features to the list of features. The list of the top 15 selected variables that yielded the highest classifier performance accuracy is presented in Table 4.
Table 4.
Selected features using the wrapper method
| # | Variable Name | Description |
|---|---|---|
| 1 | HHer | Householder status of Initial respondent |
| 2 | hhtype | Type of household family or not |
| 3 | hisphh | Hispanic status of householder |
| 4 | agehh | Age of householder |
| 5 | q7 | Relationship of recipient to caregiver |
| 6 | q17ctn | Count of conditions selected |
| 7 | q20 | Recipient has Alzheimer’s or other mental confusion? |
| 8 | q21 | How long have you been/did you provide care to recipient? |
| 9 | q22a | Get in and out of beds and chairs - Help with ADL |
| 10 | q22c | Get to and from toilet - Help with ADL |
| 11 | q22d | Bathe or shower - Help with ADL |
| 12 | q23b | Grocery or other shopping - Help with IADL |
| 13 | q23j | Communicating with care professionals - Key activity |
| 14 | hourscat | Categorical hours of care provided |
| 15 | q48d | Managing incontinence or toileting problems - need more help/info |
• Embedded
The final feature selection method we chose in this study is the embedded method. This method relies on the feature selection capabilities of an artificial algorithm during the execution of the modelling algorithm. We employed a random forest algorithm that uses Gini feature importance (mean decrease impurity) to split a node during the decision tree execution. The decrease in node impurity is averaged across all trees of the ensemble. Consistent with previews methods, we used 70 and 30 percent data split for training and testing our model respectively. The top 15 important features from the embedded methods create the last feature set for this study. The list of selected features is provided in Table 5.
Table 5.
Selected features using the wrapper method
| # | Variable Name | Description |
|---|---|---|
| 1 | agecr | Age of Care Recipient |
| 2 | q7 | Relationship of recipient to caregiver |
| 3 | q22a | Get in and out of beds and chairs - Help with ADL |
| 4 | q22b | Get dressed - Help with ADL |
| 5 | q22c | Get to and from toilet - Help with ADL |
| 6 | q22d | Bathe or shower - Help with ADL |
| 7 | adls | Count of ADLs performed - created |
| 8 | q22g | Giving medicines - Help with IADL |
| 9 | q23c | Housework - Help with IADL |
| 10 | iadls | Count of IADLs performed |
| 11 | hourscat | Categorical hours of care provided |
| 12 | n3 | Helped recipient with medical/nursing tasks? |
| 13 | n9 | How many times was recipient hospitalized overnight past year? |
| 14 | q32a | Are you currently employed? |
| 15 | HH14WGT | Household Weight - Use for household prevalence only |
• Creating the composite set of features
Each method gives us a unique subset of features. The features that are repeatedly identified as part of the feature list by multiple feature selection methods can further enhance the modeling performance. So we created a composite set of 15 features that appeared more frequently (in Tables 3, 4 and 5). The approach is shown in Table 6.
Table 6.
Frequency of identified feature in each feature selection method
| Feature | Chi-squared Method | Wrapper Method | Embedded Method | Overall Score | Compost Set |
|---|---|---|---|---|---|
| lives - Where recipient lives | 1 | 0 | 0 | 1 | ✕ |
| Hher - Caregiver is householder | 1 | 1 | 0 | 2 | ✓ |
| q7 - Relationship of recipient to caregiver | 1 | 1 | 1 | 3 | ✓ |
| ... | ... | ... | ... | ... | ... |
The composite set represents a list of the most frequently selected features (those that scored 2 or higher) from other feature selection methods. This feature set aims to further improve the predictive performance of the classification models. The list of most frequently repeated variables is provided in Table 7.
Table 7.
Composite features representing the top 15 most frequently features
| Number | Variable Name | Description |
|---|---|---|
| 1 | q22a | Get in and out of beds and chairs - Help with ADL |
| 2 | q7 | Relationship of recipient to caregiver |
| 3 | hourscat | Categorical hours of care provided - created |
| 4 | HHer | Householder status of Initial respondent - created |
| 5 | hhtype | Type of household family or not - created |
| 6 | q22d | Bathe or shower - Help with ADL |
| 7 | q22c | Get to and from toilet - Help with ADL |
| 8 | q21 | How long have you been/did you provide care to recipient? |
| 9 | agecr | Age of Care Recipient - created |
| 10 | q22b | Get dressed - Help with ADL |
| 11 | adls | Count of ADLs performed - created |
| 12 | q22g | Giving medicines - Help with IADL |
| 13 | q23c | Housework - Help with IADL |
| 14 | iadls | Count of IADLs performed - created |
| 15 | n3 | Helped recipient with medical/nursing tasks? |
Predicting Caregiver burnout using various classifiers
After this, we built predictive caregiver burnout (PCB) models by using eight machine learning classifiers. Predictive modelling involves predicting a multi-class classification of the level of burden on the caregiver. The burden variable is a three-level categorical variable where one represents a low-level, two refers to medium, and three indicates a high level of burden on the caregiver. Therefore, we selected algorithms that are inherently capable of calculating a multi-class target variable. These algorithms include Multi-layer Perceptron (MLP), linear classifiers (i.e., Logistic Regression and Gaussian Naïve Bayes), sample-based K-nearest neighbor, decision tree-based classifiers (iCART), Random Forests, Gradient Boosting, and AdaBoost algorithm for classification.
We used the embedded and our composite feature sets as input variables in the selected ML classifiers. We used 70% of the data on training the models and the remaining 30% as test data to evaluate the performance of the ML models. Area under receiver operator characteristics curve (AUC), accuracy, precision, recall, and F1 were used as performance measures for the predictive models. A summary of the performance of every machine learning model is presented in Table 8.
Table 8.
Accuracy, AUC, Precision, Recall and F1 score. Best performance is shown in bold
| Embedded Feature Set | Composite Feature Set | |||||||||
| Classifier | Accuracy | AUC | Precision | Recall | F1 Score | Accuracy | AUC | Precision | Recall | F1 Score |
| MLP | 0.71 | 0.50 | 0.50 | 0.71 | 0.56 | 0.71 | 0.52 | 0.50 | 0.71 | 0.56 |
| Logistic Regression | 0.94 | 0.90 | 0.95 | 0.94 | 0.71 | 0.91 | 0.89 | 0.91 | 0.91 | 0.71 |
| K-Neighbors | 0.71 | 0.74 | 0.68 | 0.71 | 0.56 | 0.62 | 0.52 | 0.62 | 0.62 | 0.56 |
| Decision Tree | 0.94 | 0.86 | 0.95 | 0.94 | 0.81 | 0.94 | 0.86 | 0.95 | 0.94 | 0.92 |
| Random Forest | 0.82 | 0.84 | 0.86 | 0.82 | 0.76 | 0.85 | 0.97 | 0.88 | 0.85 | 0.76 |
| Ada Boost | 0.94 | 0.83 | 0.95 | 0.94 | 0.93 | 0.94 | 0.83 | 0.95 | 0.94 | 0.93 |
| Gradient Boosting | 0.94 | 0.78 | 0.95 | 0.94 | 0.81 | 0.94 | 0.78 | 0.95 | 0.94 | 0.81 |
| Gaussian NB | 0.79 | 0.82 | 0.85 | 0.79 | 0.65 | 0.79 | 0.82 | 0.85 | 0.79 | 0.65 |
As is evident from Table 8, composite set gives us higher AUC (random Forest, 0.97) than embedded set (Logistic Regression, 0.90). Hence the composite set of features can be best used to make predictions about cancer caregiver burden levels.
Discussion
Our analysis and chosen methods were driven by our research questions. We started by trying to gain a better understanding of who the cancer caregivers are. There are more women caregivers than men with an average age of 57 years (see Table 1). A majority of our caregivers are married but we saw that even single folks (13.51%) are giving care to a loved one. Our dataset had a high percentage of whites giving care but other races such as Black, Asian and Hispanic were also present. Caregivers are providing care for long durations, typically between 6 months to a year. The caregivers have across the board household income. Most caregivers have a high school diploma or GED but every educational background is covered by the caregiver spectrum.
Further analysis also revealed that a majority of our caregivers experience high emotional stress, high financial stress and some form of physical strain while giving care to their loved ones.
Using popular feature selection methods in machine learning, we created subsets of features that are highly able to predict caregiver burden levels. By observing closely at Table 3, 4, and 5 we notice few significant features that are strongly related to burden level. Some of those features are related to ADLs, and some to IDLs. That implies stress rising due to assistance caregiver provide that requires both time and physical effort. Some other significant features are related to the type of household and the relationship to the patient. Finally, the remaining features were categorical hours of care provided, employment status, age of the patient etc.
We created a composite set of 15 features (see Table 7) that were selected by considering the features that appeared in most of the methods we used. We believe that 15 is a reasonable small number of features that can be used to build a Clinical DSS or even a simple survey that can help us to quickly identify who among the caregivers are likely to burn out or about to experience high burden level.
Our second research question was to predict caregiver burnout. For that, we ran 8 different machine learning classifiers and fed both the feature set from embedded method as our composite feature set. For the embedded feature set input, we found that Decision Tree and Logistic regression gave us highest accuracy (94%), and Logistic regression yielded the highest AUC (0.90). For the composite set of features, we found that highest accuracy was obtained by a number of classifiers (Logistic Regression, Ada Boost and Gradient Boosting) but the highest AUC value (0.97) was obtained by Random Forest classifier. We recommend using the composite set when attempting to predict caregiver burden, as it provides sufficient accuracy in our models. Hence these 15 features can be utilized to build a Clinical DSS and/or a simple survey that can be periodically administered to the caregivers.
Conclusion
High stress and burden among cancer caregivers can cause significant damage to their physiological and emotional wellbeing. Identifying who is likely going to succumb to high burnout is critical for early intervention and support. In this study, we used machine learning approaches, including data curation, feature selection and supervised learning to build prediction models for cancer caregiver burden. We demonstrated that from a small subset of 15 features, it is possible to make highly accurate predictions. One can build Clinical DSS using these features in the future. Our work has improved the current state-of-the-art in estimating the prevalence and burden of informal caregivers. Our work lays the ground for theoreticians to build new theories of caregiving and policy makers to formulate necessary guidelines. Our future work will deal with designing new interventions and treatments for cancer caregivers.
Figures & Tables
References
- 1.National Alliance for Caregiving (NAC) and AARP Report. Caregiving in the U.S. 2020.
- 2.Girgis A, Lambert S, Johnson C, Waller A, Currow D. Physical, psychosocial, relationship, and economic burden of caring for people with cancer: a review. Journal of oncology practice. 2013. July;9(4):197–202. doi: 10.1200/JOP.2012.000690. [PMC free article] [PubMed] [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.National Alliance for Caregiving Cancer Caregiving in the U.S. 2016.
- 4.Braun M, Mikulincer M, Rydall A, Walsh A, Rodin G. Hidden morbidity in cancer: spouse caregivers. J Clin Oncol. 2007;25(30):4829–4834. doi: 10.1200/JCO.2006.10.0909. [DOI] [PubMed] [Google Scholar]
- 5.Grunfeld E, Coyle D, Whelan T, et al. Family caregiver burden: results of a longitudinal study of breast cancer patients and their principal caregivers. CMAJ. 2004;170(12):1795–1801. doi: 10.1503/cmaj.1031205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Goldstein NE, Concato J, Fried TR, Kasl SV. Factors associated with caregiver burden among caregivers of terminally ill patients with cancer. J Palliat Care. 2004;20:38. [PubMed] [Google Scholar]
- 7.Ugalde A, Krishnasamy M, Schofield P. Role recognition and changes to self‐identity in family caregivers of people with advanced cancer: a qualitative study. Support Care Cancer. 2012;20(6):1175–1181. doi: 10.1007/s00520-011-1194-9. [DOI] [PubMed] [Google Scholar]
- 8.Carter PA. Caregivers’ descriptions of sleep changes and depressive symptoms. Oncol Nurs Forum. 2002;29(9):1277–1283. doi: 10.1188/02.ONF.1277-1283. [DOI] [PubMed] [Google Scholar]
- 9.Carrera PM, Kantarjian HM, Blinder VS. The financial burden and distress of patients with cancer: understanding and stepping‐up action on the financial toxicity of cancer treatment. CA Cancer J Clin. 2018;68(2):153–165. doi: 10.3322/caac.21443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.van Ryn M, Sanders S, Kahn K, et al. Objective burden, resources, and other stressors among informal cancer caregivers: a hidden quality issue? Psychooncology. 2011;20(1):44–52. doi: 10.1002/pon.1703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ugalde A., Gaskin C. J., Rankin N. M., Schofield P., Boltong A., Aranda S., Chambers S., Krishnasamy M., Livingston P. M. A systematic review of cancer caregiver interventions: Appraising the potential for implementation of evidence into practice. Psycho-oncology. 2019;28(4):687–701. doi: 10.1002/pon.5018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Mani S., Chen Y., Elasy T., Clayton W., Denny J. AMIA annual symposium proceedings. Vol. 2012. American Medical Informatics Association; 2012. Type 2 diabetes risk forecasting from EMR data using machine learning; p. p. 606. [PMC free article] [PubMed] [Google Scholar]
- 13.Li X., Liu H., Du X., Zhang P., Hu G., Xie G., Xie X. AMIA Annual Symposium Proceedings. Vol. 2016. American Medical Informatics Association; 2016. Integrated machine learning approaches for predicting ischemic stroke and thromboembolism in atrial fibrillation; p. p. 799. [PMC free article] [PubMed] [Google Scholar]
- 14.Antoniadi AM, Galvin M, Heverin M, et al. Prediction of caregiver burden in amyotrophic lateral sclerosis: a machine learning approach using random forests applied to a cohort study. BMJ Open. 2020;10:e033109. doi: 10.1136/bmjopen-2019-033109. doi: 10.1136/bmjopen-2019-033109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yoon Sunmoo, et al. “Prediction models for burden of caregivers applying data mining techniques.”. Big Data & Information Analytics. 2017.
- 16.Collins L, Swartz K. “Caregiver Care”. Work Paper, Jefferson Medical College, Thomas Jefferson University, Philadelphia, Pennsylvania. 2011.
- 17.Kent E, et al. “Caring for Caregivers and Patients: Research and Clinical Priorities for Informal Cancer Caregiving”. Cancer Author Manuscript, HHS Public Access. 2016. [DOI] [PMC free article] [PubMed]
- 18.Glajchen M. “Physical Well-Being of Oncology Caregivers: An Important Quality-of-Life Domain”. Seminars in Oncology Nursing. 2012;Volume 28(Issue 4):November 2012. Pages 226–235. doi: 10.1016/j.soncn.2012.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chappell N., Reid C. “Burden and Well-Being Among Caregivers: Examining the Distinction”. The Gerontologist. 2002;Volume 42(Issue 6):1 December 2002. Pages 772–780. doi: 10.1093/geront/42.6.772. [DOI] [PubMed] [Google Scholar]
- 20.Zarit S. H., Reever K. E., Bach-Peterson J. “Relatives of the impaired elderly: correlates of feelings of burden”. The gerontologist. 1980;20(6):pp. 649–655. doi: 10.1093/geront/20.6.649. doi: 10.1093/geront/20.6.649. [DOI] [PubMed] [Google Scholar]




