Abstract
Background:
Third-variable effect refers to the effect from a third-variable that explains an observed relationship between an exposure and an outcome. Depending on whether there is a causal relationship from the exposure to the third variable, the third-variable is called a mediator or a confounder. The multilevel mediation analysis is used to differentiate third-variable effects from data of hierarchical structures.
Data Collection and Analysis:
We developed a multilevel mediation analysis method to deal with time-to-event outcomes and implemented the method in the mlma R package. With the method, third-variable effects from different levels of data can be estimated. The method uses multilevel additive models that allow for transformations of variables to take into account potential nonlinear relationships among variables in the mediation analysis. We apply the proposed method to explore the racial/ethnic disparities in survival among patients diagnosed with breast cancer in California between 2006 and 2017, using both individual risk factors and census tract level environmental factors. The individual risk factors are collected by cancer registries and the census tract level factors are collected by the Public Health Alliance of Southern California in partnership with the Virginia Commonwealth University’s Center on Society and Health. The National Cancer Institute work group linked variables at the census tract level with each patient and performed the analysis for this study.
Results:
We found that the racial disparity in survival were mostly explained at the census tract level and partially explained at the individual level. The associations among variables were depicted. Conclusion: The multilevel mediation analysis method can be used to differentiate mediation/confounding effects for factors originated from different levels. The method is implemented in the R package mlma.
Keywords: Confounding/mediation effect, health inequality, multilevel additive models, racial/ethnic disparity, third-variable analysis
Background and Introduction
Health disparities exist widely in the United States (US). One example lies in breast cancer outcomes. Due to advanced screening methods for detecting breast cancer at early stage and improved treatments, the overall death rate of breast cancer in the US has decreased in recent years. However, compared with White women, African American women diagnosed with breast cancer have higher recurrence and death rates despite a lower incidence rate.1-15 Understanding the factors that account for these disparities is imperative to inform regulations, interventions, and treatments to reduce them.
There is consensus that individual behaviors, physical, and social environments collectively contribute to disparities in breast cancer outcomes. However, little is known of the relative contribution of behavioral factors (e.g., smoking status), or the relative contribution of any specific neighborhood or community context to the disparities. This is due to the lack of comprehensive data sets that include both individual level and environmental risk factors, and more importantly, the lack of statistical modeling method that can differentiate the intermediate effects on different paths across multiple levels (e.g., environment and individual risk factors) to explain the observed disparities.
Mediation analysis is used to differentiate a third-variable (e.g., mediator or confounder) effect that intermediates an observed relationship between an exposure variable and an outcome variable.15-25 In mediation analysis, besides the pathway that directly connects the exposure variable with the outcome, we explore the exposure(X) − third variable(M) → response(Y) pathways. Mediation analysis is widely used in health disparities research to quantify the effects of contributing factors that explain the observed disparities. For example, our previous research has shown that among all cancer patients, non-Hispanic White persons have average lower anxiety as compared with Hispanic White persons. We found that higher education level is related with lower anxiety score. Moreover, the average education is higher among non-Hispanic White cancer survivors than their Hispanic counterparts and explained 21% of the difference in anxiety scores between non-Hispanic White and Hispanic White patients.15 The mediation analysis identifies attributable factors (e.g., the educational level), and differentiates and ranks their effects in explaining paths that connect the exposure/predictor (e.g., ethnicity) with the outcome (e.g., anxiety score). The attributable factors, also called third variables, contribute to changes in outcomes. The third-variables are called mediators if the predictor is their causal prior. That is, the predictor has a causal relationship with the mediator. If the predictor and the third-variable are associated but no causal relationship is established, the third-variable is called a confounder. In the previous example, education level is a confounder but not a mediator, since ethnicity is not the cause of education levels. Despite the differences in interpreting the inference results, mediation analysis can be used to make inferences on both mediation and confounding effects.
For the research to explore racial/ethnic disparities in breast cancer survival, risk factors are collected hierarchically at both the individual and the residential neighborhood levels. In such situation, since patients living in the same neighborhood cannot be considered as independent of each other, the mediation analysis based on generalized linear models, where all patients are assumed to be indepencent, cannot be used directly to fit relationships among variables. Multilevel or mixed-effect models, are more appropriate since these models can account for dependencies among nested observations. Much research has been done on the multilevel mediation analysis. Ref. 26 studied the bias brought by using single-level models when data are hierarchical. Refs. 27-33 proposed mediation analysis methods for different types of multilevel models. In addition, Ref. 34 proposed to use Bayesian mediation analysis to deal with hierarchical databases. Yu and Li35 extended the definitions of third-variable effects (mediation or confounding effects) to multilevel data structures and proposed to use multilevel additive models to fit variable relationships. Using their method, the effects from multilevel paths relating exposure(s) through third-variable(s) to outcome(s) can be estimated. However, all these methods deal with continuous or binary outcomes only. In this paper, we extend the multilevel mediation analysis to deal with time-to-event outcomes. The extended method is implemented in the R package, mlma (version 6.0-0 and after), to help practitioners apply the method in research. In the rest of the paper, we discuss the multilevel mediation analysis method with survival outcomes in the Multilevel mediation analysis with time-to-even outcomes section, in which we discuss the models that used to fit relationships among variables, the inference and interpretation methods, and the R package mlma. In the SEER data to explore the racial/ethnic disparity in breast cancer survival section, the method is illustrated on a real data set to explore the racial/ethnic disparity in breast cancer survival through both individual and environmental risk factors. The analysis is based on the breast cancer cases diagnosed in California between 2010 and 2017. Finally, we discuss the results and future research in the Conclusions section.
Multilevel mediation analysis with time-to-even outcomes
In mediation analysis, besides the pathway that directly connects the predictor variable with the outcome (direct effect), we explore the X − M → Y pathways (indirect effect between X and Y through M). Since in general there are multiple third-variables, mediation analysis will differentiate the effect from each pathway. When mediation analysis is performed on multilevel data, the level of variables at the left side should be higher than or equal to those of the right side in grouping hierarchy, since a group-level variable may affect an individual-level variable but not the reverse.27 In this paper, we restrict our discussion to two-level data, which can be straightforwardly extended to multi-levels. Denote the higher level as level 2 and the lower level as level 1. In such setting, only the 2 – 2 → 2, 2 – 2 → 1,2 – 1 → 1, and 1 – 1 → 1 relationships are legible. Since a 2 – 2 → 2 relationship can be explored with a single level mediation analysis method, we restrict our outcome to be at level-1. In addition, if there is a level-2 third-variable, at least one level-2 predictor should be specified. A simplified conceptual model with a single outcome and one predictor at each level is presented in Figure 1. In Figure 1, there are K level-2 third-variables (M21, …, M2K) and L level-1 third-variables (M11, …, M1L). Multivariate outcomes and multiple predictors are allowed in the multilevel mediation analysis. The purpose of multilevel mediation analysis is to estimate the effect from each path that connects each pair of the predictor-outcome relationship. In the mediation analysis, we need to estimate relationships in two sets: one is on the outcome from all variables and the other is on the association between the predictor(s) and each third-variable. In this section, we first discuss the fits of the two sets of relationships separately and then discuss the estimation of third-variable effects for time-to-event outcomes.
Figure 1.

Conceptual model for multilevel mediation analysis.
Multilevel additive models on contributing third-variables
We first model the relationship between predictors and third-variables. We use the multilevel additive model, a nonlinear regression method that was first proposed by Ref. 36 to build relationships among variables. A multilevel additive model can deal with both nonlinear associations and cluster-specific heterogeneity.37 Assume that we have K level-2 and L level-1 third-variables as illustrated in Figure 1. In addition, assume that there are E1 level-1 and E2 level-2 predictors, denoted as and , respectively, indicating the ith subject in the jth group. We propose the following linear additive multilevel models for multilevel third-variable analysis. The boldfaced letter indicates a potential vector of functions or numbers.
For level-2 third-variables, M.jk, k = 1, …, K
| (1) |
For level-1 third-variables, Mijl, l = 1, …, L
| (2) |
In Equations 1 and 2, f(·) denotes a transformation function vector of ·. The transformation enables modeling nonlinear relationships among variables. We assume that all the transformation functions are first-order differentiable. α and β are coefficient vectors for transformed variables in predicting the third-variable on the left side of each equation. In addition, g(·)s are the link functions that link the expected third-variable with the right hand side of each equation, the systematic component of a generalized linear model. For example, a counting variable Mijk may have a log link function g1k = log(Mijk). With the link function, we can deal with various types (e.g., categories or counts) of third-variables.
The multilevel additive proportional hazard model
In our proposed multilevel mediation analysis, the multilevel proportional hazard function is used to fit the relationship between a time-to-event outcome and all other variables (e.g., the predictors, third-variables, and covariates). A multilevel proportional hazard model has the following format
| (3) |
where λ0 is an unspecified baseline hazard function, X and Z are the risk factor matrices with fixed and random effects, respectively. β is the vector of fixed-effect coefficients, and b is the vector of random-effect coefficients. The random effects b have a multivariate normal distribution with mean zero and a variance–covariance matrix ∑ which depend on the vector of parameters θ. For a comparison of the multilevel survival analysis methods, readers are referred to.Ref. 38.
Assume there are J groups at the second level. When a random intercept model is fitted, Z is the design matrix where the (i, j)th item, Zij, is 1 if the subject i is a member of the jth group and 0 otherwise. The variance of the random effects in this case is ∑ = θI where I is a diagonal matrix with all unit diagonal elements. The random intercept model is formulated as λ(t) = λ0(t) exp(αj) exp(Xβ), where αj denotes the random effect associated with the j-th group. exp(αj) is called the “shared frailty,”39 which has a log-normal distribution in (3) but can also be assigned to other distributions such as the gamma distribution. The variance of the random intercept indicates the intra-group associations among objects. With the multilevel hazard model, subjects in the same group are assumed to be correlated.
There can also be random slopes for risk factors. In the mediation analysis, a random slope for a risk factor means that the indirect effect through the risk factor can be different among different groups. That is, there are group-moderated effects on the risk factor to influence the hazard rate. Here, we focus the multilevel mediation analysis to random intercept models since the purpose is to account for correlations among subjects.
Using the notations in the Multilevel additive models on contributing third-variables subsection, the multilevel hazard model has the following format
| (4) |
Multilevel third-variable effects inference and interpretation
Based on the definitions of third-variable effects byRef. 35, we derive the direct and indirect effects based on models (1), (2), and (4). In the following, f′(x) denotes the first derivative of function f on the random variable X and realized at X = x. In addition, g−1 denotes the inverse function of g. We further denote μijk = E(Mijk) and μ.jk = E(M.jk).
With the relationships among variables built by models (1), (2), and (4), the derived third-variable effects for level-1 exposure variable Xije, e = 1, …, E1 on the outcome variable Y are
| (5) |
where IE indicates the indirect effect; DE, the direct effect; and TE, the total effect. The subscript 1 indicates the effect is at level 1, and l is for the lth mediator. For level-1 predictors, there can only be level-1 third-variables. The total effect is interpreted as that when xije increases by 1 unit (or when xije changes from 0 to 1 for binary predictors), the hazard rate becomes exp(TE) times the original hazard rate. The direct and indirect effects can be interpreted in terms of the relative effects, which is defined as (in) direct effect/total effect. The relative direct effect is the proportion of total effect that remains between xije and the hazard rate after adjusting for other third-variables. The relative indirect effect of M is the proportion of the total effect that is through the third-variable M.
A level-2 predictor can have both level-1 and level-2 third-variables. The derived third-variable effects for level-2 predictor X.je, e = 1, …, E2 on the outcome variable Y are
| (5) |
The interpretations of third-variable effects are similar for those of the level-1 predictors. For each pair of the predictor-outcome relationship, there is a set of total effect, direct effect, and indirect effects.
Finally, to calculate the variances of the estimated third-variable effects, two methods are used: (1) the Delta method based on the normal approximation of the estimates and (2) the nonparametric bootstrap method. Both methods are implemented in the R package mlma.
The R package mlma
The R package mlma was developed by Yu et al.35 and has been updated to deal with time-to-event outcomes (versions 6.0-0 and after). In functions provided by the package, users need to specify potential mediator(s), predictor(s), outcome(s), covariate(s), and their transformation functions. Data are read into the function data.org. The function transforms variables into the desired formats and organize the data set into analytic format. Since a predictor needs to be of the same or higher hierarchical order of its third-variables, if there are level-2 third-variables but no level-2 predictor, the level-1 predictor(s) will be aggregated into level-2 predictor(s).
In the second step, two tests are performed: (1) check the importance of potential third-variables in estimating outcome(s) when all transformed variables are used for the estimation and (2) test the association between each potential mediator/confounder and predictor(s). The function mlma calls in the output from data.org, performs the two tests, and estimates third-variable effects and their variances by the Delta method. By summarizing the output of mlma, summary(output), test results are presented. Researchers are recommended to check results to adjust for potential third-variables: those not significant in the first test may be removed for further analysis and those significant in test (1) but not in test (2) may be used as covariates. Note that the pre-screening tests are not formally used for the third-variable selection. After the adjustment, the function data.org can be called again to organize the data and the function mlma to estimate the third-variable effects. In addition to the estimates of third-variables, the mlma function also calculates variances of third-variable effect estimates based on the normal approximation and the Delta method.
Finally, the function boot.mlma uses the bootstrap method in the multilevel third-variable analysis to generate the variance and confidence interval estimations.35 The generic functions summary and plot help summarize the estimation results (estimates, variances, and confidence intervals) and depict relationships among variables.
SEER data to explore the racial/ethnic disparity in breast cancer survival
We implemented the above proposed method to explore the racial/ethnic disparity in breast cancer survival, taking into account tumor characteristics, individual demographics, and census tract level residential environmental factors.
Data sources
For the individual-level dataset for breast cancer patients, we use data from the California population-based cancer registries of the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute (NCI). Patients diagnosed from the Alaska Native Tumor Registry are excluded because additional confidentiality constraints are in place. Those data cannot be used in any analyses that involve census tracts. The SEER Program registries routinely collect patient-level data from medical records, including patient demographics, tumor characteristics, cancer stage at diagnosis, the first course of treatment, and follow-up results (vital status, date of the last contact, and cause of death). The residential census tracts for patients at diagnosis can provide important information for exploring area attributes in cancer research. However, due to concerns of disclosing patient privacy, the publicly available research data usually do not include the geographic location of the patient residential areas that are more specific than counties.40 developed a method to provide multiple imputed, synthetic census tract in supplement to cancer registry data. The synthetic census tract identifier has been shown to produce similar cancer statistics by census tract based socioeconomic variables. To evaluate the usefulness of cancer registry data with synthetic census tracts in preserving the statistical validity for more complex analyses, and to explore ways of safely releasing confidentiality data, the NCI funded a validation project for researchers to propose useful analyses of cancer registry data with census tracts. Selected researchers develop analysis plans and write statistical programming codes using the synthetic census tracts. The NCI then runs researcher provided codes on the real census tracts data behind the firewall and returns the real data results to the researchers after the results are cleared by the NCI disclosure avoidance review. All analysis results presented hereafter in this paper are based on real cancer registry data and are cleared to be published.
To explore the racial/ethnic disparity in survival among breast cancer patients, this study includes all females diagnosed with primary invasive breast cancer between 2006 and 2017 excluding those diagnosed through autopsy or death certificate. Out of 237, 167 cases, 77.95% were Whites, 6.49% were Black patients, and the remaining had Other races/ethnicities. Among all these patients, the all-cause death rate is 17.83% for White patients, 25.54% for Black patients, and 11.81% for Other race/ethnicity.
For the census tract level environmental data, we downloaded variables from the California healthy places index (HPI). Healthy places index is developed by the Public Health Alliance of Southern California (Alliance) in partnership with the Virginia Commonwealth University’s Center on Society and Health. In addition to the overall HPI score, the index also contains eight sub-scores in areas of economic, education, housing, health care access, neighborhood, clean environment, transportation, and social factors. The measurements are standardized to percentiles so that census tracts in California are comparative to each other. Readers are referred to the website http://healthyplacesindex.org for details and data downloading. The NCI work group linked the HPI variables at the census tract level with each patient and performed the analysis for this study. The R codes for all analysis are provided in Section 1 of the Supplementary Material.
Descriptive analysis
To explore factors contributing to racial/ethnic disparities in breast cancer mortality, we first applied the mediation analysis by Ref.[15 to identify important third-variables that may explain the observed disparities. Multiple additive regression trees (MART) were used to build relationships among variables. Multiple additive regression trees is a tree-based ensemble method of data mining.41 In the mediation analysis, MART is used for exploratory and inference purposes. We benefit from the following properties of MART. First, MART can model the nonlinear relationships between the dependent and independent variables. Second, due to the hierarchical splitting scheme in regression trees, MART is natural to capture multilevel data structure. Third, there are established tools on the tree-based method to help depict relationships among variables.42 Fourth, MART can handle different types of outcomes. Yu et al.15 used MART in mediation analysis to explore time-to-event outcomes. Here, we use their method to identify important third-variables at both the individual and environmental levels. The results are then used to guide the variable selection and transformation in the multilevel additive models (1) and (4).
We use the mma R package23 to do the third-variable analysis using MART. With the package, two conditions are tested to screen for third-variables. One is to test a) that the potential third-variable has to significantly relate with the outcome (survival time) adjusting for other variables and (b) that the third-variable has to be significantly related with the predictor (the race/ethnicity). If condition (a) is not satisfied, the potential third-variable is removed for further analysis. Otherwise, condition (b) is tested. If (b) is satisfied, the variable is handled as a third-variable, otherwise, it is treated as a covariate to the outcome only. The two tests are informal for identifying important third-variables since both tests check linear relationships only and two tests are performed separately. Even if neither test is significant, a variable can still be forced in analysis as a third-variable or a covariate. The formal identification of important third-variables is by the estimates of the confidence intervals of third-variable effects. If the confidence interval for the estimated indirect effect of the potential third-variable does not include 0, the variable is formally identified as a third-variable. Figure 2 shows the estimated third-variable effects and their confidence intervals from the mma package. White patients are the reference group. Left panel of Figure 2 shows the estimation results when Black patients are compared with White patients, and the right panel is the Other race/ethnicity compared with White patients. If the indirect effect is significant in any of the panels, the corresponding variable is used further in the multilevel models to explain the racial/ethnic differences. The census tract level variable names are all ended with “pctile,” indicating the ranking percentile of the census tract in California. The values are between 0 and 100.
Figure 2.
Mediation analysis results based on multiple additive regression trees.
As result, all eleven individual variables and two environmental level variables are selected as potential third-variables. The individual level variable, year of diagnosis, is included as a covariate. The two environmental factors are bachelorsed_pctile and LEB_pctile. Both factors are standardized at the scale of percentile rankings in California. The bachelorsed_pctile is for the population over age 25 with a bachelor’s education or higher for a census tract. For example, a bachelorsed_pctile equals 20 means that the census tract is better than 20% of all tracts in California in terms of the education with bachelor’s or higher degrees. LEB_pctile is the census tract ranking percentile for the residents’ average life expectancy at birth.
We then check how the selected variables relate with the hazard rate and distribute differently at different race/ethnicity groups. Based on the fitting graphs from MART, we decide how to transform the third-variables so that the transformed variables are roughly linearly related with the hazard rate. For example, Figure 3 shows the variable associations of bachelorsed_pctile. The left panel shows the association between bachelorsed_pctile and the hazard rate of breast cancer death. We found that the hazard rate decreases with the percentile rankings of bachelor’s degrees. That is, the higher proportion of people living in the census tract have bachelor’s or higher degrees, the smaller the hazard rate for people living in the census tract. As the decreasing relationship is not linear, we transformed the variable bachelorsed_pctile with b-splines of two degrees of freedom, allowing quadratic relationship to be fitted between bachelorsed_pctile and the hazard rate. The right panel shows the distributions of bachelorsed_pctile among Whites, Blacks, and Other race/ethnicity breast cancer patients. Compared to White patients (upper panel), Black patients (middle panel) were more likely to live in census tracts that have low proportion of residents with bachelor’s or higher degrees, while the Other race/ethnicity patients (lower panel) were more likely to live in census tracts with high percentile ranking of high educated population. Variable transformation search is performed for all continuous third-variables. The graphs are provided in Section 2 of the Supplementary Materials.
Figure 3.
The associations of bachelorsed_pctile with other variables. The left panel is the association between bachelorsed_pctile and the hazard rate of breast cancer death. The right panel shows the distributions of bachelorsed_pctile among three racial/ethnic groups.
The multilevel mediation analysis
As a result of mediation analysis based on MART, we select both individual and environmental level risk factors. Continuous variables are transformed according to their relationship with the hazard rate and the categorical variables are binarized so that a K category variable is transformed into K – 1 dummy variables to be included in the analysis. The function joint.effect in the mlma package can return the inference results for a group of third-variables. We use the function to calculate the indirect effect of categorical third-variables, which has been decomposed into multiple dummy variables.
We input only one exposure variable—the race/ethnicity of the individual patient. There are three racial/ethnic groups; therefore, two dummy variables are created as the level one exposures: one is 1 for Black patients and the other is 1 for Other race/ethnicity. Since there are level two (environmental level) third-variables, level two exposure variables are automatically created in the mlma package by aggregating the level one exposure variables at the census tract level. Since no weight is used in the aggregation, the created exposure variables indicate the proportions of Black patients and Other race/ethnicity patients respectively of all breast cancer patients diagnosed in the census tract.
Table 1 shows the estimated third-variable effects along with the estimated confidence intervals for each pair of the exposure-outcome relationship. The outcome is the hazard rate of all-cause death. There are two level-one exposure variables and two level-two exposure variables; therefore, a total of four groups of third-variable effects are estimated. For each level-one exposure variable, there are only level-one third-variables, while for level-two exposure variables, there can be both level-one and level-two third-variables.
Table 1.
Estimation results with the multilevel mediation analysis. The effects are relative effects except for the total effects.
| Blacks versus Whites |
Other race/ethnicity versus Whites |
|||
|---|---|---|---|---|
| Level one | Level two | Level one | Level two | |
| Total effect | 0.396 (0.338, 0.455) | 0.481 (0.353, 0.607) | −0.418 (−0.328, −0.235) | −0.400 (−0.510, −0.287) |
| Direct effect | 64.08 (57.50, 69.54) | −49.00 (−93.81, −18.51) | 77.88 (72.62, 82.54) | 1.07 (−33.57, 25.12) |
| Age | − 10.02 (−12.92, −7.39) | −20.31 (−30.08, −13.30) | 18.82 (15.16, 22.58) | 18.00 (12.73, 24.83) |
| Insurance | 4.68 (3.76, 5.76) | 22.99 (16.85, 31.80) | 0.29 (−0.84, 0.16) | 6.50 (1.77, 12.82) |
| Marital status | 6.22 (5.06, 7.65) | 11.68 (8.86, 15.73) | 1.50 (1.04, 2.07) | −1.68 (−2.92, −0.67) |
| Size | 6.47 (5.33, 7.88) | 13.63 (10.29, 18.42) | −0.90 (−1.78, −0.07) | −1.35 (−3.32 0.31) |
| Nodes | 3.45 (2.62, 4.49) | 8.75 (6.33, 11.78) | −0.07 (−0.78, 0.68) | −0.37 (−2.22, 1.10) |
| Grade | 6.48 (5.35, 7.91) | 16.03 (11.46, 23.24) | −1.32 (−2.23, −0.49) | 0.01(−3.16, 3.40) |
| Stage | 9.18 (7.37, 11.41) | 21.62 (15.61, 29.10) | 1.82 (0.11, 3.54) | 9.16 (3.95, 14.38) |
| Subtype | 5.21 (4.12, 6.46) | 2.38 (−1.91, 6.64) | 1.11 (0.33, 1.92) | −2.05 (−8.31, 3.17) |
| Surgery | 3.53 (2.70, 4.57) | 6.77 (4.87, 9.30) | 0.80 (0.25, 1.42) | 1.51 (0.19, 3.32) |
| Radiation | 1.97 (1.48, 2.56) | 6.14 (4.57, 8.50) | −0.51 (−0.97, −0.10) | 0.06 (−0.95, 0.96) |
| Chemotherapy | −1.37 (−1.83, −0.95) | −1.23 (−2.22, −0.45) | 0.97 (0.60, 1.41) | 0.81 (0.16, 1.67) |
| Bachelorsed_pctile | — | 37.28 (25.60, 52.81) | — | 35.85 (24.48, 53.60) |
| LEB_pctile | — | 30.18 (19.51, 44.13) | — | 31.37 (20.49, 46.87) |
From Table 1, we see that on average, the hazard rate for individual Black patients is 69% (= exp(0.396)-1) times higher than that for White patients. At the census tract level, if the proportion of Black patients is higher, the hazard rate is also higher (the confidence interval (0.353, 0.607) is to the right of 0). The Other race/ethnicity group has an average hazard rate that is 65.84% times that of White patients. The hazard rate is lower for census tracts with higher proportion of Other race/ethnicity.
All other estimation is in term of the relative effect, which is defined as the estimated (in)direct effect divided by the total effect. A confidence interval including 0 means that the (in)direct effect is not significant after adjusting for other variables. A relative effect can be negative, which indicates that the estimated effect is at an opposite direction of the total effect.
Conclusions
We explain the results for Black versus White patients and Other race/ethnicity group versus White patients separately.
Comparing Black with White patients
Comparing Black with White patients at the individual level, the direct effect is estimated at 64.08%, implying that after adjusting for the individual risk factors, there is still 64.08% of the racial/ethnic difference in hazard rate that cannot be explained. The individual demographic variables and cancer characteristics explained 35.92% of the differences, but there are other unmeasured variables which should be considered to explain the racial/ethnic difference.
Age of diagnosis had a negative relative effect, contributing −10% to the total effect. This implies that if age of diagnosis were distributed evenly between Black and White patients, the racial/ethnic disparity in survival would increase by 10%. Figure 4 shows the reason. On average, the hazard rate for Black patients is larger than that for White patients. The left panel of Figure 4 is the boxplot of coefficients for age in model (4) with bootstrap samples. Age was included in the multilevel model without transformation. We found that the coefficients are positive, meaning that when age increases, the hazard rate also increases: diagnosis at younger ages is associated with lower hazards of death. The right panel gives the coefficients of Black, aggregated-Black, others, and aggregated-others in the bootstrap sample for model (2). The coefficient for the indicator Black (0 for Black patients and 1 for White) is mostly negative, meaning that on average, Black patients are more likely to be diagnosed at younger age. Since younger age is related to lower hazard, adjusting for age would actually enlarge the disparity in hazard of death between Black and White breast cancer patients.
Figure 4.

The associations of age with other variables. The left panel is the coefficient of age in model (4) with bootstrap samples, and right panel is the coefficients of predictors (Black, aggregated-Black, others, and aggregated-others) in model (2).
The effects of other third-variables can be explained similarly. The figures similar to Figure 4 are provided for each potential third-variable as online Supplementary Materials.
For the census tract level variables, the estimated total effect is also positive, which implies that the increased proportion of Black breast cancer patients in the census tract was related to an increased hazard rate. The direct effect at the census tract level now becomes significantly negative, which means that after adjusting for all other factors, higher proportion of Black patients is associated with decreased mortality rate. An interesting factor is the ranking percentile of bachelor’s degrees. The variable was transformed to have a b-spline with two degrees of freedom. Therefore, there are two coefficients fitted to the transformed variables bachelorsed_pctile and bachelorsed_pctile.2, where the coefficient for the former is a linear relationship and for the latter indicates a quadratic relationship. The left panel of Figure 5 shows that the fitted coefficients are both negative, indicating a decreasing relationship between the percentile ranking of bachelor’s degree and the hazard rate. The lower panel shows that the coefficient for the proportion of Black patients is negative in relating with the ranking percentile for bachelor’s degrees. Therefore, the variable bachelorsed_pctile explained 37% of the racial/ethnic difference in survival at the census tract level.
Figure 5.

The associations of transformed bachelorsed_pctile with other variables. The left panel is the coefficients of the transformed bachelorsed_pctile in model (4) with bootstrap samples, and right panel is the coefficients of predictors (aggregated-Black and aggregated-others) in model (1).
Comparing patients of Other race/ethnicity with White patients
At both the individual and the environmental levels, Other race/ethnicity breast cancer patients had a lower hazard rate compared with White patients. At the individual level, the average hazard rate for pateints of Other race/ethnicity is 65.83%(e−.418) times that for White patients. At the census tract level, a 1% increase in the proportion of Other race/ethnicity patients is related to a 23%(1 − e−.4) decrease in the hazard rate overall.
After adjusting for other factors, 77.88% of the disparity at the individual level remained not explained. Nearly 19% of the difference was explained by age at diagnosis. Figure 4 shows that Other race/ethnicity were a lot more likely to be diagnosed at younger ages, at which the hazard rate is smaller. Other factors such as the marital status, cancer stage and subtype, and chemotherapy all played a small role in explaining the difference in survival between Other race/ethnicity and White patients.
Other factors explained all the differences in survival at the census tract level. The relative direct effect is only 1.07% with the confidence interval including 0. Both the bachelorsed_pctile and LEB_pctile are important in explaining the differences. Figure 6 shows the associations of LEB_pctile with the hazard rate (left panel) and with the race/ethnicity (right panel). Again, we transformed LEB_pctile to have quadratic relationship with the hazard rate. The coefficients for both transformed variables are negative, indicating that the hazard rate decreases when LEB_pctile increases. A better average life expectancy at birth in the census tract is related with lower hazard of death of breast cancer. The right panel shows that LEB_pctile is higher when the proportion of Other race/ethnicity patients is higher in the census tract. Again, the effect of other variables in explaining the difference in hazard rate between Other race/ethnicity and White patients can be similarly explained.
Figure 6.

The associations of transformed LEB_pctile with other variables. The left panel is the coefficients of the transformed LEB_pctile in model (4) with bootstrap samples, and right panel is the coefficients of predictors (aggregated-Black and aggregated-others) in model (1).
Conclusions and future research
In this paper, we develop a multilevel mediation analysis method for time-to-event outcomes. Frailty models are used to show correlations for patients lived in the same residential environment. With the proposed method, data with a hierarchical structure can be considered. Third-variable effects (confounding or mediating effects) are differentiated from multiple levels. We also expand a previously developed R package, mlma, to implement the proposed method. The method is used to explain the racial/ethnic disparities in breast cancer survival, taking into account both the individual level and census tract level risk factors. As a result, a large proportion of the racial/ethnic differences at the individual level were still not explained. As a future research direction, we would like to collect genetic data and individual level behavioral data (e.g., smoking and physical activity) among breast cancer patients to check if those variables can help further explain the observed racial/ethnic disparities at the individual level. In comparison, most of the differences at the census tract level were explained. Both the average educational level (proportion of bachelor or higher degrees) and the life expectancy at birth played important roles in explaining the racial/ethnic disparities in breast cancer survival at the census tract level.
The multilevel mediation analysis method works well in this application. Since generalized linear regression models are used for the analysis and many risk factors are highly correlated with each other, we plan to deal with the potential collinearities in analysis by implementing regularized regression methods in the multilevel model fitting. We have successfully used elastic-net regularized regressions in the single level mediation method.43 As a next step, we will further develop a regularized multilevel mediation analysis to deal with high dimensional and potentially highly correlated third-variables.
In addition, the frailty models we used in this paper deal with only random intercept models. As a future research, we will extend the multilevel mediation analysis method to handle random slopes, so that the heterogeneous third-variable effects are allowed at higher levels.
Supplementary Material
Acknowledgments
We acknowledge National Cancer Institute’s Surveillance, Epidemiology, and End Results Program for the research award (75N91020P00728) and for linking and providing data for this study. Part of this research were conducted with high performance computational resources provided by the Louisiana Optical Network Infrastructure.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by the National Institute On Minority Health And Health Disparities of the National Institutes of Health under Award Number R15MD012387, and the National Institute of Environmental Health Sciences under the Award Number P42ES013648 and its administrative supplement P42ES013648-09S2.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary Material for this article is available online.
References
- 1.Lewis DR, Chen H-S, Midthune DN, et al. Early estimates of SEER cancer incidence for 2012: Approaches, opportunities, and cautions for obtaining preliminary estimates of cancer incidence. Cancer 2015; 121: 2053–2062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tarver T Cancer Facts & Figures 2012. American Cancer Society (ACS). J Consumer Health On Internet 2012; 16: 366–367. [Google Scholar]
- 3.Elmore JG, Nakano CY, Linden HM, et al. Racial inequities in the timing of breast cancer detection, diagnosis, and initiation of treatment. Med Care 2005; 43: 141–148. [DOI] [PubMed] [Google Scholar]
- 4.Lyman GH, Kuderer NM, Lyman SL, et al. Importance of race on breast cancer survival. Ann Surg Oncol 1997; 4: 80–87. [DOI] [PubMed] [Google Scholar]
- 5.Moran MS, Yang Q, Harris LN, et al. Long-term outcomes and clinicopathologic differences of african-american versus white patients treated with breast conservation therapy for early-stage breast cancer. Cancer 2008; 113: 2565–2574. [DOI] [PubMed] [Google Scholar]
- 6.Muss HB, Berry DA, Cirrincione C, et al. Toxicity of older and younger patients treated with adjuvant chemotherapy for node-positive breast cancer: the cancer and leukemia group b experience. J Clin Oncol 2007; 25: 3699–3704. [DOI] [PubMed] [Google Scholar]
- 7.Warner ET, Tamimi RM, Hughes ME, et al. Racial and ethnic differences in breast cancer survival: mediating effect of tumor characteristics and sociodemographic and treatment factors. J Clin Oncol 2015; 33: 2254–2261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bain RP, Greenberg RS and Whitaker JP. Racial differences in survival of women with breast cancer. J Chronic Dis 1986; 39: 631–642. [DOI] [PubMed] [Google Scholar]
- 9.Eley JW. Racial differences in survival from breast cancer. Results of the national cancer institute black/white cancer survival study. JAMA: J Am Med Assoc 1994; 272: 947–954. [DOI] [PubMed] [Google Scholar]
- 10.Li CI, Malone KE and Daling JR. Differences in breast cancer stage, treatment, and survival by race and ethnicity. Arch Intern Med 2003; 163(49): 49–56. [DOI] [PubMed] [Google Scholar]
- 11.Lu Y, Ma H, Malone KE, et al. Obesity and survival among black women and white women 35 to 64 years of age at diagnosis with invasive breast cancer. J Clin Oncol 2011; 29: 3358–3365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.O’Malley CD, Le GM, Glaser SL, et al. Socioeconomic status and breast carcinoma survival in four racial/ethnic groups. Cancer 2003; 97: 1303–1311. [DOI] [PubMed] [Google Scholar]
- 13.Russell E, Kramer MR, Cooper HLF, et al. Residential racial composition, spatial access to care, and breast cancer mortality among women in georgia. J Urban Health 2011; 88:1117–1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Fan XWY. General multiple mediation analysis with an application to explore racial disparities in breast cancer survival. J Biom Biostat 2013; 5(2): 1–9. [Google Scholar]
- 15.Yu Q, Wu X, Li B, et al. Multiple mediation analysis with survival outcomes: With an application to explore racial disparity in breast cancer survival. Stat Med 2018; 38: 398–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Alwin DF and Hauser RM. The decomposition of effects in path analysis. Am Sociol Rev 1975; 40(1): 37–47. [Google Scholar]
- 17.Judd CM and Kenny DA.. Process analysis: estimating mediation in treatment evaluations. Eval Rev 1981; 5: 602–619. [Google Scholar]
- 18.Robins JM and Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology 1992; 3: 143–155. [DOI] [PubMed] [Google Scholar]
- 19.Mackinnon DP and Dwyer JH. Estimating mediated effects in prevention studies. Eval Rev 1993; 17: 144–158. [Google Scholar]
- 20.Ten Have TR, Joffe MM, Lynch KG, et al. Causal mediation analyses with rank preserving models. Biometrics 2007; 63: 926–934. [DOI] [PubMed] [Google Scholar]
- 21.Vanderweele TJ and Vansteelandt S. Conceptual issues concerning mediation, interventions and composition. Stat Its Interf 2009; 2(4): 457–468. [Google Scholar]
- 22.VanderWeele TJ. Marginal structural models for the estimation of direct and indirect effects. Epidemiology 2009; 20: 18–26. [DOI] [PubMed] [Google Scholar]
- 23.Yu Q and Li B. mma: An r package for multiple mediation analysis. J Open Res Softw 2017; 5: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yu Q, Medeiros KL, Wu X, et al. Nonlinear predictive models for multiple mediation analysis: with an application to explore ethnic disparities in anxiety and depression among cancer survivors. Psychometrika 2018; 83: 991–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yu Q and Li B. A multivariate multiple third-variable effect analysis with an application to explore racial and ethnic disparities in obesity. J Appl Stat 2020; 48(4): 750–764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tofighi D, West SG and MacKinnon DP. Multilevel mediation analysis: the effects of omitted variables in the 1-1-1 model. Br J Math Stat Psychol 2012; 66(2): 290–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Krull JL and MacKinnon DP. Multilevel modeling of individual and group level mediated effects. Multivar Behav Res 2001; 36(2): 249–277. [DOI] [PubMed] [Google Scholar]
- 28.Bauer DJ. Estimating multilevel linear models as structural equation models. J Educ Behav Stat 2003; 28(2): 135–167. [Google Scholar]
- 29.Kenny DA, Korchmaros JD and Bolger N. Lower level mediation in multilevel models. Psychol Methods 2003; 8(2): 115–128. [DOI] [PubMed] [Google Scholar]
- 30.Gitelman AI. Estimating causal effects from multilevel group-allocation data. J Educ Behav Stat 2005; 30(4): 397–412. [Google Scholar]
- 31.Pituch KA, Whittaker TA and Stapleton LM. A comparison of methods to test for mediation in multisite experiments. Multivar Behav Res 2005; 40(1): 1–23. [DOI] [PubMed] [Google Scholar]
- 32.Bauer DJ, Preacher KJ and Gil KM. Conceptualizing and testing random indirect effects and moderated mediation in multilevel models: new procedures and recommendations. Psychol Methods 2006; 11(2): 142–163. [DOI] [PubMed] [Google Scholar]
- 33.Zhang Z, Zyphur MJ and Preacher KJ. Testing multilevel mediation using hierarchical linear modesl, problems and solutions. Organ Res Methods 2009; 12(4): 695–719. [Google Scholar]
- 34.Yuan Y and MacKinnon DP. Bayesian mediation analysis. Psychol Methods 2009; 14(4): 301–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Yu Q and Li B. Third-variable effect analysis with multilevel additive models. PLoS One 2021; 15(1–17): e0241072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Friedman JH and Stuetzle W. Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 1981; 76: 817–823. [Google Scholar]
- 37.Lang S, Umlauf N, Wechselberger P, et al. Multilevel structured additive regression. Stat Comput 2012; 24(2): 223–238. [Google Scholar]
- 38.Austin PC. A tutorial on multilevel survival analysis: methods, models and applications. Int Stat Rev 2017; 85: 185–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sophia R-H and Skrondal A. Multilevel and longitudinal modeling using stata, volume i: continuous responses. Stat Methods Med Res 2016; 25: 3069. [Google Scholar]
- 40.Yu M, Reiter JP, Zhu L, et al. Protecting confidentiality in cancer registry data with geographic identifiers. Am J Epidemiol 2017; 186: 83–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Friedman JH and Meulman JJ. Multiple additive regression trees with application in epidemiology. Stat Med 2003; 22(9): 1365–1381. [DOI] [PubMed] [Google Scholar]
- 42.Yu Q, Li B and Scribner RA. Hierarchical additive modeling of nonlinear association with spatial correlations-an application to relate alcohol outlet density and neighborhood assault rates. Stat Med 2009; 28: 1896–1912. [DOI] [PubMed] [Google Scholar]
- 43.Li B, Yu Q, Zhang L, et al. Regularized multiple mediation analysis. Stat Interf 2021; 14(4): 449–458. in press. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


