Abstract
Background:
Binary endpoints measured at two timepoints—such as pre- and post-treatment—are common in biomedical and healthcare research. The Generalized Bivariate Bernoulli Model (GBBM) provides a specialized framework for analyzing such bivariate binary data, allowing for formal tests of covariate-dependent associations conditional on baseline outcomes. Despite its potential utility, the GBBM remains underutilized due to the lack of direct implementation in standard statistical software. Moreover, we contend that the comparison made in the original publication between the GBBM dependency test and the regressive logistic regression model has shortcomings and does not provide an ideal basis for evaluating the model's performance.
Methods:
In this paper, we propose a standard logistic regression model with an interaction term and demonstrate that it yields an equivalent dependency test to the GBBM approach. This equivalence is established conceptually, theoretically, and empirically. Extensive simulations compared the power of the GBBM dependency test with: a) dependency test from regressive logistic model; b) test derived from the logistic regression model with interaction; and c) the Pearson Chi-square test. We also applied these methods to infant mortality data from the Bangladesh Demographic & Health Survey (BDHS).
Results:
The power of the GBBM dependency test differs from the regressive logistic regression model used as a benchmark in the original paper that introduced the GBBM methodology. In contrast, the power and type 1-error rate of the GBBM dependency test and the logistic regression model with interaction described herein are equivalent across varying effect sizes and sample sizes.
Conclusion:
Our work reveals that a widely available and flexible logistic regression model can serve as a practical alternative to the GBBM dependency test, enhancing accessibility for researchers. Moreover, this approach provides a foundation for extending dependency analyses to more complex longitudinal binary data structures, broadening its applicability in biomedical research.
Keywords: Longitudinal binary endpoints generalized linear models, repeated measures
1. INTRODUCTION
Dependence in outcome variables frequently arises in various fields. This is especially true for studies involving repeated measures or observations on the same unit/subject over time, particularly when binary outcomes are measured at two time points, such as pre- and post-intervention. For example, in genomics, genetic association studies often assess disease status at baseline and follow-up to examine how genetic variants affect disease progression, highlighting the need to model dependencies between time points1. In behavioral health, smoking cessation studies evaluate smoking status before and after interventions to understand the impact of personal habits2. Clinical research frequently employ randomized pre-post designs to compare treatment efficacy with disease status measured before and after treatment3. Similarly, in the psychological research field, assessment of stress-levels pre- and post-intervention, have been employed to evaluate mindfulness therapies4. Vaccine efficacy studies assess infection status before and after vaccination to determine how well the vaccine works5. Recognizing the types of studies that are prone to dependency between repeated measurements of the same endpoint(s) and accurately modeling these dependencies is crucial for both valid interpretation and to avoid misleading conclusions.
Several marginal approaches have been employed to model repeated binary outcomes arising from longitudinal studies6. Noteworthy among these are Generalized Estimating Equation (GEE), introduced and developed by Liang and Zeger7, Lipsitz et al.8, and others9–11. Additional approaches include: the use of dependence measures for binary data with logistic regression12, the quadratic exponential form model13,14, and the generalized multivariate logistic model15. Several studies, including those by Wakefield16, have explored the limitations of marginal models, particularly in relation to Simpson’s paradox17. In contrast to marginal approaches, fewer studies have focused on conditional approaches. We briefly note that a conditional model describes the probability of an outcome given both the covariates and a previous outcome, capturing subject-specific dependence, whereas a marginal model describes the average relationship between the covariates and the outcome across the population, without conditioning on prior outcomes. Key contributions in this area include the development of Markov models for covariate dependence18–20 17. However, Islam et al.6 later argued that marginal or conditional models alone are insufficient to fully address the dependence in correlated outcome variables without incorporating a joint model. To address this issue, Islam et al. proposed a joint model called the Generalized Bivariate Bernoulli Model (GBBM). GBBM integrates both marginal and conditional probabilities of correlated binary events, allowing the joint function to be fully specified. Furthermore, Islam et al. introduced a test for association, called the “dependency test”, which allows one to examine dependency between the relationship of a covariate(s) and the odds of the event, conditional on the outcome of the event at the earlier time point. While the GBBM model exclusively addresses repeated measures of binary outcomes collected at only two time points, it is essential to acknowledge its broader applicability in various fields.
While working with the joint model proposed by Islam et al., we identified a significant gap in the availability of packages for established statistical software such as R, SAS, and STATA that support the analysis of data via the GBBM. Upon further investigation, we realized that the parameterization of GBBM and the dependency test introduced by Islam et al. can be estimated using a more straightforward and easily implementable marginal logistic regression model that includes interaction effects. The latter allows for parameter estimation and the dependency test to be carried out using existing packages in established statistical software. Additionally, we identified certain points of contention in the original paper regarding the authors comparison of their proposed dependency test with what they referred to as the regressive logistic model. We find it valuable to further explore and clarify these differences, particularly in relation to various types of dependency tests for binary outcomes.
The primary objective of this study is to provide an alternative to the joint model proposed by Islam et al. that can be readily implemented using existing software and packages. To achieve this objective, we provide both a theoretical/conceptual comparison of the proposed logistic regression model with interaction effects and the GBBM dependency test, as well as an empirical comparison of parameter estimates and test results between the two approaches. Our applied comparison involved extensive simulation studies as well as an analysis of infant mortality data across consecutive births from the same mother using the Bangladesh Demographic & Health Survey (BDHS) Data21. Finally, we demonstrate that the comparison of the GBBM dependency test to the regressive logistic model reported in Islam et al. has shortcomings and as a consequence, does not allow for a comprehensive comparison of the two approaches.
2. METHODS
2.1. Generalized Bivariate Bernoulli Model (GBBM)
To address within-subject dependence in repeated measures of a binary outcomes, Islam et al.6 proposed the Generalized Bivariate Bernoulli Model (GBBM). GBBM uses a marginal-conditional approach to construct a joint model. Specifically, the joint distribution of two binary outcomes and is modeled using the Bivariate Bernoulli distribution (Eq. 1), which effectively captures the joint behavior of the two binary variables:
| (Eq. 1) |
where and are the joint probabilities corresponding to different combinations of and . The joint probability mass function can be represented in the exponential family form, as indicated in Eq.2:
| (Eq.2) |
Following from Eq.2 and assuming independent samples, the log-likelihood function for the Bivariate Bernoulli distribution is given by:
| (Eq.3) |
where and,
As defined above, , , and represent the link functions used to relate the probabilities to covariates. Following from the relationship between joint, conditional, and marginal probabilities, the joint probability of and given in Eq. 1 can be expressed as:
| (Eq.4) |
which allows one to model the dependence between and using conditional relationships. When covariates are introduced, Eq.4 becomes:
| (Eq.5) |
As per Islam et al., the conditional probabilities in Eq.5, are modeled assuming a logit link function, resulting in the following expressions:
| (Eq.6) |
| (Eq.7) |
where, is the effect of covariate on the probability that conditioning on , is the effect of covariate on the probability that conditioned on , and (, ) represent the corresponding intercept parameters. Similarly, the marginal probability of can be written as a function of the covariates. Again, assuming a logit-link, we have:
| (Eq.8) |
where, is the effect of covariate on the probability of and is the intercept parameter. Given the above, the joint probabilities of and conditional on covariates , can be written as follows:
| (Eq.9) |
| (Eq.10) |
| (Eq.11) |
| (Eq.12) |
As described in Islam et al., estimating equations for the parameters of the GBBM model are derived from the log-likelihood function (Eq.3) using the above joint probabilities (Eq.9 – Eq.12). Because there is no closed from analytic solution for model parameters, Newton-Raphson is used to estimate the parameters: , , . We draw the reader’s attention to the above expressions for the link functions, , , and , and note that the dependency between and is captured by . Substituting the expressions given for the joint probabilities (Eq.9–Eq.12) and writing in terms of , and , we obtain the following expression:
Where . When and assuming , that implies that or equivalently, that . That is, the relationship between and is the same irrespective of the outcome . In other words, the relationship between and is not dependent on (e.g., no dependency). Islam et al. formally assess the dependency of on by developing a dependency test using a Wald’s test (Eq. 13) to test the null hypothesis .
| (Eq.13) |
Under the null hypothesis, the test statistic (Eq. 13) follows central with 2 degrees of freedom.
2.2. Regressive Logistic Regression Model
Islam et al. compared the performance of the proposed test for dependence (Eq.13) with an alternative test deriving from a “regressive logistic model”. The regressive model represents the conditional part (not the entire full model), wherein the probability of is modeled conditional on previous outcome, , and explanatory variable, , as follows:
| (Eq.14) |
where , and are parameters. It is important to note that while suggests no dependence between and based on the parameterization given in Eq. 14, the GBBM dependency test (Eq.13) captures the dependence in the covariate and conditional on the outcome at time-point 1, . Therefore, we argue that this comparison is not ideal as it fundamentally addresses a different question regarding dependency.
2.3. Logistic Regression Model with Interaction
In the present paper, we compare the GBBM dependency test (Eq.13) with a test resulting from a subtle change of the regressive logistic regression model described above (Eq. 14). In Eq. 14, the parameter was treated as a scalar. However, by allowing to be a functional parameter , we derive a dependency test equivalent to the GBBM dependency test (Eq.13). This functional parameter introduces an interaction term between and covariate to the regressive logistic regression model. The addition of this interaction term allows one to distinguish the how the effect of on varies depending on the value of . This is reflected through the parameter in the model below:
| (Eq.15) |
When we have, and when , we have . Comparing these results to Eq. 6 and Eq.7, we observe that , , , and . Since the GBBM dependency test is based on the differences () and () and given that and , the equivalent Wald’s test for testing from logistic regression model is:
| (Eq.16) |
where, and
Under the null hypothesis, the aforementioned test statistic follows central with 2 degrees of freedom.
Further, we note that the likelihood of the joint GBBM model is proportional to that of the conditional logistic regression model with interaction. Specifically, by expressing the joint distribution in exponential family form and conditioning on , the resulting conditional distribution of follows a logistic form consistent with Eq. 15. This proportionality implies that the two models share the same score functions and Fisher Information and thus yield equivalent asymptotic variances for corresponding parameters.
The logistic regression model with interaction can be extended to include multiple covariates. For example, if we consider two covariates in the analysis, the model can be expressed as:
In this model and are the model parameters. To test the null hypothesis , the Wald’s test statistic is where,
Under the null hypothesis, the test statistic follows a central chi-squared distribution with 3 degrees of freedom. In a similar manner, this approach can be extended to accommodate covariates in the model.
2.4. Simulation Studies
The objective of our simulation studies was to compare the power of the GBBM dependency test (Eq.13) with: a) Pearson Chi-square test of dependency; b) a dependency test based on the regressive logistic regression model (Eq.14); and c) a test derived from the logistic regression model with interaction (Eq 15). Simulations were conducted assuming different study sample sizes and effect sizes; for each setting of the simulation parameters, we used a total of 1000 Monte-Carlo iterations. The data generation process involved several steps. First, a covariate was drawn from a standard normal distribution, . Next, using predetermined parameters of the GBBM models, we generated conditional (Eq. 6–7) and marginal (Eq. 8) probabilities as defined in Section 2.1. The conditional and marginal probabilities were then multiplied to calculate joint probabilities (Eq. 9–12). Using the joint probabilities, we generated the response variables from a multinomial distribution , where, and, represent joint probabilities of and . Finally, values of and were determined based on the value generated from the multinomial distribution.
Each simulation involved data generation based on the following fixed parameters: , , , , , and . We considered the following sample sizes (n) and effect sizes (): n = 500, 1000, 1500, 2000 and effect sizes of 0.2, 0.4, and 0.5. Each test was conducted using a predetermined Type I error rate of . Power was calculated as the proportion of Monte Carlo simulations in which the null hypothesis was correctly rejected ().
One of the objectives of this simulation study was to compare the dependency test of the GBBM model with Pearson’s Chi-square test of dependency. For each dataset generated, we performed two tests: the GBBM test of dependency (Eq.13) and a Pearson’s Chi-Square test using the 2 × 2 contingency table of and .
Another comparison focused on evaluating the GBBM dependency test against the regressive logistic regression model (Eq.14) used by Islam et al. to benchmark the performance of the GBBM dependency test. The data generation process followed the same approach as in the previous case. For the regressive model, the null hypothesis was tested, and the results were compared with those of the GBBM dependency test, which evaluates the null hypothesis .
Additionally, the performance of the GBBM dependency test was examined in relation to the dependency test derived from the logistic regression model with interaction (Eq.15). This comparison was intended to showcase the performance of these two dependency tests across varying sample sizes and effect sizes. For each generated dataset, we conducted the dependency test using the GBBM method (Eq.13), a dependency test based on the logistic regression model with interaction (Eq.16), and a likelihood ratio test based on the logistic regression model with an interaction.
We also calculated the type I error rates for all competing models to assess their ability to maintain the nominal significance level under the null hypothesis. For this, we generated data using the following fixed parameters: , , , and . As & in this simulation, the effect of covariate on does not depend on .
2.5. Analyzing the Dependence of Neonatal Mortality Status Across Consecutive Births
To compare GBBM and the logistic regression model with an interaction term in a real-world context, we utilized data collected from the 2014 Bangladesh Demographic and Health Survey (BDHS). This survey selected 18,000 residential households and interviewed 17,862 ever-married women, comprising 6,167 urban and 11,696 rural residents. For the purposes of the present study, we focused on a subset of 11,951 women who had their first two children between 1991 and 2014. Our analysis examined whether the relationship between the neonatal mortality status of the second child and the mother's education level depends on the neonatal mortality status of the first child. The binary outcome of interest was neonatal survival (0 = alive, 1 = deceased) for both the first and second births. To maintain comparability with our simulation studies, we considered a single explanatory factor: the mother's education level, categorized as secondary education or beyond or less than secondary education (treated as the referent group). Along with the GBBM and the logistic regression model with interaction, we also performed dependency tests using the regressive logistic model and the Pearson Chi-square test to evaluate and compare the performance of these methods in real data.
3. RESULTS
3.1. Comparison of the Operating Characteristics of the Dependency Tests in Simulated Data
Table 1 presents the results comparing the power of the different dependency tests. As expected, statistical power of the GBBM dependency test increases as both the sample size and effect size increase. Conversely, for the Pearson Chi-square test there was much less pronounced relationship between power and effect size, demonstrating that a distinct pattern of dependency between and via the covariate is not captured by Pearson Chi-square test. For example, at a sample size of 1000 and an effect size of 0.5, the power was 0.867 for GBBM and 0.055 for Pearson’s Test.
Table 1:
Statistical power across varied sample size & effect size for the GBBM, Logistic Regression Model with Interaction, Regressive Logistic Model & Pearson’s Test.
| GBBM | Logistic Regression Model with interaction (Wald’s) | Logistic Regression Model with interaction (LRT) | Regressive Logistic Model | Pearson’s Test | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample # | 0.2 | 0.4 | 0.5 | 0.2 | 0.4 | 0.5 | 0.2 | 0.4 | 0.5 | 0.2 | 0.4 | 0.5 | 0.2 | 0.4 | 0.5 |
| 500 | 0.138 | 0.423 | 0.590 | 0.138 | 0.423 | 0.590 | 0.141 | 0.427 | 0.591 | 0.046 | 0.063 | 0.074 | 0.053 | 0.049 | 0.05 |
| 1000 | 0.231 | 0.708 | 0.867 | 0.231 | 0.708 | 0.867 | 0.233 | 0.708 | 0.866 | 0.061 | 0.059 | 0.090 | 0.065 | 0.046 | 0.055 |
| 1500 | 0.338 | 0.874 | 0.972 | 0.338 | 0.874 | 0.972 | 0.337 | 0.873 | 0.972 | 0.064 | 0.090 | 0.094 | 0.070 | 0.062 | 0.056 |
| 2000 | 0.418 | 0.941 | 0.992 | 0.418 | 0.941 | 0.992 | 0.416 | 0.941 | 0.992 | 0.056 | 0.103 | 0.117 | 0.087 | 0.071 | 0.069 |
As a reminder, another objective was to compare the GBBM dependency test with the regressive logistic regression model, which was used as a performance benchmark by Islam et al. We observed distinct differences between the power of the GBBM dependency test and the dependency test based on regressive logistic regression model (Table 1). The GBBM model demonstrated a clear increase in power with both increasing sample size and effect size, aligning with expected theoretical behavior. In contrast, the regressive logistic regression model did not exhibit a consistent pattern, indicating that the underlying dependencies captured by GBBM are not fully captured by the regressive logistic regression approach. For example, at a sample size of 500 and an effect size of 0.5, power was 0.590 for the GBBM dependency test and 0.074 for the regressive logistic regression model. Furthermore, when comparing the power of the regressive logistic regression model with that of Pearson Chi-square test, it was observed that the power of the regressive logistic regression model was comparable to Pearson Chi-square test, which is unsurprising given that the latter is focused on to the marginal dependency of and . This suggests that although the regressive logistic regression model shows similar power Pearson Chi-square test, it fails to capture the more detailed and complex dependency patterns that the GBBM identifies.
Table 1 also presents the power comparison results between the GBBM dependency test and the logistic regression model with an interaction, using both a Wald and LRT. The results show that the GBBM model and the logistic regression model with interaction with Wald’s test produced identical estimates of power across all effect sizes and sample sizes, indicating that both tests perform equivalently in terms of power. In contrast, the power values obtained from the LRT showed slight differences compared to those of the GBBM dependency test, although these differences were minimal and within a comparable range. For instance, at a sample size of 500 and an effect size of 0.5, the power was 0.590 for both the GBBM dependency test and Wald’s test, and 0.591 for the LRT. These minor variations suggest that while the LRT yields power values very similar to those of the GBBM model, slight discrepancies exist due to the distinct statistical framework of the LRT and Wald tests.
Table 2 shows the type I error rate across all approaches. We observe that for the GBBM dependency test, the type I error rate is controlled at approximately 5% as expected given our predetermined significance threshold. The type I error rate is also controlled at approximately 5% for regressive logistic regression model and logistic regression with interaction effects, regardless of whether the Wald or likelihood ratio test is used. In contrast, for Pearson Chi-square test we observe a slight increase in the type 1 error rate as a function of increasing study sample size.
Table 2:
Type I error rates for the GBBM, Logistic Regression Model with Interaction, Regressive Logistic Model & Pearson’s Test.
| Model/Sample Size | 500 | 1000 | 1500 | 2000 |
|---|---|---|---|---|
| GBBM | 0.049 | 0.055 | 0.043 | 0.053 |
| Logistic Regression Model with interaction (Wald) | 0.049 | 0.055 | 0.043 | 0.053 |
| Logistic Regression Model with interaction (LRT) | 0.053 | 0.051 | 0.045 | 0.050 |
| Regressive Logistic Model | 0.042 | 0.047 | 0.041 | 0.052 |
| Pearson’s Test | 0.052 | 0.071 | 0.084 | 0.086 |
3.2. Comparison of parameter estimates between GBBM and logistic regression model with interaction:
Here, we compare the parameter estimates and their variances between the GBBM and the logistic regression model with interaction based on a sample size of 2000. As noted in Table 3, for GBBM the estimated value of was 0.445, with an estimated variance 0.0058. The estimated value of was 0.212, with an estimated variance of 0.0060. Parameter was estimated to be 0.518, with an estimated variance of 0.0039, while was estimated to be 0.795, with an estimated variance of 0.0048. The logistic regression model with interaction provides estimates of and that are identical to those of the GBBM for the parameters and (Table 3). The logistic regression model with interaction also captures the differences between parameters from the GBBM Model. The difference between and was 0.073, with a combined variance of 0.0096, which is identical to the estimated value and variances of in the logistic regression model with interaction. Similarly, the difference between and was 0.583, with a combined variance of 0.0108, which is identical to the estimated value and variance of in the logistic regression model with interaction. We direct readers to the Appendix for a proof that . Finally, the statistic for the dependency test are the same between the dependency test in both GBBM model and logistic regression model with interaction (, for both).
Table 3:
Comparison of parameters estimates from GBBM and Logistic Regression Model with Interaction.
| Generalized Bivariate Bernoulli Model (GBBM) | ||
|---|---|---|
| Estimate | Variance | |
| 0.4454 | 0.0058 | |
| 0.2116 | 0.0060 | |
| 0.5183 | 0.0039 | |
| 0.7946 | 0.0048 | |
| 31.523 | ||
| Logistic Regression Model with Interaction | ||
| 0.4454 | 0.0058 | |
| 0.2116 | 0.0060 | |
| 0.0729 | 0.0096 | |
| 0.5183 | 0.0039 | |
| 0.5830 | 0.0108 | |
| 0.7946 | 0.0048 | |
| 31.523 | ||
3.3. Analyzing the Dependence of Neonatal Mortality Status Across Consecutive Births
Table 4 presents the frequency distribution of neonatal survival (Death or Alive) by birth order (first and second birth) and maternal education levels. In this study, we included 11,951 mothers who had their first two children between 1991 and 2014. The neonatal mortality rate for first births is 8% (960 out of 11,951), which is higher than the 4.6% (555 out of 11,951) observed for second births. Additionally, among these mothers, 34% (4,068) had attained at least a secondary level of education (≥6 years), while 66% (7,883) had less than six years of education.
Table 4:
Frequency distribution of death status for first and second births and maternal educational attainment in the 2014 Bangladesh Demographic and Health Survey data set.
| First Birth | Second Birth | Mother’s Education ≥ Secondary(6 Years) | ||||
|---|---|---|---|---|---|---|
| Alive | Dead | Alive | Dead | Yes | No | |
| # | 10991 | 960 | 11396 | 555 | 4068 | 7883 |
| % | 92.00 | 8.00 | 95.40 | 4.60 | 34.00 | 66.00 |
Table 5 presents a comparative analysis of parameter estimates between the Generalized Bivariate Bernoulli Model (GBBM), the logistic regression model with an interaction term, regressive logistic regression model and Pearson Chi-square test. The focus is on assessing the survival of the second birth, conditional on the survival status of the first birth at the neonatal stage. For the neonatal survival of the second birth, conditional on the survival status of the first birth, the GBBM model provides the following interpretations: is the intercept term for the mortality of the second child () when first child survived () and is the intercept term when the first child died (). The coefficient indicates that if the first child survived (), women with education at the secondary level or higher are approximately 52.61% less likely to experience neonatal death for their second child compared to women with less educational attainment. Similarly, suggests that if the first child died (), women with education at the secondary level or higher are about 49.62% less likely to experience neonatal death for their second child compared to women with lower educational attainment. Our findings also indicate that the relationship between the neonatal mortality of the second child and the mother’s educational attainment is dependent on the neonatal mortality of the first child . Table 5 also provides the estimated parameters and their variances for the logistic regression model with interaction term. Here, represents the intercept of the model. The coefficient indicates the change in the log-odds of the second child’s survival for mothers’ education at the secondary level or higher () compared to those with lower educational attainment () when the first child survived (). The coefficient reflects the change in the log-odds of the mortality of the second child when the first child died () compared to when the first child survived () among women educational attainment less that the secondary level. The positive sign suggests that if the first child died, the likelihood of the second child’s death increases among women educational attainment less that the secondary level. The interaction term reflects how the effect of the first child’s mortality status on the likelihood of the second child’s survival varies based on the mother’s education level. The positive coefficient indicates that the positive effect of the first child’s survival on the second child’s survival probability is more pronounced for mothers with higher education compared to those with less educational attainment. Similar to the GBBM dependency test, our findings show that the relationship between the neonatal mortality of the second child and the mother’s educational attainment is dependent on the neonatal mortality of the first child . Consistent with the results from Section 3.3.1, we observe identical parameter estimates for both GBBM and logistic regression model with interaction term across all four analyses presented in Table 5. This consistency indicates that both models equivalently capture the dependency structure between birth mortality, regardless of the different parameter estimates introduced by the additional terms in the logistic regression model with interaction term.
Table 5:
Comparison of parameter estimates and dependency tests between the GBBM Dependency Test, Logistic Regression Model with Interaction, Regressive Logistic Regression Model, and Pearson Chi-square Test for neonatal death of 2nd birth | 1st birth.
| Generalize Bivariate Bernoulli Model | ||
|---|---|---|
| Estimate | SE | |
| −2.9453* | 0.0542 | |
| −0.7468* | 0.1182 | |
| −1.9668* | 0.1138 | |
| −0.6856* | 0.2826 | |
| 74.1180* | ||
| Logistic Regression Model with Interaction | ||
| −2.9453* | 0.0542 | |
| −0.7468* | 0.1182 | |
| 0.9785* | 0.1261 | |
| −1.9668 | 0.1138 | |
| 0.0612 | 0.3063 | |
| −0.6856* | 0.2826 | |
| 74.1180* | ||
| Regressive Logistic Regression Model | ||
| −2.9472 | 0.0534 | |
| −0.7378 | 0.1090 | |
| 0.9887 | 0.1150 | |
| 73.96* | ||
| Pearson Chi-square Test | ||
| 88.7880* | ||
Significant at
We have also included test results from the regressive logistic model and the Pearson Chi-square test . Both tests indicate a significant marginal dependency of the survival of the second child on the survival of the first child during the neonatal stage.
4. DISCUSSION
The objectives of this paper were two-fold: (1) to demonstrate that the regressive logistic regression model does not provide an appropriate comparison with the GBBM dependency test and (2) that the latter can be effectively conducted using a logistic regression model with an interaction. This work is significant because there is no existing software package that directly supports the GBBM dependency test, requiring custom code for implementation. In contrast, logistic regression models can be fit using various software platforms (e.g., R, SAS, Stata, etc.). Because GBBM dependency offers valuable insights into the relationship between two binary outcomes with respect to covariates, conducting the dependency test using a logistic regression model benefits the research community by making this analysis more accessible. The logistic regression model with interaction may serve as a valuable alternative for outcomes measured at more than two time points, as it allows for the inclusion of additional interaction terms. This flexibility opens the door to promising future research opportunities.
Our results show that the power of the GBBM dependency test and the dependency test from the regressive logistic regression model differ significantly. This contrasts with the findings of Islam et al., where both tests produced comparable results. This discrepancy arises from differences in the data generation process: in our setting, we assume to examine the dependency of on through , thereby eliminating any unmediated dependency between on . Our results also indicate that the power of the marginal dependency test (e.g., Pearson’s Chi-square test) and the regressive logistic regression model are comparable, however both are distinct when compared to the GBBM dependency test. In contrast, the power of the GBBM dependency test and a dependency test from the logistic regression model with an interaction are equivalent across varying effect sizes and sample sizes. Both the GBBM dependency test and the logistic regression model with interaction effectively captured the effect of mother’s education on neonatal mortality for the second birth, conditional on the neonatal survival status of the first birth.
The current model assumes that the dependency between and varies linearly with the covariate(s) . A more general formulation could involve replacing the linear term with an arbitrary function , which may be linear or nonlinear (e.g., polynomial or spline), to better accommodate complex patterns in how influences the dependency between and . However, our focus in this paper is to demonstrate that a standard logistic regression model with an interaction term yield results equivalent to the GBBM model. Thus, we prioritize interpretability and accessibility, leaving exploration of more flexible functional forms for future research. Additionally, while this work focuses on binary outcomes observed at two time points, extending this framework to accommodate more than two time points would require the development of a Generalized Trivariate or Multivariate Bernoulli model. These represent important directions for future research.
While our primary focus is on demonstrating the equivalence between the GBBM and the logistic model with interaction, we acknowledge that many other frameworks exist for modeling dependency in binary outcomes. Generalized estimating equations (GEEs)22 are widely used for correlated binary data and provide population-averaged estimates. Copula-based models23 offer flexibility in modeling joint distributions with specified marginals and dependence structures. Although a detailed comparison of these methods is beyond the scope of this study, future research could explore their use in dependency testing and assess their comparative performance in practical biomedical contexts.
This paper makes a contribution by illuminating dependencies in modelling approaches for two binary outcomes at consecutive time points through the lens of the GBBM dependency test. Additionally, it offers a practical and accessible alternative to the GBBM test, making dependency analysis more accessible to the scientific community.
Supplementary Material
FUNDING
Research reported was supported by: the National Cancer Institute (NCI) Cancer Center Support Grant P30 CA168524; the Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, supported by the National Institute of General Medical Science award P20 GM103418; the Kansas Institute for Precision Medicine COBRE, supported by the National Institute of General Medical Science award P20 GM130423.
ABBREVIATIONS
- GBBM
Generalized Bivariate Bernoulli Model
- BDHS
Bangladesh Demographic and Health Survey
- LRT
Likelihood Ratio Test
Footnotes
DISCLOSURES
The authors have no conflicts of interest to disclose.
DATA AVAILABILITY
Code for the simulation can be found here: https://github.com/kmahmud01/GBBM
REFERENCES
- 1.Patron J, Serra-Cayuela A, Han B, Li C, Wishart DS. Assessing the performance of genome-wide association studies for predicting disease risk. PLoS One. 2019;14(12):e0220215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.West R, McEwen A, Bolling K, Owen L. Smoking cessation and smoking patterns in the general population: a 1-year follow-up. Addiction. 2001;96(6):891–902. [DOI] [PubMed] [Google Scholar]
- 3.Wan F Statistical analysis of two arm randomized pre-post designs with one post-treatment measurement. BMC medical research methodology. 2021;21:1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Goyal M, Singh S, Sibinga EM, et al. Meditation programs for psychological stress and well-being: a systematic review and meta-analysis. JAMA internal medicine. 2014;174(3):357–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ssentongo P, Ssentongo AE, Voleti N, et al. SARS-CoV-2 vaccine effectiveness against infection, symptomatic and severe COVID-19: a systematic review and meta-analysis. BMC infectious diseases. 2022;22(1):439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Islam MA, Alzaid AA, Chowdhury RI, Sultan KS. A generalized bivariate Bernoulli model with covariate dependence. Journal of Applied Statistics. 2013;40(5):1064–1075. [Google Scholar]
- 7.Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. [Google Scholar]
- 8.Lipsitz SR, Laird NM, Harrington DP. Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika. 1991;78(1):153–160. [Google Scholar]
- 9.Guo X, Qi H, Verfaillie CM, Pan W. Statistical significance analysis of longitudinal gene expression data. Bioinformatics. 2003;19(13):1628–1635. [DOI] [PubMed] [Google Scholar]
- 10.Prentice RL. Correlated binary regression with covariates specific to each binary observation. Biometrics. 1988:1033–1048. [PubMed] [Google Scholar]
- 11.Liang KY, Zeger SL, Qaqish B. Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society: Series B (Methodological). 1992;54(1):3–24. [Google Scholar]
- 12.Le Cessie S, Van Houwelingen JC. Logistic regression for correlated binary data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 1994;43(1):95–108. [Google Scholar]
- 13.Bahadur RR. A representation of the joint distribution of responses to n dichotomous items. Studies in item analysis and prediction. 1961:158–168. [Google Scholar]
- 14.Cox DR, Wermuth N. A note on the quadratic exponential binary distribution. Biometrika. 1994;81(2):403–408. [Google Scholar]
- 15.Glonek GF, McCullagh P. Multivariate logistic models. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57(3):533–546. [Google Scholar]
- 16.Wakefield J Ecological inference for 2× 2 tables (with discussion). Journal of the Royal Statistical Society Series A: Statistics in Society. 2004;167(3):385–445. [Google Scholar]
- 17.Simpson EH. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society: Series B (Methodological). 1951;13(2):238–241. [Google Scholar]
- 18.Muenz LR, Rubinstein LV. Markov models for covariate dependence of binary sequences. Biometrics. 1985:91–101. [PubMed] [Google Scholar]
- 19.Bonney GE. Logistic regression for dependent binary observations. Biometrics. 1987:951–973. [PubMed] [Google Scholar]
- 20.Islam MA, Chowdhury RI, Huda S. Markov models with covariate dependence for repeated measures. (No Title). 2009; [Google Scholar]
- 21.National Institute of Population Research and Training (NIPORT) MaA, ICF International. Bangladesh Demographic and Health Survey Report. 2014.
- 22.Zeger SL, Liang K-Y. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986:121–130. [PubMed] [Google Scholar]
- 23.de Leon AR, Wu B. Copula-based regression models for a bivariate mixed discrete and continuous outcome. Statistics in Medicine. 2011;30(2):175–185. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Code for the simulation can be found here: https://github.com/kmahmud01/GBBM
