Abstract
The joint occurrence of diabetes and hypertension conditions in a patient is common. The two diseases share a number of risk factors, and are hence usually modelled concurrently using bivariate logistic regression. However, the postestimation assessment for the model, such as analysis of outlier observations, is seldom carried out. In this article, we apply outlier detection methods for multivariate data models to study characteristics of cancer patients with joint outlying diabetes and hypertension outcomes observed from among 398 randomly selected cancer patients at Queen Elizabeth and Kamuzu Central Hospitals in Malawi. We used R software version 4.2.2 to perform the analyses and STATA version 12 for data cleaning. The results showed that one patient was an outlier to the bivariate diabetes and hypertension logit model. The patient had both diabetes and hypertension and was based in rural area of the study population, where it was observed that comorbidity of the two diseases was uncommon. We recommend thorough analysis of outlier patients to comorbid diabetes and hypertension before rolling out interventions for managing the two diseases in cancer patients to avoid misaligned interventions. Future research could perform the applied diagnostic assessments for the bivariate logit model on a wider and larger dataset of the two diseases.
Subject terms: Health care, Medical research, Risk factors
Introduction
Diabetes and hypertension conditions often occur concurrently in a patient1–3. This is usually the case because the two disease conditions share risk factors, which among others include old age, obesity, genetic factors, and dyslipidaemia3. Living with the two disease conditions generally affects the immune system and mental health of a patient. In particular, co-existence of type2 diabetes mellitus and hypertension in a patient is associated with poor cognitive functioning, ischemic cerebrovascular disease, retinopathy, heart disease, sexual inactivity, and kidney problems4–7. For patients with other chronic diseases such as cancer, presence of the diabetes and hypertension diseases is associated with poor prognosis outcomes in the patient8–10. Therefore, more ways of managing effects of the joint occurrence of diabetes and hypertension in cancer patients have to be explored11.
Since the data for observing presence or absence of either diabetes or hypertension in a cancer patient are binary, assessment of risk factors for the comorbidity of the two diseases can be done using bivariate logistic regression, a model that quantifies effects of the covariates on the joint binary outcomes, while accounting for the correlation of the two binary outcome variables12,13. Although this method is popular for modelling comorbidity data, the follow-up residual assessments are rarely done by researchers, mainly because the diagnostic statistics for nonlinear multivariate models are not well-developed14. This creates a vacuum in understanding the fit of the model to the bivariate data15. As for cancer patients with comorbid diabetes and hypertension, there could be some outlier patients with special characteristics, whose knowledge can help in managing joint adverse effects of the two diseases in a patient. This study therefore analyses characteristics of the outlier cancer patients with comorbid diabetes and hypertension, upon fitting a bivariate logistic regression model to the cancer patients’ data, that were randomly collected at Queen Elizabeth and Kamuzu central hospitals in Malawi. An outlier observation is one whose measurement does not conform with the rest of the data points in the fitted model15. This may be due to natural causes, which in the case of comorbid diabetes and hypertension patient data may help in revealing ways for managing impact of presence of the two diseases in a cancer patient. At times, outliers are as a result of data collection and handling errors. Whichever is the reason for outlierness, presence of outliers in a model may lead to biased estimated effects of risk factors of the diseases and inaccurate conclusions, as well as compromised policy directions from the fitted model.
Understanding cancer and allied diseases’ management approaches is an ongoing process, that requires continuous research11. Knowing the characteristics of outlier cancer patients with comorbid diabetes and hypertension contributes to that process. Following this section is the presentation in “Methods” section about the data and statistical methods used in this study. Thereafter, we present results in “Results” section. The discussion and conclusion are respectively given in “Discussion” and “Conclusion” sections.
Methods
Data
The study used data for adult cancer patients aged 18 years and above, that were collected at the two major referral hospitals of Queen Elizabeth and Kamuzu Central in Malawi, by the cancer research group of Kamuzu University of Health Sciences. The project engaged a random cross-sectional sample of 398 cancer patients, that presented themselves for oncological services at the two referral hospitals between 13th January and 23rd March 2021. The study participants were selected using simple random sampling technique, and face-to-face interviews were conducted to collect the data. In addition, the patients’ health passports, registers, and files were reviewed to get the patients’ histories. The data included cancer diagnosis information for the patient, its stage and treatment options, as well as some known behavioural risk factors and socio-demographic characteristics of the patient. Further details about the project can be found in the article11.
In this present study, interest was on identifying characteristics of cancer patients that presented outlying diabetes and hypertension measurements upon considering possible risk factors of both diseases in a regression model set up. Therefore, the outcome variables for this study were whether or not a cancer patient suffered from diabetes or hypertension or both, as at the time of the survey. The two disease conditions were selected for study as they are known to occur jointly in a patient in most cases3. Some of the factors that are associated with comorbid diabetes and hypertension are old age, elevated cholesterol, cigarette smoking, physical activity, wealth index, occupation, alcohol consumption status, marital status, place of residence, education level, and sex of an individual1,16,17. This study included these factors during the model fitting. The research team in this project got the ethical clearance and approval from the College of Medicine Research and Ethics Committee (COMREC), certificate number P.07/20/3085. In addition, the directors at the two referral hospitals provided separate approvals to allow the team to collect the data. During data collection and analysis, patients’ identities were anonymised to maximise the patients’ confidentiality and respect their privacy. In addition, informed consent to participate in this project was obtained from all participants and/or their legal guardian(s). All methods in this study were performed in accordance with the relevant guidelines and regulations11.
Bivariate logistic regression model and its estimation
Let be a pair of jointly occurring total number of diabetes and hypertension cases observed at a health facility during some period of time, with respective paired measurements , where . Let 1 denotes presence of either disease in a patient and 0 absence of the disease. Then, in one such observation, the following ordered pairs are the possible outcomes: , with respective joint probabilities of occurrence denoted by: , , , and , in which . If and are marginal probabilities of presence of the respective disease outcomes in one observation, and is the covariance of the two disease outcomes, then has a bivariate binomial distribution in n observations, expressed as follows:
1 |
where is the number of times the experiment was jointly performed to observe the two outcomes. Therefore, the probability mass function (pmf) of the pair , expressed as is given by:
2 |
where 18. As it can be appreciated in Eq. (2), the bivariate binomial distribution is in canonical form, and two natural parameters come out clearly, these are: and . Hence, a simultaneous logistic regression model will have to be defined in terms of these two link functions and solved simultaneously to find effects of covariates on the bivariate response .
Let be a vector of explanatory variables’ values observed on the i-th patient of both diabetes and hypertension, where . Then, the following expression characterises the bivariate logistic regression model:
3 |
where is the bivariate binary response variable, i.e. ; is the marginal probability of success for or given covariates, with ; is the covariance term capturing dependence between and , which is a function of some covariates , usually estimated by the odds ratio ; and is the model’s error term19. Assuming has mean zero, then the conditional expectation of the response given covariates, i.e. is a function of marginal probabilities of success only, , which is the part that relates or links the model with the covariates19. This leads to the explicit definition of the bivariate logit model in Eq. (3) through the link functions in Eq. (2), given by:
4 |
where is a vector of model parameters and is a vector of explanatory variables measured on i-th patient, while , , and are the model’s linear operators, respectively, associated with first marginal model, the second marginal model, and the covariance model for odds ratio. In this scenario, the effects of covariates on the marginal outcomes and covariance term are said to be non-exchangeable or parallel. Alternatively, the same covariates may be used for the tripartite model in Eq. (4), in which case the effects of covariates on the marginal outcomes and covariance term are said to be exchangeable, where the same fixed-effects apply to both marginal outcomes and the covariance term. In this study, both non-exchangeable (parallel) and exchangeable types of the bivariate logit model were fitted to the data. The two outcome variables, and are independent if and only if or the odds ratio . Through algebra, one can derive an alternative form of the bivariate logistic regression model in Eq. (4) in terms of the two marginal probabilities of the success for the two response variables, i.e., and , and the odds ratio term, as follows:
5 |
The likelihood function for the bivariate logit model in Eqs. (4) or (5) is constructed by taking the product of probabilities of number of events in Eq. (2) associated with each i-th observation as follows:
6 |
where . Taking partial derivatives of the log-likelihood function with respect to the model parameters will give the score vectors, which upon equating to zero and solving for the parameters we can obtain the maximum likelihood (ML) estimates for the regression coefficients, . The exponentiated fixed effect ML estimates, have usual interpretation as marginal odds ratios of success in the respective outcome, when comparing one level of a covariate to the other called reference level. The covariance term was estimated as in third line of the model in Eq. (5), and it was interpreted as the degree of dependence between the two response variables and . A positive logarithm of odds ratio implied that the first disease outcome was highly likely to occur in a person than the second disease, while a negative value meant the first disease was less likely to occur than the second. The value of zero meant that there was no association between the two disease outcomes. The model estimates were calculated using the R package VGAM, that is used to fit vector generalised linear and additive models20, while the rest computations of the residuals and graphs were also performed in R software version 4.2.2 using appropriate packages.
Before analysing the outlier observations to the fitted model, the available explanatory variables were reviewed, so that only those that resulted into a better model according to the deviance statistic, i.e. could be involved in the final analyses, where stands for ML estimates in reduced model, the estimates in saturated or full model, and an estimate of the log-likelihood. The first model had all available covariates as given in “Data” section. While, in the second model few were dropped based on large p-values. The larger the deviance statistic’s value the better the fit of the model to the data.
Model’s outlier residual analysis
To analyse joint outlier observations to the model in Eqs. (3)–(5), we first defined a deviance residual for tracking outliers to the marginal models in the bivariate logit model as follows:
7 |
where is i-th observation for the j-th response, is the fitted marginal probability of event in j-th outcome for the i-th subject, sgn(.) is the signum function of the residual , which was if the residual was above zero, when the residual was negative, and 0 if the residual was zero, , and . The deviance residual in Eq. (7) is assumed to follow a normal distribution, such that its extreme values will correspond to outlier observations to the fitted marginal model19. We assessed this graphically by plotting the marginal deviance residuals in Eq. (7) against fitted marginal probabilities of success, . The overall joint outlier observations to the entire bivariate logit model were analysed by taking the average of the marginal deviance residuals in Eq. (7), as though each column of the bivariate binary outcome was a level in a multinomial random variable21–23. Therefore, the statistic for assessing these overall outliers is defined as follows:
8 |
where and are respective marginal deviance residuals obtained in Eq. (7). Large absolute values of the overall deviance residual in Eq. (8) correspond to joint outlier observations to the fitted bivariate logit model. Graphical aids were also used for this analysis, by plotting the residual in Eq. (8) against fitted marginal probabilities of success, , as well as against marginal deviance residuals in Eq. (7). Upon identifying outlier observations to the fitted bivariate logit model, we performed back-inspection in the dataset to trace the characteristics of the identified outlier patients.
Ethical approval and consent to participate
The study was part of the Kamuzu University of Health Sciences cancer project, whose data were collected with approval of the College of Medicine Research and Ethics Committee (COMREC). The ethical approval certificate is numbered P.07/20/3085. Subsequent approvals were also given by the respective directors at Queen Elizabeth and Kamuzu Central hospitals. The three approval letters have been submitted together with this manuscript. All ethical considerations for the patients were adhered to during data collection, analysis and reporting of this study. Informed consent to participate in this project was obtained from all participants and/or their legal guardian(s). All methods in this study were performed in accordance with the relevant guidelines and regulations. More details on ethical clearance for the project are provided in the article11.
Results
Bivariate logit model estimates
The summary of the data is given in Table 1. It was shown that cases of diabetes and hypertension were respectively present in 0.51% and 8.10% of the studied cancer patients. The diabetes cases were much concentrated in female, unmarried, employed or retired, rural-based, non-smoking, and non-alcohol drinking, above 30 age groups, and patients with primary education and above. While hypertension cases were more dominant in female, married, unemployed or students, urban, non-alcohol drinking, and above 30 age groups. The chi-square test of association indicated that there was evidence from the data that age category of a person was associated with hypertension (p-value = 0.030). While sex of a person had marginal association with hypertension (p-value = 0.083). There was no evidence of association, at 5% significance level, between each studied variable and diabetes.
Table 1.
Distribution of diabetes and hypertension cases by socio-demographic characteristics of a patient.
Characteristic | n (%) | Diabetes | Hypertension | ||
---|---|---|---|---|---|
Cases (%) | p-value | Cases (%) | p-value | ||
Overall sample | 395 (100) | 2 (0.51) | 32 (8.10) | ||
Sex | 0.288 | 0.083 | |||
Male | 142 (35.95) | 0 (0.00) | 7 (4.93) | ||
Female | 253 (64.05) | 2 (0.79) | 25 (9.88) | ||
Marital status | 0.636 | 0.451 | |||
Unmarried | 135 (34.18) | 1 (0.74) | 9 (6.67) | ||
Married | 260 (65.82) | 1 (0.38) | 23 (8.85) | ||
Occupation | 0.173 | 0.543 | |||
Unemployed/student | 189 (47.85) | 0 (0.00) | 17 (8.99) | ||
Employed/retired | 205 (51.90) | 2 (0.98) | 15 (7.32) | ||
Residential place | 0.259 | 0.552 | |||
Urban | 153 (38.73) | 0 (0.00) | 14 (9.15) | ||
Rural | 241 (61.01) | 2 (0.83) | 18 (7.47) | ||
Highest education | 0.495 | 0.996 | |||
None | 74 (18.73) | 0 (0.00) | 6 (8.11) | ||
Primary and above | 320 (81.01) | 2 (0.63) | 26 (8.13) | ||
Age category | 0.602 | 0.030 | |||
18–30 | 47 (11.90) | 0 (0.00) | 0 (0.00) | ||
31+ | 348 (88.10) | 2 (0.57) | 32 (9.20) | ||
Ever smoked cigarette | 0.585 | 0.998 | |||
No | 337 (85.32) | 2 (0.59) | 27 (8.01) | ||
Yes | 50 (12.66) | 0 (0.00) | 4 (8.00) | ||
Ever drank alcohol | 0.496 | 0.636 | |||
No | 312 (78.99) | 2 (0.64) | 27 (8.65) | ||
Yes | 72 (18.23) | 0 (0.00) | 5 (6.94) |
The estimates from a bivariate logit model with non-exchangeable covariates effects on diabetes and hypertension outcomes are given in Table 2. Model 2, that excluded occupation and age variables, had a higher value of deviance statistic compared to Model 1, hence it relatively had a better fit, and was used for subsequent outlier analyses. The intercept estimates in Model 2 showed that, without considering the covariates, the logarithm of odds of suffering from diabetes or hypertension was lower in the studied population compared to not suffering from either condition. While the intercept of the covariance term was larger than 0, indicating that, holding the factors constant, suffering from diabetes was more likely than hypertension. The marginal ML estimates in Model 2 showed that, adjusting for the other factors, the logarithm of odds of suffering from either diabetes or hypertension was higher in female compared to male persons, in the educated compared to uneducated, and in persons with alcohol drinking history. Whereas, log-odds of diabetes was lower in married persons, but for hypertension it was higher in the married compared to unmarried persons. Diabetes chances were also low in persons with history of smoking, but for hypertension disease they were high in the persons who ever smoked tobacco compared to those who never smoked. Finally, the results showed that diabetes was more likely in rural, but hypertension was less likely to rural residents compared to urban dwellers. The p-values for the estimates in the non-exchangeable bivariate logit model in Model 2 were still large even after dropping occupation and age group variables, and some did not process.
Table 2.
Effect of patient characteristics on diabetes and hypertension outcomes upon fitting bivariate logit model to the data, with non-exchangeable effects.
Variable | Model 1 | Model 2 | ||||
---|---|---|---|---|---|---|
Diabetes | Hypertension | Covariance | Diabetes | Hypertension | Covariance | |
Log-odds (p-val) | Log-odds (p-val) | Log-OR (p-val) | Log-odds (p-val) | Log-odds (p-val) | Log-OR (p-val) | |
Intercept | − 547.037 (0.996) | − 272.044 (0.998) | 9.165 (1.000) | − 28.632 (NA) | − 3.454 (0.895) | 6.377 (NA) |
Sex | ||||||
Male* | ||||||
Female | 5.513 (0.728) | 0.799 (0.449) | 6.165 (NA) | 6.506 (0.808) | 0.937 (0.488) | 9.867 (0.958) |
Marital status | ||||||
Unmarried* | ||||||
Married | − 0.444 (0.748) | 0.308 (0.752) | 13.603 (NA) | − 0.541 (0.704) | 0.387 (0.592) | 17.347 (0.903) |
Occupation | ||||||
Unemployed/student* | ||||||
Employed/retired | 9.492 (0.945) | − 0.248 (0.939) | 1.294 (0.998) | |||
Residential place | ||||||
Urban* | ||||||
Rural | 9.289 (0.955) | − 0.305 (0.848) | 9.453 (0.969) | 9.211 (0.953) | − 0.320 (0.906) | − 8.026 (0.982) |
Highest education | ||||||
None* | ||||||
Primary and above | 1.418 (0.660) | 0.419 (0.807) | − 12.433 (0.845) | 9.230 (0.998) | 0.265 (0.992) | − 15.495 (0.993) |
Age category | ||||||
18–30* | ||||||
31+ | 518.411 (0.996) | 268.885 (0.998) | − 16.885 (1.000) | |||
Ever smoked cigarette | ||||||
No* | ||||||
Yes | 0.379 (0.956) | 0.351 (0.882) | 6.871 (0.970) | − 2.959 (NA) | 0.348 (0.860) | 13.021 (NA) |
Ever drank alcohol | ||||||
No* | ||||||
Yes | 1.932 (0.913) | 0.103 (0.945) | 3.894 (NA) | 2.932 (0.906) | 0.165 (0.925) | 7.184 (0.973) |
Deviance | 213.4181 | 224.9023 |
OR odd ratio, NA not applicable to base variable selection on, estimate was affected by the Hauck–Donner effect24, see “Discussion” section for details.
*Reference level.
The results in Table 3 are for the case of exchangeable covariates effects on diabetes and hypertension in the fitted bivariate logit model. Model 4, that excluded occupation and age group had a better fit, based on the deviance statistic value. The ML estimates for the intercept showed that, disregarding the covariates, a person was less likely to suffer from diabetes or hypertension compared to not suffering from either disease. The covariance estimate showed that, holding the covariates constant, diabetes event was less likely to occur in the study population compared to hypertension. This agreed with the summary data in Table 1. The marginal ML estimates showed that suffering from diabetes or hypertension was more likely in females compared to males, in married compared to unmarried, in persons with primary education and above compared to the uneducated, and in persons with history of smoking or alcohol drinking. The two diseases were less likely to occur in persons that resided in rural areas. The p-values of estimates in the exchangeable bivariate logit model in Table 3 were generally lower compared to those of non-exchangeable model estimates in Table 2, and the model overall deviance statistics values were higher than in non-exchangeable model. This indicated that the exchangeable model fitted the data better.
Table 3.
Effect of patient characteristics on diabetes and hypertension outcomes upon fitting bivariate logit model to the data, with exchangeable effects.
Variable | Model 3 | Model 4 | ||||
---|---|---|---|---|---|---|
Diabetes | Hypertension | Covariance | Diabetes | Hypertension | Covariance | |
Log-odds (p-val) | Log-odds (p-val) | Log-OR (p-val) | Log-odds (p-val) | Log-odds (p-val) | Log-OR (p-val) | |
Intercept | − 503.552 (0.995) | − 503.552 (0.995) | 2.557 (1.000) | − 4.344 (NA) | − 4.344 (NA) | − 54.216 (NA) |
Sex | ||||||
Male* | ||||||
Female | 0.878 (0.202) | 0.878 (0.202) | 3.054 (0.583) | 0.967 (0.141) | 0.967 (0.141) | 5.809 (0.768) |
Marital status | ||||||
Unmarried* | ||||||
Married | 0.258 (0.682) | 0.258 (0.682) | 5.919 (0.474) | 0.336 (0.533) | 0.336 (0.533) | 9.663 (0.669) |
Occupation | ||||||
Unemployed/student* | ||||||
Employed/retired | 0.042 (0.990) | 0.042 (0.990) | 32.753 (0.999) | |||
Residential place | ||||||
Urban* | ||||||
Rural | − 0.111 (0.838) | − 0.111 (0.838) | 11.456 (0.840) | − 0.129 (0.776) | − 0.129 (0.776) | 8.857 (0.654) |
Highest education | ||||||
None* | ||||||
Primary and above | 0.323 (0.734) | 0.323 (0.734) | − 7.054 (NA) | 0.411 (0.971) | 0.411 (0.971) | 31.799 (1.000) |
Age category | ||||||
18–30* | ||||||
31+ | 499.532 (0.995) | 499.532 (0.995) | − 45.182 (1.000) | |||
Ever smoked cigarette | ||||||
No* | ||||||
Yes | 0.339 (0.731) | 0.339 (0.731) | 0.260 (0.969) | 0.289 (0.712) | 0.289 (0.712) | − 1.0181 (0.884) |
Ever drank alcohol | ||||||
No* | ||||||
Yes | 0.159 (0.867) | 0.159 (0.867) | 4.116 (0.570) | 0.294 (0.720) | 0.294 (0.720) | 9.700 (0.628) |
Deviance | 250.9608 | 262.4044 |
OR odd ratio, NA not applicable to base variable selection on, estimate was affected by the Hauck–Donner effect24, see “Discussion” section for details.
*Reference level.
Outlier cancer patients and their characteristics
The plots of marginal deviance residuals against respective fitted probabilities in Fig. 1 showed that the non-exchangeable bivariate logit model was poorly fitted to the data. Using a cutoff of , most residual values for analysing the fit of subjects to diabetes outcomes in Fig. 1a were outside this margin, which was also the case with those for assessing the fit to hypertension outcomes in Fig. 1b. The data also showed that the diabetes model over-predicted majority of the observations, i.e. their real measurements were lower than those estimated by the model, see Fig. 1a. While with hypertension marginal model, the majority of the observations were under-predicted by the model as in Fig. 1b, suggesting that their real values were higher than those estimated by the model.
Figure 1.
Outliers to marginal outcomes in non-exchangeable bivariate logit model, Malawi cancer patients data. Source Researcher.
Unlike with the non-exchangeable marginal models, the residual results for exchangeable marginal models in Fig. 2 showed that majority of the observations were well-fitted by the model. Most estimates for the fit of observations to diabetes marginal outcomes in Fig. 2a or hypertension outcomes in Fig. 2b were close to the zero line. It was further shown that at a cutoff of few observations in both marginal models, such as patients number 151 and 172 were under-predicted by the model, as their plots were outside these margins, indicating that these were possible candidates for outliers.
Figure 2.
Outliers to marginal outcomes in exchangeable bivariate logit model, Malawi cancer patients data. Source Researcher.
The plots of overall deviance residual against marginal fitted probabilities for non-exchangeable bivariate logit model in Fig. 3 showed that the observation with identity number 172 was an outlier to the model, on the under-predicted (positive) side, while several others were also outliers but on the over-predicted (negative) side using a cutoff of , when the overall residual was plotted against marginal probability of diabetes, Fig. 3a or marginal probability of hypertension, Fig. 3b. This meant that, although the non-exchangeable model had poor fit as reported in previous paragraph, the applied method still detected few outlier observations to the model like number 172, that were not conforming to the pattern of the other data points in the model.
Figure 3.
Joint outliers to the non-exchangeable bivariate logit model, Malawi cancer patients data. Source Researcher.
Similarly, when the overall deviance residual for the non-exchangeable model was plotted against marginal deviance residuals in Fig. 4, it was shown that observation number 172 was an outlier on a positive side and few others on the negative side, using the same cutoff of . This result was observed in both Fig. 4a that plotted overall residual against marginal deviance to diabetes and Fig. 4b against hypertension outcomes.
Figure 4.
Joint outliers to the non-exchangeable bivariate logit model, Malawi cancer patients data. Source Researcher.
Now, using the cutoff the plots of overall deviance residual for the exchangeable bivariate logit model against marginal fitted probability of being diabetic or hypertensive in Fig. 5a, as well as against marginal deviance residual for diabetes in Fig. 5b identified observation number 172 as an outright outlier to the bivariate logit model. This observation was under-predicted by the model, i.e. its real outcome measurement was higher than what was predicted by the model. The rest of the plots spanned the zero line, indicating that their corresponding observations were well-fitted by the exchangeable bivariate logit model.
Figure 5.
Joint outliers to the exchangeable bivariate logit model, Malawi cancer patients data. Source Researcher.
Finally, index plots of the overall residual in both non-exchangeable, Fig. 6a and exchangeable, Fig. 6b models showed that observation number 172 was an outlier to both models. This was the only outlier detected by the residual to the exchangeable model in Fig. 6b at cutoff , indicating that the exchangeable model fitted the data well. However, the residual plots for the non-exchangeable model in Fig. 6a identified several other outliers at cutoff , apart from observation number 172, indicating poor fit of the non-exchangeable model to the data compared to the exchangeable model.
Figure 6.
Joint outliers to the non-exchangeable and exchangeable bivariate logit models, Malawi cancer patients data. Source Researcher.
Upon tracking the identified overall outlier observation number 172 in the main dataset, the results showed that this was a female aged 43 years, who weighed 61 kgs and resided in rural area in Chikwawa district. The outlier patient was a married Christian, with middle wealth index, minimum of secondary education, and was a retired employee as at survey time. She suffered from cervical cancer, that was diagnosed in 2018 as a localised disease. She had since received chemotherapy treatment and underwent surgery. The outlier cancer patient suffered from both diabetes and hypertension diseases, as well as HIV/AIDS, but had no any heart, liver, nor respiratory related complications. She also had no history of tobacco smoking nor alcohol drinking. The data showed that she had problems with walking about, felt moderate pain and discomforts, and had problems with performing usual activities, such as household chores or study.
Discussion
In this article, the postestimation techniques for the bivariate logistic regression model were applied to analyse outlier cancer patients that had comorbidities of diabetes and hypertension at Queen Elizabeth and Kamuzu central hospitals in Malawi. The study identified one overall outlier patient to the two diseases’ model. The detected outlier patient was a female with cervical cancer and based in rural area, who suffered from both diabetes and hypertension. She was married and educated, with age of 43 years and moderate weight of 61 kgs. The outlier patient had no smoking and alcohol drinking history. Since this study observed that co-occurrence of diabetes and hypertension was rare for rural populations, it was therefore not surprising for the bivariate logit model diagnostic statistics to detect this rural resident as an outlier to the model7. In most cases, the next step after identifying outliers is to assess their influence on the model’s estimates, which is usually done by re-fitting the model to data without the outlier observations in the sample and observe the changes in the ML estimates or by applying appropriate influence statistics to examine influence of each observation on ML estimates14,19,25. We did not perform such analysis in this study, as the goal of the paper was on detecting and understanding characteristics of the outlier cancer patients to comorbidities of diabetes and hypertension, so as to establish the reasons for outlierness in order to recommend appropriate ways for managing the two diseases in such type of patients, without necessarily improving the model.
This study has observed that the exchangeable bivariate logit model fitted the data better than the non-exchangeable model. The data provided evidence of dependence between diabetes and hypertension diseases in a person. Based on the exchangeable model, suffering from diabetes was less likely to happen compared to hypertension in a patient. This may reflect the nature of the sample that was used, which had too few diabetes cases. The Wald test for evaluating significance of each covariate in the fitted bivariate logit model suffered from the Hauck–Donner effect (HDE) in this study, in which the test value tends to zero when the distance between the ML estimate and hypothesised value widens up, which rendered the p-values for the ML estimates to be unrealistically larger than expected or to be not applicable for variable selection analysis24,26. This could be as a result of smallness in sample size that was used. The HDE usually masks the true effect of a covariate on a binary outcome as though it does not exist, and this can be corrected by using alternative tests for assessing usefulness of covariates in a logit model, such as likelihood-ratio test instead of the Wald test24. The likelihood ratio test analyses usefulness of a covariate by comparing the likelihood estimates with and without the particular covariate in the model. However, this method requires re-fitting the model to data upon removing the concerned covariate, which can be cumbersome for non-linear models, such as the bivariate logistic regression model, that engage iterative numerical techniques like Newton–Raphson method to estimate parameters27. Hence, this study used the Wald test to analyse significance of covariates, due to its computation efficiency. Although the p-values realised from the models were large, this study proceeded with joint outlier analysis to the bivariate logit model because outlierness of an observation is relative to others in the fitted model25, no matter the level of fit of the model at hand.
The small sample size used in this study might have also affected Chi-Square test results in the cross-classification analysis, where some variables were observed to have insignificant association at 5% significance level with either diabetes or hypertension. However, the Chi-Square p-values for the affected variables were still far lower that 1, which indicated existence of some association, at higher levels of significance than 5%, between the concerned variables and diabetes or hypertension. These results add to the debate on whether researchers should prioritise statistical significance over clinical significance for studies that involve some medical data28–30. The general consensus in literature is that statistical insignificance should not override clinical significance31. It is for this reason that this study included all the affected variables during the Chi-Square tests when fitting the bivariate logit model to the data to analyse outlier cancer patients.
Finally, the study observed that there was high likelihood for female, the married, the educated, and persons with smoking and alcohol drinking history to suffer from either diabetes or hypertension. The diseases were less likely to affect the rural compared to urban populations. These results were consistent with literature, that attributes fancy lifestyle in the indicated populations as a source for increased risk3.
Conclusion
This study sought to assess joint outliers to comorbidities of diabetes and hypertension through fitting a bivariate logistic regression model to the cancer medical data in Malawi. By applying diagnostic statistics for multivariate data models, we effectively examined outliers to the fitted bivariate logit model. The methods identified one outlier cancer patient to diabetes and hypertension outcomes, whose main unique feature was being based in rural area, where the two studied diseases were rare. We recommend careful analysis of outlying patients to comorbid diabetes and hypertension, through regression methods that should precede drafting of policies for managing the two diseases in cancer patients, in order to avoid misaligned interventions.
Further, the model estimates in this study confirmed previously-observed risk factors for comorbidities of diabetes and hypertension. The most vulnerable groups were females, the educated, past smokers and drinking groups, and the married. The exchangeable bivariate logit model produced better fit compared to the non-exchangeable model. Therefore, assuming that the available covariates affect either marginal outcome in the bivariate logit model could give the researcher accurate estimates, when using limited data. There are a number of risk factors for comorbid diabetes and hypertension, that are reported in literature but were not part of the dataset used in this study, moreover the sample was small which affected convergence of estimates such as the p-values. Future research could apply these statistical methods on a wider scale dataset.
Acknowledgements
The authors are sincerely thankful to the staff and management of Queen Elizabeth and Kamuzu Central hospitals for the assistance rendered during the data collection in this study.
Author contributions
T.K. conceived the initial research problem for this study, and provided suggestions for statistical methods and performed data analysis, as well as drafting initial manuscript. A.M. provided technical epidemiological advice on the analysed disease outcomes. J.C.B. designed the data collection tool and supervised the data collection process. G.H. performed the data cleaning. All authors read and approved the final manuscript.
Data availability
The dataset that was used and analysed for this study is available from the corresponding author upon request
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Cheung BMY. The hypertension-diabetes continuum. J. Cardiovasc. Pharmacol. 2010;55(4):333–339. doi: 10.1097/FJC.0b013e3181d26430. [DOI] [PubMed] [Google Scholar]
- 2.Long AN, Dagogo-Jack S. Comorbidities of diabetes and hypertension: Mechanisms and approach to target organ protection. J. Clin. Hypertens. 2011;13(4):244–251. doi: 10.1111/j.1751-7176.2011.00434.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tripathy JP, Thakur JS, Jeet G, Jain S. Prevalence and determinants of comorbid diabetes and hypertension: Evidence from non communicable disease risk factor steps survey, India. Diabetes Metab. Syndr. Clin. Res. Rev. 2017;11:S459–S465. doi: 10.1016/j.dsx.2017.03.036. [DOI] [PubMed] [Google Scholar]
- 4.Peña JE, Rascón-Pacheco RA, Ascencio-Montiel IJ, González-Figueroa E, Fernández-Gárate JE, Medina-Gómez OS, Borja-Bustamante P, Santillán-Oropeza JA, Borja-Aburto VH. Hypertension, diabetes and obesity, major risk factors for death in patients with covid-19 in Mexico. Arch. Med. Res. 2021;52(4):443–449. doi: 10.1016/j.arcmed.2020.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hassing LB, Hofer SM, Nilsson SE, Berg S, Pedersen NL, McClearn G, Johansson B. Comorbid type 2 diabetes mellitus and hypertension exacerbates cognitive decline: Evidence from a longitudinal study. Age Ageing. 2004;33(4):355–361. doi: 10.1093/ageing/afh100. [DOI] [PubMed] [Google Scholar]
- 6.Lago RM, Singh PP, Nesto RW. Diabetes and hypertension. Nat. Clin. Pract. Endocrinol. Metab. 2007;3(10):667. doi: 10.1038/ncpendmet0638. [DOI] [PubMed] [Google Scholar]
- 7.Lea JP, Nicholas SB. Diabetes mellitus and hypertension: Key risk factors for kidney disease. J. Natl. Med. Assoc. 2002;94(8 Suppl):7S. [PMC free article] [PubMed] [Google Scholar]
- 8.Poljičanin T, Ajduković D, Šekerija M, Pibernik-Okanović M, Metelko Ž, Vuletić Mavrinac G. Diabetes mellitus and hypertension have comparable adverse effects on health-related quality of life. BMC Public Health. 2010;10(1):1–6. doi: 10.1186/1471-2458-10-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Xiu W, Huang Y, Li Y, Min Yu, Gong Y. Comorbidities and mortality risk among extensive-stage small-cell lung cancer patients in mainland China: Impacts of hypertension, type 2 diabetes mellitus, and chronic hepatitis b virus infection. Anticancer Drugs. 2022;33(1):80. doi: 10.1097/CAD.0000000000001133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zaki N, Alashwal H, Ibrahim S. Association of hypertension, diabetes, stroke, cancer, kidney disease, and high-cholesterol with covid-19 disease severity and fatality: A systematic review. Diabetes Metab. Syndr. Clin. Res. Rev. 2020;14(5):1133–1142. doi: 10.1016/j.dsx.2020.07.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Banda JC, Sinjani MA. Burden of chronic disease comorbidities among cancer patients at queen Elizabeth and Kamuzu central hospitals in Malawi: An exploratory cross-sectional study. Pan Afr. Med. J. 2021;40:167. doi: 10.11604/pamj.2021.40.167.31069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.McWilliams LA, Bailey SJ. Associations between adult attachment ratings and health conditions: Evidence from the national comorbidity survey replication. Health Psychol. 2010;29(4):446. doi: 10.1037/a0020061. [DOI] [PubMed] [Google Scholar]
- 13.Obrien SM, Dunson DB. Bayesian multivariate logistic regression. Biometrics. 2004;60(3):739–746. doi: 10.1111/j.0006-341X.2004.00224.x. [DOI] [PubMed] [Google Scholar]
- 14.Kaombe TM, Manda SOM. Detecting influential data in multivariate survival models. Commun. Stat.-Theory Methods. 2023;52(11):3910–3926. doi: 10.1080/03610926.2021.1982983. [DOI] [Google Scholar]
- 15.Kaombe TM, Manda SOM. A novel outlier statistic in multivariate survival models and its application to identify unusual under-five mortality sub-districts in Malawi. J. Appl. Stat. 2022 doi: 10.1080/02664763.2022.2043255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Diouf A, Cournil A, Ba-Fall K, Ngom-Guèye NF, Eymard-Duvernay S, Ndiaye I, Batista G, Guèye PM, Bâ PS, Taverne B, et al. Diabetes and hypertension among patients receiving antiretroviral treatment since 1998 in senegal: Prevalence and associated factors. Int. Schol. Res. Not. 2012;2012:621565. doi: 10.5402/2012/621565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Fukui M, Tanaka M, Toda H, Senmaru T, Sakabe K, Ushigome E, Asano M, Yamazaki M, Hasegawa G, Imai S, et al. Risk factors for development of diabetes mellitus, hypertension and dyslipidemia. Diabetes Res. Clin. Pract. 2011;94(1):e15–e18. doi: 10.1016/j.diabres.2011.07.006. [DOI] [PubMed] [Google Scholar]
- 18.Wagner W. Characteristics of bivariate binomial distribution. Acta Univ. Lodziensis Folia Oecon. 2011;255:1. [Google Scholar]
- 19.Sarkar SK, Midi H, Rana S. Detection of outliers and influential observations in binary logistic regression: An empirical study. J. Appl. Sci. 2011;11(1):26–35. doi: 10.3923/jas.2011.26.35. [DOI] [Google Scholar]
- 20.Yee, T. W. Vgam family functions for bivariate binomial responses (2008).
- 21.Gupta AK, Nguyen T, Pardo L. Residuals for polytomous logistic regression models based on -divergences test statistics. Statistics. 2008;42(6):495–514. doi: 10.1080/02331880701819345. [DOI] [Google Scholar]
- 22.Jennings DE. Outliers and residual distributions in logistic regression. J. Am. Stat. Assoc. 1986;81(396):987–990. doi: 10.1080/01621459.1986.10478362. [DOI] [Google Scholar]
- 23.Sakala N, Kaombe TM. Analysing outlier communities to child birth weight outcomes in Malawi: Application of multinomial logistic regression model diagnostics. BMC Pediatr. 2022;22(1):1–8. doi: 10.1186/s12887-022-03742-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hauck Jr WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. J. Am. Stat. Assoc. 1977;72(360a):851–853. doi: 10.1080/01621459.1977.10479969. [DOI] [Google Scholar]
- 25.Zewotir T, Galpin JS. A unified approach on residuals, leverages and outliers in the linear mixed model. TEST. 2007;16(1):58–75. doi: 10.1007/s11749-006-0001-2. [DOI] [Google Scholar]
- 26.Yee TW. On the Hauck–Donner effect in wald tests: Detection, tipping points, and parameter space characterization. J. Am. Stat. Assoc. 2022;117(540):1763–1774. doi: 10.1080/01621459.2021.1886936. [DOI] [Google Scholar]
- 27.Molenberghs G, Verbeke G. Likelihood ratio, score, and wald tests in a constrained parameter space. Am. Stat. 2007;61(1):22–27. doi: 10.1198/000313007X171322. [DOI] [Google Scholar]
- 28.Kraemer HC, Morgan GA, Leech NL, Gliner JA, Vaske JJ, Harmon RJ. Measures of clinical significance. J. Am. Acad. Child Adoles. Psychiatry. 2003;42(12):1524–1529. doi: 10.1097/00004583-200312000-00022. [DOI] [PubMed] [Google Scholar]
- 29.LeFort SM. The statistical versus clinical significance debate. Image J. Nurs. Scholarsh. 1993;25(1):57–62. doi: 10.1111/j.1547-5069.1993.tb00754.x. [DOI] [PubMed] [Google Scholar]
- 30.Page P. Beyond statistical significance: Clinical interpretation of rehabilitation research literature. Int. J. Sports Phys. Ther. 2014;9(5):726. [PMC free article] [PubMed] [Google Scholar]
- 31.Ranganathan P, Pramesh CS, Buyse M. Common pitfalls in statistical analysis: Clinical versus statistical significance. Perspect. Clin. Res. 2015;6(3):169. doi: 10.4103/2229-3485.159943. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The dataset that was used and analysed for this study is available from the corresponding author upon request