Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2023 Jan 4;51(5):826–844. doi: 10.1080/02664763.2022.2163229

A new flexible regression model with application to recovery probability Covid-19 patients

F Prataviera a, E M Hashimoto b, E M M Ortega c, G M Cordeiro d, V G Cancho e, R Vila f,CONTACT
PMCID: PMC10956937  PMID: 38524797

Abstract

The aim of this study is to propose a generalized odd log-logistic Maxwell mixture model to analyze the effect of gender and age groups on lifetimes and on the recovery probabilities of Chinese individuals with COVID-19. We add new properties of the generalized Maxwell model. The coefficients of the regression and the recovered fraction are estimated by maximum likelihood and Bayesian methods. Further, some simulation studies are done to compare the regressions for different scenarios. Model-checking techniques based on the quantile residuals are addressed. The estimated survival functions for the patients are reported by age range and sex. The simulation study showed that mean squared errors decay toward zero and the average estimates converge to the true parameters when sample size increases. According to the fitted model, there is a significant difference only in the age group on the lifetime of individuals with COVID-19. Women have higher probability of recovering than men and individuals aged 60 years have lower recovered probabilities than those who aged <60 years. The findings suggest that the proposed model could be a good alternative to analyze censored lifetime of individuals with COVID-19.

Keywords: Censored data, COVID-19, Maxwell distribution, mixture model, quantile residuals

1. Introduction

The coronavirus 2019 (COVID-19) disease was first identified in Wuhan (China) in December 2019. The most common symptoms of the disease are fever, coughing, sore throat, gastrointestinal disturbances, breathing difficulty, and in serious cases it can evolve to pneumonia [6,23].

According to information updated to 22 April 2021 from Johns Hopkins University, more than 144 million people had tested positive for COVID-19, and more than 3.062 million deaths occur in the world [9]. At the coronavirus pandemic site1 (on 22 April 2021) more than 123 million of patients had recovered, and almost 19 million continued as active cases (0.6% in serious conditions). The world mortality rate is 395 per 1 million inhabitants. For this reason, several studies try to investigate the behavior of the disease according to demographic characteristics and comorbidities [2,14,16,22,34].

Furthermore, specifically in China, the number of deaths caused by the disease (on 22 April 2021) is around 4636, out of a total of 90,507 confirmed cases, according to Johns Hopkins University. Xie et al. [33] studied the effect of oxygen saturation and other measures on the lifetime of COVID-19 patients suffering from pneumonia admitted to Union Hospital of Wuhan. An interesting characteristic that can be noted in the lifetime of these individuals is the presence of a plateau in the survival curve. Figure 1(a) displays a survival curve with plateau at 0.76 of the lifetime of a sample of patients suffering from COVID-19 residing in China. Figure 1(b), in turn, depicts the empirical risk initially increasing and then diminishing after medical care [19].

Figure 1.

Figure 1.

Plots for COVID-19 data: (a) empirical survival function and (b) empirical hazard function.

Morena et al. [24], Yang et al. [35] and Yan et al. [34] also presented a survival curve like the one described in Figure 1(a). However, those works did not take into account the information of a plateau (or asymptote) in the survival curve and used other statistical analysis. In situations like this, mixture models can be used to consider this information [18], and in this case it is possible to interpret the plateau as the proportion of the patients who recovered. The WHO-China Joint Mission on Coronavirus Disease 2019 Report published by WHO2 says that the recovery time depends on the age, gender and any other underlying health issues.

Some studies also have been published taking into account the recovery time variables such as sex and age. For example, Voinsky et al. [32] assessed the effects of the age and sex of 5769 Israeli coronavirus patients on their recovery rate. The time from infection to recovery is the number of days from the first positive to the first negative result of the SARS-CoV-2 PCR test. Al-Rousan and Al-Najjar [3] presented some statistical analysis of the effects of sex, region, reasons for infection, age and date of discharge or illness on the rates of recovered cases and deaths.

In this context, we construct a mixed regression based on the Generalized Odd Log-logistic Maxwell (GOLLMax) family of distributions to estimate the effects of group age and sex variables on the recovery probabilities of COVID-19 patients residing in China. The GOLLMax family was recently pioneered by Prataviera et al. [28] for applications in various fields to some well-known distributions. We adopt maximum likelihood and Bayesian methods to estimate the parameters of this family and its adequacy is confirmed by residual analysis.

This paper is structured as follows. Section 2 addresses some structural properties of the new family. The GOLLMax mixture regression and the estimation of its parameters are discussed in Section 3. Residual analysis is addressed in Section 4. The utility of the new regression is proved by means of coronavirus lifetimes in Section 5. Finally, this paper is closed in Section 6 with some remarks.

2. New properties of the GOLLMax distribution

The cumulative distribution function (cdf) of the generalized odd log-logistic-G (‘GOLL-G’) family (from a baseline G with unknown parameters in γ) is given by ([7])

F(t)=G(t)σνG(t)σν+[1G(t)σ]ν, (1)

where σ>0 and ν>0 are two extra shape parameters. The odd log-logistic-G (OLL-G) [13] and exponentiated-G (exp-G) classes correspond to σ=1 and ν=1, respectively.

The Maxwell baseline cdf has the form (for t>0)

G(t)=γ1(32,t2μ2), (2)

where μ>0 is a scale parameter, γ1(p,y)=γ(p,y)/Γ(p), γ(p,y)=0ywp1ewdw, and Γ(p)=0wp1ewdw is the gamma function.

The cdf and probability density function (pdf) of the GOLLMax family were defined by Prataviera et al. [28] by inserting (2) in Equation (1)

F(t)=γ1σν(3/2,t2/μ2)γ1σν(3/2,t2/μ2)+[1γ1σ(3/2,t2/μ2)]ν,t>0 (3)

and

f(t)=4σνπμ3t2exp(t2μ2)γ1σν1(3/2,t2/μ2)[1γ1σ(3/2,t2/μ2)]ν1{γ1σν(3/2,t2/μ2)+[1γ1σ(3/2,t2/μ2)]ν}2, (4)

respectively.

We have

limtf(t)=0andlimt0+f(t)=1(0,1)(νσ)+01[0,)(νσ), (5)

where 1A is the indicator function of set A.

Further, if h(t) denotes the hazard function corresponding to (3), then limth(t)= and limt0+h(t)=limt0+f(t).

Prataviera et al. [28] showed that the GOLLMax family allows analyzing data whose hazard function has unimodal and bathtub bimodal shapes. Further, it has as special cases the OLLMax ( σ=1), exponentiated-Maxwell (EMax) ( ν=1) and Maxwell ( σ=ν=1) distributions. So the GOLLMax family is much more flexible and consequently becomes very competitive to many other lifetime models.

We present below new structural properties of the GOLLMax model, which are completely different from those reported by Prataviera et al. [28].

Henceforth, let TGOLLMax(μ,σ,ν) have the GOLLMax distribution with parameter vector (μ,σ,ν), and YLL(1,ν) be the log-logistic random variable with unity scale and shape ν.

Some properties of the GOLLMax distribution are reported below:

  1. The cdf of T can be written as
    F(t)=P(YA(t)),
    where A(t)=A(t;μ,σ)=G(t)σ/[1G(t)σ] and G(t) is as in (2).
  2. As a consequence of Item a:

    1. If TGOLLMax(μ,σ,ν), then Y=A(T)LL(1,ν).
    2. If YLL(1,ν), then T=A1(Y)GOLLMax(μ,σ,ν).
    Hence, the random variable T admits the stochastic representation (see [8]):
    T=μγ11(32,(Y1+Y)1/σ),
    where γ1(3/2,) denotes the inverse function of γ(3/2,).
  3. By applying Item b,
    E[γ1(3/2,T2/μ2){1γ1σ(3/2,T2/μ2)}k]=E(Yk)=/νsin(/ν),k<ν.
    Again, by using Item b with ν=1, we obtain
    E[γ1σ(32,T2μ2)]=E(Y1+Y)=0y(1+y)3dy=12.
  4. Since A(t/k;μ,σ)=A(t;,σ), k>0, by Item a, the following holds (see [8]):

    If TGOLLMax(μ,σ,ν), then kXGOLLMax(,σ,ν). That is, the GOLLMax is closure under changes of scale.

  5. A critical point of the GOLLMax density (4) verifies (see [8])
    y(y)2+(σ+1)yσ[yνσ+(1yσ)ν](νσ+1)yνσ2(1yσ)νy(1yσ)[yνσ+(1yσ)ν]=0, (6)
    where y=y(t)=G(t), and G(t) is as in (2). Equation (6) implies that the modality of GOLLMax density is independent of the parameter μ.
  6. By using the limit in (5) and the number of critical points of the GOLLMax pdf of T, we obtain that the GOLLMax pdf is decreasing/ decreasing–increasing–decreasing/unimodal or bimodal (see [8]).

  7. For any ν1 (or for any σ1) the GOLLMax distribution has thinner tails than an exponential distribution (light-tailed distribution) (see [8]).

The following two results show convergence in law involving the minimum and maximum of a sequence of random variables with the GOLLMax distribution.

Proposition 2.1

There is a sequence of independent, identically distributed (iid ) random variables TnGOLLMax(μ,σ,νn) so that

TnDU,

where U has cdf FU(u)=e[A(u)]p, u0, A is as in Item a, and ‘ D’ denotes convergence in distribution.

Proof.

Henceforth, let Y1,,Yn be iid random variables from YLL(n1/p,νn), p>0. Let Y1,nY2,nYn,n be their order statistics.

Define Zn=Yn,n/n1/p. By Theorem 4.3 of [1], ZnDX, where X has cdf FX(x)=exp, x0. By applying the continuous mapping theorem since A1 (the inverse function of A) is a continuous map, we have

Tn=A1(Zn)DU=A1(X),

where U has cdf FU(u)=e[A(u)]p, u0. Further, since ZnLL(1,νn), by Item b-(2), TnGOLLMax(μ,σ,νn).

Proposition 2.2

There is a sequence of iid random variables T~nGOLLMax(μ,σ,νn), such that

T~nDV,

where V has cdf FV(v)=1e[A(v)]p, v0, and A is as in Item a.

Proof.

The proof is similar to the previous proposition. For the convenience of the reader, we show the details.

By defining Z~n=Y1,n/n1/p from Theorem 4.4 of [1], we have Z~nDX~Weibull(1,p), p>1. By applying the continuous mapping theorem, we can write

T~n=A1(Z~n)DV=A1(X~),

where V has cdf FV(v)=1e[A(v)]p, v0. Since ZnLL(1,νn), by Item b-(2), T~nGOLLMax(μ,σ,νn).

The following proposition gives other stochastic representations for the GOLLMax distribution and some related distributions.

Proposition 2.3

Let A be as in Item a. The followings hold:

  1. If XU(0,1), then A1(X1/ν/(1X)1/ν)GOLLMax(μ,σ,ν).

  2. If TGOLLMax(μ,σ,ν), then Aν(T)/[1+Aν(T)]U(0,1).

  3. If TGOLLMax(μ,σ,ν), then klog(A(T))+Logistic(,|k|/ν), k,R.

  4. If X and YExponential(1) are independently, then A1((X/Y)1/ν)GOLLMax(μ,σ,ν).

Proof.

For XU(0,1), it is well known that a+[log(X)log(1X)]/νLogistic(a,1/ν). So, eaX1/ν/(1X)1/νLL(ea,ν). Applying properties of the log-logistic distribution, we have X1/ν/(1X)1/νLL(1,ν). Hence, by Item b-(2), A1(X1/ν/(1X)1/ν)GOLLMax(μ,σ,ν). This proves Item (1). Analogously, the proof of second item follows.

If TGOLLMax(μ,σ,ν) then, by Item b-(1), A(T)LL(1,ν). So, it is well known that log(A(T))Logistic(0,1/ν). Further, by applying properties of the logistic distribution, klog(A(T))+Logistic(,|k|/ν), thus proving Item (3).

For X,YExponential(1) independently, a well-known property is that alog(X/Y)/νLogistic(a,1/ν). So, ea(X/Y)1/νLL(ea,ν). Then, (X/Y)1/νLL(1,ν). Hence, by Item b-(2), A1((X/Y)1/ν)GOLLMax(μ,σ,ν). So, we complete the proof of the fourth item.

3. The GOLLMax mixture regression

The GOLLMax mixture model is described as follows: let c>0 be the fixed censoring time and T be the lifetime independent of c. Then the observed time t=min(T,c) defines the Type I censoring mechanism [18]. Moreover, the current population is considered to be a mixture of susceptible individuals and recovered individuals. Let Ni denote the indicator that the ith individual is susceptible ( Ni=1) or recovered (Ni=0) (for i=1,,n). The mixture model [21,25] takes the form

Spop(ti)=π0+(1π0)S(ti|Ni=1),

where Spop(ti) is the (unconditional) population survival function of Ti, π0=P(Ni=0) is the recovery probability, and the survival function for the susceptible individuals follows from (3)

S(ti|Ni=1)=1γ1σν(3/2,ti2/μ2)γ1σν(3/2,ti2/μ2)+[1γ1σ(3/2,ti2/μ2)]ν.

The improper population density function [21] can be expressed as

fpop(ti)=(1π0)f(ti),

where f(ti) is the density function (4). The hazard rate function (hrf) of Ti is hpop(ti)=fpop(ti)/Spop(ti).

Recently, Ortega et al. [26] and Prataviera et al. [27] developed some extended regressions for lifetime data. In a similar manner, the GOLLMax mixture regression is constructed for the response variable Ti (for i=1,,n) having density (4) with associated vector xi=(1,xi1,,xip) of the explanatory variables, and the systematic component

μi=exp(xiβ)andπ0i=exp(xiγ)1+exp(xiγ), (7)

where β=(β0,,βp) and γ=(γ0,,γp) are unknown parameter vectors. Note that the logit link function is used to model the proportion of individuals recovered.

Equation (7) is only identifiable when π0(x) is modeled by a logistic regression with non-constant covariates [20].

3.1. Estimation

Let (ti,xi),,(tn,xn) be a sample from the GOLLMax distribution (4), and let θ=(σ,ν,β,γ) be the unknown parameters. The observed lifetime at ti contributes to the likelihood function is

fpop(ti|xi)=4(1π0i)σνπμi3ti2exp(ti2μi2)×γ1σν1(3/2,ti2/μi2)[1γ1σ(3/2,ti2/μi2)]ν1{γ1σν(3/2,ti2/μi2)+[1γ1σ(3/2,ti2/μi2)]ν}2,

whereas an element at risk at ti contributes with

Spop(ti|xi)=π0i+(1π0i){1γ1σν(3/2,ti2/μi2)γ1σν(3/2,ti2/μi2)+[1γ1σ(3/2,ti2/μi2)]ν}.

Let L and C be the sets of elements for the lifetimes and censoring times, respectively, and r be the number of uncensored observations. The log-likelihood function for θ under uninformative censoring has the form

(θ)=rlog(4σνπ)+iLlog(1π0i)+iLlog(ti2μi3)iLti2μi2+iLlog{γ1σν1(3/2,ti2/μi2)[1γ1σ(3/2,ti2/μi2)]ν1{γ1σν(3/2,ti2/μi2)+[1γ1σ(3/2,ti2/μi2)]ν}2}+iClog{π0i+(1π0i)[1γ1σν(3/2,ti2/μi2)γ1σν(3/2,ti2/μi2)+[1γ1σ(3/2,ti2/μi2)]ν]}. (8)

We use the gamlss package of the R software [30,31] to maximize (8) and find the MLE θ^. The computational program is available at https://github.com/fabiopviera/GOLLMax.Mix. The global deviance GD=2(θ^), Akaike information criterion ( AIC), and Bayesian information criterion ( BIC) are adopted to select the best regression.

3.2. A Bayesian analysis

We can use a Markov Chain Monte Carlo (MCMC) algorithm to obtain posterior inference for the parameters. We consider independent prior densities for the parameters π(β,γ,σ,ν)=π(θ)π(γ)π(σ)π(ν), where βjN(0,τj), γjN(0,ρj) ( j=0,1,,p), σG(a,b), νG(c,d), G(a,b) is a gamma distribution, and N(μ0,τ0) is a normal distribution. Combining these prior densities with the likelihood function obtained from Equation (8), the posterior density for the parameters becomes

π(β,γ,σ,ν|D)L(β,γ,σ,ν|D)π(θ)π(γ)π(σ)π(ν). (9)

The MCMC algorithm can be used since the joint posterior density (9) is analytically intractable. However, we set σ=exp(σ) and ν=exp(ν) to obtain numerical stability, thus implying a new parameter space Θ={ϑ:ϑ=(β,γ,σ,ν)}R2p+4. Using the Jacobian transformation, the posterior density follows as

π(β,σ,ν|D)L(β,γ,σ,ν|D)π(β)π(γ)π(σ)π(ν)eν+τ.

We implement the Metropolis–Hastings algorithm in Cancho et al. [5] which operates as follows:

  1. Start with any point ϑ(0) and stage indicator j = 0

  2. Generate a point ϑ according to the transitional kernel q(ϑ,ϑj)=Np+4(ϑj,Σ~), where Σ~ is the covariance matrix of θ is the same at any stage

  3. Update ϑ(j) to ϑ(j+1)=ϑ with probability pj=min{1,π(ϑ|D)/π(ϑ(j)|D)}, or keep ϑ(j) with probability 1pj

  4. Repeat steps (2) and (3) by increasing the stage indicator until the process reaches a stationary distribution.

The computational program is available from the authors under request.

3.3. Simulation study

We examine the performance of the MLEs from the above regression by means of Monte Carlo simulations for sample sizes (n = 80, 250, 500) using the gamlss package in R and vector operations related to the RS method.

We consider a systematic component defined from Equation (7) as μi=exp(β0+β1x1i+β2x2i+β3x3i) and τi=logit(γ0+γ1x4i+γ2x3i), whose coefficients are β0=1.20, β1=0.55, β2=0.20, β3=0.35, σ=0.35, ν=0.85, γ0=0.95, γ1=0.30 and γ2=1.50. The explanatory variables are taken as x1iUniform(0,2.5), x2iNormal(5,0.50), x3iUniform(0,1) and x4iBinomial(1,0.5).

Further, the percentage of cured individuals is assumed approximately 66%. We present a brief script to generate the random values of the proposed regression with cured proportion:

  1. Calculate τ such that τi=logit(γ0+γ1x4i+γ2x3i)

  2. Let MiBernoulli(τi)

  3. If Mi=0, yi=, else Mi=1, yi=GOLLMax(μi,σ,ν) from Equation (3)

  4. Generate censored time by tciUniforme(0,ξ) for ξ=15

  5. The observed time ti for the ith individual is ti=min(yi,tci)

  6. Create a censored indicator vector, δi, if yitci do δi=1, otherwise δi=0.

For each of the 1000 simulations, we obtain the average estimates (AEs), biases and mean squared errors (MSEs) of the estimates. The figures reported in Table 1 indicate that the MSEs decay toward zero and the AEs converge to the true parameters when n increases. More details on the simulations of this regression and other scenarios, showing that the asymptotic properties of the estimators are satisfied are addressed in Prataviera et al. [28] and Prataviera et al. [29].

Table 1.

Findings from the simulated GOLLMax mixture regression.

  n = 80 n = 250 n = 500
θ AE Bias MSE AE Bias MSE AE Bias MSE
β0 1.363 0.163 0.161 1.305 0.105 0.058 1.290 0.091 0.008
β1 −0.536 0.014 0.003 −0.540 0.010 0.001 −0.538 0.008 0.001
β2 0.183 −0.017 0.005 0.194 −0.006 0.002 0.190 −0.010 0.001
β3 −0.310 0.040 0.021 −0.320 0.030 0.006 −0.321 0.022 0.004
σ 0.224 −0.126 0.028 0.230 −0.120 0.019 0.227 −0.123 0.017
ν 1.017 0.167 0.051 0.965 0.115 0.020 0.905 0.068 0.003
γ0 −1.069 −0.119 3.084 −0.952 −0.002 0.164 −0.954 −0.002 0.026
γ1 0.363 0.063 2.982 0.307 0.007 0.116 0.286 −0.014 0.095
γ2 1.623 0.123 1.305 1.514 0.014 0.376 1.551 0.011 0.027

4. Checking model

The adequacy of a regression model fitted to data can be carried out by analyzing the residuals to identify discrepant observations and if there are serious departures from the model assumptions. If the model is suitable, the residual plots versus the order of the observations or the predicted values should behave randomly around zero.

We consider the quantile residuals (qrs) [12] for the fitted GOLLMax mixture regression (for i=1,,n), namely

qri=Φ1{1[{1γ1σ^ν^(3/2,ti2/μ^i2)γ1σ^ν^(3/2,ti2/μ^i2)+[1γ1σ^(3/2,ti2/μ^i2)]ν^}π^0i+(1π^0i){1γ1σ^ν^(3/2,ti2/μ^i2)γ1σ^ν^(3/2,ti2/μ^i2)+[1γ1σ^(3/2,ti2/μ^i2)]ν^}]},

where

μ^i=exp(xiβ^),π^0i=exp(xiγ^)1+exp(xiγ^),

σ^, ν^, β^, γ^ are the MLEs and Φ1() is the inverse standard normal cdf.

We also adopt the Worm Plots (WP) of the residuals to check the quality of the fitted regression [4].

5. Application

Equations (4) and (7) are used to model the probability for symptomatic patients to recover from COVID-19. For doing that, a data set was obtained from Dong et al. [11]. The sample consists of 139 individuals of Chinese nationality diagnosed with coronavirus according to the WHO1 guidance and, of these individuals, 52 are women and 87 are men. The response variable T is the time in days from the onset of COVID-19 symptoms to the individual's death. As the last database update was on 03/13/2020, it means that the censoring time is c=03/13/2020. In addition, demographic characteristics (age and gender) were also observed and then, for each individual (i=1,,139), the following variables are obtained:

  • ti: lifetime (in days),

  • xi1: gender ( female=0, male=1),

  • xi2: age group ( 0=age<60 years, 1=age60 years),

where the reference level for xi1 is female, xi1=0 and for xi2 is age<60 years, xi2=0.

Table 2 reports the counts of individuals in relation to the current variables. Regardless of the gender, 66% of individuals aged <60 years until the update did not die from the disease, and 20% of individuals aged 60 years died of COVID-19. Figure 2 displays the Kaplan–Meier survival curves [17]. Figure 2(a) provides plots by age, Figure 2(b) by gender, and Figure 2(c) by age versus gender. For all plots, there is a high percentage of recovered individuals, mainly for individuals aged <60 years.

Table 2.

Distribution of COVID-19 patients by gender and age.

    Age group
Status Gender Age < 60 years Age ≥ 60 years
Died Female 1 7
  Male 4 21
Recovered Female 35 9
  Male 57 5

Figure 2.

Figure 2.

Plots of Kaplan–Meier survival functions for COVID-19 data: (a) by age group, (b) by gender, (c) by age group versus gender.

First, consider the data analysis after fitting the Weibull, GOLLMax and its particular cases, OLLMax and EMax, distributions to the lifetime patients without explanatory variables. The MLEs and standard errors (SEs) (in parentheses) are given in Table 3 for these data.

Table 3.

Results from some fitted distributions to coronavirus lifetimes.

Regression log(μ) log(σ) log(ν) logit(π0)
GOLLMax 3.320 −1.261 0.896 1.163
  (0.005) (0.036) (0.009) (0.111)
OLLMax 2.657   −0.067 1.166
  (0.003)   (0.008) (0.111)
EMax 2.724 −0.195   1.166
  (0.004) (0.090)   (0.111)
Weibull 2.846 0.770   1.164
  (0.039) (0.062)   (0.111)

Based on the estimates in Table 3, the empirical and their estimated survival functions for some distributions are reported in Figure 3(a). The empirical and estimated hazard functions are displayed in Figure 3(b), which reveal that the GOLLMax distribution provides the most appropriate fit to the COVID-19 lifetimes based on the risk function. This fact is not visible in the survival function plots.

Figure 3.

Figure 3.

Plots of some fitted distributions to COVID-19 data: (a) empirical and estimated survival functions and (b) empirical and estimated hazard functions.

However, the data set presents some characteristics of individuals that can explain the life span of COVID-19 patients. For this reason, we will check the adequacy of the models considering the effects of the age group and gender.

5.1. The GOLLMax mixture regression

The GOLLMax mixture regression is considered with the systematic components:

{M1:log(μi)=β0+β1Gender+β2Ageandlogit(π0i)=γ0M2:log(μi)=β0andlogit(π0i)=γ0+γ1Gender+γ2AgeM3:log(μi)=β0+β1Gender+β2Ageandlogit(π0i)=γ0+γ1Gender+γ2Age.

The values of the GD and AIC statistics to compare the GOLLMax, OLLMax, EMax and Weibull regressions are reported in Table 4 under different systematic components. The EMax mixture regression ( M3) outperforms the GOLLMax, OLLMax, EMax and Weibull regressions for all criteria, and then it can be used effectively to explain the COVID-19 survival times.

Table 4.

Information criteria for mixture regressions with different systematic components.

Model M GD AIC
GOLLMax M1 306.90 318.90
  M2 301.63 313.63
  M3 292.45 308.45
OLLMax M1 308.78 318.78
  M2 305.86 315.86
  M3 292.57 306.57
EMax M1 308.36 318.36
  M2 305.85 315.85
  M3 292.39 306.39
Weibull M1 308.01 318.01
  M2 306.87 316.87
  M3 300.76 314.76

Table 5 reports the MLEs and their SEs for four fitted mixture regressions under the structure M3 to the current data. All covariates in these regressions are significant at the 5% significance level except for gender. Some conclusions are addressed in the end of this section.

Table 5.

Findings from four fitted regressions under M3 to coronavirus data.

  GOLLMax OLLMax
θ θ^ SE p-value θ^ SE p-value
β0 3.563 0.015 <0.001 3.133 0.011 <0.001
β1(male) −0.017 0.012 0.190 −0.006 0.007 0.441
β2(≥60 years) −0.632 0.014 <0.001 −0.632 0.010 <0.001
γ0 4.045 0.167 <0.001 4.046 0.167 <0.001
γ1(male) −1.495 0.213 <0.001 −1.495 0.213 <0.001
γ2(≥60 years) −3.891 0.439 <0.001 −3.893 0.439 <0.001
log(σ) −0.861 0.041  
log(ν) 0.718 0.012 0.143 0.010
  EMax Weibull
θ θ^ SE p-value θ^ SE p-value
β0 3.038 0.011 <0.001 4.818 0.062 <0.001
β1(male) −0.007 0.009 0.452 0.147 0.072 0.045
β2(≥60 years) −0.623 0.010 <0.001 −2.227 0.076 <0.001
γ0 4.047 0.167 <0.001 1.073 0.207 <0.001
γ1(male) −1.495 0.213 <0.001 −1.728 0.407 <0.001
γ2(≥60 years) −3.894 0.439 <0.001 −0.801 0.450 0.077
log(σ) 0.254 0.089 0.880 0.062  

In addition, the qrs for the fitted EMax mixture regression in Figure 4(a) shows that the residuals have a random behavior in the interval (3,3). There is no evidence that the model assumptions do not hold, and there are no influential observations.

Figure 4.

Figure 4.

Plots for the EMax mixture regression fitted to coronavirus data under structure M3: (a) Qrs versus index, (b) Qq-plot for qrs, and (c) worm plot for qrs.

Figures 4(b) and 4(c) display the qq-plot and Worm plot for the qrs to assess possible departures from the distribution response in the fitted EMax mixture regression under structure M3. They reveal that this regression is suitable for these data.

5.2. Findings

The estimated total recovered fraction is

π0^=1139i=1139exp(xiγ^)1+exp(xiγ^)=0.762,

where xiγ^=4.0471.495x1i3.894x2i. This estimate indicates that approximately 76% of individuals recovered from those who had a diagnosis of COVID-19. The 95% asymptotic confidence interval is approximately (67.0%,87.5%) which includes the overall rate of 80% for recovered patients in several studies presented in this pandemic literature.

Moreover, we can obtain the following interpretations for the EMax regression from Table 5 at the 5% significance level:

Findings for the scale parameter μ

  • Based on the fitted regression, there is no evidence (pvalue=0.452) of a significant effect of the gender variable. However, in relation to the survival times, the plot in Figure 5(a) shows that women have a higher survival curve compared to men regardless of age.

  • There is a significant difference between individuals aged less than 60 years and older than or equal to 60 years in relation to the survival times regardless of the sex. This interpretation can also be seen in Figure 5(b).

Figure 5.

Figure 5.

Plots of Kaplan–Meier and estimated survival functions from the EMax regression: (a) by gender and (b) by age group.

Findings for the recovery probability π0

  • The women have higher probability of recovering than men, since the estimate of γ1 is negative. The estimated overall proportion for recovered women is π^Female=0.846 and for men it is approximately π^male=0.712 regardless of the age. This can be seen graphically in Figure 5(a).

  • In relation to the age, the estimate of γ2 is also negative, which indicates that individuals aged 60 years have lower recovered probabilities than those who aged <60 years. So, the proportion of recovered individuals under 60 years is π^age<60=0.948, and for individuals aged 60 years or over it is only π^age60=0.336. This fact can be seen in Figure 5(b). The age is an aggravating factor in the recovery of coronavirus patients.

  • The results for the stratified model in relation to sex by age are reported in Figure 6. In this case, it is noted that female and male individuals under 60 years old have recovered proportions equal to π^Female,age<60=0.982 and π^Male,age<60=0.927, respectively. For individuals aged 60 and over, the proportions of recovered individuals are equal to π^Female,age60=0.538, and π^Male,age60=0.207. In addition, regardless of gender, the estimated survival function of patients under 60 stabilizes at around 40 days. For women over 60, the estimated survival function stabilizes at approximately 20 days, while for men over 60, this plateau occurs around 30 days.

Figure 6.

Figure 6.

Plots of Kaplan–Meier and estimated survival function from the EMax regression stratified by gender and age group.

Note that the estimated probabilities of recovered individuals by the EMax regression using the likelihood and Bayesian methods are approximately equal to the fraction of recovered individuals observed in the Kaplan–Meier plots (Figures 5 and 6).

5.3. Bayesian analysis

The informative priors βjN(0,100) (for j = 0, 1, 2), γjN(0,100), σG(1,0.1), and νG(1,0.1) are adopted for the fitted regression models. We perform 35,000 MCMC computations after a burn-in of 5000 iterations and thinning to every tenth. Posterior results are based on 3000 iterations of the Markov chains with the convergence monitored under the methods of Cowles and Carlin [10]. We calculate the posterior means, standard deviations (SDs), 95% highest posterior density (HPD) intervals, and the logarithm of the pseudo marginal likelihood (LPML) statistic [15].

Table 6 gives the posterior means, SDs, 95% HPD intervals for the parameters, and the LPML statistics for all regressions. All covariates are statistically significant at the 5% significance level for all regressions except for β2, which was the same result obtained before for the MLEs. The EMax regression is the best model under the LPML criteria in agreement with the previous results in Table 4.

Table 6.

Findings for the GOLLMax, OLLMax, EMax, and Weibull regression models.

  GOLLMax OLLMax
θ Mean SD L U Mean SD L U
β0 3.193 0.407 2.487 4.010 3.281 0.390 2.659 4.206
β1 −0.002 0.179 −0.363 0.345 0.031 0.173 −0.331 0.339
β2 −0.738 0.297 −1.335 −0.183 −0.816 0.418 −1.864 −0.253
γ0 4.096 0.784 2.570 5.657 4.043 0.977 1.999 6.027
γ1 −1.543 0.627 −2.721 −0.272 −1.596 0.652 −2.837 −0.298
γ2 −3.951 0.711 −5.460 −2.625 −3.889 0.904 −5.518 −1.896
logν −0.050 0.469 −0.881 0.881 0.068 0.173 −0.281 0.371
logσ 0.233 0.749 −1.197 1.524        
LPML   −154.298       −155.487    
  EMax Weibull
θ Mean SD L U Mean SD L U
β0 3.135 0.321 2.655 3.732 3.550 0.492 2.903 4.814
β1 0.003 0.162 −0.312 0.331 0.026 0.178 −0.343 0.359
β2 −0.718 0.305 −1.277 −0.250 −0.842 0.544 −2.291 −0.208
γ0 4.048 0.737 2.600 5.447 3.806 1.308 1.300 6.294
γ1 −1.543 0.610 −2.795 −0.425 −1.572 0.626 −2.798 −0.429
γ2 −3.903 0.679 −5.227 −2.557 −3.680 1.313 −5.904 −0.942
logσ 0.224 0.270 −0.312 0.742 0.912 0.131 0.636 1.167
LPML   −153.903       −157.206    

The probability of patients disease-free after a time t>0 (days) of the follow-up, for all combinations of covariates ( x0), can be expressed as

p(t)=Pr(N=0|T>t)=π0(x0)1[1π0(x0)]F(t),

where μ and π0 are given by (7) at x0, and F() is the EMax cdf.

Note that p(0)=π0(x0) is the proportion of recovered patients by the end of follow-up. We determine the posterior distribution of the recovered patients from the posterior sample of the EMax regression. Table 7 provides the posterior means and 95% HPD intervals for the proportions of hypothetical recovered patients after 20 days of follow-up.

Table 7.

Posterior means, 95% HPD intervals for the recovered rate by the end of follow-up ( π0), and the proportion of recovered after 20 days of follow-up.

      π0 p(20)
Patients Gender Age (years) Mean L U Mean L U
A female <60 0.981 0.956 0.999 0.986 0.966 1.000
B male <60 0.924 0.862 0.983 0.944 0.887 0.992
C female 60 0.546 0.322 0.754 0.897 0.736 0.997
D male 60 0.208 0.072 0.352 0.669 0.411 0.909

For female patients under 60 years, the recovered proportion is greater than for any other class of patients. For male patients over 60 years, the recovered proportion is lower compared to other classes. The posterior probability distribution of the recovered patients is represented in Figure 7(a). It is clear that COVID-19 female patients under 60 years have less variability. Figure 7(b) displays the probability of recovered after the follow-up period. For example, the probability of patients disease-free after 20 days of the follow-up period for the patient A is 0.986, and for the patient D is 0.669.

Figure 7.

Figure 7.

(a) The posterior distribution of the recovered fraction patients and (b) recovered probability for four patients.

6. Concluding remarks

More flexible distributions in terms of the shape of the density function and/or the hazard function can be interesting alternatives to the analysis of COVID-19 data. We proposed the Generalized Odd Log-logistic Maxwell (GOLLMax) mixture regression to model 139 lifetimes of symptomatic patients diagnosed with COVID-19 under right censoring. We used two estimation methods for the parameters of the proposed regression: maximum likelihood and Bayesian inference. Some Monte Carlo simulations investigated the precision of the maximum likelihood estimates (MLEs), and they indicated that the new regression is a good alternative for modeling censored COVID-19 lifetimes. The explanatory variables adopted to explain the lifetime are only the age group and sex of the patients, but other variables may be used to apply the proposed regression in other coronavirus data sets. We displayed the Kaplan–Meier and estimated survival functions from the new regression with their corresponding recovered proportions, thus indicating that it fits well to the current data. The main issue of this paper is to show empirically the importance of the new regression for analyzing COVID-19 data such as incubation times, lengths of staying in intensive care units, and survival and recovery times of hospitalized patients as functions of independent variables that can explain the variability of these times. The survival and recovery probabilities can be estimated accurately from Equations (4) and (7). Thus this issue can be considered as future research to study these problems in other countries.

Acknowledgements

We are very grateful to a referees and associate editor for helpful comments that considerably improved the paper.

Funding Statement

We gratefully acknowledge financial support from CAPES and CNPq.

Notes

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Ahsanullah M. and Alzaatreh A., Some characterizations of the log-logistic distribution, Stoch. Qual. Control 33 (2018), pp. 23–29. [Google Scholar]
  • 2.Alkhouli M., Nanjundappa A., Annie F., Bates M.C., and Bhatt D.L., Sex differences in COVID-19 case fatality rate: Insights from a multinational registry, Mayo Clin. Proc. 29 (2020), pp. 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Al-Rousan N. and Al-Najjar H., Data analysis of coronavirus COVID-19 epidemic in South Korea based on recovered and death cases, J. Med. Virol. 92 (2020), pp. 1603–1608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Buuren S.V. and Fredriks M., Worm plot: A simple diagnostic device for modelling growth reference curves, Stat. Med. 20 (2001), pp. 1259–1277. [DOI] [PubMed] [Google Scholar]
  • 5.Cancho V.G., Rodrigues J., and de Castro M., A flexible model for survival data with a cure rate: A Bayesian approach, J. Appl. Stat. 38 (2011), pp. 57–70. [Google Scholar]
  • 6.Centers for Disease Control and Prevention , Coronavirus Disease 2019 (COVID-19) (2021). Accessed 2021, April 22. Available at https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html.
  • 7.Cordeiro G.M., Alizadeh M., Ozel G., Hosseinl B., Ortega E.M.M., and Altun E., The generalized odd log-logistic family of distributions: Properties, regression models and applications, J. Stat. Comput. Simul. 87 (2017), pp. 908–932. [Google Scholar]
  • 8.Cordeiro G.M., Rodrigues G.M., Ortega E.M.M., de Santana L.H., and Vila R., An extended Rayleigh model: Properties, regression and COVID-19 application, preprint (2022). Available at https://arxiv.org/submit/4257234/view.
  • 9.COVID-19 Dashboard by the Center for Systems Science and Engineering. (2021). Accessed 2021, April 22. Available at https://www.arcgis.com/apps/opsdashboard/index.html/bda7594740fd40299423467b48e9ecf6.
  • 10.Cowles M.K. and Carlin B.P., Markov chain Monte Carlo convergence diagnostics: A comparative review, J. Amer. Statist. Assoc. 91 (1996), pp. 883–904. [Google Scholar]
  • 11.Dong E., Du H., and Gardner L., An interactive web-based dashboard to track COVID-19 in real time, Lancet Infect. Dis. 20 (2020), pp. 533–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Dunn P.K. and Smyth G.K., Randomized quantile residuals, J. Comput. Graph. Stat. 5 (1996), pp. 236–244. [Google Scholar]
  • 13.Gleaton J.U. and Lynch J.D., Properties of generalized log-logistic families of lifetime distributions, J. Probab. Statist. Sci. 4 (2006), pp. 51–64. [Google Scholar]
  • 14.Hewitt J., Carter B., Vilches-Moraga A., Quinn T.J., Braude P., Verduri A., Pearce L., Stechman M., Short R., Price A., Collins J.T., Bruce E., Einarsson A., Rickard F., Mitchell E., Holloway M., Hesford J., Barlow-Pay F., Clini E., Myint P.K., Moug S.J., and McCarthy K., COPE Study Collaborators , The effect of frailty on survival in patients with COVID-19 (COPE): A multicentre, European, observational cohort study, Lancet Public Health 5 (2020), pp. 444–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ibrahim J.G., Chen M.H., and Sinha D., Bayesian Survival Analysis, Springer, New York, 2001. [Google Scholar]
  • 16.Jin J.-M., Bai P., He W., Wu F., Liu X.-F., Han D.-M., Liu S., and Yang J.-K., Gender differences in patients with COVID-19: Focus on severity and mortality, Front. Public Health 8 (2020), pp. 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kaplan E.L. and Meier P., Nonparametric estimation from incomplete observations, J. Am. Stat. Assoc. 53 (1958), pp. 457–481. [Google Scholar]
  • 18.Lawless J.F., Statistical Models and Methods for Lifetime Data, John Wiley & Sons, New Jersey, 2003. [Google Scholar]
  • 19.Lee E.T. and Wang J.W., Statistical Methods for Survival Data Analysis, John Wiley & Sons, New Jersey, 2002. [Google Scholar]
  • 20.Li C.-S., Taylor J.M.G., and Sy J.P., Identifiability of cure models, Stat. Probab. Lett. 54 (2001), pp. 389–395. [Google Scholar]
  • 21.Maller R.A. and Zhou X., Survival Analysis with Long-term Survivors, John Wiley & Sons, New Jersey, 1996. [Google Scholar]
  • 22.Mehra M.R., Desai S.S., Kuy S., Henry T.D., and Patel A.N., Cardiovascular disease, drug therapy, and mortality in COVID-19, N. Engl. J. Med. 382 (2020), pp. 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 23.Ministério da Saúde – Brasil , Coronavírus: COVID-19 (2021). Accessed 2021, April 22. Available at https://coronavirus.saude.gov.br/sobre-a-doencasintomas.
  • 24.Morena V., Milazzo L., Oreni L., Bestetti G., Fossali T., Bassoli C., Torre A., Cossu M.V., Minari C., Ballone E., Perotti A., Mileto D., Niero F., Merli S., Foschi A., Vimercati S., Rizzardini G., Sollima S., Bradanini L., Galimberti L., Combo R., Micheli V., Negri C., Ridolfo A.L., Meroni L., Galli M., Antinori S., and Corbellino M., Off-label use of tocilizumab for the treatment of SARS-CoV-2 pneumonia in Milan, Italy, Eur. J. Intern. Med. 76 (2020), pp. 36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ortega E.M.M., Cancho V.G., and Lachos V.H., A generalized log-gamma mixture model for cure rate: Estimation and sensitivity analysis, Sankhya 71 (2009), pp. 1–29. [Google Scholar]
  • 26.Ortega E.M.M., da Cruz J.N., and Cordeiro G.M., The log-odd logistic-Weibull regression model under informative censoring, Model Assist. Stat. Appl. 14 (2019), pp. 239–254. [Google Scholar]
  • 27.Prataviera F., Loibel S.M.C., Greco K.F., Ortega E.M.M., and Cordeiro G.M., Modelling non-proportional hazard for survival data with different systematic components, Environ. Ecol. Stat. 27 (2020), pp. 467–489. [Google Scholar]
  • 28.Prataviera F., Ortega E.M.M., and Cordeiro G.M., A new bimodal Maxwell regression model with engineering applications, Appl. Math. Inf. Sci. 14 (2020), pp. 817–831. [Google Scholar]
  • 29.Prataviera F., Silva A.M.M., Cardoso E.J.B.N., Cordeiro G.M., and Ortega E.M.M., A novel generalized odd log-logistic Maxwell-based regression with application to microbiology, Appl. Math. Model. 93 (2021), pp. 148–164. [Google Scholar]
  • 30.Stasinopoulos D.M., Rigby R.A., and Akantziliotou C., Instructions on How to Use the GAMLSS Package in R (2008). Available at http://www.gamlss.com/wp-content/uploads/2013/01/gamlss-manual.pdf.
  • 31.Stasinopoulos D.M., Rigby R.A., Heller G.Z., Voudouris V., and De Bastiani F., Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC, New York, 2017. [Google Scholar]
  • 32.Voinsky I., Baristaite G., and Gurwitz D., Effects of age and sex on recovery from COVID-19: Analysis of 5769 Israeli patients, J. Infect. 81 (2020), pp. 102–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Xie J., Covassin N., Fan Z., Singh P., Gao W., Li G., Kara T., and Somers V.K., Association between hypoxemia and mortality in patients with COVID-19, Mayo Clin. Proc. 95 (2020), pp. 1138–1147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Yan Y., Yang Y., Wang F., Ren H., Zhang S., Shi X., Yu X., and Dong K., Clinical characteristics and outcomes of patients with severe Covid-19 with diabetes, BMJ Open Diabetes Res. Care 8 (2020), pp. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Yang A.-P., Liu J.-P., Tao W.-Q., and Li H.-M., The diagnostic and predictive role of NLR, d-NLR and PLR in COVID-19 patients, Int. Immunopharmacol. 84 (2020), pp. 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES