Skip to main content
MethodsX logoLink to MethodsX
. 2025 Mar 28;14:103294. doi: 10.1016/j.mex.2025.103294

Estimation of parameters and hypothesis testing of multivariate spatial autoregressive model

Sutikno a,, Purhadi a, Fachrunisah a, Fajar Dwi Cahyoko b
PMCID: PMC12001134  PMID: 40241707

Abstract

Spatial dependence plays a critical role in modeling multivariate response variables, particularly in fields such as epidemiology and environmental studies. However, existing spatial regression models, such as the Spatial Autoregressive (SAR) model, are designed for univariate responses and are insufficient when multiple response variables are influenced by spatial location. To address this gap, we introduce a Multivariate Spatial Autoregressive (MSAR) model. While previous research has focused primarily on parameter estimation for the proposed model, limited attention has been given to the statistical significance of these parameters. Moreover, existing estimation methods often rely on pseudo-distributions, which may not accurately reflect the underlying data characteristics. This study employs Maximum Likelihood Estimation (MLE), optimized using a concentrated log-likelihood approach, under the assumption of normally distributed data. To assess parameter significance, we apply both the Maximum Likelihood Ratio Test (MLRT) for joint hypotheses and the Wald Test for individual parameters. The findings confirm that the proposed model yields unbiased and consistent parameter estimates. Furthermore, the significance tests reveal key predictor variables associated with pneumonia and diarrhea cases among toddlers. The proposed model achieves a Root Mean Square Error of 5 and an R-squared value of 60 %, demonstrating its effectiveness in capturing spatial dependence in multivariate settings. The main contributions of this study include:

  • Development of a MSAR model estimated using MLE to capture spatial dependencies among multiple response variables.

  • Implementation of formal hypothesis testing procedures for model parameters using the Likelihood Ratio and Wald tests.

  • Application of the proposed model to spatial health data at the village level in Tuban District, East Java, Indonesia, focusing on health problems among children under five.

Keywords: Multivariate spatial autoregressive, Maximum likelihood estimation, Maximum likelihood ratio test, Wald test

Method name: Multivariate Spatial Autoregressive Model

Graphical abstract

Image, graphical abstract


Specifications table

Subject area: Mathematics and Statistics
More specific subject area: Spatial Statistics, Spatial dependency, Multivariate Spatial Linear Models
Name of your method: Multivariate Spatial Autoregressive Model
Name and reference of original method: Original method: Multivariate Linear Regression. References:
  • Christensen, R.: Linear Models for Multivariate, Time Series, and Spatial Data. New York: Springer Science+Business Media (1991).

  • Johnson, R.A., & Wichern, D.W.: Applied Multivariate Statistical Analysis. New Jersey: Pearson Education, Inc (2007).

Resource availability: None

Background

Spatial analysis has become a fundamental methodology in scientific research, particularly when the data exhibits a strong geographical component [1,2]. One widely used technique for addressing spatial dependence is the SAR model [[3], [4], [5]], which incorporates spatially lagged dependent variables to account for spatial autocorrelation. SAR models have been extensively studied with regard to parameter estimation and hypothesis testing. Among the available techniques, MLE is the most commonly used and yields consistent estimates [[6], [7], [8]]. However, a key challenge in estimating SAR model parameters is that the spatial effect parameters do not have closed-form solutions, so numerical iteration is required. In addition to estimation, hypothesis testing is typically conducted using the Likelihood Ratio Test (LRT) for joint significance and the Wald Test for individual parameters [[9], [10], [11]].

The SAR model has evolved into the MSAR model in econometric applications, with estimation methods including Quasi-Maximum Likelihood Estimation (QMLE) [12,13], Two-Stage Least Squares (2SLS) [14,15], and Three-Stage Least Squares (3SLS) [16,17]. However, when identification conditions are not met in complex spatial models, parameter estimates may become invalid, affecting the reliability of QMLE. Additionally, QMLE generally provides less efficient estimates than fully specified MLE, assuming the model is correctly specified [18]. More recent work has incorporated MSAR within simultaneous equation models, using the FGLS-3SLS estimation approach and numerical approximation via the average concentrated log-likelihood [16]. While that approach allows spatial effects to be estimated, it is still limited to univariate optimization. Thus, In this study, we extend this approach by applying multivariate optimization to the concentrated log-likelihood using the L-BFGS-B algorithm [[19], [20], [21]]. MSAR models have also been extended to network data. For example, Zhu and Huang (2020) compared QMLE and Least Squares Estimation (LSE) methods [22,23], but LSE remains less effective in handling parameter estimation in complex models, thus highlighting the need for a better methodology in this area using MLE methods that are able to provide accurate and consistent estimates. In addition, most existing studies have focused on parameter estimation without addressing hypothesis testing, which is essential for improving the accuracy and reliability of model predictions and evaluating regression parameters that can vary geographically.

This study proposes an area-based MSAR model for multivariate responses, designed to capture the spatial interaction between more than one correlated response variables, while considering the spatial dependence across regions Parameter estimation is carried out using MLE for the regression coefficients and the covariance matrix, with spatial parameters estimated via the concentrated log-likelihood function optimized using the L-BFGS-B algorithm [24]. In addition to estimation, hypothesis testing is performed using both the LRT and the Wald Test to evaluate spatially varying regression parameters.

Method details

The MSAR model is an extension of the SAR model that incorporates spatial dependencies into the analysis. In this model, the global structure is modeled using multivariate normal linear regression. Therefore, before discussing the MSAR model in detail, this section introduces the foundational concept of multivariate linear regression.

Multivariate normal linear regression

A multivariate linear regression model describes the relationship among multiple response variables and their corresponding predictors. This model is used to determine the relationship between the response variables Y1,Y2,...,Yp and the predictor variables A1,A2,...,Aq. Given a sample of n observations and suppose j=1,2,...,p, the multivariate normal linear regression model for the i-th observation, i=1,2,...,n, is represented in Eq. (1).

Y1i=β01+β11A1i+...+βq1Aqi+ε1iY2i=β02+β12A1i+...+βq2Aqi+ε2iYji=β0j+β1jA1i+...+βqjAqi+εji (1)

The multivariate linear regression model can be expressed in matrix form as illustrated in Eq. (2).

Y(n×p)=An×(q+1)B(q+1)×p+Ξ(n×p) (2)

With

Y=[y1y2yp],A=[1A11A21Aq11A12A22Aq21A1nA2nAqn],B=[β1β2βp],Ξ=[ɛ1ɛ2ɛp]

where yj=[Yj1Yj2Yjn]T;j=1,2,...,p; βj=[β0jβ1jβqj]T;j=1,2,...,p and ɛj=[εj1εj2εjn]T;j=1,2,...,p. Furthermore, the multivariate linear regression model can be expressed in the form of a Vec operator and Kronecker product, as demonstrated in Eq. (3).

Vec(Y)pn×1=(IpA)pn×p(q+1)Vec(B)p(q+1)×1+Vec(Ξ)pn×1 (3)

In Eq. (3), an assumption was made that Vec(Ξ)N(0,Σp×pIn×n) withE(Vec(Ξ))=0and Cov(Vec(Ξ))=Σp×pIn×n. Based on this assumption, Vec(Y) has a distribution of Vec(Y)Npn((IpA)Vec(B),ΣIn). The probability density function of Vec(Y) is therefore given by the following expression.

f(Vec(Y))=(2π)pn/2|Σ|n/2exp(12(Vec(Y)((IpA)Vec(B)))T(ΣIn)1(Vec(Y)((IpA)Vec(B))))

Further parameter estimation can be carried out using the MLE method, resulting in the following parameter estimators [25].

Vec(B^)=(I(ATA)1AT)(Vec(Y))
Σ^=(YT(IA(ATA)1AT)Y)n

Multivariate spatial autoregressive

The MSAR model is a further development of the SAR model. Consequently, the analogy of the MSAR model can be traced back to the univariate SAR model. The MSAR model is used to determine the relationship between the response variables and the predictor variables by considering the spatial effect of the lag of the response variables symbolized by ρ and the spatial weight symbolized by W. The MSAR model is mathematically illustrated by Eq. (4):

Y1i=ρ1wiY1i+aTβ1+ε1iY2i=ρ2wiY2i+aTβ2+ε2iYpi=ρpwiYpi+aTβp+εpi (4)

Eq. (4) can be decomposed into the following equation:

[Y1iY2iYpi]=[β01+ρ1i*=1,ii*nwii*y1i*+aTβ1+ε1iβ02+ρ2i*=1,ii*nwii*y2i*+aTβ2+ε2iβ0p+ρpi*=1,ii*nwii*ypi*+aTβp+εpi]

The MSAR model in matrix form can be written as Eq. (5), with ρ is diagonal in form, with the elements ρ1,ρ2,...,ρp representing the spatial effects on each of the response variables.

Yn×p=Wn×nYn×pρp×p+An×(q+1)B(q+1)×p+Ξn×p (5)

If the MSAR model is written in the form of a Vec operator and using a Kronecker product, the resulting equation is given by Eq. (6).

Vec(Y)=(ρTW)Vec(Y)+(I2A)Vec(B)+Vec(Ξ)
Vec(Y)=(I(ρTW))1(IpA)Vec(B)+(I(ρTW))1Vec(Ξ) (6)

The MSAR model assumes that the error term follows a bivariate normal distribution with mean E(Vec(Ξ))=0 and covariance matrix Var(Vec(Ξ))=ΣIn [16]. The expectation and variance of Vec(Y)are shown in the following equation:

E(Vec(Y))=(IpnρTW)1(IpA)Vec(B)
Var(Vec(Y))=(Ipn(ρTW))1(ΣIn)(Ipn(ρWT))1

Once the expectation and variance of Vec(Y)are established, the distribution of Vec(Y) is given by:

Vec(Y)Npn((IpnρTW)1(IpA)Vec(B),(IpnρTW)1(ΣIn)(IpnρWT)1).

Parameter estimation of MSAR model

The MSAR parameters were estimated using the MLE estimation method combined with the numerical approximation of the concentrated log-likelihood using the L-BFGS-B optimization method. The MLE method is applied to estimate the regression coefficients and the variance-covariance matrix. The spatial effects were estimated by maximizing the concentrated log-likelihood function using the L-BFGS-B optimization method. The first step is to determine the likelihood function of the model in question. The likelihood function of the MSAR model is presented in Eq. (7).

L(Vec(B),Σ,ρ)=i=1nf(Y1i,Y2i,...,Ypi)=f(Vec(Y))=(2π)pn/2|(IpnρTW)1(ΣIn)(IpnρWT)1|1/2exp(12(Vec(Y)((IpnρTW)1(IpA)Vec(B)))T((IpnρTW)1(ΣIn)(IpnρWT)1)1(Vec(Y)((IpnρTW)1(IpA)Vec(B)))) (7)

Furthermore, the likelihood function is formulated in the form of a natural logarithm likelihood, as illustrated in Eq. (8).

lnL(Vec(B),Σ,ρ)=ln(i=1nf(Y1i,Y2i,...,Ypi))=lnf(Vec(Y))=pn2ln(2π)12ln(|(IpnρTW)1(ΣIn)(IpnρWT)1|)12(Vec(Y)((IpnρTW)1(IpA)Vec(B)))T((IpnρWT)(ΣIn)1(IpnρTW))(Vec(Y)((IpnρTW)1(IpA)Vec(B))) (8)

The subsequent stage in parameter estimation is to differentiate Eq. (8) with respect to the Vec(B) parameter and set it equal to zero, thereby obtaining an estimator for the Vec(B) parameter.

lnL(Vec(B),Σ,ρ)Vec(B)=2((IpAT)(IpnρWT)1(IpnρWT)(Σ1In)×(IpnρTW)(Vec(Y)))+2(IpAT)(IpnρWT)1×(IpnρWT)(Σ1In)((IpnρTW))((IpnρTW)1(IpA)Vec(B))=0

The result of the first derivative of Vec(B), which is equated to zero, is simplified to yield the Vec(B) estimator, symbolized by Vec(B^), which is shown in Eq. (9).

Vec(B^)initial=(Ip(ATA)1AT)(IpnρTW)(Vec(Y)) (9)

Once the Vec(B) parameter estimator has been obtained, the Σ parameter estimator can then be calculated. The steps involved in obtaining theΣparameter estimator is identical to those used for the Vec(B) parameter estimator. The initial step is to substitute Vec(B) into the ln-likelihood function in Eq. (8), with the estimated value given by Eq. (10).

lnL(Vec(B^)initial,Σ,ρ)=pn2ln(2π)12ln(|(IpnρTW)1(ΣIn)(IpnρWT)1|)12(Vec(Y)(IpnρTW)1(IpA(ATA)1AT)×(IpnρTW)(Vec(Y)))T((IpnρWT)(ΣIn)1×(IpnρTW))(Vec(Y)(IpnρTW)1×(IpA(ATA)1AT)(IpnρTW)(Vec(Y))). (10)

If we assume that M=A(ATA)1AT, then Eq. (10) can be rewritten as Eq. (11).

lnL(Vec(B^)initial,Σ,ρ)=pn2ln(2π)12ln(|(IpnρTW)1(ΣIn)(IpnρWT)1|)12(Vec(Y)(IpnρTW)1(IpM)(IpnρTW)(Vec(Y)))T((IpnρWT)(ΣIn)1(IpnρTW))(Vec(Y)(IpnρTW)1(IpM)(IpnρTW)(Vec(Y))). (11)

In Eq. (11), the final element in the quadratic form will yield a real number or a scalar, which is regarded as a (1×1)matrix [24]. Consequently, the trace of the element is the element itself. Hence, Eq. (11) can be rewritten as Eq. (12) and simplified through the utilization of the cyclic nature of the trace matrix.

lnL(Σ,ρ)=pn2ln(2π)12ln(|(IpnρTW)1(ΣIn)(IpnρWT)1|)12tr[((IpnρWT)(ΣIn)1(IpnρTW))×(Vec(Y)(IpnρTW)1(IpM)(IpnρTW)(Vec(Y)))×(Vec(Y)(IpnρTW)1(IpM)(IpnρTW)(Vec(Y)))T] (12)

Subsequently, the ln-likelihood function in Eq. (12) is derived from σjj*, which is illustrated in the following equation where Tjj* is a (p×p)symmetrical matrix comprising element 1 in positions (j,j*)and (j*,j), and element 0 in all other row and column positions.

ln(L(Σ,ρ))σjj*=12tr[(Σ1Tjj*In)]+12tr[(IpnρWT)(Σ1Tjj*Σ1In)(IpnρTW)×(Vec(Y)(IpnρTW)1(IpM)(IpnρTW)(Vec(Y)))×(Vec(Y)(IpnρTW)1(IpM)(IpnρTW)(Vec(Y)))T]=0

The partial derivative equal to zero is then solved by equalizing the form of the left equation with that of the right equation, thereby obtaining the estimator for the sigma parameter Σ.

12tr[(Σ1Tjj*In)]=12tr[(Σ1Tjj*Σ1In)(IpnρTW)(Vec(Y)(IpnρTW)1×(IpM)(IpnρTW)(Vec(Y)))(Vec(Y)(IpnρTW)1×(IpM)(IpnρTW)(Vec(Y)))T(IpnρWT)] (13)

If Φinitialis the following equation.

Φinitial=(IpnρTW)(Vec(Y)(IpnρTW)1(IpM)×(IpnρTW)(Vec(Y)))(Vec(Y)(IpnρTW)1(IpM)×(IpnρTW)(Vec(Y)))T(IpnρWT)

Given that Φinitial is a (pn×pn) matrix, the subsequent step is to transform it into a (p×p) matrix, which is represented by a symbolized matrix, and then to correlate it with an (n×n) identity matrix through the use of the Kronecker product.

Φ^initialpn×pn=Σ^initialp×pInn×n (14)

The equation below is obtained from substituting Eq. (14) into Eq. (13)

12tr[(Σ^1Tjj*In)]=12tr[(Σ^1Tjj*Σ^1In)(Σ^initialIn)]tr[(Σ^1Tjj*In)]=tr[(Σ^1Tjj*In)]

Based on the previous evidence, theΣparameter estimator can be approximated by Eq. (14). The Σ^matrix can be formed from (n×n) block elements of Φ^initial. Suppose Φ^initial has the following block structure where each Φ^jj*is a (n×n) matrix.

Φ^initial=[Φ^initial11Φ^initial12Φ^initial1pΦ^initial21Φ^initial22Φ^initial2pΦ^initialp1Φ^initialp2Φ^initialpp]

The Σ^ matrix can be taken from the main diagonal elements of the Φ^jj*block. Thus, the parameter estimator is given by Eq. (15).

Σ^initial=[tr(Φ^initial11)/ntr(Φ^initial12)/ntr(Φ^initial1p)/ntr(Φ^initial21)/ntr(Φ^initial22)/ntr(Φ^initial2p)/ntr(Φ^initialp1)/ntr(Φ^initialp2)/ntr(Φ^initialpp)/n] (15)

(9), (15) demonstrate that the equation is not in closed form, necessitating the utilization of a numerical approach for its resolution. The numerical approach to estimating ρ is the concentrated log-likelihood with the L-BFGS-B optimization method. The concentrated log-likelihood function for ρ is the likelihood function obtained from the substitution of the Vec(B^)initial and Σ^initial estimates shown in Eq. (16).

lnLcon(ρ)=lnL(Vec(B^)MReg,B^WY,Σ^con,ρ)=pn2ln(2π)12ln(|(IpnρTW)1(Σ^conIn)(IpnρWT)1|)12(Vec(Y)((IpnρTW)1(IpA)(Vec(B^)MReg(IpB^WY)Vec(ρ))))T((IpnρWT)(Σ^conIn)1(IpnρTW))
(Vec(Y)((IpnρTW)1(IpA)(Vec(B^)MReg(IpB^WY)Vec(ρ)))) (16)

where Σ^con=[tr(Φ^con11)/ntr(Φ^con12)/ntr(Φ^con1p)/ntr(Φ^con21)/ntr(Φ^con22)/ntr(Φ^con2p)/ntr(Φ^conp1)/ntr(Φ^conp2)/ntr(Φ^conpp)/n] with

Φ^con=(IpnρTW)(Vec(Y)(IpnρTW)1(IpA)(Vec(B^)MReg(IpB^WY)Vec(ρ)))(Vec(Y)(IpnρTW)1(IpA)(Vec(B^)MReg(IpB^WY)Vec(ρ)))T(IpnρWT).

Eq. (16) represents the concentrated log-likelihood function. This equation cannot be maximized statistically so a numerical approach is needed with the L-BFGS-B optimization method. The following steps outline the numerical procedure for maximizing the concentrated log-likelihood, thereby obtaining the value of ρ^:

  • a. Generated a sequence of values for ρj;j=1,2,..,p where ρj=seq(start value, end value, increasing) and substituted each into the rho matrix where ρ=diag(ρ1k,ρ2k,...,ρpk);k=1,2,...,n(seq(ρj))

  • b. Performed bivariate regression of Vec(Y) with (IpA) and obtained Vec(B^)MReg

  • c. Regressed WY with A and obtained B^WYwhich is a(q+1)×p matrix.

  • d. Substituted Vec(B^)MRegand B^WY into concentrated log-likelihood function.

  • e. Identified the value of ρthat gave the maximum lnLcon and then became ρ^.

Properties of estimator

The coefficients parameter in the MSAR model are estimated using Eq. (9). Vec(B^) is shown to be both unbiased and consistent. An estimator is considered unbiased if its expected value equals the true parameter, and consistent if it converges to the true parameter as the sample size increases. The proof is presented as follows.

E(Vec(B^))=E((Ip(ATA)1AT)(IpnρTW)(Vec(Y)))=(Ip(ATA)1AT)(IpnρTW)E(Vec(Y))=(Ip(ATA)1AT)(IpnρTW)(IpnρTW)1(IpA)Vec(B)=(Ip(ATA)1AT)(IpA)Vec(B)=(Ip(ATA)1ATA)Vec(B)=Vec(B)

Since the expectation of Vec(B^)equals Vec(B), it follows that Vec(B^) is an unbiased estimator of Vec(B).

Next, Vec(B^) consistency is shown below.

Var(Vec(B^))=Var((Ip(ATA)1AT)(IpnρTW)(Vec(Y)))=[tr(Φ^11)/ntr(Φ^12)/ntr(Φ^1p)/ntr(Φ^21)/ntr(Φ^22)/ntr(Φ^2p)/ntr(Φ^p1)/ntr(Φ^p2)/ntr(Φ^pp)/n](ATA)1
limnVar(Vec(B^))=limn[tr(Φ^11)/ntr(Φ^12)/ntr(Φ^1p)/ntr(Φ^21)/ntr(Φ^22)/ntr(Φ^2p)/ntr(Φ^p1)/ntr(Φ^p2)/ntr(Φ^pp)/n]limn(ATA)1=0(ATA)1=0

It can be concluded that Vec(B^) is an unbiased and consistent estimator.

Hypothesis testing of MSAR model

Hypothesis testing of the MSAR model parameters is conducted both simultaneously and partially. The MLRT is applied for simultaneous testing, while the Wald test is used for partial parameter testing [26,27]. The hypothesis for simultaneous testing of the model parameters is formulated as follows:

H0:β1j=β2j==βqj=0,j=1,2,...,pH1:atleastoneofβkj0,k=1,2,,q,j=1,2,...,p

The set of parameters under population, denoted by ΩMSAR, is given by ΩMSAR={Vec(B),Vec(Σ),ρ},while the set of parameters under H0, denoted by ωMSAR, is given by ωMSAR={β0ω,(Vec(Σω)),ρ0ω}. The parameter estimators for the two sets, Ω^MSAR and ω^MSAR, are obtained from parameter estimation using the MLE method described in the previous section. The LRT is calculated in consideration of the formula presented in Eq. (17).

LR=L(ω^MSAR)L(Ω^MSAR)<LR0 (17)

Where L(ω^MSAR) is the likelihood value of MSAR model using the estimated parameters under H0 and L(Ω^MSAR) is the likelihood value of MSAR model using the estimated parameters under population. Consequently, the test statistics for testing the parameters simultaneously using the MLRT is presented in Eq. (18).

GMSAR2=2(ln(L(ω^MSAR))ln(L(Ω^MSAR))) (18)

The critical regions for hypothesis testing are as follows:

α=P(LR<LR0),0<LR01=P(lnLR2<LR02)=P(lnLR2<lnLR02)=P(GMSAR2>χ(α,df)2)

GMSAR2 is distributed according to the chi-square distribution for n, whereby the H0 rejection region is GMSAR2>χ(α,df)2or pvalue<αwith degree of freedom (df), which is the number of parameters under the population minus the number of parameters under H0.

df=n(L(Ω^MSAR)n(L(ω^MSAR))=(p(q+1)+2p+p)(p+2p+p)=pq

Once the null hypothesis (H₀) is rejected in the simultaneous test, partial hypothesis testing is conducted to identify which predictor variables exert a statistically significant influence on the response variable. The first partial test focuses on the spatial autoregressive parameter ρ, formulated under the following hypothesis framework:

H0:ρj=0H1:ρj0;j=1,2,...,p

The test statistics used for testing the above hypothesis with the Wald test is shown in Eq. (19).

Waldρj=(ρ^jse^(ρ^j))2χ12 (19)

where se^(ρj^) is obtained from the Var^(ρj^) root. TheVar^(ρj^) value represents the main diagonal element of the Hessian matrix which is represented by (H(ρ^))1 and corresponds to ρj^ . The Wald test statistics in Eq. (19) is deemed to be statistically significant if Waldρj>χα,12, thereby rejecting the null hypothesis (H0).

Moreover, the partial testing of βkj parameters is conducted with the objective of identifying the parameters that exert a significant influence on the model. The following hypothesis is employed to test the partial βkj parameters:

H0:βkj=0H1:βkj0,k=1,2,,q,j=1,2,...,p

The test statistics used for testing the partial βkj parameters with the Wald test is shown in Eq. (20).

Waldβj=(β^kjse^(β^kj))2χ12 (20)

In this context, the term se^(βkj^) represents the standard error of β^kjobtained from Var^(βkj^). Var^(βkj^) is the main diagonal element of the variance-covariance matrix Var^(Vec(B^)). The null hypothesis (H₀) is rejected if Waldβj>χα,12.

Measures of model fits

To select the most suitable regression model, two commonly used evaluation metrics are the Root Mean Square Error (RMSE) and the coefficient of determination (R²). RMSE represents the average prediction error of the model, expressed in the same unit as the response variable. Models with lower RMSE values are preferred, as they indicate predictions that are closer to the observed values, reflecting a better model fit. Meanwhile, R2 measures the proportion of variance in the response variable that can be explained by the predictor variables [[28], [29], [30]]. A higher R² value signifies greater explanatory power and stronger predictive performance of the model. The formulas for RMSE and R² are provided in Eqs. (21) and (22) [[31], [32], [33]].

RMSE=RMSE1+RMSE2=1ni=1n(Y1iY^1i)2+1ni=1n(Y2iY^2i)2 (21)
R2=1SSESST=1i=1n(Y1iY^1i)2+i=1n(Y2iY^2i)2i=1n(Y1iY¯1)2+i=1n(Y2iY¯2)2 (22)

Data analysis procedure

The analysis was conducted through the following steps:

  • 1.

    Check the correlation between response variables.

  • 2.

    Test for multivariate normal distribution.

  • 3.

    Model the data using multivariate normal linear regression.

  • 4.

    Perform spatial weighting.

  • 5.

    Perform spatial dependency testing.

  • 6.

    Estimate the parameters of MSAR model.

  • 7.

    Conduct simultaneous hypothesis testing using the test statistic in Eq. (18).

  • 8.

    Conduct partial hypothesis testing using the test statistics in Eqs. (19) and (20).

  • 9.

    Evaluate model fit using Eqs. (21) and (22).

  • 10.

    Interpret the results and draw conclusions.

Method validation

To validate the application of the MSAR method, we used a real-world dataset on health issues in children under five years old.

Data set

The dataset used in this study was obtained from the Center for the Study of Regional Resources and Community Empowerment at Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia. The data are secondary in nature and pertain to the year 2023. Observations were collected from 54 villages located across four sub-districts in Tuban Regency, East Java—namely Singgahan, Kerek, Montong, and Senori—as illustrated in Fig. 1.

Fig. 1.

Fig 1

Administrative map of 54 villages (note: colored pink) in Tuban District.

The response variables selected for analysis are the percentage of cases of pneumonia and diarrhea in children under five years old. These two response variables were found to have a correlation coefficient of 0.585. The predictor variables used in this study include the percentage of infants who received exclusive breastfeeding, the percentage of children under five who received complete basic immunization, the percentage who received vitamin A supplementation, the percentage of pregnant women who attended government-sponsored prenatal classes, and the percentage of households with access to clean water. A summary of the research data is presented in Table 1.

Table 1.

Descriptive statistics of research data.

Variable Description Mean SD Min Max
Response Pneumonia in toddler (Y1) ( %) 4.82 4.91 0.00 16.68
Diarrhea in toddler (Y2) ( %) 12.99 8.10 0.83 30.61
Predictor Exclusive breastfeeding (X1) (10 %) 2.53 2.40 0.00 12.22
Complete basic immunization (X2) ( %) 22.56 6.70 6.12 50.00
Toddlers who received vit. A (X3) (10 %) 13.04 10.80 1.40 81.30
Pregnant women who attended pregnancy classes (X4) (10 %) 5.09 5.99 0.00 33.33
Households with clean water (X5) coverage ( %) 98.35 4.22 79.94 100.00

Modelling child health problems using multivariate normal linear regression

Before conducting multivariate linear regression analysis, the distribution of the response variables was assessed for multivariate normality using a quantile-quantile (Q-Q) plot. The results indicated that the Mahalanobis distance exceeded 50 %, with a proportion of 53.70 %, suggesting that the two response variables follow a bivariate normal distribution Subsequently, the parameters of the multivariate normal linear regression model were estimated, and the results are shown in Table 2. The table reveals that the predictor variables significantly influencing the prevalence of pneumonia (Y₁) in children under five are the percentage of infants who were exclusively breastfed (X₁) and the percentage of children who received complete basic immunization (X₂). Meanwhile, the variables influencing the prevalence of diarrhea (Y₂) include exclusive breastfeeding (X₁), complete basic immunization (X₂), and access to clean water (X₅).

Table 2.

Estimated values of multivariate normal linear regression parameters.

Parameters Estimated Value Standard Error T p-value
β01 16.6738 19.5660 0.8522 0.3961
β11 0.7882 0.3516 2.2416 0.0272*
β21 0.3373 0.1099 3.0685 0.0027*
β31 −0.0013 0.0669 −0.0206 0.9835
β41 −0.1088 0.1251 −0.8700 0.3863
β51 −0.2123 0.1954 −1.0867 0.2797
β02 67.2771 19.5660 3.4383 0.0008*
β12 0.7053 0.3516 2.0061 0.0475*
β22 0.5214 0.1099 4.7431 6.85 × 10–6*
β32 −0.0748 0.0669 −1.1187 0.2658
β42 0.0664 0.1251 0.5311 0.5965
β52 −0.6832 0.1953 −3.4966 0.0007*

: significant at 5 % alpha.

The multivariate normal linear regression model can be shown in the following equation.

[Y^1Y^2]=[16.6738+0.7882X1+0.3373X20.0013X30.1088X40.2123X567.2771+0.7053X1+0.5214X20.0748X3+0.0664X40.6832X5]

Spatial weighting and testing for spatial dependence

The MSAR model was used to estimate the prevalence of pneumonia and diarrhea among children under five in southwestern Tuban Regency. This analysis employed a queen contiguity spatial weighting matrix, which accounts for the asymmetrical geographical layout of the region. The matrix was constructed based on shared boundaries between villages.

Following the construction of the spatial weighting matrix, spatial dependence was assessed using the residuals from the multivariate normal linear regression model. The spatial dependence test was conducted in R using the Bivariate Moran's I statistic [[34], [35], [36]], which yielded a Moran's I value of 0.1101, with an expected value of –0.0073 and a variance of 0.0051. The resulting Z-score was 1.6513, which exceeds the critical value of Z₀.₀₅ = 1.64. Therefore, the null hypothesis (H₀) of no spatial dependence is rejected. This result indicates the presence of bivariate spatial dependence in the regression residuals, justifying further spatial analysis.

Modelling child health data using the MSAR model

In MSAR modelling, the regression coefficients include a spatial effect parameter, denoted as ρ. Therefore, estimating this parameter is the first step, conducted using a numerical approximation method based on the concentrated log-likelihood function. Once the ρ^ parameter estimation has been obtained, Vec(B^) and Σ^ can be estimated. The results of the Vec(B^) estimation is presented in Table 3, while the Σ^ value is as follows.

Σ^=[11.826.365.3033.48]

Table 3.

Estimated values of multivariate spatial autoregressive parameters.

Parameters Estimated Value Standard Error Wald Statistic P-value
ρ1 0.42 0.01 2895.30 0.00*
β01 15.76 12.90 1.49 0.22
β11 0.59 0.23 6.59 0.01*
β21 0.25 0.07 11.98 0.00*
β31 0.02 0.04 0.21 0.65
β41 −0.10 0.08 1.46 0.23
β51 −0.20 0.13 2.46 0.12
ρ2 0.38 0.01 14,786.09 0.00*
β02 62.92 21.71 8.40 0.00*
β12 0.41 0.39 1.13 0.29
β22 0.40 0.12 10.95 0.00*
β32 −0.03 0.07 0.17 0.68
β42 0.05 0.14 0.14 0.71
β52 −0.66 0.22 9.25 0.00*

: significant at 5 % alpha.

The initial step involves simultaneous hypothesis testing of all model parameters to determine whether they collectively wield a significant influence. The value of the G2test statistics was 43,240.59, which is greater than the χ(0,05;10)2=18.307. Accordingly, the null hypothesis (H₀) is rejected, indicating that at least one parameter significantly contributes to the model. This justifies proceeding with partial (individual) hypothesis tests to identify which specific parameters are influential in the MSAR model.

Table 3 indicates that the parameters ρ1 and ρ2 are significant to the model, thereby suggesting that spatial dependencies in the rates of pneumonia and diarrhea must be considered in the model. The MSAR model for the percentage of pneumonia cases (Y₁) identifies two significant predictor variables: the percentage of infants exclusively breastfed (X₁) and the percentage of children under five who received complete basic immunization (X₂). Meanwhile, for the percentage of diarrhea cases (Y₂), the significant predictors are X₂ (complete basic immunization) and X₅ (households with access to clean water).

As shown in Table 4, the MSAR model better captures the relationship between predictor variables and child health outcomes than the standard multivariate normal linear regression model. This is demonstrated by its lower Root Mean Square Error (RMSE) of 4.97 and a higher R-squared value of approximately 60 %. These findings support the conclusion that, when multivariate data exhibit spatial autocorrelation, the MSAR model provides a more accurate and reliable estimation framework.

Table 4.

Model comparison.

Model RMSE R-square
Multivariate Normal Linear Regression 5.22 55.21 %
Model MSAR 4.97 59.98 %

In total, 54 distinct MSAR models were developed—one for each village. The model estimates for both pneumonia (Y₁) and diarrhea (Y₂) are summarized as [Y^1iY^2i]T where:

Y^1i=15.76+0.42i*=1,ii*54wii*y1i*+0.59X1i+0.25X2i+0.02X3i0.10X4i0.20X5i
Y^2i=62.92+0.38i*=1,ii*54wii*y2i*+0.41X1i+0.41X2i0.03X3i+0.05X4i0.66X5i

Taking Gemulung village as an example, the MSAR model for Gemulung village (code number 5) is [Y^15Y^25]T where:

Y^15=15.76+0.11Y128+0.11Y138+0.11Y149+0.11Y153+0.59X15+0.25X25+0.02X350.10X450.20X55
Y^25=62.92+0.09Y228+0.09Y238+0.09Y249+0.09Y253+0.41X15+0.40X250.03X35+0.05X450.66X55

The above MSAR model of Gemulung Village can be interpreted as follows:

  • 1.

    For every 100 children under five, approximately 10 to 11 are affected by pneumonia, and 9 to 10 by diarrhea in Gemulung Village. Similar patterns are likely present in neighboring villages—Mulyoagung, Sidonganti, Trantang, and Wolutengah—due to spatial dependence.

  • 2.

    A 1 % increase in the proportion of exclusively breastfed infants is associated with a rise in pneumonia cases, which contradicts theoretical expectations. This may be due to the lagging effect of exclusive breastfeeding on pneumonia incidence. Additionally, pneumonia cases in Gemulung appear to influence similar increases in the four neighboring villages. No significant relationship was found between exclusive breastfeeding and diarrhea prevalence.

  • 3.

    A 1 % increase in the percentage of children receiving complete basic immunization is linked to higher pneumonia and diarrhea rates. This finding contradicts existing theory, likely due to temporal lag in the variable's impact. Increases in pneumonia and diarrhea in Gemulung are associated with corresponding rises (10–11 and 9–10 cases per 100 children, respectively) in neighboring villages.

  • 4.

    The percentage of children under five who received vitamin A supplementation showed no significant effect on pneumonia or diarrhea incidence.

  • 5.

    The proportion of pregnant women attending pregnancy classes did not significantly influence pneumonia or diarrhea rates among children under five.

  • 6.

    A 1 % increase in household clean water coverage is associated with a decrease of approximately one diarrhea case per 100 children under five, but has no significant effect on pneumonia rates. Diarrhea cases in Gemulung also appear to influence similar increases (9–10 per 100) in the neighboring villages.

Conclusions

This study focused on area-based spatial modeling in the context of multivariate response regression, introducing the MSAR model as an extension of the conventional SAR model. The MSAR approach incorporates geographic weighting to account for spatial dependencies between neighboring regions. Parameter estimation was carried out using MLE via concentrated log-likelihood, which resulted in unbiased and consistent estimates. The significance of model parameters was tested both simultaneously using the LRT and partially using the Wald Test, which enabled the identification of influential predictor variables. The application of the MSAR model to data on pneumonia and diarrhea cases among children under five in Tuban Regency, East Java, demonstrated its effectiveness in handling spatial autocorrelation. Compared to the standard multivariate normal linear regression, the MSAR model showed better accuracy. The variables that affect the incidence of pneumonia and diarrhea were the percentage of infants who receive exclusive breastfeeding, the percentage of toddlers who receive complete basic immunization, and the percentage of households which have access clean water. However, the current model is limited to multivariate normal data distributions. Future research should explore extensions of the MSAR framework that can accommodate non-normal data.

Limitations

Assumption of error distribution is normal distribution.

Ethics statements

The data used in this research has been approved by the Center for the Study of Regional Resources and Community Empowerment Institut Teknologi Sepuluh Nopember Surabaya, Indonesia.

Supplementary material and/or additional information [Optional]

None

CRediT authorship contribution statement

Sutikno: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Validation. Purhadi: Methodology, Conceptualization. Fachrunisah: Visualization, Writing – review & editing, Software. Fajar Dwi Cahyoko: Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The first author would like to gratefully acknowledge the Government of Tuban Regency for providing funding for this research.

Footnotes

Related research article: None

For a published article: None

Appendix A

Fig. A1 and Table A1.

Table A1.

Village and neighbor codes.

Village Codes Village Sub-district Count Neighbor
1 Banyuurip Senori 3 19 51 54
2 Binangun Singgahan 6 34 37 45 50 51 52
3 Bringin Montong 3 20 32 40
4 Gaji Kerek 7 7 8 16 23 26 47 53
5 Gemulung Kerek 4 28 38 49 53
6 Guwoterus Montong 6 28 30 38 41 47 48
7 Hargoretno Kerek 8 4 8 27 31 33 41 46 47
8 Jarorejo Kerek 5 4 7 22 23 46
9 Jatisari Senori 5 11 19 24 36 51
10 Jetakss Montong 4 20 33 40 42
11 Kaligede Senori 2 9 19
12 Karanglo Kerek 3 22 31 39
13 Kasiman Kerek 4 16 23 26 39
14 Katerban Senori 1 34
15 Kedungjambe Singgahan 4 29 35 44 50
16 Kedungrejo Kerek 4 4 13 23 26
17 Lajo Kidul Singgahan 4 18 36 43 45
18 Lajo Lor Singgahan 3 17 28 43
19 Leran Senori 4 1 9 11 51
20 Maindu Montong 3 3 10 40
21 Manjung Montong 1 44
22 Margomulyo Kerek 6 8 12 23 31 39 46
23 Margorejo Kerek 6 4 8 13 16 22 39
24 Medalem Senori 2 9 36
25 Mergosari Singgahan 5 28 29 43 45 50
26 Mliwang Kerek 3 4 13 16
27 Montongsekar Montong 4 7 32 33 41
28 Mulyoagung Singgahan 8 5 6 18 25 29 38 43 48
29 Mulyorejo Singgahan 7 15 25 28 30 44 48 50
30 Nguluhan Montong 5 6 29 41 44 48
31 Padasan Kerek 5 7 12 22 33 46
32 Pakel Montong 6 3 27 33 40 41 44
33 Pucangan Montong 7 7 10 27 31 32 40 42
34 Rayung Senori 6 2 14 35 37 50 54
35 Saringembat Singgahan 3 15 34 50
36 Sendang Senori 5 9 17 24 45 51
37 Sidoharjo Senori 4 2 34 52 54
38 Sidonganti Kerek 5 5 6 28 47 49
39 Sumberarum Kerek 4 12 13 22 23
40 Sumurgung Montong 5 3 10 20 32 33
41 Talangkembar Montong 7 6 7 27 30 32 44 47
42 Talun Montong 2 10 33
43 Tanggir Singgahan 5 17 18 25 28 45
44 Tanggulangin Montong 6 15 21 29 30 32 41
45 Tanjungrejo Singgahan 7 2 17 25 36 43 50 51
46 Temayang Kerek 4 7 8 22 31
47 Tengger Wetan Kerek 7 4 6 7 38 41 49 53
48 Tingkis Singgahan 4 6 28 29 30
49 Trantang Kerek 4 5 38 47 53
50 Tunggulrejo Singgahan 7 2 15 25 29 34 35 45
51 Wanglu Kulon Senori 8 1 2 9 19 36 45 52 54
52 Wanglu Wetan Senori 4 2 37 51 54
53 Wolutengah Kerek 4 4 5 47 49
54 Wonosari Senori 5 1 34 37 51 52

Fig. A1.

Fig A1

Map of tuban regency village codes.

Data availability

Data will be made available on request.

References

  • 1.Mennis J., Guo D. Spatial data mining and geographic knowledge discovery-an introduction. Comput. Environ. Urban Syst. 2009;33:403–408. doi: 10.1016/j.compenvurbsys.2009.11.001. [DOI] [Google Scholar]
  • 2.Charles A.C., Armstrong A., Nnamdi O.C., Innocent M.T., Obiageri N.J., Begianpuye A.F., Timothy E.E. Review of spatial analysis as a geographic information management tool. Am. J. Eng. Technol. Manag. 2024 doi: 10.11648/j.ajetm.20240901.12. [DOI] [Google Scholar]
  • 3.Krisztin T., Piribauer P. A Bayesian approach for the estimation of weight matrices in spatial autoregressive models. Spat. Econ. Anal. 2023;18:44–63. doi: 10.1080/17421772.2022.2095426. [DOI] [Google Scholar]
  • 4.Koley M., Bera A.K. Springer International Publishing; 2022. Testing For Spatial Dependence in a Spatial Autoregressive (SAR) Model in the Presence of Endogenous Regressors. [DOI] [Google Scholar]
  • 5.Liu X., Chen J. Variable selection for the spatial autoregressive model with autoregressive disturbances. Mathematics. 2021;9 https://www.mdpi.com/2227-7390/9/12/1448 [Google Scholar]
  • 6.LeSage J., Pace R.K. Chapman and Hall/CRC; New York: 2009. Introduction to Spatial Econometrics. [DOI] [Google Scholar]
  • 7.Yokoi T. 50th Congr. Eur. Reg. Sci. Assoc. "Sustainable Reg. Growth Dev. Creat. Knowl. Econ. 2010. Efficient maximum likelihood estimation of spatial autoregressive models with normal but heteroskedastic disturbances. [DOI] [Google Scholar]
  • 8.Jeong H., fei Lee L. Maximum likelihood estimation of a spatial autoregressive model for origin–destination flow variables. J. Econom. 2024;242 doi: 10.1016/j.jeconom.2024.105790. [DOI] [Google Scholar]
  • 9.Anselin L. Springer Netherlands Dordrecht; 1988. Spatial Econometrics: Methods and Models. [DOI] [Google Scholar]
  • 10.Yang H., Huang W., Ma X., Xu Y., Huang M. Proc. 2022 3rd Int. Conf. Big Data Soc. Sci. (ICBDSS 2022) Atlantis Press International BV; 2022. Research on the time-space impact paths of economic convergence-empirical evidence from 30 provinces in China; pp. 110–122. [DOI] [Google Scholar]
  • 11.Liu T., Lee L. A likelihood ratio test for spatial model selection. J. Econom. 2019;213:434–458. doi: 10.1016/j.jeconom.2019.07.001. [DOI] [Google Scholar]
  • 12.Yang K., fei Lee L. Identification and QML estimation of multivariate and simultaneous equations spatial autoregressive models. J. Econom. 2017;196:196–214. doi: 10.1016/j.jeconom.2016.04.019. [DOI] [Google Scholar]
  • 13.Su L., Jin S. Profile quasi-maximum likelihood estimation of partially linear spatial autoregressive modelsI. J. Econom. 2010;157:18–33. doi: 10.1016/j.jeconom.2009.10.033. [DOI] [Google Scholar]
  • 14.Liu X., Lee L.F. Two-stage least squares estimation of spatial autoregressive models with endogenous regressors and many instruments. Econom. Rev. 2013;32:734–753. doi: 10.1080/07474938.2013.741018. [DOI] [Google Scholar]
  • 15.Kelejian H.H., Prucha I.R. A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. J. Real Estate Financ. Econ. 1998;17:99–121. doi: 10.1023/A:1007707430416. [DOI] [Google Scholar]
  • 16.Sirait T. Multivariate general spatial three-stage least squares fixed effect panel simultaneous models and estimation of their parameters. WSEAS Trans. Math. 2020;19:373–383. doi: 10.37394/23206.2020.19.38. [DOI] [Google Scholar]
  • 17.Luo G., Wu M., Pang Z. Estimation of spatial autoregressive models with covariate measurement errors. J. Multivar. Anal. 2022 https://www.sciencedirect.com/science/article/pii/S0047259X22000872 [Google Scholar]
  • 18.White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. doi: 10.4337/9781035334926.00009. [DOI] [Google Scholar]
  • 19.Nocedal J., Liu D.C. On the limited memory BFGS method for large scale optimization. Math. Program. 1989;45:503–528. [Google Scholar]
  • 20.Gerber F., Furrer R. OptimParallel: an R package providing a parallel version of the l-BFGS-B optimization method. R J. 2019:11. doi: 10.32614/rj-2019-030. [DOI] [Google Scholar]
  • 21.Xiao Y., Wei Z., Wang Z. A limited memory BFGS-type method for large-scale unconstrained optimization. Comput. Math. with Appl. 2008;56:1001–1009. doi: 10.1016/j.camwa.2008.01.028. [DOI] [Google Scholar]
  • 22.Hu W., Jing B., Zhang B., Huang D. Crawling subsampling for multivariate spatial autoregression model in large-scale networks. Electron. J. Stat. 2021;15:3678–3707. doi: 10.1214/21-EJS1872. [DOI] [Google Scholar]
  • 23.Zhu X., Huang D., Pan R., Wang H. Multivariate spatial autoregressive model for large scale social networks. J. Econom. 2020;215:591–606. doi: 10.1016/j.jeconom.2018.11.018. [DOI] [Google Scholar]
  • 24.Byrd R., Lu P., Nocedal J., Zhu C. A limited memory algorithm for bound constrained optimization. J. Sci. Comput. 1995;16:1190–1208. [Google Scholar]
  • 25.Christensen R. Springer; New York: 1991. Linear Models for Multivariate, Time Series, and Spatial Data. [Google Scholar]
  • 26.Yasin H., Purhadi A.Choiruddin. Spatial clustering based on geographically weighted multivariate generalized gamma regression. MethodsX. 2024;13 doi: 10.1016/j.mex.2024.102903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Fadmi F.R., Otok B.W., Kuntoro S.Melaniani, Sriningsih R. Segmentation of stunting, wasting, and underweight in Southeast Sulawesi using geographically weighted multivariate Poisson regression. MethodsX. 2024;12 doi: 10.1016/j.mex.2024.102736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.D.N. Gujarati, D.C. Porter, Basic Econometrics, 5 ed, McGraw-Hill Education, 2008.
  • 29.Ozili P.K. The acceptable R-square in empirical modelling for social science research. Soc. Res. Methodol. Publ. Results. 2022 @. [Google Scholar]
  • 30.Chicco D., Warrens M.J., Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021;7:1–24. doi: 10.7717/PEERJ-CS.623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Johnson R.A., Wichern D.W. Pearson Prentice Hall; 2007. Applied Multivariate Statistical Analysis. 6 ed. [Google Scholar]
  • 32.E. Kasuya, On the use of r and r squared in correlation and regression, 2018. 10.1111/1440-1703.1011. [DOI]
  • 33.Keer M., Lohiya H., Chouhan S. Goodness of Fit for Linear Regression using R squared and Adjusted R-Squared. Int. J. Res. Publ. Rev. J. Homepage. 2023;4:2431–2439. @@. [Google Scholar]
  • 34.Yamada H. Moran's I for Multivariate Spatial Data. Mathematics. 2024;12:2746. doi: 10.3390/math12172746. [DOI] [Google Scholar]
  • 35.Bivand R.S., Wong D.W.S. Comparing implementations of global and local indicators of spatial association. TEST An Off. J. Spanish Soc. Stat. Oper. Res. 2018;27:716–748. doi: 10.1007/s11749-018-0599-x. [DOI] [Google Scholar]
  • 36.Cheng Z. The spatial correlation and interaction between manufacturing agglomeration and environmental pollution. Ecol. Indic. 2016;61:1024–1032. doi: 10.1016/j.ecolind.2015.10.060. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES