Linear regression in ecological studies involving space: methodology and an application example in public health

Gleice Margarete de Souza Conceição; Patricia Marques Moralejo Bermudi; Raquel Gardini Sanches Palasio; Gerson Laurindo Barbosa; Camila Meireles Fernandes; Lidia Maria Reis Santana; Ligia Vizeu Barrozo; Daiane Leite da Roza; José Alberto Quintanilha; Francisco Chiaravalloti-Neto

doi:10.1590/1980-549720260018

. 2026 Apr 20;29:e260018. doi: 10.1590/1980-549720260018

View full-text in Portuguese

Linear regression in ecological studies involving space: methodology and an application example in public health

Gleice Margarete de Souza Conceição ^I, Patricia Marques Moralejo Bermudi ^I, Raquel Gardini Sanches Palasio ^I, Gerson Laurindo Barbosa ^II, Camila Meireles Fernandes ^I, Lidia Maria Reis Santana ^I,^III, Ligia Vizeu Barrozo ^IV, Daiane Leite da Roza ^I, José Alberto Quintanilha ^V, Francisco Chiaravalloti-Neto ^I

^IUniversidade de São Paulo, School of Public Health, Department of Epidemiology - São Paulo (SP), Brazil.

^IIInstituto Pasteur, São Paulo State Health Secretariat - São Paulo (SP), Brazil.

^IIIUniversidade Federal de São Paulo, Department of Medicine - São Paulo (SP), Brazil.

^IVUniversidade de São Paulo, School of Philosophy, Languages and Human Sciences - São Paulo (SP), Brazil.

^VUniversidade de São Paulo, Institute of Energy and Environment - São Paulo (SP), Brazil.

^{CORRESPONDING AUTHOR:}

Patricia Marques Moralejo Bermudi. Avenida Doutor Arnaldo, 715, Cerqueira César, CEP 01246-904, São Paulo (SP), Brasil. E-mail: patricia.bermudi@usp.br

CONFLICT OF INTERESTS: nothing to declare.

ASSOCIATE EDITOR: Nelson da Cruz Gouveia http://orcid.org/0000-0003-0625-0265

SCIENTIFIC EDITOR: Maria Fernanda Tourinho Peres http://orcid.org/0000-0002-7049-905X

Roles

Gleice Margarete de Souza Conceição: Project administration, Formal Analysis, Conceptualization, Data curation, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Supervision, Validation, Visualization

Patricia Marques Moralejo Bermudi: Formal Analysis, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Validation, Visualization

Raquel Gardini Sanches Palasio: Formal Analysis, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Validation, Visualization

Gerson Laurindo Barbosa: Formal Analysis, Data curation, Writing - original draft, Writing - review & editing, Investigation, Methodology, Validation

Camila Meireles Fernandes: Formal Analysis, Data curation, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Validation, Visualization

Lidia Maria Reis Santana: Formal Analysis, Data curation, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Validation, Visualization

Ligia Vizeu Barrozo: Formal Analysis, Conceptualization, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Validation, Visualization

Daiane Leite da Roza: Formal Analysis, Writing - original draft, Writing - review & editing, Investigation, Methodology, Validation

José Alberto Quintanilha: Formal Analysis, Writing - original draft, Writing - review & editing, Investigation, Methodology, Validation

Francisco Chiaravalloti-Neto: Project administration, Formal Analysis, Conceptualization, Writing - original draft, Writing - review & editing, Investigation, Methodology, Software, Validation, Visualization

PMCID: PMC13101397 PMID: 42018834

ABSTRACT

Many health-related phenomena can be better understood when the geographic region in which they occur is taken into account. One of the most important aspects to consider in spatial study designs is the presence of autocorrelation in observations measured across space. If this spatial dependence is not properly modeled, the resulting statistics may be biased, compromising the validity of conclusions regarding the presence or absence of associations. Methodologies developed based on the linear regression model allow this dependence to be adequately accommodated, producing precise, robust, and unbiased estimates. With the aim of highlighting the applicability of spatial models and pointing out the necessary precautions in data analysis, this article describes, step by step, one of the most commonly used methodologies for spatial data analysis, as well as the measures to be taken to avoid modeling errors and distortion of results. The linear regression model is presented, along with procedures to evaluate model fit, the most commonly used measure to detect spatial dependence, and two autoregressive models frequently applied to model this dependence (SAR and SEM). An application example is provided using the GeoDa and R software.

Keywords: Linear regression, Ecological studies, Spatial dependence, Spatial analysis, Spatial autocorrelation, Autoregressive models

INTRODUCTION

The term “regression” was introduced in 1885 by Francis Galton. While studying the anthropometric characteristics of successive generations, Galton observed that children of parents who were taller than the average tended to be tall but shorter than their parents, whereas children of parents who were shorter than the average tended to be short but taller than their parents. Based on these observations, he postulated that human height tended to “regress” toward the mean ¹ .

The Least Squares Method, developed by Legendre and Gauss to determine the orbits of comets, made it possible to estimate the parameters of the regression proposed by Galton. Gauss noted that this method provides optimal estimates when the errors are assumed to be random, independent, and normally distributed ² ^, ³ . This methodology was later refined and applied in the formulation of the theory of Linear Regression Models (LRMs). Technological advances have substantially reduced computational time and effort, enabling the technique to be applied to increasingly large datasets. Currently, LRMs are widely used for data analysis across a broad range of fields.

Based on this theory, new methodologies have been developed, including models for the analysis of spatial and temporal data. These approaches help establish relationships between variables of interest and interrelated factors across time and space, allowing for a deeper understanding of the phenomena under study and enabling the formulation of more effective public policies ⁴ .

It is common for observations measured across space and/or time to be correlated, making it necessary to account for this dependence. Spatial regression models allow this dependence to be incorporated, generating accurate, robust, and unbiased estimates, as well as enabling the identification and interpretation of the effects of spatial factors that influence the phenomenon under study.

On the other hand, disregarding the spatial dimension can compromise the validity of the conclusions. Among the main problems resulting from this are inadequate model specification, the omission of important variables, and the production of biased estimates and incorrect confidence intervals and p-values, which can lead to spurious associations and inaccurate or invalid conclusions ⁵ .

Thus, spatial regression models are powerful tools for analyzing data observed in space. Software such as GeoDa and the R language have facilitated their application. However, it is important to emphasize that the success of these applications depends on knowledge of the theoretical foundations and a clear understanding of the problem under investigation.

Although the international scientific literature includes studies using a spatial approach, this perspective is less frequent in Brazilian journals. A search in the Virtual Health Library (VHL), Public Health Brazil ⁶ , using the term “linear regression” identified 2,458 articles in Portuguese published up to August 2025. When the search was restricted to “linear regression” and “spatial,” the number was reduced to only 82.

According to Figueiredo Filho et al. ⁷ , supporting materials that contribute to the understanding of spatial and spatiotemporal models, highlighting appropriate procedures and practices to be avoided, are also scarce in Brazilian scientific journals. The present study aimed to address this gap by offering a comprehensive approach to the relevance and application of spatial models. The objective is to disseminate tools that assist in selecting the most appropriate method for the analysis of spatial data, as well as to discuss the biases that may arise when spatial dependence is ignored, thereby reducing the risk of erroneous conclusions and promoting the advancement of regression methods in spatial data analysis. To this end, LRM and its assumptions are presented, along with the most widely used measure for detecting spatial dependence in model residuals and two of the most commonly applied approaches for modeling this dependence. As an example, an application is presented using data from Fernandes et al. ⁸ , who evaluated the spatial distribution of the prevalence of adolescent mothers and its relationship with socioeconomic indicators in the municipality of Foz do Iguaçu, Paraná (PR).

As this was a methodological study, with no direct involvement of human participants or use of identifiable individual data, submission to a Research Ethics Committee was not required.

THE LINEAR REGRESSION MODEL

LRMs are used to describe and quantify linear or linearizable associations and to generate forecasts. They can be applied to assess trends in spatial, temporal, or spatiotemporal series. These models involve a quantitative response variable (Y), preferably continuous, and one or more explanatory variables (X ₁ , X ₂ ,..., X _p ), also known as covariates, which may be quantitative or qualitative. The objective is to evaluate the joint effect of the covariates on the response variable (RV) and to approximate this relationship using a mathematical function, thereby facilitating its description and quantification and, eventually, enabling forecasting ⁹ .

Such models can be applied in spatial ecological designs, in which the geographic region where events occur is fundamental to understanding the phenomenon under study. Therefore, both the database structure and the adopted model must take into account the spatial distribution of the data. In this type of design, the study area is partitioned into Spatial Area Units (SAUs), which are generally based on pre-existing units, such as municipalities and census tracts.

The LRM model for spatial data analysis is similar to the classical model, except that the variables are measured at the SAU level, and each row of the database refers to one of these units. The model can be written as ¹⁰

Y = X β + ε

Where:

Y = (y ₁ , y ₂ ,..., y _N ) is the vector (Nx1) containing the RV values for each SAU;

N is the number of SAUs;

X is the matrix (Nxp) containing the values of the p covariates for each SAU;

β is the parameter vector (px1) to be estimated, which will quantify the effect of each explanatory variable on the response;

ε is the vector (Nx1) of random errors.

The model assumptions are that ε~N(0,σ² I _N ), where I _N is the identity matrix, and that (ε_i , ε_j ) are independent for every pair of SAUs (i, j). These assumptions can be translated into four essential conditions: normality: random errors with a normal distribution, implying normality of the RV; homoscedasticity: constant variance of the RV across the values of the covariates; linearity: a linear relationship between the RV and the covariates; independence: RV values that are independent of each other (uncorrelated) ¹⁰ .

These assumptions must be verified before and after model fitting, through exploratory and residual analyses, respectively. If they are not satisfied, the results may be compromised. In the absence of linearity, the model will not adequately reflect the behavior of the data, and the estimates will be invalid. In the absence of normality, homoscedasticity, or independence, the variances of the estimators, the test statistics, and the p-values of the hypothesis tests will be biased, thereby compromising any conclusions regarding the presence or absence of associations. Finally, the presence of outliers may lead to biased estimators and variances if their influence is substantial ¹⁰ .

To address problems related to normality, linearity, and homoscedasticity, several measures may be adopted, including transformations of the RV (such as logarithmic transformation); or the fitting of generalized linear, generalized additive, or Bayesian models. These approaches allow the use of distributions other than the normal, such as Poisson and negative binomial, as well as alternative functional forms for the relationship between the RV and the covariates, beyond linearity ¹¹ . If the observations are not independent and the data do not permit adequate specification of the spatial dependence structure, little can be done, which compromises the performance of any analysis. Conversely, when such a structure exists (for example, in spatial contexts), models that explicitly incorporate this dependence should be employed. Models with spatial dependence fall within this category.

In spatial studies, the first law of Geography is commonly applied ¹² : in space, all things are related, but those that are spatially closer are more related than those that are farther apart. This phenomenon is referred to as spatial dependence or positive spatial autocorrelation ⁴ .

It is possible that deviations from the assumptions of normality and homoscedasticity, as well as the presence of spatial dependence in the residuals of the model, may be corrected after the inclusion of covariates. If the residuals of the fitted model are normally distributed, random, homoscedastic, present few outliers (provided they do not distort the coefficient estimates), and are free from spatial dependence, the model may be considered well fitted. An example is the study by Diniz et al. ¹³ , which spatially modeled breast cancer mortality in the municipalities of São Paulo. Although the rates exhibited spatial dependence, this pattern was adequately accommodated by the covariates, and the residuals did not indicate any deviation from the assumptions of the LRMs.

EXPLORATORY ANALYSIS

The first step in any statistical analysis is data exploration, which must be thorough and precede the application of any model. It is necessary to understand the relationship between the RV and each covariate, as well as the relationships among the covariates themselves. Below are the steps that should be considered in an exploratory analysis for the fitting of any LRM, in order to avoid commonly committed errors.

Initially, each variable should be described individually, according to its nature. Qualitative variables should be summarized using frequency tables and bar graphs. For quantitative variables, numerical summary measures (mean, median, standard deviation, etc.) and a set of graphical displays (histograms, box plots, and dot plots) should be used. These provide information on the range of values, their frequency and distribution, central tendency and variability, as well as the presence of asymmetry and tails. If outliers are present, it is essential to assess their influence on LRM estimates and, if necessary, apply appropriate control measures; they should be controlled for but never deleted. To evaluate the adherence of the data to a given distribution, in addition to histograms and normality tests, QQ-plots should be used. Only after understanding the behavior of each variable individually should their joint behavior be examined pairwise, and the presence of associations assessed, both between the RV and each covariate and among the covariates themselves.

To describe the relationship between RV and qualitative covariates, the response should be summarized according to the categories of the qualitative covariate, using the numerical summary measures and graphical displays mentioned previously. In particular, it is desirable that the variance of the RV be similar across the categories of the qualitative covariates. This can be assessed using tests for equality of variances ¹⁰ .

For quantitative covariates, the scatter plot and the correlation coefficient should be examined jointly. The former reveals the shape of the relationship and, if it is linear, the latter expresses its strength. Two points should be considered. The first concerns linearity, which is one of the assumptions of LRMs. If the shape of the relationship between the RV and the covariate is not linear, the LRM will not be appropriate. In such cases, it may be convenient to transform the quantitative variable into a qualitative one (for example, using age ranges instead of age). The values defining the categories depend on the researcher’s knowledge, and in the absence of prior knowledge, it is advisable to define them so that each category contains approximately the same number of observations. Another alternative is the use of polynomial terms or smoothers ¹⁰ ^, ¹⁴ . The second point refers to the fact that the correlation coefficient measures only linear associations. Therefore, it should always be interpreted in conjunction with the scatter plot, never in isolation. Furthermore, a high coefficient does not necessarily imply a linear association; similarly, a low coefficient does not necessarily imply the absence of association, since this may occur when the relationship is non-linear.

It is also necessary to understand the relationships among the covariates (using the same tools described previously), as well as to assess the presence of collinearity, which occurs when covariates are strongly correlated. Collinear covariates make the coefficients unstable and lead to imprecise estimates. For example, income and education may be collinear: higher levels of education are generally associated with higher income. When one of these variables is known, the additional information provided by the other may be limited. One possible criterion is to consider two variables collinear when the correlation between them is greater than 0.70 (or less than -0.70) ¹⁵ and to discard one of them. Stepwise forward modeling procedures ¹⁶ provide robust criteria for the inclusion and exclusion of collinear covariates, provided they are applied appropriately. The criterion for inclusion/retention of variables should never be based exclusively on statistical significance, but rather on the researcher’s overall knowledge of conceptual relationships and biological plausibility. Furthermore, such decisions should never be delegated to the software but should always remain under the responsibility of the researcher.

It is also essential to assess the presence of interactions between covariates, a topic not addressed in this article. However, it is important to emphasize that omitting a relevant interaction will result in an incorrectly specified model, which may lead to erroneous conclusions.

Finally, it is necessary to assess whether the RV is randomly distributed in space or whether spatial dependence is present. This can be evaluated using indices that measure the degree of spatial autocorrelation, among which the global Moran’s I index, presented below, is the most widely used.

Zuur et al. ¹⁷ propose a protocol for exploratory analysis using R, which may be useful but insufficient for conducting a comprehensive exploratory analysis.

RESIDUAL ANALYSIS

This procedure is performed after model fitting in order to verify whether the adopted model is appropriate for the data. To this end, it is necessary to examine the residuals. The residual (e _i ) is defined as

e_{i} = Y_{i} - {\hat{Y}}_{i}

That is, the difference between the observed value (Y _i ) and the corresponding fitted value (Ŷ _i ).

The residual e _i can be interpreted as the observed error, in contrast to the true error ε_i , which is unknown and unobservable. The LRM assumes that ε_i are independent and identically distributed, distributed as N(0,σ²). If the fitted model is appropriate, e _i should reflect the properties assumed for ε_i ; that is, they should follow a normal distribution with mean zero and variance σ², present no outliers, and be independent (without autocorrelation). Residual analysis involves the construction of graphical and numerical diagnostic tools, fundamentally:

1. Scatter plot of residuals versus each covariate;
2. Scatter plot of residuals versus fitted values;
3. Histograms of residuals;
4. QQ-plot;
5. Moran’s I index.

Standardized residuals should be used, as they are more efficient than raw residuals, particularly for detecting outliers. In the diagrams listed as items 1 and 2, the residuals should be randomly dispersed around zero, with constant variance, no systematic patterns, and no influential outliers, with at least 95% of the values falling within ±1.96. Items 3 and 4 (histogram and QQ-plot) should indicate an approximately normal distribution. Moran’s I index, listed as item 5, should indicate the absence of spatial autocorrelation.

GLOBAL MORAN INDEX (MORAN I)

Moran’s I ¹⁸ measures the correlation between the values of a variable and the values observed in its neighborhoods. Cliff and Ord ¹⁹ were the first to apply this index to regression residual analysis. To construct it, it is necessary to identify, for each SAU, its respective neighbors. The definition of neighborhood may be established by contiguity or by distance. The contiguity criterion, more commonly used in ecological designs, defines as neighbors the areas that share a common side (rook) or a common side or vertex (queen). The distance criterion considers the distance between the centroid of a SAU and those of neighboring areas and can be constructed in two ways: by defining as neighbors all units within a previously fixed maximum distance, or by defining as neighbors the k closest units. In both cases, the parameters are defined by the researcher.

The index can be constructed using the RV or the residuals of the fitted LRM and is calculated as:

I = \frac{N}{W} \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i j} (y_{i} - \bar{Y}) (y_{j} - \bar{Y})}{\sum_{i = 1}^{N} {(y_{i} - \bar{Y})}^{2}}

Where:

N is the number of SAUs;

y _i and y _j are the values of the RV (or of the regression model residual) for the pair (i, j) of SAUs, i=1,...,N, j=1,...,N;

W _ij are the spatial weights, which take the value 1 if i and j are neighboring SAUs, and 0, otherwise;

$W = \sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i j}$ , that is, the sum of the weights W _ij

$\bar{Y} = \frac{\sum_{i = 1}^{N} y_{i}}{N}$ is the sample mean of Y.

Other weighting schemes for W _ij may also be used, for example, , where l _ij represents the length of the boundary between A _i and A _j , and l _i represents the perimeter of A _i .

Moran’s I ranges from -1 to 1. Values close to zero suggest the absence of spatial autocorrelation, indicating that the RV (or residuals) are randomly distributed in space, without clustering patterns. Values greater than zero indicate the presence of positive spatial autocorrelation, meaning that areas with high residual (or RV) values are geographically close to each other, as are areas with low values. Values less than zero indicate negative spatial autocorrelation, suggesting that areas with similar values are spatially dispersed and distant from each other. The hypothesis test for the index (in which Ho represents the absence of spatial autocorrelation) is performed using a Monte Carlo approach, based on a large number of simulations to obtain the p-value ⁴ . If Moran’s I applied to the LRM residuals indicates the presence of spatial dependence, it will be necessary to fit a model capable of accommodating this structure.

Furthermore, it is important to distinguish between the use of Moran’s I applied to the RV and that applied to the residuals. Moran’s I calculated for the response variable identifies spatial autocorrelation prior to multiple adjustment, whereas Moran’s I for the residuals assesses whether autocorrelation persists after adjustment. Thus, a significant Moran’s I in the residuals indicates that the multiple model did not fully capture the spatial structure and, therefore, that a spatial model may be required.

LRM WITH GLOBAL SPATIAL DEPENDENCE

Spatial dependence may be local or global. If spatial autocorrelation is heterogeneous, that is, if it varies across space, the dependence is local and requires a multiparameter model to accommodate it. Geographically weighted regression ²⁰ is one possible approach; however, if there is no significant heterogeneity, the dependence may be assumed to be global and can be accommodated by a single parameter added to the LRM. In this article, only models with global effects are considered. The two most commonly used are the Spatial Autoregression (SAR) and the Spatial Error (SEM) Models. In the former, dependence is explicitly modeled in the RV by incorporating a spatial lag term; in the latter, it is treated as noise to be removed and modeled as a component of the residuals ²¹ ^, ²² .

The SAR model is formulated as:

u = λ W u + ε

Where:

Y, X, β, ε are the same as previously defined;

W is the neighborhood matrix containing the weights W _ij defined previously

(W _ij =1, if i and j are neighboring SAUs and W _ij =0, otherwise);

ρ is the spatial autoregressive coefficient to be estimated.

Spatial dependence is accommodated by the term ρWY, where WY corresponds to the spatially lagged RV and ρ corresponds to the spatial autoregressive coefficient associated with the lag.

SEM is formulated as:

Y = X β + u

u = λ W u + ε

Where:

Y, X, β, W, ε are the same as previously defined;

λ is the spatial autoregressive coefficient to be estimated.

Here, spatial dependence is accommodated in the error term u, which is decomposed into λWu (spatially lagged error corresponding to the component of the residuals with spatial dependence) and ε (normally distributed, independent errors with constant variance).

In these models, values of ρ or λ close to zero indicate the absence of spatial autocorrelation; conversely, positive and statistically significant values indicate the presence of spatial dependence.

The choice of the most appropriate model is made using Lagrange Multiplier (LM) diagnostics, which test the hypotheses H_0ρ: ρ=0 and H_0λ: λ=0. The corresponding test statistics (LM _ρ and LM _λ , respectively) are obtained using Lagrange multipliers ²² . If both hypotheses are rejected, robust statistics are used, and the most suitable model is the one with the highest value of the robust LM statistic (RLM _ρ or RLM _λ ).

Once the most appropriate model has been identified, a new residual analysis is required, following items 4 and 5. If Moran’s I still indicates the presence of spatial dependence in the residuals, this suggests that the parameters ρ or λ were not able to fully capture the dependence. Before considering alternative types of models, it is recommended to select a more appropriate neighborhood matrix or include covariates that were initially omitted.

These procedures are shown in Figure 1.

APPLICATION EXAMPLE

The study by Fernandes et al. ⁸ used the urban census tracts of the municipality of Foz do Iguaçu as SAUs (data available at https://doi.org/10.6084/m9.figshare.27310467.v2) ²³ . After exploratory analysis, an LRM was fitted, with the prevalence of adolescent mothers as the RV and the Brazilian Deprivation Index (Índice Brasileiro de Privação - IBP) and the proportion of women heads of household (PWHH) as covariates. Moran’s I (estimated using the Queen neighborhood matrix) for the residuals was 0.059 (p=0.025), indicating the presence of spatial dependence.

Based on the LRM diagnostics, a SAR adjustment was selected. The new residual analysis indicated a well-fitted model, and Moran’s I (0.011, p=0.915) showed that spatial dependence was adequately modeled. The estimates for both models are presented in Table 1. Notably, in the SAR model, there was a decrease in the effect of IBP and an increase in the width of its confidence interval. The modeling procedures, step by step, using the GeoDa ²² software and the R codes ²⁴ , are presented in the supplementary material (https://github.com/LAES-USP/material-supl-regressao-espacial.git).

Table 1. Estimates of the Linear Regression and Spatial Autoregression models for the prevalence of adolescent mothers. Foz do Iguaçu (PR), Brazil, 2013-2019.

Explanatory variable	LRM			SAR
Explanatory variable	β	(β; 0.95) CI	p-value	β	(β; 0.95) CI	p-value
Brazilian Deprivation Index	4.13	(3.58; 4.68)	0.000	3.77	(3.13; 4.43)	0.000
Proportion of women household heads	0.09	(0.03; 0.14)	0.001	0.09	(0.03; 0.14)	0.000
Spatial autoregressive coefficient (ρ)	-	-	-	0.14	(0.00; 0.28)	0.055
Measure	Value		p-valuev	Value		p-value
Moran’s I	0.059		0.032	0.011		0.915
R²	0.43		-	0.44		-
AIC	2057.4		-	2055.8		-

Open in a new tab

LRM: Linear Regression Model; SAR: Spatial Autoregression.

Data availability declaration

The data used in this study are public and can be accessed at the figshare repository (https://doi.org/10.6084/m9.figshare.27310467.v2).

The supplementary material for this article is available in the GitHub repository (https://github.com/LAES-USP/material-supl-regressao-espacial.git). The content includes a detailed description of the spatial modeling procedures, the commands in R (version 4.3), and the illustrations generated in GeoDa (version 22), as well as the files necessary for reproducing the analyses.

Supplementary Material

Supplementary PDF

1980-5497-rbepid-29-e260018-sppl.pdf^{(6.1MB, pdf)}

Funding Statement

FCN and GLB received support through the Research Grant program (Process 2023/10080-3), PMMB received a doctoral scholarship (Process 2020/12371-7), RGSP received a postdoctoral fellowship (Process 2021/10212-1) and an International Fellowship (Process 2024/01315-0), and GLB received an International Fellowship (Process 2024/13664-9), all granted by FAPESP. FCN received a 1C Scientific Productivity Research Fellowship (Process 304391/2022-0) from CNPq.

Footnotes

HOW TO CITE THIS ARTICLE: Conceição GMS, Bermudi PMM, Palasio RGS, Barbosa GL, Fernandes CM, Santana LMR, et al. Linear regression in ecological studies involving space: methodology and an application example in public health. Rev Bras Epidemiol. 2026; 29: e260018. https://doi.org/10.1590/1980-549720260018

FUNDING: FCN and GLB received support through the Research Grant program (Process 2023/10080-3), PMMB received a doctoral scholarship (Process 2020/12371-7), RGSP received a postdoctoral fellowship (Process 2021/10212-1) and an International Fellowship (Process 2024/01315-0), and GLB received an International Fellowship (Process 2024/13664-9), all granted by FAPESP. FCN received a 1C Scientific Productivity Research Fellowship (Process 304391/2022-0) from CNPq.

REFERENCES

1.Galton FRS. Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland. 1886;15:246–263. doi: 10.2307/2841583. [DOI] [Google Scholar]
2.Rooney A. A história da matemática. 1a ed. São Paulo: M.Books; 2012. [Google Scholar]
3.Stigler SM. The history of statistics: the measurement of uncertainty before 1900. Harvard: Harvard University Press; 1986. [Google Scholar]
4.Druck S, Carvalho MS, Câmara G, Monteiro AVM. Análise espacial de dados geográficos. 1a ed. Empraba Cerrados; 2004. [Google Scholar]
5.Dormann CF, McPherson JM, Araújo MB, Bivand R, Bolliger J, Carl G, et al. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography. 2007;30(5):609–628. doi: 10.1111/j.2007.0906-7590.05171.x. [DOI] [Google Scholar]
6.Biblioteca Virtual em Saúde . Portal Regional da BVS. 2024. [Oct 31, 2024]. [Internet] Available at: https://pesquisa.bvsalud.org . [Google Scholar]
7.Figueiredo D, Filho, Nunes F, Rocha EC, Santos ML, Batista M, Silva JA., Júnior O que fazer e o que não fazer com a regressão: pressupostos e aplicações do modelo linear de Mínimos Quadrados Ordinários (MQO) Revista Política Hoje. 2011;20(1):44–99. [Google Scholar]
8.Fernandes CM, Conceição GMS, Silva ZP, Nampo FK, Chiaravalloti F., Neto Fatores socioeconômicos aumentam o risco de gravidez na adolescência: análise espacial e temporal em um município brasileiro. Rev Bras Epidemiol. 2024;27:e240040. doi: 10.1590/1980-549720240040.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Conceição GMS, Latorre MRDO. Brasil. Ministério da Saúde. Secretaria de Ciência, Tecnologia, Inovação e Complexo da Saúde. Departamento de Ciência e Tecnologia . Avaliação de impacto das políticas de saúde: um guia para o SUS. Brasília: Ministério da Saúde; 2023. Análise de regressão; pp. 576–614.http://bvsms.saude.gov.br/bvs/publicacoes/avaliacao_impacto_politicas_saude_guia_sus.pdf [Internet] [cited on MMM DD, YYYY] [Google Scholar]
10.Kutner MH, Nachtsheim C, Neter J, Li W. Applied linear statistical models. 5a ed. Boston: McGraw-Hill Irwin; 2005. [Google Scholar]
11.Paula GA. Modelos de regressão com apoio computacional. São Paulo: Universidade de São Paulo; 2013. [May 29, 2025]. [Internet] Available at: https://www.ime.unicamp.br/~%20cnaber/Livro_MLG.pdf . [Google Scholar]
12.Tobler WR. A computer movie simulating urban growth in the Detroit region. Econ Geogr. 1970;46(sup1):234–240. doi: 10.2307/143141. [DOI] [Google Scholar]
13.Diniz CSG, Pellini ACG, Ribeiro AG, Tedardi MV, Miranda MJ, Touso MM, et al. Breast cancer mortality and associated factors in São Paulo State, Brazil: an ecological analysis. BMJ Open. 2017;7(8):e016395. doi: 10.1136/bmjopen-2017-016395. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Hastie TJ, Tibshirani RJ. Generalized additive models. 1a ed. New York: Routledge; 2017. [Google Scholar]
15.Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. 2013;36(1):27–46. doi: 10.1111/j.1600-0587.2012.07348.x. [DOI] [Google Scholar]
16.Kleinbaum DG, Kupper LL, Nizam A, Rosenberg ES. Applied regression analysis and other multivariable methods. 5a ed. Boston: Cengage Learning; 2013. [Google Scholar]
17.Zuur AF, Ieno EN, Elphick CS. A protocol for data exploration to avoid common statistical problems. Methods Ecol Evol. 2010;1(1):3–14. doi: 10.1111/j.2041-210X.2009.00001.x. [DOI] [Google Scholar]
18.Moran PAP. Notes on continuous stochastic phenomena. Biometrika. 1950;37(1/2):17–23. doi: 10.2307/2332142. [DOI] [PubMed] [Google Scholar]
19.Cliff A, Ord K. Testing for spatial autocorrelation among regression residuals. Geogr Anal. 1972;4(3):267–284. doi: 10.1111/j.1538-4632.1972.tb00475.x. [DOI] [Google Scholar]
20.Comber A, Brunsdon C, Charlton M, Dong G, Harris R, Lu B, et al. A route map for successful applications of geographically weighted regression. Geogr Anal. 2023;55(1):155–178. doi: 10.1111/gean.12316. [DOI] [Google Scholar]
21.Loonis V, Bellefon MP. Handbook of spatial analysis: theory and practical application with R. Montrouge Cedex: Institut National de la Statistique et des Etudes Economiques; 2018. [Google Scholar]
22.Anselin L, Rey SJ. Modern spatial econometrics in practice: a guide to GeoDa, GeoDaSpace and PySAL. 2014. [14 ago. 2025]. [Internet] Disponível em: https://geodacenter.github.io/GeoDaSpace/ [Google Scholar]
23.Fernandes C, Conceição GMS, Silva ZP, Nampo FK, Chiaravalloti-Neto F. Geospatial database of teenage pregnancy in a Brazilian municipality (SHP) Figshare; 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.R Core Team . R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2025. [Aug 14, 2025]. [Internet] Available at: https://www.r-project.org/ [Google Scholar]

Rev Bras Epidemiol. 2026 Apr 20;29:e260018. [Article in Portuguese] doi: 10.1590/1980-549720260018.2

Regressão linear em estudos ecológicos envolvendo o espaço: metodologia e exemplo de aplicação na área da saúde

^IUniversidade de São Paulo, Faculdade de Saúde Pública, Departamento de Epidemiologia - São Paulo (SP), Brasil.

^IIInstituto Pasteur, Secretaria da Saúde do Estado de São Paulo - São Paulo (SP), Brasil.

^IIIUniversidade Federal de São Paulo, Departamento de Medicina - São Paulo (SP), Brasil.

^IVUniversidade de São Paulo, Faculdade de Filosofia, Letras e Ciências Humanas - São Paulo (SP), Brasil.

^VUniversidade de São Paulo, Instituto de Energia e Ambiente - São Paulo (SP), Brasil.

^IIIUniversidade Federal de São Paulo, Department of Medicine - São Paulo (SP), Brazil.

^{AUTOR CORRESPONDENTE:}

Patricia Marques Moralejo Bermudi. Avenida Doutor Arnaldo, 715, Cerqueira César, CEP 01246-904, São Paulo (SP), Brasil. E-mail: patricia.bermudi@usp.br

CONFLITO DE INTERESSES: nada a declarar.

EDITOR ASSOCIADO: Nelson da Cruz Gouveia http://orcid.org/0000-0003-0625-0265

EDITORA CIENTÍFICA: Maria Fernanda Tourinho Peres http://orcid.org/0000-0002-7049-905X

Roles

Gleice Margarete de Souza Conceição: Administração do projeto, Análise formal, Conceituação, Curadoria de dados, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Supervisão, Validação, Visualização

Patricia Marques Moralejo Bermudi: Análise formal, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Validação, Visualização

Raquel Gardini Sanches Palasio: Análise formal, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Validação, Visualização

Gerson Laurindo Barbosa: Análise formal, Curadoria de dados, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Validação

Camila Meireles Fernandes: Análise formal, Curadoria de dados, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Validação, Visualização

Lidia Maria Reis Santana: Análise formal, Curadoria de dados, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Validação, Visualização

Ligia Vizeu Barrozo: Análise formal, Conceituação, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Validação, Visualização

Daiane Leite da Roza: Análise formal, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Validação

José Alberto Quintanilha: Análise formal, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Validação

Francisco Chiaravalloti-Neto: Administração do projeto, Análise formal, Conceituação, Escrita- primeira redação, Escrita - revisão e edição, Investigação, Metodologia, Software, Validação, Visualização

RESUMO

Muitos fenômenos em saúde podem ser melhor compreendidos se considerada a região geográfica em que ocorrem. Um dos pontos mais importantes a ser considerado em delineamentos espaciais é a presença de autocorrelação em observações medidas ao longo do espaço. Se esta dependência não for adequadamente modelada, as estatísticas obtidas poderão ser viesadas, comprometendo a validade de conclusões sobre a presença ou ausência de associações. Metodologias desenvolvidas com base no modelo de regressão linear permitem acomodar adequadamente essa dependência, gerando estimativas precisas, robustas e não viesadas. Com o objetivo de ressaltar a aplicabilidade de modelos espaciais e apontar os cuidados necessários na análise dos dados, este artigo descreve, passo a passo, uma das metodologias mais comumente utilizadas para a análise de dados espaciais, bem como os cuidados a serem tomados para evitar erros na modelagem e distorção dos resultados. São apresentados o modelo de regressão linear, os procedimentos para avaliar a qualidade do ajuste, a medida mais utilizada para detectar a presença de dependência espacial e dois modelos autorregressivos comumente aplicados para modelar esta dependência (SAR e SEM). Um exemplo de aplicação é apresentado, utilizando-se os softwares GeoDa e R.

Palavras-chave: Regressão linear, Estudos ecológicos, Dependência espacial, Análise espacial, Autocorrelação espacial, Modelos autorregressivos

INTRODUÇÃO

O termo “regressão” surgiu em 1885, com Francis Galton. Ao estudar características antropométricas de sucessivas gerações, Galton observou que filhos de pais com estatura alta em relação à média tendem a ser altos, porém mais baixos que seus pais; e filhos de pais com estatura baixa em relação à média tendem a ser baixos, porém mais altos que seus pais - postulando, então, que a estatura dos corpos humanos tendia a “regredir” à média ¹ .

O Método dos Mínimos Quadrados, desenvolvido por Legendre e Gauss para determinar a órbita dos cometas, possibilitou estimar os parâmetros da regressão proposta por Galton. Gauss apontou que o método fornece as melhores estimativas quando se parte do princípio de que os erros são aleatórios, independentes e seguem a distribuição da curva normal ² ^, ³ . Tal metodologia foi aperfeiçoada e utilizada na formulação da teoria dos Modelos de Regressão Linear (MRL). Os avanços tecnológicos reduziram tempo e esforço computacional, permitindo a aplicação da técnica para conjuntos cada vez maiores de dados. Atualmente, os MRL são utilizados para a análise de dados nas mais diversas áreas do conhecimento.

Com base nessa teoria, novas metodologias foram desenvolvidas, como os modelos para análise de dados espaciais e temporais. Estes podem ajudar a estabelecer a relação entre variáveis de interesse e fatores inter-relacionados ao longo do tempo e no espaço, permitindo uma compreensão aprofundada dos fenômenos estudados e possibilitando a formulação de políticas públicas mais eficazes ⁴ .

É comum que observações medidas no espaço e/ou no tempo estejam correlacionadas, sendo necessário levar essa dependência em conta. Modelos de regressão espacial permitem incorporar essa dependência, gerando estimativas precisas, robustas e não viesadas, além da identificação e interpretação dos efeitos dos fatores espaciais que influenciam o fenômeno em estudo.

Por outro lado, desconsiderar a dimensão espacial pode comprometer a validade das conclusões. Entre os principais problemas decorrentes disso estão a especificação inadequada do modelo, a omissão de variáveis importantes, a obtenção de estimativas viesadas e de intervalos de confiança e p-valores incorretos, podendo levar à obtenção de associações espúrias e a conclusões imprecisas ou inválidas ⁵ .

Assim, os modelos de regressão espacial são ferramentas poderosas para análise de dados observados no espaço. Softwares, como GeoDa e a linguagem R, têm facilitado sua aplicação. Contudo, é importante destacar que o sucesso dessas aplicações depende do conhecimento dos fundamentos teóricos e da compreensão do problema investigado.

Embora a literatura científica internacional contenha estudos com abordagem espacial, ela é menos frequente nas revistas brasileiras. Uma busca na Biblioteca Virtual em Saúde (BVS), Saúde Pública Brasil ⁶ , com o termo “regressão linear” revelou a existência de 2.458 artigos em português publicados até agosto de 2025. Ao restringir a busca para “regressão linear” e “espacial”, o número é reduzido para apenas 82.

Conforme Figueiredo Filho et al. ⁷ , materiais de apoio que contribuam para o entendimento de modelos espaciais e espaço-temporais, destacando procedimentos adequados e práticas a evitar, são igualmente escassos nas revistas científicas brasileiras. Nosso estudo se propõe a preencher essa lacuna, oferecendo uma abordagem abrangente sobre a importância e aplicação de modelos espaciais. O objetivo é divulgar ferramentas que auxiliem na escolha do método mais adequado para a análise de dados espaciais, bem como os vieses que podem surgir quando a dependência espacial é ignorada, reduzindo o risco de conclusões equivocadas e avançando no uso da regressão em dados espaciais. Para tanto, apresentamos o MRL e suas suposições, a medida mais utilizada para detectar a dependência espacial nos resíduos do modelo e duas das abordagens mais utilizadas para modelar esta dependência. Como exemplo, apresentamos uma aplicação utilizando os dados de Fernandes et al. ⁸ , que avaliaram a distribuição da prevalência de mães adolescentes no espaço e sua relação com indicadores socioeconômicos no município de Foz do Iguaçu (PR).

Por se tratar de um estudo metodológico, sem envolvimento direto de seres humanos ou utilização de dados individuais identificáveis, não houve necessidade de submissão ao Comitê de Ética em Pesquisa.

O MODELO DE REGRESSÃO LINEAR

Os MRL são utilizados para descrever e quantificar associações lineares ou linearizáveis e, também, para o cálculo de previsões. Podem ser utilizados na avaliação de tendências em séries espaciais, temporais ou espaço-temporais. Envolvem uma variável resposta quantitativa (Y), preferencialmente contínua, e uma ou mais variáveis explicativas (X ₁ , X ₂ ,..., X _p ), também denominadas covariáveis, que podem ser quantitativas ou qualitativas. O objetivo é avaliar o efeito que as covariáveis exercem conjuntamente na Variável Resposta (VR) e aproximá-lo por uma função matemática, facilitando sua descrição e quantificação e, eventualmente, fazer previsões ⁹ .

Tais modelos podem ser aplicados em delineamentos ecológicos espaciais, nos quais a região geográfica dos eventos é fundamental para compreender o fenômeno estudado. Assim, tanto a estrutura do banco de dados quanto o modelo adotado devem considerar a distribuição espacial dos dados. Nesse tipo de delineamento, a área de estudo é particionada em Unidades Espaciais de Área (UEA) que, em geral, baseiam-se em unidades pré-existentes, como municípios e setores censitários.

O MRL para a análise de dados espaciais é semelhante ao modelo clássico, exceto que as variáveis são medidas por UEA e cada linha do banco de dados refere-se a uma delas. O modelo pode ser escrito como ¹⁰

Y = X β + ε

Onde:

Y = (y ₁ , y ₂ ,..., y _N ) é o vetor (Nx1) que contém os valores da VR em cada UEA;

N é o número de UEAs;

X é a matriz (Nxp) contendo os valores das p covariáveis em cada UEA;

β é o vetor de parâmetros (px1) a serem estimados, que quantificarão o efeito de cada variável explicativa na resposta;

ε é um vetor (Nx1) de erros aleatórios.

As suposições do modelo são que ε~N(0,σ² I _N ), em que I _N é a matriz identidade e (ε_i , ε_j ) são independentes para todo par de UEAs (i, j), e podem ser traduzidas em quatro condições essenciais: normalidade: os erros aleatórios com distribuição Normal, implicando na normalidade da VR; homoscedasticidade: a variância da VR constante ao longo dos valores das covariáveis; linearidade: relação linear entre a VR e as covariáveis; independência: valores da VR independentes entre si (não correlacionados) ¹⁰ .

Tais suposições devem ser checadas antes e depois do ajuste do modelo, nas análises exploratória e de resíduos, respectivamente. Se não satisfeitas, os resultados estarão comprometidos. Na ausência de linearidade, o modelo não refletirá o comportamento dos dados, e as estimativas serão inválidas. Na ausência de normalidade, heteroscedasticidade ou independência, as variâncias dos estimadores, as estatísticas e os p-valores dos testes de hipóteses serão viesados, comprometendo quaisquer conclusões sobre a presença ou ausência de associações. Finalmente, a presença de outliers pode resultar em estimadores e variâncias viesadas, se sua influência for grande ¹⁰ .

Para contornar problemas de normalidade, linearidade e homoscedasticidade, as seguintes medidas podem ser empregadas: transformações na VR (como a logarítmica); ou o ajuste de modelos lineares generalizados, aditivos generalizados ou bayesianos. Estes consideram outras distribuições, além da normal, como Poisson e binomial negativa, e outras formas para a relação entre a VR e as covariáveis, além da linear ¹¹ . Se as observações não forem independentes e os dados não permitirem especificar adequadamente a estrutura de dependência espacial, pouco pode ser feito, o que compromete a realização de qualquer análise. Por outro lado, havendo essa estrutura (por exemplo, ao longo do espaço), modelos que incorporam essa dependência devem ser utilizados. Os modelos com dependência espacial fazem parte desta categoria.

Em estudos espaciais, é comum a atuação da primeira lei da Geografia ¹² : no espaço, as coisas estão relacionadas, mas as espacialmente mais próximas estão mais relacionadas do que as mais distantes. Esse fenômeno é denominado dependência espacial ou autocorrelação espacial positiva ⁴ .

É possível que desvios das suposições de normalidade, homoscedasticidade e a presença de dependência espacial na VR sejam corrigidos após a inclusão das covariáveis. Se o resíduo do modelo ajustado for normal, aleatório, homoscedástico, com poucos outliers (desde que não distorçam as estimativas dos coeficientes) e livre de dependência espacial, o modelo estará bem ajustado. Um exemplo é o estudo de Diniz et al. ¹³ , que modelou espacialmente a mortalidade por câncer de mama nos municípios de São Paulo. Apesar de a taxa ter apresentado dependência espacial, ela foi convenientemente acomodada pelas covariáveis, e os resíduos não apontaram nenhum desvio das suposições do MRL.

ANÁLISE EXPLORATÓRIA

O primeiro passo de qualquer análise estatística é a exploração dos dados; ela deve ser minuciosa e anteceder a aplicação de qualquer modelo. É necessário entender a relação da VR com cada covariável e as relações das covariáveis entre si. A seguir, apresentamos as etapas que devem ser contempladas em uma análise exploratória tendo em vista o ajuste de qualquer MRL, de modo a evitar erros habitualmente cometidos.

Inicialmente, descreve-se cada variável individualmente, de acordo com a sua natureza. Variáveis qualitativas devem ser descritas por meio de tabelas de frequência e gráficos em barras. Para variáveis quantitativas utilizam-se medidas-resumo numéricas (média, mediana, desvio-padrão etc.) e uma série de gráficos (histogramas, box-plots e dotplots). Estes fornecem informações sobre a faixa de valores assumida e sua frequência, distribuição, tendência central e variabilidade, presença de assimetrias e caudas. Se existirem valores aberrantes (outliers), é essencial avaliar sua influência nas estimativas do MRL e, se necessário, utilizar medidas de controle apropriadas - eles devem ser controlados, mas nunca deletados. Para avaliar a aderência dos dados a uma determinada distribuição, além do histograma e de testes de normalidade, Qq-plots devem ser utilizados. Somente após compreender o comportamento de cada variável deve-se descrever seu comportamento conjunto, duas a duas, e avaliar a presença de associações, tanto da VR com cada covariável quanto das covariáveis entre si.

Para descrever a relação da VR com as covariáveis qualitativas a resposta deve ser descrita segundo as categorias da qualitativa, utilizando as medidas-resumo numéricas e os gráficos citados anteriormente. Em particular, é conveniente que a variância da VR seja semelhante nas categorias das covariáveis qualitativas. Isso pode ser avaliado por meio de testes de igualdade de variâncias ¹⁰ .

Para covariáveis quantitativas deve-se observar conjuntamente o diagrama de dispersão e o coeficiente de correlação. O primeiro exibirá a forma da relação e, se for linear, o segundo expressará a intensidade da relação. Dois pontos devem ser considerados. O primeiro é a linearidade, uma das suposições do MRL. Se a forma da relação entre a VR e a covariável não for linear, o MRL não será aplicável. Neste caso, pode ser conveniente transformar a variável quantitativa em qualitativa (por exemplo, utilizar faixas etárias em vez da idade). Os valores que definirão as categorias dependem do conhecimento do pesquisador e, se não houver conhecimento prévio, é conveniente defini-los de forma que cada categoria contenha mais ou menos o mesmo número de observações. Outra alternativa é a utilização de polinômios ou alisadores ¹⁰ ^, ¹⁴ . O segundo ponto refere-se ao coeficiente de correlação medir apenas associações lineares. Este deve ser observado em conjunto com o diagrama de dispersão, nunca isoladamente. Ainda, um coeficiente alto não implica em associação linear; do mesmo modo, um coeficiente baixo não implica necessariamente em ausência de associação, uma vez que pode ocorrer quando a associação não for linear.

É necessário, ainda, entender a relação das covariáveis entre si (usando as mesmas ferramentas previamente descritas), além de avaliar a presença de colinearidade, que ocorre quando elas são fortemente correlacionadas. Covariáveis colineares tornam os coeficientes instáveis e produzem estimativas pouco robustas. Por exemplo, informações sobre renda e escolaridade podem ser colineares: quanto maior a escolaridade, maior a renda. Quando se conhece uma delas, as informações adicionais fornecidas pela outra podem ser irrelevantes. Um critério possível seria considerar duas variáveis colineares se a correlação entre elas for maior que 0,70 (ou menor do que -0,70) ¹⁵ e descartar uma delas. Procedimentos de modelagem stepwise forward ¹⁶ oferecem critérios robustos para a inclusão e exclusão de covariáveis colineares, desde que aplicados adequadamente: o critério de inclusão/permanência de variáveis nunca deve ser exclusivamente a significância estatística, mas todo um conhecimento do pesquisador sobre as relações conceituais e a plausibilidade biológica. Além disso, as decisões nunca devem ficar a critério do software - sempre do pesquisador.

Também é essencial avaliar a presença de interações entre covariáveis, tema não tratado neste artigo. Entretanto, é fundamental ressaltar que a omissão de uma interação importante tornará o modelo incorretamente especificado, levando a conclusões equivocadas.

Finalmente, deve-se avaliar se a distribuição da VR é aleatória no espaço ou se há dependência espacial. Isso pode ser feito por meio de índices que medem o grau de autocorrelação espacial, sendo o índice global de Moran, apresentado adiante, o mais utilizado.

Zuur et al. ¹⁷ propõem um protocolo para a análise exploratória utilizando o R, que pode ser útil, mas insuficiente, para a realização da análise exploratória.

ANÁLISE DE RESÍDUOS

É feita após o ajuste do modelo para verificar se o modelo adotado é apropriado aos dados. Para tanto, é necessário investigar seus resíduos. O resíduo (e _i ) é definido como

e_{i} = Y_{i} - {\hat{Y}}_{i}

Isto é, a diferença entre o valor observado (Y _i ) e o correspondente valor ajustado (Ŷ _i ).

O resíduo e _i pode ser visto como o erro observado, em contrapartida ao verdadeiro erro ε_i , que é desconhecido e não observável. O MRL supõe que os ε_i são independentes e identicamente distribuídos, com distribuição N(0,σ²). Se o modelo ajustado for apropriado, os e _i devem refletir as propriedades assumidas para os ε_i , ou seja, devem ter distribuição N(0,σ²), sem outliers, e serem independentes (sem autocorrelação). A análise dos resíduos envolve a construção de gráficos e medidas numéricas, fundamentalmente:

1. Diagrama de dispersão dos resíduos versus cada covariável;
2. Diagrama de dispersão dos resíduos versus valores ajustados;
3. Histogramas dos resíduos;
4. QQ-plot;
5. Índice de Moran.

Deve-se utilizar resíduos padronizados, por serem mais eficientes do que os brutos, principalmente na detecção de outliers. Nos diagramas listados como itens 1 e 2, os resíduos devem estar dispersos aleatoriamente em torno de zero, com variância constante, sem padrões sistemáticos e sem outliers influentes, com pelo menos 95% dos valores situados entre ±1,96. Os itens 3 e 4 (histograma e QQ-plot) devem apresentar distribuição aproximadamente normal. O Índice de Moran, listado como item 5, deve indicar a ausência de autocorrelação espacial.

ÍNDICE DE MORAN GLOBAL (I DE MORAN)

O I de Moran ¹⁸ mede a correlação entre os valores de uma variável com seus valores nas vizinhanças. Cliff e Ord ¹⁹ foram os primeiros a aplicá-lo à análise de resíduos de regressão. Para construí-lo, é necessário identificar, para cada UEA, seus respectivos vizinhos. A definição de vizinhança pode ser estabelecida por contiguidade ou distância. A primeira é mais utilizada em delineamentos ecológicos e considera como vizinhos de uma UEA as áreas que compartilham um lado comum (rook ou torre) ou com um lado ou um vértice comum (queen ou rainha). A vizinhança por distância considera a distância entre o centroide de uma UEA e os das áreas vizinhas e pode ser construída de duas formas: considerando como vizinhas todas as unidades em uma distância máxima previamente fixada; ou considerando como vizinhas as k unidades mais próximas. Ambos os parâmetros são definidos pelo pesquisador.

O índice pode ser construído utilizando-se a VR ou os resíduos do MRL ajustado, e é calculado como:

I = \frac{N}{W} \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i j} (y_{i} - \bar{Y}) (y_{j} - \bar{Y})}{\sum_{i = 1}^{N} {(y_{i} - \bar{Y})}^{2}}

Onde:

N é o número de UEAs;

y _i e y _j são os valores da VR (ou do resíduo do modelo de regressão) para o par (i, j) de UEAs, i=1,...,N, j=1,...,N ;

W _ij são os pesos espaciais, assumem o valor 1 se i e j são UEAs vizinhas, e o valor 0, em caso contrário;

$W = \sum_{i = 1}^{N} \sum_{j = 1}^{N} w_{i j}$ , isto é, a soma dos pesos W _ij

$\bar{Y} = \frac{\sum_{i = 1}^{N} y_{i}}{N}$ a média amostral de Y.

Where:

N is the number of SAUs;

y _i and y _j are the values of the RV (or of the regression model residual) for the pair (i, j) of SAUs, i=1,...,N, j=1,...,N;

W _ij are the spatial weights, which take the value 1 if i and j are neighboring SAUs, and 0, otherwise;

Outras possibilidades para os pesos W _ij podem ser utilizadas, por exemplo, , em que l _ij é o comprimento da fronteira entre A _i e A _j e l _i é o perímetro de A _i .

O I de Moran assume valores entre -1 e 1. Valores próximos de zero sugerem ausência de autocorrelação espacial, indicando que a VR (ou o resíduo) está aleatoriamente distribuída no espaço, sem padrões de agrupamento. Valores maiores do que zero apontam para a existência de autocorrelação espacial positiva, indicando que áreas com valores para a VR (ou resíduo) altos estão geograficamente próximas umas das outras, assim como áreas com valores baixos. Valores menores do que zero apontam para uma autocorrelação espacial negativa, indicando que áreas com valores semelhantes estão dispersas, distantes umas das outras. O teste de hipóteses para o índice (em que Ho significa ausência de autocorrelação espacial) é realizado com base na estratégia de Monte Carlo, utilizando um grande número de simulações para obtenção do p-valor ⁴ . Se o índice de Moran aplicado aos resíduos do MRL apontar a existência de dependência espacial, será necessário ajustar um modelo que possa acomodá-la.

Além disso, é importante distinguir o uso do índice de Moran aplicado à VR daquele aplicado aos resíduos. O I de Moran da variável resposta identifica autocorrelação espacial prévia ao ajuste múltiplo, enquanto o I de Moran dos resíduos avalia se ainda há autocorrelação após o ajuste. Assim, um Moran significativo nos resíduos indica que o modelo múltiplo não capturou completamente a estrutura espacial e, ainda, que um modelo espacial pode ser necessário.

MRL COM DEPENDÊNCIA ESPACIAL GLOBAL

A dependência espacial pode ter caráter local ou global. Se a autocorrelação espacial é heterogênea, isto é, varia de local para local, a dependência tem caráter local, sendo necessário ajustar um modelo com múltiplos parâmetros para acomodá-la. A regressão geograficamente ponderada ²⁰ é uma das abordagens possíveis, mas se não houver grande heterogeneidade pode-se assumir que a dependência tem caráter global, podendo ser acomodada por meio de um único parâmetro, adicionado ao MRL. Neste artigo, trataremos apenas de modelos com efeitos globais. Os dois mais comumente utilizados são os modelos autorregressivos SAR (do inglês Spatial Autoregression) e SEM (do inglês Spatial Error Model). No primeiro, a dependência é explicitamente modelada na VR, incorporando um termo de defasagem espacial; no segundo, é considerada um ruído a ser removido e tratada como um componente dos resíduos ²¹ ^, ²² .

O SAR é formulado como:

Y = p W Y + X β + ε

Onde:

Y, X, β, ε são os mesmos previamente definidos;

W é a matriz de vizinhança contendo os pesos W _ij definidos anteriormente

(W _ij =1, se i e j são UEAs vizinhas e W _ij =0, caso contrário);

ρ é o coeficiente espacial autorregressivo a ser estimado.

A dependência espacial será acomodada pelo termo ρWY , em que WY corresponde à VR espacialmente defasada e ρ, ao coeficiente espacial autorregressivo, associado à defasagem.

O SEM é formulado como:

Y = X β + u

u = λ W u + ε

Onde:

Y, X, β, W, ε são os mesmos previamente definidos;

λ é o coeficiente espacial autorregressivo a ser estimado.

Aqui, a dependência espacial é acomodada no termo de erro u, decomposto em λWu (erro espacialmente defasado que representa a parte dos resíduos com dependência espacial) e ε (erros normalmente distribuídos, com variância constante e independentes).

Nesses modelos, valores de ρ ou λ próximos de zero indicam ausência de autocorrelação espacial; por sua vez, valores maiores que zero e estatisticamente significativos indicam dependência espacial.

A escolha do modelo mais adequado é feita utilizando-se o Diagnóstico de Lagrange (LM), testando-se as hipóteses H_0ρ: ρ=0 e H_0λ: λ=0. As estatísticas dos testes (LM _ρ e LM _λ , respectivamente) são obtidas por meio de multiplicadores de Lagrange ²² . Se ambas as hipóteses forem rejeitadas, utilizam-se estatísticas robustas, e o modelo mais adequado será aquele com o maior valor da estatística LM robusta (RLM _ρ ou RLM _λ ).

Uma vez determinado o modelo mais adequado, uma nova análise de resíduos será necessária, conforme os itens 4 e 5. Se o I de Moran ainda apontar a presença de dependência espacial nos resíduos, então os parâmetros ρ ou λ não foram capazes de capturar toda a dependência. Antes de buscar novos tipos de modelagens, recomenda-se escolher uma matriz de vizinhança mais adequada ou incluir covariáveis inicialmente omitidas.

Esses procedimentos são apresentados na Figura 1.

EXEMPLO DE APLICAÇÃO

O estudo de Fernandes et al. ⁸ utilizou como UEAs os setores censitários urbanos do município de Foz do Iguaçu (dados disponíveis em https://doi.org/10.6084/m9.figshare.27310467.v2) ²³ . Após a análise exploratória, foi ajustado um MRL, tendo como VR a prevalência de mães adolescentes e, como covariáveis, o Índice Brasileiro de Privação (IBP) e a Proporção de Mulheres Responsáveis pelo Domicílio (PMRD). O I de Moran (estimado com matriz de vizinhança Queen) para os resíduos foi 0,059 (p=0,025), indicando dependência espacial.

Com base no diagnóstico ML, optou-se pelo ajuste de um SAR. A nova análise de resíduos mostrou um modelo bem ajustado, e o I de Moran (0,011, p=0,915) indicou que a dependência espacial foi adequadamente modelada. As estimativas de ambos os modelos estão na Tabela 1. Nota-se, no SAR, a diminuição do efeito do IBP e o aumento da amplitude de seu intervalo de confiança. Os procedimentos de modelagem, passo a passo, utilizando os softwares GeoDa ²² e os códigos utilizados no R ²⁴ , são apresentados no material suplementar (https://github.com/LAES-USP/material-supl-regressao-espacial.git).

Tabela 1. Estimativas dos modelos Modelos de Regressão Linear e Spatial Autoregression para a prevalência de mães adolescentes. Foz do Iguaçu (PR), Brasil, 2013-2019.

Variável explicativa	MRL			SAR
Variável explicativa	β	IC (β; 0,95)	Valor de p	β	IC (β; 0,95)	Valor de p
Índice Brasileiro de Privação	4,13	(3,58; 4,68)	0,000	3,77	(3,13; 4,43)	0,000
Proporção de mulheres responsáveis pelo domicílio	0,09	(0,03; 0,14)	0,001	0,09	(0,03; 0,14)	0,000
Coeficiente espacial autorregressivo (ρ)	-	-	-	0,14	(0,00; 0,28)	0,055
Medida	Valor		Valor de p	Valor		Valor de p
I de Moran	0,059		0,032	0,011		0,915
R²	0,43		-	0,44		-
AIC	2057,4		-	2055,8		-

Open in a new tab

MRL: Modelos de Regressão Linear; SAR: Spatial Autoregression.

Declaração de disponibilidade de dados

O conjunto de dados que dá suporte aos resultados deste estudo está publicamente disponível no repositório figshare (https://doi.org/10.6084/m9.figshare.27310467.v2).

O material suplementar do artigo está disponível no repositório GitHub (https://github.com/LAES-USP/material-supl-regressao-espacial.git). O conteúdo inclui a descrição detalhada dos procedimentos de modelagem espacial, os comandos em R (versão 4.3) e as ilustrações geradas no GeoDa (versão 22), além dos arquivos necessários para a reprodução das análises.

Material suplementar

PDF suplementar

1980-5497-rbepid-29-e260018-sppl.pdf^{(6.1MB, pdf)}

Footnotes

COMO CITAR ESSE ARTIGO: Conceição GMS, Bermudi PMM, Palasio RGS, Barbosa GL, Fernandes CM, Santana LMR, et al. Regressão linear em estudos ecológicos envolvendo o espaço: metodologia e exemplo de aplicação na área da saúde. Rev Bras Epidemiol. 2026; 29: e260018. https://doi.org/10.1590/1980-549720260018.2

FONTE DE FINANCIAMENTO: FCN e GLB receberam apoio na modalidade Auxílio à Pesquisa (Processo 2023/10080-3), PMMB recebeu bolsa de doutorado (Processo 2020/12371-7), RGSP recebeu bolsa de pós-doutorado (Processo 2021/10212-1) e Bolsa no Exterior (Processo 2024/01315-0), e GLB recebeu Bolsa no Exterior (Processo 2024/13664-9), todas concedidas pela Fapesp. FCN recebeu bolsa de Pesquisa de Produtividade Científica 1C (Processo 304391/2022-0) do CNPq.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary PDF

1980-5497-rbepid-29-e260018-sppl.pdf^{(6.1MB, pdf)}

PDF suplementar

1980-5497-rbepid-29-e260018-sppl.pdf^{(6.1MB, pdf)}

Data Availability Statement

The data used in this study are public and can be accessed at the figshare repository (https://doi.org/10.6084/m9.figshare.27310467.v2).

[B1] 1.Galton FRS. Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute of Great Britain and Ireland. 1886;15:246–263. doi: 10.2307/2841583. [DOI] [Google Scholar]

[B2] 2.Rooney A. A história da matemática. 1a ed. São Paulo: M.Books; 2012. [Google Scholar]

[B3] 3.Stigler SM. The history of statistics: the measurement of uncertainty before 1900. Harvard: Harvard University Press; 1986. [Google Scholar]

[B4] 4.Druck S, Carvalho MS, Câmara G, Monteiro AVM. Análise espacial de dados geográficos. 1a ed. Empraba Cerrados; 2004. [Google Scholar]

[B5] 5.Dormann CF, McPherson JM, Araújo MB, Bivand R, Bolliger J, Carl G, et al. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography. 2007;30(5):609–628. doi: 10.1111/j.2007.0906-7590.05171.x. [DOI] [Google Scholar]

[B6] 6.Biblioteca Virtual em Saúde . Portal Regional da BVS. 2024. [Oct 31, 2024]. [Internet] Available at: https://pesquisa.bvsalud.org . [Google Scholar]

[B7] 7.Figueiredo D, Filho, Nunes F, Rocha EC, Santos ML, Batista M, Silva JA., Júnior O que fazer e o que não fazer com a regressão: pressupostos e aplicações do modelo linear de Mínimos Quadrados Ordinários (MQO) Revista Política Hoje. 2011;20(1):44–99. [Google Scholar]

[B8] 8.Fernandes CM, Conceição GMS, Silva ZP, Nampo FK, Chiaravalloti F., Neto Fatores socioeconômicos aumentam o risco de gravidez na adolescência: análise espacial e temporal em um município brasileiro. Rev Bras Epidemiol. 2024;27:e240040. doi: 10.1590/1980-549720240040.2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Conceição GMS, Latorre MRDO. Brasil. Ministério da Saúde. Secretaria de Ciência, Tecnologia, Inovação e Complexo da Saúde. Departamento de Ciência e Tecnologia . Avaliação de impacto das políticas de saúde: um guia para o SUS. Brasília: Ministério da Saúde; 2023. Análise de regressão; pp. 576–614.http://bvsms.saude.gov.br/bvs/publicacoes/avaliacao_impacto_politicas_saude_guia_sus.pdf [Internet] [cited on MMM DD, YYYY] [Google Scholar]

[B10] 10.Kutner MH, Nachtsheim C, Neter J, Li W. Applied linear statistical models. 5a ed. Boston: McGraw-Hill Irwin; 2005. [Google Scholar]

[B11] 11.Paula GA. Modelos de regressão com apoio computacional. São Paulo: Universidade de São Paulo; 2013. [May 29, 2025]. [Internet] Available at: https://www.ime.unicamp.br/~%20cnaber/Livro_MLG.pdf . [Google Scholar]

[B12] 12.Tobler WR. A computer movie simulating urban growth in the Detroit region. Econ Geogr. 1970;46(sup1):234–240. doi: 10.2307/143141. [DOI] [Google Scholar]

[B13] 13.Diniz CSG, Pellini ACG, Ribeiro AG, Tedardi MV, Miranda MJ, Touso MM, et al. Breast cancer mortality and associated factors in São Paulo State, Brazil: an ecological analysis. BMJ Open. 2017;7(8):e016395. doi: 10.1136/bmjopen-2017-016395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Hastie TJ, Tibshirani RJ. Generalized additive models. 1a ed. New York: Routledge; 2017. [Google Scholar]

[B15] 15.Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, et al. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography. 2013;36(1):27–46. doi: 10.1111/j.1600-0587.2012.07348.x. [DOI] [Google Scholar]

[B16] 16.Kleinbaum DG, Kupper LL, Nizam A, Rosenberg ES. Applied regression analysis and other multivariable methods. 5a ed. Boston: Cengage Learning; 2013. [Google Scholar]

[B17] 17.Zuur AF, Ieno EN, Elphick CS. A protocol for data exploration to avoid common statistical problems. Methods Ecol Evol. 2010;1(1):3–14. doi: 10.1111/j.2041-210X.2009.00001.x. [DOI] [Google Scholar]

[B18] 18.Moran PAP. Notes on continuous stochastic phenomena. Biometrika. 1950;37(1/2):17–23. doi: 10.2307/2332142. [DOI] [PubMed] [Google Scholar]

[B19] 19.Cliff A, Ord K. Testing for spatial autocorrelation among regression residuals. Geogr Anal. 1972;4(3):267–284. doi: 10.1111/j.1538-4632.1972.tb00475.x. [DOI] [Google Scholar]

[B20] 20.Comber A, Brunsdon C, Charlton M, Dong G, Harris R, Lu B, et al. A route map for successful applications of geographically weighted regression. Geogr Anal. 2023;55(1):155–178. doi: 10.1111/gean.12316. [DOI] [Google Scholar]

[B21] 21.Loonis V, Bellefon MP. Handbook of spatial analysis: theory and practical application with R. Montrouge Cedex: Institut National de la Statistique et des Etudes Economiques; 2018. [Google Scholar]

[B22] 22.Anselin L, Rey SJ. Modern spatial econometrics in practice: a guide to GeoDa, GeoDaSpace and PySAL. 2014. [14 ago. 2025]. [Internet] Disponível em: https://geodacenter.github.io/GeoDaSpace/ [Google Scholar]

[B23] 23.Fernandes C, Conceição GMS, Silva ZP, Nampo FK, Chiaravalloti-Neto F. Geospatial database of teenage pregnancy in a Brazilian municipality (SHP) Figshare; 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.R Core Team . R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2025. [Aug 14, 2025]. [Internet] Available at: https://www.r-project.org/ [Google Scholar]

PERMALINK

Linear regression in ecological studies involving space: methodology and an application example in public health

Gleice Margarete de Souza Conceição

Patricia Marques Moralejo Bermudi

Raquel Gardini Sanches Palasio

Gerson Laurindo Barbosa

Camila Meireles Fernandes

Lidia Maria Reis Santana

Ligia Vizeu Barrozo

Daiane Leite da Roza

José Alberto Quintanilha

Francisco Chiaravalloti-Neto

Roles

ABSTRACT

INTRODUCTION

THE LINEAR REGRESSION MODEL

EXPLORATORY ANALYSIS

RESIDUAL ANALYSIS

GLOBAL MORAN INDEX (MORAN I)

LRM WITH GLOBAL SPATIAL DEPENDENCE

Figure 1. Workflow of procedures for fitting an LRM with a global spatial component 22 .

APPLICATION EXAMPLE

Table 1. Estimates of the Linear Regression and Spatial Autoregression models for the prevalence of adolescent mothers. Foz do Iguaçu (PR), Brazil, 2013-2019.

Data availability declaration

Supplementary Material

Funding Statement

Footnotes

REFERENCES

Regressão linear em estudos ecológicos envolvendo o espaço: metodologia e exemplo de aplicação na área da saúde

Gleice Margarete de Souza Conceição

Patricia Marques Moralejo Bermudi

Raquel Gardini Sanches Palasio

Gerson Laurindo Barbosa

Camila Meireles Fernandes

Lidia Maria Reis Santana

Ligia Vizeu Barrozo

Daiane Leite da Roza

José Alberto Quintanilha

Francisco Chiaravalloti-Neto

Roles

RESUMO

INTRODUÇÃO

O MODELO DE REGRESSÃO LINEAR

ANÁLISE EXPLORATÓRIA

ANÁLISE DE RESÍDUOS

ÍNDICE DE MORAN GLOBAL (I DE MORAN)

MRL COM DEPENDÊNCIA ESPACIAL GLOBAL

Figura 1. Fluxo dos procedimentos para o ajuste de um MRL com componente espacial global 22 .

EXEMPLO DE APLICAÇÃO

Tabela 1. Estimativas dos modelos Modelos de Regressão Linear e Spatial Autoregression para a prevalência de mães adolescentes. Foz do Iguaçu (PR), Brasil, 2013-2019.

Declaração de disponibilidade de dados

Material suplementar

Footnotes

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Figure 1. Workflow of procedures for fitting an LRM with a global spatial component ²² .

Figura 1. Fluxo dos procedimentos para o ajuste de um MRL com componente espacial global ²² .