Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Mar 13;16:11243. doi: 10.1038/s41598-026-42068-z

Long-term associations between animal-source food consumption and breast and prostate cancer incidence based on cointegration and ARIMAX models

Alessia Spada 1,, Michele Tomaiuolo 2, Elisa Pia Amorusi 2, Nicholas Calà 2, Raffaele Ianzano 2, Pasquale Ieluzzi 2, Giovanni Emanuele Ricciardi 3,4, Antonio Tucci 2
PMCID: PMC13046939  PMID: 41826475

Abstract

Understanding whether and to what extent dietary habits influence the long-term incidence of hormone-sensitive cancers is undoubtedly a significant challenge. Conventional approaches, which are often limited to studies with short follow-up periods or methods that do not consider lag times, risk producing spurious associations and consequently lead to weak conclusions. In this study, we analyzed exceptionally long Italian national time series (1961–2020 for meat and dairy consumption; 1984–2020 for cancer incidence) to investigate the association between diet and the development of breast and prostate cancer. Initially, to avoid the risk of multicollinearity between variables, dairy and meat consumption data were summarized into a single index (PC1), obtained using principal components analysis (PCA). We then applied a rigorous econometric framework to investigate long-term dynamics. The first step consisted of testing for cointegration between PC1 and the cancer incidence series. The second step was ARIMAX modeling, with PC1 included as an exogenous variable at a lag that minimizes the AICc criterion. The study revealed evidence of cointegration between consumption and cancer incidence for both cancers, i.e., a long-term equilibrium. For breast cancer, the optimal ARIMAX model (0,0,1) identified a positive and highly significant association with PC1 at a lag of 18 years (β = 0.108, p < 0.001). For prostate cancer, an identically structured model (0,0,1) showed an even stronger and highly significant association at 15 years (β = 0.384, p < 0.001). Both models passed the diagnostic tests, confirming their validity and statistical robustness. These findings provide consistent quantitative evidence of long-term association between animal product consumption and hormone-sensitive cancers. More broadly, the study highlights the relevance of econometric methodologies in cancer epidemiology and emphasizes their potential to deepen our understanding of how cumulative dietary exposures influence population health.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-42068-z.

Keywords: Cointegration, ARIMAX model, Dairy, Meat, Breast cancer, Prostate cancer

Subject terms: Cancer, Computational biology and bioinformatics, Diseases, Oncology, Risk factors

Introduction

In recent decades, the incidence of certain hormone-sensitive malignancies, such as breast and prostate cancer, has shown an exponential rise, in sharp contrast with the more linear increase observed for most other cancer types1. This divergence raises a crucial question : which factors make these diseases distinct from an epidemiological and biological perspective? To better understand this aspect, their hormone-dependent nature cannot be ignored.

However, although it is known that the development and progression of these cancers are influenced by steroid hormones, it seems unlikely that the endogenous endocrine profile alone, which remains essentially stable across generations, can fully explain the observed exponential increase. Instead, it is more plausible to hypothesize a substantial additional role due to other exogenous sources of hormonal stimulation: namely, environmental or dietary substances capable of mimicking or amplifying physiological signals. Within this context, estrogens assume a central role. Their involvement in the etiopathogenesis of breast cancer is well documented24, whereas in prostate cancer they have not traditionally been regarded as determining factors. However, recent evidence suggests that estrogens may act as early promoters of the neoplastic process, modulating the cellular and stromal environment during the initial phases of transformation, before yielding to the proliferative drive sustained by androgens2,5,6. Diet represents a primary route of exposure to bioactive molecules. Numerous epidemiological studies have suggested a link between meat and dairy consumption and a higher incidence of hormone-sensitive cancers79. However, most available investigations rely on time-limited observations, which are unable to capture the historical and cumulative dimension of such exposures. Conversely, Italian dietary trends provide a valuable case study. Since the 1920 s, ISTAT has produced time series showing a sharp increase in meat and dairy consumption in Italy, particularly during the economic boom. This shift from a plant-based Mediterranean diet to one increasingly rich in animal products occurred at a faster pace than in other countries. For this reason, Italy is an ideal setting for applying advanced econometric methods, such as cointegration and ARIMAX, to explore the long-term relationship between dietary changes and the incidence of hormone-sensitive cancers. Therefore, this study focuses on the Italian context to exploit its distinctive trends and test the proposed econometric framework. Within this framework, a major limitation of traditional studies on the diet– cancer relationship becomes evident: although they have highlighted important associations, they fail to adequately quantify the temporal dimension and the actual extent of latency - both crucial for correctly interpreting population-level phenomena. The analysis of aggregated time series could provide a privileged tool to bridge this gap, yet it is complicated by statistical challenges that may obscure or distort true relationships. The frequent non-stationarity of socio-economic and health data, together with biological delays that can extend over decades, creates conditions prone to spurious regressions - apparently significant associations between variables with similar trends but lacking any real causal link. This study therefore aims to overcome these limitations by applying a rigorous econometric model to test and quantify the dynamic relationship between animal product consumption in Italy (1961-2020) and the incidence of two major hormone-sensitive cancers (1984-2020), breast and prostate. After testing cointegration, demonstrating that trends in food consumption and cancer incidence are not independent but linked by a long-term equilibrium, the analysis sought to estimate both the magnitude of the association and the number of years of lag required for these relationships to be robust.

Results

Descriptive analysis

Annual time series for cancer incidence (Breast, Prostate) were considered from 1984 to 2020 (n = 37), while meat and dairy consumption were analysed over a longer period, from 1961 to 2020 (n = 60), in order to account for potential lagged associations between dietary consumption and cancer incidence.

Log-transformed per capita meat consumption showed a mean of 3.94 and a standard deviation of 0.27, with values ranging from 3.16 to 4.21 (Table 1). Dairy consumption had a higher mean (5.42) and lower variability (standard deviation 0.19), indicating greater temporal stability than meat consumption (Table 1).

Table 1.

Descriptive statistics of the time series.

Variable Temporal range n Mean SD Min Max
Meat 1961-2020 60 3.94 0.27 3.16 4.21
Dairy 1961-2020 60 5.42 0.19 4.98 5.63
Consumption index PC1 1961-2020 60 0.00 1.38 −3.70 1.34
Breast 1984-2020 37 4.90 0.11 4.58 5.16
Prostate 1984-2020 37 4.71 0.31 3.91 5.07

Regarding outcome variables, breast cancer incidence had a mean of 4.90 and low variability (standard deviation 0.11), ranging from 4.58 to 5.16. The log incidence of prostate cancer was slightly lower (mean 4.71), but with more pronounced fluctuations than breast cancer incidence (standard deviation 0.31), with values ranging from 3.91 to 5.07.

The time series of meat and dairy consumption were found to be highly correlated at lag 0 (r = 0.9076), making the application of PCA necessary. As reported in Supplementary Table S1, PCA proved to be highly effective: the first principal component (PC1) explained more than 95% of the total variance of the original variables. The PC1 loadings, equal to 0.707 for both variables, were positive and of identical magnitude, confirming that PC1 represents meat and dairy consumption in a balanced way. This validates its interpretation as a composite index of animal product consumption, avoiding information redundancy in the models. By construction, the average of PC1 is zero, while the standard deviation is 1.38, with values ranging from −3.70 to 1.34 (Table 1).

In Figure 1, all series have been standardized (z-scores) to allow a direct visual comparison between the temporal trends of the series under consideration and the principal component PC1. In particular, PC1 shows a marked increasing trend from 1960 to the late 1980s, which then stabilizes in a plateau in the 1990s, followed by a slight decline. This evolution is like that observed in the historical series of dairy and meat consumption.

Fig. 1.

Fig. 1

Standardized trends (z-scores) of meat consumption, dairy consumption, animal product consumption index (PC1), breast cancer incidence, and prostate cancer incidence (1961–2020).

The incidence of prostate cancer increased until the early 2000s, then declined slightly and then stabilized. These phases appear to be synchronized with changes in PC1 but occur with a temporal lag of more than a decade. Breast cancer incidence, after a brief stabilization around 2010, showed instead a rapid and pronounced acceleration in the last decade, reaching an unusual peak around 2019-2020. Here too, the upward trend seems to follow that of PC1, with a substantial temporal lag.

In summary, the trajectory of the composite consumption index appears to visibly anticipate that of the cancer incidence curves. This visual evidence supports the hypothesis of a long- term relationship with structural lag, to be assessed through rigorous statistical tools (cointegration and ARIMAX) in order to avoid the risk of spurious regressions.

Step 1. Analysis of the order of integration and cointegration testing

The ADF tests indicated that all three series (Prostate, Breast, PC1) were non-stationary and integrated of order one (I (1)) (Table 2), thereby allowing the analysis to proceed to the subsequent steps.

Table 2.

Augmented dickey-fuller tests in levels and first differences.

At-level time-series First differences time-series
Time series ADF Value I(0) P value Implication ADF Value I(1) P value Implication
PC1 −1.5562 0.7542 Not stationary −4.6175 0.0000 Stationary I(1)
Prostate −0.0550 0.9900 Not stationary −4.148 0.0000 Stationary I(1)
Breast −1.9958 0.5749 Not stationary −7.235 0.0000 Stationary I(1)

Before proceeding with ARIMAX-based dynamic modelling, the hypothesis of a cointegration relationship - namely, a long-term equilibrium between cancer incidence and the consumption index (PC1) - was tested using two approaches: Engle–Granger and the ARDL Bounds Test, to assess whether modelling in levels would be appropriate within the ARIMAX Framework. As reported in Table 3, the Engle-Granger test found no evidence of cointegration for any of the examined relationships (p > 0.05), as it is not suitable as a test for short time series, such as in the case of cancer incidences (n = 37). To overcome these limitations, the ARDL Bounds Test was applied (Table 3). This test, proposed by Pesaran et al. (2001), compares the F statistic with the critical values at the 5% significance level, against the null hypothesis (H0) based on the absence of cointegration. For the Breast ~ PC1 relationship, the F statistic (4.31) exceeded the upper 5% critical value (4.16), leading to the rejection of H0, providing clear evidence of cointegration. The evidence was even stronger for the Prostate ~ PC1 relationship, where the F statistic (7.86) substantially exceeded the upper bound. Thus, for both neoplasms, the ARDL test confirmed the existence of a stable long- term relationship with the PC1 composite index. This result provides robust validation for the use of ARIMAX models estimated in levels (d = 0).

Table 3.

Cointegration test results for breast and prostate cancer incidence relative to PC1 consumption index.

Engle–granger cointegration test
Model relationship Test statistic (τ) p-value Conclusion (α = 5%)
Breast ~ PC1 −1.8959 >0.05 No cointegration detected
Prostate ~ PC1 −0.14417 >0.05 No cointegration detected
ARDL bounds cointegration test
Model relationship Selected ARDL model F-statistic 5% critical value lower bound (I(0)) 5% critical value upper bound (I(1)) Conclusion
Breast ~ PC1 ARDL(1,0) 4.31 4.02 4.16 Cointegration confirmed
Prostate ~ PC1 ARDL(1,0) 8.26 4.02 4.16 Cointegration confirmed

Step 2. ARIMAX framework

Step 2a. Determination of structural lag (lag L)

Before the definition of the optimal ARIMAX model, the temporal lag (L) of the lagged association with PC1 and the outcome variables was first estimated. In line with the biological plausibility of carcinogenesis processes, lags within the 8–20-year interval were considered, as this range was deemed reasonable for capturing potential long-term relation of dietary consumption on cancer incidence. To this end, preliminary ARIMAX models were estimated for each lag L within the defined interval. Supplementary Figure S1 shows the variation of AICc as a function of the structural lag: for each model, the minimum of the curve identifies the optimal lag, representing the best compromise between goodness of fit and parsimony. For breast cancer, the lowest AICc value was observed at an 18-year lag, whereas for prostate cancer the minimum was found at 15 years. Selecting the lag that minimized AICc also served as a sensitivity analysis of the model, as adjacent lag values produced a worse fit, i.e., a larger AICc. This lag-selection procedure also functions as a sensitivity analysis: the stability of the AICc minimum across the tested interval supports the consistency of the estimated delayed association. Consequently, lags of 18 years for breast cancer and 15 years for prostate cancer were retained in the final ARIMAX model specifications.

Step 2b. ARIMAX Model Specification and Selection

The selection process for the best-performing models regarding both malignancies is detailed in Table 4.

Table 4.

Comparative analysis of ARIMAX model specifications for breast and prostate, based on information criteria and residual validation tests.

Model ARIMAX(p,d,q) Model Specification AICc Q-Statistic (Ljung-Box) p-value (Ljung-Box) Shapiro–Wilk Statistic p-value
(Shapiro-Wilk)
Breast
A (2,0,0) −78.08 5.480 >0.05 0.950 >0.05
B (0,0,1) −79.25 3.530 >0.05 0.973 >0.05
C (1,0,1) −75.30 3.700 >0.05 0.972 >0.05
Prostate
A (1,0,0) −48.17 11.050 >0.05 0.957 >0.05
B (0,0,1) −51.02 11.150 >0.05 0.968 >0.05
C (1,0,1) −48.65 7.180 >0.05 0.965 >0.05

The choice was based on two hierarchical criteria:

  1. Statistical validity, verified through the Ljung–Box test (H₀: no residual autocorrelation) and the Shapiro-Wilk test (H₀: residuals are normally distributed).

  2. Efficiency and parsimony, evaluated based on corrected Akaike Information Criterion (AICc). minimization.

For breast cancer, three models were compared: Model A (ARIMAX (2,0,0)) and Model B (ARIMAX (0,0,1), automatically selected by the algorithm, and Model C (ARIMAX (1,0,1)), manually specified. All three passed the diagnostic tests, proving statistically valid, but Model B was identified as the best model, as it had the lowest AICc value (–79.25).

Similarly, for prostate cancer, three models were tested: Model A (ARIMAX (1,0,0)), Model B (ARIMAX (0,0,1), and Model C (ARIMAX (1,0,1)). All satisfied the statistical validity criteria. Here again, the optimal model was Model B, ARIMAX (0,0,1), as it showed the lowest AICc value (–51.02).

The selected ARIMAX models revealed a significant long-term relationship between PC1 - representing animal-based food consumption - and cancer incidence, as indicated by the significance of the coefficients for both Breast and Prostate (Table 5).

Table 5.

Parameter estimates for breast ARIMAX(0,0,1) and prostate ARIMAX(0,0,1) models (best models).

Coefficient (β) Standard error t-statistic p-value
Breast model (Lag=18) ARIMA(0,0,1)
PC1_lagged (L=18) 0.1084 0.0081 13.3318 < 0.001
MA(1) (Moving average) 0.5493 0.1974 2.7833 < 0.01
Intercept 4.8986 0.0099 496.6983 < 0.001
Prostate model (Lag=15) ARIMA(0,0,1)
PC1_lagged (L=15) 0.3840 0.0203 18.8835 < 0.001
MA(1) (Moving average) 0.7179 0.1366 2.8565 < 0.05
Intercept 4.6131 0.0193 238.7648 <0.001

For breast cancer, the optimal model was an ARIMAX (0,0,1) with a structural lag of 18 years. The coefficient of the lagged PC1 variable is positive and highly significant (β = 0.1084, SE = 0.0081, p < 0.001), indicating that a 1% increase in PC1 corresponds, on average, to a 0.108% increase in breast cancer incidence. In addition to the long-term association, the model also identified short-term dynamics, captured by a significant first order moving average (MA (1)) term (β = 0.5493, p < 0.01).

For prostate cancer, a model with the same structure, ARIMAX (0,0,1), was selected with a optimal lag of 15 years. Here too, the coefficient of the lagged PC1 variable was positive and highly robust (β = 0.3840, SE = 0.0203, p < 0.001), corresponding to an elasticity of about 0.384. This implies that a 1% increase in PC1 is associated with an average 0.384% increase in prostate cancer incidence after 15 years. As with breast cancer, the prostate cancer model also captured residual short-term dynamics through a significant MA (1) term (β = 0.7179, p < 0.05).

Overall, the results highlight a significant and temporally lagged association between animal-based food consumption and the incidence of both cancers analysed, with effects observable over a 15–18-year period. Importantly, for both models, the identification of a specification with no differencing (d = 0) was justified by the presence of cointegration, namely the existence of a long-term equilibrium between consumption and cancer incidence.

For breast cancer, the ARIMAX (0,0,1) model with an 18-year lag relative to the animal consumption index shows good agreement between the observed and estimated series, particularly up to the early 2000s: the forecasts closely follow the actual trend, capturing both the progressive increase in incidence and the slight declines. Short-term fluctuations are also well reproduced, confirming the appropriateness of the level specification (Figure 2).

Fig. 2.

Fig. 2

Fitted values for the final selected models: breast cancer ARIMAX(0,0,1) model with lag=18; prostate cancer ARIMAX (0,0,1) model with lag=15, vs observed values.

For prostate cancer, the ARIMAX (0,0,1) model with a 15-year lag provides an equally satisfactory fit, accurately reproducing the upward trend of the series. Discrepancies between observed and estimated values are minimal and show no systematic patterns, especially up to the early years of the new millennium (Figure 2).

Step 2c. Graphical diagnostics of model residuals

Finally, the statistical validity of both selected ARIMAX models was also confirmed through graphical analysis of the residuals (Supplementary Figure S2) for both breast and prostate cancer. The autocorrelation function (ACF) plots show that, in both cases, the residual correlations at different lags do not exceed the thresholds of statistical significance. In addition, the residuals fluctuate over time in a largely random manner around a zero mean, as illustrated by the time-series plot. Finally, the histograms of the residuals display unimodal and approximately symmetric frequency distributions, thereby visually supporting the assumption of normality of the errors, previously tested with the Shapiro–Wilk test (for both breast and prostate, p > 0.05).

Discussion

The results of this study show a significant long-term relationship between animal products consumption and the incidence of breast and prostate cancer in Italy, with estimated temporal delays of 18 and 15 years, respectively. These findings, obtained through validated ARIMAX models supported by cointegration tests, suggest that changes in dietary consumption tends to anticipate variations in cancer incidence.

The mere observation of visual parallels between epidemiological trends and dietary consumption is not sufficient to demonstrate causality, as it may conceal the risk of spurious regressions arising from the non-stationarity of time series. For this reason, in the present study, the use of a rigorous framework combining ARDL Bounds Testing with ARIMAX modeling allowed us to validate the relationship between animal products consumption (summarized in the PC1 index) and the incidence of breast and prostate cancer. This analytical strategy is in line with recent WHO comparative evaluations, which highlight the improved predictive and structural reliability of ARIMAX compared to conventional join point-based approaches10,11.

For both cancers, the results revealed a cointegrating relationship, albeit with different time lags: 18 years for breast cancer and 15 years for prostate cancer. These latencies are consistent with the minimal latency periods reported for cancers linked to environmental exposures, consistent with multistage models of carcinogenesis12,13.

Interestingly, despite similar latencies, the magnitude of the association between PC1 and cancer differs significantly: it is more modest for breast cancer and greater for prostate cancer. This difference may reflect a different biological sensitivity of the two malignancies to dietary hormone-mimetic stimuli, or interactions with disease-specific factors (such as the androgenic milieu for prostate cancer and estrogen-dependent proliferative mechanisms for breast cancer). The biological interpretation of these findings is consistent with the idea that hormones naturally present in meat and dairy products may contribute to conditions favorable to tumor development14. Such exposures could likely play a role in hormone-sensitive cancers; however, based on ecological data alone, it is not possible to confirm the existence of a causal mechanism. Furthermore, the model used does not allow the observed effects to be attributed to specific molecules or food components, nor does it allow the impact of food consumption to be separated from that of other concomitant factors.

The main limitations of the study are related to the use of data aggregated at the national level: the risk of ecological fallacy remains present, as associations observed at the population level are not immediately and necessarily transferable to the individual level. Therefore, our models indicate that dietary trends anticipate cancer incidence trends, but this is not sufficient to demonstrate biological causality, that would require other verifications such as cohort studies or individual-level data. Nonetheless, it has been observed that, for lifestyle-related cancers, aggregate and individual-level results tend to converge when exposure is relatively homogeneous across the population15. In addition, other potential confounding factors were not accounted for, such as changes in screening practices, therapeutic improvements, or demographic shifts, which may have influenced the observed trends. Although carcinogenesis is a multifactorial process, it was not possible to include additional exogenous regressors in this study. The cancer incidence series consisted of only 37 annual observations, which limited the degrees of freedom. Introducing additional variables such as smoking, alcohol consumption, or obesity into an extended ARDL or ARIMAX framework with series of this short length would have led to model saturation, unstable estimates, and increased the risk of overfitting, thus illusorily inflating the R2 values. For this reason, we opted for a more parsimonious, yet mathematically stable, specification using only PC1 as the exogenous regressor. Longer incidence series may in the future allow the use of multivariate ARDL or ARIMAX models that incorporate specific covariates without compromising their stability. However, it should be noted that the impact of omitted variables, particularly those with slow trends, would have manifested itself in the autocorrelation error term. In our model, this autocorrelation is effectively captured and neutralized by the autoregressive (AR) and moving average (MA) components, ensuring the validity of the estimates even in the absence of these covariates16 (Hyndman C Athanasopoulos, 2018). The observed association between diet and cancer may therefore reflect the combined effect of several related behaviors, including rising obesity and sedentary lifestyles, which have accompanied the increase in meat and dairy consumption over the decades17 (Di Novi, Marenzi, & Zantomio, 2021). In this context, particular attention should be paid to the dynamics that occurred post-2000, in which deviations from the expected trend were observed. These deviations are likely the effect of unmodeled external factors, rather than a flaw in the model, that altered the natural history of the tumors considered. For prostate cancer, the widespread use of the PSA diagnostic test in the early 2000s generated an artificial peak in the incidence curves, followed by a decline due to its decreased use in clinical guidelines. For breast cancer, the stabilization observed in the early 2000s coincided with the sharp decline in the use of hormone replacement therapy following the WHI study (2002), while the subsequent acceleration can be attributed to the strengthening of screening programs and the introduction of technologies such as tomosynthesis.

Additional emerging risk factors must also be considered, including the increasing prevalence of obesity and changes in dietary habits (increased consumption of ultra- processed foods, alcohol, and reduced fiber intake), which have likely reshaped the historical relationship between animal product consumption and cancer incidence.

Despite these complexities, the study offers robustness and originality that strengthen its impact. The breadth of the time series (60 years of consumption and 37 years of incidence) represents a rare strength in cancer epidemiology. Furthermore, the combined use of PCA, cointegration, and ARIMAX allowed us to address important statistical challenges, such as non-stationarity, multicollinearity, and autocorrelation, which often compromise long-term epidemiological analyses. The formal estimation of temporal lags (15–18 years) addresses a methodological gap in many traditional studies, while the comparative analysis of two hormone-sensitive cancers offers new insights into their differential biological vulnerability. Finally, the application of advanced econometric tools in the biomedical field represents an original methodological contribution capable of fostering interdisciplinary research. These findings also indicate that econometric approaches, although still little used in epidemiology, can help identify relationships between dietary habits and other social characteristics with respect to disease trends.

Overall, the results not only reinforce the evidence of a long-term relationship between animal product consumption and hormone- sensitive cancers but also demonstrate the ability of the econometric models to capture structural changes related to public health interventions and lifestyle transformations, opening new perspectives for epidemiological interpretation.

Conclusion

This study presents statistical evidence of a long-term association between animal-source food consumption and breast and prostate cancer incidence in Italy, with estimated latency periods of approximately 15 and 18 years, respectively. While these findings do not support causal interpretation, they provide a useful basis for future research using individual- or cohort-level data.

The analysis was based on Italian time series, adopting an econometric approach that integrates principal component analysis, cointegration methods, and ARIMAX models, techniques that are still relatively uncommon in epidemiological research.

Future work could benefit from longer cancer incidence series to further investigate these associations, using extended ARDL and ARIMAX models that allow for the inclusion of additional covariates. Furthermore, it would be important to apply this analytical framework to countries with different dietary and cancer dynamics to assess whether the associations observed in the Italian context are replicated elsewhere, thus offering a form of external validation. Overall, the study highlighted the strong contribution that econometric methods can have in complementing traditional epidemiological approaches to improve the understanding of long-term health dynamics at the population level.

Materials and methods

Data Sources

All data used in this study were obtained from authoritative sources. Information on annual per capita food consumption (in kilograms) in Italy was retrieved from the FAO (Food and Agriculture Organization) database18,19. Overall, the dataset covers the period 1961- 2020, including both dairy products and meat. Cancer incidence data for Italy were obtained from the ECIS (European Cancer Information System) platform20, the official European Union resource providing epidemiological data on cancer for research purposes. The data used represents age-standardized annual rates (ASR per 100,000 population). The use of ASR was strategic for this analysis because it mathematically removed the confounding effect of population aging, in this way the observed trends in cancer incidence were not impaired by demographic trends. Overall, the dataset covers the period 1984-2020.

Data Processing and Variable Construction

To ensure that the data collected was consistent with the objectives of the study, specific statistical adjustments were performed. In particular, meat consumption was calculated by summing the FAO database categories “Bovine meat,” “Mutton and Goat meat,” and “Pigmeat.” Regarding cancer data, a challenge arose from the way records are archived. In Italy, the oncological information available in the ECIS (European Cancer Information System) derives exclusively from regional or provincial registries, in the absence of a centralized national database. Consequently, in order to obtain a value representative of national incidence, the average of the incidence rates reported by individual registries was calculated for each year.

Annual incidence data for prostate and breast cancers were denoted as Prostate and Breast, respectively. These two time series were treated as outcome variables in the analysis. The primary explanatory variables were per capita meat consumption and per capita dairy consumption. To avoid problems of heteroscedasticity, all time series were transformed into natural logarithms, thereby stabilizing variance and expressing the variables on a comparable scale. Given the high correlation expected between Meat and Dairy consumption, including both as separate regressors in an econometric model could have introduced a serious problem of multicollinearity, making the coefficient estimates unstable and difficult to interpret. To address this issue, Principal Component Analysis (PCA) was employed - a dimensionality reduction technique that synthesizes the information contained in multiple correlated variables by using their mathematical transformations, while at the same time preserving the maximum possible variance, that is, the common information shared by the original variables21.

In particular, a high Pearson correlation coefficient (r) between variables (e.g., r > 0.7–0.8) represents a strong indicator of the need to apply PCA. After confirming the high correlation between Meat and Dairy, the first principal component (PC1) was extracted and interpreted as a composite index of animal-based food consumption, encompassing both dairy and meat, and was used as an explanatory variable in the econometric models analysed. A fragment of the dataset showing the observations from 1984 to 1989 is presented in Supplementary Table S2.

Study design and econometric modelling strategy

To investigate the dynamic relationship between the variables while mitigating the risk of spurious estimates - often encountered when analysing non-stationary time series - a sequential and rigorous modelling strategy was adopted. The process, outlined in the flowchart shown in Figure 3, was applied identically and independently to both malignancies (Prostate and Breast) and was structured to ensure the robustness and statistical validity of the final model:

  • Step 1: Analysis of the Order of Integration and Cointegration Testing

  • Step 2: ARIMAX Framework

  • Step 2a: Determination of Structural Lag

  • Step 2b: Specification and Selection of the ARIMAX Model

  • Step 2c: Graphical Diagnostics of the Residuals of the Selected Model

Fig. 3.

Fig. 3

Methodological flowchart of the multi-step ARIMAX framework.

Step 1. Analysis of the order of integration and cointegration testing

As a preliminary step, the stationarity of the time series was assessed both in their original (level) form and after first differencing, in order to determine their order of integration, using the Augmented Dickey–Fuller (ADF) test22. The null hypothesis (H0) of the test corresponds to the presence of a unit root (i.e., non-stationarity), whereas the alternative hypothesis indicates the absence of a unit root (i.e., stationarity). A time series that is stationary in levels is classified as I (0), while a series that becomes stationary only after first- order differencing is classified as I (1). Identifying I (1) series is a necessary condition for testing cointegration between cancer incidence and dietary consumption. Cointegration, in fact, implies that although individual series are non-stationary, a long-term equilibrium exists that binds them together.

As a preliminary cointegration test, the Engle–Granger approach was adopted, which evaluates the null hypothesis of no cointegration by testing the stationarity of the residuals from a static OLS (Ordinary Least Squares) regression between the variables23. However, given the low power of this test in small samples, the results were regarded as merely exploratory and not conclusive, particularly in view of the limited length of the available time series.

To overcome this limitation, the ARDL Bounds Testing approach of Pesaran, Shin, and Smith (2001) was applied as the main test for cointegration24. This method, which is particularly robust in the presence of small samples, allows for the simultaneous estimation of both short- and long-term dynamics. The procedure involves estimating an Unrestricted Error Correction Model (UECM) and performing an F-test (Wald test) of the null hypothesis of no long-term relationship. The F-statistic is then compared against two sets of critical values (bounds): if the observed value exceeds the upper bound, the presence of cointegration is concluded.

Step 2. ARIMAX framework

To model the relationship between cancer incidence and the index consumption PC1, the ARIMAX framework (Autoregressive Integrated Moving Average with eXogenous variables) was adopted. This approach was chosen because it simultaneously addresses three main challenges of the data: the inclusion of an external regressor (Xt), the handling of non- stationarity in the series, and the presence of short-term autocorrelation. The general form of the model is as follows:

graphic file with name d33e1150.gif 1

where ηt is an error term following an ARIMA (p, d, q) process, designed to capture both non- stationarity (parameter d) and short-term dynamics (parameters p and q). The model was applied separately to the time series of Prostate and Breast incidence, using the animal consumption index PC1 as the exogenous regressor.

The identification of the optimal model specification, i.e., the orders p, d, and q, were carried out through three sequential steps:

  • Step 2a: determination of the structural latency (lag L) with respect to the regressor PC1;

  • Step 2b: selection of the most appropriate ARIMAX model based on AICc minimization and diagnostic validity;

  • Step 2c: graphical diagnostics of the residuals of the selected model.

All models were performed on the series in levels, setting d = 0, since the identification of the optimal model was possible without differenced series (d = 1), as prior cointegration testing (see Step 1) justified the use of undifferenced data.

Step 2a. Determination of structural lag (lag L)

In the first stage, the optimal temporal lag (L) of the lagged association with PC1 and the outcome variables were identified within the 8–20-year interval, which was chosen as the analysis window in acknowledgment of the long timeframes required for carcinogenesis to develop. To this end, for each lag value a series of preliminary ARIMAX models was estimated using the auto.arima function from the forecast package. The optimal lag corresponded to the one associated with the model yielding the lowest Corrected Akaike Information Criterion (AICc) value. This criterion balances goodness of fit against model parsimony and is particularly suitable when dealing with small samples.

Step 2b. Specification and selection of the ARIMAX model

Once the optimal lag L was established, the structure of the ARIMAX model for each cancer type was determined, that is, the identification of the most appropriate p and q parameters to describe the temporal dynamics of the incidence series. To ensure the robustness of this choice, three different computational search strategies were employed, in order to select a set of three candidate ARIMAX models (Models A, B, and C) to be compared on the non-differenced series (d = 0).

  • Model A: identified using the auto.arima algorithm from the forecast package through a stepwise search of the best p and q parameters. The algorithm explores the parameter space efficiently, starting from an initial set of models and iteratively modifying one parameter at a time. Each modification is accepted only if it reduces the AICc, and the process stops when no further improvement is possible. Although computationally efficient, this approach does not necessarily guarantee the global optimum, with the risk of stopping at a local minimum.

  • To mitigate this risk, Model B was considered.

  • Model B: This model was identified using auto.arima, through an exhaustive grid search with all possible combinations of p and q, up to a predefined maximum order. Although this approach is computationally intensive, it guarantees identification of the model with the absolute minimum AICc, eliminating the risk of convergence to suboptimal solutions.

  • Model C: This model was manually specified in the form ARIMAX(1,0,1), chosen for its flexibility in representing different short-term dynamics.

The three models thus obtained were then compared and subjected to a thorough diagnostic assessment (described in Step 3) in order to evaluate their statistical consistency and select the final model.

Finally, the choice of the final model was made according to two hierarchical criteria:

  • i)

    Statistical validity, and

  • ii)

    Efficiency and parsimony.

With respect to statistical validity, the residuals (ϵt) of each model were required to satisfy the properties of a white noise process (zero mean, constant variance, and no autocorrelation). This condition was verified using the Ljung–Box test, whose null hypothesis assumes the absence of serial autocorrelation25. As an additional diagnostic check, residual normality was also assessed using the Shapiro–Wilk test26. Only models for which both tests returned a p-value > 0.05 were considered valid. The second criterion, related to efficiency and parsimony, required the selection—among statistically valid models—of the one with the lowest AICc value.

Step 2c. Graphical diagnostics of the residuals of the selected model

After identifying the best ARIMAX model, its suitability was further assessed through a graphical analysis of the residuals. In particular, the following aspects were examined:

  • i)

    the time-series plot of the residuals, which, if the model fits well, should show a random pattern;

  • ii)

    the autocorrelation function (ACF), to confirm the absence of serial dependence of the residuals;

  • iii)

    the distribution of the residuals via a histogram, to assess their compatibility with the normal distribution, i.e., a random pattern.

All statistical analyses were conducted in the R environment (version 4.4.2)27, using the forecast and tseries packages.

Supplementary Information

Acknowledgements

The authors thank Lucia Dentale for her support with the translation

Author contributions

Conceptualization: all authors; Writing – original draft: A.S., M.T., N.C., G.E.R., and A.T.; Writing – review and editing: A.S., M.T., E.P.A., N.C., G.E.R., R.I., P.I., and A.T.; Visualization: A.S., E.P.A, and P.I.; Supervision: A.S., G.E.R., and A.T. All authors have read and approved the final version of the manuscript.

Funding

This research received no external funding.

Data availability

The datasets analyzed during the current study are publicly available. Historical food balance sheets were obtained from the Food and Agriculture Organization of the United Nations (FAOSTAT) and updated food balance sheets were accessed from FAOSTAT. Cancer incidence and mortality data were retrieved from the European Cancer Information System (ECIS) of the European Commission. All datasets are openly accessible through the respective repositories at the following links: a. FAOSTAT historical (https://www.fao.org/faostat/en/#data/FBSH?countries=106&elements=645&items=2731,2848,2732,2733&years=1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009&output_type=table&file_type=csv&submit=true) b. FAOSTAT food balance sheets (https://www.fao.org/faostat/en/#data/FBS?countries=106&elements=645&items=2731,2848,2732,2733&years=2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020&output_type=table&file_type=csv&submit=true c. ECIS data female breast cancer (https://ecis.jrc.ec.europa.eu/data-explorer#/historical/incidence-mortality-trends-by-period?indicator=IN®istry=127%2C292%2C252%2C268%2C248%2C269%2C249%2C85%2C86%2C245%2C87%2C88%2C271%2C92%2C89%2C127%2C274%2C93%2C282%2C275%2C105%2C276%2C277%2C278%2C108%2C109%2C110%2C251%2C279%2C280%2C114%2C281%2C116%2C117%2C119%2C118%2C244%2C123%2C122%2C125%2C126%2C128&sex=2&cancerEntity=34&ageFrom=0&ageTo=85%2B&yearFrom=1976&yearTo=2021&groupSexes=I&groupCancers=I&statistic=ASR_EU_NEW&logarithmicScale=N) d. ECIS data prostate cancer (https://ecis.jrc.ec.europa.eu/data-explorer#/historical/incidence-mortality-trends-by-period?indicator=IN®istry=292%2C252%2C268%2C248%2C269%2C249%2C85%2C86%2C245%2C87%2C88%2C271%2C92%2C89%2C127%2C274%2C93%2C282%2C275%2C105%2C276%2C277%2C278%2C108%2C109%2C110%2C251%2C279%2C280%2C114%2C281%2C116%2C117%2C119%2C118%2C244%2C123%2C122%2C125%2C126%2C128&sex=1&cancerEntity=42&ageFrom=0&ageTo=85%2B&yearFrom=1976&yearTo=2021&groupSexes=I&groupCancers=I&statistic=ASR_EU_NEW&logarithmicScale=N).

Declarations

Competing of interest

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Online summary of trends in U.S. cancer control measures. National Cancer Institute CancerTrends Progress Reporthttps://progressreport.cancer.gov/diagnosis/incidence
  • 2.Cavalieri, E. & Rogan, E. The 3,4-quinones of estrone and estradiol are the initiators of cancer whereas resveratrol and N-acetylcysteine are the preventers. Int. J. Mol. Sci.22(15), 8238. 10.3390/ijms22158238.PMID:34361004;PMCID:PMC8347442 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yager, J. D. & Davidson, N. E. Estrogen carcinogenesis in breast cancer. N. Engl. J. Med.354(3), 270–82. 10.1056/NEJMra050776 (2006) (PMID: 16421368). [DOI] [PubMed] [Google Scholar]
  • 4.Liehr, J. G. Is estradiol a genotoxic mutagenic carcinogen?. Endocr. Rev.21(1), 40–54. 10.1210/edrv.21.1.0386 (2000) (PMID: 10696569). [DOI] [PubMed] [Google Scholar]
  • 5.Ozten, N. et al. Role of estrogen in androgen-induced prostate carcinogenesis in NBL rats. Horm. Cancer.10(2–3), 77–88. 10.1007/s12672-019-00360-7 (2019) (Epub 2019 Mar 16. PMID: 30877616; PMCID: PMC6545235.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rahman, H. P., Hofland, J. & Foster, P. A. In touch with your feminine side: How oestrogen metabolism impacts prostate cancer. Endocr. Relat. Cancer.23(6), R249-66. 10.1530/ERC-16-0118 (2016) (Epub 2016 May 18. PMID: 27194038). [DOI] [PubMed] [Google Scholar]
  • 7.Zhang, J. & Kesteloot, H. Milk consumption in relation to incidence of prostate, breast, colon, and rectal cancers: Is there an independent effect?. Nutr. Cancer.53(1), 65–72. 10.1207/s15327914nc5301_8 (2005) (PMID: 16351508). [DOI] [PubMed] [Google Scholar]
  • 8.Besson, H., Paccaud, F. & Marques-Vidal, P. Ecologic correlations of selected food groups with disease incidence and mortality in Switzerland. J. Epidemiol.23(6), 466–73. 10.2188/jea.je20130029 (2013) (Epub 2013 Oct 19. PMID: 24140818; PMCID: PMC3834285.). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Grasgruber, P., Hrazdira, E., Sebera, M. & Kalina, T. Cancer incidence in Europe: An ecological analysis of nutritional and other environmental factors. Front. Oncol.8, 151. 10.3389/fonc.2018.00151 (2018) (PMID: 29951370; PMCID: PMC6008386). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li, J., Chan, N. B., Xue, J. & Tsoi, C. K. K. Time series models show comparable projection performance with joinpoint regression: A comparison using historical cancer data from World Health Organization. Front. Public Health10, 1003162 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Trächsel, B., Rousson, V., Bulliard, J. L. & Locatelli, C. I. Comparison of statistical models to predict age-standardized cancer incidence in Switzerland. Biom. J.C5(7), e2200046. 10.1002/bimj.202200046 (2023). [DOI] [PubMed] [Google Scholar]
  • 12.Nadler, D. L. & Zurbenko, C. I. G. Estimating cancer latency times using a Weibull model. Adv. Epidemiol.2014(1), 746769 (2014). [Google Scholar]
  • 13.Little, M. P., Eidemüller, M., Kaiser, J. C. & Apostoaei, C. A. I. Minimum latency effects for cancer associated with exposures to radiation or other carcinogens. Br. J. Cancer130(5), 819–829. 10.1038/s41416-023-02544-z (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lyons, J.G. et al. Dietary estrogens and hormone-dependent cancers: mechanisms and evidence. Mol Cellndocrinol. (2021).
  • 15.Lokar, K., Zagar, T. & Zadnik, C. V. Estimation of the ecological fallacy in the geographical analysis of the association of socio-economic deprivation and cancer incidence. Int. J. Environ. Res. Public Health16(3), 296 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hyndman, R. J. & Athanasopoulos, G. G. Forecasting: Principles and Practice 2nd ed. (OTexts, 2018). [Google Scholar]
  • 17.Di Novi, C., Marenzi, A., G Zantomio, F. "Patterns of red and processed meat consumption across generations." Ca’ Foscari University of Venice, Working Paper No. 01/2021 (2021).
  • 18.FAOSTAT: food balance sheets historical. Food and Agriculture Organization of the United Nationshttps://www.fao.org/faostat/en/#data/FBSH
  • 19.FAOSTAT: food balance sheets. Food and Agriculture Organization of the United Nationshttps://www.fao.org/faostat/en/#data/FBS
  • 20.European Cancer Information System(ECIS)European Commissionhttps://ecis.jrc.ec.europa.eu/data-explorer#/historical/incidence-mortality-by-cancer?ageFrom=0CageTo=85%2BCindicator=INCsex=0CyearFrom=1976CyearTo=2015CcancerEntity=-1Cstatistic=ASR_EU_NEWCregistry=127
  • 21.Greenacre, M. et al. Principal component analysis. Nat. Rev. Methods Primers2, 100. 10.1038/s43586-022-00184-w (2022). [Google Scholar]
  • 22.Paraproditis, E. & Politis, D. N. The asymptotic size and power of the augmented Dickey–Fuller test for a unit root. Econom. Rev.37, 955–973 (2018). [Google Scholar]
  • 23.Engle, R. F. & Granger, C. W. J. Co-integration and error correction: Representation, estimation, and testing. Econometrica55(2), 251–276. 10.2307/1913236 (1987). [Google Scholar]
  • 24.Pesaran, M. H., Shin, Y. & Smith, R. J. Bounds testing approaches to the analysis of level relationships. J. Appl. Econom.16, 289–326. 10.1002/jae.61 (2001). [Google Scholar]
  • 25.Ljung, G. M. & Box, G. E. P. On a measure of lack of fit in time series models. Biometrika65(2), 297–303. 10.1093/biomet/65.2.297 (1978). [Google Scholar]
  • 26.Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika52(3–4), 591–611. 10.1093/biomet/52.3-4.591 (1965). [Google Scholar]
  • 27.Posit team RStudio: Integrated development environment for R. Posit Software (PBC, Boston, MA, 2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets analyzed during the current study are publicly available. Historical food balance sheets were obtained from the Food and Agriculture Organization of the United Nations (FAOSTAT) and updated food balance sheets were accessed from FAOSTAT. Cancer incidence and mortality data were retrieved from the European Cancer Information System (ECIS) of the European Commission. All datasets are openly accessible through the respective repositories at the following links: a. FAOSTAT historical (https://www.fao.org/faostat/en/#data/FBSH?countries=106&elements=645&items=2731,2848,2732,2733&years=1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009&output_type=table&file_type=csv&submit=true) b. FAOSTAT food balance sheets (https://www.fao.org/faostat/en/#data/FBS?countries=106&elements=645&items=2731,2848,2732,2733&years=2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020&output_type=table&file_type=csv&submit=true c. ECIS data female breast cancer (https://ecis.jrc.ec.europa.eu/data-explorer#/historical/incidence-mortality-trends-by-period?indicator=IN®istry=127%2C292%2C252%2C268%2C248%2C269%2C249%2C85%2C86%2C245%2C87%2C88%2C271%2C92%2C89%2C127%2C274%2C93%2C282%2C275%2C105%2C276%2C277%2C278%2C108%2C109%2C110%2C251%2C279%2C280%2C114%2C281%2C116%2C117%2C119%2C118%2C244%2C123%2C122%2C125%2C126%2C128&sex=2&cancerEntity=34&ageFrom=0&ageTo=85%2B&yearFrom=1976&yearTo=2021&groupSexes=I&groupCancers=I&statistic=ASR_EU_NEW&logarithmicScale=N) d. ECIS data prostate cancer (https://ecis.jrc.ec.europa.eu/data-explorer#/historical/incidence-mortality-trends-by-period?indicator=IN®istry=292%2C252%2C268%2C248%2C269%2C249%2C85%2C86%2C245%2C87%2C88%2C271%2C92%2C89%2C127%2C274%2C93%2C282%2C275%2C105%2C276%2C277%2C278%2C108%2C109%2C110%2C251%2C279%2C280%2C114%2C281%2C116%2C117%2C119%2C118%2C244%2C123%2C122%2C125%2C126%2C128&sex=1&cancerEntity=42&ageFrom=0&ageTo=85%2B&yearFrom=1976&yearTo=2021&groupSexes=I&groupCancers=I&statistic=ASR_EU_NEW&logarithmicScale=N).


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES