Abstract
Maize yield estimation at vegetative, flowering and grain filling stage was performed using four statistical modeling approaches; least absolute shrinkage and selection operator (LASSO), elastic net, stepwise multiple linear regression (SMLR) and principal component analysis in combination with SMLR (PCA-SMLR) techniques. The models were developed using maize yield data and daily weather parameters, including maximum and minimum temperatures, precipitation, morning and evening relative humidity, bright sunshine hours, and evaporation, recorded during the crop growing period for the period of 1984–2021 at ICAR-Indian Agricultural Research Institute, IARI, New Delhi. Yield estimation was carried out for the kharif season of 2020 and 2021 at vegetative, flowering and grain filling stage. Among the evaluated models, the Root Mean Square Error (RMSE) and normalized RMSE (nRMSE) were lowest for the Elastic Net model, followed by LASSO and SMLR, indicating the superior performance of Elastic Net. The percentage deviation between estimated and observed yield ranged from 8.0 to 29.1%, 6.8–22.0%, and 4.8–18.2% at the vegetative, flowering, and grain filling stages, respectively. Overall, temperature and bright sunshine hours were found to be the most influential predictors of maize yield, and based on model accuracy, the Elastic Net model was identified as the most reliable, followed by LASSO and SMLR for maize yield estimation at different growth stages.
Keywords: Weather variables, Maize, Least absolute shrinkage and selection operator, Elastic net, Stepwise multiple linear regression, Yield estimation
Subject terms: Environmental sciences, Plant sciences
Introduction
Maize (Zea mays L.) is one of the most important cereal crops cultivated during the Kharif season across northern India, contributing significantly to food security, agro-based industries, and national agricultural GDP. As one of the world’s most important cereal crops, it plays a significant role in the global agricultural economy. This crop is grown widely throughout the various states of India. The primary factors influencing maize yield are the prevailing weather conditions, soil characteristics, and the genetic makeup of the crop variety1. Fluctuations in weather patterns can lead to yield losses. The productivity of maize is highly influenced by seasonal weather variability, soil characteristics, and varietal differences. Among these, weather fluctuations are the most critical source of yield uncertainty. Hence, reliable pre-harvest yield forecasting based on weather parameters is essential for strategic planning of procurement, processing, storage, trade decisions, and policy formulation.
Previous studies have successfully demonstrated the applicability of statistical and machine learning models2 in weather-based crop yield forecasting of rice3, wheat4, maize5 mustard6, and. Penalized regression techniques such as LASSO and Elastic Net have particularly gained attention due to their ability to handle multicollinearity and automatically select the most relevant predictors from a large set of weather indices. Recent studies have shown promising performance of these models for yield estimation in major food crops7,8. However, a systematic performance comparison among these penalized models and traditional approaches across multiple distinct growth stages of maize under semi-arid climatic conditions is limited. ML models have inherent potential for the retrieval of the most important parameters from the input dataset. It can also utilize the outputs of other various techniques involved in yield predictions as features, like statistical model9,10. Long-term weather data and six different statistical methods was used for determination of rice yield prediction11. Based on Friedman test overall ranking he reported that LASSO (2.63) and Elastic Net (3.07) were the best model. Elastic Net and LASSO were the most effective models for predicting wheat yields in different locations of northwest India, followed by PCA-SMLR, SMLR, Artificial Neural Network (ANN) and PCA-ANN12. Building on these previous findings, the target of our research is to comprehensively evaluate and compare the performance of SMLR, PCA-SMLR, LASSO, and Elastic Net models in accurately estimating maize yields at various growth stages. The goal of this study is to enhance the reliability and precision of crop yield estimation, which is crucial for improving global food security and informing strategic decision-making in agricultural production and policy planning. Also, maize yield estimation at different growth stage done by model developed by LASSO, Elastic Net and stepwise multiple linear regression techniques using weather variables for ICAR-IARI New Delhi has not been explored yet. The aim of this study was to build a model (i.e., achieving maximum prediction power by using a minimum number of input parameters) to estimate in-season maize yield at three different crop growth stages (vegetative, flowering, and grain-filling stage). This will help to minimize costs and complexity in modelling, and maximize applicability to potential users.
Therefore, the objective of this study is to evaluate and compare stepwise multiple linear regression (SMLR), PCA-SMLR, LASSO, and Elastic Net models for maize yield estimation at vegetative, flowering, and grain filling stages using long-term weather variables in New Delhi, India. This research advances novelty by integrating multistage yield prediction, penalized regularization techniques, and Leave-One-Year-Out (LOYO) cross-validation across 38 years of crop-weather time-series dataset, which enhances methodological robustness and improves generalization capability. The outcome of this study is expected to contribute to more reliable district-level pre-harvest maize yield forecasting under semi-arid environments, supporting climate-resilient agricultural decision-making and operational agro-advisory systems.
Materials and methods
The study utilized long-term maize yield data and daily weather observations, including maximum and minimum temperatures, precipitation, morning and evening relative humidity, bright sunshine hours, and evaporation, recorded at the ICAR–Indian Agricultural Research Institute (ICAR-IARI), New Delhi, India during the maize growing seasons (Kharif season) from 1984 to 2021 (38 years). For maize yield data, a linear detrending procedure was applied to remove long-term non-weather driven technological and varietal improvement effects. The detrended yield values were subsequently used for model development and evaluation.
The data was organized into three distinct phenological stages: vegetative (26th to 34th Standard Meteorological Weeks; SMW), flowering (26th to 36th SMW), and grain filling (26th to 38th SMW). The variable time 1 to 38 represents the years from 1984 to 2021.
Weather indices formulation
Simple and composite weather indices were derived from daily weather data. The simple weather indices were derived by summing individual weather variables or their interactions across weeks. The weighted indices were calculated as the sum product of weather variables with their correlation coefficients with detrended yield. The exhaustive list of index notation (Z-variables) is provided in Table 1. The calculation of simple and weighted indices was undertaken according to a predefined statistical equation.
Table 1.
Simple and weighted weather indices used for developing model.
| Simple weather indices | Weighted weather indices | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tmax | Tmin | Rainfall | RH Morning | RH Evening | BSS | Evp | Tmax | Tmin | Rainfall | RH Morning | RH Evening | BSS | Evp | |
| Tmax | Z10 | Z11 | ||||||||||||
| Tmin | Z120 | Z20 | Z121 | Z21 | ||||||||||
| Rainfall | Z130 | Z230 | Z30 | Z131 | Z231 | Z31 | ||||||||
| RH morning | Z140 | Z240 | Z340 | Z40 | Z141 | Z241 | Z341 | Z41 | ||||||
| RH morning | Z150 | Z250 | Z350 | Z450 | Z50 | Z151 | Z251 | Z351 | Z451 | Z51 | ||||
| BSS | Z160 | Z260 | Z360 | Z460 | Z560 | Z60 | Z161 | Z261 | Z361 | Z461 | Z561 | Z61 | ||
| Evp | Z170 | Z270 | Z370 | Z470 | Z570 | Z670 | Z70 | Z171 | Z271 | Z371 | Z471 | Z571 | Z671 | Z71 |
Simple weather indices:
![]() |
Weighted weather indices:
![]() |
and
![]() |
Where,
/
w = value of
th/
th weather variable in
th week.
/
= correlation coefficient of yield with
th weather variable or product of
th or
th weather variable in
th week.
m = week at which forecast done.
P = number of variables.
Model development
Four model types, Stepwise Multiple Linear Regression (SMLR), Principal Component Analysis combined with SMLR (PCA-SMLR), Least Absolute Shrinkage and Selection Operator (LASSO), and Elastic Net regression were used for maize yield estimation at vegetative, flowering, and grain-filling stages. PCA-SMLR was implemented by converting correlated weather inputs into principal components (PC scores) with eigenvalue > 1, and significant PCs were used in regression to minimize multicollinearity.
Least absolute shrinkage and selection operator (LASSO)
Least Absolute Shrinkage and Selection Operator, a powerful technique for model selection that provides a robust and effective approach. These models are specifically designed to address the inherent limitations of conventional ordinary least squares and ridge regression methods. While Ordinary Least Squares (OLS) can minimize the residual mean square error, it often exhibits low bias but high variance, which can reduce the accuracy of predictions, especially when dealing with a large number of predictors. The selection of a smaller subset of predictors can have a stronger impact on the interpretation of the data, making the model more interpretable. LASSO accomplishes this through a discrete and variable subset selection process, where some regressors are retained while others are eliminated from the model. LASSO is employed to obtain reliable regression coefficients and perform automatic variable selection. The key mechanism behind LASSO is its ability to continuously shrink some coefficients through the imposition of an L1 penalty, while setting others to zero. This allows LASSO to retain the beneficial aspects of both subset selection and ridge regression, leading to a more parsimonious and interpretable model.
Let us consider a dataset consisting of predictor variables xi= (xi1,…………, xip)T and corresponding yield responses yi, where i ranges from 1 to N. In the standard regression framework, we typically assume the observations are independent or the response variables yi are conditionally independent on the given predictor variables xij’s. Under these assumptions, the variables xij have been standardized such that the mean of xij across all i is 0, and the variance of xij across all i is 1 ( i.e., ∑i xij /N = 0, ∑i x²ij /N = 1).
Letting β=( β1,………,βp)T, lasso estimate (α, β) is defined by
![]() |
The non-negative parameter t functions as a tuning mechanism that controls the degree of shrinkage applied to the coefficient estimates. The ꞵⱼ values represent the full least-squares estimates, and t₀ is the sum of these estimates. When the value of t is less than t₀, the solution is compressed towards zero, potentially resulting in some coefficients being set precisely to zero. This process enables the LASSO method to generate a parsimonious and interpretable model with superior predictive capabilities. An alternative approach to the Lasso method for solving penalized likelihood optimization problems is,
![]() |
The two formulations are mathematically equivalent,, for any given λ ϵ(0,∞), there exists t ≥ 0 such that the solutions to the two problems are the same, and vice versa.
Elastic net
The elastic net regularization technique penalizes the size of regression coefficients using a combination of L1 and L2 (ridge) norm penalties. The L1 regularization constraint induces a sparse model by driving some regression coefficients to precisely zero. Conversely, the L2 regularization does not restrict the number of selected variables, encourages the joint selection of correlated predictors, and stabilizes the coefficient path estimated by the LASSO method. Considering a data set with n number of observations with p number of variables or predictors, the dependent variable or yield can be represented as y=(y1,….,yn)T, j = 1,……., p are the predictors.
![]() |
The elastic net approach is applicable for any fixed non-negative values of the regularization parameters λ₁ and λ₂.
![]() |
![]() |
The elastic net estimator, denoted as β, is the minimizer of the penalized least squares equation. This equation represents a penalized least squares method, where the parameter α is defined as α = λ₂ / (λ₁ + λ₂), and λ₁ and λ₂ are the regularization parameters.
![]() |
![]() |
Subject to, for some value of t.
The function (1- α) β1+ αβ2 is called the elastic net penalty, which combines the LASSO and ridge penalties in a convex manner.
Regularization model parameterization
LASSO and Elastic Net were developed using R (version 3.6.0) via the “glmnet” package. In penalized models, λ (lambda) represents the regularization penalty strength, and α (alpha) represents the mixing parameter (α = 1 for pure LASSO; α between 0 and 1 for Elastic Net). Optimal λ values were obtained using cross-validation minimizing Mean Squared Error criteria. For Elastic Net, α = 0.5 was used to balance L1 and L2 penalties.
The LASSO and Elastic Net models require the optimization and selection of two key parameters, lambda (λ) and alpha (α), by minimizing the average mean squared error through a process called cross-validation. For the LASSO model, the tuning parameter α was set to 1, while for the Elastic Net model, it was set to 0.5. This reflected different compromises made between model complexity and model fit. The magnitude of the regularization penalty, referred to as the L2 or Euclidean norm, could be adjusted across a broad range through the tuning parameter λ. When λ was equal to 0, the regularization effect was eliminated, and the objective function reverted to the standard ordinary least squares regression goal of minimizing the sum of squared residuals. The LASSO penalty, which follows the L1 norm, is an alternative to the ridge penalty, which follows the L2 norm, offering different model properties and behaviors. The LASSO penalty offers improved model performance and automated feature selection. When working with high-dimensional data, the LASSO technique can identify and isolate the most influential and consistent variables by adjusting the penalty parameter λ. Specifically, setting λ = 1 emphasizes the LASSO penalty, while smaller λ values retain a broader set of variables, including less important ones. Alternatively, the Elastic Net approach combines both the LASSO and Ridge penalties, enabling effective regularization through the Ridge component while retaining the feature selection capabilities of the LASSO. The Elastic Net penalty can be tuned by varying λ from 0 to 1, where λ = 0.5 applies an equal balance of the two penalties, λ < 0.5 emphasizes the Ridge penalty, and λ > 0.5 emphasizes the LASSO penalty.
LASSO and Elastic Net apply a regularization penalty to reduce overfitting and improve model generalization. The parameter λ controls the overall strength of regularization — higher λ values shrink coefficients more aggressively. In Elastic Net, an additional parameter α ∈ [0, 1] determines the type of penalty applied: α = 1 corresponds to pure LASSO (L1 penalty), α = 0 corresponds to Ridge (L2 penalty), and intermediate values combine both. This allows Elastic Net to handle correlated predictors more effectively and retain essential feature sparsity. These definitions follow the standard formulations commonly used in machine learning literature13.
Stepwise multiple linear regression (SMLR)
Diverse weather indices, including those derived from various weather parameters, were utilized in the development of predictive models. The impact of these significant weather indices was ascertained through the application of SMLR analysis. Multiple linear regression forecasting models were developed, incorporating numerous simple and weighted weather indicators14. Additionally, stepwise multiple linear regression was employed for pre-harvest wheat yield estimation due to its superior consistency and suitability at regional or national levels15. Feature selection techniques facilitated the identification of the most relevant regression variables, enhancing the interpretability of the independent variables16.
Principal component analysis- stepwise multiple linear regression (PCA-SMLR)
This combined methodology aims to address multicollinearity issues and enhance the predictive performance of the regression model. The PCA-SMLR approach integrates principal component analysis for feature selection and stepwise multiple linear regression for model development. Principal component analysis is a multivariate statistical method that converts the initial set of interrelated variables into a new set of orthogonal principal component scores. The PC scores, selected based on eigenvalues greater than 1, are capable of capturing over 90% of the variability in the data. These PCA-derived PC scores are then utilized as input variables for the SMLR analysis all weather indices along with time is used as input for feature selection by principal component analysis.
The accuracy and robustness of the models were evaluated using several statistical metrics, including root mean square error, normalized mean square error, and percent deviation, during both the calibration and validation phases. Percentage deviation of observed yield with the estimated yields done at the vegetative, flowering, and grain filling stages from the SMLR, LASSO, and Elastic was calculated for the Kharif seasons 2020 and 2021.
Software usage
LASSO and Elastic Net were implemented in R (3.6.0) and SMLR and PCA-SMLR were implemented in SPSS (Version 16.0). Performance metrics and yield estimation were generated separately for Kharif 2020 and 2021.
Cross-validation and model assessment
To avoid temporal bias, model generalization was evaluated using Leave-One-Year-Out Cross Validation (LOYO). In this approach, models were calibrated using 37 years and validated on the single left-out year, and this process was repeated 38 times. This approach ensures robust independent validation without random year leakage. Model performance was evaluated using Root Mean Square Error (RMSE), Normalized RMSE (nRMSE), and Percent Deviation between observed and predicted yields.
Root mean square error (RMSE), Normalized mean square error (nRMSE) and Percent Deviation was calculated using following formula:
![]() |
![]() |
Percent Deviation (%) =
*100.
Here, “Pi” represents the predicted value, “Oi” represents the observed value, “N” is the number of observations, and “M” is the mean of the observed values. The prediction is advised excellent when the nRMSE is less than 10%, good when it falls between 10 and 20%, fair when it ranges from 20 to 30%, and poor when it exceeds 30%17.
Results and discussion
Maize yield estimation at vegetative stage
Models for estimating maize yields during the vegetative growth stage have been developed for ICAR-IARI, New Delhi. These models utilize historical maize yield records and daily meteorological measurements over an extended period from 26th to 34th standard meteorological weeks.
Model performance results for the vegetative stage are summarized in Table 2. All four modelling approaches achieved statistically significant goodness-of-fit, with R² values ranging between 0.90 and 0.95. Among all models, the Elastic Net yielded the lowest RMSE and nRMSE during both calibration and validation, followed by LASSO, PCA-SMLR, and SMLR. During validation, the Elastic Net achieved an nRMSE of 10.32%, while LASSO showed 11.74%, indicating improved generalization capability of penalized regression approaches.
Table 2.
Maize yield estimation at vegetative stage by different model during Kharif 2020 and 2021.
| Model | Model equation | Model performance during calibration | Model performance during validation | Estimated yield (kg/ha) | Observed yield (kg/ha) | Percentage deviation (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R 2 | RMSE | nRMSE | RMSE | nRMSE | 2020 | 2021 | 2020 | 2021 | 2020 | 2021 | ||
| SMLR | y = 143.6 + 141.97*time + 5.16*Z571 + 2.1*Z461 | 0.93 | 258.5 | 16.83 | 703.5 | 21.05 | 4734.5 | 4682.9 | 3666.5 | 3780.3 | 29.13 | 23.88 |
| PCA-SMLR | y=-206.82 + 134.55*time + 329.5* PC4 -210.3*PC5 | 0.90 | 313.9 | 20.44 | 579.2 | 17.33 | 4644.4 | 4650.5 | 3666.5 | 3780.3 | 26.67 | 23.02 |
| LASSO | y=-669.2 + 123.85*time + 12.95* Z11 + 0.14*Z231 + 0.27*Z461 + 1.3*Z471 + 1.5*Z561 + 1.7*Z571 | 0.94 | 259.7 | 16.91 | 392.2 | 11.74 | 4085.6 | 4263.9 | 3666.5 | 3780.3 | 11.43 | 12.79 |
| Elastic Net | y = 310.6 + 95.5*time + 0.4*Z31 + 0.02*Z351 + 0.47 *Z361 + 0.5*Z461 + 1.06*Z471 + 0.76*Z561 | 0.95 | 256.5 | 16.70 | 344.9 | 10.32 | 3990.2 | 4082.8 | 3666.5 | 3780.3 | 8.83 | 8.00 |
The most significant weather parameter recognized by SMLR for Maize yield estimation at vegetative stage is time and Z571 (weighted evening RH* evaporation) and Z461 (weighted morning RH*BSS). For LASSO most significant weather parameter are time, Z11 (weighted Tmax), Z461 (weighted morning RH* BSS), Z471 (weighted morning RH*evaporation), Z561 (weighted evening RH*BSS) and Z571 (weighted evening RH* evaporation). For Elastic Net most vital weather variables are time, Z31 (weighted rainfall), Z351 (weighted rainfall* evening RH), Z361 (weighted rainfall*BSS), Z461 (weighted morning RH* BSS) Z241 (weighted Tmin*morning RH) and Z561 (weighted evening RH*BSS). The regression equations developed using the SMLR, LASSO, and Elastic Net models for estimating maize yield at the vegetative growth stage are presented in Table 2.
The percentage deviation of estimated maize yield from observed yield during the vegetative stage was analyzed for the Kharif seasons of 2020 and 2021 at ICAR-IARI, New Delhi. In 2020, the lowest deviation was observed for the Elastic Net model at 8.83%, followed by LASSO at 11.43%, PCA-SMLR at 26.67%, and SMLR at 29.13%. Similarly, in 2021, the Elastic Net model had the lowest deviation at 8.00%, followed by LASSO at 12.79%, PCA-SMLR at 23.02%, and SMLR at 23.88%. Percentage deviation analysis for Kharif 2020 and 2021 confirmed Elastic Net as the most accurate approach, with deviation ≤ 9% in both years.
Maize yield estimation at flowering stage
The investigation incorporated extensive long-term daily meteorological data spanning 26th to 36th standard meteorological weeks, as well as historical crop yield records obtained from ICAR-IARI, New Delhi. Model performance statistics for estimating maize yield at the flowering stage are presented in Table 3. All four models performed better during this stage compared to vegetative stage, with nRMSE values < 14% for all approaches in validation. The Elastic Net and LASSO models showed consistently superior predictive ability (nRMSE 8.87% and 9.36%, respectively), followed by PCA-SMLR and SMLR. This improvement at flowering reflects increasing weather signal strength closer to final yield determination. The coefficient of determination (R2) values for the developed models during calibration were 0.97% for the PCA-SMLR model, 0.98% for the SMLR model, and 0.99% for the LASSO and Elastic Net models.
Table 3.
Maize yield estimation at flowering stage by different model during Kharif 2020 and 2021.
| Model | Model equation | Model performance during calibration | Model performance during validation | Estimated yield (kg/ha) | Observed yield (kg/ha) | Percentage deviation (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R 2 | RMSE | nRMSE | RMSE | nRMSE | 2020 | 2021 | 2020 | 2021 | 2020 | 2021 | ||
| SMLR | y = 419.37 + 133.61*time + 5.27*Z571 + 1.59*Z461 + 0.23*Z231 | 0.98 | 140.9 | 9.18 | 458.8 | 13.73 | 4473.5 | 4143.4 | 3666.5 | 3780.3 | 22.01 | 9.61 |
| PCA-SMLR | y = 1185.67 + time*53.84 + PC5*129.36 | 0.97 | 163.1 | 10.62 | 386.6 | 11.72 | 4163.2 | 4117.3 | 3666.5 | 3780.3 | 13.54 | 8.91 |
| LASSO | y=-1257.75 + 126.67*time + 2.68* Z11 + 18.23*Z21 + 0.21* Z231 + 0.12*Z371 + 1.92*Z561 + 4.0 *Z571 | 0.99 | 126.5 | 8.24 | 312.9 | 9.36 | 4172.2 | 4077.3 | 3666.5 | 3780.3 | 13.79 | 7.86 |
| Elastic Net | y=-1303.86 + 119.68* time + 21.84*Z21 + 1.04*Z31 + 0.05*Z231 + 0.01*Z341 + 0.02*Z351 + 0.06*Z361+ 0.17*Z371 + 1.89 *Z561 + 3.45*Z571 | 0.99 | 126.6 | 8.24 | 296.4 | 8.87 | 4105.9 | 4038.6 | 3666.5 | 3780.3 | 11.98 | 6.83 |
The key weather parameters determined by the SMLR model to be most influential in predicting maize yield at the flowering stage are time and Z571 (weighted evening RH*evaporation), Z461 (weighted morning RH*BSS) and Z231 (weighted Tmin*rainfall). According to the LASSO regression analysis, the most influential weather parameters were time, Z11 (weighted Tmax), Z21 (weighted Tmin), Z231 (weighted Tmin* rainfall), Z371 (weighted rainfall*evaporation), Z561 (weighted evening RH*BSS) and Z571 (weighted evening RH*evaporation). Similarly, the Elastic Net regression approach identified the most salient weather variables for the analysis are time, Z21 (weighted Tmin), Z31 (weighted rainfall), Z231 (weighted Tmin*rainfall), Z341 (weighted rainfall*morning RH), Z351 (weighted rainfall*evening RH) and Z361 (weighted rainfall*BSS), Z371 (weighted rainfall*evaporation), Z561 (weighted evening RH*BSS) and Z571 (weighted evening RH*evaporation). The predictive equations for estimating maize yield at the flowering stage, derived through the application of SMLR, LASSO regression, and Elastic Net regression, are detailed in Table 3.
The data presented in Table 3 clarify the accuracy of various regression models in estimating maize yield at the flowering stage, as measured by the percentage deviation from the observed yields at ICAR-IARI, New Delhi during the Kharif seasons of 2020 and 2021. In the 2020 Kharif season, the Elastic Net model demonstrated the smallest percentage deviation of 11.98%, outperforming the LASSO, PCA-SMLR, and SMLR models, which had deviations of 13.79%, 13.54%, and 22.01%, respectively. This trend continued in the Kharif season 2021, where the Elastic Net model exhibited the lowest percentage deviation of 6.83%, followed by the LASSO, PCA-SMLR, and SMLR models at 7.86%, 8.91%, and 9.61%, respectively. Percentage deviation results for 2020 and 2021 again confirmed the better performance of Elastic Net, followed by LASSO, PCA-SMLR, and SMLR models.
Maize yield estimation at grain filling stage
The predictive performances of the models during the grain filling stage are summarized in Table 4. The predictive models were developed utilizing long-term daily meteorological data spanning from 26th to 38th SMW, in conjunction with historical maize yield records obtained from the same location. All models showed very high goodness of fit in calibration, with R² = 0.98 across approaches. During validation, the Elastic Net model again performed best, recording the lowest RMSE (262.8 kg/ha) and nRMSE (7.46%), followed by LASSO, PCA-SMLR, and SMLR. This stage exhibited the lowest overall prediction error compared to vegetative and flowering stages, indicating that later phenology accumulates cumulative weather signal that more strongly governs final yield realization.
Table 4.
Maize yield estimation at grain filling stage by different model during Kharif 2020 and 2021.
| Model | Model equation | Model performance during calibration | Model performance during validation | Estimated yield (kg/ha) | Observed yield (kg/ha) | Percentage deviation (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R 2 | RMSE | nRMSE | RMSE | nRMSE | 2020 | 2021 | 2020 | 2021 | 2020 | 2021 | ||
| SMLR | y = 707.84 + 128.09*time + 4.94*Z571 + 0.23*Z231 + 1.73*Z561 | 0.98 | 137.1 | 8.92 | 359.0 | 10.74 | 4334.0 | 4057.5 | 3666.5 | 3780.3 | 18.21 | 7.33 |
| PCA-SMLR | y = 1168.98 + time*55.12 + PC5*142.85 | 0.98 | 144.7 | 9.42 | 411.1 | 10.46 | 4064.8 | 4033.4 | 3666.5 | 3780.3 | 10.86 | 6.69 |
| LASSO | Y = 336.17 + 120.54*time + 3.78* Z21 + 4.77*Z31 + 0.19* Z361 + 0.01*Z371 + 1.66*Z561 + 3.73*Z571 | 0.98 | 151.0 | 9.83 | 281.7 | 8.43 | 4174.2 | 4022.1 | 3666.5 | 3780.3 | 13.85 | 6.40 |
| Elastic net | y = 272.85 + 109.63*time + 5.0*Z21 + 1.37*Z31 + 0.01*Z231 + 0.02* Z341 + 0.01*Z351 + 0.34*Z361 + 0.01*Z371 + 0.57*Z461 + 0.03* Z471 + 1.04*Z561 + 2.76*Z571 | 0.98 | 139.3 | 9.07 | 262.8 | 7.46 | 4086.3 | 3961.6 | 3666.5 | 3780.3 | 11.45 | 4.80 |
The key weather parameters determined by the SMLR model to be most influential in predicting maize yield at the flowering stage are time and Z571 (weighted evening RH*evaporation), Z231 (weighted Tmin*rainfall) and Z561 (weighted evening RH*BSS). According to the LASSO regression analysis, the most influential weather parameters were time, Z21 (weighted Tmin), Z31 (weighted rainfall), Z361 (weighted rainfall*BSS), Z371 (weighted rainfall*evaporation), Z561 (weighted evening RH*BSS) and Z571 (weighted evening RH*evaporation). Similarly, the Elastic Net regression approach identified the most salient weather variables for the analysis are time, Z21 (weighted Tmin), Z31 (weighted rainfall) and Z231 (weighted Tmin*rainfall), Z341(weighted rainfall*morning RH), Z351(weighted rainfall*evening RH), Z361 (weighted rainfall*BSS), Z371 (weighted rainfall*evaporation), Z461 (weighted morning RH*BSS), Z471(weighted morning RH*evaporation), Z561(weighted evening RH*BSS) and Z571(weighted evening RH*evaporation). The predictive equations for estimating maize yield at the grain filling stage, derived through the application of SMLR, LASSO regression, and Elastic Net regression, are summarized in Table 4.
The analysis of maize yield estimation models at the grain filling stage, using data from the ICAR-IARI, New Delhi during the Kharif seasons of 2020 and 2021 are given in Table 4. In the Kharif season 2020, the Elastic Net model exhibited the lowest percentage deviation from observed yields at 11.45%, followed by the LASSO, PCA-SMLR, and SMLR models at 13.85%, 10.86%, and 18.21%, respectively. This trend continued in the Kharif season 2021, where the Elastic Net model had the smallest deviation of 4.80%, while the LASSO, PCA-SMLR, and SMLR models had deviations of 6.40%, 6.69%, and 7.33%, respectively. Percentage deviation analysis further confirmed the superior accuracy of Elastic Net (4.8% deviation in 2021), indicating that penalized regression handled multi-collinearity and variable redundancy more efficiently than conventional stepwise regression approaches during the most yield-decisive period. Performance of the maize yield estimation at different growth stages by different model. The performance of SMLR, PCA-SMLR, LASSO, and Elastic Net models varied systematically across crop growth stages, with predictive accuracy consistently improving from the vegetative stage to grain filling (Table 5). At the vegetative stage, all models showed moderate predictive skill during calibration (R² = 0.90–0.95), but validation errors were relatively high, reflecting greater uncertainty during early crop growth. Among the models, Elastic Net and LASSO performed better than SMLR and PCA-SMLR during validation, with lower RMSE (344.9–392.2 kg ha⁻¹) and nRMSE (10.32–11.74%). In contrast, SMLR exhibited the poorest validation performance at this stage (RMSE = 703.5 kg ha⁻¹; nRMSE = 21.05%), indicating limited robustness under early-season conditions. Model performance improved substantially at the flowering stage, with higher calibration R² values (0.97–0.99) and reduced validation errors across all approaches. Elastic Net and LASSO emerged as the most accurate models, achieving the lowest validation RMSE (296.4–312.9 kg ha⁻¹) and nRMSE (8.87–9.36%). PCA-SMLR and SMLR showed comparatively higher errors, although their performance was still markedly better than at the vegetative stage. The grain filling stage exhibited the highest overall predictability. Calibration R² values remained high (≈ 0.98) for all models, while validation RMSE declined further, ranging from 262.8 to 411.1 kg ha⁻¹. Elastic Net achieved the best validation performance at this stage (RMSE = 262.8 kg ha⁻¹; nRMSE = 7.46%), followed closely by LASSO (RMSE = 281.7 kg ha⁻¹; nRMSE = 8.43%). SMLR and PCA-SMLR showed higher validation errors, confirming the advantage of regularized regression techniques under late-season climatic control.
Table 5.
Performance of the maize yield estimation at different growth stages by different model.
| Model | At vegetative stage | At flowering stage | At grain filling stage | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dring calibration | During validation | Dring calibration | During validation | Dring calibration | During validation | ||||||||||
| R2 | RMSE | nRMSE | RMSE | nRMSE | R2 | RMSE | nRMSE | RMSE | nRMSE | R2 | RMSE | nRMSE | RMSE | nRMSE | |
| SMLR | 0.93 | 258.5 | 16.83 | 703.5 | 21.05 | 0.98 | 140.9 | 9.18 | 458.8 | 13.73 | 0.98 | 137.1 | 8.92 | 359.0 | 10.74 |
| PCA-SMLR | 0.90 | 313.9 | 20.44 | 579.2 | 17.33 | 0.97 | 163.1 | 10.62 | 386.6 | 11.72 | 0.98 | 144.7 | 9.42 | 411.1 | 10.46 |
| LASSO | 0.94 | 259.7 | 16.91 | 392.2 | 11.74 | 0.99 | 126.5 | 8.24 | 312.9 | 9.36 | 0.98 | 151.0 | 9.83 | 281.7 | 8.43 |
| Elastic Net | 0.95 | 256.5 | 16.70 | 344.9 | 10.32 | 0.99 | 126.6 | 8.24 | 296.4 | 8.87 | 0.98 | 139.3 | 9.07 | 262.8 | 7.46 |
Overall, the results demonstrate that regularization-based models (Elastic Net and LASSO) consistently outperform conventional SMLR and PCA-SMLR, particularly during validation, and that grain filling is the most reliable stage for operational maize yield prediction, followed by flowering and vegetative stages.
Discussion
The present study demonstrated that weather-based statistical models can successfully estimate maize yield at different phenological stages using long-term meteorological information. Among all approaches evaluated, the Elastic Net model consistently delivered the highest predictive accuracy across vegetative, flowering, and grain filling stages. This superior performance reflects the intrinsic advantage of combining L1 (LASSO) and L2 (Ridge) penalization, which enables Elastic Net to simultaneously perform variable selection while managing multicollinearity among highly correlated weather indices. In contrast, SMLR is sensitive to redundant predictors and PCA-SMLR, while reducing correlations, may remove physically meaningful weather signals. This methodological behaviour explains the clear ranking observed across all stages: Elastic Net > LASSO > PCA-SMLR > SMLR. The finding that model performance improved as the crop approached the grain filling phase aligns with biological yield formation processes, where cumulative moisture balance and atmospheric energy exchange more strongly determine kernel filling and dry matter partitioning relative to early vegetative phases. Similar stage-wise progressive improvement in predictability has been reported for wheat and maize in semi-arid regions4,6,18,19. Additionally, the importance of humidity-based interaction indices with sunshine and rainfall across all three stages demonstrates that weather interactions rather than individual variables alone govern yield response. These interaction-based composite weather indices effectively captured the functional stress environment, consistent with the penalized regression response mechanism that prioritizes combined predictive signals rather than individual single variable effects.
The present findings are also consistent with previous studies that evaluated penalized regression approaches for weather-based yield forecasting in maize and rice. Strong reliability of in-season weather-based maize yield was estimated in the U.S. corn belt7, while superiority of statistical approaches was specified over process-based simulations under changing climate scenarios8. Similarly, Elastic Net and LASSO outperformed traditional regression approaches for rice yield estimation11. The results of this study therefore expand existing evidence by demonstrating comparable superiority of Elastic Net for maize forecasting under semi-arid Indian conditions and, importantly, across multiple crop growth stages using LOYO cross-validation design.
The model performance demonstrated reasonable accuracy across 38 independent years, indicating its capability to generalize under varying climatic conditions. Although machine-learning approaches are often prone to overfitting, multiple measures were incorporated to minimize this risk. First, the use of LOYO (Leave-One-Year-Out) cross-validation ensured that each year served as a completely independent environment for testing, preventing the model from learning year-specific noise. Second, the inclusion of regularization techniques within the model helped restrict excessive parameter fitting and improved the stability of predictions. The variability observed in year-wise error distributions further highlights that model performance is influenced by climatic fluctuations and data heterogeneity across years. Importantly, the aggregated error statistics average RMSE and nRMSE across all years confirm that the model maintains overall robustness without relying on a single favourable year. These findings collectively suggest that the proposed framework is not only accurate but also resilient to overfitting, making it suitable for reliable operational yield prediction applications.
Overall, the combination of long-term time-series yield data, composite weather indices, penalized regression models, and leave-one-year-out cross validation strategy produced a robust and reproducible framework for early-stage yield forecasting. While Elastic Net achieved the highest performance across stages, both LASSO and PCA-SMLR remained useful alternatives for operational decision-support under data availability constraints. The multi-stage yield estimation framework presented in this study can thus support pre-harvest yield forecasting systems under agro-climatic uncertainty and contribute to improved planning for procurement, policy formulation, and agro-industry decision making.
Conclusion
This study evaluated SMLR, PCA-SMLR, LASSO, and Elastic Net models for pre-harvest maize yield prediction at vegetative, flowering, and grain-filling stages using long-term weather data from semi-arid New Delhi. Among all models, Elastic Net consistently achieved the lowest NRMSE and RMSE and the smallest percentage deviation from observed yield, confirming its superior predictive ability. Prediction accuracy improved from early vegetative to grain-filling stages, indicating that cumulative weather effects closer to maturity better explain yield variability.
The integration of multi-stage modelling with penalized regression and LOYO cross-validation across 38 years adds novelty and reliability to the findings. Overall, Elastic Net, followed by LASSO, PCA-SMLR, and SMLR, proves effective for district-level yield forecasting, with Elastic Net emerging as the most robust framework for supporting early and informed decision-making in semi-arid environments.The outcomes of this study may support advance planning of production, procurement, trade and risk mitigation strategies under increasing climate variability.
Author contributions
Ananta Vashisth: wrote the main manuscript text, did analysis, prepared table. ARAVIND K. S. : reviewed the manuscript, collected data and did analysis.
Data availability
The datasets generated during this study are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Government of India. Ministry of agriculture & farmers welfare. Agricultural Statistics at a Glance 2023. Directorate of Economics and Statistics, New Delhi (2023). [Google Scholar]
- 2.Tibshirani, R. J. Regression shrinkage and selection via the Lasso. J. R Stat. Soc. Ser. B. 58, 267–288 (1996). [Google Scholar]
- 3.Agrawal, R., Aditya, K. & Chandrahas & Use of discriminant function analysis for forecasting crop yield. Mausam36, 455–458 (2012). [Google Scholar]
- 4.Vashisth, A., Singh, R. & Choudhary, M. Crop yield forecast at different growth stages of wheat crop using statistical models under semi-arid region. J. Agroecol Nat. Resour. Manag. 1, 1–3 (2014). [Google Scholar]
- 5.Vashisth, A., Goyal, A. & Roy, D. Pre-harvest maize crop yield forecast at different growth stages using different models under semi-arid region of India. Int. J. Trop. Agric.36, 915–920 (2018). [Google Scholar]
- 6.Vashisth, A. & Aravind, K. S. Multistage mustard yield Estimation based on weather variables using multiple linear, LASSO and elastic net models for semi-arid region of India. Indian J. Agric. Phys.20, 213–223 (2020). [Google Scholar]
- 7.Joshi, V. R., Kazula, M. J., Coulter, J. A., Naeve, S. L. & Garcia, A. G. Y. In-season weather data provide reliable yield estimates of maize and soybean in the US central corn belt. Int. J. Biometeorol.65, 489–502 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Rawat, M., Sharda, V., Lin, X. & Roozeboom, K. Climate change impacts on rainfed maize yields in kansas: statistical vs. process-based models. Agronomy13, 2571 (2023). [Google Scholar]
- 9.Gupta, S. et al. Multistage wheat yield prediction using hybrid machine learning techniques. J. Agrometeorol. 24, 373–379 (2022). [Google Scholar]
- 10.Paudel, D. et al. Machine learning for large-scale crop yield forecasting. Agric. Syst.187, 103016 (2021). [Google Scholar]
- 11.Das, B., Nair, B., Reddy, V. K. & Venkatesh, P. Evaluation of multiple linear, neural network and penalised regression models for prediction of rice yield based on weather parameters for West Coast of India. Int. J. Biometeorol.62, 1809–1822 (2018). [DOI] [PubMed] [Google Scholar]
- 12.Aravind, K. S., Vashisth, A., Krishnan, P. & Das, B. Wheat yield prediction based on weather parameters using multiple linear, neural network and penalised regression models. J. Agrometeorol. 24, 18–25 (2022). [Google Scholar]
- 13.Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R Stat. Soc. Ser. B Stat. Methodol.67, 301–320 (2005). [Google Scholar]
- 14.Kumar, R., Gupta, B. R. D., Athiyaman, B., Singh, K. K. & Shukla, R. K. Stepwise regression technique to predict pigeon pea yield in Varanasi district. J. Agrometeorol. 1, 183–186 (1999). [Google Scholar]
- 15.Garde, Y. A., Dhekale, B. S. & Singh, S. Different approaches on pre-harvest forecasting of wheat yield. J. Appl. Nat. Sci.7, 839–843 (2015). [Google Scholar]
- 16.Singh, R. S., Patel, C., Yadav, M. K. & Singh, K. K. Yield forecasting of rice and wheat crops for Eastern Uttar Pradesh. J. Agrometeorol. 16, 199–202 (2014). [Google Scholar]
- 17.Jamieson, P. D., Porter, J. R. & Wilson, D. R. A test of the computer simulation model ARCWHEAT1 on wheat crops grown in new Zealand. Field Crops Res.27, 337–350 (1991). [Google Scholar]
- 18.Dutta, S., Patel, N. K. & Srivastava, S. K. District-wise yield models of rice in Bihar based on water requirement and meteorological data. J. Indian Soc. Remote Sens.29, 175–182 (2001). [Google Scholar]
- 19.Kumar, S., Attri, S. D. & Singh, K. K. Comparison of LASSO and Stepwise regression technique for wheat yield prediction. J. Agrometeorol. 21, 188–192 (2019). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated during this study are available from the corresponding author on reasonable request.












