Optimizing the utilization of Metakaolin in pre-cured geopolymer concrete using ensemble and symbolic regressions

Kennedy C Onyelowe; Viroon Kamchoom; Ahmed M Ebid; Shadi Hanandeh; José Luis Llamuca Llamuca; Fabián Patricio Londo Yachambay; José Luis Allauca Palta; M Vishnupriyan; Siva Avudaiappan

doi:10.1038/s41598-025-91049-1

. 2025 Feb 26;15:6858. doi: 10.1038/s41598-025-91049-1

Optimizing the utilization of Metakaolin in pre-cured geopolymer concrete using ensemble and symbolic regressions

Kennedy C Onyelowe ^1,^2,^✉, Viroon Kamchoom ^3,^✉, Ahmed M Ebid ^4,^✉, Shadi Hanandeh ⁵, José Luis Llamuca Llamuca ⁶, Fabián Patricio Londo Yachambay ⁷, José Luis Allauca Palta ^8,⁹, M Vishnupriyan ¹⁰, Siva Avudaiappan ¹¹

PMCID: PMC11865618 PMID: 40011548

Abstract

The optimization of metakaolin (MK) in pre-cured geopolymer concrete involves developing predictive models to capture the interplay of various influencing factors and guide mix design for improved compressive strength and sustainability. Ensemble methods and symbolic regression are promising approaches for this task due to their complementary strengths and solving challenges associated with repeated experiments in the laboratory. Choosing machine learning predictions over repeated, expensive, and time-consuming experiments in research projects, such as optimizing the utilization of metakaolin in pre-cured geopolymer concrete, presents a paradigm shift in how data-driven insights can revolutionize material development. The integration of ensemble and symbolic regression models enables researchers to derive valuable predictions and optimize critical performance parameters efficiently. In this research work, 235 records were collected from extensive literature search for compressive strength for different mixing ratios of pre-cured metakaolin-based geopolymer concrete with concrete at different ages. Each record contains MK: The content of metakaolin (kg/m³), SHS: Sodium hydroxide solution content (kg/m³), SHSM: Sodium hydroxide solution molarity (Mole), SSS: Sodium silicate solution content (kg/m³), W: Extra water content (not including the water in alkaline solutions) (kg/m³), W/S: Water to Solid ratio (Total water content / Solid part of activator solutions + MK), Na₂O/Al₂O₃: Sodium oxide to aluminium oxide ratio, SiO₂/Al₂O₃: Silicon oxide to aluminium oxide ratio, H₂O/Na₂O: Water to Sodium oxide ratio, CA/FA: Coarse to Fine aggregate ratio, CAg: The content of coarse aggregates (kg/m³), SP: The content of super-plasticizer (kg/m³), PCC: 0 for no pre-curing, 1 for pre-curing at 60 °C, and 2 for pre-curing at 80 °C, CT: Curing temperature (°C), Age: The concrete age at testing (days) and CS: Compressive strength (MPa). The collected records were portioned into training set (180 records≈75%) and validation set (55 records≈ 25%) and modeled with ensemble and symbolic regression methods. At the end of the model work, performance metrics were used to evaluate the models’ ability and Hoffman and Gardener’s sensitivity analysis was used to evaluate the impact of the variables on the compressive strength of the pre-cured geopolymer concrete mixed with metakaolin. GB and KNN models became the decisive models with excellent performance which outclassed others and the sensitivity analysis indicated that SHSM, SSS, W/S, and Na₂O/Al₂O₃ are the most influential to the predicted compressive strength.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-91049-1.

Keywords: Geopolymer concrete, Metakaolin, Machine learning, Sensitivity analysis, Sustainable concrete, Compressive strength

Subject terms: Engineering, Materials science

Introduction

Background

Metakaolin, a highly reactive pozzolanic material, has gained attention for its use in pre-cured geopolymer concrete due to its unique chemical and physical properties^1–3. Derived from the thermal activation of kaolinite clay, metakaolin contributes significantly to the geopolymerization process, making it a valuable component in sustainable construction practices^4,5. Metakaolin contains a high content of amorphous silica and alumina, which enhances the geopolymerization process by reacting with the alkaline activator to form a stable aluminosilicate gel^6–9. This gel imparts strength and durability to the concrete. The fine particle size of metakaolin leads to better particle packing and increases the density of the matrix, contributing to early-age and long-term strength in pre-cured geopolymer concrete^10,11. Metakaolin accelerates the reaction rate, reducing the dependency on prolonged curing times^12–15. This makes it suitable for pre-cured applications where rapid strength gain is desirable¹⁶. The incorporation of metakaolin reduces drying shrinkage and enhances resistance to chemical attacks, such as sulfate and chloride ingress, thus improving the durability of geopolymer concrete^17–20. Metakaolin, often derived from natural or industrial by-products, reduces the environmental footprint of concrete production. Its use in geopolymer concrete aligns with sustainable construction practices by reducing reliance on Portland cement^21–23. Metakaolin-based geopolymer concrete is ideal for precast applications where controlled pre-curing ensures uniform strength and quality²⁴. It can be used for manufacturing structural components such as beams, columns, and slabs, as well as non-structural elements like pavers and tiles²⁵. Due to its rapid strength development and excellent bonding properties, metakaolin-enriched geopolymer concrete is suitable for repair and rehabilitation works²⁴. The high fineness of metakaolin can reduce workability. Proper mix design and the use of plasticizers may be necessary to achieve the desired consistency²². The type and concentration of alkaline activators (e.g., sodium hydroxide or sodium silicate) must be carefully optimized to ensure effective geopolymerization¹². Metakaolin may be more expensive compared to other supplementary cementitious materials like fly ash or slag⁵. However, its superior performance characteristics can justify the cost for specific applications. Overall, the utilization of metakaolin in pre-cured geopolymer concrete offers significant advantages in terms of strength, durability, and sustainability²⁴. Its unique properties make it a promising material for modern construction, particularly in applications requiring rapid strength gain and high durability^24–26.

Conversely, various published literature have been deeply explored in this research work on the studied subject. Zou et al.¹ utilized machine learning methods such as multi-expression programming (MEP) and gene expression programming (GEP) to forecast compressive strength (CS) and a slump in AAC. MEP models outperformed GEP models, reaching R² values of 0.92 and 0.93 for slump and CS, respectively. Shi et al.² suggested a system to predict the compressive strength of fly ash-based geopolymer concrete (FAGC) utilizing soft computing techniques. 162 compressive strength datasets were compiled from research publications published between 2000 and 2020. The model was analyzed using different soft computing techniques such as multi-layer perceptron neural network (MLPNN), Bayesian regularized neural network (BRNN), generalized feed-forward neural networks (GFNN), support vector regression (SVR), decision tree (DT), random forest (RF), and LSTM. Three primary changes were implemented: utilizing the LSTM model to forecast FAGC, enhancing the LSTM model with the marine predator’s algorithm (MPA), and incorporating six additional inputs. The study demonstrated that the chemical compositions of fly ash and sodium silicate solution have a notable impact on the compressive strength of FAGC, reflecting changes in optimal mix designs and this agrees with results obtained in other published papers^27–31. This method has the potential to save time and money by precisely predicting the compressive strength of FAGC with reduced calcium content. Also, Nazar et al.³ utilized three artificial intelligence algorithms: adaptive neuro-fuzzy inference system (ANFIS), artificial neural networks (ANNs), and gene expression programming to forecast compressive strength and slump values of fly ash-based geopolymer concrete. Geopolymer concrete has presented a subject of serious research in intelligent infrastructure development^32–35. The GEP model demonstrated superior performance compared to the ANFIS and ANN models in terms of R-value, R², and RMSE for predicting both CS and slump. After intensive training and optimization of hyperparameters, the GEP model produced more precise predictions, showcasing its promise in forecasting geopolymer concrete properties. Qureshi et al.⁴ applied support vector regression (SVR) and modified bagging (RFR) to forecast autogenous shrinkage (AS) in high-performance concrete. The ensemble approaches are fine-tuned with twenty sub-models and different numbers of estimators to attain a strong R2 score. The data required for modeling AS consists of water-to-cement ratio, cement, silica fume, fly ash, slag, filler, metakaolin, super absorbent polymer, superplasticizer, size, curing time, and super absorbent polymer water input. The Support Vector Regression (SVR) models with AdaBoost and Random Forest Regression (RFR) provide high performance, highlighting the significant influence of powerful learners presented in other literatures^36–39. Liu et al.⁵ used ML technology to enhance engineering methods by analyzing and evaluating the effectiveness and precision of different algorithms. Advanced methods like Gradient Boosting Decision Tree, extreme Gradient Boosting, and Random Forest are employed to improve the accuracy of predictions^40,41. The GS-XGBoost model has superior prediction accuracy and generalization performance, achieving R² values exceeding 99% in both the training and test sets. A software platform with a graphical user interface has been created to improve the efficiency of material design and testing. Khan et al.⁶ utilized gene expression programming to predict the compressive strength of geopolymer concrete (GPC) produced using fly ash (FA) waste material. The model relies on 298 experimental outcomes, focusing on key parameters such as water addition, plasticizer %, beginning curing temperature, specimen age, curing duration, and aggregate to total aggregate ratio. An empirical equation based on GEP is suggested to predict the compressive strength of GPC. The model’s correctness, generalization, and prediction capabilities are assessed by doing parametric analysis, and statistical tests, and comparing it with non-linear and linear regression equations. Al-Taai et al.⁷ developed an XGBoost prediction algorithm to forecast the compressive strength of environmentally friendly concrete. The model surpassed other machine learning models, such as Support Vector Regression and K-nearest neighbor’s algorithm. The examination of Partial Dependence Plots showed that the water-to-binder ratio, concrete age, and GGBFS percentage have a significant effect on the compressive strength of eco-friendly concrete agreeable to previous works^41,42. In other studies, Nafees et al.⁸ created predictive machine-learning models to estimate SF compressive and cracking tensile strengths. Multilayer perceptron neural networks, adaptive neural fuzzy detection systems, and genetic programming were employed. An ANFIS model outperformed an MLPNN model in predicting compressive strengths and split tensile strengths from a database of 283 and 149 values, respectively. GEP models accurately predicted values that aligned with experimental data, and cross-validation was employed to prevent overfitting. Hossain et al.⁹ investigated the compressive strength of fiber-reinforced geopolymer composites (FRGC) under different curing conditions. Gene expression programming is utilized to create an empirical equation to forecast FRGC compressive strength. 393 experimental datasets are utilized to train Gene Expression Programming models. The database contains metrics such as fly ash, silica fume, metakaolin, and others. Empirical equations are generated using expression trees. Performance is validated by Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared R²calculations, with sensitivity analysis used to evaluate the impact of each input parameter. Gupta et al.¹⁰ developed Python code to execute individual K-nearest neighbor (KNN), random forest regression (RFR), and linear regression (LR) models on geopolymer concrete (GPC). The RFR machine learning algorithm demonstrated superior performance on the dataset. The proposed model offers a satisfactory approach for FACC design and optimization, surpassing the individual KNN technique. The study determined that the compressive strength of FACC GPC is considerably influenced by curing temperature, curing hours, molarity of NaOH, and FACC ratio. The RFR approach surpassed the other three combinations of ML algorithms. Khan et al.¹¹ employed three machine learning models to forecast the compressive strength of fly ash based geopolymer concrete, which is a sustainable alternative to Portland cement concrete with a reduced carbon footprint. The models were trained, validated, and tested on a dataset that included chemical composition, mix proportions, and pre-curing conditions. The BPNN model yielded the most favorable outcomes, with coarse aggregate content, SiO₂ content in FA, and NaOH concentration exerting the most significant influence on the strength. It is important to note that the latest published work on the application of MK in concrete used different machine learning techniques such as the GBM, Compact-GBM, RF, Compact-RF, DT, ANN, and SVM, which obtained R2 of 0.20–0.98. In this present research work, such as Gradient Boosting (GB), CN2 Rule Induction (CN2), Naive Bayes (NB), Support Vector Machine (SVM), Stochastic Gradient Descent (SGD), K-Nearest Neighbors (KNN), Tree Decision (Tree) and Random Forest (RF).

Novelty statement

The novelty of optimizing the utilization of metakaolin in pre-cured geopolymer concrete using ensemble and symbolic regression techniques lies in its comprehensive and data-driven approach to identifying the optimal conditions for achieving high compressive strength. By employing advanced machine learning models such as Gradient Boosting, CN2 Rule Induction, Naive Bayes, Support Vector Machine, Stochastic Gradient Descent, K-Nearest Neighbors, Tree Decision, Random Forest, and Response Surface Methodology (RSM), the study bridges the gap between conventional experimental approaches and modern predictive analytics. This research provides a unique contribution by integrating complex input parameters, including sodium hydroxide solution content, molarity, sodium silicate content, water-to-solid ratio, oxide ratios, aggregate content, super-plasticizer dosage, pre-curing conditions, curing temperature, and concrete age. The use of multiple machine learning models offers comparative insights into their predictive capabilities, highlighting the strengths and limitations of each technique. Additionally, symbolic regression and ensemble methods enable the discovery of hidden patterns and relationships within the data, leading to more accurate and generalized models for predicting compressive strength. By incorporating pre-curing conditions and age-specific evaluations, the study addresses critical factors often overlooked in traditional geopolymer research. The multi-objective optimization framework facilitated by RSM further enhances the novelty by identifying optimal parameter combinations for improved mechanical properties. This approach not only advances the understanding of metakaolin-based geopolymer concrete but also contributes to sustainable construction practices by providing a robust and scalable method for optimizing concrete formulations.

Methodology

Collected database preliminary study

After a deep study and curation of data, a total of 235 records were collected from literature²³ for compressive strength for different mixing ratios of pre-cured metakaolin-based geopolymer concrete with concrete at different ages. Each record contains the following data: MK: The content of metakaolin (kg/m³), SHS: Sodium hydroxide solution content (kg/m³), SHSM: Sodium hydroxide solution molarity (Mole), SSS: Sodium silicate solution content (kg/m³), W: Extra water content (not including the water in alkaline solutions) (kg/m³), W/S: Water to Solid ratio (Total water content / Solid part of activator solutions + MK), Na₂O/Al₂O₃: Sodium oxide to aluminium oxide ratio, SiO₂/Al₂O₃ : Silicon oxide to aluminium oxide ratio, H₂O/Na₂O: Water to Sodium oxide ratio, CA/FA: Coarse to Fine aggregate ratio, CAg: The content of coarse aggregates (kg/m³), SP: The content of super-plasticizer (kg/m³), PCC: 0 for no pre-curing, 1 for pre-curing at 60 °C, and 2 for pre-curing at 80 °C, CT: Curing temperature (^oC), Age: The concrete age at testing (days) and CS : Compressive strength (MPa). The collected records were divided into training set (180 records≈75%) and validation set (55 records≈ 25%) in line with a more sustainable partitioning of databases for enhanced intelligent models²⁵. The appendix includes the closed-form equation of the RSM model, while Table 1 summarizes their statistical characteristics. Finally, Fig. 1 shows Pearson correlation matrix, histograms, and the relations between variables. The chart combines correlation coefficients and histogram distributions, offering a comprehensive view of variable interactions and patterns in the dataset. The correlation matrix displays the relationships between parameters with color-coded intensity. Strong positive correlations are indicated by higher values and deeper colors (closer to red), while negative correlations lean toward green. Some key observations include: Compressive strength (CS) shows moderate correlations with parameters such as Age (0.23) and PCC (0.68), suggesting the influence of these factors on strength development. Conversely, it has weak or even negative correlations with other variables, such as SiO₂/Al₂O₃ (-0.47) and H₂O/Na₂O (-0.43). Na₂O/Al₂O₃ and SiO₂/Al₂O₃ exhibit positive interdependency (0.69 correlation), implying a strong link between the compositions influencing mix properties. MK (Metakaolin) shows a notable positive correlation with SHS (0.73), which may reflect a strong compositional association in the material formulation. The histograms illustrate the distribution of each variable. Parameters like Age and CS exhibit varied scatter patterns, while others, such as SHS and SiO₂/Al₂O₃, follow tighter clusters, implying different levels of variability across factors. The scattered point plots offer insights into non-linear relationships for many variable pairs, particularly where interactions deviate from simple trends. The inclusion of red fitted lines indicates potential curvatures in the relationships between variables, useful for guiding further analysis. This combination of correlation analysis and distribution visualization provides valuable input for optimizing material formulations, predicting key outcomes, and refining experimental designs in geopolymer research.

Table 1.

Statistical analysis of collected database.

Item	Unit	Training set					Validation set
Item	Unit	Max.	Min	Avg	SD	Var	Max.	Min	Avg	SD	Var
MK	kg/m³	723.00	238.00	423.02	116.30	0.27	723.00	276.00	412.38	107.43	0.26
SHS	kg/m³	675.00	0.00	147.85	125.96	0.85	675.00	0.00	145.98	125.51	0.86
SHSM	Mole	20.00	0.00	12.08	4.30	0.36	16.00	0.00	11.78	3.98	0.34
SSS	kg/m³	403.78	0.00	239.30	105.13	0.44	402.64	0.00	239.76	101.94	0.43
W	kg/m³	111.60	0.00	22.41	28.88	1.29	91.00	0.00	22.31	28.71	1.29
W/S	–	0.60	0.20	0.42	0.08	0.19	0.60	0.32	0.43	0.08	0.18
Na₂O/Al₂O₃	–	1.49	0.30	0.79	0.23	0.30	1.19	0.42	0.79	0.23	0.29
SiO₂/Al₂O₃	–	5.16	2.02	3.08	0.64	0.21	4.12	2.02	3.03	0.57	0.19
H₂O/Na₂O	–	18.95	7.46	11.55	2.16	0.19	14.99	8.62	11.66	1.67	0.14
CA/FA	–	2.34	0.00	1.16	0.90	0.78	2.34	0.00	1.31	0.90	0.69
SP	kg/m³	16.80	0.00	4.17	6.35	1.52	16.60	0.00	3.10	5.42	1.75
PCC	–	2.00	0.00	0.35	0.75	2.14	2.00	0.00	0.35	0.69	2.01
CT	°C	40.00	20.00	23.07	3.68	0.16	40.00	20.00	23.16	3.35	0.14
Age	Day	56.00	3.00	19.57	11.96	0.61	56.00	3.00	17.11	11.15	0.65
CS	MPa	70.85	5.80	32.34	16.12	0.50	69.09	5.50	31.66	13.69	0.43

Open in a new tab

Fig. 1 — Correlation, distribution and interpreting chart.

Research program

Eight different ML classification techniques and one symbolic regression method were used to predict the compressive strength of the meta-kaolin-based geopolymer concrete using the collected database. These techniques are “Gradient Boosting (GB)”, “CN2 Rule Induction (CN2)”, “Naive Bayes (NB)”, “Support vector machine (SVM), “Stochastic Gradient Descent (SGD)”, “K-Nearest Neighbors (KNN)”, “Tree Decision (Tree)”, “Random Forest (RF)” and the Response Surface Methodology (RSM). The developed models were used to predict (CS) using the concrete mixture contents, age, pretreatment methods and treatment temperature. All the developed ensemble models were created using “Orange Data Mining” software version 3.36 (https://sourceforge.net/projects/orange-data-mining.mirror/files/3.36.1/Orange3-3.36.1-cp39-cp39-macosx_10_9_x86_64.whl/download). The considered data flow diagram is shown in Fig. 2. The following section discusses the results of each model. The accuracies and performance of the developed models were evaluated by comparing SSE, MAE, MSE, RMSE, Error (%), Accuracy (%) and R² between predicted and calculated shear strength parameters values. The definition of each used measurement is presented in Eq. 1 to 6.

Fig. 2 — The considered data flow in Orange software.

Where: N is the number of entries, Inline graphic number of appearances, is the actual value, is the predicted value, and is the mean value.

Theory of advanced machine learning methods

Gradient boosting (GB)

Gradient Boosting is a powerful machine learning technique used for both regression and classification tasks¹². It builds predictive models in an iterative fashion by combining the predictions of weak learners, typically decision trees, to create a strong learner (see Fig. 3). GB is highly flexible and has been widely used in applications such as ranking systems, fraud detection, and predictive analytics. The process begins by fitting a simple model (e.g., a decision tree) to the data. This initial model provides the first set of predictions¹³. The error or residual (difference between actual and predicted values) is computed for each data point. A new weak learner (e.g., another decision tree) is trained to predict these residuals. The goal is to minimize the residuals, thereby improving the model’s performance. The predictions from the new model are combined with the previous model’s predictions²³. This combination is controlled by a learning rate, which determines how much the new model’s predictions contribute to the final output. Steps 2–4 are repeated iteratively, adding new models to correct the errors of the previous models until the desired level of accuracy is achieved or the maximum number of iterations is reached. It starts with an initial model, f₀(x), often a simple prediction like the mean of the target values for regression or a uniform class probability for classification. In each step, the model is improved by training a new weak learner to focus on the remaining errors, or residuals, from previous predictions.

Fig. 3 — Schematic of gradient boosting.

At each iteration, the algorithm calculates the negative gradient of the loss function with respect to the model’s predictions and essentially finding the direction in which the model should adjust to reduce error. This gradient guides the training of a new weak learner, h_m(x), which is then added to the existing model. The updated model can be written as:

Where: α = learning rate and it controls the influence of each weak learner on the overall model.

CN2 rule induction (CN2)

CN2 is a rule-based machine learning algorithm designed for classification tasks. It generates a set of interpretable IF-THEN rules from the given training data, making it useful for domains where understanding the decision process is critical, such as medicine and finance. CN2 was introduced by Peter Clark and Tim Niblett in the late 1980s as an enhancement of decision tree algorithms and coversets¹⁸. CN2 extracts rules in the form of IF condition THEN class, where the condition is based on feature values, and the class is the predicted outcome. CN2 follows a sequential covering approach, iteratively finding rules that correctly classify a subset of the data. Each rule “covers” a portion of the dataset, and covered examples are removed after a rule is generated. CN2 can generate probabilistic rules, where each rule assigns a probability distribution over the possible classes instead of a deterministic class label²¹. The algorithm includes mechanisms to handle noise and overlapping classes, making it robust in real-world datasets. CN2 remains a foundational algorithm in rule induction, offering a balance between simplicity, interpretability, and performance (see Fig. 4). While it may not always be the most scalable or accurate algorithm, its ease of use and comprehensibility make it valuable for many practical applications.

Each rule in CN2 takes the general form: IF Condition→THEN Class where the Condition is a conjunction of attribute-value pairs that define a subset of instances for which the rule is valid, and Class is the predicted class label for instances satisfying the condition¹³. CN2 evaluates candidate rules using a heuristic measure, commonly based on the entropy or likelihood ratio of the rule’s coverage and accuracy in differentiating a specific class. For a given rule R, the information gain can be calculated using entropy to measure the quality of the rule. The entropy H for a rule’s distribution over classes is defined as:

Where: P(c∣R)is the conditional probability of class ccc given that an instance matches the conditions of rule RRR. The information gain IGIGIG of a rule is then:

Where: H(C) represents the entropy of the class distribution in the dataset, and H(R) is the entropy of instances covered by the rule RRR. A higher information gain implies a more effective rule in separating instances of different classes. The coverage of a rule R, denoted as Cov(R), refers to the proportion of instances in the dataset that satisfy the rule’s conditions. This is mathematically represented as:

Where: ∣X∣ is the total number of instances. Higher coverage indicates that the rule applies to a larger portion of the dataset, though there may be a trade-off between coverage and precision.

Naive Bayes (NB)

Naive Bayes is a family of simple yet effective probabilistic machine learning algorithms based on applying Bayes’ theorem with the assumption of conditional independence between features²⁰. Despite the “naive” assumption, Naive Bayes performs surprisingly well in many real-world classification tasks, such as spam detection, sentiment analysis, and medical diagnosis. Despite its simplicity, Naive Bayes remains a highly effective algorithm for many classification tasks, especially when speed and interpretability are critical (sere Fig. 5). Naive Bayes is based on Bayes’ theorem of probabilistic classifier and leveraging the assumption that class features are conditionally independent. For instance, a given class C and feature vector X=(x₁,x₂,…,x_n), will have a posterior probability as follow:

Fig. 5 — Layout of Naïve Bayes’ machine learning technique.

Thus the class Inline graphic that maximizes this posterior probability is predicted by the classifier, such that:

Where P(C) = prior probability of each class, estimated as the relative frequency of instances in that class. P(x_i∣C) = conditional probability of each feature given the class, which can be calculated as frequency counts for categorical features or approximated by a Gaussian distribution for continuous features. Layout of Naïve Bayes’ machine learning technique is shown in Fig. 5.

Support vector machine (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm widely used for classification, regression, and outlier detection tasks¹⁵. SVM works by finding the hyperplane that best separates data points of different classes in a high-dimensional feature space while maximizing the margin between classes (see Fig. 6). A hyperplane is a decision boundary that separates data points belonging to different classes. In a two-dimensional space, it is a line, while in higher dimensions, it generalizes to a plane or a hyperplane. These are the data points closest to the hyperplane¹⁷. They are critical for defining the hyperplane and directly influence its position and orientation. The margin is the distance between the hyperplane and the nearest data points (support vectors) of any class. SVM aims to maximize this margin to improve generalization. SVM can handle non-linearly separable data by mapping the input features into a higher-dimensional space using kernel functions¹⁸. SVM is a versatile and powerful algorithm, particularly for complex and high-dimensional problems. Its adaptability with kernels and emphasis on boundary-based learning make it a reliable choice for many classification and regression tasks.

Fig. 6 — Sketch of support vector algorithm.

Considering dataset of labeled instances (x_i,y_i)) where xi∈Rn and yi∈{−1,1}, the decision boundary becomes a hyperplane w⋅x + b = 0, where w = weight vector perpendicular to the hyperplane, and b = bias term. The optimization problem to maximize the margin is formulated as:

Subject to the constraints:

In the case of non-linearly separable data, SVM applies the kernel functions to project data into a higher-dimensional space, where a linear separation is possible. Common kernels include the linear, polynomial, and radial basis function (RBF) kernels. The decision function for classification is then:

Where: α_i = Lagrange multipliers, and K(x, x_i) = chosen kernel function.

Stochastic gradient descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize an objective function, often the loss function in machine learning models¹⁶. It is a variant of Gradient Descent that updates the model parameters iteratively by using only a single (or a small batch of) data point(s) at a time, making it more computationally efficient for large datasets. SGD remains a cornerstone of machine learning optimization due to its simplicity, efficiency, and effectiveness for large-scale problems²³. Its ability to be extended with enhancements like momentum and adaptive learning rates ensures its relevance across a wide range of applications. A schematic of SGD model is shown in Fig. 7. For instance, in high-dimensional spaces, SGD minimizes a given objective function, J(θ), which typically determines the model error with parameters θ.

In each iteration, SGD computes the gradient using a single randomly chosen instance or a small batch rather than computing the gradient over the entire dataset. Thus, this technique speeds up convergence by updating the parameters, such that:

where η is the learning rate, controlling the step size and Inline graphic is the gradient of the objective function with respect to θ, evaluated at a training example.

K-Nearest neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple and versatile supervised learning algorithm used for classification and regression tasks. It is a non-parametric, instance-based method, meaning it makes predictions based on the similarity between the input query and training examples, rather than learning a model explicitly. KNN is a robust, interpretable, and easy-to-implement algorithm that is often used as a baseline in machine learning tasks. However, it is best suited for smaller datasets and requires careful consideration of hyperparameters and preprocessing to achieve optimal results. Figure 8 shows the illustration of the K-nearest neighbours.

Fig. 8 — Illustration of the K-nearest neighbours.

It operates by estimating the distance between the query instance and all other points in the dataset, commonly using Euclidean distance for continuous variables:

where x and x′ are two instances in n-dimensional space.

Tree decision (Tree)

Decision Trees are a popular supervised learning algorithm used for both classification and regression tasks. They use a tree-like structure to model decisions and their potential consequences by recursively splitting the data based on feature values¹⁵. Decision Trees are easy to understand, interpret, and implement, making them a widely used machine learning technique. Decision Trees can be visualized to provide interpretable decision-making processes. Tools like Graphviz or built-in functions in libraries like Scikit-learn are commonly used for visualization.

Decision Trees are foundational to many machine learning workflows due to their simplicity, interpretability, and effectiveness. While they may have limitations, combining them with ensemble techniques can significantly enhance their power and applicability. A general layout of the tree decision approach is shown in Fig. 9.

Fig. 9 — General layout of the tree decision approach.

For example, considering a dataset D with classes C, the tree grows by selecting features that maximize the information gain or minimize the impurity. Hence, information gain IG for a split on feature X is respected as:

where H(D) is the entropy or impurity of dataset D, and Dv is the subset of D for each value v of feature X.

Random forest (RF)

Random Forest (RF) is a powerful and flexible ensemble learning method that combines multiple decision trees to improve predictive performance and reduce overfitting¹⁴. It operates by constructing a “forest” of decision trees during training and outputs either the majority class (for classification) or the average prediction (for regression) from all trees¹⁶. Random Forest is a robust and versatile algorithm that works well across a wide range of tasks. Its ability to reduce overfitting, handle high-dimensional data, and provide feature importance makes it a staple in machine learning. Figure 10 presents a schematic of the random forest algorithm. For a training dataset D with n samples, for instance, Random Forest will construct m decision trees T₁, T₂, …, T_m. Thus, each of the trees is trained on a bootstrap sample D_i (random sample with replacement) from D, and at each node, a random subset of k features is selected to find the best split. For classification, the output is determined by a majority vote across all trees:

Fig. 10 — Schematic of the random forest.

For regression, the output is the average prediction from all trees:

Response surface methodology (RSM)

Response Surface Methodology (RSM) is a statistical technique used for modeling and analyzing the relationships between one or more response variables and a set of independent variables (factors) (see Fig. 11). It is commonly employed in optimization problems where the goal is to find the best conditions for a process or system by experimenting with various factor levels and evaluating their effects on a response variable. RSM is widely used in various fields, including engineering, industrial design, product development, and quality improvement. It provides an efficient framework for designing experiments, estimating the relationships between factors, and optimizing processes. In its operation, RSM approximates the underlying process using mostly a second-order polynomial equation that is suitable for capturing response surface curvature. For a process with input variables x₁,x₂,…,x_k and response y, the second-order response surface model is:

Fig. 11 — Operation layout for RSM modelling.

Where: β₀ is the intercept, Β_i, β_ii, and β_ij are coefficients for linear, quadratic, and interaction terms, respectively, ϵ is the error term. Also, RSM utilizes Box-Behnken Design and Central Composite Design (CCD) to gather data and fit model efficiently. Moreover, RSM also identifies the optimal settings of the input factors by analyzing the fitted response surface, often using gradient-based methods to locate the maximum or minimum response or desired regions.

Sensitivity analysis

Sensitivity analysis is a technique used to assess how the variation in the output of a model or system can be attributed to changes in its input parameters²⁶. It helps to understand the impact of each input variable (or factor) on the outcome, often in the context of decision-making, risk management, or optimization. In simple terms, sensitivity analysis answers the question: How sensitive is the result of the model to changes in input variables? Sensitivity analysis is an essential tool for understanding the relationships between input parameters and model outputs¹⁹. By assessing how changes in inputs affect outputs, sensitivity analysis helps identify key drivers, quantify uncertainty, and support better decision-making across various industries, from engineering to finance and environmental science. It is crucial for optimizing processes, reducing risks, and improving system performance²⁶. A preliminary sensitivity analysis was carried out on the collected database to estimate the impact of each input on the (Y) values. A sensitivity analysis of the utilization of metakaolin in pre-cured geopolymer concrete involves evaluating how variations in different factors (such as metakaolin content, curing conditions, and other mix parameters) influence the properties and performance of the geopolymer concrete. This type of analysis is crucial for optimizing the material mix and curing process to achieve the desired strength, durability, and other characteristics. “Single variable per time” technique is used to determine the “Sensitivity Index” (SI) for each input using Hoffman & Gardener formula²⁶ as follows:

The Hoffman & Gardner Sensitivity Analysis method is a sensitivity analysis technique that is often used in the context of nonlinear programming or simulation-based models²³. This method aims to evaluate how the variability in input parameters or factors influences the output of a model, particularly when the model is complex or involves nonlinear relationships between the variables. The method focuses on determining the relative influence of input parameters on the model’s outcome, especially when the model is based on simulation or involves uncertainty in inputs. Sensitivity analysis is an essential tool for understanding the relationships between input parameters and model outputs. By assessing how changes in inputs affect outputs, sensitivity analysis helps identify key drivers, quantify uncertainty, and support better decision-making across various industries, from engineering to finance and environmental science. It is crucial for optimizing processes, reducing risks, and improving system performance.

Results and discussion

RSM model

The fit summary calculation was terminated early due to options set on the Transform tab. The maximum model order for process factors was restricted to quadratic. The model selected on the Model tab may correspond to the design model or a lower-order model. Refer to Tables 2 and 3, and Table 4 to identify the highest-order polynomial where additional terms are significant and the model is not aliased. The model’s F-value of 95.67 indicates that the model is significant, with only a 0.01% likelihood that such a large F-value could arise from noise. P-values below 0.0500 denote significant model terms. In this analysis, the significant terms include: B, C, D, E, F, G, H, K, L, AB, AE, AF, AG, AH, AJ, AK, AL, BG, BH, BJ, BK, CD, CH, CJ, CK, CL, CM, DJ, DK, DL, EF, EG, EH, EJ, EK, EL, FH, FJ, FK, GH, GJ, GK, GL, HL, JL, KL, A², C², F², G², H², J², and K². Terms with P-values exceeding 0.1000 are considered insignificant. If numerous insignificant terms are present (excluding those needed to maintain model hierarchy), reducing the model may improve its performance. The Predicted R² indicates that the current model predicts the response more effectively than using the overall mean. In some instances, a higher-order model may further enhance predictive accuracy. The Adequate Precision value, which measures the signal-to-noise ratio, exceeds the desirable threshold of 4. With a ratio of 40.539, this model demonstrates an adequate signal and is suitable for navigating the design space. The optimized models of the RSM prediction of the compressive strength have been presented graphically as follows; Fig. 12, the optimized sketches for nornamal plot of residuals, residuals versus predicted values, and residuals versus runs, Fig. 13, sketches of Cook’s distance and Box-Cox plot for power transforms, Fig. 14 sketches of the predicted versus actual values, leverage versus run, DFFITS, and DFBETAS for intercept versus run, Fig. 15, Desirability plot for the optimized variables and Fig. 16, the 3D surface plots for the optimized CS with respect to the most impactful variables such as MK and SHS, MK and SHSM, MK and W, MK and W/S, MK and Na2O/Al2O3, MK and SiO2/Al2O3, W/S and SHSM, and SSS and SHSM has been presented. The CS has been optimized with the pairs of the most influential parameters of the models to show its behavior correspondence with the measured variables. The equation (see Appendix), expressed in terms of actual factors, can be used to predict the response at specific factor levels. These levels should be stated in their original units. However, this equation should not be used to assess the relative impact of each factor, as the coefficients are scaled based on factor units, and the intercept does not lie at the design space’s center.

Table 2.

CS fit summary.

Source	Sequential p-value	Adjusted R²	Predicted R²
Linear	< 0.0001	0.5549	0.5139	Suggested
2FI	< 0.0001	0.9510	− 1.5442
Quadratic	< 0.0001	0.9733	0.9583	Suggested

Open in a new tab

Table 3.

CS sequential model sum of squares.

Source	Sum of squares	df	Mean square	F-value	p-value
Mean vs. total	2.433E + 05	1	2.433E + 05
Linear vs. mean	32988.20	12	2749.02	25.31	< 0.0001	Suggested
2FI vs. linear	22248.11	66	337.09	28.20	< 0.0001
Quadratic vs. 2FI	925.21	12	77.10	11.82	< 0.0001	Suggested
Residual	939.26	144	6.52
Total	3.004E + 05	235	1278.36

Open in a new tab

Table 4.

Fit statistics.

Std. Dev.	2.55	R ²	0.9836
Mean	32.18	Adjusted R²	0.9733
C.V. %	7.94	Predicted R²	0.9583
		Adeq Precision	40.5390

Open in a new tab

Fig. 12 — The optimized sketches for (a) nornamal plot of residuals, (b) residuals versus predicted values, and (c) residuals versus runs.

Fig. 13 — Sketches of (a) Cook’s distance and (b) Box-Cox plot for power transforms.

Fig. 14 — Sketches of the (a) predicted versus actual values, (b) leverage versus run, (c) DFFITS, and (d) DFBETAS for intercept versus run.

Fig. 15 — Desirability plot for the optimized variables.

Fig. 16 — 3D surface plots for the optimized CS with respect to the most impactful variables such as (a), MK and SHS, (b) MK and SHSM, (c) MK and W, (d), MK and W/S, (e) MK and Na2O/Al2O3, (f) MK and SiO2/Al2O3, (g) W/S and SHSM, and (h) SSS and SHSM.

GB model

The developed (GB) model was based on (Scikit-learn) method with learning rate of 0.1 and minimum splitting subset of 1. Nine trials were conducted for each model started with two trees and four tree levels and increased gradually to four trees and eight tree levels. The reduction of the prediction Error% for each trail is presented in Fig. 17. Accordingly, the models with two trees and three tree levels are considered the optimum ones. Performance metrics of the three developed models for both training and validation dataset are listed in Table 2. The average achieved accuracy was (95%). The relations between calculated and predicted values are shown in Fig. 18. The provided GB (Gradient Boosting) model evaluates the compressive strength behavior of geopolymer concrete incorporating metakaolin, achieving a reported average accuracy of 95% and an R² value of 0.99. These metrics suggest that the model is highly effective at predicting the compressive strength of geopolymer concrete, making it a valuable tool for sustainable construction practices. The average achieved accuracy of 95% indicates that the predictions from the GB model closely match the actual measured values for compressive strength. An R² value of 0.99 implies that 99% of the variance in compressive strength is explained by the model. The success of the GB model depends on its ability to identify and leverage key input factors. The model uses a combination of these variables, with Gradient Boosting effectively capturing their nonlinear interactions. On the implications for sustainable construction, encouraging the use of metakaolin, a byproduct of kaolin clay, as a sustainable material alternative to ordinary Portland cement (OPC), significantly reduces carbon emissions. Reliable prediction of compressive strength ensures that concrete mixes meet durability requirements, reducing maintenance and lifecycle costs. The GB model for compressive strength prediction of metakaolin-based geopolymer concrete demonstrates excellent accuracy and reliability, making it a powerful tool for optimizing material design and promoting sustainable construction. Its ability to capture complex interactions between factors contributes to effective resource use, enhanced durability, and a reduced environmental footprint. Future enhancements in data diversity and validation will further solidify its role in sustainable infrastructure development.

Fig. 17 — Reduction in Error (%) with increasing the number of trees and levels.

Fig. 18 — Relation between predicted and calculated strength using (GB).

CN2 model

Similarly, five (CN2) models were developed considering “Laplace accuracy” as evaluation measurement with beam width of 1.0 and minimum rule coverage of 1.0. The maximum rule length was started by 5.0 and increased up to 9.0. Figure 19 shows the reduction in Error % with increasing the rule length. Accordingly, rule length of 9.0 is considered. The developed models contains 129 “IF condition” rules, Fig. 20 presents some of these rules. Performance metrics of the developed model for both training and validation dataset are listed in Table 2. The average achieved accuracy was (91%). The relations between calculated and predicted values are shown in Fig. 21. The provided CN2 model predicts the compressive strength behavior of pre-cured geopolymer concrete incorporating metakaolin, achieving an average accuracy of 91% and an R² value of 0.97. These metrics suggest the model effectively captures the underlying patterns in the data, making it a valuable decision-support tool for sustainable construction applications. The average achieved accuracy of 91% indicates that the CN2 model is capable of providing reliable predictions of compressive strength. Although slightly lower than the performance of Gradient Boosting (GB) models, this is still a high level of accuracy, sufficient for practical applications. An R² value of 0.97 implies that the CN2 model explains 97% of the variance in compressive strength. The CN2 rule induction algorithm is designed to generate human-readable rules. These rules provide insights into how different factors affect compressive strength. While the accuracy is high, it may not match the predictive power of more sophisticated models like Gradient Boosting or Random Forest. For datasets with many variables or nonlinear relationships, the generated rules may become complex or less generalizable. The model’s performance depends heavily on the diversity and quality of the training data. The CN2 model provides a reliable and interpretable approach to predicting the compressive strength of geopolymer concrete containing metakaolin. Its 91% accuracy and R² of 0.97 make it a valuable tool for mix design and process optimization, contributing to sustainable construction practices. By leveraging the model’s rule-based outputs, engineers can make informed decisions to enhance performance while minimizing environmental impact. Future improvements in data diversity and validation will further solidify the model’s practical applications.

Fig. 19 — Reduction in Error % with increasing the rule length.

Fig. 20 — Sample of the developed CN2 “If condition”.

Fig. 21 — Relation between predicted and calculated strength using (CN2).

NB model

Traditional Naive Bayes classifier technique considering the concept of “Maximum likelihood” was used to develop the nine models. Although this type of classifier is highly scalable and are used in many applications, but it showed a very low performance as shown in Table 2. The relations between calculated and predicted values are shown in Fig. 22. The achieved average accuracy was 28%. The NB model as presented in this research has no promise for application in the GPC design and production as the accuracy is not acceptable.

Fig. 22 — Relation between predicted and calculated strength using (NB).

SVM model

The developed (SVM) model was based on “polynomial” kernel with cost value of 100, regression loss of 0.10 and numerical tolerance of 1.0. The kernel started with one-degree polynomial (linear) and increased up to four-degree polynomial (quartic). The reduction in the error % with increasing the polynomial degree is illustrated in Fig. 23. Accordingly, (quartic) kernel is considered. Performance metrics of the three developed models for both training and validation dataset are listed in Table 5. The average achieved accuracy was (91%). The relations between calculated and predicted values are shown in Fig. 24. The provided Support Vector Machine (SVM) model predicts the compressive strength of geopolymer concrete containing metakaolin, achieving an average accuracy of 91% and an R² value of 0.97. These performance metrics indicate that the SVM model is effective in capturing the underlying patterns between input variables and the compressive strength of the concrete. Below is an analysis of the model: The average achieved accuracy of 91% indicates a reliable prediction of compressive strength. This means the model successfully identifies the relationships between input parameters, such as metakaolin content, curing conditions, and other factors. An R² value of 0.97 signifies that the model explains 97% of the variance in compressive strength. SVM models can be computationally intensive, especially with large datasets or when using nonlinear kernels. Hyperparameter tuning (e.g., kernel choice, regularization parameter C, and gamma) is critical to achieving optimal performance. The accuracy and generalizability of the SVM model depend on the diversity and quality of the training data. If the dataset lacks representation across all relevant parameter ranges, the model may fail to predict accurately for untested scenarios. The SVM model demonstrates strong predictive power for compressive strength in geopolymer concrete with metakaolin, achieving high accuracy (91%) and a robust R² value (0.97). Its ability to model nonlinear relationships makes it particularly suitable for this complex application, supporting sustainable construction by optimizing resource use and promoting eco-friendly materials. Future enhancements in data diversity, hyperparameter tuning, and validation will further strengthen the model’s utility and reliability.

Fig. 23 — Reduction in Error % with increasing the polynomial degree.

Table 5.

Performance measurements of developed ensemble models for (Fc).

Model	Dataset	SSE	MAE	MSE	RMSE	Error %	Accuracy %	R ²
GB	Training	567	1.4	3.2	1.8	0.06	0.94	0.99
GB	Validation	131	1.4	2.4	1.5	0.05	0.95	0.99
CN2	Training	1741	2.2	9.7	3.1	0.10	0.90	0.97
CN2	Validation	353	2.0	6.4	2.5	0.08	0.92	0.97
NB	Training	37,909	8.9	210.6	14.5	0.45	0.55	0.53
NB	Validation	19,410	12.1	352.9	18.8	0.59	0.01	0.27
SVM	Training	1496	1.7	8.3	2.9	0.09	0.91	0.97
SVM	Validation	398	1.8	7.2	2.7	0.09	0.91	0.97
SGD	Training	32,854	9.2	182.5	13.5	0.42	0.58	0.41
SGD	Validation	9156	8.9	7.2	12.9	0.41	0.59	0.37
KNN	Training	399	1.3	2.2	1.5	0.05	0.95	0.99
KNN	Validation	131	1.4	2.4	1.5	0.05	0.95	0.99
Tree	Training	921	1.7	5.1	2.3	0.07	0.93	0.98
Tree	Validation	350	1.8	6.4	2.5	0.08	0.92	0.97
RF	Training	5500	3.0	30.6	5.5	0.17	0.83	0.88
RF	Validation	946	2.5	17.2	4.1	0.13	0.87	0.92
GB [25]	–	–	1.615	–	–	–	–	0.98

Open in a new tab

Fig. 24 — Relation between predicted and calculated strength using (SVM).

SGD model

Thesethreemodelswere developed considering modified Huber classification function and “Elastic net” re-generalization technique with mixing factor of 0.01 and strength factor of 0.001.The learning rate starts with 0.01, then gradually decreased to 0.001. The reduction in error% with reducing the learning rate is presented in Fig. 25. Performance metrics of the three developed models for both training and validation dataset are listed in Table 5. The average achieved accuracy was (59%). The relations between calculated and predicted values are shown in Fig. 26. The SGD model results presented in this research paper is not acceptable.

Fig. 25 — Reduction in Error % with reducing the learning rate.

Fig. 26 — Relation between predicted and calculated strength using (SGD).

KNN model

Considering number of neighbors of 1.0, Euclidian metric method and weights were evaluated by distances, the developed (KNN) models showed the best accuracy. (KNN) model showed the best performance where the average accuracy was (95%). The relations between calculated and predicted values are shown in Fig. 27. The k-Nearest Neighbors (kNN) model for predicting the compressive strength of geopolymer concrete utilizing metakaolin achieves an average accuracy of 95% and an R² value of 0.99. These metrics indicate excellent performance, making the model highly effective for this application. The kNN model achieves a high average accuracy of 95%, indicating that it consistently predicts compressive strength values close to the observed data. This reliability makes it a valuable tool for optimizing mix designs. An R² value of 0.99 indicates that 99% of the variance in compressive strength is explained by the model, suggesting a very strong relationship between input features and the output. The kNN algorithm works by comparing new data points to the closest instances in the training dataset. By accurately predicting compressive strength, the model minimizes trial-and-error in mix design, reducing material waste. It highlights the effective use of metakaolin, a byproduct of kaolin clay, as a sustainable alternative to cement, reducing carbon emissions. The model helps engineers optimize mix designs to balance strength, durability, and sustainability. The kNN model demonstrates excellent predictive performance for compressive strength in metakaolin-based geopolymer concrete, achieving 95% accuracy and an R² of 0.99. Its ability to model nonlinear relationships and adapt to diverse datasets makes it a practical choice for sustainable construction. By optimizing hyperparameters and ensuring data diversity, the kNN model can become even more robust and reliable in real-world applications.

Fig. 27 — Relation between predicted and calculated strength using (KNN).

Tree model

These five models were developed considering minimum number of instants in leaves of 2.0 and minimum split subset of 5.0. The models began with only one tree level and gradually increased to 9.0 levels. Figure 28 illustrates the reduction in Error % with increasing the number of layers. The layouts of the generated modelsare presented in Fig. 29. Performance metrics of the last developed model for both training and validation dataset are listed in Table 2. The average achieved accuracy was (92%). The relations between calculated and predicted values are shown in Fig. 30. The Decision Tree (Tree) model for predicting compressive strength in metakaolin-based geopolymer concrete demonstrates strong performance, achieving an average accuracy of 92% and an R² value of 0.975. This analysis evaluates the model’s characteristics, effectiveness, and implications for sustainable construction. The model achieves an accuracy of 92%, indicating reliable prediction of compressive strength under various conditions. This level of accuracy is suitable for practical applications where precision is essential for optimizing mix designs. The R² value of 0.975 indicates that 97.5% of the variance in compressive strength is explained by the model. This strong correlation suggests the Decision Tree effectively captures the key factors influencing strength while minimizing unexplained variability. One of the key advantages of the Decision Tree model is its interpretability. The model structure provides insights into the hierarchy of feature importance and their interactions. While 92% accuracy is commendable, other models like Gradient Boosting or Random Forest may provide higher accuracy by combining multiple trees. The Decision Tree provides a structured visualization of decisions, whereas CN2 produces rule-based outputs, which may be easier to implement in certain scenarios. The Decision Tree model demonstrates strong predictive power for compressive strength behavior in metakaolin-based geopolymer concrete, achieving 92% accuracy and an R² of 0.975. Its transparency and ability to model nonlinear relationships make it a valuable tool for sustainable construction. However, regularization and validation are essential to ensure its reliability in real-world applications. Comparing it with ensemble methods can further enhance its performance.

Fig. 28 — Reduction in Error % with increasing the No. of layers.

Fig. 29 — The layout of the developed (Tree).

Fig. 30 — Relation between predicted and calculated strength using (Tree).

RF model

Finally, nine (RF) models were generated. The models began with only two trees and four levels and increased up to four trees and eight levels. Figure 31 shows the reduction in Error % with increasing number of Tress and layers. Accordingly, the models with three trees and three layers are considered.The developed modelsare graphically presented using Pythagorean Forest in Fig. 32. These arrangements led to a good average accuracy of (85%). The relations between calculated and predicted values are shown in Fig. 33. The Random Forest (RF) model for predicting the compressive strength of geopolymer concrete utilizing metakaolin achieves an average accuracy of 85% and an R² value of 0.90. While these metrics suggest a reasonable ability to predict compressive strength, there is room for improvement compared to other models. An accuracy of 85% indicates that the RF model provides acceptable but not exceptional predictions of compressive strength. The lower accuracy compared to other models like kNN or Gradient Boosting may suggest a need for hyperparameter tuning (e.g., the number of trees or depth). The R² value of 0.90 implies that 90% of the variance in compressive strength is explained by the model. While this reflects a strong relationship between inputs and outputs, it suggests that 10% of variability is unaccounted for, possibly due to model limitations or missing influential factors. The achieved accuracy of 85% is lower than what is typically expected for RF, suggesting potential issues with model setup or data representation. The RF model achieves a moderate performance for predicting compressive strength in metakaolin-based geopolymer concrete, with 85% accuracy and an R² of 0.90. These results indicate that while the model captures key trends in the data, further optimization is required to enhance its predictive power. By refining hyperparameters, improving data preprocessing, and incorporating additional features, the RF model could achieve higher accuracy and better support sustainable construction practices.

Fig. 31 — Reduction in Error % with increasing the No. of Tress and layers.

Fig. 32 — Pythagorean Forest diagram for the developed (RF) models.

Fig. 33 — Relation between predicted and calculated strength using (RF).

Overall, the models evaluated ensemble techniques—Gradient Boosting (GB), CN2, SVM, kNN, Decision Tree, and Random Forest (RF)—offer varying levels of accuracy and R2 in predicting the compressive strength of geopolymer concrete utilizing metakaolin (see Table 5; Fig. 34 for the summary and model comparisons using the Taylor chart). Also, the symbolic regression such as the RSM showed R2 of 0.958 and adequate precision (Adeq Prec ) of 40.539. Below is a comprehensive comparison and recommendation for the optimal model based on sustainability and performance criteria. The model should provide high accuracy and explain most of the variance in compressive strength. The model should reliably predict the effects of sustainable materials like metakaolin to minimize overuse of resources. The GB and kNN models outperform others with 95% accuracy and R² = 0.99, indicating they capture most of the compressive strength variability. The Decision Tree and CN2 models follow with slightly lower but acceptable performance, while the RF model lags with the lowest accuracy and R². CN2 and Decision Tree are the most interpretable models, offering rule-based or hierarchical outputs. GB, SVM, and kNN are less interpretable, with RF sitting in the middle due to its feature importance insights. Models like GB and kNN, with high accuracy and R², are better suited for reliable prediction of metakaolin-based concrete, reducing trial-and-error and material waste. RF may struggle to provide reliable predictions for sustainability-focused applications due to lower accuracy. GB and RF are robust due to their ensemble nature, reducing overfitting compared to single models like Decision Tree. SVM can generalize well but requires careful tuning, while kNN is sensitive to noisy data and parameter selection. Decision Tree, CN2, and kNN are computationally efficient. GB, SVM, and RF require more computational resources, especially with larger datasets. Gradient Boosting (GB) and kNN emerge as the top contenders due to their superior performance metrics based on that it offers high accuracy, robustness, and excellent handling of nonlinearities. It is ideal for large-scale or complex datasets. kNN is simple, effective, and highly accurate. It is best for smaller datasets or applications requiring quick predictions. The Gradient Boosting (GB) is recommended as the best for comprehensive prediction needs where robustness and precision are critical though it is slightly more computationally expensive but highly reliable for sustainable applications. The kNN is the second choice as it is simpler and computationally efficient, suitable for smaller-scale operations or initial exploratory modeling. Avoid Random Forest (RF) for this application unless improved via hyperparameter tuning, as its lower accuracy and R² could hinder sustainable mix design efforts. For predicting compressive strength in geopolymer concrete with metakaolin, Gradient Boosting provides the most sustainable and reliable option. Its balance of accuracy, robustness, and adaptability ensures precise modeling for optimizing concrete designs while promoting sustainability. kNN and the RSM serve as a viable alternative when simplicity and speed are prioritized. Gradient Boosting (GB) and K-Nearest Neighbors (KNN) outperform CN2 Rule Induction, Naive Bayes, Support Vector Machine, Stochastic Gradient Descent, Tree Decision, Random Forest, and Response Surface Methodology in the described model due to their distinct strengths in handling complex relationships between features and optimizing prediction accuracy. GB excels by iteratively building a sequence of weak learners (typically decision trees) where each successive learner corrects errors made by its predecessor. This process enables GB to capture complex, non-linear relationships between variables such as SHS, SHSM, and curing conditions, leading to more accurate predictions of compressive strength. Its ability to minimize residual errors through weighted adjustments makes it highly effective in scenarios with noisy or complex data. KNN, a simple yet powerful algorithm, performs well because it does not assume any parametric relationship between the inputs and output. By assigning predictions based on the similarity (distance) to neighboring data points, KNN effectively handles the non-linear interactions among input parameters such as Na₂O/Al₂O₃ and SiO₂/Al₂O₃ ratios, yielding accurate results when sufficient relevant data points are available. In contrast, CN2 Rule Induction and Naive Bayes have inherent limitations with capturing complex interactions and handling non-linear dependencies. SVM can suffer from performance issues in highly non-linear and noisy datasets unless carefully tuned. SGD and Tree Decision may underperform due to their sensitivity to noisy data and overfitting tendencies, respectively. While Random Forest often excels, it can be outperformed by Gradient Boosting when boosting correctly compensates for errors and optimizes further. Finally, Response Surface Methodology, a traditional statistical approach, may not effectively model the intricate and multi-dimensional relationships between input variables and concrete strength compared to modern machine learning methods. Furthermore, a previous research results presented in the literature²⁵ showed GB as a decisive model with R2 of 0.98 and MAE of 1.615 MPa, which compred to the best performed models of the present research paper i.e., GB and KNN, the present work best models outperformed the previous work. The Taylor charts for training and validation provide a comprehensive visualization of the predictive performance of different models in terms of correlation coefficient, standard deviation, and root mean square error (RMSE). In the training chart, the models generally cluster closer to the measured reference point, indicating a higher level of fit during training. The best-performing models, such as RF (Random Forest) and GB (Gradient Boosting), are positioned near the measured point with low RMSE and high correlation values, suggesting accurate predictions. Models like Naive Bayes (NB) show poor performance, being further away from the reference point with higher RMSE and lower correlation coefficients. The validation results show some spread compared to the training set, reflecting potential overfitting or challenges in generalizing to unseen data. Similar to the training chart, RF and GB remain among the best models with high correlation and low RMSE values, maintaining consistency between training and validation phases. The models with poorer performance during training, such as NB, continue to show weaker results in the validation phase, positioned farther from the optimal point. Overall, these charts highlight the strong generalization ability of ensemble methods like RF and GB, while simpler models, particularly NB, exhibit limited suitability for this research task.

Fig. 34 — Comparing the accuracies of the developed models for (CS) using Taylor charts, (a) Training dataset, (b) Validation dataset.

Sensitivity analysis

A sensitivity index of 1.0 indicates complete sensitivity, a sensitivity index less than 0.01 indicates that the model is insensitive to changes in the parameter. Figure 35 shows the sensitivity analysis with respect to CS. The Hoffman and Gardener sensitivity analysis on the compressive strength behavior of the utilization of metakaolin in pre-cured geopolymer concrete produced respective importance of MK with 5%, SHS with 7%, SHSM with 11%, SSS with 11%, W with 4%, W/S with 12%, Na₂O/Al₂O₃ with 11%, SiO₂/Al₂O₃ of 8%, H₂O/Na₂O with 4%, CA/FA with 0%, SP with 8%, PCC with 7%, CT with 7% and Age with 5% influence on the compressive strength. The Hoffman and Gardener sensitivity analysis provides a quantitative assessment of the factors influencing the compressive strength of geopolymer concrete using metakaolin (MK) as a key material. Water-to-Solids Ratio (W/S, 12%) is the most influential factor, indicating its critical role in determining the workability and compressive strength. Optimizing this ratio is vital for achieving high strength. SHSM, SSS, and Na2O/Al2O3Na2O/Al2O3 (11% each), these chemical components and ratios significantly affect geopolymerization reactions, which directly control strength development. Silica-to-Alumina Ratio (SiO2/Al2O3, SiO2/Al2O3, 8%) indicates the balance of silica and alumina in the mix is crucial for structural integrity and geopolymer network formation. For the moderately influential factors, SHS, SP, PCC, CT, and SiO₂/Al₂O₃ (7–8%) moderately affect compressive strength, primarily through influencing geopolymerization kinetics and workability. Age (5%) has a smaller influence compared to other factors, suggesting the concrete reaches its compressive strength earlier during curing. For the least influential factors, Water Content (W, 4%) has minimal contribution as it is indirectly considered in the W/S ratio. H₂O/Na₂O (4%) indicates limited influence on strength beyond optimizing alkali activation. CA/FA Ratio (0%) has no contribution to strength highlights that the geopolymerization reaction is independent of the aggregate ratio. Focus on optimizing the W/S ratio for maximum compressive strength with minimal resource use for optimal design and sustainable production of the geopolymer concrete. Adjusting SHSM, SSS, and alkali-to-alumina ratios can enhance strength while reducing cement content, aligning with sustainability goals. The moderate influence of PCC suggests potential for further reduction or replacement with greener alternatives like metakaolin. The insignificant role of CA/FA indicates freedom in using recycled or alternative aggregates without compromising strength. The low influence of “Age” suggests faster strength gain, making this material suitable for projects requiring quick turnaround times. It is recommended to prioritize precise control of W/S, Na2O/Al2O3 and SiO2/Al2O3 ratios. Optimize the use of appropriate molarities and quantities of SHSM and SSS to enhance the geopolymerization process. Reduce PCC where possible, leveraging metakaolin and other supplementary materials to improve sustainability. Explore alternative aggregates or recycled materials, as CA/FA does not affect strength. Given the low sensitivity of Age, focus on techniques like elevated curing temperatures (CT) to speed up strength gain without compromising performance. The Hoffman and Gardener sensitivity analysis reveals that chemical ratios, W/S ratio, and chemical activators are the dominant factors influencing the compressive strength of geopolymer concrete with metakaolin. By focusing on these elements, it is possible to optimize mix designs for sustainability, efficiency, and high strength while reducing dependency on traditional materials like cement.

Fig. 35 — Sensitivity analysis with respect to CS.

Conclusions

This research presents a comparative study between eight ensemble ML classification techniques namely GB, CN2, NB, SVM, SGD, KNN, Tree and RF and one symbolic regression such as the RSM to estimate the compressive strength of metakaolin-based geopolymer concrete considering mixture components contents, pretreatment conditions and concrete age. The outcomes of this study could be concluded as follows:

GB and KNN models showed an excellent accuracy of about 95%, while Tree, CN2, and SVM models showed very good accuracies of about 91%, RF model showed good accuracy level of about 85% and finally NB and SGD presented unacceptable accuracy (less than 60%).
Also, the symbolic regression such as the RSM showed R2 of 0.958 and adequate precision (Adeq Prec) of 40.539.
Both of correlation matrix and sensitivity analysis indicated that SHSM, SSS, W/S, and Na₂O/Al₂O₃ are the most influence inputs with relative importance of 11% each, then SHS, SiO₂/Al₂O₃, SP, PCC, and CT with relative importance of 8% each, then MK, W, H₂O/Na₂O, and Age with relative importance of 4% each, finally, CA/FA had no influence at all on the CS.
All the developed models are too complicated to be used manually except the RSM, which may be considered as the main disadvantage of the ML classification techniques compared with other symbolic regression techniques such as RSM.
The developed models are valid within the considered range of parameter values, beyond this range; the prediction accuracy should be verified.

Practical application

The research work on optimizing the utilization of metakaolin in pre-cured geopolymer concrete using ensemble and symbolic regression models has practical applications in advancing the development and deployment of sustainable construction materials. By leveraging machine learning models such as Gradient Boosting, Random Forest, Support Vector Machine, and Response Surface Methodology, the study provides a data-driven framework for predicting and optimizing key performance parameters like compressive strength based on input factors including sodium hydroxide solution content, curing conditions, and aggregate ratios. One of the key applications of this research lies in the efficient formulation of geopolymer concrete with desired strength properties while reducing experimental costs and time. The model predictions enable concrete manufacturers to tailor material compositions precisely, thus minimizing the reliance on traditional trial-and-error methods. The incorporation of metakaolin, a by-product of kaolin processing, promotes environmental sustainability by reducing dependence on Portland cement, which has a high carbon footprint. The study’s findings can be extended to large-scale construction projects where pre-cured geopolymer concrete may be preferred for its durability and early strength development. Additionally, the optimized mix designs generated through machine learning can guide engineers in selecting material combinations that balance strength, workability, and environmental impact. This research also has implications for the circular economy by fostering the use of industrial by-products in construction, thereby contributing to waste reduction and resource efficiency. Furthermore, the integration of machine learning in material optimization demonstrates the feasibility of adopting predictive analytics in civil engineering, paving the way for intelligent material design and quality control processes. This approach not only enhances productivity but also supports the creation of more sustainable and resilient infrastructure systems.

Future scope of research

The research on optimizing the utilization of metakaolin in pre-cured geopolymer concrete using ensemble and symbolic regression techniques offers promising directions for future exploration. The application of advanced machine learning models like Gradient Boosting, Random Forest, Support Vector Machines, and Response Surface Methodology demonstrates the potential for data-driven innovations in sustainable construction materials. Further work could expand upon the current findings by investigating additional variables influencing concrete performance, such as thermal resistance, shrinkage, and permeability, to create more comprehensive predictive models. Incorporating real-world conditions such as varying climate exposure and aggressive environmental factors could further validate and enhance the predictive models. This would ensure robustness and reliability in practical applications. Additionally, the integration of sensor technology for real-time monitoring of concrete properties could be explored, providing dynamic input for model optimization during construction processes. Another potential avenue involves scaling the research toward industrial-level production, assessing the economic feasibility and energy efficiency of using metakaolin-based geopolymer concrete in large-scale construction projects. Collaborative efforts with industry stakeholders could accelerate the adoption of these optimized formulations in infrastructure projects aimed at reducing carbon footprints. Machine learning techniques might also be extended to multi-objective optimization problems, balancing factors such as strength, cost, and environmental sustainability simultaneously. Exploring hybrid computational approaches that combine symbolic regressions with deep learning methods may yield more precise and adaptable models. Lastly, the development of user-friendly decision-support systems based on the research outcomes would empower engineers and researchers to easily access and apply optimized geopolymer mix designs, fostering a broader impact within the construction industry and contributing to a more sustainable built environment.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(13.5KB, docx)}

Author contributions

KCO, VK and AME conceptualized, KCO, VK, AME, SH, JLLL, FPLY, JLAP, MV, & SA wrote the main manuscript text, KCO and AME prepared the figures and JLLL, FPLY, & JLAP provided the revisions and the replies to comments. All authors reviewed the manuscript.

Data availability

The data supporting this research work will be made available on request from the corresponding author.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Kennedy C. Onyelowe, Email: kennedychibuzor@kiu.ac.ug, Email: konyelowe@mouau.edu.ng

Viroon Kamchoom, Email: viroon.ka@kmitl.ac.th.

Ahmed M. Ebid, Email: Ahmed.abdelkhaleq@fue.edu.eg

References

1.Zou, B. et al. Artificial intelligence-based optimized models for predicting the slump and compressive strength of sustainable alkali-derived concrete. Constr. Build. Mater.409, 134092. 10.1016/j.conbuildmat.2023.134092 (2023). [Google Scholar]
2.Shi, X. et al. Mechanical framework for geopolymer gels construction: an optimized LSTM technique to predict compressive strength of fly Ash-Based geopolymer gels concrete, gels. 10 (2024). 10.3390/gels10020148 [DOI] [PMC free article] [PubMed]
3.Nazar, S. et al. Machine learning interpretable-prediction models to evaluate the slump and strength of fly ash-based geopolymer. J. Mater. Res. Technol.24, 100–124. 10.1016/j.jmrt.2023.02.180 (2023). [Google Scholar]
4.Qureshi, H. J. et al. Prediction of autogenous shrinkage of concrete incorporating super absorbent polymer and waste materials through individual and ensemble machine learning approaches. Mater. (Basel). 1510.3390/ma15217412 (2022). [DOI] [PMC free article] [PubMed]
5.Liu, K. et al. Development of compressive strength prediction platform for concrete materials based on machine learning techniques. J. Build. Eng.80, 107977. 10.1016/j.jobe.2023.107977 (2023). [Google Scholar]
6.Ali Khan, M., Zafar, A., Akbar, A., Javed, M. F. & Mosavi, A. Application of gene expression programming (GEP) for the prediction of compressive strength of geopolymer concrete, materials (Basel). 14 (2021). 10.3390/ma14051106 [DOI] [PMC free article] [PubMed]
7.Al-Taai, S. R. et al. XGBoost prediction model optimized with bayesian for the compressive strength of Eco-Friendly concrete containing ground granulated blast furnace slag and recycled coarse aggregate. Appl. Sci.1310.3390/app13158889 (2023).
8.Nafees, A. et al. Predictive modeling of mechanical properties of silica Fume-Based green concrete using artificial intelligence approaches: MLPNN, ANFIS, and GEP, materials (Basel). 14 (2021). 10.3390/ma14247531 [DOI] [PMC free article] [PubMed]
9.HossainM.A.S., UddinM.N. & HossainM.M. Prediction of compressive strength fiber-reinforced geopolymer concrete (FRGC) using gene expression programming (GEP). Mater. Today Proc.10.1016/j.matpr.2023.02.458 (2023). [Google Scholar]
10.Gupta, P., Gupta, N. & Saxena, K. K. Predicting compressive strength of geopolymer concrete using machine learning. Innov. Emerg. Technol.10, 2350003 (2023). [Google Scholar]
11.Khan, A. Q., Naveed, M. H., Rasheed, M. D. & Miao, P. Prediction of compressive strength of fly Ash-Based geopolymer concrete using supervised machine learning methods. Arab. J. Sci. Eng.10.1007/s13369-023-08283-w (2023).37361464 [Google Scholar]
12.Onyelowe, K. C. et al. Multi-Objective Optimization of Sustainable Concrete Containing Fly Ash Based on Environmental and Mechanical Considerations, Buildings, 2022, 12, 948, (2022). 10.3390/buildings12070948
13.Onyelowe, K. C. et al. Evaluating the compressive strength of recycled aggregate concrete using novel artificial neural network. Civil Eng. J.8 (8), 1679–1694. 10.28991/CEJ-2022-08-08-011 (2022). [Google Scholar]
14.Onyelowe, K. C. et al. Global warming potential-based life cycle assessment and optimization of the compressive strength of fly ash-silica fume concrete; environmental impact consideration. Front. Built Environ.8, 992552. 10.3389/fbuil.2022.992552 (2022). [Google Scholar]
15.Kennedy, C. et al. Optimization of Green Concrete Containing Fly Ash and Rice Husk Ash Based on Hydro-Mechanical Properties and Life Cycle Assessment Considerations, Civil Eng. J., 8, 12, 10.28991/CEJ-2022-08-12-018 (2022).
16.Onyelowe, K. C., Gnananandarao, T., Jagan, J., Ahmad, J. & Ebid, A. M. Innovative predictive model for flexural strength of recycled aggregate concrete from multiple datasets. Asian J. Civil Eng. 1–10. 10.1007/s42107-022-00558-1 (2022).
17.Kennedy, C. et al. AI Mix Design of Fly Ash Admixed Concrete Based on Mechanical and Environmental Impact Considerations, Civil Engineering Journal, Vol. 9, Special Issue, 2023, (2023). 10.28991/CEJ-SP2023-09-03
18.Onyelowe, K. C., Ebid, A. M. & Ghadikolaee, M. R. GRG-optimized response surface powered prediction of concrete mix design chart for the optimization of concrete compressive strength based on industrial waste precursor effect. Asian J. Civil Eng. 1–10. 10.1007/s42107-023-00827-7 (2023).
19.Onyelowe, K. C. & Ebid, A. M. The influence of fly Ash and blast furnace slag on the compressive strength of high-performance concrete (HPC) for sustainable structures. Asian J. Civil Eng.10.1007/s42107-023-00817-9 (2023). [Google Scholar]
20.Kennedy, C., Onyelowe, A. M., Ebid, Frank, I., Aneke, Light, I. & Nwobia Different AI predictive models for pavement subgrade stiffness and resilient deformation of geopolymer cement–Treated lateritic soil with ordinary cement addition. Int. J. Pavement Res. Technol.10.1007/s42947-022-00185-8 (2022). [Google Scholar]
21.Ahmed, M., Ebid, K. C., Onyelowe, Denise Penelope, N., Kontoni, A. Q. & Gallardo ShadiHanandeh, Heat and mass transfer in different concrete structures: a study of self-compacting concrete and geopolymer concrete, International Journal of Low-Carbon Technologies 2023, 18, 404–411, (2023). 10.1093/ijlct/ctad022
22.Onyelowe, K. C., Ebid, A. M. & Hanandeh, S. Advanced machine learning prediction of the unconfined compressive strength of geopolymer cement reconstituted granular sand for road and liner construction applications. Asian J. Civil Eng. 1–15. 10.1007/s42107-023-00829-5 (2023).
23.Eftekhar Afzali, S. A., Shayanfar, M. A., Ghanooni-Bagha, M., Golafshani, E. & Ngo, T. The use of machine learning techniques to investigate the properties of metakaolin-based geopolymer concrete. J. Clean. Prod.10.1016/j.jclepro.2024.141305 (2024). [Google Scholar]
24.Onyelowe, K. C. et al. Characterization of net-zero pozzolanic potential of thermally-derived Metakaolin samples for sustainable carbon neutrality construction. Sci. Rep.13, 18901. 10.1038/s41598-023-46362-y (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ebid, A., Onyelowe, K. C. & Deifalla, A. F. Data utilization and partitioning for machine learning applications in civil engineering. In International Conference on Advanced Technologies for Humanity. In book: Industrial Innovations: New Technologies in Cities’ Digital Infrastructure, Publisher: Springer. (2023). 10.1007/978-3-031-70992-0_8
26.Hoffman, F. O. & Gardner, R. H. Evaluation of Uncertainties in Radiological Assessment Models. Chapter 11 of Radiological Assessment: A textbook on Environmental Dose Analysis. Edited by Till, J. E. and Meyer, H. R. NRC Office of Nuclear Reactor Regulation, Washington, D. C. (1983).
27.Harith, I. K., Abdulhadi, A. M. & Hussien, M. L. Harnessing machine learning for accurate Estimation of compressive strength of high-performance self-compacting concrete from non-destructive tests: A comparative study, construction and Building materials, 451, (2024). 10.1016/j.conbuildmat.2024.138779
28.Harith, I. K. et al. Estimating the joint shear strength of exterior beam–column joints using artificial neural networks via experimental results. Innov. Infrastruct. Solut.9, 38. 10.1007/s41062-023-01351-y (2024). [Google Scholar]
29.Harith, I. K. et al. Prediction of high-performance concrete strength using machine learning with hierarchical regression. Multiscale Multidiscip Model. Exp. Des.7, 4911–4922. 10.1007/s41939-024-00467-7 (2024). [Google Scholar]
30.Harith, I. K. et al. Harnessing machine learning for accurate Estimation of concrete strength using non-destructive tests: a comparative study. Multiscale Multidiscip Model. Exp. Des.8, 27. 10.1007/s41939-024-00605-1 (2025). [Google Scholar]
31.Harith, I. K. et al. Comparison of artificial neural network and hierarchical regression in prediction compressive strength of self-compacting concrete with fly Ash. Innov. Infrastruct. Solut.9, 62. 10.1007/s41062-024-01367-y (2024). [Google Scholar]
32.Parhi, S. K. & Patro, S. K. Prediction of compressive strength of geopolymer concrete using a hybrid ensemble of grey Wolf optimized machine learning estimators. J. Building Eng.7110.1016/j.jobe.2023.106521 (2023).
33.Dash, P. K., Parhi, S. K., Patro, S. K. & Panigrahi, R. Efficient machine learning algorithm with enhanced Cat swarm optimization for prediction of compressive strength of GGBS-based geopolymer concrete at elevated temperature, construction and Building materials, 400, (2023). 10.1016/j.conbuildmat.2023.132814
34.Dash, P. K., Parhi, S. K., Patro, S. K. & Panigrahi, R. Influence of chemical constituents of binder and activator in predicting compressive strength of fly ash-based geopolymer concrete using firefly-optimized hybrid ensemble machine learning model. Mater. Today Commun.3710.1016/j.mtcomm.2023.107485 (2023).
35.Onyelowe, K. C. Sustainable intelligent infrastructure, inaugural editorial. Sustainable Intell. Infrastructure. 1 (1), 1–3. 10.62762/SII.2025.187975 (2025). [Google Scholar]
36.Parhi, S. K. et al. Metaheuristic optimization of machine learning models for strength prediction of high-performance self-compacting alkali-activated slag concrete. Multiscale Multidiscip Model. Exp. Des.7, 2901–2928. 10.1007/s41939-023-00349-4 (2024). [Google Scholar]
37.Parhi, S. K. & Patro, S. K. Parametric analysis and prediction of geopolymerization process. Mater. Today Commun.4110.1016/j.mtcomm.2024.111047 (2024).
38.Parhi, S. K., Dwibedy, S. & Panigrahi, S. K. AI-driven critical parameter optimization of sustainable self-compacting geopolymer concrete. J. Building Eng.8610.1016/j.jobe.2024.108923 (2024).
39.Parhi, S. K., Nanda, A. & Panigrahi, S. K. Multi-objective optimization and prediction of strength along with durability in acid-resistant self-compacting alkali-activated concrete, construction and Building materials, 456, (2024). 10.1016/j.conbuildmat.2024.139235
40.Singaram, K. K., Khan, M. A., Talakokula, V. & Gurnani, C. Expansion in low calcium fly ash-based geopolymer concrete: chemical factors influenced by silica fume and NaOH concentration. J. Sustainable Cement-Based Mater.14 (1), 74–88. 10.1080/21650373.2024.2426687 (2024). [Google Scholar]
41.Venugopal Mandala, M. A. & Khan Experimental investigations on layered functionally graded fiber-reinforced concrete, Structures, Volume 70, (2024). 10.1016/j.istruc.2024.107679
42.Singaram, K. K., Khan, M. A. & Talakokula, V. Review on compressive strength and durability of fly-ash-based geopolymers using characterization techniques. Arch. Civ. Mech. Eng.25, 73. 10.1007/s43452-025-01116-7 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Kennedy, C. et al. Optimization of Green Concrete Containing Fly Ash and Rice Husk Ash Based on Hydro-Mechanical Properties and Life Cycle Assessment Considerations, Civil Eng. J., 8, 12, 10.28991/CEJ-2022-08-12-018 (2022).

Supplementary Materials

Supplementary Material 1^{(13.5KB, docx)}

Data Availability Statement

The data supporting this research work will be made available on request from the corresponding author.

[CR1] 1.Zou, B. et al. Artificial intelligence-based optimized models for predicting the slump and compressive strength of sustainable alkali-derived concrete. Constr. Build. Mater.409, 134092. 10.1016/j.conbuildmat.2023.134092 (2023). [Google Scholar]

[CR2] 2.Shi, X. et al. Mechanical framework for geopolymer gels construction: an optimized LSTM technique to predict compressive strength of fly Ash-Based geopolymer gels concrete, gels. 10 (2024). 10.3390/gels10020148 [DOI] [PMC free article] [PubMed]

[CR3] 3.Nazar, S. et al. Machine learning interpretable-prediction models to evaluate the slump and strength of fly ash-based geopolymer. J. Mater. Res. Technol.24, 100–124. 10.1016/j.jmrt.2023.02.180 (2023). [Google Scholar]

[CR4] 4.Qureshi, H. J. et al. Prediction of autogenous shrinkage of concrete incorporating super absorbent polymer and waste materials through individual and ensemble machine learning approaches. Mater. (Basel). 1510.3390/ma15217412 (2022). [DOI] [PMC free article] [PubMed]

[CR5] 5.Liu, K. et al. Development of compressive strength prediction platform for concrete materials based on machine learning techniques. J. Build. Eng.80, 107977. 10.1016/j.jobe.2023.107977 (2023). [Google Scholar]

[CR6] 6.Ali Khan, M., Zafar, A., Akbar, A., Javed, M. F. & Mosavi, A. Application of gene expression programming (GEP) for the prediction of compressive strength of geopolymer concrete, materials (Basel). 14 (2021). 10.3390/ma14051106 [DOI] [PMC free article] [PubMed]

[CR7] 7.Al-Taai, S. R. et al. XGBoost prediction model optimized with bayesian for the compressive strength of Eco-Friendly concrete containing ground granulated blast furnace slag and recycled coarse aggregate. Appl. Sci.1310.3390/app13158889 (2023).

[CR8] 8.Nafees, A. et al. Predictive modeling of mechanical properties of silica Fume-Based green concrete using artificial intelligence approaches: MLPNN, ANFIS, and GEP, materials (Basel). 14 (2021). 10.3390/ma14247531 [DOI] [PMC free article] [PubMed]

[CR9] 9.HossainM.A.S., UddinM.N. & HossainM.M. Prediction of compressive strength fiber-reinforced geopolymer concrete (FRGC) using gene expression programming (GEP). Mater. Today Proc.10.1016/j.matpr.2023.02.458 (2023). [Google Scholar]

[CR10] 10.Gupta, P., Gupta, N. & Saxena, K. K. Predicting compressive strength of geopolymer concrete using machine learning. Innov. Emerg. Technol.10, 2350003 (2023). [Google Scholar]

[CR11] 11.Khan, A. Q., Naveed, M. H., Rasheed, M. D. & Miao, P. Prediction of compressive strength of fly Ash-Based geopolymer concrete using supervised machine learning methods. Arab. J. Sci. Eng.10.1007/s13369-023-08283-w (2023).37361464 [Google Scholar]

[CR12] 12.Onyelowe, K. C. et al. Multi-Objective Optimization of Sustainable Concrete Containing Fly Ash Based on Environmental and Mechanical Considerations, Buildings, 2022, 12, 948, (2022). 10.3390/buildings12070948

[CR13] 13.Onyelowe, K. C. et al. Evaluating the compressive strength of recycled aggregate concrete using novel artificial neural network. Civil Eng. J.8 (8), 1679–1694. 10.28991/CEJ-2022-08-08-011 (2022). [Google Scholar]

[CR14] 14.Onyelowe, K. C. et al. Global warming potential-based life cycle assessment and optimization of the compressive strength of fly ash-silica fume concrete; environmental impact consideration. Front. Built Environ.8, 992552. 10.3389/fbuil.2022.992552 (2022). [Google Scholar]

[CR15] 15.Kennedy, C. et al. Optimization of Green Concrete Containing Fly Ash and Rice Husk Ash Based on Hydro-Mechanical Properties and Life Cycle Assessment Considerations, Civil Eng. J., 8, 12, 10.28991/CEJ-2022-08-12-018 (2022).

[CR16] 16.Onyelowe, K. C., Gnananandarao, T., Jagan, J., Ahmad, J. & Ebid, A. M. Innovative predictive model for flexural strength of recycled aggregate concrete from multiple datasets. Asian J. Civil Eng. 1–10. 10.1007/s42107-022-00558-1 (2022).

[CR17] 17.Kennedy, C. et al. AI Mix Design of Fly Ash Admixed Concrete Based on Mechanical and Environmental Impact Considerations, Civil Engineering Journal, Vol. 9, Special Issue, 2023, (2023). 10.28991/CEJ-SP2023-09-03

[CR18] 18.Onyelowe, K. C., Ebid, A. M. & Ghadikolaee, M. R. GRG-optimized response surface powered prediction of concrete mix design chart for the optimization of concrete compressive strength based on industrial waste precursor effect. Asian J. Civil Eng. 1–10. 10.1007/s42107-023-00827-7 (2023).

[CR19] 19.Onyelowe, K. C. & Ebid, A. M. The influence of fly Ash and blast furnace slag on the compressive strength of high-performance concrete (HPC) for sustainable structures. Asian J. Civil Eng.10.1007/s42107-023-00817-9 (2023). [Google Scholar]

[CR20] 20.Kennedy, C., Onyelowe, A. M., Ebid, Frank, I., Aneke, Light, I. & Nwobia Different AI predictive models for pavement subgrade stiffness and resilient deformation of geopolymer cement–Treated lateritic soil with ordinary cement addition. Int. J. Pavement Res. Technol.10.1007/s42947-022-00185-8 (2022). [Google Scholar]

[CR21] 21.Ahmed, M., Ebid, K. C., Onyelowe, Denise Penelope, N., Kontoni, A. Q. & Gallardo ShadiHanandeh, Heat and mass transfer in different concrete structures: a study of self-compacting concrete and geopolymer concrete, International Journal of Low-Carbon Technologies 2023, 18, 404–411, (2023). 10.1093/ijlct/ctad022

[CR22] 22.Onyelowe, K. C., Ebid, A. M. & Hanandeh, S. Advanced machine learning prediction of the unconfined compressive strength of geopolymer cement reconstituted granular sand for road and liner construction applications. Asian J. Civil Eng. 1–15. 10.1007/s42107-023-00829-5 (2023).

[CR23] 23.Eftekhar Afzali, S. A., Shayanfar, M. A., Ghanooni-Bagha, M., Golafshani, E. & Ngo, T. The use of machine learning techniques to investigate the properties of metakaolin-based geopolymer concrete. J. Clean. Prod.10.1016/j.jclepro.2024.141305 (2024). [Google Scholar]

[CR24] 24.Onyelowe, K. C. et al. Characterization of net-zero pozzolanic potential of thermally-derived Metakaolin samples for sustainable carbon neutrality construction. Sci. Rep.13, 18901. 10.1038/s41598-023-46362-y (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Ebid, A., Onyelowe, K. C. & Deifalla, A. F. Data utilization and partitioning for machine learning applications in civil engineering. In International Conference on Advanced Technologies for Humanity. In book: Industrial Innovations: New Technologies in Cities’ Digital Infrastructure, Publisher: Springer. (2023). 10.1007/978-3-031-70992-0_8

[CR26] 26.Hoffman, F. O. & Gardner, R. H. Evaluation of Uncertainties in Radiological Assessment Models. Chapter 11 of Radiological Assessment: A textbook on Environmental Dose Analysis. Edited by Till, J. E. and Meyer, H. R. NRC Office of Nuclear Reactor Regulation, Washington, D. C. (1983).

[CR27] 27.Harith, I. K., Abdulhadi, A. M. & Hussien, M. L. Harnessing machine learning for accurate Estimation of compressive strength of high-performance self-compacting concrete from non-destructive tests: A comparative study, construction and Building materials, 451, (2024). 10.1016/j.conbuildmat.2024.138779

[CR28] 28.Harith, I. K. et al. Estimating the joint shear strength of exterior beam–column joints using artificial neural networks via experimental results. Innov. Infrastruct. Solut.9, 38. 10.1007/s41062-023-01351-y (2024). [Google Scholar]

[CR29] 29.Harith, I. K. et al. Prediction of high-performance concrete strength using machine learning with hierarchical regression. Multiscale Multidiscip Model. Exp. Des.7, 4911–4922. 10.1007/s41939-024-00467-7 (2024). [Google Scholar]

[CR30] 30.Harith, I. K. et al. Harnessing machine learning for accurate Estimation of concrete strength using non-destructive tests: a comparative study. Multiscale Multidiscip Model. Exp. Des.8, 27. 10.1007/s41939-024-00605-1 (2025). [Google Scholar]

[CR31] 31.Harith, I. K. et al. Comparison of artificial neural network and hierarchical regression in prediction compressive strength of self-compacting concrete with fly Ash. Innov. Infrastruct. Solut.9, 62. 10.1007/s41062-024-01367-y (2024). [Google Scholar]

[CR32] 32.Parhi, S. K. & Patro, S. K. Prediction of compressive strength of geopolymer concrete using a hybrid ensemble of grey Wolf optimized machine learning estimators. J. Building Eng.7110.1016/j.jobe.2023.106521 (2023).

[CR33] 33.Dash, P. K., Parhi, S. K., Patro, S. K. & Panigrahi, R. Efficient machine learning algorithm with enhanced Cat swarm optimization for prediction of compressive strength of GGBS-based geopolymer concrete at elevated temperature, construction and Building materials, 400, (2023). 10.1016/j.conbuildmat.2023.132814

[CR34] 34.Dash, P. K., Parhi, S. K., Patro, S. K. & Panigrahi, R. Influence of chemical constituents of binder and activator in predicting compressive strength of fly ash-based geopolymer concrete using firefly-optimized hybrid ensemble machine learning model. Mater. Today Commun.3710.1016/j.mtcomm.2023.107485 (2023).

[CR35] 35.Onyelowe, K. C. Sustainable intelligent infrastructure, inaugural editorial. Sustainable Intell. Infrastructure. 1 (1), 1–3. 10.62762/SII.2025.187975 (2025). [Google Scholar]

[CR36] 36.Parhi, S. K. et al. Metaheuristic optimization of machine learning models for strength prediction of high-performance self-compacting alkali-activated slag concrete. Multiscale Multidiscip Model. Exp. Des.7, 2901–2928. 10.1007/s41939-023-00349-4 (2024). [Google Scholar]

[CR37] 37.Parhi, S. K. & Patro, S. K. Parametric analysis and prediction of geopolymerization process. Mater. Today Commun.4110.1016/j.mtcomm.2024.111047 (2024).

[CR38] 38.Parhi, S. K., Dwibedy, S. & Panigrahi, S. K. AI-driven critical parameter optimization of sustainable self-compacting geopolymer concrete. J. Building Eng.8610.1016/j.jobe.2024.108923 (2024).

[CR39] 39.Parhi, S. K., Nanda, A. & Panigrahi, S. K. Multi-objective optimization and prediction of strength along with durability in acid-resistant self-compacting alkali-activated concrete, construction and Building materials, 456, (2024). 10.1016/j.conbuildmat.2024.139235

[CR40] 40.Singaram, K. K., Khan, M. A., Talakokula, V. & Gurnani, C. Expansion in low calcium fly ash-based geopolymer concrete: chemical factors influenced by silica fume and NaOH concentration. J. Sustainable Cement-Based Mater.14 (1), 74–88. 10.1080/21650373.2024.2426687 (2024). [Google Scholar]

[CR41] 41.Venugopal Mandala, M. A. & Khan Experimental investigations on layered functionally graded fiber-reinforced concrete, Structures, Volume 70, (2024). 10.1016/j.istruc.2024.107679

[CR42] 42.Singaram, K. K., Khan, M. A. & Talakokula, V. Review on compressive strength and durability of fly-ash-based geopolymers using characterization techniques. Arch. Civ. Mech. Eng.25, 73. 10.1007/s43452-025-01116-7 (2025). [Google Scholar]

PERMALINK

Optimizing the utilization of Metakaolin in pre-cured geopolymer concrete using ensemble and symbolic regressions

Kennedy C Onyelowe

Viroon Kamchoom

Ahmed M Ebid

Shadi Hanandeh

José Luis Llamuca Llamuca

Fabián Patricio Londo Yachambay

José Luis Allauca Palta

M Vishnupriyan

Siva Avudaiappan

Abstract

Supplementary Information

Introduction

Background

Novelty statement

Methodology

Collected database preliminary study

Table 1.

Fig. 1.

Research program

Fig. 2.

Theory of advanced machine learning methods

Gradient boosting (GB)

Fig. 3.

CN2 rule induction (CN2)

Fig. 4.

Naive Bayes (NB)

Fig. 5.

Support vector machine (SVM)

Fig. 6.

Stochastic gradient descent (SGD)

Fig. 7.

K-Nearest neighbors (KNN)

Fig. 8.

Tree decision (Tree)

Fig. 9.

Random forest (RF)

Fig. 10.

Response surface methodology (RSM)

Fig. 11.

Sensitivity analysis

Results and discussion

RSM model

Table 2.

Table 3.

Table 4.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

Fig. 16.

GB model

Fig. 17.

Fig. 18.

CN2 model

Fig. 19.

Fig. 20.

Fig. 21.

NB model

Fig. 22.

SVM model

Fig. 23.

Table 5.

Fig. 24.

SGD model

Fig. 25.

Fig. 26.

KNN model

Fig. 27.

Tree model

Fig. 28.

Fig. 29.

Fig. 30.

RF model

Fig. 31.

Fig. 32.

Fig. 33.

Fig. 34.

Sensitivity analysis