Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 14;15:35797. doi: 10.1038/s41598-025-17802-8

Improved XGBoost model for predicting minimum miscibility pressure in CO2 flooding

Yuxin Yang 1,2, Yizhong Zhang 1,2,, Bowen Qin 1, Jianhong Guo 3, Maolin Zhang 1,2
PMCID: PMC12521415  PMID: 41087412

Abstract

CO2-enhanced oil recovery (EOR) is a key technology to improve oil recovery rates and support carbon capture, utilization, and storage (CCUS). Injecting CO2 into reservoirs reduces crude oil viscosity, enhancing its mobility. A critical factor in CO2-EOR is determining the Minimum Miscibility Pressure (MMP). This study aims to construct an MMP prediction model for pure and impure CO2-EOR based on an improved eXtreme Gradient Boosting (XGBoost) algorithm, introducing the critical temperature of the injection gas (Tcm) as a new research variable to explore its impact on MMP. The data used in this study comprises 218 experimental datasets, totaling 2,398 samples, covering both pure and impure CO2-EOR scenarios. Before model construction, important features related to MMP were identified by combining reservoir physical theory with Pearson correlation analysis, and dimensionality reduction was performed using principal component analysis to eliminate redundant information. To optimize the hyperparameters of XGBoost, this study introduced the Particle Swarm Optimization (PSO) algorithm to ensure optimal model parameter configuration. Additionally, Shapley Additive Explanations (SHAP) analysis was employed to evaluate the model’s interpretability, resulting in a CO2-MMP prediction model with good explanatory capability. The results indicate that the proposed method achieved an RMSE of 0.2347 and an R2 of 0.9991 for the training set, with an RMSE of 1.0303 and an R2 of 0.9845 for the testing set, outperforming traditional MMP prediction models in various performance metrics. The proposed methodology enables a transparent, efficient, and generalizable approach to MMP prediction, offering valuable insights for CO2-EOR strategy design and supporting more cost-effective and data-driven reservoir development.

Keywords: Minimum miscibility pressure, Machine learning, EOR, Optimization, CCUS, Explainable machine learning

Subject terms: Environmental sciences, Energy science and technology, Engineering, Scientific data

Introduction

As an effective method for Enhanced Oil Recovery (EOR), CO2 flooding plays a critical role in oilfield development and is closely integrated with Carbon Capture, Utilization, and Storage (CCUS) technologies13. By capturing CO2 from industrial emissions and injecting it into underground reservoirs for oil recovery, this approach helps reduce atmospheric carbon levels and leverages the physical and chemical properties of CO2 to enhance oil recovery, achieving a win–win for environmental and economic benefits2. In CCUS projects, CO2 flooding is regarded as an effective method for carbon utilization and a key means to enhance oilfield production efficiency, garnering widespread international attention25. During CO2 flooding, the injected CO2 mixes with crude oil, reducing its viscosity and increasing its mobility, thereby improving oil recovery4. Minimum Miscibility Pressure (MMP) is the lowest pressure required for complete miscibility between CO2 and oil, serving as a critical parameter in the flooding process6. When injection pressure remains above MMP, the interfacial tension between oil and CO2 is minimized or even reduced to zero, improving displacement efficiency and achieving better oil recovery7. If the injection pressure falls below MMP, CO2 cannot thoroughly mix with oil, reducing displacement efficiency and compromising EOR effectiveness. Consequently, immiscible CO2 flooding is often required in field operations to enhance EOR performance810. Therefore, accurately predicting MMP is of substantial engineering significance for optimizing injection strategies, improving oil recovery rates, and ensuring the economic feasibility of CCUS projects68,11.

Methods for evaluating MMP can generally be divided into two main categories: laboratory measurements and theoretical calculations. Laboratory methods, such as the slim-tube (ST) test, rising bubble apparatus (RBA), and vanishing interfacial tension (VIT) techniques, are common experimental approaches for directly measuring MMP. While RBA and IFT methods can measure MMP, they present high uncertainty levels12,13. The slim-tube method simulates in-situ reservoir conditions, gradually increasing injection pressure to observe CO2-oil miscibility, thus determining MMP14. Despite its precision, the slim-tube approach involves a complex, time-intensive, and costly procedure, limiting its practical applicability for field use1517. Researchers have developed various analytical and computational methods to streamline MMP prediction and reduce associated time and costs. These methods facilitate the rapid calculation and prediction of MMP through correlation analysis, minimizing resource consumption while ensuring more efficient determination1821. Holm & Josendal first noted that MMP correlations depend on the molecular weight of C5+ and reservoir temperature18. Lee et al. suggested estimating pure CO2 MMP based on reservoir temperature19. Yellig and Metcalfe discussed empirical correlations between oil composition, reservoir temperature (Tr), and MMP through sand-packed coil methods20. Alston et al. linked MMP to temperature, C5+ molecular weight, volatile fractions, intermediate oil fractions, and CO2 stream composition21. In 1985, Glasø expanded on Benham et al.'s research to study the effects of hydrocarbon, CO2, or N2 in injected gas on MMP, establishing an empirical correlation involving Tr, the molecular weight of heavy hydrocarbons, and molar fraction of intermediate hydrocarbons22. More recently, Chen et al. reported that parameter correlations with MMP ranked as follows: Tr > XC5–C6 > MWC7+ > Xvol > YH2S > YHC > YCO2 > YC1 > YN2 > YC2–C4, proposing an empirical equation for both pure and impure CO2 conditions23. However, the accuracy of empirical formulas depends on their application range and is limited in complex reservoirs or with variable hydrocarbon compositions24,25. Numerical simulation methods, like state equation-based modeling, simulate CO2-oil phase behavior. Belhaj et al. used CMG software with the WINPROP module, tuning the PR-EOS model to predict MMP values for N2/HC-enriched CO2 floods26. MMP calculations are also feasible with ECLIPSE and other software27, though tuning physical properties is time-consuming, and this method is costly, complex, and less economically viable for rapid MMP prediction1.

In recent years, with the development of data-driven technology, machine learning (ML) methods have increasingly been applied to MMP prediction research. ML models can handle large, complex datasets and, by training on experimental data, learn the nonlinear relationships between reservoir parameters and MMP for fast, accurate predictions28,29. Unlike traditional methods, ML leverages statistical analysis to automatically extract patterns, showing better adaptability and predictive accuracy in complex scenarios30. Huang et al. pioneered MMP prediction using an artificial neural network (ANN) model for pure and impure CO₂25. Mousavi introduced genetic algorithms to optimize ANN model parameters, creating a hybrid genetic algorithm-ANN (GA-ANN) model for MMP prediction31. Wu, C. used an ensemble of seven ML models optimized with grid search, achieving an R2 of 0.96232. Al-Khafaji, H.F. evaluated five ML models with extensive experimental data, finding the decision tree (DT) model most effective for MMP prediction, with the best error metrics and goodness of fit.

Although ML shows promise in handling complex data and improving MMP prediction accuracy, challenges remain, such as limited performance on small datasets33. Additionally, many existing ML models lack interpretability, making it difficult to understand the role of each feature in MMP prediction, which limits their applicability in real-world production34. High reliability and decision transparency are crucial for AI applications in the oil and gas industry. However, a significant challenge in the large-scale deployment of ML models is the lack of decision transparency and result interpretability34. Consequently, interpretable ML methods are expected to become a focus in intelligent oilfield research.

In summary, XGBoost is a powerful ensemble learning method known for its efficiency, robustness, and interpretability, meeting the needs of machine learning models requiring explainability35. XGBoost performs well in multi-class regression problems, handling small datasets and excelling in nonlinear regression tasks by automating feature selection and optimization, thereby enhancing prediction accuracy36. This study aims to construct a CO₂-MMP prediction model using XGBoost, integrating Particle Swarm Optimization (PSO) and PCA-dimensioned features for hyperparameter tuning.

In the data preprocessing phase, Pearson correlation and reservoir knowledge guide feature selection, followed by PCA for dimensionality reduction to ensure accuracy and efficiency. The PSO algorithm then optimizes hyperparameters, with SHAP values used to assess feature contributions, enhancing model transparency. Integrating interpretability at each modeling stage yields a reliable, visual CO₂-MMP prediction model, supporting optimized injection strategies in CO₂-EOR development.

Data analysis

Source of experimental data

The dataset used in this study is sourced from the compilation of data in other literature1821,3750, totaling 218 sets of experimental data with 2398 data samples. This includes 51 pure CO2 data and 167 sets of impure CO2 data. The dataset encompasses MMP data driven by both pure and impure CO2 across different crude oil compositions and reservoir temperature ranges, with CO2 concentrations ranging from 0 to 100%. The corresponding MMP ranges from 6.55 to 41.47 MPa, with an average MMP of 20.22 MPa. Table 1 presents the input data, parameter ranges, and statistical parameter values for 11 experimental parameters, including reservoir temperature (T), the critical temperature of the injected gas (Tcm), the relative molecular weight of C5+ in crude oil (MWC5+), the mole fraction of volatile hydrocarbon components N2 + CH4 (Xvol), the mole fraction of intermediate hydrocarbon components CO2 + H2S + C2–C4 (Xint), the mole fraction of CO2 in the injected gas (YCO2), the mole fraction of C1 in the injected gas (YC1), the mole fraction of H2S in the injected gas (YH2S), the mole fraction of N2 in the injected gas (YN2), and the mole fraction of C2–C7 in the injected gas (YC2–C7); the dependent variable is MMP.

Table 1.

Parameters and ranges of MMP database collected in this study.

Property Parameter Unit Minimum Maximum Average
Reservoir temperature T C 32.22 140.56 83.0404
Critical temperature of the injected gas Tcm C − 107.00 80.23 22.7348
Composition of crude oil XVol mol% 0.00 53.36 20.2006
Xint mol% 1.20 46.39 25.5420
MWC5+ g/mol 131.46 302.50 203.0839
Composition of injected gas YCO2 mol% 0.00 100.00 61.7113
YH2S mol% 0.00 50.00 712.00
YN2 mol% 0.00 80.10 3.7303
YC1 mol% 0.00 100.00 17.8589
YC2–C7 mol% 0.00 58.47 13.4746
Experimental MMP MMP MPa 6.55 41.47 20.2287

The data sources mentioned above are primarily based on experimental determination methods, which are the most direct and accurate means of determining MMP. Currently, the principal laboratory methods used to study MMP include the Slim Tube method, Rising Bubble Apparatus (RBA), and Vanishing Interfacial Tension (VIT) method. Among these, the Slim Tube method directly determines MMP by measuring the impact of injection pressure on recovery rates. In contrast, RBA and VIT determine MMP based on the principle that the interfacial tension between crude oil and gas approaches zero under mixed-phase conditions16.

The Slim Tube method is a commonly used experimental technique for determining MMP, simulating the gas injection displacement process in crude oil. By simulating the state of porous media in the reservoir rock, the method uses a sand-packed slim tube to approximate the actual conditions of the reservoir (Fig. 1). The injected gases (carbon dioxide, methane) mix in a slim tube and come into contact with the crude oil. Controlling different displacement pressures allows one to observe the mixing between the crude oil and the injected gas. When crude oil and natural gas reach a mixed-phase state, the interfacial tension disappears, and the pressure at which oil recovery efficiency exceeds 90% is identified as the MMP. The experimental setup includes an injection pump system, slim tube, back pressure regulator, differential pressure gauge, temperature control system, visual window, liquid fraction collector, gas meter, and gas chromatograph.

Fig. 1.

Fig. 1

Slim-tube apparatus.

The slim-tube apparatus is a commonly used method for determining CO2-MMP, but it is time-consuming. Subsequently, Christiansen and Kim proposed a faster process in 1983 called the Rising Bubble Apparatus (RBA)42. In the RBA, the oil is confined within a slender glass tube inside a double-window pressure vessel, as shown in Fig. 2. The remainder of the ship is filled with water, which applies pressure to the oil. A solvent-type gas bubble is introduced into the water below the oil in the tube. This bubble rises through the water–oil interface into the oil.

Fig. 2.

Fig. 2

Schematic of rising-bubble apparatus.

Dandina51 first proposed the VIT method for determining CO2-MMP 1997. This method determines CO2-MMP by measuring the interfacial tension between the injected gas and the crude oil. Since the interfacial tension is influenced by pressure, it decreases almost linearly as pressure increases. When miscibility is reached, the interfacial tension becomes zero, allowing CO2-MMP to be obtained through extrapolation52.

The factors affecting the CO2-oil MMP

The correlation analysis based on reservoir physics theory can explain how machine learning methods reveal the direct relationships between reservoir development data and production data from a reservoir physics perspective while ensuring the model’s generalization ability. This approach effectively combines the advantages of theory-driven and data-driven methods, ensuring that the model maintains rationality and transparency during the prediction process, thereby enhancing the scientific rigor and interpretability of the modeling34.

Temperature

Yelling and Metcafe20 used a sand-packed slim tube to simulate the CO2 displacement process in a real reservoir to determine MMP. They examined the effects of reservoir temperature and crude oil composition on MMP and established a correlation for predicting MMP based on experimental results (Eq. 1). The study found that the influence of crude oil composition on MMP is minimal, while temperature has a significant impact. Within the temperature range of 92 to 192°F, an increase in temperature leads to an increase in MMP, approximately 15 psi/°F.

graphic file with name d33e787.gif 1

In Eq. 1: T—reservoir temperature, °C; MMP—Minimum Miscibility Pressure, MPa.

Changes in reservoir temperature can lead to variations in the PVT properties of crude oil (viscosity, density, gas-to-oil ratio, saturation pressure) and others. Numerous experiments have demonstrated that the reservoir temperature positively correlates with the MMP value1822. The impact of temperature on the MMP for CO2 flooding is significant. As temperature increases, the solubility of CO2 in crude oil decreases, reducing its volumetric expansion capacity. In contrast, the supercritical nature of CO2 diminishes primarily due to a decrease in density, which severely affects its ability to extract light hydrocarbons.

Previous studies on MMP prediction models have considered reservoir temperature but have overlooked the relationship between the critical temperature of the injected gas and the reservoir temperature concerning MMP. The critical temperature is the highest temperature at which a substance can exist as a liquid. Each substance has its specific critical temperature; beyond this, it cannot be liquefied even with increased pressure. The lower the critical temperature, the more difficult it is for the substance to liquefy34.

In conventional gas injection processes, the temperature at the reservoir depth remains relatively stable and does not undergo significant changes due to gas injection and production. However, the formation temperature near the injection well is lowered when injecting liquid gases. This low temperature can not only cause cold damage (such as affecting the mechanical properties of the formation and the wax precipitation phenomenon in crude oil) but also help reduce MMP. Therefore, to improve the completeness of the data for training the prediction model, the critical temperature of the injected gas (Tcm) is included as a feature parameter for discussion.

Crude oil composition

Due to the complexity of crude oil components, the interactions between oil components and CO2 vary in magnitude, extent, and mechanism. Therefore, researchers often categorize crude oil components based on the number of carbon atoms9,10,2125. Crude oil is divided into three pseudo-components based on carbon atom count: volatile hydrocarbon components (CH4 and N2), intermediate hydrocarbon components (C2–C4 + CO2 + H2S or C2–C6 + CO2 + H2S), and heavy components (C5+ or C7+). Under constant other conditions, the mole fractions of CH4 and N2 in crude oil are proportional to the MMP. When the mole fractions of CH4 and N2 in crude oil decrease, the MMP also decreases, making it easier for CO2 to achieve miscibility with the crude oil.

Furthermore, during the extraction of crude oil components by CO2, the intermediate hydrocarbon components are extracted first, followed by the heavy components. Therefore, the higher the mole fraction of intermediate hydrocarbon components in the crude oil, the more favorable it is for CO2 dissolution and extraction, resulting in a lower MMP. Conversely, as molecular weight increases, the difficulty of extraction also rises, and CO2's solubility in the heavy components decreases. Thus, the heavier the crude oil, the higher the MMP18.

Composition of the injected gas

The composition of the injected gas is also one of the critical factors affecting the MMP. In the prediction of CO2-driven MMP calculations, the gas is classified into pure CO2 and impure CO2. Most of the current studies on MMP are related to pure CO2-driven MMP; however, the cost of using pure CO2 for gas injection to enhance recovery is too high. Currently, most gas injections for enhanced recovery utilize gas with impurities. The MMP prediction model developed in this study considers pure CO2 injection and impure CO2.

The presence of impurity gases in the injected gas, such as C1, H2S, and N2, or intermediate hydrocarbon components like C2, C3, and C4, can lead to significant variations in CO2-oil MMP, depending on the types of components53. Generally, the presence of H2S or intermediate hydrocarbon components in the injected gas reduces the CO2-oil MMP, while C1 or N2 in the injected gas significantly increases the CO2-oil MMP.

Selection of experimental parameters

There are over 20 published empirical formula prediction models for CO2-driven MMP. When establishing an MMP prediction model, the parameter variables used in different prediction models vary. Through comparative analysis of 17 existing prediction models, as shown in Table 2, the influencing factors involved in the empirical formulas include reservoir temperature (T), the relative molecular weight of C5+ in crude oil (MC5+), the relative molecular weight of C7+ in crude oil (MC7+), the mole fraction of volatile hydrocarbon components N2 + CH4 in crude oil (Xvol), the mole fraction of intermediate hydrocarbon components CO2 + H2S + C2–C4 in crude oil (Xint), the mole fraction of C2–C4 in crude oil (XC2–C4), the mole fraction of C2–C6 in crude oil (XC2–C6), the mole fraction of CO2 in the injected gas (YCO2), the mole fraction of C1 in the injected gas (YC1), the mole fraction of N2 in the injected gas (YN2), the mole fraction of H2S in the injected gas (YH2S), and the mole fraction of C2–C4 in the injected gas (YC2–C4). Among these influencing factors, the optimal input parameters for establishing the MMP prediction model are selected.

Table 2.

Parameters used to calculate the minimum miscible pressure of CO2 Flooding Oil in Different Models.

Tr MWC5+ MWC7+ Xvol XC2–C4 XC2–C6 YCO2 YC1 YN2 YH2S
Holm and Josendal18
Cronquist (1978)54
Yellig and Metcalfe20
Orr and Jetsen (1984)55
Alston et al.21
Glaso*s (1985)22
Sebastian et al.49
Huang et al.25
Emera and Sarma (2004)56
Yuan et al. (2004)57
Shokir (2007)58
Li et al. (2012)43
Chen et al. (2013)9
Sayyad et al. (2014)59
Kamari (2015)10
Fathinasab (2016)60
Karkevandi-Talkhooncheh (2017)61

As the feature vector for the prediction model, the selection of experimental parameters is a crucial foundational task for screening critical feature data. Researchers in related fields mainly use Pearson correlation coefficients and Spearman rank correlation coefficients to select critical feature data62. This study employs the Pearson correlation coefficient in conjunction with the analysis of reservoir physical theories discussed in section “The factors affecting the CO2-oil MMP” to optimize the parameters. The results are shown in Fig. 3. By calculating and analyzing the correlation of each influencing factor, sensitive parameters were identified and ranked according to the absolute value of the Pearson correlation coefficient: Tcm > T > YCO2 > YN2 > YC1 > YC2–C7 > MWC5+ > Xint > Xvol > YH2S.

Fig. 3.

Fig. 3

Heat Map implying the correlation between input and output variables for the dataset.

The sign of the Pearson index indicates that T, MWC5+, Xvol, Xint, YN2, YC1, and YC2–C7 are positively correlated with MMP, while the other parameters are negatively correlated63. Previous studies on MMP prediction models have considered reservoir temperature but overlooked the relationship between the critical temperature of the injected gas and MMP. In practical development processes, the complexity of the formation often leads to the neglect of the impact of the injected gas’s critical temperature on the gas drive. However, the results from the Pearson index analysis show that the absolute value of the Pearson index for Tcm and MMP reaches 0.678, indicating a significant correlation64.

Furthermore, as analyzed in section “Temperature”, when liquid gas is injected below the critical temperature into the formation, it lowers the temperature near the injection well. This low temperature helps reduce MMP, highlighting the engineering significance of Tcm for the study of the MMP prediction model. For the content of hydrogen sulfide in the injected gas (YH2S), its absolute Pearson index value is 0.048, indicating no significant correlation64. To ensure training accuracy while reducing the model’s training load and improving training efficiency, parameters that are not significantly correlated with MMP will be excluded from the feature parameters.

Based on the factors considered in the studied models, Pearson correlation analysis, and knowledge of reservoir physics, the factors selected for establishing the MMP prediction model are critical temperature of the injected gas (Tcm), reservoir temperature (T), the mole fraction of CO2 in the injected gas (YCO2), the mole fraction of N2 in the injected gas (YN2), the mole fraction of C1 in the injected gas (YC1), the mole fraction of C2–C7 in the injected gas (YC2–C7), the relative molecular weight of C5+ in crude oil (MWC5+), the mole fraction of intermediate hydrocarbon components CO2 + H2S + C2–C4 (Xint), and the mole fraction of volatile hydrocarbon components N2 + CH4 (Xvol).

Methodology

This study proposes a machine learning-based prediction model for CO2-MMP. To improve model construction efficiency, redundant information must be eliminated. Principal Component Analysis (PCA) was used to group and reduce the dimensionality of certain feature vectors. The XGBoost algorithm was employed to build the MMP prediction model and efficiently and reasonably determine hyperparameters. The Particle Swarm Optimization (PSO) algorithm was utilized. After constructing the MMP prediction model, SHAP was applied to interpret the model. Chapter 3 introduces the principles of the methods used in this study.

Principal component analysis

Principal Component Analysis (PCA) is a technique for reducing the dimensionality of datasets, which can enhance the interpretability of the data while minimizing information loss63. PCA reduces dimensions and examines the dependencies within the original variables’ correlation matrix, condensing several intricately related variables into a few comprehensive factors65. The PCA process mainly consists of five steps: calculating the correlation coefficient matrix, obtaining the eigenvalues and corresponding orthogonal unit eigenvectors of the covariance matrix, selecting principal components, calculating principal component loadings, and calculating principal component score coefficients66.

The expression for the covariance matrix is:

graphic file with name d33e1587.gif 2

where:

graphic file with name d33e1594.gif 3

In this equation:Inline graphic is the covariance matrix; Inline graphic is the value corresponding to the i-th row and j-th column of the covariance matrix; i is the row index; j is the column index; p is the dimensionality of the matrix; n is the number of samples; k is the sample variable value; Inline graphic is the value of the i-th variable in the k-th sample; Inline graphic is the average value of the i-th variable across all samples; Inline graphic is the value of the j-th variable in the k-th sample; Inline graphic is the average value of the j-th variable across all samples.

The top m largest eigenvalues of the covariance matrix correspond to the variances of the first m principal components. The unit eigenvector corresponding to a specific eigenvalue serves as the coefficient of the principal component concerning the original variables. Thus, the i-th principal component can be expressed as:

graphic file with name d33e1641.gif 4

where: Inline graphic is the extracted principal component; Inline graphic is the unit eigenvector corresponding to a specific eigenvalue; Inline graphic is the original variable value.

The variance (information) contribution rate of the principal component reflects the amount of information it carries and is expressed as:

graphic file with name d33e1668.gif 5

where: Inline graphic is the variance contribution rate of the principal component; Inline graphic is the eigenvalue of the ith principal component; m is the number of eigenvalues selected from the covariance matrix.

The expression for the cumulative variance contribution rate of the principal components is:

graphic file with name d33e1692.gif 6

where: Inline graphic is the cumulative variance contribution rate (in %); Inline graphic is the eigenvalue of the kth principal component.

When the cumulative variance contribution rate exceeds 85%, it is considered sufficient to reflect the information of the original variables, and the corresponding m indicates the number of principal components to be extracted.

The principal component loading reflects the degree of association between the principal components and the original variables and is expressed as:

graphic file with name d33e1715.gif 7

where l is the principal component loading; Inline graphic is the original variable; Inline graphic is the correlation coefficient matrix between the variables.

The expression for calculating the principal component score coefficients is:

graphic file with name d33e1739.gif 8

Particle swarm optimization

The Particle Swarm Optimization (PSO) algorithm was first proposed by Eberhart and Kennedy in the literature. It is a population-based optimization tool with global optimization capabilities67. The PSO algorithm iteratively searches for the optimal value, initializing the system with random solutions. Particles (potential solutions) search the solution space by following the best-performing particle.

Assuming N particles are in a D-dimensional objective search space, the i-th particle is represented as a D-dimensional vector Inline graphic where i = 1, 2, …, N. This indicates that the position of the i-th particle in the D-dimensional search space is x. The position of each particle represents a potential solution. By substituting Inline graphic into an objective function, its fitness value can be calculated, which measures the quality of x based on its fitness.

The “velocity” of the i-th particle is also a D-dimensional vector, denoted as vi​​ for I = 1, 2, …, N. The best position found by the i-th particle so far is denoted as Inline graphic ​, and the best position found by the entire swarm so far is denoted as Inline graphic​. The PSO algorithm operates on particles using the following formulas:

graphic file with name d33e1781.gif 9

where i = 1, 2, …, N; the learning factors Inline graphic​ and Inline graphic​ are non-negative constants; and r​ and r​ are random numbers between [0, 1]. The termination condition for iterations is generally set based on specific problems, typically as the maximum number of iterations or when the best position found by the swarm meets a predetermined minimum fitness threshold. The PSO algorithm proposed by Eberhart and Kennedy is often called the basic PSO algorithm. This basic PSO algorithm requires very few parameters to be determined by the user. It is easy to operate, making it convenient68.

Extreme gradient boosting

Extreme Gradient Boosting (XGBoost) algorithm, as a decision tree algorithm, has been applied to solve problems in Earth sciences69. Decision tree algorithms are a standard machine learning method often used for classification and regression tasks. They are widely utilized in data analysis and mining due to their advantages, such as low computational complexity, quickly interpretable output, insensitivity to missing intermediate values, and the ability to handle irrelevant feature data. However, decision tree methods have drawbacks, including poor stability, sensitivity to data distribution, susceptibility to overfitting, and unreliable generalization performance, which limit their application potential70.

With the development of artificial intelligence technology, researchers have proposed many methods to improve and optimize decision trees, attempting to eliminate these shortcomings by using ensemble algorithms to combine ordinary weak learners into strong learners, thereby enhancing the performance of decision trees. The XGBoost model combines weak regressors or classifiers into a robust predictive model to achieve parallel execution. Due to its high performance in handling classification and regression problems, it is widely applied in data mining and intelligent forecasting fields. The XGBoost model establishes an objective function and optimizes it by adding regularization terms, cleverly reducing the overfitting issue, and providing fast and reliable models for engineering simulations. The XGBoost model can be represented as71:

graphic file with name d33e1823.gif 10

In Eq. (10): Inline graphic is the predicted value; K is the total number of trees; i represents the i-th sample and Inline graphic denotes the feature vector corresponding to sample i; F is the collection of all trees; Inline graphic is the k-th tree generated in the k-th iteration, functioning as a member of the set F.

The objective function can be expressed as

graphic file with name d33e1854.gif 11

In the optimization process, an additive model and forward distribution algorithm are employed, where the model prediction value Inline graphic for the i-th sample in the t-th iteration is based on the model prediction value Inline graphic from the t − 1 iteration, augmented by a new tree Inline graphic. The term Inline graphic represents the prediction error in the t-th iteration, and Inline graphic is the regularization term for the t-th iteration, indicating the complexity of the model. Let T be the total number of leaf nodes in the tree Inline graphic, and w be the scores of the respective leaf nodes, then:

graphic file with name d33e1899.gif 12

For optimization, the objective function undergoes a second-order Taylor expansion. Let Inline graphic, Inline graphic. Then:

graphic file with name d33e1919.gif 13

Further simplification yields:

graphic file with name d33e1927.gif 14

In the expression: Inline graphic, it represents the sample set corresponding to the j-th leaf node of the tree Inline graphic; Inline graphic and Inline graphic are pre-set hyperparameters.

Assuming the tree structure q(x) is known, let Inline graphic and Inline graphic. To minimize the loss function, the optimal parameters Inline graphic and the loss function Inline graphic can be derived as follows:

graphic file with name d33e1985.gif 15

Shapley additive explanations

Machine learning has achieved significant success in many fields, but its lack of interpretability severely limits its application in practical engineering tasks34. The CO2 MMP prediction model has high demands for interpretability, especially in the context of oil field development. To address the issue of model interpretability, the SHAP framework is introduced to explain the model results, thereby providing support for the reliability of the model outcomes.

Shapley Additive Explanations (SHAP), based on game theory72 and local explanations73, is a classical post-hoc interpretation framework that provides Shapley values to estimate the contribution of each feature in measuring how features affect the dependent variable. This method treats each feature as a contributor, calculating the contribution value of each feature, and summing these contributions to obtain the final prediction of the model74.

Compared to traditional feature importance methods (such as the feature importance in XGBoost), SHAP offers better consistency and can present the positive/negative relationships of each predictor relative to the target variable, enabling both local and global interpretations75. For local interpretability, each feature has its own set of Shapley values, allowing for explanations of each sample’s contribution of each feature to the prediction, thereby increasing transparency and enhancing the reliability of the CO2 MMP prediction model. The average Shapley value corresponding to that variable across all samples is taken as the importance value of that feature, thus providing a global explanation.

Assuming g represents the interpretation model, M represents the number of features, z represents the presence of a feature (taking values 0 or 1), and φ represents the Shapley value for each feature, the formula can be given as:

graphic file with name d33e2025.gif 16

The SHAP value for each feature represents the expected change in the model prediction when conditioned on that feature. For each feature, the SHAP value indicates its contribution to the overall prediction outcome, illustrating the difference between the average model prediction and the actual prediction for the instance. A SHAP value Inline graphic > 0 indicates that the feature enhances the predicted value, while a negative value suggests that the feature reduces the contribution. The feature importance provided by the XGBoost model only indicates which features are important but does not specify how those features affect the prediction results. The greatest advantage of the SHAP model is its ability to reflect the influence of each feature in every sample and the positive or negative nature of that influence on the final prediction outcome.

Evaluation metrics

To rigorously and comprehensively evaluate the performance of each model, we selected four statistical metrics: coefficient of determination (R2), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE).

R2 represents the proportion of the variance in the dependent variable that can be explained by the regression relationship, with values ranging from 0 to 1. When using machine learning methods to predict MMP, an R2 value close to 1 indicates a better regression fit.

graphic file with name d33e2051.gif 17

RMSE accurately reflects the deviation between the measured values and the true values. It is commonly used as a standard to measure the prediction results of intelligent models.

graphic file with name d33e2059.gif 18

MAPE indicates the average absolute percentage error between the predicted data and the actual data. Compared to RMSE, MAPE normalizes the errors for each sample point, reducing the impact of individual outliers on the absolute error.

graphic file with name d33e2067.gif 19

MAE refers to the average absolute error in each measurement. It avoids the issue of errors cancelling each other out, thus accurately reflecting the actual value of the prediction error.

graphic file with name d33e2075.gif 20

In Eqs. 1720, where Inline graphic is the prediction for the i-th sample, Inline graphic​ is the corresponding observed value, and m is the number of samples.

Research methodology

This study integrates multiple methods to form an interpretable prediction model for MMP based on PCA-PSO-XGBoost-SHAP. The flowchart is shown in Fig. 4, and the process mainly includes data preprocessing, hyperparameter tuning of the PSO algorithm, building the XGBoost prediction model, and establishing the SHAP interpretability model.

Fig. 4.

Fig. 4

Flowchart of the PSO-XGBoost model.

Correlation analysis was first performed in the preprocessing phase before model construction. By integrating reservoir physics theory, the effects of temperature, crude oil composition, and injection gas composition on MMP were analyzed in depth. The Pearson correlation coefficient was used to determine the correlation between each feature and MMP, allowing for the selection of important feature parameters for model training.

Based on this, PCA was employed to reduce dimensionality to retain critical information from the features while reducing redundant information, thereby improving training efficiency and ensuring model accuracy. To address the complex hyperparameters of XGBoost, the PSO algorithm was introduced for global optimization, achieving optimal parameter configuration. Subsequently, SHAP analysis was conducted to evaluate the model’s interpretability. By combining the Pearson analysis from the modelling phase and the variance proportion from PCA, the contributions of various features to the predictions were compared, enhancing the model’s transparency and credibility and constructing the CO2-MMP prediction model.

Model construction process

Data preprocessing

Correlation analysis was conducted in Section “Selection of experimental parameters”, utilizing the Pearson correlation coefficient to determine the correlation between each feature and MMP. The selected feature parameters for model training include Tcm, T, YCO2, YN2, YC1, YC2–C7, MWC5+, Xint, Xvol.

During the model’s training, the required number of samples increases exponentially with each additional dimension, potentially leading to significant “curse of dimensionality” issues. Data dimensionality reduction can make the dataset more manageable and reduce computational costs.

In previous dimensionality reduction methods, although global dimensionality reduction can reduce data dimensions, it overlooks the differences between features. The Pearson correlation analysis in section “Selection of experimental parameters” shows that the correlations between various feature parameters are insignificant. Using traditional global dimensionality reduction can weaken the physical characteristics of features by reducing all features together, thereby affecting model performance. Grouped dimensionality reduction can better retain different feature groups’ independence and essential information.

This study adopts a grouped dimensionality reduction strategy to address this issue. In the MMP prediction model, reservoir temperature (T) and critical temperature of injection gas (Tcm) are sensitive to MMP. Pearson correlation analysis confirms their significant impacts on MMP. Since they represent different physical meanings, both temperature features are retained.

Additionally, the interaction between crude oil composition and injection gas composition cannot be simply described using linear relationships with MMP. PCA is applied to effectively select features that balance the accuracy and efficiency of model construction to reduce the dimensionality of these two groups of features separately. The composition features of crude oil and injection gas exhibit conjugate feature information; independent dimensionality reduction can better reflect the influence of each feature group during the oil–gas miscibility process while minimizing interference from irrelevant features on the model.

The main advantage of grouped dimensionality reduction lies in its ability to reduce model complexity while maintaining the physicochemical properties of the data. Through this strategy, the model can better capture the independent contributions of different feature groups to MMP. Moreover, grouped dimensionality reduction avoids the information loss and feature interference associated with global dimensionality reduction, enabling the model to handle different data types more flexibly. In summary, grouped dimensionality reduction provides a more efficient and accurate solution for MMP prediction, optimizing model performance while ensuring the retention and interpretability of key features.

The specific steps for PCA dimensionality reduction include:

  1. Centering: Shift the original data distribution towards the center of the coordinate axes, keeping the shape of the data distribution unchanged. The mean vector of the centered data is a zero vector, ensuring that the variances of each feature can be compared on the same scale when calculating principal components. This prevents features with large variances from dominating the principal component calculations, allowing for a fairer assessment of each feature’s contribution to variance.

  2. Calculating the covariance matrix: Compute the covariance matrix for the standardized data, retaining the largest absolute values of covariance, namely T and YCO2. Considering that the critical temperature of injection gas is also important, it is retained. During the grouped dimensionality reduction of the remaining feature vectors, crude oil components (MWC5+-XVol-Xint) and injection gas components (YCO2-YN2-YC1-YC2–C7), it was found that the principal component explained 85.30% of the variance for crude oil components. In comparison, while the principal component explained 66.78% of the variance for injection gas components. To ensure that the principal components are not adversely affected during model training, the molar content of CO2 (YCO2) in the injection gas was incorporated into the dimensionality reduction of the injection gas principal components. The adjusted principal component explained 83.13% of the variance. The covariance matrix heatmap for crude oil components (MWC5+-XVol-Xint) and injection gas components (YCO2-YN2-YC1-YC2–C7) is shown in Fig. 5, illustrating the correlations between each random variable and their variances. For the principal component of crude oil components, MWC5+ has the largest variance, thus contributing the most; for injection gas components, YCO2 has the largest variance, serving as the main contributing parameter.

  3. Obtaining feature vectors and eigenvalues: Decompose the covariance matrix to obtain the feature vectors and eigenvalues

Fig. 5.

Fig. 5

Covariance matrix heat map.

Hyperparameter determination

Finding the optimal hyperparameter configuration for each machine learning model is essential to achieving the best predictive performance. In previous XGBoost model constructions, parameters such as max_depth, learning_rate, and n_estimators significantly impacted the model’s predictive performance. However, past research often selected only a limited number of hyperparameters due to efficiency and complexity considerations, which prevented the model from achieving optimal predictive performance1.

To address this issue, this study employs various search algorithms, including grid search, Bayesian, PSO, and genetic algorithms (GA), for multi-scale hyperparameter optimization to ensure the efficiency of model construction and predictive accuracy. The hyperparameters optimized in this study are summarized in Table 3

Table 3.

XGBoost model hyperparameters.

Parameter Meaning
max_depth Maximum depth of the tree
learning_rate Learning rate; determines the step size for weight adjustment during each iteration
n_estimators Number of weak learners; increases the model’s fitting ability
min_child_weight Minimum weight of child nodes; controls model complexity
subsample Subsampling rate; determines the proportion of data used for tree construction in each iteration
gamma Minimum loss reduction required for node splitting; controls tree splitting
colsample_bytree Proportion of features used for each tree; controls feature usage
reg_alpha & reg_lambda L1 and L2 regularization parameters; control model complexity

During the iterative selection of hyperparameters in each optimization algorithm, the model’s error changes continuously with the adjustment of hyperparameters. This change is illustrated through the error iteration curve of hyperparameter optimization. Five-fold cross-validation was introduced to reduce randomness during the model construction process. By combining the error iteration graphs, five-fold cross-validation results, and hyperparameter optimization iteration times, the model hyperparameters with the best validation performance were selected as the final hyperparameter optimization model.

The advantage of grid search lies in its intuitiveness and ease of use, allowing for identifying of optimal parameter configurations within predefined hyperparameter spaces76. However, when the hyperparameter space is large, the computational cost of grid search significantly increases because it exhaustively evaluates all possible parameter combinations. As shown in Fig. 6, the error iteration curve of grid search fluctuates around a certain value, indicating that model performance does not vary significantly across different hyperparameter combinations, leading to a minimal reduction in error. Additionally, due to its exhaustive nature, grid search is inefficient and costly when facing a large parameter space. This experiment’s grid search took 914.09 s for 6000 iterations, confirming its inefficiency in large-scale parameter optimization tasks. The five-fold cross-validation graph for Grid-Search-XGB in Fig. 7c shows a high initial error, which gradually decreases but exhibits significant fluctuations, indicating that the model was not fully optimized.

Fig. 6.

Fig. 6

Iterative error graph.

Fig. 7.

Fig. 7

Five-fold cross-validation.

As seen in Fig. 6, the error in Bayesian optimization fluctuates between 1 and 4. This fluctuation primarily arises from the trade-off between exploration and exploitation inherent in Bayesian optimization77. During the search process, the algorithm may attempt points in uncertain regions. Furthermore, Bayesian optimization has a local optimum jumping characteristic, which can lead to short-term convergence to local optima; however, it can escape these regions through further exploration, resulting in periodic fluctuations in the error curve. Given the complex hyperparameter space of the XGBoost model, Bayesian optimization requires more exploration and computation time to accurately evaluate all possible parameter combinations, leading to instability in high-dimensional space and making rapid convergence of error difficult. In this experiment, 1000 iterations took 248.08 s. Although the time was short, the error did not converge, indicating that the stability of the optimization process still needs improvement. The cross-validation results of Bayesian optimization-XGB in Fig. 7b show a rapid initial decline but exhibit oscillations, indicating significant fluctuations during parameter optimization.

GA are global metaheuristic optimization methods based on biological evolution mechanisms, commonly employing elitism, where a certain proportion of the best individuals from the current population are directly retained for the next generation in each iteration78. GA is more likely to converge toward a global optimum by retaining the best individuals, thereby improving convergence speed. Figure 6 illustrates the optimization results for each generation, with the RMSE stabilizing around the best solution value of 1 after the first iteration. However, the computation time for each generation is relatively long, with 500 iterations taking 1736.69 s in this experiment. Although the overall convergence trend is stable, slight fluctuations may still occur as it approaches the global optimum, which may relate to the complexity of the search space near the global optimum. The cross-validation graph of GA-XGB in Fig. 7d shows a relatively stable error that gradually decreases and stabilizes in later stages, demonstrating good global search capability.

Compared to other algorithms, the PSO exhibits significant advantages when dealing with complex models, especially in high-dimensional problems, as it showcases outstanding computational efficiency and parameter adjustment precision79. As shown in Fig. 6, PSO can rapidly converge the RMSE from 2.5 to 1, demonstrating high optimization efficiency. PSO dynamically balances between the global best and individual best through particle position and velocity adjustments, gradually converging to the optimal solution. The experiment shows that 2658 iterations took only 351.03 s, with PSO effectively reducing redundant searches and significantly enhancing the efficiency of XGBoost parameter optimization, making it suitable for handling high-dimensional parameter spaces. Moreover, from the five-fold cross-validation graph of PSO-XGB in Fig. 7a, it is evident that the error converges quickly during the parameter search process, approaching the optimal solution gradually, indicating that PSO performs well and can efficiently adjust parameters.

In summary, the performance of PSO in the error iteration graphs and five-fold cross-validation, along with its running time, demonstrates higher efficiency and stability, especially in terms of rapid error convergence and time cost. PSO can find solutions that are close to optimal in a shorter time. Therefore, in this study, PSO-XGB achieves the best balance between efficiency, stability, and performance, making it the preferred algorithm for optimizing the MMP prediction model. The best hyperparameters optimized by PSO are shown in Table 4.

Table 4.

Optimal XGBoost hyperparameters for each optimization algorithm.

Algorithm XGBoost hyperparameters
Max_depth Learning_rate N_estimators Min_child_weight
Grid Search 10 0.4593 36 5
Bayesian 5 0.2224 210 6
GA 9 0.1677 402 4
PSO 10 0.49219232 138 4

Validation based on model proportion splitting mode

To further verify the reliability of the hyperparameter optimization results and avoid randomness caused by differences in data ratios, this study partitions the training and testing sets in various proportions for validation, ensuring that a ratio of 8:2 between the training and testing sets is reasonable within the same sample space.

By adjusting the data ratios of the training and testing sets (4:6, 5:5, 7:3, 8:2, 9:1) and combining them with the XGBoost model optimized by PSO, model prediction performance is compared. The prediction performance of the models is assessed using three evaluation metrics: RMSE, MAE, and R2. Figure 8 indicates that the model’s overall performance is best when the training and testing set ratio is 8:2, achieving an RMSE of 1.71 and an MAE of 1.19, both of which are the lowest values. At this ratio, the goodness of fit R2 reaches 0.951, indicating optimal model prediction performance. This finding is further supported by the five-fold cross-validation experiments discussed in section “Hyperparameter determination”, which reinforce the model’s robustness and stability across different datasets.

Fig. 8.

Fig. 8

Data proportional segmentation experiment.

Compared to other ratios, the 4:6 ratio yields an RMSE of 3.16 and an R2 of only 0.86, suggesting insufficient training data, which prevents the model from effectively capturing the complexity of the data. Conversely, at the 9:1 ratio, although the training data volume is the largest, the RMSE is 1.76, and R2 drops to 0.948, indicating that an insufficient testing set may lead to unstable model evaluations and the potential risk of overfitting. Therefore, it is concluded that a training-to-testing data ratio of 8:2 meets the requirements for constructing a machine-learning model.

Methodology and experimental environment

The entire process is illustrated in Fig. 9. First, data preprocessing is conducted, where feature parameters are selected based on theoretical knowledge and the Pearson correlation coefficient. The features that are significantly correlated with MMP are included in the training. PCA is employed for grouped dimensionality reduction to retain the independence and important information of different feature groups while enhancing training efficiency without compromising model accuracy.

Fig. 9.

Fig. 9

Shows the flowchart of the research process.

In the model comparison, four representative machine learning algorithms Support Vector Regression (SVR), K-Nearest Neighbors (KNN), Linear Regression (LR), and Random Forest (RF) were selected for evaluation alongside XGBoost. These methods were chosen to cover a spectrum of regression approaches: SVR is known for its ability to model complex, non-linear relationships; KNN offers simplicity and local pattern recognition; LR serves as a classical linear baseline; and RF, as a popular ensemble method, captures feature interactions effectively. This selection ensures a comprehensive and fair comparison of XGBoost’s predictive performance.The final evaluation results through RMSE, MAE, R2, and MAPE demonstrate that the XGBoost model provides the best predictions for CO2-MMP.

In the XGBoost hyperparameter optimization section, the error iteration graphs, five-fold cross-validation, and iteration times show that PSO-XGB achieves the best balance between efficiency, stability, and performance compared to Bayesian-XGB, GA-XGB, and Grid-Search-XGB, making it the preferred algorithm for MMP prediction. The optimized hyperparameters are shown in 0.

To further validate the reliability of the hyperparameter optimization results, the data ratios of the training and testing sets are adjusted (4:6, 5:5, 7:3, 8:2, 9:1) in conjunction with the XGBoost model optimized by PSO for a comparative analysis of model prediction performance. The results indicate that the model performs best when the training-to-testing set ratio is 8:2. This optimal prediction performance is reinforced by the five-fold cross-validation experiments detailed in section “Hyperparameter determination”, ensuring the robustness and stability of the PSO-XGB model across different datasets.

The SHAP framework is introduced to explain the model results and provide interpretability to the CO2 MMP prediction model. This helps support the reliability of the model outcomes and allows for a comparative analysis of feature contributions against the Pearson correlation analysis and PCA covariance matrix, enhancing the model’s transparency and credibility. This builds a transparent and reliable CO2 MMP prediction model.

This experiment was conducted using Python version 3.12.6 and Scikit-learn version 1.5.2. The detailed experimental configuration environment is shown in the Table 5 below.

Table 5.

Experimental software and hardware configuration environment.

Item Configuration description Item Configuration description
Operating System Windows10 Compiler Environment PyCharm 2024.2.1 (Professional Edition)
Python Version 3.12.6 Memory 2048 M
Scikit-learn 1.5.2 Cores 8
Shap 0.46.0 Xgboost 2.1.1

Results and analysis

Model performance comparison and results analysis

Comparison of classical algorithms and XGBoost performance

In oil and gas field development and machine learning models, prediction accuracy and performance are key criteria for evaluating model effectiveness61. This section compares the performance of XGBoost with several classical machine learning algorithms (Support Vector Regression, SVR; K-Nearest Neighbors, KNN; Linear Regression, LR; and Random Forest, RF) to assess each model’s performance on the training and testing datasets.

Figure 10 illustrates the prediction results of XGB and other machine-learning algorithms. The fitting results of XGB on the testing set are closest to the zero-error line, indicating that the predicted values of XGB are the closest to the observed results. As shown in Table 6, the XGB algorithm performs the best for the training set, followed by the Random Forest algorithm. In the RMSE metric, only the XGB algorithm is below 1, with the lowest corresponding MAE and MAPE error metrics. For the goodness of fit, the XGB method achieves the highest value of 1.0000. At the same time, Random Forest follows with a value of 0.9833, and other algorithms fail to exceed 0.9. This also suggests that SVR, KNN, and multiple regression are not well-suited for complex small-sample data, which results in similar metrics for testing data but prevents them from being the most effective methods applied. In the testing set for XGB and RF, all metrics for the XGB model outperform those of the RF model.

Fig. 10.

Fig. 10

Fitting results of XGB and other classic algorithms.

Table 6.

Evaluated index of machine learning.

Optimization Methods Machine learning
XGB SVR KNN LinearRegression RF
Dataset Train Test Train Test Train Test Train Test Train Test
RMSE 0.0229 1.2989 3.5607 3.7490 2.4944 2.6216 4.0403 3.8191 1.0093 1.8569
MAE 0.0150 1.0015 2.6856 3.0110 1.9377 2.2100 3.0724 2.8698 0.6972 1.5032
R2 1.0000 0.9754 0.7926 0.7951 0.8982 0.8998 0.7329 0.7873 0.9833 0.9497
MAPE(%) 0.0816 6.3060 15.2025 16.5603 11.1592 12.1849 16.9451 15.4962 3.8182 8.8366

From the evaluation indices in Table 6, it is evident that XGBoost has a very low RMSE on the training set (0.0229) and an R2 close to 1 (0.9999), indicating that this model has a very high fitting accuracy on the training set. However, the RMSE on the testing set is 1.2989, with an R2 of 0.9754, showing a slight decline but still demonstrating strong generalization ability, with relatively small errors (MAE of 1.0015). In contrast, the SVR model performs poorly on both the training and testing sets, with an RMSE as high as 3.749 on the testing set and an R2 of only 0.7951, indicating that SVR does not perform well on this task and may fail to capture the complexity of the data effectively. KNN performs slightly better than SVR on both the training and testing sets, but the testing set RMSE still reaches 2.6216, with an R2 of 0.8998, suggesting some capability in capturing local structures. However, overall precision is inferior to that of XGBoost. For linear regression, its performance is similar to that of SVR, showing significant errors on both the training and testing sets, with a testing set RMSE of 3.8191 and an R2 of 0.7873, indicating that this model cannot adapt well to non-linear data structures. Finally, the performance of the RF model is second only to that of XGBoost, with RMSE values of 1.0093 and 1.8569 for the training and testing sets, respectively. An R2 of 0.9497 indicates that the RF model has strong generalization ability but still falls short compared to XGBoost.

Figure 11 shows a comparison of the residuals for different machine learning algorithms on the testing dataset, including XGBoost (XGB), Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Regression (SVR), and Linear Regression. The range of residuals, from largest to smallest, is as follows: Linear Regression > SVR > KNN > RF > XGB. This indicates that XGB consistently produces the lowest residuals, demonstrating higher accuracy and predictive capability than the other algorithms. Conversely, Linear Regression exhibits the largest residuals, suggesting it performs poorly in modeling the complexity of the dataset. This analysis reflects that XGB’s advanced gradient boosting algorithm effectively captures the underlying nonlinear relationships in the data, outperforming simpler models such as Linear Regression and Support Vector Regression.

Fig. 11.

Fig. 11

Comparison of residual graphs of classical algorithms.

In summary, XGBoost outperforms other machine learning algorithms. It exhibits higher fitting accuracy on both the training and testing datasets, particularly with smaller residuals on the testing dataset, indicating strong generalization ability and predictive accuracy. In contrast, Linear Regression and SVR models perform poorly, with a larger range of residuals, failing to capture the nonlinear relationships in complex data effectively. While RF and KNN also demonstrate good generalization capability, they are still slightly inferior in accuracy compared to XGBoost. This further confirms XGBoost’s advantages in handling complex data tasks, making it suitable for constructing MMP prediction models.

Comparison of the optimized XGBoost model with the traditional XGBoost

Figure 12 illustrates the prediction results of the traditional XGB model compared to the XGB models based on different optimization algorithms. It can be seen that the PSO-XGB model’s fitting results on both the training and testing datasets are closer to the zero-error line than oth er optimization results. Table 7 displays the performance metrics of the XGB models based on different optimization algorithms for both the training and testing datasets. By comparing these metrics, we can assess the impact of the optimization algorithms on model performance.

Fig. 12.

Fig. 12

. Fitting results of classical XGB and optimized XGB.

Table 7.

Evaluated index of Xgboost algorithm.

Optimization methods Xgboost Algorithm
PSO Bayes Grid-Search GA
Dataset Train Test Train Test Train Test Train Test
RMSE 0.23467 1.03035 0.401 1.2033 0.65338 1.71823 0.46399 1.42412
MAE 0.164343 0.839156 0.297667 0.952053 0.517933 1.362954 0.371879 1.144184
R2 0.9991 0.98452 0.99737 0.97889 0.99302 0.95695 0.99648 0.97043
MAPE(%) 0.908 5.104 1.64 5.808 2.885 8.281 2.068 7.139
Times(s) 132.02 248.08 139.32 1736.69

The optimized PSO-XGB model has a mean squared error (MSE) of 0.0551, a root mean squared error (RMSE) of 0.2347, an absolute error (MAE) of 0.1643, and an R2 value of 0.9991 on the training dataset, indicating very high fitting accuracy. Compared to the unoptimized XGB model, the performance of the optimized model shows significant improvement, especially with an RMSE of 1.0303 on the testing dataset, demonstrating stronger generalization ability. The traditional XGB model has a higher RMSE on the testing dataset, suggesting that the unoptimized model may encounter larger errors in real-world applications. In contrast, the PSO-optimized model can maintain smaller errors while ensuring high accuracy.

By comparing the XGB models under different optimization algorithms, it is evident that the PSO-XGB model significantly outperforms other models on both the training and testing datasets, particularly in terms of generalization ability, as indicated by its higher R2 value. This suggests that the PSO algorithm has a distinct advantage in optimizing the XGB model, enhancing both the accuracy of the model and its training efficiency.

In addition to prediction accuracy, computational efficiency is a crucial factor in evaluating hyperparameter optimization methods. As shown in Table 7, the time required to complete 1000 iterations varies significantly across the four algorithms. The PSO-XGB model completed the optimization in 132.02 s, achieving both high accuracy and the shortest runtime. The Bayesian-XGB model required 248.08 s, exhibiting moderate speed but less stable performance. The GridSearch-XGB method, although slightly faster at 139.32 s, yielded the worst performance due to its exhaustive and inefficient parameter search. In contrast, GA-XGB achieved reasonable prediction accuracy but incurred the highest time cost, requiring 1736.69 s due to the large population-based search and complex evolutionary operations.

The residuals of the traditional XGB model and the XGB models based on different optimization algorithms on the testing dataset are shown in Fig. 13. It can be seen that the range of residuals decreases in the following order: GridSearch-XGB > GA-XGB > Bayesian-XGB > PSO-XGB. This indicates that the PSO-XGB model has the smallest residuals and the highest prediction accuracy. Its optimization efficiency surpasses the other methods, allowing for better adjustment of model parameters and reduction of overfitting or underfitting. In contrast, GridSearch-XGB, due to its low efficiency in exhaustive search, struggles to quickly find the global optimal solution in high-dimensional parameter space, resulting in larger residuals.

Fig. 13.

Fig. 13

Comparison of residual graphs of XGB optimization algorithm.

In summary, by analyzing the impact of different optimization algorithms on the performance of the XGB model, it can be concluded that PSO-XGB demonstrates superior performance on both the training and testing sets. The PSO-optimized XGB model not only significantly outperforms the traditional XGB model in terms of fitting accuracy but also has the smallest residuals on the testing set, indicating the highest prediction accuracy. Compared to other optimization methods, the PSO algorithm is better at adjusting model parameters, enhancing generalization capability, and reducing overfitting and underfitting. Therefore, the PSO-optimized XGBoost model excels in the training phase and provides more reliable decision support for practical applications in oil and gas field development, especially in scenarios with complex data characteristics, effectively improving the model’s predictive ability and stability.

Model interpretation

The SHAP interpretability method can be utilized to analyze the explanatory relationships and results between inputs and outputs in machine learning prediction models for development indicators. This analysis incorporates physical principles and researchers’ experiences to explore the similarities and differences between explanatory relationships, explanatory results, and interpretability experiences, thereby studying the interpretability of modeling in machine learning.

As shown in Fig. 14, this study uses the MMP for CO2 miscible flooding as the output target, with four input parameters: critical temperature of the injected gas (Tcm), reservoir temperature (T), crude oil composition (MWC5+-Xvol-Xint), and gas injection composition (YCO2-YN2-YC1-YC2-C7). A global interpretability analysis is conducted using the SHAP interpretability method. The side color bar indicates that the correlation of feature values ranges from large to small, from red to blue. The red dots for the critical temperature of the injected gas (Tcm) and the minimum miscibility pressure (MMP) are positioned in the negative direction on the X-axis, indicating a negative relationship with MMP. Conversely, the red dots for reservoir temperature (T) and MMP are in the positive direction on the X-axis, indicating a positive relationship. Furthermore, the order of the feature parameters on the Y-axis shows that the influence of critical temperature of the injected gas (Tcm), gas injection composition (YCO2-YN2-YC1-YC2-C7), and crude oil composition (MWC5+-Xvol-Xint) on MMP decreases progressively from top to bottom.

Fig. 14.

Fig. 14

Feature scatter plot.

Figure 15 presents a SHAP heatmap designed to display the overall substructure of the dataset using supervised clustering and heatmaps. The SHAP value matrix is passed to the heatmap plotting function, with instances on the X-axis and model input parameters on the Y-axis. The input parameters are sorted based on hierarchical clustering and explanatory similarity. Above the heatmap matrix is the model output, with a gray dashed line indicating the baseline. The bar graph on the right side of the figure shows the global importance analysis of each model input, confirming that the influence of critical temperature of the injected gas (Tcm), gas injection composition (YCO2-YN2-YC1-YC2–C7), and crude oil composition (MWC5+-Xvol-Xint) on MMP decreases progressively.

Fig. 15.

Fig. 15

Feature importance SHAP value.

Therefore, the SHAP interpretability results presented in the figures align with the physical theoretical knowledge analysis in section “The factors affecting the CO2-oil MMP” and are also consistent with the Pearson index analysis and PCA covariance matrix results in section “Data preprocessing”. This indicates that the PSO-XGBoost model in this study for predicting CO2-MMP demonstrates good reliability, thereby providing a scientific basis for optimizing injection schemes in CO2-enhanced oil recovery in oil and gas development.

Discussion

Model advantages

The XGBoost model is widely used in the oil and gas industry due to its efficiency, accuracy, and good generalization capability. Compared to other traditional models, XGBoost can more effectively handle complex feature interactions and enhance model performance through adaptive learning. In this study, the XGBoost model, adjusted by Particle Swarm Optimization (PSO), exhibited superior results, demonstrating its efficiency and high precision. Specifically, the PSO-optimized XGBoost performed excellently in goodness-of-fit (R2), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE), further enhancing the model’s generalization ability and providing reliable data support for oil and gas field development. These advantages make PSO-XGB an important tool for prediction and decision-making in oil and gas development.

Model interpretability

In the oil and gas sector, artificial intelligence is rapidly integrating, becoming a new engine for industry development. However, many existing machine learning models lack interpretability, making it difficult to explain the actual role of feature vectors in MMP prediction during practical applications, thereby limiting the model’s broader application34. To address this issue, this study focuses on interpretability in the processes of pure and non-pure CO2 flooding development, emphasizing three aspects of interpretability: pre-modeling interpretability, in-model interpretability, and post-modeling interpretability. Firstly, by integrating reservoir physical theories, the study thoroughly analyzes the impact of temperature, crude oil composition, and gas injection composition on MMP. Using Pearson correlation coefficients, it identifies key feature parameters for model training, achieving pre-modeling interpretability through principal component analysis (PCA) and covariance matrix heatmap analysis.

For the XGBoost algorithm, SHAP analysis evaluates model interpretability. By comparing feature contributions to predictions alongside pre-modeling Pearson analysis and PCA variance ratios, the transparency and credibility of the model are enhanced, achieving in-model and post-model interpretability and improving the model’s reliability for application.

Data errors

In this study, the sources of model errors are mainly multifaceted. First, errors stemming from experimental methods, such as differences in experimental conditions or equipment, may affect the reproducibility and reliability of data, especially during laboratory settings and data entry processes. Second, data processing errors may lead to information loss or result in biases. The errors associated with the model itself are closely related to the construction of the predictive model and parameter selection, particularly when physical principles are not adequately represented, potentially limiting the model’s generalization ability and applicability. The effectiveness of artificial intelligence techniques and methods in oil and gas relies on the quantity and quality of data, as the reliability of the data directly impacts model outputs based on data learning. However, combining relevant reservoir theory and causal analysis can effectively filter features and select appropriate processing methods. By evaluating the model through interpretable models, the choice of models and features can be optimized, thereby enhancing the reliability and effectiveness of the research.

Discussion on application of Tcm

In past MMP prediction models, Tcm (the critical temperature of injected gas) was often excluded from feature vector analysis, neglecting the relationship between the critical temperature of injected gas, reservoir temperature, and MMP. The critical temperature refers to the maximum temperature at which a substance can transition from a gaseous to a liquid state. Each substance has a specific temperature above which it cannot be liquefied regardless of the applied pressure, known as the critical temperature. Therefore, to achieve liquefaction, one must first reach the substance’s critical temperature; lower critical temperatures make liquefaction more difficult. The temperature in the reservoir’s deeper layers remains relatively stable in conventional gas injection processes. It does not change significantly with gas injection or production. However, the injection of liquid gas can lower the temperature of the formations near the injection well. Apart from cold damage caused by low temperatures (affecting the mechanical properties of the formation and causing wax deposition in crude oil), low temperatures help reduce MMP. To enhance the completeness of training prediction model data, this study discusses the inclusion of the critical temperature of the injected gas as a feature parameter. Additionally, through Pearson correlation analysis of relevant data, a significant correlation between the critical temperature of the injected gas and MMP was found, providing a basis for introducing Tcm as a new research variable.

This finding fills a gap in previous research and reveals the connection between laboratory experiments and engineering practice. Therefore, optimizing monitoring and control strategies will be of significant importance. Future research can advance the application of this parameter in actual oil field development by devising more refined experimental plans, considering both reservoir temperature and the critical temperature of injected gas, thus enhancing CO2 flooding technology.

Conclusion

In this work, we developed a data-driven model for predicting the minimum miscibility pressure (MMP) of pure and impure CO2 flooding using an improved XGBoost algorithm. The study demonstrates that combining feature selection with Pearson correlation and dimensionality reduction through PCA can effectively improve model efficiency while maintaining high predictive accuracy. By introducing the particle swarm optimization (PSO) algorithm for hyperparameter tuning, the optimized XGBoost model showed clear advantages over traditional models, with better generalization ability and smaller prediction errors. The use of SHAP analysis further enhanced the transparency of the model, confirming the contribution of key variables and making the results more interpretable.

A particular contribution of this study is the introduction of the critical temperature of the injected gas (Tcm) as a new feature. The analysis verified its strong relevance to MMP, providing a fresh perspective that has not been widely considered in earlier studies. By incorporating both pure and impure CO2 compositions, the model reflects more realistic injection scenarios and therefore has greater potential for application in the field.

The findings suggest that the proposed model can be used as a cost-effective alternative to laboratory measurements, offering timely support for designing and optimizing CO2-EOR strategies. Looking ahead, further research should focus on enriching the dataset with more field-scale information and integrating PVT-related parameters. In addition, exploring hybrid approaches that combine physical models with data-driven methods could further enhance both accuracy and interpretability.

Overall, this study provides not only a reliable prediction tool for MMP but also new insights into the role of injection gas properties, supporting the development of more efficient and practical CO2-EOR schemes.

Acknowledgements

The authors would like to acknowledge the Cooperative Innovation Center of Unconventional Oil and Gas Resources of Yangtze University. We are grateful for the support from the National Natural Science Foundation of China (NSFC Grant No.: 52004032)

Author contributions

Yuxin Yang (First Author): Conceptualization, Methodology, Software, Investigation, Formal Analysis, Writing—Original Draft, Data Curation. Yizhong Zhang (Corresponding Author): Conceptualization, Funding Acquisition, Resources, Supervision, Writing—Review & Editing. Bowen Qin : Data Curation, Formal Analysis, Software Jianhong Guo: Resources, Visualization, Supervision, Writing—Review & Editing. Maolin Zhang: Visualization, Investigation.

Data availability

All data supporting the findings of this study are available within the paper as well as from the corresponding author.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Al-Khafaji, H. F. et al. Predicting minimum miscible pressure in pure CO2 flooding using machine learning: Method comparison and sensitivity analysis. Fuel354, 129263. 10.1016/j.fuel.2023.129263 (2023). [Google Scholar]
  • 2.Li, P. et al. Potential evaluation of CO2 EOR and storage in oilfields of the Pearl River Mouth Basin, northern South China Sea. Greenhouse Gases: Sci. Technol.8(5), 954–977. 10.1002/ghg.1808 (2018). [Google Scholar]
  • 3.Chen, G. et al. The genetic algorithm based back propagation neural network for MMP prediction in CO2-EOR process. Fuel126, 202–212. 10.1016/j.fuel.2014.02.034 (2014). [Google Scholar]
  • 4.Gozalpour, F., Ren, S. R. & Tohidi, B. CO2 EOR and storage in oil reservoir. Oil Gas Sci. Technol.60(3), 537–546. 10.2516/ogst:2005036 (2005). [Google Scholar]
  • 5.Fareed, A. G. et al. Underground geological sequestration of carbon dioxide (CO2) and its effect on possible enhanced gas and oil recovery in a fractured reservoir of Eastern Potwar Basin, Pakistan. Sci. Total Environ.905, 167124. 10.1016/j.scitotenv.2023.167124 (2023). [DOI] [PubMed] [Google Scholar]
  • 6.Choubineh, A., Helalizadeh, A. & Wood, D. A. Estimation of minimum miscibility pressure of varied gas compositions and reservoir crude oil over a wide range of conditions using an artificial neural network model. Adv. Geo-Energy Res.3(1), 52–66. 10.26804/ager.2019.01.04 (2019). [Google Scholar]
  • 7.Zhang, Z. et al. Recent advances in carbon dioxide utilization. Renew. Sustain. Energy Rev.125, 109799. 10.1016/j.rser.2020.109799 (2020). [Google Scholar]
  • 8.Taber, J. J., Martin, F. D. & Seright, R. S. EOR screening criteria revisited—part 2: applications and impact of oil prices. SPE Reserv. Eng.12(03), 199–206. 10.2118/39234-PA (1997). [Google Scholar]
  • 9.Chen, G. et al. Simulation of CO2-oil minimum miscibility pressure (MMP) for CO2 enhanced oil recovery (EOR) using neural networks. Energy Procedia37, 6877–6884. 10.1016/j.egypro.2013.06.620 (2013). [Google Scholar]
  • 10.Kamari, A., Arabloo, M., Shokrollahi, A., Gharagheizi, F. & Mohammadi, A. H. Rapid method to estimate the minimum miscibility pressure (MMP) in live reservoir oil systems during CO2 flooding. Fuel153, 310–319. 10.1016/j.fuel.2015.02.087 (2015). [Google Scholar]
  • 11.Zhao, Y. et al. The experimental research for reducing the minimum miscibility pressure of carbon dioxide miscible flooding. Renew. Sustain. Energy Rev.145, 111091. 10.1016/j.rser.2021.111091 (2021). [Google Scholar]
  • 12.Dong, M., Huang, S., Dyer, S. B. & Mourits, F. M. A comparison of CO2 minimum miscibility pressure determinations for Weyburn crude oil. J. Petrol. Sci. Eng.31(1), 13–22. 10.1016/s0920-4105(01)00135-8 (2001). [Google Scholar]
  • 13.Dindoruk, B., Johns, R. & Orr, F. M. Jr. Measurement and modeling of minimum miscibility pressure: A state-of-the-art review. SPE Reservoir Eval. Eng.24(02), 367–389. 10.2118/200462-PA (2021). [Google Scholar]
  • 14.Ahmad, W., Vakili-Nezhaad, G., Al-Bemani, A. S. & Al-Wahaibi, Y. Experimental determination of minimum miscibility pressure. Procedia Eng.148, 1191–1198. 10.1016/j.proeng.2016.06.629 (2016). [Google Scholar]
  • 15.Porter, R. T., Fairweather, M., Pourkashanian, M. & Woolley, R. M. The range and level of impurities in CO2 streams from different carbon capture sources. Int. J. Greenhouse Gas Control36, 161–174. 10.1016/j.ijggc.2015.02.016 (2015). [Google Scholar]
  • 16.Zhang, K., Jia, N., Zeng, F., Li, S. & Liu, L. A review of experimental methods for determining the oil-gas minimum miscibility pressures. J. Petrol. Sci. Eng.183, 106366. 10.1016/j.petrol.2019.106366 (2019). [Google Scholar]
  • 17.Jaubert, J. N., Avaullee, L. & Pierre, C. Is it still necessary to measure the minimum miscibility pressure?. Ind. Eng. Chem. Res.41(2), 303–310. 10.1021/ie010485f (2002). [Google Scholar]
  • 18.Holm, L. W. & Josendal, V. A. Mechanisms of oil displacement by carbon dioxide. J. Petrol. Technol.26(12), 1427–1438. 10.2118/4736-PA (1974). [Google Scholar]
  • 19.Lee, I. J. Effectiveness of carbon dioxide displacement under miscible and immiscible conditions (1979).
  • 20.Yellig, W. F. & Metcalfe, R. S. Determination and prediction of CO2 minimum miscibility pressures (includes associated paper 8876). J. Petrol. Technol.32(01), 160–168. 10.2118/7477-PA (1980). [Google Scholar]
  • 21.Alston, R. B., Kokolis, G. P. & James, C. F. CO2 minimum miscibility pressure: a correlation for impure CO2 streams and live oil systems. Soc. Petrol. Eng. J.25(02), 268–274. 10.2118/11959-PA (1985). [Google Scholar]
  • 22.Glasø, Ø. Generalized minimum miscibility pressure correlation. Soc. Petrol. Eng. J.25(06), 927–934. 10.2118/12893-PA (1985). [Google Scholar]
  • 23.Chen, G. et al. An improved correlation to determine minimum miscibility pressure of CO2–oil system. Green Energy Environ.5(1), 97–104. 10.1016/j.gee.2018.12.003 (2020). [Google Scholar]
  • 24.Li, D., Li, X., Zhang, Y., Sun, L. & Yuan, S. Four methods to estimate minimum miscibility pressure of CO2-Oil based on machine learning. Chin. J. Chem.37(12), 1271–1278. 10.1002/cjoc.201900337 (2019). [Google Scholar]
  • 25.Huang, Y. F., Huang, G. H., Dong, M. Z. & Feng, G. M. Development of an artificial neural network model for predicting minimum miscibility pressure in CO2 flooding. J. Petrol. Sci. Eng.37(1–2), 83–95. 10.1016/s0920-4105(02)00312-1 (2003). [Google Scholar]
  • 26.Belhaj, H., Abukhalifeh, H. & Javid, K. Miscible oil recovery utilizing N2 and/or HC gases in CO2 injection. J. Petrol. Sci. Eng.111, 144–152. 10.1016/j.petrol.2013.08.030 (2013). [Google Scholar]
  • 27.Montazeri, M., Kamari, E. & Namin, A. R. Minimum miscibility pressure by the vanishing interfacial tension method: Effect of pressure and composition by injection of gas cap into dead/live oil. J. Chem. Eng. Data67(10), 3077–3084. 10.1021/acs.jced.2c00494 (2022). [Google Scholar]
  • 28.Bon, J., Emera, M. K. & Sarma, H. K. An experimental study and genetic algorithm (GA) correlation to explore the effect of nC5 on impure CO2 minimum miscibility pressure (MMP). In SPE Asia Pacific Oil and Gas Conference and Exhibition, SPE-101036. (SPE, 2006). 10.2118/101036-MS.
  • 29.Qin, B. et al. Prediction of the minimum miscibility pressure for CO2 flooding based on a physical information neural network algorithm. Meas. Sci. Technol.35(12), 126010. 10.1088/1361-6501/ad6a77 (2024). [Google Scholar]
  • 30.Simovici, D. Intelligent data analysis techniques—Machine learning and data mining. Artif. Intell. Approach. Petrol. Geosci.10.1007/978-3-319-16531-8_1 (2015). [Google Scholar]
  • 31.Dehghani, S. M., Sefti, M. V., Ameri, A. & Kaveh, N. S. Minimum miscibility pressure prediction based on a hybrid neural genetic algorithm. Chem. Eng. Res. Design86(2), 173–185. 10.1016/j.cherd.2007.10.011 (2008). [Google Scholar]
  • 32.Wu, C. et al. Determination of Gas-Oil minimum miscibility pressure for impure CO2 through optimized machine learning models. Geoenergy Sci. Eng.242, 213216. 10.1016/j.geoen.2024.213216 (2024). [Google Scholar]
  • 33.Shen, B. et al. Interpretable knowledge-guided framework for modeling minimum miscible pressure of CO2-oil system in CO2-EOR projects. Eng. Appl. Artif. Intell.118, 105687. 10.1016/j.engappai.2022.105687 (2023). [Google Scholar]
  • 34.Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci.116(44), 22071–22080. 10.1073/pnas.1900654116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Su, J., Wang, Y., Niu, X., Sha, S. & Yu, J. Prediction of ground surface settlement by shield tunneling using XGBoost and Bayesian Optimization. Eng. Appl. Artif. Intell.114, 105020. 10.1016/j.engappai.2022.105020 (2022). [Google Scholar]
  • 36.Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev.54, 1937–1967. 10.1007/s10462-020-09896-5 (2021). [Google Scholar]
  • 37.Al-Ajmi, M., Alomair, O. & Elsharkawy, A. Planning miscibility tests and gas injection projects for four major Kuwaiti reservoirs. In SPE Kuwait International Petroleum Conference and Exhibition, SPE-127537. (SPE, 2009). 10.2118/127537-MS.
  • 38.Zuo, Y. X., Chu, J. Z., Ke, S. L. & Guo, T. M. A study on the minimum miscibility pressure for miscible flooding systems. J. Petrol. Sci. Eng.8(4), 315–328. 10.1016/0920-4105(93)90008-3 (1993). [Google Scholar]
  • 39.Thakur, G. C., Lin, C. J. & Patel, Y. R. CO2 minitest, little knife field, ND: a case history. In SPE Improved Oil Recovery Conference?, SPE-12704 (SPE, 1984). 10.2118/12704-MS
  • 40.Spence Jr, A. P. & Watkins, R. W. The effect of microscopic core heterogeneity on miscible flood residual oil saturation. In SPE Annual Technical Conference and Exhibition?, SPE-9229 (SPE 1980). 10.2118/9229-MS.
  • 41.Rathmell, J. J., Stalkup, F. I. & Hassinger, R. C. A laboratory investigation of miscible displacement by carbon dioxide. In SPE Annual Technical Conference and Exhibition?, SPE-3483 (SPE, 1971). 10.2118/3483-MS.
  • 42.Eakin, B. E., & Mitch, F. J. Measurement and correlation of miscibility pressures of reservoir oils. In SPE Annual Technical Conference and Exhibition? SPE-18065 (SPE, 1988). 10.2118/18065-MS
  • 43.Li, H., Qin, J. & Yang, D. An improved CO2–oil minimum miscibility pressure correlation for live and dead crude oils. Ind. Eng. Chem. Res.51(8), 3516–3523. 10.1021/ie202339g (2012). [Google Scholar]
  • 44.Graue, D. J. & Zana, E. T. Study of a possible CO2 flood in Rangely Field. J. Petrol. Technol.33(07), 1312–1318. 10.2118/7060-PA (1981). [Google Scholar]
  • 45.Dicharry, R. M., Perryman, T. L. & Ronquille, J. D. Evaluation and design of a CO2 miscible flood project-SACROC unit, Kelly-Snyder field. J. Petrol. Technol.25(11), 1309–1318. 10.2118/4083-PA (1973). [Google Scholar]
  • 46.Shelton, J. L. & Yarborough, L. Multiple phase behavior in porous media during CO2 or rich-gas flooding. J. Petrol. Technol.29(09), 1171–1178. 10.2118/5827-PA (1977). [Google Scholar]
  • 47.Henry, R. L. & Metcalfe, R. S. Multiple-phase generation during carbon dioxide flooding. Soc. Petrol. Eng. J.23(04), 595–601. 10.2118/8812-PA (1983). [Google Scholar]
  • 48.Cardenas, R. L., Alston, R. B., Nute, A. J. & Kokolis, G. P. Laboratory design of a gravity-stable miscible CO2 process. J. Petrol. Technol.36(01), 111–118. 10.2118/10270-PA (1984). [Google Scholar]
  • 49.Sebastian, H. M., Wenger, R. S. & Renner, T. A. Correlation of minimum miscibility pressure for impure CO2 streams. J. Petrol. Technol.37(11), 2076–2082. 10.2118/12648-PA (1985). [Google Scholar]
  • 50.Clark, P., Toulekima, S., & Sarma, H. A miscibility scoping study for gas injection into a high-temperature volatile oil reservoir in the Cooper Basin, Australia. In SPE Asia Pacific Oil and Gas Conference and Exhibition, SPE-116782 (SPE, 2008). 10.2118/116782-MS
  • 51.Rao, D. N. A new technique of vanishing interfacial tension for miscibility determination. Fluid Phase Equilib.139(1–2), 311–324. 10.1016/s0378-3812(97)00180-5 (1997). [Google Scholar]
  • 52.Ayirala, S. C. & Rao, D. N. Comparative evaluation of a new gas/oil miscibility-determination technique. J. Can. Pet. Technol.50(09), 71–81. 10.2118/99606-PA (2011). [Google Scholar]
  • 53.Bian, X. Q., Han, B., Du, Z. M., Jaubert, J. N. & Li, M. J. Integrating support vector regression with genetic algorithm for CO2-oil minimum miscibility pressure (MMP) in pure and impure CO2 streams. Fuel182, 550–557. 10.1016/j.fuel.2016.05.124 (2016). [Google Scholar]
  • 54.Cronquist, C. Carbon dioxide dynamic miscibility with light reservoir oils[C]//Proc. Fourth Annual US DOE Symposium.–Tulsa, Oklahoma. 28–30 (1978).
  • 55.Orr, F. M. & Jensen, C. M. Interpretation of pressure-composition phase diagrams for CO2/crude-oil systems[J]. Soc. Pet. Eng. J.24 (05), 485–497. 10.2118/11125-PA (1984). [Google Scholar]
  • 56.Emera, M. K. & Sarma, H. K. Use of genetic algorithm to estimate CO2–oil minimum miscibility pressure—a key parameter in design of CO2 miscible flood[J]. J. Petro. Sci. Eng.46 (1–2), 37–52. 10.1016/j.petrol.2004.10.001 (2005). [Google Scholar]
  • 57.Yuan, H., Johns, R. T., Egwuenu, A. M. & Dindoruk, B. Improved MMP correlations for CO2 floods using analytical gas flooding theory[C]//SPE Improved Oil Recovery Conference? SPE.SPE-89359-MS. 10.2118/89359-MS (2004).
  • 58.Shokir, E. M. E. M. CO2–oil minimum miscibility pressure model for impure and pure CO2 streams[J]. J. Petro. Sci. Eng.58 (1–2), 173–185. 10.1016/j.petrol.2006.12.001 (2007). [Google Scholar]
  • 59.Sayyad, H., Manshad, A. K. & Rostami, H. Application of hybrid neural particle swarm optimization algorithm for prediction of MMP. Fuel116, 625–633. 10.1016/j.fuel.2013.08.076 (2014). [Google Scholar]
  • 60.Fathinasab, M. & Ayatollahi, S. On the determination of CO2–crude oil minimum miscibility pressure using genetic programming combined with constrained multivariable search methods[J]. Fuel173, 180–188. 10.1016/j.fuel.2016.01.009 (2016). [Google Scholar]
  • 61.Karkevandi-Talkhooncheh, A. et al. Application of adaptive neuro fuzzy interface system optimized with evolutionary algorithms for modeling CO2-crude oil minimum miscibility pressure. Fuel205, 34–45. 10.1016/j.fuel.2017.05.026 (2017). [Google Scholar]
  • 62.Qin B, Zhang Y, Yang L, Yang Y, Zhang M, Zhou Y, & Yang Y, ACS Omega Article ASAP, 10.1021/acsomega.4c08313.
  • 63.Zhang, M. et al. A Pearson correlation-based adaptive variable grouping method for large-scale multi-objective optimization. Inf. Sci.639, 118737. 10.1016/j.ins.2023.02.055 (2023). [Google Scholar]
  • 64.Eisinga, R., Grotenhuis, M. T. & Pelzer, B. The reliability of a two-item scale: Pearson, Cronbach, or spearman-brown?. Int. J. Public Health58, 637–642. 10.1007/s00038-012-0416-3 (2013). [DOI] [PubMed] [Google Scholar]
  • 65.Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat.2(4), 433–459. 10.1016/0169-7439(87)80084-9 (2010). [Google Scholar]
  • 66.Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. Math. Phys. Eng. Sci.374(2065), 20150202. 10.1098/rsta.2015.0202 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Jain, M., Saihjpal, V., Singh, N. & Singh, S. B. An overview of variants and advancements of PSO algorithm. Appl. Sci.12(17), 8392. 10.3390/app12178392 (2022). [Google Scholar]
  • 68.Garcia-Gonzalo, E. & Fernandez-Martinez, J. L. A brief historical review of particle swarm optimization (PSO). J. Bioinform. Intell. Control1(1), 3–16. 10.3390/app12178392 (2012). [Google Scholar]
  • 69.Ogunleye, A. & Wang, Q. G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinf.17(6), 2131–2140. 10.1109/TCBB.2019.2911071 (2019). [DOI] [PubMed] [Google Scholar]
  • 70.Qiu, Y. et al. Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Eng. Comput.38(Suppl 5), 4145–4162. 10.1007/s00366-021-01393-9 (2022). [Google Scholar]
  • 71.Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (2016). 10.1145/2939672.2939785
  • 72.Štrumbelj, E. & Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst.41, 647–665. 10.1007/s10115-013-0679-x (2014). [Google Scholar]
  • 73.Ribeiro, M. T., Singh, S. & Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144 (2016). 10.1145/2939672.2939778
  • 74.Parsa, A. B., Movahedi, A., Taghipour, H., Derrible, S. & Mohammadian, A. K. Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accid. Anal. Prev.136, 105405. 10.1016/j.aap.2019.105405 (2020). [DOI] [PubMed] [Google Scholar]
  • 75.Lundberg, S. M. et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng.2(10), 749–760. 10.1038/s41551-018-0304-0 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Syarif, I., Prugel-Bennett, A. & Wills, G. SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommun. Comput. Electron. Control)14(4), 1502–1509. 10.12928/telkomnika.v14i4.3956 (2016). [Google Scholar]
  • 77.Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. & De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE104(1), 148–175. 10.1109/JPROC.2015.2494218 (2015). [Google Scholar]
  • 78.Coello, C. A. An updated survey of GA-based multiobjective optimization techniques. ACM Comput. Surv. (CSUR)32(2), 109–143. 10.1145/358923.358929 (2000). [Google Scholar]
  • 79.Wang, D., Tan, D. & Liu, L. Particle swarm optimization algorithm: an overview. Soft. Comput.22(2), 387–408. 10.1007/s00500-016-2474-6 (2018). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data supporting the findings of this study are available within the paper as well as from the corresponding author.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES