Development of new materials for electrothermal metals using data driven and machine learning

Chengqun Zhou; Muyang Pei; Chao Wu; Degang Xu; Qiang Peng; Guoai He

doi:10.1371/journal.pone.0297943

. 2024 Apr 26;19(4):e0297943. doi: 10.1371/journal.pone.0297943

Development of new materials for electrothermal metals using data driven and machine learning

Chengqun Zhou ^1,^#, Muyang Pei ^2,^#, Chao Wu ¹, Degang Xu ^3,⁴, Qiang Peng ^2,^*, Guoai He ^2,^3,^*

Editor: Babatunde Abiodun Salami⁵

PMCID: PMC11051622 PMID: 38669274

Abstract

After adopting a combined approach of data-driven methods and machine learning, the prediction of material performance and the optimization of composition design can significantly reduce the development time of materials at a lower cost. In this research, we employed four machine learning algorithms, including linear regression, ridge regression, support vector regression, and backpropagation neural networks, to develop predictive models for the electrical performance data of titanium alloys. Our focus was on two key objectives: resistivity and the temperature coefficient of resistance (TCR). Subsequently, leveraging the results of feature selection, we conducted an analysis to discern the impact of alloying elements on these two electrical properties.The prediction results indicate that for the resistivity data prediction task, the radial basis function kernel-based support vector machine model performs the best, with a correlation coefficient above 0.995 and a percentage error within 2%, demonstrating high predictive capability. For the TCR data prediction task, the best-performing model is a backpropagation neural network with two hidden layers, also with a correlation coefficient above 0.995 and a percentage error within 3%, demonstrating good generalization ability. The feature selection results using random forest and Xgboost indicate that Al and Zr have a significant positive effect on resistivity, while Al, Zr, and V have a significant negative effect on TCR. The conclusion of the composition optimization design suggests that to achieve both high resistivity and TCR, it is recommended to set the Al content in the range of 1.5% to 2% and the Zr content in the range of 2.5% to 3%.

Introduction

In the realm of new material development, the challenge of achieving specific target performance has spurred a revolutionary transformation in material research, rendering traditional trial-and-error methods insufficient. These conventional approaches rely on limited experimental data to discretely adjust material compositions and process parameters in the quest for optimal material performance. Nevertheless, it is evident that this research and development methodology is time-consuming and costly. In an effort to expedite material development and introduce more efficient research approaches, the United States initiated the Materials Genome Initiative (MGI) in 2011[1], with the aim of significantly reducing the material development cycle.

In recent years, data-driven and machine learning methods have gradually emerged as prominent players in addressing real-world engineering challenges [2–10]. The application of artificial intelligence, machine learning, and deep learning technologies in the field of materials science has garnered significant attention, as these methods are widely adopted for material discovery and design [11]. Logan et al. [12] emphasized that the process of using machine learning methods in materials informatics comprises three key elements: material data, material descriptors, and machine learning algorithms suitable for prediction. Davoodi et al. [13] employed multiple machine learning models to accurately predict the hydrogen (H2) absorption percentage in porous carbon media (PCM), offering critical insights for efficient H2 storage. Bruno et al. [14] utilized a machine learning kernel regression model to predict the electronic properties and elastic performance of materials, along with proposing an Exhaustive Enumeration algorithm for material reverse design. The majority of the methods mentioned above have effectively utilized suitable machine learning algorithms, coupled with selective feature engineering, to achieve satisfactory prediction and design outcomes.

The objective of this study is to develop electrically conductive metallic materials with high resistivity and temperature coefficient of resistance (TCR). In the domain of alloy electrical performance, while research on resistivity is extensive, investigations into the temperature coefficient of resistance (TCR) of alloys remain relatively limited. However, both resistivity and TCR are critical indicators of alloy electrical performance. Faced with the vast parameter space of multicomponent alloys, traditional experimental methods are practically incapable of encompassing all conceivable alloy combinations. Consequently, this study introduces data-driven and machine learning approaches to address the challenges of compositional optimization design, thereby accelerating the development of new materials.

The remaining sections of the introduction are as follows: Section 2 will introduce various machine learning algorithm models, encompassing predictive modeling and feature selection. Section 3 will compare a series of experimental results to determine the optimal predictive model and discuss and analyze compositional optimization. Lastly, Section 4 will summarize the work and propose future research directions.

Method

This paper presents a three-step data-driven and machine learning-based workflow for predicting the electrical performance of electrically conductive metal materials. As Fig 1 illustrated, in the first step, the electrical performance data was obtained and subjected to data cleaning, followed by the partitioning of the original dataset into a training set and a test set. Subsequently, the raw data was normalized and standardized to mitigate the impact of varying data scales on the prediction results. In this study, we employed K-fold cross-validation as the method for data partitioning and model training, which has been proved as a well-established strategy [15–17]. K-fold cross-validation was utilized to reduce the influence of different data partitions on the training process. In this study, we set K to 5 for dataset splitting.The division ratio of 4:1 provides us with approximately 80% of the data for training, leaving 20% of the data for testing. Within the training data, a further 7:3 split is used to create a validation set, which is employed for assessing the model’s performance and evaluating its ability to generalize to unseen data. Separating the validation set from the training set allows for a more robust estimation of the model’s performance in practical applications. This balanced approach ensures that while a sufficient amount of data is retained for model performance evaluation, there is also ample training data to support the model’s learning, thereby safeguarding its generalization capabilities.Machine learning methods exhibited varying sensitivities to material data within different ranges [18–21]. Hence, it was crucial to select proper algorithms based on specific material data samples and evaluate their performance using appropriate performance evaluation metrics. In this study, we evaluated the models using four metrics: root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R²).Among these, R² serves as the principal performance metric to identify the most effective predictive model for subsequent analysis.

Fig 1 — After the optimal model is obtained, the variation trend of resistivity and TCR can be analyzed by adjusting the content of important elements based on the feature screening results.

The second step involved feature importance selection. In this study, we utilized two different feature selection models and two evaluation methods to rank the importance of features. The intersection of the feature subsets obtained from both methods is selected as the final feature subset [22]. Since the feature data in this study comprised elemental composition data, the results of the feature importance selection represented the degree of influence that different elements have on the electrical performance of alloys. As our target performance included resistivity and TCR, separate feature selection was performed for each of these electrical performance indicators to identify the element subsets that have the greatest impacts on each of them.

The third step utilized the outcomes of the previous two steps for compositional optimization analysis. Due to the "trade-off" relationship between resistivity and TCR, achieving simultaneous positive gains in both electrical performance indicators cannot rely solely on blindly increasing or decreasing the content of certain elements based on feature selection results. In this study, by adjusting the element compositions obtained through feature selection, novel electrical performance data was calculated using the predictive model. The trends in resistivity and TCR changes were then comprehensively analyzed within the same range to explore the design principles for optimizing these two electrical performance indicators simultaneously. The details of each step will be discussed in the following subsections.

Data source

The material electrical performance data used in this study primarily originated from the JMatPro database. The acquisition of material electrical performance data in this study involved four main processes. Firstly, various types of alloy electrical performance data were collected through literature review, wherein the alloys were initially selected based on the values of resistivity and TCR. In this study, titanium alloys were chosen as the initial research objects. Next, programming was employed to generate 729 virtual sample points with titanium as the primary constituent element. These 729 titanium-based compositions included specific elements such as Al, Si, Zr, Mo, V, Sn, Nb, Mn, and Ti, totaling nine elements. Subsequently, the JMatPro high throughput modules (named API), which allows batch calculation, was utilized to obtain the resistivity data at 29 temperature points ranging from 25°C to 300°C for each virtual composition sample. A total of 21,141 resistivity data points were obtained for titanium alloy, constituting the resistivity dataset used in the following experiments. Lastly, by utilizing the calculation relationship between resistivity and TCR, a programming approach was employed to calculate the TCR data for the 729 compositions, forming the TCR dataset used in the following experiments.

Calculation of TCR

With 25°C as the reference temperature, the relation between the resistance R_t of the alloy at t°C and the temperature can be expressed by the following formula:

R_{t} = R_{25} [1 + α (t - 25) + β {(t - 25)}^{2} + γ {(t - 25)}^{3} + \dots]

(1)

In Formula (1), R₂₅ is the resistance value at 25°C, and the unit is Ω; α, β and γ are the resistance temperatures of 1st, 2nd and 3rd power respectively. Since the resistance value and temperature of titanium alloy in this paper are approximately linear in the range of 25°C ~300°C, the relationship between resistance and temperature can be simplified by Eq (1) as follows:

R_{t} = R_{25} [1 + α (t - 25)]

(2)

From the relationship between resistance and resistivity, it can be seen that the change of resistance value of the same material at different temperatures essentially comes from the change of resistivity, so Formula (2) can be further simplified as:

ρ_{t} = ρ_{25} [1 + α (t - 25)]

(3)

In Formula (3), ρ_t and ρ₂₅ are the resistivity of titanium alloy at t°Cand 25°C respectively, and the unit is Ω⋅m. As shown in Fig 2, t-25 is the x-axis and ρ_t is the y-axis for linear fitting. The fitted intercept is ρ₂₅ and the slope is ρ₂₅ α, so the resistance temperature coefficient α can be calculated by dividing the slope by the intercept.

Data normalization and standardization

Data normalization and standardization can also keep the original distribution of the data while eliminating the influence of the large difference in the order of magnitude of different features on the prediction model. Therefore, this study will normalize or standardize the original data before regression prediction and feature screening.

In normalization processing, each feature is transformed by the following formula:

x_{i}^{*} = \frac{x_{i} - \min (x)}{\max (x) - \min (x)}

(4)

Where: $x = (x_{1}, x_{2}, \dots, x_{m})$ represents the set consisting of the values of a feature of m samples; x_i∈x; max(x) and min(x) are the maximum and minimum values in the feature set respectively. $x_{i}^{*}$ indicates that the eigenvalue of x_i in any value range is transformed into the value in the interval [0,1] through calculation, that is, the eigenvalue after normalization.

In normalization, each feature is transformed by the following formula:

x_{i}^{*} = \frac{x_{i} - m e a n (x)}{s t d (x)}

(5)

Where, mean(x) and std(x) are the characteristic mean and standard deviation respectively. The mean value of the data is 0 and the standard deviation is 1. The reason for the above data processing is that after comparing the two methods, it is found that the prediction effect after standardized processing is better.

Machine learning regression model

In predictive tasks, machine learning can be divided into classification and regression [23–25]. The prediction of electrical properties in this paper is a regression task, wherein, four commonly-used machine learning models were employed: linear regression, ridge regression, support vector regression and BP neural network. Among them, BP neural network showed strong nonlinear mapping ability and was widely used in classification, fitting, diagnosis, prediction and other fields [26].

Linear regression

Linear regression model plays an important role in the field of machine learning, and its advantages are simple and easy to model. Its basic form is: Let the given data set $D = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m})}$ , where $x i = (x_{i 1}, x_{i 2}, \dots, x_{i d})$ , m are the number of samples, d is the feature dimension, y_i∈R. Linear regression tries to learn:

f (x i) = ω^{T} x_{i} + b

(6)

So that f(x_i)≈y_i, where ω = (ω₁,ω₂,⋯,ω_d).The linear model has a certain interpretability. It can be seen from Eq (6) that ω_i represents the weight of each feature in the prediction, and b is the target value when all features are zero. When using the least square method to solve ω and b, that is:

(ω, b) = \arg \min \sum_{i = 1}^{m} {(y_{i} - ω x_{i} - b)}^{2}

(7)

The optimal solution of ω and b can be obtained from Eq (7).

Ridge regression

Ridge regression is generally used in cases where the ratio of sample characteristic dimension to sample number is large, and it is a supplementary method to linear regression. The loss function of ridge regression model is:

\min_{ω} \sum_{i = 1}^{m} {(y_{i} - ω^{T} x_{i})}^{2} + λ {‖ ω ‖}_{2}^{2}

(8)

L2 norm regularization is used in the expression, and the regularization parameter λ>0.

Ridge regression can alleviate the overfitting problem by introducing regularization terms. In addition, from A mathematical point of view, Ridge regression can better solve the problem that linear regression can not get more stable ω or can not be solved.

Support vector regression

The principle of support vector machine [27,28] is to find a hyperplane to divide different classes of samples. The objective function of the SVR problem is as follows:

\min \frac{1}{2} {‖ ω ‖}^{2} + C \sum_{i = 1}^{m} l_{ε} (f (x_{i}) - y_{i})

(9)

Where: ω = (ω₁,ω₂,⋯,ω_d) is the normal vector dividing the hyperplane; C is the regularization constant; l_ε is the loss function.

l_{ε} = {\begin{array}{l} 0, & | f (x) - y | \leq ε \\ | f (x) - y | - ε, & | f (x) - y | > ε \end{array}

(10)

SVR can be expressed by the formula with kernel function as:

f (x) = \sum_{i = 1}^{m} ({\overset{\land}{α}}_{i} - α_{i}) κ (x, x_{i}) + b

(11)

Where, κ(x,x_i) is the kernel function, ${\overset{\land}{α}}_{i}$ and a_i are Lagrangian multipliers of the ith sample, ${\overset{\land}{α}}_{i}$ ≥0, a_i≥ 0. The SVR cores used in this study are Gaussian radial basis (RBF kernel) and S-type kernel (Sigmod).

BP neural network

BP neural network is the extension of perceptron, the main characteristics of this network are the forward transmission of working signal and the reverse propagation of error signal. The internal network layer structure can be divided into three parts, namely, the input layer, the hidden layer and the output layer, as shown in Fig 3. Each layer of the network contains a number of neurons, and the layers are connected in a fully connected way, and the output of the upper layer is the input of the next layer, so as to realize the feature mapping and mining the inherent law of the data.

Suppose that the number of neurons in layer l is n, then the output vector of layer l is α^l∈R^n×l. The output of layer l can be expressed by the formula:

a^{l} = σ (z^{l}) = σ (W^{l} a^{l - 1} + b^{l})

(12)

Where: z^l∈R^n×l is the input vector of layer l; $W^{l} \in R^{n_{l - 1} \times n_{l}}$ is the weight matrix from layer l-1 to layer l; b^l∈R^n×1 is the bias vector of the l layer; σ(⋅) is the activation function. Common activation functions include Relu, Tanh, Sig-moid and so on.

Evaluation index

The root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and goodness of Fit (R²) were used to evaluate the performance of the prediction model.

RMSE is calculated as follows:

R M S E = \sqrt{\frac{1}{m} {\sum_{i = 1}^{m} (y_{i} - {\overset{\land}{y}}_{i})}^{2}}

(13)

MAE is calculated as follows:

M A E = \frac{1}{m} \sum_{i = 1}^{m} | y_{i} - {\overset{\land}{y}}_{i} |

(14)

MAPE is calculated as follows:

M A P E = \frac{1}{m} \sum_{i = 1}^{m} | \frac{y_{i} - {\overset{\land}{y}}_{i}}{y_{i}} |

(15)

The maximum value of R² is 1. The closer the value is to 1, the better the fitting degree is. The calculation method is as follows:

R^{2} = {1 - \sum_{i = 1}^{m} {(y_{i} - {\overset{\land}{y}}_{i})}^{2} / \sum_{i = 1}^{m} (y_{i} - \bar{y})}^{2}

(16)

Among the above types, y_i and ${\overset{\land}{y}}_{i}$ are the real and predicted values of the i sample respectively, with the mean value $\overset{\land}{y} = 1$ .

Grid parameter optimization

In the realm of machine learning, models typically comprise numerous hyperparameters that require careful tuning to achieve optimal performance. These hyperparameters encompass factors such as learning rate, the number of trees, depth, and regularization strength. The selection of the right combination of hyperparameters is pivotal for attaining high-performance models.

The fundamental principle of grid search involves exploring a predefined grid of hyperparameter value combinations. Initially, a grid of hyperparameter combinations is constructed based on the hyperparameters and their candidate values. It is essential to exercise caution when choosing hyperparameters and candidate values to prevent the search space from becoming excessively large, consuming excessive computational resources. Subsequently, for each hyperparameter combination, models are trained on the training data, and their performance is evaluated using techniques such as cross-validation. Finally, based on the results of performance metrics, the hyperparameter combination exhibiting the best performance is selected. This combination typically exhibits the lowest error or the highest accuracy.

This study necessitates the use of three models for grid search hyperparameter optimization: BPNN, Support Vector Machine, and Ridge Regression. The grid parameter settings for these three models are as follows:

1.BPNN Grid Parameter Settings:
- Hidden layer sizes setting: [(36,), (50,), (100,)]
- Activation function setting: [’relu’, ’tanh’]
- Optimizer setting: [’adam’, ’lbfgs’]
- Maximum iteration setting: [100000, 200000]
2.Support Vector Machine Grid Parameter Settings:
- Regularization coefficient C setting: [0.01, 0.1, 1, 10, 50, 100, 150]
- Epsilon tolerance setting: [0.01, 0.1, 1, 10]
- Kernel function coefficient gamma setting: [0.01, 0.1, 1, 10]
3.Ridge Regression Grid Parameter Settings:
- Regularization parameter alpha setting: [0.001, 0.01, 0.1, 1, 10, 100]

Feature selection

In traditional regression analysis models, the more features means the less accuracy the predictions become. Therefore, feature selection is necessary to improve results and optimization.

Feature engineering is a crucial step in machine learning [29–31], and different feature choices can significantly impact the prediction outcomes. In the context of titanium alloy electrical performance prediction, when the prediction model utilized elemental features to forecast target electrical performance, each elemental feature became a factor influencing the prediction results. The task of feature selection is to identify highly correlated features with the target performance while filtering out irrelevant features, thereby improving prediction effectiveness and reducing computational complexity. The commonly used feature selection methods mainly include three types: filter-based, embedded, and wrapper-based approaches [32]. In this study, a wrapper-based feature selection method was employed, which combined feature subsets with machine learning models and utilized the predictive performance of the models as the criterion for selecting the feature subsets. Since the feature selection process involves multiple iterations of training the learner, wrapper-based feature selection entails significant computational complexity and time consumption. This paper employed two wrapper-based feature selection algorithms, namely random forest and Xgboost, with feature importance and SHAP (SHapley Additive exPlanations) values used to illustrate the feature selection results.

Random forest feature selection

Random forest adopts random feature selection method to reduce the overfitting risk and increase the generalization ability of the model by randomly selecting different feature subsets during the construction of each decision tree.

The random forest defines the importance measure of X feature. The steps to calculate the importance of a feature are as follows:

The prediction Error rate of random forest for samples outside the bag is called Out-Of-Bag error (OOB). For decision tree T_i in random forest, the number of classification errors E_i on the data outside the bag is calculated.
The value of X is randomly disturbed in the OOB data of the decision tree, and the number of classification errors $E_{i}^{X}$ is recalculated.
Let i = 1,2,…,n, repeat the above two steps, where n is the number of decision trees contained in the random forest.
The importance of feature X is defined as:

I x = \frac{1}{n} \sum_{i = 1}^{n} (E_{i}^{X} - E_{i})

(17)

The basis of this definition is that if the OOB error of the model is significantly increased after the addition of noise to a feature, it indicates that the feature has a greater impact on the prediction result, and thus has a higher importance.

Xgboost feature selection

Xgboost uses incremental training and fine-tuning of feature split points, so that variable selection and weight parameter adjustment can be optimized in continuous iteration to improve model performance. The core idea of the Xgboost algorithm is to generate trees by constantly splitting features, and for each tree generated, a new function is generated to fit the residual of the last prediction. When we need to predict a sample, each sample will fall on the corresponding leaf node in each tree, and will correspond to a score, and finally add the scores corresponding to each tree to get the predicted value of the sample.

The objective function of Xgboost algorithm is:

O b j^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\overset{\land}{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}) + c o n s t

(18)

In Eq (18), l is the loss function, Ω(f_t) is the regular term, and const is the constant term.

SHAP

SHAP is a widely recognized attribution method in the field of explainable machine learning, derived from the Shapley value in cooperative game theory [33]. The key to interpreting machine learning models using SHAP analysis is to calculate the Shapley value for each feature of every instance. In practical applications, the Shapley value is approximated, and SHAP is an optimization algorithm for estimating Shapley values.

This method possesses three desirable properties of Shapley value theory: local accuracy, missingness, and consistency. Local accuracy refers to the sum of feature attributions being equal to the difference between the expected output and the actual output of the explanatory model, enabling the explanatory model to capture the discrepancy between the desired output and the observed output of a given instance [34]. Missingness implies that the Shapley value for a feature that is absent is considered as 0, indicating no contribution to the prediction. Consistency means that if a change in the model leads to an increase or unchanged marginal contribution of a particular feature, the Shapley value for that feature will not decrease. In addition to these features, the greatest advantage of SHAP lies in its independence from the predictive model [35], allowing for the theoretical interpretation of any machine learning model.

Results and analysis

The target performance data and feature data for the experiment were obtained by fitting and calculating 21141 titanium alloy resistivity samples using Eq (3), resulting in 729 sets of TCR (Temperature Coefficient of Resistance) data. The resistivity data at 25°C for the 729 sets of titanium alloys, as well as the composition proportion data containing nine elements, constituted the experimental feature data. The target performance data and feature data together formed the experimental dataset, which is summarized in Table 1. And the statistical analysis results for Resistivity and TCR are individually presented in (a) and (b) of Fig 4.

Table 1. Training dataset description for predictive modeling: TCR, resistivity values, and concentrations of nine elemental characteristics.

	ρ ₂₅	TCR	Al	Si	Zr	Mo	V	Sn	Nb	Mn	Ti
Count	729	729	729	729	729	729	729	729	729	729	729
Mean	1.16	1284.18	2.56	0.08	3.31	0.86	1.50	1.25	6.63	1.40	82.40
Std	0.24	670.93	2.09	0.00	2.71	0.71	1.23	1.02	5.41	1.60	6.64
Min	0.66	348.81	0.00	0.08	0.00	0.00	0.00	0.00	0.00	1.40	66.28
Max	1.62	3194.46	5.13	0.08	6.63	1.75	3.00	2.50	13.25	1.40	98.53

Open in a new tab

Fig 4 — (a) presents the statistical analysis results for Resistivity;(b) displays the statistical analysis results for TCR.

In Table 1, "Count" represents the total number of data points for each data type, which is 729 sets in this case. "Mean" denotes the average value of each data type, while "Std" represents the standard deviation. "Min" and "Max" correspond to the minimum and maximum values, respectively. For elemental compositions, they also indicate the compositional range of each element. It is evident that the titanium (Ti) content, as the base element for titanium alloys, ranges from 66.28% to 98.53%, making it the most abundant element among all. In Fig 4(A), the histogram represents the distribution of resistivity values, with the vertical axis indicating the frequency of resistivity values. The purple curve represents the kernel density estimation curve. From the frequency histogram, it is evident that the primary range of resistivity values falls between 0.7 and 1.6. The kernel density estimation curve exhibits three peaks, corresponding to intervals with a higher concentration of data points, namely around 0.9, 1.2, and 1.4. In Fig 4(B), the histogram depicts the distribution of TCR (Temperature Coefficient of Resistance) values, with the vertical axis showing the frequency of TCR values. The purple curve represents the kernel density estimation curve. The frequency histogram reveals that the main range of TCR values lies between 500 and 3000. The kernel density estimation curve indicates a gradual decrease in the number of samples with high TCR values. This reduction can be attributed to the fact that, in existing alloy materials, obtaining high TCR values is relatively more challenging compared to high resistivity values.

After normalizing or standardizing the data in Table 1 through preprocessing operations, subsequent prediction modeling, feature selection, and optimization range analysis were performed using the resistivity and TCR as target performance data.

Results of model prediction

Predicted results of resistivity model

Table 2 presents the predicted results of resistivity data based on five machine learning models, and the values of four evaluation metrics are based on the average level of K-fold cross-validation. The BPNN (Backpropagation Neural Network) model and three SVR (Support Vector Regression) models with different kernel functions are displayed with their best results after parameters tuning.

Table 2. Resistivity prediction results from 5 different models.

Resistivity Prediction Models	MAE	MAPE	RMSE	R²
BPNN	0.026	2.48%	0.034	0.981
Linear	0.028	2.68%	0.037	0.978
Ridge	0.028	2.68%	0.037	0.978
Rbf-SVR	0.014	1.35%	0.018	0.995
Sigmod-SVR	0.031	2.98%	0.044	0.969

Open in a new tab

From the results shown in Table 2, it can be observed that the Rbf (Radial Basis Function) kernel SVR model performs the best among all models, with superior results in all four evaluation metrics, and the coefficient of determination R2 exceeding 0.99.

The parameters of the Rbf-SVR model for resistivity prediction are as follows: The Rbf kernel is chosen for SVR, with a regularization coefficient C of 150, a kernel coefficient gamma of 0.1, and ε set to 0.01.

In Fig 5(A) to 5(E), the regression performance of five machine learning models on the validation and test datasets is presented. It is evident that the SVR model based on the Rbf kernel demonstrates the best regression performance, with minimal discrepancies between the predicted and actual resistivity values, resulting in an overall excellent predictive performance. In conclusion, the Rbf-SVR model is selected as the predictive model for resistivity, the target property in this study.

Prediction results of TCR model

Table 3 presents the prediction results of TCR data based on five machine learning models, and the values of the four evaluation metrics are still based on the average level of K-fold cross-validation. Similarly, the BPNN model and three SVR models with different kernel functions are displayed with their best results obtained during parameter tuning.

Table 3. TCR prediction results from 5 different models.

TCR Prediction Models	MAE	MAPE	RMSE	R²
BPNN	27.32	2.29%	42.08	0.996
Linear	118.03	15.91%	154.80	0.951
Ridge	117.96	15.88%	154.78	0.951
Rbf-SVR	55.82	4.61%	93.31	0.981
Sigmod-SVR	117.73	10.11%	182.99	0.932

Open in a new tab

From the results shown in Table 3, it can be observed that the BPNN model achieves the best prediction performance among all models, with superior results in all four evaluation metrics, especially with a coefficient of determination R2 exceeding 0.99. The Rbf kernel SVR model follows closely in terms of prediction performance, but the errors, as indicated by MAE and MAPE, are approximately twice that of the BPNN model. The linear regression, ridge regression, and Sigmod kernel SVR models exhibit similar prediction performance, however, their performance is significantly inferior compared to the BPNN model.

The parameters of the BPNN model for TCR prediction are as follows: The neural network has two hidden layers, each with 36 nodes, and the activation function is set as Relu. The weight optimizer is set to stochastic gradient descent, with a regularization parameter of 0.0001, and the maximum number of iterations is set to 10,000.

In Fig 6(A) to 6(E), the regression performance of five machine learning models on the validation and test datasets is depicted. It is evident that the BPNN model demonstrates the best regression performance, with minimal discrepancies between the predicted and actual TCR values, resulting in an overall excellent predictive performance. In conclusion, the BPNN model is selected as the predictive model for TCR, the target property in this study.

Feature selection

Resistivity feature selection

Since the Ti element is the base element and its content often varies with the content of other trace elements. As a result, Ti is not involved in the feature selection process. The Si and Mn elements are trace elements with fixed composition, so these two elements are also excluded from the feature selection process. The impact of different elements on the resistivity target performance was investigated using two different machine learning models, Random Forest and Xgboost, and the results are shown in Fig 7.

Fig 7(A) and 7(B) present the feature selection results based on feature importance. From Fig 6(A), it can be observed that the top three elements ranked by feature importance using the Random Forest model are Al, Zr, and Nb. On the other hand, Fig 7(B) shows that the top three elements based on feature importance using the Xgboost model are Al, Zr, and Sn. Therefore, the feature selection for resistivity includes the common elements of feature importance from both machine learning methods: Al and Zr.

In Fig 7(C) and 7(D) depict feature contribution representations based on SHAP values. In comparison to the feature importance method, which directly provides the importance ratio, the SHAP approach also reveals the polarity of the impact. The SHAP visualization results in (c) and (d) show close similarities. In the SHAP plots, each row represents a feature, with the x-axis indicating the SHAP values. The features are sorted based on the mean absolute value of SHAP, highlighting the most significant features for the model. Wider regions signify a concentration of samples, where each point represents an individual sample. Warmer colors, such as red, indicate larger feature values, while cooler colors, like blue, correspond to smaller feature values.

It is evident that the content of the Al element is highly significant for the model. Samples with higher Al element content, represented by red points with high SHAP values (greater than 0), exert a positive influence. Conversely, for the blue-shaded portion where feature values are smaller, SHAP values are negative (less than 0), indicating a negative influence. From a horizontal perspective, the distribution of samples for the Al element feature is more scattered, signifying a greater impact of this feature. Additionally, for features like Mo element, most data points are dispersed around SHAP = 0, suggesting that the Mo element has a minimal impact within the normal range.

Through the SHAP plots, it becomes apparent that both Al and Zr have a similar impact on resistivity. Specifically, within a certain range, higher concentrations of Al and Zr lead to increased resistivity. This insight provides valuable guidance for subsequent compositional optimization analysis.

TCR feature selection

Similarly, Ti, Si, and Mn are not involved in the feature selection process for TCR. The impact of different elements on the TCR target performance was investigated using the Random Forest and Xgboost models, and the results are shown in Fig 8.

Fig 8(A) and 8(B) present the feature selection results based on feature importance. From Fig 8(A), it can be observed that the top three elements ranked by feature importance using the Random Forest model are Al, Zr, and V. Similarly, Fig 8(B) shows that the top three elements based on feature importance using the Xgboost model are Al, Zr, and V. Therefore, the feature selection for TCR includes the common elements of feature importance from both machine learning methods: Al, Zr, and V.

Fig 8(C) and 8(D) depict the feature contributions based on SHAP values. It can be observed that the impacts of Al, Zr, and V on TCR are opposite to that on resistivity. Within a certain range, increasing the content of Al, Zr, and V leads to decrease in TCR, which is consistent with the reciprocal relationship between resistivity and TCR during compositional design.

Composition optimization analysis

Based on the feature selection results for resistivity and TCR, the variable elements chosen for compositional optimization are Al and Zr. When adjusting the content of Al and Zr, the content of other elements is set to the average values, as listed in Table 1. The newly generated virtual samples are used as the test set, and the resistivity is predicted using the SVR model with the Rbf kernel and the TCR is predicted using the BPNN model with the optimal parameters. The trends of resistivity and TCR variations are then analyzed in the same coordinate system.

Effect of Al on electrical properties

The average values for Si, Zr, Mo, V, Sn, Nb, and Mn are set to 7.50%, 3.31%, 0.86%, 1.50%, 1.25%, 6.63%, and 1.40% respectively. According to Table 1, the composition range of Al is set to 0–5% with a step size of 0.1. The content of Ti changes with the content of Al. In total, 51 virtual samples are generated.

These 51 virtual samples are the test set for prediction using the Rbf-SVR and BPNN models. The resulting curves of resistivity and TCR prediction data are shown in Fig 9. The red curve represents the resistivity variation caused by the change in Al content from 0% to 5%, while the green curve represents the TCR variation. It can be observed that as resistivity increases, TCR decreases significantly. Therefore, when designing the target electrical performance of titanium alloys, it is important to consider both resistivity and TCR variations. The yellow dashed line in Fig 9. corresponds to an Al content of 3.8%. The intersection of the dashed line with the two curves represents the values of both electrical properties. It can be seen that at this point, the resistivity performance is good, but the TCR is low. To improve the TCR performance of the titanium alloy, the dashed line needs to shift to the left, indicating a decrease in Al content. Based on this optimization approach, it is evident that the intersection points of the two curves correspond to satisfactory levels of both resistivity and TCR. In this study, both resistivity and TCR are targeted for high performance. Therefore, the range of Al content can be considered within the vicinity of the intersection points shown in Fig 9, approximately around 1.5% to 2%.

Effect of Zr on electrical properties

The average values for Al, Si, Mo, V, Sn, Nb, and Mn are set to 2.56%, 7.50%, 0.86%, 1.50%, 1.25%, 6.63%, and 1.40% respectively. According to Table 1, the composition range of Zr is set to 0–6.6% with a step size of 0.1. The content of Ti changes with the content of Zr. In total, 67 virtual samples are generated.

These 67 virtual samples are the test set for prediction using the Rbf-SVR and BPNN models. The resulting curves of resistivity and TCR prediction data are shown in Fig 10. The red curve represents the resistivity variation caused by the change in Zr content from 0% to 6.6%, while the green curve represents the TCR variation. The yellow dashed line in Fig 10 corresponds to a Zr content of 4.8%, where the resistivity performance is good and the TCR is low. Therefore, when designing the Zr content for the titanium alloy, the range can be considered around the intersection points shown in Fig 10, approximately around 2.5% to 3%.

If the target performance leans towards either resistivity or TCR, this optimization approach can still be used for design purposes.

Conclusions

The support vector machine model with the radial basis function kernel, Rbf-SVR, achieved the best performance in predicting the resistivity of titanium alloys, with a correlation coefficient of 0.995. The absolute percentage error between the true values and predicted values of the test samples is within 2%. The backpropagation neural network model with two hidden layers, BPNN, performed the best in predicting the TCR of titanium alloys, with a correlation coefficient of 0.996. The absolute percentage error between the true values and predicted values of the test samples is within 3%. Both prediction models exhibited high accuracy and good generalization ability.
Feature selection was performed using two different machine learning models, Random Forest and Xgboost, for resistivity and TCR. The feature importance intersection elements for resistivity are Al and Zr, with a positive effect on resistivity. The feature importance intersection elements for TCR are Al, Zr, and V, with a negative effect on TCR.
For titanium alloy thermoelectric material, the electrical performance of the alloy can be improved by adjusting the content of Al and Zr. However, considering the trade-off relationship between resistivity and TCR, to achieve higher gains in both resistivity and TCR, it is recommended to set the composition range of Al at around 1.5% to 2% and the composition range of Zr at around 2.5% to 3%.

Supporting information

S1 Data

(XLSX)

pone.0297943.s001.xlsx^{(61.9KB, xlsx)}

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

This work was Funded by Shenzhen Zhuolineng Technology Co., Ltd (HKF202200086); National Natural Science Foundation of Hunan province (2022JJ30721).

References

1.National Science and Technology Council (US). Materials genome initiative for global competitiveness[M]. Executive Office of the President, National Science and Technology Council, 2011.
2.Davoodi S, Thanh H V, Wood DA, et al. Combined machine- learning and optimization models for predicting carbon dioxide trapping indexes in deep geological formations[J]. Applied Soft Computing,2023,143: 110408. [Google Scholar]
3.Mehrad M, Bajolvand M,Ramezanzadeh A, et al. Developing a new rigorous drilling rate prediction model using a machine learning technique[J]. Journal of Petroleum Science and Engineering, 2020,192:107338. doi: 10.1016/j.petrol.2020.107338 [DOI] [Google Scholar]
4.Davoodi S, Mehrad M, Wood DA, et al. Predicting uniaxial compressive strength from drilling variables aided by hybrid machine learning[J]. International Journal of Rock Mechanics and Mining Sciences, 2023, 170:105546. doi: 10.1016/j.ijrmms.2023.105546 [DOI] [Google Scholar]
5.Anemangely M, Ramezanzadeh A, Tokhmechi B, et al.Drilling rate prediction from petrophysical logs and mud logging data using an optimized multilayer perceptron neural network[J].Journal of geophysics and engineering, 2018(4):15. [Google Scholar]
6.Davoodi S, Mehrad M, Wood DA, et al. Hybridized machine learning for prompt prediction of rheology and filtration properties of water-based drilling fluids[J]. Engineering Applications of Artificial Intelligence,2023,123:106459. [Google Scholar]
7.Sabah M, Talebkeikhah M, Wood D A, et al. A machine learning approach to predict drilling rate using petrophysical and mud logging data[J]. Earth Science Informatics,2019,12:319–339. [Google Scholar]
8.Matinkia M, Hashami R, Mehrad M,et al. Prediction of permeability from well logs using a new hybrid machine learning algorithm[J]. Petroleum, 2023,9(1):108–123. [Google Scholar]
9.Zamanzadeh Talkhouncheh M. Davoodi S, Larki B,et al. A new approach to mechanical brittleness index modeling based on conventional well logs using hybrid algorithms[J]. Earth Science Informatics,2023:1–30. [Google Scholar]
10.Mohamadian N, Ghorbani H, Wood D A, et al. A geomechanical approach to casing collapse prediction in oil and gas wells aided by machine learning[J]. Journal of Petroleum Science and Engineering,2021,196:107811. [Google Scholar]
11.Kalogirou S A, Bojic M. Artificial neural networks for the prediction of the energy consumption of a passive solar building[J]. Energy, 2000, 25(5): 479–491. [Google Scholar]
12.Logan W, Chris W. Atomistic calculations and materials informatics: a review [J]. Current Opinion in Solid State and Materials Science, 2016, 21 (3): 167–176. [Google Scholar]
13.Davoodi S, Thanh HV, Wood DA, et al. Machine-learning models to predict hydrogen uptake of porous carbon materials from influential variables[J]. Separation and Purification Technology,2023,316:123807. [Google Scholar]
14.Bruno A, John R. Property prediction of crystalline solids from composition and crystal structure[J]. Inorganic Materials: Synthesis and Processing, 2016. 62(8): 2605–2613. [Google Scholar]
15.Wong T T, Yang N Y. Dependency analysis of accuracy estimates in k-fold cross validation[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(11): 2417–2427. [Google Scholar]
16.Leroy B, Meynard C N, Bellard C, et al. virtualspecies, an R package to generate virtual species distributions[J]. Ecography, 2016, 39(6): 599–607. [Google Scholar]
17.Rohani A, Taki M, Abdollahpour M. A novel soft computing model (Gaussian process regression with K-fold cross validation) for daily and monthly solar radiation forecasting (Part: I)[J]. Renewable Energy, 2018, 115: 411–422. [Google Scholar]
18.Liu X, Lu W, Jin S, Li Y, Chen N. Support vector regression applied to materials optimization of sialon ceramics[J]. Chemometrics & Intelligent Laboratory Systems, 2006, 82(1/2): 8−14. [Google Scholar]
19.Cootes T F, Ionita M C, Lindner C, Sauer P. Robust and accurate shape model fitting using random forest regression voting[C]//European Conference on Computer Vision. Heidelberg: Springer, 2012: 278−291.
20.Xue D, Xue D, Yuan R, Zhou Y, Balachandran P V, Ding X, Sun J, Lookmant. An informatics approach to transformation temperatures of NiTi-based shape memory alloys[J]. Acta Materialia, 2017, 125: 532−541. [Google Scholar]
21.Zio M D, Guarnera U. Semiparametric predictive mean matching[J]. Asta Advances in Statistical Analysis, 2009, 93(2): 175−186. [Google Scholar]
22.Guyon I, Elisseeff A. An introduction to variable and feature selection[J]. Journal of machine learning research, 2003, 3(Mar): 1157–1182. [Google Scholar]
23.Davoodi S, Thanh H V, Wood DA, et al. Machine-learning predictions of solubility and residual trapping indexes of carbon dioxide from global geological storage sites[J]. Expert Systems with Applications,2023,222:119796. [Google Scholar]
24.Anemangely M, Ramezanzadeh A, Behboud M M.Geomechanical parameter estimation from mechanical specific energy using artificial intelligence[J]. Journal of Petroleum Science and Engineering, 2019,175:407–429. [Google Scholar]
25.Ashrafi S B, Anemangely M, Sabah M, et al. Application of hybrid artificial neural networks for predicting rate of penetration (ROP): A case study from Marun oil field[J]. Journal of petroleum science and engineering,2019,175:604–623. [Google Scholar]
26.JIN Wen, LI Zhao jia, WEI Luo-si, et al.The Improve-ments of BP Neural Network Learning Algorithm[C]// WCC 2000-ICSP 2000.2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress. Beijing, China: IEEE, 2002: 1647–1649.
27.Alex Smola J, Bernhard SCH LKOPF. A tutorial on support vector regression[J]. Statistics and Computing, 2004, 14: 199–222. [Google Scholar]
28.Debasish B, Srimanta P, Dipak C. Support vector regression [J]. Neural Information Processing—Letters and Reviews, 2007, 11(10): 203–224. [Google Scholar]
29.Sabah M, Mehrad M, Ashrafi SB, et al. Hybrid machine learning algorithms to enhance lost-circulation prediction and management in the Marun oil field[J]. Journal of Petroleum Science and Engineering,2021,198:108125. [Google Scholar]
30.Bajolvand M, Ramezanzadeh A, Mehrad M,et al.Optimization of controllable drilling parameters using a novel geomechanics-based workflow[J]. Journal of Petroleum Science and Engineering,2022,218: 111004. [Google Scholar]
31.Assari M, Anemangaly M, Ramezanzadeh A.Shear wave velocity prediction from petrophysical logs using MLP-PSO algorithm[C]//4th International Workshop on Rock Physics.2017.
32.Isabelle Guyon, Elisseeff André. An Introduction to variable and feature selection [J]. Journal of Machine Learning Research, 2003, 3:1157–1182. [Google Scholar]
33.Shapley L. S. A value for n-person games. Contributions to the Theory of Games [M]. Princeton: Princeton University Press, 1953:307–317. [Google Scholar]
34.Moncade-Torres A, Maaren C M, Hendriks P M, et al. Explainable machine learning can outperform cox regression predictions and provide insights in breast cancer survival[J]. Scientific Reports,2021,11: 698. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Adadi A, Berrada M. Peeking inside the black-box: A survey on EXplainable Artificial Intelligence (XAI)[J]. IEEE Access, 2018, 6: 52138–52160. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0297943.r001

Decision Letter 0

Babatunde Abiodun Salami

10 Oct 2023

PONE-D-23-26687Development of new materials for electrothermal metals using data driven and machine learningPLOS ONE

Dear Dr. He,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Reviewer #1

After conducting a thorough review of the manuscript, I would like to provide the following comments:

1. The abstract should include the specific names of the machine learning methods used in this study.

2. The method employed to adjust the control parameters in the neural network is clearly stated and analyzed. It is recommended to provide a similar level of detail for other algorithms utilized in this manuscript.

3. To enhance the machine learning section, I suggest considering the following articles as additional references:

- Drilling rate prediction from petrophysical logs and mud logging data using optimized Multi Layer Perceptron neural network

- Application of hybrid artificial neural networks for predicting rate of penetration (ROP): A case study from Marun oil field

- A machine learning approach to predict drilling rate using petrophysical and mud logging data

- A geomechanical approach to casing collapse prediction in oil and gas wells aided by machine learning

- Developing a new rigorous drilling rate prediction model using a machine learning technique

- Geomechanical parameter estimation from mechanical specific energy using artificial intelligence

- Prediction of permeability from well logs using a new hybrid machine learning algorithm

- Hybrid machine learning algorithms to enhance lost-circulation prediction and management in the Marun oil field

- Optimization of controllable drilling parameters using a novel geomechanics-based workflow

- Shear wave velocity prediction from petrophysical logs using MLP-PSO algorithm

- A new approach to mechanical brittleness index modeling based on conventional well logs using hybrid algorithms

4. It is important to clarify how the separation ratio for training and test data was determined. Please explain the rationale behind choosing this specific ratio.

5. In the captions of tables and figures, it should be explicitly specified for which category of data they are intended.

6. For further analysis, I recommend considering the following articles as valuable references:

- Combined machine-learning and optimization models for predicting carbon dioxide trapping indexes in deep geological formations

- Predicting uniaxial compressive strength from drilling variables aided by hybrid machine learning

- Hybridized machine-learning for prompt prediction of rheology and filtration properties of water-based drilling fluids

- Machine-learning models to predict hydrogen uptake of porous carbon materials from influential variables

- Machine-learning predictions of solubility and residual trapping indexes of carbon dioxide from global geological storage sites

7. Among the various criteria considered, it is essential to mention in the text which specific criterion was used to select the best model.

8. It would be beneficial to provide additional explanations regarding the figures related to SHAP (Shapley Additive exPlanations). Please elaborate on the specific details and insights that can be derived from these figures.

Reviewer #2

Please find the review below:

1) Statistical analysis no present or weak lacks plots (histograms, heat map etc)

2) Entire work is not novel. Why is ML used for this, simply adding ML to existing work has no new science or value

3) Entire document needs to be rewritten, introduction lack's structure (research question, the knowledge gap and how this method is only one to fill that gap)

4)TCR normalized per unit area plotted against length is needed

5) Not sure why Al and Zr was choosen, did you measure composition changes of device using XPS/EDX for materials analysis, need proofs for that

6) Very weak electro thermal tasting analysis requires more results (for ex R vs temperature plots, thermal conductivity measurements etc.)

7) Figures are not presented in proper way and looks like a master's thesis document or Lab manual

Please submit your revised manuscript by Nov 24 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Babatunde Abiodun Salami

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

This work was Funded by Shenzhen Zhuolineng Technology Co., Ltd; National Natural Science Foundation of Hunan province (2022JJ30721).

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

The author(s) received no specific funding for this work.

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

5. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: After conducting a thorough review of the manuscript, I would like to provide the following comments: