A novel machine learning-based imputation strategy for missing data in step-stress accelerated degradation test

Yaqiu Li; Qijie Zhou; Ye Fan; Guangze Pan; Zongbei Dai; Baimao Lei

doi:10.1016/j.heliyon.2024.e26429

. 2024 Feb 18;10(4):e26429. doi: 10.1016/j.heliyon.2024.e26429

A novel machine learning-based imputation strategy for missing data in step-stress accelerated degradation test

Yaqiu Li ^a,^d, Qijie Zhou ^a,^d, Ye Fan ^b, Guangze Pan ^a,^c,^∗, Zongbei Dai ^a, Baimao Lei ^a

PMCID: PMC10906311 PMID: 38434061

Abstract

The presence of missing data is a significant data quality issue that negatively impacts the accuracy and reliability of data analysis. This issue is especially relevant in the context of accelerated tests, particularly for step-stress accelerated degradation tests. While missing data can occur due to objective factors or human error, high missing rate is an inevitable pattern of missing data that will occur during the conversion process of accelerated test data. This type of missing data manifests as a degradation dataset with unequal measuring intervals. Therefore, developing a more appropriate imputation method for accelerated test data is essential. In this study, we propose a novel hybrid imputation method that combines the LSSVM and RBF models to address missing data problems. A comparison is conducted between the proposed model and various traditional and machine learning imputation methods using simulation data, to justify the advantages of the proposed model over the existing methods. Finally, the proposed model is implemented on real degradation datasets of the super-luminescent diode (SLD) to validate its performance and effectiveness in dealing with missing data in step-stress accelerated degradation test. Additionally, due to the generalizability of the proposed method, it is expected to be applicable in other scenarios with high missing data rates.

Keywords: Missing data imputation, Accelerated degradation test, Support vector machine, Radial basis function

1. Introduction

Accelerated degradation test is a highly desirable measure to shorten new product introduction time given the pressure on industry today, as its advantages of obtaining richer information for reliability assessment within limited time and costs by exposing the products to harder-than-normal stress [1]. In accelerated test, the product samples are usually subjected to successively higher stress levels in predetermined stages, and thus follow a time-varying stress profile [2]. The test is terminated when a certain number of failures or degradation data are observed, then it is necessary to transform the collected data at higher stress into equivalent failure-to-time data at normal stress for lifetime distribution fitting and further reliability estimation [[3], [4], [5]]. However, the presence of missing data throughout the entire test can adversely impact the performance of data analysis methods and introduce bias in evaluation results, posing a significant challenge in accelerated test research.

The lack of degradation data from accelerated tests primarily arises from objective and human factors, including sensor failure and errors made by testers. The data gaps can occur at the beginning, middle, or end of a time-ordered dataset, resulting in data absence patterns M₁, M₂ and M₃ shown in Fig. 1. Additionally, the intervals of transformed data obtained from higher stress levels may differ from those of the observed data at normal stress, resulting in regular yet unequal gaps throughout the degradation dataset, referred to the missing data pattern M₄ in Fig. 1. For studies involving data processing for accelerated testing, evaluation of product degradation, as well as remaining lifetime prediction, complete data sets are essential as input. The presence of missing data remarkably complicates the processing of performance degradation data, rendering many traditional methods incapables of conducting statistical analysis on incomplete datasets. For instance, time-series based remaining lifetime prediction methods usually require the data set to be both complete and equidistant.

Fig. 1 — Four prototypical missing data patterns of degradation data. The shaded areas represent the location of the missing values in the dataset. M₁, M₂, *and M*₃ represent missing data occurring in the front, middle and back parts of the dataset respectively, M₄ represents missing data caused by transforming.

To address the issue of missing data, there are two potential approaches: enhancing traditional data processing methods to handle degraded data with missing values or transforming the data with missing values into complete datasets. The former option poses implementation challenges, whereas the latter is regarded as a more feasible and realistic approach. However, the practicability of imputing missing data often depends on the underlying missing mechanism, which are categorized as follows: MAR (Missing At Random), MCAR (Missing Completely At Random), and MNAR (Missing Not At Random) [6]. Rubin has designed a probabilistic framework for missing data, he supposes that the complete data as consisting of two components, the observed data and the missing data (Y_obs and Y_mis, respectively), and defines a binary variable M that denotes whether a value on a particular variable is observed or missing (i.e., M = 1 if a value is observed, and M = 0 if a value is missing). A more precise description of each missing data category is listed in Table 1 [7]. Generally, if the missing mechanism is MNAR, which means that the missing data is influenced by both the observed and missing data, it is not feasible to interpolate the missing data. Fortunately, performance degradation data often adhere to a continuous degrading process, allowing for the assumption that the missing degradation data depends solely on the observed data rather than the missing portion. Consequently, the imputation methods are promising to handle the missing data issue of degradation data.

Table 1.

Description and mathematical expression of missing data types.

Category	Precise Description	Mathematical Expression
MAR	Data are missing at random when the probability of missing data on a variable Y is related to some other measured variables in the analysis model but not to the values of Y itself.	$f (M \| Y) = f (M {\| Y}_{o b s}, φ)$
MCAR	Data are missing completely at random when the probability of missing data on a variable Y is unrelated to other measured variables and is unrelated to the values of Y itself.	$f (M \| Y) = f (M \| φ)$
MNAR	Data are missing not at random (MNAR) when the probability of missing data on a variable Y is related to the values of Y itself, even after controlling for other variables.	$f (M \| Y) = f (M \| Y_{o b s}, Y_{m i s}, φ)$

Open in a new tab

$φ$ is a parameter that describes the relationship between M and the data.

Missing data imputation can be performed by several models, such as the mean model [8], regression model [9,10], cluster model [11,12], hot-deck model [13] and nearest neighbor model [14,15]. According to the number of imputations, the imputation methods can be classified into single and multiple imputations. The single imputation methods often fill each missing value with either the average of its predicted distribution or a randomly drawn value from that distribution, and various statistical analysis are able to be performed on the complete data set following imputation. However, few single imputation methods take uncertainty into account and they have a common drawback of distorting the sample data distributions. The multiple imputation method entails generating multiple alternative values for each missing value, resulting in several complete data sets. These complete data sets are then evaluated using the same approach to yield multiple evaluations. Finally, the results from these evaluations are combined to obtain a final estimate of the target variable. Multiple imputation methods offer improved handling of the uncertainty associated with missing data and maintain the distribution of data samples. However, these methods are computationally intensive and often yield less accurate values.

During the last decades, machine learning techniques are extensively employed for anomaly detection, assessment prediction, and missing data imputation, leveraging their excellent capability to extract valuable information from data, such as the Neural Network (NN) model, Support Vector Machine (SVM) model, Naive Bayes (NB) model and so on [16]. Although NN models, such as radial basis function (RBF), utilize their strong non-linear approximation characteristics to overcome the limitation of insufficient data supply [[17], [18], [19]], they suffer from drawbacks such as high sample requirements, slow convergence, susceptibility to local minimum trapping, and limited generalization ability. These shortcomings of the NN model can be mitigated by employing other machine learning techniques, such as SVM, which utilizes a kernel to transform the data from the input space into a higher-dimensional feature space, enabling linear separability of the problem. As a result, SVM exhibits improved generalization capability and is available for the small sample sizes compared to conventional neural network models [20]. Besides, the least squares support vector machine (LSSVM) incorporates an additional sum squared error term in its objective function for the optimization problem, which decreases the computation time of the convex optimization problem and improves the performance, leading to high precision and fast convergence compared to other SVM models [21].

In practical application, the prediction accuracy of LSSVM is affected by the kernel function and the parameters of the kernel function [22]. At present, there is no definite theory and method to support how to determine the kernel function and the parameters of the kernel function. Moreover, researchers have observed that the output residual terms of SVM-like models contain valuable information that can be further learned. Based on this observation, hybrid models are promising to refine the predictions of the LSSVM by retrain the residual terms with other machine learning techniques [23].

Based on the above description, this paper proposes hybrid intelligent models to impute the missing data value in step-stress accelerated degradation test. The proposed models combine the LSSVM with a RBF neural network model. It is worth mentioning that this is the first work that investigate the impact of missing data on degradation evaluation and explore the effectiveness of imputation techniques in the field of accelerated testing. The main innovations and research ideas of this paper are given.

•
The RBF and LSSVM methods are introduced respectively to train the model using observed data and subsequently impute the missing data through prediction. A comparative analysis is conducted to determine the applicability of these two models, based on which the framework of the hybrid imputation model is proposed.
•
By integrating the LSSVM and RBF models, the proposed hybrid model achieves a more comprehensive and accurate imputation of the missing data. The LSSVM models the degradation trend, ensuring that the imputation data follows a consistent trend with the observed data, while the RBF is utilized to estimate the residual series of the missing data, thereby maintaining a similar data dispersion as the observed data.
•
Data imputation strategies and improved process (missing data with unequal measuring intervals) are specifically designed for possible data missing mechanisms in accelerated degradation test. The performance of the proposed model is validated by comparing with several traditional methods.

The paper is organized as follows. Section 2 illustrates the theoretical foundations and detailed process of the proposed imputation method. Section 3 presents the influences of rate and measuring interval of missing data, comparing the effects of several traditional techniques and the proposed method. Section 4 shows an experiment case. Finally, conclusions and future work directions are discussed in Section 5.

2. Methodology

2.1. Machine learning techniques

In recent years, machine learning techniques are widely used for predictive analytics. However, these techniques also possess significant potential for effectively handling incomplete data.

2.1.1. Radial basis function (RBF) neural network

The radial basis function (RBF) neural network model possesses a high ability for function approximation, particularly in dealing with nonlinear data problems [24]. The model does not require a large number of samples and has a strong generalization ability for each input and output sample. The “basis” of the hidden layer neurons in the RBF neural network model is the radial basis function, which denotes a scalar function with radial symmetry. Its mathematical expression is R(||x-c||), which is a monotonic function of the Euclidean distance between any point in space (x) and the center (c), as expressed in Eq. (1) and Eq. (2).

Equation 1.

(1)

Equation 2.

(2)

where c is the kernel function center, σ is variance and is the width parameter of the function, d_max is the maximum Euclidean distance between all central vectors, and h is the number of hidden layer neurons.

2.1.2. Least squares support vector machine (LSSVM)

Least Squares Support Vector Machine (LSSVM) is an improved machine learning algorithm of SVM proposed by Suykens [25]. It has the advantages of quick learning speed, good generalization ability, and does not suffer from overfitting of other neural network models and long training time of SVM. The regression algorithm of the LSSVM can be represented as Eq. (3) [26]:

Equation 3.

(3)

Where ω is the weight vector, b is the bias vector, and φ(x) is the nonlinear mapping function. The optimization problem can be expressed as:

Equation 4.

(4)

with the constraints as Eq. (5),

Equation 5.

(5)

where ω^Tω controls the complexity of the decision function, γ is the regularization parameter, which controls the degree of penalty beyond the error sample. e is the error vector.

By introducing the Lagrange function, Eq. (4) can be transformed as Eq. (6)

Equation 6.

(6)

where δ_i is the Lagrangian multiplier and K(x, x_i) is the symmetric kernel function satisfying the Mercer condition, if we choose the RBF as the kernel function of LSSVM, thus K(x, x_i) = R(||x-c||).

2.1.3. Impact of missing degradation data on degradation predictions

Missing data issue is crucial to ensure the robustness and accuracy of the degradation behavior prediction. Therefore, prior to constructing the imputation model, a comparison analysis is conducted between RBF and LSSVM models to assess the impact of missing data.

Observed data missing due to monitoring sensors failure or record errors may occur at any time during the whole accelerated test. The influence of missing data in the front, middle and back of the dataset are compared and analyzed by four groups of simulated performance degradation data.

2.1.3.1. Supposing that the degradation model is an approximate linear monotonic descent model as Eq. (7)

Equation 7.

(7)

Random sampling is conducted to collect the complete degradation dataset Y₀ and the size of dataset is 120. Missing datasets Y₁, Y₂, Y₃ are consists of the remaining 90 data after removing 30 data values from the front, middle and back of Y₀ respectively. RBF and LSSVM methods are used for predicting the next 40 values, as shown in Fig. 2. (a)-(d).

Fig. 2 — Prediction results of RBF and LSSVM models for complete dataset and incomplete datasets with missing data in the front, middle and back of the dataset.

Mean Absolute Percentage Error (MAPE) measures the relative magnitude of deviations between predicted and true values. MAPE is often preferred over other fit error metrics, such as Mean Squared Error (MSE) or Root Mean Squared Error (RMSE), as it is less sensitive to the influence of extreme values. The value of MAPE can be obtained as Eq.(8),

Equation 8.

(8)

In practical, a smaller MAPE value indicates a higher level of accuracy in the prediction model. The MAPE values for the predictions generated by the RBF and LSSVM models for each dataset are calculated. Moreover, the MAPE of the complete dataset (Y₀) serves as the benchmark to assess the extent of decline in precision resulting from data missing situations in Y₁, Y₂, Y₃. These MAPE values, along with the declining precision analysis, are presented in Table 2, providing a comprehensive overview of the impact of data missing scenarios on prediction accuracy.

Table 2.

MAPE of prediction result of RBF and LSSVM model.

Dataset	RBF Model		LSSVM Model
Dataset	MAPE (%)	Decline in Precision	MAPE (%)	Decline in Precision
Y₀	0.75	–	0.60	–
Y₁	0.98	30.7%	0.67	11.7%
Y₂	/	/	0.61	1.67%
Y₃	1.26	68.0%	1.60	166.7%

Open in a new tab

As results shows, missing data will always lead to a decrease in the precision of evaluation or prediction models. Specifically, when the missing data is located at the beginning of the dataset, the MAPE for RBF-based time series predictions decreased by 30.7% compared to that based on complete data, while the MAPE for LSSVM predictions only decreased 11.7% in this scenario. In cases where the missing data is situated in the middle of the dataset, the impact on LSSVM prediction accuracy was relatively less pronounced. However, missing data in the middle makes RBF-based time series prediction unreliable due to the fact that missing data pattern in this scenario fails to meet the requirement of equally interval inputs, which is necessary for this kind of prediction model. When the missing data occurred at the end of the dataset, it had the most significant impact on prediction accuracy. In this situation, the MAPE for RBF and LSSVM predictions decreased by 68.0% and 166.7% respectively.

The results demonstrate that the LSSVM model is more successful in preserving the declining trend of the original data. However, it struggles to accurately reflect the uncertainty of the observed data, which may result in an underestimation of data uncertainty. On the other hand, while the RBF model requires strict adherence to equal interval inputs, it effectively maintains the discrete characteristics of the original data. It is worth noting that a relatively smaller SPREAD parameter in the RBF model can lead to improved data dispersion.

2.2. Hybrid imputation model based on LSSVM and RBF

According to the above analysis, this paper merges LSSVM and RBF methods into an improved model, in which the LSSVM algorithm is used for modeling the trend of degradation data and meanwhile a margin of random residual error is controlled by the RBF algorithm so that the imputation results can reflect the uncertainty associated with missing data. The framework of the proposed model is shown in Fig. 3.

Fig. 3 — The framework of hybrid imputation method based on LSSVM-RBF model and an illustrative example for each modeling step.

The detailed modeling procedure is as follows.

(1)
Collecting testing data to form the observed degradation dataset ${T_{o b s}, Y_{o b s}}$ , in which T_obs = {t_{obs_1}, t_{obs_2}, …, t_{obs_n}} denotes the test timing points and Y_obs = {y_{obs_1}, y_{obs_2}, …, y_{obs_n}} denotes to the observed degradation value.
(2)
Training the LSSVM model with the vector T_obs, Y_obs as input and output respectively to derive the trend model f (t) of degradation data as Eq. (9).

Equation 9.

(9)

(3)
Calculating the trend term series of missing data (Q_mis) by substituting the timing points of missing data (T_mis) into Eq. (7), where Q_mis = {q_{mis_1}, q_{mis_2}, …, q_{mis_m}} and T_mis = {t_{mis_1}, t_{mis_2}, …, t_{mis_m}}.
(4)
The trend term series of observed data Q_obs = {q_{obs_1}, q_{obs_2}, …, q_{obs_m}} can be obtained by substituting T_obs into Eq. (7), hence the residual term series of observed data E_obs = {e_{obs_1}, e_{obs_2}, …, e_{obs_n}} can be calculated as Eq. (10):

Equation 10.

(10)

(5)
Training the RBF model with the vector T_obs as input and E_obs as output, then taking the vector T_mis as input into trained RBF model to predict the residual term of missing data, obtaining the residual term series E_mis = {e_{mis_1}, e_{mis_2}, …, e_{mis_m}}.
(6)
The estimated trend term and residual term are continuously added as new data into the training set {E_obs, T_obs} for updating the RBF model, in order to enhance the prediction precision.
(7)
Obtaining the Missing data (Y_mis) by merging the trend term and residual term as Eq. (9) shows. Finally, the complete degradation dataset is formed by combining the observed data series and predicted missing data series as Eq. (11).

Equation 11.

(11)

3. Imputation strategies for missing data patterns

3.1. Strategies for missing data with different missing rate

The missing data rate is one of the most vital factors for evaluation as well as imputation. Datasets with high missing rates may cause instability in the imputation process, resulting in huge errors in final prediction and evaluation. In this section, the influences of missing data rate at different levels are analyzed by several methods with simulation data.

Supposing that the degradation model as Eq. (12),

Equation 12.

(12)

Random sampling is conducted to collect the complete degradation dataset Y₀, whose size is 300. The missing datasets ${Y_{i}}$ , i = 1, …,6, are obtained by removing the continuous date from the middle part of each dataset to fit the missing rate of 10%, 20%, 30%, 40%, 50% and 60%. The degradation datasets at different missing rate are shown in Fig. 4. (a)-(f).

Fig. 4 — Degradation dataset at different missing rate.

Imputation method based on mean model, regression model, EM algorithm, regression-RBF and LSSVM-RBF models are conducted, and their imputation effects are evaluated according to the metrics of root mean square error (RMSE) and relative variance error ( $ε_{σ^{2}}$ ) respectively as Eq. (13) and Eq. (14).

Equation 13.

(13)

Equation 14.

(14)

where X denotes the timepoint set of degradation data, h(x) denotes the models aforementioned, $σ^{2}$ and ${σ^{'}}^{2}$ denote the variances of original and imputation data separately. The RMSE quantifies the average magnitude of discrepancies between predicted and actual values within a dataset, providing a valuable assessment of the alignment between a model's predictions and observed data points. Additionally, the $ε_{σ^{2}}$ serves as the evaluation metric for gauging the effectiveness of interpolated data in accurately representing the variability present in the original dataset. The evaluation results are shown in Fig. 5.

From Fig. 5. (a), it is evident that the RMSE of each imputation method tends to increase as the missing rate gradually rises. Specifically, when the missing rate is below 20%, the regression-RBF model exhibits the smallest RMSE, followed by the EM algorithm and the LSSVM-RBF model. However, at missing rates of 30% and 40%, the EM algorithm achieves the smallest RMSE, followed by LSSVM-RBF and the regression-RBF. As the missing rate exceeds 50%, the RMSE of the EM algorithm experiences a notable increase. At the missing rate of 60%, the RMSE of the EM algorithm surpasses that of the regression model, which becomes the third smallest, closely resembling that of the regression-RBF as well as LSSVM-RBF. Notably, the mean imputation method consistently exhibits the largest RMSE.

The comparison of $ε_{σ^{2}}$ in Fig. 5. (b) reveals that among the five methods, regression imputation demonstrates the closest resemblance to the original data variance, exhibiting remarkable stability across all missing rates. Following closely are the imputation method based on regression-RBF and LSSVM-RBF, both displaying comparable proximity to the original data variance and exhibiting a high level of stability. When the missing rate is below 40%, the interpolated data variance of the EM algorithm performs nearly as well as that of the regression method. However, as the missing rate surpasses 50%, the variance of the interpolated data from the EM algorithm experiences significant deviations. Notably, the mean imputation method showcases the largest error in the relative variance error.

Based on the analysis above, the effects of handling the influence of the missing rate of five models are evaluated as Table .3 shows.

Table 3.

Effects of 5 models in dealing with missing data at different rates.

Evaluation method	Missing rate	Mean Model	Regression Model	EM Model	Regression-RBF Model	LSSVM-RBF Model
RMSE	10%∼40%	Medium	Medium	Good	Good	Good
	50%	Bad	Bad	Medium	Medium	Good
	60%	Bad	Medium	Bad	Good	Good
$ε_{σ^{2}}$	10%∼40%	Medium	Good	Good	Medium	Good
	50%	Bad	Good	Bad	Medium	Medium
	60%	Medium	Good	Bad	Good	Medium

Open in a new tab

3.2. Strategies for missing data with unequal measuring intervals

Equally spaced data serves as a fundamental requirement for conducting time series analysis. Consequently, the imputation of data with unequally measuring intervals into equally spaced data becomes a prerequisite for regression and prediction tasks. In the case of the step-stress accelerated test, the degradation data will inevitably change to data with unequal interval in time when folded to the same stress level, which may bring difficulties to subsequent data analysis. Essentially, this situation can be regarded as a special missing data problem, which means imputation method can be conducted to improve the dataset quality.

3.2.1. Determination of the imputation point

The primary goal of imputation for unequal measuring interval data is that the amount of interpolated data should be as small as possible, to ensure that the proportion of original observed data are large enough to maintain statistical characteristics.

Supposing two datasets whose measuring interval is d₁ and d₂ respectively, and d₁ > d₂, there may be two possible cases.

•
Case1: $d_{1} = k d_{2}, k \in N$ .

In this case, only the first dataset needs to be interpolated k-1 data between every two data with the measuring interval of d₂, as Fig. 6. (a) shows. Since $k \geq 2$ , it means that the first dataset should be considered as a missing dataset with 50% or more missing rate when the imputation is conducted.

•
Case 2: $d_{1} \neq k d_{2}, k \in N$ .

In this case, the maximum convention, m, needs to be calculated firstly, then imputation should to be performed to both datasets. d₁/m-1 and d₂/m-1 data should be interpolated between every two data for the first and second dataset respectively, unifying the measuring interval of two dataset as m, as Fig. 6. (b) shows. Likely, the missing rate of these two datasets in this case is above 50%.

If there are more than two datasets available, the imputation points can be determined by extending these two cases above.

3.2.2. Datasets with unequal measuring intervals

Random sampling is conducted according to Eq. (12) to collect the complete degradation dataset with 3 replications while the size of each dataset is 300, namely Y₁, Y₂, Y₃. Then, by removing corresponding data these three datasets are transformed into ‘incomplete’ datasets ${{Y_{1}}^{'}, {Y_{2}}^{'}, {Y_{3}}^{'}}$ with measuring intervals of 2, 3 and 4 respectively, as shown in Fig. 7. (a)-(f).

Fig. 7 — The original datasets Y₁, Y₂, Y₃ and missing datasets ${Y_{1}}^{'}, {Y_{2}}^{'}, {Y_{3}}^{'}$ with measuring interval = 2,3,4.

Since the maximum convention of the three measuring intervals (2, 3, 4) is 1, it is necessary to perform imputation to convert them into dataset with measuring interval of 1. As a result, imputation information of the three datasets can be seen in Table .4.

Table 4.

Imputation data information.

Dataset	Measuring Interval		Data Amount			Percentage of Imputation data
Dataset	Original (hours)	Imputation (hours)	Original	Imputation	Total	Percentage of Imputation data
${Y_{1}}^{'}$	2	1	150	149	299	49.8%
${Y_{2}}^{'}$	3	1	100	198	298	66.4%
${Y_{3}}^{'}$	4	1	173	516	689	74.9%

Open in a new tab

3.2.3. Imputation for incomplete dataset

The minimal data missing rate is close to 50%, as shown in Table .4, indicating that it is necessary to choose imputation methods that can still work in the case of substantial data missing. According to Table .3, the regression model, regression-RBF model, and LSSVM-RBF model exhibit superior performance in data interpolation, particularly at high missing rates. Consequently, case in this chapter is utilized to conduct a further comparison of the imputation performance of these three models.

Dataset with a measuring interval of 3 ( ${Y_{2}}^{'}$ ) is taken as an example to illustrate the imputation effects by comparing the imputation data with original data Y₂ and ‘incomplete’ data ${Y_{2}}^{'}$ . The comparison results of the regression imputation method are shown in Fig. 8. (a) and (b).

Fig. 8 — Imputation results of regression method.

Similarly, the comparison results of regression-RBF and LSSVM-RBF imputation methods are shown in Fig. 9. (a)–(c) and Fig. 10. (a)-(c), and the situation about trend and residual terms of imputation are presented as well.

3.2.4. Evaluation of imputation methods

By repeating the simulation approach mentioned in chapter 3.3.2, 10 sets each of data with measuring intervals of 2, 3, 4 ( ${Y_{1}^{i}}, i = 1, 2, \dots, 10$ , ${Y_{2}^{j}}, j = 1, 2, \dots, 10$ , ${Y_{3}^{k}}, k = 1, 2, \dots, 10$ ) are obtained to evaluate the effects of imputation methods considering RMSE and relative variance error as the metrics. The results of evaluation are shown in Fig. 11. (a)-(f).

Fig. 11 — Evaluation on RMSE and relative variance error of three imputation method.

In terms of RMSE, the proposed LSSVM-RBF model consistently exhibits significantly smaller RMSE values compared to the other two methods. This indicates that the proposed model achieves a higher level of stability and has fewer imputation points with considerably larger errors. On the other hand, the Regression-RBF model shows less substantial improvement to the regression model than the data with equally interval case discussed in Section 3.1. For the relative variance error ( $ε_{σ^{2}}$ ), the performance of the Regression-RBF model for data with unequal intervals under high missing rates is similar to that of the LSSVM-RBF. The Regression model maintains the smallest $ε_{σ^{2}}$ , however, this advantage is not highly pronounced.

In summary, for scenarios involving data with unequally intervals under high missing rates, the model performance can be ranked in descending order as follow: LSSVM-RBF, Regression, and Regression-RBF model.

4. Experiment and discussion

Super-luminescent diode (SLD) is a semiconductor optical component which is now widely used in automotive, medical and radio fields due to its excellent characteristics of high output power, good stability, long life and high reliability. The stability of SLD light source is mainly influenced by the injection current and core temperature, thus temperature is chosen as the applied stress condition for the accelerated test to verify the affection of the imputation method proposed in this paper.

4.1. Experiment setup

Step-stress accelerated test is performed at 3 levels of temperature (60 °C, 70 °C, 80 °C) with retention of 1500 h, 1000 h and 500 h respectively. The output optical power of SLD is recorded every 3 h during the test, as shown in Fig. 12. (a) and (b). The differences in the initial value of optical power at 3 stress levels is the result of temperature drift, but it will not affect the subsequent data imputation.

Fig. 12 — Step stress accelerated degradation test of SLD.

4.2. Data processing

Severe stress conditions typically result in an accelerated degradation of products, which can be quantified using an acceleration factor. The acceleration factor represents the ratio by which the degradation under accelerated stress conditions increases relative to the degradation observed under base stress conditions in the same time period. Alternatively, it can be understood as the time required to achieve the same extent of degradation under accelerated stress conditions being reduced to 1/a times that of the base stress conditions. As a result, the measuring interval at accelerated stress levels will expand a times when these data are converted in to the base stress level.

In order to determine the imputation timepoint for the transformed data at base stress level, we first calculating the average degradation rate (ADR) by employing a linear regression model to fit the performance (power) degradation data at each stress level on the basis of Fig. 12. (b), as shown in Table .5.

Equation 15.

(15)

Table 5.

Average degradation rate of SLD at each stress level.

Stress Level	60 °C	70 °C	80 °C
ADR	0.00052	0.0012	0.0032

Open in a new tab

Then the acceleration factors at higher stress levels relative to 60 °C can be obtained as Eq. (15).

The a₁ and a₂ represent the acceleration factors at 70 °C and 80 °C respectively. Hence, the corresponding imputation timepoints of 70 °C and 80 °C transformed to 60 °C can be presented as Eq. (16):

Equation 16.

(16)

in which t₇₀, t₈₀ represent to the measuring timepoints at 70 °C and 80 °C respectively, while $t_{70}^{'}$ , $t_{80}^{'}$ represent the corresponding timepoints of them at 60 °C after transformation, the transformed data at 60 °C are shown in Fig. 13. According to the accelerated factors a₁ and a₂, the measuring data intervals of $t_{70}^{'}$ and $t_{80}^{'}$ are approximated 6 and 18 h respectively, in which the ‘missing’ data need to be interpolated to obtain the complete dataset with equal interval, the specific information of imputation data is shown in Table .6.

Fig. 13 — Transformed degradation data at 60 °C of SLD output optical power.

Table 6.

Imputation data information.

Stress Level	Data Interval		Data Amount			Percentage of Imputation data
Stress Level	Transformed (hours)	Imputation (hours)	Transformed	Imputation	Total	Percentage of Imputation data
70 °C	6	3	334	333	667	49.9%
80 °C	18	3	167	830	997	83.2%

Open in a new tab

4.3. Results and discussion

Data imputation is performed by LSSVM-RBF model to deal with the high missing rate of transformed degradation dataset, the results can be seen in Fig. 14. (a)-(f).

Fig. 14 — Imputation results for the data transformed from 70 °C to 80 °C.

The RMSEs for the transformed datasets at 70 °C and 80 °C are calculated to be 4.66 and 5.45, respectively. The corresponding values of $ε_{σ^{2}}$ are 0.08 and 0.05, respectively. Fig. 14 shows that the degradation trends and residual terms of the complete transformed dataset closely align with the original true observations. Regarding the non-equally spaced data at 70 °C, whose missing data rate of only 50% according to Table 6, and the proposed model maintains its expected performance under these conditions. However, for the 80 °C dataset, the missing data rate reaches 80%, fortunately the availability of sufficient data from the benchmark stress level at 60 °C and the transformed data at 70 °C surpasses the cases discussed in Chapters 2 and 3. Consequently, the proposed method maintains a high accuracy due to the abundance of valid data for model training. This exemplifies the excellent performance of the proposed LSSVM-RBF model in handling the imputation issue of unequal interval data from step-stress accelerated degradation test. Furthermore, the complete dataset with equal interval of SLD degradation can be more suitable for prediction after undergoing some other data preprocessing method such as supplying the affection of temperature drift.

5. Conclusions and future work

Incomplete data is the one of the most significant problems in current data analysis research. Even though it is difficult to solve this problem completely, researchers are constantly working to propose more appropriate methods for specific data missing scenarios. The aim of this study is to propose a hybrid imputation method that combines the LSSVM model with RBF neural networks for the purpose of addressing the potential issues related to missing data in step-stress accelerated degradation tests.

In this paper, a two-step framework for hybrid imputation method is constructed. Firstly, we separate the missing data into trend and residual terms for individual imputation value, the non-linear approximation capability of LSSVM is leveraged to modeling the degradation trend, capturing its underlying patterns. Then, the RBF model is utilized to further extract data dispersion information in the residuals of the LSSVM, which ensures the imputation of missing data maintains a similar level of dispersion as the original observed data. The imputation strategy for accelerated degradation data with unequal intervals are improved by considering the acceleration factor, and the performance and validity of the proposed model are verified using the realistic test data for SLDs. The findings of the current study and future work direction are presented as follow.

•
According to the obtained results, missing data will always lead to a decrease in the precision of evaluation or prediction models. Specifically, missing data occurring in the middle part of dataset has the least impact, although it may render some models unusable. On the other hand, missing data in the back part tends to have the most severe impacts on the performance of the models.
•
Comparison results reveals that in the scenarios of missing data with equal measuring interval, the hybrid models generally outperform the traditional models in terms of imputation performance, particularly at high missing data rates. However, it should be noted that the traditional regression model excels in reproducing the original data dispersion. Besides, the proposed model demonstrates significantly better performance compared to the other models for missing data with unequal measuring intervals.
•
To optimize the proposed model, hybrid model integrated by the other intelligent learning methods could be also examined.
•
This paper investigates strategies for resolving missing data from accelerated tests with degradation trends, further research should focus on the missing data on the success/failure type of accelerated test which does not have a degradation trend.

Data availability statement

Data will be made available on request.

CRediT authorship contribution statement

Yaqiu Li: Conceptualization, Data curation, Formal analysis, Investigation, Writing – original draft. Qijie Zhou: Formal analysis, Validation, Writing – original draft. Ye Fan: Data curation, Formal analysis. Guangze Pan: Conceptualization, Methodology, Resources, Writing – review & editing, Validation. Zongbei Dai: Investigation, Resources. Baimao Lei: Data curation, Formal analysis.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was partially funded by the Guangdong Basic and Applied Basic Research Foundation (2021A1515110679) and the science and technology program of Guangzhou, China (No.202201010303 and No. 202201010584), the ministry of industry and information technology project (TC210804R-1).

References

1.Si W., Shao Y., Wei W. Accelerated degradation testing with long-term memory effects. Ieee T Reliab. 2020;69:1254–1266. doi: 10.1109/TR.2020.2997404. [DOI] [Google Scholar]
2.LuValle M. In: Recent Advances in Reliability Theory: Methodology, Practice, and Inference. Limnios N., Nikulin M., editors. Birkhäuser; Boston, MA: 2000. A theoretical framework for accelerated testing; pp. 419–433. [DOI] [Google Scholar]
3.Wang Z., Zhao L., Kong Z., Yu J., Yan C. Development of accelerated reliability test cycle for electric drive system based on vehicle operating data. Eng. Fail. Anal. 2022;141 doi: 10.1016/j.engfailanal.2022.106696. [DOI] [Google Scholar]
4.Ma Z., Wang S., Ruiz C., Zhang C., Liao H., Pohl E. Reliability estimation from two types of accelerated testing data considering measurement error. Reliab. Eng. Syst. Saf. 2020;193 doi: 10.1016/j.ress.2019.106610. [DOI] [Google Scholar]
5.Mehmood B., Akbar M., Ullah R. Accelerated aging effect on high temperature vulcanized silicone rubber composites under DC voltage with controlled environmental conditions. Eng. Fail. Anal. 2020;118 doi: 10.1016/j.engfailanal.2020.104870. [DOI] [Google Scholar]
6.Ngueilbaye A., Wang H., Mahamat D.A., Junaidu S.B. Modulo 9 model-based learning for missing data imputation. Appl. Soft Comput. 2021;103 doi: 10.1016/j.asoc.2021.107167. [DOI] [Google Scholar]
7.Heitjan D.F., Basu S. Distinguishing “missing at random” and “missing completely at random,”. Am. Statistician. 1996;50:207–213. doi: 10.1080/00031305.1996.10474381. [DOI] [Google Scholar]
8.Brown R.L. Efficacy of the indirect approach for estimating structural equation models with missing data: a comparison of five methods. Struct. Equ. Model. 1994;1:287–316. doi: 10.1080/10705519409539983. [DOI] [Google Scholar]
9.Prasad S. An exponential imputation in the case of missing data. J. Stat. Manag. Syst. 2017;20:1127–1140. doi: 10.1080/09720510.2017.1407515. [DOI] [Google Scholar]
10.Al-Omari A.I., Bouza C.N., Herrera C. Imputation methods of missing data for estimating the population mean using simple random sampling with known correlation coefficient. Qual. Quantity. 2013;47:353–365. doi: 10.1007/s11135-011-9522-1. [DOI] [Google Scholar]
11.Lin J., Li N., Alam M.A., Ma Y. Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl. Intell. 2020;50:860–877. doi: 10.1007/s10489-019-01560-y. [DOI] [Google Scholar]
12.Keller B.T., Du H. A fully conditional specification approach to multilevel multiple imputation with latent cluster means. Multivariate Behav. Res. 2019;54:149–150. doi: 10.1080/00273171.2018.1556085. [DOI] [PubMed] [Google Scholar]
13.Andridge R., Bechtel L., Thompson K.J. Finding a flexible hot-deck imputation method for multinomial data. J Surv Stat Methodol. 2021;9:789–809. doi: 10.1093/jssam/smaa005. [DOI] [Google Scholar]
14.Faisal S., Tutz G. Multiple imputation using nearest neighbor methods. Inf. Sci. 2021;570:500–516. doi: 10.1016/j.ins.2021.04.009. [DOI] [Google Scholar]
15.Quartagno M., Carpenter J.R. Multiple imputation for discrete data: evaluation of the joint latent normal model. Biom. J. 2019;61:1003–1019. doi: 10.1002/bimj.201800222. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Nd D., Mm R. Missing value imputation using stratified supervised learning for cardiovascular data. J Inform Data Min. 2016;1:1–10. doi: 10.4172/2229-8711.S1113. [DOI] [Google Scholar]
17.Gautam C., Ravi V. Data imputation via evolutionary computation, clustering and a neural network. Neurocomputing. 2015;156:134–142. doi: 10.1016/j.neucom.2014.12.073. [DOI] [Google Scholar]
18.Nishanth K.J., Ravi V. Probabilistic neural network based categorical data imputation. Neurocomputing. 2016;218:17–25. doi: 10.1016/j.neucom.2016.08.044. [DOI] [Google Scholar]
19.Shao J., Meng W., Sun G. Evaluation of missing value imputation methods for wireless soil datasets. Personal Ubiquitous Comput. 2017;21:113–123. doi: 10.1007/s00779-016-0978-9. [DOI] [Google Scholar]
20.Sharma G., Panwar A., Nasiruddin I., Bansal R.C. Non-linear LS-SVM with RBF-kernel-based approach for AGC of multi-area energy systems, IET Generation. Transm. Distrib. 2018;12:3510–3517. doi: 10.1049/iet-gtd.2017.1402. [DOI] [Google Scholar]
21.Sharma G., Nasiruddin I., Niazi K.R., Bansal R.C. Automatic generation control (AGC) of wind power system: an least squares-support vector machine (LS-SVM) radial basis function (RBF) kernel approach. Elec. Power Compon. Syst. 2018;46:1621–1633. doi: 10.1080/15325008.2018.1511003. [DOI] [Google Scholar]
22.Liu C., Niu P., Li G., You X., Ma Y., Zhang W. A hybrid heat rate forecasting model using optimized LSSVM based on improved GSA. Neural Process. Lett. 2017;45:299–318. doi: 10.1007/s11063-016-9523-0. [DOI] [Google Scholar]
23.Cornelis C., Jensen R., Hurtado G., Śle¸zak D. Attribute selection with fuzzy decision reducts. Inf. Sci. 2010;180:209–224. doi: 10.1016/j.ins.2009.09.008. [DOI] [Google Scholar]
24.Sarra S.A., Cogar S. An examination of evaluation algorithms for the RBF method. Eng. Anal. Bound. Elem. 2017;75:36–45. doi: 10.1016/j.enganabound.2016.11.006. [DOI] [Google Scholar]
25.Suykens J.A.K., De Brabanter J., Lukas L., Vandewalle J. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing. 2002;48:85–105. doi: 10.1016/S0925-2312(01)00644-0. [DOI] [Google Scholar]
26.Zhang Y., Li R. Short term wind energy prediction model based on data decomposition and optimized LSSVM. Sustain Energy Techn. 2022;52 doi: 10.1016/j.seta.2022.102025. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[bib1] 1.Si W., Shao Y., Wei W. Accelerated degradation testing with long-term memory effects. Ieee T Reliab. 2020;69:1254–1266. doi: 10.1109/TR.2020.2997404. [DOI] [Google Scholar]

[bib2] 2.LuValle M. In: Recent Advances in Reliability Theory: Methodology, Practice, and Inference. Limnios N., Nikulin M., editors. Birkhäuser; Boston, MA: 2000. A theoretical framework for accelerated testing; pp. 419–433. [DOI] [Google Scholar]

[bib3] 3.Wang Z., Zhao L., Kong Z., Yu J., Yan C. Development of accelerated reliability test cycle for electric drive system based on vehicle operating data. Eng. Fail. Anal. 2022;141 doi: 10.1016/j.engfailanal.2022.106696. [DOI] [Google Scholar]

[bib4] 4.Ma Z., Wang S., Ruiz C., Zhang C., Liao H., Pohl E. Reliability estimation from two types of accelerated testing data considering measurement error. Reliab. Eng. Syst. Saf. 2020;193 doi: 10.1016/j.ress.2019.106610. [DOI] [Google Scholar]

[bib5] 5.Mehmood B., Akbar M., Ullah R. Accelerated aging effect on high temperature vulcanized silicone rubber composites under DC voltage with controlled environmental conditions. Eng. Fail. Anal. 2020;118 doi: 10.1016/j.engfailanal.2020.104870. [DOI] [Google Scholar]

[bib6] 6.Ngueilbaye A., Wang H., Mahamat D.A., Junaidu S.B. Modulo 9 model-based learning for missing data imputation. Appl. Soft Comput. 2021;103 doi: 10.1016/j.asoc.2021.107167. [DOI] [Google Scholar]

[bib7] 7.Heitjan D.F., Basu S. Distinguishing “missing at random” and “missing completely at random,”. Am. Statistician. 1996;50:207–213. doi: 10.1080/00031305.1996.10474381. [DOI] [Google Scholar]

[bib8] 8.Brown R.L. Efficacy of the indirect approach for estimating structural equation models with missing data: a comparison of five methods. Struct. Equ. Model. 1994;1:287–316. doi: 10.1080/10705519409539983. [DOI] [Google Scholar]

[bib9] 9.Prasad S. An exponential imputation in the case of missing data. J. Stat. Manag. Syst. 2017;20:1127–1140. doi: 10.1080/09720510.2017.1407515. [DOI] [Google Scholar]

[bib10] 10.Al-Omari A.I., Bouza C.N., Herrera C. Imputation methods of missing data for estimating the population mean using simple random sampling with known correlation coefficient. Qual. Quantity. 2013;47:353–365. doi: 10.1007/s11135-011-9522-1. [DOI] [Google Scholar]

[bib11] 11.Lin J., Li N., Alam M.A., Ma Y. Data-driven missing data imputation in cluster monitoring system based on deep neural network. Appl. Intell. 2020;50:860–877. doi: 10.1007/s10489-019-01560-y. [DOI] [Google Scholar]

[bib12] 12.Keller B.T., Du H. A fully conditional specification approach to multilevel multiple imputation with latent cluster means. Multivariate Behav. Res. 2019;54:149–150. doi: 10.1080/00273171.2018.1556085. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Andridge R., Bechtel L., Thompson K.J. Finding a flexible hot-deck imputation method for multinomial data. J Surv Stat Methodol. 2021;9:789–809. doi: 10.1093/jssam/smaa005. [DOI] [Google Scholar]

[bib14] 14.Faisal S., Tutz G. Multiple imputation using nearest neighbor methods. Inf. Sci. 2021;570:500–516. doi: 10.1016/j.ins.2021.04.009. [DOI] [Google Scholar]

[bib15] 15.Quartagno M., Carpenter J.R. Multiple imputation for discrete data: evaluation of the joint latent normal model. Biom. J. 2019;61:1003–1019. doi: 10.1002/bimj.201800222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Nd D., Mm R. Missing value imputation using stratified supervised learning for cardiovascular data. J Inform Data Min. 2016;1:1–10. doi: 10.4172/2229-8711.S1113. [DOI] [Google Scholar]

[bib17] 17.Gautam C., Ravi V. Data imputation via evolutionary computation, clustering and a neural network. Neurocomputing. 2015;156:134–142. doi: 10.1016/j.neucom.2014.12.073. [DOI] [Google Scholar]

[bib18] 18.Nishanth K.J., Ravi V. Probabilistic neural network based categorical data imputation. Neurocomputing. 2016;218:17–25. doi: 10.1016/j.neucom.2016.08.044. [DOI] [Google Scholar]

[bib19] 19.Shao J., Meng W., Sun G. Evaluation of missing value imputation methods for wireless soil datasets. Personal Ubiquitous Comput. 2017;21:113–123. doi: 10.1007/s00779-016-0978-9. [DOI] [Google Scholar]

[bib20] 20.Sharma G., Panwar A., Nasiruddin I., Bansal R.C. Non-linear LS-SVM with RBF-kernel-based approach for AGC of multi-area energy systems, IET Generation. Transm. Distrib. 2018;12:3510–3517. doi: 10.1049/iet-gtd.2017.1402. [DOI] [Google Scholar]

[bib21] 21.Sharma G., Nasiruddin I., Niazi K.R., Bansal R.C. Automatic generation control (AGC) of wind power system: an least squares-support vector machine (LS-SVM) radial basis function (RBF) kernel approach. Elec. Power Compon. Syst. 2018;46:1621–1633. doi: 10.1080/15325008.2018.1511003. [DOI] [Google Scholar]

[bib22] 22.Liu C., Niu P., Li G., You X., Ma Y., Zhang W. A hybrid heat rate forecasting model using optimized LSSVM based on improved GSA. Neural Process. Lett. 2017;45:299–318. doi: 10.1007/s11063-016-9523-0. [DOI] [Google Scholar]

[bib23] 23.Cornelis C., Jensen R., Hurtado G., Śle¸zak D. Attribute selection with fuzzy decision reducts. Inf. Sci. 2010;180:209–224. doi: 10.1016/j.ins.2009.09.008. [DOI] [Google Scholar]

[bib24] 24.Sarra S.A., Cogar S. An examination of evaluation algorithms for the RBF method. Eng. Anal. Bound. Elem. 2017;75:36–45. doi: 10.1016/j.enganabound.2016.11.006. [DOI] [Google Scholar]

[bib25] 25.Suykens J.A.K., De Brabanter J., Lukas L., Vandewalle J. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing. 2002;48:85–105. doi: 10.1016/S0925-2312(01)00644-0. [DOI] [Google Scholar]

[bib26] 26.Zhang Y., Li R. Short term wind energy prediction model based on data decomposition and optimized LSSVM. Sustain Energy Techn. 2022;52 doi: 10.1016/j.seta.2022.102025. [DOI] [Google Scholar]

PERMALINK

A novel machine learning-based imputation strategy for missing data in step-stress accelerated degradation test

Yaqiu Li

Qijie Zhou

Ye Fan

Guangze Pan

Zongbei Dai

Baimao Lei

Abstract

1. Introduction

Fig. 1.

Table 1.

2. Methodology

2.1. Machine learning techniques

2.1.1. Radial basis function (RBF) neural network

2.1.2. Least squares support vector machine (LSSVM)

2.1.3. Impact of missing degradation data on degradation predictions

2.1.3.1. Supposing that the degradation model is an approximate linear monotonic descent model as Eq. (7)

Fig. 2.

Table 2.

2.2. Hybrid imputation model based on LSSVM and RBF

Fig. 3.

3. Imputation strategies for missing data patterns

3.1. Strategies for missing data with different missing rate

Fig. 4.

Fig. 5.

Table 3.

3.2. Strategies for missing data with unequal measuring intervals

3.2.1. Determination of the imputation point

Fig. 6.

3.2.2. Datasets with unequal measuring intervals

Fig. 7.

Table 4.

3.2.3. Imputation for incomplete dataset

Fig. 8.

Fig. 9.

Fig. 10.

3.2.4. Evaluation of imputation methods

Fig. 11.

4. Experiment and discussion

4.1. Experiment setup

Fig. 12.

4.2. Data processing

Table 5.

Fig. 13.

Table 6.

4.3. Results and discussion

Fig. 14.

5. Conclusions and future work

Data availability statement

CRediT authorship contribution statement

Declaration of competing interest

Acknowledgements

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases