ABSTRACT
This work proposes a two-stage procedure for identifying outlying observations in a large-dimensional data set. In the first stage, an outlier identification measure is defined by using a max-normal statistic and a clean subset that contains non-outliers is obtained. The identification of outliers can be deemed as a multiple hypothesis testing problem, then, in the second stage, we explore the asymptotic distribution of the proposed measure, and obtain the threshold of the outlying observations. Furthermore, in order to improve the identification power and better control the misjudgment rate, a one-step refined algorithm is proposed. Simulation results and two real data analysis examples show that, compared with other methods, the proposed procedure has great advantages in identifying outliers in various data situations.
KEYWORDS: Outlier identification, large-dimension statistics, multiple hypothesis testing, robust statistics, asymptotic distribution
1. Introduction
In statistics and data analysis, an outlier is a data point that differs significantly from other observations [9]. Similarly, Hawkins [11] defined an outlier as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Therefore, outliers are somewhat different in their data structure compared with the data majority, and this difference may be caused by measurement problems, different data preprocessing, inconsistencies among the observations.
Outliers can be considered as the generation mechanism that one row of data is different from other rows. It also often happens that most data cells in a row are regular and just a few of them are anomalous, that is, an outlier does not necessarily differ in all the variable values, but it could differ just in few variables. Outliers in the data set can not be avoided due to various reasons, and a regular data set may contain to of outliers, or much more than that in some specific applications [10]. The presence of outlying points may cause some serious problems, such as model misspecification, biased parameter estimation and other negative effects in statistical data analysis [4]. Especially in the case of large-dimensional settings, the dimensionality and complexity may increase the possibility of the observations been outlying, and then, it may have a greater impact on the results of statistical data analysis. Therefore, identifying and removing outliers from the reference data set is one of the crucial steps in data processing.
The outlier identification problem for large-dimensional data is a fundamental task in data analysis. In the past ten years, many statistics and computer scholars have conducted many researches in this area. Filzmoser et al. [7] proposed a computationally fast procedure by utilizing the principal components analysis to identify outliers in a transformed space. Fritsch et al. [8] modified the classical minimum covariance determinant approach by adding a regularization term to ensure that the estimation is well-posed in the presence of many outliers for high-dimensional data. Ro et al. [15] proposed an outlier detection procedure that replaces the classical minimum covariance determinant estimator with a high-breakdown minimum diagonal product estimator, and the cut-off value is obtained from the asymptotic distribution of the distance. Their method may not perform well in real-world situations with correlated variables. Yang et al. [19] proposed a procedure that combines the minimum diagonal product algorithm with the maximum thresholding method for detecting outlying points under sparsity. However, simulation experiments show that their method does not control the type I error rate well. Correspondingly, many methods for testing the mean of high-dimensional data have also been proposed in recent years. For example, Srivastava and Du [18] considered a mean vector testing of independent and identically distributed multivariate normal random vectors for high-dimensional setting. Cai et al. [2] considered max-normal test statistic for testing two sample means, which is useful for identifying sparse signals. These results on the high-dimensional mean test problem are helpful for us to expand the method of outlier detection. It is meaningful to explore more efficient methods for identifying outlying points in large-dimensional data.
To this end, in this work, we consider the large-dimensional data matrix , which contains n rows and p columns. Due to some factors, it is assumed that the data matrix inevitably contains some outlying rows. Motivated by Ro et al. [15] and Cai et al. [2], we propose a two-stage outlier identification procedure for large-dimensional data. In the first stage, we extend the well-known least trimmed square (LTS) algorithm, and divide the data matrix into two row-wise sub-matrices, such that one is a clean subset of size that does not contain any outlying points, and the other sub-matrix contains all contaminated outliers. Based on the clean subset obtained in the first step, in the second stage, we can obtain reliable parameter estimates. The outlier detection can be deemed as multiple hypothesis test problem. Therefore, in the second stage, we construct an identification distance based on the reliable parameter estimates as the test statistics. To determine the threshold for outlying points, we explore the asymptotic distribution of the distance under the null hypothesis and decide whether an observation is outlying. Furthermore, to enhance the efficiency and control the type I errors, a one-step confirmation procedure is used in the procedure.
In addition, during the multiple outliers detection process, masking and swamping effects are more common when data set size or variable dimensionality is large. Masking occurs when an outlier is undetected, while swamping occurs when a non-outlying observation is deemed as an outlier. Therefore, it is very challenging to accurately identify these outlying points while avoiding masking and swamping effects. The main goal of this work is to propose an outlier identification procedure with good performances in controlling type I errors and meanwhile with good power for detecting outlying points. That is, on the one hand, we try our best to detect all hidden outlying points; on the other hand, we should control the proportion of misjudged outlying points from real normal data, which reflect the swamping and masking effects of outliers, respectively. The proposed procedure provides a new characterization and thus can reduce masking and swamping effects to a great extent when there are multiple or large-dimensional explanatory variables.
The remainder of this paper is organized as follows. In Section 2, we introduce the outlier identification procedure in detail. Section 3 introduces the simulation study, including the performance results compared with other existing methods. We apply the proposed approach to analyze the riboflavin data and the glass data in Section 4. Several remarks conclude the paper with discussion in Section 5.
2. Methodology
In this section, we present the location-shift data generation model, existing works for large-dimensional outlier identification, and then, we propose the outlier identification method for large-dimensional data and its technical route in detail.
2.1. Problems and existing works
Suppose we observe an independent large-dimensional random sample , each observation for , and these n observations are collected from the following model,
| (1) |
where (a subset of ) is the outlying data set, both and are p-dimensional mean vectors, and are the random error vectors subject to normal distribution with mean zero and covariance matrix . Model (1) can be rewritten as the following style,
| (2) |
All the observations follow multivariate normal distribution, and any observation with or is deemed as an outlier. Here, we assume that the outlying set size is less than , ( denotes the cardinality of , and denotes the largest integer less than n/2). This assumption is valid in real applications, because in the general data analysis and processing process, outliers cannot exceed half of the data set, otherwise, we suspect that these deemed outlying data are normal data practically.
Any data set inevitably contains different types of outliers, for large-dimensional data even more so, as the structure of the data is more complex. To overcome the difficulties of outlier identification in large-dimension setting, Ro et al. [15] modified the Mahalanobis distance so that it involves only the diagonal elements of the covariance matrix, and proposed the following distance-based measure,
| (3) |
where μ is a p-dimensional mean vector, is the diagonal matrix. It can be rewritten as , then the information on outlyingness can be extracted from each individual marginally. Given the true parameters μ and D, the proposed distance-based measure follows normal distribution. This distance-based measure requires robust estimation of μ and D, for the distance-based measure, and then, a clean subset of h non-outlying observations is obtained. Furthermore, they calculated the location estimator and the diagonal matrix estimator , and illustrated that they are scalar equivariant but not affine equivariant. The robust estimation of the identification measure can be written as , and a thresholding rule based on asymptotic normal distribution theory is proposed to determine whether an observation is outlying.
Ro et al. [15]'s procedure is effective in dealing with high-dimensional outlier identification problem, especially suitable for identifying mean dense outlier settings. However, in the case of large-dimensional mean sparse data, the means of the outlying observations are typically either identical or are quite similar that they possibly only differ in a small number of coordinates, which is difficult to distinguish the difference between non-outlying and outlying observations by the previous methods. Therefore, it is especially meaningful to explore a unified and effective method for identifying mean dense and/or sparse outlying points in large-dimensional complex data set.
2.2. An efficient max-normal distance measure
In this subsection, following the idea of Ro et al. [15] and Cai et al. [2] for high-dimensional statistical inference, we propose an outlier identification procedure for large-dimensional data. First, we define an oracle max-normal distance measure as
| (4) |
where is oracle mean vector of the j-th predictor, is the j-th element of the diagonal matrix D. Here, unlike the idea of Ro et al. [15], we consider the maximum value instead of the sum of all variables. The advantage is that in the case of sparse settings, the outlier identification is more effective.
As a result, in order to determine the threshold value for identifying outlying points by the max-normal measure, we should first explore the distribution of . Given the true parameters μ and Σ, let , then is a zero-mean multivariate normal random vector with covariance matrix and diagonal for .
Proposition 2.1 Lemma 6 in Cai et al. [2]. —
Suppose that , and
, ( is a constant), then for any , as , we have,
(5)
This is a well-known result, and more discussions can be found in Lemma 6 in Proposition A.1 of Cai et al. [2]. Under certain local assumptions, if μ and Σ are the oracle parameter of mean vector and covariance matrix, then the proposed max-normal distanced measure also have same distribution, that is,
| (6) |
Next, we need to explore the asymptotic properties of the robust estimator of the measure statistic. For this, it is necessary to obtain reliable estimators of μ and Σ, and determine the threshold for , so as to decide whether an observation is an outlier. However, the presence of outliers within the data set may affect the reliable estimation of μ and Σ, moreover, this will also affect the asymptotic distribution of the measure and the results of outlier identification. Therefore, the max-normal distance measure, such as (4), requires the consistent estimation of mean and covariance. Next, we consider replacing μ and D by some robust counterparts based on a clean subset.
Let be the collection of all subsets with the size h, the sample mean and sample covariance of are defined as , , then the estimated max-normal measure can be rewritten as
| (7) |
where is the j-th element of , is the diagonal matrix of , and is the j-th main diagonal element of . Thus, in the next step, we will build a computational algorithm to find the optimal clean subset without outlying points, which is used to get the reliable estimators of the mean and covariance matrix. To this end, we give the following definition.
Definition 2.1
The least trimmed max-normal distance (LTMD in short) measure is defined by the clean subset
(8) where are h ordered small values among . The least trimmed max-normal distance measure is defined as in this case.
Next, we will develop the fast LTMD algorithm to obtain the clean subset , and the following proposition guarantees the decrease of the objective function.
Proposition 2.2
Suppose the index set is a subset of and define based on the sub-matrix data for . Then, we take such that , and compute based on the sub-matrix . We have , and the equality holds if and only if .
The proof of Proposition 2.2 is similar to Property 1 in Rousseeuw and Driessen [16]. As corresponds to the h smallest distance measures, then , and . Now that the distance measures of these h observations are such that it minimizes , then, .
After obtaining the clean subset , and calculating the corresponding sample mean and sample covariance, the max-normal measure can be written as . In the next subsection, we will develop a threshold rule to determine whether a single observation is an outlier.
2.3. Threshold rule for identification
In this work, the outlier identification problem is deemed as n hypothesis tests with
| (9) |
which corresponds to whether the i-th observation is not an outlier.
As a result, in order to determine the threshold value for outlying points, we should explore the distribution result of the max-normal test statistic . Since Proposition 2.1 has given the asymptotic distribution of the oracle statistics , and are, respectively, the sample estimates of μ and D based on the clean subset , and then, we should consider the distribution of .
Proposition 2.3
Assume that under the null hypothesis that there are no outliers in the data set. As and are the consistency estimators of the μ and D, then by applying Slutsky's theorem, we have
(10) where is the sample mean and is the diagonal matrix of the sample covariance base on the clean subset .
As a result of this proposition, and combining with the result (6) of Proposition 2.1, we can utilize the exponential distribution to determine a threshold rule. Given a significance level α, by using the asymptotic distribution of (6) with reliable estimators and instead, the i-th observation is considered as outlying if
| (11) |
where is the -quantile of the type I extreme value distribution with the cumulative distribution function , i.e.
According to this principle, we can obtain a relatively reliable non-outlier index subset, denoted as ( ), and its complement is denoted as , which contains all the candidate outliers.
2.4. A one-step reweighting scheme
In order to enhance efficiency, a one-step reweighting scheme is often used in practice [3]. In this subsection, based on the relatively reliable non-outlier subset , we can get the sample mean and sample covariance of , and the max-normal distance measure is defined as
| (12) |
where is sample mean of the j-th predictor, and is the j-th diagonal element of the covariance matrix . We can prove that under certain conditions, still has the same asymptotic distribution as (6) under the null hypothesis. So next, we use multiple testing method once again to identify which point is a real outlier, based on an appropriately chosen δ ( is chosen in this work). By the asymptotic distribution of , the i-th point is deemed as outlying if
| (13) |
where is the -quantile of the type I extreme value distribution with the cumulative distribution function .
2.5. Algorithm
The proposed procedure for outlier detection using refinement least trimmed max-normal distance (R-LTMD in short) is outlined as the following steps:
Let be an initial subset of . Compute the sample mean and sample covariance of , and then obtain the for ;
Sort the distances , where π is a permutation of . Denote of size ;
Compute the max-normal distance measure based on , and apply Step 2 again, which yields an iteration process, repeating t times, the corresponding subsets are , , until convergence, i.e. . Then is the ultimate clean subset, denote .
Compute the max-normal distance measure for , set a significance level α, and apply the multiple test procedure, a clean set and a suspicious set are obtained based on the threshold rule of Equation (11);
Calculate the sample mean and sample covariance matrix based on , and compute the max-normal distance measure for .
Apply the multiple test procedure again, at the significance level δ, and identify outlying observations by the threshold rule of Equation (13).
Remark
In Step 1 of the algorithm, the initial index set is randomly selected from . Simulation experiments and Proposition 2.2 show that the initial set will not affect the results, so in this work, we randomly select the initial set with the cardinality size ;
In Step 3 of the algorithm, the iterative process is convergent, which is guaranteed by Proposition 2.2. And in our actual calculation process, generally no more than 10 times, the algorithm will reach the state of convergence.
In Step 4 and Step 6, the max-normal distance measure is computed based on the consist mean and covariance matrix estimator of clean subset, then this algorithm satisfies the requirement of consistency in Proposition 2.2.
Steps 5 to 6 can be iterated many times until no observations in the clean set be flagged as outlying, and no further observations in the suspicious data set be flagged as non-outlying points. Considering the computational cost, we only perform the confirmation algorithm once, and the simulation results show that a one-step refined algorithm is effective to control the type I error of identification.
3. Simulation
In this section, we generated p dimensional data with the sample size from the normal distribution , and the sample from another data model of distribution , where k is a parameter reflecting the magnitude of the abnormality, is a p-dimensional zero vectors, and are p-dimensional independent random vectors. The following two cases outliers are considered.
-
Case (a)
Mean dense case: let be a normalized p-dimensional independent random vectors from uniform distribution ;
-
Case (b)
Mean sparse case: let be a normalized p-dimensional independent random vectors in which random components are from and the others are all zero, here is a tuning parameter of sparsity.
Let be a diagonal matrix with diagonal elements . Following the settings of covariance structures in Cai et al. [2] and Ro et al. [15], we consider four cases of the covariance matrix Σ as follows.
Autoregressive (AR) correlation case, with for , and is chosen;
Sparse Σ case, let where for , and is chosen, let ;
Non-Sparse Σ case, let where , and for , let ;
Moving average (MA) structures, let for , , and are independently following and , respectively, and the lag is chosen here.
Suppose among all of the n observations, there are outlying observations and non-outlying observations, and takes values of 2–20% of all observations. In the non-outlying data set, indicates the number of observations that are incorrectly identified as outlying ones, and among these outliers, represents the number of outlying points that are correctly identified. The Type I error value is , and the identification power value is . In the simulation, we use to abbreviate the Type I error value and use to abbreviate the Power value. Under the same settings, we evaluate the performance of the proposed procedure, and compare with some existing methods, such as the principal component outlier detection (PCOUT) by Filzmoser et al. [7], the minimum diagonal product (RMDP) by Ro et al. [15], and a maximum thresholding method (named SMDP) proposed by Yang et al. [19]. All the simulation results are obtained through 1,000 replications by Matlab 2019b.
Table 1 reports the average Type I error values by the proposed R-LTMD procedure on simulated data, here we set n = 100, p = 200, 500, 1000, the nominal significance level α is chosen as 0.01, 0.05 and 0.1, respectively. The results show that the realized empirical Type I error values are nearly consistent to the nominal ones. In other words, the Type I error can be well controlled no matter in sparse or dense cases. The standard deviations of the Type I error values are listed in the parentheses.
Table 1.
Average percentage of Type I error values (%) by the R-LTMD procedure under different choices of , p, and the significance level , respectively, when n = 100, k = 3 in case (a), and k = 10 in Case (b).
| Case | Cor | p | α=0.01 | α=0.05 | α=0.1 | α=0.01 | α=0.05 | α=0.1 |
|---|---|---|---|---|---|---|---|---|
| a | (i) | 200 | 0.9(0.9) | 5.0(2.4) | 9.9(3.6) | 1.2(1.2) | 5.5(2.5) | 10.8(3.9) |
| 500 | 1.2(1.1) | 5.6(2.9) | 11.1(3.8) | 1.5(1.3) | 6.0(3.0) | 12.3(3.6) | ||
| 1000 | 1.4(1.3) | 6.4(3.1) | 13.5(4.2) | 1.8(1.5) | 7.3(3.0) | 13.4(4.0) | ||
| (ii) | 200 | 1.0(1.1) | 4.7(2.4) | 10.6(3.9) | 1.1(1.3) | 5.4(2.6) | 10.7(4.0) | |
| 500 | 1.3(1.1) | 5.5(2.8) | 11.7(3.6) | 1.4(1.4) | 6.2(3.0) | 12.4(3.9) | ||
| 1000 | 1.3(1.0) | 6.1(2.9) | 13.2(4.0) | 1.6(1.4) | 7.3(3.4) | 14.2(4.1) | ||
| (iii) | 200 | 0.9(0.9) | 5.0(2.7) | 9.7(3.4) | 1.2(1.2) | 5.1(2.4) | 10.7(3.8) | |
| 500 | 1.1(1.0) | 5.8(2.5) | 11.8(3.8) | 1.2(1.3) | 6.3(3.0) | 12.1(4.5) | ||
| 1000 | 1.4(1.2) | 6.5(2.6) | 12.8(4.0) | 1.7(1.5) | 7.5(3.0) | 12.7(3.6) | ||
| (iv) | 200 | 1.0(0.9) | 5.0(2.7) | 10.7(4.2) | 1.1(1.1) | 5.6(2.9) | 10.5(4.1) | |
| 500 | 1.1(1.0) | 5.5(2.6) | 11.5(3.9) | 1.2(1.2) | 5.6(2.8) | 11.8(4.4) | ||
| 1000 | 1.1(1.0) | 6.0(3.1) | 12.0(3.9) | 1.3(1.3) | 6.7(3.1) | 12.4(4.4) | ||
| b | (i) | 200 | 0.9(1.0) | 5.1(2.5) | 10.0(3.8) | 1.0(1.0) | 5.0(2.7) | 9.6(3.7) |
| 500 | 1.2(1.2) | 5.0(2.7) | 11.5(3.7) | 1.2(1.4) | 5.3(2.6) | 11.0(3.8) | ||
| 1000 | 1.2(1.1) | 6.2(3.8) | 12.3(3.8) | 1.3(1.2) | 6.1(3.1) | 12.3(3.9) | ||
| (ii) | 200 | 1.0(0.9) | 5.0(2.5) | 10.0(3.6) | 0.9(1.0) | 5.3(2.5) | 10.1(3.6) | |
| 500 | 1.2(1.1) | 5.5(2.5) | 11.5(3.5) | 1.1(1.3) | 5.2(2.8) | 11.0(3.8) | ||
| 1000 | 1.4(1.2) | 5.9(2.8) | 12.3(3.8) | 1.3(1.2) | 5.9(2.9) | 12.1(3.7) | ||
| (iii) | 200 | 1.0(1.0) | 4.9(2.3) | 10.2(3.3) | 1.0(1.0) | 4.5(2.6) | 9.9(3.8) | |
| 500 | 1.1(1.2) | 5.7(2.6) | 11.0(3.5) | 1.1(1.3) | 5.3(2.7) | 11.0(4.2) | ||
| 1000 | 1.4(1.2) | 6.3(2.8) | 12.2(3.8) | 1.4(1.3) | 6.3(3.1) | 12.2(3.9) | ||
| (iv) | 200 | 1.0(1.0) | 4.9(2.3) | 10.2(3.3) | 1.0(1.0) | 4.5(2.6) | 9.9(3.8) | |
| 500 | 1.1(1.2) | 5.7(2.6) | 11.0(3.5) | 1.1(1.3) | 5.3(2.7) | 11.0(4.2) | ||
| 1000 | 1.4(1.2) | 6.3(2.8) | 12.2(3.8) | 1.4(1.3) | 6.3(3.1) | 12.2(3.9) | ||
Table 2 reports the average Type I error and power values under four different methods. In Case (a), the dense settings, it can be seen from the results, compared with the other three methods, our method can get better control of the Type I error values. Of course, in case (ii) and (iii), our method does not get 100% efficiency, but the values are very close to 100%. Compared with the other three methods, we can see that the SMDP method does not control the Type I error well. In Case (b), the identification power of the proposed R-LTMD method is much better than that of RMDP and PCOUT methods. As the dimension p increases from 100 to 500, the corresponding Type I error values of our method can be controlled near 5% in most cases, and the power values decrease, but in most cases, the power values are better than the other three methods.
Table 2.
The results of Type I error ( %) and Power values ( %) by various procedures when n = 100, , , k = 3 in Case (a), and k = 10, m = 0.2 in Case (b).
| R-LTMD | RMDP | SMDP | PCOUT | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Case | Cor | p | ||||||||
| a | (i) | 100 | 4.7 | 100.0 | 6.5 | 100.0 | 9.1 | 100.0 | 3.6 | 100.0 |
| 200 | 4.8 | 100.0 | 6.3 | 100.0 | 10.5 | 100.0 | 3.7 | 100.0 | ||
| 500 | 5.3 | 100.0 | 5.4 | 100.0 | 12.7 | 100.0 | 3.6 | 100.0 | ||
| (ii) | 100 | 4.4 | 92.7 | 6.3 | 100.0 | 5.5 | 100.0 | 3.3 | 100.0 | |
| 200 | 4.9 | 97.2 | 6.1 | 100.0 | 5.7 | 100.0 | 3.4 | 100.0 | ||
| 500 | 5.3 | 99.9 | 5.5 | 100.0 | 8.2 | 100.0 | 3.2 | 100.0 | ||
| (iii) | 100 | 4.4 | 91.8 | 5.8 | 100.0 | 5.4 | 100.0 | 3.9 | 100.0 | |
| 200 | 4.9 | 96.9 | 5.6 | 100.0 | 6.4 | 100.0 | 3.4 | 100.0 | ||
| 500 | 5.3 | 99.9 | 5.2 | 100.0 | 7.9 | 100.0 | 3.7 | 100.0 | ||
| (iv) | 100 | 5.3 | 99.9 | 5.2 | 100.0 | 7.9 | 100.0 | 3.7 | 100.0 | |
| 200 | 5.0 | 100.0 | 5.8 | 100.0 | 18.5 | 100.0 | 5.6 | 100.0 | ||
| 500 | 5.6 | 100.0 | 4.8 | 100.0 | 31.7 | 100.0 | 4.5 | 100.0 | ||
| b | (i) | 100 | 4.7 | 94.1 | 6.0 | 71.2 | 9.1 | 84.6 | 7.3 | 36.0 |
| 200 | 4.9 | 88.6 | 6.6 | 61.0 | 10.4 | 80.3 | 6.8 | 14.4 | ||
| 500 | 5.4 | 84.6 | 5.2 | 62.2 | 13.6 | 90.3 | 7.0 | 13.2 | ||
| (ii) | 100 | 4.5 | 87.4 | 6.5 | 59.4 | 5.9 | 71.6 | 7.6 | 26.2 | |
| 200 | 4.9 | 73.8 | 6.3 | 50.1 | 6.5 | 67.2 | 7.4 | 15.2 | ||
| 500 | 5.5 | 67.2 | 5.7 | 48.0 | 8.2 | 76.8 | 7.6 | 10.8 | ||
| (iii) | 100 | 4.6 | 83.2 | 6.2 | 61.2 | 5.5 | 70.4 | 7.9 | 23.5 | |
| 200 | 4.9 | 74.8 | 6.3 | 48.5 | 6.6 | 66.3 | 7.4 | 13.5 | ||
| 500 | 5.4 | 69.7 | 5.8 | 49.3 | 8.7 | 76.4 | 8.9 | 11.3 | ||
| (iv) | 100 | 4.8 | 94.1 | 6.5 | 59.0 | 19.4 | 87.6 | 5.7 | 92.0 | |
| 200 | 5.2 | 95.2 | 6.1 | 31.2 | 26.2 | 86.4 | 5.0 | 84.3 | ||
| 500 | 5.7 | 93.4 | 5.0 | 16.6 | 31.7 | 93.6 | 5.6 | 84.1 | ||
In order to illustrate the effect of abnormality on the results of outlier identification, we set the anomaly parameter k changes from 1 to 15 under the mean sparse case. In Figure 1, there are eight sub-figures, denoted as (a)–(h), respectively. We consider four covariance settings (i)–(iv) corresponding to each row of Figure 1, and we record the change trend of the Type I error in the left sub-figures (a),(c),(e),(g), and the corresponding Power performance is recorded as (b),(d),(f),(h) in the right sub-figures. It can be seen from the left sub-figures, the Type I error values can be well controlled by the proposed R-LTMD procedure, while none of the other three methods can control Type I error as well as that of R-LTMD, especially in sub-figures (a) and (g), where the Type I error is significantly much higher than the nominal significance level . From the four sub-figures on the right, we can see that all four methods can give the trend of Power values increase with the increase of parameter k until it tends to 1. As the parameter k increases from 1.0 to 15.0, the Power values get to the nominal levels for these models considered, and the power values are so high in most cases that converge to 1 quickly. In terms of Power values, it seems that SMDP performs better than the other three methods, but note that its Type I error can not be controlled well.
Figure 1.
The change trend of Type I error and Power values (%) by various procedures when n = 100, p = 500, , with the increase of the parameter k.
In addition, to further compare the performance of the different methods in more complex data sets, suppose among the n p-dimensional samples, observations are generated from the null distribution , where Σ is a covariance matrix with an AR structure, and the autoregressive correlation coefficient is chosen, we further simulate four more different contamination models which refer to the following scenarios:
-
Case (c)
the radial contamination schemes Cerioli [3]: the contaminated observations are from , all the diagonal components of R are ψ and the off-diagonal components are the same as those of Σ. For simplicity, here and in the following Case (d) we can have the mean .
-
Case (d)
covariance sparse case: similar to Case (c), the contaminated observations are generated from , but only diagonal components of R are ψ, here is a tuning parameter of sparsity, the off-diagonal components are the same as those of Σ.
-
Case (e)
symmetric heavy tail outliers: the contaminated observations are following a Student's t-distribution with v degree of freedom.
-
Case (f)
skew distribution outliers: the contaminated observations are following a Chi-square distribution with v degree of freedom.
In Case (c), compared with that of normal observations, the covariance of the contaminated observations changes densely; however, in Case (d), the covariance of the contaminated observations changes sparsely. In Case (e), the contaminated observations are from a symmetric heavy tail distribution. Obviously, in the Cases (c)–(e), the contaminated observations have the same means as those normal observations generated from the null model, and these observations will not be mean shift outliers. In Case (f), the contaminated observations are generated from a skew distribution.
Table 3 reports the average results of Type I error and Power values of Cases (c)–(f). The results show that all the methods are able to identify the outliers generally. But in contrast, the proposed R-LTMD procedure can outperform the other three methods, especially in the Case (d), i.e. when the covariance of the contaminated observations changes sparsely, while the other three methods can not detect outliers well. Our method can not only identify the outliers effectively, but also it can better control the Type I error values. We can also see from the Table 3 that as the dimensionality of the data increases, the Power values of the R-LTMD method increase accordingly until it approaches 1, and the Type I error values can be well controlled at about 0.05, which shows that the R-LTMD method is effective for the outlier detection of large-dimensional data.
Table 3.
The results of Type I error and Power values (%) by various procedures when n = 100, , , in Case (c), , m = 0.4 in Case (d), v = 3 in Case (e), and v = 1 in Case (f).
| R-LTMD | RMDP | SMDP | PCOUT | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Case | p | ||||||||
| c | 100 | 5.8 | 100.0 | 6.3 | 99.2 | 9.0 | 100.0 | 5.1 | 95.0 |
| 200 | 5.8 | 100.0 | 5.9 | 100.0 | 10.2 | 100.0 | 4.6 | 98.1 | |
| 500 | 5.6 | 100.0 | 5.1 | 100.0 | 12.6 | 100.0 | 4.2 | 100.0 | |
| d | 100 | 4.8 | 88.2 | 6.5 | 67.1 | 9.2 | 81.2 | 7.7 | 26.3 |
| 200 | 4.8 | 94.1 | 6.1 | 66.7 | 10.5 | 88.9 | 7.6 | 18.1 | |
| 500 | 5.4 | 96.2 | 5.1 | 64.2 | 12.6 | 92.8 | 7.3 | 12.7 | |
| e | 100 | 5.0 | 99.6 | 6.3 | 98.4 | 9.0 | 99.5 | 6.7 | 73.0 |
| 200 | 5.0 | 100.0 | 5.8 | 99.2 | 10.3 | 100.0 | 5.5 | 74.8 | |
| 500 | 5.8 | 100.0 | 4.9 | 100.0 | 12.6 | 100.0 | 4.8 | 92.1 | |
| f | 100 | 4.8 | 99.5 | 6.5 | 99.2 | 9.2 | 99.1 | 4.8 | 98.3 |
| 200 | 5.5 | 100.0 | 6.1 | 99.5 | 10.3 | 100.0 | 3.9 | 100.0 | |
| 500 | 5.5 | 100.0 | 5.0 | 100.0 | 12.6 | 100.0 | 3.5 | 100.0 | |
In order to illustrate the effectiveness of the one-step reweighting scheme in the proposed procedure, we conducted simulations and gave the results in Table 4, and the Type I errors and Power values of LTMD and R-LTMD under Cases (a)–(f) are tabulated. From the simulation results, we can see the importance of the one-step confirmation procedure, which can greatly reduce the Type I error values, and meanwhile, the outlier detection Power values are not decreased significantly. Thus the proposed R-LTMD procedure can identify most of the outliers and reduce the false identify rate, therefore, an one-step refine algorithm is very necessary in the proposed outlier to identify the procedure. Comprehensively considered, the proposed R-LTMD method in this paper is a good choice in detecting outliers for large-dimensional data. Targeting different forms of outliers, it can effectively control the Type I error values, and meanwhile, it can obtain the higher detection Power values in most cases.
Table 4.
Performance comparison of LTMD and R-LTMD in terms of Type I error ( %) and Power values ( %) under different choices of k, ψ or v, when n = 100, p = 500, , and m = 0.4 in Case (b) and Case (d).
| LTMD | R-LTMD | LTMD | R-LTMD | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Case | k, ψ, v | ||||||||
| a | k = 2 | 16.3 | 98.4 | 6.0 | 93.7 | 15.2 | 98.0 | 5.4 | 92.0 |
| k = 3 | 15.6 | 100.0 | 5.5 | 100.0 | 15.4 | 100.0 | 5.6 | 100.0 | |
| k = 4 | 16.1 | 100.0 | 5.8 | 100.0 | 15.2 | 100.0 | 5.6 | 100.0 | |
| b | k = 10 | 14.3 | 100.0 | 5.0 | 100.0 | 14.1 | 87.6 | 4.9 | 86.5 |
| k = 15 | 14.7 | 100.0 | 5.3 | 100.0 | 14.4 | 83.4 | 5.0 | 81.8 | |
| k = 20 | 14.8 | 100.0 | 5.2 | 100.0 | 14.1 | 83.1 | 4.9 | 81.4 | |
| c | 15.8 | 98.6 | 5.5 | 94.4 | 16.0 | 99.0 | 6.1 | 93.0 | |
| 15.6 | 100.0 | 5.6 | 100.0 | 15.3 | 100.0 | 6.0 | 100.0 | ||
| 16.0 | 100.0 | 5.9 | 100.0 | 15.8 | 100.0 | 5.8 | 100.0 | ||
| d | 15.6 | 96.4 | 5.5 | 94.8 | 15.1 | 96.0 | 6.3 | 93.7 | |
| 16.2 | 99.5 | 6.0 | 99.1 | 15.5 | 99.7 | 6.6 | 99.4 | ||
| 16.3 | 99.9 | 6.0 | 99.7 | 15.6 | 99.9 | 6.2 | 99.8 | ||
| e | v = 1 | 15.9 | 100.0 | 5.8 | 100.0 | 14.5 | 100.0 | 6.1 | 100.0 |
| v = 3 | 15.6 | 100.0 | 5.5 | 100.0 | 15.3 | 100.0 | 6.5 | 100.0 | |
| v = 5 | 16.5 | 100.0 | 5.9 | 100.0 | 15.5 | 99.9 | 6.5 | 99.4 | |
| f | v = 1 | 15.4 | 100.0 | 5.4 | 100.0 | 15.1 | 100.0 | 6.2 | 100.0 |
| v = 3 | 16.0 | 100.0 | 5.7 | 100.0 | 15.0 | 100.0 | 6.3 | 100.0 | |
| v = 5 | 16.5 | 100.0 | 6.0 | 100.0 | 15.2 | 100.0 | 6.0 | 100.0 | |
4. Real data examples
4.1. Riboflavin data
In this example, we apply the proposed R-LTMD procedure to identify outliers in the riboflavin data set. The data set contains 71 observations of 4088 predictors (gene expressions) and a response (riboflavin production). This data can be loaded by ‘hdi’ function in R software, and more information can be found on (https://rdrr.io/rforge/hdi/man/riboflavin.html). Many literatures in biostatistics have analyzed this data set, such as Bühlmann et al. [1], Dezeure et al. [5] and Shi et al. [17].
Since outliers points inevitably exist in any data set, therefore, we assume that the riboflavin production data are contaminated with some outlying observations. Note that the dimension p = 4088 is much larger than the sample size n = 71. We follow the sure independence screening (SIS) procedure in Fan and Lv [6], and select the top 500 gene expression data that are mostly correlated with the riboflavin production. Then, the analysis data contains n = 71 observations and p = 500 predictors, which can be written as the matrix setting, and the i-th observation is for .
Next, we apply the proposed R-LTMD procedure and combined with RMDP proposed in Ro et al. [15] to identify the outlying gene expression data with . The R-LTMD detects a total of 14 suspected outlying points, the RMDP procedure also detects 14 suspected outlying points. The union of abnormal points detected by two different methods contains a total of 17 observations, and then, we can determine that these 17 data are suspected abnormal data. Meanwhile, the intersection of the data detected by the two different methods is a total of 11 data points. We can deem these 11 observations as outliers. Figure 2 shows the distance measures by R-LTMD method and RMDP method, respectively. The horizontal lines in Figure 2 represents the threshold line obtained by a given significance level and the asymptotic distributions under the null hypothesis. It can be seen that the larger measurement values which are greater than the thresholding line are corresponding to the suspected outlying observations and we mark them as black solid dots.
Figure 2.
The distance measures of R-LTMD (a), and RMDP (b).
To further evaluate the impact of the identified outliers, we might compare the Lasso estimate with and without those points. Therefore, before data analysis, it is important for us to pay more attention to those data that are deemed as outlying observations.
4.2. Glass data
The glass data [14] contains Electron Probe X-ray Microanalysis (EPXMA) intensities for different wavelengths of sixteenth to seventeenth century archaeological glass vessels. The data frame with 180 observations and 750 variables, it can be regarded as a data matrix, and the variables correspond to EPXMA intensities for different wavelengths. Hubert et al. [13] performed the RobPCA method to analyze the glass data, and identified some row-wise outliers. Hubert et al. [12] also analyzed this data set by the proposed MacroPCA procedure and flagged some of the glass samples (between rows 22 and 30, rows 57–63 and 74–76) as outliers, which result is consistent with that of Hubert et al. [13].
In this work, we consider using our proposed R-LTMD method to perform outlier identification on this data set. First, we preprocess the data and select the first 100 EPXMA intensities variables more related to variable 1. And then, we set the significance level . By applying the LTMD algorithm to the processed data set, the identification results of the distance measures are presented in Figure 3, where the black solid point represents the distance measures of the abnormal points and the horizontal line represents the threshold line calculated by R-LTMD. We can see that we have detected a total of 18 outliers, that is, the samples whose distance metric is greater than the threshold, and the identification result is consistent with the results of Hubert et al. [13] and Hubert et al. [12]. This example illustrates again that our proposed R-LTMD method can be useful to detect outliers in large-dimensional data matrix.
Figure 3.

The distance measures of R-LTMD procedure.
5. Discussion and conclusion
This work studied the outlier detection problem for large-dimensional data. In the first stage, we extended the well-known least trimmed square algorithm to search for a clean subset of size and defined a max-normal distance measure , . This outlier identification problem can be deemed as multiple hypothesis test problem, and then, the asymptotic distribution of the distance measure is explored under null hypothesis. Furthermore, a one-step confirmation procedure is applied to reduce the type I errors. Through simulation research and real data analysis, we have verified the effectiveness of this method. Compared with other existing methods, the R-LTMD method has great advantages in controlling type I errors and also has good power for identifying outlying points, especially for large-dimensional sparse data.
Acknowledgments
We would like to thank the editor, the associate editor and the two anonymous referees for their constructive comments and suggestions that have considerably improved this paper.
Funding Statement
This research was supported by National Natural Science Foundation of China [grant number 12071233], [grant number 11926347], [grant number 11971247], [grant number 11771187] Natural Science Foundation of the Higher Education Institutions of Jiangsu Province, China [grant number 18KJB110003], Jiangsu Provincial Government Scholarship Program [grant number 2019-43] and University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (UNPYSCT-2020011).
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Bühlmann P., Markus K., and Lukas M., High-dimensional statistics with a view towards applications in biology, Ann. Rev. Stat. Appl. 1 (2014), pp. 255–278. [Google Scholar]
- 2.Cai T.T., Liu W., and Xia Y., Two-sample test of high dimensional means under dependence, J. R. Stat. Soc.: Ser. B: Stat. Methodol. 76 (2014), pp. 349–372. [Google Scholar]
- 3.Cerioli A., Multivariate outlier detection with high-breakdown estimators, J. Am. Stat. Assoc. 105 (2010), pp. 147–156. [Google Scholar]
- 4.Chatterjee S. and Hadi A.S.. Sensitivity Analysis in Linear Regression, John Wiley and Sons, New York, 1988. [Google Scholar]
- 5.Dezeure R., Bühlmann P., Meier L., and Meinshausen N., High-dimensional inference: confidence intervals, p-values and R-software hdi, Stat. Sci. 30 (2015), pp. 533–558. [Google Scholar]
- 6.Fan J. and Lv J., Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 70 (2008), pp. 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Filzmoser P., Maronna R., and Werner M., Outlier identification in high dimensions, Comput. Stat. Data Anal. 52 (2008), pp. 1694–1711. [Google Scholar]
- 8.Fritsch V., Varoquaux G., Thyreau B., Poline J.B., and Thirion B., Detecting outlying subjects in high-dimensional neuroimaging datasets with regularized minimum covariance determinant, International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Berlin, Heidelberg, 2011, pp. 264–271. [DOI] [PubMed]
- 9.Grubbs F.E., Procedures for detecting outlying observations in samples, Technometrics 11 (1969), pp. 1–21. [Google Scholar]
- 10.Hampel F.R., Ronchetti E.M., Rousseeuw P.J., and Stahel W.A., Robust Statistics: The Approach Based on Influence Functions, John Wiley and Sons, New York, NY, 2005. [Google Scholar]
- 11.Hawkins D.M., Identification of Outliers, Chapman and Hall, London, 1980. [Google Scholar]
- 12.Hubert M., Rousseeuw P.J., and Vanden Bossche W., MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers, Technometrics 61 (2019), pp. 459–473. [Google Scholar]
- 13.Hubert M., Rousseeuw P.J., and Vanden Branden K., ROBPCA: A new approach to robust principal component analysis, Technometrics 47 (2005), pp. 64–79. [Google Scholar]
- 14.Lemberge P., De Raedt I., Janssens K., Wei F., and Van Espen P.J., Quantitative Z-analysis of 16th-17th century archaeological glass vessels using PLS regression of EPXMA and μ-XRF data, J. Chemom. 14 (2000), pp. 751–763. [Google Scholar]
- 15.Ro K., Zou C., Wang Z., and Yin G., Outlier detection for high-dimensional data, Biometrika 102 (2015), pp. 589–599. [Google Scholar]
- 16.Rousseeuw P.J. and Van Driessen K., Computing LTS regression for large data sets, Data. Min. Knowl. Discov. 12 (2006), pp. 29–45. [Google Scholar]
- 17.Shi C., Song R., Lu W., and Li R., Statistical inference for high-dimensional models via recursive online-score estimation, J. Am. Stat. Assoc. 116 (2020), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Srivastava M.S. and Du M., A test for the mean vector with fewer observations than the dimension, J. Multivar. Anal. 99 (2008), pp. 386–402. [Google Scholar]
- 19.Yang X., Wang Z., and Zi X., Thresholding-based outlier detection for high-dimensional data, J. Stat. Comput. Simul. 88 (2018), pp. 2170–2184. [Google Scholar]


