Dynamic principal component analysis with missing values

Junhyeon Kwon; Hee-Seok Oh; Yaeji Lim

doi:10.1080/02664763.2019.1699910

. 2019 Dec 8;47(11):1957–1969. doi: 10.1080/02664763.2019.1699910

Dynamic principal component analysis with missing values

Junhyeon Kwon ^a, Hee-Seok Oh ^a, Yaeji Lim ^b,^CONTACT

PMCID: PMC9042056 PMID: 35707571

Abstract

Dynamic principal component analysis (DPCA), also known as frequency domain principal component analysis, has been developed by Brillinger [Time Series: Data Analysis and Theory, Vol. 36, SIAM, 1981] to decompose multivariate time-series data into a few principal component series. A primary advantage of DPCA is its capability of extracting essential components from the data by reflecting the serial dependence of them. It is also used to estimate the common component in a dynamic factor model, which is frequently used in econometrics. However, its beneficial property cannot be utilized when missing values are present, which should not be simply ignored when estimating the spectral density matrix in the DPCA procedure. Based on a novel combination of conventional DPCA and self-consistency concept, we propose a DPCA method when missing values are present. We demonstrate the advantage of the proposed method over some existing imputation methods through the Monte Carlo experiments and real data analysis.

Keywords: Dynamic principal component analysis, spectral density matrix, missing problem, frequency domain principal component analysis, dynamic factor model

1. Introduction

Dynamic principal component analysis (DPCA) was originally proposed by Brillinger [4,5], which generalizes the static PCA by considering serial dependence of data and reduces the dimension of time series in the frequency domain while preserving as much information as possible. One of its most common uses is the estimation of factors in the dynamic factor model (DFM). DFM assumes that multivariate time series is generated from a linear combination of the common factors in a lower dimension and these common factors, unlike the ones in the classical static factor model, have their own dynamics. Sargent and Sims [25] and Geweke [15] first extended the classical factor model to dynamic models, and several researchers developed various versions of the DFM [9,27,28].

Although DFMs and DPCAs differ when the number of variables is small, their differences diminish as the number of variables increases [1,2,29]. For example, Forni et al. [14] developed a non-parametric DFM procedure in the frequency domain, based on dynamic principal components, and Bai [3] provided a comprehensive review of large-sample results for high-dimensional factor models estimated via dynamic PCA. More recently, Doz et al. [8] proposed a two-step estimation method that is based on a DFM with the factor loadings set equal to the eigenvectors which are associated with a set of principal components, and Eichler et al. [12] estimated the common components of DFM by the time-varying principal components approach in the frequency domain.

Not only has DPCA been widely used to estimate factors in DFMs, but also is useful by itself. DPCA had been developed to replace the static PCA when analyzing dynamic data. One major shortcoming of static PCA is the lack of focus on time dependence, and several problems can arise when applying static PCA to dynamic data directly [7]. DPCA attempts to address the dynamic problem by modeling the auto-correlation structure present in the data [19]. Therefore, it can recover the dynamics of a group of time series efficiently and has become popular in various fields including macroeconomics and chemometrics [6,13]. Numerous methods have been derived from DPCA. Stoffer [31] proposed a spectral envelope using DPCA to detect common signals of multiple time series, and Hörman et al. [16] suggested a functional version of DPCA to analyze functional time-series data such as particulate matter with an aerodynamic diameter of less than 10 µm (PM10). More recently, Peña and Yohai [24] discussed a new perspective and generalization for principal component series in DPCA and further developed a robust version of it.

DPCA is successful in analyzing time series; however, it cannot be efficient in the presence of missing values. For the conventional PCA, there have been some studies on the missing problem [10,17,26], but there are few studies on DPCA for missing data [18,30]. In time-series data related to macroeconomics, missing data are common because some variables have a late start of observation or might have different observing interval than the others.

Since the dynamic principal component is obtained from the eigenanalysis of the spectral density matrix, we may apply the imputation methods for the spectral density estimation [22,32], which enables DPCA in the presence of missing values. More recently, Lee and Zhu [21] proposed a spectrum estimation method based on self-consistency concept. The method imputes the missing data until the estimate of the spectrum is stabilized. In this paper, we modify the spectrum estimation method of [21] and propose a new algorithm which enables DPCA for the incompletely observed data.

We organize the rest of this paper as follows. Section 2 reviews DPCA, and in Section 3, we propose an imputation algorithm for DPCA, which is the main contribution of this study. Section 4 compares the proposed method to other existing imputation methods such as linear interpolation and Kalman smoothing of Durbin and Koopman [11] through the Monte Carlo experiment. In Section 5, we apply the new method to two real data. Finally, concluding remarks are given in Section 6.

2. Dynamic principal component analysis

First of all, we briefly review the traditional static PCA that is performed in the time domain. Let $X$ be a p-dimensional random vector with mean $μ$ and covariance matrix $Σ = E [(X - μ) {\bar{(X - μ)}}^{'}]$ . Static PCA aims at reducing the dimension of given data while preserving as much information as possible. So, the static PCA finds a projection of $X - μ$ to q-dimensional subspace, which can be reconstructed to the original data most closely. The projected and reconstructed data are expressed as $B (X - μ)$ and $C B (X - μ)$ , respectively, where $B$ is a $q \times p$ matrix and $C$ is a $p \times q$ matrix. Then, the reconstruction criterion is defined as

E [(\bar{X - μ - C B (X - μ)})^{'} (X - μ - C B (X - μ))],

(1)

and the principal components are obtained by minimizing it. The solutions are given by

C = [V_{1}, \dots, V_{q}] = {\bar{B}}^{'},

where $V_{j}$ is an eigenvector of $Σ$ which corresponds to the j-th largest eigenvalue $λ_{j}$ ( $j = 1, \dots, q$ ). Hence, we obtain the j-th principal component of $X$ as $ζ_{j} = {\bar{V_{j}}}^{'} X$ , $j = 1, \dots, q$ .

However, the complex non-linear relations among observations cannot be approximated by the static PCA when the data have a time-dependent structure. Therefore, DPCA has been developed to analyze the data that have a dynamic structure. For a review of DPCA, we summarize Chapter 9 of Brillinger [5] from the aspect of dimension reduction and reconstruction of time series, which is closely related to the static PCA criterion in (1).

Suppose that we have a p vector-valued second-order stationary time series $X (t) = (X_{1} (t), X_{2} (t), \dots, X_{p} (t))^{'}$ with zero mean, absolutely summable autocovariance function $c (u)$ and the corresponding spectral density matrix $f (ω)$ , $- \infty < ω < \infty$ . Similarly, as in the time domain, DPCA reduces the dimension of given time-series data that can be reconstructed to the original one as close as possible. Hence, we consider minimizing the least squares criterion of reconstruction error defined as

E [{\bar{X (t) - \sum_{u} c (t - u) ζ (u)}}^{'} {X (t) - \sum_{u} c (t - u) ζ (u)}],

(2)

where $c (t)$ is a $p \times q$ filter and $ζ (t) = \sum_{u} b (t - u) X (u)$ is a principal component series with $q \times p$ filter $b (t)$ . Then, the solution is given by

b (u) = \frac{1}{2 π} \int_{0}^{2 π} B (ω) e^{i u ω} d ω, and c (u) = \frac{1}{2 π} \int_{0}^{2 π} C (ω) e^{i u ω} d ω,

where $C (ω) = [V_{1} (ω), \dots, V_{q} (ω)] = {\bar{B (ω)}}^{'}$ and $V_{j} (ω)$ is an eigenvector of $f (ω)$ which corresponds to the j-th largest eigenvalue $λ_{j} (ω)$ . Furthermore, the PC series $ζ_{j} (t)$ has $λ_{j} (ω)$ as its spectrum, and it has zero coherence with $ζ_{k} (t)$ for $k \neq j$ . Detailed derivation can be found in Brillinger [5].

DPCA is also closely related to the DFM [14]. Under the assumptions in Forni et al. [14], the DFM can be written as

X (t) = \sum_{u} c (u) U (t - u) + ξ (t),

where $U (t) = (U_{1} (t), U_{2} (t), \dots, U_{q} (t))^{'}$ is a q-dimensional common shock (or a common factor in statistical term), $c (t)$ is a $p \times q$ filter, and $ξ (t)$ is a zero-mean stationary process called idiosyncratic component. Our main interest lies in estimating the common component $X (t) = X (t) - ξ (t)$ and the estimator is computed as

\hat{X} (t) = \sum_{u} c (t - u) ζ (u),

which is the orthogonal projection of $X (t)$ on $U := \bar{s p a n} (ζ_{j} (t), j = 1, \dots, q)$ , the minimal closed subspace of Hilbert space $L_{2} (Ω, F, P)$ containing the first q PC series. It converges to $X (t)$ in mean square as p tends to infinity [14].

3. Dynamic principal component analysis with missing values

Here we present a new approach for DPCA when data contain missing values. As stated in the previous section, it is necessary to estimate the spectral density matrix $f (ω)$ to perform DPCA. Hence, in the presence of missing values, the estimation of $f (ω)$ might be distorted. To cope with this problem, we adopt an imputation method for estimating the spectral density matrix $f (ω)$ based on self-consistency [21] and then combine it with the DPCA.

3.1. Review: Monte Carlo self-consistent imputation algorithm

It is important to start with the concept of self-consistency. When we approximate a random variable $X_{1}$ with another random variable $X_{2}$ , the $L_{2}$ approximation criterion $E ‖ X_{1} - g (X_{2}) ‖^{2}$ is minimized when $g (X_{2}) = E [X_{1} | X_{2}]$ . Here, $| | \cdot | |$ is $L_{2}$ norm. Therefore, the following inequality holds

E ‖ X_{1} - E [X_{1} | X_{2}] ‖^{2} \leq E ‖ X_{1} - X_{2} ‖^{2},

and we say that $X_{2}$ is self-consistent, if $X_{2} = E [X_{1} | X_{2}]$ almost surely. This concept has been applied to find efficient estimators in various studies such as wavelet image denoising with missing data [20] and EM-algorithm [23].

Recently, Lee and Zhu [21] developed a self-consistent estimation method of the spectral density function for missing data, termed Monte Carlo Self-Consistent (MCSC) algorithm. MCSC algorithm is a non-parametric spectrum estimation method using an iterative method. During the iteration, missing values are imputed with the random numbers generated from the conditional normal distribution of the missing data $X^{m i s}$ given the observed data $X^{o b s}$ . Then the spectrum is estimated based on the imputed values. The iteration is continued until the convergence of $\hat{f}$ in order to satisfy the self-consistency, that is

\hat{f} = E {{\hat{f}}^{c o m} | X^{o b s}, f = \hat{f}},

where $\hat{f}$ and ${\hat{f}}^{c o m}$ are spectrum estimates obtained from the imputed and complete data, respectively.

3.2. Proposed DPCA

For a new DPCA method to deal with missing values, we modify the MCSC algorithm because the uncertainty arises due to the random numbers which impute the unobserved ones. This inherent randomness of the MCSC algorithm prevents the estimated spectrum from converging. Therefore, we use the conditional expectation in our algorithm and apply the DPCA to the converged estimate to obtain dynamic principal components.

Here are the detailed implementing steps for the proposed DPCA method for missing data, and we name the whole process as Iterative Self-Consistent (ISC) algorithm. Let $X^{c o m} = (X_{1}, X_{2}, \dots, X_{p})$ be the complete data, and $X_{j}^{o b s}$ and $X_{j}^{m i s}$ be observed and missing portions of $X_{j}$ , respectively, for $j = 1, 2, \dots, p$ . We assume that $X_{j}$ has $M_{j}$ missing values, where $X_{j, m}^{m i s}$ denotes the mth missing value, $m = 1, \dots M_{j}$ .

To obtain the converged spectral density estimate ${\hat{f}}_{j}^{*}$ of $X_{j}$ , repeat the following steps for $j = 1, 2, \dots, p .$
1. Impute the missing data $X_{j}^{m i s}$ by linear interpolation and denote the imputed complete data by $X_{j}^{(0)}$ . Then, obtain the initial spectral density estimate ${\hat{f}}_{j}^{(0)}$ from $X_{j}^{(0)}$ .
2. Iterate the following steps for $ℓ = 1, 2, \dots$ , until
  $n^{- 1} \sum_{k = 1}^{n} ({\hat{f}}_{j}^{(ℓ - 1)} (ω_{k}) - {\hat{f}}_{j}^{(ℓ)} (ω_{k}))^{2} < ϵ,$
  for a pre-determined $ϵ < 0$ .
  1. For $m = 1, 2, \dots, M_{j},$
    
    update the imputation of $X_{j, m}^{m i s}$ with the conditional mean
    $X_{j, m}^{(ℓ)} := E [X_{j, m}^{m i s} | X_{j}^{o b s}, γ = {\hat{γ}}^{(ℓ - 1)}],$
    where γ is an autocovariance function whose estimate is obtained by
    ${\hat{γ}}^{(ℓ)} (h) = \frac{1}{2 π} \int_{- π}^{π} {\hat{f}}_{j}^{(ℓ)} (ω) e^{i ω h} d ω,$
    
    obtain the m-th periodogram $I_{m}^{(ℓ)}$ from
    ${X_{j}^{o b s}} \cup {X_{j, k}^{(ℓ - 1)} : k = 1, \dots, m - 1, m + 1, \dots, M_{j}} \cup {X_{j, m}^{(ℓ)}}, and$
    
    apply the kernel smoothing on $I_{m}^{(ℓ)}$ and obtain ${\hat{f}}_{j, m}^{(ℓ)}$ .
  2. Compute the ℓth spectral density estimate by ${\hat{f}}_{j}^{(ℓ)} = \frac{1}{M_{j}} \sum_{m = 1}^{M_{j}} {\hat{f}}_{j, m}^{(ℓ)}$
Perform DPCA with the spectral density matrix ${\hat{f}}^{*}$ obtained from the converged imputed complete data.

For an illustration purpose of the above algorithm, Figure 1 shows the flow of estimating ${\hat{f}}_{j}^{*}$ in Step 1.

4. Monte Carlo experiment

We use DFM as a data generating process in the Monte Carlo experiment of this study. For a p-dimensional DFM $X (t) = X (t) + ξ (t), 1 \leq t \leq T$ , we estimate the common component $X (t)$ by the orthogonal projection of $X (t)$ on $U := \bar{s p a n} (ζ_{j} (t), j = 1, \dots, q)$ as described in Section 2. We assume that the dimension of the PC series is half of that of $X (t)$ , i.e. $q = p / 2$ .

Simulation data with missing values are generated under various scenarios. A weighted sum of sinusoids or vector MA(1) process is considered as a common component. Regarding the proportion of the missing data, we consider four scenarios that $5 %$ , $10 %$ , $15 %$ or $20 %$ of the given data is missing completely at random. Then, the missing values are imputed by three different imputation methods: linear interpolation, Kalman smoothing of Durbin and Koopman [11], and the proposed ISC algorithm. Kalman smoothing assumes that the incompletely observed process is a linearly transformed version of the unobservable state process with noise,

X (t) = A U (t) + ξ (t), 1 \leq t \leq T,

where $U (t)$ is a q-dimensional autoregressive state process and $ξ (t) \sim N (0, Q)$ . For imputation, $U (t)$ is first estimated by $E [U (t) | X_{o b s}]$ and then it is post-multiplied by the observation matrix $A$ .

Using the imputed data, we perform DPCA and estimate the common components. As a measure that evaluates the empirical performance of the proposed method, we use the relative deviation criterion defined as

D (X, \hat{X}) = \frac{\sum_{i, t} (X_{i} (t) - {\hat{X}}_{i} (t))^{2}}{\sum_{i, t} X_{i} (t)^{2}},

where $\hat{X}$ denotes the estimate of the common components by q PC series.

For the simulation study, we consider two different dimensions of observations $p = 20, 40$ , and three different lengths of observations T = 1000, 2000, 4000. For each combination of p, T and missing proportion, the average values and their standard deviations of $D (X, \hat{X})$ over 100 simulations are computed.

4.1. Common component: sum of sinusoids

We consider a weighted sum of sinusoids as a common component and MA(1) process as an idiosyncratic component. The simulation data, $X (t) = (X_{1} (t), X_{2} (t), \dots, X_{p} (t))^{'}$ is defined as

X (t) = 2 A U (t) + ξ (t) - 0.5 ξ (t), 1 \leq t \leq T,

where $A = (A_{i, j})_{1 \leq i \leq p, 1 \leq j \leq q}$ with $A_{i, j} \overset{i i d}{\sim} N (0, 1)$ , and $ξ (t) = (ξ_{1} (t), ξ_{2} (t), \dots, ξ_{p} (t))^{'}$ with $ξ_{j} (t) \overset{i i d}{\sim} N (0, 1) .$ The common factor $U (t) = (U_{1} (t), U_{2} (t), \dots, U_{q} (t))^{'}$ has a sinusoidal form of

U_{j} (t) = \cos (ω_{j} (2 π t - τ_{j})),

where $ω_{j} = ν_{j} / 200$ , with $ν_{j}$ as the j-th prime number, and $τ_{j} \sim U n i f (0, 2 π)$ , $j = 1, \dots, q$ .

Results are listed in Table 1. For all cases, the proposed method based on the ISC algorithm outperforms the other two methods. To determine whether one method is better than the other significantly, we can construct an approximated 95% confidence interval of the difference between two performance measures. For example, when T = 1000, p = 20, and missing data rate is $20 %$ , the approximate the confidence interval between the average values of Kalman smoothing and ISC algorithm is

(0.062 - 0.057) \pm z_{0.025} \frac{\sqrt{{0.015}^{2} + {0.014}^{2}}}{\sqrt{100}},

which implies that the ISC algorithm performs better than Kalman smoothing since this confidence interval does not contain zero. Furthermore, we observe that the superiority of the proposed method becomes significant as p and missing data rate increase, and T decreases.

Table 1. Average and standard deviation in parentheses of $D (X, \hat{X})$ over 100 simulations.

			Missing proportion
		Method	5%	10%	15%	20%
T = 1000	p = 20	Linear interpolation	0.057 (0.014)	0.063 (0.019)	0.065 (0.016)	0.066 (0.016)
		Kalman smoothing	0.057 (0.016)	0.060 (0.018)	0.061 (0.015)	0.062 (0.015)
		ISC algorithm	0.054 (0.012)	0.055 (0.013)	0.056 (0.015)	0.057 (0.014)
	p = 40	Linear interpolation	0.083 (0.012)	0.113 (0.015)	0.142 (0.016)	0.170 (0.016)
		Kalman smoothing	0.066 (0.012)	0.081 (0.013)	0.093 (0.014)	0.110 (0.013)
		ISC algorithm	0.059 (0.011)	0.066 (0.012)	0.075 (0.013)	0.085 (0.013)
T = 2000	p = 20	Linear interpolation	0.059 (0.013)	0.062 (0.013)	0.064 (0.015)	0.067 (0.015)
		Kalman smoothing	0.057 (0.012)	0.059 (0.012)	0.062 (0.015)	0.063 (0.016)
		ISC algorithm	0.057 (0.013)	0.056 (0.011)	0.057 (0.014)	0.060 (0.016)
	p = 40	Linear interpolation	0.076 (0.012)	0.103 (0.015)	0.127 (0.015)	0.153 (0.015)
		Kalman smoothing	0.061 (0.012)	0.073 (0.014)	0.083 (0.013)	0.097 (0.015)
		ISC algorithm	0.058 (0.014)	0.063 (0.013)	0.070 (0.014)	0.079 (0.015)
T = 4000	p = 20	Linear interpolation	0.062 (0.013)	0.062 (0.014)	0.063 (0.013)	0.064 (0.014)
		Kalman smoothing	0.061 (0.013)	0.062 (0.015)	0.061 (0.014)	0.063 (0.015)
		ISC algorithm	0.059 (0.012)	0.060 (0.012)	0.059 (0.013)	0.060 (0.012)
	p = 40	Linear interpolation	0.075 (0.013)	0.095 (0.013)	0.116 (0.015)	0.136 (0.015)
		Kalman smoothing	0.063 (0.011)	0.071 (0.012)	0.080 (0.013)	0.089 (0.013)
		ISC algorithm	0.059 (0.012)	0.064 (0.013)	0.069 (0.012)	0.073 (0.012)

Open in a new tab

Note: Note that a sum of sinusoids as a common component.

4.2. Common component: MA(1)

Here we consider MA(1) process as a common component [14],

X (t) = A_{0} U (t) + A_{1} U (t - 1) + 2 ξ (t), 1 \leq t \leq T,

where $U (t) = (U_{1} (t), U_{2} (t), \dots, U_{q} (t))^{'}$ with $U_{j} (t) \overset{i i d}{\sim} N (0, 1)$ , and other variables are generated in the same way as in the previous case of Section 4.1.

From Table 2, we observe that the performance of the proposed method is similar to that of Kalman smoothing when the missing proportion is equal to or less than 10%. Kalman smoothing works slightly better than the proposed method as the missing data rate increases and T decreases, but most of the differences are not significant.

Table 2. Average and standard deviation in parentheses of $D (X, \hat{X})$ over100 simulations.

			Missing proportion
		Method	5%	10%	15%	20%
T = 1000	p = 20	Linear interpolation	0.171 (0.009)	0.225 (0.012)	0.282 (0.013)	0.339 (0.015)
		Kalman smoothing	0.148 (0.009)	0.179 (0.009)	0.211 (0.009)	0.245 (0.008)
		ISC algorithm	0.149 (0.009)	0.180 (0.009)	0.214 (0.009)	0.255 (0.009)
	p = 40	Linear interpolation	0.137 (0.005)	0.194 (0.006)	0.253 (0.007)	0.316 (0.007)
		Kalman smoothing	0.114 (0.006)	0.148 (0.006)	0.184 (0.005)	0.223 (0.005)
		ISC algorithm	0.114 (0.005)	0.149 (0.006)	0.188 (0.005)	0.232 (0.005)
T = 2000	p = 20	Linear interpolation	0.167 (0.011)	0.217 (0.013)	0.270 (0.016)	0.327 (0.018)
		Kalman smoothing	0.145 (0.010)	0.173 (0.010)	0.204 (0.010)	0.236 (0.010)
		ISC algorithm	0.145 (0.010)	0.174 (0.009)	0.205 (0.010)	0.242 (0.010)
	p = 40	Linear interpolation	0.133 (0.006)	0.187 (0.006)	0.243 (0.007)	0.301 (0.009)
		Kalman smoothing	0.112 (0.005)	0.144 (0.005)	0.178 (0.005)	0.213 (0.004)
		ISC algorithm	0.112 (0.005)	0.145 (0.005)	0.180 (0.005)	0.218 (0.005)
T = 4000	p = 20	Linear interpolation	0.161 (0.009)	0.211 (0.011)	0.263 (0.013)	0.317 (0.015)
		Kalman smoothing	0.140 (0.008)	0.168 (0.009)	0.197 (0.008)	0.228 (0.009)
		ISC algorithm	0.140 (0.008)	0.168 (0.008)	0.197 (0.008)	0.230 (0.009)
	p = 40	Linear interpolation	0.129 (0.005)	0.179 (0.005)	0.232 (0.006)	0.287 (0.007)
		Kalman smoothing	0.109 (0.005)	0.139 (0.005)	0.171 (0.005)	0.204 (0.005)
		ISC algorithm	0.109 (0.005)	0.139 (0.005)	0.172 (0.005)	0.207 (0.005)

Open in a new tab

Note: Note that MA(1) as a common component.

5. Empirical application

5.1. Macroeconomic data

We consider four yearly macroeconomic indicators: gross domestic product (GDP), gross fixed capital formation (GFCF), consumer price index (CPI), and industrial production (IP). GDP and GFCF are measured in US dollars, and the base year of CPI and IP is 2015. We use the data for the period of years 1971–2017 from 15 countries in the Organization for Economic Co-operation and Development (OECD), which are Austria, Belgium, Canada, Finland, France, Germany, Greece, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, and the United Kingdom. The dataset is available in the OECD website https://data.oecd.org.

To apply the proposed method, we artificially generate missing values completely at random with a rate of 10% or 20% and estimate the coincident indicator defined by Forni et al. [14]. We compare this indicator with the one from the missing-free data. The procedure is described as follows:

Construct a panel pooling four yearly macroeconomic indicators for 15 countries of the OECD from 1971 to 2017. Take a difference of logarithm of each indicator for each country and normalize it by subtracting the mean and dividing with the standard deviation. Then, we have 46 observations for each of 60 variables.
Apply the DPCA with the proposed ISC algorithm. Here, the dimension of the common factor is set as q = 2 by following the suggestion of Forni et al. [14].
Estimate the common component of GDP, $X_{i} (t)$ and compute the coincident indicator defined as
$Δ C (t) = \frac{\sum W_{i} X_{i} (t)}{\sum W_{i}},$
where $W_{i}$ is the average GDP level of the i-th country for the period of 1971–2017.

The left plot in Figure 2 shows the estimated coincident indicators obtained from missing-free, 10% missing and 20% missing data, and the right plot is a deviation of coincident indicator obtained from missing data from that of missing-free data. We observe that the indicators obtained from missing data do not deviate significantly from the missing-free result.

5.2. Particulate matter concentration in South Korea

We analyze daily particulate matter ( $P M_{10}$ ) concentration observed at 23 cities in South Korea for the period of years 2006–2015. Recorded $P M_{10}$ values often have missing values and, sometimes, observation failure lasts for a long time due to facility maintenance.

Though we cannot compare the performance of our method with the other ones since the data had missing values in the first place, we can make interesting and reasonable interpretations by comparing the estimated common component with the imputed complete data. We assume that the data are generated from the DFM with a two-dimensional common factor, and perform DPCA with the proposed ISC algorithm. Figure 3 shows the result of common component estimation for four cities (Seoul, Suwon, Mokpo, and Taean) in South Korea for January–June in 2009. Dots denote the imputed values, the solid line is the resultant imputed full data, and the dashed line is the estimated common component for each city. While the observations in Seoul show very similar behavior with the common component, Mokpo has higher $P M_{10}$ concentrations in general than the common component. On the other hand, at some time points in Suwon or Taean, $P M_{10}$ concentrations are distinguishably lower or higher than the common component. These indicate some short-term affecting events in each city and are explained by the idiosyncratic component in the DFM.

Figure 3. — $P M_{10}$ concentration of four cities (Seoul, Suwon, Mokpo and Taean) in South Korea between January and June in 2009. Dots denote the imputed values, and solid line is the resultant imputed full data. Dashed line is the estimated common component for each city.

6. Concluding remark

6.1. Application to non-stationary data

Recently, Eichler et al. [12] proposed a non-stationary DFM based on time-varying principal components. Since the stationarity is assumed for both DPCA and ISC algorithms, it is inappropriate to simply apply our approach to non-stationary multivariate time-series data. However, we expect that the proposed method may further be generalized to a non-stationary process. Here we illustrate a toy simulation to show the applicability of the proposed method under the non-stationary process.

The simulation is conducted by referring to the study of Eichler et al. [12], which consider a sum of a time-varying AR(2) and an ordinary AR(2) process as a common component. The data generating process has its specific form as

X (t) = U^{(1)} (t) + U^{(2)} (t) + ξ (t), 1 \leq t \leq 1000,

where

\begin{aligned} U_{i}^{(j)} (t) & = \sum_{k = 1}^{2} a_{i j, k} (\frac{t}{T}) U_{i}^{(j)} (t - k) + η (t), \\ a_{i j, 1} (t) & = \frac{2}{α} \cos (ϕ_{i j} - \cos (ν_{i j} t)), a_{i j, 2} (t) = - \frac{1}{α^{2}}, \\ ξ (t) & = (ξ_{1} (t), ξ_{2} (t), \dots, ξ_{20} (t))^{'}, \\ η (t) & = (η_{1} (t), η_{2} (t), \dots, η_{20} (t))^{'}, \end{aligned}

$α = - 10 / 9$ , $ν_{i 1} = 0.4 π i$ , $ν_{i 2} = 0.2 π i$ , $ϕ_{i 1} = 0.5 i$ , $ϕ_{i 2} = - 1.2 i$ , and $ξ_{i} (t), η_{i} (t) \sim i . i . d N (0, 1)$ for $i = 1, 2, \dots, 20$ and j = 1, 2.

We have set 20% of the data as missing completely at random and impute these with three different imputation methods. Then, we estimate the common components using ten dynamic principal components. Average values and their standard deviation of $D (X, \hat{X})$ are computed over 100 simulations, which are listed in Table 3. Even if the proposed method is valid only under the stationarity, the proposed DPCA with ISC algorithm provides the closest estimate to the true common components. We expect that the results may be improved if we generalize the proposed method for non-stationary data as future research.

Table 3. Non-stationary common component – averages and their standard deviation of $D (X, \hat{X})$ over 100 simulations.

Method	Average	Standard deviation
Linear interpolation	0.355	0.026
Kalman smoothing	0.148	0.016
ISC algorithm	0.133	0.010

Open in a new tab

6.2. Summary

DPCA is a popular statistical method to reveal the dynamics of the underlying data structure. However, we cannot estimate the spectral density matrix, which is essential in DPCA, when there exist missing values in the data. In this paper, we propose a new algorithm to impute the missing data based on a self-consistent inference on the frequency domain information.

‘Through the numerical experiments including simulations and real data analysis, we observe that the proposed DPCA with ISC algorithm works well compared to the other methods such as linear interpolation and Kalman smoothing.’ When the data is a sum of sinusoids, the proposed method works best. However, when the common components are generated from the process with less distinctive frequency such as MA(1), Kalman smoothing works best while the proposed method gives comparable results.

We expect that the proposed method can be generalized further to the non-stationary case, and various kind of real data containing missing values can be analyzed by the proposed method.

Funding Statement

This research was supported by the Seoul National University Research Grant in 2018 and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Korea government (NRF-2018R1D1A1B07042933, NRF-2019R1A2C4069453).

Disclosure statement

No potential conflict of interest was reported by the authors.

References

1.Bai J., Inferential theory for factor models of large dimensions, Econometrica 71 (2003), pp. 135–171. doi: 10.1111/1468-0262.00392 [DOI] [Google Scholar]
2.Bai J. and Ng S., Determining the number of factors in approximate factor models, Econometrica 70 (2002), pp. 191–221. doi: 10.1111/1468-0262.00273 [DOI] [Google Scholar]
3.Bai J. and Ng S., Large dimensional factor analysis, Econometrics 3 (2008), pp. 89–163. [Google Scholar]
4.Brillinger D.R., The canonical analysis of stationary time series, Multivariate Anal. 2 (1969), pp. 331–350. [Google Scholar]
5.Brillinger D.R., Time Series: Data Analysis and Theory, Vol. 36, SIAM, Philadelphia, 1981. [Google Scholar]
6.Chen J. and Liu K.-C., On-line batch process monitoring using dynamic pca and dynamic pls models, Chem. Eng. Sci. 57 (2002), pp. 63–75. doi: 10.1016/S0009-2509(01)00366-9 [DOI] [Google Scholar]
7.Dong Y. and Qin S.J., A novel dynamic PCA algorithm for dynamic data modeling and process monitoring, J. Process. Control. 67 (2018), pp. 1–11. doi: 10.1016/j.jprocont.2017.05.002 [DOI] [Google Scholar]
8.Doz C., Giannone D., and Reichlin L., A two-step estimator for large approximate dynamic factor models based on Kalman filtering, J. Econom. 164 (2011), pp. 188–205. doi: 10.1016/j.jeconom.2011.02.012 [DOI] [Google Scholar]
9.Doz C., Giannone D., and Reichlin L., A quasi-maximum likelihood approach for large, approximate dynamic factor models, Rev. Econ. Stat. 94 (2012), pp. 1014–1024. doi: 10.1162/REST_a_00225 [DOI] [Google Scholar]
10.Dray S. and Josse J., Principal component analysis with missing values: a comparative survey of methods, Plant. Ecol. 216 (2015), pp. 657–667. doi: 10.1007/s11258-014-0406-z [DOI] [Google Scholar]
11.Durbin J. and Koopman S.J., Time Series Analysis by State Space Methods, Oxford University Press, Oxford, 2012. [Google Scholar]
12.Eichler M., Motta G., and Von Sachs R., Fitting dynamic factor models to non-stationary time series, J. Econom. 163 (2011), pp. 51–70. doi: 10.1016/j.jeconom.2010.11.007 [DOI] [Google Scholar]
13.Favero C.A., Marcellino M., and Neglia F., Principal components at work: the empirical analysis of monetary policy with large data sets, J. Appl. Economet. 20 (2005), pp. 603–620. doi: 10.1002/jae.815 [DOI] [Google Scholar]
14.Forni M., Hallin M., Lippi M., and Reichlin L., The generalized dynamic-factor model: identification and estimation, Rev. Econ. Stat. 82 (2000), pp. 540–554. doi: 10.1162/003465300559037 [DOI] [Google Scholar]
15.Geweke J., The dynamic factor analysis of economic time-series models, in Latent Variables in Socio-Economic Models, D. Aigner and A. Goldberger, eds., NorthHolland, Amsterdam, 1977, pp. 365–383.
16.Hörmann S., Kidziński Ł., and Hallin M., Dynamic functional principal components, J. Royal Stat. Soc. Ser B (Stat Methodol) 77 (2015), pp. 319–348. doi: 10.1111/rssb.12076 [DOI] [Google Scholar]
17.Ilin A. and Raiko T., Practical approaches to principal component analysis in the presence of missing values, J. Mach. Learn. Res. 11 (2010), pp. 1957–2000. [Google Scholar]
18.Jungbacker B., Koopman S.-J., and van der Wel M, Dynamic factor analysis in the presence of missing data, Tech. Rep. Tinbergen Institute Discussion Paper, 2009.
19.Ku W., Storer R.H., and Georgakis C., Disturbance detection and isolation by dynamic principal component analysis, Chemometr. Intell. Lab. Syst. 30 (1995), pp. 179–196. doi: 10.1016/0169-7439(95)00076-3 [DOI] [Google Scholar]
20.Lee T.C. and Meng X.-L, A self-consistent wavelet method for denoising images with missing pixels, in Proceedings (ICASSP'05) IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, Vol. 2, IEEE, 2005, pp. ii–41.
21.Lee T.C. and Zhu Z, Nonparametric spectral density estimation with missing observations, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2009, pp. 3041–3044.
22.Lomb N.R., Least-squares frequency analysis of unequally spaced data, Astrophys. Space. Sci. 39 (1976), pp. 447–462. doi: 10.1007/BF00648343 [DOI] [Google Scholar]
23.Meng X.-L. and Rubin D.B., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika 80 (1993), pp. 267–278. doi: 10.1093/biomet/80.2.267 [DOI] [Google Scholar]
24.Peña D. and Yohai V.J., Generalized dynamic principal components, J. Am. Stat. Assoc. 111 (2016), pp. 1121–1131. doi: 10.1080/01621459.2015.1072542 [DOI] [Google Scholar]
25.Sargent T.J. and Sims C.A., Business cycle modeling without pretending to have too much a priori economic theory, New Methods Bus. Cycle Res. 1 (1977), pp. 145–168. [Google Scholar]
26.Shum H.-Y., Ikeuchi K., and Reddy R., Principal component analysis with missing data and its application to polyhedral object modeling, IEEE T. Pattern Anal. 9:(1995), pp. 854–867. doi: 10.1109/34.406651 [DOI] [Google Scholar]
27.Stock J.H. and Watson M.W., New indexes of coincident and leading economic indicators, NBER. Macroecon. Annu. 4 (1989), pp. 351–394. doi: 10.1086/654119 [DOI] [Google Scholar]
28.Stock J.H. and Watson M.W, A probability model of the coincident economic indicators, in Leading Economic Indicators: New Approaches and Forecasting Records, K. Lahiri and G. Moore, eds., Cambridge University Press, 1991, pp. 63–90.
29.Stock J.H. and Watson M.W., Forecasting using principal components from a large number of predictors, J. Am. Stat. Assoc. 97 (2002), pp. 1167–1179. doi: 10.1198/016214502388618960 [DOI] [Google Scholar]
30.Stock J.H. and Watson M.W., Macroeconomic forecasting using diffusion indexes, J. Bus. Econ. Stat. 20 (2002), pp. 147–162. doi: 10.1198/073500102317351921 [DOI] [Google Scholar]
31.Stoffer D.S., Detecting common signals in multiple time series using the spectral envelope, J. Am. Stat. Assoc. 94 (1999), pp. 1341–1356. doi: 10.1080/01621459.1999.10473886 [DOI] [Google Scholar]
32.Wang Y., Stoica P., Li J., and Marzetta T.L., Nonparametric spectral analysis with missing data via the em algorithm, Digit. Signal. Process. 15 (2005), pp. 191–206. doi: 10.1016/j.dsp.2004.10.004 [DOI] [Google Scholar]

[CIT0001] 1.Bai J., Inferential theory for factor models of large dimensions, Econometrica 71 (2003), pp. 135–171. doi: 10.1111/1468-0262.00392 [DOI] [Google Scholar]

[CIT0002] 2.Bai J. and Ng S., Determining the number of factors in approximate factor models, Econometrica 70 (2002), pp. 191–221. doi: 10.1111/1468-0262.00273 [DOI] [Google Scholar]

[CIT0003] 3.Bai J. and Ng S., Large dimensional factor analysis, Econometrics 3 (2008), pp. 89–163. [Google Scholar]

[CIT0004] 4.Brillinger D.R., The canonical analysis of stationary time series, Multivariate Anal. 2 (1969), pp. 331–350. [Google Scholar]

[CIT0005] 5.Brillinger D.R., Time Series: Data Analysis and Theory, Vol. 36, SIAM, Philadelphia, 1981. [Google Scholar]

[CIT0006] 6.Chen J. and Liu K.-C., On-line batch process monitoring using dynamic pca and dynamic pls models, Chem. Eng. Sci. 57 (2002), pp. 63–75. doi: 10.1016/S0009-2509(01)00366-9 [DOI] [Google Scholar]

[CIT0007] 7.Dong Y. and Qin S.J., A novel dynamic PCA algorithm for dynamic data modeling and process monitoring, J. Process. Control. 67 (2018), pp. 1–11. doi: 10.1016/j.jprocont.2017.05.002 [DOI] [Google Scholar]

[CIT0008] 8.Doz C., Giannone D., and Reichlin L., A two-step estimator for large approximate dynamic factor models based on Kalman filtering, J. Econom. 164 (2011), pp. 188–205. doi: 10.1016/j.jeconom.2011.02.012 [DOI] [Google Scholar]

[CIT0009] 9.Doz C., Giannone D., and Reichlin L., A quasi-maximum likelihood approach for large, approximate dynamic factor models, Rev. Econ. Stat. 94 (2012), pp. 1014–1024. doi: 10.1162/REST_a_00225 [DOI] [Google Scholar]

[CIT0010] 10.Dray S. and Josse J., Principal component analysis with missing values: a comparative survey of methods, Plant. Ecol. 216 (2015), pp. 657–667. doi: 10.1007/s11258-014-0406-z [DOI] [Google Scholar]

[CIT0011] 11.Durbin J. and Koopman S.J., Time Series Analysis by State Space Methods, Oxford University Press, Oxford, 2012. [Google Scholar]

[CIT0012] 12.Eichler M., Motta G., and Von Sachs R., Fitting dynamic factor models to non-stationary time series, J. Econom. 163 (2011), pp. 51–70. doi: 10.1016/j.jeconom.2010.11.007 [DOI] [Google Scholar]

[CIT0013] 13.Favero C.A., Marcellino M., and Neglia F., Principal components at work: the empirical analysis of monetary policy with large data sets, J. Appl. Economet. 20 (2005), pp. 603–620. doi: 10.1002/jae.815 [DOI] [Google Scholar]

[CIT0014] 14.Forni M., Hallin M., Lippi M., and Reichlin L., The generalized dynamic-factor model: identification and estimation, Rev. Econ. Stat. 82 (2000), pp. 540–554. doi: 10.1162/003465300559037 [DOI] [Google Scholar]

[CIT0015] 15.Geweke J., The dynamic factor analysis of economic time-series models, in Latent Variables in Socio-Economic Models, D. Aigner and A. Goldberger, eds., NorthHolland, Amsterdam, 1977, pp. 365–383.

[CIT0016] 16.Hörmann S., Kidziński Ł., and Hallin M., Dynamic functional principal components, J. Royal Stat. Soc. Ser B (Stat Methodol) 77 (2015), pp. 319–348. doi: 10.1111/rssb.12076 [DOI] [Google Scholar]

[CIT0017] 17.Ilin A. and Raiko T., Practical approaches to principal component analysis in the presence of missing values, J. Mach. Learn. Res. 11 (2010), pp. 1957–2000. [Google Scholar]

[CIT0018] 18.Jungbacker B., Koopman S.-J., and van der Wel M, Dynamic factor analysis in the presence of missing data, Tech. Rep. Tinbergen Institute Discussion Paper, 2009.

[CIT0019] 19.Ku W., Storer R.H., and Georgakis C., Disturbance detection and isolation by dynamic principal component analysis, Chemometr. Intell. Lab. Syst. 30 (1995), pp. 179–196. doi: 10.1016/0169-7439(95)00076-3 [DOI] [Google Scholar]

[CIT0020] 20.Lee T.C. and Meng X.-L, A self-consistent wavelet method for denoising images with missing pixels, in Proceedings (ICASSP'05) IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, Vol. 2, IEEE, 2005, pp. ii–41.

[CIT0021] 21.Lee T.C. and Zhu Z, Nonparametric spectral density estimation with missing observations, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2009, pp. 3041–3044.

[CIT0022] 22.Lomb N.R., Least-squares frequency analysis of unequally spaced data, Astrophys. Space. Sci. 39 (1976), pp. 447–462. doi: 10.1007/BF00648343 [DOI] [Google Scholar]

[CIT0023] 23.Meng X.-L. and Rubin D.B., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika 80 (1993), pp. 267–278. doi: 10.1093/biomet/80.2.267 [DOI] [Google Scholar]

[CIT0024] 24.Peña D. and Yohai V.J., Generalized dynamic principal components, J. Am. Stat. Assoc. 111 (2016), pp. 1121–1131. doi: 10.1080/01621459.2015.1072542 [DOI] [Google Scholar]

[CIT0025] 25.Sargent T.J. and Sims C.A., Business cycle modeling without pretending to have too much a priori economic theory, New Methods Bus. Cycle Res. 1 (1977), pp. 145–168. [Google Scholar]

[CIT0026] 26.Shum H.-Y., Ikeuchi K., and Reddy R., Principal component analysis with missing data and its application to polyhedral object modeling, IEEE T. Pattern Anal. 9:(1995), pp. 854–867. doi: 10.1109/34.406651 [DOI] [Google Scholar]

[CIT0027] 27.Stock J.H. and Watson M.W., New indexes of coincident and leading economic indicators, NBER. Macroecon. Annu. 4 (1989), pp. 351–394. doi: 10.1086/654119 [DOI] [Google Scholar]

[CIT0028] 28.Stock J.H. and Watson M.W, A probability model of the coincident economic indicators, in Leading Economic Indicators: New Approaches and Forecasting Records, K. Lahiri and G. Moore, eds., Cambridge University Press, 1991, pp. 63–90.

[CIT0029] 29.Stock J.H. and Watson M.W., Forecasting using principal components from a large number of predictors, J. Am. Stat. Assoc. 97 (2002), pp. 1167–1179. doi: 10.1198/016214502388618960 [DOI] [Google Scholar]

[CIT0030] 30.Stock J.H. and Watson M.W., Macroeconomic forecasting using diffusion indexes, J. Bus. Econ. Stat. 20 (2002), pp. 147–162. doi: 10.1198/073500102317351921 [DOI] [Google Scholar]

[CIT0031] 31.Stoffer D.S., Detecting common signals in multiple time series using the spectral envelope, J. Am. Stat. Assoc. 94 (1999), pp. 1341–1356. doi: 10.1080/01621459.1999.10473886 [DOI] [Google Scholar]

[CIT0032] 32.Wang Y., Stoica P., Li J., and Marzetta T.L., Nonparametric spectral analysis with missing data via the em algorithm, Digit. Signal. Process. 15 (2005), pp. 191–206. doi: 10.1016/j.dsp.2004.10.004 [DOI] [Google Scholar]

PERMALINK

Dynamic principal component analysis with missing values

Junhyeon Kwon

Hee-Seok Oh

Yaeji Lim

Abstract

1. Introduction

2. Dynamic principal component analysis

3. Dynamic principal component analysis with missing values