Sequential linear regression with online standardized data

Kévin Duarte; Jean-Marie Monnez; Eliane Albuisson

doi:10.1371/journal.pone.0191186

. 2018 Jan 18;13(1):e0191186. doi: 10.1371/journal.pone.0191186

Sequential linear regression with online standardized data

Kévin Duarte ^1,^2,^3,^*, Jean-Marie Monnez ^1,^2,^3,⁴, Eliane Albuisson ^1,^5,⁶

Editor: Chenping Hou⁷

PMCID: PMC5773231 PMID: 29346392

Abstract

The present study addresses the problem of sequential least square multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose using a process with online standardized data instead of raw data and the use of several observations per step or all observations until the current step. Herein, we define and study the almost sure convergence of three processes with online standardized data: a classical process with a variable step-size and use of a varying number of observations per step, an averaged process with a constant step-size and use of a varying number of observations per step, and a process with a variable or constant step-size and use of all observations until the current step. Their convergence is obtained under more general assumptions than classical ones. These processes are compared to classical processes on 11 datasets for a fixed total number of observations used and thereafter for a fixed processing time. Analyses indicate that the third-defined process typically yields the best results.

1 Introduction

In the present analysis, A′ denotes the transposed matrix of A while the abbreviation “a.s.” signifies almost surely.

Let R = (R¹,…,R^p) and S = (S¹,…,S^q) be random vectors in $R^{p}$ and $R^{q}$ respectively. Considering the least square multidimensional linear regression of S with respect to R: the (p, q) matrix θ and the (q, 1) matrix η are estimated such that E[‖S − θ′ R − η‖²] is minimal.

Denote the covariance matrices

\begin{matrix} B & = & C o v a r [R] = E [(R - E [R]) {(R - E [R])}^{'}], \\ F & = & C o v a r [R, S] = E [(R - E [R]) {(S - E [S])}^{'}] . \end{matrix}

If we assume B is positive definite, i.e. there is no affine relation between the components of R, then

\begin{matrix} θ = B^{- 1} F, η = E [S] - θ^{'} E [R] . \end{matrix}

Note that, R₁ denoting the random vector in $R^{p + 1}$ such that $R_{1}^{'} = (\begin{matrix} R^{'} & 1 \end{matrix})$ , θ₁ the (p + 1, q) matrix such that $θ_{1}^{'} = (\begin{matrix} θ^{'} & η \end{matrix})$ , $B_{1} = E [R_{1} R_{1}^{'}]$ and F₁ = E[R₁ S′], we obtain $θ_{1} = B_{1}^{- 1} F_{1}$ .

In order to estimate θ (or θ₁), a stochastic approximation process (X_n) in $R^{p \times q}$ (or $R^{(p + 1) \times q}$ ) is recursively defined such that

\begin{matrix} X_{n + 1} = X_{n} - a_{n} (B_{n} X_{n} - F_{n}), \end{matrix}

where (a_n) is a sequence of positive real numbers, eventually constant, called step-sizes (or gains). Matrices B_n and F_n have the same dimensions as B and F, respectively. The convergence of (X_n) towards θ is studied under appropriate definitions and assumptions on B_n and F_n.

Suppose that ((R_1n, S_n), n ≥ 1) is an i.i.d. sample of (R₁, S). In the case where q = 1, $B_{n} = R_{1 n} R_{1 n}^{'}$ and $F_{n} = R_{1 n} S_{n}^{'}$ , several studies have been devoted to this stochastic gradient process (see for example Monnez [1], Ljung [2] and references hereafter). In order to accelerate general stochastic approximation procedures, Polyak [3] and Polyak and Juditsky [4] introduced the averaging technique. In the case of linear regression, Györfi and Walk [5] studied an averaged stochastic approximation process with a constant step-size. With the same type of process, Bach and Moulines [6] proved that the optimal convergence rate is achieved without strong convexity assumption on the loss function.

However, this type of process may be subject to the risk of numerical explosion when components of R or S exhibit great variances and may have very high values. For datasets used as test sets by Bach and Moulines [6], all sample points whose norm of R is fivefold greater than the average norm are removed. Moreover, generally only one observation of (R, S) is introduced at each step of the process. This may be not convenient for a large amount of data generated by a data stream for example.

Two modifications of this type of process are thus proposed in this article.

The first change in order to avoid numerical explosion is the use of standardized, i.e. of zero mean and unit variance, components of R and S. In fact, the expectation and the variance of the components are usually unknown and will be estimated online.

The parameter θ can be computed from the standardized components as follows. Let σ^j the standard deviation of R^j for j = 1,…,p and $σ_{1}^{k}$ the standard deviation of S^k for k = 1,…,q. Define the following matrices

\begin{matrix} Γ = (\begin{matrix} \frac{1}{σ^{1}} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & \frac{1}{σ^{p}} \end{matrix}), Γ^{1} = (\begin{matrix} \frac{1}{σ_{1}^{1}} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & \frac{1}{σ_{1}^{q}} \end{matrix}) . \end{matrix}

Let S_c = Γ¹(S − E[S]) and R_c = Γ(R − E[R]). The least square linear regression of S_c with respect to R_c is achieved by estimating the (p, q) matrix θ_c such that $E [{|| S_{c} - θ_{c}^{^{'}} R_{c} ||}^{2}]$ is minimal. Then θ_c = Γ⁻¹(B⁻¹ F)Γ¹ ⇔ θ = B⁻¹ F = Γθ_c(Γ¹)⁻¹.

The second change is to use, at each step of the process, several observations of (R, S) or an estimation of B and F computed recursively from all observations until the current step without storing them.

More precisely, the convergence of three processes with online standardized data is studied in sections 2, 3, 4 respectively.

First, in section 2, a process with a variable step-size a_n and use of several online standardized observations at each step is studied; note that the number of observations at each step may vary with n.

Secondly, in section 3, an averaged process with a constant step-size and use of a varying number of online standardized observations at each step is studied.

Thirdly, in section 4, a process with a constant or variable step-size and use of all online standardized observations until the current step to estimate B and F is studied.

These three processes are tested on several datasets when q = 1, S being a continuous or binary variable, and compared to existing processes in section 5. Note that when S is a binary variable, linear regression is equivalent to a linear discriminant analysis. It appears that the third-defined process most often yields the best results for the same number of observations used or for the same duration of computing time used.

These processes belong to the family of stochastic gradient processes and are adapted to data streams. Batch gradient and stochastic gradient methods are presented and compared in [7] and reviewed in [8], including noise reduction methods, like dynamic sample sizes methods, stochastic variance reduced gradient (also studied in [9]), second-order methods, ADAGRAD [10] and other methods. This work makes the following contributions to the variance reduction methods:

In [9], the authors proposed a modification of the classical stochastic gradient algorithm to reduce directly the gradient of the function to be optimized in order to obtain a faster convergence. It is proposed in this article to reduce this gradient by an online standardization of the data.
Gradient clipping [11] is another method to avoid a numerical explosion. The idea is to limit the norm of the gradient to a maximum number called threshold. This number must be chosen, a bad choice of threshold can affect the computing speed. Moreover it is then necessary to compare the norm of the gradient to this threshold at each step. In our approach the limitation of the gradient is implicitly obtained by online standardization of the data.
If the expectation and the variance of the components of R and S were known, standardization of these variables could be made directly and convergence of the processes obtained using existing theorems. But these moments are unknown in the case of a data stream and are estimated online in this study. Thus the assumptions of the theorems of almost sure (a.s.) convergence of the processes studied in sections 2 and 3 and the corresponding proofs are more general than the classical ones in the linear regression case [1–5].
The process defined in section 4 is not a classical batch method. Indeed in this type of method (gradient descent), the whole set of data is known a priori and is used at each step of the process. In the present study, new data are supposed to arrive at each step, as in a data stream, and are added to the preceding set of data, thus reducing by averaging the variance. This process can be considered as a dynamic batch method.
A suitable choice of step-size is often crucial for obtaining good performance of a stochastic gradient process. If the step-size is too small, the convergence will be slower. Conversely, if the step-size is too large, a numerical explosion may occur during the first iterations. Following [6], a very simple choice of the step-size is proposed for the methods with a constant step-size.
Another objective is to reduce computing time in order to take into account a maximum of data in the case of a data stream. It appears in the experiments that the use of all observations until the current step without storing them, several observations being introduced at each step, increases at best in general the convergence speed of the process. Moreover this can reduce the influence of outliers.

As a whole the major contributions of this work are to reduce gradient variance by online standardization of the data or use of a “dynamic” batch process, to avoid numerical explosions, to reduce computing time and consequently to better adapt the stochastic approximation processes used to the case of a data stream.

2 Convergence of a process with a variable step-size

Let (B_n, n ≥ 1) and (F_n, n ≥ 1) be two sequences of random matrices in $R^{p \times p}$ and $R^{p \times q}$ respectively. In this section, the convergence of the process (X_n, n ≥ 1) in $R^{p \times q}$ recursively defined by

\begin{matrix} X_{n + 1} = X_{n} - a_{n} (B_{n} X_{n} - F_{n}) \end{matrix}

and its application to sequential linear regression are studied.

2.1 Theorem

Let X₁ be a random variable in $R^{p \times q}$ independent from the sequence of random variables ((B_n, F_n), n ≥ 1) in $R^{p \times p} \times R^{p \times q}$ .

Denote T_n the σ-field generated by X₁ and (B₁, F₁),…,(B_n−1, F_n−1). X₁, X₂,…,X_n are T_n-measurable.

Let (a_n) be a sequence of positive numbers.

Make the following assumptions:

(H1a) There exists a positive definite symmetrical matrix B such that a.s.

1) $\sum_{n = 1}^{\infty} a_{n} || E [B_{n} | T_{n}] - B || < \infty$

2) $\sum_{n = 1}^{\infty} a_{n}^{2} E [{|| B_{n} - B ||}^{2} | T_{n}] < \infty$ .

(H2a) There exists a matrix F such that a.s.

1) $\sum_{n = 1}^{\infty} a_{n} || E [F_{n} | T_{n}] - F || < \infty$

2) $\sum_{n = 1}^{\infty} a_{n}^{2} E [{|| F_{n} - F ||}^{2} | T_{n}] < \infty$ .

(H3a) $\sum_{n = 1}^{\infty} a_{n} = \infty, \sum_{n = 1}^{\infty} a_{n}^{2} < \infty$ .

Theorem 1 Suppose H1a, H2a and H3a hold. Then X_n converges to θ = B⁻¹ F a.s.

State the Robbins-Siegmund lemma [12] used in the proof.

Lemma 2 Let (Ω, A, P) be a probability space and (T_n) a non-decreasing sequence of sub-σ-fields of A. Suppose for all n, z_n, α_n, β_n and γ_n are four integrable non-negative T_n-measurable random variables defined on (Ω, A, P) such that:

\begin{matrix} E [z_{n + 1} | T_{n}] & \leq z_{n} (1 + α_{n}) + β_{n} - γ_{n} & a . s . \end{matrix}

Then, in the set ${\sum_{n = 1}^{\infty} α_{n} < \infty, \sum_{n = 1}^{\infty} β_{n} < \infty}$ , (z_n) converges to a finite random variable and $\sum_{n = 1}^{\infty} γ_{n} < \infty$ a.s.

Proof of Theorem 1. The Frobenius norm ‖A‖ for a matrix A is used. Recall that, if ‖A‖₂ denotes the spectral norm of A, ‖AB‖ ≤ ‖A‖₂‖B‖.

\begin{matrix} X_{n + 1} - θ & = & X_{n} - θ - a_{n} (B_{n} X_{n} - F_{n}) \\ = & (I - a_{n} B) (X_{n} - θ) - a_{n} ((B_{n} - B) X_{n} - (F_{n} - F)) \end{matrix}

Denote Z_n = (B_n − B)X_n − (F_n − F) = (B_n − B)(X_n − θ) + (B_n − B)θ − (F_n − F) and $X_{n}^{1} = X_{n} - θ$ . Then:

\begin{matrix} X_{n + 1}^{1} & = & (I - a_{n} B) X_{n}^{1} - a_{n} Z_{n} \\ {|| X_{n + 1}^{1} ||}^{2} & = & {|| (I - a_{n} B) X_{n}^{1} ||}^{2} - 2 a_{n} ⟨ (I - a_{n} B) X_{n}^{1}, Z_{n} ⟩ + a_{n}^{2} {|| Z_{n} ||}^{2} . \end{matrix}

Denote λ the smallest eigenvalue of B. As a_n → 0, we have for n sufficiently large

\begin{matrix} || I - a_{n} {B ||}_{2} = 1 - a_{n} λ < 1 . \end{matrix}

Then, taking the conditional expectation with respect to T_n yields almost surely:

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2} | T_{n}] & \leq & {(1 - a_{n} λ)}^{2} {|| X_{n}^{1} ||}^{2} + 2 a_{n} | ⟨ (I - a_{n} B) X_{n}^{1}, E [Z_{n} | T_{n}] ⟩ | + \\ a_{n}^{2} E [{|| Z_{n} ||}^{2} | T_{n}], \\ E [Z_{n} | T_{n}] & = & (E [B_{n} | T_{n}] - B) X_{n}^{1} + (E [B_{n} | T_{n}] - B) θ - (E [F_{n} | T_{n}] - F) . \end{matrix}

Denoting

\begin{matrix} β_{n} & = & || E [B_{n} | T_{n}] - B ||, δ_{n} = || E [F_{n} | T_{n}] - F ||, \\ b_{n} & = & E [{|| B_{n} - B ||}^{2} | T_{n}], d_{n} = E [{|| F_{n} - F ||}^{2} | T_{n}], \end{matrix}

we obtain, as $|| X_{n}^{1} || \leq 1 + {|| X_{n}^{1} ||}^{2}$ :

\begin{matrix} | ⟨ (I - a_{n} B) X_{n}^{1}, E [Z_{n} | T_{n}] ⟩ | & \leq & || X_{n}^{1} || || E [Z_{n} | T_{n}] || \\ \leq & {|| X_{n}^{1} ||}^{2} (β_{n} (1 + || θ ||) + δ_{n}) + β_{n} || θ || + δ_{n}, \\ E [{|| Z_{n} ||}^{2} | T_{n}] & \leq & 3 b_{n} {|| X_{n}^{1} ||}^{2} + 3 b_{n} {|| θ ||}^{2} + 3 d_{n}, \\ E [{|| X_{n + 1}^{1} ||}^{2} | T_{n}] & \leq & (1 + a_{n}^{2} λ^{2} + 2 (1 + || θ ||) a_{n} β_{n} + 2 a_{n} δ_{n} + 3 a_{n}^{2} b_{n}) {|| X_{n}^{1} ||}^{2} + \\ 2 || θ || a_{n} β_{n} + 2 a_{n} δ_{n} + 3 {|| θ ||}^{2} a_{n}^{2} b_{n} + 3 a_{n}^{2} d_{n} - 2 a_{n} λ {|| X_{n}^{1} ||}^{2} . \end{matrix}

Applying Robbins-Siegmund lemma under assumptions H1a, H2a and H3a implies that there exists a non-negative random variable T such that a.s.

\begin{matrix} || X_{n}^{1} || ⟶ T, \sum_{n = 1}^{\infty} a_{n} {|| X_{n}^{1} ||}^{2} < \infty . \end{matrix}

As $\sum_{n = 1}^{\infty} a_{n} = \infty$ , T = 0 a.s. ∎

A particular case with the following assumptions is now studied.

(H1a’) There exist a positive definite symmetrical matrix B and a positive real number b such that a.s.

1) for all n, E[B_n|T_n] = B

2) $\begin{matrix} \sup \\ n \end{matrix}$ E[‖B_n − B‖²|T_n] < b.

(H2a’) There exist a matrix F and a positive real number d such that a.s.

1) for all n, E[F_n|T_n] = F

2) $\begin{matrix} \sup \\ n \end{matrix}$ E[‖F_n − F‖²|T_n] < d.

(H3a’) Denoting λ the smallest eigenvalue of B,

$(a_{n} = \frac{a}{n^{α}}, a > 0, \frac{1}{2} < α < 1)$ or $(a_{n} = \frac{a}{n}, a > \frac{1}{2 λ})$ .

Theorem 3 Suppose H1a’, H2a’ and H3a’ hold. Then X_n converges to θ almost surely and in quadratic mean. Moreover $lim^{¯} \frac{1}{a_{n}} E [{|| X_{n} - θ ||}^{2}] < \infty$ .

Proof of Theorem 3. In the proof of theorem 1, take β_n = 0, δ_n = 0, b_n < b, d_n < d; then a.s.:

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2} | T_{n}] \leq (1 + λ^{2} a_{n}^{2} + 3 b a_{n}^{2}) {|| X_{n}^{1} ||}^{2} + 3 (b {|| θ ||}^{2} + d) a_{n}^{2} - 2 a_{n} λ {|| X_{n}^{1} ||}^{2} . \end{matrix}

Taking the mathematical expectation yields:

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2}] \leq (1 + (λ^{2} + 3 b) a_{n}^{2}) E [{|| X_{n}^{1} ||}^{2}] + 3 (b {|| θ ||}^{2} + d) a_{n}^{2} - 2 a_{n} λ E [{|| X_{n}^{1} ||}^{2}] . \end{matrix}

By Robbins-Siegmund lemma:

\begin{matrix} \exists t \geq 0 : E [{|| X_{n}^{1} ||}^{2}] ⟶ t; \sum_{n = 1}^{\infty} a_{n} E [{|| X_{n}^{1} ||}^{2}] < \infty . \end{matrix}

As $\sum_{n = 1}^{\infty} a_{n} = \infty$ , t = 0. Therefore, there exist $N \in N$ and f > 0 such that for n > N:

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2}] \leq (1 - 2 a_{n} λ) E [{|| X_{n}^{1} ||}^{2}] + f a_{n}^{2} . \end{matrix}

Applying a lemma of Schmetterer [13] for $a_{n} = \frac{a}{n^{α}}$ with $\frac{1}{2} < α < 1$ yields:

\begin{matrix} lim^{¯} n^{α} E [{|| X_{n}^{1} ||}^{2}] < \infty . \end{matrix}

Applying a lemma of Venter [14] for $a_{n} = \frac{a}{n}$ with $a > \frac{1}{2 λ}$ yields:

\begin{matrix} lim^{¯} n E [{|| X_{n}^{1} ||}^{2}] < \infty ∎ \end{matrix}

2.2 Application to linear regression with online standardized data

Let (R₁, S₁),…,(R_n, S_n),… be an i.i.d. sample of a random vector (R, S) in $R^{p} \times R^{q}$ . Let Γ (respectively Γ¹) be the diagonal matrix of order p (respectively q) of the inverses of the standard deviations of the components of R (respectively S).

Define the correlation matrices

\begin{matrix} B = Γ E [(R - E [R]) {(R - E [R])}^{'}] Γ, \\ F = Γ E [(R - E [R]) {(S - E [S])}^{'}] Γ^{1} . \end{matrix}

Suppose that B⁻¹ exists. Let θ = B⁻¹ F.

Denote ${\bar{R}}_{n}$ (respectively ${\bar{S}}_{n}$ ) the mean of the n-sample (R₁, R₂,…,R_n) of R (respectively (S₁, S₂,…,S_n) of S).

Denote ${(V_{n}^{j})}^{2}$ the variance of the n-sample $(R_{1}^{j}, R_{2}^{j}, . . ., R_{n}^{j})$ of the j^th component R^j of R, and ${(V_{n}^{1 k})}^{2}$ the variance of the n-sample $(S_{1}^{k}, S_{2}^{k}, . . ., S_{n}^{k})$ of the k^th component S^k of S.

Denote Γ_n (respectively $Γ_{n}^{1}$ ) the diagonal matrix of order p (respectively q) whose element (j, j) (respectively (k, k)) is the inverse of $\sqrt{\frac{n}{n - 1}} V_{n}^{j}$ (respectively $\sqrt{\frac{n}{n - 1}} V_{n}^{1 k}$ ).

Let (m_n, n ≥ 1) be a sequence of integers. Denote $M_{n} = \sum_{k = 1}^{n} m_{k}$ for n ≥ 1, M₀ = 0 and I_n = {M_n−1+1,…,M_n}.

Define

\begin{matrix} B_{n} & = Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j} - {\bar{R}}_{M_{n - 1}}) {(R_{j} - {\bar{R}}_{M_{n - 1}})}^{'} Γ_{M_{n - 1}}, \\ F_{n} & = Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j} - {\bar{R}}_{M_{n - 1}}) {(S_{j} - {\bar{S}}_{M_{n - 1}})}^{'} Γ_{M_{n - 1}}^{1} . \end{matrix}

Define recursively the process (X_n, n ≥ 1) in $R^{p \times q}$ by

\begin{matrix} X_{n + 1} = X_{n} - a_{n} (B_{n} X_{n} - F_{n}) . \end{matrix}

Corollary 4 Suppose there is no affine relation between the components of R and the moments of order 4 of (R, S) exist. Suppose moreover that assumption H3a” holds:

(H3a”) $a_{n} > 0, \sum_{n = 1}^{\infty} \frac{a_{n}}{\sqrt{n}} < \infty, \sum_{n = 1}^{\infty} a_{n}^{2} < \infty .$

Then X_n converges to θ a.s.

This process was tested on several datasets and some results are given in section 5 (process S11 for m_n = 1 and S12 for m_n = 10).

The following lemma is first proved.

Lemma 5 Suppose the moments of order 4 of R exist and a_n > 0, $\sum_{n = 1}^{\infty} \frac{a_{n}}{\sqrt{n}} < \infty$ . Then $\sum_{n = 1}^{\infty} a_{n} || {\bar{R}}_{M_{n - 1}} - E [R] || < \infty$ and $\sum_{n = 1}^{\infty} a_{n} || Γ_{M_{n - 1}} - Γ || < \infty$ a.s.

Proof of Lemma 5. The usual Euclidean norm for vectors and the spectral norm for matrices are used in the proof.

Step 1:

Denote $V a r [R] = E [{|| R - E [R] ||}^{2}] = \sum_{j = 1}^{p} V a r [R^{j}]$ .

\begin{matrix} E [{|| {\bar{R}}_{M_{n - 1}} - E [R] ||}^{2}] & = & \sum_{j = 1}^{p} V a r [{\bar{R}}_{M_{n - 1}}^{j}] = \sum_{j = 1}^{p} \frac{V a r [R^{j}]}{M_{n - 1}} \leq \frac{V a r [R]}{n - 1} . \end{matrix}

Then:

\begin{matrix} \sum_{n = 1}^{\infty} a_{n} E [|| {\bar{R}}_{M_{n - 1}} - E [R] ||] & \leq & \sqrt{V a r [R]} \sum_{n = 1}^{\infty} \frac{a_{n}}{\sqrt{n - 1}} < \infty by H 3 a ” . \end{matrix}

It follows that $\sum_{n = 1}^{\infty} a_{n} || {\bar{R}}_{M_{n - 1}} - E [R] || < \infty$ a.s.

Likewise $\sum_{n = 1}^{\infty} a_{n} || {\bar{S}}_{M_{n - 1}} - E [S] || < \infty$ a.s.

Step 2:

\begin{array}{l} | | Γ_{M_{n - 1}} - Γ | | & = max_{j = 1, \dots, p} | \frac{1}{\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j}} - \frac{1}{\sqrt{V a r [R^{j}]}} | \\ \leq \sum_{j = 1}^{p} \frac{| \sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} - \sqrt{V a r [R^{j}]} |}{\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} \sqrt{V a r [R^{j}]}} \\ = \sum_{j = 1}^{p} \frac{| \frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2} - V a r [R^{j}] |}{\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} \sqrt{V a r [R^{j}]} (\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} + \sqrt{V a r [R^{j}]})} . \end{array}

Denote $μ_{4}^{j}$ the centered moment of order 4 of R^j. We have:

\begin{matrix} E [| \frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2} - V a r [R^{j}] |] & \leq & \sqrt{V a r [\frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2}]} \\ = & O (\sqrt{\frac{μ_{4}^{j} - {(V a r [R^{j}])}^{2}}{M_{n - 1}}}) . \end{matrix}

Then by H3a”, as M_n−1 ≥ n−1:

\begin{matrix} \sum_{n = 1}^{\infty} a_{n} \sum_{j = 1}^{p} E [| \frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2} - V a r [R^{j}] |] < \infty \\ \Rightarrow \sum_{n = 1}^{\infty} a_{n} \sum_{j = 1}^{p} | \frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2} - V a r [R^{j}] | < \infty a . s . \end{matrix}

As ${(V_{M_{n - 1}}^{j})}^{2} \to V a r [R^{j}]$ a.s., j = 1,…,p, this implies:

\begin{matrix} \sum_{n = 1}^{\infty} a_{n} || Γ_{M_{n - 1}} - Γ || < \infty a . s . ∎ \end{matrix}

Proof of Corollary 4.

Step 1: prove that assumption H1a1 of theorem 1 is verified.

Denote R^c = R − E[R], $R_{j}^{c} = R_{j} - E [R]$ , ${\bar{R}}_{j}^{c} = {\bar{R}}_{j} - E [R]$ .

\begin{matrix} B_{n} & = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c}) {(R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}} \\ = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j}^{c} {R_{j}^{c}}^{'} - {\bar{R}}_{M_{n - 1}}^{c} {R_{j}^{c}}^{'} - R_{j}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'} + {\bar{R}}_{M_{n - 1}}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'}) Γ_{M_{n - 1}} . \\ B & = & Γ E [R^{c} {R^{c}}^{'}] Γ . \end{matrix}

As Γ_{M_n−1} and ${\bar{R}}_{M_{n - 1}}$ are T_n-measurable and $R_{j}^{c}$ , j ∈ I_n, is independent of T_n, with $E [R_{j}^{c}] = 0$ :

\begin{matrix} E [B_{n} | T_{n}] - B & = & Γ_{M_{n - 1}} (E [R^{c} {R^{c}}^{'}] + {\bar{R}}_{M_{n - 1}}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'}) Γ_{M_{n - 1}} - Γ E [R^{c} {R^{c}}^{'}] Γ \\ = & (Γ_{M_{n - 1}} - Γ) E [R^{c} {R^{c}}^{'}] Γ_{M_{n - 1}} + Γ E [R^{c} {R^{c}}^{'}] (Γ_{M_{n - 1}} - Γ) \\ + Γ_{M_{n - 1}} {\bar{R}}_{M_{n - 1}}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}} a . s . \end{matrix}

As Γ_{M_n−1} and ${\bar{R}}_{M_{n - 1}}^{c}$ converge respectively to Γ and 0 a.s. and by lemma 5, $\sum_{n = 1}^{\infty} a_{n} || Γ_{M_{n - 1}} - Γ || < \infty$ and $\sum_{n = 1}^{\infty} a_{n} || {\bar{R}}_{M_{n - 1}}^{c} || < \infty$ a.s., it follows that $\sum_{n = 1}^{\infty} a_{n} || E [B_{n} | T_{n}] - B || < \infty$ a.s.

Step 2: prove that assumption H1a2 of theorem 1 is verified.

\begin{array}{l} {‖ B_{n} - B ‖}^{2} & \leq 2 {‖ Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c}) {(R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}} ‖}^{2} \\ + 2 {‖ Γ E [R^{c} R^{c}^{'}] Γ ‖}^{2} \\ \leq 2 {‖ Γ_{M_{n - 1}} ‖}^{4} \frac{1}{m_{n}} \sum_{j \in I_{n}} {‖ R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c} ‖}^{4} + 2 {‖ Γ E [R^{c} R^{c}^{'}] Γ ‖}^{2} \\ \leq 2 {‖ Γ_{M_{n - 1}} ‖}^{4} \frac{1}{m_{n}} \sum_{j \in I_{n}} 2^{3} ({‖ R_{j}^{c} ‖}^{4} + {‖ {\bar{R}}_{M_{n - 1}}^{c} ‖}^{4}) + 2 {‖ Γ E [R^{c} R^{c}^{'}] Γ ‖}^{2} . \end{array}

\begin{matrix} E [{|| B_{n} - B ||}^{2} | T_{n}] & \leq & 2^{4} {|| Γ_{M_{n - 1}} ||}^{4} (E [{|| R^{c} ||}^{4}] + {|| {\bar{R}}_{M_{n - 1}}^{c} ||}^{4}) + 2 {|| Γ E [R^{c} {R^{c}}^{'}] Γ ||}^{2} a . s . \end{matrix}

As Γ_{M_n−1} and ${\bar{R}}_{M_{n - 1}}^{c}$ converge respectively to Γ and 0 a.s., and $\sum_{n = 1}^{\infty} a_{n}^{2} < \infty$ , it follows that $\sum_{n = 1}^{\infty} a_{n}^{2} E [{|| B_{n} - B ||}^{2} | T_{n}] < \infty$ a.s.

Step 3: the proofs of the verification of assumptions H2a1 and H2a2 of theorem 1 are similar to the previous ones, B_n and B being respectively replaced by

\begin{matrix} F_{n} & = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c}) {(S_{j}^{c} - {\bar{S}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}}^{1}, \\ F & = & Γ E [R^{c} {S^{c}}^{'}] Γ^{1} ∎ \end{matrix}

3 Convergence of an averaged process with a constant step-size

In this section, the process (X_n, n ≥ 1) with a constant step-size a and the averaged process (Y_n, n ≥ 1) in $R^{p \times q}$ are recursively defined by

\begin{matrix} X_{n + 1} & = & X_{n} - a (B_{n} X_{n} - F_{n}) \\ Y_{n + 1} & = & \frac{1}{n + 1} \sum_{j = 1}^{n + 1} X_{j} = Y_{n} - \frac{1}{n + 1} (Y_{n} - X_{n + 1}) . \end{matrix}

The a.s. convergence of (Y_n, n ≥ 1) and its application to sequential linear regression are studied.

3.1 Lemma

Lemma 6 Let three real sequences (u_n), (v_n) and (a_n), with u_n > 0 and a_n > 0 for all n, and a real positive number λ such that, for n ≥ 1,

\begin{matrix} u_{n + 1} & \leq & (1 - a_{n} λ) u_{n} + a_{n} v_{n} . \end{matrix}

Suppose:

1) v_n → 0
2) $(a_{n} = a < \frac{1}{λ})$ or $(a_{n} \to 0, \sum_{n = 1}^{\infty} a_{n} = \infty)$ .
Under assumptions 1 and 2, u_n → 0.

Proof of Lemma 6. In the case a_n depending on n, as a_n → 0, we can suppose without loss of generality that 1 − a_n λ > 0 for n ≥ 1. We have:

\begin{matrix} u_{n + 1} & \leq & \prod_{i = 1}^{n} (1 - a_{i} λ) u_{1} + \sum_{i = 1}^{n} a_{i} \prod_{l = i + 1}^{n} (1 - a_{l} λ) v_{i}, with \prod_{n + 1}^{n} = 1 . \end{matrix}

Now, for n₁ ≤ n₂ ≤ n and 0 < c_i < 1 with c_i = a_i λ for all i, we have:

\begin{matrix} \sum_{i = n_{1}}^{n_{2}} c_{i} \prod_{l = i + 1}^{n} (1 - c_{l}) & = & \sum_{i = n_{1}}^{n_{2}} (1 - (1 - c_{i})) \prod_{l = i + 1}^{n} (1 - c_{l}) \\ = & \sum_{i = n_{1}}^{n_{2}} (\prod_{l = i + 1}^{n} (1 - c_{l}) - \prod_{l = i}^{n} (1 - c_{l})) \\ = & \prod_{l = n_{2} + 1}^{n} (1 - c_{l}) - \prod_{l = n_{1}}^{n} (1 - c_{l}) \leq \prod_{l = n_{2} + 1}^{n} (1 - c_{l}) \leq 1 . \end{matrix}

Let ϵ > 0. There exists N such that for i > N, $| v_{i} | < \frac{ϵ}{3} λ$ . Then for n ≥ N, applying the previous inequality with c_i = a_i λ, n₁ = 1, n₂ = N, yields:

\begin{matrix} u_{n + 1} & \leq & \prod_{i = 1}^{n} (1 - a_{i} λ) u_{1} + \sum_{i = 1}^{N} a_{i} λ \prod_{l = i + 1}^{n} (1 - a_{l} λ) \frac{| v_{i} |}{λ} + \frac{ϵ}{3} \sum_{i = N + 1}^{n} a_{i} λ \prod_{l = i + 1}^{n} (1 - a_{l} λ) \\ \leq & \prod_{i = 1}^{n} (1 - a_{i} λ) u_{1} + \frac{1}{λ} max_{1 \leq i \leq N} | v_{i} | \prod_{l = N + 1}^{n} (1 - a_{l} λ) + \frac{ϵ}{3} . \end{matrix}

In the case a_n depending on n, ln(1 − a_i λ) ∼ −a_i λ as a_i → 0(i → ∞); then, as $\sum_{n = 1}^{\infty} a_{n} = \infty$ , $\prod_{l = N + 1}^{n} (1 - a_{l} λ) \to 0 (n \to \infty)$ .

In the case a_n = a, $\prod_{l = N + 1}^{n} (1 - a λ) = {(1 - a λ)}^{n - N} \to 0 (n \to \infty)$ as 0 < 1 − aλ < 1.

Thus there exists N₁ such that u_n+1 < ϵ for n > N₁ ∎

3.2 Theorem

Make the following assumptions

(H1b) There exist a positive definite symmetrical matrix B in $R^{p \times p}$ and a positive real number b such that a.s.

1) lim_{n → ∞}(E[B_n|T_n] − B) = 0

2) $\sum_{n = 1}^{\infty} \frac{1}{n} {(E [{|| E [B_{n} | T_{n}] - B ||}^{2}])}^{\frac{1}{2}} < \infty$

3) sup_n E[‖B_n−B‖²|T_n] ≤ b.

(H2b) There exist a matrix F in $R^{p \times q}$ and a positive real number d such that a.s.

1) lim_n→∞(E[F_n|T_n] − F) = 0

2) sup_n E [‖F_n − F‖²|T_n] ≤ d.

(H3b) λ and λ_max being respectively the smallest and the largest eigenvalue of B, $0 < a < min (\frac{1}{λ_{m a x}}, \frac{2 λ}{λ^{2} + b})$ .

Theorem 7 Suppose H1b, H2b and H3b hold. Then Y_n converges to θ = B⁻¹ F a.s.

Remark 1 Györfi and Walk [5] proved that Y_n converges to θ a.s. and in quadratic mean under the assumptions E[B_n|T_n] = B, E[F_n|T_n] = F, H1b2 and H2b2. Theorem 7 is an extension of their a.s. convergence result when E[B_n|T_n] → B and E[F_n|T_n] → F a.s.

Remark 2 Define $R_{1} = (\begin{matrix} R \\ 1 \end{matrix})$ , $B = E [R_{1} R_{1}^{'}]$ , F = E[R₁ S′]. If ((R_1n, S_n), n ≥ 1) is an i.i.d. sample of (R₁, S) whose moments of order 4 exist, assumptions H1b and H2b are verified for $B_{n} = R_{1 n} R_{1 n}^{'}$ and $F_{n} = R_{1 n} S_{n}^{'}$ as $E [R_{1 n} R_{1 n}^{'} | T_{n}] = E [R_{1} R_{1}^{'}] = B$ and $E [R_{1 n} S_{n}^{'} | T_{n}] = F$ .

Proof of Theorem 7. Denote

\begin{matrix} Z_{n} & = & (B_{n} - B) (X_{n} - θ) + (B_{n} - B) θ - (F_{n} - F), \\ X_{n}^{1} & = & X_{n} - θ, \\ Y_{n}^{1} & = & Y_{n} - θ = \frac{1}{n} \sum_{j = 1}^{n} X_{j}^{1} . \end{matrix}

Step 1: give a sufficient condition to have $Y_{n}^{1} \to 0$ a.s.

We have (cf. proof of theorem 1):

\begin{matrix} X_{n + 1}^{1} & = & (I - a B) X_{n}^{1} - a Z_{n}, \\ Y_{n + 1}^{1} & = & \frac{1}{n + 1} X_{1}^{1} + \frac{1}{n + 1} \sum_{j = 2}^{n + 1} X_{j}^{1} \\ = & \frac{1}{n + 1} X_{1}^{1} + \frac{1}{n + 1} \sum_{j = 2}^{n + 1} (I - a B) X_{j - 1}^{1} - a \frac{1}{n + 1} \sum_{j = 2}^{n + 1} Z_{j - 1} \\ = & \frac{1}{n + 1} X_{1}^{1} + \frac{n}{n + 1} (I - a B) Y_{n}^{1} - a \frac{1}{n + 1} \sum_{j = 1}^{n} Z_{j} . \end{matrix}

Take now the Frobenius norm of $Y_{n + 1}^{1}$ :

\begin{matrix} ‖ Y_{n + 1}^{1} ‖ & \leq & ‖ (I - a B) Y_{n}^{1} ‖ + a ‖ \frac{1}{n + 1} \sum_{j = 1}^{n} Z_{j} - \frac{1}{n + 1} \frac{1}{a} X_{1}^{1} ‖ . \end{matrix}

Under H3b, all the eigenvalues of I − aB are positive and the spectral norm of I − aB is equal to 1 − aλ. Then:

\begin{matrix} | | Y_{n + 1}^{1} | | & \leq & (1 - a λ) ‖ Y_{n}^{1} ‖ + a ‖ \frac{1}{n + 1} \sum_{j = 1}^{n} Z_{j} - \frac{1}{n + 1} \frac{1}{a} X_{1}^{1} ‖ . \end{matrix}

By lemma 6, it suffices to prove $\frac{1}{n} \sum_{j = 1}^{n} Z_{j} \to 0$ a.s. to conclude $Y_{n}^{1} \to 0$ a.s.

Step 2: prove that assumptions H1b and H2b imply respectively $\frac{1}{n} \sum_{j = 1}^{n} B_{j} \to B$ and $\frac{1}{n} \sum_{j = 1}^{n} F_{j} \to F$ a.s.

The proof is only given for (B_n), the other one being similar.

Assumption H1b3 implies sup_n E[‖B_n − B‖²] < ∞. It follows that, for each element $B_{n}^{k l}$ and B^kl of B_n and B respectively, $\sum_{n = 1}^{\infty} \frac{V a r [B_{n}^{k l} - B^{k l}]}{n^{2}} < \infty$ . Therefore:

\begin{matrix} \frac{1}{n} \sum_{j = 1}^{n} (B_{j}^{k l} - B^{k l} - E [B_{j}^{k l} - B^{k l} | T_{j}]) \to 0 a . s . \end{matrix}

As $E [B_{j}^{k l} - B^{k l} | T_{j}] \to 0$ a.s. by H1b1, we have for each (k, l)

\begin{matrix} \frac{1}{n} \sum_{j = 1}^{n} (B_{j}^{k l} - B^{k l}) ⟶ 0 a . s . \end{matrix}

Then $\frac{1}{n} \sum_{j = 1}^{n} (B_{j} - B) \to 0$ a.s.

Step 3: prove now that $\frac{1}{n} \sum_{j = 1}^{n} (B_{j} - B) X_{j}^{1} \to 0$ a.s.

Denote β_n = ‖E[B_n|T_n] − B‖ and γ_n = ‖E[F_n|T_n] − F‖. β_n → 0 and γ_n → 0 a.s. under H1b1 and H2b1. Then: ∀δ > 0, ∀ε > 0, ∃N(δ, ε): ∀n ≥ N(δ, ε),

\begin{matrix} P ({s u p_{j > n} (β_{j}) \leq δ} ⋂ {s u p_{j > n} (γ_{j}) \leq δ}) > 1 - ε . \end{matrix}

As $a < \frac{2 λ}{λ^{2} + b}$ , choose η such that:

\begin{matrix} 0 < η < \frac{1}{b} (\frac{2 λ}{a} - (λ^{2} + b)) \Leftrightarrow λ > \frac{a}{2} (λ^{2} + b + η b) . \end{matrix}

Choose δ such that

\begin{matrix} 0 < δ < \frac{1}{(1 - a λ) (|| θ || + 2)} (λ - \frac{a}{2} (λ^{2} + b + η b)) . \end{matrix}

Let ε be fixed. Denote N₀ = N(δ, ε) and, for n > N₀,

\begin{matrix} G_{n} & = & ({sup_{N_{0} < j \leq n} (β_{j}) \leq δ} ⋂ {sup_{N_{0} < j \leq n} (γ_{j}) \leq δ}), \\ G & = & ({sup_{j > N_{0}} (β_{j}) \leq δ} ⋂ {sup_{j > N_{0}} (γ_{j}) \leq δ}) = ⋂_{n > N_{0}} G_{n} . \end{matrix}

Remark that G_n is T_n-measurable and, I_G denoting the indicator of G,

\begin{matrix} G \subset G_{n + 1} \subset G_{n} \Leftrightarrow I_{G} \leq I_{G_{n + 1}} \leq I_{G_{n}} . \end{matrix}

Step 3a: prove that ${sup}_{n} E [{|| X_{n}^{1} ||}^{2} I_{G_{n}}] < \infty$ .

\begin{matrix} {|| X_{n + 1}^{1} ||}^{2} I_{G_{n + 1}} & \leq & {|| X_{n + 1}^{1} ||}^{2} I_{G_{n}} = {|| (I - a B) X_{n}^{1} I_{G_{n}} - a Z_{n} I_{G_{n}} ||}^{2} \\ \leq & {|| (I - a B) X_{n}^{1} I_{G_{n}} ||}^{2} - 2 a ⟨ (I - a B) X_{n}^{1} I_{G_{n}}, Z_{n} I_{G_{n}} ⟩ + a^{2} {|| Z_{n} I_{G_{n}} ||}^{2} . \end{matrix}

As the spectral norm ‖I − aB‖ = 1 − aλ, taking the conditional expectation with respect to T_n yields a.s.

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2} I_{G_{n + 1}} | T_{n}] & \leq & {(1 - a λ)}^{2} {|| X_{n}^{1} I_{G_{n}} ||}^{2} - 2 a ⟨ (I - a B) X_{n}^{1} I_{G_{n}}, E [Z_{n} | T_{n}] I_{G_{n}} ⟩ \\ + a^{2} E [{|| Z_{n} I_{G_{n}} ||}^{2} | T_{n}] . \end{matrix}

Now:

\begin{matrix} || E [Z_{n} | T_{n}] I_{G_{n}} || & = & || (E [B_{n} | T_{n}] - B) X_{n}^{1} I_{G_{n}} + (E [B_{n} | T_{n}] - B) θ I_{G_{n}} \\ - (E [F_{n} | T_{n}] - F) I_{G_{n}} || \\ \leq & δ || X_{n}^{1} I_{G_{n}} || + δ (|| θ || + 1) \\ E [{|| Z_{n} I_{G_{n}} ||}^{2} | T_{n}] & \leq & (1 + η) E [{|| (B_{n} - B) X_{n}^{1} I_{G_{n}} ||}^{2} | T_{n}] \\ + (1 + \frac{1}{η}) E [{|| (B_{n} - B) θ I_{G_{n}} - (F_{n} - F) I_{G_{n}} ||}^{2} | T_{n}] \\ \leq & (1 + η) b {|| X_{n}^{1} I_{G_{n}} ||}^{2} + 2 (1 + \frac{1}{η}) (b {|| θ ||}^{2} + d) . \end{matrix}

Therefore:

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2} I_{G_{n + 1}} | T_{n}] & \leq & ({(1 - a λ)}^{2} + 2 a (1 - a λ) δ + a^{2} (1 + η) b) {|| X_{n}^{1} I_{G_{n}} ||}^{2} \\ + & 2 a (1 - a λ) δ (|| θ || + 1) || X_{n}^{1} I_{G_{n}} || \\ + & 2 a^{2} (1 + \frac{1}{η}) (b {|| θ ||}^{2} + d) . \end{matrix}

As $|| X_{n}^{1} I_{G_{n}} || \leq 1 + {|| X_{n}^{1} I_{G_{n}} ||}^{2}$ , taking mathematical expectation yields:

\begin{matrix} E [{|| X_{n + 1}^{1} ||}^{2} I_{G_{n + 1}}] & \leq & ρ E [{|| X_{n}^{1} I_{G_{n}} ||}^{2}] + e, \\ ρ & = & {(1 - a λ)}^{2} + 2 a (1 - a λ) δ (|| θ || + 2) + a^{2} (1 + η) b, \\ e & = & 2 a (1 - a λ) δ (|| θ || + 1) + 2 a^{2} (1 + \frac{1}{η}) (b {|| θ ||}^{2} + d) . \end{matrix}

As $ρ = 1 + 2 a ((1 - a λ) (|| θ || + 2) δ - λ + \frac{a}{2} (λ^{2} + b + η b)) < 1$ by the choice of δ, this implies $g = {sup}_{n} E [{|| X_{n}^{1} ||}^{2} I_{G_{n}}] < \infty$ .

Step 3b: conclusion.

\begin{matrix} E [{|| (B_{n} - B) X_{n}^{1} I_{G_{n}} ||}^{2}] & = & E [E [{|| (B_{n} - B) X_{n}^{1} I_{G_{n}} ||}^{2} | T_{n}]] \\ \leq & E [E [{|| B_{n} - B ||}^{2} | T_{n}] {|| X_{n}^{1} I_{G_{n}} ||}^{2}] \\ \leq & b g . \end{matrix}

Then: $\sum_{n = 1}^{\infty} \frac{E [{|| (B_{n} - B) X_{n}^{1} I_{G_{n}} ||}^{2}]}{n^{2}} < \infty$ . Therefore a.s.:

\begin{matrix} \frac{1}{n} \sum_{j = 1}^{n} ((B_{j} - B) X_{j}^{1} I_{G_{j}} - E [(B_{j} - B) X_{j}^{1} I_{G_{j}} | T_{j}]) ⟶ 0 . \end{matrix}

Now:

\begin{matrix} \sum_{n = 1}^{\infty} \frac{1}{n} E [|| (E [B_{n} | T_{n}] - B) X_{n}^{1} I_{G_{n}} ||] \leq \sum_{n = 1}^{\infty} \frac{1}{n} E [|| E [B_{n} | T_{n}] - B || || X_{n}^{1} I_{G_{n}} ||] \\ \leq \sum_{n = 1}^{\infty} \frac{1}{n} {(E [{|| E [B_{n} | T_{n}] - B ||}^{2}])}^{\frac{1}{2}} {(E [{|| X_{n}^{1} I_{G_{n}} ||}^{2}])}^{\frac{1}{2}} \\ \leq g^{\frac{1}{2}} \sum_{n = 1}^{\infty} \frac{1}{n} {(E [{|| E [B_{n} | T_{n}] - B ||}^{2}])}^{\frac{1}{2}} < \infty by H1b2 . \end{matrix}

Then:

\begin{matrix} \sum_{n = 1}^{\infty} \frac{1}{n} || (E [B_{n} | T_{n}] - B) X_{n}^{1} I_{G_{n}} || < \infty a . s . \end{matrix}

This implies by the Kronecker lemma:

\begin{matrix} \frac{1}{n} \sum_{j = 1}^{n} (E [B_{j} | T_{j}] - B) X_{j}^{1} I_{G_{j}} ⟶ 0 a . s . \end{matrix}

Therefore:

\begin{matrix} \frac{1}{n} \sum_{j = 1}^{n} (B_{j} - B) X_{j}^{1} I_{G_{j}} ⟶ 0 a . s . \end{matrix}

In G, I_{G_j} = 1 for all j, therefore $\frac{1}{n} \sum_{j = 1}^{n} (B_{j} - B) X_{j}^{1} ⟶ 0$ a.s. Then: $P (\frac{1}{n} \sum_{j = 1}^{n} (B_{j} - B) X_{j}^{1} ⟶ 0) \geq P (G) > 1 - ε$ . This is true for every ε > 0. Thus:

\begin{matrix} \frac{1}{n} \sum_{j = 1}^{n} (B_{j} - B) X_{j}^{1} ⟶ 0 a . s . \end{matrix}

Therefore by step 2 and step 1, we conclude that $\frac{1}{n} \sum_{j = 1}^{n} Z_{j} ⟶ 0$ and $Y_{n}^{1} ⟶ 0$ a.s. ∎

3.3 Application to linear regression with online standardized data

Define as in section 2:

\begin{matrix} B_{n} & = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j} - {\bar{R}}_{M_{n - 1}}) {(R_{j} - {\bar{R}}_{M_{n - 1}})}^{'} Γ_{M_{n - 1}}, \\ F_{n} & = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j} - {\bar{R}}_{M_{n - 1}}) {(S_{j} - {\bar{S}}_{M_{n - 1}})}^{'} Γ_{M_{n - 1}}^{1} . \end{matrix}

Denote U = (R − E[R])(R − E[R])′, B = ΓE[U]Γ the correlation matrix of R, λ and λ_max respectively the smallest and the largest eigenvalue of B, b₁ = E[‖ΓUΓ − B‖²], F = ΓE[(R − E[R])(S − E[S])′]Γ¹.

Corollary 8 Suppose there is no affine relation between the components of R and the moments of order 4 of (R,S) exist. Suppose H3b1 holds:

(H3b1) $0 < a < m i n (\frac{1}{λ_{m a x}}, \frac{2 λ}{λ^{2} + b_{1}})$ .

Then Y_n converges to θ = B⁻¹F a.s.

This process was tested on several datasets and some results are given in section 5 (process S21 for m_n = 1 and S22 for m_n = 10).

Proof of Corollary 8.

Step 1: introduction.

Using the decomposition of E[B_n|T_n] − B established in the proof of corollary 4, as ${\bar{R}}_{M_{n - 1}} ⟶ E [R]$ and Γ_{M_{n − 1}} ⟶ Γ a.s., it is obvious that E[B_n|T_n] − B ⟶ 0 a.s. Likewise E[F_n|T_n] − F ⟶ 0 a.s. Thus assumptions H1b1 and H2b1 are verified.

Suppose that Y_n does not converge to θ almost surely.

Then there exists a set of probability ε₁ > 0 in which Y_n does not converge to θ.

Denote $σ^{j} = \sqrt{V a r [R^{j}]}$ , j = 1,…,p.

As ${\bar{R}}_{M_{n - 1}} - E [R] ⟶ 0$ , $\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} - σ^{j} ⟶ 0$ , j = 1,…,p and Γ_{M_{n − 1}} − Γ ⟶ 0 almost surely, there exists a set G of probability greater than $1 - \frac{ε_{1}}{2}$ in which these sequences of random variables converge uniformly to θ.

Step 2: prove that $\sum_{n = 1}^{\infty} \frac{1}{n} {(E [|| Γ_{M_{n - 1}} - Γ || I_{G}])}^{\frac{1}{2}} < \infty$ .

By step 2 of the proof of lemma 5, we have for n > N:

\begin{matrix} | | Γ_{M_{n - 1}} - Γ | | I_{G} & \leq & \sum_{j = 1}^{p} \frac{| \frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2} - {(σ^{j})}^{2} |}{\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} σ^{j} (\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j} + σ^{j})} I_{G} . \end{matrix}

As in G, $\sqrt{\frac{M_{n - 1}}{M_{n - 1} - 1}} V_{M_{n - 1}}^{j}$ converges uniformly to σ^j for j = 1,…,p, there exists c > 0 such that

\begin{matrix} || Γ_{M_{n - 1}} - Γ || I_{G} & \leq & c \sum_{j = 1}^{p} | \frac{M_{n - 1}}{M_{n - 1} - 1} {(V_{M_{n - 1}}^{j})}^{2} - {(σ^{j})}^{2} | . \end{matrix}

Then there exists d > 0 such that

\begin{matrix} E [|| Γ_{M_{n - 1}} - Γ || I_{G}] & \leq & \frac{d}{\sqrt{M_{n - 1}}} \leq \frac{d}{\sqrt{n - 1}} . \end{matrix}

Therefore $\sum_{n = 1}^{\infty} \frac{1}{n} {(E [|| Γ_{M_{n - 1}} - Γ || I_{G}])}^{\frac{1}{2}} < \infty$ .

Step 3: prove that assumption H1b2 is verified in G.

Using the decomposition of E[B_n|T_n] − B given in step 1 of the proof of corollary 4, with R^c = R − E[R] and ${\bar{R}}_{M_{n - 1}}^{c} = {\bar{R}}_{M_{n - 1}} - E [R]$ yields a.s.:

\begin{matrix} (E [B_{n} | T_{n}] - B) I_{G} & = & ((Γ_{M_{n - 1}} - Γ) E [R^{c} {R^{c}}^{'}] Γ_{M_{n - 1}} + Γ E [R^{c} {R^{c}}^{'}] (Γ_{M_{n - 1}} - Γ) \\ + Γ_{M_{n - 1}} {\bar{R}}_{M_{n - 1}}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}}) I_{G} . \end{matrix}

As in G, Γ_{M_n−1} − Γ and ${\bar{R}}_{M_{n - 1}}^{c}$ converge uniformly to 0, E[B_n|T_n] − B converges uniformly to 0. Moreover there exists c₁ > 0 such that

\begin{matrix} || E [B_{n} | T_{n}] - B || I_{G} & \leq & c_{1} (|| Γ_{M_{n - 1}} - Γ || I_{G} + || {\bar{R}}_{M_{n - 1}}^{c} ||) a . s . \end{matrix}

By the proof of lemma 5: $E [|| {\bar{R}}_{M_{n - 1}}^{c} ||] \leq {(\frac{V a r [R]}{n - 1})}^{\frac{1}{2}}$ ; then $\sum_{n = 1}^{\infty} \frac{1}{n} {(E [|| {\bar{R}}_{M_{n - 1}}^{c} ||])}^{\frac{1}{2}} < \infty$ .

By step 2: $\sum_{n = 1}^{\infty} \frac{1}{n} {(E [|| Γ_{M_{n - 1}} - Γ || I_{G}])}^{\frac{1}{2}} < \infty$ .

Then: $\sum_{n = 1}^{\infty} \frac{1}{n} {(E [|| E [B_{n} | T_{n}] - B || I_{G}])}^{\frac{1}{2}} < \infty$ .

As E[B_n|T_n] − B converges uniformly to 0 on G, we obtain:

\begin{matrix} \sum_{n = 1}^{\infty} \frac{1}{n} {(E [{|| E [B_{n} | T_{n}] - B ||}^{2} I_{G}])}^{\frac{1}{2}} < \infty . \end{matrix}

Thus assumption H1b2 is verified in G.

Step 4: prove that assumption H1b3 is verified in G.

Denote R^c = R − E[R], $R_{j}^{c} = R_{j} - E [R]$ , ${\bar{R}}_{j}^{c} = {\bar{R}}_{j} - E [R]$ . Consider the decomposition:

\begin{matrix} B_{n} - B & = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c}) {(R_{j}^{c} - {\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}} \\ - Γ E [R^{c} {R^{c}}^{'}] Γ \\ = & α_{n} + β_{n} \end{matrix}

\begin{matrix} with α_{n} & = & Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} (R_{j}^{c} {R_{j}^{c}}^{'} - {\bar{R}}_{M_{n - 1}}^{c} {R_{j}^{c}}^{'} - R_{j}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'} + {\bar{R}}_{M_{n - 1}}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'}) Γ_{M_{n - 1}} \\ - Γ \frac{1}{m_{n}} \sum_{j \in I_{n}} R_{j}^{c} {R_{j}^{c}}^{'} Γ \\ = & (Γ_{M_{n - 1}} - Γ) (\frac{1}{m_{n}} \sum_{j \in I_{n}} R_{j}^{c} {R_{j}^{c}}^{'}) Γ_{M_{n - 1}} + Γ (\frac{1}{m_{n}} \sum_{j \in I_{n}} R_{j}^{c} {R_{j}^{c}}^{'}) (Γ_{M_{n - 1}} - Γ) \\ - Γ_{M_{n - 1}} {\bar{R}}_{M_{n - 1}}^{c} \frac{1}{m_{n}} \sum_{j \in I_{n}} {R_{j}^{c}}^{'} Γ_{M_{n - 1}} - Γ_{M_{n - 1}} \frac{1}{m_{n}} \sum_{j \in I_{n}} R_{j}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}} \\ + Γ_{M_{n - 1}} {\bar{R}}_{M_{n - 1}}^{c} {({\bar{R}}_{M_{n - 1}}^{c})}^{'} Γ_{M_{n - 1}}, \\ β_{n} & = & Γ (\frac{1}{m_{n}} \sum_{j \in I_{n}} R_{j}^{c} {R_{j}^{c}}^{'} - E [R^{c} {R^{c}}^{'}]) Γ . \end{matrix}

Let η > 0.

\begin{matrix} E [{|| B_{n} - B ||}^{2} I_{G} | T_{n}] & = & E [{|| α_{n} + β_{n} ||}^{2} I_{G} | T_{n}] \\ \leq & (1 + \frac{1}{η}) E [{|| α_{n} ||}^{2} I_{G} | T_{n}] \\ + (1 + η) E [{|| β_{n} ||}^{2} I_{G} | T_{n}] a . s . \end{matrix}

As random variables $R_{j}^{c}$ , j ∈ I_n, are independent of T_n, as Γ_{M_n−1} and ${\bar{R}}_{M_{n - 1}}^{c}$ are T_n-measurable and converge uniformly respectively to Γ and 0 on G, E[‖α_n‖² I_G|T_n] converges uniformly to 0. Then, for δ > 0, there exists N₁ such that for n > N₁, E[‖α_n‖² I_G|T_n] ≤ δ a.s.

Moreover, denoting U = R^cR^c′ and $U_{j} = R_{j}^{c} {R_{j}^{c}}^{'}$ , we have, as the random variables U_j form an i.i.d. sample of U:

\begin{matrix} E [{|| β_{n} ||}^{2} | T_{n}] & = & E [{|| \frac{1}{m_{n}} \sum_{j \in I_{n}} Γ (U_{j} - E [U]) Γ ||}^{2} | T_{n}] \\ \leq & E [{|| Γ (U - E [U]) Γ ||}^{2}] = E [{|| Γ U Γ - E [Γ U Γ] ||}^{2}] = b_{1} a . s . \end{matrix}

Then:

\begin{matrix} E [{|| B_{n} - B ||}^{2} I_{G} | T_{n}] & \leq & (1 + \frac{1}{η}) δ + (1 + η) b_{1} = b a . s . \end{matrix}

Thus assumption H1b3 is verified in G.

As ${\bar{S}}_{M_{n - 1}} - E [S] ⟶ 0$ and $Γ_{M_{n - 1}}^{1} - Γ^{1} ⟶ 0$ almost surely, it can be proved likewise that there exist a set H of probability greater than $1 - \frac{ε_{1}}{2}$ and d > 0 such that E[‖F_n − F‖² I_H|T_n] ≤ d a.s. Thus assumption H2b2 is verified in H.

Step 5: conclusion.

As $a < m i n (\frac{1}{λ_{m a x}}, \frac{2 λ}{λ^{2} + b_{1}})$ , $b_{1} < \frac{2 λ}{a} - λ^{2}$ .

Choose $0 < η < \frac{\frac{2 λ}{a} - λ^{2}}{b_{1}} - 1$ and $0 < δ < \frac{\frac{2 λ}{a} - λ^{2} - (1 + η) b_{1}}{1 + \frac{1}{η}}$ such that

\begin{matrix} b = (1 + \frac{1}{η}) δ + (1 + η) b_{1} < \frac{2 λ}{a} - λ^{2} ⟺ a < \frac{2 λ}{λ^{2} + b} . \end{matrix}

Thus assumption H3b is verified.

Applying theorem 7 implies that Y_n converges to θ almost surely in H ∩ G.

Therefore P(Y_n ⟶ θ) ≥ P(H ∩ G) > 1 − ε₁.

This is in contradiction with $P (Y_{n} ↛ θ) = ε_{1}$ . Thus Y_n converges to θ a.s. ∎

4 Convergence of a process with a variable or constant step-size and use of all observations until the current step

In this section, the convergence of the process (X_n, n ≥ 1) in $R^{p \times q}$ recursively defined by

\begin{matrix} X_{n + 1} = X_{n} - a_{n} (B_{n} X_{n} - F_{n}) \end{matrix}

and its application to sequential linear regression are studied.

4.1 Theorem

Make the following assumptions

(H1c) There exists a positive definite symmetrical matrix B such that B_n ⟶ B a.s.

(H2c) There exists a matrix F such that F_n ⟶ F a.s.

(H3c) λ_max denoting the largest eigenvalue of B,

$(a_{n} = a < \frac{1}{λ_{m a x}})$ or $(a_{n} ⟶ 0, \sum_{n = 1}^{\infty} a_{n} = \infty)$ .

Theorem 9 Suppose H1c, H2c and H3c hold. Then X_n converges to B⁻¹F a.s.

Proof of Theorem 9.

Denote θ = B⁻¹F, $X_{n}^{1} = X_{n} - θ$ , Z_n = (B_n − B)θ − (F_n − F). Then:

\begin{matrix} X_{n + 1}^{1} = (I - a_{n} B_{n}) X_{n}^{1} - a_{n} Z_{n} . \end{matrix}

Let ω be fixed belonging to the intersection of the convergence sets {B_n ⟶ B} and {F_n ⟶ F}. The writing of ω is omitted in the following.

Denote ‖A‖ the spectral norm of a matrix A and λ the smallest eigenvalue of B.

In the case a_n depending on n, as a_n ⟶ 0, we can suppose without loss of generality $a_{n} < \frac{1}{λ_{m a x}}$ for all n. Then all the eigenvalues of I − a_nB are positive and ‖I − a_nB‖ = 1 − a_nλ.

Let 0 < ε < λ. As B_n − B ⟶ 0, we obtain for n sufficiently large:

\begin{matrix} || I - a_{n} B_{n} || & \leq & || I - a_{n} B || + a_{n} || B_{n} - B || \\ \leq & 1 - a_{n} λ + a_{n} ε, with a_{n} < \frac{1}{λ - ε} \\ || X_{n + 1}^{1} || & \leq & (1 - a_{n} (λ - ε)) || X_{n}^{1} || + a_{n} || Z_{n} || . \end{matrix}

As Z_n ⟶ 0, applying lemma 6 yields $|| X_{n}^{1} || ⟶ 0$ .

Therefore X_n ⟶ B⁻¹F a.s. ∎

4.2 Application to linear regression with online standardized data

Let (m_n, n ≥ 1) be a sequence of integers. Denote $M_{n} = \sum_{k = 1}^{n} m_{k}$ for n ≥ 1, M₀ = 0 and I_n = {M_{n − 1} + 1,…,M_n}.

Define

\begin{matrix} B_{n} & = & Γ_{M_{n}} (\frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} R_{j} R_{j}^{'} - {\bar{R}}_{M_{n}} {\bar{R}}_{M_{n}}^{'}) Γ_{M_{n}}, \\ F_{n} & = & Γ_{M_{n}} (\frac{1}{M_{n}} \sum_{i = 1}^{n} \sum_{j \in I_{i}} R_{j} S_{j}^{'} - {\bar{R}}_{M_{n}} {\bar{S}}_{M_{n}}^{'}) Γ_{M_{n}}^{1} . \end{matrix}

As ((R_n, S_n), n ≥ 1) is an i.i.d. sample of (R, S), assumptions H1c and H2c are obviously verified with B = ΓE[(R − E[R])(R − E[R])′]Γ and F = ΓE[(R − E[R])(S − E[S])′]Γ¹. Then:

Corollary 10 Suppose there is no affine relation between the components of R and the moments of order 4 of (R, S) exist. Suppose H3c holds. Then X_n converges to B⁻¹F a.s.

Remark 3 B is the correlation matrix of R of dimension p. Then λ_max < Trace(B) = p. In the case of a constant step-size a, it suffices to take $a \leq \frac{1}{p}$ to verify H3c.

Remark 4 In the definition of B_n and F_n, the R_j and the S_j are not directly pseudo-centered with respect to ${\bar{R}}_{M_{n}}$ and ${\bar{S}}_{M_{n}}$ respectively. Another equivalent definition of B_n and F_n can be used. It consists of replacing R_j by R_j − m, ${\bar{R}}_{M_{n}}$ by ${\bar{R}}_{M_{n}} - m$ , S_j by S_j − m, ${\bar{S}}_{M_{n}}$ by ${\bar{S}}_{M_{n}} - m_{1}$ , m and m₁ being respectively an estimation of E[R] and E[S] computed in a preliminary phase with a small number of observations. For example, at step n, $\sum_{j \in I_{n}} Γ_{M_{n}} (R_{j} - m) {(Γ_{M_{n}} (R_{j} - m))}^{'}$ is computed instead of $\sum_{j \in I_{n}} Γ_{M_{n}} R_{j} {(Γ_{M_{n}} R_{j})}^{'}$ . This limits the risk of numerical explosion.

This process was tested on several datasets and some results are given in section 5 (with a variable step-size: process S13 for m_n = 1 and S14 for m_n = 10; with a constant step-size: process S31 for m_n = 1 and S32 for m_n = 10).

5 Experiments

The three previously-defined processes of stochastic approximation with online standardized data were compared with the classical stochastic approximation and averaged stochastic approximation (or averaged stochastic gradient descent) processes with constant step-size (denoted ASGD) studied in [5] and [6]. A description of the methods along with abbreviations and parameters used is given in Table 1.

Table 1. Description of the methods.

Method type	Abbreviation	Type of data	Number of observations used at each step of the process	Use of all the observations until the current step	Step-size	Use of the averaged process
Classic	C1	Raw data	1	No	variable	No
	C2		10	No
	C3		1	Yes
	C4		10	Yes
ASGD	A1		1	No	constant	Yes
ASGD	A2		1	No	constant	Yes
Standardization 1	S11	Online standardized data	1	No	variable	No
	S12		10	No
	S13		1	Yes
	S14		10	Yes
Standardization 2	S21		1	No	constant	Yes
Standardization 2	S22		10	No		Yes
Standardization 3	S31		1	Yes		No
Standardization 3	S32		10	Yes		No

Open in a new tab

With the variable S set at dimension 1, 11 datasets were considered, some of which are available in free access on the Internet, while others were derived from the EPHESUS study [15]: 6 in regression (continuous dependent variable) and 5 in linear discriminant analysis (binary dependent variable). All datasets used in our experiments are presented in detail in Table 2, along with their download links. An a priori selection of variables was performed on each dataset using a stepwise procedure based on Fisher’s test with p-to-enter and p-to-remove fixed at 5 percent.

Table 2. Datasets used in our experiments.

Dataset name	N	p_a	p	Type of dependent variable	T²	Number of outliers
CADATA	20640	8	8	Continuous	1.6x10⁶	122	www.dcc.fc.up.pt/∼ltorgo/Regression/DataSets.html
AILERONS	7154	40	9	Continuous	247.1	0	www.dcc.fc.up.pt/∼ltorgo/Regression/DataSets.html
ELEVATORS	8752	18	10	Continuous	7.7x10⁴	0	www.dcc.fc.up.pt/∼ltorgo/Regression/DataSets.html
POLY	5000	48	12	Continuous	4.1x10⁴	0	www.dcc.fc.up.pt/∼ltorgo/Regression/DataSets.html
eGFR	21382	31	15	Continuous	2.9x10⁴	0	derived from EPHESUS study [15]
HEMG	21382	31	17	Continuous	6.0x10⁴	0	derived from EPHESUS study [15]
QUANTUM	50000	78	14	Binary	22.5	1068	www.osmot.cs.cornell.edu/kddcup
ADULT	45222	97	95	Binary	4.7x10¹⁰	20	www.cs.toronto.edu/∼delve/data/datasets.html
RINGNORM	7400	20	20	Binary	52.8	0	www.cs.toronto.edu/∼delve/data/datasets.html
TWONORM	7400	20	20	Binary	24.9	0	www.cs.toronto.edu/∼delve/data/datasets.html
HOSPHF30D	21382	32	15	Binary	8.1x10⁵	0	derived from EPHESUS study [15]

Open in a new tab

N denotes the size of global sample, p_a the number of parameters available, p the number of parameters selected and T² the trace of E[RR′]. Outlier is defined as an observation whose the L2 norm is greater than five times the average norm.

Let D = {(r_i, s_i), i = 1, 2,…,N} be the set of data in $R^{p} \times R$ and assuming that it represents the set of realizations of a random vector (R, S) uniformly distributed in D, then minimizing E[(S − θ′ R − η)²] is equivalent to minimizing $\frac{1}{N} \sum_{i = 1}^{N} {(s_{i} - θ^{'} r_{i} - η)}^{2}$ . One element of D (or several according to the process) is randomly drawn at each step to iterate the process.

To compare the methods, two different studies were performed: one by setting the total number of observations used, the other by setting the computing time.

The choice of step-size, the initialization for each method and the convergence criterion used are respectively presented and commented below.

Choice of step-size

In all methods of stochastic approximation, a suitable choice of step-size is often crucial for obtaining good performance of the process. If the step-size is too small, the convergence rate will be slower. Conversely, if the step-size is too large, a numerical explosion phenomenon may occur during the first iterations.

For the processes with a variable step-size (processes C1 to C4 and S11 to S14), we chose to use a_n of the following type:

\begin{matrix} a_{n} & = & \frac{c_{γ}}{{(b + n)}^{α}} . \end{matrix}

The constant $α = \frac{2}{3}$ was fixed, as suggested by Xu [16] in the case of stochastic approximation in linear regression, and b = 1. The results obtained for the choice $c_{γ} = \frac{1}{p}$ are presented although the latter does not correspond to the best choice for a classical method.

For the ASGD method (A1, A2), two different constant step-sizes a as used in [6] were tested: $a = \frac{1}{T^{2}}$ and $a = \frac{1}{2 T^{2}}$ , T² denoting the trace of E[RR′]. Note that this choice of constant step-size assumes knowing a priori the dataset and is not suitable for a data stream.

For the methods with standardization and a constant step-size a (S21, S22, S31, S32), $a = \frac{1}{p}$ was chosen since the matrix E[RR′] is thus the correlation matrix of R, whose trace is equal to p, such that this choice corresponds to that of [6].

Initialization of processes

All processes (X_n) were initialized by $X_{1} = \underline{0}$ , the null vector. For the processes with standardization, a small number of observations (n = 1000) were taken into account in order to calculate an initial estimate of the means and standard deviations.

Convergence criterion

The “theoretical vector” θ¹ is assigned as that obtained by the least square method in D such that $θ^{1^{'}} = (\begin{matrix} θ^{'} & η \end{matrix})$ . Let $Θ_{n + 1}^{1}$ be the estimator of θ¹ obtained by stochastic approximation after n iterations.

In the case of a process (X_n) with standardized data, which yields an estimation of the vector denoted θ_c in section 1 as θ = Γθ_c(Γ¹)⁻¹ and η = E[S] − θ′ E[R], we can define:

\begin{matrix} Θ_{n + 1}^{1}^{'} & = & (\begin{matrix} Θ_{n + 1}^{'} & H_{n + 1} \end{matrix}) \\ with Θ_{n + 1} & = & Γ_{M_{n}} X_{n + 1} {(Γ_{M_{n}}^{1})}^{- 1} \\ H_{n + 1} & = & {\bar{S}}_{M_{n}} - Θ_{n + 1}^{'} {\bar{R}}_{M_{n}} . \end{matrix}

To judge the convergence of the method, the cosine of the angle formed by exact θ¹ and its estimation $θ_{n + 1}^{1}$ was used as criterion,

\begin{matrix} cos (θ^{1}, θ_{n + 1}^{1}) = \frac{{θ^{1}}^{'} θ_{n + 1}^{1}}{{|| θ^{1} ||}_{2} {|| θ_{n + 1}^{1} ||}_{2}} . \end{matrix}

Other criteria, such as $\frac{{|| θ^{1} - θ_{n + 1}^{1} ||}_{2}}{{|| θ^{1} ||}_{2}}$ or $\frac{f (θ_{n + 1}^{1}) - f (θ^{1})}{f (θ^{1})}$ , f being the loss function, were also tested, although the results are not presented in this article.

5.1 Study for a fixed total number of observations used

For all N observations used by the algorithm (N being the size of D) up to a maximum of 100N observations, the criterion value associated with each method and for each dataset was recorded. The results obtained after using 10N observations are provided in Table 3.

Table 3. Results after using 10N observations.

	CADATA	AILERONS	ELEVATORS	POLY	EGFR	HEMG	QUANTUM	ADULT	RINGNORM	TWONORM	HOSPHF30D	Mean rank
C1	Expl.	-0.0385	Expl.	Expl.	Expl.	Expl.	0.9252	Expl.	0.9998	1.0000	Expl.	11.6
C2	Expl.	0.0680	Expl.	Expl.	Expl.	Expl.	0.8551	Expl.	0.9976	0.9996	Expl.	12.2
C3	Expl.	0.0223	Expl.	Expl.	Expl.	Expl.	0.9262	Expl.	0.9999	1.0000	Expl.	9.9
C4	Expl.	-0.0100	Expl.	Expl.	Expl.	Expl.	0.8575	Expl.	0.9981	0.9996	Expl.	12.3
A1	-0.0013	0.4174	0.0005	0.3361	0.2786	0.2005	Expl.	0.0027	0.9998	1.0000	0.0264	9.2
A2	0.0039	0.2526	0.0004	0.1875	0.2375	0.1846	0.0000	0.0022	0.9999	1.0000	0.2047	8.8
S11	1.0000	0.9516	0.9298	1.0000	1.0000	0.9996	0.9999	0.7599	0.9999	1.0000	0.7723	5.2
S12	0.9999	0.9579	0.9311	1.0000	0.9999	0.9994	0.9991	0.6842	0.9999	1.0000	0.4566	6.1
S13	1.0000	0.9802	0.9306	1.0000	1.0000	0.9998	1.0000	0.7142	0.9999	1.0000	0.7754	3.7
S14	0.9999	0.9732	0.9303	1.0000	0.9999	0.9994	0.9991	0.6225	0.9998	1.0000	0.4551	6.9
S21	0.9993	0.6261	0.9935	Expl.	Expl.	Expl.	Expl.	Expl.	0.9998	1.0000	Expl.	10.5
S22	1.0000	0.9977	0.9900	1.0000	1.0000	0.9989	0.9999	-0.0094	0.9999	1.0000	0.9454	4.1
S31	1.0000	0.9988	0.9999	1.0000	1.0000	0.9992	0.9999	0.9907	0.9999	1.0000	0.9788	2.3
S32	1.0000	0.9991	0.9998	1.0000	1.0000	0.9992	0.9999	0.9867	0.9999	1.0000	0.9806	2.2

Open in a new tab

Expl. means numerical explosion.

As can be seen in Table 3, a numerical explosion occured in most datasets using the classical methods with raw data and a variable step-size (C1 to C4). As noted in Table 2, these datasets had a high T² = Tr(E[RR′]). Corresponding methods S11 to S14 using the same variable step-size but with online standardized data quickly converged in most cases. However classical methods with raw data can yield good results for a suitable choice of step-size, as demonstrated by the results obtained for POLY dataset in Fig 1. The numerical explosion can arise from a too high step-size when n is small. This phenomenon can be avoided if the step-size is reduced, although if the latter is too small, the convergence rate will be slowed. Hence, the right balance must be found between step-size and convergence rate. Furthermore, the choice of this step-size generally depends on the dataset which is not known a priori in the case of a data stream. In conclusion, methods with standardized data appear to be more robust to the choice of step-size.

The ASGD method (A1 with constant step-size $a = \frac{1}{T^{2}}$ and A2 with $a = \frac{1}{2 T^{2}}$ ) did not yield good results except for the RINGNORM and TWONORM datasets which were obtained by simulation (note that all methods functioned very well for these two datasets). Of note, A1 exploded for the QUANTUM dataset containing 1068 observations (2.1%) whose L2 norm was fivefold greater than the average norm (Table 2). The corresponding method S21 with online standardized data yielded several numerical explosions with the $a = \frac{1}{p}$ step-size, however these explosions disappeared when using a smaller step-size (see Fig 1). Of note, it is assumed in corollary 8 that $0 < a < min (\frac{1}{λ_{m a x}}, \frac{2 λ}{λ^{2} + b_{1}})$ ; in the case of $a = \frac{1}{p}$ , only $a < \frac{1}{λ_{m a x}}$ is certain.

Finally, for methods S31 and S32 with standardized data, the use of all observations until the current step and the very simple choice of the constant step-size $a = \frac{1}{p}$ uniformly yielded good results.

Thereafter, for each fixed number of observations used and for each dataset, the 14 methods ranging from the best (the highest cosine) to the worst (the lowest cosine) were ranked by assigning each of the latter a rank from 1 to 14 respectively, after which the mean rank in all 11 datasets was calculated for each method. A total of 100 mean rank values were calculated for a number of observations used varying from N to 100N. The graph depicting the change in mean rank based on the number of observations used and the boxplot of the mean rank are shown in Fig 2.

Overall, for these 11 datasets, a method with standardized data, a constant step-size and use of all observations until the current step (S31, S32) represented the best method when the total number of observations used was fixed.

5.2 Study for a fixed processing time

For every second up to a maximum of 2 minutes, the criterion value associated to each dataset was recorded. The results obtained after a processing time of 1 minute are provided in Table 4.

Table 4. Results obtained after a fixed time of 1 minute.

	CADATA	AILERONS	ELEVATORS	POLY	EGFR	HEMG	QUANTUM	ADULT	RINGNORM	TWONORM	HOSPHF30D	Mean rank
C1	Expl.	-0.2486	Expl.	Expl.	Expl.	Expl.	0.9561	Expl.	1.0000	1.0000	Expl.	12.2
C2	Expl.	0.7719	Expl.	Expl.	Expl.	Expl.	0.9519	Expl.	1.0000	1.0000	Expl.	9.9
C3	Expl.	0.4206	Expl.	Expl.	Expl.	Expl.	0.9547	Expl.	1.0000	1.0000	Expl.	10.6
C4	Expl.	0.0504	Expl.	Expl.	Expl.	Expl.	0.9439	Expl.	1.0000	1.0000	Expl.	10.1
A1	-0.0067	0.8323	0.0022	0.9974	0.7049	0.2964	Expl.	0.0036	1.0000	1.0000	Expl.	9.0
A2	0.0131	0.8269	0.0015	0.9893	0.5100	0.2648	Expl.	0.0027	1.0000	1.0000	0.2521	8.6
S11	1.0000	0.9858	0.9305	1.0000	1.0000	1.0000	1.0000	0.6786	1.0000	1.0000	0.9686	5.8
S12	1.0000	0.9767	0.9276	1.0000	1.0000	0.9999	1.0000	0.6644	1.0000	1.0000	0.9112	5.8
S13	1.0000	0.9814	0.9299	1.0000	1.0000	0.9999	1.0000	0.4538	1.0000	1.0000	0.9329	6.1
S14	1.0000	0.9760	0.9274	1.0000	1.0000	1.0000	0.9999	0.5932	1.0000	1.0000	0.8801	6.1
S21	-0.9998	0.2424	0.6665	Expl.	Expl.	Expl.	Expl.	0.0000	1.0000	1.0000	Expl.	11.5
S22	1.0000	0.9999	1.0000	1.0000	1.0000	1.0000	1.0000	-0.0159	1.0000	1.0000	0.9995	3.1
S31	1.0000	0.9995	1.0000	1.0000	1.0000	0.9999	1.0000	0.9533	1.0000	1.0000	0.9997	4.5
S32	1.0000	0.9999	1.0000	1.0000	1.0000	1.0000	1.0000	0.9820	1.0000	1.0000	0.9999	1.5

Open in a new tab

Expl. means numerical explosion.

The same conclusions can be drawn as those described in section 5.1 for the classical methods and the ASGD method. The methods with online standardized data typically faired better.

As in the previous study in section 5.1, the 14 methods were ranked from the best to the worst on the basis of the mean rank for a fixed processing time. The graph depicting the change in mean rank based on the processing time varying from 1 second to 2 minutes as well as the boxplot of the mean rank are shown in Fig 3.

As can be seen, these methods with online standardized data using more than one observation per step yielded the best results (S32, S22). One explanation may be that the total number of observations used in a fixed processing time is higher when several observations are used per step rather than one observation per step. This can be verified in Table 5 in which the total number of observations used per second for each method and for each dataset during a processing time of 2 minutes is given. Of note, the number of observations used per second in a process with standardized data and one observation per step (S11, S13, S21, S31) was found to be generally lower than in a process with raw data and one observation per step (C1, C3, A1, A2), since a method with standardization requires the recursive estimation of means and variances at each step.

Table 5. Number of observations used after 2 minutes (expressed in number of observations per second).

	CADATA	AILERONS	ELEVATORS	POLY	EGFR	HEMG	QUANTUM	ADULT	RINGNORM	TWONORM	HOSPHF30D
C1	19843	33170	17133	14300	10979	9243	33021	476	31843	31677	10922
C2	166473	291558	159134	134249	104152	89485	281384	4565	262847	261881	102563
C3	17206	28985	16036	13449	10383	8878	28707	462	28123	28472	10404
C4	132088	194031	125880	106259	87844	76128	184386	4252	171711	166878	86895
A1	33622	35388	36540	35800	35280	34494	11815	15390	34898	34216	14049
A2	33317	32807	36271	35628	35314	34454	15439	16349	34401	34205	34890
S11	17174	17133	17166	16783	15648	14764	16296	1122	14067	13836	14334
S12	45717	47209	45893	43470	39937	37376	40943	4554	34799	34507	36389
S13	12062	12731	11888	12057	11211	10369	11466	620	9687	9526	10137
S14	43674	46080	43068	42123	38350	35338	39170	4512	33594	31333	32701
S21	15396	17997	16772	10265	8404	7238	9166	996	13942	13274	7672
S22	47156	47865	46318	43899	40325	37467	41320	4577	34478	31758	37418
S31	12495	12859	12775	12350	11495	10619	11608	621	9890	9694	10863
S32	44827	47035	45123	42398	38932	36288	39362	4532	33435	33385	35556

Open in a new tab

Of note, for the ADULT dataset with a large number of parameters selected (95), the only method yielding sufficiently adequate results after a processing time of one minute was S32, and methods S31 and S32 when 10N observations were used.

6 Conclusion

In the present study, three processes with online standardized data were defined and for which their a.s. convergence was proven.

A stochastic approximation method with standardized data appears to be advantageous compared to a method with raw data. First, it is easier to choose the step-size. For processes S31 and S32 for example, the definition of a constant step-size only requires knowing the number of parameters p. Secondly, the standardization usually allows avoiding the phenomenon of numerical explosion often obtained in the examples given with a classical method.

The use of all observations until the current step can reduce the influence of outliers and increase the convergence rate of a process. Moreover, this approach is particularly adapted to the case of a data stream.

Finally, among all processes tested on 11 different datasets (linear regression or linear discriminant analysis), the best was a method using standardization, a constant step-size equal to $\frac{1}{p}$ and all observations until the current step, and the use of several new observations at each step improved the convergence rate.

Data Availability

All datasets used in our experiments except those derived from EPHESUS study are available online and links to download these data appear in Table 2 of our article. Due to legal restrictions, data from EPHESUS study are only available upon request. Interested researchers may request access to data upon approval from the EPHESUS Executive Steering Committee of the study. This committee can be reached through Pr Faiez Zannad (f.zannad@chu-nancy.fr) who is member of this board.

Funding Statement

This work is supported by a public grant overseen by the French National Research Agency (ANR) as part of the second “Investissements d’Avenir” programme (reference: ANR-15-RHU-0004). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Monnez JM. Le processus d’approximation stochastique de Robbins-Monro: résultats théoriques; estimation séquentielle d’une espérance conditionnelle. Statistique et Analyse des Données. 1979;4(2):11–29. [Google Scholar]
2. Ljung L. Analysis of stochastic gradient algorithms for linear regression problems. IEEE Transactions on Information Theory. 1984;30(2):151–160. doi: 10.1109/TIT.1984.1056895 [Google Scholar]
3. Polyak BT. New method of stochastic approximation type. Automation and remote control. 1990;51(7):937–946. [Google Scholar]
4. Polyak BT, Juditsky AB. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization. 1992;30(4):838–855. doi: 10.1137/0330046 [Google Scholar]
5. Györfi L, Walk H. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization. 1996;34(1):31–61. doi: 10.1137/S0363012992226661 [Google Scholar]
6. Bach F, Moulines E. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems. 2013;773–781. [Google Scholar]
7. Bottou L, Le Cun Y. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry. 2005;21(2):137–151. doi: 10.1002/asmb.538 [Google Scholar]
8.Bottou L, Curtis FE, Noceda J. Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838v2. 2017.
9. Johnson R, Zhang Tong. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. Advances in Neural Information Processing Systems. 2013:315–323. [Google Scholar]
10. Duchi J, Hazan E, Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research. 2011;12:2121–2159. [Google Scholar]
11.Pascanu R, Mikolov T, Bengio Y. Understanding the exploding gradient problem. arXiv:1211.5063v1. 2012.
12. Robbins H, Siegmund D. A convergence theorem for nonnegative almost supermartingales and some applications Optimizing Methods in Statistics, Rustagi J.S. (ed.), Academic Press, New York: 1971;233–257. [Google Scholar]
13.Schmetterer L. Multidimensional stochastic approximation. Multivariate Analysis II, Proc. 2nd Int. Symp., Dayton, Ohio, Academic Press. 1969;443–460.
14. Venter JH. On Dvoretzky stochastic approximation theorems. The Annals of Mathematical Statistics. 1966;37:1534–1544. doi: 10.1214/aoms/1177699145 [Google Scholar]
15. Pitt B., Remme W., Zannad F. et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction. New England Journal of Medicine. 2003;348(14):1309–1321. doi: 10.1056/NEJMoa030207 [DOI] [PubMed] [Google Scholar]
16.Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv:1107.2490v2. 2011.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[pone.0191186.ref001] 1. Monnez JM. Le processus d’approximation stochastique de Robbins-Monro: résultats théoriques; estimation séquentielle d’une espérance conditionnelle. Statistique et Analyse des Données. 1979;4(2):11–29. [Google Scholar]

[pone.0191186.ref002] 2. Ljung L. Analysis of stochastic gradient algorithms for linear regression problems. IEEE Transactions on Information Theory. 1984;30(2):151–160. doi: 10.1109/TIT.1984.1056895 [Google Scholar]

[pone.0191186.ref003] 3. Polyak BT. New method of stochastic approximation type. Automation and remote control. 1990;51(7):937–946. [Google Scholar]

[pone.0191186.ref004] 4. Polyak BT, Juditsky AB. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization. 1992;30(4):838–855. doi: 10.1137/0330046 [Google Scholar]

[pone.0191186.ref005] 5. Györfi L, Walk H. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization. 1996;34(1):31–61. doi: 10.1137/S0363012992226661 [Google Scholar]

[pone.0191186.ref006] 6. Bach F, Moulines E. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems. 2013;773–781. [Google Scholar]

[pone.0191186.ref007] 7. Bottou L, Le Cun Y. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry. 2005;21(2):137–151. doi: 10.1002/asmb.538 [Google Scholar]

[pone.0191186.ref008] 8.Bottou L, Curtis FE, Noceda J. Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838v2. 2017.

[pone.0191186.ref009] 9. Johnson R, Zhang Tong. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. Advances in Neural Information Processing Systems. 2013:315–323. [Google Scholar]

[pone.0191186.ref010] 10. Duchi J, Hazan E, Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research. 2011;12:2121–2159. [Google Scholar]

[pone.0191186.ref011] 11.Pascanu R, Mikolov T, Bengio Y. Understanding the exploding gradient problem. arXiv:1211.5063v1. 2012.

[pone.0191186.ref012] 12. Robbins H, Siegmund D. A convergence theorem for nonnegative almost supermartingales and some applications Optimizing Methods in Statistics, Rustagi J.S. (ed.), Academic Press, New York: 1971;233–257. [Google Scholar]

[pone.0191186.ref013] 13.Schmetterer L. Multidimensional stochastic approximation. Multivariate Analysis II, Proc. 2nd Int. Symp., Dayton, Ohio, Academic Press. 1969;443–460.

[pone.0191186.ref014] 14. Venter JH. On Dvoretzky stochastic approximation theorems. The Annals of Mathematical Statistics. 1966;37:1534–1544. doi: 10.1214/aoms/1177699145 [Google Scholar]

[pone.0191186.ref015] 15. Pitt B., Remme W., Zannad F. et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction. New England Journal of Medicine. 2003;348(14):1309–1321. doi: 10.1056/NEJMoa030207 [DOI] [PubMed] [Google Scholar]

[pone.0191186.ref016] 16.Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv:1107.2490v2. 2011.

PERMALINK

Sequential linear regression with online standardized data

Kévin Duarte

Jean-Marie Monnez

Eliane Albuisson

Roles

Abstract

1 Introduction

2 Convergence of a process with a variable step-size

2.1 Theorem

2.2 Application to linear regression with online standardized data

3 Convergence of an averaged process with a constant step-size

3.1 Lemma

3.2 Theorem

3.3 Application to linear regression with online standardized data

4 Convergence of a process with a variable or constant step-size and use of all observations until the current step

4.1 Theorem

4.2 Application to linear regression with online standardized data

5 Experiments

Table 1. Description of the methods.

Table 2. Datasets used in our experiments.

5.1 Study for a fixed total number of observations used

Table 3. Results after using 10N observations.

Fig 2. Results for a fixed total number of observations used: A/ change in the mean rank based on the number of observations used, B/ boxplot of the mean rank by method.

5.2 Study for a fixed processing time

Table 4. Results obtained after a fixed time of 1 minute.

Fig 3. Results for a fixed processing time: A/ change in the mean rank based on the processing time, B/ boxplot of the mean rank by method.

Table 5. Number of observations used after 2 minutes (expressed in number of observations per second).

6 Conclusion

Data Availability

Funding Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sequential linear regression with online standardized data

Kévin Duarte

Jean-Marie Monnez

Eliane Albuisson

Roles

Abstract

1 Introduction

2 Convergence of a process with a variable step-size

2.1 Theorem

2.2 Application to linear regression with online standardized data

3 Convergence of an averaged process with a constant step-size

3.1 Lemma

3.2 Theorem

3.3 Application to linear regression with online standardized data

4 Convergence of a process with a variable or constant step-size and use of all observations until the current step

4.1 Theorem

4.2 Application to linear regression with online standardized data

5 Experiments

Table 1. Description of the methods.

Table 2. Datasets used in our experiments.

5.1 Study for a fixed total number of observations used

Table 3. Results after using 10N observations.

Fig 1. Results obtained for dataset POLY using 10N and 100N observations: A/ process C1 with variable step-size an=1(b+n)23 by varying b, B/ process C1 with variable step-size an=1p(b+n)23 by varying b, C/ process S21 by varying constant step-size a.

Fig 2. Results for a fixed total number of observations used: A/ change in the mean rank based on the number of observations used, B/ boxplot of the mean rank by method.

5.2 Study for a fixed processing time

Table 4. Results obtained after a fixed time of 1 minute.

Fig 3. Results for a fixed processing time: A/ change in the mean rank based on the processing time, B/ boxplot of the mean rank by method.

Table 5. Number of observations used after 2 minutes (expressed in number of observations per second).

6 Conclusion

Data Availability

Funding Statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases