Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process

Shizhe Chen; Daniela Witten; Ali Shojaie

doi:10.1214/17-EJS1251

. Author manuscript; available in PMC: 2017 Aug 24.

Published in final edited form as: Electron J Stat. 2017 Apr 11;11(1):1207–1234. doi: 10.1214/17-EJS1251

Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process

Shizhe Chen ¹, Daniela Witten ^2,^*, Ali Shojaie ^3,^†

PMCID: PMC5570442 NIHMSID: NIHMS892541 PMID: 28845209

Abstract

We consider the task of learning the structure of the graph underlying a mutually-exciting multivariate Hawkes process in the high-dimensional setting. We propose a simple and computationally inexpensive edge screening approach. Under a subset of the assumptions required for penalized estimation approaches to recover the graph, this edge screening approach has the sure screening property: with high probability, the screened edge set is a superset of the true edge set. Furthermore, the screened edge set is relatively small. We illustrate the performance of this new edge screening approach in simulation studies.

Keywords and phrases: Hawkes process, screening, high-dimensionality

MSC 2010 subject classifications: Primary 60G55, secondary 62M10, 62H12

1. Introduction

1.1. Overview of the multivariate Hawkes process

In a seminal paper, Hawkes (1971) proposed the multivariate Hawkes process, a multivariate point process model in which a past event may trigger the occurrence of future events. The Hawkes process and its variants have been widely applied to model recurrent events, with notable applications in modeling earthquakes (Ogata, 1988), crime rates (Mohler et al., 2011), interactions in social networks (Simma and Jordan, 2012; Perry and Wolfe, 2013; Zhou, Zha and Song, 2013a,b), financial events (Chavez-Demoulin, Davison and McNeil, 2005; Bowsher, 2007; Aït-Sahalia, Cacho-Diaz and Laeven, 2015), and spiking histories of neurons (see e.g., Brillinger, 1988; Okatan, Wilson and Brown, 2005; Paninski, Pillow and Lewi, 2007; Pillow et al., 2008).

In this section, we provide a very brief review of the multivariate Hawkes process. A more comprehensive discussion can be found in Liniger (2009) and Zhu (2013).

Following Brémaud and Massoulié (1996), we define a simple point process N on ℝ⁺ as a family {N(A)}_A_{∈ℬ(ℝ⁺)} taking integer values (including positive infinity), where ℬ(ℝ⁺) denotes the Borel σ-algebra of the positive half of the real line. Further let t₁, t₂, … ∈ ℝ⁺ be the event times of N. In this notation, N(A) = Σ_i 𝟙_{[t_i∈A]} for A ∈ ℬ(ℝ⁺). We write N ([t, t + dt)) as dN(t), where dt denotes an arbitrary small increment of t. Let ℋ_t be the history of N up to time t. Then the ℋ_t-predictable intensity process of N is defined as

λ (t) d t = ℙ (d N (t) = 1 / ℋ_{t}) .

(1)

Now suppose that N is a marked point process, in which each event time t_i is associated with a mark m_i ∈ {1, …, p} (see e.g., Definition 6.4.I. in Daley and Vere-Jones, 2003). We can then view N as a multivariate point process (N_j)_j₌₁_,_…_,p, of which the jth component process is given by N_j(A) = Σ_i 𝟙[_{t_i∈A,m_i=j]} for A ∈ ℬ(ℝ⁺). To simplify the notation, we let t_j,₁, t_j,₂, … ∈ ℝ⁺ denote the event times of N_j.

The intensity of the jth component process is

λ_{j} (t) d t = ℙ (d N_{j} (t) = 1 ∣ H_{t}) .

In the case of the linear Hawkes process, this function takes the form (Brémaud and Massoulié, 1996; Hansen, Reynaud-Bouret and Rivoirard, 2015)

λ_{j} (t) = μ_{j} + \sum_{k = 1}^{p} (\sum_{i : t_{k, i} \leq t} ω_{j, k} (t - t_{k, i})) .

(2)

We refer to μ_j ∈ ℝ as the background intensity, and ω_j,k(·): ℝ⁺ ↦ ℝ as the transfer function.

For p fixed, Brémaud and Massoulié (1996) established that the linear Hawkes process with intensity function (2) is stationary given the following assumption.

Assumption 1

Let Ω be a p × p matrix whose entries are $Ω_{j, k} = \int_{0}^{\infty} ∣ ω_{j, k} (Δ) ∣ d Δ$ , for j, k = 1, …, p. We assume that the spectral norm of Ω is strictly less than 1, i.e., Γ_max(Ω) ≤ γ_Ω < 1, where γ_ω is a generic constant.

We now define a directed graph with node set {1, …, p} and edge set

E \equiv {(j, k) : ω_{j, k} \neq 0, 1 \leq j, k \leq p},

(3)

for ω_j,k given in (2). Let

s \equiv max_{1 \leq j \leq p} \sum_{k = 1}^{p} 𝟙_{{(j, k) \in E}}

(4)

denote the maximum in-degree of the nodes in the graph. In this paper, we propose a simple screening procedure that can be used to obtain a small superset of the edge set ℰ.

1.2. Estimation and theory for the Hawkes process

We first consider the low-dimensional setting, in which the dimension of the process, p, is fixed, and T, the time period during which the point process is observed, is allowed to grow. In this setting, asymptotic properties such as the central limit theorem have been established; for instance, see Bacry et al. (2013) and Zhu (2013). Consequently, estimating the edge set ℰ is straightforward in low dimensions.

In high dimensions, when p might be large, we can fit the Hawkes process model using a penalized estimator of the form

\underset{ω_{j, k} \in F, 1 \leq j, k \leq p}{minimize} L (ω_{j, k}; {N_{j}}_{j = 1}^{p}) + λ \sum_{j, k} P (ω_{j, k}; {N_{j}}_{j = 1}^{p}),

(5)

where $L (\cdot; {N_{j}}_{j = 1}^{p})$ is a loss function, based on, e.g., the log-likelihood (Bacry, Gaïffas and Muzy, 2015) or least squares (Hansen, Reynaud-Bouret and Rivoirard, 2015); $P (\cdot; {N_{j}}_{j = 1}^{p})$ is a penalty function, such as the lasso (Hansen, Reynaud-Bouret and Rivoirard, 2015); λ is a nonnegative tuning parameter; and ℱ is a suitable function class. Then, a natural estimator for ℰ is {(j, k): ω̂_j,k ≠ 0}.

Recently, Reynaud-Bouret and Schbath (2010), Bacry, Gaïffas and Muzy (2015), and Hansen, Reynaud-Bouret and Rivoirard (2015) have established that under certain assumptions, penalized estimation approaches of the form (5) are consistent in high dimensions, provided that the edge set ℰ is sparse. For instance, Hansen, Reynaud-Bouret and Rivoirard (2015) establish the oracle inequality of the lasso estimator for the Hawkes process, given that certain conditions hold on the observed event times. However, to show that these conditions hold with high probability for arbitrary samples, these theoretical results require that the point process is mutually-exciting — that is, an event in one component process can increase, but cannot decrease, the probability of an event in another component process. This amounts to assuming that ω_j,k(Δ) ≥ 0 for all Δ ≥ 0, for ω_j,k defined in (1).

When the dimension p is large, penalized estimation procedures of the form (5) (Bacry, Gaïffas and Muzy, 2015; Hansen, Reynaud-Bouret and Rivoirard, 2015) become computationally expensive: they require 𝒪(Tp²) operations per iteration in an iterative algorithm. This is problematic in contemporary applications, in which p can be on the order of tens of thousands (Ahrens et al., 2013). These concerns motivate us to propose a simple and computationally-efficient edge screening procedure for estimating the true edge set ℰ in high dimensions. Under very few assumptions, our proposed screening procedure is guaranteed to select a small superset of the true edge set ℰ.

1.3. Organization of paper

The rest of this paper proceeds as follows. In Section 2, we introduce our screening procedure for estimating the edge set ℰ, and establish its theoretical properties. We present simulation results in support of our proposed procedure in Section 3. Proofs of theoretical results are presented in Section 4, and the Discussion is in Section 5.

2. An edge screening procedure

2.1. Approach

For j = 1, …, p, let Λ_j denote the mean intensity of the jth point process introduced in Section 1. That is,

Λ_{j} \equiv E [d N_{j} (t)] / d t .

(6)

Following Equation 5 of Hawkes (1971), for any Δ ∈ ℝ, the (infinitesimal) cross-covariance of the jth and kth processes is defined as

V_{j, k} (Δ) \equiv {\begin{cases} E [d N_{j} (t) d N_{k} (t - Δ)] / {d t d (t - Δ)} - Λ_{j} Λ_{k} & j \neq k \\ E [d N_{k} (t) d N_{k} (t - Δ)] / {d t d (t - Δ)} - Λ_{k}^{2} - Λ_{k} δ (Δ) & j = k \end{cases},

(7)

where δ(·) is the Dirac delta function, which satisfies $\int_{- \infty}^{\infty} δ (x) d x = 1$ and δ(x) = 0 for x ≠ 0.

For a given value of Δ, we can estimate the cross-covariance function V_j,k(Δ) using kernel smoothing:

{\hat{V}}_{j, k} (Δ) = {\begin{array}{l} \frac{1}{T h} \iint_{{[0, T]}^{2}} K (\frac{(t^{'} - t) + Δ}{h}) d N_{j} (t) d N_{k} (t^{'}) - \frac{1}{T^{2}} N_{j} ([0, T]) N_{k} ([0, T]) & j \neq k \\ \frac{1}{T h} \iint_{{[0, T]}^{2} \ {t = t^{'}}} K (\frac{(t^{'} - t) + Δ}{h}) d N_{k} (t) d N_{k} (t^{'}) - \frac{1}{T^{2}} N_{k}^{2} ([0, T]) & j = k \end{array},

(8)

where K(·) is a kernel function with bandwidth h, and $\int_{0}^{T} f (t) d N_{j} (t)$ is the Stieltjes integral, defined as

\int_{0}^{T} f (t) d N_{j} (t) \equiv \sum_{i : t_{j, i} \in [0, T]} f (t_{j, i}) .

In this paper, we focus on kernel functions that are bounded by 1 and are defined on a bounded support, i.e., 0 ≤ K(x/h) ≤ 1 for x ∈ [−h, h], and K(x/h) = 0 for x ∉ [−h, h] (e.g., the Epanechnikov kernel).

Let B denote a tuning parameter that defines the time range of interest for V_j,k(Δ), i.e. Δ ∈ [−B, B]. For any ζ, we define the set of screened edges as

\hat{E} (ζ) \equiv {(j, k) : {‖ {\hat{V}}_{j, k} ‖}_{2, [- B, B]} > ζ},

(9)

where ${‖ f ‖}_{2, [l, u]} \equiv {\int_{l}^{u} f^{2} (t) d t}^{1 / 2}$ is the ℓ₂-norm of a function f on the interval [l, u].

The screened edge set ℰ̂(ζ) in (9) can be calculated quickly: ||V̂_j,k||₂_,_[−_B,B_] can be calculated in 𝒪(T) computations, and so ℰ̂(ζ) can be calculated in 𝒪(Tp²) computations. The procedure can be easily parallellized.

There are three tuning parameters in the procedure: the bandwidth h in (8), the range B in (9), and the screening threshold ζ in (9). The bandwidth h can be chosen by cross-validation. The range B can be selected based on the problem setting. For instance, when using the multivariate Hawkes process to model a spike train data set in neuroscience, we can set B to equal the maximum time gap between a spike and the spike it can possibly evoke. The choice of screening threshold ζ can be determined based on the sparsity level that we expect based on our prior knowledge. Alternatively, we may wish to use a small value of ζ in order to reduce the chance of false negative edges in ℰ̂(ζ), or a larger value due to limited computational resources in our downstream analysis.

2.2. Theoretical results

We consider the asymptotics of triangular arrays (Greenshtein and Ritov, 2004), where the dimension p is allowed to grow with T. When unrestricted, it is possible to cook up extreme networks, where, for instance, the mean intensity Λ_j in (6) diverges to infinity. To avoid such cases, we pose the following regularity assumption.

Assumption 2

There exist positive constants Λ_min, Λ_max, and V_max such that 0 < Λ_min ≤ Λ_j ≤ Λ_max and max_Δ∈ℝ |V_j,k(Δ)| ≤ V_max for all 1 ≤ j, k ≤ p, where Λ_j and V_j,k are defined in (6) and (7), respectively. Furthermore, Λ_min, Λ_max, and V_max are generic constants that do not depend on p.

Next, we make some standard assumptions on the transfer functions ω_j,k in (2).

Assumption 3

The following hold:

The transfer functions are non-negative: ω_j,k(Δ) ≥ 0 for all Δ ≥ 0.
There exists a positive constant β_min such that
$min_{(j, k) : ω_{j, k} \neq 0} (\int_{0}^{\infty} ω_{j, k}^{2} (Δ) d Δ) \geq β_{min}^{2} .$
There exist positive constants b, θ₀, and C such that, for all 1 ≤ j, k ≤ p, and for any Δ₁, Δ₂ ∈ ℝ, supp(ω_j,k) ⊂ (0, b], max_Δ |ω_j,k(Δ)| ≤ C, and |ω_j,k(Δ₁) − ω_j,k(Δ₂)| ≤ θ₀|Δ₁ − Δ₂|.

Assumption 3(a) guarantees that the multivariate Hawkes process is mutually-exciting: that is, an event may trigger (but cannot inhibit) future events. This assumption is shared by the original proposal of Hawkes (1971). Furthermore, existing theory for penalized estimators for the Hawkes process requires this assumption (Bacry, Gaïffas and Muzy, 2015; Hansen, Reynaud-Bouret and Rivoirard, 2015).

Assumption 3(b) guarantees that the non-zero transfer functions are nonnegligible. Such an assumption is needed in order to establish variable selection consistency (Bühlmann and van de Geer, 2011; Wainwright, 2009) for the penalized estimator (5).

Assumption 3(c) guarantees that the transfer functions are sufficiently smooth; this guarantees that the cross-covariances are smooth (see Section A.2 in Appendix), and hence can be estimated using a kernel smoother (8). Instead of Assumption 3(c), we could assume that ω_j,k is an exponential function (Bacry, Gaïffas and Muzy, 2015) or that it is well-approximated by a set of smooth basis functions (Hansen, Reynaud-Bouret and Rivoirard, 2015).

Recall that s was defined in (4). We now state our main result.

Theorem 1

Suppose that the Hawkes process (2) satisfies Assumptions 1–3. Let h = c₁s^−1/2T^−1/6 in(8) and ζ = 2c₂s^1/2T^−1/6 in (9) for some constants c₁ and c₂. Then, for some positive constants c₃ and c₄, with probability at least 1 − c₃T^7/6s^1/2p² exp(−c₄T^1/6),

ℰ ⊂ ℰ̂(ζ);
$card (\hat{E} (ζ)) = O (card (E) s^{- 1} T^{1 / 3} γ_{Ω} {(1 - γ_{Ω})}^{- 2} Λ_{max}^{2})$ .

Theorem 1(a) guarantees that, with high probability, the screened edge set ℰ̂(ζ) contains the true edge set ℰ. Therefore, screening does not result in false negatives. This is referred to as the sure screening property in the literature (Fan and Lv, 2008; Fan, Samworth and Wu, 2009; Fan and Song, 2010; Fan, Feng and Song, 2011; Fan, Ma and Dai, 2014; Liu, Li and Wu, 2014; Song et al., 2014; Luo, Song and Witten, 2014). Typically, establishing the sure screening property requires assuming that the marginal association between a pair of nodes in ℰ is sufficiently large; see e.g. Condition 3 in Fan and Lv (2008) and Condition C in Fan, Feng and Song (2011). In contrast, Theorem 1(a) requires only that the conditional association between a pair of nodes in ℰ is sufficiently large; see Assumption 3(b).

Theorem 1(b) guarantees that ℰ̂(ζ) is a relatively small set, on the order of 𝒪(card(ℰ)s⁻¹T^1/3). Suppose that p² ∝ s^−1/2 exp(c₄T^1/6−^ε) for some positive constant ε < 1/6; this is the high-dimensional regime, in which the probability statement in Theorem 1 converges to one. Then the size of ℰ̂(ζ), 𝒪(card(ℰ)s⁻¹T^1/3), can be much smaller than p², the total number of node pairs. We note that the rate of T^1/3 is comparable to existing results for non-parametric screening in the literature (see e.g., Fan, Feng and Song 2011; Fan, Ma and Dai 2014).

To summarize, Theorem 1 guarantees that under a small subset of the assumptions required for penalized estimation methods to recover the edge set ℰ, the screened edge set ℰ̂(ζ) (9) is small and contains no false negatives. We note that this is not the case for other types of models. For instance, in the case of the Gaussian graphical model, Luo, Song and Witten (2014) considered estimating the conditional dependence graph by screening the marginal covariances. In order for this procedure to have the sure screening property, one must make an assumption on the minimum marginal covariance associated with an edge in the graph, which is not required for variable selection consistency of penalized estimators (Cai, Liu and Luo, 2011; Luo, Song and Witten, 2014; Ravikumar et al., 2011; Saegusa and Shojaie, 2016).

It is important to note that Theorem 1 considers an oracle procedure, where the tuning parameters depend on unknown parameters. The heuristic selection guidelines suggested at the end of Section 2.1 may not satisfy the requirements of Theorem 1. We leave the discussion of optimal tuning parameter selection criteria for future research. Also, note that the bandwidth h ∝ T^−1/6 is wider than the typical bandwidth for kernel smoothing, which is T^−1/3 (Tsybakov, 2009). This is because we aim to minimize a concentration bound on V̂_j,k − V_j,k (see the proof of Lemma 3 in the Appendix), rather than the usual mean integrated square error as in, e.g., Theorem 1.1 in Tsybakov (2009).

Remark 1

In light of Theorem 1, consider applying a constraint induced by ℰ̂(ζ) to (5):

\underset{ω_{j, k} \in F, 1 \leq j, k \leq p}{minimize} L (ω_{j, k}; {N_{j}}_{j = 1}^{p}) + λ \sum_{j, k} P (ω_{j, k}; {N_{j}}_{j = 1}^{p}) subject t o ω_{j, k} = 0 for (j, k) \notin \hat{E} (ζ) .

(10)

Theorem 1 can be combined with existing results on consistency of penalized estimators of the Hawkes process (Bacry, Gaïffas and Muzy, 2015; Hansen, Reynaud-Bouret and Rivoirard, 2015) in order to establish that (10) results in consistent estimation of the transfer functions ω_j,k. As a concrete example, Hansen, Reynaud-Bouret and Rivoirard (2015) considered (10) with $L (ω_{j, k}; {N_{j}}_{j = 1}^{p})$ taken to be the least-squares loss, and $P (ω_{j, k}; {N_{j}}_{j = 1}^{p})$ a lasso-type penalty. Our simulation experiments in Section 3 indicate that in this setting, (10) can actually have better small-sample performance than (5) when p is very large. Furthermore, solving (10) can be much faster than solving (5): the former requires 𝒪(T^4/3s⁻¹card(ℰ)) computations per iteration, compared to 𝒪(Tp²) per iteration for the latter (using e.g. coordinate descent, Friedman, Hastie and Tibshirani, 2010). In the high-dimensional regime when p² ∝ s^−1/2 exp(c₄T^1/6−^ε) for some positive constant ε < 1/6, we have that T^4/3s⁻¹card(ℰ) ≪ Tp². We note that in order to solve (10), we must first compute ℰ̂(ζ), which requires an additional one-time computational cost of 𝒪(Tp²).

3. Simulation

3.1. Simulation set-up

In this section, we investigate the performance of our screening procedure in a simulation study with p = 100 point processes. Intensity functions are given by (2), with μ_j = 0.75 for j = 1, …, p, and ω_j,k(t) = 2t exp(1 − 5t) for (j, k) ∈ ℰ. By definition, ω_j,k = 0 for all (j, k) ∉ ℰ. We consider two settings for the edge set ℰ, Setting A and Setting B. These settings are displayed in Figure 1.

Fig 1 — *Left:* In Setting A, the edge set ℰ is composed of 5 connected components, each of which is a chain graph containing 20 nodes. *Right:* In Setting B, ℰ is composed of 10 connected components, each of which contains 10 nodes.

In what follows, it will be useful to think about the (undirected) node pairs as belonging to three types. (i) We let

\tilde{E} \equiv {(j, k) : (j, k) \in E or (k, j) \in E} .

(11)

(ii) With a slight abuse of notation, we will use ℰ̃^c ∩ supp(V) to denote node pairs that are not in ℰ̃ with non-zero population cross-covariance, defined in (7). (iii) Continuing to slightly abuse notation, we will use ℰ̃^c\supp(V) to denote node pairs that are not in ℰ̃ and that have zero population cross-covariance.

Throughout the simulation, we set the bandwidth h in (8) to equal T^−1/6, and the range of interest B in (9) to equal 5. Thus, h satisfies the requirements of Theorem 1, and [−B, B] covers the majority of the mass of the transfer function ω_j,k. However, these simulation results are not sensitive to the particular choices of h or B.

3.2. Investigation of the estimated cross-covariances

In Setting A, within a single connected component, all of the node pairs that are not in ℰ̃ are in ℰ̃^c ∩ supp(V). However, for the most part, the population cross-covariances corresponding to node pairs in ℰ̃^c ∩ supp(V) are quite small, because they are induced by paths of length two and greater. This can be seen from the left-hand panel of Figure 2. Given the left-hand panel of Figure 2, we expect the proposed screening procedure to work very well in Setting A: for a sufficiently large value of the time period T, there exists a value of ζ such that, with high probability, ℰ̂(ζ) = ℰ̃.

Fig. 2 — The quantiles of ||*V̂_jk*||_2,[−5,5] are displayed, for node pairs in ℰ̃ (11), ℰ̃^c∩ supp(V), and ℰ̃^c\supp(V), as a function of the time period *T. Left:* Results for Setting A. The estimated cross-covariances of node pairs in ℰ̃^c\supp(V) and ℰ̃^c ∩ supp(V) overlap. *Center:* Results for Setting B. The estimated cross-covariances of node pairs in ℰ̃ and ℰ̃^c ∩ supp(V) overlap. *Right:* The color legend is displayed.

In Setting B, six nodes receive directed edges from the same set of four nodes. Therefore, we expect the pairs among these six nodes to be in the set ℰ̃^c ∩ supp(V), and to have substantial population cross-covariances. This intuition is supported by the center panel of Figure 2, which indicates that the node pairs in ℰ̃^c ∩ supp(V) have relatively large estimated cross-covariances, on the same order as the node pairs in ℰ̃. In light of Figure 2, we anticipate that for a sufficiently large value of the time period T, the screened edge set ℰ̂(ζ) will contain the edges in ℰ̃ as well as many of the edges in ℰ̃^c ∩ supp(V).

3.3. Size of smallest screened edge set

We now define ζ^* ≡ max {ζ : ℰ ⊆ ℰ̂(ζ)}, and calculate card(ℰ̂(ζ^*)). This represents the size of the smallest screened edge set that contains the true edge set.

Results, averaged over 200 simulated data sets, are shown in Figure 3.

Fig. 3 — For each of 200 simulated data sets, we calculated card(ℰ̂(ζ^*)), where ζ^* ≡ max {ζ : ℰ ⊆ ℰ̂(ζ)}, as a function of the time period T. The curves represent the mean of card(ℰ̂(ζ^*)) ( ); the 2.5% and 97.5% quantiles of card(ℰ̂(ζ^*)) ( ); card(ℰ̃) ( ); and card(supp(V)) ( ). *Left:* Data generated under Setting A. *Right:* Data generated under Setting B.

Inline graphic — For each of 200 simulated data sets, we calculated card(ℰ̂(ζ^*)), where ζ^* ≡ max {ζ : ℰ ⊆ ℰ̂(ζ)}, as a function of the time period T. The curves represent the mean of card(ℰ̂(ζ^*)) ( ); the 2.5% and 97.5% quantiles of card(ℰ̂(ζ^*)) ( ); card(ℰ̃) ( ); and card(supp(V)) ( ). *Left:* Data generated under Setting A. *Right:* Data generated under Setting B.

We see that in Setting A, for sufficiently large T, card(ℰ̂(ζ^*)) = card(ℰ̃), which implies that ℰ̂(ζ^*) = ℰ̃. In other words, in Setting A, the screening procedure yields perfect recovery of the set ℰ̃ (11). This is in line with our intuition based on the left-hand panel of Figure 2.

In contrast, in Setting B, even when T is very large, card(ℰ̂(ζ^*)) > card(ℰ̃), which implies that ℰ̂(ζ^*) ⊇ ℰ̃. This was expected based on the center panel of Figure 2.

3.4. Performance of constrained penalized estimation

We now consider the performance of the estimator (10), which we obtain by calculating the screened edge set ℰ̂(ζ), and then performing a penalized regression subject to the constraint that ω_jk = 0 for (j, k) ∉ ℰ̂(ζ). Note that rather than assuming a specific functional form for ω_j_,_k, Hansen, Reynaud-Bouret and Rivoirard (2015) use a basis expansion to estimate ω_j_,_k. Following their lead, we use a basis of step functions, of the form 𝟙₍₍_m_−1)/2,_m_/2](t) for m = 1, …, 6. Instead of applying a lasso penalty to the basis function coefficients (Hansen, Reynaud-Bouret and Rivoirard, 2015), we employ a group lasso penalty for every 1 ≤ j, k ≤ p (Yuan and Lin, 2006; Simon and Tibshirani, 2012). Thus, (10) consists of a squared error loss function and a group lasso penalty. We let

{\hat{E}}_{P} \equiv {(j, k) : \exists Δ s.t. {\hat{ω}}_{j, k} (Δ) \neq 0},

(12)

where ω̂_j_,_k solves (10).

Results are shown in Figure 4. In Setting A, solving the constrained optimization problem (10) leads to substantially better performance than solving the unconstrained problem (5). The improvement is especially noticeable when T is small. In Setting B, solving the constrained optimization problem (10) leads to only a slight improvement in performance relative to solving the unconstrained problem (5), since, as we have learned from Figures 2 and 3, the screened set ℰ̂(ζ) contains many edges in ℰ̃^c ∩ supp(V). In both settings, solving the constrained optimization problem leads to substantial computational improvements.

Fig. 4 — The constrained penalized optimization problem (10) was performed, for a range of values of the tuning parameter λ. The x-axis displays the size of the estimated edge set ℰ̂_℘ (12), and the y-axis displays the number of true positives, averaged over 200 simulated data sets. The curves represent performance when ζ is chosen to yield card(ℰ̂(ζ)) = 4card(ℰ̃) (T = 300 [ ]and T = 600 [ ]), and when ζ is chosen to yield card(ℰ̂(ζ)) = 8card(ℰ̃) (T = 300 [ ] and T = 600 [ ]). We also display performance of the unconstrained penalized optimization problem (5) (T = 300 [ ] and T = 600 [ ]).

4. Proofs of theoretical results

In this section, we prove Theorem 1. In Section 4.1, we review an important property of the Hawkes process, the Wiener-Hopf integral equation. In Section 4.2, we list three technical lemmas used in the proof of Theorem 1. Theorem 1 is proved in Section 4.3. Proofs of the technical lemmas are provided in the Appendix.

4.1. The Wiener-Hopf integral equation

Recall that the transfer functions ω = {ω_j_,_k}_1≤_j_,_k_≤_p were defined in (2), the cross-covariances V = {V_j_,_k}_1≤_j_,_k_≤_p were defined in (7), and the mean intensities Λ = (Λ₁, …, Λ_p)^T were defined in (6). If the Hawkes process defined in (2) is stationary, then for any Δ ∈ ℝ⁺,

V (Δ) = ω (Δ) diag (Λ) + (ω * V) (Δ),

(13)

where

{[ω * V]}_{j, k} (Δ) \equiv \sum_{l = 1}^{p} [ω_{j, l} * V_{l, k}] (Δ)

and

[ω_{j, l} * V_{l, k}] (Δ) \equiv \int_{0}^{\infty} ω_{j, l} (Δ^{'}) V_{l, k} (Δ - Δ^{'}) d Δ^{'} .

Equation (13) belongs to a class of integral equations known as the Wiener-Hopf integral equations.

4.2. Technical lemmas

We state three lemmas used to prove Theorem 1, and provide their proofs in the Appendix. The following lemma is a direct consequence of (13) and our assumptions. Recall that [0, b] is a superset of supp(ω_j_,_k) introduced in Assumption 3.

Lemma 1

Under Assumptions 1–3, for sufficiently large B such that B ≥ b, we have that ||V_j_,_k||_2,[−_B_,_B_] ≥ β_minΛ_min for (j, k) ∈ ℰ.

The next lemma shows that the cross-covariance is Lipschitz continuous given the smoothness assumption on ω_j_,_k (Assumption 3(c)). We will use this lemma in the proof of Theorem 1, in order to bound the bias of the kernel smoothing estimator (8). Recall that s, the maximum node in-degree, was defined in (4).

Lemma 2

Under Assumptions 1–3, the cross-covariance function is Lipschitz for 1 ≤ j, k ≤ p. More specifically, there exists some θ₁ > 0 such that |V_j_,_k(x) − V_j_,_k(y)| ≤ θ₁s|x − y| for any x, y ∈ ℝ.

Recall that the bandwidth h was defined in (8). The following concentration inequality holds on the estimated cross-covariance.

Lemma 3

Suppose that Assumptions 1–3 hold, and let h = c₁s^−1/2T^−1/6 for some constant c₁. Then

ℙ (\underset{1 \leq j \leq k \leq p}{\cap} [{‖ {\hat{V}}_{j, k} - V_{j, k} ‖}_{2, [- B, B]} \leq c_{2} s^{1 / 2} T^{- 1 / 6}]) \geq 1 - c_{3} s^{1 / 2} T^{7 / 6} p^{2} e^{- c_{4} T^{1 / 6}} .

4.3. Proof of Theorem 1

Proof

In what follows, we will consider the event

M \equiv {{‖ {\hat{V}}_{j, k} - V_{j, k} ‖}_{2, [- B, B]} \leq c_{2} s^{1 / 2} T^{- 1 / 6} for all 1 \leq j, k \leq p} .

We will first show that part (b) of Theorem 1 holds. From the Wiener-Hopf equation, (13), for each (j, k), we can write

V_{j, k} = ω_{j, k} Λ_{k} + ω_{j, \cdot} * V_{\cdot, k} .

(14)

We thus have

\begin{array}{l} {‖ V_{j, k} ‖}_{2, (- \infty, \infty)} \leq Λ_{k} {‖ ω_{j, k} ‖}_{2, (- \infty, \infty)} + {‖ ω_{j, \cdot} * V_{\cdot, k} ‖}_{2, (- \infty, \infty)} \\ \leq Λ_{k} {‖ ω_{j, k} ‖}_{2, (- \infty, \infty)} + \sum_{l = 1}^{p} {‖ ω_{j, l} * V_{l, k} ‖}_{2, (- \infty, \infty)} \\ \leq Λ_{k} {‖ ω_{j, k} ‖}_{2, (- \infty, \infty)} + \sum_{l = 1}^{p} (\int_{- \infty}^{\infty} ∣ ω_{j, l} (Δ) ∣ d Δ) {‖ V_{l, k} ‖}_{2, (- \infty, \infty)}, \end{array}

(15)

where the last inequality follows from Young’s inequality (see e.g., Theorem 3.9.4 in Bogachev (2007)), which takes the form

{‖ f * g ‖}_{r, (- \infty, \infty)} \leq {‖ f ‖}_{p, (- \infty, \infty)} {‖ g ‖}_{q, (- \infty, \infty)}, \frac{1}{p} + \frac{1}{q} = \frac{1}{r} + 1,

(16)

with ${‖ f ‖}_{p, (- \infty, \infty)} \equiv {[\int_{- \infty}^{\infty} {∣ f (x) ∣}^{p} d x]}^{1 / p}$ . Here, we let r = q = 2, p = 1, f = ω_j_,_l, and g = V_l_,_k.

From Assumption 3(c), we know that ω_j_,_k is bounded by C. Therefore, by the Cauchy-Schwartz inequality,

‖ ω_{j, k} ‖_{2, ℝ} = {\int_{- \infty}^{\infty} ω_{j, k}^{2} (Δ) d Δ}^{1 / 2} \leq {\int_{- \infty}^{\infty} C ∣ ω_{j, k} (Δ) ∣ d Δ}^{1 / 2} = C^{1 / 2} Ω_{j, k}^{1 / 2} .

Using (15) and letting V̄_j_,_k ≡ ||V_j_,_k||_{2,(−∞,∞)}, we get

{\bar{V}}_{j, k} \leq C^{1 / 2} Ω_{j, k}^{1 / 2} Λ_{k} + Ω_{j, \cdot} \cdot {\bar{V}}_{\cdot, k} .

(17)

The ℓ₂-norm of the vector V̄_·,_k can then be bounded using the triangle inequality,

∣ {\bar{V}}_{\cdot, k} ‖_{2} \leq C^{1 / 2} Λ_{k} {[\sum_{j = 1}^{p} Ω_{j, k}]}^{1 / 2} + {‖ Ω {\bar{V}}_{\cdot, k} ‖}_{2} .

Thus, by Assumption 1,

{‖ {\bar{V}}_{\cdot, k} ‖}_{2} \leq C^{1 / 2} Λ_{k} {‖ Ω_{\cdot, k} ‖}_{1}^{1 / 2} + γ_{Ω} {‖ {\bar{V}}_{\cdot, k} ‖}_{2} .

Rearranging the terms, and using the fact that γ_Ω < 1, gives

{‖ {\bar{V}}_{\cdot, k} ‖}_{2} \leq C^{1 / 2} {(1 - γ_{Ω})}^{- 1} Λ_{max} {‖ Ω_{\cdot, k} ‖}_{1}^{1 / 2} .

(18)

Hence,

\sum_{j, k} {\bar{V}}_{j, k}^{2} = \sum_{k} {‖ {\bar{V}}_{\cdot, k} ‖}_{2}^{2} \leq C {(1 - γ_{Ω})}^{- 2} Λ_{max}^{2} \sum_{k = 1}^{p} {‖ Ω_{\cdot, k} ‖}_{1} .

(19)

Now, recall that the number of non-zero elements in Ω is card(ℰ), and Ω_j_,_k ≤ γ_Ω. Thus, the inequality becomes

\sum_{j, k} {\bar{V}}_{j, k}^{2} \leq C {(1 - γ_{Ω})}^{- 2} Λ_{max}^{2} card (E) γ_{Ω} .

(20)

Hence, no more than $(C / c_{2}^{2}) card (E) s^{- 1} T^{1 / 3} γ_{Ω} {(1 - γ_{Ω})}^{- 2} Λ_{max}^{2}$ elements of V̄_j_,_k exceed c₂s^1/2T^−1/6. Recalling that V̄_j_,_k = ||V_j_,_k||_{2,(−∞,∞)}, this implies that no more than

(C / c_{2}^{2}) card (E) s^{- 1} T^{1 / 3} γ_{Ω} {(1 - γ_{Ω})}^{- 2} Λ_{max}^{2}

elements of ||V_j_,_k||_2,(−_B_,_B₎ exceed c₂s^1/2T^−1/6.

Given the event ℳ, only edges in the set

{(j, k) : {‖ V_{j, k} ‖}_{2, [- B, B]} \geq c_{2} s^{1 / 2} T^{- 1 / 6}, 1 \leq j, k \leq p}

can be contained in ℰ̂(ζ) for ζ = 2c₂s^1/2T^−1/6. This implies that the size of ℰ̂(ζ) is on the order of $(C / c_{2}^{2}) card (E) s^{- 1} T^{1 / 3} γ_{Ω} {(1 - γ_{Ω})}^{- 2} Λ_{max}^{2}$ .

We now proceed to prove part (a) of Theorem 1. Lemma 1 states that ||V_j_,_k||_2,[−_B_,_B_] ≥ β_minΛ_min for (j, k) ∈ ℰ. If the event ℳ holds, then for T sufficiently large, ||V̂_j_,_k||_2,[−_B_,_B_] > 2c₂s^1/2T^−1/6 = ζ for (j, k) ∈ ℰ. Therefore, ℰ ⊂ ℰ̂(ζ).

Finally, Theorem 1 follows from the fact that, by Lemma 3, the event ℳ holds with probability at least 1 − c₃s^1/2T^7/6p² exp(−c₄T^1/6).

5. Discussion

In this paper, we have proposed a very simple procedure for screening the edge set in a multivariate Hawkes process. Provided that the process is mutually-exciting, we establish that this screening procedure can lead to a very small screened edge set, without incurring any false negatives. In fact, this result holds under a subset of the conditions required to establish model selection consistency of penalized regression estimators for the Hawkes process (Wainwright, 2009; Hansen, Reynaud-Bouret and Rivoirard, 2015). Therefore, this screening should always be performed whenever estimating the graph for a mutually-exciting Hawkes process.

The proposed screening procedure boils down to just screening pairs of nodes by thresholding an estimate of their cross-covariance. In fact, this approach is commonly taken within the neuroscience literature, with a goal of estimating the functional connectivity among a set of p neuronal spike trains (Okatan, Wilson and Brown, 2005; Pillow et al., 2008; Mishchencko, Vogelstein and Paninski, 2011; Berry et al., 2012). Therefore, this paper sheds light on the theoretical foundations for an approach that is often used in practice.

Appendix A: Technical proofs

A.1. Proof of Lemma 1

Proof

First, we observe that, if V_j_,_k is non-negative for all j and k, then ω_j_,_l*V_l_,_k is non-negative for any j, l, k. Under Assumption 1, we know that (13) holds. We can see from (13) that

V_{j, k} (Δ) \geq ω_{j, k} (Δ) Λ_{k} .

Therefore, we have

{‖ V_{j, k} (Δ) ‖}_{2, [- B, B]} \geq {‖ ω_{j, k} (Δ) ‖}_{2, [- B, B]} Λ_{min} = {‖ ω_{j, k} (Δ) ‖}_{2, [0, b]} Λ_{min},

(21)

where the inequality follows from Assumption 2 and the equality holds since

supp (ω_{j, k}) \subset (0, b] \subset [- B, B] .

From Assumption 3(b), we have that ||V_j_,_k(Δ)||_2,[−_B_,_B_] ≥ β_minΛ_min for (j, k) ∈ ℰ.

We now show that the elements of V are non-negative, i.e., V_l_,_k(Δ) ≥ 0 for 1 ≤ l, k ≤ p, and Δ ∈ ℝ. Recall from the definition (7) in the main paper that

\begin{array}{l} V_{l, k} (Δ) \equiv E [d N_{l} (t) d N_{k} (t - Δ)] / {d t d (t - Δ)} - Λ_{l} Λ_{k} \\ \equiv E [λ_{l} (t) d N_{k} (t - Δ)] / {d (t - Δ)} - Λ_{l} Λ_{k}, \end{array}

(22)

where the second equality follows from

E [d N_{l} (t) d N_{k} (t - Δ)] = E [E [d N_{l} (t) ∣ H_{t}] d N_{k} (t - Δ)] = E [λ_{l} (t) d N_{k} (t - Δ)] d t .

(23)

In this proof, we use the Stieltjes integral to rewrite λ_l(t) in (2) as

λ_{l} (t) = μ_{l} + \sum_{k = 1}^{p} (\sum_{i : t_{k, i} \leq t} ω_{l, k} (t - t_{k, i})) = μ_{l} + \sum_{k = 1}^{p} \int_{0}^{\infty} ω_{l, k} (Δ) d N_{k} (t - Δ) .

(24)

Plugging in λ_l(t) from (24) into (22) gives

\begin{array}{l} V_{l, k} (Δ) = - Λ_{l} Λ_{k} + E [μ_{l} d N_{k} (t - Δ)] / {d (t - Δ)} \\ + E [\sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)] / {d (t - Δ)} \\ = \sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) E [d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)] / {d (t - Δ)} \\ + E [μ_{l} d N_{k} (t - Δ)] / {d (t - Δ)} - Λ_{l} E [d N_{k} (t - Δ)] / {d (t - Δ)}, \end{array}

where we use the definition Λ_k = 𝔼[dN_k(t − Δ)]/{d(t − Δ)}.

Using the fact that (see e.g., Hawkes and Oakes (1974))

Λ_{l} = μ_{l} + \sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) d Δ^{'} μ_{m},

we have

\begin{array}{l} V_{l, k} (Δ) = \sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) E [d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)] / {d (t - Δ)} \\ + E [μ_{l} d N_{k} (t - Δ)] / {d (t - Δ)} \\ - {μ_{l} + \sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) μ_{m} d Δ^{'}} E [d N_{k} (t - Δ)] / {d (t - Δ)} \\ = \sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) E [d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)] / {d (t - Δ)} \\ - \sum_{m = 1}^{p} \int_{0}^{b} ω_{l, m} (Δ^{'}) μ_{m} d Δ^{'} E [d N_{k} (t - Δ)] / {d (t - Δ)} . \end{array}

Rearranging the terms gives

V_{l, k} (Δ) = \sum_{m = 1}^{p} \int_{0}^{b} \frac{ω_{l, m} (Δ^{'})}{d (t - Δ)} {E [d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)] - E [μ_{m} d Δ^{'} d N_{k} (t - Δ)]} .

(25)

Next, we will rewrite (25) by taking the conditional expectation of dN_k or dN_m as in (23). Note here that, when Δ′ < Δ, we condition dN_m on the history up to t − Δ′, i.e., ℋ_t _{− Δ′}. Given ℋ_t _{− Δ′}, dN_k(t − Δ) is fixed since t − Δ < t − Δ′. When Δ′ > Δ, we condition dN_k on the history up to t − Δ. These cases are discussed separately in the following.

When Δ′ < Δ, for each integral of the summation, it holds that

E {d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)} = E {λ_{m} (t - Δ^{'}) d Δ^{'} d N_{k} (t - Δ)} .

From the definition of λ_m(t) in (2), we know that λ_m(t − Δ′) ≥ μ_m. Hence, in (25), if Δ′ < Δ, it holds that

E {d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)} / {d (t - Δ)} - E {μ_{m} d Δ^{'} d N_{k} (t - Δ)} / {d (t - Δ)} \geq 0.

(26)

On the other hand, when Δ′ ≥ Δ, we have

\begin{array}{l} E {d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)} / {d (t - Δ)} - E {μ_{m} d Δ^{'} d N_{k} (t - Δ)} / {d (t - Δ)} \\ = E {d N_{m} (t - Δ^{'}) λ_{k} (t - Δ)} - E {μ_{m} d Δ^{'} λ_{k} (t - Δ)} \\ = E {d N_{m} (t - Δ^{'}) λ_{k} (t - Δ)} - μ_{m} Λ_{k} d Δ^{'} . \end{array}

Expanding λ_k and Λ_k yields

\begin{array}{l} E {d N_{m} (t - Δ^{'}) λ_{k} (t - Δ)} - μ_{m} Λ_{k} d Δ^{'} \\ = E {d N_{m} (t - Δ^{'}) μ_{k}} + \sum_{i = 1}^{p} \int_{0}^{b} ω_{k, i} (Δ^{″}) E [d N_{m} (t - Δ^{'}) d N_{i} (t - Δ - Δ^{″})] \\ - μ_{m} μ_{k} d Δ^{'} - \sum_{i = 1}^{p} \int_{0}^{b} ω_{k, i} (Δ^{″}) d Δ^{″} μ_{i} μ_{m} d Δ^{'} \\ = (Λ_{m} - μ_{m}) μ_{k} d Δ^{'} \\ + \sum_{i = 1}^{p} \int_{0}^{b} ω_{k, i} (Δ^{″}) {E [d N_{m} (t - Δ^{'}) d N_{i} (t - Δ - Δ^{″})] - μ_{i} μ_{m} d Δ^{'} d Δ^{″}} . \end{array}

Now, observe that Λ_m ≥ μ_m and 𝔼{dN_i(t − Δ − Δ″) dN_m(t − Δ′)}/{dΔ′dΔ″} ≥ μ_iμ_m by the nature of the mutually-exciting process. Thus, for Δ′ ≥ Δ,

E {d N_{m} (t - Δ^{'}) d N_{k} (t - Δ)} / {d (t - Δ)} - E {μ_{m} d Δ^{'} d N_{k} (t - Δ)} / {d (t - Δ)} \geq 0.

(27)

Applying both (26) and (27) to (25) shows that V_l_,_k(Δ) ≥ 0.

A.2. Proof of Lemma 2

Proof

For any Δ ≥ 0, the integral equation (13) gives

V_{j, k} (Δ) = ω_{j, k} (Δ) Λ_{k} + (ω_{j, \cdot} * V_{\cdot, k}) (Δ) .

(28)

For any x, y ≥ 0, we can write

\begin{array}{l} ∣ V_{j, k} (x) - V_{j, k} (y) ∣ = ∣ {ω_{j, k} (x) - ω_{j, k} (y)} Λ_{k} + (ω_{j, \cdot} * V_{\cdot, k}) (x) - (ω_{j, \cdot} * V_{\cdot, k}) (y) ∣ \\ = | {ω_{j, k} (x) - ω_{j, k} (y)} Λ_{k} + \sum_{l = 1}^{p} {ω_{j, l} * V_{l, k} (x) - ω_{j, l} * V_{l, k} (y)} | \\ = | {ω_{j, k} (x) - ω_{j, k} (y)} Λ_{k} + \sum_{l \in E_{j}} {ω_{j, l} * V_{l, k} (x) - ω_{j, l} * V_{l, k} (y)} |, \end{array}

(29)

where the last inequality holds since ω_j,l ≡ 0 for l ∉ ε_j. We then have

∣ V_{j, k} (x) - V_{j, k} (y) ∣ \leq \underset{I}{\underset{︸}{| {ω_{j, k} (x) - ω_{j, k} (y)} Λ_{k} |}} + \sum_{l \in E_{j}} \underset{{II}_{l}}{\underset{︸}{| ω_{j, l} * V_{l, k} (x) - ω_{j, l} * V_{l, k} (y) |}} .

(30)

For I, we know from Assumptions 2 and 3(c) that

I \equiv | {ω_{j, k} (x) - ω_{j, k} (y)} Λ_{k} | \leq θ_{0} Λ_{max} ∣ x - y ∣ .

(31)

For II_l, we can expand the convolution

\begin{array}{l} {II}_{l} = | \int_{0}^{b} ω_{j, l} (Δ) V_{l, k} (x - Δ) d Δ - \int_{0}^{b} ω_{j, l} (Δ) V_{l, k} (y - Δ) d Δ | \\ = | \int_{- x}^{b - x} ω_{j, l} (Δ^{'} + x) V_{l, k} (- Δ^{'}) d Δ^{'} - \int_{- y}^{b - y} ω_{j, l} (Δ^{'} + y) V_{l, k} (- Δ^{'}) d Δ^{'} | . \end{array}

Without loss of generality, we consider only the case that x ≥ y. We can decompose the integrals into parts on the intervals [−x, − y), [−y, b–x), and [b–x, b–y] as

\begin{array}{l} {II}_{l} \leq | \int_{- y}^{b - x} {ω_{j, l} (Δ^{'} + x) - ω_{j, l} (Δ^{'} + y)} V_{l, k} (- Δ^{'}) d Δ^{'} | \\ + | \int_{- x}^{- y} ω_{j, l} (Δ^{'} + x) V_{l, k} (- Δ^{'}) d Δ^{'} | + | \int_{b - x}^{b - y} ω_{j, l} (Δ^{'} + y) V_{l, k} (- Δ^{'}) d Δ^{'} | \\ \leq \int_{- y}^{b - x} θ_{0} ∣ Δ^{'} + x - Δ^{'} - y ∣ ∣ V_{l, k} (- Δ^{'}) ∣ d Δ^{'} \\ + \int_{- x}^{- y} ∣ ω_{j, l} (Δ^{'} + x) V_{l, k} (- Δ^{'}) ∣ d Δ^{'} + \int_{b - x}^{b - y} ∣ ω_{j, l} (Δ^{'} + y) V_{l, k} (- Δ^{'}) ∣ d Δ^{'} \\ \leq \int_{- y}^{b - x} θ_{0} ∣ x - y ∣ V_{max} d Δ^{'} + \int_{- x}^{- y} ∣ ω_{j, l} (Δ^{'} + x) ∣ V_{max} d Δ^{'} \\ + \int_{b - x}^{b - y} ∣ ω_{j, l} (Δ^{'} + y) ∣ V_{max} d Δ^{'} \\ \leq (b - x + y) θ_{0} ∣ x - y ∣ V_{max} + 2 C (x - y) V_{max}, \end{array}

where we use Assumption 3(c) in the second inequality, Assumptions 2 in the third inequality, and the boundedness of ω_j_,_l from Assumption 3(c) in the last inequality. Recalling that x ≥ y, we have

{II}_{l} \leq (b θ_{0} V_{max} + 2 C V_{max}) ∣ x - y ∣,

(32)

Finally, plugging (31) and (32) into (30) gives

∣ V_{j, k} (x) - V_{j, k} (y) ∣ \leq θ_{0} Λ_{max} ∣ x - y ∣ + s (b θ_{0} V_{max} + 2 C V_{max}) ∣ x - y ∣ \leq s θ_{1} ∣ x - y ∣,

(33)

where we set θ₁ ≡ θ₀Λ_max + bθ₀V_max + 2CV_max. Note that the last inequality holds as long as s ≥ 1. (The result also holds if s = 0: in this case, the second term in (30) is zero for every j and the bound (31) suffices.)

A.3. Proof of Lemma 3

Recall that the estimator of the cross-covariance (8) takes the form

\frac{1}{h} \underset{I_{j, k}}{\underset{︸}{\frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) d N_{j} (t^{'}) d N_{k} (t)}} - \underset{{II}_{j}}{\underset{︸}{[\frac{1}{T} \int_{0}^{T} d N_{j} (t)]}} \underset{{II}_{k}}{\underset{︸}{[\frac{1}{T} \int_{0}^{T} d N_{k} (t)]}} .

The proof of Lemma 3 uses the following result. Lemma 4 is based on Proposition 3 of Hansen, Reynaud-Bouret and Rivoirard (2015); for completeness, we provide its proof in Section A.4.

Lemma 4

Suppose that Assumption 1 holds. We have

ℙ (\underset{1 \leq j \leq k \leq p}{\cap} [∣ I_{j, k} - E I_{j, k} ∣ \geq c_{6} T^{- 1 / 3}]) \leq c_{5} p^{2} T exp (- c_{4} T^{1 / 6}),

(34)

ℙ (\underset{1 \leq j \leq p}{\cap} [∣ {II}_{j} - E {II}_{j} ∣ \geq c_{6} T^{- 1 / 3 + 1 / 18}]} \leq c_{5} p^{2} T exp (- c_{4} T^{1 / 6}),

(35)

where c₄, c₅, and c₆ are constants.

We are now ready to prove Lemma 3.

Proof

First, note that

\begin{array}{l} ∣ E I_{j, k} - h [V_{j, k} (Δ) + Λ_{j} Λ_{k}] ∣ \\ = | \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) E [d N_{j} (t^{'}) d N_{k} (t)] \\ - \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) [V_{j, k} (Δ) + Λ_{j} Λ_{k}] d t d t^{'} | \\ = | \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) {E [d N_{j} (t^{'}) d N_{k} (t)] - Λ_{j} Λ_{k} d t d t^{'}} \\ - \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) V_{j, k} (Δ) d t d t^{'} | \\ = | \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) V_{j, k} (t^{'} - t) d t d t^{'} \\ - \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) V_{j, k} (Δ) d t d t^{'} | \\ = | \frac{1}{T} \iint_{{[0, T]}^{2}} K (\frac{t - t^{'} + Δ}{h}) [V_{j, k} (t^{'} - t) - V_{j, k} (Δ)] d t d t^{'} |, \end{array}

(36)

where we use the definition of V in the third equality. Using the fact that the kernel K(x/h) is defined on [−h, h], we can write

\begin{array}{l} ∣ E I_{j, k} - h [V_{j, k} (Δ) + Λ_{j} Λ_{k}] ∣ \\ = | \frac{1}{T} \int_{0}^{T} \int_{max (0, t - Δ - h)}^{min (T, t - Δ + h)} K (\frac{t - t^{'} + Δ}{h}) [V_{j, k} (t^{'} - t) - V_{j, k} (Δ)] d t d t^{'} | \\ \leq \frac{1}{T} \int_{0}^{T} \int_{max (0, t - Δ - h)}^{min (T, t - Δ + h)} K (\frac{t - t^{'} + Δ}{h}) θ_{1} s ∣ t^{'} - t - Δ ∣ d t d t^{'} \\ \leq \frac{1}{T} \int_{0}^{T} \int_{max (0, t - Δ - h)}^{min (T, t - Δ + h)} K (\frac{t - t^{'} + Δ}{h}) θ_{1} h s d t d t^{'} \\ \leq \frac{1}{T} \int_{0}^{T} 2 θ_{1} s h^{2} d t \\ = 2 θ_{1} s h^{2}, \end{array}

(37)

where the first inequality follows from Lemma 2.

Recall that II_j ≡ T⁻¹N_j(T) and II_k ≡ T⁻¹N_k(T). Applying Lemma 4 and (37), we have, with probability at least 1 − 2c₅p²T exp(−c₄T^1/6),

\begin{array}{l} | {\hat{V}}_{j, k} (Δ) - V_{j, k} (Δ) | \\ \leq \frac{1}{h} ∣ I_{j, k} - E I_{j, k} ∣ + \frac{1}{h} ∣ E I_{j, k} - E [d N_{j} (t - Δ) d N_{k} (t)] / (d t d Δ) ∣ \\ + | \frac{1}{T^{2}} (N_{j} (T) - T Λ_{j}) N_{k} (T) | + | Λ_{j} \frac{1}{T} N_{k} (T) - Λ_{j} Λ_{k} | \\ \leq c_{6} T^{- 1 / 3} h^{- 1} + 2 θ_{1} h s + \\ (‖ Λ_{max} + c_{6} T^{- 1 / 3 + 1 / 18}) c_{6} T^{- 1 / 3 + 1 / 18} + Λ_{max} c_{6} T^{- 1 / 3 + 1 / 18} . \end{array}

(38)

Letting h = c₁s^−1/2T^−1/6, (38) can be written as

| {\hat{V}}_{j, k} (Δ) - V_{j, k} (Δ) | \leq c_{2}^{'} s^{1 / 2} T^{- 1 / 6} .

(39)

Lastly, we need a uniform bound on V̂_j,k − V_j,k on the region [−B,B]. We first see that the above probability statement holds for a grid of ⌈s^1/2T^1/6⌉ points on [−B,B], denoted as ${Δ_{i}}_{i = 1}^{⌈ s^{1 / 2} T^{1 / 6} ⌉}$ . The gap between adjacent points on this grid is bounded by 2Bs^−1/2T^−1/6. Furthermore, for any Δ ∈ [−B,B], we can find a point on the grid Δ_i such that |Δ − Δ_i| ≤ 2B/⌈s^1/2T^1/6⌉ ≤ 2Bs^−1/2T^−1/6. From basic calculus we get that, for all Δ ∈ [−B,B],

\begin{array}{l} | {\hat{V}}_{j, k} (Δ) - V_{j, k} (Δ) | \\ = | {\hat{V}}_{j, k} (Δ) - {\hat{V}}_{j, k} (Δ_{i}) + {\hat{V}}_{j, k} (Δ_{i}) - V_{j, k} (Δ_{i}) + V_{j, k} (Δ_{i}) - V_{j, k} (Δ) | \\ \leq 2 B s^{- 1 / 2} T^{- 1 / 6} + C_{2}^{'} s^{1 / 2} T^{- 1 / 6} + θ_{1} s s^{- 1 / 2} T^{- 1 / 6} \\ \leq c_{2} s^{1 / 2} T^{- 1 / 6}, \end{array}

(40)

for some constant c₂.

Therefore, with probability at least 1 − c₃s^1/2p²T^7/6 exp(−c₄T^1/6),

{‖ {\hat{V}}_{j, k} - V_{j, k} ‖}_{2, [- B, B]} \leq c_{2} s^{1 / 2} T^{- 1 / 6} .

(41)

A.4. Proof of Lemma 4

Lemma 4 follows directly from the proof of Proposition 3 in Hansen, Reynaud-Bouret and Rivoirard (2015). The only difference is that we want a polynomial bound on the deviation, while Hansen, Reynaud-Bouret and Rivoirard (2015) consider a logarithmic bound. For completeness, we state the proof of Lemma 4 below, but note that the proof is almost identical to the proof of Proposition 3 in Hansen, Reynaud-Bouret and Rivoirard (2015). We refer the interested readers to the original proof in Section 7.4.3 of Hansen, Reynaud-Bouret and Rivoirard (2015) for more details.

Throughout this section, we assume that N ≡ (N₁, …, N_p)^T is defined on the full real line. We first state some notation that is only used in this section.

Following Hansen, Reynaud-Bouret and Rivoirard (2015), we use $C_{a_{1}, a_{2}, \dots}^{(i)}$ to denote a constant that depends only on a₁, a₂, …; and we use the superscript i to indicate that this is the ith constant appearing in the proof.
Without loss of generality, we assume that supp(ω_j_,_k) ⊂ (0, 1], as in Hansen, Reynaud-Bouret and Rivoirard (2015).
As in Hansen, Reynaud-Bouret and Rivoirard (2015), we introduce a function Z(N) such that Z(N) depends only on {dN(t′), t′∈ [−A, 0)}, and there exist two non-negative constants η and d such that
$∣ Z (N) ∣ \leq d {1 + {(\sum_{l = 1}^{p} N_{l} ([- A, 0)])}^{η}} .$ (42)
We also introduce the (time) shift operator S_t so that Z ○ S_t(N) depends only on {dN(t′), t′∈ [−A + t, t)}, in the same way as Z(N) depends on the points of N in [−A, 0).

We are now ready to prove the lemma. When proving the bound (34), we only discuss the case when j ≠ k. The proof for the case when j = k follows from the same argument and is thus omitted.

Proof

In this proof, we will consider a probability bound for [Z ○ S_t(N) − 𝔼(Z)] dt ≥ u, where, for some κ ∈ (0, 1) to be specified later,

u = c_{6} T^{(1 - κ) (1 - η) + κ} .

(43)

Note that, by applying the bound to −Z(·), we can obtain a bound for|Z ○ S_t(N) − 𝔼(Z)|. To complete the proof, we will verify the statements (34) and (35) by considering some specific choices of Z(·).

For any positive integer k such that x ≡ T/(2k) > A, we have

\begin{array}{l} ℙ (\int_{0}^{T} [Z \circ S_{t} (N) - E (Z)] d t \geq u) \\ = ℙ (\sum_{q = 0}^{k - 1} \int_{2 q x}^{2 q x + x} [Z \circ S_{t} (N) - E (Z)] d t + \int_{2 q x + x}^{2 q x + 2 x} [Z \circ S_{t} (N) - E (Z)] d t \geq u) \\ \leq 2 ℙ (\sum_{q = 0}^{k - 1} \int_{2 q x}^{2 q x + x} [Z \circ S_{t} (N) - E (Z)] d t \geq \frac{u}{2}), \end{array}

where the inequality follows from the stationarity of N. As in Reynaud-Bouret and Roy (2006), let ${{\tilde{M}}_{q}^{x}}_{q = 1}^{\infty}$ be a sequence of independent Hawkes processes, each of which is stationary with intensities λ(t) ≡ (λ₁(t), …, λ_p(t))^T. See Section 3 of Reynaud-Bouret and Roy (2006) for more details on the construction of ${{\tilde{M}}_{q}^{x}}_{q = 1}^{\infty}$ . For each q, let $M_{q}^{x}$ be the truncated process associated with ${\tilde{M}}_{q}^{x}$ , where truncation means that we only consider the points in [2qx − A, 2qx + x]. Now, if we set

F_{q} = \int_{2 q x}^{2 q x + x} [Z \circ S_{t} (M_{q}^{x}) - E (Z)] d t,

(44)

then

ℙ (\int_{0}^{T} [Z \circ S_{t} (N) - E (Z)] d t \geq u) \leq 2 ℙ (\sum_{q = 0}^{k - 1} F_{q} \geq \frac{u}{2}) + 2 \sum_{q = 0}^{k - 1} ℙ (T_{e, q} > \frac{T}{2 k} - A),

(45)

where T_e_,_q is the time to extinction of the process $M_{q}^{x}$ . The extinction time T_e_,_q is introduced in Sections 2.2 and 3 in Reynaud-Bouret and Roy (2006). Roughly speaking, it is the last time when there is an event for the Hawkes process with intensity λ(t) of the form (2), with background intensity μ ≡ (μ₁, …, μ_p)^T set to 0 for t ≥ 0. Since T_e_,_q is identically distributed for all q, we can focus on one T_e_,_q. Denoting by a_l the ancestral points with marks l and by $H_{a_{l}}^{l}$ the length of the corresponding cluster whose origin is a_l, we have:

T_{e, q} = max_{l \in {1, \dots, p}} max_{a_{l}} {a_{l} + H_{a_{l}}^{l}} .

(46)

Then by the exact argument on page 48 of Hansen, Reynaud-Bouret and Rivoirard (2015), we have

ℙ (T_{e, q} \leq a) \geq 1 - \sum_{l = 1}^{p} μ^{(l)} c_{l} / ϑ_{l} exp (- ϑ_{l} a) .

(47)

Thus, there exists a constant $C_{A}^{(1)}$ depending on A such that if we take $k = ⌊ C_{A}^{(1)} T^{κ} ⌋$ , for some κ ∈ (0, 1) to be specified later, then

\sum_{q = 0}^{k - 1} ℙ (T_{e, q} > \frac{T}{2 k} - A) \leq T^{κ} p exp (- c_{4} T^{1 - κ}),

(48)

where c₄ is a constant. Note that x = T/2k ≈ T¹⁻^κ is larger than A for T large enough (depending on A).

Now, note that the event 𝒯 ≡ {T_e_,_q ≤ T/2k − A, for all q = 0, …, k} only depends on the process N. We will first find a probability bound for the first term in (45). In other words, we will show that, given the event 𝒯,

ℙ (\int_{0}^{T} [Z \circ S_{t} (N) - E (Z)] d t \geq u) \leq c_{5} T exp (- c_{4} T^{1 - κ}) .

(49)

Let

B = ℙ (\sum_{q = 0}^{k - 1} F_{q} \geq \frac{u}{2}) .

Consider the measurable events

Ω_{q} = {sup_{t} {M_{q}^{x} ∣_{[t - A, t)}} \leq \tilde{N}},

where 𝒩̃ is a constant that will be defined later and $M_{q}^{x} ∣_{[t - A, t)}$ represents the number of points of $M_{q}^{x}$ lying in [t − A, t). Let Ω = ∩₀_≤q≤k_–1 Ω_q. Then

B \leq ℙ (\sum_{q = 0}^{k - 1} F_{q} \geq u / 2 and Ω) + ℙ (Ω^{c}) .

We have $ℙ (Ω^{c}) \leq \sum_{q} ℙ (Ω_{q}^{c})$ , where each $ℙ (Ω_{q}^{c})$ can be easily controlled. Indeed, it is sufficient to split [2qx–A, 2qx+x] into intervals of size A (there are about $C_{A}^{(2)} T^{1 - κ}$ of these) and require the number of points in each sub-interval to be smaller than 𝒩̃/2. By stationarity, we then obtain

ℙ (Ω_{q}^{c}) \leq C_{A}^{(2)} T^{1 - κ} ℙ (N_{[- A, 0]} > \tilde{N} / 2) .

Using Proposition 2 in Hansen, Reynaud-Bouret and Rivoirard (2015) with u = [𝒩̃/2] + 1/2, we obtain:

ℙ (Ω_{q}^{c}) \leq C_{A}^{(2)} T^{1 - κ} exp (- C_{A}^{(3)}, \tilde{N})

and, thus,

ℙ (Ω_{q}^{c}) \leq C_{A}^{(4)} T exp (- C_{A}^{(3)} \tilde{N}) .

Note that this control holds for any positive choice of 𝒩̃. Thus, for any 𝒩̃ > 0,

ℙ (\exists t \in [0, T] such that M_{q}^{x} ∣_{t - A, t)} > \tilde{N}) \leq C_{A}^{(2)} T^{1 - κ} exp (- C_{A}^{(3)} \tilde{N}) .

(50)

Hence by taking $\tilde{N} = C_{A}^{(5)} T^{1 - κ}$ , for $C_{A}^{(5)}$ large enough, the right-hand side of (50) is smaller than $C_{A}^{(2)} T^{1 - κ} exp (- c_{4} T^{1 - κ})$ .

It remains to obtain the rate of D ≡ ℘(Σ_q F_q ≥ u/2 and Ω). For any positive constant ε that will be chosen later, we have:

\begin{array}{l} D & \leq & e^{- ε u / 2} E (e^{ε \sum_{q} F_{q}} \prod_{q} 𝟙_{Ω_{q}}) \\ \leq & e^{- ε u / 2} \prod_{q} E (e^{ε F_{q}} 𝟙_{Ω_{q}}), \end{array}

(51)

since the variables ${M_{q}^{x}}_{q}$ are independent. But,

E (e^{ε F_{q}} 𝟙_{Ω_{q}}) = 1 + ε E (F_{q} 𝟙_{Ω_{q}}) + \sum_{j \geq 2} \frac{ε^{j}}{j!} E (F_{q}^{j} 𝟙_{Ω_{q}})

and $E (F_{q} 𝟙_{Ω_{q}}) = E (F_{q}) - E (F_{q} 𝟙_{Ω_{q}^{c}}) = - E (F_{q} 𝟙_{Ω_{q}^{c}})$ .

Next note that if for any integer l,

l \tilde{N} < sup_{t} M_{q}^{x} ∣_{[t - A, t)} \leq (l + 1) \tilde{N},

then

∣ F_{q} ∣ \leq x d [{(l + 1)}^{η} {\tilde{N}}^{η} + 1] x E (Z) .

Hence, cutting $Ω_{q}^{c}$ into slices of the type { $l \tilde{N} < {sup}_{t} M_{q}^{x} ∣_{[t - A, t)} \leq (l + 1) \tilde{N}$ } and using (50) with $\tilde{N} = C_{A}^{(5)} T^{1 - κ}$ for a large enough $C_{A}^{(5)}$ , we obtain

\begin{array}{l} | E (F_{q} 𝟙_{Ω_{q}}) | = | E (F_{q} 𝟙_{Ω_{q}^{c}}) | \\ \leq \sum_{l = 1}^{+ \infty} x (d [{(l + 1)}^{η} {\tilde{N}}^{η} + 1] + | E (Z) |) \\ \times ℙ (\exists t \in [0, T] such that M_{q}^{x} ∣_{[t - A, t)} > ℓ \tilde{N}) \\ \leq C_{A}^{(2)} \sum_{l = 1}^{+ \infty} x (d [{(l + 1)}^{η} {\tilde{N}}^{η} + 1] + | E (Z) |) T^{1 - κ} exp (- c_{4} l \tilde{N}) \\ \leq C_{A}^{(6)} \sum_{l = 1}^{+ \infty} x (d {\tilde{N}}^{η} + | E (Z) |) T^{1 - κ} 2^{l η} exp {- c_{4} l \tilde{N}} \\ \leq C_{A}^{(7)} T^{2 - 2 κ} d {\tilde{N}}^{η} \frac{exp (- c_{4} \tilde{N})}{1 - 2^{η} exp (- c_{4} \tilde{N})}, \end{array}

where in the last inequality, we have used the fact that $∣ E (Z) ∣ \leq d E [N_{[- A, 0)}^{η}]$ by (42). Plugging $\tilde{N} = C_{A}^{(5)} T^{1 - κ}$ into the above equation gives

∣ E (F_{q} 𝟙_{Ω_{q}}) ∣ \leq z_{1} \equiv C_{A}^{(8)} d T^{2 - 2 κ} T^{(1 - κ) η} exp (- c_{4} T^{1 - κ}) .

In the same way, following Hansen, Reynaud-Bouret and Rivoirard (2015), we can write

E (F_{q}^{j} 𝟙_{Ω_{q}}) \leq E (F_{q}^{j} 𝟙_{Ω_{q}}) z_{b}^{j - 2},

(52)

where $z_{b} \equiv x d [{\tilde{N}}^{η} + 1] + x E (Z) = C_{η, A}^{(9)} d T^{(1 - κ) (1 + η)}$ . Then, by stationarity,

\begin{array}{l} E (F_{q}^{2} 𝟙_{Ω_{q}}) & \leq & x E [\int_{2 q x}^{2 q x + x} {[Z \circ S_{t^{'}} (M_{q}^{x}) - E (Z)]}^{2} \cap_{τ \in ℝ} 𝟙_{M_{q}^{x} ∣_{[τ - A, τ)} \leq \tilde{N}}} d t^{'}] \\ \leq & x E [\int_{2 q x}^{2 q x + x} {[Z \circ S_{t^{'}} (M_{q}^{x}) - E (Z)]}^{2} 𝟙_{{M_{q}^{x} ∣_{[t^{'} - A, t^{'})} \leq \tilde{N}}} d t^{'}] \\ \leq & x^{2} E ({[Z (N) - E (Z)]}^{2} 𝟙_{N_{[- A, 0)} \leq \tilde{N}}) \\ \leq & z_{v} \equiv C_{η, A}^{(10)} T^{2 - 2 κ} σ^{2}, \end{array}

where σ² ≡ 𝔼[Z(N) − 𝔼(Z)]. Going back to (51), by (52), we have

\begin{array}{l} D & \leq & exp [- \frac{ε u}{2} + k log (1 + ε z_{1} + \sum_{j \geq 2} z_{v} z_{b}^{j - 2} \frac{ε^{j}}{j!})] \\ \leq & exp [- ε (\frac{u}{2} - k z_{1}) + k \sum_{j \geq 2} z_{v} z_{b}^{j - 2} \frac{ε^{j}}{j!}], \end{array}

using the fact that log(1 + u) ≤ u. Since

k z_{1} = C_{η}^{(10)} d T^{κ} T^{(2 + η) (1 - κ)} exp (- c_{4} T^{1 - κ}),

one can choose c₆ in the definition (43) of u (not depending on d) such that $u / 2 - k z_{1} \geq \sqrt{2 k z_{v} z} + \frac{1}{3} z_{b} z$ for some z = c₄T^κ^–2^η^(1–^κ⁾. Hence,

D \leq exp [- ε (\sqrt{2 k z_{v} z} + \frac{1}{3} z_{b} z) + k \sum_{j \geq 2} z_{v} z_{b}^{j - 2} \frac{ε^{j}}{j!}] .

One can choose ε (as in the proof of the Bernstein inequality in Massart (2007), page 25) to obtain a bound on the right-hand side in the form of e⁻^z. We can then choose c₄ large enough, and only depending on η and A, to guarantee that D ≤ e⁻^z ≤ c₅ exp(−c₄T^1–^κ).

In summary, we have shown that, given the event 𝒯,

ℙ (\int_{0}^{T} [Z \circ S_{t} (N) - E (Z)] d t \geq u) \leq c_{5} exp (- c_{4} T^{1 - κ}) + C_{A}^{(4)} T exp (- c_{4} T^{1 - κ}) .

With a slight abuse of notation, letting $c_{5} = max (c_{5}, C_{A}^{(4)})$ gives (49).

To complete the proof, we apply the concentration inequality (49) with some specific choices of Z(·).

For each pair (j, k), let

Z \circ S_{t} (N) \equiv \int_{t - h}^{t + h} K (\frac{t^{'} - t + Δ}{h}) d N_{j} (t^{'}) d N_{k} (t) / d t .

We can check that d = 1 and η = 2 satisfy (42). Then with κ = 5/6 in (49), we get, given the event 𝒯,

ℙ (∣ I_{j, k} - E I_{j, k} ∣ \geq c_{6} T^{- 1 / 3}) \leq c_{5} T exp (- c_{4} T^{1 / 6}) .

Applying a union bound for all pairs (j, k), we have, given the event 𝒯,

ℙ (\underset{1 \leq j \leq k \leq p}{\cap} [∣ I_{j, k} - E I_{j, k} ∣ \geq c_{6} T^{- 1 / 3}]) \leq c_{5} T p^{2} exp (- c_{4} T^{1 / 6}) .

(53)

Recall from the concentration inequality (48) that the event 𝒯 holds with probability at least 1–pT ^1/6 exp(−c₄T^1/6). Thus, given that pT ^1/6 exp(−c₄T^1/6) is dominated by the right-hand side of (53), it holds unconditionally that

ℙ (\underset{1 \leq j \leq k \leq p}{\cap} [∣ I_{j, k} - E I_{j, k} ∣ \geq c_{6} T^{- 1 / 3}]} \leq c_{5} T p^{2} exp (- c_{4} T^{1 / 6}),

which is the statement on I_j_,_k in (34).

The statement on II_l, l = j, k, in (35) can be shown in a similar manner by taking Z ○ S_t(N) ≡ dN_j(t)/dt, with η = 1, and κ = 13/18.

Contributor Information

Shizhe Chen, Department of Statistics, Columbia University, New York, NY 10027.

Daniela Witten, Department of Biostatistics and Statistics, University of Washington, Seattle, WA 98195.

Ali Shojaie, Department of Biostatistics and Statistics, University of Washington, Seattle, WA 98195.

References

Ahrens MB, Orger MB, Robson DN, Li JM, Keller PJ. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nature Methods. 2013;10:413–420. doi: 10.1038/nmeth.2434. [DOI] [PubMed] [Google Scholar]
Aït-Sahalia Y, Cacho-Diaz J, Laeven RJA. Modeling financial contagion using mutually exciting jump processes. Journal of Financial Economics. 2015;117:585–606. [Google Scholar]
Bacry E, Gaïffas S, Muzy J-F. A generalization error bound for sparse and low-rank multivariate Hawkes processes. 2015 arXiv preprint arXiv:1501.00725. [Google Scholar]
Bacry E, Delattre S, Hoffmann M, Muzy JF. Some limit theorems for Hawkes processes and application to financial statistics. Stochastic Process Appl. 2013;123:2475–2499. [Google Scholar]
Berry T, Hamilton F, Peixoto N, Sauer T. Detecting connectivity changes in neuronal networks. Journal of Neuroscience Methods. 2012;209:388–397. doi: 10.1016/j.jneumeth.2012.06.021. [DOI] [PubMed] [Google Scholar]
Bogachev VI. Measure Theory. I, II. Springer-Verlag; Berlin: 2007. [Google Scholar]
Bowsher CG. Modelling security market events in continuous time: Intensity based, multivariate point process models. Journal of Econometrics. 2007;141:876–912. [Google Scholar]
Brémaud P, Massoulié L. Stability of nonlinear Hawkes processes. Ann Probab. 1996;24:1563–1588. [Google Scholar]
Brillinger DR. Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics. 1988;59:189–200. doi: 10.1007/BF00318010. [DOI] [PubMed] [Google Scholar]
Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer Series in Statistics. Springer; Heidelberg: 2011. Methods, theory and applications. [Google Scholar]
Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
Chavez-Demoulin V, Davison AC, McNeil AJ. Estimating value-at-risk: a point process approach. Quantitative Finance. 2005;5:227–234. [Google Scholar]
Daley D, Vere-Jones D. An Introduction to the Theory of Point Processes, volume I: Elementary Theory and Methods of Probability and its Applications. Springer; 2003. [Google Scholar]
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Amer Statist Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J Amer Statist Assoc. 2014;109:1270–1284. doi: 10.1080/01621459.2013.879828. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009;10:2013–2038. [PMC free article] [PubMed] [Google Scholar]
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38:3567–3604. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1. [PMC free article] [PubMed] [Google Scholar]
Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
Hansen NR, Reynaud-Bouret P, Rivoirard V. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli. 2015;21:83–143. [Google Scholar]
Hawkes AG. Spectra of some self-exciting and mutually exciting point processes. Biometrika. 1971;58:83–90. [Google Scholar]
Hawkes AG, Oakes D. A cluster process representation of a self-exciting process. J Appl Probability. 1974;11:493–503. [Google Scholar]
Liniger TJ. PhD thesis, Diss. 2009. Multivariate Hawkes processes. Eidgenössische Technische Hochschule ETH Zürich, Nr. 18403. [Google Scholar]
Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]
Luo S, Song R, Witten D. Sure Screening for Gaussian Graphical Models. 2014 arXiv preprint arXiv:1407.7819. [Google Scholar]
Massart P. Concentration inequalities and model selection. Lecture Notes in Mathematics; Lectures from the 33rd Summer School on Probability Theory; Saint-Flour. July 6–23, 2003; Berlin: Springer; 2007. 1896. With a foreword by Jean Picard. [Google Scholar]
Mishchencko Y, Vogelstein JT, Paninski L. A Bayesian approach for inferring neuronal connectivity from calcium fluorescent imaging data. Ann Appl Stat. 2011;5:1229–1261. [Google Scholar]
Mohler GO, Short MB, Brantingham PJ, Schoenberg FP, Tita GE. Self-exciting point process modeling of crime. J Amer Statist Assoc. 2011;106:100–108. [Google Scholar]
Ogata Y. Statistical Models for Earthquake Occurrences and Residual Analysis for Point Processes. Journal of the American Statistical Association. 1988;83:9–27. [Google Scholar]
Okatan M, Wilson MA, Brown EN. Analyzing Functional Connectivity Using a Network Likelihood Model of Ensemble Neural Spiking Activity. Neural Comput. 2005;17:1927–1961. doi: 10.1162/0899766054322973. [DOI] [PubMed] [Google Scholar]
Paninski L, Pillow J, Lewi J. Statistical models for neural encoding, decoding, and optimal stimulus design. Progress in Brain Research. 2007;165:493–507. doi: 10.1016/S0079-6123(06)65031-0. [DOI] [PubMed] [Google Scholar]
Perry PO, Wolfe PJ. Point process modelling for directed interaction networks. J R Stat Soc Ser B Stat Methodol. 2013;75:821–849. [Google Scholar]
Pillow JW, Shlens J, Paninski L, Sher A, Litke AM, Chichilnisky E, Simoncelli EP. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature. 2008;454:995–999. doi: 10.1038/nature07140. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron J Stat. 2011;5:935–980. [Google Scholar]
Reynaud-Bouret P, Roy E. Some non asymptotic tail estimates for Hawkes processes. Bull Belg Math Soc Simon Stevin. 2006;13:883–896. [Google Scholar]
Reynaud-Bouret P, Schbath S. Adaptive estimation for Hawkes processes; application to genome analysis. Ann Statist. 2010;38:2781–2822. [Google Scholar]
Saegusa T, Shojaie A. Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics. 2016;10:1341–1392. doi: 10.1214/16-EJS1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simma A, Jordan MI. Modeling events with cascades of Poisson processes. 2012 arXiv preprint arXiv:1203.3516. [Google Scholar]
Simon N, Tibshirani RJ. Standardization and the group lasso penalty. Statist Sinica. 2012;22:983–1001. doi: 10.5705/ss.2011.075. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song R, Lu W, Ma S, Jeng XJ. Censored rank independence screening for high-dimensional survival data. Biometrika. 2014;101:799–814. doi: 10.1093/biomet/asu047. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsybakov AB. In: Introduction to nonparametric estimation. Zaiats Vladimir., translator. Springer; New York: 2009. Springer Series in Statistics. Revised and extended from the 2004 French original. (2011g:62006) [Google Scholar]
Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (Lasso) Information Theory, IEEE Transactions on. 2009;55:2183–2202. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol. 2006;68:49–67. [Google Scholar]
Zhou K, Zha H, Song L. Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics; 2013a. pp. 641–649. [Google Scholar]
Zhou K, Zha H, Song L. Learning triggering kernels for multi-dimensional Hawkes processes. Proceedings of the 30th International Conference on Machine Learning (ICML-13); 2013b. pp. 1301–1309. [Google Scholar]
Zhu L. Nonlinear Hawkes processes. 2013 arXiv preprint arXiv:1304.7531. [Google Scholar]

[R1] Ahrens MB, Orger MB, Robson DN, Li JM, Keller PJ. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nature Methods. 2013;10:413–420. doi: 10.1038/nmeth.2434. [DOI] [PubMed] [Google Scholar]

[R2] Aït-Sahalia Y, Cacho-Diaz J, Laeven RJA. Modeling financial contagion using mutually exciting jump processes. Journal of Financial Economics. 2015;117:585–606. [Google Scholar]

[R3] Bacry E, Gaïffas S, Muzy J-F. A generalization error bound for sparse and low-rank multivariate Hawkes processes. 2015 arXiv preprint arXiv:1501.00725. [Google Scholar]

[R4] Bacry E, Delattre S, Hoffmann M, Muzy JF. Some limit theorems for Hawkes processes and application to financial statistics. Stochastic Process Appl. 2013;123:2475–2499. [Google Scholar]

[R5] Berry T, Hamilton F, Peixoto N, Sauer T. Detecting connectivity changes in neuronal networks. Journal of Neuroscience Methods. 2012;209:388–397. doi: 10.1016/j.jneumeth.2012.06.021. [DOI] [PubMed] [Google Scholar]

[R6] Bogachev VI. Measure Theory. I, II. Springer-Verlag; Berlin: 2007. [Google Scholar]

[R7] Bowsher CG. Modelling security market events in continuous time: Intensity based, multivariate point process models. Journal of Econometrics. 2007;141:876–912. [Google Scholar]

[R8] Brémaud P, Massoulié L. Stability of nonlinear Hawkes processes. Ann Probab. 1996;24:1563–1588. [Google Scholar]

[R9] Brillinger DR. Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics. 1988;59:189–200. doi: 10.1007/BF00318010. [DOI] [PubMed] [Google Scholar]

[R10] Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer Series in Statistics. Springer; Heidelberg: 2011. Methods, theory and applications. [Google Scholar]

[R11] Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]

[R12] Chavez-Demoulin V, Davison AC, McNeil AJ. Estimating value-at-risk: a point process approach. Quantitative Finance. 2005;5:227–234. [Google Scholar]

[R13] Daley D, Vere-Jones D. An Introduction to the Theory of Point Processes, volume I: Elementary Theory and Methods of Probability and its Applications. Springer; 2003. [Google Scholar]

[R14] Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Amer Statist Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J Amer Statist Assoc. 2014;109:1270–1284. doi: 10.1080/01621459.2013.879828. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009;10:2013–2038. [PMC free article] [PubMed] [Google Scholar]

[R18] Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38:3567–3604. [Google Scholar]

[R19] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1. [PMC free article] [PubMed] [Google Scholar]

[R20] Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]

[R21] Hansen NR, Reynaud-Bouret P, Rivoirard V. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli. 2015;21:83–143. [Google Scholar]

[R22] Hawkes AG. Spectra of some self-exciting and mutually exciting point processes. Biometrika. 1971;58:83–90. [Google Scholar]

[R23] Hawkes AG, Oakes D. A cluster process representation of a self-exciting process. J Appl Probability. 1974;11:493–503. [Google Scholar]

[R24] Liniger TJ. PhD thesis, Diss. 2009. Multivariate Hawkes processes. Eidgenössische Technische Hochschule ETH Zürich, Nr. 18403. [Google Scholar]

[R25] Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Luo S, Song R, Witten D. Sure Screening for Gaussian Graphical Models. 2014 arXiv preprint arXiv:1407.7819. [Google Scholar]

[R27] Massart P. Concentration inequalities and model selection. Lecture Notes in Mathematics; Lectures from the 33rd Summer School on Probability Theory; Saint-Flour. July 6–23, 2003; Berlin: Springer; 2007. 1896. With a foreword by Jean Picard. [Google Scholar]

[R28] Mishchencko Y, Vogelstein JT, Paninski L. A Bayesian approach for inferring neuronal connectivity from calcium fluorescent imaging data. Ann Appl Stat. 2011;5:1229–1261. [Google Scholar]

[R29] Mohler GO, Short MB, Brantingham PJ, Schoenberg FP, Tita GE. Self-exciting point process modeling of crime. J Amer Statist Assoc. 2011;106:100–108. [Google Scholar]

[R30] Ogata Y. Statistical Models for Earthquake Occurrences and Residual Analysis for Point Processes. Journal of the American Statistical Association. 1988;83:9–27. [Google Scholar]

[R31] Okatan M, Wilson MA, Brown EN. Analyzing Functional Connectivity Using a Network Likelihood Model of Ensemble Neural Spiking Activity. Neural Comput. 2005;17:1927–1961. doi: 10.1162/0899766054322973. [DOI] [PubMed] [Google Scholar]

[R32] Paninski L, Pillow J, Lewi J. Statistical models for neural encoding, decoding, and optimal stimulus design. Progress in Brain Research. 2007;165:493–507. doi: 10.1016/S0079-6123(06)65031-0. [DOI] [PubMed] [Google Scholar]

[R33] Perry PO, Wolfe PJ. Point process modelling for directed interaction networks. J R Stat Soc Ser B Stat Methodol. 2013;75:821–849. [Google Scholar]

[R34] Pillow JW, Shlens J, Paninski L, Sher A, Litke AM, Chichilnisky E, Simoncelli EP. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature. 2008;454:995–999. doi: 10.1038/nature07140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron J Stat. 2011;5:935–980. [Google Scholar]

[R36] Reynaud-Bouret P, Roy E. Some non asymptotic tail estimates for Hawkes processes. Bull Belg Math Soc Simon Stevin. 2006;13:883–896. [Google Scholar]

[R37] Reynaud-Bouret P, Schbath S. Adaptive estimation for Hawkes processes; application to genome analysis. Ann Statist. 2010;38:2781–2822. [Google Scholar]

[R38] Saegusa T, Shojaie A. Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics. 2016;10:1341–1392. doi: 10.1214/16-EJS1137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Simma A, Jordan MI. Modeling events with cascades of Poisson processes. 2012 arXiv preprint arXiv:1203.3516. [Google Scholar]

[R40] Simon N, Tibshirani RJ. Standardization and the group lasso penalty. Statist Sinica. 2012;22:983–1001. doi: 10.5705/ss.2011.075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Song R, Lu W, Ma S, Jeng XJ. Censored rank independence screening for high-dimensional survival data. Biometrika. 2014;101:799–814. doi: 10.1093/biomet/asu047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Tsybakov AB. In: Introduction to nonparametric estimation. Zaiats Vladimir., translator. Springer; New York: 2009. Springer Series in Statistics. Revised and extended from the 2004 French original. (2011g:62006) [Google Scholar]

[R43] Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (Lasso) Information Theory, IEEE Transactions on. 2009;55:2183–2202. [Google Scholar]

[R44] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol. 2006;68:49–67. [Google Scholar]

[R45] Zhou K, Zha H, Song L. Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics; 2013a. pp. 641–649. [Google Scholar]

[R46] Zhou K, Zha H, Song L. Learning triggering kernels for multi-dimensional Hawkes processes. Proceedings of the 30th International Conference on Machine Learning (ICML-13); 2013b. pp. 1301–1309. [Google Scholar]

[R47] Zhu L. Nonlinear Hawkes processes. 2013 arXiv preprint arXiv:1304.7531. [Google Scholar]

PERMALINK

Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process

Shizhe Chen

Daniela Witten

Ali Shojaie

Abstract

1. Introduction

1.1. Overview of the multivariate Hawkes process

Assumption 1

1.2. Estimation and theory for the Hawkes process

1.3. Organization of paper

2. An edge screening procedure

2.1. Approach

2.2. Theoretical results

Assumption 2

Assumption 3

Theorem 1

Remark 1

3. Simulation

3.1. Simulation set-up

Fig 1.

3.2. Investigation of the estimated cross-covariances

Fig. 2.

3.3. Size of smallest screened edge set

Fig. 3.

3.4. Performance of constrained penalized estimation

Fig. 4.

4. Proofs of theoretical results

4.1. The Wiener-Hopf integral equation

4.2. Technical lemmas

Lemma 1

Lemma 2

Lemma 3

4.3. Proof of Theorem 1

Proof

5. Discussion

Appendix A: Technical proofs

A.1. Proof of Lemma 1

Proof

A.2. Proof of Lemma 2

Proof

A.3. Proof of Lemma 3

Lemma 4

Proof

A.4. Proof of Lemma 4

Proof

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases