Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Aug 24.
Published in final edited form as: Electron J Stat. 2017 Apr 11;11(1):1207–1234. doi: 10.1214/17-EJS1251

Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process

Shizhe Chen 1, Daniela Witten 2,*, Ali Shojaie 3,
PMCID: PMC5570442  NIHMSID: NIHMS892541  PMID: 28845209

Abstract

We consider the task of learning the structure of the graph underlying a mutually-exciting multivariate Hawkes process in the high-dimensional setting. We propose a simple and computationally inexpensive edge screening approach. Under a subset of the assumptions required for penalized estimation approaches to recover the graph, this edge screening approach has the sure screening property: with high probability, the screened edge set is a superset of the true edge set. Furthermore, the screened edge set is relatively small. We illustrate the performance of this new edge screening approach in simulation studies.

Keywords and phrases: Hawkes process, screening, high-dimensionality

MSC 2010 subject classifications: Primary 60G55, secondary 62M10, 62H12

1. Introduction

1.1. Overview of the multivariate Hawkes process

In a seminal paper, Hawkes (1971) proposed the multivariate Hawkes process, a multivariate point process model in which a past event may trigger the occurrence of future events. The Hawkes process and its variants have been widely applied to model recurrent events, with notable applications in modeling earthquakes (Ogata, 1988), crime rates (Mohler et al., 2011), interactions in social networks (Simma and Jordan, 2012; Perry and Wolfe, 2013; Zhou, Zha and Song, 2013a,b), financial events (Chavez-Demoulin, Davison and McNeil, 2005; Bowsher, 2007; Aït-Sahalia, Cacho-Diaz and Laeven, 2015), and spiking histories of neurons (see e.g., Brillinger, 1988; Okatan, Wilson and Brown, 2005; Paninski, Pillow and Lewi, 2007; Pillow et al., 2008).

In this section, we provide a very brief review of the multivariate Hawkes process. A more comprehensive discussion can be found in Liniger (2009) and Zhu (2013).

Following Brémaud and Massoulié (1996), we define a simple point process N on ℝ+ as a family {N(A)}A∈ℬ(ℝ+) taking integer values (including positive infinity), where ℬ(ℝ+) denotes the Borel σ-algebra of the positive half of the real line. Further let t1, t2, … ∈ ℝ+ be the event times of N. In this notation, N(A) = Σi 𝟙[tiA] for A ∈ ℬ(ℝ+). We write N ([t, t + dt)) as dN(t), where dt denotes an arbitrary small increment of t. Let ℋt be the history of N up to time t. Then the ℋt-predictable intensity process of N is defined as

λ(t)dt=(dN(t)=1/t). (1)

Now suppose that N is a marked point process, in which each event time ti is associated with a mark mi ∈ {1, …, p} (see e.g., Definition 6.4.I. in Daley and Vere-Jones, 2003). We can then view N as a multivariate point process (Nj)j=1,,p, of which the jth component process is given by Nj(A) = Σi 𝟙[tiA,mi=j] for A ∈ ℬ(ℝ+). To simplify the notation, we let tj,1, tj,2, … ∈ ℝ+ denote the event times of Nj.

The intensity of the jth component process is

λj(t)dt=(dNj(t)=1Ht).

In the case of the linear Hawkes process, this function takes the form (Brémaud and Massoulié, 1996; Hansen, Reynaud-Bouret and Rivoirard, 2015)

λj(t)=μj+k=1p(i:tk,itωj,k(t-tk,i)). (2)

We refer to μj ∈ ℝ as the background intensity, and ωj,k(·): ℝ+ ↦ ℝ as the transfer function.

For p fixed, Brémaud and Massoulié (1996) established that the linear Hawkes process with intensity function (2) is stationary given the following assumption.

Assumption 1

Let Ω be a p × p matrix whose entries are Ωj,k=0ωj,k(Δ)dΔ, for j, k = 1, …, p. We assume that the spectral norm of Ω is strictly less than 1, i.e., Γmax(Ω) ≤ γΩ < 1, where γω is a generic constant.

We now define a directed graph with node set {1, …, p} and edge set

E{(j,k):ωj,k0,1j,kp}, (3)

for ωj,k given in (2). Let

smax1jpk=1p𝟙{(j,k)E} (4)

denote the maximum in-degree of the nodes in the graph. In this paper, we propose a simple screening procedure that can be used to obtain a small superset of the edge set ℰ.

1.2. Estimation and theory for the Hawkes process

We first consider the low-dimensional setting, in which the dimension of the process, p, is fixed, and T, the time period during which the point process is observed, is allowed to grow. In this setting, asymptotic properties such as the central limit theorem have been established; for instance, see Bacry et al. (2013) and Zhu (2013). Consequently, estimating the edge set ℰ is straightforward in low dimensions.

In high dimensions, when p might be large, we can fit the Hawkes process model using a penalized estimator of the form

minimizeωj,kF,1j,kpL(ωj,k;{Nj}j=1p)+λj,kP(ωj,k;{Nj}j=1p), (5)

where L(·;{Nj}j=1p) is a loss function, based on, e.g., the log-likelihood (Bacry, Gaïffas and Muzy, 2015) or least squares (Hansen, Reynaud-Bouret and Rivoirard, 2015); P(·;{Nj}j=1p) is a penalty function, such as the lasso (Hansen, Reynaud-Bouret and Rivoirard, 2015); λ is a nonnegative tuning parameter; and ℱ is a suitable function class. Then, a natural estimator for ℰ is {(j, k): ω̂j,k ≠ 0}.

Recently, Reynaud-Bouret and Schbath (2010), Bacry, Gaïffas and Muzy (2015), and Hansen, Reynaud-Bouret and Rivoirard (2015) have established that under certain assumptions, penalized estimation approaches of the form (5) are consistent in high dimensions, provided that the edge set ℰ is sparse. For instance, Hansen, Reynaud-Bouret and Rivoirard (2015) establish the oracle inequality of the lasso estimator for the Hawkes process, given that certain conditions hold on the observed event times. However, to show that these conditions hold with high probability for arbitrary samples, these theoretical results require that the point process is mutually-exciting — that is, an event in one component process can increase, but cannot decrease, the probability of an event in another component process. This amounts to assuming that ωj,k(Δ) ≥ 0 for all Δ ≥ 0, for ωj,k defined in (1).

When the dimension p is large, penalized estimation procedures of the form (5) (Bacry, Gaïffas and Muzy, 2015; Hansen, Reynaud-Bouret and Rivoirard, 2015) become computationally expensive: they require 𝒪(Tp2) operations per iteration in an iterative algorithm. This is problematic in contemporary applications, in which p can be on the order of tens of thousands (Ahrens et al., 2013). These concerns motivate us to propose a simple and computationally-efficient edge screening procedure for estimating the true edge set ℰ in high dimensions. Under very few assumptions, our proposed screening procedure is guaranteed to select a small superset of the true edge set ℰ.

1.3. Organization of paper

The rest of this paper proceeds as follows. In Section 2, we introduce our screening procedure for estimating the edge set ℰ, and establish its theoretical properties. We present simulation results in support of our proposed procedure in Section 3. Proofs of theoretical results are presented in Section 4, and the Discussion is in Section 5.

2. An edge screening procedure

2.1. Approach

For j = 1, …, p, let Λj denote the mean intensity of the jth point process introduced in Section 1. That is,

ΛjE[dNj(t)]/dt. (6)

Following Equation 5 of Hawkes (1971), for any Δ ∈ ℝ, the (infinitesimal) cross-covariance of the jth and kth processes is defined as

Vj,k(Δ){E[dNj(t)dNk(t-Δ)]/{dtd(t-Δ)}-ΛjΛkjkE[dNk(t)dNk(t-Δ)]/{dtd(t-Δ)}-Λk2-Λkδ(Δ)j=k, (7)

where δ(·) is the Dirac delta function, which satisfies -δ(x)dx=1 and δ(x) = 0 for x ≠ 0.

For a given value of Δ, we can estimate the cross-covariance function Vj,k(Δ) using kernel smoothing:

V^j,k(Δ)={1Th[0,T]2K((t-t)+Δh)dNj(t)dNk(t)-1T2Nj([0,T])Nk([0,T])jk1Th[0,T]2\{t=t}K((t-t)+Δh)dNk(t)dNk(t)-1T2Nk2([0,T])j=k, (8)

where K(·) is a kernel function with bandwidth h, and 0Tf(t)dNj(t) is the Stieltjes integral, defined as

0Tf(t)dNj(t)i:tj,i[0,T]f(tj,i).

In this paper, we focus on kernel functions that are bounded by 1 and are defined on a bounded support, i.e., 0 ≤ K(x/h) ≤ 1 for x ∈ [−h, h], and K(x/h) = 0 for x ∉ [−h, h] (e.g., the Epanechnikov kernel).

Let B denote a tuning parameter that defines the time range of interest for Vj,k(Δ), i.e. Δ ∈ [−B, B]. For any ζ, we define the set of screened edges as

E^(ζ){(j,k):V^j,k2,[-B,B]>ζ}, (9)

where f2,[l,u]{luf2(t)dt}1/2 is the ℓ2-norm of a function f on the interval [l, u].

The screened edge set ℰ̂(ζ) in (9) can be calculated quickly: ||j,k||2,[−B,B] can be calculated in 𝒪(T) computations, and so ℰ̂(ζ) can be calculated in 𝒪(Tp2) computations. The procedure can be easily parallellized.

There are three tuning parameters in the procedure: the bandwidth h in (8), the range B in (9), and the screening threshold ζ in (9). The bandwidth h can be chosen by cross-validation. The range B can be selected based on the problem setting. For instance, when using the multivariate Hawkes process to model a spike train data set in neuroscience, we can set B to equal the maximum time gap between a spike and the spike it can possibly evoke. The choice of screening threshold ζ can be determined based on the sparsity level that we expect based on our prior knowledge. Alternatively, we may wish to use a small value of ζ in order to reduce the chance of false negative edges in ℰ̂(ζ), or a larger value due to limited computational resources in our downstream analysis.

2.2. Theoretical results

We consider the asymptotics of triangular arrays (Greenshtein and Ritov, 2004), where the dimension p is allowed to grow with T. When unrestricted, it is possible to cook up extreme networks, where, for instance, the mean intensity Λj in (6) diverges to infinity. To avoid such cases, we pose the following regularity assumption.

Assumption 2

There exist positive constants Λmin, Λmax, and Vmax such that 0 < Λmin ≤ Λj ≤ Λmax and maxΔ∈ℝ |Vj,k(Δ)| ≤ Vmax for all 1 ≤ j, kp, where Λj and Vj,k are defined in (6) and (7), respectively. Furthermore, Λmin, Λmax, and Vmax are generic constants that do not depend on p.

Next, we make some standard assumptions on the transfer functions ωj,k in (2).

Assumption 3

The following hold:

  1. The transfer functions are non-negative: ωj,k(Δ) ≥ 0 for all Δ ≥ 0.

  2. There exists a positive constant βmin such that
    min(j,k):ωj,k0(0ωj,k2(Δ)dΔ)βmin2.
  3. There exist positive constants b, θ0, and C such that, for all 1 ≤ j, kp, and for any Δ1, Δ2 ∈ ℝ, supp(ωj,k) ⊂ (0, b], maxΔ |ωj,k(Δ)| ≤ C, and |ωj,k1) − ωj,k2)| ≤ θ01 − Δ2|.

Assumption 3(a) guarantees that the multivariate Hawkes process is mutually-exciting: that is, an event may trigger (but cannot inhibit) future events. This assumption is shared by the original proposal of Hawkes (1971). Furthermore, existing theory for penalized estimators for the Hawkes process requires this assumption (Bacry, Gaïffas and Muzy, 2015; Hansen, Reynaud-Bouret and Rivoirard, 2015).

Assumption 3(b) guarantees that the non-zero transfer functions are nonnegligible. Such an assumption is needed in order to establish variable selection consistency (Bühlmann and van de Geer, 2011; Wainwright, 2009) for the penalized estimator (5).

Assumption 3(c) guarantees that the transfer functions are sufficiently smooth; this guarantees that the cross-covariances are smooth (see Section A.2 in Appendix), and hence can be estimated using a kernel smoother (8). Instead of Assumption 3(c), we could assume that ωj,k is an exponential function (Bacry, Gaïffas and Muzy, 2015) or that it is well-approximated by a set of smooth basis functions (Hansen, Reynaud-Bouret and Rivoirard, 2015).

Recall that s was defined in (4). We now state our main result.

Theorem 1

Suppose that the Hawkes process (2) satisfies Assumptions 1–3. Let h = c1s−1/2T−1/6 in(8) and ζ = 2c2s1/2T−1/6 in (9) for some constants c1 and c2. Then, for some positive constants c3 and c4, with probability at least 1 − c3T7/6s1/2p2 exp(−c4T1/6),

  1. ℰ ⊂ ℰ̂(ζ);

  2. card(E^(ζ))=O(card(E)s-1T1/3γΩ(1-γΩ)-2Λmax2).

Theorem 1(a) guarantees that, with high probability, the screened edge set ℰ̂(ζ) contains the true edge set ℰ. Therefore, screening does not result in false negatives. This is referred to as the sure screening property in the literature (Fan and Lv, 2008; Fan, Samworth and Wu, 2009; Fan and Song, 2010; Fan, Feng and Song, 2011; Fan, Ma and Dai, 2014; Liu, Li and Wu, 2014; Song et al., 2014; Luo, Song and Witten, 2014). Typically, establishing the sure screening property requires assuming that the marginal association between a pair of nodes in ℰ is sufficiently large; see e.g. Condition 3 in Fan and Lv (2008) and Condition C in Fan, Feng and Song (2011). In contrast, Theorem 1(a) requires only that the conditional association between a pair of nodes in ℰ is sufficiently large; see Assumption 3(b).

Theorem 1(b) guarantees that ℰ̂(ζ) is a relatively small set, on the order of 𝒪(card(ℰ)s−1T1/3). Suppose that p2s−1/2 exp(c4T1/6−ε) for some positive constant ε < 1/6; this is the high-dimensional regime, in which the probability statement in Theorem 1 converges to one. Then the size of ℰ̂(ζ), 𝒪(card(ℰ)s−1T1/3), can be much smaller than p2, the total number of node pairs. We note that the rate of T1/3 is comparable to existing results for non-parametric screening in the literature (see e.g., Fan, Feng and Song 2011; Fan, Ma and Dai 2014).

To summarize, Theorem 1 guarantees that under a small subset of the assumptions required for penalized estimation methods to recover the edge set ℰ, the screened edge set ℰ̂(ζ) (9) is small and contains no false negatives. We note that this is not the case for other types of models. For instance, in the case of the Gaussian graphical model, Luo, Song and Witten (2014) considered estimating the conditional dependence graph by screening the marginal covariances. In order for this procedure to have the sure screening property, one must make an assumption on the minimum marginal covariance associated with an edge in the graph, which is not required for variable selection consistency of penalized estimators (Cai, Liu and Luo, 2011; Luo, Song and Witten, 2014; Ravikumar et al., 2011; Saegusa and Shojaie, 2016).

It is important to note that Theorem 1 considers an oracle procedure, where the tuning parameters depend on unknown parameters. The heuristic selection guidelines suggested at the end of Section 2.1 may not satisfy the requirements of Theorem 1. We leave the discussion of optimal tuning parameter selection criteria for future research. Also, note that the bandwidth hT−1/6 is wider than the typical bandwidth for kernel smoothing, which is T−1/3 (Tsybakov, 2009). This is because we aim to minimize a concentration bound on j,kVj,k (see the proof of Lemma 3 in the Appendix), rather than the usual mean integrated square error as in, e.g., Theorem 1.1 in Tsybakov (2009).

Remark 1

In light of Theorem 1, consider applying a constraint induced by ℰ̂(ζ) to (5):

minimizeωj,kF,1j,kpL(ωj,k;{Nj}j=1p)+λj,kP(ωj,k;{Nj}j=1p)subjecttoωj,k=0for(j,k)E^(ζ). (10)

Theorem 1 can be combined with existing results on consistency of penalized estimators of the Hawkes process (Bacry, Gaïffas and Muzy, 2015; Hansen, Reynaud-Bouret and Rivoirard, 2015) in order to establish that (10) results in consistent estimation of the transfer functions ωj,k. As a concrete example, Hansen, Reynaud-Bouret and Rivoirard (2015) considered (10) with L(ωj,k;{Nj}j=1p) taken to be the least-squares loss, and P(ωj,k;{Nj}j=1p) a lasso-type penalty. Our simulation experiments in Section 3 indicate that in this setting, (10) can actually have better small-sample performance than (5) when p is very large. Furthermore, solving (10) can be much faster than solving (5): the former requires 𝒪(T4/3s−1card(ℰ)) computations per iteration, compared to 𝒪(Tp2) per iteration for the latter (using e.g. coordinate descent, Friedman, Hastie and Tibshirani, 2010). In the high-dimensional regime when p2s−1/2 exp(c4T1/6−ε) for some positive constant ε < 1/6, we have that T4/3s−1card(ℰ) ≪ Tp2. We note that in order to solve (10), we must first compute ℰ̂(ζ), which requires an additional one-time computational cost of 𝒪(Tp2).

3. Simulation

3.1. Simulation set-up

In this section, we investigate the performance of our screening procedure in a simulation study with p = 100 point processes. Intensity functions are given by (2), with μj = 0.75 for j = 1, …, p, and ωj,k(t) = 2t exp(1 − 5t) for (j, k) ∈ ℰ. By definition, ωj,k = 0 for all (j, k) ∉ ℰ. We consider two settings for the edge set ℰ, Setting A and Setting B. These settings are displayed in Figure 1.

Fig 1.

Fig 1

Left: In Setting A, the edge set ℰ is composed of 5 connected components, each of which is a chain graph containing 20 nodes. Right: In Setting B, ℰ is composed of 10 connected components, each of which contains 10 nodes.

In what follows, it will be useful to think about the (undirected) node pairs as belonging to three types. (i) We let

E{(j,k):(j,k)Eor(k,j)E}. (11)

(ii) With a slight abuse of notation, we will use ℰ̃c ∩ supp(V) to denote node pairs that are not in ℰ̃ with non-zero population cross-covariance, defined in (7). (iii) Continuing to slightly abuse notation, we will use ℰ̃c\supp(V) to denote node pairs that are not in ℰ̃ and that have zero population cross-covariance.

Throughout the simulation, we set the bandwidth h in (8) to equal T−1/6, and the range of interest B in (9) to equal 5. Thus, h satisfies the requirements of Theorem 1, and [−B, B] covers the majority of the mass of the transfer function ωj,k. However, these simulation results are not sensitive to the particular choices of h or B.

3.2. Investigation of the estimated cross-covariances

In Setting A, within a single connected component, all of the node pairs that are not in ℰ̃ are in ℰ̃c ∩ supp(V). However, for the most part, the population cross-covariances corresponding to node pairs in ℰ̃c ∩ supp(V) are quite small, because they are induced by paths of length two and greater. This can be seen from the left-hand panel of Figure 2. Given the left-hand panel of Figure 2, we expect the proposed screening procedure to work very well in Setting A: for a sufficiently large value of the time period T, there exists a value of ζ such that, with high probability, ℰ̂(ζ) = ℰ̃.

Fig. 2.

Fig. 2

The quantiles of ||jk||2,[−5,5] are displayed, for node pairs in ℰ̃ (11), ℰ̃c∩ supp(V), and ℰ̃c\supp(V), as a function of the time period T. Left: Results for Setting A. The estimated cross-covariances of node pairs in ℰ̃c\supp(V) and ℰ̃c ∩ supp(V) overlap. Center: Results for Setting B. The estimated cross-covariances of node pairs in ℰ̃ and ℰ̃c ∩ supp(V) overlap. Right: The color legend is displayed.

In Setting B, six nodes receive directed edges from the same set of four nodes. Therefore, we expect the pairs among these six nodes to be in the set ℰ̃c ∩ supp(V), and to have substantial population cross-covariances. This intuition is supported by the center panel of Figure 2, which indicates that the node pairs in ℰ̃c ∩ supp(V) have relatively large estimated cross-covariances, on the same order as the node pairs in ℰ̃. In light of Figure 2, we anticipate that for a sufficiently large value of the time period T, the screened edge set ℰ̂(ζ) will contain the edges in ℰ̃ as well as many of the edges in ℰ̃c ∩ supp(V).

3.3. Size of smallest screened edge set

We now define ζ* ≡ max {ζ : ℰ ⊆ ℰ̂(ζ)}, and calculate card(ℰ̂(ζ*)). This represents the size of the smallest screened edge set that contains the true edge set.

Results, averaged over 200 simulated data sets, are shown in Figure 3.

Fig. 3.

Fig. 3

For each of 200 simulated data sets, we calculated card(ℰ̂(ζ*)), where ζ* ≡ max {ζ : ℰ ⊆ ℰ̂(ζ)}, as a function of the time period T. The curves represent the mean of card(ℰ̂(ζ*)) ( Inline graphic); the 2.5% and 97.5% quantiles of card(ℰ̂(ζ*)) ( Inline graphic); card(ℰ̃) ( Inline graphic); and card(supp(V)) ( Inline graphic). Left: Data generated under Setting A. Right: Data generated under Setting B.

We see that in Setting A, for sufficiently large T, card(ℰ̂(ζ*)) = card(ℰ̃), which implies that ℰ̂(ζ*) = ℰ̃. In other words, in Setting A, the screening procedure yields perfect recovery of the set ℰ̃ (11). This is in line with our intuition based on the left-hand panel of Figure 2.

In contrast, in Setting B, even when T is very large, card(ℰ̂(ζ*)) > card(ℰ̃), which implies that ℰ̂(ζ*) ⊇ ℰ̃. This was expected based on the center panel of Figure 2.

3.4. Performance of constrained penalized estimation

We now consider the performance of the estimator (10), which we obtain by calculating the screened edge set ℰ̂(ζ), and then performing a penalized regression subject to the constraint that ωjk = 0 for (j, k) ∉ ℰ̂(ζ). Note that rather than assuming a specific functional form for ωj,k, Hansen, Reynaud-Bouret and Rivoirard (2015) use a basis expansion to estimate ωj,k. Following their lead, we use a basis of step functions, of the form 𝟙((m−1)/2,m/2](t) for m = 1, …, 6. Instead of applying a lasso penalty to the basis function coefficients (Hansen, Reynaud-Bouret and Rivoirard, 2015), we employ a group lasso penalty for every 1 ≤ j, kp (Yuan and Lin, 2006; Simon and Tibshirani, 2012). Thus, (10) consists of a squared error loss function and a group lasso penalty. We let

E^P{(j,k):Δs.t.ω^j,k(Δ)0}, (12)

where ω̂j,k solves (10).

Results are shown in Figure 4. In Setting A, solving the constrained optimization problem (10) leads to substantially better performance than solving the unconstrained problem (5). The improvement is especially noticeable when T is small. In Setting B, solving the constrained optimization problem (10) leads to only a slight improvement in performance relative to solving the unconstrained problem (5), since, as we have learned from Figures 2 and 3, the screened set ℰ̂(ζ) contains many edges in ℰ̃c ∩ supp(V). In both settings, solving the constrained optimization problem leads to substantial computational improvements.

Fig. 4.

Fig. 4

The constrained penalized optimization problem (10) was performed, for a range of values of the tuning parameter λ. The x-axis displays the size of the estimated edge set ℰ̂ (12), and the y-axis displays the number of true positives, averaged over 200 simulated data sets. The curves represent performance when ζ is chosen to yield card(ℰ̂(ζ)) = 4card(ℰ̃) (T = 300 [ Inline graphic]and T = 600 [ Inline graphic]), and when ζ is chosen to yield card(ℰ̂(ζ)) = 8card(ℰ̃) (T = 300 [ Inline graphic] and T = 600 [ Inline graphic]). We also display performance of the unconstrained penalized optimization problem (5) (T = 300 [ Inline graphic] and T = 600 [ Inline graphic]).

4. Proofs of theoretical results

In this section, we prove Theorem 1. In Section 4.1, we review an important property of the Hawkes process, the Wiener-Hopf integral equation. In Section 4.2, we list three technical lemmas used in the proof of Theorem 1. Theorem 1 is proved in Section 4.3. Proofs of the technical lemmas are provided in the Appendix.

4.1. The Wiener-Hopf integral equation

Recall that the transfer functions ω = {ωj,k}1≤j,kp were defined in (2), the cross-covariances V = {Vj,k}1≤j,kp were defined in (7), and the mean intensities Λ = (Λ1, …, Λp)T were defined in (6). If the Hawkes process defined in (2) is stationary, then for any Δ ∈ ℝ+,

V(Δ)=ω(Δ)diag(Λ)+(ωV)(Δ), (13)

where

[ωV]j,k(Δ)l=1p[ωj,lVl,k](Δ)

and

[ωj,lVl,k](Δ)0ωj,l(Δ)Vl,k(Δ-Δ)dΔ.

Equation (13) belongs to a class of integral equations known as the Wiener-Hopf integral equations.

4.2. Technical lemmas

We state three lemmas used to prove Theorem 1, and provide their proofs in the Appendix. The following lemma is a direct consequence of (13) and our assumptions. Recall that [0, b] is a superset of supp(ωj,k) introduced in Assumption 3.

Lemma 1

Under Assumptions 1–3, for sufficiently large B such that Bb, we have that ||Vj,k||2,[−B,B]βminΛmin for (j, k) ∈ ℰ.

The next lemma shows that the cross-covariance is Lipschitz continuous given the smoothness assumption on ωj,k (Assumption 3(c)). We will use this lemma in the proof of Theorem 1, in order to bound the bias of the kernel smoothing estimator (8). Recall that s, the maximum node in-degree, was defined in (4).

Lemma 2

Under Assumptions 1–3, the cross-covariance function is Lipschitz for 1 ≤ j, kp. More specifically, there exists some θ1 > 0 such that |Vj,k(x) − Vj,k(y)| ≤ θ1s|xy| for any x, y ∈ ℝ.

Recall that the bandwidth h was defined in (8). The following concentration inequality holds on the estimated cross-covariance.

Lemma 3

Suppose that Assumptions 1–3 hold, and let h = c1s−1/2T−1/6 for some constant c1. Then

(1jkp[V^j,k-Vj,k2,[-B,B]c2s1/2T-1/6])1-c3s1/2T7/6p2e-c4T1/6.

4.3. Proof of Theorem 1

Proof

In what follows, we will consider the event

M{V^j,k-Vj,k2,[-B,B]c2s1/2T-1/6forall1j,kp}.

We will first show that part (b) of Theorem 1 holds. From the Wiener-Hopf equation, (13), for each (j, k), we can write

Vj,k=ωj,kΛk+ωj,·V·,k. (14)

We thus have

Vj,k2,(-,)Λkωj,k2,(-,)+ωj,·V·,k2,(-,)Λkωj,k2,(-,)+l=1pωj,lVl,k2,(-,)Λkωj,k2,(-,)+l=1p(-ωj,l(Δ)dΔ)Vl,k2,(-,), (15)

where the last inequality follows from Young’s inequality (see e.g., Theorem 3.9.4 in Bogachev (2007)), which takes the form

fgr,(-,)fp,(-,)gq,(-,),1p+1q=1r+1, (16)

with fp,(-,)[-f(x)pdx]1/p. Here, we let r = q = 2, p = 1, f = ωj,l, and g = Vl,k.

From Assumption 3(c), we know that ωj,k is bounded by C. Therefore, by the Cauchy-Schwartz inequality,

ωj,k2,={-ωj,k2(Δ)dΔ}1/2{-Cωj,k(Δ)dΔ}1/2=C1/2Ωj,k1/2.

Using (15) and letting j,k ≡ ||Vj,k||2,(−∞,∞), we get

V¯j,kC1/2Ωj,k1/2Λk+Ωj,··V¯·,k. (17)

The ℓ2-norm of the vector ·,k can then be bounded using the triangle inequality,

V¯·,k2C1/2Λk[j=1pΩj,k]1/2+ΩV¯·,k2.

Thus, by Assumption 1,

V¯·,k2C1/2ΛkΩ·,k11/2+γΩV¯·,k2.

Rearranging the terms, and using the fact that γΩ < 1, gives

V¯·,k2C1/2(1-γΩ)-1ΛmaxΩ·,k11/2. (18)

Hence,

j,kV¯j,k2=kV¯·,k22C(1-γΩ)-2Λmax2k=1pΩ·,k1. (19)

Now, recall that the number of non-zero elements in Ω is card(ℰ), and Ωj,kγΩ. Thus, the inequality becomes

j,kV¯j,k2C(1-γΩ)-2Λmax2card(E)γΩ. (20)

Hence, no more than (C/c22)card(E)s-1T1/3γΩ(1-γΩ)-2Λmax2 elements of j,k exceed c2s1/2T−1/6. Recalling that j,k = ||Vj,k||2,(−∞,∞), this implies that no more than

(C/c22)card(E)s-1T1/3γΩ(1-γΩ)-2Λmax2

elements of ||Vj,k||2,(−B,B) exceed c2s1/2T−1/6.

Given the event ℳ, only edges in the set

{(j,k):Vj,k2,[-B,B]c2s1/2T-1/6,1j,kp}

can be contained in ℰ̂(ζ) for ζ = 2c2s1/2T−1/6. This implies that the size of ℰ̂(ζ) is on the order of (C/c22)card(E)s-1T1/3γΩ(1-γΩ)-2Λmax2.

We now proceed to prove part (a) of Theorem 1. Lemma 1 states that ||Vj,k||2,[−B,B]βminΛmin for (j, k) ∈ ℰ. If the event ℳ holds, then for T sufficiently large, ||j,k||2,[−B,B] > 2c2s1/2T−1/6 = ζ for (j, k) ∈ ℰ. Therefore, ℰ ⊂ ℰ̂(ζ).

Finally, Theorem 1 follows from the fact that, by Lemma 3, the event ℳ holds with probability at least 1 − c3s1/2T7/6p2 exp(−c4T1/6).

5. Discussion

In this paper, we have proposed a very simple procedure for screening the edge set in a multivariate Hawkes process. Provided that the process is mutually-exciting, we establish that this screening procedure can lead to a very small screened edge set, without incurring any false negatives. In fact, this result holds under a subset of the conditions required to establish model selection consistency of penalized regression estimators for the Hawkes process (Wainwright, 2009; Hansen, Reynaud-Bouret and Rivoirard, 2015). Therefore, this screening should always be performed whenever estimating the graph for a mutually-exciting Hawkes process.

The proposed screening procedure boils down to just screening pairs of nodes by thresholding an estimate of their cross-covariance. In fact, this approach is commonly taken within the neuroscience literature, with a goal of estimating the functional connectivity among a set of p neuronal spike trains (Okatan, Wilson and Brown, 2005; Pillow et al., 2008; Mishchencko, Vogelstein and Paninski, 2011; Berry et al., 2012). Therefore, this paper sheds light on the theoretical foundations for an approach that is often used in practice.

Appendix A: Technical proofs

A.1. Proof of Lemma 1

Proof

First, we observe that, if Vj,k is non-negative for all j and k, then ωj,l*Vl,k is non-negative for any j, l, k. Under Assumption 1, we know that (13) holds. We can see from (13) that

Vj,k(Δ)ωj,k(Δ)Λk.

Therefore, we have

Vj,k(Δ)2,[-B,B]ωj,k(Δ)2,[-B,B]Λmin=ωj,k(Δ)2,[0,b]Λmin, (21)

where the inequality follows from Assumption 2 and the equality holds since

supp(ωj,k)(0,b][-B,B].

From Assumption 3(b), we have that ||Vj,k(Δ)||2,[−B,B]βminΛmin for (j, k) ∈ ℰ.

We now show that the elements of V are non-negative, i.e., Vl,k(Δ) ≥ 0 for 1 ≤ l, kp, and Δ ∈ ℝ. Recall from the definition (7) in the main paper that

Vl,k(Δ)E[dNl(t)dNk(t-Δ)]/{dtd(t-Δ)}-ΛlΛkE[λl(t)dNk(t-Δ)]/{d(t-Δ)}-ΛlΛk, (22)

where the second equality follows from

E[dNl(t)dNk(t-Δ)]=E[E[dNl(t)Ht]dNk(t-Δ)]=E[λl(t)dNk(t-Δ)]dt. (23)

In this proof, we use the Stieltjes integral to rewrite λl(t) in (2) as

λl(t)=μl+k=1p(i:tk,itωl,k(t-tk,i))=μl+k=1p0ωl,k(Δ)dNk(t-Δ). (24)

Plugging in λl(t) from (24) into (22) gives

Vl,k(Δ)=-ΛlΛk+E[μldNk(t-Δ)]/{d(t-Δ)}+E[m=1p0bωl,m(Δ)dNm(t-Δ)dNk(t-Δ)]/{d(t-Δ)}=m=1p0bωl,m(Δ)E[dNm(t-Δ)dNk(t-Δ)]/{d(t-Δ)}+E[μldNk(t-Δ)]/{d(t-Δ)}-ΛlE[dNk(t-Δ)]/{d(t-Δ)},

where we use the definition Λk = 𝔼[dNk(t − Δ)]/{d(t − Δ)}.

Using the fact that (see e.g., Hawkes and Oakes (1974))

Λl=μl+m=1p0bωl,m(Δ)dΔμm,

we have

Vl,k(Δ)=m=1p0bωl,m(Δ)E[dNm(t-Δ)dNk(t-Δ)]/{d(t-Δ)}+E[μldNk(t-Δ)]/{d(t-Δ)}-{μl+m=1p0bωl,m(Δ)μmdΔ}E[dNk(t-Δ)]/{d(t-Δ)}=m=1p0bωl,m(Δ)E[dNm(t-Δ)dNk(t-Δ)]/{d(t-Δ)}-m=1p0bωl,m(Δ)μmdΔE[dNk(t-Δ)]/{d(t-Δ)}.

Rearranging the terms gives

Vl,k(Δ)=m=1p0bωl,m(Δ)d(t-Δ){E[dNm(t-Δ)dNk(t-Δ)]-E[μmdΔdNk(t-Δ)]}. (25)

Next, we will rewrite (25) by taking the conditional expectation of dNk or dNm as in (23). Note here that, when Δ′ < Δ, we condition dNm on the history up to t − Δ′, i.e., ℋt − Δ′. Given ℋt − Δ′, dNk(t − Δ) is fixed since t − Δ < t − Δ′. When Δ′ > Δ, we condition dNk on the history up to t − Δ. These cases are discussed separately in the following.

When Δ′ < Δ, for each integral of the summation, it holds that

E{dNm(t-Δ)dNk(t-Δ)}=E{λm(t-Δ)dΔdNk(t-Δ)}.

From the definition of λm(t) in (2), we know that λm(t − Δ′) ≥ μm. Hence, in (25), if Δ′ < Δ, it holds that

E{dNm(t-Δ)dNk(t-Δ)}/{d(t-Δ)}-E{μmdΔdNk(t-Δ)}/{d(t-Δ)}0. (26)

On the other hand, when Δ′ ≥ Δ, we have

E{dNm(t-Δ)dNk(t-Δ)}/{d(t-Δ)}-E{μmdΔdNk(t-Δ)}/{d(t-Δ)}=E{dNm(t-Δ)λk(t-Δ)}-E{μmdΔλk(t-Δ)}=E{dNm(t-Δ)λk(t-Δ)}-μmΛkdΔ.

Expanding λk and Λk yields

E{dNm(t-Δ)λk(t-Δ)}-μmΛkdΔ=E{dNm(t-Δ)μk}+i=1p0bωk,i(Δ)E[dNm(t-Δ)dNi(t-Δ-Δ)]-μmμkdΔ-i=1p0bωk,i(Δ)dΔμiμmdΔ=(Λm-μm)μkdΔ+i=1p0bωk,i(Δ){E[dNm(t-Δ)dNi(t-Δ-Δ)]-μiμmdΔdΔ}.

Now, observe that Λmμm and 𝔼{dNi(t − Δ − Δ″) dNm(t − Δ′)}/{dΔ′dΔ″} ≥ μiμm by the nature of the mutually-exciting process. Thus, for Δ′ ≥ Δ,

E{dNm(t-Δ)dNk(t-Δ)}/{d(t-Δ)}-E{μmdΔdNk(t-Δ)}/{d(t-Δ)}0. (27)

Applying both (26) and (27) to (25) shows that Vl,k(Δ) ≥ 0.

A.2. Proof of Lemma 2

Proof

For any Δ ≥ 0, the integral equation (13) gives

Vj,k(Δ)=ωj,k(Δ)Λk+(ωj,·V·,k)(Δ). (28)

For any x, y ≥ 0, we can write

Vj,k(x)-Vj,k(y)={ωj,k(x)-ωj,k(y)}Λk+(ωj,·V·,k)(x)-(ωj,·V·,k)(y)=|{ωj,k(x)-ωj,k(y)}Λk+l=1p{ωj,lVl,k(x)-ωj,lVl,k(y)}|=|{ωj,k(x)-ωj,k(y)}Λk+lEj{ωj,lVl,k(x)-ωj,lVl,k(y)}|, (29)

where the last inequality holds since ωj,l ≡ 0 for lεj. We then have

Vj,k(x)-Vj,k(y)|{ωj,k(x)-ωj,k(y)}Λk|I+lEj|ωj,lVl,k(x)-ωj,lVl,k(y)|IIl. (30)

For I, we know from Assumptions 2 and 3(c) that

I|{ωj,k(x)-ωj,k(y)}Λk|θ0Λmaxx-y. (31)

For IIl, we can expand the convolution

IIl=|0bωj,l(Δ)Vl,k(x-Δ)dΔ-0bωj,l(Δ)Vl,k(y-Δ)dΔ|=|-xb-xωj,l(Δ+x)Vl,k(-Δ)dΔ--yb-yωj,l(Δ+y)Vl,k(-Δ)dΔ|.

Without loss of generality, we consider only the case that xy. We can decompose the integrals into parts on the intervals [−x, − y), [−y, bx), and [bx, by] as

IIl|-yb-x{ωj,l(Δ+x)-ωj,l(Δ+y)}Vl,k(-Δ)dΔ|+|-x-yωj,l(Δ+x)Vl,k(-Δ)dΔ|+|b-xb-yωj,l(Δ+y)Vl,k(-Δ)dΔ|-yb-xθ0Δ+x-Δ-yVl,k(-Δ)dΔ+-x-yωj,l(Δ+x)Vl,k(-Δ)dΔ+b-xb-yωj,l(Δ+y)Vl,k(-Δ)dΔ-yb-xθ0x-yVmaxdΔ+-x-yωj,l(Δ+x)VmaxdΔ+b-xb-yωj,l(Δ+y)VmaxdΔ(b-x+y)θ0x-yVmax+2C(x-y)Vmax,

where we use Assumption 3(c) in the second inequality, Assumptions 2 in the third inequality, and the boundedness of ωj,l from Assumption 3(c) in the last inequality. Recalling that xy, we have

IIl(bθ0Vmax+2CVmax)x-y, (32)

Finally, plugging (31) and (32) into (30) gives

Vj,k(x)-Vj,k(y)θ0Λmaxx-y+s(bθ0Vmax+2CVmax)x-ysθ1x-y, (33)

where we set θ1θ0Λmax + 0Vmax + 2CVmax. Note that the last inequality holds as long as s ≥ 1. (The result also holds if s = 0: in this case, the second term in (30) is zero for every j and the bound (31) suffices.)

A.3. Proof of Lemma 3

Recall that the estimator of the cross-covariance (8) takes the form

1h1T[0,T]2K(t-t+Δh)dNj(t)dNk(t)Ij,k-[1T0TdNj(t)]IIj[1T0TdNk(t)]IIk.

The proof of Lemma 3 uses the following result. Lemma 4 is based on Proposition 3 of Hansen, Reynaud-Bouret and Rivoirard (2015); for completeness, we provide its proof in Section A.4.

Lemma 4

Suppose that Assumption 1 holds. We have

(1jkp[Ij,k-EIj,kc6T-1/3])c5p2Texp(-c4T1/6), (34)
(1jp[IIj-EIIjc6T-1/3+1/18]}c5p2Texp(-c4T1/6), (35)

where c4, c5, and c6 are constants.

We are now ready to prove Lemma 3.

Proof

First, note that

EIj,k-h[Vj,k(Δ)+ΛjΛk]=|1T[0,T]2K(t-t+Δh)E[dNj(t)dNk(t)]-1T[0,T]2K(t-t+Δh)[Vj,k(Δ)+ΛjΛk]dtdt|=|1T[0,T]2K(t-t+Δh){E[dNj(t)dNk(t)]-ΛjΛkdtdt}-1T[0,T]2K(t-t+Δh)Vj,k(Δ)dtdt|=|1T[0,T]2K(t-t+Δh)Vj,k(t-t)dtdt-1T[0,T]2K(t-t+Δh)Vj,k(Δ)dtdt|=|1T[0,T]2K(t-t+Δh)[Vj,k(t-t)-Vj,k(Δ)]dtdt|, (36)

where we use the definition of V in the third equality. Using the fact that the kernel K(x/h) is defined on [−h, h], we can write

EIj,k-h[Vj,k(Δ)+ΛjΛk]=|1T0Tmax(0,t-Δ-h)min(T,t-Δ+h)K(t-t+Δh)[Vj,k(t-t)-Vj,k(Δ)]dtdt|1T0Tmax(0,t-Δ-h)min(T,t-Δ+h)K(t-t+Δh)θ1st-t-Δdtdt1T0Tmax(0,t-Δ-h)min(T,t-Δ+h)K(t-t+Δh)θ1hsdtdt1T0T2θ1sh2dt=2θ1sh2, (37)

where the first inequality follows from Lemma 2.

Recall that IIjT−1Nj(T) and IIkT−1Nk(T). Applying Lemma 4 and (37), we have, with probability at least 1 − 2c5p2T exp(−c4T1/6),

|V^j,k(Δ)-Vj,k(Δ)|1hIj,k-EIj,k+1hEIj,k-E[dNj(t-Δ)dNk(t)]/(dtdΔ)+|1T2(Nj(T)-TΛj)Nk(T)|+|Λj1TNk(T)-ΛjΛk|c6T-1/3h-1+2θ1hs+(Λmax+c6T-1/3+1/18)c6T-1/3+1/18+Λmaxc6T-1/3+1/18. (38)

Letting h = c1s−1/2T−1/6, (38) can be written as

|V^j,k(Δ)-Vj,k(Δ)|c2s1/2T-1/6. (39)

Lastly, we need a uniform bound on j,kVj,k on the region [−B,B]. We first see that the above probability statement holds for a grid of ⌈s1/2T1/6⌉ points on [−B,B], denoted as {Δi}i=1s1/2T1/6. The gap between adjacent points on this grid is bounded by 2Bs−1/2T−1/6. Furthermore, for any Δ ∈ [−B,B], we can find a point on the grid Δi such that |Δ − Δi| ≤ 2B/⌈s1/2T1/6⌉ ≤ 2Bs−1/2T−1/6. From basic calculus we get that, for all Δ ∈ [−B,B],

|V^j,k(Δ)-Vj,k(Δ)|=|V^j,k(Δ)-V^j,k(Δi)+V^j,k(Δi)-Vj,k(Δi)+Vj,k(Δi)-Vj,k(Δ)|2Bs-1/2T-1/6+C2s1/2T-1/6+θ1ss-1/2T-1/6c2s1/2T-1/6, (40)

for some constant c2.

Therefore, with probability at least 1 − c3s1/2p2T7/6 exp(−c4T1/6),

V^j,k-Vj,k2,[-B,B]c2s1/2T-1/6. (41)

A.4. Proof of Lemma 4

Lemma 4 follows directly from the proof of Proposition 3 in Hansen, Reynaud-Bouret and Rivoirard (2015). The only difference is that we want a polynomial bound on the deviation, while Hansen, Reynaud-Bouret and Rivoirard (2015) consider a logarithmic bound. For completeness, we state the proof of Lemma 4 below, but note that the proof is almost identical to the proof of Proposition 3 in Hansen, Reynaud-Bouret and Rivoirard (2015). We refer the interested readers to the original proof in Section 7.4.3 of Hansen, Reynaud-Bouret and Rivoirard (2015) for more details.

Throughout this section, we assume that N ≡ (N1, …, Np)T is defined on the full real line. We first state some notation that is only used in this section.

  1. Following Hansen, Reynaud-Bouret and Rivoirard (2015), we use Ca1,a2,(i) to denote a constant that depends only on a1, a2, …; and we use the superscript i to indicate that this is the ith constant appearing in the proof.

  2. Without loss of generality, we assume that supp(ωj,k) ⊂ (0, 1], as in Hansen, Reynaud-Bouret and Rivoirard (2015).

  3. As in Hansen, Reynaud-Bouret and Rivoirard (2015), we introduce a function Z(N) such that Z(N) depends only on {dN(t′), t′∈ [−A, 0)}, and there exist two non-negative constants η and d such that
    Z(N)d{1+(l=1pNl([-A,0)])η}. (42)
  4. We also introduce the (time) shift operator St so that ZSt(N) depends only on {dN(t′), t′∈ [−A + t, t)}, in the same way as Z(N) depends on the points of N in [−A, 0).

We are now ready to prove the lemma. When proving the bound (34), we only discuss the case when jk. The proof for the case when j = k follows from the same argument and is thus omitted.

Proof

In this proof, we will consider a probability bound for [ZSt(N) − 𝔼(Z)] dtu, where, for some κ ∈ (0, 1) to be specified later,

u=c6T(1-κ)(1-η)+κ. (43)

Note that, by applying the bound to −Z(·), we can obtain a bound for|ZSt(N) − 𝔼(Z)|. To complete the proof, we will verify the statements (34) and (35) by considering some specific choices of Z(·).

For any positive integer k such that xT/(2k) > A, we have

(0T[ZSt(N)-E(Z)]dtu)=(q=0k-12qx2qx+x[ZSt(N)-E(Z)]dt+2qx+x2qx+2x[ZSt(N)-E(Z)]dtu)2(q=0k-12qx2qx+x[ZSt(N)-E(Z)]dtu2),

where the inequality follows from the stationarity of N. As in Reynaud-Bouret and Roy (2006), let {Mqx}q=1 be a sequence of independent Hawkes processes, each of which is stationary with intensities λ(t) ≡ (λ1(t), …, λp(t))T. See Section 3 of Reynaud-Bouret and Roy (2006) for more details on the construction of {Mqx}q=1. For each q, let Mqx be the truncated process associated with Mqx, where truncation means that we only consider the points in [2qxA, 2qx + x]. Now, if we set

Fq=2qx2qx+x[ZSt(Mqx)-E(Z)]dt, (44)

then

(0T[ZSt(N)-E(Z)]dtu)2(q=0k-1Fqu2)+2q=0k-1(Te,q>T2k-A), (45)

where Te,q is the time to extinction of the process Mqx. The extinction time Te,q is introduced in Sections 2.2 and 3 in Reynaud-Bouret and Roy (2006). Roughly speaking, it is the last time when there is an event for the Hawkes process with intensity λ(t) of the form (2), with background intensity μ ≡ (μ1, …, μp)T set to 0 for t ≥ 0. Since Te,q is identically distributed for all q, we can focus on one Te,q. Denoting by al the ancestral points with marks l and by Hall the length of the corresponding cluster whose origin is al, we have:

Te,q=maxl{1,,p}maxal{al+Hall}. (46)

Then by the exact argument on page 48 of Hansen, Reynaud-Bouret and Rivoirard (2015), we have

(Te,qa)1-l=1pμ(l)cl/ϑlexp(-ϑla). (47)

Thus, there exists a constant CA(1) depending on A such that if we take k=CA(1)Tκ, for some κ ∈ (0, 1) to be specified later, then

q=0k-1(Te,q>T2k-A)Tκpexp(-c4T1-κ), (48)

where c4 is a constant. Note that x = T/2kT1−κ is larger than A for T large enough (depending on A).

Now, note that the event 𝒯 ≡ {Te,qT/2kA, for all q = 0, …, k} only depends on the process N. We will first find a probability bound for the first term in (45). In other words, we will show that, given the event 𝒯,

(0T[ZSt(N)-E(Z)]dtu)c5Texp(-c4T1-κ). (49)

Let

B=(q=0k-1Fqu2).

Consider the measurable events

Ωq={supt{Mqx[t-A,t)}N},

where 𝒩̃ is a constant that will be defined later and Mqx[t-A,t) represents the number of points of Mqx lying in [tA, t). Let Ω = ∩0≤q≤k–1 Ωq. Then

B(q=0k-1Fqu/2andΩ)+(Ωc).

We have (Ωc)q(Ωqc), where each (Ωqc) can be easily controlled. Indeed, it is sufficient to split [2qxA, 2qx+x] into intervals of size A (there are about CA(2)T1-κ of these) and require the number of points in each sub-interval to be smaller than 𝒩̃/2. By stationarity, we then obtain

(Ωqc)CA(2)T1-κ(N[-A,0]>N/2).

Using Proposition 2 in Hansen, Reynaud-Bouret and Rivoirard (2015) with u = [𝒩̃/2] + 1/2, we obtain:

(Ωqc)CA(2)T1-κexp(-CA(3),N)

and, thus,

(Ωqc)CA(4)Texp(-CA(3)N).

Note that this control holds for any positive choice of 𝒩̃. Thus, for any 𝒩̃ > 0,

(t[0,T]suchthatMqxt-A,t)>N)CA(2)T1-κexp(-CA(3)N). (50)

Hence by taking N=CA(5)T1-κ, for CA(5) large enough, the right-hand side of (50) is smaller than CA(2)T1-κexp(-c4T1-κ).

It remains to obtain the rate of D ≡ ℘(Σq Fqu/2 and Ω). For any positive constant ε that will be chosen later, we have:

De-εu/2E(eεqFqq𝟙Ωq)e-εu/2qE(eεFq𝟙Ωq), (51)

since the variables {Mqx}q are independent. But,

E(eεFq𝟙Ωq)=1+εE(Fq𝟙Ωq)+j2εjj!E(Fqj𝟙Ωq)

and E(Fq𝟙Ωq)=E(Fq)-E(Fq𝟙Ωqc)=-E(Fq𝟙Ωqc).

Next note that if for any integer l,

lN<suptMqx[t-A,t)(l+1)N,

then

Fqxd[(l+1)ηNη+1]xE(Z).

Hence, cutting Ωqc into slices of the type { lN<suptMqx[t-A,t)(l+1)N} and using (50) with N=CA(5)T1-κ for a large enough CA(5), we obtain

|E(Fq𝟙Ωq)|=|E(Fq𝟙Ωqc)|l=1+x(d[(l+1)ηNη+1]+|E(Z)|)×(t[0,T]suchthatMqx[t-A,t)>N)CA(2)l=1+x(d[(l+1)ηNη+1]+|E(Z)|)T1-κexp(-c4lN)CA(6)l=1+x(dNη+|E(Z)|)T1-κ2lηexp{-c4lN}CA(7)T2-2κdNηexp(-c4N)1-2ηexp(-c4N),

where in the last inequality, we have used the fact that E(Z)dE[N[-A,0)η] by (42). Plugging N=CA(5)T1-κ into the above equation gives

E(Fq𝟙Ωq)z1CA(8)dT2-2κT(1-κ)ηexp(-c4T1-κ).

In the same way, following Hansen, Reynaud-Bouret and Rivoirard (2015), we can write

E(Fqj𝟙Ωq)E(Fqj𝟙Ωq)zbj-2, (52)

where zbxd[Nη+1]+xE(Z)=Cη,A(9)dT(1-κ)(1+η). Then, by stationarity,

E(Fq2𝟙Ωq)xE[2qx2qx+x[ZSt(Mqx)-E(Z)]2τ𝟙Mqx[τ-A,τ)N}dt]xE[2qx2qx+x[ZSt(Mqx)-E(Z)]2𝟙{Mqx[t-A,t)N}dt]x2E([Z(N)-E(Z)]2𝟙N[-A,0)N)zvCη,A(10)T2-2κσ2,

where σ2 ≡ 𝔼[Z(N) − 𝔼(Z)]. Going back to (51), by (52), we have

Dexp[-εu2+klog(1+εz1+j2zvzbj-2εjj!)]exp[-ε(u2-kz1)+kj2zvzbj-2εjj!],

using the fact that log(1 + u) ≤ u. Since

kz1=Cη(10)dTκT(2+η)(1-κ)exp(-c4T1-κ),

one can choose c6 in the definition (43) of u (not depending on d) such that u/2-kz12kzvz+13zbz for some z = c4Tκ–2η(1–κ). Hence,

Dexp[-ε(2kzvz+13zbz)+kj2zvzbj-2εjj!].

One can choose ε (as in the proof of the Bernstein inequality in Massart (2007), page 25) to obtain a bound on the right-hand side in the form of ez. We can then choose c4 large enough, and only depending on η and A, to guarantee that D ≤ ezc5 exp(−c4T1–κ).

In summary, we have shown that, given the event 𝒯,

(0T[ZSt(N)-E(Z)]dtu)c5exp(-c4T1-κ)+CA(4)Texp(-c4T1-κ).

With a slight abuse of notation, letting c5=max(c5,CA(4)) gives (49).

To complete the proof, we apply the concentration inequality (49) with some specific choices of Z(·).

For each pair (j, k), let

ZSt(N)t-ht+hK(t-t+Δh)dNj(t)dNk(t)/dt.

We can check that d = 1 and η = 2 satisfy (42). Then with κ = 5/6 in (49), we get, given the event 𝒯,

(Ij,k-EIj,kc6T-1/3)c5Texp(-c4T1/6).

Applying a union bound for all pairs (j, k), we have, given the event 𝒯,

(1jkp[Ij,k-EIj,kc6T-1/3])c5Tp2exp(-c4T1/6). (53)

Recall from the concentration inequality (48) that the event 𝒯 holds with probability at least 1–pT 1/6 exp(−c4T1/6). Thus, given that pT 1/6 exp(−c4T1/6) is dominated by the right-hand side of (53), it holds unconditionally that

(1jkp[Ij,k-EIj,kc6T-1/3]}c5Tp2exp(-c4T1/6),

which is the statement on Ij,k in (34).

The statement on IIl, l = j, k, in (35) can be shown in a similar manner by taking ZSt(N) ≡ dNj(t)/dt, with η = 1, and κ = 13/18.

Contributor Information

Shizhe Chen, Department of Statistics, Columbia University, New York, NY 10027.

Daniela Witten, Department of Biostatistics and Statistics, University of Washington, Seattle, WA 98195.

Ali Shojaie, Department of Biostatistics and Statistics, University of Washington, Seattle, WA 98195.

References

  1. Ahrens MB, Orger MB, Robson DN, Li JM, Keller PJ. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nature Methods. 2013;10:413–420. doi: 10.1038/nmeth.2434. [DOI] [PubMed] [Google Scholar]
  2. Aït-Sahalia Y, Cacho-Diaz J, Laeven RJA. Modeling financial contagion using mutually exciting jump processes. Journal of Financial Economics. 2015;117:585–606. [Google Scholar]
  3. Bacry E, Gaïffas S, Muzy J-F. A generalization error bound for sparse and low-rank multivariate Hawkes processes. 2015 arXiv preprint arXiv:1501.00725. [Google Scholar]
  4. Bacry E, Delattre S, Hoffmann M, Muzy JF. Some limit theorems for Hawkes processes and application to financial statistics. Stochastic Process Appl. 2013;123:2475–2499. [Google Scholar]
  5. Berry T, Hamilton F, Peixoto N, Sauer T. Detecting connectivity changes in neuronal networks. Journal of Neuroscience Methods. 2012;209:388–397. doi: 10.1016/j.jneumeth.2012.06.021. [DOI] [PubMed] [Google Scholar]
  6. Bogachev VI. Measure Theory. I, II. Springer-Verlag; Berlin: 2007. [Google Scholar]
  7. Bowsher CG. Modelling security market events in continuous time: Intensity based, multivariate point process models. Journal of Econometrics. 2007;141:876–912. [Google Scholar]
  8. Brémaud P, Massoulié L. Stability of nonlinear Hawkes processes. Ann Probab. 1996;24:1563–1588. [Google Scholar]
  9. Brillinger DR. Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics. 1988;59:189–200. doi: 10.1007/BF00318010. [DOI] [PubMed] [Google Scholar]
  10. Bühlmann P, van de Geer S. Statistics for High-Dimensional Data. Springer Series in Statistics. Springer; Heidelberg: 2011. Methods, theory and applications. [Google Scholar]
  11. Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
  12. Chavez-Demoulin V, Davison AC, McNeil AJ. Estimating value-at-risk: a point process approach. Quantitative Finance. 2005;5:227–234. [Google Scholar]
  13. Daley D, Vere-Jones D. An Introduction to the Theory of Point Processes, volume I: Elementary Theory and Methods of Probability and its Applications. Springer; 2003. [Google Scholar]
  14. Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J Amer Statist Assoc. 2011;106:544–557. doi: 10.1198/jasa.2011.tm09779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fan J, Ma Y, Dai W. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J Amer Statist Assoc. 2014;109:1270–1284. doi: 10.1080/01621459.2013.879828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009;10:2013–2038. [PMC free article] [PubMed] [Google Scholar]
  18. Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Statist. 2010;38:3567–3604. [Google Scholar]
  19. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33:1. [PMC free article] [PubMed] [Google Scholar]
  20. Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
  21. Hansen NR, Reynaud-Bouret P, Rivoirard V. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli. 2015;21:83–143. [Google Scholar]
  22. Hawkes AG. Spectra of some self-exciting and mutually exciting point processes. Biometrika. 1971;58:83–90. [Google Scholar]
  23. Hawkes AG, Oakes D. A cluster process representation of a self-exciting process. J Appl Probability. 1974;11:493–503. [Google Scholar]
  24. Liniger TJ. PhD thesis, Diss. 2009. Multivariate Hawkes processes. Eidgenössische Technische Hochschule ETH Zürich, Nr. 18403. [Google Scholar]
  25. Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J Amer Statist Assoc. 2014;109:266–274. doi: 10.1080/01621459.2013.850086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Luo S, Song R, Witten D. Sure Screening for Gaussian Graphical Models. 2014 arXiv preprint arXiv:1407.7819. [Google Scholar]
  27. Massart P. Concentration inequalities and model selection. Lecture Notes in Mathematics; Lectures from the 33rd Summer School on Probability Theory; Saint-Flour. July 6–23, 2003; Berlin: Springer; 2007. 1896. With a foreword by Jean Picard. [Google Scholar]
  28. Mishchencko Y, Vogelstein JT, Paninski L. A Bayesian approach for inferring neuronal connectivity from calcium fluorescent imaging data. Ann Appl Stat. 2011;5:1229–1261. [Google Scholar]
  29. Mohler GO, Short MB, Brantingham PJ, Schoenberg FP, Tita GE. Self-exciting point process modeling of crime. J Amer Statist Assoc. 2011;106:100–108. [Google Scholar]
  30. Ogata Y. Statistical Models for Earthquake Occurrences and Residual Analysis for Point Processes. Journal of the American Statistical Association. 1988;83:9–27. [Google Scholar]
  31. Okatan M, Wilson MA, Brown EN. Analyzing Functional Connectivity Using a Network Likelihood Model of Ensemble Neural Spiking Activity. Neural Comput. 2005;17:1927–1961. doi: 10.1162/0899766054322973. [DOI] [PubMed] [Google Scholar]
  32. Paninski L, Pillow J, Lewi J. Statistical models for neural encoding, decoding, and optimal stimulus design. Progress in Brain Research. 2007;165:493–507. doi: 10.1016/S0079-6123(06)65031-0. [DOI] [PubMed] [Google Scholar]
  33. Perry PO, Wolfe PJ. Point process modelling for directed interaction networks. J R Stat Soc Ser B Stat Methodol. 2013;75:821–849. [Google Scholar]
  34. Pillow JW, Shlens J, Paninski L, Sher A, Litke AM, Chichilnisky E, Simoncelli EP. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature. 2008;454:995–999. doi: 10.1038/nature07140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron J Stat. 2011;5:935–980. [Google Scholar]
  36. Reynaud-Bouret P, Roy E. Some non asymptotic tail estimates for Hawkes processes. Bull Belg Math Soc Simon Stevin. 2006;13:883–896. [Google Scholar]
  37. Reynaud-Bouret P, Schbath S. Adaptive estimation for Hawkes processes; application to genome analysis. Ann Statist. 2010;38:2781–2822. [Google Scholar]
  38. Saegusa T, Shojaie A. Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics. 2016;10:1341–1392. doi: 10.1214/16-EJS1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Simma A, Jordan MI. Modeling events with cascades of Poisson processes. 2012 arXiv preprint arXiv:1203.3516. [Google Scholar]
  40. Simon N, Tibshirani RJ. Standardization and the group lasso penalty. Statist Sinica. 2012;22:983–1001. doi: 10.5705/ss.2011.075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Song R, Lu W, Ma S, Jeng XJ. Censored rank independence screening for high-dimensional survival data. Biometrika. 2014;101:799–814. doi: 10.1093/biomet/asu047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Tsybakov AB. In: Introduction to nonparametric estimation. Zaiats Vladimir., translator. Springer; New York: 2009. Springer Series in Statistics. Revised and extended from the 2004 French original. (2011g:62006) [Google Scholar]
  43. Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming (Lasso) Information Theory, IEEE Transactions on. 2009;55:2183–2202. [Google Scholar]
  44. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B Stat Methodol. 2006;68:49–67. [Google Scholar]
  45. Zhou K, Zha H, Song L. Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics; 2013a. pp. 641–649. [Google Scholar]
  46. Zhou K, Zha H, Song L. Learning triggering kernels for multi-dimensional Hawkes processes. Proceedings of the 30th International Conference on Machine Learning (ICML-13); 2013b. pp. 1301–1309. [Google Scholar]
  47. Zhu L. Nonlinear Hawkes processes. 2013 arXiv preprint arXiv:1304.7531. [Google Scholar]

RESOURCES