Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 14.
Published in final edited form as: Int Conf Learn Represent. 2025 Apr;2025:36822–36850.

Identifiability for Gaussian Processes with Holomorphic Kernels

Ameer Qaqish 1, Didong Li 1
PMCID: PMC12904360  NIHMSID: NIHMS2119297  PMID: 41695776

Abstract

Gaussian processes (GPs) are widely recognized for their robustness and flexibility across various domains, including machine learning, analysis of time series, spatial statistics, and biomedicine. In addition to their common usage in regression tasks, GP kernel parameters are frequently interpreted in various applications. For example, in spatial transcriptomics, estimated kernel parameters are used to identify spatial variable genes, which exhibit significant expression patterns across different tissue locations. However, before these parameters can be meaningfully interpreted, it is essential to establish their identifiability. Existing studies of GP parameter identifiability have focused primarily on Matérn-type kernels, as their spectral densities allow for more established mathematical tools. In many real-world applications, particuarly in time series analysis, other kernels such as the squared exponential, periodic, and rational quadratic kernels, as well as their combinations, are also widely used. These kernels share the property of being holomorphic around zero, and their parameter identifiability remains underexplored. In this paper, we bridge this gap by developing a novel theoretical framework for determining kernel parameter identifiability for kernels holomorphic near zero. Our findings enable practitioners to determine which parameters are identifiable in both existing and newly constructed kernels, supporting application-specific interpretation of the identifiable parameters, and highlighting non-identifiable parameters that require careful interpretation.

1. Introduction

Gaussian Processes (GPs) are powerful and flexible tools extensively used across multiple fields, such as machine learning (ML), geospatial and spatiotemporal analysis, biomedicine, finance, and environmental modeling (Rasmussen and Williams, 2006; Banerjee et al., 2014; Cressie and Wikle, 2015). They serve various purposes: as regression or classification methods through GP regression or GP classification; as priors over functions in Bayesian inference (Ghosal and van der Vaart, 2017); for modeling latent distributions via Gaussian Process Latent Variable Models (GPLVM, Lawrence (2003)); and in demonstrating equivalencies to deep neural networks with infinite width (Lee et al., 2018). The flexibility of GPs as universal approximators, their inherent interpretability – especially regarding kernel parameters – and their capability to quantify uncertainty, are among their key advantages.

The kernel function, also known as the covariance function or covariogram, which defines the covariance structure within a GP, is pivotal to application and effectiveness. Over recent decades, there has been a proliferation of research into developing specialized kernels tailored for specific data types including time-series, spatial, imaging, and spatiotemporal datasets. Popular choices such as the squared exponential (SE, also known as RBF or Gaussian), rational quadratic (RQ), periodic (Per), and Matérn kernels are frequently employed, often in innovative combinations that enhance model performance (Wang et al., 2018). These combinations involve operations such as summation, multiplication, and spectral mixtures (Duvenaud et al., 2011; 2013; Kronberger and Kommenda, 2013; Wilson and Adams, 2013; Samo and Roberts, 2015; Remes et al., 2017; Cheng et al., 2019; Verma and Engelhardt, 2020), which enable the leveraging of individual kernel strengths to better capture complex data patterns.

Despite the extensive literature on GP theory and its application to regression or prediction tasks, less attention has been paid to the parameter inference, particularly the identifiability and interpretability of kernel parameters. Parameter inference is critical in applications that use estimated parameters in downstream tasks such as model comparison and problem-specific parameter interpretation.

One such application is in the study of spatial transcriptomics, which measures gene expression across different tissue locations to understand cellular and tissue-level biological processes (Marx, 2021). One important task within this field is identifying spatially variable genes (SVGs), which are genes that show significant changes in expression patterns across spatial locations, among tens of thousands of genes. Svensson et al. (2018) models gene expressions as a GP across spatial coordinates using a SE kernel with nuggets: Kx,x=σs2expxx222+σe21x=x. The kernel parameter σs2 was then interpreted as the magnitude of the spatial effects to identify SVGs, estimated by the Maximum Likelihood Estimator (MLE). Other applications of GP parameter inference to spatial transcriptomics include Weber et al. (2023) and Sun et al. (2020).

Another example where kernel parameter estimates are interpreted is the decomposition of the Mauna Loa CO2 time series data (Tans and Keeling, 2023) into four kernel components in the impactful book Rasmussen and Williams (2006):

Kx,x=K1x,x+K2x,x+K3x,x+K4x,x=θ12expxx22θ22+θ32expxx22θ422sin2πxxθ52+θ621+xx22θ8θ72θ8+θ92expxx22θ102+θ1121x=x (1)

where K1x,x is a SE kernel that captures the long-term smooth rising trend, K2x,x, called the damped periodic kernel, is a multiplication of a SE kernel and a periodic kernel that accounts for seasonal variations, K3x,x is a rational quadratic kernel that models medium-term irregularities, and K4x,x is a sum of a SE and a white noise kernel that measure correlated and independent noise respectively. This kernel is also used as an example in the tutorial of the widely used Python package “sklearn.gaussian_process”, with detailed interpretation of all 11 parameters proposed by the authors in Section 3. Although the interpretation seems reasonable, a theoretical understanding with a rigorous proof is missing.

Although parameter identifiability in a GP model might seem straightforward at first glance, it is a challenging and nuanced problem. In fact, not all parameters in widely used GP kernels are identifiable: if a parameter is not identifiable, consistent estimation and subsequent interpretation are impossible. For example, for the Matérn kernel in dimension p3 with spatial variance σ2, lengthscale , and known smoothness parameter ν, Zhang (2004) proved that neither σ2 nor is identifiable or consistently estimable, no matter how sophisticated the estimator is. In fact, the only identifiable parameter in the Matérn kernel, termed the microergodic parameter, is σ22ν. Follow-up studies for a single Matérn kernel include Anderes (2010); Kaufman and Shaby (2013); Li (2022); Li et al. (2023), and Chen et al. (2024) for a linear combination of Matérns with different smoothness. Such negative results raise a natural question: are all parameters used in practice, including those in Equation (1), identifiable so that their interpretations are justified?

As far as we are aware, identifiability of the parameters in Equation (1) has not been proven before. More importantly, there is still a gap in the literature for more complicated kernel combinations like those popularized in ML, especially when the combinations involve periodic kernels. The lack of theoretical examination is partly due to the failure of traditional methods used to study GP parameter identifiability, such as the integral test (Stein, 1999), which requires conditions on the spectral density not met by common kernels like SE, Per, and RQ, and even more so when these kernels are combined. This necessitates the development of new analytic tools to better understand kernel parameter identifiability and interpretability.

Motivated by these observations and challenges, this paper proves a general theorem (Theorem 3.4) that determines all the identifiable functions of the parameters in any family of stationary kernels holomorphic around 0. The result applies to complex combinations of kernels, particularly those common in the ML community, such as the one used in Equation (1). We demonstrate that all parameters in this kernel are identifiable under mild constraints, supporting the interpretation of kernel parameters in Rasmussen and Williams (2006) and the “sklearn.gaussian_process” Python package tutorial. Additionally, we establish a general result that is used to determine the identifiable functions of the parameters for a kernel that is a sum of products of other kernels.

The paper is organized as follows. Section 2 provides a comprehensive background, introduces the necessary notation and concepts, and reviews the relevant literature. Section 3 presents our main theoretical contributions. Section 4 contains simulation studies to support our theories, followed by Section 5 with a discussion of limitations and future work. A brief discussion of the connection between parameter identifiability and prediction is given in Appendix C.

2. Background

This section defines key concepts and notations and summarizes existing literature on GP kernel identifiability and interpretability. We begin with the definition of GPs.

2.1. Gaussian process

Definition 1 (GP).

A stochastic process Z is said to follow a GP in domain Ω with a mean function μ:Ω and a positive definite covariance/kernel/covariogram function K:Ω×Ω if for all x1,,xnΩ,

Zx1,,,ZxnN(v,Σ),v=μx1,,μxn,Σij=Kxi,xj.

For our study, as well as presentation simplicity, we assume μ=0, without loss of generality (Stein, 1999). In this situation, since the distribution of Z is completely determined by K, we sometimes call K a GP, which refers to a GP with covariance kernel K.

Throughout this paper, we focus on the infill domain (also known as fixed domain or interpolation), i.e., the domain Ω=[0,T]p does not grow with sample size, a situation commonly considered in the literature (Stein, 1999). Next, we introduce the commonly accepted stationarity assumption:

Definition 2 (Stationarity).

K is called stationary if Kx,x=Kx+h,x+h, hp.

For a stationary kernel, we can reformulate the kernel to a function on p instead of p×p by K0(x):=K(x,0). Stationarity is a common assumption in GP literature due to its satisfactory practical performance and simplicity in both implementation and theoretical analysis. Throughout this paper, we focus only on stationary kernels, and still denote the simplified kernel K without causing any confusion.

2.2. Kernels

We first note that all kernel functions considered in this paper are continuous functions unless noted otherwise. Then we introduce the following commonly used kernels in Table 1.

Table 1:

Example kernels, parameters, and domain dimension

Name K(x) Parameters Dimension
SE σ2expx222 σ2, >0 p1
Per σ2exp2sin2πxγ2 σ2, , γ>0 p=1
RQ σ21+x22α2α σ2, , α>0 p1
Matérn σ221νΓ(ν)2νxνKν2νx σ2, , ν>0 p1

In this table, σ2 is called the spatial variance, or partial sill, which measures point-wise variance; is called the length scale that measures the spatial dependency; γ>0 is the period parameter; α is called the scale mixture parameter; ν is called the smoothness parameter. Among them, Per is well-defined only when p=1, i.e., Ω=[0,T] is a closed interval, while others are well-defined on p for any p.

Each individual kernel in the above table captures some unique behavior in the process Z. However, when the process has complicated structure, a common approach is to combine some of these kernels to create a new one. Such a combination can be simply a sum of products of these kernels, which is guaranteed to be a positive definite function.

The following example extends Equation (1) used in Rasmussen and Williams (2006) to study the Mauna Loa CO2 time series data:

Kθ(x)=θ12expx22θ22+θ32expx22θ422sin2πxγθ52+θ621+x22θ8θ72θ8+θ92expx22θ102+θ111{x=0}, (2)

where θ=θ1,,θ11,γ be the vector of all parameters in the above kernel.

Note that this kernel is more flexible than the one in Equation (1), which assumes the period γ=1. We adopt this more challenging modification since the period is sometimes unknown in practice so practitioners have to estimate it from the data.

2.3. Identifiability

The study of identifiability of GP kernel parameters relies on the notation of equivalence of measures defined below:

Definition 3 (Equivalence of measures).

Two measures P1 and P2 are said to be equivalent if they are absolutely continuous with respect to each other, denoted by P1P2. That is, P1(A)=0P2(A)=0. Two measures are said to be orthogonal, denoted by P1P2 if there exists a measurable set A such that P1(A)=0 but P2(A)=1.

Two GP laws P1 and P2 are either equivalent, or are orthogonal (Feldman, 1958), which means they assign probability 1 to disjoint sets: P1Ac=1 and P2(A)=1. We define the identifiability of GP parameters as follows:

Definition 4 (Microergodicity).

Let KθθΘ be a family of covariance kernels of a GP. Then a function h=h(θ) of θ is said to be microergodic if Kθ1Kθ2hθ1=hθ2.

If h and h˜ are both microergodic, then hθ1=hθ2h˜θ1=h˜θ2, so h and h˜ are related by a bijection. Thus the migroergodic function is unique up to a bijective transformation, and it makes sense to speak of ‘the’ microergodic function h.

Definition 5 (Identifiability).

Let KθθΘ be a family of covariance kernels of a GP. A function g=g(θ) of θ is said to be identifiable if Kθ1Kθ2gθ1=gθ2, or equivalently, g is a function of the microergodic function h. We say that the family KθθΘ is identifiable if θ is identifiable.

Note that a consistent estimator of g(θ) can exist only when g(θ) is identifiable – when g(θ) is not identifiable, say Kθ1Kθ2 with gθ1gθ2, it is not possible to find a consistent estimator of g(θ), since there is no way to distinguish between data generated from Kθ1 and those from Kθ2 almost surely (see Stein (1999); Zhang (2004) for more detailed discussion). Thus anything that can be consistently estimated is identifiable. The microergodic function h(θ) is the maximal identifiable function, so knowing the microergodic function completely solves the identifiability problem for the family of kernels. However, in some cases, it is difficult to fully determine the microergodic function h(θ), whereas it is easier to determine that some specific function g(θ) is identifiable.

To study the identifiability of GP kernel parameters, it suffices to determine when two GPs in the same parametric family are equivalent. However, to determine whether two GPs are equivalent is not an easy task, and the methods for doing so highly depend on the form of the kernels. There is a rich literature focusing on identifiability for Matérn kernels, where it has been shown that

Kσ12,1,ν1Kσ22,2,ν2σ1212ν1,ν1=σ2222ν2,ν21p3,σ12,1,ν1=σ2,2,ν2p5.

That is, when the domain dimension is greater than or equal to 5, then hσ2,,ν=σ2,,ν, so all three parameters are identifiable (Anderes, 2010; Bolin and Kirchner, 2023); when the domain dimension is less than or equal to 3, then hσ2,,ν=σ22ν,ν:ν is identifiable (Loh et al., 2021), but not σ2 or (Zhang, 2004). As a result, there is no consistent estimator of σ2 or , but instead, a consistent estimator of σ22ν, called the microergodic parameter, does exist (Kaufman and Shaby, 2013; Loh et al., 2021), namely the MLE. The microergodic function for p=4 is an open problem.

Although the identifiability of Matérn has been understood, the study of other kernels including Per and RQ is much sparser. The key reason is that the tool to study equivalence between Matérn kernels, known as the integral test (Stein, 1999), requires strong conditions on the spectral densities of the kernel, which are not often satisfied by other kernels. The spectral density is defined below:

Definition 6 (Spectral measure).

For a stationary kernel K, its spectral measure, denoted by F, is defined through

K(x)=eiωxF(dω).

Bochner’s theorem guarantees the existence and uniqueness of F. The density of F w.r.t. the Lebesgue measure dω, denoted by f, if it exists, is called the spectral density.

The condition to use the integral test is that f(ω)ωα1 as ω for some α>0. That is, the spectral density is required to behave like ωα for some positive α>0. The spectral density of Matérn is f(ω)=σ22pπp/2Γν+p2(2ν)νΓ(ν)2ν2ν2+ω2ν+p2ω2νp (Rasmussen and Williams, 2006, p. 84) (note that we use a different Fourier transform convention than (Rasmussen and Williams, 2006).) However, this condition is not met by RBF, Per, or RQ, as their spectral densities decay very rapidly due to the infinite differentiability of the kernels.

Due to the popularity of these kernels in ML, we aim to address these challenges and study the equivalence of GPs with these kernels and their combinations. The next section provides theoretical support for the success of these kernels in terms of identifiability and interpretability.

3. Theory

In this section, we present our main theory regarding equivalence of GPs, as outlined in the previous sections. We first determine the identifiable parameters of the individual kernels used in Equation (2), i.e., SE, Per, and RQ, with some extensions.

Theorem 3.1.

The microergodic functions of 5 individual kernels in Table 1, including all four components K1, K2, K3, K4 in Equation (2) and an additional kernel, Cosine, are summarized in Table 2.

Table 2:

Microergodic functions of five kernels

Name K(x) Parameters Microergodicity p
SE σ2expx222 σ2, >0 σ2, ≥ 1
Per σ2exp2sin2πxγ2 σ2, , γ>0 σ2,,γ 1
Damped Per σ2expx22122sin2πxγ22 σ2, 1, 2, γ>0 σ2,1,2,γ 1
RQ σ21+x22α2α σ2, α, >0 σ2,α, ≥ 1
Cosine σ2cossx σ2, s1>0, sp s ≥1

Theorem 3.1 supports the identifiability and interpretability of each kernel parameter in SE, Per and RQ, as discussed in Section 2.2. In addition, we include the cosine kernel, which will be revisited later in this section.

Then we consider the combination of SE, PER and RQ in Equation (2), an extension of the kernel used by the impactful book Rasmussen and Williams (2006) and the tutorial of the widely used Python package “sklearn.gaussian_process”.

Theorem 3.2.

All parameters in Equation (2) are identifiable provided θ10, the length-scale of the SE component to model the correlated noise, is less than θ2, the length-scale of the SE component to model the long-term trend.

Such a constraint is necessary, and not surprising, since otherwise, say, if θ2=θ10, then we can merge the two SE components into a single SE: θ12+θ92expx22θ22, making θ12+θ92 identifiable instead of θ12 and θ92. This distinction of two SE components is also discussed in Section 5.4.3 in Rasmussen and Williams (2006). Excluding this trivial case, all parameters are identifiable. As a consequence, these parameters are interpretable, as discussed in the same section in Rasmussen and Williams (2006). For example, θ1 measures the amplitude and θ2 measures the characteristic length-scale of the long-term smooth rising trend; within the seasonal trend, θ3 gives the magnitude, θ4 gives the decay time for the periodic component, γ gives the period, while θ5 is the smoothness of the periodic component; for the (small) medium term irregularities, θ6 is the magnitude, θ7 is the typical length-scale and θ8 is the shape parameter determining diffuseness of the length-scales; θ9 is the magnitude of the correlated noise component, θ10 is its lengthscale and θ11 is the magnitude of the independent noise component.

Now we would like to answer the following more challenging question with a broader implication: Given a new kernel, how do we determine the microergodic function? Specifically, if we combine a finite number of kernels, such as those in Table 2, by finite multiplication and addition like Equation (2), what is the microergodic function of the resulting kernel? To answer these questions, we need to introduce the following notions first.

Lemma 3.3 (Kernel decomposition).

For any stationary kernel K, K can be uniquely decomposed as K=Kc+Kd, where Kc is a kernel with continuous spectral measure and Kd is another kernel with discrete spectral measure.

A direct consequence is that if K admits a spectral density, then K=Kc; while if K is periodic, then K=Kd. We call Kc the continuous component and Kd the discrete component. Note that this notion is different from continuous and discrete functions, and we do assume all kernels are continuous functions themselves. Here the continuous and discrete notion is at the spectrum level. For example, the Per kernel, is continuous as a function, but has a purely discrete spectrum. Moreover, we denote the spectral measure of Kc as Fc and the spectral measure of Kc as Fd, where F=Fc+Fd is the spectral measure of K.

Such a decomposition offers deeper insights to understand different types of kernels. Moreover, to understand the equivalence of GPs, it suffices to understand the equivalence of its continuous component and discrete component separately, given by the following key theorem, which is the main result of the paper:

Theorem 3.4.

Given two kernels K1 and K2 with K1 holomorphic on some ball around 0 in p, the p-dimensional complex space, then K1K2 if and only if the following two conditions hold:

  1. K1c(x)=K2c(x) for every xp.

  2. There are c, C>0 such that cF1d({ω})F2d({ω})CF1d({ω}) for all ωp and ω:F1d({ω})>01F2d({ω})F1d({ω})2<.

Note that K:p is said to be holomorphic on a ball around 0 in p, if it has a holomorphic extension K˜ to some ball Bp around 0, such that K˜=K on Bp. While being holomorphic on a ball around 0 is a stronger condition than being infinitely differentiable, most infinitely differentiable kernels used in practice, including all those in Table 2, are holomorphic on a ball around 0. Condition 1 means the continuous components of F1 and F2 are the same, while Condition 2 means the discrete components of F1 and F2 have the same support, and their relative difference, although allowed to be nonzero, should decay fast enough.

Notably, Theorem 3.4 provides a general pipeline to study the identifiability of kernel parameters, summarized in the following theorem:

Theorem 3.5.

Let KθθΘ be a family of stationary kernels on p, each of which is holomorphic on some ball around 0 in p. We have the following assertions regarding the microergodic function:

  1. If h(θ) is microergodic for the continuous component KθcθΘ and g(θ) is microergodic for the discrete component KθdθΘ, then (h(θ),g(θ)) is microergodic for KθθΘ.

  2. Moreover,
    1. h(θ) is microergodic for KθcθΘ if and only if
      Kθ1c(x)=Kθ2c(x),xphθ1=hθ2.
    2. g(θ) is microergodic for KθdθΘ if and only if
      c,C>0,s.t.cF1d({ω})F2d({ω})CF1d({ω}),ωpω:Fθ1d({ω})>01Fθ2d({ω})Fθ1d({ω})2<gθ1=gθ2

That is, in order to find the microergodic function of a parametric family of kernels Kθ, it suffices to find the microergodic function hθ of the continuous component, and gθ of the discrete component separately. Moreover, to find hθ, it suffices to understand when two continuous components are equal everywhere; to find gθ, we need to investigate the conditions 2b about the discrete measure Fd.

Having established the foundational aspects of kernel identifiability, we now apply our results to determine the microergodic function of various combinations of kernels. Our general strategy is to use Fourier transform identities to compute the spectral measure of the combined kernel (see, for example, Theorem B.6) and then apply Theorem 3.4. These combinations not only illustrate the practical applications of our theoretical findings in Theorem 3.4, but also provide insights into designing new kernels with desired properties. We start with the squared exponential kernel with automatic relevance determination (ARD).

Theorem 3.6.

For the family

Kσ,M(x)=σ2exp12xTMx,

where σ2>0 and M is a positive-definite matrix, the microergodic function is σ2,M.

Next, we study the sum of cosine kernels:

Theorem 3.7.

For the family

Kσ1,,σm,s1,,sm(x)=σ12coss1x+σ22coss2x++σm2cossmx,

where σ12,,σm2>0 and s1,,sm, under the natural constraint 0s1<s2<<sm, the microergodic function is s1,,sm.

Theorem 3.7 shows that when cosine kernels are combined linearly, their individual frequencies (or periods) remain identifiable, provided they are distinct. This scenario often arises in signal processing where different periodic components need to be isolated and identified. Notably, the last kernel in Table 1 is a special case of the kernel in Theorem 3.7 with m=1.

Next, we study the product of Cosine kernels:

Theorem 3.8.

For the family

Kσ,s1,,sm(x)=σ2coss1xcoss2xcossmx,

where σ2>0 and s1,,sm, under the natural constraint 0<s1s2sm, the microergodic function is ±s1±s2±±sm:=a1s1+a2s2++amsm:a1,a2,,am{1,1}}. If m=1,2,3, then the mircoergodic function simplifies to s1,,sm.

In Theorem 3.8, when m4, we do not have identifiability of s1,,sm. For example, for m=4, when s1,s2,s3,s4=(1,2,2,3) and s˜1,s˜2,s˜3,s˜4=(1,1,3,3), the values of the microergodic function coincide, that value being {0,±2,±4,±6,±8}. Theorem 3.8 shows that for a product of discrete spectrum kernels that are all a function of the same variable x, the parameters of each individual kernel may not be identifiable.

Finally, we explore the sum of periodic kernels as previously discussed.

Theorem 3.9.

Let K,γ denote the periodic kernel with variance parameter 1, length-scale , and period γ. For the family

Kσ1,σ2,γ1,γ2(x)=σ12K1,γ1(x)+σ22K2,γ2(x),

where σ12, σ22, 1, 2>0, γ1>γ2>0, the microergodic function is σ12,1,γ1,σ22,2,γ2, that is, all parameters are identifiable.

This result is crucial for scenarios where multiple periodic processes operate at different scales or periods, as often encountered in geospatial, financial, and environmental data analysis.

4. Simulation

In this section, we provide empirical support to our theoretical results on kernel parameter identifiability, presented in Section 3, by investigating the behavior of the maximum likelihood estimators (MLEs) as the sample size n increases.

Before moving to the simulation details, we would like to clarify the broader picture of parameter inference for GPs, which involves three steps: first, determining which parameters are identifiable; second, finding a consistent estimator of identifiable parameters, such as the MLEs or others estimators; and third, developing numerical methods to compute these estimators. While the second and the third steps are crucial, they fall beyond the scope of this paper, which focuses solely on the first step–a theoretical framework to find all the identifiable parameters. In fact, even for simple kernels like the SE and Matérn kernels, whether the MLE is consistent remains open (Loh and Sun, 2023).

Despite these complexities, we use standard optimization packages commonly applied in the GP literature to find the MLEs. Our simulations are not intended to solve the open problem of MLE consistency or introduce new numerical techniques; rather, they serve to illustrate the theoretical results on identifiability through practical examples.

We start from individual kernels, followed by the combination in Equation (2).

4.1. Individual kernels

We consider the individual kernels: SE, Damped Per (DPer), Per, RQ, and Cosine. For the cosine kernel, we parameterize in terms of the period γ so that s=2πγ. Input samples are generated by adding a unif 14n,14n random shift to n evenly spaced points in 14n,114n, where n{500,1000,2000,5000}. After generating the outcomes by sampling a GP with the given kernel at the inputs, we added independent Gaussian noise from N(0, ε),ε=0.01, to model measurement errors (see Section D of the appendix for the experiments repeated with ε=0.1). All kernel parameters were estimated by MLEs, with 100 replicates for each kernel configuration to assess the convergence of the MLEs. The results are summarized in Figure 1. These boxplots demonstrate that the MLEs of all parameters except σ2 in the cosine kernel appear consistent, as conjectured by their identifiability, proved in Theorem 3.1.

Figure 1:

Figure 1:

Simulation results for various kernel types. Each subfigure shows the boxplots of MLEs for the corresponding kernel, with ground truth in horizontal dashed line.

Some of the MLE standard deviations appear to plateau for large n. One explanation for this is numerical limitations – for our squared exponential simulation, where σ2=1 and =1/500, the condition numbers of the covariance matrix of the observations are 6.82101, 1.80108, 1.891019, and 3.371024 for sample sizes 500, 1000, 2000, and 5000, respectively.

The failure of the MLE of σ2 in the cosine kernel to converge is in agreement with the microergodicity of γ. In fact, if we treat γ as known and let the noise variance ε decrease to 0, then since the covariance matrix has rank 2 for all n2, it can be shown that the MLE σ^(ε) converges to σ^(0)~σ2χ222.

4.2. The combined kernel in Equation (2)

Then, we study the combined kernel, one motivating kernel of this paper, defined in Equation (2). Since the kernel was proposed for forecasting CO2level on the Mona Loa dataset, we set the time interval to be [0, 45], presenting the time span of 45 years. Input samples are generated by adding a unif 454n,454n random shift to n evenly spaced points in 454n,45454n, where n{50,100,200,500}. All kernel parameters were estimated by MLEs, with 100 replicates to assess the convergence of the MLEs. Moreover, to further mimic this dataset, the ground truth parameters and noise variance θ112 are set to be the MLEs learned from running the “Gaussian process regression” package from the scikit-learn Python package. All truth parameters to be estimated are given by Table 3.

In Figure 2, we again observe that the MLEs generally are unbiased, but for some parameters, their variance does not strictly decrease with sample size. This is likely due to the relatively large number of parameters (10) compared to the small sample size of 500.

Figure 2:

Figure 2:

MLEs of parameters in Equation (2), with ground truth in horizontal dashed line.

5. Discussion

This paper has introduced a novel analytical framework that advances the theory of identifiability of kernel parameters in GPs for a large class kernels, those holomorphic around 0. We have demonstrated that all the parameters in certain combinations of kernels, such as the example employed on the Mauna Loa CO2 time series data, are indeed identifiable. This establishes a robust theoretical foundation for selecting or constructing GP kernels and determining the identifiable functions of the parameters in practical applications.

Looking ahead, several avenues of future research present themselves as particularly promising and interesting. First, while establishing the identifiability of kernel parameters is a critical step, it does not necessarily guarantee the consistency of the MLE. The analysis of MLEs is complicated due to the complex nature of the likelihood function involved, which is often multi-modal and difficult to handle. Second, extending our theoretical framework to encompass non-stationary kernels could enhance the flexibility of GPs in modeling data with evolving trends and dynamics. This area is notably challenging due to the current limitations in mathematical tools available, presenting a largely open problem in the field. Third, another intriguing direction for research involves extending our findings to infinitely differentiable kernels that are not holomorphic near 0, though most infinitely differentiable kernels used in applications are holomorphic near 0.

Supplementary Material

1

Acknowledgment:

AQ was supported by NIH grant R37 AI029168; DL was supported by NIH grants P30 ES010126, R01 HL149683, R01 HL173044, R01 LM014407, R56 LM013784, UM1 TR004406.

Footnotes

Reproducibility Statement: All code used to produce the results of this paper are provided Appendix A. Complete proofs of all lemmas and theorems stated in the paper are provided in Appendix B.

Ethics Statement: Our paper does not deal with sensitive experiments, data, or any methods that can be expected to cause harm. We have no conflicts of interest and have no data privacy concerns.

References

  1. Abramowitz M. and Stegun IA (1964). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, Volume 55 of Applied Mathematics Series. Washington, D.C.: National Bureau of Standards. Reprinted by Dover Publications, 1972. [Google Scholar]
  2. Anderes E. (2010). On the consistent separation of scale and variance for Gaussian random fields. The Annals of Statistics. [Google Scholar]
  3. Banerjee S, Carlin BP, and Gelfand AE (2014). Hierarchical modeling and analysis for spatial data. CRC press. [Google Scholar]
  4. Bolin D. and Kirchner K. (2023). Equivalence of measures and asymptotically optimal linear prediction for Gaussian random fields with fractional-order covariance operators. Bernoulli 29(2), 1476–1504. [Google Scholar]
  5. Chen J, Mu W, Li Y, and Li D. (2024). On the identifiability and interpretability of Gaussian process models. Advances in Neural Information Processing Systems 36. [Google Scholar]
  6. Cheng L, Ramchandran S, Vatanen T, Lietzén N, Lahesmaa R, Vehtari A, and Lähdesmäki H. (2019). An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data. Nature communications 10(1), 1798. [Google Scholar]
  7. Cressie N. and Wikle CK (2015). Statistics for spatio-temporal data. John Wiley & Sons. [Google Scholar]
  8. Duvenaud D, Lloyd J, Grosse R, Tenenbaum J, and Zoubin G. (2013, 17–19 Jun). Structure discovery in nonparametric regression through compositional kernel search. In Dasgupta S. and McAllester D. (Eds.), Proceedings of the 30th International Conference on Machine Learning, Volume 28 of Proceedings of Machine Learning Research, Atlanta, Georgia, USA, pp. 1166–1174. PMLR. [Google Scholar]
  9. Duvenaud DK, Nickisch H, and Rasmussen C. (2011). Additive Gaussian processes. Advances in neural information processing systems 24. [Google Scholar]
  10. Feldman J. (1958). Equivalence and perpendicularity of Gaussian processes. Pacific J. Math 8(4), 699–708. [Google Scholar]
  11. Ghosal S. and van der Vaart AW (2017). Fundamentals of nonparametric Bayesian inference, Volume 44. Cambridge University Press. [Google Scholar]
  12. Ibragimov I. and Rozanov Y. (1978). Conditions for regularity of stationary random processes. In Gaussian Random Processes, pp. 108–143. Springer. [Google Scholar]
  13. Kaufman C. and Shaby BA (2013). The role of the range parameter for estimation and prediction in geostatistics. Biometrika 100(2), 473–484. [Google Scholar]
  14. Kronberger G. and Kommenda M. (2013). Evolution of covariance functions for Gaussian process regression using genetic programming. [Google Scholar]
  15. Lawrence N. (2003). Gaussian process latent variable models for visualisation of high dimensional data. In Thrun S, Saul L, and Schölkopf B(Eds.), Advances in Neural Information Processing Systems, Volume 16. MIT Press. [Google Scholar]
  16. Lee J, Bahri Y, Novak R, Schoenholz SS, Pennington J, and Sohl-Dickstein J. (2018). Deep neural networks as Gaussian processes. In International Conference on Learning Representations. [Google Scholar]
  17. Li C. (2022). Bayesian fixed-domain asymptotics for covariance parameters in a Gaussian process model. The Annals of Statistics 50(6), 3334–3363. [Google Scholar]
  18. Li D, Tang W, and Banerjee S. (2023). Inference for Gaussian processes with Matérn covariogram on compact Riemannian manifolds. Journal of Machine Learning Research 24(101), 1–26. [Google Scholar]
  19. Loh W-L and Sun S. (2023). Estimating the parameters of some common Gaussian random fields with nugget under fixed-domain asymptotics. Bernoulli 29(3), 2519–2543. [Google Scholar]
  20. Loh W-L, Sun S, and Wen J. (2021). On fixed-domain asymptotics, parameter estimation and isotropic Gaussian random fields with Matérn covariance functions. The Annals of Statistics 49(6), 3127–3152. [Google Scholar]
  21. Lukacs E, Szász O, et al. (1952). On analytic characteristic functions. Pacific J. Math 2(4), 615–625. [Google Scholar]
  22. Marx V. (2021). Method of the year: spatially resolved transcriptomics. Nature methods 18(1), 9–14. [DOI] [PubMed] [Google Scholar]
  23. Rasmussen CE and Williams CKI (2006). Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press. [Google Scholar]
  24. Remes S, Heinonen M, and Kaski S. (2017). Non-stationary spectral kernels. Advances in neural information processing systems 30. [Google Scholar]
  25. Samo Y-LK and Roberts S. (2015). Generalized spectral kernels. [Google Scholar]
  26. Stein ML (1999). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. [Google Scholar]
  27. Sun S, Zhu J, and Zhou X. (2020). Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nature methods 17(2), 193–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Svensson V, Teichmann SA, and Stegle O. (2018). Spatialde: identification of spatially variable genes. Nature methods 15(5), 343–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Tans P. and Keeling R. (2023). Trends in atmospheric carbon dioxide. https://gml.noaa.gov/ccgg/trends/data.html. Accessed: 2023-08-01. [Google Scholar]
  30. Verma A. and Engelhardt BE (2020). A robust nonlinear low-dimensional manifold for single cell RNA-seq data. BMC bioinformatics 21, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wang J, Yam WK, Fong KL, Cheong SA, and Wong KM (2018). Gaussian process kernels for noisy time series: Application to housing price prediction. In Neural Information Processing: 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, December 13–16, 2018, Proceedings, Part VI 25, pp. 78–89. Springer. [Google Scholar]
  32. Weber LM, Saha A, Datta A, Hansen KD, and Hicks SC (2023). nnSVG for the scalable identification of spatially variable genes using nearest-neighbor Gaussian processes. Nature communications 14(1), 4059. [Google Scholar]
  33. Wilson A. and Adams R. (2013). Gaussian process kernels for pattern discovery and extrapolation. In International conference on machine learning, pp. 1067–1075. PMLR. [Google Scholar]
  34. Zhang H. (2004). Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics. Journal of the American Statistical Association 99(465), 250–261. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES