Skip to main content
Springer logoLink to Springer
. 2026 Mar 7;93(2):48. doi: 10.1007/s00245-026-10392-5

Random Neural Networks for Rough Volatility

Antoine Jacquier 1,2,, Žan Žurič 1
PMCID: PMC12967488  PMID: 41804376

Abstract

We construct a deep learning-based numerical algorithm to solve path-dependent partial differential equations arising in the context of rough volatility. Our approach is based on interpreting the PDE as a solution to an BSDE, building upon recent insights by Bayer, Qiu and Yao, and on constructing a neural network of reservoir type as originally developed by Gonon, Grigoryeva, Ortega. The reservoir approach allows us to formulate the optimisation problem as a simple least-square regression for which we prove theoretical convergence properties.

Keywords: Rough volatility, SPDEs, Neural networks, Reservoir computing

Introduction

In recent years, a fundamental shift from classical modelling towards so-called rough stochastic volatility models has happened. These “rough" models were first proposed by Gatheral, Jusselin, Rosenbaum [27] and by Bayer, Gatheral, Friz [4], and have since sparked a great deal of research, because of their ability to capture stylised facts of volatility time series and of option prices more accurately, while remaining parsimonious. In essence, they are a class of continuous-path stochastic volatility models, where the instantaneous volatility is driven by a stochastic process with paths rougher than those of Brownian Motion, typically modelled by a fractional Brownian motion [50] with Hurst parameter H(0,1). The reason for this drastic paradigm shift can be found not only under the historical measure, where the roughness of the time series of daily log-realised variance estimates suggests Hölder regularity of H0.1, but also under the pricing measure, where rough volatility models are able to reproduce the power-law behaviour of the ATM volatility skew. Since then, a slew of papers have appeared, providing closed-form expressions for the characteristic functions of rough Heston models [21], machine learning techniques for calibration [39], microstructural foundations [20], option pricing partial differential equations (PDEs) solvers [6, 45], among others. A full overview can be found in the recent monograph [5].

Dating back to Black-Scholes [12], PDEs have been used to model the evolution of the prices of European-style options. However, rough volatility models give rise to a non-Markovian framework, where the value function for a European option is not deterministic anymore, but is instead random and satisfies a backward stochastic partial differential equation (BSPDE) as was shown in [6]. Moreover, even in classical diffusive models, the so-called curse of dimensionality poses a challenge when solving PDEs in high dimension; until recently, only the backward stochastic differential equation (BSDE) approach by [54] was available to tackle this, which is not really feasible in dimension beyond six.

On a positive note, machine learning methods have spread inside quantitative finance in recent years, and neural networks, in particular, have become a powerful tool to overcome problems in high-dimensional situations, because of their superior computational performance across a wide range of applications [15, 30, 59]; more precisely in the context of PDEs, examples of applications thereof can be found in [7, 18, 35, 45, 60, 62]. For a more thorough literature review on the use of neural networks in finance and finance-related PDEs, we refer the reader to the surveys in [8, 29].

In this paper, we focus on the works by Huré, Pham and Warin [41], and by Bayer, Qiu and Yao [6], where the classical backward resolution technique is combined with neural networks to estimate both the value function and its gradient. Not only does this approach successfully reduce the curse of dimensionality, but also appears more effective in both accuracy and computational efficiency than existing Euler-based approaches.

Besides research on numerical aspects, a lot of progress has been made on the theoretical foundations for neural network-based methods, in particular showing that they are able to approximate solutions of certain types of PDEs [19, 33, 43, 57]. These results are significant as they show that deep neural networks can be used to solve complex problems that were previously thought intractable. However, in practice, optimal parameters of any given neural network minimising a loss function ultimately have to be calculated approximately. This is usually done through some kind of stochastic gradient descent (SGD) algorithm, which inadvertently introduces an optimisation error. Because of the non-convexity of the network’s loss surface and the stochastic nature of the SGD, the optimisation error is notoriously hard to treat rigorously. One such attempt by Gonon [31] instead involves the use of neural networks in which only certain weights are trainable and the remaining are randomly fixed. This suggests that these random-weight neural networks are, in fact, capable of learning non-degenerate Black-Scholes-type PDEs without succumbing to the curse of dimensionality. Following this, we combine the classical BSDE approach [14, 54] with random-weight neural networks (RWNNs) [40, 55, 56].

Our final algorithm then reduces to a least-square Monte-Carlo, introduced by Longstaff and Schwartz [48] (see also [1] for related applications), where the usually arbitrary choice of basis is ‘outsourced’ to the reservoir of the corresponding RWNN. The basis is computationally efficient and ultimately allows us to express the approximation error in terms of the number of stochastic nodes in the network. Moreover, vectorisation of the Randomised Least Square along the sampling direction allows us to evaluate the sum of outer tensor products using the einsum function (available in NumPy and PyTorch) and achieve an even greater speed-up. One word of caution though: in our numerical examples, for the rough Bergomi model and for Basket options, computation time is still longer than using Monte Carlo methods, mostly because it does require simulating sample paths. It however opens the gates to (more advanced) numerical schemes for path-dependent partial differential equations, which we plan to investigate more in later projects.

To summarise, in contrast with Bayer-Qiu-Yao [6], our numerical scheme employs RWNNs as opposed to the conventional feed-forward neural networks, resulting in significantly faster training times without sacrificing the accuracy of the scheme. Moreover, this structure allows us to provide error bounds in terms of the number of hidden nodes, granting additional insights into the network’s performance. Given the comparable performance of RWNNs and conventional feed-forward neural networks, we argue that this paper illuminates an essential lesson, namely that the additional complexity of deep neural networks can sometimes be redundant at the cost of precise error bounds. We note in passing that RWNNs have already been used in Finance to price American options [36], for financial data forecasting [47], for PIDEs [33], and we refer the interested reader to [16] for a general overview of their applications in data science.

Moreover, in parallel to our work, Shang, Wang, and Sun [61] combined randomised neural networks with Petrov-Galerkin methods to solve linear and non-linear PDEs. Their method, similar to ours, uses randomly initialised neural networks with trainable linear readouts. Neufeld, Schmocker, and Wu [53] conducted a comprehensive error analysis of the random deep splitting method for non-linear parabolic PDEs and PIDEs, demonstrating high-dimensional problem-solving capabilities.

The paper is structured as follows: Section 2 provides a brief overview of Random-weight Neural Networks (RWNNs), including their key features and characteristics. In Section 3, we outline the scheme for the Markovian case and discuss the non-Markovian case in Section 4. The convergence analysis is presented in Section 5. Additionally, Section 6 presents numerical results, which highlight the practical relevance of the scheme and its performance for different models. Some of the technical proofs are postponed to Appendix B to ease the flow of the paper.

Notations: R+=[0,); refers to a random neural network, defined in Section 2; for an open subset ERd, 1p and sN we define the Sobolev space

Ws,p(E,Rm):={fLp(E,Rm):xαfLp(E,Rm),for all|α|s},

where α=α1,,αd, |α|=α1++αd, and the derivatives xαf=x11xddf are taken in a weak sense. To be consistent with probabilistic notations—although Machine Learning literature tends to differ—we shall write EΦ[·]:=E[·|Φ] as the conditional expectation with respect to the random variable Φ.

Random-Weight Neural Network (RWNN)

Neural networks with random weights appeared in the seminal works by Barron [2, 3], but a modern version was proposed by Huang [40] under the name Extreme learning machine, and today are known under different names: reservoir networks, random feature or random-weight networks; we adopt the latter as it sounds more explicit to us.

Definition 2.1

(Neural network) Let L,N0,,NLN,ϱ:RR and for l= 1,,L let wl:RNl-1RNl an affine function. A function F:RN0RNL defined as

F=wLFL-1F1,withFl=ϱwlforl=1,,L-1,

is called a neural network, with activation function ϱ applied component-wise. L denotes the total number of layers, N1,,NL-1 denote the dimensions of the hidden layers and N0 and NL those of the input and output layers respectively. For each l{1,,L} the affine function wl:RNl-1RNl is given as wl(x)=A(l)x+b(l), for xRNl-1, with A(l)RNl×Nl-1 and b(l)RNl. For any i{1,Nl} and j{1,,Nl-1}, Aij(l) is interpreted as the weight of the edge connecting node i of layer l-1 to node j of layer l.

A random-weight neural network (RWNN) is a neural network where the hidden layers are randomly sampled from a given distribution and then fixed; consequently, only the last layer is trained: out of all the parameters (A(l),b(l))l=0,,L of the L-layered neural network, the parameters (A(0),b(0),,A(L-1),b(L-1)) are randomly sampled and frozen and only (A(L),b(L)) from the last layer are trained.

The training of such an RWNN can then be simplified into a convex optimisation problem. This makes the training easier to manage and understand both practically and theoretically. However, by only allowing certain parts of the parameters to be trained, the overall capacity and expressivity are possibly reduced. Although it is still unclear if random neural networks still maintain any of the powerful approximation properties of general deep neural networks, these questions have been addressed to some extent in e.g. [32, 52], where learning error bounds for RWNNs have been proved.

Denote now ϱ(d0,d1) the set of random neural networks from Rd0 to Rd1, with activation function ϱ–and we shall drop the explicit reference to input and output dimensions in the notation whenever they are clear from the context. Moreover, for any L,KN, L,Kϱ represents a random neural network with a fixed number of hidden layers L and fixed input and output dimension K for each hidden layer. We now give a precise definition of a single layer Kϱ:=1,Kϱ, which we will use for our approximation.

Definition 2.2

(Single layer RWNN) Let (Ω~,F~,P~) be a probability space on which the iid random variables on a bounded domain Ak:Ω~SRd and bk:Ω~SR, respectively corresponding to weights and biases, are defined. Let ϕ=ϕkk1 denote a sequence of random basis functions, where each ϕk:RdR is of the form

ϕk(x):=ϱAkx+bk,xRd,

with ϱ:RR a Lipschitz continuous activation function. For an output dimension m and K hidden units, we define the reservoir or random basis as ΦK:=ϕ1:K=(ϕ1,,ϕK) and the random network Kϱ with parameter Θ=θ1,,θmRm×K as the map

Kϱ:xΨK(x;Θ):=ΘΦK(x).

Thus, for each output dimension j{1,,m}, Kϱ produces a linear combination of the first K random basis functions θjϕ1:K:=k=1Kθj,kϕk.

Remark 2.3

In this paper, we will make use of the more compact vector notation

ΦK:Rdxϱ(Ax+b)RK,

where ϱ:RKRK acts component-wise ϱ(y):=(ϱ(y1),ϱ(yK)) and A:Ω~RK×d and b:Ω~RK are the random matrix and bias respectively.

Derivatives of ReLu-RWNN

In recent years ReLu neural networks have been predominately used in deep learning, because of their simplicity, efficiency and ability to address the so-called vanishing gradient problem [46]. In many ways, ReLu networks also give a more tractable structure to the optimisation problem compared to their smooth counterparts such as tanh and sigmoid. Gonon, Grigoryeva and Ortega [32] derived error bounds to the convergence of a single layer RWNN with ReLu activations. Now, while ς(y):=max{y,0} is performing well numerically, it is, however, not differentiable at zero (see [11] for a short exposition on the chain-rule in ReLu networks). As ReLu-RWNNs will be used in our approach to approximate solutions of partial differential equations, a discussion on its derivatives is in order. To that end we let ς(y):=(ς(y1),,ς(yK)) and H(y)=11(0,)(y)RK for yRK, where the indicator function is again applied component-wise.

Lemma 2.4

For any linear function (x)=Ax+b, with ARK×d and bRK, then

x(ς)(x)=diag(H(Ax+b))A,for a.e.xRd.

Proof

Let A:=xRd:(ς)(x)=0=xRd:(x)0. Then (ς)(x)=(x) for all xRd\A. Since  is Lipschitz, differentiability on level sets [22, Section 3.1.2, Corollary I] implies that xς(x)=0Rd for almost every xA, and hence

x(ς)(x)=diag11{(x)Rd\A}x(x)=diag11(0,)((x))x(x)=diag(H(Ax+b))A.

Thus by Lemma 2.4, the first derivative of Ψ(·;Θ)Kς is equal to

xΨK(x;Θ)=Θdiag(H(Ax+b))Afor a.e.xRd. 2.1

The above statements hold almost everywhere, it is thus appropriate we introduce a notion of approximate differentiability.

Definition 2.5

(Approximate limit, [22, Section 1.7.2]) Consider a Lebesgue-measurable set ERd, a measurable function f:ERm and a point x0E. We say lRm is the approximate limit of f at x0, and write aplimxx0f(x)=l, if for each ε>0,

limr0λBr(x0)xE:|f(x)-l|ελ(Br(x0))=0,

with λ the Lebesgue measure and Br(x0) the closed ball with radius r>0 and center x0.

Definition 2.6

(Approximate differentiability, [22, Section 6.1.3]) Consider a measurable set ERd, a measurable map f:ERm and a point x0E. The map f is approximately differentiable at x0 if there exists a linear map Dx:RdRm such that

aplimxx0f(x)-f(x0)-Dx(x-x0)|x-x0|=0.

Then Dx is called the approximate differential of f at x0. We call f approximately differentiable almost everywhere if its approximately derivative exists almost everywhere.

Remark 2.7

The usual rules from classical derivatives, such as the uniqueness of the differential, and differentiability of sums, products and quotients, apply to approximately differentiable functions. Moreover, the chain rule applies to compositions φf when f is approximately differentiable at x0 and φ is classically differentiable at f(x0).

Remark 2.8

([22, Theorem 4, Section 6.1.3]) For fWloc1,pRd and 1p, f is approximately differentiable almost everywhere and its approximate derivative equals its weak derivative almost everywhere. We will thus use the operator Dx to denote the weak derivative and approximate derivative interchangeably, to distinguish them from the classical derivative denoted by .

Lemma 2.9

Let ERd be a measurable set with finite measure, X:ΩE a continuous random variable on some probability space (Ω,F,P), φC1(Rm), Φap:ERm an approximately differentiable function, and Φ its C1(Rd;Rm) extension to Rd. Then E[φ(DxΦap(X))]=E[φ(xΦ(X))].

Proof

By [23, Theorem 3.1.6] a function Φap:ERm is approximately differentiable almost everywhere if for every ε>0 there is a compact set FE such that the Lebesgue measure λ(E\F)<ε, ΦapF is C1 and there exists a C1-extension on Rd. Since φ is everywhere differentiable, it maps null-sets to null-sets [58, Lemma 7.25]. The claim follows since P is absolutely continuous with respect to the Lebesgue measure λ, X being a continuous random variable.

Corollary 2.10

Let ERd be a measurable set with finite measure, X:ΩE a continuous random variable on some probability space (Ω,F,P), φC1(Rm), Φ:ERm an approximately differentiable function, and ΨW1,p(E,Rm) for p1 such that Φ=Ψ almost everywhere. Then E[φ(DxΦ(X))]=E[φ(DxΨ(X))].

Proof

This is a direct consequence of Lemma 2.9, after noting that the two notions of derivatives are the same on W1,p(E,Rm) (see Remark 2.8).

From a practical perspective, the second-order derivative of the network with respect to the input will be zero for all intents and purposes. However, as will become apparent in Lemma 5.7, we need to investigate it further, in particular the measure zero set of points where ReLu-RWNN is not differentiable. Rewriting the diagonal operator in terms of the natural basis {ei} and evaluating the function H component-wise yields

xΨK(x;Θ)=Θj=1KejejHejAx+bjA.

The i-th component of the second derivative is thus

x2ΨK(x;Θ)i=Θj=1KejejajiHejAx+bjA=ΘdiagaidiagH(Ax+b)A,

where ai denotes the i-th column of the matrix A. Next, we let

δ0ε(x):=H(x)-H(x-ε)ε

for xR and define the left derivative of H as H=limε0δ0ε=δ0 in the distributional sense. This finally gives the second derivative of the network:

x2ΨK(x;Θ)i=Θdiagaidiagδ0(Ax+b)A, 2.2

where δ0 denotes the vector function applying δ0 component-wise.

Randomised Least Squares (RLS)

Let YRd and XRk random variables on some probability space (Ω,F,P) and βRd×k a deterministic matrix. If the loss function is the mean square error (MSE), the randomised least square estimator reads

βEY-βX2=βE[(Y-βX)(Y-βX)]=Eβ(YY-YβX-XβY+XββX)=E2βXX-2YX,

which gives the minimiser1β=EYXEXX-1, and its estimator

β^:=j=1nYjXjj=1nXjXj-1. 2.3

Depending on the realisation of the reservoir of the RWNN, the covariates of X may be collinear, so that X is close to rank deficient. A standard remedy is to use the Ridge regularised version [37] of the estimator

β^R=j=1nYjXjj=1nXjXj+λI-1,forλ>0,

which results in a superior, more robust performance in our experiments.

Remark 2.11

The above derivation holds true for the approximate derivative Dx as well because all operations above hold for approximately differentiable functions (Remark 2.7).

Remark 2.12

At first glance, the form of the RLS estimator in (2.3) suggests that the sum of outer products over n>>1 samples may be computationally expensive. In practice, however, this operation can be implemented efficiently by exploiting the tensor functionalities provided by libraries such as NumPy and PyTorch. In particular, the einsum function enables an efficient evaluation of the required sum of outer products, thereby further optimising the overall computation. Implementation details are provided in the accompanying code, available at ZuricZ/RWNN_PDE_solver.

The Markovian Case

Let the process X of the traded and non-traded components of the underlying under the risk-neutral measure Q be given by the following d-dimensional SDE:

Xst,x=x+tsμ(r,Xrt,x)dr+tsΣr,Xrt,xdWr, 3.1

where μ:[0,T]×RdRd and Σ:[0,T]×RdRd×d adhere to Assumption 3.1, and W is a standard d-dimensional Brownian motion on the probability space (Ω,F,Q) equipped with the natural filtration F={Ft}0tT of W. By the Feynman-Kac formula, options whose discounted expected payoff under Q can be represented as

u(t,x)=EtTe-r(s-t)fs,Xst,xds+e-r(T-t)gXTt,xforall(t,x)[0,T]×A,

for ARd with interest rate r0 and continuous functions f:[0,T]×RdR and g:RdR can be viewed as solutions to the Cauchy linear parabolic PDE

tu+Lu+f-ru=0,on[0,T)×A,u(T,·)=g,onA,

where

Lu:=12TrΣΣx2u+(xu)μ,on[0,T)×A, 3.2

is the infinitesimal generator associated with diffusion (3.1). In this Markovian setting, we thus adopt a setup similar to [41] and consider a slightly more general PDE

tu(t,x)+Lu(t,x)+f(t,x,u(t,x),xu(t,x)·Σ(t,x))=0,on[0,T)×A,u(T,·)=g,onA, 3.3

with f:[0,T]×Rd×R×RdR such that Assumption 3.1 is satisfied, which guarantees existence and uniqueness of the solution to the corresponding BSDE [54, Section 4].

Assumption 3.1

(Well-posedness of the FBSDE system (3.1)-(3.4)) The drift μ:[0,T]×RdRd and the diffusion coefficient Σ:[0,T]×RdRd×Rd satisfy global Lipschitz conditions. Moreover,

  • (i)
    there exists Lf>0 such that sup0tTf(t,0,0,0)< and, for all (t1,x1,y1,z1) and (t2,x2,y2,z2) in [0,T]×Rd×R×Rd,
    ft2,x2,y2,z2-ft1,x1,y1,z1Lft2-t1+x2-x1+y2-y1+z2-z1,
  • (ii)

    The function g has at most linear growth condition.

The corresponding second-order generator is again given by (3.2). The following assumption is only required to cast the problem into a regression. Otherwise, the optimisation (below) can still be solved using other methods, such as stochastic gradient descent. Another solution would be to use the so-called splitting method to linearise the PDE (as in [7] and the references therein for example).

Assumption 3.2

The function f:[0,T]×Rd×R×Rd×RdR has an affine structure in yRm and in z,vRd:

ft,x,y,z,v=a(t,x)y+b(t,x)z+c(t,x)v+f~(t,x),

for some real-valued functions a,b,c,f~ on [0,T]×Rd that map to conformable dimensions.

Random Weighted Neural Network Scheme

The first step in so-called deep BSDE schemes [18, 35, 41] is to establish the BSDE associated with the PDE (3.3) and the process (3.1) through the non-linear Feynman-Kac formula. By [54] there exist F-adapted processes (YZ), which are unique solutions to the BSDE

Yt=gXT+tTfs,Xs,Ys,Zsds-tTZsdWs,for anyt[0,T], 3.4

and which are connected to the PDE (3.3) via

Yt=u(t,Xt)andZt=xu(t,Xt)·Σ(t,Xt).

with terminal condition u(T,·)=g. Next, the BSDE (3.4) is rewritten in forward form

Yt=Y0-0tfs,Xs,Ys,Zsds+0tZsdWs,for anyt[0,T],

and both processes are discretised according to the Euler-Maruyama scheme. To this end let π:=0=t0<t1<<tN=T be a partition of the time interval [0, T] with modulus |π|=maxi={0,1,,N-1}δi and δi:=ti+1-ti. Then the scheme is given by

Xti+1=Xti+μ(ti,Xti)δi+Σ(ti,Xti)ΔiW,Yti+1=Yti-fti,Xti,Yti,Ztiδi+ZtiΔiW, 3.5

where naturally ΔiW:=Wti+1-Wti. Then for all i{N-1,,0} we approximate u(ti,·) with Ui(·;Θi)Kϱ and Zti as

u(ti,Xti)=YtiUi(Xti;Θi):=ΘiΦKi(Xti),ZtiZi(Xti):=DxUi(Xti;Θi)·Σ(ti,Xti)=ΘiDxΦKi(Xti)·Σ(ti,Xti).

Recall that the derivative Zi(Xti) is the approximate derivative from Definition 2.6. The following formulation of the loss function using the approximate derivative is sensible by Lemma 2.9: notice that for the optimal parameter Θi+1, in step (i+1), the optimal approximation U^i+1(Xti+1):=Ui+1(Xti+1;Θi+1,) does not depend on Θi, hence under Assumption 3.2 with c=0 the loss function at the i-th discretisation step reads

(Θi):=EΦU^i+1(Xti+1)-Ui(Xti;Θi)-f(ti,Xti,Ui(Xti;Θi),Zi(Xti;Θi))δi+Zi(Xti;Θi)ΔiW2=EΦU^i+1(Xti+1)-(Ui(Xti;Θi)-aiUi(Xti;Θi)+biZi(Xti;Θi)+f~iδi+Zi(Xti;Θi)ΔiW2=EΦU^i+1(Xti+1)+f~iδi-Θi{(1-aiδi)ΦKi(Xti)+DxΦKi(Xti)Σibiδi+ΔiW}2=EΦYi-ΘiXi2

where pi:=p(ti,Xti) for p{a,b,f~,Σ}, and the expectation EΦ is of course conditional on the realisation of the random basis ΦKi, i.e., conditional on the random weights and biases of the RWNN. Furthermore, we used the notations

Yi:=U^i+1(Xti+1)+f~iδiandXi:=(1-aiδi)ΦK(Xti)+DxΦK(Xti)·Σibiδi+ΔiW.

The problem can now be solved via least squares from Section 2.2, yielding the estimator

Θi,=EΦYiXiEΦXiXi-1.

Algorithm

We now summarise the algorithmic procedure of our RWNN scheme. See how the algorithm resembles the Least-Square Monte-Carlo method of [48] after considering sample estimator version of RLS from Section 2.2:

Algorithm 1.

Algorithm 1

RWNN scheme

Remark 3.3

We discuss the choice of R>0 from Algorithm 1 in different practical scenarios in Section 6. We find that the scheme remains robust across different choices of support intervals as long as it aligns with the magnitude of the expected output.

The Non-Markovian Case

We now consider a stochastic volatility model under a risk-neutral measure so that X=(X,V), where the dynamics of log-price process X are given by

dXst,x=r-Vs2ds+Vs(ρ1dWs1+ρ2dWs2),0tsT, 4.1

starting from Xtt,x=xR, with interest rate rR, correlation ρ1[-1,1], and denote ρ2:=1-ρ12, and W1,W2 are two independent Brownian motions. We allow for a general variance process process V, satisfying the following:

Assumption 4.1

The process V has continuous trajectories, is non-negative almost surely, adapted to the natural filtration of W1 and E0tVsds is finite for all t0.

By no-arbitrage, the fair price of a European option with payoff h:R+R+ reads

u(t,x):=Ee-r(T-t)heXTt,x+rT|Ft,for all(t,x)[0,T]×R,

subject to (4.1). Since X is not Markovian, one cannot characterise the value function u(tx) via a deterministic PDE. Bayer, Qiu and Yao [6] proved that u can be viewed as a random field which, together with another random field ψ, satisfies the backward stochastic partial differential equation (BSPDE)

-du(t,x)=Vt2x2u(t,x)+ρVtxψ(t,x)-Vt2xu(t,x)-ru(t,x)dt-ψ(t,x)dWt1, 4.2

in a distributional sense for (t,x)[0,T)×R, with boundary condition u(T,x)=hex+rT where the variance process (Vt)t0 is defined exogenously under Assumption 4.1. We in fact consider the slightly more general BSPDEs

-du(t,x)={Vt2Dx2u(t,x)+ρVtDxψ(t,x)-Vt2Dxu(t,x)+ft,ex,u(t,x),ρ2VtDxu(t,x),ψ(t,x)+ρ1VtDxu(t,x)}dt-ψ(t,x)dWt1,(t,x)[0,T)×R,u(T,x)=gex,xR. 4.3

The following assumption on f and g (from [6]) ensures well-posedness of the above BSPDE, and we shall additionally require the existence of a weak-Sobolev solution (Assumption 5.1 ) for the convergence analysis of our numerical scheme in Section 5.

Assumption 4.2

Let g:RR and f:[0,T]×R4R be such that

  • (i)

    g admits at most linear growth;

  • (ii)
    f is Lf-Lipschitz in all space arguments and there exists L0>0 such that
    |f(t,x,0,0,0)|Lf(1+|x|)and|f(t,x,y,z,z~)-f(t,x,y,0,0)|L0.

Note that (4.2) is just a particular case of the general BSPDE (4.3) for the choice f(t,x,y,z,z~)-ry and g(ex)h(ex+rT). Again, this general form is shown to be well posed in the distributional sense under Assumption 4.2 (borrowed from [6]). By [14] the corresponding BSDE is then, for 0ts<T,

-dYst,x=fs,eXst,x,Yst,x,Zs1t,x,Zs2t,xds-Zs1t,xdWs1-Zs2t,xdWs2,YTt,x=geXTt,x, 4.4

where (Yst,x,Zs1t,x,Zs2t,x) is defined as the solution to (4.4) in the weak sense.

Random Neural Network Scheme

Let the quadruple Xs,Ys,Zs1,Zs2 be the solution to the forward FBSDE

-dYs=fs,eXs,Ys,Zs1,Zs2ds-Zs1dWs1-Zs2dWs2,dXs=-Vs2ds+Vsρ1dWs1+ρ2dWs2,Vs=ξsEηW^s,withW^s=0sK(s,r)dWr1, 4.5

for s[0,T), with terminal condition YT=geXT, initial condition X0=x and K a locally square-integrable kernel and ξs:=ξ(s)>0 is the forward variance curve. For notational convenience below, we use ρ2:=1-ρ12, with ρ1[-1,1]. Here E(·) denotes the Wick stochastic exponential and is defined as E(ζ):=expζ-12E[|ζ|2] for a centered Gaussian variable ζ. Then by [6, Theorem 2.4],

Yt=ut,Xt,fort[0,T],Zt1=ψt,Xt+ρ1VtDxut,Xt,fort[0,T),Zt2=ρ2VtDxut,Xt,fort[0,T),

where (u,ψ) is the unique weak solution to (4.3). Accordingly, the forward equation reads

Yt=Y0-0tfs,eXs,Ys,Zs1,Zs2ds+0tZs1dWs1+0tZs2dWs2,fort[0,T].

By simulating (W1,W2,V), the forward process X may be approximated by an Euler scheme–with the same notations as in the Markovian case–and the forward representation above yields the approximation

uti+1,Xti+1u(ti,Xti)-fti,eXti,uti,Xti,Zi1,Zi2δi+Zi1Δi1+Zi2Δi2,

with

Zi1=ρ1VtiDxuti,Xti+ψti,XtiandZi2=ρ2VtiDxuti,Xti.

By Lemma A.2 we can, for each time step i{0,,N-1}, approximate the solutions u(ti,·) and ψ(ti,·) by two separate networks Ui and Ψi in Kς:

YtiUi(Xti;Θi)=ΘiΦKΘ,i(Xti),Zi1Zi1(Xti;Θi,Ξi)=ΘiDxΦKΘ,i(Xti)ρ1Vti+ΞiΦKΞ,i(Xti),Zi2Zi2(Xti;Θi,Ξi)=ΘiDxΦKΘ,i(Xti)ρ2Vti.

Here ΦKΞ and ΦKΘ are realisations of random bases (reservoirs) of the RWNNs with respective parameters Ξ and Θ. The next part relies on Assumption 3.2, namely

fti,eXti,Yti,Zi1,Zi2=a(ti,Xti)Yti+b(ti,Xti)Zi1+c(ti,Xti)Zi2+f~(ti,Xti),

for some functions a,b,c,f~ mapping to R, so that, as in the Markovian case, the minimisation of the expected quadratic loss at every time step i{N-1,,0} reads

(Θi,Ξi):=EΦ[|U^i+1(Xti+1)-{Ui(Xti;Θi)-f(ti,Xti,Ui(Xti;Θi),Zi1(Xti;Θi,Ξi),Zi2(Xti;Θi,Ξi))δi+k=12Z^ik(Xti;Θi,Ξi)Δik}|2]=EΦ[|U^i+1(Xti+1)-{Ui(Xti;Θi)-aUi(Xti;Θi)+bZi1(Xti;Θi,Ξi)+cZi2(Xti;Θi,Ξi)+f~δi+k=12Zik(Xti;Θi,Ξi)Δik}|2]=EΦ[|U^i+1(Xti+1)+f~δi-{ΞiΦKΞ,i(Xti)Δi1-bδi+Θi(1-aδi)ΦKΘ,i(Xti)+DxΦKΘ,i(Xti)VtiΔiB-(bρ1+cρ2)δi}|2]=EΦYi-ΞiX1i-ΘiX2i2,

with ΔiB=(ρ1Δi1+ρ2Δi2) and where U^i+1(Xti+1):=Ui+1(Xti+1;Θi+1,) was set in the previous time step and is now constant (without dependence on Θi). We defined

Yi:=U^i+1(Xti+1)+f~δi,X1i:=ΦKΞ(Xti)Δi1-bδi,X2i:=(1-aδi)ΦKΘ(Xti)+DxΦKΘ(Xti)VtiΔiB-(bρ1+cρ2)δi. 4.6

In matrix form, this yields (Θi,Ξi)=EΦ[Yi-βiXi2], with βi=Ξi,Θi and Xi=X1i,X2i, for which the RLS from Section 2.2 yields the solution

βi=EΦYiX1iYiX2iEΦX1iX1iX1iX2iX2iX1iX2iX2i-1. 4.7

Algorithm

We summarise the steps of the algorithm below:

Algorithm 2.

Algorithm 2

RWNN non-Markovian scheme

Convergence Analysis

In this section, whenever there is any ambiguity, we use the notation Xπ to denote the discretised version of the solution process of (4.5) over the partition π=0=t0<t1<<tN=T of the interval [0, T], with modulus |π|=maxi{0,1,,N-1}δi with δi=ti+1-ti. As mentioned just before Assumption 3.2, the linearity of f assumed before was only required to cast the optimisation in Algorithm 2 into a regression problem. In the forthcoming convergence analysis, this does not play any role, and we therefore allow for a more general function f.

Assumption 5.1

  • (i)

    There exists a unique weak solution to the BSPDE system (4.3) with u,ψW3,2;

  • (ii)
    There is an increasing continuous function ω:R+R+ with ω(0)=0 such that
    Et12Vsds+Et12Vsds2ω(|t2-t1|),for any0t1t2T;
  • (iii)
    There exists Lf>0 such that, for all (t,x,z1,z2) and (t~,x~,z~1,z~2),
    ft,ex,y,z1,z2-ft~,ex~,y~,z~1,z~2Lfω(|t-t~|)12+|x-x~|+|y-y~|+|z1-z~1|+|z2-z~2|.

Remark 5.2

Assumption 5.1(iii) may look unusual but appears as we are interested in evaluating options on the stock price, which is the exponential of the log stock price. This condition (the same as in [6]) allows us to control the L2 norm of the drift term in (4.5) (see also [13, Assumption 2.12(iii)]) and thus of the backward process Y therein.

Assumption 5.3

Given the partition π, supi{0,,N-1}EViπ is finite.

Classical estimates (see Lemma C.1 for full details and proof) yield

Esup0tTXt2C1+x02,

as well as (Lemma C.2)

maxi{0,,N-1}EXti+1-Xi+1π2+suptti,ti+1Xt-Xiπ2Cω(|π|), 5.1

for some C>0 independent of |π|, and we furthermore have [14]

E0Tft,eXt,Yt,Zt1,Zt22dt<, 5.2

as well as the standard L2-regularity result on Y:

maxi{0,,N-1}Esuptti,ti+1Yt-Yiπ2=O(|π|). 5.3

For k{1,2}, define the errors

εk(π):=Ei=0N-1tii+1Ztk-Z¯ik2dt,withZ¯ik:=1δiEitii+1Ztkdt, 5.4

which represent measures of the total variance of the processes Zk along the partition π and where Ei denotes the conditional expectation given Fti. We furthermore define the auxiliary processes, for i{0,,N-1},

V^ti:=EiU^i+1Xi+1π+fti,eXiπ,V^ti,Z^¯i1,Z^¯i2δi,Z^¯i1:=Ψ^i(Xiπ)+1δiEiU^i+1Xi+1πΔi1,Z^¯i2:=1δiEiU^i+1Xi+1πΔi2, 5.5

with U^i(x):=Ui(x;Θi,) and Ψ^i(x):=Ψ(x;Ξi,) as before. Observe that Ψ^i+1 and U^i+1 do not depend on Θi because the parameters were fixed at (i+1) time step and are held constant at step i (see Algorithm 2). Next, notice that V^ is well defined by a fixed-point argument since f is Lipschitz. By Assumption 5.1(i), there exist v^i,z^¯ik for which

V^ti=v^i(Xiπ)andZ^¯ik=z^¯ik(Xiπ)fori{0,,N-1},k{1,2}. 5.6

By the martingale representation theorem, one has integrable processes Z^¯1,Z^¯2 such that

U^i+1Xi+1π=V^ti-fti,eXiπ,V^ti,Z^¯i1,Z^¯i2δi+tii+1Z^t1dWt1+tii+1Z^t2dWt2, 5.7

since Z^¯tk are FtW-adapted as asserted by the martingale representation theorem. From here, Itô’s isometry yields

Z^¯i1=Ψ^i(Xiπ)+1δiEiU^i+1(Xi+1π)Δik=1δitii+1Ψ^i(Xiπ)dt+1δiEiV^ti+tii+1Z^t1dWt1+tii+1Z^t2dWt2tii+1dWt1=1δiEitii+1Ψ^i(Xiπ)+Z^t1dt,

and similarly,

Z^¯i2=1δiEitii+1Z^t2dt.

We consider convergence in terms of the following error:

EU^,Ψ^:=maxi{0,,N-1}EΦYti-U^iXiπ2+EΦi=0N-1tii+1k=12Ztk-Z^ikXiπ2dt,

with Z^i1,Z^i2 introduced before Lemma 5.8. We now state the main convergence result:

Theorem 5.4

Under Assumptions 4.1-4.2-5.1, there exists C>0 such that

EU^,Ψ^Cω(|π|)+Eg(XT)-g(XTπ)2+k=12εk(π)+CKN+M|π|2,

with C,M>0 given in Lemma 5.9 and the errors εk(π) defined in (5.4).

The following follows from (B.5), established in Part II of the proof of Theorem 5.4:

Corollary 5.5

Under Assumptions 4.1-4.2-5.1, there exists C>0 such that

maxi{0,,N-1}EΦYti-U^iXiπ2Cω(|π|)+Eg(XT)-g(XTπ)2+k=12εk(π)+CKN+M|π|2,

with C,M>0 given in Lemma 5.9.

Remark 5.6

The second error term is the strong L2-Monte-Carlo error and is O(N-H) for processes driven by an fBm with Hurst parameter H(0,1). We refer the reader to [13, 26] for an exposition on strong versus weak error rates in rough volatility models.

To prove Theorem 5.4, the following bounds on the derivatives are key.

Lemma 5.7

Let ΨK(·;Θ)Kς and (Xiπ,Viπ)i denote the discretised versions of (4.5) over the partition π, then there exist L1,L2>0 such that, for all i{0,,N-1},

EiΦDxΨK(Xi+1π;Θ)L1andEiΦDx2ΨK(Xi+1π;Θ)L2.

Proof

We start with the first derivative. For all x,yRd,

ΨK(x;Θ)-ΨK(y;Θ)=Θς(Ax+b)-ς(Ay+b)ΘFς(Ax+b)-ς(Ay+b)ΘFAx-AyΘFAFx-yL1x-y,

since ς is 1-Lipschitz. The estimator Θ has an explicit form (4.7) and its norm is finite, therefore ΨK(·;Θ) is globally Lipschitz and its first derivative is bounded by L1>0. Next, without loss of generality, we can set A=I and b=0, since their support is bounded. As in (2.2) for j{1,,m},

EiΦDx2ΨK(Xi+1π;Θ)j=Θdiagejdiagδ0x-12Viπδi+VtiwpN(w)dw=ΘdiagejdiagpN0;x-12Viπδi,Viπδi,

since ΔiBN(0,δi) and pN is the Gaussian density applied component-wise. Since the weights are sampled on a compact and Θ is finite, then there exists C>0 such that

EiΦDx2ΨK(Xi+1π;Θ)CΘF=L2.

From here the error bound of approximating V^ti, Z^¯i1 and Z^¯i2 with their RWNN approximators U^i, Z^i1 and Z^i2 (defined in the lemma below) can be obtained. For i{0,,N-1}, (Ui,Ψi)Kς, introduce

Zi1(x):=Ψi(x)+ρ1VtiDxUi(x),Zi2(x):=ρ2VtiDxUi(x),Z^i1(x):=Ψ^i(x)+ρ1VtiDxU^i(x),Z^i2(x):=ρ2VtiDxU^i(x).

Lemma 5.8

Under Assumptions 5.1-5.3, there exists M>0 such that

EΦZik(Xiπ)-Z^¯ik2ρk2|π|2M,for alli{0,,N-1},k=1,2.

Proof

From (5.5) and (5.6), we have, for i{0,,N-1} and k{1,2},

v^i(x)=EiΦU^i+1Xi+1x,π+fti,ex,v^i(x),z^¯i1(x),z^¯i2(x)δi,z^¯ik(x)=ΨiXi+1x,π11{k=1}+1δiEiΦU^i+1Xi+1x,πΔik,

where Xi+1x,π=x+r-12Vtiδi+VtiΔiB is the Euler discretisation of {Xt}t[0,T] over π and {Viπ}i=0N is the appropriate discretisation of the volatility process over the same partition. For {Rk}iidN(0,1), the two auxiliary processes can be written as

z^¯ik(x)=ΨiXi+1x,π11{k=1}+1δiEiΦU^i+1x-Vix,π2δi+Vix,πδiρ1R1+ρ2R2δiRk.

Notice that, while any sensible forward scheme for {Vt}t[0,T] does depend on a series of Brownian increments, Vix,π only depends on Δi-1W,,Δ0W, which are known at time ti. Thus, since usual derivative operations are available for approximately differentiable functions (Remark 2.7) multivariate integration by parts for Gaussian measures (a formulation of Isserlis’ Theorem [44]) yields

z^¯ik(x)=ΨiXi+1x,π11{k=1}+ρkVtiEΦDxU^i+1Xi+1x,π,

with corresponding derivatives

Dxz^¯ik(x)=DxΨiXi+1x,π11{k=1}+ρkVtiEΦDx2U^i+1Xi+1x,π.

An application of the implicit function theorem then implies

Dxv^i(x)=EiΦDxU^i+1Xi+1x,π+δiDxf^i(x)+Dyf^i(x)Dxv^i(x)+k=12Dzkf^i(x)Dxz^¯ik(x),

where f^i(x):=fti,ex,v^i(x),z^¯i1(x),z^¯i2(x) and

ΨiXi+1x,π11{k=1}+1-δiDyf^i(x)ρkViπDxv^i(x)=z^¯ik(x)+ρkViπδiDxf^i(x)+Dz1f^i(x)Dxz^¯i1(x)+Dz2f^i(x)Dxz^¯i2(x).

Thus, for small enough |π|,

ΨiXi+1x,π11{k=1}+ρkViπDxv^i(x)z^¯ik(x)+ρkViπδiDxf^i(x)+Dz1f^i(x)Dxz^¯i1(x)+Dz2f^i(x)Dxz^¯i2(x),

and since f is Lipschitz by Assumption 5.1(iii), then Drf^i(x)=1 for r{x,z1,z2} and, by Lemma 5.7 and the definition of z^¯ik(x):

ΨiXi+1x,π11{k=1}+ρkViπDxv^i(x)z^¯ik(x)+ρkδiViπ1+Dxz^¯i1(x)+Dxz^¯i2(x)z^¯ik(x)+ρkδiViπ1+L1+2L2Viπ.

Therefore, using the above inequality

EΦΨi(Xiπ)+ρ1ViπDxUi(Xiπ)-Z^¯i12EΦz^¯i1(Xiπ)-Z^¯i1+ρ1δiViπ1+L1+2L2Viπ2|ρ1δi|2EViπ1+L1+2L2Viπ2ρ12|π|2EViπ1+L1+2L2EViπρ12|π|2M,

relying on Corollary 2.10 and the fact that Z^¯ik=z^¯ik(Xiπ) (see (5.6)) in the second inequality and the boundedness of E[|Vπ|] from Assumption 5.3 in the last line. The proof of the other bound is analogous.

Lemma 5.9

Under Assumptions 4.1-4.2-5.1, for sufficiently small |π| we have

EΦV^ti-U^iXiπ2+δiEΦk=12Z^ik-Z^ikXiπ2CCK+M|π|3

for all i{0,,N-1} and K hidden units, for some C>0, where C is as in Proposition A.1 and M in Lemma 5.8.

Proof of Lemma 5.9

Fix i{0,,N-1}. Relying on the martingale representation in (5.7) and Lemma A.2, we can define the following loss function for the pair (Ui(·;Θ),Ψi(·;Ξ))Kς and their corresponding parameters Θ and Ξ:

L^i(Θ,Ξ):=L~i(Θ,Ξ)+EΦtii+1k=12Z^tk-Z^¯ik2, 5.8

with

L~i(Θ,Ξ):=EΦ[|V^ti-Ui(Xiπ;Θ)+δi{fti,eXiπ,U(Xiπ;Θ),Zi1(Xiπ;Θ,Ξ),Zi2(Xiπ;Θ,Ξ)-fti,eXiπ,Vti,Z^¯i1,Z^¯i2}|2]+δik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2.

Now, recall the following useful identity, valid for any a,bR:

(a+b)21+χa2+1+1χb2,χ>0. 5.9

Applying (5.9) yields

L~i(Θ,Ξ)δik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2+(1+Cδi)EΦV^ti-Ui(Xiπ;Θ)2+1+1CδiEΦfti,eXiπ,U(Xiπ;Θ),Zi1(Xiπ;Θ,Ξ),Zi2(Xiπ;Θ,Ξ)-fti,eXiπ,Vti,Z^¯i1,Z^¯i22.

Now by the Lipschitz condition on f from Assumption 5.1,

L~i(Θ,Ξ)(1+Cδi)EΦV^ti-Ui(Xiπ;Θ)2+Cδik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2.

For any a,bR, inequality (5.9) holds with the reverse sign and hence also

(a+b)21-χa2-1χb2,χ>0. 5.10

Following (5.10), for χ=γδi and γ>0 we have

L~i(Θ,Ξ)δik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2+(1-γδi)EΦV^ti-Ui(Xiπ;Θ)2-1γδiEΦf(ti,eXiπ,U(Xiπ;Θ),Zi1(Xiπ;Θ,Ξ),Zi2(Xiπ;Θ,Ξ))-fti,eXiπ,Vti,Z^¯i1,Z^¯i22.

Again since f is Lipschitz, the arithmetic-geometric inequality implies

L~i(Θ,Ξ)(1-γδi)EΦV^ti-Ui(Xiπ;Θ)2+δik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2-δiγEΦLf2|V^ti-U(Xiπ;Θ)+Z^¯ik-Zik(Xiπ;Θ,Ξ)|2(1-γδi)EΦV^ti-Ui(Xiπ;Θ)2+δik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2-3δiLf2γEΦV^ti-Ui(Xiπ;Θ)2+k=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2.

Taking γ=6Lf2 gives

L~i(Θ,Ξ)(1-Cδi)EΦV^ti-Ui(Xiπ;Θ)2+δi2k=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2.

For a given i{0,,N-1} take (Θ,Ξ)argminΘ,ΞL^i(Θ,Ξ) so that U^i=Ui(·;Θ) and Z^ik(·):=Zik(·;Θ,Ξ). From (5.8), L^i and L~i have the same minimisers, thus combining both bounds gives for all (Θ,Ξ)Rm×K×Rm×K,

1-CδiEΦV^ti-U^iXiπ2+δi2k=12EΦZ^¯ik-Z^ik2L~i(Θ,Ξ)L~i(Θ,Ξ)1+CδiEΦV^ti-UiXiπ;Θ2+Cδik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2.

Letting |π| be sufficiently small gives together with Lemma 5.8

EΦV^ti-U^iXiπ2+δik=12EΦZ^¯ik-Zik(Xiπ;Θ,Ξ)2CinfΘEΦv^i(Xiπ)-Ui(Xiπ;Θ)2+|π|3ρ12+ρ22M,

therefore, using Proposition A.1, we obtain

EΦV^ti-U^iXiπ2+δiEΦk=12Z^¯ik-Z^ikXiπ2CinfΘEΦv^i(Xiπ)-Ui(Xiπ;Θ)2+M|π|3CCK+M|π|3.

The rest of the proof is similar to those in [6, Theorem A.2] and [41, Theorem 4.1], but we include it in Appendix B for completeness.

Numerical Results

We now showcase the performance of the RWNN scheme on a representative model from each–Markovian and non-Markovian–class. We first test our scheme in the multidimensional Black-Scholes (BS) setting [12] and then move to the non-Markovian setup with the rough Bergomi (rBergomi) model [4]. We develop numerical approximations to European option prices given in (3.3) and (4.2), choosing

f(t,x,y,z1,z2)=-ryandgcallx=ex-K+,

and discretising over the partition π={0=t0,t1,tN=T} for some NN. The precise discretisation schemes of the individual processes are given in their corresponding sections below. We remark, however, that the approximated option price for a given Monte-Carlo sample can become (slightly) negative by construction so we add an absorption feature for both models:

Yiπ:=max0,Y~iπ,fori{0,,N-1}, 6.1

where {Y~iπ}i=0N denotes the approximation obtained through the RWNN scheme.

Remark 6.1

This is a well-studied problem and is especially prevalent in the simulation of square-root diffusions. We acknowledge that the absorption scheme possibly creates additional bias (see [49] for the case of the Heston model), however, a theoretical study in the case of the PDE-RWWN scheme is out of the scope of this paper.

The reservoir used as a random basis of RWNNs here is the classical linear reservoir from Definition 2.2. For numerical purposes, we introduce a so-called connectivity parameter, a measure of how interconnected the neurons in a network are: the higher the connectivity, the more inter-dependence between the neurons (see [17] for effects of connectivity in different reservoir topologies). In practice, however, too high a connectivity can lead to overfitting and poor generalisation. Recall that our reservoir is given by

ΦK:RdRK,xΦK(x):=ϱ(Ax+b),

where only ARK×d is affected by the connectivity parameter c(0,1]. Mathematically, Aij=A~ij11{Zij<c}, where ZijiidU[0,1] and A~ij is the original matrix not impacted by the connectivity parameter. A value c=1 means that A is dense and fully determined by sampled weights. We find that the choice c0.5 results in superior performance.

In all our experiments, the reservoir weights are sampled uniformly over [-0.5,0.5] (i.e. with the choice R=0.5 in Algorithms 1 and 2). We experimented with sampling over alternative distributions and/or intervals, yet discovered that the scheme remains robust to the choice of support, provided it aligns with the magnitude of the expected output. All experiments below were run on a standard laptop with an AMD Ryzen 9 5900HX processor without any use of GPU, which would most certainly speed up the algorithms further. The code for both models is available at ZuricZ/RWNN_PDE_solver.

Example: Black-Scholes

The Black-Scholes model [12] is ubiquitous in mathematical finance, allowing for closed-form pricing and hedging of many financial contracts. Despite its well-known limitations, it remains a reference and is the first model to check before exploring more sophisticated ones. Since it offers an analytical pricing formula as a benchmark for numerical results, it will serve as a proof of concept for our numerical scheme. Under the pricing measure Q, the underlying assets S=(S1,,Sd) satisfy

dStj=Stjrdt+σjdWtj,fort[0,T],j{1,,d},

where {Wtj}t[0,T] are standard Brownian motions such that Wi,Wj=ρi,jdt, with ρi,j[-1,1], r0 is the risk-free rate and σj>0 is the volatility coefficient. The corresponding d-dimensional option pricing PDE is then given by

u(t,S)t+j=1drSju(t,S)Sj+j=1d(σiSj)222u(t,S)(Sj)2+j=1d-1k=j+1dρj,kσjσkSjSk2u(t,S)SjSk=ru(t,S),

for t[0,T) with terminal condition u(T,ST)=g(ST1,,STd). To use Algorithm 1, the process S has to be discretised, for example with an Euler-Maruyama scheme, for each j=1,,d and i{0,1,,N-1}:

Xi+1π,j=Xiπ,j+r-σj22δi+σjΔij,Si+1π,j=expXi+1π,j,

with initial value X0π,j=log(S0π,j). If not stated otherwise, we let K=S0=1, r=0.01, T=1, and run the scheme with N=21 discretisation steps and nMC=50,000 Monte-Carlo samples. The reservoir has K{10,100,1000} hidden nodes, in Sections 6.1.4 and 6.1.2 the connectivity parameter is set to c=0.5.

Convergence rate

We empirically analyse the error rate in terms of the number of hidden nodes K obtained in Corollary 5.5. To isolate the dependence on the number of nodes, we fix the discretisation grid and the number of MC samples. We then consider a single ATM vanilla Call, fix c=1, σ=0.1 and vary

K101+2(i-1)9:i1,,10,

over a set of 10 logarithmically spaced points between 10 and 1000. Due to our vectorised implementation of the algorithm, the reservoir basis tensor cannot fit into the random-access memory of a standard laptop for K10000. The results in Figure 1 are compared to the theoretical price only computed using the Black-Scholes pricing formula. The absorption scheme (6.1) is applied.

Fig. 1.

Fig. 1

Empirical convergence of the MSE from Corollary 5.5 under Black-Scholes in terms of the number of hidden nodes (for a fixed grid and number of MC samples). Error bars mark 0.1 and 0.9 quantiles of 20 separate runs of the algorithm. The slope coefficient of the dashed line is obtained through regression of the means of individual runs, while the solid line represents 1/K convergence and is shown as a reference

ATM Call option

As a proof of concept we first test Algorithm 1 with Call options written on dN independent assets, i.e. with ρj,k=0 for jk and VT=(gcall(STj))j{1,,d}. This is only done so that the results can be directly compared to the theoretical price computed using the Black-Scholes pricing formula and is, in effect, the same as pricing d options on d independent assets, each with their own volatility parameter σ. All the listed results in this section are for K=100 hidden nodes. In Table 1, results and relative errors are shown for d=5 and σ:=(σ1,,σd) uniformly spaced over [0.05, 0.4]. Next, the effects of the absorption scheme (6.1) are investigated. Curiously, the absorption performs noticeably worse compared to the basic scheme, where one does not adjust for negative paths. This leads us to believe that absorption adds a substantial bias, similar to the Heston case (see Remark 6.1). Therefore, such a scheme should only be used for purposes, when positivity of the option price paths is strictly necessary (e.g. when hedging). Finally, in Table 2, total MSE and computational times are given for different dimensions. The computational times for different dimensions are then plotted in Figure 2. It is important to note that our results do not allow us to make definitive claims about the computational times of the PDE-RWNN scheme across different dimensions. This was not the goal of our experiments, and further detailed theoretical study and experiments would be necessary to draw more definitive conclusions regarding the efficiency of the scheme in various dimensions.

Table 1.

A single run for d=5 independent underlyings, where European Calls are compared to the price obtained through PDE-RWNN (with and without absorption) and the Monte Carlo methods along each dimension. Below, the relative errors of both methods are given. The MC method was run using the same paths as in the PDE-RWNN

Price
σ True PDE (w/ abs) PDE (w/o abs) MC
0.05 0.02521640 0.02960256 0.02531131 0.02574731
0.1 0.04485236 0.05523114 0.04467687 0.04547565
0.15 0.06459483 0.07719949 0.06477605 0.06520783
0.2 0.08433319 0.10307868 0.08443957 0.08484961
0.25 0.10403539 0.12660871 0.10412393 0.10513928
Rel. Error
σ PDE(w/ abs) PDE (w/o abs) MC
0.05 1.74e-01 -3.76e-03 -2.11e-02
0.1 2.31e-01  3.91e-03 -1.39e-02
0.15 1.95e-01 -2.81e-03 -9.49e-03
0.2 2.22e-01 -1.26e-03 -6.12e-03
0.25 2.17e-01 -8.51e-04 -1.06e-02
Table 2.

Total MSE of the option price calculated across all d assets and CPU training times for varying dimension d, where σ uniformly spaced over [0.05, 0.4]

d Total MSE (with abs) CPU Time (seconds)
5 3.482e-8 10.5
10 5.417e-8 16.0
25 4.901e-8 34.5
50 1.653e-7 65.0
100 2.534e-7 135.0
Fig. 2.

Fig. 2

Computational time vs number of dimensions, as in Table 2

Computational time

A key advantage of RWNNs is their fast training procedure, which in essence relies on solving a linear regression problem. We now look at (2.3) to assess its computational complexity. First, computing the sum of outer products, j=1nYjXj, where each YjRd and XjRk, requires forming n matrices of size d×k, resulting in a total computational cost of O(ndk). Similarly, for the sum j=1nXjXj, each outer product produces a matrix of size k×k, yielding a cost of O(nk2). Multiplying the resulting matrices requires O(dk2) operations, and inversion of the k×k matrix incurs a cost of O(k3). The overall computational complexity is therefore

O(ndk+nk2+k3),

where, for large n, the sum over n outer products is typically the dominant term. The higher-order terms O(nk2) and O(k3) only become significant when k is no longer negligible in comparison to n and d. The complexity estimate is consistent with the empirical results presented in Figure 2, which demonstrate a linear relationship between the number of dimensions d and observed CPU time. In effect, the dominant O(ndk) term shows the scheme mitigates the curse of dimensionality, allowing high-dimensional problems to be tackled efficiently within this framework.

Basket option

We consider an equally weighted basket Call option with a payoff

gbasket(ST):=1dj=1dSTj-K+,

where K>0 denotes the strike price. For simplicity, we consider d=5 and an ATM option with K:=1dj=1dS0j and set all S0j=1 for j{1,,5}. The volatilities σj are uniformly spaced between [0.05, 0.25] and the correlation matrix is randomly chosen as

ρ:=10.84-0.51-0.700.150.841-0.66-0.850.41-0.51-0.6610.55-0.82-0.70-0.850.551-0.510.150.41-0.82-0.511,

so that Σ:=diag(σ)ρdiag(σ). Since the distribution of a sum of Lognormal is not known explicitly, no closed-form expression is available for the option price. Hence, the reference price is computed using Monte-Carlo with 100 time steps and 400, 000 samples. In Table 3, we compare our scheme with a classical MC estimator in terms of relative error for K=100 hidden nodes.

Table 3.

Comparison of prices, relative errors and CPU time of the Monte-Carlo estimator, PDE-RWNN scheme with and without absorption (using same sampled MC paths and K=100) and the reference price computed with 100 time steps and 400,000 samples

Reference PDE (with abs) PDE (without abs) MC
Price 0.01624 0.01822 0.01613 0.01625
Rel. error - -1.22e-01 -6.71e-03 -6.50e-04
Time (seconds) 12.8 9.7 9.8 0.3

Example: Rough Bergomi

The rough Bergomi model belongs to the recently developed class of rough stochastic volatility models, first proposed in [4, 27, 34], where the instantaneous variance is driven by a fractional Brownian motion (or more generally a continuous Gaussian process) with Hurst parameter H<12. As highlighted in many papers, they are able to capture many features of (Equities, Commodities,...) data, and clearly seem to outperform most classical models, with fewer parameters. Precise examples with real data can be found in [4] for SPX options, in [28, 38] for joint SPX-VIX options and in [9, 27] for estimation on historical time series, the latter being the state-of-the-art statistical analysis under the P-measure. We consider here the price dynamics under Q with constant initial forward variance curve ξ0(t)>0 for all t[0,T]:

dStSt=rdt+VtdρdWt1+1-ρ2Wt2,Vt=ξ0(t)Eη2H0t(t-u)H-12dWu1,

where η>0, ρ(-1,1) and H(0,1). The corresponding BSPDE reads

-du(t,x)=Vt2x2u(t,x)+ρVtxψ(t,x)-Vt2xu(t,x)-ru(t,x)dt-ψ(t,x)dWt1,

with terminal condition u(T,x)=gcallex+rT. While the existence of the solution was only proven in the distributional sense [6], we nevertheless apply our RWNN scheme. To test Algorithm 2, both the price and the volatility processes are discretised according to the Hybrid scheme developed in [10, 51]. We set the rBergomi parameters as (H,η,ρ,r,T,S0)=(0.3,1.9,-0.7,0.01,1,1) and choose the forward variance curve to be flat with ξ0(·)=0.2352. Again, we are pricing an ATM vanilla Call option with K=S0=1. The number of discretisation steps is again N=21, the number of Monte-Carlo samples nMC=50,000 and the reservoir has K{10,100,1000} nodes with connectivity c=0.5 in Section 6.2.2.

Convergence rate

As in Section 6.1.1, we conduct an empirical analysis of the convergence error from Corollary 5.5 for the same ATM Call. To isolate the dependence on the number of nodes we fix c=1, nMC=50,000 and vary

K101+2(i-1)9:i1,,10,

logarithmically spaced points between 10 and 1000. The reference price is computed by Monte-Carlo with 100 time steps and 800, 000 samples. The absorption scheme has been applied and the results are displayed in Figure 3. In this section, the same random seed was used as in Section 6.1.1, to ensure consistent results across different simulations.

Fig. 3.

Fig. 3

Empirical convergence of MSE under rBergomi in terms of the number of hidden nodes. Error bars mark 0.1 and 0.9 quantiles of 20 separate runs of the algorithm. The slope coefficient of the dashed line is obtained through regression of the means of individual runs, while the solid line represents 1/K convergence and is shown as a reference

ATM Call option

We now evaluate the performance of our PDE-RWNN method for option pricing in the rough Bergomi model using the market parameters listed above and compare the results to those obtained with the MC method over the same sample paths. We also investigate the effect of the absorption scheme (Table 4) and find that, interestingly, despite keeping the paths positive, the absorption scheme adds noticeable bias. Nevertheless, the relative error of the proposed scheme with absorption is comparable to the results using regular artificial networks found in the literature [6, Table 1], yet, our scheme learns much faster with orders of magnitudes faster training times.

Table 4.

Prices, relative errors and CPU time of the Monte-Carlo estimator, PDE-RWNN scheme with absorption, PDE-RWNN scheme without absorption, both with K=100 and same sampled MC paths. Reference price computed with 100 time steps and 800,000 samples

Reference PDE (with abs) PDE (without abs) MC
Price 0.07993 0.081924 0.07973 0.080310
Rel. error - 24.9e-03 2.54e-03 -4.73e-03
Time (seconds) 10.1 7.4 7.5 0.4

Acknowledgements

AJ acknowledges financial support from the EPSRC/T032146 grant. ŽŽ was supported by the EPSRC/S023925 CDT in Mathematics of Random Systems. We would like to thank Lukas Gonon, Christian Bayer and Jinniao Qiu for helpful discussions. The python code is available at https://github.com/ZuricZ/RWNN_PDE_solver. The authors have no relevant financial or non-financial interests to disclose.

Appendix A Error Bounds for RWNN

Proposition A.1

(Proposition 3 in [32]) Suppose ψ:RqR can be represented as

ψ(z)=Rqeiθzg(θ)dθ,

for some complex-valued function g on Rq and all zRq with zQ. Assume that

Rqmax1,θ2q+6|g(θ)|2dθ<.

For R>0, suppose the rows of the K×K-valued random matrix A are iid Uniform on BRRq, that the entries of bRK are iid Uniform on [-max(QR,1),max(QR,1)], that A and b are independent and let ς(x):=max(x,0) on R. Let the pair (Ab) characterises a random basis (reservoir) Φ and its corresponding network ΨKς in the sense of Definition 2.2. Then, there exists an RK-valued random variable Θ and C>0 (given explicitly in [32, Equation (33)]) such that

EΦΨ(Z;Θ)-ψ(Z)2CK,

and for any δ(0,1), with probability 1-δ the random neural network Ψ(·;Θ) satisfies

RqΨ(z;Θ)-ψ(z)2μZ(dz)1/2CδK.

Lemma A.2

For any tiπ, there exists g as in Proposition A.1 such that solutions f{u(ti,·),ψ(ti,·)} to the BSPDE (4.3) can be represented as

f(z)=Rqeiθzg(θ)dθ,for allzRq.

Proof

A sufficient condition for this is that fL1Rq has an integrable Fourier transform and belongs to the Sobolev space W2,2Rq [24, Theorem 6.1]. In our case, for q=1 and with fW3,2(R1) as enforced by Assumption 5.1 (i), [25, Theorem 9.17] implies that (Dαf)^(1)Cf(3) for all multi-indices |α|2 and f(s)=[|f^(ξ)|21+|ξ|2sdξ]1/2 for sR denotes the Sobolev norm. Thus f^L1(R1) by [25, Corollary 2.52].

Appendix B Proof of Theorem 5.4

We prove the theorem in three steps. First, we derive an estimate for the L2-distance of V^ti to the discretised Yti. Next, we use this to estimate the Y-component and finally for the Z-components. In the first part, for convenience, we do not explicitly denote the conditionality of expectation on the realisation of random bases of the corresponding RWNNs, but remark that the expectation should be understood as such whenever this is the case. For convenience, we introduce the error notations:

EA,B:=EA-B2,DiA,B:=EiA-B,IiA(·),B(ti):=Etii+1A(t)-B(ti)2dt,

where the error rates D and I are implicitly related to the partition π through the index i{0,,N-1}. We further denote ft:=ft,Xt,Yt,Zt1,Zt2 and note that the constant C>0 might change from line to line.

Part I We start by showing for each i{0,,N-1},

EYti,V^ti(1+C|π|)EYti+1,U^i+1+C|π|Etii+1ft2dt+Ck=12IiZk,Z¯ik+C|π|ω(|π|), B.1

where ω is the modulus function from Assumption 5.1. By writing the SPDE as the corresponding BSPDE as in (4.4) and using (5.5), we obtain

Yti-V^ti=EiYti+1-U^i+1+Eitii+1ft,eXt,Yt,Zt1,Zt2-fti,eXiπ,V^ti,Z^¯i1,Z^¯i2dt.

Then, Young’s inequality (5.9) with χ=γδi gives

EYti,V^tiE1+γδiDiYti+1,U^i+12+1+1γδiEitii+1ft-fti,eXiπ,V^ti,Z^¯i1,Z^¯i2dt2,

and Cauchy-Schwarz, Assumption 5.1 and (5.1) imply

EYti,V^ti1+γδiEDiYti+1,U^i+12+51+1γδiLf2δi{C|π|ω(|π|)+IiYt,V^ti+k=12IiZk,Z^¯ik}.

The standard inequality (a+b)22(|a|2+|b|2) and the L2-regularity of Y in (5.3) yield

IiYt,V^ti=Etii+1|Yt-V^ti|2dt=Etii+1|Yt-Yti+Yti-V^ti|2dt2Etii+1|Yt-Yti|2+Yti-V^ti2dt2|π|2+2δiE|Yti-V^ti|2=2|π|2+2δiEYti,V^ti,

so that after rearranging the constant term 1+1γδiLf2δi=(1+γδi)Lf2γ, we obtain

EYti,V^ti1+γδiEDiYti+1,U^i+12+51+γδiLf2γC|π|ω(|π|)+2δiEYti,V^ti+k=12IiZk,Z^¯ik. B.2

Since Z¯k are defined as L2-projections of Z (see (5.4)), the last term reads

EYti,V^ti1+γδiEDiYti+1,U^i+12+51+γδiLf2γ{C|π|ω(|π|)+2δiEYti,V^ti+k=12IiZk,Z¯ik+δiEZ¯ik,Z^¯ik}. B.3

The rightmost term can be further expanded: integrating the BSDE (4.4) over [ti,ti+1], multiplying it by Δik for k{1,2} separately and using the definitions in (5.5) give

δiZ¯ik-Z^¯ik=EiΔikYti+1-U^i+1-EiYti+1-U^i+1+EiΔiktii+1ftdt.

Next, taking the expectation of the square and using the Hölder inequality yield

δi22EiZ¯ik-Z^¯ik2EiΔikYti+1-U^i+1-EiYti+1-U^i+12+EiΔiktii+1ftdt2EiΔik2EiYti+1-U^i+1-EiYti+1-U^i+12+EiΔik2Eitii+1ftdt2=δiEiYti+1-U^i+12-2EiYti+1-U^i+1EiYti+1-U^i+1+EiYti+1-U^i+12+Eitii+1ftdt2δiEiYti+1-U^i+12-EiYti+1-U^i+12+Eitii+1dttii+1ft2dt.

Finally, by the law of iterated conditional expectations

δi2EZ¯ik-Z^¯ik2EYti+1,U^i+1-EDiYti+1,U^i+12+δiEtii+1ft2dt, B.4

which can then be used in (B.3) to obtain

EYti,V^ti1+γδiEDiYti+1,U^i+12+51+γδiLf2γ{C|π|ω(|π|)+2δiEYti,V^ti+k=12IiZk,Z¯ik+4EYti+1,U^i+1-EDiYti+1,U^i+12+4δiEtii+1ft2dt}[1+20Lf2δi]EYti+1,U^i+1+C|π|ω(|π|)+δiEYti,V^ti+k=12IiZk,Z¯ik+δiEtii+1ft2dt,

with γ=20Lf2 in the second inequality, concluding Part I by letting |π| sufficiently small.

Part II We now prove an estimate for the Y-component and show that

maxi{0,,N-1}EYti,U^iC{ω(|π|)+Eg(XT)-g(XTπ)2+k=12εk(π)+CKN+M|π|2}. B.5

With Young’s inequality of the form (a+b)2(1-|π|)a2-1|π|b2, we have

EYti,V^ti=EYti-U^i+U^i-V^ti2(1-|π|)EYti,U^i-1|π|EV^ti,U^ti. B.6

Since 1|π|NT=CN, taking small enough |π|,

EYti,U^iEYti,V^ti+CNEV^ti,U^ti(1-|π|)CEYti,V^ti+CNEV^ti,U^ti B.7

and combining (B.1) from Part I with (B.7),

EYti,U^i[1+C|π|]EYti+1,U^i+1+C|π|Etii+1ft2dt+k=12IiZk,Z¯ik+|π|ω(|π|)+NEV^ti,U^ti.

After noting that YtN=g(XT) and U^N=g(XTπ), recalling the definition of εk(π) from (5.4) and the L2-integrability of f in (5.2), straightforward induction implies

maxi{0,,N-1}EYti,U^iC{ω(|π|)+|π|+Eg(Xt)-g(XTπ)2+k=12εZk(π)+Ni=0N-1EV^ti,U^ti}.

Finally, this in conjunction with Lemma 5.9, safely ignoring the Z-component on the left-hand side, since it is positive, gives the desired result (B.5).

Part III Finally, we prove the following bound on the Z components:

Ei=0N-1tii+1k=12Ztk-Z^ikXiπ2dtC{ω(|π|)+Eg(XT)-g(XTπ)2+k=12εk(π)+CKN+M|π|2}.

Since Z¯k are L2-projections of Z,

IiZk,Z^¯ik=Etii+1Ztk-Z¯ik+Z¯ik-Z^¯ik2dt=Etii+1Ztk-Z¯ik2+Z¯ik-Z^¯ik2+2Z¯ik-Z^¯ikZtk-Z¯ikdt=IiZk,Z¯ik+δiEZ¯ik,Z^¯ik+2EZ¯ik-Z^¯iktii+1Ztk-1δiEitii+1Zskdsdt,

and the mixed term is cancelled by the tower property. Using (B.4) below yields

IiZk,Z^¯ik=IiZk,Z¯ik+δiEZ¯ik,Z^¯ikIiZk,Z¯ik+2EYti+1,U^i+1-EDiYti+1,U^i+1+2|π|Etii+1ft2dt,

for k{1,2}. Summing over i{0,,N-1} together with (5.2) implies

IiZk,Z^¯ikεk(π)+2Eg(XT)-g(XTπ)2+2i=0N-1EYti,U^i-EDiYti+1,U^i+12+C|π|. B.8

The summation index was changed in the first term in the curly braces to apply the terminal conditions YtN:=g(XT) and U^N(XNπ):=g(XTπ). Rearranging (B.6) into EYti,U^i1|π|(1-|π|)EV^ti,U^ti+11-|π|EYti,V^ti and combining it with (B.2) give

2EYti,U^i-EDiYti+1,U^i+123EV^ti,U^ti|π|(1-|π|)+21-|π|{1+γ|π|EDiYti+1,U^i+12+51+γ|π|Lf2γC|π|ω(|π|)+2|π|EYti,V^ti+k=12IiZk,Z^¯ik}.

Plugging this back into (B.8) gives

IiZk,Z^¯ikεk(π)+2Eg(XT)-g(XTπ)2+i=0N-13|π|(1-|π|)EV^ti,U^ti+i=0N-1{21+γ|π|1-|π|EDiYti+1,U^i+1+1+γ|π|1-|π|10Lf2γCω(|π|)|π|+2|π|EYti,V^ti+k=12IiZk,Z^¯ik}+C|π|.

Furthermore, choose γ=50Lf2 so that 1+γ|π|1-|π|10Lf2γ14 for small |π|,

12k=12IiZk,Z^¯ikk=12εk(π)+C{maxi{0,,N}EYti,U^i+ω(|π|)+|π|+EgXT-gXTπ2+|π|i=0N-1EYti,V^ti+Ni=0N-1EV^ti,U^ti},

in conjunction with Part I (B.1),

12k=12IiZk,Z^¯ikk=12εk(π)+C{maxi{0,,N}EYti,U^i+ω(|π|)+|π|+EgXT-gXTπ2+|π|i=0N-1{C|π|ω(|π|)+Ck=12IiZk,Z¯ik+(1+C|π|)EYti+1,U^i+1+|π|Etii+1ft2dt}+Ni=0N-1EV^ti,U^ti},

and L2-integrability f in (5.2), Lemma 5.9, Part II (B.5) gives

12k=12IiZk,Z^¯ikCk=12εk(π)+ω(|π|)+|π|+EgXT-gXTπ2+CKN+M|π|2.

Ultimately, since Z¯ is an L2-projection, we have the relation, for k{1,2},

Etii+1Ztk-Z^ikXiπ2dt2IiZk,Z^¯ik+2δiEtii+1Z^¯ik-Z^ikXiπ2dt.

Then Lemma 5.9 applied over Q and summing over i{0,,N-1} yield Part III.

Appendix C Technical Lemmata

Lemma C.1

Under Assumptions 4.1 and 5.1(ii), the solution to the SDE in (4.1) starting at x0>0 satisfies the following growth inequality:

Esup0tTXt0,x02C1+|x0|2,for someC>0.

Proof

For simplicity, we write X=X0,x0 in the proof. For any t0, we can decompose Xt in three parts:

Xt=x0+0tr-Vs2dsDt+0tVsρ1dWs1+ρ2dWs2Mt.

We can bound the martingale term M using the BDG inequality and Assumption 5.1(ii) so that there exists CM>0 for which

Esup0tT|Mt|2CME0TVsdsCMω(T).

For the drift term D, we now have

sup0tT|Dt|rT+120TVsds.

Since V0 almost surely by Assumption 4.1, we can square both sides, interchange the square and supremum, use (a+b)22(a2+b2) and Assumption 5.1(ii) so that

Esup0tT|Dt|2=Esup0tT|Dt|2E2r2T2+120TVsds22r2T2+12ω(T).

Combining all terms and using (a+b+c)23(a2+b2+c2) we have

Esup0tT|Xt|2=Esup0tTx0+Dt+Mt23|x0|2+Esup0tT|Dt|2+Esup0tT|Mt|23|x0|2+6r2T2+32ω(T)+3CMω(T),

and the lemma follows.

Lemma C.2

Under Assumptions 5.1(ii) and 5.3, the solution to (4.1) starting at x0>0 satisfies

maxi{0,,N-1}EXti+1-Xti+1π2+suptti,ti+1Xt-Xtiπ2Cω(|π|),

where |π| is the mesh size of the partition π=t0,t1,,tN.

Proof

Error at grid points: Xti+1-Xti+1π2. The difference between the true solution and its Euler-Maruyama approximation at grid points reads

Xti+1-Xi+1π=tii+1r-Vs2ds-r-Viπ2(ti+1-ti)+tii+1VsdWs-ViπWti+1-Wti,

where W=ρ1W1+ρ2W2. Since E[|Viπ|] is finite for all tiπ by Assumption 5.3, using Assumption 5.1 ii) it is easy to see that the drift term contributes an error of order O(ω(|π|)). For the diffusion term:

Etii+1VsdWs-VtiWti+1-Wti2=Etii+1Vs-VtidWs2=Etii+1Vs-Vti2ds

by Itô isometry. Since Vs-Vti2|Vs-Vti|, then, by Assumption 5.1(ii) the diffusion term is bounded by O(ω(π)).

Supremum error over subintervals: suptti,ti+1Xt-Xtiπ2. For any subinterval, Et:=Xt-Xtiπ is the local error due to discretisation. Using standard SDE arguments and properties of the Euler-Maruyama scheme (see for example [42]), this again yields an upper bound of order O(ω(|π|)). Since expectation is linear, the result follows.

Footnotes

1

The matrix E[XX] may not be invertible, but its generalised Moore-Penrose inverse always exists.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Anker, F., Bayer, C., Eigel, M., Ladkau, M., Neumann, J., Schoenmakers, J.: SDE based regression for linear random PDEs. SIAM J. Sci. Comput. 39, 1168–1200 (2017) [Google Scholar]
  • 2.Barron, A.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39, 930–945 (1993) [Google Scholar]
  • 3.Barron, A.R.: Neural net approximation, In: Proc. 7th Yale workshop on adaptive and learning systems, vol. 1, pp. 69–72 (1992)
  • 4.Bayer, C., Friz, P., Gatheral, J.: Pricing under rough volatility. Quantitative Finance 16, 887–904 (2015) [Google Scholar]
  • 5.Bayer, C., Friz, P.K., Fukasawa, M., Gatheral, J., Jacquier, A., Rosenbaum, M.: Rough volatility, SIAM, (2023)
  • 6.Bayer, C., Qiu, J., Yao, Y.: Pricing options under rough volatility with backward SPDEs. SIAM J. Financial Math. 13, 179–212 (2022) [Google Scholar]
  • 7.Beck, C., Becker, S., Cheridito, P., Jentzen, A., Neufeld, A.: Deep splitting method for parabolic PDEs. SIAM J. Sci. Comput. 43, A3135–A3154 (2021) [Google Scholar]
  • 8.Beck, C., Hutzenthaler, M., Jentzen, A., Kuckuck, B.: An overview on deep learning-based approximation methods for partial differential equations. Discrete Contin. Dyn. Syst. Ser. B 28, 3697–3746 (2023) [Google Scholar]
  • 9.Bennedsen, M., Lunde, A., Pakkanen, M.: Decoupling the short-and long-term behavior of stochastic volatility. J. Financ. Economet. 20, 961–1006 (2022) [Google Scholar]
  • 10.Bennedsen, M., Lunde, A., Pakkanen, M.S.: Hybrid scheme for Brownian semistationary processes. Finance Stochast. 21, 931–965 (2017) [Google Scholar]
  • 11.Berner, J., Elbrachter, D., Grohs, P., Jentzen, A.: Towards a regularity theory for ReLU networks, In: 13th international conference on sampling theory and applications, IEEE, (2019)
  • 12.Black, F., Scholes, M.: The pricing of options and corporate liabilities. J. Polit. Econ. 81, 637–654 (1973) [Google Scholar]
  • 13.Bonesini, O., Jacquier, A., Pannier, A.: Rough volatility, path-dependent PDEs and weak rates of convergence, arXiv:2304.03042, (2023)
  • 14.Briand, P., Delyon, B., Hu, Y., Pardoux, E., Stoica, L.: solutions of backward stochastic differential equations. Stoch. Processes Appl. 108, 109–129 (2003) [Google Scholar]
  • 15.Buehler, H., Gonon, L., Teichmann, J., Wood, B.: Deep hedging. Quantitative Finance 19, 1271–1291 (2019) [Google Scholar]
  • 16.Cao, W., Wang, X., Ming, Z., Gao, J.: A review on neural networks with random weights. Neurocomputing 275, 278–287 (2018) [Google Scholar]
  • 17.Dale, M., O’Keefe, S., Sebald, A., Stepney, S., Trefzer, M.A.: Reservoir computing quality: connectivity and topology. Nat. Comput. 20, 205–216 (2020) [Google Scholar]
  • 18.W. E, Han, J., Jentzen, A.: Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics 5, 349–380 (2017)
  • 19.Elbrächter, D., Grohs, P., Jentzen, A., Schwab, C.: DNN expression rate analysis of high-dimensional PDEs: application to option pricing. Constr. Approx. 55, 3–71 (2021) [Google Scholar]
  • 20.Euch, O.E., Fukasawa, M., Rosenbaum, M.: The microstructural foundations of leverage effect and rough volatility. Finance Stochast. 22, 241–280 (2018) [Google Scholar]
  • 21.Euch, O.E., Rosenbaum, M.: The characteristic function of rough Heston models. Math. Financ. 29, 3–38 (2018) [Google Scholar]
  • 22.Evans, L.C.: Measure Theory and Fine Properties of Functions. CRC Press, Boca Raton (1992) [Google Scholar]
  • 23.Federer, H.: Geometric Measure Theory. Springer-Verlag, Berlin (1996) [Google Scholar]
  • 24.Folland, G.B.: Introduction to Partial Differential Equations. Princeton University Press, Princeton (1995) [Google Scholar]
  • 25.Folland, G.B.: Real Analysis. Wiley & Sons, Hoboken (2013) [Google Scholar]
  • 26.Gassiat, P.: Weak error rates of numerical schemes for rough volatility, SIAM Journal on Financial Mathematics, (2023)
  • 27.Gatheral, J., Jaisson, T., Rosenbaum, M.: Volatility is rough. Quantitative Finance 18, 933–949 (2018) [Google Scholar]
  • 28.Gatheral, J., Jusselin, P., Rosenbaum, M.: The quadratic rough Heston model and the joint S&P 500/VIX smile calibration problem, Risk, (2020)
  • 29.Germain, M., Pham, H., Warin, X.: Neural networks-based algorithms for stochastic control and PDEs in Finance, In: Capponi, A., Lehalle, C.-A. (eds.) Machine Learning and Data Sciences for Financial Markets: a Guide to Contemporary Practices, CUP, pp. 426–452 (2023)
  • 30.Gierjatowicz, P., Sabate-Vidales, M., Šiska, D., Szpruch, Ł, Žurič, Ž: Robust pricing and hedging via neural stochastic differential equations. J. Computational Finance 26, 1–32 (2022) [Google Scholar]
  • 31.Gonon, L.: Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality. J. Mach. Learn. Res. 24, 1–51 (2023) [Google Scholar]
  • 32.Gonon, L., Grigoryeva, L., Ortega, J.-P.: Approximation bounds for random neural networks and reservoir systems. Ann. Appl. Probab. 33, 28–69 (2023) [Google Scholar]
  • 33.Gonon, L., Schwab, C.: Deep ReLU neural networks overcome the curse of dimensionality for partial integrodifferential equations. Anal. Appl. 21, 1–47 (2023) [Google Scholar]
  • 34.Guennoun, H., Jacquier, A., Roome, P., Shi, F.: Asymptotic behavior of the fractional Heston model. SIAM J. Financial Math. 9, 1017–1045 (2018) [Google Scholar]
  • 35.Han, J., Jentzen, A., W. E.: Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences 115, 8505–8510 (2018) [DOI] [PMC free article] [PubMed]
  • 36.Herrera, C., Krach, F., Ruyssen, P., Teichmann, J.: Optimal stopping via randomized neural networks. Frontiers of Mathematical Finance 3, 31–77 (2024) [Google Scholar]
  • 37.Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970) [Google Scholar]
  • 38.Horvath, B., Jacquier, A., Tankov, P.: Volatility options in rough volatility models. SIAM J. Financial Math. 11, 437–469 (2020) [Google Scholar]
  • 39.Horvath, B., Muguruza, A., Tomas, M.: Deep learning volatility: a deep neural network perspective on pricing and calibration in (rough) volatility models. Quantitative Finance 21, 11–27 (2020) [Google Scholar]
  • 40.Huang, G.-B., Chen, L., Siew, C.-K.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Networks 17, 879–892 (2006) [DOI] [PubMed] [Google Scholar]
  • 41.Hurá, C., Pham, H., Warin, X.: Deep backward schemes for high-dimensional nonlinear PDEs. Math. Comput. 89, 1547–1579 (2020) [Google Scholar]
  • 42.Hutzenthaler, M., Jentzen, A., Kloeden, P.E.: Strong convergence of an explicit numerical method for SDEs with nonglobally lipschitz continuous coefficients. Ann. Appl. Probab. 22, 1611–1641 (2012) [Google Scholar]
  • 43.Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T.A.: A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations, SN Partial Differential Equations and Applications, 1 (2020)
  • 44.Isserlis, L.: On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika 12, 134 (1918) [Google Scholar]
  • 45.Jacquier, A., Oumgari, M.: Deep curve-dependent PDEs for affine rough volatility. SIAM J. Financial Math. 14, 353–382 (2023) [Google Scholar]
  • 46.LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. Springer, Berlin (2012) [Google Scholar]
  • 47.Liu, J., Sun, T., Luo, Y., Fu, Q., Cao, Y., Zhai, J., Ding, X.: Financial data forecasting using optimized echo state network, In: NeurIPS, pp. 138–149 (2018)
  • 48.Longstaff, F., Schwartz, E.: Valuing American options by simulation: a simple least-squares approach. The Review of Financial Studies 14, 113–147 (2001) [Google Scholar]
  • 49.Lord, R., Koekkoek, R., Dijk, D.V.: A comparison of biased simulation schemes for stochastic volatility models. Quantitative Finance 10, 177–194 (2009) [Google Scholar]
  • 50.Mandelbrot, B.B., Ness, J.W.V.: Fractional Brownian motions, fractional noises and applications. SIAM Rev. 10, 422–437 (1968) [Google Scholar]
  • 51.McCrickerd, R., Pakkanen, M.S.: Turbocharging Monte Carlo pricing for the rough Bergomi model. Quantitative Finance 18, 1877–1886 (2018) [Google Scholar]
  • 52.Mei, S., Misiakiewicz, T., Montanari, A.: Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration. Appl. Comput. Harmon. Anal. 59, 3–84 (2022) [Google Scholar]
  • 53.Neufeld, A., Schmocker, P., Wu, S.: Full error analysis of the random deep splitting method for nonlinear parabolic PDEs and PIDEs. Commun. Nonlinear Sci. Numer. Simul. 143, 108556 (2025) [Google Scholar]
  • 54.Pardoux, E., Peng, S.: Adapted solution of a backward stochastic differential equation. Systems & Control Letters 14, 55–61 (1990) [Google Scholar]
  • 55.Rahimi, A., Recht, B.: Random features for large-scale kernel machines, NeurIPS, 20 (2007)
  • 56.Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning, NeurIPS, 21 (2008)
  • 57.Reisinger, C., Zhang, Y.: Rectified deep neural networks overcome the curse of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff systems. Anal. Appl. 18, 951–999 (2020) [Google Scholar]
  • 58.Rudin, W.: Real and Complex Analysis. McGraw-Hill, Columbus (1986) [Google Scholar]
  • 59.Ruf, J., Wang, W.: Neural networks for option pricing and hedging: a literature review. The J. Computational Finance 24, 1–46 (2020) [Google Scholar]
  • 60.Saporito, Y.F., Zhang, Z.: Path-dependent deep Galerkin method: a neural network approach to solve path-dependent PDEs. SIAM J. Financial Math. 12, 912–940 (2021) [Google Scholar]
  • 61.Shang, Y., Wang, F., Sun, J.: Randomized neural network with Petrov-Galerkin methods for solving linear and nonlinear partial differential equations. Commun. Nonlinear Sci. Numer. Simul. 127, 107518 (2023) [Google Scholar]
  • 62.Sirignano, J., Spiliopoulos, K.: DGM: a deep learning algorithm for solving partial differential equations. J. Comput. Phys. 375, 1339–1364 (2018) [Google Scholar]

Articles from Applied Mathematics and Optimization are provided here courtesy of Springer

RESOURCES