Skip to main content
Royal Society Open Science logoLink to Royal Society Open Science
. 2017 Nov 15;4(11):171377. doi: 10.1098/rsos.171377

Risk-aware multi-armed bandit problem with application to portfolio selection

Xiaoguang Huo 1,, Feng Fu 2,3,
PMCID: PMC5717697  PMID: 29291122

Abstract

Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through filtering assets based on the topological structure of the financial market and combining the optimal multi-armed bandit policy with the minimization of a coherent risk measure, we achieve a balance between risk and return.

Keywords: multi-armed bandit, online learning, portfolio selection, graph theory, risk-awareness, conditional value-at-risk

1. Introduction

Portfolio selection is a popular area of study in the financial industry ranging from academic researchers to fund managers. The problem involves determining the best combination of assets to be held in the portfolio in order to achieve the investor’s objectives, such as maximizing the cumulative return relative to some risk measure. In the finance community, the traditional approach to this problem can be traced back to 1952 with Markowitz’s seminal paper [1], which introduces mean-variance analysis, also known as the modern portfolio theory (MPT), and suggests choosing the allocation that maximizes the expected return for a certain risk level quantified by variance. On the other hand, sequential portfolio selection models have been developed in the mathematics and computer science communities; for example, Cover’s universal portfolio strategy [2], Helmbold’s multiplicative update portfolio strategy [3] and also see Li & Hoi [4] for a comprehensive survey. In recent years, with the unprecedented success of AI and machine learning methods evidenced by AlphaGo defeating the world champion and OpenAI’s bot beating professional Dota players, more creative machine learning-based portfolio selection strategies also emerged [5,6].

Including portfolio selection, many practical problems such as clinical trials, online advertising and robotics can be modelled as sequential decision-making under uncertainty [7]. In such a process, at each trial the learner faces the trade-off between acting ambitiously to acquire new knowledge and acting conservatively to take advantage of current knowledge, which is commonly known as the exploration versus exploitation dilemma. Often understood as a single-state Markov decision process (MDP), the stochastic multi-armed bandit problem provides an extremely intuitive mathematical framework to study sequential decision-making.

An abstraction of this setting involves a set of K slot machines and a sequence of N trials. At each trial t=1,…,N, the learner chooses to play one of the machines It∈{1,…,K} and receives a reward RIt,t drawn randomly from the corresponding fixed but unknown probability distribution νIt, whose mean is μIt. In the classic setting, the random rewards of the same machine across time are assumed to be independent and identically distributed, and the rewards of different machines are also independent. The objective of the learner is to develop a policy, an algorithm that specifies which machine to play at each trial, to maximize cumulative rewards. A popular measure for the performance of a policy is the regret after some n trials, which is defined to be

ξ(n)=defmaxi[1,K]t=1nRi,tt=1nRIt,t. 1.1

However, in a stochastic model it is more intuitive to compare rewards in expectation and use pseudo-regret [8]. Let Ti(n) be the number of times machine i is played during the first n trials and let μ=max{μ1,,μK}. Then,

ξ^(n)=defnμEt=1nRIt,t=1iK,μi<μ(μμi)E[Ti(n)]. 1.2

Thus, the learner’s objective to maximize cumulative rewards is then equivalent to minimizing regret. The asymptotic lower bound on the best possible growth rate of total regret is proved by Lai & Robbins [9], which is O(logn) with a coefficient determined by the suboptimality of each machine and the Kullback–Leibler divergence. Since then, various online learning policies have been proposed [10], among which the UCB1 policy developed in Auer et al. [11] is considered the optimal and will be introduced in detail in Methods and model section.

Although the classic multi-armed bandit has been well studied in academia, a number of variants of this problem are proposed to model different real-world scenarios. For example, Agrawal & Goyal [12] considers a contextual bandit with a linear reward function and analyses the performance of the Thompson sampling algorithm. Koulouriotis & Xanthopoulos [13] studies the non-stationary setting where the reward distributions of machines change at a fixed time. A more important variant is the risk-aware setting, where the learner considers risk in the objective instead of simply maximizing the cumulative reward. This variant is closely related to the portfolio selection problem, where risk management is an indispensable concern, and has been discussed in several papers. For example, Sani et al. [14] studies the problem where the learner’s objective is to minimize the mean variance defined as σ2ρμ and proposes two algorithms, MV-LCB and ExpExp. In a similar setting, Vakili & Zhao [15] provides a finer analysis of the performance of algorithms proposed in Sani et al. [14]. In addition, Vakili & Zhao [16] extends this setting by considering the mean variance and value-at-risk of total rewards at the end of the time horizon. In a more generalized case, Zimin et al. [17] sets the objective to be a function of the mean and the variance f(μ,σ2) and defines the φ-LCB algorithm that achieves desirable performance under certain conditions. Moreover, Galichet et al. [18] chooses the conditional value-at-risk to be the objective and proposes the MARAB algorithm.

These works serve as the inspiration for us to consider risk in the model, but they are not directly applicable to the portfolio selection problem, owing to the primary obstacle that these methods only choose the best single machine to play at each trial. To address this issue, a basket of candidate portfolios need to be first selected in the preliminary stage in a strategic and logical way. For example, Shen et al. [19] uses principal component analysis (PCA) to select candidate portfolios, namely the normalized eigenvectors of the covariance matrix of asset returns.

In our model, we first take a graph theory approach to filter and select a basket of assets, which we use to construct the portfolio. Then, at each trial we combine the single-asset portfolio determined by the optimal multi-armed bandit algorithm with the portfolio that globally minimizes a coherent risk measure, the conditional value-at-risk. The rest of this paper is organized as follows. In Methods and model section, we formulate the portfolio selection problem in the multi-armed bandit setting, and describe our methodology in detail. In Results section, we present our simulation results using the proposed method. In Discussion and conclusion section, we discuss results and also provide directions for future research.

2. Methods and model

2.1. Problem formulation

In this section, we modify the classic multi-armed bandit setting to model portfolio selection. Consider a financial market with a large set of assets, from which the learner selects a basket of K assets to invest in a sequence of N trials. At each trial t=1,…,N, the learner chooses a portfolio ωt=(ω1,t,,ωK,t) where ωi,t is the weight of asset i. As we only consider long-only and self-financed trading, we must have ωtW where W={uR+K:u1=1} and 1 is a column vector of ones. The returns of assets are then revealed at trial t+1 and denoted by Rt=(R1,t,,RK,t). In particular, the return for each asset Ri,t is viewed as a random draw from the corresponding probability distribution νi with mean μi and can be simply defined as the log price ratio Ri,t=log(Pi,t+1/Pi,t), where we use the natural log, and Pi,t, Pi,t+1 denote the prices at trial t and t+1, respectively. For the trading period from t to t+1, the learner receives ωtRt as the reward for his portfolio. The investment strategy of the learner is thus a sequence of N mappings from the accumulated knowledge to W.

We make the following assumptions. First, we assume we always have access to historical returns Hi,t of every asset i in the market for t=1,…,δ. The historical return is defined similarly to Ri,t as the log price ratio but corresponds to the time horizon immediately before our investment period. They are only used to estimate the correlation structure and risk level. Second, we make no assumption on the dependency of returns either across time or across assets. We only assume that, for each trial t and for all i∈{1,…,K}, Ri,tνi and Hi,tνi with a relatively small δ. Note that the UCB1 algorithm we use later is proved to be optimal under a weaker assumption, E[Ri,t|Ri,1,,Ri,t1]=μi, allowing us to waive the assumptions in the classic setting [11]. Third, transaction costs and market liquidity will not be considered. See Model 1 for a summary of the problem.

2.1.

2.2. Portfolio construction by filtering assets

Graph theory has been popularly applied in various disciplines to model networks, where the vertices represent individuals of interest and the edges represent their interactions. For example, in evolutionary game theory, graphs are used to analyse the dynamics of cooperation within different population structures [2025]. In financial markets, the minimum spanning tree (MST) is accepted as a robust method to visualize the structure of assets [26], allowing one to capture different market sectors from empirical data [2729].

For our purpose, as we have a large pool of assets, we first want to select a basket of K to invest in. Recall that the return of each asset is Ri,t=log(Pi,t+1/Pi,t), where Pi,t and Pi,t+1 are the prices at trial t and t+1. Following Mantegna [27] and Mantegna & Stanley [30], we use δ trials of historical returns to find the correlation matrix, whose entries are

ρi,j=defHiHjHiHj(Hi2Hi2)(Hj2Hj2),

where 〈⋅〉 is the historical mean, namely Hi=t=1δHi,t for each asset i in the market. For δ small, we can improve our estimation by taking advantage of the shrinkage method in Ledoit & Wolf [31]. We then define the metric distance between two vertices as di,j=def2(1ρi,j). The Euclidean distance matrix D whose entries are di,j is then used to compute the undirected graph G={V,E}, where V is the set of vertices representing assets and E is the set of weighted edges representing distance. To extract the most important edges from G, we construct the MST T. In particular, T is the subgraph of G that connects all vertices without cycle and minimizes total edge weights.

One way to classify vertices is based on their relative positions in the graph, central versus peripheral. In financial markets, this classification method turns out to have significant implications in systemic risk, which is the risk that an economic shock causes the collapse of a chain of institutions [32]. Several empirical studies suggest that such risk can be associated with certain characteristics of the correlation structure of the market. For example, Kritzman et al. [33] defines the absorption ratio as the fraction of total variances explained by a fixed number of principal components, namely the eigenvectors of the covariance matrix, and shows this ratio increased dramatically during both domestic and global financial crises including the housing bubble, dot-com bubble, the 1997 Asian financial crisis and so on. Drozdz et al. [34] finds a similar result and suggests that the maximum eigenvalue of the correlation matrix rises during crisis and exhausts the total variances. Hence, graph theory can be naturally applied to this setting and provides significant insights into managing systemic risk. In particular, Huang et al. [35] gives an intuitive simulation of the contagion process of systemic risk on a bipartite graph. Onnela et al. [36] shows that the MST of assets shrinks during a crisis, which supports the above arguments on the compactness of the eigenvalues of correlation matrix. More importantly, Onnela et al. [36], Pozzi et al. [37] and Ren et al. [38] suggest that investing in the assets located on the peripheral parts of the MST can facilitate diversification and reduce the exposure to systemic risk during a crisis.

For our study, we select 30 S&P 500 stocks, which consist of 15 financial institutions (JPM, WFC, BAC, C, GS, USB, MS, KEY, PNC, COF, AXP, PRU, SCHW, BBT, STI) and 15 randomly selected companies from other sectors (KR, PFE, XOM, WMT, DAL, CSCO, HCP, EQIX, DUK, NFLX, GE, APA, F, REGN, CMS). We use the daily close price of 44 trading days during the subprime mortgage crisis to construct the MST and investigate the advantage of investing in peripheral vertices using the equally weighted portfolio strategy. Although the number of stocks is small, our results similarly show that investing in peripheral vertices can reduce loss during financial crisis (figure 1). Figure 1a shows the complete graph of 30 stocks. Figure 1b is the MST we obtain following the above method. Observe that this tree has a total of 14 leaves (WFC, C, GS, KEY, PNC, SCHW, KR, DAL, HCP, EQIX, DUK, NFLX, GE, F), and selecting from these leaves to construct a portfolio almost always reduces the median daily loss compared with the portfolio with all vertices. For example, figure 1c provides the performance of the portfolio with 10 randomly selected vertices from the 14 leaves, which increases the median daily log price ratio from −0.0101 to −0.0079 and the median daily percentage return from −0.0095 to −0.0070. Furthermore, figure 1d shows that the eigenvalue spectrum of the covariance matrix becomes less compact. Finally, we acknowledge the dynamic nature of the market structure, but for simplicity this aspect will not be considered in our study.

Figure 1.

Figure 1.

Portfolio selection based on the MST. (a) The complete graph and (b) the corresponding MST constructed from the 30 selected S&P 500 stocks during the period September 2008 to October 2008. (c) The performance of the portfolio of 10 randomly selected vertices from the 14 leaves shown in b. (d) The eigenvalue spectrum of the covariance matrix of the 30 selected S&P 500 stocks in a with that of 10 stocks randomly chosen from the peripheral nodes from the MST in c.

Therefore, we select the K most peripheral vertices from the MST T as our basket of assets to invest in. We note that for any graph G with distinct edge weight, which is often the case for financial data with high precision, the MST T is proved to be unique. Our selection of vertices tends to lie on the leaves for a star-like graph, on the two ends of the longest edge for a cycle, and on the corners for a lattice. Among the numerous centrality measures discussed in graph theory [39], we use the most straightforward measure and select the K vertices with the least degree. The value of K is subjective and can be determined based on the learner’s view of the economic state. Assuming K assets are selected, we proceed to portfolio construction as described in what follows.

2.3. Combined sequential portfolio selection algorithm

We design a sequential portfolio selection algorithm by combining the optimal multi-armed bandit policy, namely the UCB1 proposed in Auer et al. [11], with the minimization of a coherent risk measure, namely the conditional value-at-risk. Recall that the return Ri,t of each asset i is defined as the log price ratio, namely Ri,t=log(Pi,t+1/Pi,t). The UCB1 policy is defined as follows. First, select each asset once and observe return during the first K trials. Then, for each trial select the asset that maximizes an estimated upper confidence bound of return with a certain confidence level. Precisely, at each trial t we select

It=def{tiftK,arg maxi{1,,K}R¯i(t)+2logtTi(t1)otherwise, 2.1

where R¯i(t) is the empirical mean of return for asset i and recall that Ti(t−1) is the number of times asset i has been selected during the past t−1 trials. Theorem 2.1 below provided in Auer et al. [11] proves the optimality of UCB1.

Theorem 2.1 [[11]]

For all K>1 assets whose mean returns are in the support [0,1], the regret of UCB1 algorithm after any number n of trials satisfies

ξ^(n)[8i:μi<μ(lognμμi)]+(1+π23)[i=1K(μμi)],

where μi is the mean return of asset i and μ=max{μ1,,μK}.

The proof makes no assumption on the dependency and distribution of asset returns besides E[Ri,t|Ri,1,,Ri,t1]=μi. Therefore, by scaling the values we can achieve optimality. In addition, we can use historical returns and observed returns of unselected assets to further improve performance, but we do not discuss details here. Let eiRK be the vector of a single 1 on entry i and 0 on the others. Our single-asset multi-armed bandit portfolio at t chosen according to equation (2.1) is

ωtM=defeIt. 2.2

Now, let us incorporate risk awareness into our algorithm by finding the portfolio that achieves the global minimum of the conditional value-at-risk. We define risk measure and associated properties following Artzner et al. [40] and Bäuerle & Rieder [41].

Definition 2.2 —

Let (Ω,F,P) be a probability space and denote by L(Ω,F,P) the set of integrable random variables, where any instance of L(Ω,F,P) represents portfolio return. A function Ψ:L(Ω,F,P)R is called a risk measure.

Definition 2.3 —

Let Ψ be a risk measure; we say Ψ is a coherent risk measure if, for all X1,X2L(Ω,F,P), cR and dR+{0}, it satisfies

  • — translation invariance: Ψ(X1+c)=Ψ(X1)−c

  • — subadditivity: Ψ(X1+X2)≤Ψ(X1)+Ψ(X2)

  • — positive homogeneity: Ψ(dX1)=(X1)

  • — monotonicity: X1X2Ψ(X1)≥Ψ(X2)

Definition 2.4 —

Let XL(Ω,F,P); the risk measure value-at-risk of X at confidence level β∈(0,1) is defined as

VaRβ(X)=definf{xR:P(x+X<0)1β}.

In addition, the risk measure conditional value-at-risk at confidence level γ∈(0,1) is defined as

CVaRγ(X)=def11γγ1VaRβ(X)dβ.

In the literature, the above risk measures are sometimes expressed in terms of the portfolio loss variable, namely positive values represent loss and negative values represent gain. We note that these definitions are equivalent. Intuitively, the value-at-risk denotes the maximum threshold of loss under a certain confidence level, and conditional value-at-risk is the conditional expectation of loss given that it exceeds such a threshold. Although more popularly used in practice, value-at-risk fails certain mathematical properties such as subadditivity, which contradicts with Markowitz’s MPT and implies that diversification may not reduce investment risk. As a result, it is not a coherent risk measure. On the other hand, Pflug [42] proves that conditional value-at-risk is coherent and satisfies some extra properties such as convexity, monotonicity with respect to first-order stochastic dominance (FSD) and second-order monotonic dominance.

Theorem 2.5 [[42]]

The conditional value-at-risk is a coherent risk measure.

Therefore, we would like to minimize risk using the conditional value-at-risk at confidence level γ as the risk measure. We recall that W={uR+K:u1=1} is the set of possible portfolios. At each trial t, the learner would like to solve the following optimization problem:

minimizeuWCVaRγ(uRt)

Note that as γ→0, the problem becomes minimizing expected loss, and as γ→1, it becomes minimizing the worst outcome. In this study, we use γ=0.95. Rockafellar & Uryasev [43] provides a convenient method to solve this problem. Recall that we assume that both historical returns and present returns follow the same distribution; let p(Rt) be the density. Define the performance function as

Fγ(u,α)=defα+11γRtRK[uRtα]+p(Rt)dRt,

where [m]+=defmax{m,0}. Then, we have the following theorem.

Theorem 2.6 [[43]]

The minimization of CVaRγ(uRt) over u∈W is equivalent to the minimization of Fγ(u,α) over all pairs of (u,α)W×R. Moreover, as Fγ(u,α) is convex with respect to (u,α), the loss function −uRt is convex with respect to u and W is a convex set due to linearity, the minimization of Fγ(u,α) is an instance of convex programming.

Moreover, as the density p(Rt) is unknown, we would like to approximate the performance function using not only historical returns but also knowledge gained as we proceed in this learning process. From the received Hi,1,…,Hi,δ for all i, we extract historical returns of our K assets H1,,HδRK. Let R1,…,Rt−1 be the t−1 trials of returns observed so far. Then our approximation of Fγ(u,α) at trial t is the following convex and piecewise linear function

F~γ(u,α,t)=defα+1(δ+t1)(1γ)[s=1δ[uHsα]++s=1t1[uRsα]+]. 2.3

Note that the approximation function is implicitly also a function of the current trial t, hence we have added an extra parameter and denote it as F~γ(u,α,t). As the learner proceeds in time, she accumulates data information and obtains a more and more precise approximation. As a result, the minimization of conditional value-at-risk is solved by convex programming and generates the following optimal solution. At each trial t, the risk-aware portfolio constructed according to equation (2.3) is

ωtC=defargmin(u,α)W×RF~γ(u,α,t). 2.4

Now, we have found both the single-asset multi-armed bandit portfolio by (2.2) and the risk-aware portfolio by (2.4). Note that they are dynamic and update based on the learner’s accumulated knowledge. For each trial t, the learner combines them with a factor λ∈[0,1] to form the balanced portfolio

ωt=defλωtM+(1λ)ωtC. 2.5

In particular, λ is the proportion of wealth invested in the single-asset multi-armed bandit portfolio and 1−λ is the proportion invested in the risk-aware portfolio. The value of λ denotes the risk preference of the learner. As λ→1, our algorithm reverts to the UCB1 policy, whereas for λ→0, it becomes the minimization of conditional value-at-risk. Therefore, the commonly discussed trade-off between reward and risk is illustrated here in the choice of λ. Finally, the following algorithm summarizes our sequential portfolio selection algorithm.

2.3.

3. Results

In this section, we design experiments and report the performance of the proposed algorithm (algorithm 1) in comparison with several benchmarks.

3.1. Monte Carlo simulation method

For simplicity, we consider stocks as our assets and adopt the Black–Scholes model [44] to simulate stock prices as geometric Brownian motion (GBM) paths. As a Nobel Prize-winning model, it provides a partial differential equation to price a European option by computing the initial wealth for perfectly hedging a short position in that option. The underlying asset, usually a stock, is modelled to follow a GBM. Although this assumption may not hold perfectly in reality, it provides an extremely convenient and popularly used method to simulate any number of stock paths. For our purpose, as we never make any assumption on the dependency of asset returns, we consider the general case where stock paths can be correlated as it is almost always the case in the financial market. We use definitions similar to ch. 4 of Shreve [45] and describe our method below.

Definition 3.1 —

Let (Ω,F,P) be a probability space. The stock price Pi(t) is said to follow a GBM if it satisfies the following stochastic differential equation:

dPi(t)=αiPi(t)dt+σiP(t)dWi(t),

where Wi(t) is a Brownian motion, αi is the drift and σi is the volatility.

Definition 3.2 —

Two stock paths Pi(t) and Pj(t) modelled by GBMs are correlated if their associated Brownian motions satisfy

dWi(t)dWj(t)=ρi,jdt

for some non-zero constant ρi,j∈[−1,1] where ρi,i=ρj,j=1.

Proposition 3.3 —

For two correlated stock prices Pi(t) and Pj(t) that satisfy dWi(t) dWj(t)=ρi,j⋅dt, the following properties hold:

  • — E[Wi(t)Wj(t)]=ρi,jt

  • — Cov[Wi(t),Wj(t)]=ρi,jt

  • — Cov[σiWi(t),σjWj(t)]=σiσjρi,jt,

where σi and σj are volatility parameters of Pi(t) and Pj(t), respectively.

Proof. —

We prove the first claim and the rest follow immediately after some computations. By the Itô–Doeblin formula, which can be found in Shreve [45], we have

d(Wi(t)Wj(t))=Wi(t)dWj(t)+Wj(t)dWi(t)+ρi,jdt.

Integrating on both sides, we have

Wi(t)Wj(t)=0tWi(t)dWj(t)+0tWj(t)dWi(t)+ρi,jt.

By the Martingale property of Itô integrals, we simply take the expectation on both sides to obtain E[Wi(t)Wj(t)]=ρi,jt. ▪

Recall that we have K stocks whose prices P1(t),…,PK(t) are modelled by correlated GBMs. By definition, they must satisfy the following two equations:

dPi(t)Pi(t)=αidt+σidWi(t) 3.1

and

dWi(t)dWj(t)=ρi,jdt. 3.2

In particular, the solution to equation (3.1) can be expressed as follows [46]. For any time u<l, we have

Pi(l)=Pi(u)exp{(αi12σi2)(lu)+σi(Wi(l)Wi(u))}. 3.3

We first would like to express the scaled correlated Brownian motions σiWi(t) using independent ones. By proposition 3.3, we have the following instantaneous covariance matrix:

Θ=[σ12σ1σ2ρ1,2σ1σKρ1,Kσ2σ1ρ2,1σ22σ2σKρ2,KσKσ1ρK,1σKσ2ρK,2σK2]

As Θ has to be symmetric and positive definite, it has a square root and we apply Cholesky decomposition to find the matrix A such that AAT=Θ. By Shreve [45], there exists K independent Brownian motions X1(t),…,XK(t) such that

σiWi(t)=m=1KAi,mXm(t).

Then equation (3.1) becomes

dPi(t)Pi(t)=αidt+m=1KAi,mdXm(t), 3.4

and equation (3.3) becomes, for any time u<l,

Pi(l)=Pi(u)exp{(αi12σi2)(lu)+m=1KAi,m(Xm(l)Xm(u))}. 3.5

As each Brownian motion Xm(t) for m∈[1,K] above is independent and the increment Xm(l)−Xm(u) is Gaussian with mean 0 and variance lu, let Z(t)=(Z1(t),…,ZK(t)) be standard multivariate Gaussian, then equation (3.5) becomes

Pi(l)=Pi(u)exp{(αi12σi2)(lu)+lum=1KAi,mZm(l)}. 3.6

Therefore, at each time we can conveniently generate a sample from Z(t) to compute the price increment. Specifically, equation (3.6) leads to the following recursive algorithm that can also be found in Glasserman [46]. For 0=t0<t1<<t, we have

Pi(ts+1)=Pi(ts)exp{(αi12σi2)(ts+1ts)+ts+1tsm=1KAi,mZm(ts+1)}.

Also note that when the paths are independent, dWi(t) dWj(t)=δi,j dt, where δi,j is the Kronecker delta function, and the covariance matrix Θ is diagonal. In this special case, it is equivalent to compute K paths separately in the one-dimensional space. For our purpose, we first find some appropriate covariance matrix and generate K price paths following the above algorithm. We then uniformly divide the total time horizon into δ+N trials and use the prices at the beginning and end of each trial to calculate return, which was defined earlier as the log price ratio. We run our sequential portfolio selection algorithm on these data and compare the performance with four benchmark portfolios, namely UCB1 (2.2), risk-aware portfolio (2.4), ϵ-greedy and the equally weighted portfolio.

3.2. Simulation results

After we repeatedly generate price paths and compare the performance, we can see that the results agree well with our prediction (figure 2). The UCB1 portfolio almost always achieves the most cumulative wealth but has high variations in its path. On the other hand, the risk-aware portfolio achieves a relatively low cumulative wealth but also has low variations. As a result, our combined portfolio achieves a middle ground between the two extremes of maximizing reward and minimizing risk. For example, figure 2ac illustrate a typical simulation, where figure 2a shows K=5 GBM paths, figure 2b shows the optimality of UCB1 compared to ϵ-greedy and figure 2c shows the cumulative wealth at the end of N=200 trials. With an initial wealth of 1 and λ=0.9, the cumulative wealth is 2.1615 for UCB1, 2.1024 for combined portfolio, 1.9168 for ϵ-greedy, 1.6355 for the risk-aware portfolio and 1.4640 for the equally weighted portfolio.

Figure 2.

Figure 2.

Combined sequential portfolio selection algorithm can achieve a balance between risk and return. (a,d) The simulated stock paths based on the GBM. (b,e) The performance of two portfolio selection algorithms, UCB1 versus ϵ-greedy. Panels (c,f) compare the cumulative wealth obtained with our sequential portfolio selection algorithm that combines the single-asset multi-armed bandit portfolio by (2.2) and the risk-aware portfolio by (2.4) with the other four benchmarks of portfolio selection algorithms. To quantify and compare the role of volatility in the performance of portfolio selection algorithms, we present the simulation results of low volatility in (a)–(c) and high volatility in (d)–(f). Parameters: the same vector (0.04,0.035,0.08,0.02,0.03) for drift terms αi is used for simulating the stock paths in (a) and (d). For each trial, the volatility terms σi are uniformly and randomly generated from the interval [0.02,0.025] in (a) and from the interval [0.03,0.035] in (d). λ=0.9.

In addition, we observe that when the market is volatile and when different stock paths are similar in expectation, it takes more trials for the UCB1 policy to reach optimality (figure 2df). In this case, the risk-aware portfolio achieves the most cumulative wealth with a similarly low variation in its path. Different from the simulation presented in figure 2ac, where the volatility parameters of GBMs are bounded in the interval [0.02,0.025], we now choose values from the interval [0.03,0.035] for figure 2df. Specifically, figure 2df demonstrate such a simulation, where figure 2d shows the GBM paths, figure 2e shows the suboptimality of UCB1 and figure 2f shows the cumulative wealth at the end of 200 trials. With an initial wealth of 1 and λ=0.9, the cumulative wealth is 1.5412 for the risk-aware portfolio, 1.4409 for the combined portfolio, 1.4294 for UCB1, 1.4132 for the equally-weighted portfolio and finally, 1.3298 for ϵ-greedy.

From the above discussion, it is evident that the value of λ is vital to the performance of our sequential portfolio selection algorithm and should be determined based on the market condition. In particular, Way et al. [47] discusses the trade-off between specialization to achieve high rewards and diversification to hedge against risk, and similarly shows that such choice depends on the underlying parameters and initial conditions.

4. Discussion and conclusion

In this paper, we have studied the multi-armed bandit problem as a mathematical model for sequential decision-making under uncertainty. In particular, we focus on its application in financial markets and construct a sequential portfolio selection algorithm. We first apply graph theory and select the peripheral assets from the market to invest. Then at each trial, we combine the optimal multi-armed bandit policy with the minimization of a coherent risk measure. By adjusting the parameter, we are able to achieve the balance between maximizing reward and minimizing risk. We adopt the Black–Scholes model to repeatedly simulate stock paths and observe the performance of our algorithm. We conclude that the results agree well with our prediction when the market is stable. In addition, when the market is volatile, risk awareness becomes more crucial to achieving high performance. Therefore, parameter selection should be based on the market condition.

For future research, one may consider the optimal selection of the parameter λ for combining the two portfolios. One may also consider portfolio selection strategies based on the MDP, which is a generalization of the multi-armed bandit to multiple states. In addition, one may pay more attention to a chaotic market environment where stock paths can be affected by various factors instead of simply following a stochastic process. For example, Junior & Mart [48] uses random matrix theory and transfer entropy to show that news articles can possibly affect the market. Finally, one may consider transaction costs and market liquidity. For example, Reiter et al. [49] illustrates the trade-off between reward and cost in a biological auction setting and might provide some important insights for the researcher.

Data accessibility

Our data and simulation codes are deposited at Dryad: https://doi.org/10.5061/dryad.h628h [50].

Authors' contributions

X.H. & F.F. conceived the project, X.H. performed analyses and simulations, X.H. & F.F. analysed results. X.H. wrote the first draft of the main text. Both the authors reviewed the manuscript.

Competing interests

The authors declare no competing financial interests.

Funding

Financial support came from the Dartmouth Faculty Startup Fund and Walter & Constance Burke Research Initiation Award. X.H. is thankful for financial support from the National Science Foundation and Dartmouth College. F.F. is grateful for support from the Dartmouth Faculty Startup Fund, Walter & Constance Burke Research Initiation Award, NIH under grant no. C16A12652 (A10712), and DARPA under grant no. D17PC00002-002.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Huo X, Fu F. 2017. Data from: Risk-aware multi-armed bandit problem with application to portfolio selection. Dryad Digital Repository (doi:10.5061/dryad.h628h)

Data Availability Statement

Our data and simulation codes are deposited at Dryad: https://doi.org/10.5061/dryad.h628h [50].


Articles from Royal Society Open Science are provided here courtesy of The Royal Society

RESOURCES