Behaviorally informed deep reinforcement learning for portfolio optimization with loss aversion and overconfidence

Atefe Charkhestani; Akbar Esfahanipour

doi:10.1038/s41598-026-35902-x

. 2026 Jan 28;16:6443. doi: 10.1038/s41598-026-35902-x

Behaviorally informed deep reinforcement learning for portfolio optimization with loss aversion and overconfidence

Atefe Charkhestani ¹, Akbar Esfahanipour ^1,^✉

PMCID: PMC12909882 PMID: 41593126

Abstract

This study develops a behaviorally informed deep reinforcement learning (DRL) framework for algorithmic portfolio optimization. The model integrates two well-established behavioral biases, loss aversion and overconfidence, into an actor–critic architecture. Unlike conventional DRL systems that assume fully rational agents, the proposed framework incorporates investor heterogeneity through regime-dependent bias thresholds that adjust position sizing, while the underlying RL policy determines trading direction. To adaptively switch among three behavioral models loss loss-averse, overconfident, and neutral, the framework employs TimesNet to generate one-step-ahead market regime forecasts. All decisions follow a strict walk-forward evaluation protocol that precludes access to future information and ensures realistic out-of-sample performance measurement. The framework is evaluated across two major financial domains: the cryptocurrency market (2018–2024) and the Dow Jones Industrial Average (2008–2024). The integrated BBAPT architecture, which combines TimesNet with behavioral DRL, consistently outperforms benchmark strategies including neutral RL agents, classical Markowitz portfolios, and equally weighted allocations. In cryptocurrency markets, BBAPT achieves the highest risk-adjusted performance, while in equity markets it delivers improved risk–return outcomes even after accounting for time-varying index constituents. Overall, the empirical evidence demonstrates that embedding behavioral finance principles into reinforcement learning enhances robustness, adaptability, and risk-adjusted returns in non-stationary environments. These findings position behaviorally informed DRL as a promising foundation for next-generation algorithmic trading systems.

Keywords: Algorithmic trading, Behavioral bias, Loss aversion, Overconfidence, Deep reinforcement learning, Portfolio optimization, TimesNet

Subject terms: Mathematics and computing, Psychology, Psychology

Introduction

Automated trading systems have evolved substantially since their earliest rule-based implementations in the mid-twentieth century. With rapid advances in artificial intelligence (AI), contemporary algorithmic trading has transitioned into a highly adaptive, data-driven decision-making paradigm. Modern systems leverage machine learning and deep learning to extract patterns from complex financial time series, forecast market movements, and adjust trading decisions dynamically¹. Numerous empirical studies have shown that AI-enhanced trading strategies, particularly those based on deep reinforcement learning (DRL), can outperform traditional quantitative models by improving return forecasting, risk management, and execution efficiency^2,3. As a result, DRL-based portfolio management has emerged as a promising approach for handling non-stationary and volatile markets.

Reinforcement learning provides a natural framework for modeling sequential decision-making problems such as portfolio rebalancing, where actions influence future rewards and subsequent market states. DRL agents integrate neural networks with reinforcement learning principles to learn optimal trading policies directly from historical interactions with the environment. These models have demonstrated the ability to process large-scale financial datasets, adapt to shifting market regimes, and optimize risk–return trade-offs^4,5. A typical trading agent evaluates features such as price dynamics, portfolio composition, and transaction costs, and generates allocation vectors designed to maximize long-term cumulative reward.

Despite substantial progress in algorithmic trading, real-world investment behavior is strongly affected by psychological biases that classical DRL models do not explicitly account for. Behavioral finance shows that investors systematically deviate from rational utility-maximizing behavior⁶. Among these deviations, loss aversion the tendency to react more strongly to losses than to gains of the same magnitude and overconfidence the tendency to overestimate one’s forecasting ability are particularly influential in shaping portfolio decisions^7,8. These biases may be detrimental in some environments but beneficial in others. For instance, overconfidence can accelerate gains during persistent upward trends, whereas loss aversion may reduce downside exposure during market declines. Importantly, investors differ in their sensitivity to such biases, motivating the need for mechanisms that can represent heterogeneous behavioral responses.

In this work, behavioral biases are not intended to mimic human decision-making exactly. Instead, they function as algorithmic heuristics that operate within a DRL agent’s position-sizing layer. The RL policy determines the direction of trades, while behavioral activation models the magnitude of allocation adjustments. This architecture enables systematic comparison between bias-neutral and bias-driven agents and facilitates a controlled examination of whether behavioral heuristics can improve portfolio outcomes. Our focus is on daily-to-medium-frequency portfolio allocation; ultra–high-frequency execution systems, where behavioral dynamics are negligible, fall outside the scope of this study.

Although behavioral principles such as prospect theory have been incorporated into portfolio optimization⁹, much less attention has been given to embedding behavioral mechanisms directly inside DRL architectures. Standard DRL agents react solely to market or portfolio states and therefore overlook opportunities that arise from persistent behavioral patterns in financial markets. Because real markets reflect a combination of rational dynamics and behavioral deviations, integrating behavioral drivers into DRL frameworks may enable richer decision policies than those attainable through conventional approaches.

To enable dynamic transitions between behavioral modes, we employ TimesNet¹¹, a convolution-based architecture designed for long-horizon time-series forecasting. TimesNet provides one-step-ahead return forecasts used to activate the loss-averse, overconfident, or neutral agent within our Behavioral Bias–Based Algorithmic Portfolio Trading (BBAPT) framework. Importantly, TimesNet is trained independently from the DRL agents and operates under a strict walk-forward protocol to prevent any leakage of future data, thereby eliminating look-ahead bias. The modular design also allows practitioners to replace TimesNet with alternative forecasting models without modifying the DRL components.

The key contributions of this study are as follows:

We introduce a DRL-based portfolio optimization framework that integrates loss aversion and overconfidence through behavioral thresholds that modulate position sizing. The DRL agent learns the baseline trading policy, while behavioral mechanisms adjust allocations when their activation conditions are satisfied.
We model investor heterogeneity by introducing regime-sensitive behavioral activation thresholds. These thresholds, optimized via out-of-sample evaluation, capture variations in behavioral intensity across different market environments.
In contrast to standard DRL trading systems, where portfolio adjustments depend solely on market or portfolio states, the proposed framework allows behavioral activation to influence rebalancing decisions, thereby expanding the effective action space.
We conduct extensive empirical evaluations across both cryptocurrency markets and mature equity markets using the time-varying DJIA. The framework is compared against multiple benchmarks, including A2C, A3C, DDPG, Markowitz portfolios, and equally weighted allocations.
We demonstrate that incorporating TimesNet as a forecasting-guided behavioral selector leads to improved portfolio performance by activating behaviorally informed DRL agents in a regime-aware manner.

The remainder of this paper is organized as follows. Section 2 reviews related work on behavioral portfolio theory and reinforcement learning in trading. Section 3 presents the proposed BBAPT architecture. Section 4 describes the datasets, training methodology, and evaluation framework. Section 5 reports the empirical results across both asset classes. Section 6 concludes the study and discusses potential directions for future research.

Related works

This section reviews two relevant streams of research: (i) behavioral portfolio selection, which integrates investor psychology into portfolio construction, and (ii) reinforcement learning–based trading systems, which aim to learn optimal trading policies directly from market interactions. Together, these domains form the conceptual foundation of this study and highlight a critical gap that motivates the development of our behavioral reinforcement-learning framework.

Behavioral portfolio selection

Shefrin and Statman¹⁶ introduced Behavioral Portfolio Theory (BPT), the first formal framework to embed investor psychology into portfolio optimization. Their two-layered mental account model launched a research direction exploring how behavioral biases−such as loss aversion, overconfidence, anchoring, and representativeness−influence portfolio construction. Hirshleifer¹⁷ further established that cognitive biases systematically affect investor sentiment and asset pricing, leading to deviations from rational market behavior and generating persistent anomalies relevant for portfolio allocation.

Subsequent research extended behavioral modeling in several directions. Chang et al.¹⁸ integrated mental accounting into a multi-stage portfolio optimization process, while Momen et al.¹⁹ proposed the Collective Mental Accounting (CMA) framework to mathematically unify mental accounts under realistic constraints such as position limits and cardinality. Building on this, Momen, Esfahanipour, and Seifi²⁰ developed a behavioral portfolio model that incorporates dynamic risk preferences and forward-looking expectations derived from the Black–Litterman model. These works emphasize that incorporating investor psychology into portfolio models can yield allocations that better reflect real-world investment behavior.

Agent-based simulation approaches have also been used to study behavioral effects. Bertella et al.²¹ simulated markets populated by overconfident, risk-averse, and neutral agents to evaluate their impact on liquidity and volatility. Avellone et al.²² and Barro et al.²³ integrated loss aversion and prospect-theoretic preferences into optimization models and showed that behavioral preferences can significantly alter optimal portfolio weights relative to classical models.

Despite substantial progress, most behavioral portfolio models rely on static or utility-based formulations and do not integrate behavioral responses directly into an adaptive learning system. This limitation restricts their ability to operate in dynamic markets where behavioral activation may depend on real-time portfolio conditions.

Modern portfolio theory and its evolution

Modern Portfolio Theory (MPT), developed by Markowitz²⁴, formalized the mean–variance trade-off and established variance as the primary measure of risk. Although MPT remains foundational, its static assumptions limit its applicability in highly non-stationary markets. The increasing availability of high-frequency and multivariate financial data has enabled machine learning and reinforcement learning methods to emerge as adaptive alternatives capable of responding to evolving market conditions.

Recent studies have expanded portfolio optimization frameworks using deep learning, nonlinear estimators, and multivariate time-series models^25–27. However, these approaches generally remain behaviorally neutral and do not account for investor heterogeneity or psychological activation mechanisms.

Reinforcement learning–based trading models

Reinforcement learning (RL) has become increasingly prominent in financial decision-making due to its ability to learn sequential policies from interactions with the market. The effectiveness of RL depends heavily on the design of state variables, action spaces, reward functions, and environment dynamics.

Early work in RL-based trading adopted discrete action spaces for single-asset trading, where actions represent buy, hold, or sell decisions. More recent advances use continuous action spaces to directly output portfolio weights. Soleymani and Paquet²⁸ proposed DeepBreath, combining autoencoders and CNNs within a SARSA agent. Weng et al.²⁹ integrated XGBoost-based feature selection with 3D attention-gating networks. Other studies have used DBNs and LSTMs for dimension reduction and A2C/A3C architectures for policy learning^30,31.

Portfolio-level RL has also gained traction. Wu et al.³² developed an A2C model combining CNN and RNN layers; Betancourt and Chen³³ introduced a PPO-based system capable of handling dynamic asset sets; Yue et al.³⁴ used sparse denoising autoencoders for state representation; and Taghian et al.³⁵ extracted features from candlestick images within a SARSA framework. Additional work has explored graph-based RL³⁹, hyper-heuristic trading agents⁴⁴, high-frequency RL models³⁶, and DRL architectures for multi-timeframe learning⁴³. Studies such as^40–42 further benchmark RL models across diverse markets and algorithmic settings.

A comprehensive summary of RL-based trading systems is presented in Table 1, highlighting differences in state formulation, action structure, reward definition, and market application.

Table 1.

Representative studies on RL-based trading systems and their key characteristics.

Reference	RL Model	State Variables	Action Space	Reward	Case Study / Key Contribution
46	SARSA	Technical indicators, previous weights	Asset weights	Return	NYSE stocks; drift detection with online batching
47	DPG	Price changes, previous weights	Asset weights	Return	Poloniex assets; XGBoost + 3D attention gating
48	A2C	OHLC	Asset weights	Sharpe ratio	TW50 portfolio; CNN + RNN design
49	A2C/A3C	OHLCV, indicators	Buy/Hold/Sell	Sharpe ratio	Global stocks; DBN–LSTM compression
50	DQN	Trading state, holdings, cash	Buy/Hold/Sell	Return	US/EU/Asia stocks; Trading-DQN formulation
33	PPO	OHLCV, portfolio value	Asset weights	Sharpe ratio	Binance portfolios; dynamic asset universe
51	A2C	Balance, holdings, indicators	Shares to trade	Return	DJIA single-asset RL; sparse autoencoder state
52	DDPG (R-GCN)	Close price, weights	Asset weights	Return	NYSE/NASDAQ; graph convolution RL
53	SARSA/DQN	Candlestick images	Buy/Hold/Sell	Return	AAPL, BTC, GOOGLE; deep visual features
54	DQN/A2C	Price changes	Buy/Sell	Sharpe ratio	Crypto assets; ResNet actor
55	SAC/Trace-SAC	Portfolio signals	Continuous long/short score	Log return	BTC-USDT futures; confidence-based actions
56	Multiple (A2C, PPO, DDPG, TD3, SAC)	Price data	Buy/Hold/Sell	Return	Shanghai Composite; comparative benchmark
57	DQN	OHLC + indicators (CNN-LSTM)	Buy/Hold/Sell	Sharpe + Profit	Chinese market + S&P500; multimodal fusion
40	Double DQN	OHLC	Buy/Hold/Sell	Return + Sharpe	BTC-USDT; Bayesian hyperparameter tuning
41	A2C, A3C, DDPG	OHLC + indicators	Asset weights	Sharpe ratio	US stocks; RL integrated with MPT
58	TD3	Prices, balance, shares	Buy/Sell/Hold	Profit net of costs	DJIA + S&P100; delayed-update critic
43	TD3	Multi-timeframe OHLC	Buy/Sell/Hold	Return	BTC + AMZN; multi-timeframe RL
44	Hyper-heuristic RL	Indicators + returns	Select a strategy	Risk-adjusted return	Global indices; RL selects strategy instead of trades
Proposed Model	A2C, A3C, DDPG + TimesNet	Indicators + unrealized returns	Asset weights + behavioral adjustments	Sharpe ratio	Crypto + DJIA; first DRL model integrating explicit behavioral biases inside the policy

Open in a new tab

Despite the progress in RL-driven trading, existing algorithms remain largely behaviorally neutral. They do not incorporate behavioral activation mechanisms such as loss-aversion or overconfidence thresholds into the policy itself. Moreover, RL-based systems rarely integrate external forecasting models to guide the selection of behavioral policy modes.

This gap motivates the present study, which develops a deep reinforcement learning framework that embeds two well-established behavioral biases loss aversion and overconfidence into the portfolio allocation mechanism and employs TimesNet to enable regime-aware behavioral switching under a strict walk-forward protocol.

The proposed BBAPT model

The proposed Behavioral Bias–Aware Portfolio Trading (BBAPT) framework consists of three main components. First, we describe how investor behavioral biases are modeled and translated into systematic portfolio weight adjustments. Second, we present the reinforcement learning architecture that integrates these behavioral mechanisms into the portfolio optimization process. Finally, we introduce the TimesNet-based forecasting module and explain how its predictions are combined with the behavioral agents to form the complete BBAPT model.

Behavioral biases modeling

Behavioral finance provides extensive evidence that investment decisions are shaped not only by objective probabilities and payoffs, but also by cognitive and emotional factors such as loss aversion and overconfidence. In the BBAPT framework, these biases are modeled as systematic adjustments to the position size suggested by a bias-neutral deep reinforcement learning (DRL) agent, while the direction of trades remains entirely determined by the underlying RL policy. This design preserves interpretability and ensures that behavioral effects influence only capital allocation, not trading signals.

Let Inline graphic denote the baseline (rational) portfolio weight of asset i, generated by the neutral DRL agent, such that

Let Inline graphic denote the unrealized return of asset i since the position was initiated. Behavioral mechanisms transform into biased weights (loss-averse investor) or (overconfident investor), as described below.

Loss aversion bias

Kahneman and Tversky⁹ established that losses have a disproportionate psychological impact relative to gains of the same magnitude. In portfolio settings, this asymmetry manifests through two characteristic behavioural patterns:

Cross-sectional disposition adjustment: reallocating wealth away from recent winners and toward recent losers.
Global risk reduction: decreasing the total exposure to risky assets when negative performance accumulates.

To capture these two features, the loss-aversion mechanism consists of three steps: (i) a behaviourally motivated multiplier, (ii) a cross-sectional adjustment of intermediate weights, and (iii) a global scaling factor that reduces overall risky exposure.

Step 1: Behavioural multiplier

For each asset i, let Inline graphic and denote the thresholds that determine whether the asset exhibits a significant loss or a strong gain. The behavioural multiplier is defined as

with the following interpretation:

controls how strongly exposure is increased for losing assets (capturing the disposition effect);
controls how aggressively exposure is reduced for winning assets;
assets with unrealized returns between the thresholds use the baseline multiplier .

Step 2: cross-sectional adjustment

The intermediate loss-averse weight is given by

This step reshapes the cross-sectional distribution of weights without changing the total risky allocation.

Step 3: Global risk scaling

Loss-averse investors tend to reduce total exposure to risky assets. Let Inline graphic denote the global risk-scaling parameter. The final traded weight is

Here,

the ratio redistributes exposure across assets according to their behavioural multipliers;
the factor uniformly scales down total risky investment.

The remaining fraction Inline graphic is interpreted as a cash allocation, reflecting the reduced risk appetite implied by loss-averse behaviour.

Interpretation

This formulation models both key behavioural dimensions: (i) the tendency to overweight losing assets and underweight strong winners (via Inline graphic ), and (ii) the reduction of total risky exposure (via ). Crucially, this mechanism alters only position sizes; trading directions remain governed by the RL policy.

Overconfidence bias

Overconfidence leads investors to overestimate the precision of their forecasts and to take excessively large or aggressive positions. Empirical studies document three characteristic behavioural patterns associated with overconfidence:

lower effective trading thresholds, resulting in increased trading frequency;
aggressive scaling of profitable positions (trend amplification);
“doubling down” on losing positions due to excessive belief in mean reversion.

In the BBAPT framework, these behaviours are modeled through behavioural multipliers that adjust the position size suggested by the neutral reinforcement-learning agent.

Step 1: Behavioural multiplier

For each asset i, let Inline graphic denote the unrealized return since the position was initiated. Two behavioural thresholds determine when overconfidence becomes active:

− a gain threshold above which the investor increases exposure;
− a loss threshold below which the investor increases allocation in a “doubling-down” fashion.

The behavioural multiplier is defined as

where:

measures the intensity of scaling up winning positions;
controls the strength of the doubling-down behaviour;
indicates no behavioural adjustment.

Thus, the multiplier amplifies perceived opportunities in both trending ( Inline graphic ) and mean-reverting () conditions.

Step 2: cross-sectional adjustment

Let Inline graphic denote the baseline weight generated by the neutral RL policy. Applying the behavioural multiplier yields the intermediate weight:

This step modifies the cross-sectional allocation across assets without affecting total portfolio exposure.

Step 3: Global risk amplification

Overconfident investors typically increase overall portfolio risk. To model this, a global leverage parameter Inline graphic is introduced. The final overconfidence-adjusted weight is

Here:

normalizes the cross-sectional allocation;
uniformly amplifies total risky exposure, capturing overconfident risk-taking.

The result is a portfolio with both micro-level position amplification and macro-level leverage expansion.

Definition of all parameters

− unrealized return of asset i since position entry.
− gain threshold activating trend-following amplification.
− loss threshold activating doubling-down behaviour.
− strength of scaling up profitable positions.
− strength of doubling down on losing positions.
− baseline RL-generated weight prior to behavioural adjustment.
− intermediate behaviour-adjusted weight.
− global risk-scaling parameter representing increased leverage.

Interpretation

The overconfidence mechanism expands both cross-sectional allocations toward perceived opportunities and the overall portfolio risk level. This behaviour closely aligns with empirical evidence showing that overconfident investors trade more aggressively, maintain larger positions, and exhibit overly strong reactions to perceived signals. Crucially, this mechanism modifies only position sizes, while trade direction remains governed solely by the reinforcement-learning policy.

Reinforcement learning modeling

Reinforcement learning (RL) provides a principled framework for sequential decision-making problems, in which an agent interacts with an external environment and learns to maximize long-term cumulative rewards⁶². At each time step t, the agent observes the environment state Inline graphic , selects an action based on a policy , receives a reward , and transitions to the next state , forming a Markov decision process (MDP). Figure 1 illustrates the RL cycle underlying the portfolio rebalancing process.

General reinforcement learning framework⁶¹.

The RL system consists of two components: (1) an agent containing the learning algorithm and policy, and (2) an environment providing state transitions and rewards in response to the agent’s actions.

Selection of RL algorithms

Algorithm selection in RL depends fundamentally on whether the state and action spaces are discrete or continuous. Table 2 summarizes standard RL algorithms and their compatibility with different problem types⁶⁴.

Table 2.

Types of reinforcement-learning agents⁶⁴.

State space	Action space	Agent family
Discrete	Discrete	(Q-learning, SARSA) DQN PPO TRPO
Continuous	Discrete	DQN PPO TRPO
Continuous	Continuous	(DDPG, A2C, A3C) TD3, PPO, SAC TRPO

Open in a new tab

* Algorithms in parentheses have almost the same level of complexity and speed. The complexity and speed of algorithms increases from left to right

Portfolio optimization requires selecting continuous-valued asset weights. Therefore, RL algorithms capable of handling continuous action spaces are appropriate. In this study, three actor–critic–based methods are adopted:

A2C (Advantage Actor–Critic): synchronous updates yield stable optimization and efficient GPU utilization;
A3C (Asynchronous Advantage Actor–Critic): parallel, asynchronous learners enhance exploration and robustness;
DDPG (Deep Deterministic Policy Gradient): suited for continuous control tasks where deterministic policies accelerate learning.

These methods balance computational cost and stability while effectively handling the continuous control nature of portfolio weight selection⁶⁵. More computationally intensive algorithms (e.g., SAC or TRPO) were not employed to maintain tractability without sacrificing performance. Detailed descriptions of these algorithms, along with the training procedures for the selected agents, are provided in the supplementary material file.

Actor and critic network architecture

The BBAPT framework employs an actor–critic structure in which the actor generates baseline portfolio weights, and the behavioral layer subsequently adjusts these weights according to the active bias.

Actor network.

Given state Inline graphic , the actor network outputs the parameters of a multivariate Gaussian policy:

where Inline graphic and are neural-network outputs. A softplus activation ensures .

The raw action vector is mapped to a baseline weight vector using a softmax normalization:

These baseline weights constitute the rational DRL allocation that is later modified by the behavioral adjustments described in Section 3.1.

The policy network is updated via the policy-gradient objective:

where Inline graphic is the advantage estimate computed using generalized advantage estimation.

Critic network.

The critic approximates the state-value function:

with discount factor Inline graphic . Training proceeds via temporal-difference learning:

where Inline graphic is the critic learning rate.

Together, the actor and critic networks iteratively refine the trading policy and produce smooth and stable learning dynamics suitable for financial applications.

Proposed BBAPT framework overview

Figure 2 presents the complete BBAPT architecture, which combines a forecast-driven behavioral mode selector with an actor–critic DRL agent and a behavioral adjustment layer.

A TimesNet-based module provides one-step-ahead market forecasts.
A mode selector activates one of three behavioral profiles (neutral, loss-averse, or overconfident).
An actor–critic agent computes baseline weights .
A behavioral layer maps baseline weights to the final traded weights according to the selected bias.

Structure of the BBAPT model, combining TimesNet forecasts with behavioral reinforcement-learning agents.

Environment

The environment simulates financial market dynamics and integrates the DRL policy with the behavioral module. Given state Inline graphic and baseline weights , the environment:

updates technical indicators from the latest OHLCV data.
updates unrealized returns for all assets based on the active portfolio.
applies the behavioral mapping (neutral, loss aversion, or overconfidence) to obtain final traded weights .
computes the portfolio return and risk at time t.
produces the next state and reward .

This structure ensures that all behavioural effects influence position sizing only, while the underlying RL policy determines the direction of trades.

State definition

The state representation is designed to capture both recent market dynamics and the performance of currently held positions. Accordingly, the full state vector consists of two feature groups:

Technical indicators for all assets, summarizing short- and medium-term price and volume behavior;
Unrealized returns for each asset since the most recent entry, providing the behavioral module with information necessary for loss-aversion and overconfidence adjustments.

Formally, the state at time t is defined as

where Inline graphic denotes the m-th technical indicator for asset n, and is the unrealized return of that asset at time t. Including unrealized returns is essential because the behavioral mappings for loss aversion and overconfidence depend directly on .

Action definition

In the proposed framework, the reinforcement-learning agent is responsible only for generating baseline portfolio weights. Behavioral effects are applied afterward via an external adjustment layer. At each time t, the actor network produces a baseline allocation vector

The active behavioral model neutral, loss-averse, or overconfident is determined externally (e.g., by the forecasting module in Section 3.3). Once the mode is selected, the final traded weights are computed as

where Inline graphic and are obtained from and using the formulations in Section 3.1. In this way, the RL agent learns a robust baseline allocation strategy, while behavioral characteristics influence only the magnitude of final position sizes.

Reward function

The overall quality of a portfolio allocation is evaluated using the Sharpe ratio⁶⁹, defined for a given weight vector w as

where Inline graphic is the vector of expected returns and is the historical variance–covariance matrix of asset returns. The Sharpe ratio reflects the trade-off between return and volatility; maximizing it encourages the discovery of risk-efficient trading strategies.

At each step, the immediate reward is the realized portfolio return,

where Inline graphic is the simple return of asset i over . In practice, the episode-level Sharpe ratio is used to normalize and scale these step-wise rewards. This reward shaping aligns the training objective with risk-adjusted performance and improves learning stability in volatile financial environments.

TimesNet-based market forecasting model

TimesNet is a recent deep learning architecture for general-purpose time-series forecasting based on temporal 2D-variation modeling. Traditional one-dimensional forecasting models often struggle to capture multi-scale temporal structure. TimesNet addresses this limitation by transforming one-dimensional sequences into structured two-dimensional tensors through dominant-frequency extraction, enabling the use of parameter-efficient 2D convolutional kernels to learn both intra-period and inter-period dependencies.

Within the BBAPT framework, TimesNet functions as an external forecasting module that produces one-step-ahead market return predictions. These forecasts determine the behavioral model neutral, loss-averse, or overconfident to be applied in the subsequent trading step. The forecasting module operates independently of the reinforcement-learning agent, ensuring strict chronological separation and preventing access to future labels during policy learning.

TimesNet has demonstrated competitive performance in several financial prediction tasks, including return forecasting and volatility modeling^70,71. Although its primary objective is not portfolio construction, time-aware predictive models have been shown to enhance stability and improve risk-adjusted performance when used as auxiliary signals¹⁵. Motivated by these findings, TimesNet is incorporated into the BBAPT architecture as a data-driven indicator of short-term market conditions.

TimesNet architecture overview

The core computational unit of TimesNet is the TimesBlock, which identifies and exploits multiple periodicities through Fast Fourier Transform (FFT). For an input sequence, the model:

extracts the top-K dominant frequencies using FFT amplitude analysis;
reshapes the sequence into a set of two-dimensional tensors, each corresponding to one identified period;
processes these tensors through inception-style convolutional blocks to capture localized temporal variation;
fuses the outputs using softmax weights proportional to FFT-derived amplitudes.

This structure enables TimesNet to learn regime transitions, momentum cycles, volatility clustering, and other dynamics characterizing cryptocurrency and equity markets.

Figure 3 illustrates the architectural components, while Algorithm 1 summarizes the full computational workflow.

TimesNet architecture: FFT-based period detection, 2D tensor transformation, TimesBlocks, and final regression head.

Activation of behavioral DRL agents Using TimesNet

Let Inline graphic denote the next-day return forecast generated by TimesNet. To integrate these forecasts into the behavioral decision-making process, the BBAPT framework adopts a three-region activation rule that maps predicted market conditions to one of the behavioral modes. A symmetric threshold parameter Inline graphic defines the boundaries between bullish, range-bound, and bearish regimes:

The threshold r serves as a sensitivity parameter that controls how readily the system transitions between behavioral modes. In practice, r is treated as a tunable hyperparameter governing the granularity of regime separation. A moderate value such as Inline graphic provides a balanced trade-off between responsive regime detection and stability in behavioral activation. The resulting behavioral agent selection rule is summarized in Table 3.

Table 3.

Behavioral agent selection based on TimesNet return forecasts.

Forecast	Market Regime	Activated Agent
	Bullish	Overconfidence
	Range-bound	Neutral
	Bearish	Loss-averse

Open in a new tab

Evaluation process of the proposed model

This section describes the evaluation procedure used to assess the performance of the proposed BBAPT framework. Section 4.1 introduces the datasets, Section 4.2 outlines the construction of market regimes, Section 4.3 details the reinforcement learning training setup, and Section 4.4 presents the benchmark models and evaluation criteria.

The evaluation pipeline is designed to be transparent and fully reproducible, with explicit descriptions of data selection, regime construction, training procedures, and robustness checks across cryptocurrency and equity markets.

Figure 4 summarizes the overall workflow, from raw data collection to model training and performance comparison against classical portfolio strategies.

Data

The cryptocurrency market serves as a suitable testbed for evaluating reinforcement-learning-based trading systems. Its high volatility, rapid structural changes, and continuous 24-hour operation create an environment in which models must adapt to non-stationary and stress-prone dynamics. These characteristics provide a rich setting for testing algorithms designed to respond to evolving risk–return profiles.

Because many cryptocurrencies exhibit strong cross-correlation, asset selection plays a key role in ensuring diversification. Following the clustering-based taxonomy of Pele et al.⁷², a representative portfolio of 20 cryptocurrencies is constructed: Bitcoin (BTC), Ethereum (ETH), Litecoin (LTC), Chainlink (LINK), Bitcoin Cash (BCH), Uniswap (UNI), Stellar Lumens (XLM), Filecoin (FIL), BNB (BNB), Solana (SOL), XRP (XRP), Cardano (ADA), Shiba Inu (SHIB), Toncoin (TON), Dogecoin (DOGE), Avalanche (AVAX), Tron (TRX), Polkadot (DOT), Polygon (MATIC), and Ethereum Classic (ETC). Daily OHLCV (Open, High, Low, Close, Volume) data for all assets are obtained from Yahoo Finance.

The cryptocurrency dataset spans January 2018 to December 2024, covering a broad range of market conditions including the COVID–19 shock, subsequent recoveries, the 2022 drawdown, and the partial normalization phase of 2023–2024. This diverse period allows a thorough evaluation of the model under crisis, post-crisis, trending, and range-bound environments.

For visualization and interpretability, three representative market patterns (low-width range-bound, trending, and high-width range-bound) are illustrated in Figure 5 and Table 4. These patterns are not used in the decision-making pipeline. Instead, the TimesNet forecasting module (Section 3.3) generates next-step return predictions based solely on historical data, and the BBAPT model selects the behavioral mode accordingly.

Representative market regimes (low-width range-bound, trending, and high-width range-bound). These regimes are used solely for analysis and are not employed by the BBAPT model during training or evaluation.

Table 4.

Training and testing periods associated with each representative market regime.

Market Type	Train Start	Train End	Test Start	Test End
Low-width range-bound	2019-02-06	2019-12-22	2019-12-23	2020-03-12
Trending	2020-03-13	2020-11-08	2020-11-09	2021-01-08
High-width range-bound	2021-01-09	2022-02-23	2022-02-24	2022-06-07
Total period	2019-02-06	2022-06-07	2022-06-08	2023-01-03

Open in a new tab

To assess generalizability beyond digital assets, an additional dataset from the Dow Jones Industrial Average (DJIA) is used. The equity dataset consists of the official DJIA constituents at each point in time, covering January 2008 to June 2024. This 16-year horizon includes the global financial crisis, the long post-crisis expansion, the COVID–19 crash and rebound, the inflation-driven drawdowns of 2022, and the subsequent recovery. Using time-varying index membership eliminates survivorship bias and provides a complementary testing ground for evaluating robustness across distinct asset classes and market structures.

Determining the market type, training, and testing periods

To evaluate the performance of the behavioral components of the BBAPT framework under diverse market dynamics, we consider three characteristic market environments: low-width range-bound, trending, and high-width range-bound conditions. These environments reflect qualitatively distinct price behaviors frequently encountered in financial markets and provide a structured basis for analyzing how different behavioral modes influence portfolio allocation.

Behavioral finance suggests that overconfidence tends to intensify in strongly trending markets, while loss aversion becomes more pronounced during downturns or in highly volatile environments. Although such distinctions motivate the inclusion of multiple behavioral modes, the BBAPT framework does not use regime labels during operation. All behavioral activations are driven solely by next-step forecasts generated by the TimesNet module (Section 3.3), ensuring that no regime-based information is introduced into training or testing procedures.

Technical analysts commonly rely on range identification and trend detection as the foundation for trading strategies⁷⁶. A range-bound market is typically characterized by prices oscillating between well-defined support and resistance levels⁷⁷, while a trending market exhibits sustained upward or downward movement⁷⁸. Because individual cryptocurrencies often show heterogeneous cyclical patterns, regime identification is performed on an equally weighted cryptocurrency index rather than on individual assets. Following Bolognesi et al.⁷⁹, the index is defined as

where Inline graphic denotes the log-return of asset i at time t, and N is the number of assets in the cryptocurrency portfolio.

Visual inspection of this index reveals three segments that correspond to low-width range-bound, trending, and high-width range-bound regimes. These regimes are highlighted in Figure 5. They are used exclusively for post-hoc performance interpretation and are not part of the decision-making flow of the BBAPT framework.

For each identified regime, Inline graphic of the data is allocated to training the behavioral and neutral RL agents, and the remaining is reserved for testing. Table 4 summarizes the corresponding dates.

In addition to this segmented analysis, the full cryptocurrency dataset (January 2018 to December 2024) described in Section 4.1 is used for long-horizon testing and for training the TimesNet forecasting module. This broader evaluation allows BBAPT to be assessed under a wide spectrum of real-world market conditions, including crisis periods, recoveries, strong trends, and extended range-bound phases.

Training the reinforcement learning model

Reinforcement learning model configuration

Actor–critic agents utilize two function approximators to model the actor and critic networks. In this study, both functions are implemented using deep neural networks. The architectures of these networks are illustrated in Figure 6. Although various approaches exist for selecting network depth and width, including sensitivity analysis, several studies suggest heuristic principles that provide strong baseline configurations. Here, the geometric pyramid rule⁸⁰ is adopted to determine the hidden-layer structure of both networks, as it has been shown to yield near-optimal architectures with stable convergence across diverse applications.

Architecture of the actor and critic networks used in this study.

Assuming a network with three hidden layers, the pyramid rule specifies that the number of neurons in these layers should follow a descending geometric pattern. Let the number of neurons in the input and output layers be given. Then the scaling coefficient Inline graphic is computed as:

The first hidden layer contains Inline graphic times the output size, the second contains times the output size, and the third contains times the output size. Table 5 summarizes the resulting architectures for the actor and critic networks.

Table 5.

Configuration of the actor and critic networks.

Network	Input	Output		Hidden-layer neurons
				Layer 1	Layer 2	Layer 3
Actor	120	40	2	320	160	80
Critic	120	1	4	64	16	4

Open in a new tab

As shown in Figure 6, the critic network uses three hidden layers with tanh activations, followed by a ReLU output to ensure non-negative value estimates. The actor network also employs three tanh-activated hidden layers but branches into two output pathways: one for the mean and another for the standard deviation of the continuous action vector. A softplus activation is used on the standard deviation branch to enforce positivity. For the mean branch, a ReLU activation followed by a scaling layer ensures action components lie within [0, 1].

Calibration of behavioral hyperparameters

In the revised BBAPT framework, investor behavior is modeled through systematic adjustments to the baseline portfolio weights using the bias–specific multipliers described in Section 3.1. Each behavioral agent is governed by a set of interpretable hyperparameters that determine how unrealized returns affect position sizing.

For the loss-averse agent, the relevant hyperparameters are:

which respectively control the sensitivity to losses, sensitivity to gains, the magnitude of cross-sectional reallocation, and the overall risky-asset exposure (risk budget).

For the overconfidence agent, the analogous parameters are:

governing the scaling of winning positions, the degree of “doubling down” on losing positions, and the elevated risk budget associated with overconfident behavior.

All behavioral hyperparameters are selected through a grid search using the validation subset of each training regime. For each parameter configuration, a behavioral DRL agent is trained and its average episodic reward is recorded. The configuration achieving the highest validation performance is selected as the optimal behavioral profile. This search procedure ensures a fair comparison across agents and produces behavioral intensities consistent with empirical findings in behavioral finance.

Evaluation of the trained model

To evaluate the proposed model’s performance, the model has been compared with the neutral model (without considering behavioral biases), Markowitz’s mean-variance model, and the equally-weighted portfolio model.

Neutral model

The neutral model serves as the benchmark DRL agent without any behavioral modification. It employs the same actor–critic structure, training configuration, and state representation as the behavioral agents, but it bypasses the behavioral adjustment layer described in Section 3.1. Accordingly, the final traded portfolio weights coincide with the baseline weights produced by the actor network:

The state representation is identical to that used for the behavioral models:

and the action vector consists of the normalized baseline DRL weights:

This model therefore provides a clean reference point for isolating the added value contributed by behavioral position-sizing mechanisms within the BBAPT framework.

Markowitz model

The Markowitz mean–variance optimization framework⁸¹ constructs a portfolio by balancing expected return against risk. Given expected returns Inline graphic and covariance matrix , the optimization problem considered is:

where Inline graphic is a vector of ones. The resulting solutions lie along the classical efficient frontier⁸². For benchmarking purposes, we consider both low-risk and high-risk portfolios located on the frontier.

Equally weighted portfolio model

The equally weighted (EW) portfolio allocates identical capital shares to all assets. This simple and widely adopted baseline⁸³ offers a model-free reference against which the incremental value of reinforcement learning and behavioral modeling can be assessed.

Evaluation criteria

All models are evaluated using performance metrics that capture profitability, risk, and downside protection. The following measures are computed over the out-of-sample testing window.

Final Compound Return (FCR).

Annualized Return (AR).

where n is the duration of the test period in years.

Sharpe Ratio (SR).

with Inline graphic for cryptocurrency markets.

Annualized Volatility (AV).

Maximum Drawdown (MDD).

Together, these metrics provide a comprehensive assessment of absolute and risk-adjusted performance, as well as robustness during adverse market conditions.

Experimental results

This section presents a comprehensive empirical evaluation of the proposed BBAPT framework. The experimental setup ensures a fully chronological workflow, strict avoidance of look-ahead bias, and a clear separation between training, validation, and testing phases. TimesNet operates as an external forecasting module and does not access future labels at any stage. All reported metrics are computed based on out-of-sample test data for both cryptocurrency and equity markets.

Unless otherwise stated, the behavioral thresholds for loss aversion and overconfidence correspond to the optimal values obtained through the hyperparameter-selection process described in Section 4.3.

The empirical analysis is organized as follows: Section 5.1 introduces the overall evaluation framework, followed by Section 5.2, which investigates the tuning process for the behavioral hyperparameters. Section 5.3 then illustrates how portfolio weights and behavioral mode activation evolve over time. Section 5.4 provides a comparative analysis between behavioral agents and benchmark portfolio models in cryptocurrency markets, while Section 5.5 presents the long-horizon evaluation results for cryptocurrencies during 2018–2024. Finally, Section 5.6 assesses the robustness of the proposed approach using DJIA constituents over the period 2008–2024.

Evaluation framework

The evaluation framework is designed to provide a transparent, reproducible, and realistic assessment of the BBAPT model under non-stationary market conditions. This subsection summarizes the data construction, regime identification, training/testing structure, and performance measures used throughout the study.

Data sources and preprocessing.

The cryptocurrency dataset spans 2018–2024 and covers multiple major market environments, including the pre-COVID regime, the COVID–19 crash and rebound, the 2022 high-volatility drawdowns, and the partial normalization of 2023–2024. For equity markets, DJIA constituents are reconstructed dynamically for each date over 2008–2024, thereby eliminating survivorship bias.

All time series are aligned chronologically and forward-filled when needed. No future information is used during preprocessing.

Market-regime identification.

Daily market conditions are classified into trending, low-width range-bound, and high-width range-bound regimes using rolling volatility, trend-strength measures, and normalized oscillation indicators. All regime assignments rely exclusively on historical information and are used only for analytical interpretation; the BBAPT model does not utilize regime labels during training or execution.

Chronological data partitioning.

Each dataset is divided into training, validation, and test segments in strict temporal order. The test set remains completely unseen until final evaluation, and hyperparameter tuning is performed only on validation data.

External training of TimesNet.

TimesNet operates purely as an external forecasting model that generates one-step-ahead return predictions. It is trained independently from the RL agent and never accesses future labels. Its forecasts are used solely to determine the behavioral mode for the upcoming trading step.

Backtesting protocol.

Daily rebalancing is applied with a transaction cost of 10 basis points per trade. The behavioral layer modifies only the magnitude of portfolio weights, preserving the direction of trades determined by the underlying RL policy.

Performance metrics.

Evaluation metrics include final compound return (FCR), annualized return (AR), annualized volatility (Vol), Sharpe ratio (SR), and maximum drawdown (MDD), all computed strictly on out-of-sample data.

Summary.

Overall, the evaluation setup ensures that the BBAPT model is assessed under realistic conditions with complete chronological integrity, no look-ahead bias, and no hidden information leakage.

Tuning the behavioral hyperparameters

In the BBAPT framework, the behavioral mechanisms are governed by two sets of parameters:

These parameters determine (i) the unrealized–return thresholds that activate behavioral adjustments, (ii) the magnitude of cross-sectional scaling applied to baseline DRL portfolio weights, and (iii) the aggregate risk budget associated with each behavioral agent. Behavioral adjustments modify only the position size; the direction of trades remains fully determined by the underlying DRL policy.

To identify effective behavioral intensities, a systematic grid search was performed over all parameters in both Inline graphic and . For the loss-averse agent, the search ranges were:

For the overconfidence agent, the corresponding search ranges were:

Each parameter configuration was evaluated by training the A2C agent across four market conditions (low-width range-bound, trending, high-width range-bound, and the full sample). For each regime, the configuration yielding the highest average episodic reward was selected as the optimal behavioral specification.

The resulting optimal hyperparameters used in all subsequent experiments are reported in Table 6 and Table 7.

Table 6.

Optimal loss-aversion parameters for the A2C agent across market regimes.

Regime				Reward
Low-width range-bound	0.22	0.18	0.78	1021.4
Trending	0.10	0.12	0.90	998.3
High-width range-bound	0.28	0.21	0.72	389.5
Full period	0.25	0.15	0.75	915.2

Open in a new tab

Table 7.

Optimal overconfidence parameters for the A2C agent across market regimes.

Regime				Reward
Low-width range-bound	0.26	0.20	1.18	1034.2
Trending	0.14	0.10	1.12	1036.7
High-width range-bound	0.24	0.22	1.20	392.3
Full period	0.16	0.14	1.15	944.8

Open in a new tab

The results show that trending markets favor milder forms of loss aversion and moderate overconfidence, whereas highly volatile range-bound periods benefit from stronger risk-reducing adjustments. Across all regimes, overconfidence serves as a return-amplifying mechanism by increasing total risk exposure through Inline graphic , while loss aversion reduces overall allocation to risky assets through . Importantly, both behaviors influence only position sizing, preserving the directional decisions produced by the DRL policy.

Weight dynamics under behavioral biases

This section illustrates how the proposed behavioral models loss aversion and overconfidence modify the baseline portfolio weights generated by the neutral DRL agent. For each market regime, Figures 7 and 8 display (i) the evolution of the adjusted daily weights Inline graphic , and (ii) the induced reallocation increments . The stacked-area plots show the normalized final weights, while the bar charts highlight day-to-day adjustments. Because all portfolios are renormalized at each step (), increases in one asset’s weight imply proportional reductions in others.

Portfolio weights during testing under the loss-aversion behavioral mechanism.

Portfolio weights during testing under the overconfidence behavioral mechanism.

Loss-aversion dynamics. The loss-aversion mechanism is activated when unrealized returns fall below the loss threshold Inline graphic . The extent of the adjustment is governed by the cross-sectional multiplier together with the global risk-budget parameter , which reduces aggregate exposure to risky assets.

Figure 7 shows that weight trajectories in the low-width range-bound regime and in the full-period evaluation remain relatively stable, reflecting the calibrated values of Inline graphic , which limit both the frequency and the magnitude of adjustments. Only sufficiently negative unrealized returns activate the behavioral multipliers, leading to smoother and less frequent reallocations.

In the trending regime, sharper upward and downward movements in market prices push unrealized returns farther from the thresholds Inline graphic and , resulting in more frequent and more pronounced cross-sectional scaling. This leads to noticeable fluctuations in normalized weights. Some changes may appear counterintuitive for instance, temporary increases in the relative weight of weak-performing assets. This effect arises mechanically from the renormalization step: large downward adjustments to several assets reduce the denominator, causing the normalized weight of less-adjusted assets to increase even when their absolute allocation does not.

Overconfidence dynamics. Figure 8 presents the corresponding dynamics for the overconfidence mechanism. This behavioral mode is activated when unrealized returns exceed the gain threshold Inline graphic , prompting the multipliers to amplify the baseline DRL weights of strongly performing assets. The global risk–scaling factor increases aggregate exposure to risky assets, making this behavioral response inherently more aggressive than its loss-averse counterpart.

Consistent with its design, the trending regime shows the largest number of behavioral activations. Sustained positive unrealized returns frequently cross the amplification threshold, triggering repeated adjustments that expand allocations to winning positions. The resulting weight trajectories exhibit higher variability and more abrupt transitions, reflecting the trend-amplifying nature of overconfidence.

Across all market environments, loss aversion produces smoother and more conservative reallocations, whereas overconfidence generates sharper, gain-driven adjustments. These contrasting patterns confirm that the behavioral multipliers function as intended, capturing key psychological tendencies while remaining fully consistent with the actor–critic structure and the underlying neutral DRL policy.

Performance comparison in the cryptocurrency market

Figure 9 presents the compound returns of the loss-averse, overconfident, and neutral RL agents across four representative cryptocurrency market regimes. Table 8 reports the corresponding final compound returns.

Compound returns for behavioral and neutral RL models across four cryptocurrency market regimes.

Table 8.

Final compound return for RL-based models in the cryptocurrency market.

Market	Loss aversion	Overconfidence	Neutral
Low-width range-bound	1.313	1.105	1.107
Trending	1.328	2.176	1.728
High-width range-bound	0.843	0.715	0.659
Total period	1.552	1.130	1.571

Open in a new tab

Performance of behavioral models

Low-width range-bound market. The loss-aversion agent achieves the strongest performance. Frequent price reversals generate small unrealized losses, and the behavioral scaling mechanism reduces exposure promptly, preventing extended drawdowns and stabilizing returns. The neutral and overconfidence agents produce similar outcomes, with the neutral model performing slightly better due to its more conservative position sizing.
Trending market. In trending environments, the overconfidence agent clearly dominates. Persistent unrealized gains repeatedly trigger its amplification mechanism, expanding exposure to winning positions and benefiting from upward momentum. By contrast, the loss-aversion model trims exposure during pullbacks and thus captures a smaller fraction of the trend.
High-width range-bound market. The performance hierarchy mirrors that of the low-width range-bound regime but with more pronounced volatility. The loss-aversion agent again delivers the best results by reducing exposure during abrupt reversals. Short-lived upward moves are insufficient to activate meaningful trend amplification for the overconfidence agent, which performs similarly to the neutral policy.
Total period. Over the full evaluation window, the loss-aversion and neutral agents achieve comparable returns, while the overconfidence agent underperforms. The decline toward the end of the period penalizes the aggressive risk scaling of the overconfidence mechanism, whereas the loss-aversion agent effectively limits downside exposure while still benefiting from partial recoveries.

Sharpe ratio analysis

Figure 10 summarizes risk-adjusted performance across all regimes.

In the low-width range-bound regime, the loss-aversion agent attains the highest Sharpe ratio due to effective exposure reduction during frequent reversals.
In trending markets, the overconfidence agent achieves the strongest Sharpe ratio by amplifying positions with sustained unrealized gains.
In the high-width range-bound regime, Sharpe ratios decline across all agents, yet loss aversion remains the best performer.
Over the full period, the loss-aversion agent again provides the highest risk-adjusted performance.

Sharpe ratios for all models across cryptocurrency market regimes.

Comparison of RL algorithms

Table 9 compares A2C, A3C, and DDPG across the behavioral and neutral agents. A2C consistently delivers the highest compound returns, benefiting from synchronous updates that stabilize the learning process. A3C performs competitively but exhibits higher variability, while DDPG is more sensitive to noise and non-stationary market dynamics.

Table 9.

Performance of A2C, A3C, and DDPG algorithms for behavioral and neutral models.

Model	A2C	A3C	DDPG
Loss aversion	1.552	1.497	1.312
Overconfidence	1.130	1.089	0.974
Neutral	1.571	1.423	1.215

Open in a new tab

Summary of findings

Loss aversion performs best in volatile and range-bound conditions due to its effective downside-risk reduction.
Overconfidence dominates in trending markets through targeted exposure amplification.
The neutral agent provides balanced performance but lacks mechanisms for controlling extreme outcomes.
Among reinforcement learning algorithms, A2C achieves the most stable and robust results.

Overall, these results show that incorporating behavioral scaling into position sizing enables RL agents to adjust more effectively to different market structures and enhances robustness across heterogeneous regimes.

Extended performance evaluation

To assess the robustness of the proposed framework under long-horizon non-stationary conditions, all portfolio strategies are evaluated over an extended out-of-sample period from January 2018 to June 2024. This interval includes several distinct market phases: the pre-COVID environment, the COVID–19 crash and subsequent rebound, the strong bull market of 2020–2021, the inflation-driven downturn of 2022, and the partial recovery observed during 2023–2024. Such heterogeneous dynamics provide a demanding setting for testing the stability and adaptability of reinforcement-learning-based portfolio strategies. All results reported in this section are strictly out-of-sample and incorporate transaction costs.

The benchmark strategies considered are: (i) an equally weighted portfolio (EW), (ii) a Markowitz mean–variance portfolio (MV), and (iii) a neutral reinforcement-learning agent without behavioral adjustments (RL-N). These are compared with two behavioral reinforcement-learning agents based on loss aversion (LA) and overconfidence (OC), as well as with the full BBAPT architecture, which integrates TimesNet forecasts with behavioral position-sizing.

Return-based performance

Table 10 reports the main return-oriented indicators for the extended period: final compound return (FCR), annualized return (AR), and Sharpe ratio. These metrics quantify long-run wealth accumulation and risk-adjusted performance. Figure 11 indicates return-based performance metrics (FCR, annualized return, and Sharpe ratio) for all portfolio models.

Table 10.

Return-based performance of portfolio strategies. Best values in each column are highlighted in bold.

Model	FCR	AR (%)	Sharpe
Equally weighted (EW)	1.86	16.4	0.72
Markowitz (MV)	2.41	18.9	0.88
Neutral RL (RL-N)	3.12	22.3	1.14
Loss-aversion RL (LA)	2.74	20.1	1.21
Overconfidence RL (OC)	3.98	26.7	1.03
BBAPT	4.22	25.4	1.28

Open in a new tab

Return-based performance metrics (FCR, annualized return, and Sharpe ratio) for all portfolio models.

The extended evaluation shows that all reinforcement-learning-based strategies outperform the static benchmarks (EW and MV) in both total return and Sharpe ratio. The neutral RL agent already exceeds the performance of the Markowitz portfolio. The loss-aversion agent attains the highest Sharpe ratio, while the overconfidence agent generates the largest cumulative return. The BBAPT model combines high profitability with the strongest risk-adjusted performance.

Risk performance (volatility and drawdown)

Table 11 reports the annualized volatility and maximum drawdown (MDD) for each strategy, highlighting differences in overall risk-taking and resilience during stressed market conditions. Figure 12 shows the risk metrics (annualized volatility and maximum drawdown) for all portfolio models.

Table 11.

Risk metrics of portfolio strategies over the extended period. Lower values indicate lower risk; best values in bold.

Model	Vol (%)	MDD (%)
Equally weighted (EW)	82.3	58.1
Markowitz (MV)	74.5	52.4
Neutral RL (RL-N)	68.9	48.0
Loss-aversion RL (LA)	63.4	41.7
Overconfidence RL (OC)	85.2	56.9
BBAPT	67.1	44.8

Open in a new tab

Risk metrics (annualized volatility and maximum drawdown) for all portfolio models.

The loss-aversion agent produces the lowest volatility and drawdown, confirming its defensive bias and its ability to reduce exposure during adverse market conditions. The BBAPT model maintains volatility close to that of the neutral RL agent while improving drawdown control. By contrast, the overconfidence agent exhibits the highest volatility and deep drawdowns, reflecting its aggressive position amplification.

Dynamic behaviour: wealth trajectories and risk–return frontier

Figure 13 shows the cumulative wealth trajectories of the portfolio strategies, normalized to one at the start of the evaluation period.

Cumulative wealth trajectories of all portfolio strategies (initial wealth normalized to 1).

Several observations arise. First, the OC and BBAPT strategies diverge upward from the benchmarks during the 2020–2021 bull phase, benefiting from dynamic position amplification in sustained positive regimes. Second, the LA strategy shows smoother behavior and noticeably smaller drawdowns during the COVID crash and the 2022 downturn, highlighting its downside protection. Third, the static EW and MV portfolios remain consistently below all reinforcement-learning-based strategies.

Table 12 reports the inputs for the empirical risk–return frontier, and Figure 14 illustrates the resulting trade-offs.

Table 12.

Inputs for the empirical risk–return frontier.

Model	AR (%)	Vol (%)
Equally weighted (EW)	16.4	82.3
Markowitz (MV)	18.9	74.5
Neutral RL (RL-N)	22.3	68.9
Loss-aversion RL (LA)	20.1	63.4
Overconfidence RL (OC)	26.7	85.2
BBAPT	25.4	67.1

Open in a new tab

Empirical risk–return frontier for all portfolio models over 2018–2024. Each point corresponds to a strategy.

The frontier shows that all reinforcement-learning-based strategies dominate the static benchmarks: for any given volatility they provide higher expected returns, and for any given return they require less risk. The LA agent forms the low-risk boundary, while the OC agent occupies the high-risk/high-return region. BBAPT lies on the upper-left boundary, indicating a Pareto-efficient balance between return and volatility.

Summary of extended-period findings

The extended evaluation confirms the main conclusions observed in shorter windows:

Reinforcement-learning strategies outperform static benchmarks in both absolute and risk-adjusted terms.
The loss-aversion mechanism provides effective downside protection and produces the most stable risk profile.
The overconfidence mechanism achieves the highest returns during prolonged expansions, albeit with elevated risk.
The BBAPT framework delivers the best overall trade-off by combining regime-aware forecasting with adaptive behavioral position sizing.

Overall, the results demonstrate that the behavioral reinforcement-learning framework remains effective across full market cycles containing multiple structural shocks and extended periods of non-stationarity.

Robustness analysis on DJIA constituents (2008-2024)

To further evaluate the robustness of the proposed framework and examine the impact of potential survivorship bias, we perform an additional set of experiments on the constituents of the Dow Jones Industrial Average (DJIA). The DJIA represents a mature large-cap equity universe with periodic changes in index membership. This setting is therefore well suited to assess (i) whether the behavioural reinforcement–learning framework generalizes beyond digital assets, and (ii) whether the main conclusions remain valid when index composition varies over time.

For each date between January 2008 and June 2024, we construct the investable universe using the official DJIA constituents in effect on that date. The constituent lists are obtained from the public historical record of the index provider. Portfolios are rebalanced at the same frequency as in the cryptocurrency experiments, and all results are strictly out-of-sample and net of transaction costs.

Performance on the original DJIA window 2008-2019

We begin by analyzing model performance over a standard pre-COVID period spanning January 2008 to December 2019, a horizon frequently used in empirical evaluations of DJIA-based portfolio strategies. Table 13 summarizes the final compound return (FCR), annualized return (AR), and Sharpe ratio for all models.

Table 13.

Return-based performance of portfolio strategies on DJIA constituents over the original window. Best values in each column are shown in bold.

Model	FCR	AR (%)	Sharpe
Equally weighted (EW)	2.29	7.3	0.58
Markowitz (MV)	2.60	8.5	0.71
Neutral RL (RL-N)	3.20	10.2	0.92
Loss-aversion RL (LA)	3.02	9.6	0.98
Overconfidence RL (OC)	3.75	11.8	0.85
BBAPT	3.55	11.2	1.04

Open in a new tab

Extended DJIA evaluation over 2008–2024

Next, we extend the DJIA analysis to the full period 2008–2024, thereby capturing the COVID–19 crash and rapid recovery, the inflation shock of 2022, and the subsequent normalization phase. Table 14 reports the return-based metrics, while Table 15 summarizes volatility and maximum drawdown.

Table 14.

Return-based performance of portfolio strategies on DJIA constituents over the extended window 2008–2024.

Model	FCR	AR (%)	Sharpe
Equally weighted (EW)	2.97	6.9	0.54
Markowitz (MV)	3.38	8.0	0.67
Neutral RL (RL-N)	4.23	9.4	0.88
Loss-aversion RL (LA)	4.10	9.1	0.95
Overconfidence RL (OC)	4.80	10.5	0.79
BBAPT	4.65	10.2	1.00

Open in a new tab

Table 15.

Risk metrics for DJIA strategies over 2008–2024. Volatility and MDD are annualized percentage values; lower is better.

Model	Vol (%)	MDD (%)
Equally weighted (EW)	19.5	54.3
Markowitz (MV)	17.0	47.8
Neutral RL (RL-N)	16.2	44.2
Loss-aversion RL (LA)	14.8	39.5
Overconfidence RL (OC)	20.4	52.1
BBAPT	15.6	41.0

Open in a new tab

The extended window confirms that the relative ranking of strategies is remarkably stable: RL-based models continue to dominate static benchmarks, loss aversion offers the most conservative risk profile with competitive returns, and BBAPT achieves the best overall risk-adjusted performance.

Impact of survivorship bias

According to the definition of survivorship bias, using a fixed set of DJIA constituents over the entire sample can lead to misleading results, as only the companies remaining in the index are considered, while those that were removed are overlooked. To assess this effect, two different time periods were considered: one from 2008 to 2019 and another from 2008 to 2024. In this section, we compare the ”rolling constituents design” described above with a survivorship-biased specification, where the investable universe is fixed to the DJIA membership as of December 2019.

Table 16 reports annualized return and Sharpe ratio for three representative strategies under both designs: EW, MV, and BBAPT. The “SB” columns correspond to the survivorship-biased universe (2008-2019), whereas the “UNB” columns correspond to the unbiased rolling-constituent universe (2008-2024).

Table 16.

Effect of survivorship bias on selected strategies over 2008–2024. SB: survivorship-biased universe (fixed 2019 constituents); UNB: unbiased universe with time-varying DJIA membership.

	AR (%)		Sharpe
Model	SB	UNB	SB	UNB
Equally weighted (EW)	7.4	6.9	0.59	0.54
Markowitz (MV)	8.7	8.0	0.73	0.67
BBAPT	10.8	10.2	1.07	1.00

Open in a new tab

Survivorship bias leads to mildly overstated performance (around 0.5–0.8 percentage points of annualized return and 0.05–0.07 units of Sharpe). Importantly, however, the qualitative conclusions are unchanged: BBAPT still dominates both benchmarks, and the ordering between EW and MV remains the same. This suggests that the main findings of the paper are not driven by survivorship bias, although an unbiased universe is clearly preferable for accurate performance measurement.

Summary, discussion, and DJIA experiment results

The DJIA experiments yield three main insights:

Generalization across asset classes. The behavioural RL framework, including BBAPT, delivers consistent performance improvements not only in cryptocurrencies but also in a mature large-cap equity universe.
Stable relative ranking. Across both the original (2008–2019) and extended (2008–2024) DJIA windows, the relative ordering of strategies mirrors that observed in the cryptocurrency experiments: RL-based methods outperform static benchmarks, loss aversion offers the most conservative risk profile, overconfidence achieves the highest raw returns, and BBAPT delivers the best overall risk-adjusted performance.
Limited impact of survivorship bias. Correcting for survivorship bias slightly reduces absolute performance metrics but leaves all qualitative conclusions intact. This confirms that the reported gains are not an artefact of conditioning on ex post index membership.

Taken together, these results demonstrate that the proposed behaviorally informed reinforcement-learning architecture is robust to changes in asset universe, sample period, and index-construction methodology, directly addressing concerns regarding empirical robustness and data handling.

Graphical analysis and interpretation

Figures 15–18 provide a graphical representation of the return characteristics, risk behaviour, risk–return trade-offs, and cumulative wealth dynamics of all strategies over the 2008–2024 period. These visual results complement the numeric patterns reported in Tables 14–15 and offer further insight into the behaviour of the proposed reinforcement-learning framework.

DJIA – Return-based performance metrics for all portfolio strategies (2008–2024).

DJIA – Cumulative wealth trajectories of all portfolio strategies (2008–2024, initial wealth = 1).

Figure 15 shows that the BBAPT model achieves the highest combined return–Sharpe profile, with annualized performance comparable to the overconfidence model but with significantly better risk-adjusted metrics. The overconfidence agent exhibits the strongest raw returns, consistent with its aggressive position amplification during sustained upward trends, whereas the loss-aversion agent generates more moderate returns but attains the highest Sharpe ratio by effectively mitigating downside exposure.

A similar pattern emerges in the risk comparison shown in Figure 16. The loss-aversion model achieves the lowest volatility and maximum drawdown, followed by BBAPT and the neutral RL agent. Static benchmarks such as the equally weighted and Markowitz portfolios remain substantially more volatile and experience deeper drawdowns, confirming that adaptive reinforcement-learning methods handle stressed market conditions more effectively.

DJIA – Risk metrics (annualized volatility and maximum drawdown) for all portfolio strategies (2008–2024).

Figure 17 compares the empirical risk–return frontiers for the survivorship-biased (SB) and unbiased (UNB) universes. While performance levels in the SB universe are marginally inflated as expected due to eliminating underperforming constituents the qualitative structure of the frontier remains unchanged. In both universes, RL-based models dominate the static benchmarks, and BBAPT lies on or near the Pareto-efficient boundary. This confirms that the main conclusions of the study are not driven by survivorship bias.

DJIA – Empirical risk–return frontier for unbiased and survivorship-biased universes (2008–2024).

Finally, the cumulative wealth trajectories in Figure 18 highlight the temporal behaviour of the strategies. BBAPT and the overconfidence agent exhibit the strongest growth during prolonged bull markets, while the loss-aversion model shows superior resilience during periods of market stress such as the 2008 financial crisis, the COVID–19 crash, and the inflation-driven drawdown of 2022. The static benchmarks lag consistently throughout the entire sample, offering substantially lower wealth accumulation.

Taken together, the graphical evidence reinforces the robustness and consistency of the proposed behavioural reinforcement-learning architecture, demonstrating that its performance advantages persist across multiple visual diagnostics and under both biased and unbiased constituent universes.

Summary of empirical findings

Across all empirical experiments spanning cryptocurrency markets from 2018 to 2024 and DJIA equity constituents from 2008 to 2024 the proposed behaviourally informed reinforcement–learning framework exhibits three robust properties.

First, in both digital-asset and large-cap equity markets, reinforcement–learning strategies consistently outperform static benchmarks such as equally weighted and Markowitz portfolios in terms of final wealth, annualized return, and Sharpe ratio. This performance advantage holds across heterogeneous market conditions, including the 2008 global financial crisis, the COVID–19 crash, the post-crisis recovery phases, and the inflation-driven drawdowns of 2022.

Second, the behavioural modules display clearly differentiated and complementary roles. Loss aversion achieves the lowest volatility and drawdown, and the highest Sharpe ratios in volatile or range-bound regimes, whereas overconfidence produces the strongest raw returns in persistent trending markets. These regime-dependent patterns highlight that modelling behavioural tendencies through position sizing creates economically interpretable portfolio dynamics.

Third, the integrated BBAPT architecture combining TimesNet regime forecasts with behavioural reinforcement learning lies on or near the empirical risk–return frontier across all datasets and horizons. BBAPT typically matches or exceeds the high returns of the overconfidence strategy while maintaining a risk profile close to the neutral or loss-averse agents. The DJIA experiments further show that these advantages persist under a time-varying asset universe, indicating that the framework generalizes well beyond cryptocurrencies.

Overall, the empirical evidence demonstrates that behaviourally informed reinforcement learning, when paired with regime-aware forecasting, yields robust and economically meaningful improvements in portfolio performance across markets, asset classes, and evaluation protocols.

Conclusion

This paper introduced BBAPT, a behavioural reinforcement–learning framework that combines externally trained TimesNet regime forecasts with psychologically motivated position–sizing mechanisms. The experimental design ensures strict chronological separation between training and testing, removes all sources of look–ahead bias, incorporates time–varying DJIA constituents to eliminate survivorship bias, and evaluates the framework across multiple asset classes and market environments.

Across all empirical analyses covering cryptocurrency markets from 2018 onward and DJIA equities from 2008 onward, three consistent insights emerge. First, reinforcement–learning agents substantially outperform static benchmark portfolios in both total return and risk-adjusted metrics. Second, the behavioural modules function as intended: loss aversion provides strong downside protection by scaling down exposure during adverse conditions, while overconfidence enhances performance in persistent trending markets through controlled amplification of profitable positions. Third, BBAPT achieves the strongest overall performance by adaptively selecting the behavioural response that aligns with the prevailing market regime, resulting in superior outcomes over both short- and long-horizon evaluations.

The extensive multi-period, multi-market experiments confirm that the behavioural reinforcement learning framework remains stable and effective under a wide range of structural market conditions including major crises, recoveries, and prolonged trending phases. The additional DJIA analysis demonstrates that the framework generalizes beyond cryptocurrencies to mature equity markets, with survivorship bias exerting only modest quantitative influence while leaving qualitative conclusions unchanged.

Overall, the findings provide strong evidence that incorporating behavioural biases into position sizing while leaving trade direction to the reinforcement learning policy offers a robust, interpretable, and practically valuable enhancement to data-driven portfolio management. Future research may explore multi-agent extensions, integrate generative regime-modelling architectures, or develop adaptive behavioural parameters that evolve online with changing market conditions.

Supplementary Information

Supplementary Information.^{(24.9KB, docx)}

Author contributions

Atefe Charkhestani: Conceptualization, Methodology, Investigation, Software, Validation, Writing - Original Draft. Akbar Esfahanipour: Conceptualization, Methodology, Supervision, Validation, Writing - Review & Editing.

Funding

This research received no external funding.

Data Availability

The datasets used and analyzed in this study are either publicly available or can be obtained from the corresponding author upon reasonable request.

Declarations

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-35902-x.

References

1.Dakalbab, F., Talib, M. A., Nasir, Q. & Saroufil, T. J. Artificial intelligence techniques in financial trading: A systematic literature review. J King Saud Univ Comput Inf Sci36, 102015 (2024). [Google Scholar]
2.Chanda, S., Chib, S., Bose, S. S., Zabiullah, B. I., Samal, A. In 2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE), pp. 806–809. IEEE (2024).
3.Addy, W. A. et al. Algorithmic trading and AI: A review of strategies and market impact. Journal11, 258–267 (2024). [Google Scholar]
4.Jiang, Z., Liang, J. In 2017 Intelligent Systems Conference (IntelliSys), pp. 905–913. IEEE (2017).
5.Singh, V., Chen, S.-S., Singhania, M., Nanavati, B. & Gupta, A. How are reinforcement learning and deep learning algorithms used for big data based decision making in financial industries – A review and research agenda. Int J Ind Manag Decis Intel2, 100094 (2022). [Google Scholar]
6.Kumar, A. Hard-to-value stocks, behavioral biases, and informed trading. J Financ Quant Anal44, 1375–1401 (2009). [Google Scholar]
7.Jain, R., Jain, P., Jain, C. Behavioral biases in the decision making of individual investors. International Journal of Knowledge Management 13 (2015).
8.Rehan, R., Umer, I. Behavioural biases and investor decisions. Management Finance 12 (2017).
9.Kahneman, D., Tversky, A. The simulation heuristic. In Judgment under Uncertainty: Heuristics and Biases, pp. 201–208 (1982). [DOI] [PubMed]
10.Felizardo, L. K. et al. Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market. Expert Syst. Appl.202, 117259 (2022). [Google Scholar]
11.Wu, H. et al. TimesNet: Temporal 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186 (2022).
12.Poyser, O. Herding behavior in cryptocurrency markets. arXiv preprint arXiv:1806.11348 (2018).
13.Hidajat, T. Behavioural biases in Bitcoin trading. Fokus Ekonomi: Jurnal Ilmiah Ekonomi14, 337–354 (2019). [Google Scholar]
14.Calderón, O. P. Herding behavior in cryptocurrency markets. arXiv preprint arXiv:1806.11348 (2018).
15.Huang, Y., Zhou, C., Cui, K. & Lu, X. A multi-agent reinforcement learning framework for optimizing financial trading strategies based on TimesNet. Expert Syst. Appl.237, 121502 (2024). [Google Scholar]
16.Shefrin, H. & Statman, M. Behavioral Portfolio Theory. J. Financ. Quant. Anal.35, 127–151 (2000). [Google Scholar]
17.Hirshleifer, D. Investor psychology and asset pricing. J. Finance56, 1533–1597 (2001). [Google Scholar]
18.Chang, K., Young, M. & Diaz, J. Portfolio optimization utilizing the framework of behavioral portfolio theory. Int. J. Oper. Res.15, 1–13 (2018). [Google Scholar]
19.Momen, O., Esfahanipour, A. & Seifi, A. Collective mental accounting: an integrated behavioural portfolio selection model for multiple mental accounts. Quant. Finance19, 265–275 (2019). [Google Scholar]
20.Momen, O., Esfahanipour, A. & Seifi, A. A robust behavioral portfolio selection: Model with investor attitudes and biases. Oper. Res. Int. J.20, 427–446 (2020). [Google Scholar]
21.Bertella, M. A., Silva, J. N. & Stanley, H. E. Loss aversion, overconfidence and their effects on a virtual stock exchange. Physica A554, 123909 (2020). [Google Scholar]
22.Avellone, A., Fiori, A. M., Foroni, I. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: eMAF2020, pp. 51–56. Springer.
23.Barro, D., Corazza, M., Nardon, M. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: eMAF2020, pp. 87–93. Springer.
24.Markowitz, H. M. Portfolio Selection. J. Finance7, 77–91 (1959). [Google Scholar]
25.Cheng, Q., Yang, L., Zheng, J., Tian, M., Xin, D. Optimizing Portfolio Management and Risk Assessment in Digital Assets Using Deep Learning for Predictive Analysis. arXiv preprint arXiv (2024).
26.Ma, Y., Mao, R., Lin, Q., Wu, P. & Cambria, E. Quantitative stock portfolio optimization by multi-task learning risk and return. Information Fusion104, 102165 (2024). [Google Scholar]
27.Ndikum, P., Ndikum, S. Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio Optimization. arXiv preprint arXiv (2024).
28.Soleymani, F. & Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoderDeepBreath. Expert Syst. Appl.156, 113456 (2020). [Google Scholar]
29.Weng, L., Sun, X., Xia, M., Liu, J. & Xu, Y. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing402, 171–182 (2020). [Google Scholar]
30.AbdelKawy, R., Abdelmoez, W. M. & Shoukry, A. A synchronous deep reinforcement learning model for automated multi-stock trading. Progress Artif. Intell.10, 83–97 (2021). [Google Scholar]
31.Théate, T. & Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.173, 114632 (2021). [Google Scholar]
32.Wu, M.-E., Syu, J.-H., Lin, J.C.-W. & Ho, J.-M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell.51, 8119–8131 (2021). [Google Scholar]
33.Betancourt, C. & Chen, W.-H. Deep reinforcement learning for portfolio management of markets with a dynamic number of assets. Expert Syst. Appl.164, 114002 (2021). [Google Scholar]
34.Yue, H., Liu, J., Tian, D. & Zhang, Q. A novel anti-risk method for portfolio trading using deep reinforcement learning. Electronics11, 1506 (2022). [Google Scholar]
35.Taghian, M., Asadi, A. & Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.195, 116523 (2022). [Google Scholar]
36.Song, Z., Jin, X., Li, C. Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint arXiv (2022).
37.Ge, J., Qin, Y., Li, Y., Huang, Y., Hu, H. In Proceedings of the 2022 14th International Conference on Machine Learning and Computing, pp. 34–43.
38.Li, Y., Liu, P. & Wang, Z. Stock trading strategies based on deep reinforcement learning. Sci. Program.2022, 4698656 (2022). [Google Scholar]
39.Shi, S. et al. GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing498, 14–27 (2022). [Google Scholar]
40.Tran, M., Pham-Hi, D. & Bui, M. Optimizing automated trading systems with deep reinforcement learning. Algorithms16, 23 (2023). [Google Scholar]
41.Jang, J. & Seong, N. Deep reinforcement learning for stock portfolio optimization by connecting with modern portfolio theory. Expert Syst. Appl.218, 119556 (2023). [Google Scholar]
42.Zou, J., Lou, J., Wang, B. & Liu, S. A novel deep reinforcement learning based automated stock trading system using cascaded LSTM networks. Expert Syst. Appl.242, 122801 (2024). [Google Scholar]
43.Majidi, N., Shamsi, M. & Marvasti, F. Algorithmic trading using continuous action space deep reinforcement learning. Expert Syst. Appl.235, 121245 (2024). [Google Scholar]
44.Cui, T., Du, N., Yang, X. & Ding, S. Multi-period portfolio optimization using a deep reinforcement learning hyper-heuristic approach. Technol. Forecast. Soc. Chang.198, 122944 (2024). [Google Scholar]
45.Qin, M. et al. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14669–14676.
46.Soleymani, F. & Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoderDeepBreath. Expert Syst. Appl.156, 113456 (2020). [Google Scholar]
47.Weng, L., Sun, X., Xia, M., Liu, J. & Xu, Y. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing402, 171–182 (2020). [Google Scholar]
48.Wu, M.-E., Syu, J.-H., Lin, J.C.-W. & Ho, J.-M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell.51, 8119–8131 (2021). [Google Scholar]
49.AbdelKawy, R., Abdelmoez, W. M. & Shoukry, A. A synchronous deep reinforcement learning model for automated multi-stock trading. Progress Artif. Intell.10, 83–97 (2021). [Google Scholar]
50.Théate, T. & Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.173, 114632 (2021). [Google Scholar]
51.Yue, H., Liu, J., Tian, D. & Zhang, Q. A novel anti-risk method for portfolio trading using deep reinforcement learning. Electronics11, 1506 (2022). [Google Scholar]
52.Shi, S. et al. GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing498, 14–27 (2022). [Google Scholar]
53.Taghian, M., Asadi, A. & Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.195, 116523 (2022). [Google Scholar]
54.Felizardo, L. K. et al. Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market. Expert Syst. Appl.202, 117259 (2022). [Google Scholar]
55.Song, Z., Jin, X., Li, C. Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint arXiv:2206.05910 (2022).
56.Ge, J., Qin, Y., Li, Y., Huang, Y., Hu, H. In 2022 14th International Conference on Machine Learning and Computing (ICMLC), pp. 34–43.
57.Li, Y., Liu, P., Wang, Z. Stock trading strategies based on deep reinforcement learning. Scientific Programming 2022 (2022).
58.Jiang, Y., Olmo, J. & Atwi, M. Deep reinforcement learning for portfolio selection. Glob. Financ. J.62, 101016 (2024). [Google Scholar]
59.Guiso, L., Sapienza, P. & Zingales, L. Time varying risk aversion. J. Financ. Econ.128, 403–421 (2018). [Google Scholar]
60.Kahneman, D. Judgment under uncertainty: Heuristics and biases. Cambridge University Press(1982). [DOI] [PubMed]
61.Pertiwi, T., Yuniningsih, Y. & Anwar, M. The biased factors of investor’s behavior in stock exchange trading. Manag. Sci. Lett.9, 835–842 (2019). [Google Scholar]
62.Szepesvári, C. Algorithms for reinforcement learning. Springer Nature (2022).
63.Barto, A. G. Reinforcement Learning: An Introduction, By Richard Sutton. Science Robotics 6: 423 (2021).
64.Sutton, R. S., Barto, A. G. Reinforcement learning: An introduction. MIT Press (2018).
65.Alibabaei, K. et al. Comparison of on-policy deep reinforcement learning A2C with off-policy DQN in irrigation optimization: A case study at a site in Portugal. Irrig. Sci.11, 104 (2022). [Google Scholar]
66.Sutton, R. S., McAllester, D., Singh, S., Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12 (1999).
67.Grondman, I., Busoniu, L., Lopes, G. A. & Babuska, R. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Syst. Man Cybernet. Part C42, 1291–1307 (2012). [DOI] [PubMed] [Google Scholar]
68.Akyildirim, E., Goncu, A. & Sensoy, A. Prediction of cryptocurrency returns using machine learning. Ann. Oper. Res.297, 3–36 (2021). [Google Scholar]
69.Sharpe, W. F. The Sharpe ratio. In Streetwise – the Best of the Journal of Portfolio Management 3, 169–185 (1998).
70.Zhang, Z., Sang, Q. In Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application, pp. 82–91.
71.Gobato Souto, H. TimesNet for Realized Volatility Prediction. SSRN 4660025 (2023).
72.Pele, D. T., Wesselhöfft, N., Härdle, W. K., Kolossiatis, M., Yatracos, Y. G. A statistical classification of cryptocurrencies. Applied Statistics (2020).
73.Jang, J. & Seong, N. Deep reinforcement learning for stock portfolio optimization by connecting with modern portfolio theory. Expert Syst. Appl.218, 119556 (2023). [Google Scholar]
74.Jiang, Z., Xu, D., Liang, J. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059 (2017).
75.Yang, H., Liu, X.-Y., Zhong, S., Walid, A. In Proceedings of the First ACM International Conference on AI in Finance, pp. 1–8.
76.Park, C. H. & Irwin, S. H. What do we know about the profitability of technical analysis?. J. Econ. Surv.21, 786–826 (2007). [Google Scholar]
77.Droke, C. Moving Averages Simplified. Marketplace Books USA (2001).
78.Burgess, G. A. Trading and investing in the Forex markets using chart techniques. John Wiley & Sons (2010).
79.Bolognesi, E., Torluccio, G. & Zuccheri, A. A comparison between capitalization-weighted and equally weighted indexes in the European equity market. J. Asset Manag.14, 14–26 (2013). [Google Scholar]
80.Rachmatullah, M. I. C., Santoso, J. & Surendro, K. A novel approach in determining neural networks architecture to classify data with large number of attributes. IEEE Access8, 204728–204743 (2020). [Google Scholar]
81.Markowitz, H. Portfolio selection. J. Finance7, 77–91 (1952). [Google Scholar]
82.Merton, R. C. An analytic derivation of the efficient portfolio frontier. J. Financ. Quant. Anal.7, 1851–1872 (1972). [Google Scholar]
83.Malladi, R. & Fabozzi, F. Equal-weighted strategy: Why it outperforms value-weighted strategies? Theory and evidence. J. Asset Manag.18, 188–208 (2017). [Google Scholar]
84.Alamdari, M. K., Esfahanipour, A. & Dastkhan, H. A portfolio trading system using a novel pixel graph network for stock selection and a mean-CDaR optimization for portfolio rebalancing. Appl. Soft Comput.152, 111213 (2024). [Google Scholar]
85.Guo, S., Gu, J.-W. & Ching, W.-K. Adaptive online portfolio selection with transaction costs. Eur. J. Oper. Res.295, 1074–1086 (2021). [Google Scholar]
86.Harris, R. D. & Mazibas, M. Portfolio optimization with behavioural preferences and investor memory. Eur. J. Oper. Res.296, 368–387 (2022). [Google Scholar]
87.Mba, J. C., Ababio, K. A. & Agyei, S. K. Markowitz mean-variance portfolio selection and optimization under a behavioral spectacle: New empirical evidence. Int. J. Financ. Stud.10, 28 (2022). [Google Scholar]
88.Young, M. N. et al. Portfolio optimization considering behavioral stocks with return scenario generation. Journal10, 4269 (2022). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information.^{(24.9KB, docx)}

Data Availability Statement

The datasets used and analyzed in this study are either publicly available or can be obtained from the corresponding author upon reasonable request.

[CR1] 1.Dakalbab, F., Talib, M. A., Nasir, Q. & Saroufil, T. J. Artificial intelligence techniques in financial trading: A systematic literature review. J King Saud Univ Comput Inf Sci36, 102015 (2024). [Google Scholar]

[CR2] 2.Chanda, S., Chib, S., Bose, S. S., Zabiullah, B. I., Samal, A. In 2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE), pp. 806–809. IEEE (2024).

[CR3] 3.Addy, W. A. et al. Algorithmic trading and AI: A review of strategies and market impact. Journal11, 258–267 (2024). [Google Scholar]

[CR4] 4.Jiang, Z., Liang, J. In 2017 Intelligent Systems Conference (IntelliSys), pp. 905–913. IEEE (2017).

[CR5] 5.Singh, V., Chen, S.-S., Singhania, M., Nanavati, B. & Gupta, A. How are reinforcement learning and deep learning algorithms used for big data based decision making in financial industries – A review and research agenda. Int J Ind Manag Decis Intel2, 100094 (2022). [Google Scholar]

[CR6] 6.Kumar, A. Hard-to-value stocks, behavioral biases, and informed trading. J Financ Quant Anal44, 1375–1401 (2009). [Google Scholar]

[CR7] 7.Jain, R., Jain, P., Jain, C. Behavioral biases in the decision making of individual investors. International Journal of Knowledge Management 13 (2015).

[CR8] 8.Rehan, R., Umer, I. Behavioural biases and investor decisions. Management Finance 12 (2017).

[CR9] 9.Kahneman, D., Tversky, A. The simulation heuristic. In Judgment under Uncertainty: Heuristics and Biases, pp. 201–208 (1982). [DOI] [PubMed]

[CR10] 10.Felizardo, L. K. et al. Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market. Expert Syst. Appl.202, 117259 (2022). [Google Scholar]

[CR11] 11.Wu, H. et al. TimesNet: Temporal 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186 (2022).

[CR12] 12.Poyser, O. Herding behavior in cryptocurrency markets. arXiv preprint arXiv:1806.11348 (2018).

[CR13] 13.Hidajat, T. Behavioural biases in Bitcoin trading. Fokus Ekonomi: Jurnal Ilmiah Ekonomi14, 337–354 (2019). [Google Scholar]

[CR14] 14.Calderón, O. P. Herding behavior in cryptocurrency markets. arXiv preprint arXiv:1806.11348 (2018).

[CR15] 15.Huang, Y., Zhou, C., Cui, K. & Lu, X. A multi-agent reinforcement learning framework for optimizing financial trading strategies based on TimesNet. Expert Syst. Appl.237, 121502 (2024). [Google Scholar]

[CR16] 16.Shefrin, H. & Statman, M. Behavioral Portfolio Theory. J. Financ. Quant. Anal.35, 127–151 (2000). [Google Scholar]

[CR17] 17.Hirshleifer, D. Investor psychology and asset pricing. J. Finance56, 1533–1597 (2001). [Google Scholar]

[CR18] 18.Chang, K., Young, M. & Diaz, J. Portfolio optimization utilizing the framework of behavioral portfolio theory. Int. J. Oper. Res.15, 1–13 (2018). [Google Scholar]

[CR19] 19.Momen, O., Esfahanipour, A. & Seifi, A. Collective mental accounting: an integrated behavioural portfolio selection model for multiple mental accounts. Quant. Finance19, 265–275 (2019). [Google Scholar]

[CR20] 20.Momen, O., Esfahanipour, A. & Seifi, A. A robust behavioral portfolio selection: Model with investor attitudes and biases. Oper. Res. Int. J.20, 427–446 (2020). [Google Scholar]

[CR21] 21.Bertella, M. A., Silva, J. N. & Stanley, H. E. Loss aversion, overconfidence and their effects on a virtual stock exchange. Physica A554, 123909 (2020). [Google Scholar]

[CR22] 22.Avellone, A., Fiori, A. M., Foroni, I. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: eMAF2020, pp. 51–56. Springer.

[CR23] 23.Barro, D., Corazza, M., Nardon, M. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: eMAF2020, pp. 87–93. Springer.

[CR24] 24.Markowitz, H. M. Portfolio Selection. J. Finance7, 77–91 (1959). [Google Scholar]

[CR25] 25.Cheng, Q., Yang, L., Zheng, J., Tian, M., Xin, D. Optimizing Portfolio Management and Risk Assessment in Digital Assets Using Deep Learning for Predictive Analysis. arXiv preprint arXiv (2024).

[CR26] 26.Ma, Y., Mao, R., Lin, Q., Wu, P. & Cambria, E. Quantitative stock portfolio optimization by multi-task learning risk and return. Information Fusion104, 102165 (2024). [Google Scholar]

[CR27] 27.Ndikum, P., Ndikum, S. Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio Optimization. arXiv preprint arXiv (2024).

[CR28] 28.Soleymani, F. & Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoderDeepBreath. Expert Syst. Appl.156, 113456 (2020). [Google Scholar]

[CR29] 29.Weng, L., Sun, X., Xia, M., Liu, J. & Xu, Y. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing402, 171–182 (2020). [Google Scholar]

[CR30] 30.AbdelKawy, R., Abdelmoez, W. M. & Shoukry, A. A synchronous deep reinforcement learning model for automated multi-stock trading. Progress Artif. Intell.10, 83–97 (2021). [Google Scholar]

[CR31] 31.Théate, T. & Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.173, 114632 (2021). [Google Scholar]

[CR32] 32.Wu, M.-E., Syu, J.-H., Lin, J.C.-W. & Ho, J.-M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell.51, 8119–8131 (2021). [Google Scholar]

[CR33] 33.Betancourt, C. & Chen, W.-H. Deep reinforcement learning for portfolio management of markets with a dynamic number of assets. Expert Syst. Appl.164, 114002 (2021). [Google Scholar]

[CR34] 34.Yue, H., Liu, J., Tian, D. & Zhang, Q. A novel anti-risk method for portfolio trading using deep reinforcement learning. Electronics11, 1506 (2022). [Google Scholar]

[CR35] 35.Taghian, M., Asadi, A. & Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.195, 116523 (2022). [Google Scholar]

[CR36] 36.Song, Z., Jin, X., Li, C. Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint arXiv (2022).

[CR37] 37.Ge, J., Qin, Y., Li, Y., Huang, Y., Hu, H. In Proceedings of the 2022 14th International Conference on Machine Learning and Computing, pp. 34–43.

[CR38] 38.Li, Y., Liu, P. & Wang, Z. Stock trading strategies based on deep reinforcement learning. Sci. Program.2022, 4698656 (2022). [Google Scholar]

[CR39] 39.Shi, S. et al. GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing498, 14–27 (2022). [Google Scholar]

[CR40] 40.Tran, M., Pham-Hi, D. & Bui, M. Optimizing automated trading systems with deep reinforcement learning. Algorithms16, 23 (2023). [Google Scholar]

[CR41] 41.Jang, J. & Seong, N. Deep reinforcement learning for stock portfolio optimization by connecting with modern portfolio theory. Expert Syst. Appl.218, 119556 (2023). [Google Scholar]

[CR42] 42.Zou, J., Lou, J., Wang, B. & Liu, S. A novel deep reinforcement learning based automated stock trading system using cascaded LSTM networks. Expert Syst. Appl.242, 122801 (2024). [Google Scholar]

[CR43] 43.Majidi, N., Shamsi, M. & Marvasti, F. Algorithmic trading using continuous action space deep reinforcement learning. Expert Syst. Appl.235, 121245 (2024). [Google Scholar]

[CR44] 44.Cui, T., Du, N., Yang, X. & Ding, S. Multi-period portfolio optimization using a deep reinforcement learning hyper-heuristic approach. Technol. Forecast. Soc. Chang.198, 122944 (2024). [Google Scholar]

[CR45] 45.Qin, M. et al. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14669–14676.

[CR46] 46.Soleymani, F. & Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoderDeepBreath. Expert Syst. Appl.156, 113456 (2020). [Google Scholar]

[CR47] 47.Weng, L., Sun, X., Xia, M., Liu, J. & Xu, Y. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing402, 171–182 (2020). [Google Scholar]

[CR48] 48.Wu, M.-E., Syu, J.-H., Lin, J.C.-W. & Ho, J.-M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell.51, 8119–8131 (2021). [Google Scholar]

[CR49] 49.AbdelKawy, R., Abdelmoez, W. M. & Shoukry, A. A synchronous deep reinforcement learning model for automated multi-stock trading. Progress Artif. Intell.10, 83–97 (2021). [Google Scholar]

[CR50] 50.Théate, T. & Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.173, 114632 (2021). [Google Scholar]

[CR51] 51.Yue, H., Liu, J., Tian, D. & Zhang, Q. A novel anti-risk method for portfolio trading using deep reinforcement learning. Electronics11, 1506 (2022). [Google Scholar]

[CR52] 52.Shi, S. et al. GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing498, 14–27 (2022). [Google Scholar]

[CR53] 53.Taghian, M., Asadi, A. & Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.195, 116523 (2022). [Google Scholar]

[CR54] 54.Felizardo, L. K. et al. Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market. Expert Syst. Appl.202, 117259 (2022). [Google Scholar]

[CR55] 55.Song, Z., Jin, X., Li, C. Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint arXiv:2206.05910 (2022).

[CR56] 56.Ge, J., Qin, Y., Li, Y., Huang, Y., Hu, H. In 2022 14th International Conference on Machine Learning and Computing (ICMLC), pp. 34–43.

[CR57] 57.Li, Y., Liu, P., Wang, Z. Stock trading strategies based on deep reinforcement learning. Scientific Programming 2022 (2022).

[CR58] 58.Jiang, Y., Olmo, J. & Atwi, M. Deep reinforcement learning for portfolio selection. Glob. Financ. J.62, 101016 (2024). [Google Scholar]

[CR59] 59.Guiso, L., Sapienza, P. & Zingales, L. Time varying risk aversion. J. Financ. Econ.128, 403–421 (2018). [Google Scholar]

[CR60] 60.Kahneman, D. Judgment under uncertainty: Heuristics and biases. Cambridge University Press(1982). [DOI] [PubMed]

[CR61] 61.Pertiwi, T., Yuniningsih, Y. & Anwar, M. The biased factors of investor’s behavior in stock exchange trading. Manag. Sci. Lett.9, 835–842 (2019). [Google Scholar]

[CR62] 62.Szepesvári, C. Algorithms for reinforcement learning. Springer Nature (2022).

[CR63] 63.Barto, A. G. Reinforcement Learning: An Introduction, By Richard Sutton. Science Robotics 6: 423 (2021).

[CR64] 64.Sutton, R. S., Barto, A. G. Reinforcement learning: An introduction. MIT Press (2018).

[CR65] 65.Alibabaei, K. et al. Comparison of on-policy deep reinforcement learning A2C with off-policy DQN in irrigation optimization: A case study at a site in Portugal. Irrig. Sci.11, 104 (2022). [Google Scholar]

[CR66] 66.Sutton, R. S., McAllester, D., Singh, S., Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12 (1999).

[CR67] 67.Grondman, I., Busoniu, L., Lopes, G. A. & Babuska, R. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Syst. Man Cybernet. Part C42, 1291–1307 (2012). [DOI] [PubMed] [Google Scholar]

[CR68] 68.Akyildirim, E., Goncu, A. & Sensoy, A. Prediction of cryptocurrency returns using machine learning. Ann. Oper. Res.297, 3–36 (2021). [Google Scholar]

[CR69] 69.Sharpe, W. F. The Sharpe ratio. In Streetwise – the Best of the Journal of Portfolio Management 3, 169–185 (1998).

[CR70] 70.Zhang, Z., Sang, Q. In Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application, pp. 82–91.

[CR71] 71.Gobato Souto, H. TimesNet for Realized Volatility Prediction. SSRN 4660025 (2023).

[CR72] 72.Pele, D. T., Wesselhöfft, N., Härdle, W. K., Kolossiatis, M., Yatracos, Y. G. A statistical classification of cryptocurrencies. Applied Statistics (2020).

[CR73] 73.Jang, J. & Seong, N. Deep reinforcement learning for stock portfolio optimization by connecting with modern portfolio theory. Expert Syst. Appl.218, 119556 (2023). [Google Scholar]

[CR74] 74.Jiang, Z., Xu, D., Liang, J. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059 (2017).

[CR75] 75.Yang, H., Liu, X.-Y., Zhong, S., Walid, A. In Proceedings of the First ACM International Conference on AI in Finance, pp. 1–8.

[CR76] 76.Park, C. H. & Irwin, S. H. What do we know about the profitability of technical analysis?. J. Econ. Surv.21, 786–826 (2007). [Google Scholar]

[CR77] 77.Droke, C. Moving Averages Simplified. Marketplace Books USA (2001).

[CR78] 78.Burgess, G. A. Trading and investing in the Forex markets using chart techniques. John Wiley & Sons (2010).

[CR79] 79.Bolognesi, E., Torluccio, G. & Zuccheri, A. A comparison between capitalization-weighted and equally weighted indexes in the European equity market. J. Asset Manag.14, 14–26 (2013). [Google Scholar]

[CR80] 80.Rachmatullah, M. I. C., Santoso, J. & Surendro, K. A novel approach in determining neural networks architecture to classify data with large number of attributes. IEEE Access8, 204728–204743 (2020). [Google Scholar]

[CR81] 81.Markowitz, H. Portfolio selection. J. Finance7, 77–91 (1952). [Google Scholar]

[CR82] 82.Merton, R. C. An analytic derivation of the efficient portfolio frontier. J. Financ. Quant. Anal.7, 1851–1872 (1972). [Google Scholar]

[CR83] 83.Malladi, R. & Fabozzi, F. Equal-weighted strategy: Why it outperforms value-weighted strategies? Theory and evidence. J. Asset Manag.18, 188–208 (2017). [Google Scholar]

[CR84] 84.Alamdari, M. K., Esfahanipour, A. & Dastkhan, H. A portfolio trading system using a novel pixel graph network for stock selection and a mean-CDaR optimization for portfolio rebalancing. Appl. Soft Comput.152, 111213 (2024). [Google Scholar]

[CR85] 85.Guo, S., Gu, J.-W. & Ching, W.-K. Adaptive online portfolio selection with transaction costs. Eur. J. Oper. Res.295, 1074–1086 (2021). [Google Scholar]

[CR86] 86.Harris, R. D. & Mazibas, M. Portfolio optimization with behavioural preferences and investor memory. Eur. J. Oper. Res.296, 368–387 (2022). [Google Scholar]

[CR87] 87.Mba, J. C., Ababio, K. A. & Agyei, S. K. Markowitz mean-variance portfolio selection and optimization under a behavioral spectacle: New empirical evidence. Int. J. Financ. Stud.10, 28 (2022). [Google Scholar]

[CR88] 88.Young, M. N. et al. Portfolio optimization considering behavioral stocks with return scenario generation. Journal10, 4269 (2022). [Google Scholar]

PERMALINK

Behaviorally informed deep reinforcement learning for portfolio optimization with loss aversion and overconfidence

Atefe Charkhestani

Akbar Esfahanipour

Abstract

Introduction

Related works

Behavioral portfolio selection

Modern portfolio theory and its evolution

Reinforcement learning–based trading models

Table 1.

The proposed BBAPT model

Behavioral biases modeling

Loss aversion bias

Overconfidence bias

Reinforcement learning modeling

Figure 1.

Selection of RL algorithms

Table 2.

Actor and critic network architecture

Proposed BBAPT framework overview

Figure 2.

Environment

State definition

Action definition

Reward function

TimesNet-based market forecasting model

TimesNet architecture overview

Figure 3.

Algorithm 1.

Activation of behavioral DRL agents Using TimesNet

Table 3.

Evaluation process of the proposed model

Figure 4.

Data

Figure 5.

Table 4.

Determining the market type, training, and testing periods

Training the reinforcement learning model

Reinforcement learning model configuration

Figure 6.

Table 5.

Calibration of behavioral hyperparameters

Evaluation of the trained model

Neutral model

Markowitz model

Equally weighted portfolio model

Evaluation criteria

Experimental results

Evaluation framework

Tuning the behavioral hyperparameters

Table 6.

Table 7.

Weight dynamics under behavioral biases

Figure 7.

Figure 8.

Performance comparison in the cryptocurrency market

Figure 9.

Table 8.

Performance of behavioral models

Sharpe ratio analysis

Figure 10.

Comparison of RL algorithms

Table 9.

Summary of findings

Extended performance evaluation

Return-based performance

Table 10.

Figure 11.

Risk performance (volatility and drawdown)

Table 11.

Figure 12.

Dynamic behaviour: wealth trajectories and risk–return frontier

Figure 13.

Table 12.

Figure 14.

Summary of extended-period findings

Robustness analysis on DJIA constituents (2008-2024)

Performance on the original DJIA window 2008-2019

Table 13.