Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 28;16:6443. doi: 10.1038/s41598-026-35902-x

Behaviorally informed deep reinforcement learning for portfolio optimization with loss aversion and overconfidence

Atefe Charkhestani 1, Akbar Esfahanipour 1,
PMCID: PMC12909882  PMID: 41593126

Abstract

This study develops a behaviorally informed deep reinforcement learning (DRL) framework for algorithmic portfolio optimization. The model integrates two well-established behavioral biases, loss aversion and overconfidence, into an actor–critic architecture. Unlike conventional DRL systems that assume fully rational agents, the proposed framework incorporates investor heterogeneity through regime-dependent bias thresholds that adjust position sizing, while the underlying RL policy determines trading direction. To adaptively switch among three behavioral models loss loss-averse, overconfident, and neutral, the framework employs TimesNet to generate one-step-ahead market regime forecasts. All decisions follow a strict walk-forward evaluation protocol that precludes access to future information and ensures realistic out-of-sample performance measurement. The framework is evaluated across two major financial domains: the cryptocurrency market (2018–2024) and the Dow Jones Industrial Average (2008–2024). The integrated BBAPT architecture, which combines TimesNet with behavioral DRL, consistently outperforms benchmark strategies including neutral RL agents, classical Markowitz portfolios, and equally weighted allocations. In cryptocurrency markets, BBAPT achieves the highest risk-adjusted performance, while in equity markets it delivers improved risk–return outcomes even after accounting for time-varying index constituents. Overall, the empirical evidence demonstrates that embedding behavioral finance principles into reinforcement learning enhances robustness, adaptability, and risk-adjusted returns in non-stationary environments. These findings position behaviorally informed DRL as a promising foundation for next-generation algorithmic trading systems.

Keywords: Algorithmic trading, Behavioral bias, Loss aversion, Overconfidence, Deep reinforcement learning, Portfolio optimization, TimesNet

Subject terms: Mathematics and computing, Psychology, Psychology

Introduction

Automated trading systems have evolved substantially since their earliest rule-based implementations in the mid-twentieth century. With rapid advances in artificial intelligence (AI), contemporary algorithmic trading has transitioned into a highly adaptive, data-driven decision-making paradigm. Modern systems leverage machine learning and deep learning to extract patterns from complex financial time series, forecast market movements, and adjust trading decisions dynamically1. Numerous empirical studies have shown that AI-enhanced trading strategies, particularly those based on deep reinforcement learning (DRL), can outperform traditional quantitative models by improving return forecasting, risk management, and execution efficiency2,3. As a result, DRL-based portfolio management has emerged as a promising approach for handling non-stationary and volatile markets.

Reinforcement learning provides a natural framework for modeling sequential decision-making problems such as portfolio rebalancing, where actions influence future rewards and subsequent market states. DRL agents integrate neural networks with reinforcement learning principles to learn optimal trading policies directly from historical interactions with the environment. These models have demonstrated the ability to process large-scale financial datasets, adapt to shifting market regimes, and optimize risk–return trade-offs4,5. A typical trading agent evaluates features such as price dynamics, portfolio composition, and transaction costs, and generates allocation vectors designed to maximize long-term cumulative reward.

Despite substantial progress in algorithmic trading, real-world investment behavior is strongly affected by psychological biases that classical DRL models do not explicitly account for. Behavioral finance shows that investors systematically deviate from rational utility-maximizing behavior6. Among these deviations, loss aversion the tendency to react more strongly to losses than to gains of the same magnitude and overconfidence the tendency to overestimate one’s forecasting ability are particularly influential in shaping portfolio decisions7,8. These biases may be detrimental in some environments but beneficial in others. For instance, overconfidence can accelerate gains during persistent upward trends, whereas loss aversion may reduce downside exposure during market declines. Importantly, investors differ in their sensitivity to such biases, motivating the need for mechanisms that can represent heterogeneous behavioral responses.

In this work, behavioral biases are not intended to mimic human decision-making exactly. Instead, they function as algorithmic heuristics that operate within a DRL agent’s position-sizing layer. The RL policy determines the direction of trades, while behavioral activation models the magnitude of allocation adjustments. This architecture enables systematic comparison between bias-neutral and bias-driven agents and facilitates a controlled examination of whether behavioral heuristics can improve portfolio outcomes. Our focus is on daily-to-medium-frequency portfolio allocation; ultra–high-frequency execution systems, where behavioral dynamics are negligible, fall outside the scope of this study.

Although behavioral principles such as prospect theory have been incorporated into portfolio optimization9, much less attention has been given to embedding behavioral mechanisms directly inside DRL architectures. Standard DRL agents react solely to market or portfolio states and therefore overlook opportunities that arise from persistent behavioral patterns in financial markets. Because real markets reflect a combination of rational dynamics and behavioral deviations, integrating behavioral drivers into DRL frameworks may enable richer decision policies than those attainable through conventional approaches.

To enable dynamic transitions between behavioral modes, we employ TimesNet11, a convolution-based architecture designed for long-horizon time-series forecasting. TimesNet provides one-step-ahead return forecasts used to activate the loss-averse, overconfident, or neutral agent within our Behavioral Bias–Based Algorithmic Portfolio Trading (BBAPT) framework. Importantly, TimesNet is trained independently from the DRL agents and operates under a strict walk-forward protocol to prevent any leakage of future data, thereby eliminating look-ahead bias. The modular design also allows practitioners to replace TimesNet with alternative forecasting models without modifying the DRL components.

The key contributions of this study are as follows:

  1. We introduce a DRL-based portfolio optimization framework that integrates loss aversion and overconfidence through behavioral thresholds that modulate position sizing. The DRL agent learns the baseline trading policy, while behavioral mechanisms adjust allocations when their activation conditions are satisfied.

  2. We model investor heterogeneity by introducing regime-sensitive behavioral activation thresholds. These thresholds, optimized via out-of-sample evaluation, capture variations in behavioral intensity across different market environments.

  3. In contrast to standard DRL trading systems, where portfolio adjustments depend solely on market or portfolio states, the proposed framework allows behavioral activation to influence rebalancing decisions, thereby expanding the effective action space.

  4. We conduct extensive empirical evaluations across both cryptocurrency markets and mature equity markets using the time-varying DJIA. The framework is compared against multiple benchmarks, including A2C, A3C, DDPG, Markowitz portfolios, and equally weighted allocations.

  5. We demonstrate that incorporating TimesNet as a forecasting-guided behavioral selector leads to improved portfolio performance by activating behaviorally informed DRL agents in a regime-aware manner.

The remainder of this paper is organized as follows. Section 2 reviews related work on behavioral portfolio theory and reinforcement learning in trading. Section 3 presents the proposed BBAPT architecture. Section 4 describes the datasets, training methodology, and evaluation framework. Section 5 reports the empirical results across both asset classes. Section 6 concludes the study and discusses potential directions for future research.

Related works

This section reviews two relevant streams of research: (i) behavioral portfolio selection, which integrates investor psychology into portfolio construction, and (ii) reinforcement learning–based trading systems, which aim to learn optimal trading policies directly from market interactions. Together, these domains form the conceptual foundation of this study and highlight a critical gap that motivates the development of our behavioral reinforcement-learning framework.

Behavioral portfolio selection

Shefrin and Statman16 introduced Behavioral Portfolio Theory (BPT), the first formal framework to embed investor psychology into portfolio optimization. Their two-layered mental account model launched a research direction exploring how behavioral biases−such as loss aversion, overconfidence, anchoring, and representativeness−influence portfolio construction. Hirshleifer17 further established that cognitive biases systematically affect investor sentiment and asset pricing, leading to deviations from rational market behavior and generating persistent anomalies relevant for portfolio allocation.

Subsequent research extended behavioral modeling in several directions. Chang et al.18 integrated mental accounting into a multi-stage portfolio optimization process, while Momen et al.19 proposed the Collective Mental Accounting (CMA) framework to mathematically unify mental accounts under realistic constraints such as position limits and cardinality. Building on this, Momen, Esfahanipour, and Seifi20 developed a behavioral portfolio model that incorporates dynamic risk preferences and forward-looking expectations derived from the Black–Litterman model. These works emphasize that incorporating investor psychology into portfolio models can yield allocations that better reflect real-world investment behavior.

Agent-based simulation approaches have also been used to study behavioral effects. Bertella et al.21 simulated markets populated by overconfident, risk-averse, and neutral agents to evaluate their impact on liquidity and volatility. Avellone et al.22 and Barro et al.23 integrated loss aversion and prospect-theoretic preferences into optimization models and showed that behavioral preferences can significantly alter optimal portfolio weights relative to classical models.

Despite substantial progress, most behavioral portfolio models rely on static or utility-based formulations and do not integrate behavioral responses directly into an adaptive learning system. This limitation restricts their ability to operate in dynamic markets where behavioral activation may depend on real-time portfolio conditions.

Modern portfolio theory and its evolution

Modern Portfolio Theory (MPT), developed by Markowitz24, formalized the mean–variance trade-off and established variance as the primary measure of risk. Although MPT remains foundational, its static assumptions limit its applicability in highly non-stationary markets. The increasing availability of high-frequency and multivariate financial data has enabled machine learning and reinforcement learning methods to emerge as adaptive alternatives capable of responding to evolving market conditions.

Recent studies have expanded portfolio optimization frameworks using deep learning, nonlinear estimators, and multivariate time-series models2527. However, these approaches generally remain behaviorally neutral and do not account for investor heterogeneity or psychological activation mechanisms.

Reinforcement learning–based trading models

Reinforcement learning (RL) has become increasingly prominent in financial decision-making due to its ability to learn sequential policies from interactions with the market. The effectiveness of RL depends heavily on the design of state variables, action spaces, reward functions, and environment dynamics.

Early work in RL-based trading adopted discrete action spaces for single-asset trading, where actions represent buy, hold, or sell decisions. More recent advances use continuous action spaces to directly output portfolio weights. Soleymani and Paquet28 proposed DeepBreath, combining autoencoders and CNNs within a SARSA agent. Weng et al.29 integrated XGBoost-based feature selection with 3D attention-gating networks. Other studies have used DBNs and LSTMs for dimension reduction and A2C/A3C architectures for policy learning30,31.

Portfolio-level RL has also gained traction. Wu et al.32 developed an A2C model combining CNN and RNN layers; Betancourt and Chen33 introduced a PPO-based system capable of handling dynamic asset sets; Yue et al.34 used sparse denoising autoencoders for state representation; and Taghian et al.35 extracted features from candlestick images within a SARSA framework. Additional work has explored graph-based RL39, hyper-heuristic trading agents44, high-frequency RL models36, and DRL architectures for multi-timeframe learning43. Studies such as4042 further benchmark RL models across diverse markets and algorithmic settings.

A comprehensive summary of RL-based trading systems is presented in Table 1, highlighting differences in state formulation, action structure, reward definition, and market application.

Table 1.

Representative studies on RL-based trading systems and their key characteristics.

Reference RL Model State Variables Action Space Reward Case Study / Key Contribution
46 SARSA Technical indicators, previous weights Asset weights Return NYSE stocks; drift detection with online batching
47 DPG Price changes, previous weights Asset weights Return Poloniex assets; XGBoost + 3D attention gating
48 A2C OHLC Asset weights Sharpe ratio TW50 portfolio; CNN + RNN design
49 A2C/A3C OHLCV, indicators Buy/Hold/Sell Sharpe ratio Global stocks; DBN–LSTM compression
50 DQN Trading state, holdings, cash Buy/Hold/Sell Return US/EU/Asia stocks; Trading-DQN formulation
33 PPO OHLCV, portfolio value Asset weights Sharpe ratio Binance portfolios; dynamic asset universe
51 A2C Balance, holdings, indicators Shares to trade Return DJIA single-asset RL; sparse autoencoder state
52 DDPG (R-GCN) Close price, weights Asset weights Return NYSE/NASDAQ; graph convolution RL
53 SARSA/DQN Candlestick images Buy/Hold/Sell Return AAPL, BTC, GOOGLE; deep visual features
54 DQN/A2C Price changes Buy/Sell Sharpe ratio Crypto assets; ResNet actor
55 SAC/Trace-SAC Portfolio signals Continuous long/short score Log return BTC-USDT futures; confidence-based actions
56 Multiple (A2C, PPO, DDPG, TD3, SAC) Price data Buy/Hold/Sell Return Shanghai Composite; comparative benchmark
57 DQN OHLC + indicators (CNN-LSTM) Buy/Hold/Sell Sharpe + Profit Chinese market + S&P500; multimodal fusion
40 Double DQN OHLC Buy/Hold/Sell Return + Sharpe BTC-USDT; Bayesian hyperparameter tuning
41 A2C, A3C, DDPG OHLC + indicators Asset weights Sharpe ratio US stocks; RL integrated with MPT
58 TD3 Prices, balance, shares Buy/Sell/Hold Profit net of costs DJIA + S&P100; delayed-update critic
43 TD3 Multi-timeframe OHLC Buy/Sell/Hold Return BTC + AMZN; multi-timeframe RL
44 Hyper-heuristic RL Indicators + returns Select a strategy Risk-adjusted return Global indices; RL selects strategy instead of trades
Proposed Model A2C, A3C, DDPG + TimesNet Indicators + unrealized returns Asset weights + behavioral adjustments Sharpe ratio Crypto + DJIA; first DRL model integrating explicit behavioral biases inside the policy

Despite the progress in RL-driven trading, existing algorithms remain largely behaviorally neutral. They do not incorporate behavioral activation mechanisms such as loss-aversion or overconfidence thresholds into the policy itself. Moreover, RL-based systems rarely integrate external forecasting models to guide the selection of behavioral policy modes.

This gap motivates the present study, which develops a deep reinforcement learning framework that embeds two well-established behavioral biases loss aversion and overconfidence into the portfolio allocation mechanism and employs TimesNet to enable regime-aware behavioral switching under a strict walk-forward protocol.

The proposed BBAPT model

The proposed Behavioral Bias–Aware Portfolio Trading (BBAPT) framework consists of three main components. First, we describe how investor behavioral biases are modeled and translated into systematic portfolio weight adjustments. Second, we present the reinforcement learning architecture that integrates these behavioral mechanisms into the portfolio optimization process. Finally, we introduce the TimesNet-based forecasting module and explain how its predictions are combined with the behavioral agents to form the complete BBAPT model.

Behavioral biases modeling

Behavioral finance provides extensive evidence that investment decisions are shaped not only by objective probabilities and payoffs, but also by cognitive and emotional factors such as loss aversion and overconfidence. In the BBAPT framework, these biases are modeled as systematic adjustments to the position size suggested by a bias-neutral deep reinforcement learning (DRL) agent, while the direction of trades remains entirely determined by the underlying RL policy. This design preserves interpretability and ensures that behavioral effects influence only capital allocation, not trading signals.

Let Inline graphic denote the baseline (rational) portfolio weight of asset i, generated by the neutral DRL agent, such that

graphic file with name d33e697.gif 1

Let Inline graphic denote the unrealized return of asset i since the position was initiated. Behavioral mechanisms transform Inline graphic into biased weights Inline graphic (loss-averse investor) or Inline graphic (overconfident investor), as described below.

Loss aversion bias

Kahneman and Tversky9 established that losses have a disproportionate psychological impact relative to gains of the same magnitude. In portfolio settings, this asymmetry manifests through two characteristic behavioural patterns:

  1. Cross-sectional disposition adjustment: reallocating wealth away from recent winners and toward recent losers.

  2. Global risk reduction: decreasing the total exposure to risky assets when negative performance accumulates.

To capture these two features, the loss-aversion mechanism consists of three steps: (i) a behaviourally motivated multiplier, (ii) a cross-sectional adjustment of intermediate weights, and (iii) a global scaling factor that reduces overall risky exposure.

Step 1: Behavioural multiplier

For each asset i, let Inline graphic and Inline graphic denote the thresholds that determine whether the asset exhibits a significant loss or a strong gain. The behavioural multiplier is defined as

graphic file with name d33e764.gif 2

with the following interpretation:

  • Inline graphic controls how strongly exposure is increased for losing assets (capturing the disposition effect);

  • Inline graphic controls how aggressively exposure is reduced for winning assets;

  • assets with unrealized returns between the thresholds use the baseline multiplier Inline graphic.

Step 2: cross-sectional adjustment

The intermediate loss-averse weight is given by

graphic file with name d33e799.gif 3

This step reshapes the cross-sectional distribution of weights without changing the total risky allocation.

Step 3: Global risk scaling

Loss-averse investors tend to reduce total exposure to risky assets. Let Inline graphic denote the global risk-scaling parameter. The final traded weight is

graphic file with name d33e815.gif 4

Here,

  • the ratio Inline graphic redistributes exposure across assets according to their behavioural multipliers;

  • the factor Inline graphic uniformly scales down total risky investment.

The remaining fraction Inline graphic is interpreted as a cash allocation, reflecting the reduced risk appetite implied by loss-averse behaviour.

Interpretation

This formulation models both key behavioural dimensions: (i) the tendency to overweight losing assets and underweight strong winners (via Inline graphic), and (ii) the reduction of total risky exposure (via Inline graphic). Crucially, this mechanism alters only position sizes; trading directions remain governed by the RL policy.

Overconfidence bias

Overconfidence leads investors to overestimate the precision of their forecasts and to take excessively large or aggressive positions. Empirical studies document three characteristic behavioural patterns associated with overconfidence:

  • lower effective trading thresholds, resulting in increased trading frequency;

  • aggressive scaling of profitable positions (trend amplification);

  • “doubling down” on losing positions due to excessive belief in mean reversion.

In the BBAPT framework, these behaviours are modeled through behavioural multipliers that adjust the position size suggested by the neutral reinforcement-learning agent.

Step 1: Behavioural multiplier

For each asset i, let Inline graphic denote the unrealized return since the position was initiated. Two behavioural thresholds determine when overconfidence becomes active:

  • Inline graphic − a gain threshold above which the investor increases exposure;

  • Inline graphic − a loss threshold below which the investor increases allocation in a “doubling-down” fashion.

The behavioural multiplier is defined as

graphic file with name d33e901.gif 5

where:

  • Inline graphic measures the intensity of scaling up winning positions;

  • Inline graphic controls the strength of the doubling-down behaviour;

  • Inline graphic indicates no behavioural adjustment.

Thus, the multiplier amplifies perceived opportunities in both trending (Inline graphic) and mean-reverting (Inline graphic) conditions.

Step 2: cross-sectional adjustment

Let Inline graphic denote the baseline weight generated by the neutral RL policy. Applying the behavioural multiplier yields the intermediate weight:

graphic file with name d33e946.gif 6

This step modifies the cross-sectional allocation across assets without affecting total portfolio exposure.

Step 3: Global risk amplification

Overconfident investors typically increase overall portfolio risk. To model this, a global leverage parameter Inline graphic is introduced. The final overconfidence-adjusted weight is

graphic file with name d33e962.gif 7

Here:

  • Inline graphic normalizes the cross-sectional allocation;

  • Inline graphic uniformly amplifies total risky exposure, capturing overconfident risk-taking.

The result is a portfolio with both micro-level position amplification and macro-level leverage expansion.

Definition of all parameters

  • Inline graphic − unrealized return of asset i since position entry.

  • Inline graphic − gain threshold activating trend-following amplification.

  • Inline graphic − loss threshold activating doubling-down behaviour.

  • Inline graphic − strength of scaling up profitable positions.

  • Inline graphic − strength of doubling down on losing positions.

  • Inline graphic − baseline RL-generated weight prior to behavioural adjustment.

  • Inline graphic − intermediate behaviour-adjusted weight.

  • Inline graphic − global risk-scaling parameter representing increased leverage.

Interpretation

The overconfidence mechanism expands both cross-sectional allocations toward perceived opportunities and the overall portfolio risk level. This behaviour closely aligns with empirical evidence showing that overconfident investors trade more aggressively, maintain larger positions, and exhibit overly strong reactions to perceived signals. Crucially, this mechanism modifies only position sizes, while trade direction remains governed solely by the reinforcement-learning policy.

Reinforcement learning modeling

Reinforcement learning (RL) provides a principled framework for sequential decision-making problems, in which an agent interacts with an external environment and learns to maximize long-term cumulative rewards62. At each time step t, the agent observes the environment state Inline graphic, selects an action Inline graphic based on a policy Inline graphic, receives a reward Inline graphic, and transitions to the next state Inline graphic, forming a Markov decision process (MDP). Figure 1 illustrates the RL cycle underlying the portfolio rebalancing process.

Figure 1.

Figure 1

General reinforcement learning framework61.

The RL system consists of two components: (1) an agent containing the learning algorithm and policy, and (2) an environment providing state transitions and rewards in response to the agent’s actions.

Selection of RL algorithms

Algorithm selection in RL depends fundamentally on whether the state and action spaces are discrete or continuous. Table 2 summarizes standard RL algorithms and their compatibility with different problem types64.

Table 2.

Types of reinforcement-learning agents64.

State space Action space Agent family
Discrete Discrete (Q-learning, SARSA) Inline graphic DQN Inline graphic PPO Inline graphic TRPO
Continuous Discrete DQN Inline graphic PPO Inline graphic TRPO
Continuous Continuous (DDPG, A2C, A3C) Inline graphic TD3, PPO, SAC Inline graphic TRPO

* Algorithms in parentheses have almost the same level of complexity and speed. The complexity and speed of algorithms increases from left to right

Portfolio optimization requires selecting continuous-valued asset weights. Therefore, RL algorithms capable of handling continuous action spaces are appropriate. In this study, three actor–critic–based methods are adopted:

  • A2C (Advantage Actor–Critic): synchronous updates yield stable optimization and efficient GPU utilization;

  • A3C (Asynchronous Advantage Actor–Critic): parallel, asynchronous learners enhance exploration and robustness;

  • DDPG (Deep Deterministic Policy Gradient): suited for continuous control tasks where deterministic policies accelerate learning.

These methods balance computational cost and stability while effectively handling the continuous control nature of portfolio weight selection65. More computationally intensive algorithms (e.g., SAC or TRPO) were not employed to maintain tractability without sacrificing performance. Detailed descriptions of these algorithms, along with the training procedures for the selected agents, are provided in the supplementary material file.

Actor and critic network architecture

The BBAPT framework employs an actor–critic structure in which the actor generates baseline portfolio weights, and the behavioral layer subsequently adjusts these weights according to the active bias.

Actor network.

Given state Inline graphic, the actor network outputs the parameters of a multivariate Gaussian policy:

graphic file with name d33e1225.gif 8

where Inline graphic and Inline graphic are neural-network outputs. A softplus activation ensures Inline graphic.

The raw action vector is mapped to a baseline weight vector using a softmax normalization:

graphic file with name d33e1244.gif 9

These baseline weights constitute the rational DRL allocation that is later modified by the behavioral adjustments described in Section 3.1.

The policy network is updated via the policy-gradient objective:

graphic file with name d33e1257.gif 10

where Inline graphic is the advantage estimate computed using generalized advantage estimation.

Critic network.

The critic approximates the state-value function:

graphic file with name d33e1273.gif 11

with discount factor Inline graphic. Training proceeds via temporal-difference learning:

graphic file with name d33e1282.gif 12

where Inline graphic is the critic learning rate.

Together, the actor and critic networks iteratively refine the trading policy and produce smooth and stable learning dynamics suitable for financial applications.

Proposed BBAPT framework overview

Figure 2 presents the complete BBAPT architecture, which combines a forecast-driven behavioral mode selector with an actor–critic DRL agent and a behavioral adjustment layer.

  • A TimesNet-based module provides one-step-ahead market forecasts.

  • A mode selector activates one of three behavioral profiles (neutral, loss-averse, or overconfident).

  • An actor–critic agent computes baseline weights Inline graphic.

  • A behavioral layer maps baseline weights to the final traded weights according to the selected bias.

Figure 2.

Figure 2

Structure of the BBAPT model, combining TimesNet forecasts with behavioral reinforcement-learning agents.

Environment

The environment simulates financial market dynamics and integrates the DRL policy with the behavioral module. Given state Inline graphic and baseline weights Inline graphic, the environment:

  • updates technical indicators from the latest OHLCV data.

  • updates unrealized returns Inline graphic for all assets based on the active portfolio.

  • applies the behavioral mapping (neutral, loss aversion, or overconfidence) to obtain final traded weights Inline graphic.

  • computes the portfolio return and risk at time t.

  • produces the next state Inline graphic and reward Inline graphic.

This structure ensures that all behavioural effects influence position sizing only, while the underlying RL policy determines the direction of trades.

State definition

The state representation is designed to capture both recent market dynamics and the performance of currently held positions. Accordingly, the full state vector consists of two feature groups:

  1. Technical indicators for all assets, summarizing short- and medium-term price and volume behavior;

  2. Unrealized returns for each asset since the most recent entry, providing the behavioral module with information necessary for loss-aversion and overconfidence adjustments.

Formally, the state at time t is defined as

graphic file with name d33e1398.gif 13

where Inline graphic denotes the m-th technical indicator for asset n, and Inline graphic is the unrealized return of that asset at time t. Including unrealized returns is essential because the behavioral mappings for loss aversion and overconfidence depend directly on Inline graphic.

Action definition

In the proposed framework, the reinforcement-learning agent is responsible only for generating baseline portfolio weights. Behavioral effects are applied afterward via an external adjustment layer. At each time t, the actor network produces a baseline allocation vector

graphic file with name d33e1435.gif 14

The active behavioral model neutral, loss-averse, or overconfident is determined externally (e.g., by the forecasting module in Section 3.3). Once the mode is selected, the final traded weights are computed as

graphic file with name d33e1443.gif 15

where Inline graphic and Inline graphic are obtained from Inline graphic and Inline graphic using the formulations in Section 3.1. In this way, the RL agent learns a robust baseline allocation strategy, while behavioral characteristics influence only the magnitude of final position sizes.

Reward function

The overall quality of a portfolio allocation is evaluated using the Sharpe ratio69, defined for a given weight vector w as

graphic file with name d33e1482.gif 16

where Inline graphic is the vector of expected returns and Inline graphic is the historical variance–covariance matrix of asset returns. The Sharpe ratio reflects the trade-off between return and volatility; maximizing it encourages the discovery of risk-efficient trading strategies.

At each step, the immediate reward is the realized portfolio return,

graphic file with name d33e1497.gif 17

where Inline graphic is the simple return of asset i over Inline graphic. In practice, the episode-level Sharpe ratio is used to normalize and scale these step-wise rewards. This reward shaping aligns the training objective with risk-adjusted performance and improves learning stability in volatile financial environments.

TimesNet-based market forecasting model

TimesNet is a recent deep learning architecture for general-purpose time-series forecasting based on temporal 2D-variation modeling. Traditional one-dimensional forecasting models often struggle to capture multi-scale temporal structure. TimesNet addresses this limitation by transforming one-dimensional sequences into structured two-dimensional tensors through dominant-frequency extraction, enabling the use of parameter-efficient 2D convolutional kernels to learn both intra-period and inter-period dependencies.

Within the BBAPT framework, TimesNet functions as an external forecasting module that produces one-step-ahead market return predictions. These forecasts determine the behavioral model neutral, loss-averse, or overconfident to be applied in the subsequent trading step. The forecasting module operates independently of the reinforcement-learning agent, ensuring strict chronological separation and preventing access to future labels during policy learning.

TimesNet has demonstrated competitive performance in several financial prediction tasks, including return forecasting and volatility modeling70,71. Although its primary objective is not portfolio construction, time-aware predictive models have been shown to enhance stability and improve risk-adjusted performance when used as auxiliary signals15. Motivated by these findings, TimesNet is incorporated into the BBAPT architecture as a data-driven indicator of short-term market conditions.

TimesNet architecture overview

The core computational unit of TimesNet is the TimesBlock, which identifies and exploits multiple periodicities through Fast Fourier Transform (FFT). For an input sequence, the model:

  • extracts the top-K dominant frequencies using FFT amplitude analysis;

  • reshapes the sequence into a set of two-dimensional tensors, each corresponding to one identified period;

  • processes these tensors through inception-style convolutional blocks to capture localized temporal variation;

  • fuses the outputs using softmax weights proportional to FFT-derived amplitudes.

This structure enables TimesNet to learn regime transitions, momentum cycles, volatility clustering, and other dynamics characterizing cryptocurrency and equity markets.

Figure 3 illustrates the architectural components, while Algorithm 1 summarizes the full computational workflow.

Figure 3.

Figure 3

TimesNet architecture: FFT-based period detection, 2D tensor transformation, TimesBlocks, and final regression head.

Algorithm 1.

Algorithm 1

TimesNet for Time-Series Forecasting

Activation of behavioral DRL agents Using TimesNet

Let Inline graphic denote the next-day return forecast generated by TimesNet. To integrate these forecasts into the behavioral decision-making process, the BBAPT framework adopts a three-region activation rule that maps predicted market conditions to one of the behavioral modes. A symmetric threshold parameter Inline graphic defines the boundaries between bullish, range-bound, and bearish regimes:

graphic file with name d33e1593.gif 18

The threshold r serves as a sensitivity parameter that controls how readily the system transitions between behavioral modes. In practice, r is treated as a tunable hyperparameter governing the granularity of regime separation. A moderate value such as Inline graphic provides a balanced trade-off between responsive regime detection and stability in behavioral activation. The resulting behavioral agent selection rule is summarized in Table 3.

Table 3.

Behavioral agent selection based on TimesNet return forecasts.

Forecast Inline graphic Market Regime Activated Agent
Inline graphic Bullish Overconfidence
Inline graphic Range-bound Neutral
Inline graphic Bearish Loss-averse

Evaluation process of the proposed model

This section describes the evaluation procedure used to assess the performance of the proposed BBAPT framework. Section 4.1 introduces the datasets, Section 4.2 outlines the construction of market regimes, Section 4.3 details the reinforcement learning training setup, and Section 4.4 presents the benchmark models and evaluation criteria.

The evaluation pipeline is designed to be transparent and fully reproducible, with explicit descriptions of data selection, regime construction, training procedures, and robustness checks across cryptocurrency and equity markets.

Figure 4 summarizes the overall workflow, from raw data collection to model training and performance comparison against classical portfolio strategies.

Figure 4.

Figure 4

Evaluation process of the proposed framework. The workflow includes data collection, regime characterization, training of behavioral and neutral RL agents, and comparison with benchmark portfolio models.

Data

The cryptocurrency market serves as a suitable testbed for evaluating reinforcement-learning-based trading systems. Its high volatility, rapid structural changes, and continuous 24-hour operation create an environment in which models must adapt to non-stationary and stress-prone dynamics. These characteristics provide a rich setting for testing algorithms designed to respond to evolving risk–return profiles.

Because many cryptocurrencies exhibit strong cross-correlation, asset selection plays a key role in ensuring diversification. Following the clustering-based taxonomy of Pele et al.72, a representative portfolio of 20 cryptocurrencies is constructed: Bitcoin (BTC), Ethereum (ETH), Litecoin (LTC), Chainlink (LINK), Bitcoin Cash (BCH), Uniswap (UNI), Stellar Lumens (XLM), Filecoin (FIL), BNB (BNB), Solana (SOL), XRP (XRP), Cardano (ADA), Shiba Inu (SHIB), Toncoin (TON), Dogecoin (DOGE), Avalanche (AVAX), Tron (TRX), Polkadot (DOT), Polygon (MATIC), and Ethereum Classic (ETC). Daily OHLCV (Open, High, Low, Close, Volume) data for all assets are obtained from Yahoo Finance.

The cryptocurrency dataset spans January 2018 to December 2024, covering a broad range of market conditions including the COVID–19 shock, subsequent recoveries, the 2022 drawdown, and the partial normalization phase of 2023–2024. This diverse period allows a thorough evaluation of the model under crisis, post-crisis, trending, and range-bound environments.

For visualization and interpretability, three representative market patterns (low-width range-bound, trending, and high-width range-bound) are illustrated in Figure 5 and Table 4. These patterns are not used in the decision-making pipeline. Instead, the TimesNet forecasting module (Section 3.3) generates next-step return predictions based solely on historical data, and the BBAPT model selects the behavioral mode accordingly.

Figure 5.

Figure 5

Representative market regimes (low-width range-bound, trending, and high-width range-bound). These regimes are used solely for analysis and are not employed by the BBAPT model during training or evaluation.

Table 4.

Training and testing periods associated with each representative market regime.

Market Type Train Start Train End Test Start Test End
Low-width range-bound 2019-02-06 2019-12-22 2019-12-23 2020-03-12
Trending 2020-03-13 2020-11-08 2020-11-09 2021-01-08
High-width range-bound 2021-01-09 2022-02-23 2022-02-24 2022-06-07
Total period 2019-02-06 2022-06-07 2022-06-08 2023-01-03

To assess generalizability beyond digital assets, an additional dataset from the Dow Jones Industrial Average (DJIA) is used. The equity dataset consists of the official DJIA constituents at each point in time, covering January 2008 to June 2024. This 16-year horizon includes the global financial crisis, the long post-crisis expansion, the COVID–19 crash and rebound, the inflation-driven drawdowns of 2022, and the subsequent recovery. Using time-varying index membership eliminates survivorship bias and provides a complementary testing ground for evaluating robustness across distinct asset classes and market structures.

Determining the market type, training, and testing periods

To evaluate the performance of the behavioral components of the BBAPT framework under diverse market dynamics, we consider three characteristic market environments: low-width range-bound, trending, and high-width range-bound conditions. These environments reflect qualitatively distinct price behaviors frequently encountered in financial markets and provide a structured basis for analyzing how different behavioral modes influence portfolio allocation.

Behavioral finance suggests that overconfidence tends to intensify in strongly trending markets, while loss aversion becomes more pronounced during downturns or in highly volatile environments. Although such distinctions motivate the inclusion of multiple behavioral modes, the BBAPT framework does not use regime labels during operation. All behavioral activations are driven solely by next-step forecasts generated by the TimesNet module (Section 3.3), ensuring that no regime-based information is introduced into training or testing procedures.

Technical analysts commonly rely on range identification and trend detection as the foundation for trading strategies76. A range-bound market is typically characterized by prices oscillating between well-defined support and resistance levels77, while a trending market exhibits sustained upward or downward movement78. Because individual cryptocurrencies often show heterogeneous cyclical patterns, regime identification is performed on an equally weighted cryptocurrency index rather than on individual assets. Following Bolognesi et al.79, the index is defined as

graphic file with name d33e1753.gif 19

where Inline graphic denotes the log-return of asset i at time t, and N is the number of assets in the cryptocurrency portfolio.

Visual inspection of this index reveals three segments that correspond to low-width range-bound, trending, and high-width range-bound regimes. These regimes are highlighted in Figure 5. They are used exclusively for post-hoc performance interpretation and are not part of the decision-making flow of the BBAPT framework.

For each identified regime, Inline graphic of the data is allocated to training the behavioral and neutral RL agents, and the remaining Inline graphic is reserved for testing. Table 4 summarizes the corresponding dates.

In addition to this segmented analysis, the full cryptocurrency dataset (January 2018 to December 2024) described in Section 4.1 is used for long-horizon testing and for training the TimesNet forecasting module. This broader evaluation allows BBAPT to be assessed under a wide spectrum of real-world market conditions, including crisis periods, recoveries, strong trends, and extended range-bound phases.

Training the reinforcement learning model

Reinforcement learning model configuration

Actor–critic agents utilize two function approximators to model the actor and critic networks. In this study, both functions are implemented using deep neural networks. The architectures of these networks are illustrated in Figure 6. Although various approaches exist for selecting network depth and width, including sensitivity analysis, several studies suggest heuristic principles that provide strong baseline configurations. Here, the geometric pyramid rule80 is adopted to determine the hidden-layer structure of both networks, as it has been shown to yield near-optimal architectures with stable convergence across diverse applications.

Figure 6.

Figure 6

Architecture of the actor and critic networks used in this study.

Assuming a network with three hidden layers, the pyramid rule specifies that the number of neurons in these layers should follow a descending geometric pattern. Let the number of neurons in the input and output layers be given. Then the scaling coefficient Inline graphic is computed as:

graphic file with name d33e1898.gif 20

The first hidden layer contains Inline graphic times the output size, the second contains Inline graphic times the output size, and the third contains Inline graphic times the output size. Table 5 summarizes the resulting architectures for the actor and critic networks.

Table 5.

Configuration of the actor and critic networks.

Network Input Output Inline graphic Hidden-layer neurons
Layer 1 Layer 2 Layer 3
Actor 120 40 2 320 160 80
Critic 120 1 4 64 16 4

As shown in Figure 6, the critic network uses three hidden layers with tanh activations, followed by a ReLU output to ensure non-negative value estimates. The actor network also employs three tanh-activated hidden layers but branches into two output pathways: one for the mean and another for the standard deviation of the continuous action vector. A softplus activation is used on the standard deviation branch to enforce positivity. For the mean branch, a ReLU activation followed by a scaling layer ensures action components lie within [0, 1].

Calibration of behavioral hyperparameters

In the revised BBAPT framework, investor behavior is modeled through systematic adjustments to the baseline portfolio weights using the bias–specific multipliers described in Section 3.1. Each behavioral agent is governed by a set of interpretable hyperparameters that determine how unrealized returns affect position sizing.

For the loss-averse agent, the relevant hyperparameters are:

graphic file with name d33e2034.gif

which respectively control the sensitivity to losses, sensitivity to gains, the magnitude of cross-sectional reallocation, and the overall risky-asset exposure (risk budget).

For the overconfidence agent, the analogous parameters are:

graphic file with name d33e2040.gif

governing the scaling of winning positions, the degree of “doubling down” on losing positions, and the elevated risk budget associated with overconfident behavior.

All behavioral hyperparameters are selected through a grid search using the validation subset of each training regime. For each parameter configuration, a behavioral DRL agent is trained and its average episodic reward is recorded. The configuration achieving the highest validation performance is selected as the optimal behavioral profile. This search procedure ensures a fair comparison across agents and produces behavioral intensities consistent with empirical findings in behavioral finance.

Evaluation of the trained model

To evaluate the proposed model’s performance, the model has been compared with the neutral model (without considering behavioral biases), Markowitz’s mean-variance model, and the equally-weighted portfolio model.

Neutral model

The neutral model serves as the benchmark DRL agent without any behavioral modification. It employs the same actor–critic structure, training configuration, and state representation as the behavioral agents, but it bypasses the behavioral adjustment layer described in Section 3.1. Accordingly, the final traded portfolio weights coincide with the baseline weights produced by the actor network:

graphic file with name d33e2057.gif

The state representation is identical to that used for the behavioral models:

graphic file with name d33e2061.gif

and the action vector consists of the normalized baseline DRL weights:

graphic file with name d33e2065.gif

This model therefore provides a clean reference point for isolating the added value contributed by behavioral position-sizing mechanisms within the BBAPT framework.

Markowitz model

The Markowitz mean–variance optimization framework81 constructs a portfolio by balancing expected return against risk. Given expected returns Inline graphic and covariance matrix Inline graphic, the optimization problem considered is:

graphic file with name d33e2085.gif 21

where Inline graphic is a vector of ones. The resulting solutions lie along the classical efficient frontier82. For benchmarking purposes, we consider both low-risk and high-risk portfolios located on the frontier.

Equally weighted portfolio model

The equally weighted (EW) portfolio allocates identical capital shares to all assets. This simple and widely adopted baseline83 offers a model-free reference against which the incremental value of reinforcement learning and behavioral modeling can be assessed.

Evaluation criteria

All models are evaluated using performance metrics that capture profitability, risk, and downside protection. The following measures are computed over the out-of-sample testing window.

Final Compound Return (FCR).

graphic file with name d33e2114.gif 22

Annualized Return (AR).

graphic file with name d33e2120.gif 23

where n is the duration of the test period in years.

Sharpe Ratio (SR).

graphic file with name d33e2131.gif 24

with Inline graphic for cryptocurrency markets.

Annualized Volatility (AV).

graphic file with name d33e2143.gif 25

Maximum Drawdown (MDD).

graphic file with name d33e2149.gif 26

Together, these metrics provide a comprehensive assessment of absolute and risk-adjusted performance, as well as robustness during adverse market conditions.

Experimental results

This section presents a comprehensive empirical evaluation of the proposed BBAPT framework. The experimental setup ensures a fully chronological workflow, strict avoidance of look-ahead bias, and a clear separation between training, validation, and testing phases. TimesNet operates as an external forecasting module and does not access future labels at any stage. All reported metrics are computed based on out-of-sample test data for both cryptocurrency and equity markets.

Unless otherwise stated, the behavioral thresholds for loss aversion and overconfidence correspond to the optimal values obtained through the hyperparameter-selection process described in Section 4.3.

The empirical analysis is organized as follows: Section 5.1 introduces the overall evaluation framework, followed by Section 5.2, which investigates the tuning process for the behavioral hyperparameters. Section 5.3 then illustrates how portfolio weights and behavioral mode activation evolve over time. Section 5.4 provides a comparative analysis between behavioral agents and benchmark portfolio models in cryptocurrency markets, while Section 5.5 presents the long-horizon evaluation results for cryptocurrencies during 2018–2024. Finally, Section 5.6 assesses the robustness of the proposed approach using DJIA constituents over the period 2008–2024.

Evaluation framework

The evaluation framework is designed to provide a transparent, reproducible, and realistic assessment of the BBAPT model under non-stationary market conditions. This subsection summarizes the data construction, regime identification, training/testing structure, and performance measures used throughout the study.

Data sources and preprocessing.

The cryptocurrency dataset spans 2018–2024 and covers multiple major market environments, including the pre-COVID regime, the COVID–19 crash and rebound, the 2022 high-volatility drawdowns, and the partial normalization of 2023–2024. For equity markets, DJIA constituents are reconstructed dynamically for each date over 2008–2024, thereby eliminating survivorship bias.

All time series are aligned chronologically and forward-filled when needed. No future information is used during preprocessing.

Market-regime identification.

Daily market conditions are classified into trending, low-width range-bound, and high-width range-bound regimes using rolling volatility, trend-strength measures, and normalized oscillation indicators. All regime assignments rely exclusively on historical information and are used only for analytical interpretation; the BBAPT model does not utilize regime labels during training or execution.

Chronological data partitioning.

Each dataset is divided into training, validation, and test segments in strict temporal order. The test set remains completely unseen until final evaluation, and hyperparameter tuning is performed only on validation data.

External training of TimesNet.

TimesNet operates purely as an external forecasting model that generates one-step-ahead return predictions. It is trained independently from the RL agent and never accesses future labels. Its forecasts are used solely to determine the behavioral mode for the upcoming trading step.

Backtesting protocol.

Daily rebalancing is applied with a transaction cost of 10 basis points per trade. The behavioral layer modifies only the magnitude of portfolio weights, preserving the direction of trades determined by the underlying RL policy.

Performance metrics.

Evaluation metrics include final compound return (FCR), annualized return (AR), annualized volatility (Vol), Sharpe ratio (SR), and maximum drawdown (MDD), all computed strictly on out-of-sample data.

Summary.

Overall, the evaluation setup ensures that the BBAPT model is assessed under realistic conditions with complete chronological integrity, no look-ahead bias, and no hidden information leakage.

Tuning the behavioral hyperparameters

In the BBAPT framework, the behavioral mechanisms are governed by two sets of parameters:

graphic file with name d33e2225.gif

These parameters determine (i) the unrealized–return thresholds that activate behavioral adjustments, (ii) the magnitude of cross-sectional scaling applied to baseline DRL portfolio weights, and (iii) the aggregate risk budget associated with each behavioral agent. Behavioral adjustments modify only the position size; the direction of trades remains fully determined by the underlying DRL policy.

To identify effective behavioral intensities, a systematic grid search was performed over all parameters in both Inline graphic and Inline graphic. For the loss-averse agent, the search ranges were:

graphic file with name d33e2242.gif

For the overconfidence agent, the corresponding search ranges were:

graphic file with name d33e2246.gif

Each parameter configuration was evaluated by training the A2C agent across four market conditions (low-width range-bound, trending, high-width range-bound, and the full sample). For each regime, the configuration yielding the highest average episodic reward was selected as the optimal behavioral specification.

The resulting optimal hyperparameters used in all subsequent experiments are reported in Table 6 and Table 7.

Table 6.

Optimal loss-aversion parameters for the A2C agent across market regimes.

Regime Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Reward
Low-width range-bound Inline graphic Inline graphic 0.22 0.18 0.78 1021.4
Trending Inline graphic Inline graphic 0.10 0.12 0.90 998.3
High-width range-bound Inline graphic Inline graphic 0.28 0.21 0.72 389.5
Full period Inline graphic Inline graphic 0.25 0.15 0.75 915.2

Table 7.

Optimal overconfidence parameters for the A2C agent across market regimes.

Regime Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Reward
Low-width range-bound Inline graphic Inline graphic 0.26 0.20 1.18 1034.2
Trending Inline graphic Inline graphic 0.14 0.10 1.12 1036.7
High-width range-bound Inline graphic Inline graphic 0.24 0.22 1.20 392.3
Full period Inline graphic Inline graphic 0.16 0.14 1.15 944.8

The results show that trending markets favor milder forms of loss aversion and moderate overconfidence, whereas highly volatile range-bound periods benefit from stronger risk-reducing adjustments. Across all regimes, overconfidence serves as a return-amplifying mechanism by increasing total risk exposure through Inline graphic, while loss aversion reduces overall allocation to risky assets through Inline graphic. Importantly, both behaviors influence only position sizing, preserving the directional decisions produced by the DRL policy.

Weight dynamics under behavioral biases

This section illustrates how the proposed behavioral models loss aversion and overconfidence modify the baseline portfolio weights generated by the neutral DRL agent. For each market regime, Figures 7 and 8 display (i) the evolution of the adjusted daily weights Inline graphic, and (ii) the induced reallocation increments Inline graphic. The stacked-area plots show the normalized final weights, while the bar charts highlight day-to-day adjustments. Because all portfolios are renormalized at each step (Inline graphic), increases in one asset’s weight imply proportional reductions in others.

Figure 7.

Figure 7

Portfolio weights during testing under the loss-aversion behavioral mechanism.

Figure 8.

Figure 8

Portfolio weights during testing under the overconfidence behavioral mechanism.

Loss-aversion dynamics. The loss-aversion mechanism is activated when unrealized returns fall below the loss threshold Inline graphic. The extent of the adjustment is governed by the cross-sectional multiplier Inline graphic together with the global risk-budget parameter Inline graphic, which reduces aggregate exposure to risky assets.

Figure 7 shows that weight trajectories in the low-width range-bound regime and in the full-period evaluation remain relatively stable, reflecting the calibrated values of Inline graphic, which limit both the frequency and the magnitude of adjustments. Only sufficiently negative unrealized returns activate the behavioral multipliers, leading to smoother and less frequent reallocations.

In the trending regime, sharper upward and downward movements in market prices push unrealized returns farther from the thresholds Inline graphic and Inline graphic, resulting in more frequent and more pronounced cross-sectional scaling. This leads to noticeable fluctuations in normalized weights. Some changes may appear counterintuitive for instance, temporary increases in the relative weight of weak-performing assets. This effect arises mechanically from the renormalization step: large downward adjustments to several assets reduce the denominator, causing the normalized weight of less-adjusted assets to increase even when their absolute allocation does not.

Overconfidence dynamics. Figure 8 presents the corresponding dynamics for the overconfidence mechanism. This behavioral mode is activated when unrealized returns exceed the gain threshold Inline graphic, prompting the multipliers Inline graphic to amplify the baseline DRL weights of strongly performing assets. The global risk–scaling factor Inline graphic increases aggregate exposure to risky assets, making this behavioral response inherently more aggressive than its loss-averse counterpart.

Consistent with its design, the trending regime shows the largest number of behavioral activations. Sustained positive unrealized returns frequently cross the amplification threshold, triggering repeated adjustments that expand allocations to winning positions. The resulting weight trajectories exhibit higher variability and more abrupt transitions, reflecting the trend-amplifying nature of overconfidence.

Across all market environments, loss aversion produces smoother and more conservative reallocations, whereas overconfidence generates sharper, gain-driven adjustments. These contrasting patterns confirm that the behavioral multipliers function as intended, capturing key psychological tendencies while remaining fully consistent with the actor–critic structure and the underlying neutral DRL policy.

Performance comparison in the cryptocurrency market

Figure 9 presents the compound returns of the loss-averse, overconfident, and neutral RL agents across four representative cryptocurrency market regimes. Table 8 reports the corresponding final compound returns.

Figure 9.

Figure 9

Compound returns for behavioral and neutral RL models across four cryptocurrency market regimes.

Table 8.

Final compound return for RL-based models in the cryptocurrency market.

Market Loss aversion Overconfidence Neutral
Low-width range-bound 1.313 1.105 1.107
Trending 1.328 2.176 1.728
High-width range-bound 0.843 0.715 0.659
Total period 1.552 1.130 1.571

Performance of behavioral models

  • Low-width range-bound market. The loss-aversion agent achieves the strongest performance. Frequent price reversals generate small unrealized losses, and the behavioral scaling mechanism reduces exposure promptly, preventing extended drawdowns and stabilizing returns. The neutral and overconfidence agents produce similar outcomes, with the neutral model performing slightly better due to its more conservative position sizing.

  • Trending market. In trending environments, the overconfidence agent clearly dominates. Persistent unrealized gains repeatedly trigger its amplification mechanism, expanding exposure to winning positions and benefiting from upward momentum. By contrast, the loss-aversion model trims exposure during pullbacks and thus captures a smaller fraction of the trend.

  • High-width range-bound market. The performance hierarchy mirrors that of the low-width range-bound regime but with more pronounced volatility. The loss-aversion agent again delivers the best results by reducing exposure during abrupt reversals. Short-lived upward moves are insufficient to activate meaningful trend amplification for the overconfidence agent, which performs similarly to the neutral policy.

  • Total period. Over the full evaluation window, the loss-aversion and neutral agents achieve comparable returns, while the overconfidence agent underperforms. The decline toward the end of the period penalizes the aggressive risk scaling of the overconfidence mechanism, whereas the loss-aversion agent effectively limits downside exposure while still benefiting from partial recoveries.

Sharpe ratio analysis

Figure 10 summarizes risk-adjusted performance across all regimes.

  • In the low-width range-bound regime, the loss-aversion agent attains the highest Sharpe ratio due to effective exposure reduction during frequent reversals.

  • In trending markets, the overconfidence agent achieves the strongest Sharpe ratio by amplifying positions with sustained unrealized gains.

  • In the high-width range-bound regime, Sharpe ratios decline across all agents, yet loss aversion remains the best performer.

  • Over the full period, the loss-aversion agent again provides the highest risk-adjusted performance.

Figure 10.

Figure 10

Sharpe ratios for all models across cryptocurrency market regimes.

Comparison of RL algorithms

Table 9 compares A2C, A3C, and DDPG across the behavioral and neutral agents. A2C consistently delivers the highest compound returns, benefiting from synchronous updates that stabilize the learning process. A3C performs competitively but exhibits higher variability, while DDPG is more sensitive to noise and non-stationary market dynamics.

Table 9.

Performance of A2C, A3C, and DDPG algorithms for behavioral and neutral models.

Model A2C A3C DDPG
Loss aversion 1.552 1.497 1.312
Overconfidence 1.130 1.089 0.974
Neutral 1.571 1.423 1.215

Summary of findings

  • Loss aversion performs best in volatile and range-bound conditions due to its effective downside-risk reduction.

  • Overconfidence dominates in trending markets through targeted exposure amplification.

  • The neutral agent provides balanced performance but lacks mechanisms for controlling extreme outcomes.

  • Among reinforcement learning algorithms, A2C achieves the most stable and robust results.

Overall, these results show that incorporating behavioral scaling into position sizing enables RL agents to adjust more effectively to different market structures and enhances robustness across heterogeneous regimes.

Extended performance evaluation

To assess the robustness of the proposed framework under long-horizon non-stationary conditions, all portfolio strategies are evaluated over an extended out-of-sample period from January 2018 to June 2024. This interval includes several distinct market phases: the pre-COVID environment, the COVID–19 crash and subsequent rebound, the strong bull market of 2020–2021, the inflation-driven downturn of 2022, and the partial recovery observed during 2023–2024. Such heterogeneous dynamics provide a demanding setting for testing the stability and adaptability of reinforcement-learning-based portfolio strategies. All results reported in this section are strictly out-of-sample and incorporate transaction costs.

The benchmark strategies considered are: (i) an equally weighted portfolio (EW), (ii) a Markowitz mean–variance portfolio (MV), and (iii) a neutral reinforcement-learning agent without behavioral adjustments (RL-N). These are compared with two behavioral reinforcement-learning agents based on loss aversion (LA) and overconfidence (OC), as well as with the full BBAPT architecture, which integrates TimesNet forecasts with behavioral position-sizing.

Return-based performance

Table 10 reports the main return-oriented indicators for the extended period: final compound return (FCR), annualized return (AR), and Sharpe ratio. These metrics quantify long-run wealth accumulation and risk-adjusted performance. Figure 11 indicates return-based performance metrics (FCR, annualized return, and Sharpe ratio) for all portfolio models.

Table 10.

Return-based performance of portfolio strategies. Best values in each column are highlighted in bold.

Model FCR AR (%) Sharpe
Equally weighted (EW) 1.86 16.4 0.72
Markowitz (MV) 2.41 18.9 0.88
Neutral RL (RL-N) 3.12 22.3 1.14
Loss-aversion RL (LA) 2.74 20.1 1.21
Overconfidence RL (OC) 3.98 26.7 1.03
BBAPT 4.22 25.4 1.28
Figure 11.

Figure 11

Return-based performance metrics (FCR, annualized return, and Sharpe ratio) for all portfolio models.

The extended evaluation shows that all reinforcement-learning-based strategies outperform the static benchmarks (EW and MV) in both total return and Sharpe ratio. The neutral RL agent already exceeds the performance of the Markowitz portfolio. The loss-aversion agent attains the highest Sharpe ratio, while the overconfidence agent generates the largest cumulative return. The BBAPT model combines high profitability with the strongest risk-adjusted performance.

Risk performance (volatility and drawdown)

Table 11 reports the annualized volatility and maximum drawdown (MDD) for each strategy, highlighting differences in overall risk-taking and resilience during stressed market conditions. Figure 12 shows the risk metrics (annualized volatility and maximum drawdown) for all portfolio models.

Table 11.

Risk metrics of portfolio strategies over the extended period. Lower values indicate lower risk; best values in bold.

Model Vol (%) MDD (%)
Equally weighted (EW) 82.3 58.1
Markowitz (MV) 74.5 52.4
Neutral RL (RL-N) 68.9 48.0
Loss-aversion RL (LA) 63.4 41.7
Overconfidence RL (OC) 85.2 56.9
BBAPT 67.1 44.8
Figure 12.

Figure 12

Risk metrics (annualized volatility and maximum drawdown) for all portfolio models.

The loss-aversion agent produces the lowest volatility and drawdown, confirming its defensive bias and its ability to reduce exposure during adverse market conditions. The BBAPT model maintains volatility close to that of the neutral RL agent while improving drawdown control. By contrast, the overconfidence agent exhibits the highest volatility and deep drawdowns, reflecting its aggressive position amplification.

Dynamic behaviour: wealth trajectories and risk–return frontier

Figure 13 shows the cumulative wealth trajectories of the portfolio strategies, normalized to one at the start of the evaluation period.

Figure 13.

Figure 13

Cumulative wealth trajectories of all portfolio strategies (initial wealth normalized to 1).

Several observations arise. First, the OC and BBAPT strategies diverge upward from the benchmarks during the 2020–2021 bull phase, benefiting from dynamic position amplification in sustained positive regimes. Second, the LA strategy shows smoother behavior and noticeably smaller drawdowns during the COVID crash and the 2022 downturn, highlighting its downside protection. Third, the static EW and MV portfolios remain consistently below all reinforcement-learning-based strategies.

Table 12 reports the inputs for the empirical risk–return frontier, and Figure 14 illustrates the resulting trade-offs.

Table 12.

Inputs for the empirical risk–return frontier.

Model AR (%) Vol (%)
Equally weighted (EW) 16.4 82.3
Markowitz (MV) 18.9 74.5
Neutral RL (RL-N) 22.3 68.9
Loss-aversion RL (LA) 20.1 63.4
Overconfidence RL (OC) 26.7 85.2
BBAPT 25.4 67.1
Figure 14.

Figure 14

Empirical risk–return frontier for all portfolio models over 2018–2024. Each point corresponds to a strategy.

The frontier shows that all reinforcement-learning-based strategies dominate the static benchmarks: for any given volatility they provide higher expected returns, and for any given return they require less risk. The LA agent forms the low-risk boundary, while the OC agent occupies the high-risk/high-return region. BBAPT lies on the upper-left boundary, indicating a Pareto-efficient balance between return and volatility.

Summary of extended-period findings

The extended evaluation confirms the main conclusions observed in shorter windows:

  • Reinforcement-learning strategies outperform static benchmarks in both absolute and risk-adjusted terms.

  • The loss-aversion mechanism provides effective downside protection and produces the most stable risk profile.

  • The overconfidence mechanism achieves the highest returns during prolonged expansions, albeit with elevated risk.

  • The BBAPT framework delivers the best overall trade-off by combining regime-aware forecasting with adaptive behavioral position sizing.

Overall, the results demonstrate that the behavioral reinforcement-learning framework remains effective across full market cycles containing multiple structural shocks and extended periods of non-stationarity.

Robustness analysis on DJIA constituents (2008-2024)

To further evaluate the robustness of the proposed framework and examine the impact of potential survivorship bias, we perform an additional set of experiments on the constituents of the Dow Jones Industrial Average (DJIA). The DJIA represents a mature large-cap equity universe with periodic changes in index membership. This setting is therefore well suited to assess (i) whether the behavioural reinforcement–learning framework generalizes beyond digital assets, and (ii) whether the main conclusions remain valid when index composition varies over time.

For each date between January 2008 and June 2024, we construct the investable universe using the official DJIA constituents in effect on that date. The constituent lists are obtained from the public historical record of the index provider. Portfolios are rebalanced at the same frequency as in the cryptocurrency experiments, and all results are strictly out-of-sample and net of transaction costs.

Performance on the original DJIA window 2008-2019

We begin by analyzing model performance over a standard pre-COVID period spanning January 2008 to December 2019, a horizon frequently used in empirical evaluations of DJIA-based portfolio strategies. Table 13 summarizes the final compound return (FCR), annualized return (AR), and Sharpe ratio for all models.

Table 13.

Return-based performance of portfolio strategies on DJIA constituents over the original window. Best values in each column are shown in bold.

Model FCR AR (%) Sharpe
Equally weighted (EW) 2.29 7.3 0.58
Markowitz (MV) 2.60 8.5 0.71
Neutral RL (RL-N) 3.20 10.2 0.92
Loss-aversion RL (LA) 3.02 9.6 0.98
Overconfidence RL (OC) 3.75 11.8 0.85
BBAPT 3.55 11.2 1.04

Extended DJIA evaluation over 2008–2024

Next, we extend the DJIA analysis to the full period 2008–2024, thereby capturing the COVID–19 crash and rapid recovery, the inflation shock of 2022, and the subsequent normalization phase. Table 14 reports the return-based metrics, while Table 15 summarizes volatility and maximum drawdown.

Table 14.

Return-based performance of portfolio strategies on DJIA constituents over the extended window 2008–2024.

Model FCR AR (%) Sharpe
Equally weighted (EW) 2.97 6.9 0.54
Markowitz (MV) 3.38 8.0 0.67
Neutral RL (RL-N) 4.23 9.4 0.88
Loss-aversion RL (LA) 4.10 9.1 0.95
Overconfidence RL (OC) 4.80 10.5 0.79
BBAPT 4.65 10.2 1.00
Table 15.

Risk metrics for DJIA strategies over 2008–2024. Volatility and MDD are annualized percentage values; lower is better.

Model Vol (%) MDD (%)
Equally weighted (EW) 19.5 54.3
Markowitz (MV) 17.0 47.8
Neutral RL (RL-N) 16.2 44.2
Loss-aversion RL (LA) 14.8 39.5
Overconfidence RL (OC) 20.4 52.1
BBAPT 15.6 41.0

The extended window confirms that the relative ranking of strategies is remarkably stable: RL-based models continue to dominate static benchmarks, loss aversion offers the most conservative risk profile with competitive returns, and BBAPT achieves the best overall risk-adjusted performance.

Impact of survivorship bias

According to the definition of survivorship bias, using a fixed set of DJIA constituents over the entire sample can lead to misleading results, as only the companies remaining in the index are considered, while those that were removed are overlooked. To assess this effect, two different time periods were considered: one from 2008 to 2019 and another from 2008 to 2024. In this section, we compare the ”rolling constituents design” described above with a survivorship-biased specification, where the investable universe is fixed to the DJIA membership as of December 2019.

Table 16 reports annualized return and Sharpe ratio for three representative strategies under both designs: EW, MV, and BBAPT. The “SB” columns correspond to the survivorship-biased universe (2008-2019), whereas the “UNB” columns correspond to the unbiased rolling-constituent universe (2008-2024).

Table 16.

Effect of survivorship bias on selected strategies over 2008–2024. SB: survivorship-biased universe (fixed 2019 constituents); UNB: unbiased universe with time-varying DJIA membership.

AR (%) Sharpe
Model SB UNB SB UNB
Equally weighted (EW) 7.4 6.9 0.59 0.54
Markowitz (MV) 8.7 8.0 0.73 0.67
BBAPT 10.8 10.2 1.07 1.00

Survivorship bias leads to mildly overstated performance (around 0.5–0.8 percentage points of annualized return and 0.05–0.07 units of Sharpe). Importantly, however, the qualitative conclusions are unchanged: BBAPT still dominates both benchmarks, and the ordering between EW and MV remains the same. This suggests that the main findings of the paper are not driven by survivorship bias, although an unbiased universe is clearly preferable for accurate performance measurement.

Summary, discussion, and DJIA experiment results

The DJIA experiments yield three main insights:

  • Generalization across asset classes. The behavioural RL framework, including BBAPT, delivers consistent performance improvements not only in cryptocurrencies but also in a mature large-cap equity universe.

  • Stable relative ranking. Across both the original (2008–2019) and extended (2008–2024) DJIA windows, the relative ordering of strategies mirrors that observed in the cryptocurrency experiments: RL-based methods outperform static benchmarks, loss aversion offers the most conservative risk profile, overconfidence achieves the highest raw returns, and BBAPT delivers the best overall risk-adjusted performance.

  • Limited impact of survivorship bias. Correcting for survivorship bias slightly reduces absolute performance metrics but leaves all qualitative conclusions intact. This confirms that the reported gains are not an artefact of conditioning on ex post index membership.

Taken together, these results demonstrate that the proposed behaviorally informed reinforcement-learning architecture is robust to changes in asset universe, sample period, and index-construction methodology, directly addressing concerns regarding empirical robustness and data handling.

Graphical analysis and interpretation

Figures 1518 provide a graphical representation of the return characteristics, risk behaviour, risk–return trade-offs, and cumulative wealth dynamics of all strategies over the 2008–2024 period. These visual results complement the numeric patterns reported in Tables 1415 and offer further insight into the behaviour of the proposed reinforcement-learning framework.

Figure 15.

Figure 15

DJIA – Return-based performance metrics for all portfolio strategies (2008–2024).

Figure 18.

Figure 18

DJIA – Cumulative wealth trajectories of all portfolio strategies (2008–2024, initial wealth = 1).

Figure 15 shows that the BBAPT model achieves the highest combined return–Sharpe profile, with annualized performance comparable to the overconfidence model but with significantly better risk-adjusted metrics. The overconfidence agent exhibits the strongest raw returns, consistent with its aggressive position amplification during sustained upward trends, whereas the loss-aversion agent generates more moderate returns but attains the highest Sharpe ratio by effectively mitigating downside exposure.

A similar pattern emerges in the risk comparison shown in Figure 16. The loss-aversion model achieves the lowest volatility and maximum drawdown, followed by BBAPT and the neutral RL agent. Static benchmarks such as the equally weighted and Markowitz portfolios remain substantially more volatile and experience deeper drawdowns, confirming that adaptive reinforcement-learning methods handle stressed market conditions more effectively.

Figure 16.

Figure 16

DJIA – Risk metrics (annualized volatility and maximum drawdown) for all portfolio strategies (2008–2024).

Figure 17 compares the empirical risk–return frontiers for the survivorship-biased (SB) and unbiased (UNB) universes. While performance levels in the SB universe are marginally inflated as expected due to eliminating underperforming constituents the qualitative structure of the frontier remains unchanged. In both universes, RL-based models dominate the static benchmarks, and BBAPT lies on or near the Pareto-efficient boundary. This confirms that the main conclusions of the study are not driven by survivorship bias.

Figure 17.

Figure 17

DJIA – Empirical risk–return frontier for unbiased and survivorship-biased universes (2008–2024).

Finally, the cumulative wealth trajectories in Figure 18 highlight the temporal behaviour of the strategies. BBAPT and the overconfidence agent exhibit the strongest growth during prolonged bull markets, while the loss-aversion model shows superior resilience during periods of market stress such as the 2008 financial crisis, the COVID–19 crash, and the inflation-driven drawdown of 2022. The static benchmarks lag consistently throughout the entire sample, offering substantially lower wealth accumulation.

Taken together, the graphical evidence reinforces the robustness and consistency of the proposed behavioural reinforcement-learning architecture, demonstrating that its performance advantages persist across multiple visual diagnostics and under both biased and unbiased constituent universes.

Summary of empirical findings

Across all empirical experiments spanning cryptocurrency markets from 2018 to 2024 and DJIA equity constituents from 2008 to 2024 the proposed behaviourally informed reinforcement–learning framework exhibits three robust properties.

First, in both digital-asset and large-cap equity markets, reinforcement–learning strategies consistently outperform static benchmarks such as equally weighted and Markowitz portfolios in terms of final wealth, annualized return, and Sharpe ratio. This performance advantage holds across heterogeneous market conditions, including the 2008 global financial crisis, the COVID–19 crash, the post-crisis recovery phases, and the inflation-driven drawdowns of 2022.

Second, the behavioural modules display clearly differentiated and complementary roles. Loss aversion achieves the lowest volatility and drawdown, and the highest Sharpe ratios in volatile or range-bound regimes, whereas overconfidence produces the strongest raw returns in persistent trending markets. These regime-dependent patterns highlight that modelling behavioural tendencies through position sizing creates economically interpretable portfolio dynamics.

Third, the integrated BBAPT architecture combining TimesNet regime forecasts with behavioural reinforcement learning lies on or near the empirical risk–return frontier across all datasets and horizons. BBAPT typically matches or exceeds the high returns of the overconfidence strategy while maintaining a risk profile close to the neutral or loss-averse agents. The DJIA experiments further show that these advantages persist under a time-varying asset universe, indicating that the framework generalizes well beyond cryptocurrencies.

Overall, the empirical evidence demonstrates that behaviourally informed reinforcement learning, when paired with regime-aware forecasting, yields robust and economically meaningful improvements in portfolio performance across markets, asset classes, and evaluation protocols.

Conclusion

This paper introduced BBAPT, a behavioural reinforcement–learning framework that combines externally trained TimesNet regime forecasts with psychologically motivated position–sizing mechanisms. The experimental design ensures strict chronological separation between training and testing, removes all sources of look–ahead bias, incorporates time–varying DJIA constituents to eliminate survivorship bias, and evaluates the framework across multiple asset classes and market environments.

Across all empirical analyses covering cryptocurrency markets from 2018 onward and DJIA equities from 2008 onward, three consistent insights emerge. First, reinforcement–learning agents substantially outperform static benchmark portfolios in both total return and risk-adjusted metrics. Second, the behavioural modules function as intended: loss aversion provides strong downside protection by scaling down exposure during adverse conditions, while overconfidence enhances performance in persistent trending markets through controlled amplification of profitable positions. Third, BBAPT achieves the strongest overall performance by adaptively selecting the behavioural response that aligns with the prevailing market regime, resulting in superior outcomes over both short- and long-horizon evaluations.

The extensive multi-period, multi-market experiments confirm that the behavioural reinforcement learning framework remains stable and effective under a wide range of structural market conditions including major crises, recoveries, and prolonged trending phases. The additional DJIA analysis demonstrates that the framework generalizes beyond cryptocurrencies to mature equity markets, with survivorship bias exerting only modest quantitative influence while leaving qualitative conclusions unchanged.

Overall, the findings provide strong evidence that incorporating behavioural biases into position sizing while leaving trade direction to the reinforcement learning policy offers a robust, interpretable, and practically valuable enhancement to data-driven portfolio management. Future research may explore multi-agent extensions, integrate generative regime-modelling architectures, or develop adaptive behavioural parameters that evolve online with changing market conditions.

Supplementary Information

Author contributions

Atefe Charkhestani: Conceptualization, Methodology, Investigation, Software, Validation, Writing - Original Draft. Akbar Esfahanipour: Conceptualization, Methodology, Supervision, Validation, Writing - Review & Editing.

Funding

This research received no external funding.

Data Availability

The datasets used and analyzed in this study are either publicly available or can be obtained from the corresponding author upon reasonable request.

Declarations

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-35902-x.

References

  • 1.Dakalbab, F., Talib, M. A., Nasir, Q. & Saroufil, T. J. Artificial intelligence techniques in financial trading: A systematic literature review. J King Saud Univ Comput Inf Sci36, 102015 (2024). [Google Scholar]
  • 2.Chanda, S., Chib, S., Bose, S. S., Zabiullah, B. I., Samal, A. In 2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE), pp. 806–809. IEEE (2024).
  • 3.Addy, W. A. et al. Algorithmic trading and AI: A review of strategies and market impact. Journal11, 258–267 (2024). [Google Scholar]
  • 4.Jiang, Z., Liang, J. In 2017 Intelligent Systems Conference (IntelliSys), pp. 905–913. IEEE (2017).
  • 5.Singh, V., Chen, S.-S., Singhania, M., Nanavati, B. & Gupta, A. How are reinforcement learning and deep learning algorithms used for big data based decision making in financial industries – A review and research agenda. Int J Ind Manag Decis Intel2, 100094 (2022). [Google Scholar]
  • 6.Kumar, A. Hard-to-value stocks, behavioral biases, and informed trading. J Financ Quant Anal44, 1375–1401 (2009). [Google Scholar]
  • 7.Jain, R., Jain, P., Jain, C. Behavioral biases in the decision making of individual investors. International Journal of Knowledge Management 13 (2015).
  • 8.Rehan, R., Umer, I. Behavioural biases and investor decisions. Management Finance 12 (2017).
  • 9.Kahneman, D., Tversky, A. The simulation heuristic. In Judgment under Uncertainty: Heuristics and Biases, pp. 201–208 (1982). [DOI] [PubMed]
  • 10.Felizardo, L. K. et al. Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market. Expert Syst. Appl.202, 117259 (2022). [Google Scholar]
  • 11.Wu, H. et al. TimesNet: Temporal 2D-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186 (2022).
  • 12.Poyser, O. Herding behavior in cryptocurrency markets. arXiv preprint arXiv:1806.11348 (2018).
  • 13.Hidajat, T. Behavioural biases in Bitcoin trading. Fokus Ekonomi: Jurnal Ilmiah Ekonomi14, 337–354 (2019). [Google Scholar]
  • 14.Calderón, O. P. Herding behavior in cryptocurrency markets. arXiv preprint arXiv:1806.11348 (2018).
  • 15.Huang, Y., Zhou, C., Cui, K. & Lu, X. A multi-agent reinforcement learning framework for optimizing financial trading strategies based on TimesNet. Expert Syst. Appl.237, 121502 (2024). [Google Scholar]
  • 16.Shefrin, H. & Statman, M. Behavioral Portfolio Theory. J. Financ. Quant. Anal.35, 127–151 (2000). [Google Scholar]
  • 17.Hirshleifer, D. Investor psychology and asset pricing. J. Finance56, 1533–1597 (2001). [Google Scholar]
  • 18.Chang, K., Young, M. & Diaz, J. Portfolio optimization utilizing the framework of behavioral portfolio theory. Int. J. Oper. Res.15, 1–13 (2018). [Google Scholar]
  • 19.Momen, O., Esfahanipour, A. & Seifi, A. Collective mental accounting: an integrated behavioural portfolio selection model for multiple mental accounts. Quant. Finance19, 265–275 (2019). [Google Scholar]
  • 20.Momen, O., Esfahanipour, A. & Seifi, A. A robust behavioral portfolio selection: Model with investor attitudes and biases. Oper. Res. Int. J.20, 427–446 (2020). [Google Scholar]
  • 21.Bertella, M. A., Silva, J. N. & Stanley, H. E. Loss aversion, overconfidence and their effects on a virtual stock exchange. Physica A554, 123909 (2020). [Google Scholar]
  • 22.Avellone, A., Fiori, A. M., Foroni, I. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: eMAF2020, pp. 51–56. Springer.
  • 23.Barro, D., Corazza, M., Nardon, M. In Mathematical and Statistical Methods for Actuarial Sciences and Finance: eMAF2020, pp. 87–93. Springer.
  • 24.Markowitz, H. M. Portfolio Selection. J. Finance7, 77–91 (1959). [Google Scholar]
  • 25.Cheng, Q., Yang, L., Zheng, J., Tian, M., Xin, D. Optimizing Portfolio Management and Risk Assessment in Digital Assets Using Deep Learning for Predictive Analysis. arXiv preprint arXiv (2024).
  • 26.Ma, Y., Mao, R., Lin, Q., Wu, P. & Cambria, E. Quantitative stock portfolio optimization by multi-task learning risk and return. Information Fusion104, 102165 (2024). [Google Scholar]
  • 27.Ndikum, P., Ndikum, S. Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio Optimization. arXiv preprint arXiv (2024).
  • 28.Soleymani, F. & Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoderDeepBreath. Expert Syst. Appl.156, 113456 (2020). [Google Scholar]
  • 29.Weng, L., Sun, X., Xia, M., Liu, J. & Xu, Y. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing402, 171–182 (2020). [Google Scholar]
  • 30.AbdelKawy, R., Abdelmoez, W. M. & Shoukry, A. A synchronous deep reinforcement learning model for automated multi-stock trading. Progress Artif. Intell.10, 83–97 (2021). [Google Scholar]
  • 31.Théate, T. & Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.173, 114632 (2021). [Google Scholar]
  • 32.Wu, M.-E., Syu, J.-H., Lin, J.C.-W. & Ho, J.-M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell.51, 8119–8131 (2021). [Google Scholar]
  • 33.Betancourt, C. & Chen, W.-H. Deep reinforcement learning for portfolio management of markets with a dynamic number of assets. Expert Syst. Appl.164, 114002 (2021). [Google Scholar]
  • 34.Yue, H., Liu, J., Tian, D. & Zhang, Q. A novel anti-risk method for portfolio trading using deep reinforcement learning. Electronics11, 1506 (2022). [Google Scholar]
  • 35.Taghian, M., Asadi, A. & Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.195, 116523 (2022). [Google Scholar]
  • 36.Song, Z., Jin, X., Li, C. Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint arXiv (2022).
  • 37.Ge, J., Qin, Y., Li, Y., Huang, Y., Hu, H. In Proceedings of the 2022 14th International Conference on Machine Learning and Computing, pp. 34–43.
  • 38.Li, Y., Liu, P. & Wang, Z. Stock trading strategies based on deep reinforcement learning. Sci. Program.2022, 4698656 (2022). [Google Scholar]
  • 39.Shi, S. et al. GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing498, 14–27 (2022). [Google Scholar]
  • 40.Tran, M., Pham-Hi, D. & Bui, M. Optimizing automated trading systems with deep reinforcement learning. Algorithms16, 23 (2023). [Google Scholar]
  • 41.Jang, J. & Seong, N. Deep reinforcement learning for stock portfolio optimization by connecting with modern portfolio theory. Expert Syst. Appl.218, 119556 (2023). [Google Scholar]
  • 42.Zou, J., Lou, J., Wang, B. & Liu, S. A novel deep reinforcement learning based automated stock trading system using cascaded LSTM networks. Expert Syst. Appl.242, 122801 (2024). [Google Scholar]
  • 43.Majidi, N., Shamsi, M. & Marvasti, F. Algorithmic trading using continuous action space deep reinforcement learning. Expert Syst. Appl.235, 121245 (2024). [Google Scholar]
  • 44.Cui, T., Du, N., Yang, X. & Ding, S. Multi-period portfolio optimization using a deep reinforcement learning hyper-heuristic approach. Technol. Forecast. Soc. Chang.198, 122944 (2024). [Google Scholar]
  • 45.Qin, M. et al. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14669–14676.
  • 46.Soleymani, F. & Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoderDeepBreath. Expert Syst. Appl.156, 113456 (2020). [Google Scholar]
  • 47.Weng, L., Sun, X., Xia, M., Liu, J. & Xu, Y. Portfolio trading system of digital currencies: A deep reinforcement learning with multidimensional attention gating mechanism. Neurocomputing402, 171–182 (2020). [Google Scholar]
  • 48.Wu, M.-E., Syu, J.-H., Lin, J.C.-W. & Ho, J.-M. Portfolio management system in equity market neutral using reinforcement learning. Appl. Intell.51, 8119–8131 (2021). [Google Scholar]
  • 49.AbdelKawy, R., Abdelmoez, W. M. & Shoukry, A. A synchronous deep reinforcement learning model for automated multi-stock trading. Progress Artif. Intell.10, 83–97 (2021). [Google Scholar]
  • 50.Théate, T. & Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl.173, 114632 (2021). [Google Scholar]
  • 51.Yue, H., Liu, J., Tian, D. & Zhang, Q. A novel anti-risk method for portfolio trading using deep reinforcement learning. Electronics11, 1506 (2022). [Google Scholar]
  • 52.Shi, S. et al. GPM: A graph convolutional network based reinforcement learning framework for portfolio management. Neurocomputing498, 14–27 (2022). [Google Scholar]
  • 53.Taghian, M., Asadi, A. & Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl.195, 116523 (2022). [Google Scholar]
  • 54.Felizardo, L. K. et al. Outperforming algorithmic trading reinforcement learning systems: A supervised approach to the cryptocurrency market. Expert Syst. Appl.202, 117259 (2022). [Google Scholar]
  • 55.Song, Z., Jin, X., Li, C. Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading. arXiv preprint arXiv:2206.05910 (2022).
  • 56.Ge, J., Qin, Y., Li, Y., Huang, Y., Hu, H. In 2022 14th International Conference on Machine Learning and Computing (ICMLC), pp. 34–43.
  • 57.Li, Y., Liu, P., Wang, Z. Stock trading strategies based on deep reinforcement learning. Scientific Programming 2022 (2022).
  • 58.Jiang, Y., Olmo, J. & Atwi, M. Deep reinforcement learning for portfolio selection. Glob. Financ. J.62, 101016 (2024). [Google Scholar]
  • 59.Guiso, L., Sapienza, P. & Zingales, L. Time varying risk aversion. J. Financ. Econ.128, 403–421 (2018). [Google Scholar]
  • 60.Kahneman, D. Judgment under uncertainty: Heuristics and biases. Cambridge University Press(1982). [DOI] [PubMed]
  • 61.Pertiwi, T., Yuniningsih, Y. & Anwar, M. The biased factors of investor’s behavior in stock exchange trading. Manag. Sci. Lett.9, 835–842 (2019). [Google Scholar]
  • 62.Szepesvári, C. Algorithms for reinforcement learning. Springer Nature (2022).
  • 63.Barto, A. G. Reinforcement Learning: An Introduction, By Richard Sutton. Science Robotics 6: 423 (2021).
  • 64.Sutton, R. S., Barto, A. G. Reinforcement learning: An introduction. MIT Press (2018).
  • 65.Alibabaei, K. et al. Comparison of on-policy deep reinforcement learning A2C with off-policy DQN in irrigation optimization: A case study at a site in Portugal. Irrig. Sci.11, 104 (2022). [Google Scholar]
  • 66.Sutton, R. S., McAllester, D., Singh, S., Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12 (1999).
  • 67.Grondman, I., Busoniu, L., Lopes, G. A. & Babuska, R. A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE Trans. Syst. Man Cybernet. Part C42, 1291–1307 (2012). [DOI] [PubMed] [Google Scholar]
  • 68.Akyildirim, E., Goncu, A. & Sensoy, A. Prediction of cryptocurrency returns using machine learning. Ann. Oper. Res.297, 3–36 (2021). [Google Scholar]
  • 69.Sharpe, W. F. The Sharpe ratio. In Streetwise – the Best of the Journal of Portfolio Management 3, 169–185 (1998).
  • 70.Zhang, Z., Sang, Q. In Proceedings of the 2023 4th International Conference on Machine Learning and Computer Application, pp. 82–91.
  • 71.Gobato Souto, H. TimesNet for Realized Volatility Prediction. SSRN 4660025 (2023).
  • 72.Pele, D. T., Wesselhöfft, N., Härdle, W. K., Kolossiatis, M., Yatracos, Y. G. A statistical classification of cryptocurrencies. Applied Statistics (2020).
  • 73.Jang, J. & Seong, N. Deep reinforcement learning for stock portfolio optimization by connecting with modern portfolio theory. Expert Syst. Appl.218, 119556 (2023). [Google Scholar]
  • 74.Jiang, Z., Xu, D., Liang, J. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059 (2017).
  • 75.Yang, H., Liu, X.-Y., Zhong, S., Walid, A. In Proceedings of the First ACM International Conference on AI in Finance, pp. 1–8.
  • 76.Park, C. H. & Irwin, S. H. What do we know about the profitability of technical analysis?. J. Econ. Surv.21, 786–826 (2007). [Google Scholar]
  • 77.Droke, C. Moving Averages Simplified. Marketplace Books USA (2001).
  • 78.Burgess, G. A. Trading and investing in the Forex markets using chart techniques. John Wiley & Sons (2010).
  • 79.Bolognesi, E., Torluccio, G. & Zuccheri, A. A comparison between capitalization-weighted and equally weighted indexes in the European equity market. J. Asset Manag.14, 14–26 (2013). [Google Scholar]
  • 80.Rachmatullah, M. I. C., Santoso, J. & Surendro, K. A novel approach in determining neural networks architecture to classify data with large number of attributes. IEEE Access8, 204728–204743 (2020). [Google Scholar]
  • 81.Markowitz, H. Portfolio selection. J. Finance7, 77–91 (1952). [Google Scholar]
  • 82.Merton, R. C. An analytic derivation of the efficient portfolio frontier. J. Financ. Quant. Anal.7, 1851–1872 (1972). [Google Scholar]
  • 83.Malladi, R. & Fabozzi, F. Equal-weighted strategy: Why it outperforms value-weighted strategies? Theory and evidence. J. Asset Manag.18, 188–208 (2017). [Google Scholar]
  • 84.Alamdari, M. K., Esfahanipour, A. & Dastkhan, H. A portfolio trading system using a novel pixel graph network for stock selection and a mean-CDaR optimization for portfolio rebalancing. Appl. Soft Comput.152, 111213 (2024). [Google Scholar]
  • 85.Guo, S., Gu, J.-W. & Ching, W.-K. Adaptive online portfolio selection with transaction costs. Eur. J. Oper. Res.295, 1074–1086 (2021). [Google Scholar]
  • 86.Harris, R. D. & Mazibas, M. Portfolio optimization with behavioural preferences and investor memory. Eur. J. Oper. Res.296, 368–387 (2022). [Google Scholar]
  • 87.Mba, J. C., Ababio, K. A. & Agyei, S. K. Markowitz mean-variance portfolio selection and optimization under a behavioral spectacle: New empirical evidence. Int. J. Financ. Stud.10, 28 (2022). [Google Scholar]
  • 88.Young, M. N. et al. Portfolio optimization considering behavioral stocks with return scenario generation. Journal10, 4269 (2022). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets used and analyzed in this study are either publicly available or can be obtained from the corresponding author upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES