Best DRL Algorithm for FX Trading: PPO, SAC, or DQN?

DQN produced a higher annualized return than PPO in a commodity futures benchmark — 47.6% against 15.7%. It also produced a maximum drawdown of -16.6%, against PPO’s -0.75%. The second comparison is the only one that matters for a live account. A 16.6% drawdown would trigger a margin call at most retail brokers and eliminate a prop firm trader from the evaluation immediately.

The algorithm comparison question is not “which one backtests best?” It is “which one survives contact with a real account?” Most deep reinforcement learning (DRL) research papers optimise for the first question. This article answers the second.

Table of Contents

The Benchmark Number Most Traders Ignore

A published benchmark comparing DRL algorithms across commodity futures markets found:

Algorithm	Annualized Return	Sharpe Ratio	Max Drawdown
DQN	47.6%	0.81	-16.6%
PPO	15.7%	2.04	-0.75%
A2C	—	2.11	-5.2%
DDPG	—	0.72	High

Horizontal bar chart comparing Sharpe ratio and max drawdown for DQN, PPO, A2C, and DDPG — showing DQN's high drawdown versus PPO's minimal drawdown despite lower return

DQN’s annualized return looks compelling until you read the next column. A Sharpe of 0.81 against PPO’s 2.04 means DQN is generating less return per unit of risk. The drawdown figure shows where that excess risk lands: -16.6% at worst, against PPO’s -0.75%.

Prop firm evaluations impose 5–10% maximum drawdown limits. A 16.6% peak-to-trough decline during a test period does not end a strategy — it ends an account. DQN’s drawdown profile is structurally incompatible with live FX trading constraints, not because the algorithm is broken, but because it was designed for a different problem.

A2C’s Sharpe in the table (2.11) nominally exceeds PPO’s (2.04). It is not the recommended starting algorithm because its max drawdown (-5.2%) is seven times worse. A2C uses synchronous parallel rollouts without PPO’s clipped update constraint — in a high-volatility batch, the policy can revise sharply. The Sharpe advantage is real but insufficient: a strategy generating 2.11 Sharpe at -5.2% drawdown fails a prop firm evaluation; one generating 2.04 Sharpe at -0.75% passes. The drawdown column is the constraint that determines which number is actionable.

The decision-relevant columns in any DRL algorithm comparison are Sharpe ratio and max drawdown. Annualized return tells you what happened during the best sequences in a controlled backtest. Sharpe and drawdown tell you what happens on the weeks that go wrong — and in FX, there are always weeks that go wrong.

Why DQN Fails as a Live FX Algorithm

DQN’s catastrophic drawdown is not bad luck. It is the structural consequence of two specific design choices.

The discrete action space. DQN learns a Q-function — an estimate of expected future reward for each available action in a given state. “Available actions” in a standard DQN implementation are fixed discrete choices: buy, sell, hold. The algorithm decides *whether* to trade but cannot decide *how much* to trade.

For a 15-minute EURUSD bar strategy, this is a fundamental limitation. Position sizing is one of the primary variables a sophisticated trading agent should optimise. An agent constrained to a fixed lot size on every signal cannot adjust its risk exposure to changing market conditions. It trades identically during a quiet Tuesday session and during an NFP release. That rigidity is where the drawdown spikes in the benchmark data come from.

Overestimation bias. The Bellman equation at the core of DQN training includes a maximisation step — at each update, the algorithm selects the action with the highest estimated Q-value. In noisy environments, this causes the algorithm to systematically overestimate the quality of certain actions. The agent learns to chase reward spikes that are artefacts of price noise, not genuine signal.

FX markets are high-noise by design. The signal-to-noise ratio in any 15-minute price series is low. DQN’s overestimation bias means it is structurally inclined to be misled by that noise during training — and to carry those learned misbehaviours into the test period.

Architectural variants address both problems partially. Double DQN decouples action selection from action evaluation, which reduces overestimation. Dueling DQN separates state value estimation from action advantage estimation. In tick-data studies on NASDAQ market making, these variants produced more stable results than raw DQN. But neither variant solves the discrete action space problem for bar-based FX trading. If the strategy requires “buy 0.07 lots based on current volatility,” DQN cannot express that decision in any form.

Why PPO Is the Starting Algorithm on Every DRL Project

PPO’s defining characteristic is a hard constraint on how much the policy can change per update step. The clipped surrogate objective limits each policy update to within roughly ±20% of the previous policy. A single bad batch of training data cannot cause a catastrophic revision.

Conceptual diagram contrasting unconstrained policy updates (DQN, large erratic jumps) with PPO's clipped incremental updates — illustrating why PPO produces lower drawdown

Why does this matter more in FX than in other domains? Because FX training data is full of regime segments — trending periods, ranges, and volatility shocks — that produce sharply different reward signals. Without clipping, a week of NFP-driven volatility could look to the algorithm like evidence that the entire previous policy was wrong — and trigger an update that overwrites months of stable learning.

With clipping, the policy improves incrementally. The agent cannot unlearn a working strategy in one bad batch. This is the mechanical reason PPO’s max drawdown is -0.75% in the benchmark data while DQN’s is -16.6%. The algorithm structure prevents the explosive behaviour that DQN’s update dynamics allow.

There is a second argument for PPO that is less obvious but matters at the early stages of any DRL project: input quality has more leverage than algorithm choice. A PPO agent trained on Kalman-filtered price data achieved 27.1% CAGR on precious metals; raw PPO on the same data achieved 3.46%. The algorithm did not change — only the noise characteristics of the input changed. Improvement came from filtering, not from switching to a more complex algorithm.

This points directly at how to sequence the work. While you are still validating the reward function and feature pipeline, use PPO. It is simple, stable, and — critically — it fails diagnostically. When PPO training curves oscillate without converging, the signal is clear: the reward function has a scaling problem, or a feature has lookahead bias. SAC’s additional complexity makes those signals harder to interpret. PPO gives you the clearest picture of whether the environment is sound before you add architectural sophistication.

Every DRL trading project I have taken on over the past three years starts with PPO. Not because it is theoretically optimal, but because it is the most informative instrument for diagnosing environment design problems.

When to Move from PPO to SAC — and What Changes

SAC adds one thing PPO does not have: an entropy term in the objective. SAC simultaneously maximises expected cumulative reward and the randomness of the policy. The agent is penalised for becoming too deterministic — for always selecting the same action in a given state.

This regularisation targets a specific failure mode: a policy that has overfit a single market regime. A PPO agent trained on 2017–2019 data may learn patterns that worked perfectly in that period and fail entirely when the volatility regime shifts. The policy has converged to a deterministic mapping it cannot adapt out of.

SAC’s entropy term prevents premature convergence. During the COVID-19 volatility shock in early 2020, SAC adapted to the new high-volatility regime faster than PPO in controlled studies — its stochastic policy continued exploring alternative action distributions rather than committing to patterns that had stopped working.

The practical question is when this matters for your specific project. The threshold I use:

Run PPO across 5 different random seeds
Count how many produce positive out-of-sample Sharpe
If 3 or more pass: PPO is adequate — do not touch the algorithm
If fewer than 3 pass: investigate the reward function and feature pipeline first

This sequence matters because PPO failure below the 3-seed threshold is almost always caused by a flawed reward function or feature engineering problems, not algorithm limitations. Switching to SAC will not fix lookahead bias in the feature pipeline. It will just make the lookahead bias harder to diagnose.

If you have validated the environment through PPO — 3+ seeds producing positive out-of-sample Sharpe — and there is genuine evidence of regime overfitting (strong in-sample performance that degrades sharply out-of-sample), then SAC is the appropriate upgrade. SAC is also the right choice when continuous position sizing is a genuine requirement and higher sample efficiency matters — for example, when historical data is limited or the strategy requires fine-grained lot sizing across multiple instruments.

DDPG and TD3: No Recommended Use Case for New FX Projects

DDPG benchmarks in FX research consistently show low Sharpe ratios and high drawdowns. The mechanism is identifiable: DDPG’s deterministic policy latches onto short-term patterns in training data with no entropy-based exploration to prevent it. The policy overfits a specific volatility regime, and when that regime ends, the agent either takes no meaningful action or takes the same action it learned in training regardless of current conditions.

TD3 addresses DDPG’s worst instabilities. Twin critics reduce overestimation bias. Delayed policy updates ensure value estimates stabilise before the strategy is revised. In controlled benchmarks, TD3 often matches SAC in raw performance metrics. The problem is that TD3 still produces a deterministic policy — which inherits DDPG’s brittleness in markets where the exact sequence of price observations never repeats.

At barmenteros FX, when we take on a DRL rescue project built on DDPG, the failure pattern is almost always the same: a policy that performed well during one specific volatility regime and went functionally dormant when that regime changed. The agent learned a lookup table for one market, not a generalised policy. Switching to PPO or SAC at that point usually requires a full environment redesign — the DDPG training regime produced a policy too regime-specific to be fine-tuned.

For new FX projects there is no scenario where DDPG is the right starting algorithm. TD3 is a reasonable choice if you are implementing a continuous-action agent without a library dependency and need a simpler codebase — but SAC is available in every major RL library and outperforms TD3 in most trading benchmarks.

The Algorithm Decision Sequence

Decision flowchart for selecting a DRL trading algorithm: PPO as default starting point, SAC when regime overfitting is confirmed, DRQN for HFT, DDPG/TD3 not recommended for new projects

The practical sequence for a new FX DRL project:

Step 1 — Start with PPO. It handles both discrete and continuous action spaces, tolerates hyperparameter variation, and fails diagnostically when the environment is misconfigured.

Step 2 — Run 5 seeds before drawing any conclusions. If fewer than 3 produce positive out-of-sample Sharpe, the problem is in the reward function or feature pipeline. Do not change the algorithm until the environment is validated.

Step 3 — Once PPO validates the environment, evaluate whether continuous position sizing is genuinely required by the strategy. If yes, benchmark SAC against the PPO baseline. If PPO with continuous actions meets performance targets, stay with it — the simpler architecture is easier to maintain and monitor in production.

Step 4 — HFT and limit order book applications: DRQN (Deep Recurrent Q-Network). The LSTM layer processes sequences of order book events that feedforward networks cannot capture. This is the one scenario where a DQN variant is the correct choice — and it requires tick-level data, not bar data.

Step 5 — DDPG and TD3: No recommended use case for new bar-trading FX projects.

The research literature compares these algorithms on annualized return. Live accounts are constrained by drawdown limits, margin calls, and prop firm evaluation rules. Those constraints, not backtest return figures, determine which algorithm survives into production — and that is the only comparison that matters.

For full methodology coverage — reward function design, seed stability testing, state space engineering, and MetaTrader deployment architecture — see Deep Reinforcement Learning for Trading.

Which DRL Algorithm for FX Trading — and Why the Max Drawdown Number Tells You Everything

The Benchmark Number Most Traders Ignore

Why DQN Fails as a Live FX Algorithm

Why PPO Is the Starting Algorithm on Every DRL Project

When to Move from PPO to SAC — and What Changes

DDPG and TD3: No Recommended Use Case for New FX Projects

The Algorithm Decision Sequence

Leave a Reply Cancel reply

barmenteros FX

COMPANY

SERVICES

PRODUCTS

LEGAL

The Benchmark Number Most Traders Ignore

Why DQN Fails as a Live FX Algorithm

Why PPO Is the Starting Algorithm on Every DRL Project

When to Move from PPO to SAC — and What Changes

DDPG and TD3: No Recommended Use Case for New FX Projects

The Algorithm Decision Sequence

Reader Interactions

Leave a Reply Cancel reply