A PPO agent trained for 2 million steps on EURUSD 15-minute data showed flat reward curves across all 5 seeds. Switching to SAC changed nothing. Adjusting the learning rate changed nothing — in most DRL trading rescue projects I have worked on, the answer is in the observation vector: missing context, incorrect normalization, or lookahead bias. The problem here was two absent features: no position state, no session encoding. The same RSI reading at 3:00 AM meant something different than at 8:30 AM London open, and the agent had no way to distinguish them. Once both were added, all 5 seeds converged within 800K steps.
These defects produce convincing backtest results and immediate live failures. Before adjusting a single hyperparameter, the observation vector needs to be right.
The Two Observations Most Implementations Omit
Position state is the most commonly omitted feature. A standard implementation feeds the agent a window of price and indicator data and expects it to learn when to buy, hold, or exit. But an RSI reading of 65 in an uptrend means three different things depending on the agent’s current position:
- Flat: Consider entering long
- Long: Consider holding or adding
- Short: Consider closing or reversing

An agent whose observation vector does not include position state receives all three scenarios with the same input tensor. The gradient signal at that observation is contradictory — update toward “enter long” from one batch, update away from it from another. The policy never converges to rational behavior because no consistent mapping exists from the observed state to the optimal action. This is called state aliasing: multiple distinct situations are represented by the same observation, so the agent cannot distinguish between them.
The fix is simple. The observation vector must include, at minimum: current position direction (encoded as -1 for short, 0 for flat, 1 for long), current unrealized P&L as a fraction of account equity, and time in trade (number of bars the current position has been open).
Session encoding is the second omission. EURUSD during the Asian session and EURUSD during the London open are not the same instrument from a reward distribution perspective. Spread, average true range, directional momentum behavior, and liquidity all differ materially by session. An agent trained on 3 years of EURUSD data without session encoding is averaging these different environments into a single undifferentiated training signal.
Encoding is straightforward: hour of day as a sine-cosine pair (to preserve the circular nature of time), plus binary indicators for the major session overlaps. This is 4 additional features that give the agent the context to learn session-conditional policies rather than a single blurred average policy.
Observation Window Depth — How Far Back Is Far Enough
The instinct is to include as many historical bars as possible. More history means more context, and more context should produce better decisions. This logic has a ceiling that most practitioners hit without recognizing it.
For a 15-minute bar EURUSD strategy, the useful observation window is typically 10–20 bars. That represents 2.5 to 5 hours of market history — enough to capture current momentum, recent swing structure, and intraday session context. Beyond that, you are adding bars from a previous trading session that the strategy’s signal frequency does not actually need.
The research basis for this is precise. A 2026 study comparing PPO, TD3, and SAC on partially observable continuous control tasks found that multi-step bootstrapping with n=2–5 steps provides sufficient implicit state estimation for most real-world tasks without requiring LSTM layers or large observation windows. Multi-step returns aggregate reward signal across n time steps, giving the value function access to recent trajectory context — which is why a large observation window is not a substitute for multi-step bootstrapping, and why inflating window depth beyond the strategy’s natural lookback adds noise rather than temporal resolution.
The empirical protocol: test three window depths (10, 20, 30 bars) with identical hyperparameters, across 5 seeds each. Compute the median out-of-sample Sharpe across the 5 seeds for each depth. If the jump from 10 to 20 bars improves median Sharpe materially and the jump from 20 to 30 does not, 20 bars is the correct depth. Stop there.
On DRL projects I have inherited for rescue work, I routinely find observation vectors with 50 to 100 historical bars. When I ask why, the answer is always the same: “more seemed safer.” It is not. A 100-bar window on 15-minute data feeds the network price action from 25 hours ago — prices from a full prior day’s session that contribute noise to the gradient without contributing signal. The network capacity spent modeling that noise is capacity not spent on the features that actually matter.
The practical limit is also computational. A 100-bar observation window with 15 features is a 1,500-dimensional input tensor. For a 20-bar window with the same features, it is 300. Training time scales with input dimensionality. If your observation window is inflated, your training budget is being spent on noise.
Normalization: The Choices and the Leakage Traps
The normalization choice and how parameters are fitted introduce failure modes that most implementations miss.
Three normalization rules for FX DRL:
Rule 1 — Log returns for price features. Convert OHLC data to log return form: `log(close_t / close_t-1)`. Log returns are stationary, magnitude-comparable across time periods, and unbounded in both directions. They allow the network to learn “this bar moved 0.3 standard deviations up” rather than “this bar closed at 1.08437.”
Rule 2 — Rolling Z-score for indicators. ATR, volume, spread, and other indicators drift in absolute terms over time. A 20-pip ATR in 2020 represents different market volatility than a 20-pip ATR in a post-rate-decision session in 2024. Normalize with a rolling Z-score: subtract the rolling mean and divide by the rolling standard deviation, using a window equal to your training lookback period. The key constraint: the rolling window must be causal — it can only look at history available at bar t, not the full dataset.
Rule 3 — Min-max for bounded features. RSI runs from 0 to 100. Position size as a fraction of account runs from 0 to 1. Session indicators are binary. These features have stable, known bounds and should be scaled linearly to [0, 1] or [-1, 1] using those bounds — not Z-scored, because Z-scoring a bounded feature distorts the natural distribution.
The leakage trap: if you compute Z-score normalization parameters (mean, standard deviation) from the full training dataset before splitting into train and test, the normalization of the first training bar uses statistics derived from future data. The agent’s training environment has absorbed information about what happens later in the sample — not through the reward function, but through the feature scaling. The resulting training metrics are inflated.
The diagnostic: fit normalization parameters on the first 50% of training data only. Apply those same parameters to the full training period and recompute out-of-sample Sharpe. If Sharpe changes materially between the two normalization approaches, you have normalization leakage. The fix is to use an expanding window for normalization parameters anchored at the training start date — parameters grow as you move through the training period, but never reach beyond the current bar.
The Lookahead Audit — Five Checks Before Training
Lookahead bias in DRL is more destructive than in supervised ML. A gradient boosting model trained on leaky features will produce inflated accuracy scores. A DRL agent trained on a leaky environment will find and exploit the leak with superhuman precision — learning exactly which action sequence extracts maximum reward from the forward-looking signal. The policy looks like it works until the first live bar.
Run these five checks before beginning any training run:

1. Rolling indicator timing. At bar t, is any indicator using bar t’s own close in the same observation step it is fed to the agent? The observation at bar t must be computable using only data through bar t-1. If you compute ATR using the close at bar t and then use that ATR as a feature in the observation at bar t, the agent sees a feature derived from the same price it is about to act on — that is lookahead.
2. Lookback fill period. A 20-bar ATR requires 20 bars of history before it produces a valid value. What does your implementation use for bars 1 through 19 of the training period? If those positions are filled forward from bars 20+ (common in pandas rolling implementations), you have leakage from the first 19 bars of every training run.
3. Normalization parameter scope. Are rolling mean and standard deviation computed from a causal expanding window anchored at training start, or from a window that can reach into future data? Check the implementation — not the intent.
4. Reward function terms. Does any term in the reward function reference bar t+1? This is the most common and most destructive form of leakage. “The trade was profitable if the next bar’s open was above the entry price” sounds like a reasonable reward formulation. It is a direct lookahead. The agent learns to predict the next bar’s open with perfect accuracy from the leak — nothing you observe in training metrics tells you this is happening.
5. Regime labels and filters. If you include a volatility regime label or market state filter as a feature, verify it is computed using only past data at inference time. Many published implementations compute regime labels over the full dataset for convenience and then split the labeled data — the label at bar 500 was computed with knowledge of bar 5,000.
Verification test: once the audit is complete, deploy the trained policy on a synthetic random-walk price series with no actual market signal — generated as a cumulative sum of normally distributed random increments. If the policy produces above-random cumulative reward on the random series, the environment has leakage. A correctly specified environment produces near-zero or slightly negative cumulative reward on random data (due to transaction costs).
Validating the Environment Before Touching Hyperparameters
When a DRL agent fails to converge — flat reward curves across all seeds, oscillating without trend — the standard response is to increase the learning rate, switch from Adam to RMSProp, or try a different algorithm. In the majority of cases I encounter, the problem is in the environment, not the optimizer.
Three checks identify environment defects before any hyperparameter experimentation begins:
Random agent test. Deploy a policy that selects actions uniformly at random (no network, no training) and record mean reward over 1,000 steps. The expected value should be close to zero, with a small negative offset from transaction costs. If mean random reward is substantially positive, the reward function has a systematic bias — it rewards the act of trading itself, regardless of market outcome. The agent does not need to learn anything to accumulate positive reward; it just needs to trade frequently. If mean random reward is substantially negative beyond what transaction costs explain, there is a directional bias against action — the agent will learn to do nothing.
Position state reconstruction. Take the observation vector at any bar t from the training environment and attempt to unambiguously determine the current position from that vector alone. If you cannot, position state is absent or insufficiently encoded. No external information should be required — the observation vector must be self-contained.
5-seed convergence shape. Train the agent across 5 random seeds and plot the training reward curves. In a correctly specified environment, all 5 curves should show at minimum a weak upward trend after 200K–300K steps — not necessarily convergence, but directional improvement. If all 5 curves are flat or oscillating without trend after 500K steps, the environment is the problem. A sound environment with the wrong algorithm still produces a weak signal. An unsound environment with any algorithm produces noise.
Before adjusting any hyperparameter on a DRL trading project, I run these three checks. On code rescue work at barmenteros FX, they identify the environment defect within the first 30 minutes of diagnosis — usually before I have looked at a single line of training code.
The Build Order for a Sound Observation Vector
To summarize the sequence:
- Start with what must be there: position state (direction, unrealized P&L, bars in trade) + session encoding (hour sine-cosine + session overlap flags)
- Add price features as log returns, not raw prices
- Add indicators normalized by rolling Z-score (causal, expanding window)
- Add bounded features normalized by min-max using their natural bounds
- Set window depth by empirical test (10/20/30 bars, 5 seeds each, stop at diminishing returns)
- Run the lookahead audit before first training run
- Run the three environment validation checks before touching hyperparameters
Once the observation vector is validated, the next engineering problem is reproducing it exactly in the MetaTrader execution bridge. Any divergence between training-time feature computation and inference-time feature computation is distribution shift — the agent receives observations in live markets that it never saw during training. That gap is where most DRL trading systems fail at deployment, even when the training environment was correctly specified. For the deployment architecture — named pipe vs. TCP socket bridge, feature parity validation, and inference latency requirements — see Deep Reinforcement Learning for Trading.


Leave a Reply