Question 1

What is the difference between overfitting and lookahead bias in ML trading?

Accepted Answer

They are separate failure modes. Lookahead bias means the model or feature pipeline used future information during training — information that does not exist at signal time in live trading. The most common form: normalization calculated on the full dataset, so a bar's scaled value reflects data from years later. Overfitting means the model memorized patterns in the training data that do not repeat out of sample. Both problems are tested differently and fixed differently.

Question 2

Why does a single train/test split fail to validate an ML trading model?

Accepted Answer

Because it tests performance in one specific historical window, under one specific market regime. A model trained on 2018–2021 data and tested on 2022 data only tells you how the model performed during the transition from post-pandemic stimulus to rate-hiking conditions. That is one data point about the model’s ability to generalize.

Walk-forward validation solves this by running multiple train/test pairs across the full data history — each fold’s test period is different, covering different volatility regimes, trend environments, and liquidity conditions. The result is a distribution of performance metrics rather than a single figure. If a model holds up across 5–8 different out-of-sample periods, that is meaningful evidence of generalization.

Question 3

What is Combinatorial Purged Cross-Validation (CPCV) and when should it be used?

Accepted Answer

CPCV is a validation framework that addresses two problems standard cross-validation ignores in financial time series: embargo (bars immediately after a training boundary carry contaminated information) and combinatorial path generation (producing more independent test estimates from the same dataset). Use CPCV when the dataset is too small for standard walk-forward to produce enough test periods, or when signal frequency means you need more than 30–50 out-of-sample trades per fold. For most FX strategies using daily or 4H bars, standard purged walk-forward with an embargo is sufficient.

Question 4

How do I know if an ML model is overfit to the training data?

Accepted Answer

The clearest sign: training Sharpe ratio and out-of-sample Sharpe ratio diverge significantly across walk-forward folds. Other indicators include: high feature count relative to sample size (more features than roughly 1/20th of the training sample count), performance collapse in the first test fold, monotonic improvement with added complexity, and high variance in walk-forward test results (some folds work, others fail badly).

Question 5

What performance metrics actually matter for ML trading validation?

Accepted Answer

The most reliable metrics: out-of-sample Sharpe ratio per fold (risk-adjusted return), Calmar ratio (return relative to max drawdown), walk-forward consistency (standard deviation of Sharpe across folds — low variance means repeatable edge), information coefficient (correlation between predicted and actual returns), and profit factor. The one metric to distrust: maximum backtest return — it rewards overfitting more than any other figure.

Question 6

How many out-of-sample trades do I need before a result is statistically meaningful?

Accepted Answer

A minimum of 30 out-of-sample trades per fold is the practical floor. Below 30, confidence intervals on win rate and Sharpe ratio are so wide that positive results are indistinguishable from luck. For infrequent strategies (4 signals per month), this means over 7 months of out-of-sample data per fold. For 5-fold walk-forward, that is over 3 years of test data required.

Question 7

What is the correct way to handle normalization in a backtested ML feature pipeline?

Accepted Answer

Normalization must be calculated only from data available at each signal timestamp — never from the full dataset. Use a rolling window for any statistic derived from price data. In scikit-learn: fit StandardScaler on the training fold only, transform the test fold using those training-fold statistics only — for each walk-forward fold separately. Applying fit_transform() on the full dataset contaminates the test set with training-set distribution information and invalidates the backtest.

Metric	What it measures	Why it matters
Sharpe ratio (out-of-sample, per fold)	Risk-adjusted return	More comparable across strategies and periods than raw return
Calmar ratio	Return relative to maximum drawdown	Relevant for risk management — especially for leveraged FX accounts
Walk-forward consistency	Standard deviation of Sharpe across folds	Low variance means the edge is repeatable, not regime-specific
Information coefficient (IC)	Correlation between predicted and actual returns	Direct measure of model predictive accuracy, independent of execution
Profit factor	Gross profit / gross loss	Useful for assessing whether the edge is large enough to survive spread and slippage

ML Backtesting: Avoiding Overfitting and Lookahead Bias

Frequently Asked Questions

Related Services

barmenteros FX

COMPANY

SERVICES

PRODUCTS

LEGAL