A Taxonomy of Backtest Lies: Survival Bias, Lookahead Bias, and …

George Box told us all models are wrong, but some are useful. The same applies to backtests, except the failure mode is worse. A wrong model gives you a bad prediction. A wrong backtest gives you a bad prediction wrapped in the illusion of empirical evidence.

Every backtest I have ever written was biased. Every backtest you have ever written was biased. The question is never “is this backtest biased?” but rather: which biases are present, how large are they, and does any signal survive after accounting for them? That shift in framing, from “is my backtest good?” to “how bad is my backtest and in which direction?”, is the first step toward treating backtest results with the skepticism they deserve.

I built this taxonomy because I kept making the same mistakes. Not the obvious ones. Nobody with six months of experience runs a backtest without transaction costs and takes the result at face value. The dangerous biases are the subtle ones: a pandas index alignment that silently leaks one bar of future data into your signal, a data source that quietly drops delisted tickers, a parameter search over a thousand combinations without any correction for multiple comparisons. These are the biases that produce plausible-looking results. They don’t give you a Sharpe ratio of 10 that screams “something is wrong.” They give you a Sharpe of 1.4 that looks like a real edge, and you don’t discover the problem until you deploy capital.

This article catalogs the biases I’ve encountered, organized by source: data biases (what you feed the backtest), methodology biases (how you run the backtest), and reporting biases (how you interpret results). Each section follows a consistent structure: mechanism, direction, magnitude from the literature where available, detection methods, mitigation strategies, and code for automated detection. The goal is a reference you can consult when reviewing any backtest, whether your own or someone else’s, and a set of automated checks you can integrate into your validation pipeline .

Survivorship Bias

Survivorship bias is the one everyone learns first. It is also the one that most people think they’ve solved when they haven’t.

The mechanism is simple. You backtest on a universe of assets that only includes those that exist today. Companies that went bankrupt, were acquired, or delisted for any reason are excluded from the historical universe. Since delisted companies are disproportionately failures, excluding them inflates historical returns. Your backtest picks winners from a pool where the biggest losers have been silently removed.

How Large Is the Survivorship Effect?

The academic literature has quantified this consistently. Elton, Gruber, and Blake (1996) found that survivorship bias in mutual fund returns runs approximately 0.9% per year. That sounds small until you compound it. Over a ten-year backtest, nearly a full percentage point per year accumulates into a substantial distortion, roughly the difference between a mediocre strategy and an apparently good one.

For hedge funds, the bias is worse. Rohleder, Scholz, and Wilkens (2011) estimated survivorship bias at 2-4% per year, because hedge fund attrition rates are higher and the losses before closure tend to be steeper. For individual equities, backtesting on the current S&P 500 constituents rather than historical point-in-time constituents can inflate annual returns by 1-2%.

The direction is almost always positive. I say “almost” because there are edge cases involving merger targets where the surviving entity underperforms the acquired one, but in practice, survivorship bias pushes your numbers up.

The Momentum Trap

Survivorship bias interacts badly with momentum strategies, and this combination deserves special attention. Momentum buys recent winners and sells recent losers. Survivorship bias removes the most extreme losers from the universe. The result: the strategy’s sell signals get artificially neutered because the stocks that would have generated the worst short-side performance have been deleted from history. Meanwhile, the buy signals look great because the winners that eventually got acquired at a premium are still in the data. Both sides of the ledger are distorted in the strategy’s favor.

I once ran a momentum strategy on a dataset I thought was survivorship-free because the vendor advertised it as such. When I cross-referenced the ticker count against known constituent histories, roughly 8% of tickers that should have been present at various historical dates were missing. These were small-cap names that had been delisted, and they would have been prime short candidates during their decline phase. The corrected backtest lost about 40% of its apparent alpha.

Detecting Survivorship in Your Data

The detection approach is straightforward. For each historical date in your backtest, compare the set of tickers you’re trading against the actual set of constituents that existed at that date:

from dataclasses import dataclass


@dataclass(frozen=True)
class SurvivorshipDiagnostic:
    """Results from comparing backtest universe against point-in-time constituents.

    Attributes:
        date: The historical date being checked.
        missing_tickers: Tickers that existed at this date but are absent
            from the backtest universe.
        extra_tickers: Tickers in the backtest universe that didn't exist
            at this date (possibly added later).
        missing_fraction: Fraction of actual constituents missing from
            the backtest universe.
    """
    date: str
    missing_tickers: frozenset[str]
    extra_tickers: frozenset[str]
    missing_fraction: float


def detect_survivorship_bias(
    backtest_tickers_by_date: dict[str, set[str]],
    pit_constituents_by_date: dict[str, set[str]],
) -> list[SurvivorshipDiagnostic]:
    """Compare backtest universe against point-in-time constituents.

    For each date, checks whether the backtest universe matches the actual
    constituents that existed at that date. Missing tickers suggest
    survivorship bias; extra tickers suggest lookahead in universe
    construction.

    Args:
        backtest_tickers_by_date: Mapping from date string to the set of
            tickers used in the backtest at that date.
        pit_constituents_by_date: Mapping from date string to the set of
            tickers that actually existed in the target universe at that
            date (e.g., actual S&P 500 membership).

    Returns:
        A list of SurvivorshipDiagnostic results, one per date, sorted
        by date. Only dates present in both inputs are checked.
    """
    diagnostics: list[SurvivorshipDiagnostic] = []
    common_dates = sorted(
        set(backtest_tickers_by_date.keys()) & set(pit_constituents_by_date.keys())
    )

    for date in common_dates:
        actual = pit_constituents_by_date[date]
        used = backtest_tickers_by_date[date]
        missing = actual - used
        extra = used - actual
        fraction = len(missing) / len(actual) if actual else 0.0

        diagnostics.append(SurvivorshipDiagnostic(
            date=date,
            missing_tickers=frozenset(missing),
            extra_tickers=frozenset(extra),
            missing_fraction=fraction,
        ))

    return diagnostics

The extra_tickers field catches a subtler problem: tickers in your backtest universe that didn’t exist at that historical date. These are often companies that were added to the index later. Including them early is a form of lookahead bias in universe construction.

Fixing Survivorship Bias

Use point-in-time constituent data. This means S&P 500 membership as of each rebalance date, not today’s membership applied retroactively. CRSP provides this. Most free data sources do not.

Include delisting returns. When a company gets delisted, there is typically a terminal return (positive for acquisitions, deeply negative for bankruptcies) that occurs over the final days of trading. CRSP tracks these. If your data ends the price series at the last regular trading day without the delisting return, you’re missing the event that caused the delisting, which is often the most informative data point.

If point-in-time data isn’t available, estimate the bias: run your backtest on a known-survivorship-free subset (the largest 100 stocks, say, where attrition is rare) and compare the results against the full universe. The difference gives you a rough bound on the survivorship bias magnitude.

Lookahead Bias

Lookahead bias is the most dangerous bias in the taxonomy because it can silently produce catastrophic distortions. A single lookahead bug can transform a losing strategy into an apparent winner with a Sharpe ratio above 3. I have done this to myself. More than once.

The mechanism is using information at time t that was not available until time t+k. The backtest knows the future. There are three common forms, and they range from obvious to insidious.

Direct Lookahead

This is the simplest form: using tomorrow’s close to decide today’s trade. Easy to describe, easy to spot in theory, and yet it keeps happening in practice because of how pandas handles index alignment.

Here is the trap. You compute a signal from price data. You use that signal to generate trades. If the signal and the trade happen on the same bar, and the signal requires that bar’s close price, you’ve assumed you can compute a signal from a price that doesn’t exist until after the trading session ends. You cannot trade on information you don’t have yet.

Data Revision Lookahead

Economic data gets revised. Sometimes dramatically. Initial GDP releases can differ from final revisions by a full percentage point. Employment numbers get revised by tens of thousands of jobs. If your strategy uses macro data, you need to use the data as it was initially released, not the final revised version. The difference between the initial release and the final revision is future information.

This one bit me with a macro-overlay strategy that used employment data to tilt equity exposure. The backtest looked excellent. Then I discovered I was using revised employment figures from the FRED database, which reflects the latest revision, not the number that was published on the release date. The ALFRED database maintains historical vintages of each release, showing exactly what number was available on each date. When I switched to ALFRED data, the strategy’s Sharpe dropped by a third.

Knowledge Lookahead

The subtlest form. You know that a company will be added to the S&P 500 next month, so your strategy implicitly conditions on that knowledge. Or you filter your universe to “stocks that will have earnings surprises this quarter.” Or you know that a certain regime shift happened in 2020, so you design a strategy that conveniently adapts right before the shift.

Knowledge lookahead is hard to detect algorithmically because it lives in the researcher’s head, not in the code. The best defense is a strict event-time framework and a willingness to question every filtering step: “Would I have been able to apply this filter in real time?”

The Pandas Alignment Trap

This deserves its own section because it is the single most common source of lookahead bugs in Python backtests. I have seen it in my own code, in colleagues’ code, and in open-source backtesting libraries.

import pandas as pd

# WRONG: signal uses today's close, but you can't trade on today's
# close if you need to compute the signal first
df["signal"] = df["close"].pct_change()
df["position"] = (df["signal"] > 0).astype(int)
df["strategy_return"] = df["position"] * df["close"].pct_change()

The problem: position at time t is based on the return from t-1 to t, which requires knowing the close at time t. But the strategy_return at time t also uses the close at time t. The signal and the return being captured overlap. The signal is computed from a price move that the position is simultaneously trying to profit from.

# RIGHT: signal uses yesterday's data, position taken today,
# return captured tomorrow
df["signal"] = df["close"].pct_change()
df["position"] = (df["signal"].shift(1) > 0).astype(int)
df["strategy_return"] = df["position"] * df["close"].pct_change()

The .shift(1) on the signal ensures that today’s position is based on yesterday’s signal. This reflects reality: you observe yesterday’s close, compute a signal, and trade today. The return you capture is today’s price move.

One extra .shift() seems like a small thing. It can change a backtest’s annualized return by hundreds of basis points.

Detecting Lookahead: The Shift Test

The most reliable automated detection method is the shift test: lag all input data by one period and re-run the backtest. If performance drops dramatically, there is likely lookahead somewhere in the pipeline.

from dataclasses import dataclass
from typing import Protocol

import pandas as pd


class BacktestStrategy(Protocol):
    """Protocol for a backtestable strategy."""

    def backtest(self, data: pd.DataFrame) -> "BacktestResult": ...


@dataclass(frozen=True)
class BacktestResult:
    """Summary statistics from a single backtest run."""
    sharpe: float
    annual_return: float
    max_drawdown: float


@dataclass(frozen=True)
class LookaheadDiagnostic:
    """Results from the lookahead shift test.

    A large sharpe_drop (the difference between the original and lagged
    Sharpe ratios) suggests the strategy may be using future information.
    The appropriate threshold for 'suspicious' depends on your strategy
    and market context.

    Attributes:
        original_sharpe: Sharpe ratio from the unmodified backtest.
        lagged_sharpe: Sharpe ratio after lagging all input data.
        sharpe_drop: Difference between original and lagged Sharpe.
        lag_periods: Number of periods the data was lagged.
    """
    original_sharpe: float
    lagged_sharpe: float
    sharpe_drop: float
    lag_periods: int


def detect_lookahead_shift_test(
    strategy: BacktestStrategy,
    data: pd.DataFrame,
    lag_periods: int = 1,
) -> LookaheadDiagnostic:
    """Run a strategy with original and lagged data to detect lookahead.

    If the strategy's performance drops significantly when all input data
    is lagged by one period, the original result likely depends on future
    information. A clean strategy should be relatively insensitive to a
    one-bar lag because its signals are already based on past data.

    Args:
        strategy: A backtestable strategy implementing the BacktestStrategy
            protocol.
        data: The input DataFrame containing price and signal data.
        lag_periods: Number of periods to lag the input data. Defaults to 1.

    Returns:
        A LookaheadDiagnostic with the original and lagged Sharpe ratios.
    """
    original_result = strategy.backtest(data)
    lagged_data = data.shift(lag_periods)
    lagged_result = strategy.backtest(lagged_data)
    sharpe_diff = original_result.sharpe - lagged_result.sharpe

    return LookaheadDiagnostic(
        original_sharpe=original_result.sharpe,
        lagged_sharpe=lagged_result.sharpe,
        sharpe_drop=sharpe_diff,
        lag_periods=lag_periods,
    )

A clean strategy should be relatively insensitive to a one-bar lag. If shifting the data by a single bar causes the Sharpe to collapse, something in the pipeline is peeking ahead. The threshold for “suspicious” is context-dependent: a high-frequency strategy is naturally more lag-sensitive than a monthly rebalancing strategy. You need to calibrate based on your holding period and signal decay.

Preventing Lookahead

The structural fix is an event-time framework where every data point carries two timestamps: the effective date (when the event occurred) and the availability date (when you could have known about it). Signals are computed strictly from data whose availability date precedes the signal computation time. This is the same “parse at the boundary” principle from software engineering: validate and tag your data at the point of ingestion, and enforce the temporal constraint everywhere downstream.

For macro data, use point-in-time databases like ALFRED rather than the latest-revised series from FRED. For corporate fundamentals, use as-reported data with filing date timestamps rather than as-restated data. For price data, be explicit about when you assume you can act on each bar’s information.

Time-Period Bias

Every strategy has market regimes where it thrives and regimes where it dies. Time-period bias occurs when your backtest window happens to be dominated by favorable regimes, making the strategy look better than it will perform across the full distribution of market conditions.

This is the bias I find hardest to guard against because it operates partly at the level of research design. We naturally gravitate toward backtest periods where our strategies work. Sometimes consciously: “let me start the backtest in 2009 because the data before that is from a different era.” Sometimes unconsciously: we develop the strategy during a period, and the development period is by definition a period where the strategy concept felt right.

Regime Dependence

I ran a momentum strategy that looked exceptional over 2012-2021. Low volatility, persistent trends, brief drawdowns. Then I extended the backtest back to 2000. The strategy lost money during 2000-2003 (dot-com crash), was mediocre during 2004-2007, and was catastrophic during 2008. The 2012-2021 window was not representative. It was the best possible window for that particular strategy.

The problem isn’t that the strategy failed in some periods. All strategies have drawdowns. The problem is that the original backtest window gave no indication that such periods existed. The Sharpe ratio was computed over a window that excluded the strategy’s worst environments.

Momentum strategies thrive in trending markets with moderate volatility. Mean-reversion strategies thrive in range-bound markets. Volatility-selling strategies thrive in calm markets. If your backtest window happens to be dominated by the regime your strategy is designed for, your performance estimate is biased upward. Harvey and Liu (2015) argue that at least a decade of data is necessary for equity strategies to span multiple regimes, and even that may not be enough if the decade you chose was unusually benign.

Detecting Regime Dependence

Sub-period analysis is the first tool. Split the backtest into non-overlapping windows of equal length and compute performance separately in each. If the Sharpe ratio varies wildly across sub-periods, the strategy is regime-dependent, and your full-period result is averaging over very different performance characteristics.

Rolling metrics make this visual. Plot the rolling one-year Sharpe ratio over the backtest period. A robust strategy shows a relatively stable line. A regime-dependent strategy shows extended periods of strong performance interspersed with extended periods of negative Sharpe.

from dataclasses import dataclass

import numpy as np
import pandas as pd


@dataclass(frozen=True)
class RegimePerformance:
    """Performance metrics for a single market regime.

    Attributes:
        regime_label: Name of the regime (e.g., 'high_vol', 'low_vol').
        n_observations: Number of return observations in this regime.
        annualized_return: Annualized mean return in this regime.
        annualized_vol: Annualized volatility in this regime.
        sharpe: Sharpe ratio in this regime.
        max_drawdown: Maximum drawdown experienced in this regime.
    """
    regime_label: str
    n_observations: int
    annualized_return: float
    annualized_vol: float
    sharpe: float
    max_drawdown: float


def compute_max_drawdown(returns: pd.Series) -> float:
    """Compute the maximum drawdown from a return series.

    Args:
        returns: A pandas Series of period returns (not cumulative).

    Returns:
        The maximum drawdown as a negative float (e.g., -0.25 for 25%).
        Returns 0.0 if the series is empty.
    """
    if returns.empty:
        return 0.0
    cumulative = (1 + returns).cumprod()
    running_max = cumulative.cummax()
    drawdown = (cumulative - running_max) / running_max
    return float(drawdown.min())


def regime_conditional_performance(
    returns: pd.Series,
    regime_indicator: pd.Series,
    periods_per_year: int = 252,
) -> list[RegimePerformance]:
    """Compute strategy performance conditioned on market regime.

    Splits the return series by regime labels and computes annualized
    metrics for each regime independently. This reveals whether a
    strategy's aggregate performance hides regime-dependent fragility.

    Args:
        returns: Daily (or other frequency) strategy returns, indexed
            by date.
        regime_indicator: A Series of regime labels (e.g., 'high_vol',
            'low_vol') with the same index as returns.
        periods_per_year: Number of trading periods per year for
            annualization. Defaults to 252 (daily).

    Returns:
        A list of RegimePerformance results, one per unique regime.
    """
    aligned = pd.DataFrame({
        "returns": returns,
        "regime": regime_indicator,
    }).dropna()

    results: list[RegimePerformance] = []
    for label in sorted(aligned["regime"].unique()):
        mask = aligned["regime"] == label
        regime_returns = aligned.loc[mask, "returns"]
        n = len(regime_returns)

        mean_return = float(regime_returns.mean())
        vol = float(regime_returns.std())
        ann_return = mean_return * periods_per_year
        ann_vol = vol * np.sqrt(periods_per_year)
        sharpe = ann_return / ann_vol if ann_vol > 0 else 0.0
        mdd = compute_max_drawdown(regime_returns)

        results.append(RegimePerformance(
            regime_label=str(label),
            n_observations=n,
            annualized_return=ann_return,
            annualized_vol=ann_vol,
            sharpe=sharpe,
            max_drawdown=mdd,
        ))

    return results

The regime indicator can be anything that segments market conditions: a volatility measure split at its median, a trend indicator, a macro regime classifier. The point is to reveal whether your aggregate Sharpe is the average of consistent performance or the average of spectacular performance in one regime and terrible performance in another.

Reducing Time-Period Sensitivity

Walk-forward validation is the primary defense. By forcing the strategy to perform out-of-sample across sequential windows, you get a more honest picture of how it behaves across different market conditions.

Bootstrap methods provide another angle: resampling returns from different regimes lets you estimate the distribution of performance across the full range of market conditions, not just the ones that happened to dominate your backtest window.

The minimum viable backtest length depends on your asset class and strategy type. For equities, a decade is a reasonable floor. For macro strategies, you may need two decades to capture a full interest rate cycle. For crypto, the entire asset class history may not span a full regime cycle, and you need to acknowledge that limitation explicitly rather than pretending your three-year backtest is definitive.

Transaction Cost Bias

Transaction cost bias always pushes in one direction: it makes your strategy look better than it will perform in production. The magnitude depends on how sophisticated your cost model is and how frequently the strategy trades.

Three Levels of Cost Modeling

The first level is no transaction costs at all. This is inexcusable for any strategy that trades more than monthly, yet I still see it. A high-turnover strategy backtested without costs is meaningless. It’s not even a rough approximation. It is a fantasy.

The second level is a fixed cost per trade: some number of basis points per transaction, applied uniformly. This is better than nothing and sufficient for low-frequency strategies trading liquid assets. For a strategy that rebalances monthly on large-cap equities, a flat cost assumption gets you in the right neighborhood.

The third level models market impact as a function of order size relative to average daily volume, volatility, and urgency. Almgren and Chriss (2001) established the standard framework for optimal execution that accounts for the tradeoff between market impact (trading too fast) and timing risk (trading too slow). Their model shows that execution costs scale roughly with the square root of the fraction of daily volume you’re consuming. This matters enormously for strategies managing real capital, because a strategy that looks great at $1M may be physically impossible to execute at $100M.

The Capacity Question

Every strategy has a capacity limit beyond which market impact overwhelms the alpha. If your backtest assumes you can trade $50M per day in a stock that averages $10M in daily volume, you are assuming away the dominant cost. The backtest shows profits at a scale that is physically impossible to execute.

I think about capacity before I think about performance. A strategy with a Sharpe of 2.0 and a capacity of $500K is an intellectual exercise, not a trading strategy. A strategy with a Sharpe of 0.8 and a capacity of $50M is a business.

Measuring Your Cost Sensitivity

The break-even cost analysis is the most informative diagnostic: compute the per-trade cost at which the strategy’s Sharpe drops to zero. If the break-even cost is close to realistic transaction costs, the edge is thin and fragile.

from dataclasses import dataclass

import numpy as np
import pandas as pd


@dataclass(frozen=True)
class TransactionCostSensitivity:
    """Results from sweeping transaction cost assumptions.

    Attributes:
        cost_bps: The cost assumption in basis points per trade.
        net_sharpe: The Sharpe ratio after subtracting this cost level.
        net_annual_return: Annualized return after costs.
        total_cost_drag: Total annualized cost as a fraction of capital.
    """
    cost_bps: float
    net_sharpe: float
    net_annual_return: float
    total_cost_drag: float


def transaction_cost_sweep(
    gross_returns: pd.Series,
    turnover: pd.Series,
    cost_levels_bps: list[float],
    periods_per_year: int = 252,
) -> list[TransactionCostSensitivity]:
    """Sweep across transaction cost levels and report net performance.

    For each cost level, subtracts the round-trip cost (proportional to
    turnover) from gross returns and computes net Sharpe. This reveals
    how sensitive the strategy's edge is to execution cost assumptions.

    Args:
        gross_returns: Strategy returns before transaction costs.
        turnover: Daily portfolio turnover as a fraction (0.05 = 5%
            of portfolio traded that day).
        cost_levels_bps: List of per-trade costs in basis points to test.
        periods_per_year: Trading periods per year for annualization.

    Returns:
        A list of TransactionCostSensitivity results, one per cost level.
    """
    results: list[TransactionCostSensitivity] = []

    for cost_bps in cost_levels_bps:
        cost_fraction = cost_bps / 10_000
        daily_cost = turnover * cost_fraction
        net_returns = gross_returns - daily_cost

        mean_net = float(net_returns.mean())
        vol_net = float(net_returns.std())
        ann_return = mean_net * periods_per_year
        ann_vol = vol_net * np.sqrt(periods_per_year)
        sharpe = ann_return / ann_vol if ann_vol > 0 else 0.0
        total_drag = float(daily_cost.mean()) * periods_per_year

        results.append(TransactionCostSensitivity(
            cost_bps=cost_bps,
            net_sharpe=sharpe,
            net_annual_return=ann_return,
            total_cost_drag=total_drag,
        ))

    return results

If your strategy’s Sharpe at zero cost is 1.5 but drops to 0.3 at realistic cost assumptions, you don’t have an edge. You have an artifact. The function above makes this relationship explicit by sweeping across cost levels and showing exactly where the strategy’s performance degrades.

Realistic Cost Modeling

Use conservative cost estimates. If you think costs will be 5 basis points per trade, model them at 10. If the strategy survives at double your estimate, the edge is probably real. If it doesn’t, you’re relying on getting better execution than you expect, which is a bad place to start.

Model capacity explicitly. Estimate the dollar volume your strategy needs to trade on each rebalance day, divide by average daily volume for each asset, and flag any trade that exceeds a reasonable fraction of daily volume. What counts as “reasonable” depends on your market; you need to know your execution venue and its typical depth.

Corporate Action Bias

Corporate action bias is less discussed than survivorship or lookahead, but it can produce errors of similar magnitude. The mechanism is straightforward: failing to properly adjust historical prices for splits, dividends, rights issues, spin-offs, and mergers creates phantom price moves that don’t correspond to real gains or losses.

The Adjustment Problem

A 2:1 stock split means yesterday’s $200 and today’s $100 represent the same value. In an unadjusted dataset, this looks like a 50% crash overnight. Every momentum signal, every mean-reversion signal, every volatility estimator will react to this phantom move. Split adjustment is the minimum; most data sources handle it correctly. The problems arise with less common corporate actions.

Cash dividends are trickier than they appear. On the ex-dividend date, the stock price drops by approximately the dividend amount. In adjusted price series, historical prices are retroactively reduced to make the total return series smooth through the ex-date. But different adjustment methods (proportional vs. additive) produce different historical prices, and the difference compounds over long histories with many dividend payments. For high-dividend stocks over multi-decade backtests, the choice of adjustment method can change historical prices by 20% or more.

Special dividends are irregular and can be large. A $50 stock paying a $10 special dividend will gap down 20% on the ex-date. If your data source handles regular dividends but misses special dividends, you’ll see what looks like a crash.

Spin-offs are the hardest to get right. When a company spins off a subsidiary, the parent’s price drops by the value of the new entity. If your dataset doesn’t include the new ticker, you see a loss that didn’t actually happen. The parent lost value, but the investor received shares in the new entity that compensate. Missing the new ticker is simultaneously a corporate action error and a survivorship error.

Scanning for Adjustment Anomalies

The automated approach scans for anomalous single-day returns and cross-references them against known corporate actions:

from dataclasses import dataclass

import pandas as pd


@dataclass(frozen=True)
class CorporateActionAnomaly:
    """A suspicious single-day return that may indicate a missing adjustment.

    Attributes:
        ticker: The ticker symbol.
        date: The date of the anomalous return.
        daily_return: The single-day return that triggered the flag.
        has_known_action: Whether a known corporate action exists on this
            date in the reference database.
        action_type: The type of corporate action, if known. None if no
            matching action was found.
    """
    ticker: str
    date: str
    daily_return: float
    has_known_action: bool
    action_type: str | None


def scan_corporate_action_anomalies(
    prices: pd.DataFrame,
    corporate_actions: pd.DataFrame,
    return_threshold: float = 0.20,
) -> list[CorporateActionAnomaly]:
    """Flag single-day returns that exceed a threshold and may indicate
    missing corporate action adjustments.

    Scans each ticker's price series for returns larger than the
    threshold (in absolute value) and checks whether a known corporate
    action exists on that date. Returns without a matching corporate
    action are the most suspicious, as they may represent unadjusted
    splits or special dividends.

    Args:
        prices: DataFrame with columns as tickers and a DatetimeIndex,
            containing adjusted close prices.
        corporate_actions: DataFrame with columns 'ticker', 'date', and
            'action_type' listing known corporate actions.
        return_threshold: Minimum absolute return to flag. Defaults to
            0.20 (20%).

    Returns:
        A list of CorporateActionAnomaly results for all flagged returns.
    """
    returns = prices.pct_change()
    actions_lookup: dict[tuple[str, str], str] = {}
    for _, row in corporate_actions.iterrows():
        key = (str(row["ticker"]), str(row["date"]))
        actions_lookup[key] = str(row["action_type"])

    anomalies: list[CorporateActionAnomaly] = []
    for ticker in returns.columns:
        series = returns[ticker].dropna()
        extreme = series[series.abs() > return_threshold]

        for date, ret in extreme.items():
            date_str = str(date)[:10]
            key = (ticker, date_str)
            has_action = key in actions_lookup
            action_type = actions_lookup.get(key)

            anomalies.append(CorporateActionAnomaly(
                ticker=str(ticker),
                date=date_str,
                daily_return=float(ret),
                has_known_action=has_action,
                action_type=action_type,
            ))

    return anomalies

Returns that exceed the threshold but have no matching corporate action are the red flags. They may be genuine extreme moves (earnings surprises, flash crashes), or they may be unadjusted corporate actions. Either way, they deserve manual review before you trust a backtest that includes them.

Getting Adjustments Right

Use adjusted close prices from a reliable source. CRSP, Bloomberg, and Refinitiv all provide properly adjusted series. Free data sources vary in quality; some handle splits but miss special dividends, some miss spin-offs entirely.

The alternative is maintaining raw prices alongside a separate corporate actions table and applying adjustments in your own pipeline. This gives you more control and transparency at the cost of significant additional engineering work. I prefer this approach for my own pipeline because I can audit exactly what adjustments are being applied, but I recognize it’s not practical for everyone.

Always use total return data for strategy evaluation. Price return (ignoring dividends) understates buy-and-hold returns and distorts the comparison between strategies with different dividend exposures. A high-dividend strategy backtested on price returns alone looks worse than it actually performed.

Rebalance Timing Bias

Rebalance timing bias arises from assuming you can trade at a price that isn’t available at the moment you decide to trade. It sounds like a minor technical detail. For short-holding-period strategies, it can be the entire edge.

The Close-to-Close Impossibility

The most common version: your strategy computes a signal from today’s close and trades at today’s close. This is impossible. You cannot simultaneously observe a price and trade on it. The close print happens at 4:00 PM. You can’t compute a function of the close price and execute a trade at the close price in zero time.

Yet this is exactly what most simple backtests assume. “Buy at the close if the signal is positive” means the signal was computed before the close (using information that didn’t include the close price) or the trade happened after the close (at a different price). Most backtesting frameworks default to this close-to-close assumption because it’s computationally convenient, and many users don’t realize the implication.

The MOC Nuance

You can submit MOC orders before the close and receive the closing price. But there’s a catch: you must submit them before knowing the final close price. Most exchanges accept MOC orders until about ten minutes before the close. So you’re making a decision based on 3:50 PM information and getting filled at the 4:00 PM price. This is a real, executable approach, but your backtest needs to model it correctly: the signal is computed on information available at 3:50, not at 4:00.

Testing Execution Timing Sensitivity

Run the backtest under three execution assumptions and compare:

Signal at close, trade at close (the impossible but common assumption)
Signal at close, trade at next open
Signal at close, trade at next close

from dataclasses import dataclass

import numpy as np
import pandas as pd


@dataclass(frozen=True)
class TimingSensitivity:
    """Performance under different execution timing assumptions.

    Attributes:
        execution_label: Description of the timing assumption.
        sharpe: Sharpe ratio under this assumption.
        annual_return: Annualized return under this assumption.
        correlation_with_base: Correlation of this return series with
            the base (same-close) return series.
    """
    execution_label: str
    sharpe: float
    annual_return: float
    correlation_with_base: float


def execution_timing_sensitivity(
    signal: pd.Series,
    close_prices: pd.Series,
    open_prices: pd.Series,
    periods_per_year: int = 252,
) -> list[TimingSensitivity]:
    """Test strategy performance under different execution timing assumptions.

    Computes returns assuming the same signal but different execution
    prices: same-bar close, next-bar open, and next-bar close. Large
    differences between these scenarios indicate that the strategy's
    edge is timing-sensitive and may not survive realistic execution.

    Args:
        signal: A Series of position signals (+1, 0, -1) indexed by date.
        close_prices: Close prices indexed by date.
        open_prices: Open prices indexed by date.
        periods_per_year: Trading periods per year for annualization.

    Returns:
        A list of TimingSensitivity results for each execution assumption.
    """
    close_returns = close_prices.pct_change()
    open_to_close_returns = (close_prices - open_prices) / open_prices

    # Same-bar close (impossible but common assumption)
    base_returns = signal * close_returns

    scenarios = {
        "signal_at_close_trade_at_close": base_returns,
        "signal_at_close_trade_at_next_open": (
            signal.shift(1) * open_to_close_returns
            + signal.shift(1) * (open_prices.shift(-1) - close_prices) / close_prices
        ).dropna(),
        "signal_at_close_trade_at_next_close": (
            signal.shift(1) * close_returns
        ),
    }

    results: list[TimingSensitivity] = []
    base_clean = base_returns.dropna()

    for label, ret_series in scenarios.items():
        clean = ret_series.dropna()
        mean_r = float(clean.mean())
        vol_r = float(clean.std())
        ann_ret = mean_r * periods_per_year
        ann_vol = vol_r * np.sqrt(periods_per_year)
        sharpe = ann_ret / ann_vol if ann_vol > 0 else 0.0

        common_idx = base_clean.index.intersection(clean.index)
        if len(common_idx) > 1:
            corr = float(base_clean.loc[common_idx].corr(clean.loc[common_idx]))
        else:
            corr = float("nan")

        results.append(TimingSensitivity(
            execution_label=label,
            sharpe=sharpe,
            annual_return=ann_ret,
            correlation_with_base=corr,
        ))

    return results

If the three scenarios produce materially different results, the strategy is exploiting the gap between signal time and execution time. For a strategy with a multi-week holding period, the difference between close-to-close and close-to-next-open is negligible. For a day-trading strategy, it can be everything.

Realistic Execution Assumptions

Be explicit about your execution timing assumption and verify it’s achievable in production. If you assume next-bar open execution, confirm that you can reliably get filled near the open. If you assume MOC execution, confirm that your signal can be computed before the MOC cutoff time.

For production, TWAP or VWAP execution over a window around your target time is more realistic than a single-point fill. Your backtest should use the same execution assumption, which means using the VWAP over the execution window rather than a single bar’s close price.

Add random execution noise to your backtest as a robustness check. Instead of the exact close price, add uniform noise of a few basis points in either direction and re-run. If the strategy is sensitive to this noise, its edge is not robust to realistic execution uncertainty.

Overfitting Bias

Overfitting is the most common and most damaging bias in the taxonomy. It is also the one that feels the least like a “bias” because you think you’re doing rigorous work while you’re doing it. You optimize parameters. You test combinations. You select the one that performed best. And in doing so, you capture noise that happened to look like signal in your sample.

The Deflated Sharpe Ratio provides a statistical correction for overfitting by adjusting for the number of strategies tested, and I cover the behavioral aspect in other articles. Here I want to focus on the detection mechanics: how to tell whether your backtest result reflects genuine signal or the inevitable consequence of searching over a large parameter space.

The Parameter Sensitivity Landscape

A well-identified signal produces a smooth performance landscape. If your strategy uses a 20-day moving average and the backtest Sharpe is 1.2, neighboring parameter values (18-day, 22-day) should produce similar results. A Sharpe of 1.2 at exactly 20 days surrounded by a Sharpe of 0.3 at 19 days and 0.4 at 21 days is not a signal. It is a noise spike that happened to land on your chosen parameter.

Plot the Sharpe ratio (or whatever performance metric you use) as a function of each parameter while holding the others fixed. A smooth, broad hill suggests signal. A jagged landscape of sharp peaks and valleys suggests noise. This is the most intuitive and most informative overfitting diagnostic, and it requires no fancy statistics.

Degrees of Freedom

The number of free parameters relative to the number of independent observations determines how much room the optimizer has to fit noise. A strategy with two parameters and ten years of daily data has enormous room for genuine identification. A strategy with twenty parameters and three years of weekly data is almost certainly overfit.

Bailey, Borwein, Lopez de Prado, and Zhu (2014) formalize this argument. They show that the expected maximum Sharpe ratio from trying N independent strategies on the same data grows with sqrt(2 * ln(N)). If you tested 1,000 parameter combinations, the best one will have an expected Sharpe of about 3.7 even with zero true signal. This is the “pseudo-mathematics” that makes financial charlatanism possible: run enough backtests, pick the best one, and present it as if it were the only test you ran.

The remedy is to account for the total number of trials. The deflated Sharpe ratio adjusts for the number of strategies tested, the skewness and kurtosis of returns, and the backtest length. But even without the formal test, you can use the degrees-of-freedom heuristic: keep the parameter count low, keep the data long, and be suspicious of any result that requires precise parameter tuning.

In-Sample vs. Out-of-Sample Degradation

The ratio of in-sample performance to out-of-sample performance is a direct measure of overfitting. If your strategy has a Sharpe of 2.0 in-sample and 0.5 out-of-sample, roughly 75% of your in-sample result was noise.

Walk-forward validation provides this comparison automatically. The concatenated out-of-sample results give you the OOS estimate. The average in-sample result gives you the IS estimate. The ratio tells you how much you’re overfitting.

I have never seen an IS/OOS ratio of 1.0. Some degradation is inevitable because the in-sample optimizer has the advantage of fitting to the specific realization of noise in each training window. A ratio of 0.5 to 0.7 is typical for strategies with genuine signal. Below 0.3, you’re probably fitting noise.

Detecting Overfitting: Parameter Sweeps

from dataclasses import dataclass

import numpy as np
import pandas as pd


@dataclass(frozen=True)
class ParameterSensitivityResult:
    """Results from scanning performance across a parameter range.

    Attributes:
        parameter_name: The name of the parameter being varied.
        parameter_values: The values tested.
        sharpe_values: The Sharpe ratio at each parameter value.
        mean_sharpe: Mean Sharpe across all parameter values.
        std_sharpe: Standard deviation of Sharpe across parameter values.
        peak_sharpe: Maximum Sharpe observed.
        peak_parameter: Parameter value at which peak Sharpe occurred.
        smoothness: Ratio of mean to peak Sharpe. Values near 1.0
            indicate a smooth landscape (signal); values near 0 indicate
            a jagged landscape (noise).
    """
    parameter_name: str
    parameter_values: list[float]
    sharpe_values: list[float]
    mean_sharpe: float
    std_sharpe: float
    peak_sharpe: float
    peak_parameter: float
    smoothness: float


def parameter_sensitivity_scan(
    backtest_fn: "Callable[[float], float]",
    parameter_name: str,
    parameter_range: list[float],
) -> ParameterSensitivityResult:
    """Scan a parameter range and assess the smoothness of the performance landscape.

    Runs the backtest at each parameter value and computes the ratio of
    mean Sharpe to peak Sharpe. A ratio near 1.0 means performance is
    stable across parameter values (suggesting real signal). A ratio
    near 0 means performance is concentrated at a narrow peak
    (suggesting overfitting to noise).

    Args:
        backtest_fn: A callable that takes a parameter value and returns
            the Sharpe ratio for that parameter setting.
        parameter_name: Human-readable name of the parameter being scanned.
        parameter_range: List of parameter values to test.

    Returns:
        A ParameterSensitivityResult with the full scan and smoothness metric.
    """
    sharpe_values = [backtest_fn(p) for p in parameter_range]

    mean_s = float(np.mean(sharpe_values))
    std_s = float(np.std(sharpe_values))
    peak_s = float(np.max(sharpe_values))
    peak_p = parameter_range[int(np.argmax(sharpe_values))]
    smoothness = mean_s / peak_s if peak_s > 0 else 0.0

    return ParameterSensitivityResult(
        parameter_name=parameter_name,
        parameter_values=list(parameter_range),
        sharpe_values=sharpe_values,
        mean_sharpe=mean_s,
        std_sharpe=std_s,
        peak_sharpe=peak_s,
        peak_parameter=peak_p,
        smoothness=smoothness,
    )

The smoothness metric is the key output. A value above 0.7 or so suggests the strategy has genuine signal across a range of parameter values. Below 0.3, the performance is concentrated at a narrow parameter setting, which is the hallmark of overfitting. These thresholds are rough guidelines; calibrate them based on your experience with your specific market and strategy type.

Defending Against Overfitting

Fewer parameters. This is the single most effective defense against overfitting. A two-parameter strategy is harder to overfit than a twenty-parameter strategy. Every parameter you add gives the optimizer another degree of freedom to fit noise. If your strategy requires fifteen parameters to produce good results, it’s probably the parameters doing the work, not the signal.

Walk-forward validation, as covered in the dedicated article , is the primary structural defense. Combinatorial purged cross-validation (Lopez de Prado, 2018) is a more sophisticated variant that generates many train/test splits while respecting the temporal structure of financial data. It’s more data-efficient than standard walk-forward but also more complex to implement correctly.

Regularization, in the machine learning sense of penalizing model complexity, applies to any strategy with tunable parameters. Use the simplest model that captures the signal. Add complexity only when the out-of-sample performance improves, not just the in-sample performance.

Less Common But Important Biases

The biases above are the big ones. The ones below are less frequently discussed but can still corrupt your results.

Fill Assumption Bias

Most backtests assume that every order gets filled at the desired price. In reality, limit orders compete with other orders at the same price level. Mean-reversion strategies that enter on limit orders at the bid or ask are especially vulnerable to this bias: you get filled on the losing trades (the price moves through your limit, filling you, and keeps going against you) and miss the winning trades (the price touches your limit and reverses before you’re filled). This creates a selection bias in which fills you actually receive.

The direction is positive: assuming 100% fill rates overstates performance for any strategy that uses limit orders. The magnitude depends on how much of the strategy’s edge comes from limit-order entry timing. For strategies that trade exclusively with market orders, this bias is minimal. For strategies that rely on capturing the spread, it can be the entire edge.

The mitigation is to model partial fills and adverse selection. Assume that fills on limit orders only occur when the price moves through the limit level by some buffer amount, rather than just touching it. This is conservative but more realistic than assuming every resting order gets filled.

Overnight and Weekend Gap Bias

Strategies that are modeled as continuous-time processes but traded in discrete sessions miss gap risk. A strategy that’s nominally flat overnight but backtested on daily close-to-close returns includes the overnight gap as if it were intraday performance. If the strategy is actually flat overnight (closed out at the close and re-entered at the open), the overnight component of the close-to-close return is noise that doesn’t belong in the backtest.

The fix is to use open-to-close returns for intraday strategies and close-to-close returns only for strategies that hold positions overnight. Mixing the two inflates apparent returns by including moves during periods when the strategy has no exposure.

Data Frequency Bias

Using daily data for a strategy that would execute on intraday signals hides intraday drawdowns that would trigger risk limits in production. A daily bar that closes up 0.5% may have had an intraday drawdown of 3% that would have stopped you out. Your backtest shows a calm upward ride; the live experience would have been a drawdown followed by a recovery you never captured because your stop triggered.

If your strategy operates at intraday frequency, backtest on intraday data. Daily data can give you a rough first pass, but you must re-validate on appropriate-frequency data before committing capital.

Staleness Bias

In illiquid markets, the “last trade” price may be hours or days old. Using stale prices to compute signals and mark positions creates phantom stability. Volatility looks lower than it is because the price series doesn’t update between trades. Correlations with other assets appear weaker because the stale price doesn’t co-move in real time.

This is particularly relevant for options strategies, small-cap equities, corporate bonds, and any OTC market. If the bid-ask spread is wide and trading frequency is low, the “price” in your data is not a price you could transact at. It’s a historical record of what someone paid at some point in the past.

Detection: check the timestamp of each price observation. If many observations share the same timestamp or if timestamps jump by large intervals, the data is stale. Mitigation: use mid-quote data rather than last-trade data, and flag any observation where the time since last trade exceeds a threshold appropriate for that asset.

Index Reconstitution Bias

S&P 500 additions tend to outperform before they’re added. That’s part of why they were added: inclusion criteria favor large, growing companies. Using current index membership retroactively means your backtest includes these stocks during their pre-addition outperformance phase, performance that you couldn’t have captured because you wouldn’t have known they were going to be added.

The direction is positive: pre-addition outperformance inflates the index return estimate. The magnitude is modest for broad indices but can be meaningful for concentrated strategies that select within the index. This is really a special case of survivorship bias, but it’s worth calling out separately because it affects index-tracking strategies specifically.

The mitigation is point-in-time constituent data, the same remedy as for survivorship bias. Use the index membership that was in effect at each historical date, not today’s membership applied retroactively.

The Bias Checklist: A Validation Scorecard

I keep a checklist that I run through for every backtest. Not a mental checklist. A literal, documented checklist with code that runs automated checks where possible. The informal version looks like this:

Bias	Direction	Key Question
Survivorship	Positive	Does my universe include delisted assets with terminal returns?
Lookahead	Varies	Does the shift test produce a large Sharpe drop?
Time-period	Positive	Is performance consistent across sub-periods and regimes?
Transaction cost	Positive	What is the break-even cost per trade?
Corporate action	Varies	Are there anomalous single-day returns without matching actions?
Rebalance timing	Positive	Does the execution timing assumption change results materially?
Overfitting	Positive	Is the parameter sensitivity landscape smooth?
Fill assumption	Positive	Does the strategy rely on limit-order fills at exact prices?

The Bias Budget

The checklist becomes quantitative through what I call a bias budget. For each bias that’s present, estimate the magnitude and direction of its effect on the reported Sharpe ratio. Sum the magnitudes (conservatively, assume they compound rather than offset each other). Subtract from the backtest’s gross Sharpe. The residual is your de-biased Sharpe estimate.

If the de-biased Sharpe is still above your threshold, the strategy may have genuine signal. If it’s not, you need to either address the bias sources or accept that the apparent edge may be an artifact. The threshold itself is something you need to calibrate for your own context. It depends on your risk tolerance, your capital, your execution infrastructure, and how much effort you’re willing to invest in a strategy before concluding it’s real.

This is the formal version:

from dataclasses import dataclass
from enum import Enum


class BiasDirection(Enum):
    """Direction of bias effect on reported performance."""
    POSITIVE = "positive"
    NEGATIVE = "negative"
    AMBIGUOUS = "ambiguous"


class AuditRecommendation(Enum):
    """Overall recommendation from the bias audit."""
    PASS = "pass"
    REVIEW = "review"
    FAIL = "fail"


@dataclass(frozen=True)
class BiasCheck:
    """Assessment of a single bias source.

    Attributes:
        bias_name: Human-readable name of the bias.
        present: Whether this bias is detected or suspected in the backtest.
        direction: Which direction the bias pushes performance.
        estimated_sharpe_impact: Estimated effect on the Sharpe ratio.
            Positive values mean the bias inflates the reported Sharpe.
        notes: Free-text notes on the assessment.
        automated_check_passed: Whether the automated detection (if any)
            passed. None if no automated check is available.
    """
    bias_name: str
    present: bool
    direction: BiasDirection
    estimated_sharpe_impact: float
    notes: str
    automated_check_passed: bool | None


@dataclass(frozen=True)
class BiasAuditResult:
    """Full bias audit scorecard for a backtest.

    Attributes:
        reported_sharpe: The Sharpe ratio reported by the backtest.
        survivorship: Assessment of survivorship bias.
        lookahead: Assessment of lookahead bias.
        time_period: Assessment of time-period bias.
        transaction_cost: Assessment of transaction cost bias.
        corporate_action: Assessment of corporate action bias.
        rebalance_timing: Assessment of rebalance timing bias.
        overfitting: Assessment of overfitting bias.
        fill_assumption: Assessment of fill assumption bias.
        estimated_total_bias: Sum of all estimated Sharpe impacts.
        debiased_sharpe: Reported Sharpe minus estimated total bias.
        recommendation: Overall recommendation based on the de-biased Sharpe.
    """
    reported_sharpe: float
    survivorship: BiasCheck
    lookahead: BiasCheck
    time_period: BiasCheck
    transaction_cost: BiasCheck
    corporate_action: BiasCheck
    rebalance_timing: BiasCheck
    overfitting: BiasCheck
    fill_assumption: BiasCheck
    estimated_total_bias: float
    debiased_sharpe: float
    recommendation: AuditRecommendation


def run_bias_audit(
    reported_sharpe: float,
    checks: list[BiasCheck],
) -> BiasAuditResult:
    """Aggregate individual bias checks into a full audit scorecard.

    Sums the estimated Sharpe impacts of all detected biases, computes
    a de-biased Sharpe estimate, and produces a recommendation. The
    recommendation thresholds should be calibrated for your specific
    context; the defaults here are conservative starting points.

    Args:
        reported_sharpe: The Sharpe ratio from the backtest.
        checks: A list of exactly 8 BiasCheck objects, one per bias
            category, in the order: survivorship, lookahead, time_period,
            transaction_cost, corporate_action, rebalance_timing,
            overfitting, fill_assumption.

    Returns:
        A BiasAuditResult with the full scorecard and recommendation.

    Raises:
        ValueError: If exactly 8 bias checks are not provided.
    """
    if len(checks) != 8:
        raise ValueError(
            f"Expected 8 bias checks, got {len(checks)}. Provide one check "
            f"per bias category."
        )

    total_bias = sum(c.estimated_sharpe_impact for c in checks if c.present)
    debiased = reported_sharpe - total_bias

    any_failed = any(
        c.automated_check_passed is False for c in checks
    )

    if any_failed or debiased <= 0:
        recommendation = AuditRecommendation.FAIL
    elif debiased < reported_sharpe * 0.5:
        recommendation = AuditRecommendation.REVIEW
    else:
        recommendation = AuditRecommendation.PASS

    return BiasAuditResult(
        reported_sharpe=reported_sharpe,
        survivorship=checks[0],
        lookahead=checks[1],
        time_period=checks[2],
        transaction_cost=checks[3],
        corporate_action=checks[4],
        rebalance_timing=checks[5],
        overfitting=checks[6],
        fill_assumption=checks[7],
        estimated_total_bias=total_bias,
        debiased_sharpe=debiased,
        recommendation=recommendation,
    )

The recommendation logic is deliberately simple. If any automated check fails outright or the de-biased Sharpe drops to zero or below, the backtest fails. If the de-biased Sharpe loses more than half of the reported value, the result needs review. Otherwise, it passes. These thresholds are starting points. You should adjust them based on how conservative you want to be. I err on the conservative side because the cost of deploying a false positive (a strategy that looks good but isn’t) is much higher than the cost of rejecting a false negative (a strategy that is good but looks marginal).

Connection to the Full Pipeline

The bias audit sits between backtesting and statistical validation in my analysis pipeline . The sequence is:

Data ingestion with property-based validation to catch structural problems
Signal generation and backtesting
Bias audit (this article) to estimate and subtract systematic distortions
Statistical validation: Deflated Sharpe Ratio to correct for multiple testing, bootstrap methods to estimate performance distributions, Monte Carlo permutation tests to test against randomized baselines
Walk-forward validation for out-of-sample confirmation
Paper trading for live-data confirmation before deploying capital

The bias audit is not optional. Every step downstream assumes the backtest result is a reasonable starting point. If the backtest is distorted by undetected biases, every subsequent statistical test inherits the distortion. You’re computing the deflated Sharpe ratio of an inflated number. You’re bootstrapping from a biased distribution. And if you haven’t checked for autocorrelation in your strategy returns , even the “de-biased” Sharpe is still overstated by the annualization formula. The garbage-in-garbage-out principle applies with full force.

How Biases Compound

I want to emphasize one final point: these biases don’t operate in isolation. They compound. A backtest with survivorship bias, a one-bar lookahead, and no transaction costs doesn’t have three small biases. It has three biases whose effects multiply through the return series.

Survivorship bias removes the worst-performing names from the universe. Lookahead bias lets the strategy exploit future information to pick the best-performing names from the survivors. Zero transaction costs let the strategy trade as frequently as it wants to capture every fleeting signal. Each bias makes the next one more damaging.

This is why the bias budget should assume compounding rather than simple addition. And this is why I treat any backtest with more than one unaddressed bias as unreliable until proven otherwise, regardless of how attractive the headline numbers look.

The goal isn’t a bias-free backtest. That doesn’t exist. The goal is a backtest where you’ve identified every significant bias, estimated its magnitude, and demonstrated that the remaining signal exceeds the cumulative distortion. That’s not a guarantee of future performance. Nothing is. But it’s a far better foundation than a headline Sharpe ratio and a prayer.

References

Almgren, R., & Chriss, N. (2001). Optimal Execution of Portfolio Transactions. Journal of Risk, 3(2), 5-39.
Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the AMS, 61(5), 458-471.
Elton, E. J., Gruber, M. J., & Blake, C. R. (1996). Survivorship Bias and Mutual Fund Performance. The Review of Financial Studies, 9(4), 1097-1120.
Harvey, C. R., & Liu, Y. (2015). Backtesting. The Journal of Portfolio Management, 42(1), 13-28.
Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.