Susan Potter
### quant  ·  Created  ·  Updated

Autocorrelation and What It Means for Your Backtest P&L

Here’s a scenario most traders have experienced: you backtest a strategy, the numbers look solid, you size up, and live performance disappoints. Not catastrophically, just consistently worse than the backtest promised. You check for bugs. You check your fill assumptions. Everything looks right. But the returns are softer than expected and the drawdowns are deeper.

One common cause that doesn’t show up in a standard backtest report: your daily returns aren’t independent of each other. Today’s return is influenced by yesterday’s. If you hold a position for five days, each of those five daily returns is driven by the same bet. If your signal updates weekly but you measure returns daily, five consecutive days of returns are all following the same instruction. These patterns mean your returns are “autocorrelated,” and when that’s the case, the standard formula for annualizing your Sharpe ratio quietly overstates your risk-adjusted performance.

The Sharpe ratio formula that every backtesting platform uses assumes each day’s return is a fresh, independent coin flip. When returns are correlated day-to-day, your actual risk over longer horizons is higher than the formula assumes, which means your actual Sharpe is lower than the number on screen. Not by a huge amount for most strategies (typically 5-15%), but enough to change allocation decisions, especially once you account for how much wider your confidence interval becomes.

I spent three days tuning a momentum strategy, watching the backtest Sharpe ratio climb to 1.8 annualized. That number felt good. Then I checked for autocorrelation in the daily returns and found a first-order coefficient of 0.12. After applying the Lo (2002) correction, the Sharpe dropped to about 1.6. Still tradeable, but a different conversation with my risk budget once I factored in how much less certain I could be about that number.

This article covers why strategy returns are autocorrelated, how to detect it, how to correct for it, and what honest performance reporting looks like once you take serial dependence seriously.

The Hidden Inflation in Your Sharpe Ratio

The standard Sharpe ratio annualization is the most commonly abused formula in quantitative finance:

SR_annual = SR_daily * sqrt(252)

That sqrt(252) scaling factor assumes i.i.d. returns. Specifically, it assumes that the variance of a multi-period return scales linearly with the number of periods. If daily returns are independent, then the variance of 252-day returns is exactly 252 times the variance of daily returns, so the standard deviation scales by sqrt(252), so the Sharpe ratio (mean over standard deviation) scales by sqrt(252).

But if returns are positively autocorrelated, variance grows faster than linearly with time. Positive autocorrelation means an up day is more likely to be followed by another up day (or more precisely, the conditional expectation of tomorrow’s return given today’s return is shifted in the same direction). Over longer horizons, these correlated shocks compound rather than cancel, making the true multi-period variance larger than the i.i.d. assumption predicts.

The result: the true annual standard deviation is larger than sqrt(252) * daily_std, which means the true annual Sharpe is smaller than SR_daily * sqrt(252).

Andrew Lo formalized this in his 2002 paper “The Statistics of Sharpe Ratios.” The corrected annualized Sharpe ratio accounts for the autocorrelation structure of returns. For a return series with autocorrelations rho_1, rho_2, ..., rho_K, the correction factor involves the sum of these autocorrelations weighted by their lag distance.

Here is the magnitude of the problem for different levels of first-order autocorrelation, assuming an AR(1) process for daily returns:

rho = 0.05  →  ~5% Sharpe inflation
rho = 0.10  →  ~10% Sharpe inflation
rho = 0.15  →  ~16% Sharpe inflation
rho = 0.20  →  ~22% Sharpe inflation

These numbers may look small, but they compound with other biases. A daily autocorrelation of 0.10 is unremarkable for a strategy that holds positions for a few days. But a 10% inflation on top of survivorship bias, on top of optimistic fill assumptions, on top of transaction cost underestimates, can turn a mediocre strategy into an apparently good one. And when you factor in the wider confidence intervals that autocorrelation produces (the bootstrap CI expands significantly), the practical impact on capital allocation decisions is larger than the point estimate suggests.

Let me show how to compute this inflation factor directly:

from dataclasses import dataclass
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class SharpeInflation:
    """Result of computing Sharpe ratio inflation from autocorrelation.

    Attributes:
        raw_sharpe: The naive annualized Sharpe ratio using sqrt(252).
        adjusted_sharpe: The Lo (2002) corrected annualized Sharpe.
        inflation_pct: How much the raw Sharpe overstates the adjusted, as a percentage.
        eta_factor: The variance ratio correction factor.
    """
    raw_sharpe: float
    adjusted_sharpe: float
    inflation_pct: float
    eta_factor: float


def compute_sharpe_inflation(
    daily_returns: npt.NDArray[np.float64],
    max_lag: int = 20,
    periods_per_year: int = 252,
) -> SharpeInflation:
    """Compute the Lo (2002) Sharpe ratio correction for autocorrelation.

    Estimates autocorrelations up to max_lag and computes the variance
    ratio that adjusts the annualized Sharpe downward for positive
    serial correlation (or upward for negative serial correlation).

    Args:
        daily_returns: Array of daily strategy returns.
        max_lag: Maximum lag for autocorrelation estimation.
        periods_per_year: Trading periods per year for annualization.

    Returns:
        SharpeInflation with raw, adjusted, and inflation metrics.

    Raises:
        ValueError: If daily_returns has fewer observations than max_lag + 1.
    """
    n = len(daily_returns)
    if n < max_lag + 1:
        raise ValueError(
            f"Need at least {max_lag + 1} observations, got {n}"
        )

    mean_return = np.mean(daily_returns)
    std_return = np.std(daily_returns, ddof=1)

    if std_return == 0:
        return SharpeInflation(
            raw_sharpe=0.0,
            adjusted_sharpe=0.0,
            inflation_pct=0.0,
            eta_factor=1.0,
        )

    daily_sharpe = mean_return / std_return

    # Compute autocorrelations using Bartlett-style weighting
    # Lo (2002) eq. 6: eta(q) = q * (1 + 2 * sum_{k=1}^{q-1} (1 - k/q) * rho_k)
    # where q = periods_per_year
    demeaned = daily_returns - mean_return
    gamma_0 = np.sum(demeaned ** 2) / n

    weighted_sum = 0.0
    q = periods_per_year
    for k in range(1, min(max_lag, q) + 1):
        gamma_k = np.sum(demeaned[k:] * demeaned[:-k]) / n
        rho_k = gamma_k / gamma_0
        weight = 1.0 - k / q
        weighted_sum += weight * rho_k

    # variance_inflation is the factor by which multi-period variance
    # exceeds what i.i.d. would predict. > 1 for positive autocorrelation.
    variance_inflation = 1.0 + 2.0 * weighted_sum
    variance_inflation = max(variance_inflation, 0.01)  # Floor to avoid division issues

    raw_sharpe = daily_sharpe * np.sqrt(q)
    # Correct formula: divide by sqrt of the variance inflation factor
    # Positive autocorrelation increases variance, which REDUCES the Sharpe
    adjusted_sharpe = raw_sharpe / np.sqrt(variance_inflation)

    if adjusted_sharpe == 0:
        inflation_pct = 0.0
    else:
        inflation_pct = (raw_sharpe / adjusted_sharpe - 1.0) * 100.0

    return SharpeInflation(
        raw_sharpe=float(raw_sharpe),
        adjusted_sharpe=float(adjusted_sharpe),
        inflation_pct=float(inflation_pct),
        eta_factor=float(variance_inflation),
    )

The eta_factor tells you how much larger the true multi-period variance is relative to what the i.i.d. assumption predicts. An eta_factor of 1.22 (typical for rho=0.10) means 22% more variance, which translates to about 10% Sharpe inflation (since Sharpe scales with the inverse of standard deviation, and 1 - 1/sqrt(1.22) ≈ 0.10). The relationship is nonlinear and depends on the full autocorrelation structure, not just the first lag, though for AR(1) processes the first lag dominates.

Most practitioners don’t report this correction. I’ve read dozens of strategy writeups on QuantConnect, Quantopian (when it existed), and various quant blogs. Almost none include autocorrelation diagnostics. Some people don’t know about the correction. Others know but skip it because it makes their numbers less impressive. Both are problems, but the second one is worse.

Why Strategy Returns Are Autocorrelated

Here is the thing that confused me when I first encountered this topic: asset returns are approximately uncorrelated. The weak form of the efficient market hypothesis implies that past returns don’t predict future returns, and the empirical evidence broadly supports this at daily and longer frequencies. So why would strategy returns, which are just filtered versions of asset returns, be autocorrelated?

The answer is that the filter itself introduces serial dependence. Your trading rules transform a roughly uncorrelated input (asset returns) into a correlated output (strategy returns) through several mechanisms.

Holding period effects

This is the most mechanical and most common source. If your strategy enters a position on Monday and exits on Friday, every daily return during that holding period is driven by the same position. Today’s return is positive because you are long 100 shares of XYZ. Tomorrow’s return is also driven by being long 100 shares of XYZ. The returns are correlated because the position is persistent.

Think of it differently. A strategy that flips positions randomly each day would have uncorrelated returns (assuming the asset returns themselves are uncorrelated). The moment you hold for longer than one period, you introduce positive autocorrelation at lags up to the holding period length.

I measure this by computing the average holding period and then checking whether the ACF shows significant autocorrelation at lags 1 through the holding period. If it does, and the autocorrelation dies off beyond that lag, holding period effects explain most of the serial dependence.

Slow signal updates

Many strategies update their signals at a lower frequency than their return measurement. A weekly momentum signal produces daily returns that are driven by the same signal for five consecutive days. A monthly rebalancing schedule produces daily returns from the same portfolio for approximately 21 days. The returns within each signal window are correlated because they come from the same allocation decision.

This is particularly insidious in backtests where you compute daily returns for reporting purposes but the strategy only makes decisions weekly or monthly. The daily Sharpe ratio gets the full sqrt(252) treatment even though there are really only 52 or 12 independent decision points per year. The autocorrelation inflates the annualized Sharpe, and the inflation is entirely a measurement artifact.

Smoothing in underlying data

This one is more relevant for hedge fund returns and illiquid asset classes, but it matters wherever mark-to-model pricing enters the picture. Getmansky, Lo, and Makarov (2004) showed that hedge fund returns exhibit significant positive autocorrelation, not because of genuine serial dependence in the fund’s true returns, but because illiquid positions are marked using stale prices. The reported return in month t includes some of the true return from month t-1 because the positions weren’t repriced until t.

Their model decomposes observed returns into a moving average of true returns:

R_observed_t = theta_0 * R_true_t + theta_1 * R_true_t-1 + ... + theta_k * R_true_t-k

where the theta weights sum to 1. This smoothing creates positive autocorrelation in observed returns even if true returns are i.i.d. The effect is substantial. A fund with monthly autocorrelation of 0.3 may have a reported Sharpe ratio that’s 50% higher than its true risk-adjusted performance.

I don’t trade illiquid assets directly, but I’ve seen this effect in backtests that use end-of-day closing prices for instruments that trade infrequently. If a stock only trades a few times per day, the closing price may be stale, and your daily returns inherit that staleness as positive autocorrelation.

Transaction cost drag

Strategies with high turnover pay the bid-ask spread repeatedly. On days when the strategy trades, the realized return includes a negative component (the cost of crossing the spread). On days when the strategy holds, no spread is paid. This creates a pattern: trade-day returns are biased downward relative to hold-day returns.

If trades are clustered (as they are in many rebalancing strategies), this creates negative autocorrelation in the returns: a cluster of bad days (high-turnover) followed by a cluster of better days (low-turnover). It’s a subtler effect than holding period autocorrelation, but it shows up in strategies with periodic rebalancing or in high-frequency strategies where costs are a large fraction of gross returns.

Mean-reversion construction

Mean-reversion strategies buy after price drops and sell after price rises. By design, a down day in the asset leads to buying, which (if the mean-reversion works) leads to an up day in the strategy. Then the position is closed, and the next signal might be neutral or opposite. This creates negative autocorrelation in strategy returns: positive returns tend to follow negative returns and vice versa.

This is one of the cases where autocorrelation actually helps you. Negative autocorrelation means variance grows slower than linearly with time. The true annual variance is smaller than 252 * daily_variance, so the true annual Sharpe is higher than the naive sqrt(252) scaling suggests. Mean-reversion strategies look better at longer horizons because the negative serial dependence reduces long-term risk.

I still correct for it in my reporting, but the correction goes in the other direction: the adjusted Sharpe is higher than the raw Sharpe. It’s honest to report both, and it’s useful to know that the improved long-horizon performance is genuine rather than a statistical artifact.

The key insight

Autocorrelation in strategy returns tells you something about the strategy’s structure, not about market inefficiency. It’s an artifact of the trading process: how long you hold, how often you rebalance, how your signal is constructed. Recognizing this means you stop treating autocorrelation as a nuisance to be corrected and start treating it as a diagnostic that reveals the strategy’s mechanics.

Detecting Autocorrelation

Before you can correct for autocorrelation, you have to measure it. I use four tools, in order of increasing formality.

The ACF plot

Start here. Always. The ACF at lag k is the correlation between r_t and r_{t-k}, computed over the full sample. Plot it for lags 1 through 20 or so, along with the 95% confidence bands at +/- 1.96 / sqrt(T).

If you see significant bars at low lags (1 through 5), you probably have holding period effects or slow signal updates. If you see significant bars at a specific lag (say lag 20-22), suspect monthly rebalancing. If the ACF alternates in sign at low lags, you likely have a mean-reversion strategy.

The ACF is the fastest diagnostic. I glance at it after every backtest. It takes thirty seconds and tells me whether the Sharpe ratio I just computed is trustworthy.

The PACF plot

The PACF at lag k is the correlation between r_t and r_{t-k} after removing the linear effects of intervening lags. It’s useful for identifying the order of an autoregressive process. If the ACF decays slowly but the PACF cuts off sharply after lag 2, an AR(2) model would capture most of the serial dependence.

I use the PACF less frequently than the ACF. It’s most useful when I want to parameterize the autocorrelation structure (for example, to choose the bandwidth for Newey-West standard errors).

The Ljung-Box test

This is the formal hypothesis test. The Ljung-Box Q statistic tests whether the first K autocorrelations are jointly zero:

Q = T(T+2) * sum_{k=1}^{K} rho_k^2 / (T-k)

Under the null hypothesis of no autocorrelation, Q follows a chi-squared distribution with K degrees of freedom. A small p-value means the returns are significantly autocorrelated.

I test at lags 10 and 20. Lag 10 catches short-horizon effects (holding period, signal updates). Lag 20 catches monthly rebalancing effects. If either rejects at the 5% level, I flag the raw Sharpe as unreliable and compute the correction.

The Durbin-Watson statistic

The DW statistic is specifically designed for first-order autocorrelation:

DW ≈ 2(1 - rho_1)

DW = 2 means no autocorrelation. DW < 2 means positive autocorrelation. DW > 2 means negative autocorrelation. It’s a quick sanity check rather than a comprehensive test.

I include it in my diagnostic output because it’s familiar to anyone with a regression background, but I rely on Ljung-Box for formal inference because it captures higher-order effects that DW misses.

Putting it all together in code

Here is the diagnostic class I use to assess autocorrelation in strategy returns:

from dataclasses import dataclass, field
from typing import Optional
import numpy as np
import numpy.typing as npt
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.stats.stattools import durbin_watson


@dataclass(frozen=True)
class LjungBoxResult:
    """Results from a Ljung-Box test at a specific lag.

    Attributes:
        lag: The number of lags tested.
        statistic: The Q statistic value.
        p_value: The p-value under the chi-squared null.
        significant: Whether the null of no autocorrelation is rejected at 5%.
    """
    lag: int
    statistic: float
    p_value: float
    significant: bool


@dataclass(frozen=True)
class AutocorrelationDiagnostic:
    """Complete autocorrelation diagnostic for a return series.

    Attributes:
        acf_values: Autocorrelation function values from lag 0 to max_lag.
        pacf_values: Partial autocorrelation values from lag 0 to max_lag.
        confidence_band: The 95% confidence threshold (1.96 / sqrt(T)).
        ljung_box_lag10: Ljung-Box test result at lag 10.
        ljung_box_lag20: Ljung-Box test result at lag 20.
        durbin_watson: The Durbin-Watson statistic.
        first_order_autocorrelation: The ACF at lag 1.
        has_significant_autocorrelation: True if any Ljung-Box test rejects.
    """
    acf_values: npt.NDArray[np.float64]
    pacf_values: npt.NDArray[np.float64]
    confidence_band: float
    ljung_box_lag10: LjungBoxResult
    ljung_box_lag20: LjungBoxResult
    durbin_watson: float
    first_order_autocorrelation: float
    has_significant_autocorrelation: bool


def diagnose_autocorrelation(
    returns: npt.NDArray[np.float64],
    max_lag: int = 20,
) -> AutocorrelationDiagnostic:
    """Run a complete autocorrelation diagnostic on strategy returns.

    Computes ACF, PACF, Ljung-Box tests at lags 10 and 20, and the
    Durbin-Watson statistic. Flags the series if any formal test
    rejects the null of no autocorrelation at the 5% level.

    Args:
        returns: Array of strategy returns (daily or at the strategy's
            natural frequency).
        max_lag: Maximum lag for ACF and PACF computation.

    Returns:
        AutocorrelationDiagnostic with all test results.

    Raises:
        ValueError: If the return series has fewer than max_lag + 1 observations.
    """
    n = len(returns)
    if n < max_lag + 1:
        raise ValueError(
            f"Need at least {max_lag + 1} observations, got {n}"
        )

    acf_vals = acf(returns, nlags=max_lag, fft=True)
    pacf_vals = pacf(returns, nlags=max_lag)

    conf_band = 1.96 / np.sqrt(n)

    lb_result = acorr_ljungbox(returns, lags=[10, 20], return_df=True)

    lb10 = LjungBoxResult(
        lag=10,
        statistic=float(lb_result.iloc[0]["lb_stat"]),
        p_value=float(lb_result.iloc[0]["lb_pvalue"]),
        significant=float(lb_result.iloc[0]["lb_pvalue"]) < 0.05,
    )
    lb20 = LjungBoxResult(
        lag=20,
        statistic=float(lb_result.iloc[1]["lb_stat"]),
        p_value=float(lb_result.iloc[1]["lb_pvalue"]),
        significant=float(lb_result.iloc[1]["lb_pvalue"]) < 0.05,
    )

    dw = float(durbin_watson(returns))

    return AutocorrelationDiagnostic(
        acf_values=acf_vals,
        pacf_values=pacf_vals,
        confidence_band=float(conf_band),
        ljung_box_lag10=lb10,
        ljung_box_lag20=lb20,
        durbin_watson=dw,
        first_order_autocorrelation=float(acf_vals[1]),
        has_significant_autocorrelation=lb10.significant or lb20.significant,
    )

My workflow after every backtest: call diagnose_autocorrelation, check has_significant_autocorrelation, and if it’s True, immediately compute the adjusted Sharpe before looking at any other performance metrics. The raw Sharpe goes in the report for completeness, but the adjusted number is what I use for decisions.

A practical note on lag selection for the Ljung-Box test. Some textbooks recommend sqrt(T) lags, which for a typical backtest of 2,500 daily observations gives about 50 lags. I’ve found this produces too many false positives. Autocorrelations at lag 40+ are almost never economically meaningful for the strategies I work with. Lags 10 and 20 capture the effects I care about (holding periods, signal frequency, monthly rebalancing) without testing noise at distant lags.

Correcting the Sharpe Ratio

Once you’ve detected autocorrelation, you have three options for producing an honest Sharpe ratio. They trade off simplicity against robustness.

Method 1: Lo (2002) analytical correction

This is the approach I showed in the first section. You compute the variance ratio eta that accounts for autocorrelations, and use sqrt(eta) instead of sqrt(252) to annualize.

The correction is fast, requires no resampling, and gives a point estimate. It’s what I use for quick screening. The limitation is that it assumes the autocorrelation structure is well-estimated by the sample autocorrelations, which may not hold in short samples.

The code above (compute_sharpe_inflation) implements this directly.

Method 2: Newey-West significance test

This is a different tool solving a different problem, and I want to be precise about the distinction. The Lo correction adjusts the Sharpe ratio point estimate. Newey-West answers a different question: is the mean return statistically significantly different from zero, given the autocorrelation in the data?

The Newey-West (1987) HAC estimator corrects the standard error of a regression coefficient when the residuals are autocorrelated. For our purposes, we regress the returns on a constant (intercept-only model). The intercept is the mean return. The NW-corrected standard error of that intercept is wider than the naive standard error when returns are positively autocorrelated, which means the t-statistic (mean / standard error) is smaller, and the p-value for “is the mean return different from zero?” is larger.

This is an inference tool, not a Sharpe ratio correction. The t-statistic from a NW regression is not a “corrected Sharpe ratio.” The Sharpe ratio is mean / std (where std is the standard deviation of returns). The t-statistic is mean / SE (where SE is the standard error of the mean). These are different quantities with different units and different interpretations. Conflating them, which I’ve seen in multiple quant blog posts and which an earlier draft of this article did, is a conceptual error.

What NW does give you is a reliable answer to: “Given the autocorrelation in my strategy returns, can I reject the hypothesis that this strategy has zero expected return?” If the NW t-statistic is small (say, below 2.0), you cannot confidently claim the strategy makes money, regardless of what the Sharpe ratio says. This is useful as a sanity check alongside the Lo-corrected Sharpe.

import numpy as np
import numpy.typing as npt
import statsmodels.api as sm
from dataclasses import dataclass


@dataclass(frozen=True)
class NeweyWestSignificance:
    """Newey-West significance test for mean strategy return.

    Tests whether the mean return is significantly different from zero
    after accounting for autocorrelation and heteroskedasticity in the
    return series. This is an inference tool, not a Sharpe ratio correction.

    Attributes:
        mean_return: Annualized mean return.
        nw_std_error: Newey-West standard error of the mean (daily).
        t_statistic: mean / NW_std_error. Values above ~2.0 suggest
            the mean return is significantly different from zero.
        p_value: Two-sided p-value for the null of zero mean return.
        naive_sharpe: Standard annualized Sharpe ratio (for comparison).
        significant_at_5pct: Whether the null is rejected at 5% significance.
        bandwidth: Number of lags used in the HAC estimator.
    """
    mean_return: float
    nw_std_error: float
    t_statistic: float
    p_value: float
    naive_sharpe: float
    significant_at_5pct: bool
    bandwidth: int


def compute_newey_west_significance(
    daily_returns: npt.NDArray[np.float64],
    max_lags: int = 20,
    periods_per_year: int = 252,
) -> NeweyWestSignificance:
    """Test whether the mean strategy return is significantly nonzero.

    Fits an intercept-only OLS model with HAC covariance to get
    autocorrelation-robust standard errors for the mean return.
    The t-statistic tests whether the strategy has a statistically
    significant positive (or negative) expected return.

    This is NOT a Sharpe ratio correction. Use compute_sharpe_inflation
    for the Lo (2002) point estimate adjustment. This function answers
    a different question: can you reject the null that the strategy
    has zero expected return?

    Args:
        daily_returns: Array of daily strategy returns.
        max_lags: Bandwidth for the Newey-West estimator.
        periods_per_year: Trading days per year for annualization.

    Returns:
        NeweyWestSignificance with test statistic and p-value.
    """
    n = len(daily_returns)
    X = np.ones(n)
    model = sm.OLS(daily_returns, X)
    result = model.fit(cov_type="HAC", cov_kwds={"maxlags": max_lags})

    mean_ret = float(result.params[0])
    nw_se = float(result.bse[0])
    std_ret = float(np.std(daily_returns, ddof=1))

    naive_sharpe = (mean_ret / std_ret) * np.sqrt(periods_per_year) if std_ret > 0 else 0.0

    return NeweyWestSignificance(
        mean_return=float(mean_ret * periods_per_year),
        nw_std_error=float(nw_se),
        t_statistic=float(result.tvalues[0]),
        p_value=float(result.pvalues[0]),
        naive_sharpe=float(naive_sharpe),
        significant_at_5pct=float(result.pvalues[0]) < 0.05,
        bandwidth=max_lags,
    )

Newey-West is more general than Lo’s correction in one respect: it handles heteroskedasticity (volatility clustering) in addition to autocorrelation. The trade-off is that you need to choose a bandwidth (the max_lags parameter). Too few lags and you don’t capture all the serial dependence. Too many and the estimator becomes noisy. A common starting point is max_lags = int(4 * (n / 100) ** (2/9)), then check sensitivity with a few alternatives.

Method 3: Block bootstrap

The block bootstrap, which I cover in detail in my article on bootstrap methods , is the most robust approach. Instead of estimating autocorrelation parametrically and plugging it into a formula, you resample the return series in blocks that preserve the serial dependence, compute the Sharpe ratio on each bootstrap sample, and build a confidence interval from the distribution.

The key parameter is the block length. It needs to be long enough to capture the autocorrelation structure. I typically use a block length of 2 to 3 times the holding period of the strategy. For a strategy that holds positions for an average of 5 days, blocks of 10-15 days work well.

from dataclasses import dataclass
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class BlockBootstrapSharpe:
    """Block bootstrap confidence interval for the Sharpe ratio.

    Attributes:
        point_estimate: The sample Sharpe ratio.
        ci_lower: Lower bound of the 95% confidence interval.
        ci_upper: Upper bound of the 95% confidence interval.
        bootstrap_mean: Mean Sharpe across bootstrap samples.
        bootstrap_std: Standard deviation of the bootstrap Sharpe distribution.
        block_length: Block length used for resampling.
        n_samples: Number of bootstrap samples drawn.
    """
    point_estimate: float
    ci_lower: float
    ci_upper: float
    bootstrap_mean: float
    bootstrap_std: float
    block_length: int
    n_samples: int


def block_bootstrap_sharpe(
    daily_returns: npt.NDArray[np.float64],
    block_length: int = 10,
    n_bootstrap: int = 5000,
    periods_per_year: int = 252,
    seed: int = 42,
) -> BlockBootstrapSharpe:
    """Compute a block-bootstrap confidence interval for the Sharpe ratio.

    Resamples the return series in contiguous blocks of the given
    length, preserving within-block autocorrelation. Computes the
    annualized Sharpe on each bootstrap sample and returns percentile-
    based confidence intervals.

    Args:
        daily_returns: Array of daily strategy returns.
        block_length: Length of contiguous blocks for resampling.
        n_bootstrap: Number of bootstrap replications.
        periods_per_year: Trading days per year for annualization.
        seed: Random seed for reproducibility.

    Returns:
        BlockBootstrapSharpe with point estimate and confidence interval.

    Raises:
        ValueError: If block_length exceeds the length of the series.
    """
    n = len(daily_returns)
    if block_length > n:
        raise ValueError(
            f"Block length {block_length} exceeds series length {n}"
        )

    rng = np.random.default_rng(seed)
    n_blocks = int(np.ceil(n / block_length))

    # Number of valid block starting positions
    max_start = n - block_length

    sharpe_samples = np.empty(n_bootstrap)
    for i in range(n_bootstrap):
        starts = rng.integers(0, max_start + 1, size=n_blocks)
        blocks = [
            daily_returns[s : s + block_length] for s in starts
        ]
        boot_returns = np.concatenate(blocks)[:n]

        std = np.std(boot_returns, ddof=1)
        if std > 0:
            sharpe_samples[i] = (
                np.mean(boot_returns) / std * np.sqrt(periods_per_year)
            )
        else:
            sharpe_samples[i] = 0.0

    std_orig = np.std(daily_returns, ddof=1)
    point_est = (
        np.mean(daily_returns) / std_orig * np.sqrt(periods_per_year)
        if std_orig > 0
        else 0.0
    )

    return BlockBootstrapSharpe(
        point_estimate=float(point_est),
        ci_lower=float(np.percentile(sharpe_samples, 2.5)),
        ci_upper=float(np.percentile(sharpe_samples, 97.5)),
        bootstrap_mean=float(np.mean(sharpe_samples)),
        bootstrap_std=float(np.std(sharpe_samples)),
        block_length=block_length,
        n_samples=n_bootstrap,
    )

The bootstrap confidence interval automatically reflects serial dependence because the blocks preserve it. If the returns are strongly autocorrelated, the bootstrap Sharpe distribution will be wider, and the confidence interval will be wider, which correctly reflects the increased uncertainty.

Comparing the three methods

These three tools answer different questions:

  • Lo correction: “What is the adjusted Sharpe ratio?” A point estimate correction. Fast, analytical, good for screening.
  • Newey-West: “Is the excess return itself significantly different from zero, after accounting for autocorrelation in the residuals?” This is a statement about whether the strategy makes money at all, not about the Sharpe ratio. A strategy can have a significant NW t-stat and a poor Sharpe (significant but volatile returns), or a high Sharpe and an insignificant t-stat (good risk-adjusted returns but too few observations to be confident).
  • Block bootstrap: “What is the range of plausible Sharpe ratios?” A distribution-based approach. The most honest because it makes the fewest assumptions about the form of serial dependence. Use it for your final assessment.

Here is what the comparison looks like for a strategy with moderate positive autocorrelation:

Method                    Result    What it tells you
─────────────────────────────────────────────────────────────────
Naive Sharpe (sqrt(252))   1.82     Point estimate assuming i.i.d.
Lo (2002) corrected        1.64     Adjusted point estimate (~10% lower)
Newey-West t-stat          2.41     Excess return is significantly nonzero (p=0.016)
Block bootstrap median     1.58     95% CI: [0.74, 2.31]

The naive Sharpe suggests a strong strategy. The Lo correction brings the point estimate down modestly. The NW test confirms the mean return is statistically significant, which is reassuring but a lower bar than “the strategy is good.” The bootstrap confidence interval is the sobering part: it stretches from 0.74 (barely worth trading) to 2.31 (excellent). The uncertainty in the Sharpe ratio itself is large, and autocorrelation makes it larger.

The practical reality check

It’s worth stepping back from the statistics and acknowledging something: most working quants don’t rely on analytical Sharpe corrections as their primary defense against autocorrelation-inflated backtests. Walk-forward validation and out-of-sample holdout testing naturally reveal whether the backtest Sharpe was overstated, because the out-of-sample returns don’t benefit from in-sample fitting. And live trading will punish you regardless of what your corrected Sharpe said.

So why bother with these corrections at all? Two reasons. First, they’re cheap diagnostics that catch problems before you spend days on walk-forward testing or weeks on paper trading. If the Lo correction drops your Sharpe below your minimum threshold, you can kill the idea immediately instead of running it through the full validation pipeline . Second, understanding why your backtest Sharpe is inflated (holding period effects? slow signal updates? data smoothing?) tells you something about the strategy’s structure that walk-forward results alone don’t reveal. The correction is diagnostic, not just numerical.

The Annualization Problem

The sqrt(T) rule deserves its own section because it’s the place where autocorrelation causes the most practical damage, and it’s the place where the fix is simplest.

Let me be concrete. The standard annualization works like this:

SR_annual = SR_daily * sqrt(252)
           = SR_weekly * sqrt(52)
           = SR_monthly * sqrt(12)

Each of these is exact under the i.i.d. assumption. And each gives a different answer when the returns are autocorrelated, because the autocorrelation structure is different at different frequencies.

For positively autocorrelated returns, daily autocorrelation compounds over the week. So weekly returns are “smoother” than you’d expect from five independent daily returns, which means the weekly Sharpe doesn’t equal the daily Sharpe scaled by sqrt(5). The daily-to-annual path through sqrt(252) gives a higher number than the weekly-to-annual path through sqrt(52), and both differ from the monthly-to-annual path through sqrt(12).

This inconsistency is a diagnostic in itself. If your annualized Sharpe changes depending on which frequency you start from, your returns are autocorrelated. The correct approach is one of two things.

Option 1: Compute at the natural frequency. If the strategy has an average holding period of 5 days, compute non-overlapping 5-day returns and calculate the Sharpe on those. Then annualize using sqrt(252/5). This sidesteps the within-holding-period autocorrelation entirely, because each return observation corresponds to one independent trade.

This is what I do most of the time. It aligns the measurement frequency with the decision frequency, which makes the i.i.d. assumption more defensible (though not guaranteed, since signals may themselves be correlated across trades).

Option 2: Use the Lo correction. Compute the daily Sharpe and apply the eta correction factor from the Lo (2002) formula. This gives you an annualized number that properly accounts for the autocorrelation.

The one thing you should never do is compute daily returns, multiply by sqrt(252), and report the result without checking for autocorrelation. That’s the default in most backtesting frameworks, and it’s wrong for most strategies.

from dataclasses import dataclass
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class MultiHorizonSharpe:
    """Sharpe ratio computed at multiple horizons for consistency checking.

    If the strategy returns are i.i.d., all annualized values should
    agree. Divergence indicates autocorrelation.

    Attributes:
        daily_annualized: Sharpe from daily returns, scaled by sqrt(252).
        weekly_annualized: Sharpe from non-overlapping weekly returns, scaled by sqrt(52).
        monthly_annualized: Sharpe from non-overlapping monthly returns, scaled by sqrt(12).
        daily_adjusted: Lo (2002) corrected Sharpe from daily returns.
        max_divergence_pct: Maximum percentage divergence across horizons.
        iid_consistent: True if all horizons agree within 10%.
    """
    daily_annualized: float
    weekly_annualized: float
    monthly_annualized: float
    daily_adjusted: float
    max_divergence_pct: float
    iid_consistent: bool


def _sharpe_from_returns(
    returns: npt.NDArray[np.float64],
    annualization_factor: float,
) -> float:
    """Compute annualized Sharpe from a return series.

    Args:
        returns: Array of period returns.
        annualization_factor: sqrt(periods_per_year) for this frequency.

    Returns:
        Annualized Sharpe ratio, or 0.0 if standard deviation is zero.
    """
    std = np.std(returns, ddof=1)
    if std == 0 or len(returns) < 2:
        return 0.0
    return float(np.mean(returns) / std * annualization_factor)


def _aggregate_returns(
    daily_returns: npt.NDArray[np.float64],
    period_length: int,
) -> npt.NDArray[np.float64]:
    """Aggregate daily returns into non-overlapping period returns.

    Computes compound returns over non-overlapping blocks.
    Discards any incomplete final block.

    Args:
        daily_returns: Array of daily returns.
        period_length: Number of days per aggregation period.

    Returns:
        Array of compound period returns.
    """
    n_periods = len(daily_returns) // period_length
    truncated = daily_returns[: n_periods * period_length]
    reshaped = truncated.reshape(n_periods, period_length)
    # Compound returns: product of (1 + r) - 1
    return np.prod(1.0 + reshaped, axis=1) - 1.0


def compute_multi_horizon_sharpe(
    daily_returns: npt.NDArray[np.float64],
    max_lag: int = 20,
) -> MultiHorizonSharpe:
    """Compute Sharpe ratios at daily, weekly, and monthly horizons.

    Checks consistency across horizons. Divergence indicates the returns
    are not i.i.d. and the sqrt(T) annualization is unreliable.

    Args:
        daily_returns: Array of daily strategy returns.
        max_lag: Maximum lag for the Lo (2002) correction.

    Returns:
        MultiHorizonSharpe with all horizon estimates and consistency flag.
    """
    daily_sr = _sharpe_from_returns(daily_returns, np.sqrt(252))

    weekly_returns = _aggregate_returns(daily_returns, 5)
    weekly_sr = _sharpe_from_returns(weekly_returns, np.sqrt(52))

    monthly_returns = _aggregate_returns(daily_returns, 21)
    monthly_sr = _sharpe_from_returns(monthly_returns, np.sqrt(12))

    # Lo (2002) correction
    adjusted = compute_sharpe_inflation(daily_returns, max_lag=max_lag)

    all_sr = [daily_sr, weekly_sr, monthly_sr]
    nonzero = [s for s in all_sr if s != 0]
    if len(nonzero) >= 2:
        mean_sr = np.mean(nonzero)
        if mean_sr != 0:
            max_div = max(abs(s - mean_sr) / abs(mean_sr) * 100 for s in nonzero)
        else:
            max_div = 0.0
    else:
        max_div = 0.0

    return MultiHorizonSharpe(
        daily_annualized=daily_sr,
        weekly_annualized=weekly_sr,
        monthly_annualized=monthly_sr,
        daily_adjusted=adjusted.adjusted_sharpe,
        max_divergence_pct=float(max_div),
        iid_consistent=float(max_div) < 10.0,
    )

I run this on every backtest. If iid_consistent comes back False, I know the sqrt(252) Sharpe is unreliable, and I use either the holding-period Sharpe or the Lo-corrected Sharpe for decision-making. When the three horizons agree within 10%, the autocorrelation is mild enough that the naive annualization is acceptable.

One pitfall with the aggregation approach: compounding daily returns into weekly or monthly returns can obscure transaction costs. If your backtest accounts for costs at the daily level, the aggregated weekly returns already include those costs, which is correct. But if costs are computed at the trade level and aren’t reflected in daily P&L, aggregation will miss them. Make sure costs flow through daily returns before you aggregate.

Strategy-Specific Autocorrelation Patterns

Different strategy types produce characteristic autocorrelation signatures. Learning to recognize these patterns helps you diagnose what’s happening in your backtest without needing to inspect the strategy logic directly.

Momentum and trend-following

Momentum strategies hold positions in the direction of recent price trends. A long position is entered after a price increase and held until the trend reverses or a trailing stop is hit. During the holding period, daily returns are driven by the same directional bet, creating positive autocorrelation at short lags.

The typical ACF shape: significant positive values at lags 1 through approximately the average holding period, then a gradual decay to zero. If the strategy has an average holding period of 10 days, expect positive autocorrelation at lags 1-10 with the strongest values at lags 1-3.

Sharpe inflation for momentum strategies is meaningful. Daily autocorrelations of 0.05 to 0.15 are common, translating to 5-16% inflation in the annualized Sharpe ratio. That may sound modest as a standalone correction, but it compounds with the wider confidence intervals that autocorrelation creates, making the true uncertainty around the Sharpe substantially larger than the naive formula suggests.

I’ve learned the hard way that a momentum strategy with a “great” backtest Sharpe needs immediate autocorrelation correction. The point estimate correction might be 10-15%, but the bootstrap confidence interval often expands enough to include zero, which is a very different risk allocation conversation.

Mean-reversion and statistical arbitrage

Mean-reversion strategies trade against recent price moves: buying after drops, selling after rises. By construction, a profitable trade (positive return) is often followed by a position exit or reversal, creating negative autocorrelation.

The typical ACF shape: significant negative value at lag 1 (and possibly lag 2), then rapid decay to zero. The magnitude depends on how quickly the strategy exits positions. Strategies that exit within 1-2 days show strong negative autocorrelation at lag 1. Strategies that hold for longer show weaker negative autocorrelation because the reversal is diluted by holding-period effects.

The interesting property of negative autocorrelation: it makes the sqrt(252) scaling understate the true annual Sharpe. Negative serial dependence means long-horizon variance grows slower than linearly with time, so the actual annual Sharpe is higher than the naive scaling suggests. Mean-reversion strategies genuinely have better risk-adjusted performance at longer horizons, and the Lo correction will show this.

This is why some allocators prefer mean-reversion strategies for longer investment horizons. The apparent underperformance on a daily Sharpe basis is partially a measurement artifact caused by the negative autocorrelation.

Market-making

Market-making strategies earn the bid-ask spread on most trades but occasionally take losses from adverse selection (trading against informed participants). The return pattern alternates between small positive returns (spread earned) and occasional larger negative returns (adverse selection loss), followed by recovery.

The typical ACF shape: negative autocorrelation at lag 1 (alternating win-loss pattern), with possible positive autocorrelation at lag 2 (recovery after adverse selection). The ACF usually has a damped oscillating pattern that dies out within 3-5 lags.

Sharpe correction for market-making strategies is usually modest because the negative and positive autocorrelations at different lags partially offset each other. The bigger issue with market-making backtests is usually the simulation of execution quality rather than the Sharpe annualization.

Portfolio rebalancing strategies

Strategies that rebalance on a fixed schedule (monthly, quarterly) create autocorrelation at the rebalancing frequency. Between rebalancing dates, the portfolio drifts, and the returns reflect the drift rather than active decisions. At rebalancing, the portfolio is reset, creating a discontinuity.

The typical ACF shape: mild positive autocorrelation at short lags (within the rebalancing period, the same positions generate returns), with a possible spike at the rebalancing lag (lag 20-22 for monthly rebalancing in daily data).

This pattern is easy to miss if you only check the first few lags. I’ve seen strategies where the ACF at lags 1-5 was insignificant, but the Ljung-Box at lag 20 rejected strongly due to autocorrelation at the monthly rebalancing boundary. The Sharpe correction from these periodic effects is usually modest (5-10%), but it compounds with other sources of autocorrelation.

Multi-strategy composites

When you combine multiple strategies into a single portfolio, the composite autocorrelation depends on the individual strategies’ autocorrelation structures, their weights, and the cross-correlations between strategies.

Diversification across strategies with different autocorrelation signatures can reduce portfolio-level autocorrelation. Combining a momentum strategy (positive autocorrelation) with a mean-reversion strategy (negative autocorrelation) can produce a composite with near-zero autocorrelation, making the standard Sharpe annualization more reliable.

This is another reason to diagnose autocorrelation at the individual strategy level before aggregating. If each component’s Sharpe is honestly reported, the composite’s Sharpe is more likely to be accurate. If you let each component inflate its Sharpe through uncorrected autocorrelation, the composite Sharpe will inherit the inflation even if the composite itself has low autocorrelation.

Honest Performance Reporting

Everything so far has been building to this section. Detecting and correcting for autocorrelation is only useful if you actually change what you report.

Here is what I consider the minimum reporting standard for a backtest. Not “best practices” in a textbook sense, but the actual list of things I include in every performance report I produce, whether it’s for my own records or for sharing with other people who might rely on the numbers.

  1. Raw Sharpe ratio, annualized from the strategy’s natural frequency. This is the number everyone expects to see. Include it, but never alone.

  2. First-order autocorrelation of the strategy returns at the reporting frequency. This is a single number that immediately tells the reader whether the raw Sharpe is trustworthy.

  3. Ljung-Box p-values at lags 10 and 20. These formalize the autocorrelation check. If either p-value is below 0.05, the raw Sharpe is statistically unreliable.

  4. Autocorrelation-adjusted Sharpe ratio using the Lo (2002) correction. This adjusts the risk-adjusted return estimate itself.

  5. Sharpe ratio at the holding-period frequency (not scaled from daily). This sidesteps the annualization problem entirely for holding-period-related autocorrelation.

If you’re presenting results to allocators, the adjusted Sharpe is the number that matters. The raw Sharpe is marketing. I don’t mean that cynically. I mean that the raw Sharpe systematically overstates risk-adjusted performance for strategies with positive autocorrelation, and any experienced allocator knows this. Presenting the raw number without the correction signals either ignorance or intent to mislead. Neither makes a good impression.

Integration with the validation pipeline

In my validation pipeline , autocorrelation diagnostics run immediately after the backtest and before any performance metrics are computed for decision-making. The flow is:

  1. Run the backtest, produce a daily return series.
  2. Compute autocorrelation diagnostics (diagnose_autocorrelation).
  3. If significant autocorrelation is detected, compute the Lo correction for the Sharpe ratio and the Newey-West test for whether the excess return itself is significantly nonzero.
  4. Feed the adjusted Sharpe (not the raw) into a Deflated Sharpe Ratio calculation to account for multiple testing (see the bias taxonomy for context on how biases compound).
  5. Include all metrics in the performance report.

This ordering matters. The DSR (Deflated Sharpe Ratio) already accounts for the number of strategies tried. If you feed it an inflated raw Sharpe, you’re partially undoing its correction for multiple testing. Using the autocorrelation-adjusted Sharpe as input to the DSR gives a doubly honest result: corrected for both serial dependence and selection bias.

The PerformanceReport dataclass

Here is the reporting structure I use. It bundles all the metrics together and includes a flag that fires when the gap between raw and adjusted Sharpe exceeds a threshold:

from dataclasses import dataclass
from typing import Optional
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class PerformanceReport:
    """Complete performance report with autocorrelation diagnostics.

    Bundles raw and adjusted Sharpe ratios, autocorrelation diagnostics,
    and a flag indicating when the raw Sharpe is materially inflated.
    Use this as the standard output of any backtest evaluation.

    Attributes:
        strategy_name: Identifier for the strategy being evaluated.
        n_observations: Number of return observations.
        annualized_return: Annualized mean return.
        annualized_volatility: Annualized standard deviation of returns.
        raw_sharpe: Naive annualized Sharpe ratio (assumes i.i.d.).
        lo_adjusted_sharpe: Lo (2002) corrected Sharpe ratio.
        nw_t_statistic: Newey-West t-statistic for mean return significance.
        holding_period_sharpe: Sharpe computed at the holding period frequency.
        first_order_autocorrelation: ACF at lag 1.
        ljung_box_p_lag10: Ljung-Box p-value at lag 10.
        ljung_box_p_lag20: Ljung-Box p-value at lag 20.
        durbin_watson: Durbin-Watson statistic.
        sharpe_inflation_pct: Percentage by which raw exceeds adjusted.
        autocorrelation_flag: True if raw Sharpe is materially inflated.
        max_drawdown: Maximum peak-to-trough drawdown.
        avg_holding_period_days: Average number of days positions are held.
    """
    strategy_name: str
    n_observations: int
    annualized_return: float
    annualized_volatility: float
    raw_sharpe: float
    lo_adjusted_sharpe: float
    nw_t_statistic: float
    holding_period_sharpe: Optional[float]
    first_order_autocorrelation: float
    ljung_box_p_lag10: float
    ljung_box_p_lag20: float
    durbin_watson: float
    sharpe_inflation_pct: float
    autocorrelation_flag: bool
    max_drawdown: float
    avg_holding_period_days: Optional[float]


def build_performance_report(
    strategy_name: str,
    daily_returns: npt.NDArray[np.float64],
    avg_holding_period_days: Optional[float] = None,
    inflation_threshold_pct: float = 15.0,
    periods_per_year: int = 252,
) -> PerformanceReport:
    """Build a complete performance report with autocorrelation corrections.

    Computes raw and adjusted Sharpe ratios, runs autocorrelation
    diagnostics, and flags strategies where the raw Sharpe materially
    overstates the corrected number.

    Args:
        strategy_name: Name or identifier for the strategy.
        daily_returns: Array of daily strategy returns.
        avg_holding_period_days: Average holding period in trading days.
            If provided, computes Sharpe at the holding period frequency.
        inflation_threshold_pct: Flag when raw-to-adjusted gap exceeds
            this percentage. Default is 15%.
        periods_per_year: Trading days per year.

    Returns:
        PerformanceReport with all metrics and diagnostic flags.
    """
    n = len(daily_returns)
    mean_ret = float(np.mean(daily_returns))
    std_ret = float(np.std(daily_returns, ddof=1))

    ann_return = mean_ret * periods_per_year
    ann_vol = std_ret * np.sqrt(periods_per_year)
    raw_sharpe = (mean_ret / std_ret * np.sqrt(periods_per_year)) if std_ret > 0 else 0.0

    # Lo correction
    lo_result = compute_sharpe_inflation(daily_returns, max_lag=20, periods_per_year=periods_per_year)

    # Newey-West correction
    nw_result = compute_newey_west_significance(daily_returns, max_lags=20, periods_per_year=periods_per_year)

    # Autocorrelation diagnostics
    diag = diagnose_autocorrelation(daily_returns, max_lag=20)

    # Holding period Sharpe
    hp_sharpe: Optional[float] = None
    if avg_holding_period_days is not None and avg_holding_period_days >= 2:
        period_len = max(2, int(round(avg_holding_period_days)))
        hp_returns = _aggregate_returns(daily_returns, period_len)
        periods_in_year = periods_per_year / period_len
        hp_sharpe = _sharpe_from_returns(hp_returns, np.sqrt(periods_in_year))

    # Max drawdown
    cumulative = np.cumprod(1.0 + daily_returns)
    running_max = np.maximum.accumulate(cumulative)
    drawdowns = (cumulative - running_max) / running_max
    max_dd = float(np.min(drawdowns))

    # Inflation check
    if lo_result.adjusted_sharpe != 0:
        inflation = lo_result.inflation_pct
    else:
        inflation = 0.0

    return PerformanceReport(
        strategy_name=strategy_name,
        n_observations=n,
        annualized_return=float(ann_return),
        annualized_volatility=float(ann_vol),
        raw_sharpe=float(raw_sharpe),
        lo_adjusted_sharpe=lo_result.adjusted_sharpe,
        nw_t_statistic=nw_result.t_statistic,
        holding_period_sharpe=hp_sharpe,
        first_order_autocorrelation=diag.first_order_autocorrelation,
        ljung_box_p_lag10=diag.ljung_box_lag10.p_value,
        ljung_box_p_lag20=diag.ljung_box_lag20.p_value,
        durbin_watson=diag.durbin_watson,
        sharpe_inflation_pct=float(inflation),
        autocorrelation_flag=float(inflation) > inflation_threshold_pct,
        max_drawdown=max_dd,
        avg_holding_period_days=avg_holding_period_days,
    )

The autocorrelation_flag fires when the raw Sharpe exceeds the adjusted Sharpe by more than 15%. That threshold is a judgment call. I’ve found that corrections below 15% are usually within the noise of the Sharpe estimate itself, so flagging them creates false alarms. Above 15%, the inflation is material and should change your assessment of the strategy.

What to do when the flag fires

When autocorrelation_flag is True, I do three things:

First, I check whether the autocorrelation is expected given the strategy design. A momentum strategy with a 10-day average holding period should have positive autocorrelation at short lags. If it does, the autocorrelation is structural, the correction is necessary, and the adjusted Sharpe is the true number. No remediation needed, just honest reporting.

Second, I check whether the autocorrelation is an artifact of the backtest setup. Common culprits: using daily returns with weekly signals (reduce the return frequency), using mark-to-model prices for illiquid instruments (switch to transaction prices or acknowledge the smoothing), or computing returns on a calendar basis when the strategy trades on a different schedule (align the return computation with actual trades).

Third, I check whether reducing the autocorrelation would improve the strategy. Sometimes you can reduce holding-period autocorrelation by adding a faster exit signal, or reduce signal-update autocorrelation by computing signals at a higher frequency. These changes reduce the gap between raw and adjusted Sharpe, not by inflating the adjusted number, but by genuinely making the returns more independent.

The broader context

Autocorrelation correction is one piece of the backtest bias taxonomy . It interacts with other corrections:

  • Multiple testing correction (Deflated Sharpe Ratio): feed the autocorrelation-adjusted Sharpe into the DSR, not the raw number.
  • Walk-forward validation (covered in my walk-forward article ): autocorrelation within each walk-forward test window should be checked separately. The concatenated OOS returns may have different autocorrelation structure than the full in-sample returns.
  • Property-based testing : you can write properties that verify the consistency of your autocorrelation diagnostics. For example, the multi-horizon Sharpe ratios should agree within a tolerance if the returns are genuinely i.i.d.

None of these corrections in isolation is sufficient. A strategy that passes autocorrelation correction but fails walk-forward validation is still unreliable. A strategy that passes walk-forward but has uncorrected Sharpe inflation is still overstating its performance. The goal is a validation pipeline where each check catches a different failure mode.

Wrapping Up

The core message is simple. Strategy returns are almost always autocorrelated, the standard Sharpe ratio annualization assumes they are not, and the resulting inflation systematically flatters your backtest. The fix is straightforward: check for autocorrelation, correct for it, and report both numbers.

I check autocorrelation after every backtest, the same way I check for look-ahead bias or overfitting. It’s part of the hygiene. The tools serve different purposes: ACF and Ljung-Box detect autocorrelation, the Lo correction adjusts the Sharpe ratio point estimate, Newey-West tests whether the excess return is significantly nonzero after accounting for serial dependence, and the block bootstrap gives you confidence intervals that reflect the full autocorrelation structure. The code is short. The impact on how you interpret your results can be substantial.

If there’s one thing I want you to take away, it’s this: compute your annualized Sharpe at multiple frequencies (daily, weekly, monthly). If the numbers disagree, your returns are autocorrelated, and your sqrt(252) Sharpe is wrong. Fix it before you allocate capital to it.

References

  • Lo, Andrew W. (2002). “The Statistics of Sharpe Ratios.” Financial Analysts Journal, 58(4), 36-52.
  • Newey, Whitney K. and Kenneth D. West (1987). “A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica, 55(3), 703-708.
  • Getmansky, Mila, Andrew W. Lo, and Igor Makarov (2004). “An Econometric Model of Serial Correlation and Illiquidity in Hedge Fund Returns.” Journal of Financial Economics, 74(3), 529-609.
  • Christie, Sean (2005). “The Interpretation of Return Autocorrelation.” Working paper.
Susan Potter

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.