Stationarity Testing for Strategy Signals: ADF, KPSS, and Why …

I once spent three weeks building what I thought was a momentum signal. The backtest looked spectacular: Sharpe above 2, smooth equity curve, drawdowns that barely registered. I was already sketching out position sizing when I ran a stationarity test on the signal as part of a pipeline sanity check. The signal was just a moving average of log prices. It trended upward over the backtest period because the underlying equity market trended upward over that period. What I had was not a momentum signal. It was a non-stationary artifact that happened to correlate with positive returns during a bull market. Out of sample, when the trend reversed, every entry threshold I had calibrated became meaningless.

That experience cost me three weeks and no money because the stationarity test caught it before capital was at risk. If I had skipped the test, it would have cost me money too.

This article is about the statistical tests that prevent that class of mistake. Not as a pure stats tutorial (there are textbooks for that) but as a practical framework for catching non-stationary signals before they reach a backtest, integrated into the kind of pipeline I described in my validation toolkit .

Why Stationarity Matters for Strategy Signals

Most quantitative strategies rest on an implicit assumption: the statistical properties of a signal are stable over time. A mean-reversion strategy assumes the signal reverts to a stable mean. A momentum strategy assumes the distribution of signal changes is consistent enough to identify trends. A risk model assumes variance is stable enough to calibrate position sizes. When these assumptions break, the strategy breaks with them, often silently.

The formal term is weak (covariance) stationarity. A time series is weakly stationary when three conditions hold:

Constant mean: E[X_t] = μ for all t. The expected value doesn’t drift.
Constant variance: Var(X_t) = σ² for all t. The spread around the mean stays the same.
Autocovariance depends only on the lag: Cov(X_t, X_{t+k}) = f(k), not f(t, k). The relationship between observations separated by k periods is the same regardless of when you measure it.

Strict stationarity is stronger: it requires that the entire joint distribution of any subset of observations is invariant to time shifts. In practice, weak stationarity is what we test for and what we need. If you have finite second moments and weak stationarity, that’s sufficient for most quant applications.

Why does violating these conditions matter in practice? Because the tools we use to build and evaluate strategies assume stationarity, whether we realize it or not.

Spurious regressions. Granger and Newbold showed in 1974 that regressing one non-stationary series on another produces inflated R-squared values and significant t-statistics even when the two series are completely independent. The regression finds a “relationship” that is purely an artifact of shared trends. I’ve seen this happen with signals that regress a smoothed price series on a macro indicator. Both trend upward over the sample, the regression looks great, and the out-of-sample performance is indistinguishable from random.

Drifting entry and exit thresholds. Mean-reversion strategies compute z-scores or percentile bands based on the signal’s historical distribution. If the mean is drifting, your z-score of -2 today doesn’t mean the same thing it meant six months ago. You end up entering trades based on a historical distribution that no longer applies.

Miscalibrated risk. If variance isn’t stable, the volatility estimate you use for position sizing is wrong. You’ll be too aggressive in high-vol regimes and too conservative in low-vol regimes, or the reverse depending on which part of the sample dominates your estimate.

One clarification that trips people up: stationarity does not mean the series is flat or boring. A stationary series can have strong autocorrelation, seasonal patterns (as long as they’re periodic, not trending), and substantial variance. Stationarity is about the distribution being stable over time, not the values being constant. White noise is stationary. So is an AR(1) process with a coefficient of 0.95, even though it has persistent deviations from its mean. What’s not stationary is a random walk, where each shock permanently shifts the level.

Unit Roots vs. Trend Stationarity: The Distinction That Matters

There are two fundamentally different ways a series can fail the stationarity condition, and confusing them leads to the wrong fix.

Unit root processes. The canonical example is a random walk: X_t = X_{t-1} + ε_t. Each shock εt shifts the level of the series permanently. There is no mean to revert to; the series wanders without bound. The variance grows linearly with time. First differencing fixes this: ΔX_t = X_t - X{t-1} = ε_t, which is stationary (just white noise). A series that requires d differences to become stationary is called integrated of order d, written I(d). Raw prices are typically I(1). Returns are I(0).

Trend-stationary processes. The series follows a deterministic trend with stationary fluctuations around it: X_t = α + βt + ε_t. The non-stationarity comes from the βt term, not from accumulated shocks. Detrending (subtracting the fitted linear trend) makes the series stationary. Shocks are temporary: after a deviation, the series returns to the trend line.

Over short horizons, these two look almost identical. Plot a random walk and a trend-stationary process side by side and you’ll struggle to tell them apart visually. Both go up, both have wiggles. The difference shows up in the autocorrelation function: a unit root series has ACF that decays extremely slowly (it stays near 1.0 even at high lags), while a trend-stationary series, after detrending, has ACF that decays rapidly.

Why does the distinction matter for your strategy?

If your signal has a unit root, differencing is the correct transformation. Using the levels in a regression produces spurious results. Differencing eliminates the permanent-shock component and gives you a stationary series to work with.

If your signal is trend-stationary, differencing also produces a stationary result, but it’s suboptimal. Differencing a trend-stationary series discards the trend information and introduces unnecessary negative autocorrelation at lag 1 (a well-known consequence of over-differencing). You lose information that detrending would have preserved.

The practical upshot: you need to know which type of non-stationarity you’re dealing with before you apply a transformation. This is exactly what the tests in the next few sections help you determine.

Some concrete financial context. Log prices are almost always I(1). Log returns are I(0). Spreads between cointegrated pairs are I(0) by definition; that’s what cointegration means. If you’re building a pairs trading strategy around the concept of cointegration (where two individually non-stationary series combine to form a stationary spread), the stationarity of the spread is the entire basis for the trade. Implied volatility surfaces are more ambiguous: ATM implied vol at a fixed tenor is arguably stationary over long horizons, but term-structure slopes and skew parameters can exhibit unit-root-like behavior depending on the asset class and the sample period.

The Augmented Dickey-Fuller Test

The ADF test is the workhorse of unit root testing. It has been around since 1979, it’s implemented in every major statistics package, and it’s the first test most quants reach for. Understanding its mechanics, its variants, and its limitations is essential.

The null hypothesis of the ADF test is that the series has a unit root (is non-stationary). Rejection of the null provides evidence of stationarity. This framing matters: failing to reject doesn’t prove the series has a unit root. It just means you don’t have enough evidence to rule one out at your chosen significance level.

The test regression is:

ΔX_t = α + βt + γX_{t-1} + Σδi ΔX{t-i} + ε_t

The test statistic is the t-ratio on γ. Under the null hypothesis (γ = 0), this statistic does not follow a standard t-distribution. It follows the Dickey-Fuller distribution, with critical values tabulated by MacKinnon. This is important because the critical values are more negative than standard t-distribution values. A test statistic of -2.5 might be significant in a normal t-test but fail to reject the unit root null in an ADF test.

Three Model Variants

The ADF test comes in three flavors depending on which deterministic terms you include in the regression. Choosing the right one matters more than most people realize.

No constant, no trend (regression='n'): ΔX_t = γX_{t-1} + lags + ε_t. Use this when the series has no drift and no trend under the alternative hypothesis. This is rare in financial data. I almost never use it.

Constant, no trend (regression='c'): ΔX_t = α + γX_{t-1} + lags + ε_t. This is the default and the right choice for most financial signals: returns, spreads, z-scores, oscillators. If your signal fluctuates around a non-zero mean, this is the model.

Constant and trend (regression='ct'): ΔX_t = α + βt + γX_{t-1} + lags + ε_t. Use this when you suspect a deterministic time trend. GDP growth, some price levels, or trending macro indicators might warrant this model. Including a trend when none exists reduces the power of the test, so don’t use ‘ct’ as a default just to be safe.

My rule of thumb: start with ‘c’ for anything that’s a derived signal (returns, spreads, ratios, z-scores). Use ‘ct’ only for level series where a visual inspection or domain knowledge suggests a deterministic trend. Use ’n’ basically never.

Lag Selection

The “augmented” in ADF refers to the lagged difference terms Σδi ΔX{t-i}. These soak up serial correlation in the residuals so that the test statistic is valid. Too few lags and the residuals are autocorrelated, biasing the test toward rejecting the null (you see stationarity that isn’t there). Too many lags and you lose statistical power; the test can’t detect stationarity even when it exists.

The standard approach is to let an information criterion choose. AIC tends to select more lags (prioritizes prediction accuracy), while BIC selects fewer (prioritizes parsimony). In my experience, AIC is the safer default for financial data because under-specifying the lag structure is worse than over-specifying it. The alternative is the “test-down” approach: start at a maximum lag and sequentially drop the last lag if its t-statistic is insignificant. I’ve tried both and settled on AIC because it requires less manual intervention in a pipeline context.

Running the ADF Test in Python

Here’s how I run the ADF test in practice. The function wraps statsmodels and returns a structured result rather than a raw tuple.

from dataclasses import dataclass
from statsmodels.tsa.stattools import adfuller
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class ADFResult:
    """Result of an Augmented Dickey-Fuller unit root test.

    Attributes:
        statistic: The ADF test statistic (more negative = stronger
            evidence against unit root).
        pvalue: MacKinnon approximate p-value.
        lags_used: Number of lags selected by the information criterion.
        nobs: Number of observations used in the regression.
        critical_values: Dict mapping significance levels ('1%', '5%',
            '10%') to critical values.
        regression: Model specification used ('n', 'c', or 'ct').
    """

    statistic: float
    pvalue: float
    lags_used: int
    nobs: int
    critical_values: dict[str, float]
    regression: str


def run_adf(
    series: npt.NDArray[np.float64],
    regression: str = "c",
    max_lag: int | None = None,
    autolag: str = "AIC",
) -> ADFResult:
    """Run the Augmented Dickey-Fuller test for a unit root.

    Args:
        series: The time series to test. Must be one-dimensional.
        regression: Deterministic terms to include. 'c' for constant
            only, 'ct' for constant and trend, 'n' for neither.
        max_lag: Maximum number of lags to consider. None lets
            statsmodels choose based on the sample size.
        autolag: Information criterion for lag selection. 'AIC',
            'BIC', or 't-stat'.

    Returns:
        An ADFResult with the test statistic, p-value, and diagnostics.

    Raises:
        ValueError: If series has fewer than 20 observations.
    """
    if len(series) < 20:
        raise ValueError(
            f"ADF requires at least 20 observations, got {len(series)}"
        )
    stat, pvalue, lags, nobs, crit, _ = adfuller(
        series, maxlag=max_lag, autolag=autolag, regression=regression
    )
    return ADFResult(
        statistic=stat,
        pvalue=pvalue,
        lags_used=lags,
        nobs=nobs,
        critical_values=crit,
        regression=regression,
    )

Usage is straightforward:

import numpy as np

rng = np.random.default_rng(42)
random_walk = np.cumsum(rng.standard_normal(500))
returns = np.diff(np.log(random_walk[random_walk > 0]))

# This should fail to reject (prices have a unit root)
price_result = run_adf(random_walk)
print(f"Prices: stat={price_result.statistic:.4f}, p={price_result.pvalue:.4f}")

# This should reject (returns are stationary)
returns_result = run_adf(returns[np.isfinite(returns)])
print(f"Returns: stat={returns_result.statistic:.4f}, p={returns_result.pvalue:.4f}")

The Most Common Pitfall

I see this constantly: someone runs ADF on raw prices, gets a p-value near 1.0, and concludes there’s something wrong with the data. There isn’t. Prices should have unit roots. That’s how prices work in efficient markets: they follow random walks (or near-random walks) because predictable price movements get arbitraged away. The relevant question isn’t whether prices are stationary. It’s whether your signal, which is derived from prices, is stationary. Returns, spreads, z-scores, ratios, residuals from regressions: these are the objects you test. If your signal is just a moving average of prices, it inherits the unit root from prices, and that should tell you your signal is broken, not your data.

The other power issue worth mentioning: ADF has notoriously low power against near-unit-root alternatives. A series with an autoregressive coefficient of 0.97 behaves almost identically to one with a coefficient of 1.0 over typical sample sizes, but the first is stationary and the second isn’t. ADF struggles to distinguish them unless you have hundreds or thousands of observations. This is one reason I never rely on ADF alone.

The KPSS Test: Stationarity as the Null

The KPSS test, published by Kwiatkowski, Phillips, Schmidt, and Shin in 1992, flips the hypothesis. The null hypothesis is that the series is stationary. Rejection provides evidence of non-stationarity.

This reversal is exactly what makes KPSS useful as a complement to ADF. With ADF, failing to reject means ambiguity: maybe the series has a unit root, or maybe the test just lacks power. With KPSS, you get a second opinion with the opposite burden of proof. Section 6 shows how to combine them into a confirmation framework.

How KPSS Works

The KPSS test decomposes the series as X_t = r_t + ξ_t + ε_t, where r_t is a random walk component, ξ_t is a deterministic trend, and ε_t is a stationary error. Under the null hypothesis, the variance of the random walk component is zero, meaning there’s no stochastic trend. The test statistic is a Lagrange Multiplier test based on the partial sums of residuals from regressing X_t on deterministic components.

Two Variants

Level stationarity (regression='c'): the null is that the series is stationary around a constant mean. Use this for signals that should fluctuate around a fixed level, like spreads, z-scores, or oscillators.

Trend stationarity (regression='ct'): the null is that the series is stationary around a linear trend. Use this when you want to allow for a deterministic trend and test whether the deviations from that trend are stationary. This is useful for level series like GDP or cumulative returns where you expect a trend but want to know if the fluctuations around it are stable.

Choosing the wrong variant causes problems. If your signal has a trend and you test with regression='c', KPSS will reject (correctly) because the trend violates level stationarity. But that doesn’t mean the series has a unit root; it might be trend-stationary. Using regression='ct' would correctly fail to reject in that case.

Bandwidth Selection

Where ADF has lag selection, KPSS has bandwidth selection for the long-run variance estimator. The bandwidth determines how many autocovariances to include when estimating the spectral density at frequency zero. Too few and the estimate is biased; too many and it’s noisy.

The default in statsmodels uses a data-driven approach based on the Schwert criterion when you pass nlags='auto'. In practice, I stick with the automatic selection unless I have a specific reason to override it. Manual bandwidth selection is one of those knobs that, in my experience, rarely improves results but frequently introduces researcher degrees of freedom.

Running the KPSS Test in Python

from dataclasses import dataclass
from statsmodels.tsa.stattools import kpss
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class KPSSResult:
    """Result of a KPSS stationarity test.

    Attributes:
        statistic: The KPSS test statistic.
        pvalue: Approximate p-value. Bounded between 0.01 and 0.10
            in statsmodels; values at the boundary should be reported
            as '< 0.01' or '> 0.10'.
        lags_used: Bandwidth (number of lags) used in the long-run
            variance estimator.
        critical_values: Dict mapping significance levels to critical
            values.
        regression: Model specification ('c' or 'ct').
    """

    statistic: float
    pvalue: float
    lags_used: int
    critical_values: dict[str, float]
    regression: str

    @property
    def pvalue_display(self) -> str:
        """Human-readable p-value that accounts for bounds."""
        if self.pvalue <= 0.01:
            return "< 0.01"
        if self.pvalue >= 0.10:
            return "> 0.10"
        return f"{self.pvalue:.4f}"


def run_kpss(
    series: npt.NDArray[np.float64],
    regression: str = "c",
    nlags: str | int = "auto",
) -> KPSSResult:
    """Run the KPSS test for stationarity.

    Unlike ADF, the null hypothesis here is stationarity. Rejection
    means evidence of non-stationarity.

    Args:
        series: The time series to test.
        regression: 'c' for level stationarity, 'ct' for trend
            stationarity.
        nlags: Number of lags for the Newey-West estimator. 'auto'
            uses the Schwert criterion.

    Returns:
        A KPSSResult with test statistic, p-value, and diagnostics.

    Raises:
        ValueError: If series has fewer than 20 observations.
    """
    if len(series) < 20:
        raise ValueError(
            f"KPSS requires at least 20 observations, got {len(series)}"
        )
    stat, pvalue, lags, crit = kpss(series, regression=regression, nlags=nlags)
    return KPSSResult(
        statistic=stat,
        pvalue=pvalue,
        lags_used=lags,
        critical_values=crit,
        regression=regression,
    )

A practical caveat: statsmodels bounds KPSS p-values between 0.01 and 0.10. If the test returns pvalue=0.10, that means the true p-value is at least 0.10. Report it as “> 0.10” rather than claiming exact knowledge of the p-value. The pvalue_display property above handles this. It’s a small thing, but I’ve seen research reports that list “p = 0.10” as if it were exact, which misrepresents the certainty.

KPSS also has a known sensitivity to structural breaks. A break in the mean or trend can cause KPSS to reject the stationarity null even when the series is stationary within each sub-period. The test interprets the level shift as evidence of a stochastic trend. This is another reason structural break testing (section 7) is part of my workflow.

Phillips-Perron: The Non-Parametric Alternative

The PP test, introduced by Phillips and Perron in 1988, tests the same null hypothesis as ADF (unit root) but handles serial correlation differently. Where ADF adds lagged differences to the regression, PP applies a non-parametric correction to the test statistic. The correction accounts for both serial correlation and heteroskedasticity in the error term without requiring you to specify a lag structure.

When to Prefer PP Over ADF

PP has advantages when the error structure is heteroskedastic, which is common in financial data. Volatility clustering means that the errors in a unit root regression aren’t identically distributed over time. ADF assumes homoskedastic errors (or at least errors that are adequately captured by the AR lag structure); PP does not.

PP is also useful when you’re unsure about the correct lag length for ADF. Since PP doesn’t use augmented lags, the lag-selection problem disappears. Instead you have a bandwidth parameter for the spectral density estimator, but the default choices tend to be less consequential than ADF’s lag selection.

PP can also handle MA components in the error structure that ADF sometimes handles poorly. If the true data-generating process has a moving average component, ADF might need many augmented lags to capture it, losing power in the process. PP’s non-parametric correction handles this more gracefully.

When ADF Is Better

In small samples, PP’s asymptotic corrections can be unreliable. The non-parametric spectral density estimator needs enough data to be accurate. With fewer than 100 observations, I trust ADF more than PP.

If the data-generating process is well-approximated by an AR(p) model, ADF is the natural choice because its parametric structure matches the data.

Running the Phillips-Perron Test in Python

from dataclasses import dataclass
from arch.unitroot import PhillipsPerron
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class PPResult:
    """Result of a Phillips-Perron unit root test.

    Attributes:
        statistic: The PP test statistic.
        pvalue: Approximate p-value.
        lags: Bandwidth used in the spectral density estimator.
        trend: Deterministic trend specification used.
    """

    statistic: float
    pvalue: float
    lags: int
    trend: str


def run_pp(
    series: npt.NDArray[np.float64],
    trend: str = "c",
    lags: int | None = None,
) -> PPResult:
    """Run the Phillips-Perron test for a unit root.

    Non-parametric alternative to ADF that corrects for serial
    correlation and heteroskedasticity without augmented lags.

    Args:
        series: The time series to test.
        trend: Deterministic trend specification. 'c' for constant,
            'ct' for constant and trend, 'n' for neither.
        lags: Bandwidth for the Newey-West estimator. None uses
            the default data-driven selection.

    Returns:
        A PPResult with test statistic, p-value, and diagnostics.

    Raises:
        ValueError: If series has fewer than 20 observations.
    """
    if len(series) < 20:
        raise ValueError(
            f"PP requires at least 20 observations, got {len(series)}"
        )
    pp = PhillipsPerron(series, trend=trend, lags=lags)
    return PPResult(
        statistic=pp.stat,
        pvalue=pp.pvalue,
        lags=pp.lags,
        trend=trend,
    )

In practice, ADF and PP agree the vast majority of the time. When they disagree, it’s usually because the series sits near the unit root boundary or contains structural breaks. The disagreement itself is informative: it tells you the result is fragile and you should investigate further rather than trusting either test in isolation. Don’t average the p-values. That’s statistically incoherent. Instead, treat the disagreement as a flag for deeper analysis.

The Confirmation Framework: Running ADF and KPSS Together

This is the section that pays for the rest of the article. Running ADF alone tells you one thing. Running KPSS alone tells you another. Running them together, with their opposite null hypotheses, gives you a two-dimensional view that is far more informative than either test in isolation.

The insight is simple: ADF tests whether you can reject non-stationarity. KPSS tests whether you can reject stationarity. Combining the two creates four possible outcomes, and each outcome tells you something different about your signal.

I use a significance level of 0.05 for both tests throughout. You can adjust this, but consistency between the two tests matters more than the exact level.

ADF Rejects (p < 0.05)	KPSS Rejects (p < 0.05)	Conclusion
Yes	No	Stationary. Both tests agree. ADF says “not a unit root” and KPSS says “consistent with stationarity.” Proceed with confidence.
No	Yes	Unit root. Both tests agree. ADF can’t reject the unit root null and KPSS rejects stationarity. Difference the series.
Yes	Yes	Contradictory. ADF rejects the unit root but KPSS also rejects stationarity. This typically indicates trend stationarity or a structural break. The series isn’t a unit root process, but it’s not level-stationary either.
No	No	Inconclusive. Neither test can reject its null. This usually means the series is near the unit root boundary, or you don’t have enough data for either test to have power.

The contradictory cases are the interesting ones. “Both reject” often means trend stationarity: ADF correctly identifies that the series doesn’t have a unit root, but KPSS correctly identifies that it’s not stationary around a constant level (because it has a trend). The fix is detrending, not differencing. Alternatively, a structural break can produce this pattern: the series is stationary within each regime, but the level shift looks like non-stationarity to KPSS and the persistent deviation looks like mean-reversion to ADF.

“Neither rejects” is the low-power case. I see this most often with short samples (fewer than 200 observations) or series that are barely non-stationary (autoregressive coefficient around 0.95-0.99). The honest answer is that you don’t have enough evidence to decide, and you should either get more data or be conservative about using the signal.

Decision Tree for Quant Signals

Here’s the decision tree I follow in my pipeline:

Both agree stationary: use the signal in levels. No transformation needed.
Both agree unit root: first-difference the signal, re-run the battery, and confirm the differenced signal is stationary. If you need level information, consider fractional differencing (section 8).
Contradictory (both reject): run Zivot-Andrews to check for structural breaks (section 7). If a break is found, split the sample and test each sub-period. If no break, try KPSS with regression='ct' to test for trend stationarity.
Inconclusive (neither rejects): flag for human review. Increase the sample size if possible. If the signal is going into an ML model, consider fractional differencing as a robust transformation.

The StationarityReport

Here’s the dataclass that ties it all together. Every signal that enters my pipeline gets one of these.

from dataclasses import dataclass
from enum import Enum
import numpy as np
import numpy.typing as npt


class StationarityConclusion(Enum):
    """Outcome of the combined ADF + KPSS stationarity assessment."""

    STATIONARY = "stationary"
    UNIT_ROOT = "unit_root"
    CONTRADICTORY = "contradictory"
    INCONCLUSIVE = "inconclusive"


class RecommendedAction(Enum):
    """Recommended transformation based on stationarity assessment."""

    USE_LEVELS = "use_levels"
    DIFFERENCE = "difference"
    INVESTIGATE_BREAKS = "investigate_breaks"
    NEEDS_REVIEW = "needs_review"


@dataclass(frozen=True)
class StationarityReport:
    """Combined stationarity assessment from ADF, KPSS, and PP tests.

    This report runs all three tests and synthesizes their results
    into a conclusion and recommended action.

    Attributes:
        adf: Result of the Augmented Dickey-Fuller test.
        kpss: Result of the KPSS test.
        pp: Result of the Phillips-Perron test.
        significance_level: The alpha used for all tests.
        conclusion: Synthesized conclusion from the confirmation
            framework.
        recommended_action: What to do with the signal based on
            the test results.
    """

    adf: ADFResult
    kpss: KPSSResult
    pp: PPResult
    significance_level: float
    conclusion: StationarityConclusion
    recommended_action: RecommendedAction


def assess_stationarity(
    series: npt.NDArray[np.float64],
    significance_level: float = 0.05,
    adf_regression: str = "c",
    kpss_regression: str = "c",
    pp_trend: str = "c",
) -> StationarityReport:
    """Run the full stationarity battery and return a synthesized report.

    Executes ADF, KPSS, and Phillips-Perron tests, then combines
    their results using the confirmation framework: ADF and KPSS have
    opposite null hypotheses, so their joint outcome determines the
    conclusion.

    Args:
        series: The time series to assess.
        significance_level: Alpha level for all hypothesis tests.
        adf_regression: Deterministic terms for ADF ('c', 'ct', 'n').
        kpss_regression: Deterministic terms for KPSS ('c', 'ct').
        pp_trend: Deterministic terms for PP ('c', 'ct', 'n').

    Returns:
        A StationarityReport with individual test results, a
        synthesized conclusion, and a recommended action.

    Raises:
        ValueError: If significance_level is not between 0 and 1, or
            if the series is too short.
    """
    if not 0 < significance_level < 1:
        raise ValueError(
            f"significance_level must be in (0, 1), got {significance_level}"
        )

    adf = run_adf(series, regression=adf_regression)
    kpss_result = run_kpss(series, regression=kpss_regression)
    pp = run_pp(series, trend=pp_trend)

    adf_rejects = adf.pvalue < significance_level
    kpss_rejects = kpss_result.pvalue < significance_level

    if adf_rejects and not kpss_rejects:
        conclusion = StationarityConclusion.STATIONARY
        action = RecommendedAction.USE_LEVELS
    elif not adf_rejects and kpss_rejects:
        conclusion = StationarityConclusion.UNIT_ROOT
        action = RecommendedAction.DIFFERENCE
    elif adf_rejects and kpss_rejects:
        conclusion = StationarityConclusion.CONTRADICTORY
        action = RecommendedAction.INVESTIGATE_BREAKS
    else:
        conclusion = StationarityConclusion.INCONCLUSIVE
        action = RecommendedAction.NEEDS_REVIEW

    return StationarityReport(
        adf=adf,
        kpss=kpss_result,
        pp=pp,
        significance_level=significance_level,
        conclusion=conclusion,
        recommended_action=action,
    )

The PP result is included for completeness but doesn’t directly affect the conclusion in this implementation. I use it as a tiebreaker during human review: if ADF and PP agree but KPSS disagrees, that shifts my interpretation. You could formalize this into the decision logic, but I’ve found that the cases where PP matters are exactly the cases where automated decisions are risky and human judgment is needed.

This report integrates naturally with the property-based validation pipeline . After validating that your data satisfies structural invariants (non-negative spreads, consistent OHLC bars, monotonic timestamps), stationarity testing is the next gate. There’s no point computing a Sharpe ratio on a strategy whose signal isn’t stationary. The stationarity report is part of the pre-conditions I check before any signal enters a walk-forward optimization .

Structural Breaks: When the Regime Changes

Standard unit root tests assume the data-generating process is constant over the entire sample. When the process changes partway through (a structural break), these tests produce unreliable results. A break in the mean or trend can make an otherwise stationary series look like it has a unit root, and a unit root test that ignores the break will happily confirm that false impression.

I hit this problem with a pairs trading spread. The spread between two energy sector ETFs was clearly stationary from 2018 through early 2020: tight range, fast mean-reversion, textbook behavior. Then came the oil price crash and pandemic disruption. The spread shifted to a new level and oscillated there. Running ADF on the full 2018-2022 sample returned a p-value of 0.35. Unit root, apparently. But within each sub-period (2018-2020 and mid-2020-2022), the spread was stationary. The structural break at the regime change was masquerading as a unit root.

This is not an edge case. Financial data is full of regime changes: monetary policy shifts, sector rotations, liquidity events, regulatory changes. If your sample spans one of these, standard stationarity tests will mislead you.

The Zivot-Andrews Test

The Zivot-Andrews test (1992) tests for a unit root while allowing for a single structural break at an unknown date. Instead of assuming the series has a constant mean and trend throughout, it allows one of three types of break:

Break in intercept: the mean shifts at the break date, but the trend (if any) stays the same.
Break in trend: the trend slope changes at the break date, but the intercept is continuous.
Break in both: both the intercept and trend change simultaneously.

The test searches over all possible break dates (excluding a trimmed region at the endpoints, typically 15% at each end) and selects the break date that gives the strongest evidence against the unit root null. This endogenous break selection means the critical values differ from standard ADF critical values, because the search inflates the probability of rejecting by chance.

from dataclasses import dataclass
from arch.unitroot import ZivotAndrews
import numpy as np
import numpy.typing as npt


@dataclass(frozen=True)
class ZivotAndrewsResult:
    """Result of a Zivot-Andrews unit root test with structural break.

    Attributes:
        statistic: The ZA test statistic.
        pvalue: Approximate p-value.
        breakpoint: Index of the estimated structural break.
        lags: Number of lags used.
        method: Type of break tested ('intercept', 'trend', 'both').
    """

    statistic: float
    pvalue: float
    breakpoint: int
    lags: int
    method: str


def run_zivot_andrews(
    series: npt.NDArray[np.float64],
    method: str = "both",
    trim: float = 0.15,
) -> ZivotAndrewsResult:
    """Run the Zivot-Andrews unit root test with structural break.

    Tests for a unit root while allowing for a single structural
    break at an unknown date. The break date is determined
    endogenously by selecting the date that gives the strongest
    evidence against the unit root null.

    Args:
        series: The time series to test. Must have at least 50
            observations for meaningful results.
        method: Type of break to test. 'intercept' for break in
            level, 'trend' for break in trend slope, 'both' for
            break in level and trend.
        trim: Fraction of observations to trim from each end when
            searching for the break date. Default 0.15 excludes the
            first and last 15% of observations.

    Returns:
        A ZivotAndrewsResult with the test statistic, break date,
        and diagnostics.

    Raises:
        ValueError: If series has fewer than 50 observations.
    """
    if len(series) < 50:
        raise ValueError(
            f"Zivot-Andrews needs at least 50 observations, got {len(series)}"
        )
    za = ZivotAndrews(series, method=method, trim=trim)
    return ZivotAndrewsResult(
        statistic=za.stat,
        pvalue=za.pvalue,
        breakpoint=za.breakpoint,
        lags=za.lags,
        method=method,
    )

Practical Application

Once you have a detected break date, several things follow.

Split and re-test. Run the standard ADF + KPSS battery on each sub-period. If both sub-periods are stationary, your signal is fine within each regime; you just need to know which regime you’re in now.

Decide whether the relationship has broken down. For pairs trading or mean-reversion strategies (the concept behind cointegration-based trading, where two non-stationary series share a common stochastic trend and their spread is stationary), a structural break in the spread means the cointegration relationship may have dissolved. The pair might re-establish, or the fundamental relationship might have changed permanently. This is where domain knowledge matters more than statistics. The short version is: a structural break in a cointegrated spread demands a fundamental explanation. If you can’t find one, assume the relationship is dead until proven otherwise.

Set effective start dates. If a break is detected, I only use data from after the most recent break for calibration. The data before the break came from a different regime and calibrating to it introduces bias. This is a conservative choice that costs sample size, but in my experience the alternative (mixing regimes) costs accuracy.

Multiple Breaks

Zivot-Andrews handles exactly one break. Real financial data can have multiple regime changes. For detecting multiple breaks, the Bai-Perron (1998) test is the standard tool, though it’s more complex to implement and interpret. Clemao, Kilian, and Zivot (2003) extended the ZA framework to handle multiple breaks.

My pragmatic view: if I suspect multiple structural breaks in a signal over my sample period, the signal is probably too unstable for a reliable strategy. Two regime changes in five years means the statistical properties of the signal have a half-life shorter than my calibration window. I can still use the signal if I’m re-estimating frequently enough (rolling stationarity tests, which I cover in section 9), but I’ll size the position much smaller to account for the structural uncertainty.

Structural break tests are a frequentist approach to regime detection. Hidden Markov Models address the same question probabilistically. Different tools, same underlying concern: when did the world change, and does my strategy’s assumption still hold?

Fractional Differencing: When Integer Differencing Is Too Aggressive

Here’s a tension I ran into early on. Price series are I(1), so you difference them to get returns. Returns are I(0), stationary, ready for analysis. But differencing destroys something valuable. A return series has no memory of where prices have been. The information about whether the stock is at 50 or 500 is gone. For many strategies, this doesn’t matter. For machine learning features and some signal constructions, it matters a lot.

Marcos Lopez de Prado articulated this problem clearly in chapter 5 of Advances in Financial Machine Learning (2018). The standard toolkit offers a binary choice: use prices (non-stationary, invalid for most statistical methods) or use returns (stationary, but stripped of level information). Fractional differencing offers a middle ground.

The Idea

Instead of differencing by d = 1 (standard first differencing), you difference by a fractional value like d = 0.3 or d = 0.5. The fractional differencing operator is:

(1 - B)^d = Σ_{k=0}^{∞} C(d,k) (-B)^k

where B is the backshift operator and C(d,k) are generalized binomial coefficients. When d = 1, this reduces to standard differencing. When 0 < d < 1, the result is a series that’s “partially differenced”: it retains some memory of past levels while being closer to stationary.

The key insight is that there exists a minimum value of d that makes the series stationary (as measured by ADF). Using this minimum d gives you the best of both worlds: stationarity for valid statistical inference, with maximum retention of the level information that integer differencing would discard.

Fractional Differencing in Python

The weights for fractional differencing decay but never reach exactly zero, so you need to truncate the infinite sum. Lopez de Prado suggests truncating when the weight magnitude falls below a threshold (typically 1e-5).

import numpy as np
import numpy.typing as npt


def fractional_diff_weights(
    d: float,
    threshold: float = 1e-5,
    max_terms: int = 10000,
) -> npt.NDArray[np.float64]:
    """Compute weights for the fractional differencing operator.

    The weights come from expanding (1 - B)^d as a binomial series.
    They decay toward zero for 0 < d < 1 and are truncated when
    they fall below the threshold.

    Args:
        d: The differencing order. Must be in (0, 1) for fractional
            differencing. d=1 gives standard first differences.
        threshold: Minimum absolute weight to include. Weights
            smaller than this are truncated. Lopez de Prado suggests
            1e-5.
        max_terms: Safety limit on the number of terms to prevent
            infinite loops for d values near 1.

    Returns:
        Array of weights, starting with 1.0 (for lag 0) and
        decreasing. Length is determined by the threshold.

    Raises:
        ValueError: If d is not in (0, 1].
    """
    if not 0 < d <= 1:
        raise ValueError(f"d must be in (0, 1], got {d}")

    weights = [1.0]
    k = 1
    while k < max_terms:
        w = -weights[-1] * (d - k + 1) / k
        if abs(w) < threshold:
            break
        weights.append(w)
        k += 1
    return np.array(weights)


def apply_fractional_diff(
    series: npt.NDArray[np.float64],
    d: float,
    threshold: float = 1e-5,
) -> npt.NDArray[np.float64]:
    """Apply fractional differencing to a time series.

    Computes (1 - B)^d applied to the series using truncated
    binomial expansion weights.

    Args:
        series: The input time series (e.g., log prices).
        d: The differencing order, in (0, 1].
        threshold: Weight truncation threshold.

    Returns:
        The fractionally differenced series. The first (len(weights)-1)
        values are NaN because insufficient history is available to
        compute them.

    Raises:
        ValueError: If d is not in (0, 1] or series is too short.
    """
    weights = fractional_diff_weights(d, threshold)
    width = len(weights)

    if len(series) <= width:
        raise ValueError(
            f"Series length ({len(series)}) must exceed weight "
            f"length ({width}) for d={d}"
        )

    result = np.full(len(series), np.nan)
    for t in range(width - 1, len(series)):
        result[t] = np.dot(weights, series[t - width + 1 : t + 1][::-1])

    return result

Finding the Minimum d

The practical procedure is a search for the smallest d that achieves stationarity. I use a simple iterative approach rather than a true binary search because the relationship between d and the ADF p-value isn’t always monotonic in small samples.

def find_minimum_d(
    series: npt.NDArray[np.float64],
    d_values: npt.NDArray[np.float64] | None = None,
    significance_level: float = 0.05,
    threshold: float = 1e-5,
) -> tuple[float, npt.NDArray[np.float64]]:
    """Find the minimum differencing order that achieves stationarity.

    Iterates over candidate d values from small to large and returns
    the first d for which the ADF test rejects the unit root null.

    Args:
        series: The input time series (typically log prices).
        d_values: Array of d values to try, in ascending order.
            Defaults to np.arange(0.05, 1.05, 0.05).
        significance_level: Alpha level for the ADF test.
        threshold: Weight truncation threshold for fractional
            differencing.

    Returns:
        A tuple of (minimum_d, differenced_series). If no d achieves
        stationarity, returns (1.0, first-differenced series).
    """
    if d_values is None:
        d_values = np.arange(0.05, 1.05, 0.05)

    for d in d_values:
        diffed = apply_fractional_diff(series, d, threshold)
        clean = diffed[~np.isnan(diffed)]

        if len(clean) < 20:
            continue

        adf_result = run_adf(clean)
        if adf_result.pvalue < significance_level:
            return float(d), diffed

    # Fallback to standard differencing
    return 1.0, np.diff(series)

When to Use Fractional Differencing

Fractional differencing shines in machine learning contexts where you want stationary features that still carry level information. If you’re feeding features into a gradient-boosted model or a neural network and you want a feature that captures “where the price is” without being non-stationary, fractional differencing with minimum d is a strong choice.

For classic mean-reversion strategies, fractional differencing is less relevant. If you’re trading a spread and the spread is already stationary, there’s nothing to fix. If the spread has a unit root, the strategy’s premise is broken and fractional differencing won’t save it.

The trade-off is always the same: lower d preserves more memory but may not achieve stationarity. Higher d guarantees stationarity but discards information. The minimum-d search finds the Pareto-optimal point on that trade-off.

One thing to be aware of: the minimum d you find is sample-dependent. Different time periods may require different d values for the same series. This is yet another reason to re-estimate on a rolling basis rather than computing d once and treating it as fixed.

Building a Stationarity Gate into Your Pipeline

All the tests above are useless if they sit in a notebook and get run manually when you remember to. Stationarity testing needs to be automatic, run on every signal before it touches a backtest, and produce structured output that your pipeline can act on.

Architecture

Here’s where stationarity testing fits in my data-to-backtest flow:

Raw data arrives (prices, spreads, indicators from a data provider).
Parse and validate structural invariants using property-based checks : OHLC consistency, monotonic timestamps, non-negative spreads.
Compute the signal or feature from the validated data.
Run the stationarity battery (ADF + KPSS + PP) on the computed signal.
Route based on the conclusion:
- Stationary: pass directly to the backtest engine.
- Unit root: auto-difference and re-test; if the differenced signal passes, log the transformation and pass it through.
- Contradictory: run Zivot-Andrews, log the break date, flag for human review.
- Inconclusive: reject the signal with a diagnostic message.
Log everything to the research database for audit and reproducibility.

The key design principle is that no signal reaches the backtest without a stationarity report attached to it. This is the gate. If the gate doesn’t open, the signal doesn’t proceed.

The StationarityGate

from dataclasses import dataclass
from enum import Enum
import logging
import numpy as np
import numpy.typing as npt

logger = logging.getLogger(__name__)


class GateOutcome(Enum):
    """Outcome of the stationarity gate check."""

    PASSED = "passed"
    PASSED_AFTER_DIFFERENCING = "passed_after_differencing"
    FLAGGED_FOR_REVIEW = "flagged_for_review"
    REJECTED = "rejected"


@dataclass(frozen=True)
class GateResult:
    """Result of passing a signal through the stationarity gate.

    Attributes:
        outcome: Whether the signal passed, was transformed, or
            was rejected.
        original_report: Stationarity report on the original signal.
        transformed_series: The (possibly differenced) series to use.
            None if rejected.
        transformation_applied: Description of any transformation.
        differenced_report: Stationarity report on the differenced
            signal, if differencing was applied. None otherwise.
        za_result: Zivot-Andrews result if a break test was run.
            None otherwise.
    """

    outcome: GateOutcome
    original_report: StationarityReport
    transformed_series: npt.NDArray[np.float64] | None
    transformation_applied: str
    differenced_report: StationarityReport | None = None
    za_result: ZivotAndrewsResult | None = None


def stationarity_gate(
    series: npt.NDArray[np.float64],
    signal_name: str,
    significance_level: float = 0.05,
) -> GateResult:
    """Run a signal through the stationarity gate.

    Assesses stationarity using the confirmation framework and
    routes the signal based on the outcome: pass, auto-difference,
    flag for review, or reject.

    Args:
        series: The signal time series to evaluate.
        signal_name: Human-readable name for logging.
        significance_level: Alpha level for all tests.

    Returns:
        A GateResult containing the outcome, reports, and the
        (possibly transformed) series to use downstream.
    """
    report = assess_stationarity(series, significance_level)
    logger.info(
        "Stationarity gate for '%s': %s",
        signal_name,
        report.conclusion.value,
    )

    if report.conclusion == StationarityConclusion.STATIONARY:
        return GateResult(
            outcome=GateOutcome.PASSED,
            original_report=report,
            transformed_series=series,
            transformation_applied="none",
        )

    if report.conclusion == StationarityConclusion.UNIT_ROOT:
        diffed = np.diff(series)
        diff_report = assess_stationarity(diffed, significance_level)

        if diff_report.conclusion == StationarityConclusion.STATIONARY:
            logger.info(
                "Signal '%s' passed after first differencing.",
                signal_name,
            )
            return GateResult(
                outcome=GateOutcome.PASSED_AFTER_DIFFERENCING,
                original_report=report,
                transformed_series=diffed,
                transformation_applied="first_difference",
                differenced_report=diff_report,
            )

        logger.warning(
            "Signal '%s' still non-stationary after differencing.",
            signal_name,
        )
        return GateResult(
            outcome=GateOutcome.REJECTED,
            original_report=report,
            transformed_series=None,
            transformation_applied="first_difference_failed",
            differenced_report=diff_report,
        )

    if report.conclusion == StationarityConclusion.CONTRADICTORY:
        za = None
        if len(series) >= 50:
            za = run_zivot_andrews(series)
            logger.info(
                "Signal '%s' contradictory. ZA break at index %d, p=%.4f",
                signal_name,
                za.breakpoint,
                za.pvalue,
            )

        return GateResult(
            outcome=GateOutcome.FLAGGED_FOR_REVIEW,
            original_report=report,
            transformed_series=series,
            transformation_applied="none_pending_review",
            za_result=za,
        )

    # INCONCLUSIVE
    logger.warning(
        "Signal '%s' inconclusive. Neither test could reject its null.",
        signal_name,
    )
    return GateResult(
        outcome=GateOutcome.REJECTED,
        original_report=report,
        transformed_series=None,
        transformation_applied="none_insufficient_evidence",
    )

Monitoring in Production

Stationarity can change. A signal that passed every test during your backtest window can develop a unit root in live data. Regime changes, market structure shifts, and evolving correlations all contribute. If you’re running a strategy in production, you need to monitor the stationarity of its signals on an ongoing basis.

I run rolling stationarity tests on a schedule. For daily signals, I re-run the battery weekly using a trailing window. The window length should match your strategy’s calibration window: if your strategy recalibrates every 252 trading days, test stationarity over a trailing 252-day window.

The alert conditions are:

A signal that was stationary transitions to unit root or contradictory. This is a red flag. It might mean the relationship your strategy depends on has broken down.
A signal that was contradictory (trend-stationary with a suspected break) shows a new break date that falls within your live trading period. The regime changed while you were trading.
The ADF p-value for a stationary signal drifts above 0.10 even if it hasn’t crossed the 0.05 threshold. This is an early warning that the signal is weakening.

I log every stationarity report to a time-series database so I can visualize the trajectory of ADF p-values and KPSS statistics over time. Sudden jumps in these metrics are informative even before they cross decision thresholds.

This connects directly to the monitoring approach I use for backtest PnL autocorrelation . Stationarity of the signal and autocorrelation structure of the PnL are two faces of the same coin: they both tell you whether the statistical properties your strategy depends on are stable.

Common Mistakes and Pitfalls

After building this framework and using it on dozens of signals, I’ve assembled a list of mistakes I’ve either made or seen others make. Some of these are obvious in retrospect. All of them have burned real research time.

Testing Raw Prices for Stationarity

I mentioned this in the ADF section but it’s worth repeating because it’s the single most common mistake. Running ADF on raw price series, getting a p-value near 1.0, and concluding the data is problematic. Prices are supposed to have unit roots. In an efficient market, predictable price movements get arbitraged away, and what remains is approximately a random walk. The non-stationarity of prices is a feature of markets, not a bug in your data. Test your signal, not the prices it’s derived from.

The more insidious version of this mistake is testing a smoothed price series and being surprised it has a unit root. A 50-day moving average of prices inherits the unit root from prices. A ratio of two moving averages might be stationary, depending on the underlying relationship. The transformation matters.

Ignoring the Model Specification

Using regression='ct' (constant and trend) when the series has no trend reduces the power of the ADF test because you’re estimating an unnecessary parameter. Using regression='n' (no constant) when the series has a non-zero mean biases the test toward rejecting the unit root null. The model specification isn’t a throwaway parameter. It encodes your prior about the data-generating process.

My rule: look at a plot of the series before choosing the specification. If it oscillates around zero, use ‘c’. If it oscillates around a non-zero level, use ‘c’. If it has a visible upward or downward trend, use ‘ct’. If you’re genuinely testing a demeaned, detrended residual, ’n’ might be appropriate, but that’s rare.

Over-Differencing

Differencing an already-stationary series is not harmless. It introduces spurious negative autocorrelation at lag 1. This phantom autocorrelation can make a series look like it has mean-reverting properties when it doesn’t, which can generate false signals for mean-reversion strategies.

I’ve seen this happen when someone builds a pipeline that automatically differences everything “just to be safe.” The safe thing is to test first and difference only when the test says you need to. That’s what the stationarity gate above is designed to enforce.

You can detect over-differencing by checking the autocorrelation of the differenced series. If the lag-1 autocorrelation is significantly negative (around -0.5), and it wasn’t negative before differencing, you’ve probably over-differenced.

Treating Stationarity as Binary

The ADF test gives you a p-value and you compare it to a threshold. This binary framing (stationary vs. unit root) obscures important nuance. A series with an autoregressive coefficient of 0.97 behaves very differently in practice from one with a coefficient of 1.0, but both might produce ADF p-values above 0.05 in a sample of 200 observations. The first series is mean-reverting with a half-life of about 23 periods. The second never reverts.

In my pipeline, I report the point estimate of the autoregressive coefficient alongside the ADF results. This gives me a sense of how far from a unit root the series is, not just whether I can reject the null. A coefficient of 0.97 and a coefficient of 0.999 both fail ADF at 0.05, but they imply very different trading dynamics.

Ignoring Structural Breaks

I covered this in section 7 but it deserves emphasis here. Running stationarity tests on data that spans a regime change and trusting the result is one of the most consequential mistakes in quant research. The tests are designed for processes with constant parameters. If the mean or trend shifted partway through your sample, the test result is uninterpretable.

The visual check is simple: plot the series. If you see a level shift, test for breaks before trusting the ADF result. If you can’t be bothered to plot, at least run Zivot-Andrews as a routine check.

Small Sample Problems

ADF has notoriously low power with fewer than about 100 observations. If your signal is monthly and you have five years of data, you’re working with 60 observations. The test might not detect a unit root even if one exists, and it certainly won’t have the power to reject the null for a series that’s stationary but persistent.

For KPSS, small samples cause the opposite problem: the test can over-reject, finding non-stationarity when the series is actually stationary but noisy.

There’s no magic fix for small samples. More data helps but isn’t always available. When I’m stuck with a short sample, I lower my confidence in any individual test result and rely more heavily on domain knowledge and visual inspection. I also consider whether the signal can be computed at a higher frequency to increase the observation count.

Confusing Statistical Significance with Practical Significance

A p-value of 0.04 from ADF technically rejects the unit root null at the 0.05 level. But a series that barely passes the stationarity test is a series that might fail next month with slightly different data. The confidence you place in a stationarity conclusion should scale with the strength of the evidence. A p-value of 0.001 is much more reassuring than 0.04.

I use 0.05 as the gate threshold in my automated pipeline, but I flag any signal with an ADF p-value between 0.01 and 0.05 as “marginally stationary” in the research log. These signals get extra scrutiny during walk-forward testing to see if stationarity holds across all sub-periods.

Not Re-Testing After Transformation

When you difference a series or apply fractional differencing, you should re-run the full stationarity battery on the transformed series to confirm it’s actually stationary. I’ve seen cases where first differencing an I(2) series (which needs two differences) produces an I(1) series, and the researcher assumed the first difference was sufficient because “that’s what you do with prices.” The stationarity gate above handles this automatically by testing the differenced series before passing it through.

Wrapping Up

Stationarity testing is not glamorous work. It doesn’t generate alpha. It doesn’t produce impressive equity curves. What it does is prevent an entire class of silent failures that corrupt backtests and waste research time.

The framework is straightforward once you internalize the pieces:

ADF tests whether you can reject a unit root. It’s the default first check.
KPSS tests whether you can reject stationarity. It’s the confirmation.
Phillips-Perron provides a non-parametric second opinion on the unit root question.
The confirmation framework combines ADF and KPSS to produce four distinct conclusions, each with a clear next action.
Zivot-Andrews handles the structural break case that fools standard tests.
Fractional differencing offers a middle path when integer differencing destroys too much information.
The stationarity gate makes all of this automatic and enforces it as a pre-condition before any signal enters a backtest.

Every signal in my pipeline gets a StationarityReport before it goes anywhere. The three weeks I spent on that fake momentum signal taught me that catching non-stationarity early costs almost nothing, while missing it costs everything downstream.

If you’re building a quant pipeline and you’re not testing for stationarity, start. If you’re testing but only running ADF, add KPSS. If you’re running both but not checking for structural breaks, add Zivot-Andrews. Each layer catches failures the previous layer misses. The combination is what makes the framework robust.

References

Dickey, D.A. and Fuller, W.A. (1979). “Distribution of the Estimators for Autoregressive Time Series With a Unit Root.” Journal of the American Statistical Association, 74(366), 427-431.
Granger, C.W.J. and Newbold, P. (1974). “Spurious Regressions in Econometrics.” Journal of Econometrics, 2(2), 111-120.
Kwiatkowski, D., Phillips, P.C.B., Schmidt, P. and Shin, Y. (1992). “Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root.” Journal of Econometrics, 54(1-3), 159-178.
Lopez de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. Chapter 5: Fractional Differencing.
Phillips, P.C.B. and Perron, P. (1988). “Testing for a Unit Root in Time Series Regression.” Biometrika, 75(2), 335-346.
Zivot, E. and Andrews, D.W.K. (1992). “Further Evidence on the Great Crash, the Oil-Price Shock, and the Unit-Root Hypothesis.” Journal of Business & Economic Statistics, 10(3), 251-270.

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.