Property-Based Testing Meets Financial Data: Turning Market …

The idea that changed how I think about data quality came from software, not finance.

In property-based testing, you don’t write test cases for specific inputs and outputs. You define properties that should hold for all valid inputs, generate thousands of random inputs, and check whether any of them break the property. I first encountered this approach through QuickCheck in Haskell, used it extensively with ZIO-test in Scala, and now work primarily with Hypothesis in Python. The shift from “test this specific example” to “define what must always be true” is one of those ideas that, once internalized, changes how you look at every system you build.

Financial markets are full of invariants. Bid-ask spreads are non-negative. OHLC bars have internal consistency constraints. Timestamps are monotonic. Adjusted prices are continuous through corporate actions. These aren’t statistical estimates that might be wrong. They’re structural properties of how markets work. When your data violates them, something is broken in your pipeline, not in the market.

This article covers the core insight and the data validation properties I check before any data enters my analysis pipeline . I cover metamorphic relations for backtests separately.

The Core Insight: Financial Theory as Test Specifications

In software, a property is an invariant. sort(xs) is a permutation of xs. reverse(reverse(xs)) == xs. These hold for all valid inputs, and if you find an input that violates them, you’ve found a bug.

Financial theory provides the same kind of invariants. The difference is that most quants don’t think of them as test specifications. They think of them as textbook formulas. But the formulas encode relationships that must hold by the structure of markets:

Non-negative bid-ask spread: the ask price is at least as high as the bid. Always. A violation in your data means stale quotes, a crossed market that should have been filtered, or corrupt data.
OHLC consistency: the high is the highest price in the bar, the low is the lowest. The open and close fall between them. This is definitional, not statistical.
Monotonic timestamps: each observation comes after the previous one. Out-of-order data means your pipeline has a sorting bug or your source delivered duplicates.
Adjusted price continuity through splits: a 2:1 split shouldn’t create a 50% gap in the adjusted price series. If it does, the adjustment factor is wrong or missing.
Total return consistency: the total return computed from adjusted close prices should match the reported total return. Discrepancies mean the adjustment methodology is inconsistent.

The shift in thinking is this: instead of writing test_aapl_bid_ask_spread_jan_15(), you write for_all(quotes, assert bid <= ask). Instead of checking one example, you check the invariant across every data point. Hypothesis generates the adversarial inputs. Your job is to define what “correct” means.

The properties are different from the ones I tested in distributed systems, but the methodology is identical: state what must be true, generate inputs, find violations.

Data Validation Properties

These are the properties I check before any data enters my pipeline. They’re the first line of defense, and they’ve caught problems that would have silently corrupted everything downstream.

Bid-ask spread is non-negative

from hypothesis import given
import hypothesis.strategies as st

@given(quote=quote_strategy)
def test_bid_ask_spread_non_negative(quote):
    assert quote.ask >= quote.bid, (
        f"Crossed market at {quote.timestamp}: "
        f"bid={quote.bid}, ask={quote.ask}"
    )

Crossed markets do happen briefly in real markets, particularly during fast moves or around market opens. But they shouldn’t persist in cleaned data. If this property fails on your historical dataset, either your data provider isn’t filtering them or your cleaning step has a bug. I’ve seen both.

OHLC bar consistency

@given(bar=ohlcv_bar_strategy)
def test_ohlc_consistency(bar):
    assert bar.low <= bar.high, f"low > high at {bar.timestamp}"
    assert bar.low <= bar.open <= bar.high, f"open outside range at {bar.timestamp}"
    assert bar.low <= bar.close <= bar.high, f"close outside range at {bar.timestamp}"

This catches data corruption and bad adjustment more often than you’d expect. I once spent a day debugging a mean-reversion signal that was producing spurious entries, only to find that a handful of bars in the source data had lows above their highs. The signal was reacting to impossible price moves. A property test on ingest would have caught it immediately.

Monotonic timestamps

@given(data=time_series_strategy)
def test_monotonic_timestamps(data):
    for i in range(1, len(data)):
        assert data[i].timestamp > data[i - 1].timestamp, (
            f"Non-monotonic: {data[i-1].timestamp} >= {data[i].timestamp}"
        )

Out-of-order data is surprisingly common when you’re ingesting from multiple sources or processing files in parallel. This is where my software engineering background made the property obvious. In distributed systems, you always check message ordering. Market data is no different.

Adjusted price continuity through splits

@given(split_event=split_event_strategy)
def test_split_adjustment(split_event):
    ratio = split_event.ratio
    pre_adj = split_event.pre_split_adjusted_close
    post_adj = split_event.post_split_adjusted_close
    assert abs(post_adj * ratio - pre_adj) < 0.01, (
        f"Split adjustment gap: pre={pre_adj}, post={post_adj}, ratio={ratio}"
    )

A 2:1 split means yesterday’s adjusted $200 and today’s adjusted $100 should be consistent when you multiply by the ratio. If they’re not, the adjustment factor is wrong. Unadjusted data that sneaks past your pipeline creates phantom crashes that trigger every mean-reversion and momentum signal in your system.

Total return consistency

@given(bar_pair=consecutive_bar_strategy)
def test_total_return_consistency(prev_bar, curr_bar):
    computed_return = curr_bar.adjusted_close / prev_bar.adjusted_close - 1
    assert abs(computed_return - curr_bar.total_return) < 1e-6, (
        f"Return mismatch at {curr_bar.timestamp}: "
        f"computed={computed_return:.6f}, reported={curr_bar.total_return:.6f}"
    )

This catches inconsistencies between price adjustments and reported returns. Different providers adjust differently (proportional vs. additive for dividends, for example), and mixing sources without checking consistency is a quiet way to corrupt your return series.

Properties at a Glance

Property	What It Checks	Violation Indicates
Bid-ask spread >= 0	Quote data integrity	Stale quotes, crossed market, corrupt data
OHLC consistency	Bar data integrity	Data corruption, bad price adjustment
Monotonic timestamps	Time ordering	Pipeline sorting bug, duplicate records
Split-adjusted continuity	Corporate action handling	Wrong or missing adjustment factor
Total return consistency	Return calculation	Mixed adjustment methodologies across sources

What Violations Tell You

The value of property-based testing isn’t just that it catches bugs. It’s that the pattern of violations is diagnostic.

Systematic failures across the entire dataset usually mean a pipeline bug or a data source problem. Every OHLC bar from a particular exchange has impossible values, or every timestamp from a particular feed is duplicated. These are engineering problems with engineering fixes.

Sporadic failures concentrated around specific events usually mean market microstructure. Bid-ask spreads cross briefly during earnings announcements. Adjusted prices gap at corporate actions the data provider didn’t handle correctly. These need domain-specific handling: wider tolerances during known events, or enrichment from a corporate actions database.

Failures that grow over time usually mean model drift. Your data cleaning assumptions were correct when you wrote them but the market has changed. A tolerance that was appropriate for 2020 volatility is too tight for 2022. This is the kind of problem that property monitoring catches before it shows up in your backtest results.

Failure Pattern	Likely Cause	Response
Systematic, entire dataset	Pipeline bug or data source problem	Fix the engineering
Sporadic, event-concentrated	Market microstructure	Widen tolerances for known events
Growing over time	Model drift, changed assumptions	Revisit cleaning parameters

The important thing is that you have the properties defined and running before you need them. Writing them after you’ve found a problem is debugging. Writing them before is engineering. The difference is whether you catch the next problem automatically or discover it the hard way.

The Testing Stack

For this work I use Python’s Hypothesis library, which is the most mature property-based testing framework in the Python ecosystem. It handles test case generation, shrinking (finding the minimal failing example), and integration with pytest.

The harder part is writing good strategies (Hypothesis’s term for data generators) that produce realistic financial data rather than random noise. A naive strategy that generates random floats for bid and ask prices will spend most of its time testing obviously invalid inputs. A good strategy generates data that looks like real market data but varies enough to exercise edge cases: prices near zero, very wide spreads, timestamps at market boundaries, volume spikes.

The principle is constrained generation. Don’t generate random floats and hope they look like market data. Generate from a model that guarantees structural validity, then test your pipeline’s ability to handle the variation within those constraints. I’ll cover the specifics of building these generators in a separate article.

Where This Fits

Data validation is the foundation of the analysis pipeline . If the data is wrong, everything built on top of it is wrong. Property-based testing turns “I hope the data is clean” into “I have checked every invariant I can articulate, on every data point, and the ones that failed are documented.”

This approach came naturally from my background in functional programming and distributed systems, where property-based testing is standard practice for verifying system correctness. Financial data has more domain-specific invariants than most software systems, but the methodology transfers directly. Define what must be true. Generate inputs. Find violations. Fix the source. Repeat.

The next step beyond data validation is applying the same thinking to backtests and models: metamorphic relations that check whether your backtest’s behavior changes correctly when you transform its inputs. That’s a different article, but the foundation is the same.

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.