Metamorphic Relations for Backtests: Testing the Engine, Not the …

Backtests have a testing problem. You can’t write a traditional assertion for one because you don’t know the “correct” Sharpe ratio in advance. There’s no expected output to compare against. This makes backtesting code harder to verify than most software, and it means bugs in the engine itself can hide for a long time, silently distorting every result you produce.

Metamorphic testing solves this by shifting the question. Instead of “is the output correct?”, you ask “does the output change correctly when I transform the input?” You don’t need to know the right answer. You need to know the relationship between answers under different conditions.

I first used metamorphic testing when building large-scale SaaS products, where the same problem exists: complex systems with no easily verifiable expected output. I gave a talk on this approach at Haskell Love 2020 (Thinking in Properties ) focused on properties like idempotence, commutativity, and invariance for software systems. Applying the same thinking to backtesting infrastructure was a natural step once I returned to quantitative work.

This article covers two metamorphic relations that I use to verify the structural correctness of my backtesting framework. These aren’t strategy validations. They’re infrastructure tests. They check that the engine computes what you think it computes, regardless of which strategy you’re running through it. For the companion piece on using property-based testing to validate market data before it enters the pipeline, see Property-Based Testing Meets Financial Data .

Why Metamorphic Relations Matter for Backtests

The typical approach to testing backtest code is example-based: run a known strategy on a small dataset where you can compute the expected result by hand, then compare. This catches obvious bugs but misses the subtle ones. Rounding errors that only appear at scale. Cost calculations that work for single-leg trades but break for spreads. Position sizing logic that’s correct for round lots but wrong when fractional shares are involved.

Metamorphic relations catch these because they test the engine across a wide range of inputs, not just the handful of examples you thought to write. They encode structural truths about how backtest outputs should relate to each other under transformation. If those relationships break, something is wrong with the machinery.

The two I rely on most are fee monotonicity and cash invariance. Both are sanity checks for the backtesting framework itself. I run them whenever I change the engine code, and they’ve caught bugs that would have taken much longer to find through example-based testing or, worse, through puzzling over strategy results that didn’t make sense.

Fee Monotonicity

If you increase transaction costs, performance should get worse or at best stay the same. It should never improve. This is a structural truth: paying more to enter and exit positions cannot create value.

from hypothesis import assume, given, settings
import hypothesis.strategies as st

@given(
    fee_low=st.floats(min_value=0.0001, max_value=0.01),
    fee_high=st.floats(min_value=0.0001, max_value=0.02),
)
def test_fee_monotonicity(fee_low, fee_high, strategy, data):
    """Higher fees should never improve performance."""
    assume(fee_high > fee_low)
    result_low = backtest(strategy, data, fee_per_trade=fee_low)
    result_high = backtest(strategy, data, fee_per_trade=fee_high)
    assert result_high.total_return <= result_low.total_return + 1e-9, (
        f"Higher fees improved returns: "
        f"low_fee={fee_low:.4f} returned {result_low.total_return:.4f}, "
        f"high_fee={fee_high:.4f} returned {result_high.total_return:.4f}"
    )

What this catches in practice:

Sign errors in cost calculation. Fees being added to returns instead of subtracted. This sounds too obvious to happen, but in a complex pipeline with multiple cost components (commission, spread, slippage, market impact), getting the sign wrong on one component is easy to do and hard to spot by looking at aggregate results.
Cost applied inconsistently. Entry costs applied but exit costs missed, or vice versa. Round-trip costs that only apply to one leg of a spread trade. These bugs partially inflate returns, so the backtest looks plausible but wrong.
Threshold interactions. If the strategy has a minimum-profit threshold for taking a trade, increasing fees should cause some trades to fall below the threshold and not execute. If the number of trades stays constant as fees increase, the threshold logic isn’t interacting with the cost model correctly.

The property is simple to state but the bugs it catches are not. I’ve found cost calculation issues with this test that would have taken weeks to surface through normal strategy development, because the individual backtest results looked reasonable. It was only the relationship between results at different fee levels that revealed the problem.

Cash Invariance

If you double the initial capital and proportionally scale position sizes, the Sharpe ratio should stay the same. Returns as a percentage of capital are independent of the absolute dollar amount. If the Sharpe changes when you scale capital, something in the position sizing or return calculation is not scaling linearly.

@given(capital_multiplier=st.floats(min_value=0.5, max_value=10.0))
def test_cash_invariance(capital_multiplier, strategy, data, base_capital):
    """Scaling capital should not change the Sharpe ratio."""
    result_base = backtest(strategy, data, capital=base_capital)
    result_scaled = backtest(
        strategy, data, capital=base_capital * capital_multiplier
    )
    assert abs(result_base.sharpe - result_scaled.sharpe) < 0.01, (
        f"Sharpe changed with capital scaling: "
        f"base={result_base.sharpe:.4f}, "
        f"scaled={result_scaled.sharpe:.4f}, "
        f"multiplier={capital_multiplier:.2f}"
    )

What this catches in practice:

Integer share rounding. If the backtest rounds to whole shares, small accounts may get significantly different position sizes (as a percentage of capital) than large accounts. A $10,000 account trying to buy a $3,000 stock gets 3 shares (90% of capital), while $20,000 gets 6 shares (90% again). But scale to $15,000 and you get 5 shares (100% of capital). The rounding creates inconsistencies that show up as Sharpe ratio drift under capital scaling.
Fixed-cost components that don’t scale. If your cost model includes a fixed per-trade commission (say $1 per trade), that commission is a larger percentage of a $1,000 trade than a $10,000 trade. This is real and expected, but if your backtest assumes percentage-based costs throughout and then applies a fixed component somewhere, the invariance test will flag the inconsistency.
Position sizing bugs. Any logic that computes position size from capital and doesn’t scale linearly. Hardcoded dollar amounts, minimum position thresholds, or leverage calculations that assume a specific capital base.

I use a tolerance of 0.01 on the Sharpe comparison because small floating-point differences are expected, especially with rounding. If the difference exceeds that, it’s a real bug.

Relations at a Glance

Relation	Input Transformation	Expected Output Change	Bugs It Catches
Fee monotonicity	Increase transaction costs	Performance decreases or stays the same	Sign errors in cost calc, inconsistent cost application, threshold interaction bugs
Cash invariance	Scale capital, scale position sizes proportionally	Sharpe ratio stays the same	Integer share rounding, fixed-cost components that don’t scale, position sizing bugs

Running These in Practice

I run both properties whenever I modify the backtesting engine. They’re part of the test suite, not something I run manually. The Hypothesis framework handles generating the randomized inputs (fee levels, capital multipliers) and shrinking failures to minimal examples.

The key insight from my software engineering background is that these are infrastructure tests, not strategy tests. They live alongside the engine code, not alongside any particular strategy. When I add a new cost component (say, borrowing costs for short positions), I run fee monotonicity to verify it’s wired in correctly. When I change the position sizing logic, I run cash invariance. The properties tell me whether the plumbing works before I start running strategies through it.

This separation matters. Strategy validation (does this signal have predictive power?) is a different problem from engine validation (does this backtest compute what I think it computes?). Mixing them leads to debugging strategy results when the problem is actually in the infrastructure. I spent enough years in production software engineering to know that you test your tools before you trust their output.

What Metamorphic Relations Don’t Tell You

These properties verify the mechanics. They don’t verify the strategy. A backtest engine that passes both tests can still produce meaningless results if the strategy is overfit, the data has survivorship bias , or the walk-forward validation is misconfigured.

Think of it this way: metamorphic relations are to backtesting what unit tests are to a compiler. If the compiler’s tests pass, you know it translates code correctly. You don’t know whether the program it compiled is any good. Fee monotonicity and cash invariance tell you the engine is sound. Everything else in the validation funnel tells you whether the strategy running through it is worth deploying.

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.