The Testing Pyramid, Sideways: Software Testing Practices …

This is a companion to From Hypothesis to Production: A Quant’s Productivity Toolkit , where I describe the five stages I use to take a trading strategy from hunch to deployment. If you haven’t read that, the short version is: exploration, hypothesis testing, statistical validation, simulation, and robustness testing, each acting as a progressively more expensive filter.

If any of that sounds familiar to software engineers, it should. I spent fourteen years building production software at places like Salesforce, Northern Trust, and Citadel before returning to quantitative work, and the quant workflow maps almost directly onto the testing pyramid . The stages differ in cost, speed, confidence, and scope in exactly the same way that different types of software tests do. Recognizing these parallels has changed how I think about both disciplines.

Stage 1 as customer interviews

Exploration in Nushell, QuestDB, and DuckDB is the equivalent of customer discovery interviews. You’re not building anything yet. You’re figuring out whether the problem is real before investing in a solution.

In product development, customer interviews exist to invalidate assumptions about what users need. You go in with a hypothesis (“users want feature X”), and a good interview either confirms there’s a real pain point or reveals that you were solving a problem nobody has. The whole point is to learn cheaply, before anyone writes code.

Stage 1 works the same way, except the “customer” is the data. You’re asking it whether your hypothesis has any basis in reality. Is there actually a pattern here, or did you imagine it? Is the dataset even capable of answering the question you’re asking? Just as a product team that skips customer interviews wastes months building something nobody wants, a quant who skips exploration wastes days backtesting a signal that was never there.

Stage 2 as unit tests

At Salesforce, I once watched a team spend a week debugging a distributed service failure that a single unit test would have caught: a function that returned the wrong value for a boundary condition that nobody had checked. The fix was one line. The cost of not having the test was a week of engineering time plus a production incident.

Stage 2 exists to prevent the quant equivalent. You’ve isolated a specific behavior (the signal, the feature, the relationship), you’re testing it in a controlled environment (a cleaned dataset, a specific time window), and you’re checking that it does what you expect. A Python script that loads data, computes features, and produces plots is a unit test for your hypothesis. It’s reproducible, it focuses on one thing at a time, and it asserts something specific.

Good unit tests assert specific, falsifiable things. Good Stage 2 analysis asks specific, falsifiable questions. “Is there mean reversion after large moves?” is a testable claim. “Does this data look interesting?” is not. In both cases, if you can’t state what you’re checking, you’re not testing anything.

Stage 3 as integration tests

At Northern Trust, we had a portfolio analytics calculation that passed every unit test but produced wrong results for a specific combination of multi-currency holdings and performance attribution. Each component worked in isolation. The failure only appeared when they were connected, and it hid for months because the aggregate numbers looked plausible.

Statistical validation has the same failure mode. You’re no longer testing the signal in isolation. You’re testing whether it holds up when connected to the broader context: out-of-sample data, multiple time periods, different market conditions. A signal that looked strong in one time window disappears when you test it across the full sample. Assumptions that held in a controlled environment don’t survive contact with the rest of the system.

Integration tests are also where you start caring about false positives. A unit test that passes incorrectly is usually obvious. An integration test that passes incorrectly can hide for a long time, like that portfolio analytics bug. In Stage 3, a p-value that looks significant but isn’t (because of multiple comparisons or data snooping) is exactly that kind of hidden false positive. The numbers look plausible, so nobody questions them until real money is on the line.

Stage 4 as generative and fuzz testing

Every software engineer has a story about the input nobody anticipated. The unicode string that broke the parser. The negative number in a field that should have been positive. The request that arrived in an order the state machine didn’t handle. Fuzz testing exists because humans are bad at imagining the full space of things that can go wrong.

Moving from historical to simulated data is the quant equivalent. Instead of testing against a fixed set of inputs (the historical record), you generate a large number of synthetic inputs and check whether the system handles all of them correctly. A Monte Carlo path might produce a sequence of returns that never occurred in history but is entirely plausible. If the strategy breaks on that path, you’ve found a vulnerability that historical backtesting alone would never have revealed.

The philosophy is identical: you don’t trust that your handpicked test cases (or your single historical sample) cover the space of things that could happen. You generate inputs programmatically to explore that space more thoroughly. The cost is higher than unit or integration testing, but the coverage is qualitatively different. At Citadel, the market data infrastructure had to handle data that arrived out of order, with gaps, with corrections to previously published values. You only discover those edge cases by throwing volume at the system. The same is true for strategies.

Stage 5 as property-based tests and chaos engineering

Robustness testing maps to two software testing practices at once.

Parameter sensitivity testing is property-based testing. You’re defining invariants (the strategy should work across a neighborhood of parameter values, performance should be monotonic in signal strength) and then checking whether those invariants hold across a wide range of inputs. This is the approach Claessen and Hughes introduced in “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs” : define universal properties, generate random inputs, and see if any of them break. Property-based testing in software catches the same kind of failure: code that works for the specific examples someone wrote but violates a general property that should always hold.

Regime stress testing is chaos engineering . Netflix’s Chaos Monkey kills random services in production to verify the system recovers. Regime stress testing kills the favorable market conditions a strategy depends on to verify it doesn’t blow up. In both cases, you’re deliberately introducing adversarial conditions to test the system’s resilience rather than its happy-path performance.

The combination of property-based tests and chaos engineering is the highest-confidence testing you can do, in software or in quant. It’s also the most expensive. That’s why it comes last.

Smoke tests: what happens after deployment

Smoke tests don’t map to a stage in the development funnel. They map to what happens after a strategy is already deployed. Is the strategy still profitable? Has the edge decayed? Are the assumptions it was built on still holding? This is the ongoing health check you run on a live system, the equivalent of a service health endpoint that tells you whether something that used to work still does.

In software, you don’t stop testing after you ship. You monitor, you alert, you run periodic checks. Deployed strategies need the same discipline. A signal that passed every stage of the funnel six months ago might be dead today. Smoke tests are how you find out before the Profit and Loss (P&L) tells you.

The Mapping

Quant Stage	Testing Analogue	What It Catches	Cost
Exploration	Customer interviews	Bad premises, non-existent problems	Low
Hypothesis testing (pandas)	Unit tests	Signals that don’t exist or aren’t reproducible	Low
Statistical validation	Integration tests	Signals that don’t hold across contexts, hidden false positives	Medium
Simulated data	Fuzz / generative testing	Strategies overfit to one historical path	Medium-High
Robustness testing	Property-based tests + chaos engineering	Fragile parameters, regime-dependent failure	High
(Post-deployment)	Smoke tests	Edge decay, assumption drift	Ongoing

The shared lesson

The testing pyramid exists because different types of tests have different cost-to-confidence ratios. Customer interviews are cheap and strong at invalidation (they’re great at killing bad ideas) but weak at positive confirmation (a customer saying “yes I’d use that” doesn’t mean much). Property-based tests and chaos engineering are expensive but high-confidence in both directions. Running them in the wrong order wastes time and money. You don’t fuzz test a service that nobody needs.

The quant funnel works the same way. You don’t run Monte Carlo simulations on a hypothesis you haven’t plotted yet. You don’t sweep parameters on a signal that failed a basic statistical test. Each stage earns the right to proceed to the next, and the cost of each stage is justified only because the cheaper stages already filtered out the obvious failures.

Whether you’re validating software or validating trading strategies, the discipline is the same: test cheap and fast first, invest in expensive and thorough only when the cheap tests pass, and never skip a level of the pyramid just because you’re excited about the idea.

What software got wrong

The quant funnel is ruthless about killing ideas. Most hypotheses die in Stage 1. The ones that survive face progressively harder tests, and most of those die too. This is considered normal. It’s the whole point of the process.

Software product development is not like this. In most organizations I’ve worked in, the default answer to a feature idea is “yes, let’s figure out when we can build it.” The conversation jumps straight from hypothesis to implementation timeline. How long will it take? Can we fit it in the next sprint? What’s the estimated delivery date?

Nobody stops to ask whether the feature should exist at all. Nobody runs the equivalent of Stage 1: a cheap, fast check to see if the premise holds up. Does this feature solve a real problem? Is there evidence that customers actually need it, or did someone in a meeting assert that they do? Would a quick prototype or a few conversations with users kill the idea before engineering spends weeks on it?

In quant, if you skip exploration and jump straight to backtesting, you waste days building infrastructure around a signal that was never there. In software, if you skip validation and jump straight to building, you waste weeks or months shipping features that don’t move any metric anyone cares about. The cost is just less visible because there’s no P&L statement that tells you a feature was worthless. It ships on time, everyone marks the ticket as done, and nobody measures whether it mattered.

The testing pyramid is supposed to prevent this at the code level. You don’t write expensive end-to-end tests for logic that fails a unit test. But the same principle should apply at the product level. You don’t build expensive features for hypotheses that fail a customer interview. The pyramid should extend upward, past code, past architecture, all the way to the question of whether the thing you’re building deserves to exist.

The quant funnel enforces this because the feedback is unambiguous. A strategy makes money or it doesn’t. Software doesn’t have that forcing function, so the discipline has to come from the team. It rarely does. The result is product roadmaps full of features that passed a timeline test but never passed a value test.

Some will argue that this matters less now because generative AI has made writing code cheaper. Even if you grant that premise (and it’s a big “if,” given that most frontier models are still heavily subsidized by venture capital and investor money, not priced at their true cost of operation), cheaper code doesn’t fix the problem. The waste was never primarily in the typing. It was in building the wrong thing. Making it faster to ship features that nobody needed doesn’t reduce waste. It accelerates it. If anything, the ability to generate code quickly makes the invalidation discipline more important, not less. The cheaper it is to build, the more tempting it is to skip the question of whether you should.

In fourteen years of building software products, I watched this pattern repeat everywhere I worked. Coming back to quant, where killing ideas early is the norm rather than the exception, has been a reminder of how much waste that pattern produces.

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.