From Hypothesis to Production: A Quant's Productivity Toolkit

Most “productivity” advice is about doing things faster. In quantitative work, productivity is about knowing when to stop. The quant workflow is a funnel: many hypotheses enter, few survive. Each stage of the funnel demands different tools and different thinking, and the most important skill is killing bad ideas quickly and cheaply.

I’ve spent the last year rebuilding my quant workflow after fourteen years in software product engineering, drawing on earlier experience in quantitative risk modeling at BNP Paribas and Bank of America and market data infrastructure at Citadel. This post walks through the stages I use to take a trading strategy hypothesis from a hunch to something I’d actually deploy, and the tools that make each stage efficient.

Stage 1: Exploration (Nushell + QuestDB + DuckDB)

Most hypotheses don’t start with a model. They start with a question about data. Something looks off in a chart, or a pattern shows up in the weekly returns that you didn’t expect. Before writing any real code, you need to poke at the data and figure out whether the question is even worth pursuing.

The single most important thing you can do at this stage is try to kill the idea. Not develop it, not refine it, not build a backtest around it. Kill it. The faster you invalidate a bad premise, the more time you have for the ones that might actually hold up. Most hypotheses are wrong. That’s not a problem as long as you find out cheaply.

This is where I spend a lot of time in Nushell, QuestDB, and DuckDB. The three tools cover different access patterns, but they share a common trait: they give you answers in seconds, not minutes. That speed matters because invalidation is a volume game. You want to check ten ideas in an afternoon, not spend an afternoon on one.

Nushell as a data workbench

Nushell treats everything as structured data. When I pipe output from one command to another, I’m working with tables and records, not raw text that needs to be parsed with awk and sed. This matters when you’re doing exploratory work on market data because the feedback loop is immediate. Filter a table, sort by a column, group by a field, compute a summary. It’s all built into the shell.

A typical exploration session might look like pulling a CSV of daily closes into Nushell, filtering to a date range, adding a column for daily returns, and then grouping by month to see if there’s any seasonality worth investigating. If the answer is “no, there’s nothing here,” good. You just saved yourself days of work. If the answer is “maybe,” you move to the next stage with your eyes open.

QuestDB for time-series questions

When the data already lives in QuestDB, exploration means SQL. QuestDB is built for time-series, so the queries that matter in quant work are fast and natural. Slicing a year of tick data by trading session, computing rolling 20-day volatility, checking for gaps in the feed, or comparing intraday volume profiles across different market regimes. These are all just queries.

What makes QuestDB productive at this stage is that it doesn’t punish you for asking loose questions. You’re not optimizing a query plan or tuning indexes. You’re trying to poke holes in your own reasoning. Does the spread compression you noticed actually persist across different time periods, or did you just happen to look at a quiet month? Is the volume pattern real, or is it an artifact of a holiday calendar you forgot about? The speed of the engine means you can run through these sanity checks in minutes instead of taking the pattern at face value and building on a false premise.

DuckDB for file-based analysis

Not all data lives in a database. Research datasets, historical exports, and third-party data often show up as Parquet or CSV files sitting in a directory. DuckDB lets me run analytical SQL directly against those files without any setup. No server, no import step, no schema definition.

This is especially useful when I’m working with a new dataset for the first time. I can point DuckDB at a Parquet file, run a few aggregations to understand the shape of the data, check for nulls and outliers, and decide whether it’s clean enough to be worth loading into QuestDB for ongoing work. More often than not, this is also where I discover that the data itself doesn’t support the premise. Maybe the sample size is too small, the coverage is too sparse, or the distribution is nothing like what I assumed. Finding that out now, before writing a single line of Python, is the whole point.

How the three fit together

Nushell is the glue. I’ll query QuestDB or DuckDB from the shell, pipe the results through Nushell for reshaping or joining with other data, and make a call about whether to keep going. The point of this stage isn’t rigor. It’s triage. You’re looking for reasons to stop, not reasons to continue. Every bad idea killed in the terminal is a backtest you didn’t waste a day building, a validation pipeline you didn’t spin up for nothing, and a false lead you didn’t chase into production.

The discipline is in being honest with yourself early. It’s tempting to see a hint of a pattern and jump straight to modeling. But the cheapest place to find out you’re wrong is right here, in a shell session that cost you five minutes.

Stage 2: Testing the Hypothesis (pandas + matplotlib)

Once an idea survives exploration, it moves from terminal poking to a reproducible Python script. The shift matters. Nushell is great for quick questions, but once you’re computing derived features, joining multiple datasets, and looking at the data from several angles, you need something more structured. This is where pandas and matplotlib come in.

pandas for building the dataset

The first thing I do is reconstruct the exploration work in pandas. Not because I don’t trust what I saw in Nushell, but because I need a reproducible pipeline that I can modify and rerun as my understanding changes. The script becomes the record of what I actually tested, not a shell history I’ll lose tomorrow.

A typical script at this stage loads price data, computes returns, adds whatever derived columns the hypothesis depends on, and filters to the relevant universe. If the hypothesis is about mean reversion after large moves, that means computing z-scores of daily returns, flagging threshold crossings, and building a table of what happened in the days and weeks after each event. pandas makes this kind of slicing and reshaping straightforward, and the resulting DataFrame is easy to inspect at every step.

The important discipline here is keeping the script honest. It’s tempting to keep tweaking filters and parameters until the data tells you what you want to hear. I try to write the script once based on what I actually believe, run it, and look at the results before changing anything. If the first pass doesn’t show a clear signal, that’s information. Tuning until something appears is just overfitting with extra steps.

matplotlib for visual inspection

Before running any statistical tests, I plot the data. This sounds obvious, but it’s the step that most often changes my mind about a hypothesis.

Summary statistics hide important structure. A mean return of zero could come from a tight distribution around zero or from large gains and large losses that cancel out. A positive correlation could be driven entirely by a single outlier month. Autocorrelation numbers don’t tell you whether the serial dependence is stable across time or concentrated in one regime. Plotting the data answers these questions in a glance.

The plots I reach for most often at this stage are simple: time series of the feature and the target, scatter plots of one against the other, histograms of the return distribution conditional on the signal, and rolling window plots to check whether the relationship is stable over time. Nothing fancy. The goal isn’t to produce publication-quality charts. It’s to see the shape of what I’m working with before committing to a quantitative framework.

Visual inspection is also where I catch data problems that survived Stage 1. A price series that flatlines for a week, a volume spike that’s clearly an exchange error, a gap in the data that creates a spurious return. These show up instantly in a plot. They’re much harder to catch staring at a DataFrame.

The transition point

By the end of this stage, I have a reproducible script, a set of plots I’ve actually looked at, and an updated opinion about whether the hypothesis has legs. A lot of ideas that seemed promising in the terminal die here. The seasonality pattern turns out to be driven by two outlier years. The mean reversion signal is real but tiny compared to transaction costs. The feature is predictive in one regime and meaningless in another.

That’s fine. The goal is still invalidation. The difference from Stage 1 is that the evidence is now reproducible and visual rather than ad hoc. If the idea survives this stage, it’s earned the right to face statistical testing.

Stage 3: Statistical Validation (numpy + scipy + scikit-learn)

This is where the work changes character. Stages 1 and 2 are about intuition, speed, and visual pattern recognition. Stage 3 is about turning those intuitions into testable claims and finding out whether they hold up under scrutiny. The tools shift accordingly: numpy for efficient numerical computation, scipy for statistical tests, and scikit-learn when the hypothesis involves classification or regression.

numpy as the computational backbone

Most of the numerical work at this stage lives in numpy arrays rather than pandas DataFrames. Once you’ve built and cleaned your dataset in Stage 2, the actual computation (rolling statistics, matrix operations, return distributions, correlation structures) runs faster and more naturally in numpy. The difference matters when you’re running the same calculation across thousands of windows or permutations.

A typical workflow: compute a matrix of rolling correlations between a signal and forward returns, calculate the distribution of those correlations across time, and compare it to what you’d expect from random noise. numpy handles this kind of bulk numerical work without complaint.

scipy for hypothesis testing

scipy.stats is where intuitions face their first real trial. The question stops being “does this look like a pattern?” and becomes “is this distinguishable from chance?”

The tests I reach for depend on the hypothesis. For mean reversion claims, I’m looking at stationarity tests and autocorrelation significance . For regime-dependent signals, I’m comparing distributions across regimes using Kolmogorov-Smirnov or Mann-Whitney tests. For claims about predictive relationships, I’m running regressions through scipy or statsmodels and looking hard at the residuals.

The important thing at this stage is skepticism, and it took me getting burned to internalize it. Early in my return to quant work, I found a volatility signal that looked significant across three lookback variations. It took an embarrassingly long time to realize I’d effectively tripled my false positive rate by testing three variations and only reporting the one that worked. That’s exactly the trap Harvey, Liu, and Zhu documented in "…and the Cross-Section of Expected Returns" : most published financial signals are likely false positives once you account for the sheer number of factors that have been tested. I try to be explicit about it now: if I tested three variations of a signal before finding one that “works,” I need to account for that.

scikit-learn for structured prediction

When the hypothesis is more complex than “X predicts Y linearly,” scikit-learn comes in. Maybe the claim is that a combination of features (volatility regime, volume profile, time of day) predicts a directional move. That’s a classification problem, and scikit-learn provides the scaffolding to test it properly.

The key discipline here is keeping the evaluation honest. Train/test splits are the bare minimum, and even those can mislead if the data has temporal structure. For time-series problems, I use expanding or rolling walk-forward validation so the model never sees future data during training. scikit-learn’s pipeline and cross-validation tools make this mechanical once you set it up, which is the point. The mechanics of proper evaluation shouldn’t be something you have to think about every time.

I’m also watching for a specific failure mode at this stage: a model that performs well on average but only because it nails one regime and is useless in others. Aggregate accuracy numbers hide this. Breaking performance down by time period, volatility regime, or market condition is the only way to catch it.

What “statistical muster” actually means

In practice, passing this stage requires more than a significant p-value or a good accuracy score. It means:

Out-of-sample performance. The signal has to work on data that wasn’t used to find or develop it. If the only evidence is in-sample, it hasn’t passed.
Stability over time. A relationship that was strong from 2015 to 2019 and absent since then is not a tradeable signal. It’s a historical curiosity.
Robustness to specification. If the result disappears when you change the lookback window by a few days or shift the threshold slightly, it was never real. Real effects don’t balance on a knife edge. This is a quick sanity check, not the full parameter sweep that comes in Stage 5. You’re just making sure the signal isn’t obviously fragile before investing in systematic sensitivity analysis.
Honest accounting for multiple comparisons. If you tested twenty signals and one passed at the 5% level, you found exactly what you’d expect from noise. López de Prado covers this extensively in Advances in Financial Machine Learning .

The kill rate at this stage is high. Most ideas that looked visually compelling in Stage 2 fail one or more of these checks. That’s the point. Statistical validation is expensive compared to exploration, but it’s cheap compared to building a full backtest around a signal that was never real.

Stage 4: From Historical to Simulated Data

A strategy that works on historical data has passed one test, not the test. History only happened once. The specific sequence of events, the particular crises, the exact timing of regime changes: none of that will repeat. If a strategy’s edge depends on the specific path that markets actually took, it’s memorizing the past, not capturing a real relationship.

This is the stage where most people stop, and it’s exactly where you shouldn’t. Overfitting to history is the single most common failure mode in quantitative work (I catalog the full taxonomy of backtest biases separately), and the only way to test for it is to see how the strategy behaves on data it has never seen and that never actually happened. White’s “A Reality Check for Data Snooping” provides the statistical foundation for bootstrap-based tests that address exactly this problem.

Monte Carlo simulation

I first used Monte Carlo methods for risk modeling at BNP Paribas, generating thousands of interest rate scenarios to estimate portfolio exposure. The application is different here, but the thinking is the same: you don’t trust a single path. You generate many and see what holds up across all of them. I wrote more about this approach in Monte Carlo Permutation Tests for Strategy Significance .

The simplest approach is generating synthetic price paths that share the statistical properties of the real data but follow different trajectories. numpy’s random number generation makes this straightforward. Fit a model to the historical returns (even something as basic as matching the mean, volatility, and skew), then generate thousands of paths from that model.

The question you’re answering is: does the strategy perform well because it’s capturing a structural feature of how this market behaves, or because it’s latching onto the specific sequence of events in the historical record? If performance is consistent across many simulated paths, that’s evidence of a real effect. If it collapses once the path changes, the strategy was curve-fit to history.

I usually start simple with geometric Brownian motion calibrated to the observed parameters, then add complexity only if the hypothesis demands it. If the strategy claims to exploit volatility clustering, the simulation needs to exhibit volatility clustering. If it claims to exploit mean reversion in spreads, the simulation needs to model the spread process with realistic dynamics. The simulation should test the claim, not a strawman.

Bootstrapped returns

Bootstrap methods take a different angle. Instead of generating synthetic data from a model, you resample the actual historical returns with replacement to create alternative histories. The returns are real, but the order is different.

This is powerful because it preserves the marginal distribution of returns exactly (every return in the bootstrapped series actually happened) while destroying the temporal structure. Efron and Tibshirani’s An Introduction to the Bootstrap remains the reference text for these methods, and I cover the practical application to strategy testing in Bootstrap Methods for Strategy Robustness . If the strategy depends on genuine serial dependence in the data, it should perform differently on bootstrapped samples than on the original series. If it performs just as well on reshuffled data, whatever it’s capturing isn’t related to the ordering of events, which is a problem for most trading strategies.

Block bootstrapping is a useful middle ground. Instead of resampling individual returns, you resample blocks of consecutive returns, which preserves some of the short-term autocorrelation structure while still scrambling the longer-term path. The block length becomes a parameter you can vary to test how much of the strategy’s performance depends on persistence at different time scales.

Reshuffling Monte Carlo for ordering sensitivity

A variant worth calling out separately: when your strategy operates on high-frequency event data (tick-level trades, order book updates), the specific ordering of events within a narrow time window may not be deterministic. Distributed exchange feeds deliver events per channel with sequence numbers, but cross-channel ordering isn’t guaranteed. Two events with the same millisecond timestamp could arrive in either order depending on network conditions.

Reshuffling Monte Carlo tests whether your results are sensitive to this ambiguity. Take events that fall within the same timestamp window, randomly permute them across many trials, and recompute your signals each time. If your backtest results shift meaningfully across permutations, the strategy is picking up artifacts of event ordering rather than genuine signal. If results are stable, the signal is robust to the ordering noise inherent in the data infrastructure. This matters most for order flow strategies where features like absorption ratio and iceberg detection depend on the exact sequence of trades hitting a price level.

The same reshuffling principle extends beyond ordering ambiguity. Two broader applications:

Trade reshuffling within time blocks. Take the historical trade sequence and reshuffle trades within each 1-second or 5-second window. Aggregate volume and price range within each block stay roughly the same, but the specific ordering of individual fills changes. Run the backtest across hundreds of reshuffled variants and look at the distribution of results. If the strategy’s performance scatters widely, it’s fitting to the exact microstructure of the historical record rather than detecting a generalizable pattern.

Noise injection into price paths. Add small random perturbations to historical prices before replaying the backtest. Gaussian noise calibrated to the instrument’s tick size and typical noise level (the right magnitude depends on understanding the specific market) tests whether entry and exit logic is robust to the minor price variations that occur naturally between sessions. A strategy that triggers at exactly one price and fails one tick away is fitting to a level, not a dynamic. Strategies should degrade gracefully as noise increases; those that collapse under small perturbations are too brittle for live trading.

Both are applications of the same principle: a real signal should survive small perturbations to the data that generated it. I cover the order-flow-specific version of these techniques in more detail in the order flow article .

Synthetic regimes

Markets move through regimes: low volatility grinds, high volatility sell-offs, range-bound consolidation, trending runs. A strategy tested only on historical data has seen whatever mix of regimes happened to occur during the sample period. If that period was unusually calm or unusually volatile, the backtest results are conditioned on a regime mix that may not repeat.

Synthetic regime testing means deliberately constructing scenarios that stress the strategy in ways history didn’t. What happens if a 2008-style credit crisis lasts twice as long as the original? What if the low-volatility grind of 2017 never occurs and your sample is all COVID-crash chaos and 2022 rate hiking volatility? What if correlations spike the way they did in March 2020, when everything sold off together? These aren’t exotic edge cases. They’re the normal range of market behavior viewed over a longer horizon than any single backtest covers.

I build these by splicing segments of historical data from different regimes, by scaling volatility in simulated paths, or by generating regime-switching models where the transition probabilities differ from the historical estimates. The goal is to expand the range of conditions the strategy faces beyond what a single historical path provides.

Keeping the pipeline consistent

All of this simulation work uses numpy for generation and feeds back into the same pandas pipeline from Stage 2. The strategy code shouldn’t know or care whether it’s running on historical data or simulated data. If it does, something is wrong. This consistency matters because it means every test you ran in Stages 2 and 3 can be rerun on simulated data without modification. The same plots, the same statistical tests, the same evaluation framework. The only thing that changes is the data.

A strategy that passes Stage 3 on historical data and then falls apart on simulated data has told you something valuable. It saved you from deploying a strategy that was never going to work in production, and it cost you a few hours of compute instead of real money.

Stage 5: Robustness Testing

A strategy that has survived Stages 1 through 4 is promising. It showed up in exploratory data, looked real in plots, passed statistical tests on historical data, and held together on simulated data. That’s more than most ideas can claim. But it still isn’t enough to deploy.

The question at this stage isn’t “does it work?” It’s “how easily does it break?”

Parameter sensitivity

Every strategy has parameters. A lookback window, a threshold for entry, a holding period, a volatility filter. During development, you chose specific values for these. The question is whether those values are the only ones that work.

I test this by sweeping each parameter across a reasonable range while holding the others fixed, then doing the same for combinations. Pardo’s The Evaluation and Optimization of Trading Strategies covers this methodology thoroughly. The output is a surface (or a set of surfaces) showing how performance changes as parameters move. What you want to see is a broad plateau: the strategy works across a wide neighborhood of the parameters you chose. What you don’t want to see is a sharp spike: performance is good at exactly your chosen values and collapses everywhere else.

A sharp spike means you’ve found a parameter combination that happens to fit the data, not a real effect. Real effects are robust to small perturbations. If moving a lookback window from 20 days to 22 days destroys the signal, there was no signal. This is one of the most reliable ways to distinguish genuine edge from overfitting, and it’s surprising how many strategies that pass every other test fail here.

Regime stress testing

Stage 4 introduced simulated regimes. This stage takes that further by deliberately constructing adversarial conditions. Not “what if volatility is a little higher?” but “what if everything that could go wrong does?”

Concretely, I test against: extended drawdown periods where the strategy is consistently wrong, liquidity shocks where the assumptions about execution break down, correlation regime changes where diversified positions suddenly move together, and prolonged low-volatility environments where the strategy generates no signals at all. These aren’t hypothetical. Every one of these has happened in living memory, and a strategy that can’t survive them isn’t ready for production.

The goal isn’t for the strategy to be profitable in every regime. That’s unrealistic. The goal is to understand the conditions under which it fails, how badly it fails, and whether the failure is bounded. A strategy that loses a little during adverse regimes and makes it back during favorable ones is fine. A strategy that blows up catastrophically under stress is not, regardless of how good the average case looks.

Property-based thinking

This is where my software engineering background pays off. In property-based testing , you don’t test specific inputs and outputs. You define properties that should hold for all valid inputs and then throw random data at the system to see if any of them break.

The same thinking applies to strategies. What invariants should hold regardless of market conditions? Some examples:

Symmetry. If the strategy is long/short, does it perform comparably on both sides, or is all the edge on one direction? An asymmetry isn’t automatically a problem, but it needs an explanation.
Monotonicity. If the signal is stronger, does performance improve? If doubling the signal strength doesn’t improve (or actually hurts) performance, the relationship between signal and outcome isn’t what you think it is.
Scaling. Does the strategy degrade gracefully as position sizes increase, or does it assume infinite liquidity? This catches strategies that work on paper but can’t be executed at meaningful size.
Time invariance. Does the strategy work roughly as well in different calendar periods, or is performance concentrated in a specific era? A signal that only existed from 2010 to 2015 isn’t a strategy. It’s a historical footnote.

Defining these properties forces you to articulate what you actually believe about why the strategy works. If you can’t state a property, you probably don’t understand the mechanism well enough to trade it.

The final filter

This is the most expensive stage and intentionally so. You only want to spend this kind of effort on ideas that have already passed every cheaper test. But it’s also the stage that earns the most trust. A strategy that survives parameter sweeps, regime stress tests, and property-based checks is one I’m willing to deploy. Not because I’m certain it will work, but because I’ve done everything I can to prove it won’t, and it’s still standing.

Most strategies don’t make it here. Of the ones that do, some fail. That’s the funnel working as intended.

Coming Back

I worked in quantitative finance from 1998 to 2011, then spent fourteen years building software products. SaaS platforms, distributed systems, infrastructure at scale. I learned a lot during that time, but I missed the clarity of quantitative work. A strategy either makes money or it doesn’t. A risk model either predicts realized losses within its confidence intervals or it doesn’t. The feedback is concrete, measurable, and indifferent to how you feel about it. The whole funnel described in this post only works because there’s a measurable outcome at the end of it. You can’t build a rigorous invalidation pipeline if you can’t define what failure looks like.

The fourteen years away weren’t wasted. The production engineering discipline, the infrastructure skills, the experience shipping systems at scale: all of that makes me better at this work than I was the first time around. Event sourcing , which I used extensively in distributed SaaS systems, turned out to be directly applicable to building auditable financial infrastructure. The discipline of parsing market data correctly draws on the same boundary-validation thinking I used in production systems. And the perfection paradox , the tendency to keep optimizing instead of shipping, is the same trap in quant research as it is in software.

But the thing that makes me want to do this work is the clarity of the objectives. I wrote more about what software product development gets wrong about this in the companion piece .

The Funnel at a Glance

Stage	Tools	Purpose	Cost
1. Exploration	Nushell, QuestDB, DuckDB	Kill bad premises fast	Minutes
2. Hypothesis testing	pandas, matplotlib	Reproduce and visualize the signal	Hours
3. Statistical validation	numpy, scipy, scikit-learn	Test whether the signal is distinguishable from chance	Hours to days
4. Simulated data	numpy, pandas	Test whether the signal survives alternative histories	Days
5. Robustness testing	numpy, pandas, scipy	Test whether the signal breaks under parameter variation and stress	Days

Most hypotheses die in Stage 1. The ones that survive get progressively more expensive to test, and most of those die too. That’s the funnel working as designed.

The Productivity Lesson

The real hack is that each stage acts as a filter with increasing cost. Nushell exploration costs minutes. Robustness testing costs days. Matching the right tool to the right stage prevents wasting expensive computation on bad ideas. Tool fluency across the entire stack matters more than mastery of any single tool.

Most hypotheses die in Stage 1. The ones that survive face visual scrutiny in Stage 2 and statistical testing in Stage 3, and many more die there. Stage 4 catches the strategies that were overfitting to a single historical path, and Stage 5 catches the ones that were fragile in ways the earlier stages couldn’t detect. The ideas that survive the full funnel are rare, and that’s fine. The productivity gain isn’t in finding winners faster. It’s in discarding losers sooner.

Susan Potter

Quant

Work with me

I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.