Risk Disclaimer: Order flow trading involves substantial risk of loss. The methods described here are educational. No strategy guarantees profitability. Always size positions within your defined risk tolerance and consult a qualified financial professional before trading with real capital.
Before you can formalize a pattern, you have to understand it well enough to describe what you’re looking for. I spent weeks paper trading order flow on Bookmap and Sierra Chart, not because I intended to trade discretionarily, but because I needed to internalize what absorption and stop cascades actually look like before I could write detectors for them. You can’t quantify what you haven’t observed. The paper trading was a deliberate research phase: get close to the raw data, watch enough examples to characterize the pattern, then step back and ask what measurable conditions distinguish it from noise.
That research phase surfaced a problem. The patterns I was observing were real, a bright band on the heatmap really did mark levels where price bounced, but “I saw a bright band” isn’t a testable hypothesis. I couldn’t define the entry conditions precisely enough to validate them, which meant I couldn’t tell whether they’d hold up out of sample or whether I was just pattern-matching on randomness.
This article is about the next step: the mental framework for turning those observations into something you could code, measure, and test. None of this is a finished trading system; it’s a demonstration of the quant mindset applied to order flow, the kind of thinking that precedes any real strategy development. If you come from a software background, the approach will feel familiar. Take a messy, high-dimensional data stream, extract features, define thresholds, test whether those features predict anything useful, throw them away when they don’t. The domain is market microstructure. The discipline is the same one I apply across my analysis pipeline : if you can’t express it as a number, you can’t test it.
Why Candlesticks Fail as a Statistical Basis
A 5-minute candlestick compresses potentially thousands of individual transactions into four numbers: Open, High, Low, Close. That compression destroys the sequential microstructure that actually tells you what happened.
Consider a lower wick. Conventional technical analysis attributes it to buying pressure. But statistically, it could result from any of several distinct mechanisms:
- A high-volume aggressive buy sweep against thin ask liquidity
- A complete vacuum of sell orders allowing trivial buy volume to spike price
- A stop cascade followed by mean reversion
The OHLC representation cannot distinguish between these scenarios. Each implies a different market state and different forward expectations. Order flow analysis works from the event-level data that OHLC throws away, the same principle behind data validation with property-based testing : summary statistics destroy structure, and you can’t get it back after the fact.
The Software Engineering Parallel
If you’ve built event-driven systems, the data model will feel natural. An order book is an event-sourced data structure: the current state is the projection of every add, modify, and cancel event, and OHLC bars are what you get when you throw away the event log and keep only the final snapshot. Order flow analysis is replaying that log.
The signals I describe below are all streaming computations: windowed aggregations, sliding ratios, rate-of-change calculations over an ordered event sequence. If you’ve built Kafka consumers or Flink pipelines, substitute “trade and quote events” for “messages” and “statistical feature extractors” for “consumers” and the architecture is the same. The order book itself is a set of price-time priority queues, and understanding your queue position turns out to matter enormously when you try to backtest limit order strategies (more on that in the backtesting section).
What makes this different from most streaming problems is that the data-generating process fights back. If you’ve worked in security engineering, you’ll recognize the dynamic: adversaries observe your defenses and adapt. Market participants do the same with trading signals. A pattern that works today may stop working once enough participants trade on it. In security, you can patch a vulnerability. In markets, the signal is the vulnerability, and exploiting it degrades it. You’re searching for transient statistical regularities in a system where every participant’s strategy changes the system’s behavior.
What You Need Instead
The minimum data inputs for quantitative order flow work are:
- Level 2 / Depth of Market (DOM) snapshots at tick frequency
- Time-and-sales (tape) with trade direction (bid-hit vs. ask-lift)
- Order book add/cancel/modify events where the exchange exposes them (CME MDP 3.0, for example)
Everything below builds on these.
One derived quantity shows up repeatedly: CVD, the running sum of signed trade volume (positive for ask-lifts, negative for bid-hits). CVD isn’t a separate data feed; you compute it from time-and-sales. I define it here because it feeds into multiple signals.
If your data source only provides OHLC bars, you can compute some of the features below (CVD slope, volume pressure) from footprint chart data, but book-state features like order book imbalance and pull rate require Level 2 feeds.
Signal Construction from Order Book State
Liquidity Density as a Continuous Variable
The heatmap’s color gradient encodes a quantity: resting limit order volume at each price level. To do anything rigorous with it, you need to work with the numbers directly.
We could use L_b(p, t) for bid-side volume at price p and time t, and L_a(p, t) for the ask side, as a starting point for exploring this further. Everything below is derived from these two functions.
Weighted Mid Price
The standard mid-price (best_bid + best_ask) / 2 ignores volume. Cont, Kukanov, and Stoikov (2014)
formalized a better estimator that weights each price by the opposite side’s volume:
WMP(t) = [L_a(p_best_ask) * p_best_bid + L_b(p_best_bid) * p_best_ask]
/ [L_b(p_best_bid) + L_a(p_best_ask)]
If the bid has more volume than the ask (L_b >> L_a), WMP shifts above the simple mid. Heavy bid volume represents buying interest, so the “fair” price gets pulled upward, toward the thinner ask side where there’s less resistance. Cont et al. showed this micro-price predicts future trade prices better than the simple mid, with the advantage concentrated in the 1-30 second horizon.
One natural feature to extract is the gap between WMP and the last trade price. A 0.5-tick WMP shift when volume at the top of book is 500 contracts means something different from the same shift when volume is 20.
Order Book Imbalance (OBI)
OBI(t) = [L_b(p_best_bid) - L_a(p_best_ask)]
/ [L_b(p_best_bid) + L_a(p_best_ask)]
OBI ranges from -1 (all ask, bearish) to +1 (all bid, bullish). Cont, Kukanov, and Stoikov (2014) showed that top-of-book OBI correlates with short-horizon price direction on U.S. equity index futures. The relationship is predictive, not causal: the imbalance and the price move may both be driven by the same latent information arriving to the market. Either way, the signal decays fast. It’s useful on a 1-30 second horizon and largely noise beyond that. For a broader treatment of order book dynamics, Cartea, Jaimungal, and Penalva (2015) is the standard reference.
You can extend OBI to N levels deep:
OBI_N(t) = [sum(L_b, top N bids) - sum(L_a, top N asks)]
/ [sum(L_b, top N bids) + sum(L_a, top N asks)]
In exploratory testing on ES futures, 5-level OBI appears more stable than top-of-book OBI but slower to react. The choice would depend on your holding period. For scalping (seconds to low minutes), top-of-book may be more responsive. For swing entries where you’re using order flow as a timing filter, deeper OBI could reduce false signals. These are hypotheses worth testing rigorously, not conclusions.
The threshold matters too. Backtest OBI thresholds (e.g., |OBI| > 0.6) as entry filters on your target instrument. But don’t optimize the threshold on the same data you evaluate on. This is exactly the kind of parameter that should go through walk-forward validation
to avoid second-order overfitting.
Absorption Detection
Most institutional order flow executes passively through limit orders, not market orders, because passive execution minimizes price impact (Bouchaud, Farmer, and Lillo, 2009 ). Absorption is what this looks like on the tape: a level that eats wave after wave of aggressive orders without moving. On the heatmap, it’s a bright band that refuses to fade. The quantitative question is how to distinguish genuine absorption from normal book dynamics.
Absorption Ratio
The absorption ratio measures aggressive volume per tick of price displacement:
AR(p, window) = aggressive_volume_hitting_p / max(|price_displacement_over_window|, min_tick)
Units are contracts per tick. What counts as “high” depends entirely on the instrument and the session. You need to understand the baseline for your market: what does normal price discovery look like in terms of volume per tick of displacement? That baseline also shifts over time as market structure evolves, participation changes, and volatility regimes rotate. Any threshold you calibrate today needs periodic recalibration. Once you have a current baseline, absorption stands out as a multiple of it.
A reasonable starting framework would flag levels where all three conditions hold simultaneously:
- Cumulative aggressive volume exceeds a threshold (e.g., 3x the average volume at that level over the session)
- Price displacement per 100 contracts remains below a threshold (e.g., < 1.0 ticks per 100 contracts, the inverse view of the same relationship)
- The condition persists for at least N consecutive seconds (15-30 seconds might be a starting point on ES, likely longer on less liquid instruments)
Conditions 1 and 2 are jointly equivalent to an AR threshold, but I find them easier to calibrate independently: volume adapts to session activity, displacement adapts to the instrument.
The persistence requirement does a lot of the work. AR spikes briefly all the time due to normal book dynamics. Sustained absorption over 15-30 seconds is the part that’s hard to explain without a large participant behind it.
from dataclasses import dataclass
@dataclass(frozen=True)
class AbsorptionEvent:
"""A detected absorption event at a price level."""
price: float
aggressive_volume: int
price_displacement: float
duration_seconds: float
absorption_ratio: float
def detect_absorption(
trades: list[Trade],
level: float,
tick_size: float = 0.25,
window_seconds: float = 30.0,
volume_multiplier: float = 3.0,
max_displacement_per_100: float = 1.0,
session_avg_volume: float = 100.0,
) -> AbsorptionEvent | None:
"""Detect absorption at a specific price level.
Args:
trades: Time-ordered trade events hitting the level.
level: The price level to check.
tick_size: Minimum price increment for the instrument (0.25 for ES).
window_seconds: Minimum persistence duration.
volume_multiplier: Required multiple of session average volume.
max_displacement_per_100: Max ticks of displacement per 100 contracts.
session_avg_volume: Average volume at this level for the session.
Returns:
An AbsorptionEvent if absorption is detected, None otherwise.
"""
level_trades = [t for t in trades if t.price == level]
if not level_trades:
return None
total_volume = sum(t.size for t in level_trades)
duration = (level_trades[-1].timestamp - level_trades[0].timestamp).total_seconds()
if duration < window_seconds:
return None
if total_volume < session_avg_volume * volume_multiplier:
return None
prices_during_window = [t.price for t in trades
if level_trades[0].timestamp <= t.timestamp <= level_trades[-1].timestamp]
displacement = max(prices_during_window) - min(prices_during_window)
displacement_per_100 = displacement / (total_volume / 100) if total_volume > 0 else float('inf')
if displacement_per_100 > max_displacement_per_100:
return None
ratio = total_volume / max(displacement, tick_size)
return AbsorptionEvent(
price=level,
aggressive_volume=total_volume,
price_displacement=displacement,
duration_seconds=duration,
absorption_ratio=ratio,
)
Iceberg Order Detection
Iceberg orders (large orders that expose only a fraction of their total size to the market) leave a statistical fingerprint that distinguishes them from organic resting liquidity. De Winne and D’Hondt (2007) found that hidden order usage on Euronext was concentrated among institutional participants and that hidden orders represented a significant fraction of total depth at the best quotes. Their study covers European equities rather than US futures, but the economic logic transfers: participants use hidden orders to reduce information leakage and price impact, which are concerns in any liquid electronic market. Learning to detect them quantitatively was one of the more useful skills I developed early in this work, because icebergs are strong evidence of institutional activity. When you find one, you know a large participant is working a position.
The Reload Signature
When an iceberg reloads, the visible quantity at a price level snaps back to a fixed value immediately after being consumed. This creates a detectable pattern in Level 2 data:
- Volume at level
pdrops to zero (consumed by aggressor) - Volume at level
presets to a constantQwithin milliseconds - The cycle repeats
The low variance in displayed quantity across repeated fills is the statistical tell for naive iceberg implementations. Natural limit order flow at a level produces irregular quantity sequences as different participants add and cancel independently. A basic iceberg produces near-constant displayed quantity because the same algorithm is refilling with the same clip size.
An important caveat: modern iceberg algorithms are aware that constant clip sizes are detectable. Hautsch and Huang (2012) studied the market impact of hidden liquidity and documented the evolving sophistication of hidden order strategies. Many contemporary implementations randomize the displayed quantity by +/- 10-30% on each reload to evade exactly this kind of variance-based detection. The code below uses a coefficient of variation threshold that catches the naive case. For randomized icebergs, you’d need to relax the CV threshold (perhaps to 0.15-0.25) and add a secondary check: whether the inter-reload time intervals are suspiciously regular, since the timing pattern is harder to randomize without degrading fill quality. Detection is an arms race, and any static heuristic will eventually be defeated by sufficiently motivated counterparties.
import numpy as np
from dataclasses import dataclass
@dataclass(frozen=True)
class IcebergCandidate:
"""A suspected iceberg order at a price level."""
price: float
reload_count: int
displayed_quantity: float
quantity_variance: float
total_volume_absorbed: int
def detect_iceberg(
fill_events: list[FillEvent],
price: float,
max_variance_ratio: float = 0.05,
min_reloads: int = 3,
) -> IcebergCandidate | None:
"""Detect iceberg orders by analyzing reload patterns.
Args:
fill_events: Sequence of fill events at this price level.
price: The price level to analyze.
max_variance_ratio: Maximum coefficient of variation in
displayed quantity to classify as iceberg.
min_reloads: Minimum number of reload cycles required.
Returns:
An IcebergCandidate if the pattern matches, None otherwise.
"""
level_fills = [f for f in fill_events if f.price == price]
if len(level_fills) < min_reloads:
return None
quantities = np.array([f.displayed_quantity_before_fill for f in level_fills])
mean_q = np.mean(quantities)
if mean_q == 0:
return None
cv = np.std(quantities) / mean_q
if cv > max_variance_ratio:
return None
return IcebergCandidate(
price=price,
reload_count=len(level_fills),
displayed_quantity=float(mean_q),
quantity_variance=float(cv),
total_volume_absorbed=sum(f.fill_size for f in level_fills),
)
Implications for Strategy
An identified iceberg at price p sets a conditional hypothesis:
- The institutional participant holds a large directional position in formation
- Price is unlikely to sustain through
puntil the iceberg exhausts - When the iceberg finally absorbs all aggressive flow and the level holds, it marks a high-probability support/resistance zone
Entry logic: after N confirmed iceberg reload cycles at level p, fade aggressive moves toward p with a limit order at p + buffer (1-2 ticks), targeting mean reversion to the session VWAP or prior equilibrium.
The important caveat: icebergs can and do exhaust. A 5,000-lot iceberg absorbing 3,000 lots of aggression is a wall. The same iceberg after absorbing 4,800 lots is a wall about to break. Tracking the cumulative volume absorbed relative to typical institutional order sizes for your instrument gives you a sense of when the iceberg is likely running low. I’ve been burned by assuming an iceberg was infinite. It never is.
Stop Cascade Detection
Price breaches a level where stop-losses are clustered, those stops fire as market orders, and the resulting volume spike amplifies the move. Retail order flow education calls these “stop hunts” and attributes them to predatory institutions. Sometimes that’s true. More often, it isn’t. Danielsson, Shin, and Zigrand (2012) showed that endogenous risk, where participants’ own risk management rules amplify price moves, is a structural feature of modern markets. The stops fire mechanically regardless of what initiated the breach.
This matters because the trading response depends on which kind of cascade you’re seeing. If genuine directional flow pushed through the level (organic cascade), fading it is fighting the trend. If a thin book got swept by the mechanical burst of stop orders but the underlying flow is balanced, mean reversion is more likely. The features below help distinguish these cases, imperfectly.
The typical sequence is:
- Price approaches a technically visible level (prior swing high/low, round number)
- Liquidity at that level thins as participants front-run the anticipated break
- Price breaches the level, triggering stop-loss market orders
- Tape speed surges as stop orders execute sequentially
- Aggressive directional volume spikes, visible in CVD as a sharp slope change
- Price either continues (organic directional flow) or decelerates when it hits fresh institutional limit orders on the other side (mechanical cascade into absorption)
Quantitative Detection
Tape velocity tracks the rate of transaction events per second, normalized to a rolling baseline:
tape_velocity(t) = transactions_per_second(t)
/ rolling_mean_transactions_per_second(last 20 min)
What ratio constitutes “elevated” depends on your instrument and time of day. You need to understand the normal tape speed for the market you’re trading: ES at the open looks nothing like ES during the lunch hour. A ratio that would be unremarkable at 9:31 AM might be extreme at 1:00 PM. On its own, elevated tape velocity is also ambiguous (it could be a news-driven move, an opening rotation, or a real breakout). You need the additional context.
CVD slope change provides that context. Compute the first derivative of cumulative volume delta over a short window (5-30 seconds). A sharp slope inflection following a level break confirms aggressive directional order flow rather than passive absorption. If CVD slope is steep but price displacement is proportional, that’s a genuine breakout. If CVD slope is steep and price displacement is small, that’s absorption on the other side.
The exhaustion signal is where the counter-trend opportunity lives. The cascade ends when aggressive volume continues but price stops moving. This is the absorption ratio metric applied in real time: when CVD slope remains steep but price velocity approaches zero, exhaustion is occurring. The aggressive orders are being absorbed by fresh institutional liquidity. The counter-trend entry window opens.
@dataclass(frozen=True)
class StopRunSignal:
"""Detected stop run with exhaustion signal."""
breach_price: float
tape_velocity_ratio: float
cvd_slope: float
price_velocity: float
exhaustion_detected: bool
def detect_stop_run(
tape_velocity: float,
cvd_slope: float,
price_velocity: float,
velocity_threshold: float = 2.5,
exhaustion_price_velocity_max: float = 0.1,
) -> StopRunSignal | None:
"""Detect a stop run event and check for exhaustion.
Args:
tape_velocity: Current tape velocity ratio (vs. 20-min baseline).
cvd_slope: First derivative of CVD over short window.
price_velocity: Rate of price change (ticks per second).
velocity_threshold: Minimum tape velocity ratio to trigger detection.
exhaustion_price_velocity_max: Price velocity below which
exhaustion is flagged.
Returns:
A StopRunSignal if conditions are met, None otherwise.
"""
if tape_velocity < velocity_threshold:
return None
if abs(cvd_slope) < 1.0:
return None
exhaustion = (
abs(cvd_slope) > 2.0
and abs(price_velocity) < exhaustion_price_velocity_max
)
return StopRunSignal(
breach_price=0.0, # Set by caller from context
tape_velocity_ratio=tape_velocity,
cvd_slope=cvd_slope,
price_velocity=price_velocity,
exhaustion_detected=exhaustion,
)
I want to be honest about something: stop run detection in real time is harder than it sounds. The window between “this is a stop cascade” and “this is exhaustion” can be a few seconds. If you’re computing features on 1-second bars, you might get 3-5 bars of signal before the opportunity closes. The latency of your data feed and computation pipeline matters here more than in any other signal I work with. If you’re running this on delayed data or with more than a couple hundred milliseconds of processing lag, the signal will consistently arrive too late.
Stacker Algorithm Filtering
This is where most retail traders lose money with DOM-based strategies. Stacker bots create phantom liquidity walls that evaporate on approach. If you don’t filter them, your absorption and iceberg signals will fire on phantom levels that never intended to trade.
Pull Rate Analysis
Track the ratio of limit orders pulled (cancelled) to limit orders filled at each price level over a rolling window:
pull_rate(p, t) = orders_cancelled_at_p / (orders_cancelled_at_p + orders_filled_at_p)
This isn’t as simple as “high pull rate = fake.” Genuine institutional limit orders placed with intent to fill tend to have lower pull rates, but legitimate electronic market makers also have very high cancel-to-fill ratios (often above 90%) because they continuously re-quote as the market moves. Hagströmer and Nordén (2013) studied order-to-trade ratios on NASDAQ OMX and found that high-frequency market makers had cancel rates far exceeding those of other participants, driven by the need to continuously update quotes in response to changing conditions. Similarly, the SEC’s Equity Market Structure Literature Review (2014) documents that high order-to-trade ratios are characteristic of market-making strategies that narrow spreads and provide liquidity. A market maker’s quotes at a price level 3 ticks from the spread will be cancelled and replaced many times per second as the spread shifts. This is normal market making, not manipulation.
The distinction between market making and spoofing/layering (which is what “stacking” often describes) is both a regulatory and a statistical question. Regulators look at intent, which is hard to observe in data. Statistically, a promising distinguishing feature is the relationship between pull rate and market proximity: market makers cancel because the market moved away from their quote (reactive cancellation), while stackers cancel because the market moved toward their quote (evasive cancellation). Cao, Chen, Liang, and Lo (2014) formalized this distinction in their work on detecting spoofing, showing that the correlation between cancellation timing and price direction is a more reliable indicator of manipulative intent than raw order-to-trade ratios. Tracking whether cancellations correlate with adverse price movement (the market approaching the resting order) vs. favorable price movement (the market moving away) adds a useful dimension beyond the raw pull rate.
For signal filtering purposes, the goal isn’t to classify intent. It’s to estimate whether the liquidity at a level will actually be there if price reaches it. A level where most orders get pulled as price approaches, especially when cancellations correlate with the market moving toward the resting order, should be discounted from support/resistance calculations. The right threshold depends on your market’s normal cancel behavior; understanding the baseline cancel rate for your instrument matters more than picking a universal number.
Distance Decay
Phantom liquidity tends to cluster near the spread, while genuine institutional resting orders more often sit at round numbers and technically significant levels further out. You can weight your liquidity density signals by distance and pull rate:
adjusted_density(p, t) = L(p, t) * (1 - stacker_discount(p, t))
where stacker_discount is derived from the pull rate analysis. In practice, a continuous weight works better than a binary filter. A level with a pull rate of 0.60 might be partially genuine and partially stacker activity. Zeroing it out entirely throws away real information; discounting it proportionally preserves what signal there is.
def compute_stacker_discount(
cancels: int,
fills: int,
distance_from_spread: int,
near_spread_ticks: int = 3,
) -> float:
"""Compute a discount factor for suspected stacker activity.
Args:
cancels: Number of order cancellations at this level.
fills: Number of order fills at this level.
distance_from_spread: Distance in ticks from current spread.
near_spread_ticks: Threshold for "near spread" classification.
Returns:
Discount factor between 0.0 (no discount) and 1.0 (full discount).
"""
total = cancels + fills
if total == 0:
return 0.0
pull_rate = cancels / total
# Near-spread orders with high pull rates are likely stackers
if distance_from_spread <= near_spread_ticks and pull_rate > 0.85:
return min(pull_rate, 0.95)
# Further from spread, require higher pull rate for discount
if pull_rate > 0.90:
return pull_rate * 0.8
return 0.0
Microstructure Analytics as Quantitative Features
None of these signals work in isolation. In practice, I compose them into a feature vector per time bar (1-second or 5-second aggregation depending on the signal’s natural timescale) and use the combination to filter entries.
| Feature | Quantitative Definition | Signal Interpretation |
|---|---|---|
| Pace of Tape | Transaction rate / 5-min rolling mean | > 1.7x baseline = elevated participation |
| Volume Pressure | Aggressive buy vol / Total aggressive vol | > 0.65 or < 0.35 = directional imbalance |
| Volume Exhaustion | Rate of change of Volume Pressure approaching zero | Slope sign flip after extreme = reversal signal |
| OBI Skew | Top-of-book imbalance ratio | |OBI| > 0.6 = directional lean |
| Absorption Ratio | Aggressive vol / Price displacement | > 200 contracts/tick = active absorption |
| Pull Rate | Orders cancelled / (cancelled + filled) near spread | > 0.85 = phantom liquidity, discount this level |
| CVD Slope | First derivative of cumulative volume delta | Sign and magnitude indicate directional flow |
| WMP Deviation | Weighted mid price - last trade price | Positive = book skewed bullish, negative = bearish |
from dataclasses import dataclass
@dataclass(frozen=True)
class MicrostructureFeatures:
"""Feature vector capturing current order flow state."""
pace_of_tape: float
volume_pressure: float
obi_5_level: float
absorption_ratio: float
pull_rate_near_spread: float
cvd_slope: float
wmp_deviation: float
Starting with pure threshold-based rules makes sense because they’re transparent and easy to debug. The thresholds in the table above are directional, not prescriptive: you need to calibrate them for your instrument, your session characteristics, and the current volatility regime, and recalibrate as those shift.
A machine learning classifier (random forest or gradient boosting on the feature vector, trained to predict short-horizon direction) is a natural next step, but it introduces all the usual overfitting risks . López de Prado (2018) documents extensively how ML models applied to financial data are prone to backtest overfitting, particularly when the number of features is large relative to the number of independent observations. I’d only go there after the rule-based version demonstrates that the features have predictive content. If threshold-based rules on these features can’t beat random, a classifier won’t save you. It’ll just overfit more creatively.
Strategy Architecture
Here’s how the pieces fit together. The architecture mirrors my general pipeline : raw data through feature computation, filtering, classification, and into entry/exit logic.
digraph OrderFlowStrategy {
rankdir=TB;
node [shape=rectangle, style=filled, fillcolor="#1a1a2e", fontcolor="#e0e0e0", fontname="monospace"];
edge [color="#4a9eff"];
A [label="Raw Data Feeds\n(DOM snapshots, T&S)"];
B [label="Feature Computation Layer\n(OBI, AR, Pull Rate, Tape Velocity, WMP)"];
C [label="Stacker Filter\n(Discount phantom liquidity)"];
D [label="Signal Classifier\n(Rule-based or ML)"];
E1 [label="Absorption Entry\n(Fade at defended level)"];
E2 [label="Stop Run Entry\n(Counter-trend at exhaustion)"];
E3 [label="Breakout Entry\n(Momentum after iceberg absorbs)"];
F [label="Position Sizing\n(Volatility-adjusted, CVaR-bounded)"];
G [label="Exit Logic\n(VWAP reversion, DOM level flip, Time stop)"];
A -> B;
B -> C;
C -> D;
D -> E1;
D -> E2;
D -> E3;
E1 -> F;
E2 -> F;
E3 -> F;
F -> G;
}
From Observation to Hypothesis to Entry Rule
The specific thresholds matter less than how I arrived at them. Every entry rule below went through the same process, though calling it a “process” makes it sound cleaner than it was:
- Observation: something I noticed repeatedly on the heatmap or tape (“price tends to bounce when I see heavy volume absorbed at a level”)
- Hypothesis: a testable statement with measurable conditions (“when AR exceeds 150 contracts/tick for 30+ seconds at a level with genuine liquidity, the probability of a bounce to VWAP exceeds the probability of a breakdown”)
- Feature construction: translate the hypothesis into computable features (AR, OBI, pull rate)
- Threshold calibration: find thresholds that separate signal from noise on a training set
- Validation: test whether the signal survives walk-forward evaluation across multiple window configurations
The examples below are outputs of this process, not recipes. The specific numbers came from studying ES futures during 2024-2025 and would need fresh calibration for a different instrument or a different year. Understanding the nature of the market you’re trading, its typical volume profile, its tick structure, its participant mix, is the prerequisite for setting any of these parameters. The methodology transfers; the numbers don’t.
Absorption Fade (hypothesis: defended levels attract mean reversion):
- Absorption ratio at level
p> 150 contracts/tick for 30+ seconds OBI_5skewed toward defending side (> 0.4)- Pull rate at nearby levels < 0.5 (confirming real, not phantom, liquidity)
- Enter limit at
p+/- 1 tick, target session VWAP
Stop Cascade Reversal (hypothesis: mechanical cascades into absorption zones revert):
- Tape velocity > 2.5x baseline
- CVD slope inflection detected (sign change in first derivative)
- Price within 2 ticks of a pre-identified absorption zone
- Volume pressure at extreme (> 0.70 or < 0.30) and flattening
- Enter market order in counter-trend direction, stop 3 ticks beyond cascade low/high
- Critical filter: only fade cascades that terminate at absorption. If there’s no absorption on the other side, the cascade may be organic directional flow and fading it is fighting the trend.
Breakout After Iceberg Exhaustion (hypothesis: iceberg exhaustion signals failed defense):
- Iceberg detected at level
pwith 5+ reload cycles - Total volume absorbed exceeds 80% of estimated iceberg size
- OBI flips (the level is about to break)
- Enter market order in breakout direction as the iceberg exhausts, stop at
p
Position Sizing
Not all setups are equal. A stop cascade reversal with clear exhaustion at an iceberg level is a different animal from an absorption fade where the pull rate is borderline. The natural approach is to scale position size to signal strength.
The catch is that AR (in contracts/tick), pull rate (a ratio), and OBI (a ratio) live on different scales. Averaging them raw means AR dominates everything. Normalizing each to its percentile rank within the current session’s distribution before combining ensures each feature contributes proportionally.
def compute_position_size(
account_risk_per_trade: float,
stop_distance_ticks: int,
tick_value: float,
ar_percentile: float,
pull_rate_inverse: float,
obi_magnitude: float,
) -> int:
"""Compute position size scaled by signal confidence.
All confidence inputs must be normalized to [0, 1] before calling.
ar_percentile is the absorption ratio's percentile rank within the
session's AR distribution, not the raw contracts/tick value.
Args:
account_risk_per_trade: Dollar risk per trade (0.5-1.0% of account).
stop_distance_ticks: Stop distance in ticks.
tick_value: Dollar value per tick per contract.
ar_percentile: Absorption ratio percentile within session (0 to 1).
pull_rate_inverse: 1 - pull_rate, higher means more genuine liquidity (0 to 1).
obi_magnitude: Absolute value of OBI (0 to 1).
Returns:
Number of contracts to trade.
"""
base_size = account_risk_per_trade / (stop_distance_ticks * tick_value)
confidence = (ar_percentile + pull_rate_inverse + obi_magnitude) / 3.0
return max(1, int(base_size * confidence))
Cap total risk per trade at 0.5-1.0% of account. The speed of order flow setups creates a temptation to oversize because entries and exits happen in seconds. Don’t. A blown stop on an oversized position erases weeks of correct reads.
Backtesting and Validation Considerations
Order flow strategies have their own backtesting pitfalls on top of the usual ones, and some of them are subtle enough that I didn’t catch them on the first pass.
Look-ahead bias in DOM data. Historical Level 2 snapshots must be consumed in event order. Any feature that uses future book state contaminates the backtest. This is more insidious than look-ahead bias in daily data because the granularity is so fine. If your feature computation function accidentally reads the next tick’s book state because of an off-by-one in your event loop, you won’t notice in a casual review. The returns will just look slightly better than they should. Enforcing ordering through property checks helps, but it’s worth noting that truly monotonic global ordering is an idealization. In practice, exchange feeds like CME’s MDP 3.0 provide per-channel sequence numbers, not global timestamps with total ordering guarantees. Events from different channels can arrive out of order at the consumer. Your replay infrastructure needs to handle this, and you need to know whether your results are sensitive to it. Reshuffling Monte Carlo is one way to test: take events that fall within the same microsecond or millisecond window (where the “true” order is ambiguous), randomly permute them across many trials, and recompute your features and signals each time. If your backtest results change meaningfully depending on which permutation you use, your strategy is picking up artifacts of event ordering rather than real microstructure signal. If results are stable across permutations, you can be more confident the signal is genuine. This is a specific application of the broader Monte Carlo permutation methodology I use across my pipeline, adapted for the ordering ambiguity inherent in distributed market data feeds.
Fill assumption realism. Limit orders in backtests often assume fills at the quoted price. In live trading, a resting limit order at an absorption level may not fill if the iceberg absorbs all aggressive flow before your order executes. You’re behind the iceberg in the queue. Model queue position explicitly or use conservative fill assumptions (assume you’re last in queue at the level, or require price to trade through your level by at least one tick before counting a fill). Optimistic fill assumptions are a primary source of inflated backtest returns in order flow strategies. Moallemi and Yuan (2017) quantified the gap between backtested and realized limit order fill rates and showed that the discrepancy is large enough to turn profitable backtests into losing live strategies.
Regime sensitivity. Order flow microstructure differs between high-volatility sessions (open, news events, FOMC) and low-volatility drift periods. Validate strategies separately across these regimes. A strategy calibrated on low-volatility absorption behavior will likely overtrade during a macro event, and a strategy tuned for stop run detection during news will generate false signals during quiet afternoons. Splitting walk-forward validation into regime-specific windows would be important when testing order flow strategies.
Sample size illusion. With tick-level data, the number of observations is large but the number of independent signals is small. 10,000 one-second bars covers under 3 hours of trading. You might see 3-5 genuine absorption events in that window. Ensure your backtest spans multiple months and market regimes before drawing conclusions. Bootstrap methods are applicable here, though block lengths would need to account for the intraday structure of order flow data.
Stress-Testing Through Perturbation
A backtest that performs well on the exact historical record has passed one test. History only happened once, and a strategy that’s overfit to the specific sequence of events will look great in retrospect but break on anything slightly different. Two perturbation techniques from the Monte Carlo toolkit help probe this.
Trade reshuffling. Take the historical trade sequence and reshuffle trades within time blocks (e.g., within each 1-second or 5-second window). The aggregate volume and price range within each block stay roughly the same, but the specific ordering of individual fills changes. An absorption signal that depends on trades arriving in a particular sequence may be fitting to the exact microstructure of that session rather than detecting a generalizable pattern. Run the backtest across hundreds of reshuffled variants and look at the distribution of results. If the strategy’s performance is tightly clustered, the signal is robust to the specific trade sequence. If results scatter widely, the strategy is fragile in a way that live trading will expose, because the exact sequence will never repeat.
Noise injection into price paths. Add small random perturbations to historical prices before replaying the backtest. The perturbations should be calibrated to the instrument’s tick size and typical noise level: adding Gaussian noise with a standard deviation of 0.5-1 tick to each price observation is a starting point, but the right magnitude depends on understanding the noise characteristics of your specific market. This tests whether entry and exit logic is robust to the minor price variations that occur naturally between sessions. A strategy that triggers at exactly 4500.25 and fails at 4500.50 is fitting to a price level, not a market dynamic. If the strategy degrades gracefully as noise increases, the signal is genuine. If it collapses with even small perturbations, the entry logic is too precise for the noise level of the data.
Both techniques are applications of the same principle: a real signal should survive small perturbations to the data that generated it. Strategies that depend on the exact historical path are memorizing, not generalizing. This is the microstructure-specific version of the robustness testing stage in the broader pipeline.
Building Incrementally
I’ve tried building the full pipeline at once. Book state, iceberg detection, stacker filtering, stop run detection, entry logic, position sizing, all connected. It didn’t go well. The same decomposition discipline that works in software applies here: when something fails in a monolithic system, you can’t tell which piece broke. I couldn’t tell which features were adding signal and which were adding noise until I pulled them apart and tested each one independently.
What worked was building one feature at a time:
- Start with OBI as a single-feature predictor on 5-second bars in your target futures instrument. Does top-of-book imbalance predict short-horizon direction? If yes, how much? If no, the more complex signals built on top of it may not help either.
- Add CVD slope to filter directional vs. absorption-driven moves. This gives you two orthogonal features: book state (OBI) and flow state (CVD).
- Layer absorption ratio detection at pre-defined support/resistance levels. Now you’re combining book state with price-level context.
- Introduce the stacker pull-rate filter to reduce phantom signal noise. This is a refinement, not a new signal.
- Formalize entry and exit rules, then paper-trade before committing capital.
At each stage, I validate whether the addition actually helps. Does CVD slope on top of OBI improve predictive accuracy? By how much? Does the improvement hold across walk-forward windows , or does it only show up in one favorable stretch? Features that don’t demonstrably improve the signal get cut. I’ve thrown away more features than I’ve kept.
The point was never to replicate the heatmap in code. It was to figure out what the heatmap was actually showing me, express it as something measurable, and then build the framework to find out whether it holds up under rigorous testing. This article sketches that framework. The real work, the part where you take these candidate signals through proper validation and find out which ones survive, is where quantitative research gets interesting. Most of what looks promising in exploratory analysis doesn’t make it. That’s the process working as designed.

Susan Potter
Quant
Work with me
I spent the first half of my career building risk models and market data infrastructure at BNP Paribas, Bank of America, and Citadel, then fourteen years shipping production systems at scale. Now I bring both sides to quantitative trading. If you're a trading firm, family office, or fund looking to tighten the connection between your research ideas and your production trading systems, whether that's building validation pipelines, formalizing signal logic, or getting microstructure analytics into a deployable state, I'd like to hear what you're working on. Reach me at me@susanpotter.net.