I’ve debugged more strategy issues caused by bad parsing than by bad strategies. A timestamp off by five hours because someone assumed UTC. A price that lost precision when pandas silently converted to float64. A backtest that looked amazing until we realized the parser was dropping every trade with a missing field instead of failing.
Parsing is where raw bytes become domain objects you can trust. Get it wrong and garbage propagates silently through your system until it corrupts a P&L report or produces a phantom signal.
Why This Matters More Than You Think
Most quants treat parsing as solved. pd.read_csv(), move on. This works until you discover:
- Timestamp ambiguity: Is
01/02/2024January 2nd or February 1st? Is09:30:00in UTC, Eastern, or exchange local time? - Numeric precision: Does your CSV parser preserve decimal precision, or does it silently convert to float64 and lose cents?
- Missing data: Is an empty field a missing value, zero, or an error?
- Malformed records: What happens when a trade record has four fields instead of five?
- Price units: Some feeds report prices in cents, others in dollars, others in ticks or pips. One feed’s
18550is another feed’s185.50. - Options symbology: The OCC standard is
AAPL 240119C00185000but vendors format it differently. Some use spaces, some use underscores, some omit leading zeros. Parsing the wrong format silently gives you the wrong strike. - Exchange identifiers: Is it
NYSE,XNYS,N, or1? Different feeds use different codes for the same venue. A trade routed toEDGXvsNASDAQmatters for execution analysis. - Trade condition codes: A trade marked
T(extended hours) orZ(sold out of sequence) should be excluded from VWAP calculations. Ignoring condition codes corrupts your benchmarks. - Corporate actions: A 4:1 stock split means yesterday’s $400 close and today’s $100 open are the same price. Unadjusted data breaks any strategy that looks at price changes.
A parser that fails loudly on bad data is worth ten times one that silently produces garbage.
What a Parser Actually Does
Parsing happens in stages, and knowing which stage failed tells you what’s wrong:
- Tokenization — split the input into chunks. For CSV, that’s commas and newlines. For FIX, it’s the SOH delimiter.
- Structure — arrange tokens into shape. A CSV row becomes a list of fields. A FIX message becomes tag-value pairs.
- Validation — check domain rules. Is the quantity positive? Is the symbol in your universe?
- Transformation — convert to real types. Strings become
datetimeandDecimal.
A tokenization failure means the input isn’t even valid CSV. A validation failure means it’s valid CSV but nonsense data. Different problems, different fixes.
Example: Parsing Trade Records
Consider a simple CSV of historical trades:
timestamp,symbol,side,quantity,price
2024-01-15T09:30:00.123Z,AAPL,B,100,185.50
2024-01-15T09:30:00.456Z,AAPL,S,50,185.51
2024-01-15T09:30:01.789Z,MSFT,B,200,
A naive approach:
import pandas as pd
df = pd.read_csv("trades.csv")
This “works” for the simplest unambiguous data formats but even this example hides problems:
- The third row has a missing price. Pandas fills it with
NaN. - Timestamps are strings, not datetime objects.
- Side is a string, not an enum.
- No validation that quantity is positive or price is reasonable.
A structured approach:
from dataclasses import dataclass
from decimal import Decimal
from datetime import datetime
from enum import Enum
class Side(Enum):
BUY = "B"
SELL = "S"
@dataclass(frozen=True)
class Trade:
timestamp: datetime
symbol: str
side: Side
quantity: int
price: Decimal
class ParseError(Exception):
def __init__(self, line_number: int, field: str, message: str):
self.line_number = line_number
self.field = field
self.message = message
super().__init__(f"Line {line_number}, field '{field}': {message}")
def parse_trade(line_number: int, row: dict) -> Trade:
# Timestamp
try:
timestamp = datetime.fromisoformat(row["timestamp"].replace("Z", "+00:00"))
except (ValueError, KeyError) as e:
raise ParseError(line_number, "timestamp", str(e))
# Symbol
symbol = row.get("symbol", "").strip()
if not symbol:
raise ParseError(line_number, "symbol", "missing or empty")
# Side
try:
side = Side(row["side"])
except (ValueError, KeyError):
raise ParseError(line_number, "side", f"invalid side: {row.get('side')}")
# Quantity
try:
quantity = int(row["quantity"])
if quantity <= 0:
raise ParseError(line_number, "quantity", "must be positive")
except (ValueError, KeyError) as e:
raise ParseError(line_number, "quantity", str(e))
# Price
price_str = row.get("price", "").strip()
if not price_str:
raise ParseError(line_number, "price", "missing")
try:
price = Decimal(price_str)
if price <= 0:
raise ParseError(line_number, "price", "must be positive")
except Exception as e:
raise ParseError(line_number, "price", str(e))
return Trade(timestamp, symbol, side, quantity, price)
Now the third row fails with a clear error: Line 3, field 'price': missing. You know exactly what went wrong and where.
Don’t Repeat Your Parsing Logic
The parse_trade function works, but you’ll regret it when you need to parse Quote, OHLCV, and OrderBook records that all have timestamps, symbols, and prices. Copy-paste parsing code and you’ll fix the same timezone bug four times.
Extract each field parser into its own function:
from dataclasses import dataclass
from decimal import Decimal
from datetime import datetime
from enum import Enum
from typing import TypeVar, Callable
T = TypeVar("T")
class ParseError(Exception):
def __init__(self, field: str, message: str):
self.field = field
self.message = message
super().__init__(f"Field '{field}': {message}")
# Reusable field parsers
def parse_timestamp(raw: str, field: str = "timestamp") -> datetime:
"""Parse ISO timestamp with Z suffix."""
try:
return datetime.fromisoformat(raw.replace("Z", "+00:00"))
except ValueError as e:
raise ParseError(field, str(e))
def parse_symbol(raw: str, field: str = "symbol") -> str:
"""Parse and validate a ticker symbol."""
symbol = raw.strip().upper()
if not symbol:
raise ParseError(field, "missing or empty")
if not symbol.isalnum():
raise ParseError(field, f"invalid characters in symbol: {symbol}")
return symbol
class Side(Enum):
BUY = "B"
SELL = "S"
def parse_side(raw: str, field: str = "side") -> Side:
"""Parse trade side indicator."""
try:
return Side(raw.strip().upper())
except ValueError:
raise ParseError(field, f"invalid side: {raw}")
def parse_positive_int(raw: str, field: str) -> int:
"""Parse a positive integer."""
try:
value = int(raw)
if value <= 0:
raise ParseError(field, "must be positive")
return value
except ValueError as e:
raise ParseError(field, str(e))
def parse_positive_decimal(raw: str, field: str) -> Decimal:
"""Parse a positive decimal with full precision."""
raw = raw.strip()
if not raw:
raise ParseError(field, "missing")
try:
value = Decimal(raw)
if value <= 0:
raise ParseError(field, "must be positive")
return value
except Exception as e:
raise ParseError(field, str(e))
Now parse_trade becomes a composition of these primitives:
@dataclass(frozen=True)
class Trade:
timestamp: datetime
symbol: str
side: Side
quantity: int
price: Decimal
def parse_trade(row: dict) -> Trade:
return Trade(
timestamp=parse_timestamp(row.get("timestamp", "")),
symbol=parse_symbol(row.get("symbol", "")),
side=parse_side(row.get("side", "")),
quantity=parse_positive_int(row.get("quantity", ""), "quantity"),
price=parse_positive_decimal(row.get("price", ""), "price"),
)
And when you need to parse a Quote record, you reuse the same field parsers:
@dataclass(frozen=True)
class Quote:
timestamp: datetime
symbol: str
bid: Decimal
ask: Decimal
bid_size: int
ask_size: int
def parse_quote(row: dict) -> Quote:
return Quote(
timestamp=parse_timestamp(row.get("timestamp", "")),
symbol=parse_symbol(row.get("symbol", "")),
bid=parse_positive_decimal(row.get("bid", ""), "bid"),
ask=parse_positive_decimal(row.get("ask", ""), "ask"),
bid_size=parse_positive_int(row.get("bid_size", ""), "bid_size"),
ask_size=parse_positive_int(row.get("ask_size", ""), "ask_size"),
)
Why bother? Because when you find a timezone bug in parse_timestamp, you fix it once and every record type—trades, quotes, OHLCV bars—gets the fix. You can unit test parse_positive_decimal with weird inputs (empty string, negative, scientific notation) without constructing entire trade records. And you build up a vocabulary of field parsers that encode your domain knowledge: a parse_symbol that validates against your actual ticker universe catches typos at parse time, not when your backtest explodes.
FIX Protocol: The Industry Standard You’ll Eventually Hit
If you’re doing anything with order routing or execution, you’ll encounter FIX. It’s a tag-value format that looks like this:
8=FIX.4.2|9=178|35=D|49=SENDER|56=TARGET|34=12|52=20240115-09:30:00.123|
11=ORDER123|21=1|55=AAPL|54=1|60=20240115-09:30:00.100|38=100|40=2|44=185.50|10=128|
Each segment is a tag-value pair: tag=value|. Important tags:
35: Message type (D = New Order Single)55: Symbol54: Side (1 = Buy, 2 = Sell)38: Quantity44: Price
Parsing FIX requires:
- Split on delimiter (usually SOH character, shown here as
|) - Parse tag-value pairs
- Validate required tags exist
- Validate checksum (tag 10)
- Transform to domain objects
def parse_fix_message(raw: str, delimiter: str = "|") -> dict[int, str]:
"""Parse FIX message into tag-value dictionary."""
result = {}
for segment in raw.split(delimiter):
if "=" not in segment:
continue
tag_str, value = segment.split("=", 1)
try:
tag = int(tag_str)
result[tag] = value
except ValueError:
raise ParseError(0, f"tag:{tag_str}", "invalid tag number")
return result
def validate_new_order(tags: dict[int, str]) -> None:
"""Validate required fields for New Order Single (35=D)."""
required = {
11: "ClOrdID",
55: "Symbol",
54: "Side",
38: "OrderQty",
40: "OrdType",
}
for tag, name in required.items():
if tag not in tags:
raise ParseError(0, name, f"required tag {tag} missing")
Parser Combinators: When Field Parsers Aren’t Enough
Field parsers work great when your format is already tokenized (CSV rows, JSON objects). For binary protocols or weird text formats where you’re parsing character-by-character, parser combinators give you composable building blocks.
from typing import Callable, TypeVar, Generic
from dataclasses import dataclass
T = TypeVar("T")
U = TypeVar("U")
@dataclass
class ParseResult(Generic[T]):
value: T
remaining: str
Parser = Callable[[str], ParseResult[T] | None]
def literal(expected: str) -> Parser[str]:
"""Parse an exact string."""
def parse(input: str) -> ParseResult[str] | None:
if input.startswith(expected):
return ParseResult(expected, input[len(expected):])
return None
return parse
def map_parser(parser: Parser[T], f: Callable[[T], U]) -> Parser[U]:
"""Transform parser output."""
def parse(input: str) -> ParseResult[U] | None:
result = parser(input)
if result is None:
return None
return ParseResult(f(result.value), result.remaining)
return parse
def sequence(p1: Parser[T], p2: Parser[U]) -> Parser[tuple[T, U]]:
"""Run two parsers in sequence."""
def parse(input: str) -> ParseResult[tuple[T, U]] | None:
r1 = p1(input)
if r1 is None:
return None
r2 = p2(r1.remaining)
if r2 is None:
return None
return ParseResult((r1.value, r2.value), r2.remaining)
return parse
def choice(p1: Parser[T], p2: Parser[T]) -> Parser[T]:
"""Try first parser, fall back to second."""
def parse(input: str) -> ParseResult[T] | None:
result = p1(input)
if result is not None:
return result
return p2(input)
return parse
These primitives compose. Parse a side indicator:
parse_buy = map_parser(literal("B"), lambda _: Side.BUY)
parse_sell = map_parser(literal("S"), lambda _: Side.SELL)
parse_side = choice(parse_buy, parse_sell)
Testing Parsers Without Writing Test Cases by Hand
Here’s the trick that changed how I test parsers: if you can generate random valid data and serialize it, your parser should be able to parse it back. Roundtrip testing. Generate a thousand random trades, serialize each one to CSV, parse it back, check you get the same trade.
This is property-based testing. Instead of writing test_parse_trade_with_valid_input() and test_parse_trade_with_missing_price() by hand, you describe what valid data looks like and let the test framework generate cases. Hypothesis (Python) and QuickCheck (Haskell) are the standard tools.
Here’s a complete example. Save as test_trade_parser.py and run with pytest -v:
from hypothesis import given, strategies as st
from dataclasses import dataclass
from decimal import Decimal
from datetime import datetime
from enum import Enum
import csv
import io
# Domain types
class Side(Enum):
BUY = "B"
SELL = "S"
@dataclass(frozen=True)
class Trade:
timestamp: datetime
symbol: str
side: Side
quantity: int
price: Decimal
class ParseError(Exception):
def __init__(self, line_number: int, field: str, message: str):
self.line_number = line_number
self.field = field
self.message = message
super().__init__(f"Line {line_number}, field '{field}': {message}")
# Parser
def parse_trade(line_number: int, row: dict) -> Trade:
# Timestamp
try:
timestamp = datetime.fromisoformat(row["timestamp"].replace("Z", "+00:00"))
except (ValueError, KeyError) as e:
raise ParseError(line_number, "timestamp", str(e))
# Symbol
symbol = row.get("symbol", "").strip()
if not symbol:
raise ParseError(line_number, "symbol", "missing or empty")
# Side
try:
side = Side(row["side"])
except (ValueError, KeyError):
raise ParseError(line_number, "side", f"invalid side: {row.get('side')}")
# Quantity
try:
quantity = int(row["quantity"])
if quantity <= 0:
raise ParseError(line_number, "quantity", "must be positive")
except (ValueError, KeyError) as e:
raise ParseError(line_number, "quantity", str(e))
# Price
price_str = row.get("price", "").strip()
if not price_str:
raise ParseError(line_number, "price", "missing")
try:
price = Decimal(price_str)
if price <= 0:
raise ParseError(line_number, "price", "must be positive")
except Exception as e:
raise ParseError(line_number, "price", str(e))
return Trade(timestamp, symbol, side, quantity, price)
# Serialization (for roundtrip testing)
def serialize_trade(trade: Trade) -> str:
"""Convert Trade to CSV row string."""
return f"{trade.timestamp.isoformat()},{trade.symbol},{trade.side.value},{trade.quantity},{trade.price}"
def deserialize_row(csv_line: str) -> dict:
"""Parse CSV line into dictionary."""
reader = csv.DictReader(
io.StringIO("timestamp,symbol,side,quantity,price\n" + csv_line)
)
return next(reader)
# Property-based test
trade_strategy = st.builds(
Trade,
timestamp=st.datetimes(),
symbol=st.sampled_from(["AAPL", "MSFT", "GOOGL"]),
side=st.sampled_from(list(Side)),
quantity=st.integers(min_value=1, max_value=10000),
price=st.decimals(min_value=Decimal("0.01"), max_value=Decimal("10000.00"), places=2),
)
@given(trade_strategy)
def test_roundtrip(trade: Trade):
"""Serializing then parsing should return the original trade."""
serialized = serialize_trade(trade)
parsed = parse_trade(1, deserialize_row(serialized))
assert parsed == trade
if __name__ == "__main__":
test_roundtrip()
print("All tests passed!")
Hypothesis generates 100 random trades by default (you can crank it up for CI). The roundtrip property—serialize then parse returns the original—catches bugs you’d never think to write test cases for.
Testing Field Parsers in Isolation
The real payoff comes when you test field parsers individually. Edge cases that would require dozens of handwritten trade records become one-liners:
from hypothesis import given, strategies as st, assume
from decimal import Decimal
# Test parse_positive_decimal handles valid inputs
@given(st.decimals(min_value=Decimal("0.01"), max_value=Decimal("1000000"), places=4))
def test_positive_decimal_roundtrip(value: Decimal):
"""Any positive decimal should parse correctly."""
result = parse_positive_decimal(str(value), "price")
assert result == value
# Test parse_positive_decimal rejects invalid inputs
@given(st.decimals(max_value=Decimal("0"), allow_nan=False, allow_infinity=False))
def test_positive_decimal_rejects_nonpositive(value: Decimal):
"""Zero and negative values should raise ParseError."""
import pytest
with pytest.raises(ParseError):
parse_positive_decimal(str(value), "price")
# Test parse_symbol normalizes and validates
@given(st.text(alphabet="abcdefghijklmnopqrstuvwxyz", min_size=1, max_size=5))
def test_symbol_uppercases(raw: str):
"""Symbols should be uppercased."""
result = parse_symbol(raw)
assert result == raw.upper()
assert result.isupper()
# Test parse_timestamp with edge cases
@given(st.datetimes())
def test_timestamp_roundtrip(dt: datetime):
"""Any datetime should survive isoformat roundtrip."""
iso = dt.isoformat()
result = parse_timestamp(iso)
assert result == dt
This catches edge cases that record-level tests miss: decimal precision loss, timezone quirks, symbol validation. Test a field parser once, reuse it across every record type.
What I’ve Learned the Hard Way
Fail loud. A parser that silently drops malformed records is worse than one that crashes. I’d rather get paged at 2 AM because the parser rejected a file than discover six months later that we’ve been missing 3% of trades.
Use Decimal, not float. Financial data needs exact arithmetic. 185.50 as a float might become 185.49999999999997. In a backtest over millions of trades, that drift adds up.
Be explicit about timezones. Every timestamp in your system should carry a timezone. If the source data doesn’t specify, document your assumption (“we assume Eastern time for this vendor”) and convert to UTC at the boundary.
Parse at the boundary. Validate data when it enters your system, not when you use it. If a bad record slips into your database, you’ll spend hours tracing where it came from.
Version your parsers. Data formats drift. The file you get today might not match the file you got last year. I keep parser versions and document which date ranges each handles—saves a lot of “why won’t this old file parse?” debugging.
The Payoff
Good parsing isn’t glamorous work. But every hour you spend building a proper parser saves you days of debugging corrupted backtests, tracking phantom signals, or explaining to someone why the P&L report doesn’t match.
Parse strictly. Fail loudly. Use Decimal. Compose your field parsers. Test with property-based tools. Your future self will thank you.