One topic I often see shunned or misunderstood by developers of all walks, is parsing. It is associated with writing compilers. The developers that feel queasy or disinterested hearing the term fall into a number of categories including:
those who studied computer science in college and either hated their compilers class or weren't shown its relevance;
those who are self-taught and think this is an elitist mechanism to exclude the self-taught from conversations;
those who think regular expressions solve all their problems so why learn about something seemingly antiquated?
If I am honest, earlier in my career I fell into the second category. Today I am still not entirely convinced every conversation about parsers or compiler design is completely devoid of those elitist motives.
However, since re-discovering and better understanding Haskell and well-type functional programming over the last nine years, I have found this topic not just fascinating as an intellectual endeavor but also very practical to understand for work purposes.
This post is an attempt to synthesize some of the most interesting ideas that can help the non-academic programmer do any one of the following (depending on the situation):
validate formats with less bugs and sometimes (when needed) more context;
identify what stage/phase of a compiler or validator an error occurs which can provide greater implicit insight and understanding to correct the core problem;
more simply troubleshoot tools you depend on that leverage some form of parsing such as fixing bugs inside of them or their dependencies;
and arguably build better designed applications or tooling.
First, a brief word about compilers
You might be programming in a language ecosystem that supports feeding a source file at the command prompt which then gets interpreted on the fly to produce a representation that the runtime can execute.
I have news for those developers: your "interpreter" is just compiling on-the-fly.
Modern compilers are often built in what is commonly referred to as a multi-pass design. LLVM is one example of a multi-pass compiler. It has multiple pass rounds where it consumes an input and produces an output in a composable way.
For instance, let's imagine a simple model to represent a modern compiler:
Read: read in a source file path which produces the text read in from that path
Tokenize: the text read from the file can be used to find relevant tokens or markers that denote structure of the source
Syntax Analysis: given a semi-structured tokenized structure of the source we can analyze the syntax into terms of a parse tree (AST)
Semantic Analysis: given a parse tree of terms we can analyze semantics of the terms to ensure the parse tree adheres to the semantics of the language. Also returns a validated AST.
Optimizer (intermediate code generation): from this AST we might be able to simplify or reduce it based on known optimizations that are known to hold
Code Generation: after all optimization passes we might then translate the resulting AST to a new format. This might be another source format, or binary, or - heck - multiple targets.
The concerns of parsing are contained in lexical, syntax and semantic analysis phases above. This is what we will dive into more in the next section.
To make the above more concrete we might define functions with the following opaque but concrete types (for now):
-- 0. reading contents of file readFile :: FilePath -> EitherT Error IO Text -- 1. tokenize flattened text source tokenize :: Text -> EitherT Error Identity Token -- 2. parse syntax from tokenized form parseSyntax :: Token -> EitherT Error Identity Syntax -- 3. parse semantics of parse tree returned from prior phase parseSemantics :: Syntax -> EitherT SemanticError Identity Syntax -- 4. from AST to AST optimize :: Syntax -> Syntax -- 5. from AST to Targets generateBinary :: Sytnax -> ByteString generateJS :: Syntax -> Text generateTargets :: Syntax -> Target (ByteString, ByteString, Text) generateTargets ast = AllTargets <$> generateBinary ast <*> generateJS ast
In the last section we looked at the most basic modern compiler model.