Susan Potter

Parsing for the non-academic programmer

Sun September 9, 2020


One topic I often see shunned or misunderstood by developers of all walks, is parsing. It is associated with writing compilers. The developers that feel queasy or disinterested hearing the term fall into a number of categories including:

If I am honest, earlier in my career I fell into the second category. Today I am still not entirely convinced every conversation about parsers or compiler design is completely devoid of those elitist motives.

However, since re-discovering and better understanding Haskell and well-type functional programming over the last nine years, I have found this topic not just fascinating as an intellectual endeavor but also very practical to understand for work purposes.

This post is an attempt to synthesize some of the most interesting ideas that can help the non-academic programmer do any one of the following (depending on the situation):

First, a brief word about compilers

You might be programming in a language ecosystem that supports feeding a source file at the command prompt which then gets interpreted on the fly to produce a representation that the runtime can execute.

One common refrain I have heard from some developers with varying levels of years nof experience is that this interpretation means that compilers are irrelevant for their needs. They suggest that they don't need to worry about compilers because in JavaScript/Ruby/Python (and others) they can just have their source code executed. So, why should they learn about compilers?

I have news for those developers: your "interpreter" is just compiling on-the-fly.

A compiler is just a program (which could be part of a bigger program) that translates some input source language to a target language. In fact, if you use features from a more modern version of JavaScript or any version of TypeScript, you are running a "transpilation" process which is a compiler by another name to target more web browsers or Node.js runtimes.

Modern compilers are often built in what is commonly referred to as a multi-pass design. LLVM is one example of a multi-pass compiler. It has multiple pass rounds where it consumes an input and produces an output in a composable way.

For instance, let's imagine a simple model to represent a modern compiler:

  1. Read: read in a source file path which produces the text read in from that path

  2. Tokenize: the text read from the file can be used to find relevant tokens or markers that denote structure of the source

  3. Syntax Analysis: given a semi-structured tokenized structure of the source we can analyze the syntax into terms of a parse tree (AST)

  4. Semantic Analysis: given a parse tree of terms we can analyze semantics of the terms to ensure the parse tree adheres to the semantics of the language. Also returns a validated AST.

  5. Optimizer (intermediate code generation): from this AST we might be able to simplify or reduce it based on known optimizations that are known to hold

  6. Code Generation: after all optimization passes we might then translate the resulting AST to a new format. This might be another source format, or binary, or - heck - multiple targets.

The concerns of parsing are contained in lexical, syntax and semantic analysis phases above. This is what we will dive into more in the next section.

To make the above more concrete we might define functions with the following opaque but concrete types (for now):

-- 0. reading contents of file
readFile :: FilePath -> EitherT Error IO Text

-- 1. tokenize flattened text source
tokenize :: Text -> EitherT Error Identity Token

-- 2. parse syntax from tokenized form
parseSyntax :: Token -> EitherT Error Identity Syntax

-- 3. parse semantics of parse tree returned from prior phase
parseSemantics :: Syntax -> EitherT SemanticError Identity Syntax

-- 4. from AST to AST
optimize :: Syntax -> Syntax

-- 5. from AST to Targets
generateBinary :: Sytnax -> ByteString
generateJS     :: Syntax -> Text

generateTargets     :: Syntax -> Target (ByteString, ByteString, Text)
generateTargets ast = AllTargets <$> generateBinary ast <*> generateJS ast

Parsing Overview

In the last section we looked at the most basic modern compiler model.