Algebraic Data Types in TypeScript [software]

Note: the code for the TypeScript (to help you follow along) is available here: Encoding Algebraic Data Types in TypeScript code

Background

Recently I've been reviewing TypeScript code at work. A common pattern I've observed in pull requests and some open source projects is a type that looks like the following:

type FooTag = 'Bar' | 'Baz' | 'Qux';

type Foo<A> = {
  _tag  : FooTag,
  /* Required for the 'Bar' tagged values of Foo */
  bar?  : string,
  /* Required for the 'Baz' tagged values of Foo */
  baz?  : number,
  /* Required for the 'Qux' tagged values of Foo */
  qux?  : boolean,
  /* Optional for the 'Qux' tagged values of Foo */
  quux? : A
}

Notice that Foo is a parameterized type (also called generic type) which means it places a value of type A which is determined by the caller somewhere in its structure.

In a business application we might place a Currency enumeration (or another domain type) that might look like the following:

type Currency  =
  'USD'
  | 'JPY'
  | 'EUR'
  | 'CHF'

You might notice that (except for the _tag attribute value) all the other attributes of type Foo are optionally required to be present. The above type declaration would allow us to construct the following values that typecheck against this type:

// According to the documentation, this is an INVALID Bar kind of Foo
const bar : Foo<Currency> = { _tag : 'Bar', baz : 123 };

// This is an INVALID Baz kind of Foo
const baz : Foo<Currency> = { _tag : 'Baz' }

// This is an INVALID Qux without an optional quxx attribute
const qux : Foo<Currency> = { _tag : 'Qux', qux : true, bar : 'WAT?' }

// According to the documentation this is an INVALID Qux kind of Foo
const qux2 : Foo<Currency> = { qux : false, quux : 'JPY' }

Here only the last invalid construction (according to the documentation) is raised as a type error by TypeScript. All the other invalid forms of the data get past tsc with no noise being made.

Remembering Algebraic Data Types (ADTs)

For the remainder of this blog post, I'll assume you know the basics of algebraic data types before continuing with this post, but if you need a refresher, I wrote about algebraic data types while ago.

Inconsistencies of the single record approach

The invalid data constructions above (assuming the documentation is correct) did not get caught by the tsc typechecker even though the data was badly constructed. This adds overhead for the developer using these types as inputs to their functions or outputs from their functions.

I often see extra logic in utility or combinator functions that accommodate bad constructions scattered over all the impacted functions like this:

// I did not want to introduce fp-ts to readers who are new to it
// as I wanted the focus to be on encoding algebraic data types in
// TypeScrpt, but ordinarily I would return fp-ts's Option<Foo<A>> or
// Either<AddError, Foo<A>> depending on whether the calling code
// might need to differentiate between error cases.
const addBars = <A extends unknown>(bar0 : Foo<A>, bar1 : Foo<A>) : Foo<A> | null => {
  const tags : string = bar0._tag + "/" + bar1._tag;
  switch (tags) {
    case "Bar/Bar":
      // In an ill-constructed Bar value the bar attribute might not
      // be set because the bar attribute was optional.
      return (bar0.bar && bar1.bar) ? { _tag : 'Bar', bar : bar0.bar + bar1.bar } : null;
    default: return null
  }
};

Using this in ts-node we would see:

>>> // hits path that returns a new Bar value appending the bar attribute
>>> addBars({ _tag: 'Bar', bar: 'hello'}, { _tag: 'Bar', bar: ' world'})
{ "_tag": "Bar", "bar": "hello world" }

>>> // hits the ternary else null path
>>> addBars({ _tag: 'Bar', baz: 1234}, { _tag: 'Bar', bar: 'wow' })
null

>>> // hits the default path that returns null
>>> addBars({ _tag: 'Baz', baz: 1234}, { _tag: 'Bar', bar: 'wow' })
null

The clutter in addBars adds up across many functions and is not isolated to one area of the code or we end up writing partial functions that ignore bad constructions.

We could offer clients of our library "smart constructors" which is a fancy way to say "functions that create well-formed data constructions". Something like the following:

const mkBar = (s : string) : Foo<Currency> => ({ _tag : 'Bar', bar : s });
const mkBaz = (n : number) : Foo<Currency> => ({ _tag : 'Baz', baz : n });
const mkQux = (qux : boolean, quux? : Currency) : Foo<Currency> => ({ '_tag' : 'Qux', qux, quux });

We can construct valid data from these smart constructors like so:

const foo0 = mkBar('bla bla');
const foo1 = mkBaz(12345);
const foo2 = mkQux(true);
const foo3 = mkQux(false, 'USD');

All these values get inferred as type Foo as expected (due to the type annotation for the return type of the smart constructors).

This certainly helps keep clients of our library construct valid forms of Foo but the type declaration can still be used by rogue callers or those that don't bother reading the documentation (myself included). We could argue about whether it is worth our time nannying these developers but for the sake of this blog post, I will assume we want to do the best we can (within reason) to eliminate bad constructions of type Foo so that we don't need to add extra logic to our functions that consume these values to check they are well-formed according to how we intended (read the inline documentation in the type declaration).

Native Algebraic Data Types

Stepping back this is how we might define data types in a language that natively supports algebraic data types satisfying the documentation's intent.

Here is a datatype declaration in PureScript:

data Foo a
  = Bar { bar :: String }
  | Baz { baz :: Int }
  | Qux { qux :: Boolean, quux :: Maybe a }

What it says is that we have one type named Foo and there are only three ways we can construct a valid value of Foo according to this type. Namely:

foo0 = Bar { bar : "wowser" }
foo1 = Baz { baz : 3000 }
foo2 = Qux { qux : true, quux : Nothing }

Similarly, the application type Currency used in the TypeScript example looks like this in PureScript:

data Currency = USD | CHF | EUR | JPY

With the declaration of Foo a in PureScript, we are not able to construct the equivalent to the invalid forms that escaped the TypeScript typechecker unscathed above.

Here is us attempting the equivalent of { _tag : 'Bar', baz : 123 }. It yields the following type error:

>>> invalid0 = Bar { baz : 123 }

  Could not match type

    ()

  with type

    ( baz :: Int
    | t0
    )


while trying to match type ()
  with type ( baz :: Int
            | t0
            )
while checking that expression { baz: 123
                               }
  has type { bar :: String
           }
in value declaration invalid0

where t0 is an unknown type

Now attempting the PureScript equivalent of { _tag : 'Baz' } does not give us a type error but will infer the type as { baz :: Int } -> Foo Currency.

Annotating with the type of Foo Currency yields the following error from the PureScript typechecker:

>>> invalid1 = Baz :: Foo Currency
  Could not match type

    Function
      { baz :: Int
      }

  with type

    Foo


while trying to match type { baz :: Int
                           }
                           -> Foo t0
  with type Foo Currency
while checking that expression Baz
  has type Foo Currency
in value declaration invalid1

where t0 is an unknown type

Attempting to do the equivalent of { _tag : 'Qux', qux : true, bar : 'WAT?' } in PureScript with the sum of products algebraic data type that describes our intent (based on documentation) yields:

>>> invalid2 = Qux { qux : true, bar : "WAT?" }

 Could not match type

    ()

  with type

    ( bar :: String
    | t0
    )


while trying to match type ()
  with type ( bar :: String
            | t0
            )
while checking that expression { qux: true
                               , bar: "WAT?"
                               }
  has type { quux :: Maybe String
           , qux :: Boolean
           }
in value declaration invalid2

where t0 is an unknown type

We can already see that describing our Foo a parameterized type in a sum of products in PureScript keeps our troublesome callers out of much more trouble than our record-with-many-optional-attributes encoding in TypeScript at the top.

TypeScript encodings of non-native algebraic data types

So how can we encode this kind of algebraic data type in TypeScript which doesn't natively support them (yet)?

We will explore some of the design space and note the progressive solution's trade-offs as we go along, attempting to improve on the faithfulness of the encoding to PureScript's native algebraic data type's characteristics.

Our first attempt might be the following:

type Bar2    = { _tag : 'Bar', bar : string }
type Baz2    = { _tag : 'Baz', baz : number }
type Qux2<A> = { _tag : 'Qux', qux : boolean, quux? : A }
type Foo2<A> = Bar2 | Baz2 | Qux2<A>

We can then determine which data construction was used via a switch statement like so:

const switcher = (foo : Foo2<Currency>) : string | Currency | undefined => {
  switch (foo._tag) {
    case 'Bar': return 'fizz';
    case 'Baz': return 'buzz';
    case 'Qux': return foo.quux;
    default: return undefined;
  }
};

We can no longer mix and match attributes that don't correspond to the data constructor "tag" like we could with the "tagged" single record. However, we can pass a null value for one of the constructor arguments instead of giving it the required Currency value expected. We also probably still want to provide smart constructors. Also note that we now have four types: Bar, Baz, Qux, and Foo not just Foo. This point may not be that important for all library designers.

However, let's see I we can prevent the nullability of the constructor attributes:

type Bar3 = { _tag : 'Bar', bar : string }
type Baz3 = { _tag : 'Baz', baz : number }
type Qux3<A> = { _tag : 'Qux', qux : boolean, quux? : NonNullable<A> }
type Foo3<A> = Bar3 | Baz3 | Qux3<A>

Note that:

primitive types like string, number, boolean cannot accept a null value.
we only have to specifically annotate the parameterized type for the optional quuz attribute as NonNullable to get the semantics we want since it can be anything.

Now if we wanted to eliminate the possibility of constructing a value of type Bar4 instead of unifying all the values under Foo4 then we could update the type declaration like so:

interface Bar4 { _tag : 'Bar', bar : string }
interface Baz4 { _tag : 'Baz', baz : number }
interface Qux4<A> { _tag : 'Qux', qux : boolean, quux? : NonNullable<A> }
type Foo4<A> = Bar4 | Baz4 | Qux4<A>

const mkBar4 = <A extends unknown>(s : string) : Foo4<A> => ({ _tag : 'Bar', bar : s });
const mkBaz4 = <A extends unknown>(n : number) : Foo4<A> => ({ _tag : 'Baz', baz : n });
const mkQux4 = <A extends unknown>(qux : boolean, quux? : NonNullable<A>) : Foo4<A> => ({ '_tag' : 'Qux', qux, quux });

Do not fear, we can still create a Qux4 construction of Foo4 with no quux attribute, but if one is given, it cannot be null. This is our intent based on the original documentation.

This ensures we don't create values of types Bar4, Baz4, or Qux4 instead of Foo4 now giving us the same basic guarantees as PureScript's sum of products algebraic data type definition for data construction without pattern matching or exhaustivity checking.

>>> mkBar4('bla bla bla')
{ "_tag": "Bar", "bar": "bla bla bla" }

>>> mkBaz4(3000);
{ "_tag": "Baz", "baz": 3000 }

>>> mkQux4(false, 'Coronavirus')
{ "_tag": "Qux", "qux": false, "quux": "Coronavirus" }

>>> mkQux4<MyType>(false)
{ "_tag": "Qux", "qux": false }

The _tag values in each case correspond to the data constructor of Foo so we can switch on it when we need to but are the types inferred what we expect (all values infer as Foo)?

First let's see what happens when we tell the typechecker they should be Bar4, Baz4, and Qux4 respectively.

>>> const v0 : Bar4<Currency> = mkBar4('bla bla bla');
Type 'Foo4' is not assignable to type 'Bar4'.
  Property 'bar' is missing in type 'Baz4' but required in type 'Bar4'.(2322)

>>> const v1 : Baz4<Currency> = mkBaz4(3000);
Type 'Foo4' is not assignable to type 'Baz4'.
  Property 'baz' is missing in type 'Bar4' but required in type 'Baz4'.(2322)

>>> const v2 : Qux4<Currency> = mkQux4<string>(false, 'Coronavirus');
Type 'Foo4' is not assignable to type 'Qux4'.
  Property 'qux' is missing in type 'Bar4' but required in type 'Qux4'.(2322)

Removing explicit type annotations we see TypeScript infers type Foo2<A> which is ideally what we want for a faithful encoding of a sum type, which is "closed" (some familiar with JVM languages may say sealed) over a specified number of data constructions at time of declaration (it is not extensible later).

In languages that support algebraic data types natively, like PureScript and Haskell, the typechecker can do exhaustivity checking on our pattern matching clauses. This one kind of check early in development can eliminate a large category of common runtime defects very early in the development process converting them to compile-time checks yielding with fast developer feedback. This means the developer retains their full context of the bugfix or feature being worked on. This is hugely valuable on several fronts:

developers can fix these partial function problems closer to the type they were created
reduces developer frustration
less likely to ship partial logic to environments higher than local development, saving everyone else's time too.

Next, we will attempt to trick TypeScript into providing exhaustivity checking for our switch-case statements that we are using as a poor substitute for pattern matching.

Can we get tsc to do exhaustivity checking on our _tag switch statements?

In the earlier switch statement, the default branch returned a null which always irritated me because we had to update our return type to include null. Now let's remove the return null statement and tighten the return type to string:

const matcher = (foo : Foo4<Currency>) : string => {
  switch (foo._tag) {
    case 'Bar': return 'sand';
    case 'Baz': return 'zap';
    // Uncomment the following line to get tsc to succeed typechecking
    //case 'Qux': return 'xoom';
  }
};

When we comment one or more of the possible _tag cases out of our switch statement the TypeScript compiler gives us the error:

Function lacks ending return statement and return type does not include 'undefined'.

Uncommenting the missing switch statement gives us typechecked code.

We now have some form of exhaustivity checking from this encoding.

Alternative encodings of algebraic data types in TypeScript

This is just one way to encode algebraic data types in TypeScript with the characteristics we care most about and I hope this post gives you a launchpad to explore the design space further.

The following are other approaches or explorations that accomplish the same or similar goals in TypeScript without native algebraic data types:

Parting Thoughts

I hope I've shown that algebraic data types offer developers a way to build up descriptions of data from the small up to the very big using the same basic construct.

Some languages like PureScript (plus Haskell, Fsharp, OCaml) offer a direct or native encoding of algebraic data types (ADTs) that lets the developer focus on modeling their domain for their application's needs.

I also demonstrated that even in languages that do not have a direct or native encoding of algebraic data types we can still describe data in a compositional way with some extra boilerplate needed to gain the ideal characteristics of ADTs, specifically in TypeScript.

My 2012 article about Algebraic Data Types showed Scala using an indirect encoding with sealed traits and case classes that was very common in Scala2 codebases. Since then Scala3 provides more direct encodings of ADTs though still not as native as ML-style languages support.

By building up a rich domain model from smaller ADTs we keep things compositional which makes it easier for developers to break down solutions to big problems into smaller parts that can be composed together without magic or fear and the ability to test and validate the smaller parts of the solution independently. This helps us keep our sanity the longer we maintain our codebases.

Have fun and exploit algebraic data types in your language (or consider using a language that more directly encodes this idea so you can worry less about the boilerplate and more about modeling your application domain).

If you enjoyed this content, please consider sharing this link with a friend, following my GitHub or LinkedIn accounts, or subscribing to my RSS feed.