Monadic Parser Combinators with Scala 3

This tutorial oughts to illustrate how categories and effect-based programming could simplify some kind of abstractions, using the Monadic Parser Combinators as a vehicle for illustration.

This blog post is heavily based on the works by Graham Huton in his Programming Haskell and Monadic Parser Combinators by him and Erik Meijer.

On my previous blog, I tried to convey the idea of categories and combinators by means of using two of the most common Monads: Option and List.

These two monads illustrated how methods such as map or flatMap can be abstracted into one of hundreds of categories, and today’s post will show how you can implement a hierarchy for any data-type you define.

Our vehicle for exposition will be the Parser[?] monad.

What’s a Parser?

A parser is an abstraction over something that reads a string of characters, and tokenize them so we have a hierarchical structure with some ’meaning’. This meaning might be something as complicated as an AST for a programming language, but it can be something as simple as an arithmetic expression – which is what we’re going to do for today.

Hence, a Parser could be defined as a function from a string of characters, to something with a meaning for us, say an arithmetic expression tree. But please notice there are two possibilities: Our parser is successful and gives us one or more valid ways to interpret what we’ve just parsed, and returns to us the string that it didn’t parse (perhaps will do it in the next round), or it fails and there are zero valid syntax trees that are helpful to us from the string it parsed.

In Scala 3, I’ll define it like this, but using an Option instead of a List because I found it easier to follow:

type Parser[A] = String => Option[(A, String)]

Now, how on earth would one implement those so called categories into something like this? It’s not even a data structure!’ You might say. Well… stay tuned.

Basic Parsers

We’re going to start up by defining a couple of parsers so basic we won’t need the notion of categories or monads just yet.

The first of them is the empty parser: It ignores its input and produce nothing. Seems kinda dumb right now, but it’ll have its usage later.

def empty[A]: Parser[A] = _ /* input */ => None

Now let’s see the item or char parser, which takes the first char in the input and puts the tail with the rest of the ‘unprocessed’ string.

def item[A]: Parser[Char] = input =>
  input match
    case "" => None
    case s  => Some((s.head, s.tail))

What it means to run those parsers. Well, as you can notice, parsers are nothing more than a function, so to use them you just have to:

empty(“hi”) // 1: None
item(“Scala rocks”) // 2: Some(‘S’, “cala rocks”)

You’re seeing this taking shape. Now let’s make our parser our monad so I can show you what’s all with this monadic combinator thing.

Functorial Parser

In the previous blog, we talk about how monads have this hierarchy where they inherit, among a ton of other methods, a method from the Functor category called map. Remember that map allows us to map the value that’s inside the parser, while keeping the parser just as a data structure that now remembers how to get there.

In Scala 3, we use something called extension methods, so we don’t need to actually make our class that inherits from Function[String, Option[(?, String)] to extend methods of our types, instead we could just simply:

extension [A](p: Parser[A]) def map(f: A => B): Parser[B] = input =>
    p(input) match // we call our `p` to pattern match over its result
      case None                   => None
      case Some(parsed, unparsed) => Some(f(parsed), unparsed)

Here the choice of applying f to only our parsed part seems a bit arbitrary, and in a sense it is. It falls outside the scope of this post and enters something will discuss in the future of the series, but for now keep in mind two things:

  1. Our Parser[A] is parametrized only on A, which is the type of the parsed part of our parser abstraction and thus makes it a perfect target. Remember how a map on Option[A] gives an Option[B], so we keep the notion of mapping what’s inside and that is important to us, not some kind of hidden implementation detail.
  2. Categories have laws. Yeah, that maybe spooky for my libertarian readers, but, as in this case, laws is what makes us a powerful cohesive society, and that’s what make categories useful and uniform to operate with in the first place. We’ll talk about them later, in a future post.

Now we have a Parser with a map! That means Parser is now a Functor. No rocket science there. Let’s check out how this might be useful:

item(“Hi from Scala 3”).map(c => c.toLowerCase) // 1: Some((‘h’, “i from Scala 3”))

Now let’s jump up difficulty a bit, by implementing some more interesting parsers.

Not-That-Basic Parsers

Let me introduce pure, which is a parser that takes up a value, put it as it was parsed, and left the rest of the input yet to parse.

def pure[A](value: A): Parser[A] = input => Some((value, input))

pure(12)(“parse me please”) // 1: Some((12, “parse me please”))

Monadic Parsers

Now, finally, here we are. We’re going to turn this parser into a Monad, the infamous monad category.

For this to work, we’ll need our good old flatMap, which is named bind in languages like Haskell, although they have a different way to use it, with a symbol, which I’m going to show you but that is not that common to see in Scala.

extension [A](p: Parser[A]) def flatMap(f: A => Parser[B]): Parser[B] = inputB =>
  pa(inputB) match
    case None                   => None
    case Some(parsed, unparsed) => f(parsed)(unparsed) // f returns a Parser, which then we call with the `unparsed` part so we can have our `Option[(A, String)]`

It looks almost the same as our map friend up there, but as you can notice, the f function gives us a Parser and not the right hand side of the function. Remember that the whole function is our parser, so we gotta call that parser because the left-hand-side is already there, we just need to return our Option[(A, String)].

In fact, it looks so similar, that we the help of our boy, the pure parser, we could implement map without losing a hair.

extension [A](p: Parser[A]) def map(f: A => B): Parser[B] = p.flatMap(a => pure(f(a)))

This is where you start to notice the magic of categories. They are abstractions build upon the other, so a lot of the times, by implementing categories upstream, you’ll get for free what goes down the line.

This is why, for implementing a Monad, you have to implement those two methods. That's your abstraction. That's your API. You satisfy it, and you have a new Monad.

It didn’t look like much work, did it? Well, we have already our Parser Monad.

Let’s see some usage examples, but let me first define some more parsers to actually showcase their power:

/** It makes sure a condition is satified so it let the parsed part through, otherwise it empties the parser, as it found an error parsing something or it is simply not what we’re looking for */
def ensure(p: Char => Boolean): Parser[Char] = item.flatMap(x => if p(x) then pure(x) else empty) // notice this `flatMap`.

ensure(c => c.isDigit)(“3rd version of Scala”) // 1: Some((3, “rd version of Scala”))
ensure(c => c.isNotDigit)(“3rd version of Scala”) // 2: None

As you can see, it uses flatMap because we want to return from f another Parser, namely the empty parser. In the 2 example, you can see the item parser is flattened with empty, which means we get the empty parser a result we put input on it: None.

From now on, I’ll use the shorten lambda syntax in Scala, namely the val myFunVal: Char => Boolean = _.isDigit, but don’t freak out ;).

Here is a list of other parsers that use the ensure parser to building themselves:

/** Checks if the char on parser is the same as the one provided to the [[char]] parser */
def char(x: Char): Parser[Char] = ensure(y => x == y)

/** Checks if it’s a digit */
def digit: Parser[Char] = ensure(_.isDigit)
/** Checks if it’s a lower case char */
def lower: Parser[Char] = ensure(_.isLower)
/** Checks if it’s an upper case char */
def upper: Parser[Char] = ensure(_.isUpper)

Choose Parsers

The tools we have so far are insufficient, but we’re almost there. One of those missing feature is the ability to choose between parsers in a way we can try one (let’s say, with ensure), check if that’s parser we were looking for for the string we’re parsing, and move on to next in case it wasn’t the one that could parse the following string.

For that we have a cryptically named combineK, which in languages like Haskell is used more like a syntax extension that looks like <|>. This combinator is the main characteristic of the Alternative category, which name is quite telling.

We want to compare alternative, left to right, until one parser yields us something that isn’t None. See its implementation:

def combineK[A](px: Parser[A], py: Parser[A]): Parser[A] = input =>
px(input) match
  case None   => py(input) // 1
  case result => result // 2

extension [A](px: Parser[A])
  @targetName("infixCombineK")
  def <|>(py: Parser[A]): Parser[A] = Parser.combineK(px, py) // 3

Here I follow a different convention. I don’t define combineK as an extension method, but as a standalone function that receives two parsers as their parameters, because the extension will be a nicely infix syntax, which is a tool from the Scala programming language we have yet to explore, and we just call the function in it.

1 shows us that if None is the result of the first parser, px, then we return the result of the second, py. In the case that py actually yields us a result, namely a Some value, we return it as it is.

Here are some examples using this combinator.

(empty <|> item)(“hi!”) // 1: Some((‘h’, “i!”))
(digit <|> empty)(“123”) // 2: Some((‘1’, “23”))
(empty <|> empty)(“hi!”) // 3: None
(empty <|> empty <|> char(‘h’))(“hi!”) // 3: Some((‘h’, “i!”))

Suffice to say that the combineK combinator associates to the left, meaning operations are run left to right. Notice how in 1 we got an empty parser, but is being chosen upon the item parser, which succeeds, meaning we got the second one as our parser when we apply it to the ”hi!” input. On the third 3 example, we have two empty parsers at first, so it first compare the first two, then that result with the third one, which succeeds with the input we’ve given to it.

Notice that the creation of the parser is always independent to the actual execution of then, the inputting part. This is what’s called deferred execution and is one of the core tenants of Effectful programming. The building of the Effects has nothing to do with the execution of them, concept we’ll explore in, well, future blog posts.

Here you have two parsers that depends on our new shiny feature:

/** A lower or an upper char will do to consider it a letter */
def letter: Parser[Char] = lower <|> upper

/** An alphanum is either a letter or a digit char */
def alphanum: Parser[Char] = letter <|> digit

Please notice how composable this things are. Not far behind we were building parsers that seemed not so useful, but the power comes when you compose them to form bigger and more complex parser, all without having to go back and change anything. Again, a power that comes from working with Effects.

For-Comprehensions

Until now we haven’t need it. Scala 3 has a syntax sugar to use flatMaps and maps out of the box using for-comprehensions. If you are not familiar with scala, you may have notice it in languages like python, which do more or less the same, but only works for the list monad. In Scala it works for any data-type you define, but at a cost: you need to define some things for it to work.

The actual implementation is a bit long and it doesn’t matter that much for this exposition of the parser monad, so I’ll give you the implementation, but it’s on you to learn more about it. Deal?

class WithFilter[A](pa: Parser[A], p: A => Boolean):
  def map[B](f: A => B): Parser[B]               = pa.filter(p).map(f)
  def flatMap[B](f: A => Parser[B])              = pa.filter(p).flatMap(f)
  def withFilter(q: A => Boolean): WithFilter[A] = new WithFilter(pa, x => p(x) && q(x))

With this implementation, now we can use our Parser monad like this:

def ensure = for
    x <- item
    x <- if p(x) then pure(x) else empty
  yield x

I’ve seen Scala devs using it only where multiple nested flatMaps are needed. It looks nice, too.

The String parser

This will be our most complicated parser, so far. It consists of a parser that parses only the given string from the input, leaving the rest untouched. Or it fails if the match is not there to be seen.

def string(str: String): Parser[String] = str match
  case str if str.isBlank() => Parser.pure("")
  case str                  =>
    val (x, xs) = (str.head, str.tail)
    for
      _ <- char(x)
      _ <- string(xs)
    yield (s"$x$xs")

string("abc")("abcdef") === Some(("abc", "def"))
string("abc")("ab21345") === None

Some and Even Many!

The some and many combinators are two of the most important for what we're going to do today.

While some expreses the idea of having at least one char of the input to be present, and then matches as much as it can of the res. You can see it as akin to the + operator on regex. many, on the other hand, matches zero or more, akin to *. See how them works:

Parser.many(digit).parse("123abc") === Some((Seq('1', '2', '3'), "abc"))
Parser.many(letter).parse("abc123") === Some((Seq('a', 'b', 'c'), "123"))
Parser.many(digit).parse("abc") === Some((Seq(), "abc"))

Parser.some(digit).parse("abc") === None
Parser.some(letter).parse("ab1") === Some(Seq('a', 'b'), "1")
Parser.some(letter).parse("a321") === Some(Seq('a'), "321")

These combinators have an impressively elegant implementation in Haskell, where they have some awesome treats like mutual recursion – because it’s a lazy language. Here in Scala, we have to settle for a more modest solution, using that dull normal recursion 😤:

def many[A](p: Parser[A]): Parser[Seq[A]] = input =>
  p(input) match
    case None                     => Parser.pure(Seq())(input)
    case Some((parsed, unparsed)) =>
      Parser.many(p)(unparsed) match
        case None                       => None
        case Some((parsed1, unparsed1)) => Some((parsed +: parsed1, unparsed1))

def some[A](p: Parser[A]): Parser[Seq[A]] = input =>
  p(input) match
    case None                     => None
    case Some((parsed, unparsed)) =>
      Parser.many(p)(unparsed) match
        case None                       => None
        case Some((parsed1, unparsed1)) => Some((parsed +: parsed1, unparsed1))

Behind those cryptic implementations, lies a simple yet powerful functionality.

With these combinators, we’re finally set for our composed parsers, which are the ones that will give us the power to parse more complex expressions, account for white space, take on identifiers and so on.

Let's start with the white space parser, which consumes as many whitespace as it can.

def space: Parser[Unit] = Parser.many(ensure(_.isWhitespace)).flatMap(_ => ()) // Last `flatMap` is used to discard the whitespace char.

For natural numbers and integers, we could define one (int) in terms of the other (nat), since int is just a nat with a posible negative symbol (-).

def nat: Parser[Int] = for xs <- Parser.some(digit)
  yield xs.mkString.toInt

def int: Parser[Int] =
  (for
    _ <- Parser.char('-')
    n <- Parser.nat
  yield -n) <|> Parser.nat

token is a parser that'll let us capture a token, ignoring the whitespace around.

def token[A](p: Parser[A]): Parser[A] =
  for
    _      <- Parser.space
    parsed <- p
    _      <- Parser.space
  yield parsed

symbol parses a specific string while ignoring whitespace around, meanwhile integer is basically int, but ignoring whitespaces around.

def symbol(xs: String): Parser[String] =
  Parser.token(Parser.string(xs))

def integer: Parser[Int] = Parser.token(Parser.int)

With all these parser under our belt, it's time to showcase them by doing something relatively useful: an arithmetic calculator.

Arithmetic Parser

After all our previous work, the following will feel like a walk in the park, or at least I hope so.

First we have to define our grammar. I’m going to base this one on Hutton’s work Programming in Haskell Chapter 13.8. Arithmetic Expressions.

The definition has some key characteristics, such as:

  1. Addition and multiplication are right associative, hence the recursion only on the right-hand side of the operands.
  2. Numbers (or literals) have the higher precedence, followed by parenthesized expressions, then multiplication and lastly additions.
  3. Is unambiguous, meaning that a valid expression has only one interpretation or parse tree.
expr ::= term + expr | term
term ::= factor * term | factor
factor ::= (expr) | nat
nat ::= 1 | 2 | 3 | …

This definition has a simplification because of the right associativity. If e is an empty string, then expr and term have optional right hand-side operators.

expr ::= term (+ expr | e)
term ::= factor (* term | e)

An expression like 2+3+4 will have only one interpretation, meaning 2+(3+4), or as shown by this figure:

With these simplifications, expressing our grammar into actual combinators is almost trivial, since they look quite alike and because we have composability superpowers.

def expr: Parser[Int] = ArithmeticParser.term.flatMap { t =>
  val sum = for
    _ <- Parser.symbol("+")
    e <- ArithmeticParser.expr
  yield t + e

  sum <|> Parser.pure(t)
}

def term: Parser[Int] = ArithmeticParser.factor.flatMap { f =>
  val fact = for
    _ <- Parser.symbol("*")
    t <- ArithmeticParser.term
  yield f * t

  fact <|> Parser.pure(f)
}

def factor: Parser[Int] =
  (for
    _ <- Parser.symbol("(")
    e <- ArithmeticParser.expr
    _ <- Parser.symbol(")")
  yield e) <|> Parser.natural

Lastly, we need something to eval the expression, produce a result, or report the appropriate errors, for which I chose to have a Sum Algebraic Data Type, which you may know just as an Enum.

sealed trait ParseError extends NoStackTrace

final case class UnusedInput(parsed: Int, unused: String) extends ParseError
final case class InvalidInput(input: String)              extends ParseError

def eval(input: String): Either[ParseError, Int] = ArithmeticParser.expr.parse(input) match
  case Some((n, unparsed)) if unparsed.isEmpty() => Right(n)
  case Some((n, unparsed))                       => Left(UnusedInput(n, unparsed))
  case None                                      => Left(InvalidInput(input))

Finally, we should be able to see our parser in action by doing something… fun, I guess.

For our demo purposes, and because I want to show you a pure application, I will use the Typelevel library Cats Effect, which uses the library Cats I talked about in our previous post to create pure functional abstractions over IO, or Input/Output. The details are unnecessary to understand, but I could totally walk you through them in another tutorial. Check it out:

import cats.effect.{ExitCode, IO, IOApp}
import land.yurian.monadicparsers.ArithmeticParser.{InvalidInput, ParseError, UnusedInput}

object Main extends IOApp:
  def run(args: List[String]): IO[ExitCode] =
    def evalLine: IO[ExitCode] =
      def outErr: ParseError => IO[Unit] =
        case UnusedInput(n, unused) => IO.println(s"Some input couldn't be read (after n = $n): $unused")
        case InvalidInput           => IO.println("invalid input :(")

      IO.readLine.map(ArithmeticParser.eval).flatMap {
        case Left(error)  => outErr(error).as(ExitCode.Error)
        case Right(value) => IO.println(s"expresion evaluated to: $value").as(ExitCode.Success)
      }
    end evalLine

    IO.print("insert an expression: ") >> evalLine
  end run
end Main

That’s pretty much it. Today you learned a lot (I hope so), so give yourself a treat. We learned how to implement different categories – such as Functor, Monad, Alternative – for a type we defined ourselves, and saw how this gave us an incredible power for composition, using the same rules as if we had used any other effect, like List or Option.

We defined a simple grammar to eval arithmetic expressions, and expressed this grammar with ease using our composition toolbox.

Stay tuned for the explanation on how to compose Input/Output effects to, for example, printing to the console feedback from our application, which is what we did at last in this post.