Parser Combinators Part 2: This time with types!

If you didn’t read my first post on parser combinators, I highly recommend going back and reading it. This post assumes both that you understand parser combinators, and that you’re already familiar with the way I implemented parser combinators in Python.

All of the code discussed in this post is available in a Github repos. It’s still not anything that I’d seriously use for anything important, but it’s good enough to explain stuff.

When we left off with Parser combinators, I’d shown a simple parser combinator implementation in Python. But as long-term readers of this blog know, I’m seriously into strongly-typed programming. My take on string typing is twofold:

  1. Information is valuable!: When you write a piece of code, you know what types of parameters you expect it to take, and what kind of value you expect it to return. If you know information like that, which could be valuable both to human readers of your code, and to the compiler, why would you throw it away? Write it down!
  2. It prevents mistakes: This is definitely a personal thing, but I find that a good strong type system catches the kinds of mistakes that I make most frequently. When I program in OCaml and Scala, my two favorite strongly typed languages, I’m constantly amazed at how frequently my code works the first time I run it. That’s not because I’m some kind of perfect genius, but because by the time I get to run it, the compiler has already told me about all the stupid mistakes I made

If I’m going to use Parser combinators, I certainly don’t want to give up my strong typing. After all, if I use a parser generator like Antlr, I’m going to end up writing my parser with types – if I can’t have them in the combinators, I’ll just use ANTLR!

I’m going to walk through how to build typed parser combinators in Java. I’m not a huge Java fan – I find it painfully verbose. But it’s a language with an adequate type system and garbage collection, which is widely available and familiar.

In the Python parser combinators, we informally described what a parser looked like. In Java, we can be more formal, and describe it with types. As we said before, a Parser is, conceptually, a function from a sequence of input values to either a success, consisting of some result value and a sequence of the unconsumed input values, or to a a failure. We can say that pretty directly in Java.

Let’s start by saying what we mean by a parser input. It’s a sequence of input values. They’re all the same type, and we look at them one at a time.

public interface ParserInput<In> {
  /**
   * Get the first element from this input stream.
   */
  In first();

  /**
   * Get the remaining part of this input stream after the first character is consumed.
   */
  ParserInput<In> rest();

  /**
   * Return true if there's no input left in the stream.
   */
  boolean atEnd();
}

In the typed version, we need to say what kind of things are in the input sequence: it’s an input stream of something. Most of the time, that’ll probably be an input stream of characters – a ParserInput<Character>. But it could be other things – token, numbers, strings. To make it easy to use whatever input type you want, we just make it a parameter.

A parser, then, is a function from a parser input to a parser output. What’s a parser output? It’s one of two things; either

  • A success containing a result value and a stream of the unconsumed inputs, or
  • a failure.
  • We’ll need to specify both the input and output types. The parse result needs both as parameters, because if it’s a success, it’ll contain an input stream of the correct type.

    public interface ParseResult<In, Out> {
      public ParserInput<In> getRest();
    }
    
    public class Success<In, Out> implements ParseResult<In, Out> {
      public Success(Out result, ParserInput<In> rest) { ... }
    
      public Out getResult() { ... }
    
      public ParserInput<In> getRest() { ... }
    }
    
    public class Failure<In, Out> implements ParseResult<In, Out> {
      public Failure() { }
    }
    

    There’s nothing difficult there. Now, finally, we can describe the type of a parser in Java:

    public abstract class Parser<In, Out> {
      public abstract ParseResult<In, Out> parse(ParserInput<In> in);
    
    }
    

    (In Java, I used an abstract class, so that I could write the combinators, with default implementations, as methods of the base parser type.) The parser is exactly what we said in text earlier: it’s a object with a method, that acts as a function from a parser input to a parse result. We’ve just added type parameters: a parser of type Parser<In, Out> takes a input stream of values of type In, and if it succeeds, it will produce an output value of type

    Out

    .

    To get a sense of what a typical parser looks like, we’ll do the same sort of example we did with the Python version. Let’s look at a parser that parses a single character from some set of potential characters. The characters that the parser should match will be taken from a string.

    public class CharSetParser extends Parser {
      private final String _chars;
    
      public CharSetParser(String chars) {
        this._chars = chars;
      }
    
      @Override
      public org.goodmath.pcomb.ParseResult parse(
          ParserInput in) {
        if (_chars.indexOf(in.first()) == -1) {
          return new Failure();
        } else {
          return new Success(in.first(), in.rest());
        }
      }
    
    }
    

    It looks an awful lot like the Python equivalent, modulo the more verbose Java syntax, and the type declarations. It’s a parser that takes a character as input, and produces a character as its result value.

    The typed approach gets more complicated when it comes to the actual combinators. As a quick reminder, we need to have the following types of combinators:

    1. Repetition: run this parser many times.
    2. Choice: parse either this or that.
    3. Sequence: parse this followed by that.
    4. Option: this is optional: parse this if you can, but just succeed without consuming any input if you can’t.
    5. Action: run some function on the result of some other parser.
    6. Reference: I’ll tell you what I want to you parse later.

    In terms of types, repetition is easy. If I’ve got a parser p whose type is Parser<In, Out>, then parsing a bunch of repetitions should produce a list of the outputs from the repetition – it’s type Parser<In, List<Out>>:

    public class ManyParser<In, Out> extends Parser<In, List<Out>> {
      private final int _atLeast;
      private final Parser<In, Out> _base;
    
      public ManyParser(Parser<In, Out> base, int atLeast) {
        this._base = base;
        this._atLeast = atLeast;
      }
    
      @Override
      public ParseResult<In, List<Out>> parse(ParserInput<In> in) {
        List<Out> results = new ArrayList<Out>();
        ParserInput<In> unparsed = in;
        ParseResult<In, Out> r = _base.parse(unparsed);
        while (r != null && r instanceof Success) {
          Success<In, Out> success = (Success<In, Out>)r;
          results.add(success.getResult());
          unparsed = success.getRest();
          r = _base.parse(unparsed);
        }
        if (results.size() >= _atLeast) {
          return new Success<In, List<Out>>(results, unparsed);
        } else {
          return new Failure<In, List<Out>>();
        }
      }
    
    }
    

    Choice is also easy: we just say that the two arms of the choice have to have the same result type. A choice between two Parser<In, Out> is a Parser<In, Out>. (I’ll skip the implementation: you can look at ChoiceParser on github if you want.)

    Sequence is where things get difficult. We could go the same route as we did with the choice, and say that a sequence of parsers must all return the same result type. But in practice, that’s going to make it painfully cumbersome to write anything. Think of a simple example of parsing a lisp s-expression. We’ll have parsers for the open and close parens, and parsers for other symbols. Simplifying a bit, any lisp expression is either an identifier, a number, or an open-paren followed by a list of expressions followed by a close paren. We’re going to have a parse rule for the paren case. In the Python version, we could write openpar & expr.many() & closepar for that case. But with typing, that’s now a problem. What’s the type of the three parsers in that expression? A parser for an identifier can certainly be written to return an expression type; so can a parser for a number. And we’d like the parser for the parenthesied expression to return an expression. But what about an open-paren? What’s it supposed to return? It doesn’t make sense for it to return an expression!

    What we end up needing to do is to provide a collection of combinators for sequencing – one for each reasonable type-outcome.

    1. x.andFirst(y): parse x followed by y, and return the result of the first parser, x. The type of the resulting combined parser is the same as the type of x.
    2. x.andSecond(y): parse x followed by y, and return the result of the second parser, y. The type of the resulting combined parser is the same as the type of y.
    3. x.andPair(y): : parse x followed by y, and return a pair of the results of both parsers. The type of the resulting combined parser is built from a typed pair: if x is a Parser<In, X>, and y is a Parser<In, Y>, then x.andPair(y) is a Parser<In, Pair<X, Y>>.
    4. x.andThen(y): if x and y are the same type of parser, then we can just combine them and return a list.

    The result sadly ends up looking rather ugly in Java, because of the syntax. In scala, sequencing is done with an arrow pointing at the value to keep – so my “andFirst” is written “a <~ b” instead of “a.andSecond(b)“. But it’s reasonably legible, and it works.

    For choice, the combinator takes a value to use as the result if optional element isn’t parsed. If a is a Parser<Character, String>, then you could write a.opt("empty") to write a parser that processes an optional string; its result value will be "empty" if the optional a is omitted.

    Actions are simple: an action is a function from a parser result type to some other type. I implemented them as an interface type Action:

      public interface Action {
        Transformed run(Orig val);
      }
    

    Then you create an action with the action combinator: if act is an Action<X, Y>, and x is a Parser<Character, X>, then x.action().

    There are a few other combinators, but you can take a look at the source code on github if you want to see them. To get an idea of what it’s like to use this thing, here’s the same calculator that we did in Python.

      @Test
      public void testArithmetic() {
        // FIRST, ACTIONS!
        final Action<Pair<Character, Integer>, Integer> unary_to_int =
          new Action<Pair<Character, Integer>, Integer>() {
            @Override
            public Integer run(Pair<Character, Integer> p) {
              if (p.getFirst() == '-') {
                return -p.getSecond();
              } else {
                return p.getSecond();
              }
            }
          };
    
        final Action<List<Character>, Integer> digits_to_int =
          new Action<List<Character>, Integer>() {
            @Override
            public Integer run(List<Character> digchars) {
              StringBuilder numstr = new StringBuilder(digchars.size());
              for (char c: digchars) {
                numstr.append(c);
              }
              return Integer.parseInt(numstr.toString());
            }
          };
    
          final Action<Pair<Integer, List<Pair<Character, Integer>>>, Integer> mult_to_int =
              new Action<Pair<Integer, List<Pair<Character, Integer>>>, Integer>() {
            @Override
            public Integer run(Pair<Integer, List<Pair<Character, Integer>>> val) {
              int result = val.getFirst();
              for (Pair<Character, Integer> term: val.getSecond()) {
                if (term.getFirst() == '*') {
                  result = result * term.getSecond();
                } else {
                  result = result / term.getSecond();
                }
              }
              return result;
            }
          };
    
          final Action<Pair<Integer, List<Pair<Character, Integer>>>, Integer> add_to_int =
              new Action<Pair<Integer, List<Pair<Character, Integer>>>, Integer>() {
            @Override
            public Integer run(Pair<Integer, List<Pair<Character, Integer>>> val) {
              int result = val.getFirst();
              for (Pair<Character, Integer> term: val.getSecond()) {
                if (term.getFirst() == '+') {
                  result = result + term.getSecond();
                } else {
                  result = result - term.getSecond();
                }
              }
              return result;
            }
          };
    
        // HERE'S THE GRAMMAR!!
        Parser<Character, Integer> number = Parser.charSet("0123456789").many(1).action(digits_to_int);
        RefParser<Character, Integer> exprRef = Parser.ref();
        Parser<Character, Integer> parens = Parser.match('(').andSecond(exprRef).andFirst(Parser.match(')'));
        Parser<Character, Integer> simple = number.or(parens);
        Parser<Character, Integer> unary_expr = Parser.match('-').opt('+').andPair(simple).action(unary_to_int);
        Parser<Character, Integer> mult_expr =
            unary_expr.andPair((Parser.charSet("*/").andPair(unary_expr)).many(0)).action(mult_to_int);
        Parser<Character, Integer> add_expr =
            mult_expr.andPair((Parser.charSet("+-").andPair(mult_expr)).many(0)).action(add_to_int);
        exprRef.setRef(add_expr);
    
        // NOW, RUN IT!!
        ParserInput<Character> in = new StringParserInput("1+2*(3+5*4)*(6+7)");
        ParseResult<Character, Integer> result = add_expr.parse(in);
        assertTrue(result instanceof Success<?, ?>);
        Success<Character, Integer> success = (Success<Character, Integer>)result;
        assertEquals(1 + 23*26, success.getResult().intValue());
        }
    

    The syntax is, obviously, a hell of a lot more cumbersome than the Python. But here’s the advantage. I actually wrote this version first – I started with just the Java version of the parser combinators, and implemented the Python ones because I thought it was too hard to explain the basic idea of the combinators, and the type mess in one go.

    When I originally wrote this code, it worked perfectly the first time I ran it. I translated it into Python for the blog post, and ended up tracking down 4 different bugs before it worked. Every error that I made was something that the type-system caught in Java.

    Java doesn’t have a great type system, but it’s adequate for most stuff. And it payed off here in exactly the way that make me appreciate strong typing so much!

2 thoughts on “Parser Combinators Part 2: This time with types!

  1. none

    Java isn’t that strong though. It would have been nicer to see this post in (abstract) ML or Scala. Thanks though.

    Reply

Leave a Reply