Fault-tolerant Parsing, a Work in Progress

Originally published at: https://javacc.com/2020/04/23/fault-tolerant-parsing-progress/

One of the very basic goals of this iteration of work on the JavaCC codebase, JavaCC 21, is to turn this into a capable tool for generating fault-tolerant parsers. Traditionally, not just JavaCC, but tools in this space generally, consider that it is their job to parse, i.e. to break down input into its various pieces — valid input. Once the parser encounters anything that is not valid, then it just aborts. That’s it.

It’s a funny thing. Imagine if human beings behaved in a way analogous to this. Wouldn’t that be quite exasperating? Suppose, for example, if you mumbled or mispronounced a single word in the middle of a long speech, your audience simply stopped listening. Just like a non-fault-tolerant parser, once they encounter one unit of input that they do not understand, they would just take the entirety to be invalid and stop listening.

Regardless, the goal of a fault-tolerant parser would be to be able to continue parsing and providing feedback to the user even after encountering some errors in the input. The majority of programming tools, such as interactive development environments, really require some fault-tolerant parsing to be useful. This, by the way, might also explain why parser generator tools like JavaCC are actually used quite a bit less than one think would they would be.

Now, to frame an initial discussion, let us consider a very typical and fairly simple grammatical production:

void ParentheticalExpression() : 
{}
{
   "(" Expression() ")"
}

So we could start by considering how JavaCC works (or doesn’t, really) as regards error handling. The above expansion is a sequence of 3 elements, the opening “(“, the non-terminal Expression() production, and the closing “)”. Now, as regards the opening delimiter, let us assume for the purposes of discussion that it is there, and that is, in fact, why we entered the production. Okay, so now, the way that the legacy JavaCC tool works (and JavaCC21 also, if the still experimental fault-tolerant parsing is not turned on) is that it will try to parse the nested Expression() production and then (assuming it succeeds at that) to consume a closing “)” delimiter. If, along the way, it fails, i.e. it encounters unexpected input, it will simply throw an exception. (Legacy JavaCC had two kinds of exceptions here, ParseException and TokenMgrError, but JavaCC21 has merged the two into one and there is only ParseException now. However that is a tangential detail in this exposition…)

So, the above describes what happens traditionally. We get an exception thrown and really, to all intents and purposes, we have no disposition for recovering from the error. (More on that later.) Now, we can break the above scenario down into two separate situations:

  1. Failure to parse the Expression()
  2. We parsed the Expression() successfully but do not have the final “)”

Intuition tells us that the second condition is the easier one to handle. Even with the dispositions available in the legacy tool, one might have a fighting chance at dealing with case 2 and continuing the parse. However, the first case looks completely intractable, since the Expression() is surely arbitrarily recursively nested so the exception was likely thrown from within some deeply nested sub-expansion. Legacy JavaCC provides us with no real machinery for recovering from the error and yes, we do want to remedy that in JavaCC 21. But for now, let’s start by talking about the second, simpler case above.

So, we successfully parse an Expression() and now try to consume the closing “)” but we have the wrong kind of input. Suppose we are parsing some Java code like this:

if (foo {
   ....
}

The parsing machinery is expecting a “)” after the foo and finds “{” instead. Now, any human looking at this would realize what you need to do if you want to keep parsing. You insert the missing “)” and that closes ParentheticalExpression() and it so happens in this case that this was the only problem and that’s it! We’re back on the rails!

Now, as things stand, the currently implemented fault-tolerant parsing machinery is something of a one-trick pony. And that is the trick. In these spots, it will attempt to insert a virtual “)” token to complete the production. Note however, that it will only do that under the following two conditions:

  1. Fault-tolerant parsing is turned on.
  2. The expansion is marked as being forced, which means that we are going to “force” the production to succeed even if there is erroneous input.

The way you do the first one is:

FAULT_TOLERANT=true;

in your options. The second one requires you to put in a special annotation in the appropriate place. For example:

void ParentheticalExpression()! : {}
{
   "(" Expression() ")"
}

The exclamation mark up top means basically that we are going to force the production to complete and build a node in the AST no matter what. This means, by the way, that if the more general error condition is the problem, i.e. the parsing of the nested Expression() production failed, we catch the exception and then we will:

  1. Pretend that we somehow got past the Expression() production or really, whatever led up to the closing delimiter.
  2. Try to insert the closing delimiter, creating and inserting a virtual token if necessary. (For now, if there is more than one token type in the “final set” of the expansion, then it simply chooses the first one ordinally, i.e. the one that came first in the list.)

A One-Trick Pony

In its current state, the fault-tolerant parsing machinery could be described as basically a one-trick pony. It has a single trick really. In principle, any expansion can be marked as forced, so you can also write something like:

void KeyValuePair() : {}
{
   <IDENTIFIER> ":"! Expression()
}

In the above example, the colon between key and value is marked as forced, meaning that if it is not there, it will be inserted. So a case of the above would be somebody writing:

   name "John"

instead of:

name : "John"

and the parsing machinery, failing to find the colon after name, knows that it has to insert a virtual colon. If this, for example, was the only error in JSON file, with the above disposition, one could successfully parse the whole file and report the cases of the missing colon to the user.

Stay Tuned

Now, in closing, to clarify a key point, I do not believe that this single trick of inserting virtual tokens is going to be entirely adequate. Nonetheless, it is an experiment in progress to see how far one can get with this single trick. And doubtless, even this single trick needs some refining. Now, some refinements have been made. For example, if the inserted token’s type would require a change of lexical state, it does that currently.

It will be useful to see how much can be resolved with this trick and then also which specific problems remain unsolved. I already have some thoughts on that and can outline (even if in a somewhat wooly minded way) the steps necessary to move towards a more general solution to the fault-tolerant parsing problem. However, I shall write that up separately. Meanwhile, back to the (coding) trenches!