Revisiting Assertions, introducing the ENSURE keyword

Assertions, implemented in late 2021, are certainly an important feature in CongoCC -- actually pretty much essential when writing grammars for very gnarly languages. However, as currently implemented, they are admittedly a somewhat strange hybrid construct. The main reason for this is that ASSERT has completely different semantics depending on whether you hit the assertion when parsing or scanning ahead.

Now, to review some basic concepts, note that a congo-generated parser is always in one of either two states:

  • regular parsing
  • lookahead

In regular parsing, the parsing machinery is consuming tokens based on where it is in the grammar, and assuming that tree-building is on, is building up a parse tree based on those tokens.

In lookahead, on the other hand, the parsing machinery is at a choice point and we are scanning ahead to try to figure out which choice to opt for. And we are not doing the tree-building actions or executing other custom code actions.

So, basically, if an assertion fails in regular parsing, this is pretty much analogous to an assertion failing in a programming language such as Java. In Java, when you write:

 assert someCondition() : "Oops!";

this is basically just syntactic sugar for:

 if (!someCondition()) {
    throw new AssertionFailedException("Oops");
 }

Since the exception thrown here is not really meant to be caught, the assertion failing typically just terminates the process. (Or thread...) And that is what a CongoCC assertion does... if we are in regular parsing. (The other aspect of an assertion in Java and other languages is that they can be enabled/disabled, but that has no analogue in CongoCC.)

However, if an assertion fails when scanning ahead, this is just taken to mean that the predicate routine that this is part of returns false, and since we are at a choice point, we simply need to go to the next possible choice. Thus, if we have:

 Foo | Bar | Baz

and Foo is:

 Foo : Foobar Foobaz ASSERT {someCondition()}# =>|| C ;

then in the Foo|Bar choice, the machinery scans ahead into the Foo rule and if it gets to the =>|| up-to-here point, the predicate succeeded and we opt for the Foo branch. If the predicate fails, we try Bar and so on down the line.

For the predicate to succeed, it would have to scan past a Foobar and a Foobaz and then the condition in the assertion, which is someCondition(), would have to evaluate to true.

First of all, note that the # at the end of the assertion actually means that the assertion applies in a lookahead as well as regular parsing. If that is not there, the assertion is simply ignored in a lookahead. Another thing to note is that if the assertion does not fail, and we now parse a Foo, we hit the assertion again and it presumably does not fail, but we know it won't fail because we already checked it!

Another thing to note is that CongoCC assertions also have a dual nature in that they can be expressed either as syntactic expansions or as conditions expressed in Java code. The ones actually expressed syntactically have been taken to always apply, both in parsing and lookahead. I mean an assertion like:

 Foo : Foobar ASSERT ~(Baz|Bat) =>||

This means that, at a choice point, to decide whether to enter the Foo expansion, we try to scan past a Foobar and now we check whether what follows is a Baz or a Bat. If it is then the predicate fails, since this is expressed as a negative. (That is what the ~ means.) But the assertion is also checked again when you go back to regular parsing. That is usually (though not always) redundant, so that is basically a glitch that is being addressed now as well.

By the way, the reason that the # is required when the assertion condition is Java code is actually kind of hypertechnical. Suppose we have something like:

 Foo :
   {
      int someVariable = ...;
   }
   ...
   ASSERT {someVariable > 5}
 ;

The above assertion can really only apply in regular parsing because someVariable is only defined in regular parsing. If the assertion applies in lookahead, this will actually generate code that does not even compile!

All of the above-described disposition does work as far as that goes, but is also kind of a mess conceptually, aside from the basic glitch of assertions being redundantly double-checked. The reason for the conceptual confusion is that an assertion failing in regular parsing is more like a Java assertion, but an assertion failing in a lookahead is actually part of regular programmatic flow. You check that a condition is satisifed. If it isn't, the predicate fails and you go to the next choice...

Introducing the new ENSURE keyword

The foregoing outlines the problem, and the solution that I have come up with is to break ASSERT in two basically. When an assertion is being used in the classical way, like a Java assert, say, we use the ASSERT keyword, but when it is actually part of a predicate and thus really part of regular programmatic flow, we use instead the newly introduced ENSURE keyword.

The goal here is that the assertions written with ASSERT only apply in regular parsing and the ones written with ENSURE only apply in a lookahead. And this should lead to much greater clarity in the resulting grammars.

In the case (not so very common) where you really want the assertion to apply in regular parsing and in lookahead, you can write:

 ASSERT ENSURE ...

or alternatively:

 ENSURE ASSERT ...

For a while, I was pondering which of the two above is better but finally decided they can both work... Of course, either way, ASSERT ENSURE is bound to be redundant unless you are at a point that can sometimes be reached in regular parsing without having passed the predicate AND also sometimes having passed the predicate. This is certainly possible. For example, in the one-or-more construct, (Foo)+, on the first iteration, the predicate is not tested since we enter unconditionally, but then on the iterations afterwards, it is. So, one can conceive of writing ASSERT ENSURE in such a case, I suppose. But usually, you will want to write either ASSERT or ENSURE, not both, and, as I say, this should result in clearer code in one's grammars. The intent is bound to be clearer.

Backward Compatibility Issues

This is all implemented and, for now, it is all backward compatible. One niggling detail is that an ASSERT where the condition is expressed syntactically (as opposed to with Java code) still applies both in lookahead and regular parsing. In the future it will only apply in regular parsing, but that change is backward-incompatible. For now, what there is is a new setting, ASSERT_APPLIES_IN_LOOKAHEAD which is on by default -- for now.

At some later stage, that setting will be off by default (and probably eventually be removed) so that you have to use ASSERT, ENSURE, and occasionally ASSERT ENSURE appropriately depending on your intention.

Recap

The key takeaway is the following: henceforth, where you previously wrote ASSERT, you should continue to write ASSERT if the assertion applies only in regular parsing (and typically failure results in termination of the parsing job) and you should replace ASSERT by ENSURE if it only is meant to apply in a lookahead -- and thus, is used for normal programmatic flow. And last (and probably least) you should write ASSERT ENSURE or alternatively ENSURE ASSERT if the assertion is meant to apply both in regular parsing and lookahead.