You say “Tomayto”, I say “Tomahto”. Exploring Syntax options in CongoCC.

What this article's title alludes to is the fact that, once you have a complex language (or system) there are usually multiple equally valid ways to say something. Let's consider the following typical case, a comma-separated list of some sort, something like:

  [1, 2, x, y, "Hello, World"]

To describe the grammar for this in CongoCC, we would have something like:

 ArrayLiteral :
    "["
    [
        Element
        ("," Element)*
    ]
    "]"
 ;

Let us now focus on the bit that allows us to have more than one comma-separated element, i.e. this line:

    ("," Element)*

Straightforward bread-and-butter stuff, right? The above expresses a loop. We enter the loop if the next token off the stream is a comma, and then, on each subsequent iteration, we stay in the loop if, after encountering an Element, the next token is a comma. If not, we break out of the loop, and then we would duly expect that the next token off the stream will be a right bracket, i.e. ].

The logic outlined above is totally LL(1), a single-token lookahead, which is what a CongoCC generated parser does by default.

However, in practice this sort of thing gets more complicated because, for whatever reasons,it seems to be increasingly fashionable in these instances to allow (optionally) a superfluous final comma. (Or whatever delimiter it happens to be, of course.) Or, to put it more concretely, we could write:

 [x, y, z,]

In that case, the comma following the z above is optional and surprisingly, this complicates things considerably because the grammar snippet is now longer LL(1). In other words, at a key point, it requires two tokens of lookahead to decide whether to stay in the loop or break out. Now, if we thought about this naively we could just change the line to:

("," Element)* [","]

saying that there is an optional final comma. But... that doesn't work! The problem is that it would still be deciding whether to repeat the loop based on a single-token lookahead, just checking whether the next token is a comma. So, in [x,y,z,] it encounters the comma after the z and, on that basis, decides to repeat the loop, consumes that final comma, and then expects to find one of the tokens that can start an Element but instead, runs into that final ]. And then the parser generates an error. In fact, as things stand, that optional comma inside of the [","] is unreachable.

Now, most readers probably already know that this will work if you write:

 (SCAN 2 "," Element)* [","]

The above tweak is based on realizing that now that the final comma is optional, we need to scan ahead two tokens to decide whether to repeat the loop or jump out. The above handles this construct whether the trailing comma is present or not! (N.B. In legacy JavaCC, one would write LOOKAHEAD(2).)

Now, the reason I entitled this article as I did was to emphasize the fact that, once you have a sophisticated language or system, there are frequently multiple equally valid ways of expressing something. Thus, for example, CongoCC introduces the up-to-here and up-to-here-plus notation. Another way to express this, that is, in practice, completely equivalent is as follows:

  ("," =>|+1 Element)* [","]

What the above means it that we scan past the comma and then one more token past that point. In effect, this is equivalent to the previous SCAN 2. (Well, in this precise case it is. However, it is also true that the up-to-here notation is more powerful generally speaking, since it handles cases which involve variable numbers of tokens.) Probably, in CongoCC, the last example, with the up-to-here-plus-one might be considered the most natural or idiomatic way to write something like this. But the explicit numerical lookahead of SCAN 2 is fine here as well... You say tomahto, I say tomayto...

The above two examples would probably be the most typical way that people would handle this case. That said, there are other alternatives. If you thought about the problem another way, you could think that you repeat the loop as long as the next character after the comma is not a right bracket. So, we could also write:

   ("," ENSURE ~("]") =>|| Element) [","]

The above will generate code that, when it has to decide whether to repeat the loop or not, first checks that the next token is a comma and then that the token immediately following is not the right bracket ]. By the way, the up-to-here after the ENSURE ~("]") is necessary because without that up-to-here, it would just default to checking a single token ahead and we would have the same old problem. (You might think that with the ENSURE statement right there, the system should be smart enough to realize that your intent is to peek at the next token, i.e. to infer the up-to-here that immediately follows. Well, maybe you you would be right to think that, but that is not currently how it works. You have to specify the up-to-here. It all comes back to the fact that the default is a one-token-lookahead unless specified otherwise.)

An interesting twist on this is that if you write it this way:

  (ENSURE {getTokenType(2) != RBRACE} "," =>|| Element)* [","]

Then the up-to-here is optional! That is because the generated lookahead routine is assumed to scan ahead one token by default, i.e. just past the ",". The prior ENSURE statement actually does look past that first token, but it is expressed in Java code, which is a sort of a black box. So, you could (though I think few people would) write the above as:

  (ENSURE {getTokenType(2) != RBRACE} "," Element)* [","]

That few people would write it that way is just a conjecture but it is based on the notion that, for most people it is unnatural to check the second token and then the first one! So probably, most people would write the SCAN 2 or possibly the alternative with up-to-here-plus-one.

However, the fact remains that there are more than two alternative ways of expressing things here. (So, okay, you say "tomayto", I say "tomahto", and Luigi over there says "pomodoro".)

Now, I should also point out that, although the use of ENSURE in this case seems unnatural, in other cases, it provides an elegant solution to the problem at hand.

Here is an example from my recent work on the Rust grammar. Rust has the same old problem as in Java and C# that the character sequence >> sometimes means a right-shift operator but, in the context of generic parameters or arguments should actually be interpreted as two consecutive > tokens. I decided to solve the problem differently in Rust from the way I did it in Java. The << and >> tokens are deactivated by default. Thus, at the top of the grammar, we have:

 DEACTIVATE_TOKENS GT_GT, LT_LT;

We're only going to activate these tokens when we need them. (N.B. Rust, unlike Java, does not have the <<< or >>> left and right unsigned shift respectively, so no need to deal with that.)

Here is how the left-shift operator is currently expressed in the Rust grammar:

 SyntheticLeftShift#void :
     SCAN "<" "<" ENSURE {getToken(-1).getEndOffset() == getToken(0).getBeginOffset()}
     => ACTIVATE_TOKENS LT_LT ("<<")
 ;

The above is worthy of some study. The above production, called SyntheticLeftShift handles the case of the left shift initially being scanned as two consecutive < tokens. However, those two LT tokens cannot have any whitespace or comments in between them. So we also have the additional condition, expressed in Java code, that the two tokens are adjacent. So, the lookahead expansion after SCAN above is two < tokens in a row but we also add the ENSURE condition. To be honest, when I first wrote this, I was not 100% sure that it would work! I thought to myself: "Is it possible to nest an ENSURE into a syntactic lookahed like that? But, I wrote it that way and then took a peek at the generated code and tested it and it's all fine! So, what happens is that it scans ahead to check for two consecutive LT tokens and that they are adjacent. If that succeeds, it then uncaches the tokens (ACTIVATE_TOKENS automatically does all this) and then tokenizes the text as a single left-shift operator.

Of course, I also had to write a similar SyntheticRightShift production but decided to express it a bit differently... (You say tomayto, I saw tomahto...). I first wrote it like this:

  SyntheticRightShift#void :
      ENSURE (
        ">" ">"
        ENSURE {getToken(-1).getEndOffset() == getToken(0).getBeginOffset()}
      )
      ACTIVATE_TOKENS GT_GT (">>")
  ;

In the above, we have an ENSURE nested within another ENSURE. Again, I was not 100% sure that the above would work! But it does!

I later eyeballed the above SyntheticRightShift production and decided I was being... well... excessively clever (showing off a bit, was I??) and I rewrote it as follows:

  SyntheticRightShift#void :
      ENSURE(">" ">")
      ENSURE {getToken(1).getEndOffset() == getToken(2).getBeginOffset()}
      ACTIVATE_TOKENS GT_GT (">>")
  ;

There is, of course, no need to nest one ENSURE inside the other. It could be written as two consecutive ENSURE statements. That, actually, is a key difference between using ENSURE and SCAN. A SCAN must be at the very beginning of an expansion and you can only have one! It is instructive to compare the second ENSURE in the above two versions. In the first case, with the nested ENSURE, it is in a context where we just scanned the two consecutive > tokens. So, the check for whether the two tokens are adjacent has to be relative to that position, so we use getToken(-1) and getToken(0) respectively in this spot.

In the second example, we use getToken(1) and getToken(2) respectively because this is a separate (not nested) ENSURE so we are back to the starting point, you see... Well, look closely and you'll see. The truth is that it took me a couple of tries to get this right! I wrote it wrong initially. But, with practice, I reckon this kind of thing will become natural to you. (And even to me!)

Addendum

Let us consider again the following snippet:

 (SCAN 2 "," Element)* [","] "]"

As I pointed out, you could also write it this way:

 ("," =>|+1 Element)* [","] "]"

And these are actually 100% equivalent in this case.

I also provided a further alternative:

 ("," ENSURE ~("]") =>|| Element)* [","] "]"

This works just as well as the preceding two, but it turns out there is a slight difference. If the input is valid, it's all the same, but if we give it erroneous input, for example:

  [x, y, z, %/]

Let's just go through the logic here. With the above input, with the SCAN 2 way of expressing it, we scan past x,y,z of course, but then consider the next iteration. We scan one token forward, which is the comma after the z, which is fine, and then we go to the second token (we scan 2 tokens after all!) and we hit this invalid input, the %/. Since this two-token scan fails, what happens? Well, we jump out of the loop, we consume the [","] that immediately follows. And then we are supposed to have a closing square bracket. Except the next token off the stream is %. So it generates the error message that it was expecting a ] but found a %. Right?

However, if we write it the other way, what happens? It parses past the x,y,z, which is all fine. It hits the trailing comma after the z, but then the other part of the predicate is to check whether, specifically, the next token is a closing bracket or not. Well, it isn't. It's a %, so the logic is that it does stay in the loop! In the previous case, it exited the loop. It stays in the loop and then reports the error that the next token is erroneous and lists all the token types that could follow here, to start an Element, such as an identifier, a literal number, and such.

The funny thing about this is that both errors are effectively correct. It is correct to say that the problem after the trailing comma is that the next token is not a closing bracket. But it is also correct to say that it is not one of the tokens that could start an Element. Or, to put it another way, that trailing comma after the z could be a lone trailing comma, which is permissible, but then the next token has to be the closing right bracket. Or it could be a comma that precedes another element in the array, in which case, there is a set of following tokens that could be there.

Now, arguably the system should be smart enough to tell you that the next token after that final comma must be either a closing right bracket or one of the set of tokens (first set) that can begin an Element. That would be the most completely correct error message. However, as of this writing at least, it gives one error message or the other depending on how you expressed the loop. And this is, admittedly, based, in either case, on a certain sort of pure bloody-minded logic.

Well, the bottom line is that it is likely to remain as it is for the time being because this is not much of a practical problem. In practice, as long as the error message gives you the exact location of the error, anybody who goes and looks at this, I mean, something like [x, y, z, %/] will see the problem pretty much immediately. Whether you characterize the problem as being that the token after the trailing comma should be the ] or should start a new Element hardly matters. In fact, it is akin perhaps to characterizing the glass as half full or half empty.

Actually, I would say that this doesn't matter if your parser has no pretension of being fault-tolerant. If we are limiting ourselves to parsing valid input, then in either case, it is behaving irreproachably. It stops when it hits invalid input and says where this happened. How the error message characterizes the problem is maybe not so important. It is only when you start thinking about what to do with invalid input that such things matter. (And can even start making our head spin!)

Now, as for the example of SyntheticLeftShift and SyntheticRightShift, most people would write both of these the same way. Some people say "tomayto" and some people say "tomahto" but I reckon they are usually pretty consistent about which pronunciation they use! Meanwhile, I wrote the two productions differently because these example grammars are designed not only to work (which they do!) but to be usage examples. So, effectively, I will sometimes say "tomayto" and other times say "tomahto" for that reason.

*Or maybe I'm just schizophrenic.

Leave a Comment

Your email address will not be published. Required fields are marked *