New Feature: The =>|| delimiter stands for "scan up to here"

Originally published at: https://javacc.com/2020/07/31/new-feature-for-scan-up-to-here/

Revisiting LOOKAHEAD Redux

Not so long ago, I had a sort of eureka moment when I realized that the legacy LOOKAHEAD construct was fundamentally half-baked or broken. Not only that, but I started narrowing in on how to definitely address the issue!

Well, let’s get concrete. Suppose we have a production that looks kind of like this:

FooBar : <KEYWORD1> [<KEYWORD2>] "foo" Bar <STATEMENT_END>;

(I’ll just use the new streamlined syntax for examples now. See here.

So, expressing things in English, our FooBar production above starts with the KEYWORD1 token, is followed optionally by a KEYWORD2, and then we have concretely a “foo” followed by the rest of the production.

Suppose, further, that we know that if we have a FooBar production, and we get to the “foo” above, we can commit to being in a FooBar. Here is how the literature on legacy JavaCC would tell you to deal with such a situation:

LOOKAHEAD (3) FooBar()

If it’s a question of scanning up to and including the “foo”, we can set a maximum lookahead of 3 tokens. This is already somewhat error-prone because anything that requires a human to calculate something just is that way. (We should all be able to count to 3 or 4, say, but if you are sleep-deprived at 3 a.m. with a deadline looming, all bets are off…)

But further, suppose our language evolves and another optional keyword is introduced in the above. So the production is now:

FooBar : <KEYWORD1> [<KEYWORD2>] [<KEYWORD3>] "foo" Bar <STATEMENT_END>;

If we want to generate a lookahead routine that is guaranteed to scan right up to and including “foo” then we need to rewrite the earlier line as:

LOOKAHEAD (4) FooBar()

Well, suppose this line occurs in more than one place. I guess it’s no big problem. We just have to change the 3 to a 4 in all those places where we wrote that LOOKAHEAD. Well, we can imagine situations that are far from optimal. Suppose you needed to change this in 6 places but you only did it in 5 of them. (Well, hey, I would never have such a lapse, but maybe you would!)

Kidding aside, you can see the problem here. Requiring the programmer to manually specify a specific number like 3 or 4 or whatever, to say the maximum number of tokens to look ahead, is really quite error-prone and not very robust with respect to ongoing evolution of the grammar.

So, that outlines the problem. Here is the proposed solution. Using the new up-to-here marker, we could write:

 FooBar : <KEYWORD1> [<KEYWORD2>] "foo" =>|| Bar <STATEMENT_END>;

What the above (new) delimiter, =>|| means is that any generated scan routine scans up to and including the “foo” token and then reports success. It doesn’t need to go into the Bar, or generally speaking, scan the production right to the very end. My current (though it is open to discussion) terminology for the =>|| delimiter is the up-to-here marker.

So, let us consider the scenarios that I outlined above. The first was that we introduce a new optional keyword, KEYWORD3 so the maximum number of tokens that you need to scan ahead goes from 3 to 4. In the new, improved solution with the up-to-here marker you simply insert the optional KEYWORD3 in the production:

FooBar : <KEYWORD1> [<KEYWORD2>] [<KEYWORD3>] "foo" =>|| Bar <STATEMENT_END>;

No code anywhere else needs to be changed.

Moreover, there is no need to specify any LOOKAHEAD (or SCAN) at all when referencing this FooBar production because the information is already included in its specification. Thus, you can write:

FooBar
|
Alternative1
|
Alternative2

And the code generator will generate the same code that it would have generated if you had written:

LOOKAHEAD (<KeyWord1> [<KEYWORD2>] [<KEYWORD3>] "foo") FooBar()
|
Alternative1()
|
Alternative2()

So, some might say that using syntactic lookahead above instead of a concrete number of tokens is the better solution. But it is also quite error-prone. (Also, it is much more verbose!) Regardless, you have the same problem. If you introduce another optional keyword into your FooBar production, you have to change all these LOOKAHEAD specifications everywhere they occurred. Again, quite error-prone. (By the way, in case you didn’t realize it, when I said that I would never have such a lapse as to make the necessary change in five out of six places, say… that was a joke!)

Another use of the up-to-here marker is when you directly write a production in a choice construct. Thus, the older:

LOOKAHEAD(Foo()) Foo() Bar() Baz()

which is now written in the newer streamlined syntax as:

SCAN Foo => Foo Bar Baz

can now be written (this is all implemented, by the way, this is not some sort of “blue sky” wish list!):

Foo Bar =>|| Baz

The up-to-here marker is meant to symbolize visually that we scan up to here, and then hit an explicit block, it’s like the arrow up to a wall. I thought to make the “wall” two of these | symbols rather than just one because I thought the thing looked a bit more balanced that way. (It’s very very early in the game and these details could be discussed if anybody has any better proposals for syntax and such.)

One funny wrinkle to this is that the abbreviated syntax of:

=> Foo Bar Baz

is now kind of superfluous, I believe.

(In legacy JavaCC, you wrote that as: `LOOKAHEAD(Foo() Bar() Baz()) Foo() Bar() Baz() as some might recall.)

The super-abbreviated syntax above is actually not necessary, because we could express the same idea with:

Foo Bar Baz =>||

That is not shorter, actually, it’s two characters longer, but the point is that it is unified with the other ways of expressing it. Thus:

Foo Bar =>|| Baz

would mean we scan ahead for Foo followed by Bar. And if that succeeds, then we parse the whole thing — Foo Bar Baz in this case. Note, by the way, that in this case, the production can blow up, but presumably it does so when parsing the Baz, not the Foo or the Bar. Of course, if you have any java code actions in your production, then there are no guarantees of that, since the scanahead routine does not execute any of the Java code actions. That only happens when the expression is actually being parsed, not in a lookahead routine. So, in those cases, it could actually fail on Foo or Bar without hitting Baz. (Details, details…))

It is so early in all of this that I am actually wondering whether to remove the abbreviated => Foo Bar construct on the grounds that it is superfluous. I have not come to any decision on that.

Concluding Remarks

In the above, I outlined the problems with the existing LOOKAHEAD construct and the way that this new syntax remedies the situation. In a way, I was skirting around the really central issue here, which is that this whole notion of writing something like LOOKAHEAD (3) is fundamentally a very sucky concept. And here is why:

This simply does not correspond to how a human being thinks!!!

For example, consider a class declaration in Java. I put it to you:

Does a human programmer think to himself something like: What is the number N of tokens I have to scan ahead to identify something as a a class declaration?”

I think not. The person’s conceptual model is something like this:

(various stuff...) "class" (WE CAN STOP SCANNING HERE!) "{ (various more stuff...) "}"

In short, once you hit the “class” keyword, that’s it. You know that this is a class declaration. So, with this new feature, the way that the Java grammar in JavaCC works now corresponds much more to a human being’s conceptual model of things. Like so:

ClassDeclaration
Modifiers
"class" =>||
<IDENTIFIER>
[ TypeParameters ]
[ ExtendsList(false) ]
[ ImplementsList(false) ]
ClassOrInterfaceBody(false)
;

Again, this reflects the way a human being considers the situation. Once we hit that “class” keyword, we know we are in a class declaration! So we can abort any scan routine for checking if what follows is a class declaration. Similarly, here is how the current Java grammar defines a Lambda expression:

LambdaExpression :
LambdaLHS
"->" =>||
(Expression | Block)
;

Once we hit the -> token, we know we are in a Lambda since this token does not occur anywhere in a Java grammar except within a Lambda. (This is no longer true as of JDK 14, where the -> is used for a different purpose inside switch expressions. But that is currently unimplemented and I’ll cross that bridge when I choose to cross it!)

In any case, when I first implemented the Lambdas in the Java grammar in earlier versions of the Java grammar, I wrote:

 LOOKAHEAD(LambdaLHS() "->") LambdaExpression()

and now, in the same spot, I just write:

 LambdaExpression

In the definition of the Lambda expression, I simply put a up-to-here marker after the arrow token, so the code generation machinery will generate a scan routine that scans up to that point. And this formulation of things should be pretty robust wrt future evolution and refinements.

One Other Little Detail

The up-to-here marker is ignored in certain situations, of course. If you write:

=> FooBar FooBaz

the scanning machinery will ignore the up-to-here marker that is inside of FooBar, because it must go to the very end of the FooBar production to then scan for the FooBaz that follows! If, however, you had:

=> FooBaz FooBar

then, yes, it would stop scanning after the =>|| inside of FooBar. It can do that because it is at the very end of the expansion to be scanned. Or, similarly, if you were building on top of the Java grammar, and you had:

=> MethodDeclaration ClassDeclaration

It would scan to the end of the MethodDeclaration (regardless of any up-to-here markers inside of it) but it would stop after “class” when it got to the ClassDeclaration. Since the ClassDeclaration is last, there is no need to scan any further at that point. It is important to understand these points because, even though we have much more powerful computing resources than ever before, scanning to the very end of a class declaration (in spots where you only need to go up to the first occurrence of “class”) could start getting a bit expensive and make your parser noticeably sluggish!

So there are some niggling details here and, as of this writing, I cannot exclude that there are some corner cases in all of this that are not being handled correctly, but I would just encourage people to use this and report any bugs they come across (or anything you think is a bug) and it will be dealt with. As I pointed out earlier, even though it is a new feature, this up-to-here marker will result in much more robust and readable grammars. So it is well worth using. It seems solid and I have started using it in internal development already.

I am quite excited about this new feature because I think it will largely supersede the legacy LOOKAHEAD specification, particularly, specifying specific limits, like 3 or 4 or whatever. Also, it seems quite possible that explicit syntactic lookaheads will mostly become unnecessary. Likewise, the new LookBehind feature should serve to make the majority (not all, probably) semantic lookaheads unnecessary.