Turning to Semantic Lookahead

In my last post I reported that I had finally fixed the way syntactic lookahead works in JavaCC.

This leads to the natural question: what about semantic lookahead?

Well, what is historically called semantic lookahead in JavaCC is actually misnamed. It’s really just a bit of java code that is used to determine whether to enter an expansion or not. Something like:

  LOOKAHEAD ({someCondition()}) Foo()

In the above snippet, someCondition() must return true or false; based on that, the parsing machinery decides whether to enter the Foo() production. By the way, the term semantic lookahead is really something of a misnomer, particularly the “lookahead” part. The java code does not necessarily look ahead. It might or might not, or it might look behind or look somewhere else, or check whether we are currently in the Year of the Dragon…. In short, from the point of view of the JavaCC machinery, the java code is just a black box that outputs a yes/no or true/false based on whatever inscrutable criteria.

Well, regardless of the terminology, what is interesting about semantic lookahead is that, first of all, semantic lookahead (unlike syntactic lookahead) does nest in legacy JavaCC. This is question 4.11 in Theodore Norvell’s JavaCC FAQ:

Is semantic lookahead evaluated during syntactic lookahead?

And the answer is: yes.

Now, one obvious point about all this is that it makes no sense whatsoever for semantic lookahead to nest but for syntactic lookahead not to nest. That is clear enough, but the plot actually thickens beyond that point. What also is quite clear is that this whole issue is the source of some various nasty gotchas in terms of using JavaCC. Those gotchas are also documented in the JavaCC FAQ. For example, the following question, 4.12:

Can local variables (including parameters) be used in semantic lookahead?

The answer to that one is not so clear. Well, basically, you can use local variables in your “semantic lookahead” as long as that lookahead is never invoked outside of that local context – i.e. it is NOT nested. Well, not to worry if you don’t understand that last sentence. Clearly, an example is needed to focus the question. Suppose you have a production:

 void FooBar() :
 {boolean foundOpeningBrace = false;}
 {
     BlahBlah()
     [
        "{" {foundOpeningBrace = true;}
     ]
     MoreBlahBlah()
     [
         LOOKAHEAD({foundOpeningBrace})
         "}"
     ]
     ... etcetera ...
 `

Now, I the intent of the LOOKAHEAD above is clear enough. If we match the opening brace after BlahBlah() we will only match the closing brace after MoreBlahBlah IF we previusly encountered an opening brace, which we keep track of with the local variable foundOpeningBrace.

And this works perfectly well…

UNTIL… we try to use this in a nested lookahead, like in some other part of our grammar, we have:

  LOOKAHEAD (Foobar()) ...

What is interesting about this is that the above will generate code that does not even compile! The foundOpeningBrace variable is only defined locally in the generated FooBar() method and the generated code for the nested lookahead (called from a completely different method) will not even have access to that variable. Again, Professor Norvell documents this isue in his answer to the FAQ.

Again, Professor Norvell refrains from using the dreaded B-word, though, to be fair, it is not so clear that this is a bug! The problem is that this issue sometimes is a bug, but sometimes not. Well, that sounds self-contradictory and maybe it is. What I really mean is that sometimes it works as you intend, but often it does not.

You see, the deeper problem is that, as you parse code, there are presumably semantic actions being executed, which would frequently build up some data, as the parsing proceeds. It will be very typical that your “semantic lookahead”, i.e. your java code that decides whether to enter a production or not, is based on your being at a specific point in the parsing. However, when you are scanning ahead, possibly visiting deeply nested expansions, you are not currenty at that point in the parsing. Or, to put it another way, in a scanahead routine, the code is not executing any of the java code actions associated with the actual parsing of the contstructs you are scanning.

Getting back to the little example above, the action foundOpeningBrace = true is not being executed in a lookahead, only when the production is actually parsed! Thus, it is really quite dubious that you want to check for that variable if, at this point, it cannot possibly have been set anyway! (That, aside from the straight-out technical impedance that you can’t access the variable because it is local to the FooBar() method to start with!)

In short, the idea that you always want to check so-called “semantic lookahead” inside of a nested syntactic lookahead is actually quite dubious. At least as a general idea.

I gave this quite a bit of thought over the last few days. It’s actually quite a conundrum. I came to the following conclusions:

Typically, as a default, you do not want to check semantic lookahead in a nested syntactic lookahead.

BUT….

The problem is that sometimes you do!

Well, the solution that I implemented is as follows:

If you really want a semantic lookahead to be checked in a nested lookahead, then you indicate that as follows:

LOOKAHEAD ( {checkSomeCondition()}# ) FooBar()

The # after the semantic lookahead means that it is used in nested lookahead, or in other words, we could say that it is global. If you do not specify that, it will only be used locally and routines generated for nested syntactic lookahead will ignore this check.

This provides a way for you to make your intent clear.

So, finally, I guess we could say that legacy JavaCC providing no way of specifying your intent is also a bug. (Oh, that dreaded B-word again!) At the very least, it is a point where the legacy tool is in a rather half-baked state.

In any case, JavaCC 21 now addresses this issue. The updated answer to the FAQ entry here:

Can local variables (including parameters) be used in semantic lookahead?

would now be:

  In legacy JavaCC, semantic lookahead is always invoked 
  in nested lookaheads, which leads to some rather annoying 
  gotchas. The updated version of JavaCC, JavaCC21, provides 
  a disposition whereby you can specify whether a semantic 
  lookahead should be considered in nested lookaheads.

And then, the FAQ entry could go on to explain that, in JavaCC 21, if you want a semantic lookahead to be considered even in nested lookaheads, you need to put a # symbol afer it, such as:

  LOOKAHEAD({someCondition()}#) FooBar()

If you write the LOOKAHEAD without the # symbol, it is only considered locally.

Closing Note

In general, there is a need for more powerful way of expressing LOOKAHEADs (predicates really) in JavaCC. And then there would be less and less need for "semantic" lookahead (i.e. dropping into Java) to express a predicate.

One step in this direction that is already implemented is the ability to make a LOOKAHEAD expansion negative. So, you can write:

 LOOKAHEAD(~Foo()) Bar()

This gives us a way of expressing the rather commonplace idea that, if what follows is not a Foo, then we assume it is a Bar.

That is already implemented but one thing that is still missing is a way of writing predicates that answer the basic question of whether we are already in a given grammatical construct. I envisage being able to add something like:

LOOKAHEAD(/**/ClassDeclaration/) ConstructorDeclaration()

which would say that if we are inside a ClassDeclaration production, then we can parse a Constructor declaration at this point. Otherwise not. This is the one thing that I describe in this post that is not yet implemented. However, it does not actually look very technically difficult to do! So, stay tuned