Revisiting LOOKAHEAD in JavaCC, Attacking the Dmitry Dmitriyevich Problem

As a result of some private correspondence with one of the PMD developers (PMD makes extensive use of JavaCC) I started thinking about some issues that I had in the back of my mind to look into at a later point.

Having recently ripped out the code in JavaCC 21 that supported putting LOOKAHEAD specifications at non-choice points, I was informed that this was a change (possibly not the only one) that prevented PMD from switching over to JavaCC 21. To tell the truth, I was rather surprised by this. I had just assumed that practically no JavaCC grammars in the wild really used this. Besides, the usage pattern gave off a funny smell. Basically, you define a phony grammatical production (one in which the parser never enters) in order to use it in LOOKAHEADs.

The need to do things like this is suggestive of some design failings. Well, truth told, I don't think that there is any absolute need to do this as things stand, even when using legacy JavaCC. However, all of this got me thinking about how LOOKAHEAD works in JavaCC. As far as I can tell, despite this being one of the most basic things in the tool, it has never been revisited.

Okay, starting from first principles, JavaCC, at least by default, assumes a LOOKAHEAD of 1 token. Whenever there is a fork in the road and it must decide which path to take, the baseline assumption is that looking at the next token in the stream is enough information to make this decision.

Well, sometimes it is and sometimes it isn't. But if it isn't, you're supposed to specify LOOKAHEAD. That makes sense as far as that goes, but actually, I daresay that all of this has been one of the most annoying aspects of using the tool. For one thing, it requires us to insert explicit LOOKAHEAD specifications in places where the tool really ought to just infer them. For example, if you write something like:

  
  (
    (<IDENTIFIER> "{")
    |
    (<IDENTIFIER> "(")
  )

the thing bitches and moans until you put LOOKAHEAD(2) up top. In fact, the tool can perfectly well deduce that it needs to scan ahead 2 tokens to resolve the choice. Finally, as far as I can see, there is no real reason that it should not just deduce this and just quietly put in the code to look ahead the necessary 2 tokens. As best I understand it, this is really just some sort of longstanding shibboleth.

Now, when it comes to having to overspecify things, as is frequently the case with LOOKAHEAD in JavaCC, it occurs to me that this is an example of what I have called in the past the Dmitri Dmitriyevich problem, a.k.a. the Ivan Ivanovich or Vladimir Vladimirovich problem. (This is just my own private terminology, but hey, maybe it could become popular out there!)

The origin of the term is one of my (possibly dubious) attempts at a witticism some years back, when I commented to a collaborator that, increasingly, I found that reading Java code was like reading one of those huge Russian novels in which the characters all have these sorts of long, seemingly repetitive names. In particular, constructs like:

   Map<String, String> lookup = new HashMap<String, String>();

Well, at some point, one of the many geniuses employed at Sun Microsystems figured out that the second specification of the type parameters in the line above was superfluous and the compiler could simply deduce it. So now you can just write:

   Map<String, String> lookup = new HashMap<>();

Well, that took them 7 years (from JDK 1.5 in 2004 to JDK 1.7 in 2011) but the genius boys figured it out. So much for that particular Dmitry Dmitriyevich problem. R.I.P. (By the way, I actually do know that in these Russian names, the second name is a patronymic, the name of the father, so it is not really repetitious and superfluous. It just looks that way when the son is named after the father. In the line of Java code above, by contrast, the repetition of the type parameters really is redundant and pointless!)

But getting back to JavaCC specifically, we frequently encounter things the following Dmitri Dmitriyevich construct:

   LOOKAHEAD (Foo() [Bar()] Baz())
   Foo()[Bar()]Baz()

It so happens that, in a very high percentage of the cases where you write a LOOKAHEAD with some fairly complex expansion with non-terminals inside, the very same expansion is simply repeated afterwards. Or even not such complex things:

    LOOKAHEAD(Foo()) Foo()

Now, granted, the expansion after the lookahead is not always identical. For example, it would make perfect sense to write:

   LOOKAHEAD ("try" "{")
   "try" CodeBlock()...

In the above case, we only need to look ahead 2 tokens to decide whether to enter the expansion. There is no need to do a syntactic lookahead all the way to the end of the last catch or finally block, which would be rather expensive and pointless. However, the case where the expansion inside the LOOKAHEAD is identical to the one that follows it is common enough that it ought to receive some special consideration. So, in the latest version of JavaCC 21, if the expansion following the LOOKAHEAD is identical, you don't need to specify it, so you can write:

   
LOOKAHEAD()
   Foo() Bar() Baz()

and it is the same as if you had written:

   
LOOKAHEAD(Foo() Bar() Baz())
   Foo() Bar() Baz()

 

In fact, it occurred to me that LOOKAHEAD() was still too verbose, so I introduced a new keyword (though this could be open to some discussion). I chose "SCAN" so you can also just write:

   SCAN Foo() Bar() Baz()

I'm interested in some feedback on that. It also has occurred to me since then that even just using a one-character operator, like maybe ^ could be even better. I don't really know whether that is better than using an English language keyword like "scan". I thought to try to put to vote whether people prefer to write:

   LOOKAHEAD(Foo() Bar() Baz()) Foo() Bar() Baz()

or:

   SCAN Foo() Bar() Baz()

or

   ˆFoo() Bar() Baz()

Somehow, I don't think very many people would vote for the first one! However, it will always remain an option!

1 thought on “Revisiting LOOKAHEAD in JavaCC, Attacking the Dmitry Dmitriyevich Problem”

  1. Pingback: Straightforward LOOKAHEAD Enhancements – JavacCC for the 21st Century

Comments are closed.