Revisiting LOOKAHEAD in JavaCC, Attacking the Dmitry Dmitriyevich Problem

Originally published at: https://javacc.com/2020/04/11/revisiting-lookahead/

As a result of some private correspondence with one of the PMD developers (PMD makes extensive use of JavaCC) I started thinking about some issues that I had in the back of my mind to look into at a later point.

Having recently ripped out the code in JavaCC 21 that supported putting LOOKAHEAD specifications at non-choice points, I was informed that this was a change (possibly not the only one) that prevented PMD from switching over to JavaCC 21. To tell the truth, I was rather surprised by this. I had just assumed that practically no JavaCC grammars in the wild really used this. Besides, the usage pattern gave off a funny smell. Basically, you define a phony grammatical production (one in which the parser never enters) in order to use it in LOOKAHEADs.

The need to do things like this is suggestive of some design failings. Well, truth told, I don’t think that there is any absolute need to do this as things stand, even when using legacy JavaCC. However, all of this got me thinking about how LOOKAHEAD works in JavaCC. As far as I can tell, despite this being one of the most basic things in the tool, it has never been revisited.

Okay, starting from first principles, JavaCC, at least by default, assumes a LOOKAHEAD of 1 token. Whenever there is a fork in the road and it must decide which path to take, the baseline assumption is that looking at the next token in the stream is enough information to make this decision.

Well, sometimes it is and sometimes it isn’t. But if it isn’t, you’re supposed to specify LOOKAHEAD. That makes sense as far as that goes, but actually, I daresay that all of this has been one of the most annoying aspects of using the tool. For one thing, it requires us to insert explicit LOOKAHEAD specifications in places where the tool really ought to just infer them. For example, if you write something like:

  
  (
    (<IDENTIFIER> "{")
    |
    (<IDENTIFIER> "(")
  )

the thing bitches and moans until you put LOOKAHEAD(2) up top. In fact, the tool can perfectly well deduce that it needs to scan ahead 2 tokens to resolve the choice. Finally, as far as I can see, there is no real reason that it should not just deduce this and just quietly put in the code to look ahead the necessary 2 tokens. As best I understand it, this is really just some sort of longstanding shibboleth.

Now, when it comes to having to overspecify things, as is frequently the case with LOOKAHEAD in JavaCC, it occurs to me that this is an example of what I have called in the past the Dmitri Dmitriyevich problem, a.k.a. the Ivan Ivanovich or Vladimir Vladimirovich problem. (This is just my own private terminology, but hey, maybe it could become popular out there!)

The origin of the term is one of my (possibly dubious) attempts at a witticism some years back, when I commented to a collaborator that, increasingly, I found that reading Java code was like reading one of those huge Russian novels in which the characters all have these sorts of long, seemingly repetitive names. In particular, constructs like:

   Map<String, String> lookup = new HashMap<String, String>();

Well, at some point, one of the many geniuses employed at Sun Microsystems figured out that the second specification of the type parameters in the line above was superfluous and the compiler could simply deduce it. So now you can just write:

   Map<String, String> lookup = new HashMap<>();

Well, that took them 7 years (from JDK 1.5 in 2004 to JDK 1.7 in 2011) but the genius boys figured it out. So much for that particular Dmitry Dmitriyevich problem. R.I.P. (By the way, I actually do know that in these Russian names, the second name is a patronymic, the name of the father, so it is not really repetitious and superfluous. It just looks that way when the son is named after the father. In the line of Java code above, by contrast, the repetition of the type parameters really is redundant and pointless!)

But getting back to JavaCC specifically, we frequently encounter things the following Dmitri Dmitriyevich construct:

   LOOKAHEAD (Foo() [Bar()] Baz())
   Foo()[Bar()]Baz()

It so happens that, in a very high percentage of the cases where you write a LOOKAHEAD with some fairly complex expansion with non-terminals inside, the very same expansion is simply repeated afterwards. Or even not such complex things:

    LOOKAHEAD(Foo()) Foo()

Now, granted, the expansion after the lookahead is not always identical. For example, it would make perfect sense to write:

   LOOKAHEAD ("try" "{")
   "try" CodeBlock()...

In the above case, we only need to look ahead 2 tokens to decide whether to enter the expansion. There is no need to do a syntactic lookahead all the way to the end of the last catch or finally block, which would be rather expensive and pointless. However, the case where the expansion inside the LOOKAHEAD is identical to the one that follows it is common enough that it ought to receive some special consideration. So, in the latest version of JavaCC 21, if the expansion following the LOOKAHEAD is identical, you don’t need to specify it, so you can write:

   
LOOKAHEAD()
   Foo() Bar() Baz()

and it is the same as if you had written:

   
LOOKAHEAD(Foo() Bar() Baz())
   Foo() Bar() Baz()

 

In fact, it occurred to me that LOOKAHEAD() was still too verbose, so I introduced a new keyword (though this could be open to some discussion). I chose “SCAN” so you can also just write:

   SCAN Foo() Bar() Baz()

I’m interested in some feedback on that. It also has occurred to me since then that even just using a one-character operator, like maybe ^ could be even better. I don’t really know whether that is better than using an English language keyword like “scan”. I thought to try to put to vote whether people prefer to write:

   LOOKAHEAD(Foo() Bar() Baz()) Foo() Bar() Baz()

or:

   SCAN Foo() Bar() Baz()

or

   ˆFoo() Bar() Baz()

Somehow, I don’t think very many people would vote for the first one! However, it will always remain an option!

Hi. For me:

  • ok for lookahead() foo() bar() baz() (empy args -> full semantic la)
  • against for lookahead foo() bar() baz() : no args could be reserved for other future specific cases
  • against scan foo() bar() baz(): no interest adding a new keyword
  • agains ^ foo() bar() baz(): no interest in a cryptic character; keep it for other future uses
  • ok for lookahead(~…)

Again, thanks for the comment.

I’m not sure why you would prefer to write LOOKAHEAD() Foo() Bar() Baz() as opposed to:
LOOKAHEAD Foo() Bar() Baz()

What good is the extra pair of parentheses after LOOKAHEAD() doing you in this case?

As for the keyword SCAN, I really don’t know. My sense of things is that LOOKAHEAD is a bit of a mouthful and long-winded and to be able to write SCAN is a bit nicer, but where I have misgivings is that maybe (it’s a possibility) I would rather reserve SCAN for maybe some other feature that I implement where I want something a bit like LOOKAHEAD but with slightly different semantics.

Currently, the way it works is that SCAN and LOOKAHEAD are just synonyms. I originally thought that I would introduce SCAN as a shorthand, for when you didn’t really need to repeat the expansion. The Dmitry Dmitriyevich problem. But then I realized that I could just have LOOKAHEAD Foo() Bar() Baz() work and maybe I could just reserve SCAN for some later purposes.

Another thing I’m thinking of doing by the way, is merging SKIP and SPECIAL_TOKEN into one thing and just calling it UNPARSED. The whole SKIP vs. SPECIAL_TOKEN thing is basically pointless now that the gigabyte is the new megabyte. You might as well just store everything. SKIP is almost always used for whitespace and you’re not going to save that much space by throwing it away. I mean, the whole thing is just an extra distinction that serves very little purpose finally.

As for ^Foo() Bar()Baz(), I haven’t implemented anything with that… But, yes, in general, using up delimiters is a bit more problematic. We have an infinite number of keywords, but still, natural ones like SCAN might be in short supply a bit…

Hi
Why LOOKAHEAD() and not LOOKAHEAD?
To keep close to the current syntax : The general structure of a LOOKAHEAD specification is: LOOKAHEAD(amount, expansion, { boolean_expression }):

  • it makes it easier for IDEs (code completion, formatting) to handle one syntax (always with parenthesis) than two (with or without),
  • is it easier for the beginner to learn only one syntax.

SPECIAL_TOKENS : we need them in IDEs (for reformating, syntax highligthing). A good improvement would be to chain them differently to the other tokens: in this order:

  • specials after normals and on the same line as them : chain them in a “end” list (end of line comments or block comments, usually related to the current line)
  • specials before normals and on the same line : chain them in a “beg” list (block comments at the beginning of a line, usually part of the line)
  • specials before normals on the previous line : chain them in a “prev” list (javadoc,block or line comments, usually related to the next line)
  • specials on lines with at least one blank line before normals or at eof : chain them in an “orphan” list (usually more global comments)

SKIP and SPECIAL_TOKENS: a very good improvement would be to transparently generate the tokens for them, to be able to generate the node tokens (for JJTree or JTB), to have them in the ASTs; currently, if we want to manage them (again we need them for formating / code completion in IDEs), we need either to declare them as normal tokens and pollute the grammar (with optional productions or lexical states), or write specific code to get the node tokens, the tokens and then the specials lists or even use the tokenmanager APIs to get the skipped whitespaces.

In fact, I would definitely prefer lookahead(!..) to lookahead(~…), because ! is the usual negation operator.

One more word about letting javacc infer the right lookahead and silently manage it: I’m totally against this, because:

  • this is an open door for javacc making errors (too many times I found that the “please consider lookahead of 3 …” was inappropriate)
  • this is an open door for not yet javacc experts get confused on their grammar problems and knock at the support door
  • what would be the benefit of inferring trivial things that the user can trivially fix?

In your trivial example ( < IDENTIFIER > “{”) | ( < IDENTIFIER > “(”)) the more optimized solution is to factor the first term, so do not hide it by letting the user keep this syntax.

Well, really, there is a fundamental issue here that will repeatedly come up in many future decisions. The question is whether one’s major concern is with making migration easier for existing JavaCC users. OR whether you are mostly concerned with the user experience of completely new users, i.e. people who never used any version of JavaCC before.

From the point of view of somebody who never used JavaCC before, how close this is to the current syntax is something completely irrelevant. And, also, from that perspective, the decision between SCAN and LOOKAHEAD is a very easy decision. Unless somebody was used to writing LOOKAHEAD, why would anybody prefer writing that to simply SCAN?

I remember when I took over the FreeMarker project, and we were preparing FreeMarker 2, there were some similar sorts of issues. FreeMarker 1.x existed and had a smallish community of people around it, but I was thinking that eventually, almost the entire user communahity of FreeMarker 2.x would be people who had never used FreeMarker 1, so basically: while there was no reason to break older templates gratuitously, the main consideration should be doing the best thing for the new users.

Of course, on FreeMarker, I was right, you know. I don’t know the exact numbers, of course, but I’m sure that if you consider all the people using FreeMarker or who have used it in the past, the proportion who ever used FreeMarker 1.x must be very very small.

In general, I’m not somebody who tends to think small. I believe that JavaCC 21, with the ideas being put in place, could eventually have a user community far larger than the legacy tool ever had, so again, the main consideration would be creating something very usable for the new users, and yes, also keeping broad backward compatibility with the legacy tool (not perfect compatibility because that won’t be possible) but that will be more like a secondary consideration.

The problem is that I already plan to use the exclamation mark for another purpose. It indicates that an expansion is “forced”, so if you write:

“(” Expression “)”!

the meaning is that the closing parenthesis, is “forced”, i.e. even if there is no closing parenthesis, we will insert a virtual token there to close the expression.

JavaCC 21 already allows you to treat “special” tokens as nodes. That is the setting SPECIAL_TOKENS_ARE_NODES.

As for SKIP tokens, they are just thrown away, but there is not much reason not to just treat them the same way as SPECIAL_TOKENS. Of course, SKIP is pretty much always used for whitespace, and if you have the location informaiton, you can infer the whitespace (except for the annoying tabs vs. spaces and line endings nonsense, that is).

A funny little terminological thing include, you know, is that if you include all the tokens in your tree, including the so-called “special tokens” and whitespace, then your tree is not really an AST, but a CST. (Concrete syntax tree) Because you can reconstruct the original file precisely from the tree.

As for the legacy API of having this public field specialToken (in the Token class) that is pretty horrible and we need to find a way to refactor that without it being too traumatic for existing users. That is certainly on my mental TODO list.

Well, about ! versus ~, I don’t see there would be a tremendous conflict. In a JavaCC grammar { and } are both JavaCC AND Java delimiters… # has differents meanings…

Why LOOKAHEAD() and not LOOKAHEAD?
Please consider the real argument: it makes it easier for the IDEs. You want plugins for JavacC 21, don’t you?

Frankly, if this is your argument, I can’t really take it very seriously. You’re trying to tell me that the syntax should be LOOKAHEAD() Foo() Bar() Baz() instead of simply LOOKAHEAD Foo Bar Baz()

even though the () after the LOOKAHEAD contains ZERO information

because somehow it’s just so very hard to handle the latter construct. I find the whole concept so unconvincing. For starters, it does not look very difficult to parse at all. You either have:

LOOKAHEAD(blah blah) <Expansion>

OR:

LOOKAHEAD <Expansion>

Why is this some big problem? In Java, a method call with no arguments, like foo() needs to have the () because otherwise, foo on its own is a variable reference and those two things must be distinguished. However, in this case with LOOKAHEAD, there is no need to distinguish LOOKAHEAD() from LOOKAHEAD alone, so the () looks completely superfluous.

I could be missing something about this, but I am not seeing it right now…

As examples:

See lines 674+ in https://sourceforge.net/p/eclipse-javacc/code-git/ci/master/tree/sf.eclipse.javacc/src-plugin/sf/eclipse/javacc/scanners/JavaCCCodeColorRule.java

See lines 917+ in https://sourceforge.net/p/eclipse-javacc/code-git/ci/master/tree/sf.eclipse.javacc/src-plugin/sf/eclipse/javacc/editors/CompletionProcessor.java

Lookahead() is already handled. Lookahead not. Small extra work when so much needs to be done elsewhere will go in last priority, for me.

User will always have to think and choose between Lookahead and Lookahead() when editing with code completion… Currently he has only one proposal. No need an extra thinking.

You have to understand that, in my world, if it is a choice between inconveniencing the plugin writer (you in this case) and inconveniencing the end user, I will always choose to inconvenience the plugin writer.

This is for the very simple reason that there will never be more than a few plugin writers, but there are potentially thousands of end users. In fact, by the same reasoning, I will inconvenience myself rather than inconvenience the end user!

This is why there are things that I wrote over 15 years ago that are still widely used. I have a relentless focus on usability. Everything I do is very usable. It’s not a question of anything personal, about you specifically, but I will not oblige the end user to write LOOKAHEAD() Foo() instead of LOOKAHEAD Foo() because the latter is a bit more work for you! (If it were vastly more work, then maybe the calculation would change, but I don’t believe it is very difficult to support this.)

Another thing this got me wondering about is whether there should really be any need to write a Non-terminal with no args as Foo(). Possibly, Foo would be okay. I’m thinking about that now, but I am not going to make any very rash decision. The reason that you have to put the empty parentheses after a method call in Java is that foo() and foo really are necessarily different.

But that is another question that I am not going to decide on right now. The basic thing here is that if you want to convince me of something, you have to make an argument based on what is good for the end user, not what is good for you! This has to do with the overall philosophical approach of any project that I am in control of.

Why does JavaCC grammar look like a preprocessor language? And not like a normal language? Heritage of lex/yacc and forth.

To use tkid = < ATOKEN > and ret = AProduction() you need to define tkid and ret in the PARSER_BEGIN…END java section, and use them in {} actions like { ret == 0 } or { call xx(tkid); }
Why not allow defining variables at the JavaCC level?
Why not get rid of all Java code actions and begin…end section, and just add simple constructs in the language for declaring / assigning variables, using conditions and calling external methods, like Token = tkid; int = ret; (ret == 0), call xx(tkid);? Just define and implement the minimal set of constructs to condition the grammar parts and call other classes methods.
These constructs could be translated at generation time in Java or C++ or Kotlin, like the rest of the grammar.
That would help focusing on the grammar at lexical & syntactical level, and use an external class to handle the semantic level. Help not writing in the grammar things related to the tools you want to build with the grammar.

What is painful in writing Java fragments in current JavaCC grammars is that the IDEs do not allow easily to support editing these fragments with the same features than in the appropriate editors (it is not easy and may be impossible to use the JDT or CDT in another language development tool like the JavaCC plugin, so you have to reinvent the wheel for the java or c++ parts).
So for now, if you have a lot of java code to write (for chaining parsers, for handling errors, for languages like Uniface or Progress with weird syntaxes because they are interpreted), editing the grammar is a burden and one can end up using a companion class to the parser for that. But the IDE does not allow easy navigation between these. If all the grammar code was JavaCC only, it would be easier to build in the plugin the navigation between it and the JDT / CDT.

So back to xxx() and xxx: I strongly suggest keeping xxx() for methods and reserving xxx for variables for the next decade of the 21st century. 2 keystrokes saves is not a must have for a lot of users, not only for me.

Well, you should have understood that as you asked me to contribute you have to understand what return I expect and if I don’t get what I strongly want I probably shall not contribute!
I contributed to the plugin and JTB during my work time (not my spare time) because I could justify that at the end my projects would cost less.

Well, what can I say? If you prefer a project with a philosophy where you do things for your own convenience rather than that of the end user, then you can find such a project surely. (Well, not with JavaCC, because this is the only branch of active development on JavaCC, but something completely different, I guess…)

I think, however, that you will find that, in the end, working on this project will be more satisfactory than most other things. For one thing, a project run according to this philosophy has the potential of becoming something very widely used.

But again, the bottom line is that if you make the argument: the syntax should be like this because it makes my life easier. As opposed to saying that it makes things easier for the end user, this is just not an argument that can sway me easily because it just doesn’t correspond to how I think. That’s my philosophy and you would have a very very difficult task to convince me that I am wrong on something so fundamental! (On other, more mundane questions, you might frequently argue with me and convince me that I’m wrong, but on something like this…)

As an end user, xxx does not make my life easier that xxx(). It would be the opposite. I guess that’s true for a lot of users. But I can imagine some will have the opposite view, like you.
So let’s go to the design level.
What kind of language do we want? Lisp / yacc like, or Java / C / … like? For language adoption. For code reuse. For less bugs. For less billion dollar mistakes. For better tools / IDE support. For better code quality.
We intend to use JavaCC with Java, C++, Kotlin, no? So let be close to their conventions, constructs.
Their methods are always with (), even with empty arguments. Lots of reasons (easier to learn / remember, easier to read / understand code, easier to write compilers and tools, faster parsers, …).
So many many many end users will agree on this.

Well, first of all, I doubt I will make any decision on making the empty parenthesis optional very soon, because I have other things I want to look into that are more important.

If I do finally decide that it’s just as well to let somebody write: Foo Bar Baz as to write Foo() Bar() Baz() then the latter will still work! I see no value in breaking everybody’s code for no reason. As for the LOOKAHEAD enhancements, it seems pretty clear to me that there is no reason to force everybody to write: LOOKAHEAD(Foo()Bar()Baz()) Foo()Bar()Baz() when the expansion following the LOOKAHEAD is exactly the same as the expansion in the LOOKAHEAD. But, again, the longer way of writing it will still work, so… on that level, what’s the big deal?

At some point, the Java people decided you could write: Map<String, String> map = new HashMap<>();
instead of Map<String,String> map = new HashMap<String,String>();

But the longer one still works! So, again…

But I think the bigger issue in all of this is whether one is ever going to revisit the longstanding shibboleths. I wrote this little thing about the gigabyte being the new megabyte and the really more general point in all that is just the way certain things are never re-examined. So you have this default disposition where you have 4k buffer and grudgingly expand it by 2k, and one has this behavior as the default at a point in time where a full megabyte of RAM costs less than a penny! Once you think about something like that, it really is just ridiculous, you know…

In Java, you have to write foo.bar() to invoke a method on an object because foo.bar means something else. The semantics of the Java language are such that the empty parentheses are necessary. But to keep making people write those empty parentheses in a language where there is no similar necessity for them…

I mean to say, there’s a larger issue here, which is whether you just never revisit anything!

Your point about other languages is valid, I guess. Eventually (especially since code generation is templatized) there should be a move towards generating other languages besides Java. That is definitely a goal. However, since my mentality is such that I would not move towards the generation of other languages if I was not handling Java fully, I first had to get the Java support updated, as you surely saw here but it’s not clear what the implications of this are in terms of this issue, making people write Foo() if there is, in fact, no ambiguity with just Foo. I think also that, whatever syntax one goes for, people will eventually get used to it.

But again, I probably won’t do anything with that in the very near-term. I think that SPECIAL_TOKEN might be replaced by just UNPARSED (or if you have a better idea for a word, I am open to suggestions) and that should be used in preference to both SPECIAL_TOKEN and SKIP. SKIP is nice and short, of course, but there is actually little reason in the current day to have the distinction between SKIP and SPECIAL_TOKEN. One might as well just store everything by default, given how cheap memory is. UNPARSED is much clearer (AND shorter) than SPECIAL_TOKEN. It’s clearer because it means that this is a token that does not participate in the parsing, i.e. it’s purely lexical. I don’t think SPECIAL_TOKEN was ever a very good terminological choice, because it’s not clear what it means, and it’s also very long-winded.

That said, I could keep SPECIAL_TOKEN and SKIP working indefinitely since there is no real cost to it. But it’s a conceptual simplification: a token is either parsed, or unparsed, i.e. it participates in the parsing machinery or it doesn’t. And that’s that…