Reducing Visual Clutter: Introducing a New Streamlined Syntax for JavaCC 21

Originally published at: https://javacc.com/2020/07/07/announce-new-syntax/

The overarching design goal of the JavaCC 21 project (dating back from when it was still called FreeCC) is to transform the legacy JavaCC tool into something more useful and useable. Having a clean syntax does not, in itself, further the first goal, but it is surely a key part of the latter.

While it is possible for a language syntax to be too terse and cryptic, it seems pretty clear that JavaCC has always suffered from the opposite problem. JavaCC’s legacy syntax is just plagued with these various visual clutter issues that make grammars harder to write, and probably more importantly, to read.

Well, they say a picture tells a thousand words, so just take a look at a JSON grammar written in the newer streamlined syntax:

Root : JSONObject <EOF> ;

JSONObject : “{” [KeyValuePair ("," KeyValuePair)*] “}” ;

KeyValuePair : <STRING_LITERAL> “:” Value ;

Array : “[” [ Value ("," Value)* ] “]” ;

Value : “true” | “false” | “null” | <STRING_LITERAL> | <NUMBER> | Array | JSONObject;

// Lexical specifications
// Since this is meant to be a minimal grammar, we just define
// the the string literals like “true”, “null” etc
// implicitly in the grammar part.

SKIP : <WHITESPACE : (" "| “\t”| “\n”| “\r”)+>;

TOKEN :
<#ESCAPE1 : “\” (["\", “”", “/”,“b”,“f”,“n”,“r”,“t”])>
|
<#ESCAPE2 : “\u” ([“0”-“9”, “a”-“f”, “A”-“F”]) {4}>
|
<#REGULAR_CHAR : ~["\u0000"-"\u001F",""","\"]>
|
<STRING_LITERAL : “”" (<REGULAR_CHAR>|<ESCAPE2>|<ESCAPE1>)* “”">
|
<#ZERO : “0”>
|
<#NON_ZERO : ([“1”-“9”])([“0”-“9”])*>
|
<#FRACTION : “.” ([“0”-“9”])+>
|
<#EXPONENT : [“E”,“e”]"+","-"+>
|
<NUMBER : ("-")?(<ZERO>|<NON_ZERO>)(<FRACTION>)?(<EXPONENT>)?>
;

If you want to eyeball a more complex example, here is the Java grammar in the newer syntax. Uhh, yes, this is the Java grammar that is now being used internally! I actually intend to convert entirely to the newer syntax in internal use and also in tutorials and examples. However, my intention is to write an automatic syntax converter to convert the legacy syntax, so when I have that working, I will convert all the codebase and the various examples using that, rather than manually.

Now, I know what some people reading this are thinking!

Well, the answer is no. This newer streamlined syntax is not the result of consciously imitating ANTLR. Not that there would be anything wrong with that, mind you, but no, it just turns out that if you analyze the legacy JavaCC syntax and aggressively take advantage of every opportunity to get rid of unnecessary visual clutter, then you end up with something that resembles ANTLR quite a bit! I suppose it is of some interest to compare the grammar above with the JSON grammar provided by the ANTLR project.

A far more important point is that use of the newer syntax is entirely optional. All of the older syntax is supported and will be for the foreseeable future.

So, here are some of the outstanding issues that the newer syntax addresses:

Repeating a Lookahead Expansion in Syntactic Lookahead

This issue has been resolved for several months. With legacy JavaCC syntax, you frequently find yourself writing something like:

 LOOKAHEAD(Foo()) Foo() 

though it can get far worse. You write things like:

  LOOKAHEAD(Foo() [Bar()] (Baz())+) Foo() [Bar()] (Baz())+

This has already been fixed in JavaCC 21. See here. Thus, for some months, the above two lines could already be written as:

 LOOKAHEAD Foo()

and:

 LOOKAHEAD Foo() [Bar()] (Baz())+

respectively. This is because if we don’t write a separate lookahead expansion, we default to the case where the expansion to be matched and the one used by the lookahead are one and the same.

Having to write empty parentheses that are actually superfluous

In legacy JavaCC syntax, we always have to write something like:

Foo() Bar() Baz()

even in situations where writing:

Foo Bar Baz

would not cause any ambiguity. Removing the need to write these empty parentheses would allow us to write the second LOOKAHEAD above even more simply:

We go from:

(1) LOOKAHEAD(Foo() [Bar()] (Baz())+) Foo() [Bar()] (Baz())+

to:

(2) LOOKAHEAD Foo() [Bar()] (Baz())+

finally to:

(3) LOOKAHEAD Foo [Bar] (Baz)*

and actually, I decided to introduce the shorter keyword SCAN so we can write

(4) SCAN Foo [Bar] (Baz)*

Added note as of 24 July 2020: As of the last streamlining cycle, it was decided that, since this kind of construct is so common in practice, where you want to scan ahead indefinitely for the expansion to be parsed, the above could now be written simply as:

=> Foo [Bar] (Baz)*

Superfluous void when declaring productions

In a typical grammatical production that will generate a Java method with no return type (i.e. void) there is no point in making the grammar author write void in all these spots. Thus, the grammatical production in legacy syntax:

 void FooBar() : 
 {}
 {
     Foo() Bar()
 }

can be written more tersely as:

 FooBar : 
 {}
 {
     Foo Bar
 }

The above, of course, makes use of the fact that we no longer need to write all the superfluous empty parentheses. Of course, the perceptive reader will realize that the above production still contains visual clutter, namely…

Having to write an empty Java code block at the start of any grammatical production

In JavaCC, you can optionally start any grammatical production with a Java code block, in order to define and/or initialize any variables that your Java code needs. This is quite useful generally, but if you don’t need to do this, you still have to write that Java code block. The latest version of the JavaCC 21 parser is smart enough to allow you to omit the opening code block. So, the previous production can now be written:

 FooBar :
 {
    Foo Bar
 }

In fact, many people (probably in order to see more a file on their screen) would prefer to write:

 FooBar : {Foo Bar}

Given this, I decided that it would be clearer to have an alternative syntax that dispensed completely with the braces. In general, in the newer streamlined syntax, we mostly just use braces (i.e.{...}) when injecting actual Java code into our parser.

So, with the newer streamlined, the preferred way to write the last version of the production would be:

  FooBar : Foo Bar;

Note that if you use this alternative braces-free syntax, you must end your production with a semicolon.

Note that all of the legacy non-streamlined constructs still work. Most of this streamlined syntax is just the result of making certain elements optional that were previously mandatory. The older syntax will work for the foreseeable future, but I really anticipate that, for most people, it will be an easy decision whether to write:

    void FooBar() :
    {}
    {
        Foo() Bar()
    }

or:

   FooBar : Foo Bar;

Let me say in closing that I have on my mental TODO list to write an automatic converter from the legacy syntax to the streamlined syntax.

ADDENDUM: Ambiguities in the streamlined syntax and how they are resolved

There are a few ambiguities, corner cases really, that the newer streamlined syntax introduced.

For example, consider the following construct:

  Foo (Bar | Baz)

I suspect that most readers would express surprise that there is any ambiguity here. It looks quite clear what this is: a grammatical expansion that represents a Foo followed by either a Bar or a Baz.

But it is ambiguous. You see, the Bar | Baz in the parentheses could equivalently be parsed as the bitwise OR of two integers, Bar and Baz. So, the above construct can be parsed alternatively as:

  1. A reference to a single Nonterminal Foo in which the value of Bar | Baz is being passed in as an argument. (Here Bar and Baz would have to be integer types of some sort and the | is the Java/C bitwise OR operator.)
  2. A reference to the Foo production with no args, followed by another expansion in parentheses, (Bar | Baz), i.e. a choice between Bar and Baz. (Here Bar and Baz would have to be non-terminals in the grammar and the | here is the JavaCC choice operator.)

Now, as a practical question, the intent of the author will be the second case about 99.99% of the time. So, the solution is simply to parse this construct as the second case above.

Now, in the very rare occasion that you really did want Bar | Baz to be an integer passed in as an argument, you could easily disambiguate by writing it as:

  Foo(0 + Bar | Baz)

This disambiguates because it is now quite clear that the expression inside the parentheses is some integer value. Certainly, it cannot be parsed as a grammar expansion.

However, I daresay that this this would be so rare that, as a practical matter, very very few JavaCC useres would ever run into this, and if they did, the above solution does not look very onerous.

Another ambiguity is when you write something like:

  Foo(Bar)

In theory, this could be interpreted as:

  1. The nonterminal production Foo(…) being invoked with a single argument Bar, which is presumably some variable defined in the Java code.

or:

  1. The nonterminal Foo followed by the nonterminal Bar.

The solution I have opted for is simply to interpret this as the first case, since the parentheses are completely superfluous and it would just make more sense to write the second case as:

 Foo Bar

Note also that the following constructs:

Foo (Bar)? 

Foo (Bar)*

Foo (Bar)+

are not ambiguous, since the parser can (and does) simply scan ahead for the ‘?’, ’*‘, or’+’ and disambiguate this way. It is only the plain Foo (Bar) that is ambiguous, and that can always be better written as simply Foo Bar

This is a gotcha that people are more likely to run into than the previous one, since the use of the bitwise operator in that spot to pass in an arg would be quite rare. Still, given that the solution for disambiguating this case involves simply writing the code more clearly (i.e. deleting the superfluous parentheses if you don’t intend for Bar to be an argument that is passed in) it seems that introducing this extra wrinkle, given all of the extra clarity and brevity that we can achieve, seems like a good tradeoff.

IHMO, the just introduced ambiguities show that the JavaCC syntax is what most people need, specially the beginners.
When you use an editor with proper syntax highlighting and pretty formatting, you are not bothered by what you call the visual clutter.

I also need to point that this new syntax complexifies the IDEs plugins … so you are probably far from getting them updated :frowning:

Well, I can’t really agree with this.

On the one hand, yes, all of this shows that when you do try to evolve a grammar that is already established, it is easy to introduce ambiguities. And, I have found that you often only notice them when you actually start using it and you do hit these cases. I already found this a long long time ago when I took over the FreeMarker project and started introducing new syntactical elements.

That much is true. However, I cannot agree that it tells you anything about what people need, or in particular, what beginners need. For starters, you seem to conflate two separate questions: (a) what would be more amenable to (human) users and (b) what is easier to parse, or formally define.

These are two separate questions. And one can see this quite easily. What is most amenable to human users is obviously natural language, i.e. English, French, or whatever somebody’s native language happens to be… However, natural language is outrageously difficult to parse and formally define. On the other hand, something that has a very clear and simple formal definition, like XML, let’s say, is very difficult for most ordinary human beings to read and write. However, it is dead easy to write an XML parser!

So, the fact that when you try to streamline the syntax for human users, you make it (a) harder to write a parser and (b) harder to define the language formally, this is to be expected actually.

But there is a deeper problem here, you know. Your whole attitude seems to be utterly based on status quo bias, an extreme bias towards the existing state of affairs.

Consider the following thought experiment:

Suppose (contrary to fact) that this new streamlined syntax was the original syntax, how it had always been.

Now, suppose that somebody came along with a proposal to improve this. Their proposal would be that, instead of writing:

   FooBar : foo bar;

you now write:

    void FooBar() : 
     {}
     {
            foo() bar()
     }

How would most users react to this proposed “improvement”?

Isn’t the whole thing self-evidently preposterous?

One thing about this occurs to me. I haven’t even looked at the code of [your Eclipse plugin(https://sourceforge.net/projects/eclipse-javacc/). I always would have assumed that you use the actual JavaCC parser (the one that JavaCC itself uses internally to parse grammars) inside your plugin. Now, I have the sudden suspicion that you have cobbled together your own ad-hoc JavaCC parser that you must maintain separately.

Is this true? (Say it ain’t so, Joe…")

Because, otherwise… why would you care whether it is difficult (or not) to write a parser that supports the newer streamlined syntax?

That is my job!

You should be able to use the same JavaCC parser that JavaCC (JavaCC21 obviously) uses internally and then, in that case, what does it matter to you whether the newer syntax is easier to parse or not???

If my suspicion is correct and your plugin maintains its own JavaCC parser, then… you know, this is an example of how people get used to very suboptimal situations. A couple of months ago, I was wondering how many different projects that use JavaCC maintain their own Java grammar. PMD does. Actually, this question came up with one of the PMD developers. I brought up this question in private correspondence because I was interested in how he would reply to it. (But he never wrote me back. Shrug…)

There is an entirely separate project that’s entire goal is to maintain an up-to-date Java parser. I mean this project. As far as I can see, if the JavaCC project had been something competent, that Javaparser thing would have no reason to exist because anybody who wanted a JavaCC based Java grammar would just use the one that the JavaCC project provides. Granted, you might need to alter some things in the stock java grammar. But that is what the INCLUDE directive is for. You simply INCLUDE the stock Java grammar and then redefine whatever you need to change in the including grammar.

But a lot of this relates to people’s mentality. When I decided to resuscitate my old FreeCC work, about the first task I set myself was updating the Java grammar used internally to handle the current modern Java language. (FreeCC supported up through Java 5 and only partially.) So, you know, you’ve probably seen this. And you see there, that I made big deal out of the fact that the updated Java grammar is separately usable!

So, my point is that while I was doing this work, it quickly occurred to me that other people who need an up-to-date Java should be able to reuse my work. You see, in my open source work, anything I do, I’m constantly thinking about how other people could benefit from it. Is this some special characteristic of mine? I would have thought that pretty much any other open source hacker thinks the same way. No?

Well, getting back to the point at hand, why should you care if the newer syntax is more complicated to parse etcetera if you can just reuse the existing parser? (If, for some reason, it is not separately reusable, then we can explore why and address that, but…)

Please reread one of the previous post about the Eclipse plugin.

One of the big difficulty is the code scanner. It is based on a FSM, not on a parser. I think it cannot be based on a parser, at least on a non fault tolerant parser. Changing mandatory constructs (that help finding the context) by optional constructs leads to a complexified algorithm. Beurk.

The trees are based on a JJTree grammar, not on a JavaCC grammar. Yes, it was customized, by its creator, because JavaCC / JJTree do not allow easily to build tools on top of an agnostic grammar. You cannot take a new grammar and hope it will work immediately… It is not too complex to enhance the current grammar or to take a new grammar, it is just a boring task for no new great features.

For the IntelliJ plugin, the grammar has to be in (F)Lex / Yacc. Not a trivial task I believe. So another boring task for no new great features.

Keep the syntax simple, as in Java / Kotlin / C++. A method always with (), a member without. Not a mix of them.

Yes, I’m one of those that would think that having foo or bar being a method or a token or a variable would be much less readable that having foo() for methods and foo for tokens or variables. Having to write 0+Bar to tell Bar is an integer is just ugly, not to say more…
So yes IMO the second syntax would be an improvement.

For the braces, this is another story. I agree that the syntax could be enhanced to declare JavaCC variables that would translate to Java or C++ variables and therefore get rid of the first braces pair, and may be also on the second. But the 21st century feature you could glorify yourself for would be the introduction of these JavaCC variables, not the braces suppression.