The “Code Prologue Glitch” is now fixed!

Folks, the good news is that I just fixed this longstanding glitch in CongoCC. The bad news is that the fix is not 100% backward compatible.

Well, granted, the above does kind of beg the question of WTF the code prologue glitch is. Though probably the question that will occur to most people is more like: How does this affect ME? For the answer, please read on...

What I call here the code prologue glitch is basically an artifact of how the code evolved from the way it was implemented in the 1990's in legacy JavaCC. That is how CongoCC (and JavaCC21 before it) ended up in this situation where the first Java code block in a grammar production was treated in a special manner. Well, here is a minimal example that illustrates the problem:

TREE_BUILDING_ENABLED=false; // so that we generate a minimal example
BASE_NAME="Test"; // We generate TestParser.java, TestLexer.java...

Foo :
   {int x = 7;}
   "bar"
   |
   "baz"
   {
       System.out.println(x);
   }
;

I suppose that most readers can see that the above production generates code that outputs a 7 if the input is baz and outputs nothing if the input is bar. (And if the input is neither bar nor baz, it throws an exception, of course.)

You can paste the above grammar into a file, Test.ccc let's say, and run CongoCC over it, i.e.

java -jar congocc.jar Test.ccc

and it generates the parser code, which duly compiles. Well, to be precise, the generated code compiles if you are using a CongoCC build from before the glitch was fixed. But if you use the most recent build of CongoCC when you compile, you get:

./testparser/TestParser.java:214: error: cannot find symbol
            System.out.println(x);

(As always, you can grab a pre-built CongoCC binary here.)

So, yes, with the glitch "fixed", the above little grammar generates uncompilable code. With the glitch present, the generated code compiles okay! That seems strange, but no, I have not lost my mind. It all makes sense, just bear with me.

You see, the "glitch" is that the very first Java code block in a production is treated specially. It is a kind of code prologue that precedes the expansion as a whole. That is a kind of artifact of how the original JavaCC worked. (And still works, actually.)

This becomes more apparent when you consider how the above Foo production in legacy JavaCC, where you would write something like:

public void Foo() :
   {int x = 7;}
{
    "bar"
    |
    "baz"
    {
        System.out.println(x);
    }
}

So, to be absolutely fair, the verbose (and butt ugly) legacy JavaCC syntax does make it quite clear that the code prologue, i.e. int x = 7; above, does sort of exist separately from the rest of the production. In fact, as some of you may recall, JavaCC even requires that initial code prologue to be there even if there is really no need for it. So, in such cases, you have to write:

 public void Foobar() : 
 {} // This empty code block is required!
 {
    ....
 }

Now, one of the first things people notice about CongoCC (or its predecessor JavaCC21) is that the syntax is quite a bit less cluttered. There is no need to start every production with a (frequently empty) code block. (By the way, a key difference between JavaCC21 and CongoCC is that JavaCC21 supported all that screwball legacy syntax as well as the newer streamlined syntax, while CongoCC only supports the newer syntax.)

Well, the problem here is that CongoCC kept on with this special code prologue logic, treating the very first Java code block in an expansion specially. Thus, until very recently, the code:

  Foobar :
      {int x = 7;}
      OptionA
      |
      OptionB
      {System.out.println(x);}
  ;

actually generated code as if you had written:

  Foobar :
      {int x = 7;}
      (
         OptionA
         |
         OptionB
         {System.out.println(x);}
      )
  ;

That is why the example up top generated compilable code, but now does not.

This whole state of affairs has some very screwy aspects. This becomes apparent if we extend the initial example and write:

Foo :
   {int x = 7;}
   "bar"
   |
   {int y = 7;}
   "baz"
   |
   "bat"
   {
       System.out.println(x);
       System.out.println(y);
   }
;

Again, if you generate and compile the parser, using any version where the glitch is not fixed, when the compiler is invoked on the generated code, it hits a problem on the line System.out.println(y); but NOT on the line System.out.println(x);. And that, my friends is quite illogical!

The reason for this is that CongoCC is carrying forward this screwball (it really is pretty screwball finally...) concept of treating a code block that begins the production in a special manner. So, for the above example, the current version of CongoCC will generate code that has TWO compiation errors. Both x and y are out of scope in that final code block above. And that is pretty obiously the correct behavior!

To fix it or not to fix it, that is the question.

All of this situation amounts to a rather annoying conundrum. (By the way, I always thought that "conundrum" was an intrinsically funny word, along with "condiment" and "condominium", but I guess that is neither here nor there...) The conundrum is that fixing the glitch is potentially backward incompatible. Meanwhile, there may well be grammars in the wild that rely on this glitchy behavior.

By the way, it so happens that none of the grammars that I maintain as part of this project -- JSON, Lua, Java, Python, or C# or the CongoCC grammar itself -- actually did rely on this glitch. (That was a relief!)

Anyway, the ~~condiment~~conundrum from a developer's point of view is how to deal with this. One solution would be to fix the glitch but to leave a flag to enable the glitchy behavior. That is the solution that has been applied recently and is why we have these backward-compatible options, such as LEGACY_GLITCHY_LOOKAHEAD and ASSERT_APPLIES_IN_LOOKAHEAD. (This is all explained here and here respectively.) These are settings that enable the legacy glitchy behavior, in case people have existing code that relies on it working that way.

Finally, in this case of the code prologue glitch, I decided to just fix it and not leave any setting to enable/disable the fix. And, yes, this could break some existing grammars, but finally, I reasoned that this is a situation where the solution is typically quite obvious. If you had written something like the initial grammar example, you need to put in some parentheses (which were unneeded before.) So you would need to write:

Foo :
   {int x = 7;}
   ( // insert the parentheses here
     "bar"
     |
     "baz"
     {
       System.out.println(x);
     }
   ) // and here!
;

The fixed version with the parentheses, by the way, will work equally well with the new non-glitchy version and the older glitchy one. Also, it is not confusing. It actually is fairly obvious why you need the parentheses. You need it so that the x defined in the first code block is accessible in the later one. In short, if you do run into this sort of issue, the solution will pretty much invariably be to add parentheses as in the above example.

By the way, regarding the aforementioned options, LEGACY_GLITCHY_LOOKAHEAD and ASSERT_APPLIES_IN_LOOKAHEAD, not too long ago, I did quietly (or sneakily?) change the default, so that they are now both off by default. Thus, for example, if you were relying on being able to write an expansion such as:

  "(" Expression ("," ASSERT ~(")") =>|| Expression)* [","] ")"

The above is one way to express syntax, like (x, y, z,) where the trailing comma is optional. However, the above does not work (at least with what are currently the default settings)

Well, to make a long story short, unless you have the ASSERT_APPLIES_IN_LOOKAHEAD option set to true, the above won't work as intended, since ASSERT does not apply in a lookahead. You have to write ENSURE in this spot. This whole issue is explained here.

So, the final take-away on all this is that the glitch is fixed (Hooray!) and there is some possibility that it could break your existing code. The solution would always be to put a pair of parentheses in the appropriate place and then the problem will be gone and your code will work again AND be more readable as a result. (Hooray!)

Post Views: 872