JavaCC 21 now has a Preprocessor!

It had been in the back of my mind nagging me for some time, thinking that we would eventually need some sort of preprocessor functionality moving forward, to support output for different programming languages (besides Java) and actually, even different Java targets. I had also been looking more closely at C# recently and finally, I just decided to copy the way the C# preprocessor works, which is much more limited (and sane) than the full C/C++ preprocessor.

In fact, this JavaCC 21 preprocessor only implements the part of the C# preprocessor that deals with turning on and off regions of the code based on some conditions (conditional symbols) that you define. So, basically, you can write things like:

 #def testing

  #if foo

   #undef testing

   #define stable

  #endif

  #if !(testing || debug)

     something 

  #elif stable

    And now something else!

  #else

     And now for something completely different!

  #endif

Well, as you probably see already, the preprocessor is its own separate little mini-grammar. (It is expressed in a couple of hundred lines here).

So it has its own rules (that I did not invent) like: all of these pre-processing directives that start with # must be on their own line. In the above, the #if-elif..else...#endif structure has to be valid. If a closing #endif was missing, for example, it will complain. Same old... same old...

Note, however, that these constructs are not really part of the syntactic or lexical grammar of a JavaCC grammarfile. All of the preprocessing is best thought of as pre-lexical. The way it works is that the preprocessor runs over the source file and simply builds up the information (in a BitSet instance) that marks which lines in the source file are turned and off. And then, when the lexical machinery reads in the code to be lexed (and parsed) the lines that are marked as ignored are simply skipped. Neither the parser nor the lexical machinery sees any of those ignored lines and behaves as if they weren't there. Well, there is a key difference. The line number information stored in Tokens and Nodes is correct based on the location in the original file. So if you have:

 1. #if false
 2. blah blah blah
 3. #endif
 4. Foobar : "foo" "bar";

The Foobar production and the tokens inside it know that they are on line 4, not on line 1 as they would be if we really stripped out the first three lines and fed the remaining code to the parser.

Can I use it?

Some readers may already be wondering whether this is re-usable in their own projects. And the answer is that it basically is. You can see here the key point where the JavaCC grammar uses the Preprocessor grammar to get the BitSet of line markers that turn off the various line ranges.

In fact, I think this is generally useful enough that it will eventually just be a settings toggle, something like USE_PREPROCESSOR=true and your DSL will automatically have this preprocessor functionality. But that is not implemented yet. But it is already not very hard to incorporate this into any other project.

Internationalization... Not

At the moment, all the conditional symbols have to be in 7-bit ASCII. The regexp for the conditional symbols currently looks like this:

<PP_SYMBOL : (["_", "a"-"z", "A"-"Z"])(["_", "a"-"z", "A"-"Z", "0"-"9"])*>

It would have been easy enough to allow people to have conditional symbols in full unicode, so as to write things like:

 #define 你好

Or:

 #if отладка
  ....
 #endif

However, there is really still no clean way of including the whole Unicode definition of an identifier for something like this. The whole preprocessor grammar is only a couple of hundred lines and it just felt weird to copy-paste an identifier definition that is longer than that.

I like the idea of treating non-English speakers as full citizens, of course, but I need a cleaner way of reusing the various internationalized definitions of Identifiers and such that use the full Unicode character set. I anticipate that when I have that in place, this will be one of the first places I apply it.

So, well, aside from not currently supporting full Unicode in the conditional symbol names, we also only have #define, #undef, #if/#elif/#endif directives. The various directives (mostly just used internally in C#) such as #pragma, #line, #region/#regionend and some others are all just ignored at the moment.

By the way, a directive that does not even exist in C# is just passed through. So the line:

#foobar blah blah

is just passed through to JavaCC since there is no #foobar instruction in the C# preprocessor. On the other hand, the line:

 #warning This is a warning!

is ignored, but JavaCC 21 does nothing with it. At least for now...

ATTEMPT/RECOVER is back

In other matters, I put back the ATTEMPT/RECOVER construct. It should work but is largely untested. The syntax is slightly changed. You write:

  ATTEMPT Foo Bar 
  RECOVER {some Java code...} 

OR:

  ATTEMPT Foo Bar 
  RECOVER (Baz Bat)

So, if you use curly braces, what is inside is Java code and then if you use parentheses, it is a JavaCC grammar expression. Note that you can put a Java code block in any grammar expansion anyway, so you can write:

  ATTEMPT Foo Bar
  RECOVER ({some java code} Baz {more java code} Bat {even more java code})

The above construct will parse (well, attempt to match) the expansion, in this case Foo Bar, and then if a ParseException is thrown it tries to recover with the code after RECOVER. BUT... only after rolling back the state of the world to before it entered Foo!

That is the key difference between ATTEMPT/RECOVER and the older try-catch.

Well, there are some other new features that are implemented, but I'll have to document them in a later article.