Originally published at: Context-Sensitive Tokenization, Next Installment, Activating and De-activating Tokens – JavaCC 21
Sometimes, when you complete a major code cleanup, features that were previously pie in the sky become low-hanging fruit to pluck. The new feature that I describe here, the ability to activate and deactivate tokens is such a case. It resulted from my rewriting of the lexical code generation that I describe here.
In an earlier article I (rather proudly) outlined my solution (of the time) to a certain rather pesky little problem in Java parsing. In the following:
we need to parse
>> as two consecutive
> tokens, while, in other contexts, the
>> it does need to be identified a single token. At first sight, it does not seem like this should be so hard to deal with, but it is surprisingly difficult to find a palatable solution. I was quite pleased with the solution I describe there because it was definitely far better than the 3-part kludge that I had been using before that, and way way better than the 5-part kludge that the legacy code used. However, the truth is that it was still a kludge! But now that we can activate and deactivate tokens on an as-needed basis, there is a far more elegant solution.
Well, the easiest way to explain this is with an actual code example. Here is how the
ShiftExpression construct is implemented in the current Java grammar:
ShiftExpression : AdditiveExpression ACTIVATE_TOKENS RSIGNEDSHIFT,RUNSIGNEDSHIFT ( ( "<>" | ">>>") AdditiveExpression )* ;
The use of the
ACTIVATE_TOKENS is fairly straightforward. What it means is that in the following expansion, delimited by parentheses, these two tokens are activated. They are activated at the beginning of the expansion that follows and at the end, the set of active tokens is reset to what it was before.
At the very top of the grammarfile, we have the option:
DEACTIVATE_TOKENS=RSIGNEDSHIFT, RUNSIGNEDSHIFT, RECORD;
These token types are turned off by default and turned on at key moments. So, the RecordDeclaration, new stable feature in JDK 16 is defined as follows:
RecordDeclaration : Modifiers(EnumSet.of(PUBLIC, PROTECTED, PRIVATE, ABSTRACT, FINAL, STATIC, STRICTFP)) ACTIVATE_TOKENS RECORD ("record") =>|| [TypeParameters] RecordHeader [ImplementsList] RecordBody ;
The token for the "soft keyword" record is "activated" at the key point where we need it and everywhere else, the "record" is simply tokenized as an identifier.
Well, that's it. Of course, I'm pretty sure that this disposition is quite generally useful, not just for this specific problem of parsing Java generics -- even if that is the example I keep coming back to when discussing this overall problem of context-sensitive tokenization.