Context-Sensitive Tokenization, Next Installment, Activating and De-activating Tokens

Sometimes, when you complete a major code cleanup, features that were previously pie in the sky become low-hanging fruit to pluck. The new feature that I describe here, the ability to activate and deactivate tokens is such a case. It resulted from my rewriting of the lexical code generation that I describe here.

In an earlier article I (rather proudly) outlined my solution (of the time) to a certain rather pesky little problem in Java parsing. In the following:

List<Set<Foo>> listOfFooSets;

we need to parse >> as two consecutive > tokens, while, in other contexts, the >> it does need to be identified a single token. At first sight, it does not seem like this should be so hard to deal with, but it is surprisingly difficult to find a palatable solution. I was quite pleased with the solution I describe there because it was definitely far better than the 3-part kludge that I had been using before that, and way way better than the 5-part kludge that the legacy code used. However, the truth is that it was still a kludge! But now that we can activate and deactivate tokens on an as-needed basis, there is a far more elegant solution.

Well, the easiest way to explain this is with an actual code example. Here is how the ShiftExpression construct is implemented in the current Java grammar:

ShiftExpression :
    AdditiveExpression
    ACTIVATE_TOKENS RSIGNEDSHIFT,RUNSIGNEDSHIFT 
   (
      ( "<<" | ">>" | ">>>")
      AdditiveExpression
   )*
;

The use of the ACTIVATE_TOKENS is fairly straightforward. What it means is that in the following expansion, delimited by parentheses, these two tokens are activated. They are activated at the beginning of the expansion that follows and at the end, the set of active tokens is reset to what it was before.

At the very top of the grammarfile, we have the option:

DEACTIVATE_TOKENS=RSIGNEDSHIFT, RUNSIGNEDSHIFT, RECORD;

These token types are turned off by default and turned on at key moments. So, the RecordDeclaration, new stable feature in JDK 16 is defined as follows:

RecordDeclaration :
    Modifiers(EnumSet.of(PUBLIC, PROTECTED, PRIVATE,    ABSTRACT, FINAL, STATIC, STRICTFP))
    ACTIVATE_TOKENS RECORD ("record")
    =>||
   
   [TypeParameters]
   RecordHeader
   [ImplementsList]
   RecordBody
;

The token for the "soft keyword" record is "activated" at the key point where we need it and everywhere else, the "record" is simply tokenized as an identifier.

Well, that's it. Of course, I'm pretty sure that this disposition is quite generally useful, not just for this specific problem of parsing Java generics -- even if that is the example I keep coming back to when discussing this overall problem of context-sensitive tokenization.

I anticipate that this disposition will significantly reduce the need to define separate lexical states since, in many cases, all we really want is to turn on or off a single token in a given spot. Defining a separate lexical state for that is a rather heavy, inefficient solution. Well, sometimes, a separate lexical state is the natural solution, like if we are embedding JavaScript inside HTML, but it never seemed right to me to have separate lexical states that only differ by a single token or two...

Post Views: 15,484

1 thought on “Context-Sensitive Tokenization, Next Installment, Activating and De-activating Tokens”