Token Hooks (CommonTokenAction) Revisited

Originally published at: https://javacc.com/2020/10/16/token-hooks-revisited/

In the beginning there was… CommonTokenAction

Legacy JavaCC had (and still has) a means of applying whatever adjustments (a.k.a. kludges) to a Token just before it is handed off to the parser machinery. You could define a method called CommonTokenAction in your Lexer TokenManager class and this method is invoked when you get another Token off the stream. The method would look:

 void CommonTokenAction(Token t) {
     if (t.kind == IDENTIFIER) {
         ... do some kludgy thing here ...
     }
 }

You would define such a method in your XXXTokenManager class and then you needed to put the option:

 COMMON_TOKEN_ACTION=true;

up top in your options block.

Well, JavaCC21 still supports this usage pattern except that you don't need any configuration option since JavaCC21 simply sets this based on checking whether there is a method called CommonTokenAction in your code and uses it if is there. (Duh...)

This disposition is still supported, but it is has some self-evident flaws and JavaCC21 offers a somewhat better solution. (And this has been around since back in the FreeCC days, c. 2008.) You define a method called tokenHook in your XXXLexer code and the signature is:

 Token tokenHook(Token t) {....}

The key difference is that tokenHook has a return value, thus giving you the option of instantiating a new Token object, possibly of a different subclass, and using that to replace the Token object passed in. This is optional, of course, since the method could just do some manipulation on the object passed in and simply return that. So, in short, you can use tokenHook to do anything that you could do with CommonTokenAction and also some tricks that were not possible with the older disposition. So, the CommonTokenAction is really deprecated.

Problems with tokenHook

While the tokenHook disposition addresses some of the problems of CommonTokenAction by allowing a return value, it still suffers from some design problems, in particular in conjunction with the INCLUDE directive. (Admittedly, legacy JavaCC has never had an INCLUDE directive.)

The main problem with tokenHook is that you are only allowed to have one of them! If one grammar includes a grammar that already has a tokenHook method defined, the including grammar now cannot define a tokenHook method of its own!

This is really a first order problem and I have to admit that this had never occurred to me in all these years. It it finally did when I wanted to have a tokenHook in both the JavaCC grammar, JavaCC.javacc, and in the java grammar, Java.javacc that it includes and it doesn't work! Well, the code generator generates the code all right, but with two tokenHook methods defined with the exact same signature, the Java compiler complains!

What we really need is for it to be possible to define more than one tokenHook method. Here is the solution (and it is already implemented):

When you want to define a token hook method, you use the special name TOKEN_HOOK, like so:

 INJECT LEXER_CLASS : {
      Token TOKEN_HOOK(Token t) {
           if (t.getType() == IDENTIFIER) {
                ... do something, maybe a bit kludgy...
           } 
           return t;
      }
 }

What happens, of course, is that the TOKEN_HOOK name is a holder that later gets turned into a unique method name of the form tokenHook$XXX. This way, you can have more than one of them! Actually, to be precise, it generates a unique method name using the location where the method was specified, like this:

tokenHook$Java_javacc_line_48(Token tok) {
    ....
}

That the method name incorporates the location info allows you to jump pretty easily to where the code is defined when you realize the method needs to be tweaked.

In any case, when there is an INCLUDE facility, it is absolutely necessary to handle this use case of multiple tokenHook methods because any subgrammar that you include in your master grammar is quite likely to already define its own tokenHook method, so also, once you include more than one such grammar, you have a first order problem! Aside from the point already mentioned that the including grammar also cannot define a tokenHook once any of the grammars that it is using via INCLUDE already has one of its own!

Another aspect of all this is that the ability to have multiple tokenHook method is also a key usability improvement, since you can define the method very near to where it is actually used. For example, suppose you have a bit of munging you want to do when the token is an identifier and another when the token is a delimiter, like parentheses or semicolon or something. If you can only have only one tokenHook method, you have to write`

INJECT LEXER_CLASS : {
  Token tokenHook(Token t) {
       if (t.getType() == IDENTIFIER) {
            ...some code...
       } else if (t.getImage().equals("(")) {
            ... some other code... 
       } 
       ...etc...
       return t;
  }
}

Once you can have multiple tokenHook methods the above can be specified in multiple code injections:

INJECT LEXER_CLASS {
     Token TOKEN_HOOK(Token t) {
          if (t.getType() == IDENTIFIER) {
               ... some code...
          }
          return t;
     }
}

INJECT LEXER_CLASS {
int parenthesisNesting;
Token TOKEN_HOOK(Token t) {
String img = t.getImage();
if (t.equals("(")) {
++parenthesisNesting;
}
else if (img.equals(")") {
–parenthesisNesting;
}
return t;
}
}

Use TOKEN_HOOK. The plain tokenHook and CommonTokenAction methods are deprecated.

To recap, the TOKEN_HOOK methods defined in the above injections will generate different methods with unique names, but also a key usability aspect of this is that the above code injections can be placed very near related places in the grammar file. This is actually a general improvement that JavaCC21 offers over the legacy tool. Legacy JavaCC allows you to inject code into the generated Parser and TokenManager classes via the PARSER_BEGIN...PARSER_END... and TOKEN_MGR_DECLS sections. However, you can only have one of each! And they must be up top in the grammar. JavaCC21 allows the analogous code injections to be placed very near the related places where they are used in the grammar. For example, if your parser needs that parenthesisNesting member variable to keep track of the nesting level, then this code injection can be placed very close to where the variables (or methods or whatever...) are actually used in the grammar. So if certain methods defined in an injection are mostly used after line 2000 of your grammar, you can place the INJECT block that defined those methods very near where they are used, so you don't have to be continually scrolling up and down thousands of lines to make sense of your code!

Concluding Thoughts

On its own a usability enhancement like this could be considered fairly minor, but the cumulative effect of five, ten, or twelve things of this nature, almost certainly constitutes a night-and-day transformation in the tool's usability! Regardless, the inability to define multiple token hook methods is a clear design flaw that has now been fixed. Again, this came about because I ran into this problem in internal development and realized it needed to be addressed.

The other big problem with this token hook disposition is that, until quite recently, it could not really address the problems that it was supposed to be able to, because it was getting injected in the wrong place, so that there was no ability to revisit Tokens that had already been cached in the chain of token.next.next.... fields formed during a lookahead.

This is explained in depth here and is really a key thing to understand if you are going to use JavaCC in a heavy duty way. In short, the only way to address the problem of context-sensitive tokenization is to be able to inject the TOKEN_HOOK routine into the parser class, NOT the lexer class!