In the beginning there was... CommonTokenAction
Legacy JavaCC had (and still has) a means of applying whatever adjustments (a.k.a. kludges) to a Token just before it is handed off to the parser machinery. You could define a method called CommonTokenAction
in your Lexer TokenManager class and this method is invoked when you get another Token off the stream. The method would look:
void CommonTokenAction(Token t) {
if (t.kind == IDENTIFIER) {
... do some kludgy thing here ...
}
}
You would define such a method in your XXXTokenManager class and then you needed to put the option:
COMMON_TOKEN_ACTION=true;
up top in your options
block.
Well, JavaCC21 still supports this usage pattern except that you don't need any configuration option since JavaCC21 simply sets this based on checking whether there is a method called CommonTokenAction
in your code and uses it if is there. (Duh...)
This disposition is still supported, but it is has some self-evident flaws and JavaCC21 offers a somewhat better solution. (And this has been around since back in the FreeCC days, c. 2008.) You define a method called tokenHook
in your XXXLexer code and the signature is:
Token tokenHook(Token t) {....}
The key difference is that tokenHook
has a return value, thus giving you the option of instantiating a new Token object, possibly of a different subclass, and using that to replace the Token
object passed in. This is optional, of course, since the method could just do some manipulation on the object passed in and simply return that. So, in short, you can use tokenHook
to do anything that you could do with CommonTokenAction
and also some tricks that were not possible with the older disposition. So, the CommonTokenAction
is really deprecated.
Problems with tokenHook
While the tokenHook
disposition addresses some of the problems of CommonTokenAction
by allowing a return value, it still suffers from some design problems, in particular in conjunction with the INCLUDE directive. (Admittedly, legacy JavaCC has never had an INCLUDE
directive.)
The main problem with tokenHook
is that you are only allowed to have one of them! If one grammar includes a grammar that already has a tokenHook
method defined, the including grammar now cannot define a tokenHook
method of its own!
This is really a first order problem and I have to admit that this had never occurred to me in all these years. It it finally did when I wanted to have a tokenHook
in both the JavaCC grammar, JavaCC.javacc
, and in the java grammar, Java.javacc
that it includes and it doesn't work! Well, the code generator generates the code all right, but with two tokenHook
methods defined with the exact same signature, the Java compiler complains!
What we really need is for it to be possible to define more than one tokenHook
method. Here is the solution (and it is already implemented):
When you want to define a token hook method, you use the special name TOKEN_HOOK
, like so:
INJECT LEXER_CLASS : {
Token TOKEN_HOOK(Token t) {
if (t.getType() == IDENTIFIER) {
... do something, maybe a bit kludgy...
}
return t;
}
}
What happens, of course, is that the TOKEN_HOOK
name is a holder that later gets turned into a unique method name of the form tokenHook$XXX
. This way, you can have more than one of them! Actually, to be precise, it generates a unique method name using the location where the method was specified, like this:
tokenHook$Java_javacc_line_48(Token tok) {
....
}
That the method name incorporates the location info allows you to jump pretty easily to where the code is defined when you realize the method needs to be tweaked.
In any case, when there is an INCLUDE
facility, it is absolutely necessary to handle this use case of multiple tokenHook
methods because any subgrammar that you include in your master grammar is quite likely to already define its own tokenHook
method, so also, once you include more than one such grammar, you have a first order problem! Aside from the point already mentioned that the including grammar also cannot define a tokenHook
once any of the grammars that it is using via INCLUDE
already has one of its own!
Another aspect of all this is that the ability to have multiple tokenHook
method is also a key usability improvement, since you can define the method very near to where it is actually used. For example, suppose you have a bit of munging you want to do when the token is an identifier and another when the token is a delimiter, like parentheses or semicolon or something. If you can only have only one tokenHook
method, you have to write`
INJECT LEXER_CLASS : {
Token tokenHook(Token t) {
if (t.getType() == IDENTIFIER) {
...some code...
} else if (t.getImage().equals("(")) {
... some other code...
}
...etc...
return t;
}
}
Once you can have multiple tokenHook
methods the above can be specified in multiple code injections:
INJECT LEXER_CLASS {
Token TOKEN_HOOK(Token t) {
if (t.getType() == IDENTIFIER) {
... some code...
}
return t;
}
}
INJECT LEXER_CLASS {
int parenthesisNesting;
Token TOKEN_HOOK(Token t) {
String img = t.getImage();
if (t.equals("(")) {
++parenthesisNesting;
}
else if (img.equals(")") {
--parenthesisNesting;
}
return t;
}
}
Use TOKEN_HOOK
. The plain tokenHook
and CommonTokenAction
methods are deprecated.
To recap, the TOKEN_HOOK
methods defined in the above injections will generate different methods with unique names, but also a key usability aspect of this is that the above code injections can be placed very near related places in the grammar file. This is actually a general improvement that JavaCC21 offers over the legacy tool. Legacy JavaCC allows you to inject code into the generated Parser and TokenManager classes via the PARSER_BEGIN...PARSER_END...
and TOKEN_MGR_DECLS
sections. However, you can only have one of each! And they must be up top in the grammar. JavaCC21 allows the analogous code injections to be placed very near the related places where they are used in the grammar. For example, if your parser needs that parenthesisNesting
member variable to keep track of the nesting level, then this code injection can be placed very close to where the variables (or methods or whatever...) are actually used in the grammar. So if certain methods defined in an injection are mostly used after line 2000 of your grammar, you can place the INJECT
block that defined those methods very near where they are used, so you don't have to be continually scrolling up and down thousands of lines to make sense of your code!
Concluding Thoughts
On its own a usability enhancement like this could be considered fairly minor, but the cumulative effect of five, ten, or twelve things of this nature, almost certainly constitutes a night-and-day transformation in the tool's usability! Regardless, the inability to define multiple token hook methods is a clear design flaw that has now been fixed. Again, this came about because I ran into this problem in internal development and realized it needed to be addressed.
Pingback: “You can’t get there from here!” The Problem of context-sensitive tokenization. – JavaCC 21
Pingback: A Glimpse of the Promised Land: Fault-tolerant parsing – JavaCC 21