Context-Sensitive Tokenizing, Part Deux: Lexical States

Originally published at: https://javacc.com/2021/01/24/context-sensitive-lexical-states/

(To get some prerequisite understanding of this topic, it might be a good idea to read this earlier blog post on context-sensitive tokenization from three months ago.)

The Lay of the Land

There are two quite useful ideas that have been in JavaCC from the very beginning:

  • lookahead (particularly syntactic lookahead)
  • lexical states

Syntactic lookahead allows you to scan ahead and see if a grammar expansion would succeed or fail when deciding between two or more options. (I use the term lookahead above even though the LOOKAHEAD keyword in JavaCC21 is deprecated in favor of SCAN. See here for more info.)

Lexical states allow you to define multiple states in which different tokenization rules apply. This can be particularly useful if you are parsing some sort of document format that allows different embedded snippets in various mini-languages. Naturally, the snippets in the various mini-languages typically need to be broken down into tokens according to their own respective lexical rules.

While both ideas, lookahead and lexical states, are quite good on their own, the interaction between the two features is quite problematic. As they say, the devil is in the details and due to the way this is implemented, these things have never quite work correctly. Perhaps the overarching issue is that while legacy JavaCC has a lookahead concept, it has zero concept of backtracking or error recovery! This is why certain problems are just so difficult (borderline intractable even) and that is what the aforementioned blog post was mostly about. The problem that that article uses as an example is the problem of parsing type arguments in Java. Just as a quick review, consider the use of >> in the following two lines of code:

  int x = y >> 4;

and:

  Map<String, List<Product>> productLists;

In the first line, the >> is a single token, the signed right shift operator (>>> is the unsigned right shift operator) but in the second line, >> is really two tokens, two closing angle brackets. In short, there are really two alternative ways of tokenizing the input >> (or >>>) -- as a single shift operator token or as two (or three) separate > tokens. Which is the correct tokenization depends on context. It is really quite astounding just how difficult it is to deal with such situations in legacy JavaCC. In this forum post I outlined the 5-part kludge that legacy JavaCC uses to deal with this problem. (I suggest that you shouldn't look at that if you have just eaten!)

Regardless, properly understood, all of this is part and parcel of a larger problem:

Once a token is matched as being of a given type in a lookahead routine, legacy JavaCC has no mechanism whereby this decision can ever be revisited.

There was simply no thought in the design of the original tool (circa 1996/1997) that if the lookahead routine fails (i.e. returns false) that the tokens cached in course of the lookahead could be spurious. So, as we see in this case, once you've spuriously interpreted >> as a single token in a lookahead routine, there is no real way of dealing with the problem. Well, there are kludges, like the 5-part kludge I outlined, but nothing built-in. Certainly nothing elegant!

So, this is the longstanding problem I'm talking about as regards lexical states. If your failed lookahead routine involves a lexical state switch, it is quite possible that the tokenization is spurious but there is no disposition to deal with this problem.

Well, this is not exactly some well-kept secret. Consider question 4.16 in Theodore Norvell's JavaCC FAQ. The question is formulated as:

4.16 How do I communicate from the parser to the token manager?

Or, in other words, how do I get the lexer to change its behavior based on where we are in the parsing?

Theo's answer begins:

It is usually a bad idea to try to have the parser try to influence the way the token manager does its job. The reason is that the token manager may produce tokens long before the parser consumes them. This is a result of lookahead.

I had read this before but I came across it again just recently before this latest round of work, and it made me laugh out loud -- I mean, the way Theo frames the question.

"This is a bad idea..."

Well, as I pointed out earlier, at no point in the entire JavaCC FAQ does Norvell ever say forthrightly that something in JavaCC is simply a bug. He basically takes the approach that JavaCC, as is, is more or less perfect, and that has the natural implication that if you want to do something that the tool does not support, then that is a bad idea.

On your part.

Well, in a sense, Theo is right. It is a bad idea in legacy JavaCC because, generally speaking, it just doesn't work! However, it is not a bad idea intrinsically. Again, the problem is that there is no concept of backtracking in legacy JavaCC, so as Theo says, ... the token manager may produce tokens long before the parser consumes them. This is a result of lookahead. Well, yes, that is true, but Theo glosses over a little detail:

AND, there is no disposition for backtracking and re-tokenizing if it turns out to be necessary.

Speaking of this particular bad idea, in recent private correspondence, Brian Goetz told me that he had been a heavy user of JavaCC from the very early days (20 plus years ago) and he had tried to contribute a patch to address this whole problem of lexical states and lookahead but was given the runaround. (Quelle surprise!)

Well, apparently, Mr. Goetz, despite being the Java language architect at Oracle, and the co-author of the standard reference on concurrent programming in Java, is not immune to bad ideas -- like wanting the state of the parser to influence tokenization. (In case you don't realize it already, Dear Reader, I say that tongue in cheek!)

I actually dug up Brian's patch and looked at it but... well... my solution is more elegant. I say that, by the way, not to put down Mr. Goetz, but mostly just because any fix to this against the current JavaCC21 codebase (approximately a 90% rewrite at this point) is bound to be more elegant, because the cleaned up JavaCC21 codebase allows much cleaner solutions to this kind of thing.

We're not in Kansas any more!

So, now that I am through with the long-winded pre-amble, here is the solution that is now implemented:

You can (optionally, of course) specify that a grammar production needs to be tokenized in a certain lexical state. For example:

PythonSnippet : PYTHON : 
    blah blah
;

And that's it.

Basically, that's it. What the above (the extra : PYTHON bit) means is that the parser will switch over to the PYTHON lexical state when entering this production -- whether parsing it OR in a lookahead routine. And when it exits, it will revert to whatever the previous lexical state was.

Also, if it is in a lookahead and the lookahead fails, it throws away all the tokens that it cached that are liable to be spurious.

So, if we have something like:

 CodeSnippet :
    PythonSnippet
    |
    RubySnippet
    |
    JavaScriptSnippet
;

If we scan ahead in the production CodeSnippet, it switches into the appropriate lexical state as it tries the various choices one by one. So, if it is scanning ahead to check whether we are in a Python code snippet, and it decides that no, we aren't, then it just jumps out, but when it jumps out, it reverts to the previous lexical state and throws away any Python tokens it created. (Yep, we're back in bed in Kansas, as if nothing had happened!) And now it tries to scan ahead a Ruby snippet and if it jumps out with failure, it always reverts to the previous lexical state and throws away whatever tokens it cached along the way.

Of course, if the lookahead is successful, it keeps the cached tokens. As far as I can see, the whole thing has close to zero overhead.

You can still specify lexical states switches in your lexical grammar, but I anticipate that this will be the standard way to use lexical states in JavaCC. I think it is both conceptually simpler and less error-prone. Also, it seems quite natural to be able to specify that a certain grammar production must be tokenized in a certain lexical state.

That said, I should mention that I also implemented a way of specifying the lexical state at the level of expansions within a grammatical production. So, you could write:

 LEXICAL_STATE FOOBAR
 (
     Foo Bar Baz
 )

And the same basic logic applies. It switches to the FOOBAR lexical state before entering Foo and after exiting Baz it reverts to the previous lexical state. If this is part of a lookahead routine, and it jumps out of the routine anywhere between the start of Foo and the end of Baz, with a failure, then it reverts to the previous lexical state AND throws away any cached tokens.

I implemented the above mostly for completeness, I guess. I have been trying to think about which cases you would prefer to specify this at the expansion level, as opposed to just defining a production and labeling that production as taking place in a given lexical state. I think you would almost always just prefer to just label the production.

Well, something similar happens with tree building. some people might not even realize this, but you can actually put a tree building annotation anywhere in a grammar, like:

  Foo Bar #FooBar(2)

i.e. create a new FooBar node with the Foo and Bar we just created as 2 child nodes. However, as a practical question, probably about 97% of tree-building annotations occur at the level of grammatical productions. So you will much more often see something like:

  FooBar #FooBar(2) : Foo Bar ;

There, FooBar is a grammatical rule and it creates a FooBar node with Foo and Bar as 2 sub-nodes.

Well, actually, in the above case, you don't need to write that #FooBar(2) part anyway. That is the default behavior in JavaCC21 if you specify nothing at all, i.e.

  FooBar : Foo Bar ;

Note, by the way, that I'm assuming in the examples above that Foo and Bar both definitely create a single node and place it on the stack, and that is not necessarily the case, but it would be quite typical.

If you didn't want the behavior described above, you'd have to write:

  FooBar #void : Foo Bar;

to specify that you don't want the FooBar production to create any Node!

Well, I'm digressing anyway. I'm just saying that in both these cases, both specifying the node creation (or not specifying it since it's the default anyway!) or specifying the lexical state, it will probably be most natural to do this at the level of grammatical productions.

The above feature is not part of some blue skies wish list. It is all implemented! Not so very tested, I would have to admit. Just a little bit. So some issues may become apparent over the coming days, but also maybe not. The code has been restructured and simplified to such an extent that adding these sorts of features is extremely simple. They are expressed in shockingly few lines of code. But please give it your best shot! I'll be grateful to anybody who reports problems with this new feature. (Or any other.)