Context-Sensitive Tokenization, Next Installment, Activating and De-activating Tokens

Originally published at: Context-Sensitive Tokenization, Next Installment, Activating and De-activating Tokens – JavaCC 21

Sometimes, when you complete a major code cleanup, features that were previously pie in the sky become low-hanging fruit to pluck. The new feature that I describe here, the ability to activate and deactivate tokens is such a case. It resulted from my rewriting of the lexical code generation that I describe here.

In an earlier article I (rather proudly) outlined my solution (of the time) to a certain rather pesky little problem in Java parsing. In the following:

List<Set<Foo>> listOfFooSets;

we need to parse >> as two consecutive > tokens, while, in other contexts, the >> it does need to be identified a single token. At first sight, it does not seem like this should be so hard to deal with, but it is surprisingly difficult to find a palatable solution. I was quite pleased with the solution I describe there because it was definitely far better than the 3-part kludge that I had been using before that, and way way better than the 5-part kludge that the legacy code used. However, the truth is that it was still a kludge! But now that we can activate and deactivate tokens on an as-needed basis, there is a far more elegant solution.

Well, the easiest way to explain this is with an actual code example. Here is how the ShiftExpression construct is implemented in the current Java grammar:

ShiftExpression :
    AdditiveExpression
    ACTIVATE_TOKENS RSIGNEDSHIFT,RUNSIGNEDSHIFT 
   (
      ( "<>" | ">>>")
      AdditiveExpression
   )*
;

The use of the ACTIVATE_TOKENS is fairly straightforward. What it means is that in the following expansion, delimited by parentheses, these two tokens are activated. They are activated at the beginning of the expansion that follows and at the end, the set of active tokens is reset to what it was before.

At the very top of the grammarfile, we have the option:

DEACTIVATE_TOKENS=RSIGNEDSHIFT, RUNSIGNEDSHIFT, RECORD;

These token types are turned off by default and turned on at key moments. So, the RecordDeclaration, new stable feature in JDK 16 is defined as follows:

RecordDeclaration :
    Modifiers(EnumSet.of(PUBLIC, PROTECTED, PRIVATE,    ABSTRACT, FINAL, STATIC, STRICTFP))
    ACTIVATE_TOKENS RECORD ("record")
    =>||
   
   [TypeParameters]
   RecordHeader
   [ImplementsList]
   RecordBody
;

The token for the "soft keyword" record is "activated" at the key point where we need it and everywhere else, the "record" is simply tokenized as an identifier.

Well, that's it. Of course, I'm pretty sure that this disposition is quite generally useful, not just for this specific problem of parsing Java generics -- even if that is the example I keep coming back to when discussing this overall problem of context-sensitive tokenization.

I anticipate that this disposition will significantly reduce the need to define separate lexical states since, in many cases, all we really want is to turn on or off a single token in a given spot. Defining a separate lexical state for that is a rather heavy, inefficient solution. Well, sometimes, a separate lexical state is the natural solution, like if we are embedding JavaScript inside HTML, but it never seemed right to me to have separate lexical states that only differ by a single token or two...

I should also add that there is also a new setting called DEACTIVATE_TOKENS so that you can specify one or more token types to deactivate by default when the parser is instantiated. Like this:

  DEACTIVATE_TOKENS=OPEN_PAREN,CLOSE_PAREN;

at the top of a grammar file with all the rest of the various options.

How does one deal with activation/deactivation when there might be recursion involved? For example, the 3.10 Python grammar has soft keywords match and case, and of course case statements can contain nested match statements. Would it be sufficient to do something like:

CompoundStatement :
    FunctionDefinition
(other alternatives elided)
    |
    TryStatement
    |
   ACTIVATE_TOKENS(MATCH)
   MatchStatement 
   DEACTIVATE_TOKENS(MATCH)
 ;

MatchStatement:
   <MATCH> SubjectExpression <COLON> <NEWLINE> <INDENT>
   ACTIVATE_TOKENS(CASE) (CaseBlock)+ DEACTIVATE_TOKENS(CASE) DEDENT
;

CaseBlock:
  <CASE> Patterns [ Guard ] <COLON> Block;

Or is there a better way?

Hmmm. I think there is a problem here, and the code above won’t work correctly, but it doesn’t really have anything to do with recursion. I mean, if you know that the soft keyword “match” should be tokenized as a MATCH keyword in that specific spot and everywhere else, it should be treated as a regular IDENTIFIER, it doesn’t really matter that much how deeply nested you are at that juncture. The problem lies elsewhere.

The above doesn’t work (at least for now, but I’ll be working on the problem!) due to how the default choice resolution algorithm works. Like, in the absence of anything to the contrary, the code is looking ahead one token and deciding whether to enter whatever expansion on that basis, right?

So, if you have a choice, like:

   IfStatement
   |
   WhileStatement
   |
   MatchStatement
   |
   ReturnStatement

the assumption (assuming that we haven’t put any explicit lookahead anywhere) is that we can decide which choice to enter based on the next token. If the next token is an IF, we go into the IfStatement and if it’s a RETURN, we go into the ReturnStatement and so on. And you can imagine what code is generated for this. Roughly anyway. Something like:
if (nextTokenType == IF) {…}
else if (nextTokenType == MATCH {…}

etc.

Well, the problem (as things stand) is that the decision to take a given choice is effectively being made before the ACTIVATE_TOKENS is hit. If we decide, in the above code to go into the MatchStatement option based on the next token, we’ll never do it, because the next token, even if it is the string “match”, is being matched as an IDENTIFIER, and then based on that, we decide whether to enter that choice and the ACTIVATE_TOKENS(MATCH) is never invoked! You see the problem? IOW, the choice that involves activating the MATCH token never gets invoked precisely because the next token has already been scanned as an IDENTIFIER and we don’t enter that choice in the first place. You see my point?

I haven’t actually checked, but I’m about 99% certain that the code you wrote doesn’t work (which is NOT a reproach, mind you. It probably should work…)

Now, I think it would work if you wrote:

        SCAN 0 {getToken(1).getImage().equals("match")}  =>
        ACTIVATE_TOKENS(MATCH)
        MatchStatement

We add that extra semantic lookahead that if the next token is the string “match” (irregardless of the type, so it can be an identifier) we enter the following expansion, then we turn on the token type MATCH and then we enter MatchStatement! And “match” here is a MATCH token because the ACTIVATE_TOKENS caused us to retokenize the tokens that follow!) (And probably the DEACTIVATE_TOKENS should just occur write after the <MATCH>)

I’m pretty sure that works, for example, but I don’t consider it satisfactory. It requires people to be entirely too clever and I don’t want to market the tool as being solely for super clever people!

Oh, another way to get this to work would probably be:

   MatchStatement :
       ACTIVATE_TOKENS(MATCH) <MATCH> =>||
       DEACTIVATE_TOKENS(MATCH) 
       etc...

The reason that would work is that, since we put in an up-to-here marker in the MatchStatement, it generates a lookahead routine and so we’re not doing the simple one-token-lookahead default, so we get around the catch-22 problem I describe initially. But that still requires people to be clever…

As for the CASE, I don’t think there is much problem with it, since it’s not occurring at a choice point. Except, I think it would be better to put the activate/deactivate in the CaseBlock production.

   CaseBlock :
       ACTIVATE_TOKENS(CASE) <CASE> DEACTIVATE_TOKENS(CASE)

which is a bit verbose, but should work. But it works because the CaseBlock is not at a choice point, so we’re not hitting this catch-22 problem.

I’ll close this note here, but it’s not a full answer, because, while writing this, I realized that there are some rather screwy aspects to all this that probably need to be revisited, so stay tuned…

D’oh! Of course, should’ve seen that :blush:

On what basis would the MatchStatement production be entered in this case from CompoundStatement?

Well, it’s not exactly that the MatchStatement is entered on that basis. What happens is that, since MatchStatement contains an up-to-here marker, it is not taken to be a simple LL(1) case, the code generated is something more like:

   if (nextTokenType == IF) {....}
   else if (nextTokenType == TRY) {....}
   ...etc...
   else if (generatedLookaheadRoutineForMatchStatement()) {
                enter Match statement
    }

So that generated lookahed routine activates the token MATCH and then scans the “match” as a MATCH token. And then returns a success and we enter the MatchStatement production on that basis.

However, there is another aspect to all of this that I find troubling. I think I implemented this feature incorrectly and have to redo it. Because I did something wrong that I’ll have to explain separately.

Well… it’s not so easy to reason about these things. I myself find that I still make similar sorts of mistakes. And that’s at a point where I know the whole system more intimately than anybody! So, for somebody in your position, just coming into this…

But the other thing is that this whole conversation makes me realize that this whole thing is implemented incorrectly. I think the ACTIVATE/DEACTIVATE stuff has to work on an expansion as a whole and be implemented in code with a try…finally… block. It should really be something more like:

   ACTIVATE FOO, BAR
   (some Expansion)

And that should generate something like:

   activateTokens(FOO, BAR);
   try {
             expansion code
   }
    finally {
         deactivateTokens(FOO, BAR);
     }

Or, if it was DEACTIVATE instead of ACTIVATE, it’s the other way round. We deactivate at the start and re-activate the same tokens in the finally block.

You see, the problem is that, as things stand, it just doesn’t quite work correctly in terms of a lookahead routine. A lookahead routine jumps out once it realizes that it has failed, but generally speaking, if it activated or deactivated a given token, it should restore the state of the world to what it was before. Something like:

  SCAN MatchStatement =>

should not have the side effect that it activated the MATCH token and now, even asuming that the scanahead failed, we now have MATCH activated. The only way for this to work robustly is for the ACTIVATE/DEACTIVATE to be in a try-finally construct. It’s just implemented wrong, basically.

Well, there’s the other aspect, which is that, in terms of LL(1) sort of logic, there needs to be a recognition that if an expansion starts with an ACTIVATE_TOKENS or DEACTIVATE_TOKENS, even if the expansion or production just requires 1 token of lookahead, we still generate a custom lookahead routine for it. That would get rid of the Gotcha that we were discussing. And that is not too hard. I initially thought that was the full solution but then I started thinking and I realized that there is a deeper problem with all this. But I figure it’s still early and I can re-implement this feature correctly!

Yes, I follow your explanation. It makes sense - I hadn’t considered all those wrinkles, but I’m glad you’re on the case!

I’ll hold off looking at any match-statement related stuff until you’ve done the re-implementation.

It shouldn’t be too long.

I think it’s okay now. There is a change in both syntax and semantics. Now the ACTIVATE_TOKENS works like:

   ACTIVATE_TOKENS FOO BAR
    (
           some expansion
    )

Of course, the “some expansion” above will frequently just be a single token, so you end up with something a bit verbose, like:

      ACTIVATE_TOKENS MATCH (<MATCH>)

At some point, I may put in a shorthand for the above, but for now…

So I’m using it internally already! See for example: javacc21/Java.javacc at master · javacc21/javacc21 · GitHub

But the point here is that the way it works is it creates a try-finally block (both for regular parsing and lookaheads) such that the set of “active” tokens is always reset to what it was at the beginning of the expansion. You can see where it is implemented here: javacc21/CommonUtils.java.ftl at master · javacc21/javacc21 · GitHub

Mostly that. That’s a couple of dozen lines of template code. And it doesn’t look too obvious to me where there would be a problem there, but there could be a bug somehow. Anyway, feel free to give it a try.

Hmmm … not sure if it can be made to work as you suggest for this specific case. Consider this source:

match = 0  # Assignment ... SmallStatement ... SimpleStatement ... Statement
match()  # FunctionCall ... Expression ... StarExpressions ... SmallStatement ... SimpleStatement ... Statement
def match(x):  # FunctionDefinition ... CompoundStatement ... Statement
    match x:  # MatchStatement ... CompoundStatement ... Statement
        case [*_]:
            return "seq"
        case {}:
            return "map"

and the production is

Statement : SimpleStatement | CompoundStatement ;

When we’re about to parse at the top level (Statement) the input "match" could be indicative of either a SimpleStatement (the first two source lines above) or a CompoundStatement (the match statement at line four). If we activate the <MATCH> token in the CompoundStatement or a MatchStatement production under it, that would be too late, as we would have tried to parse as a SimpleStatement already … but if we activate it too soon, then the first two source lines wouldn’t be correctly parsed.

It seems like we might need to do a negative lookahead for “match” followed by [the first set for] SubjectExpr to go into SimpleStatement, otherwise go into CompountStatement. As all the other CompoundStatement cases start with a keyword, then the match statement should be discernable if the next token is “match”. Have I understood things correctly?

I tried changing the Statement production to

Statement :
  LOOKAHEAD(~ "match" SubjectExpr) SimpleStatement
  |
  CompoundStatement
;

but that blows up entirely. Without the negative lookahead, the value of first_set$Python_javacc$25$13 is

result = {JumboEnumSet@1688}  size = 44
 0 = {PythonConstants$TokenType@1704} "AT"
 1 = {PythonConstants$TokenType@1705} "LBRACE"
 2 = {PythonConstants$TokenType@1706} "LBRACKET"
 3 = {PythonConstants$TokenType@1707} "LPAREN"
 4 = {PythonConstants$TokenType@1708} "STAR"
 5 = {PythonConstants$TokenType@1709} "TILDE"
 6 = {PythonConstants$TokenType@1710} "ELLIPSIS"
 7 = {PythonConstants$TokenType@1711} "MINUS"
 8 = {PythonConstants$TokenType@1712} "PLUS"
 9 = {PythonConstants$TokenType@1713} "_ASSERT"
 10 = {PythonConstants$TokenType@1714} "ASYNC"
 11 = {PythonConstants$TokenType@1715} "AWAIT"
 12 = {PythonConstants$TokenType@1716} "BREAK"
 13 = {PythonConstants$TokenType@1717} "CLASS"
 14 = {PythonConstants$TokenType@1718} "CONTINUE"
 15 = {PythonConstants$TokenType@1719} "DEF"
 16 = {PythonConstants$TokenType@1720} "DEL"
 17 = {PythonConstants$TokenType@1721} "FOR"
 18 = {PythonConstants$TokenType@1722} "FROM"
 19 = {PythonConstants$TokenType@1723} "GLOBAL"
 20 = {PythonConstants$TokenType@1724} "IF"
 21 = {PythonConstants$TokenType@1725} "FALSE"
 22 = {PythonConstants$TokenType@1726} "IMPORT"
 23 = {PythonConstants$TokenType@1727} "LAMBDA"
 24 = {PythonConstants$TokenType@1728} "NONLOCAL"
 25 = {PythonConstants$TokenType@1729} "NONE"
 26 = {PythonConstants$TokenType@1730} "NOT"
 27 = {PythonConstants$TokenType@1731} "PASS"
 28 = {PythonConstants$TokenType@1732} "PEG_PARSER"
 29 = {PythonConstants$TokenType@1733} "RAISE"
 30 = {PythonConstants$TokenType@1734} "RETURN"
 31 = {PythonConstants$TokenType@1735} "TRUE"
 32 = {PythonConstants$TokenType@1736} "TRY"
 33 = {PythonConstants$TokenType@1737} "WHILE"
 34 = {PythonConstants$TokenType@1738} "WITH"
 35 = {PythonConstants$TokenType@1739} "YIELD"
 36 = {PythonConstants$TokenType@1740} "DECNUMBER"
 37 = {PythonConstants$TokenType@1741} "HEXNUMBER"
 38 = {PythonConstants$TokenType@1742} "OCTNUMBER"
 39 = {PythonConstants$TokenType@1743} "BINNUMBER"
 40 = {PythonConstants$TokenType@1744} "FLOAT"
 41 = {PythonConstants$TokenType@1745} "COMPLEX"
 42 = {PythonConstants$TokenType@1746} "STRING_LITERAL"
 43 = {PythonConstants$TokenType@1747} "NAME"

but with the negative lookahead, the value changes to have only 10 elements:

result = {JumboEnumSet@1688}  size = 10
 0 = {PythonConstants$TokenType@1705} "AT"
 1 = {PythonConstants$TokenType@1706} "ASYNC"
 2 = {PythonConstants$TokenType@1707} "CLASS"
 3 = {PythonConstants$TokenType@1708} "DEF"
 4 = {PythonConstants$TokenType@1709} "FOR"
 5 = {PythonConstants$TokenType@1710} "IF"
 6 = {PythonConstants$TokenType@1711} "MATCH"
 7 = {PythonConstants$TokenType@1712} "TRY"
 8 = {PythonConstants$TokenType@1713} "WHILE"
 9 = {PythonConstants$TokenType@1714} "WITH"

So of course it blows up. Am I setting up the negative lookahead wrong, or is there some other error in my logic?

Quick question: Are you 100% sure that it is even using that variable? One thing that happened recently is that I used to have a thing where I ran over the code Java AST and eliminated unused private variables/methods. I axed all that with the intention of rewriting it. One thing about that is that, before I finally just fixed the “Code too large problem” I was running into that problem frequently and that was my main motivation for that method where I was getting rid of unused variables. Well, anyway, the generated first set stuff may frequently be unused. If that’s the case, then you would likely be going on a wild goose chase!

I think that probably the more natural way to do things in JavaCC21 is just try the match statement as a first choice and basically, just scan ahead for the “case”. So, in the offical python grammar you have the match_statement expressed as:

     match_stmt:
        | "match" subject_expr ':' NEWLINE INDENT case_block+ DEDENT 

And this actually occurs as the last choice to be checked.

I would think that in JavaCC21, we would perhaps do it the other way round and just put the match statement at the top of the available choices. I think I’d express it something like:

    MatchStatement :
        ACTIVATE_TOKENS <MATCH> (<MATCH>)
        SubjectExpr 
        ":" 
        <NEWLINE> 
        <INDENT> ":" 
       =>|+1 
       (CaseBlock)+
       <DEDENT>
    ;

The above with the =>|+1 means that we scan ahead to the “:” and then one more token (which, here, should be the “case”). If we can match up that “case” then we commit to a MatchStatement and if that lookahead fails, then it’s just the same old pre-3.10 logic. So we go through the remaining options exactly like before and in all those cases MATCH is not activated and is just a regular identifier token.

By the way, what I propose above is a pretty tricky test of the stuff I just implemented, because all the machinery has to activate and de-activate the tokens in the appropriate places. So we need something like:

   CaseBlock :
       ACTIVATE_TOKENS <CASE> (<CASE>)
        Patterns
        Guard 
        Block
    ;

So, I mean, for what I’m proposing above to work, the token activation stuff has to be pretty rock solid in terms of working inside of a lookaheads and such.

Seems to be - I set a breakpoint on the membership check (in the PyModule production) and it failed on the very first statement because the diminished set didn’t contain the required token type.

Do you mean to have that ":" after the <INDENT> ? I don’t think it’s there in the Python grammar - after the newline and the indent, you should get the <CASE> token.

As for the rest - are you assuming that MatchStatement is one of the choices in CompoundStatement? If so, how will the parse of my example source from Statement level not go into SimpleStatement when it parses line 4, as it does for lines 1 and 2?

Yes, you’re right, of course. I don’t know why I typed the extra colon. Never mind that.

It occurs to me also that probably there is no need to scan up to the CASE. Maybe you could stop scanning after the colon. I think that, in general, a lookahead should stop as soon as there is no other viable choice. (Or it can definitively reject the choice…) It seems like, if you scan "match" SubjectExpr ":" then there is no other viable choice but the match statement. Is that correct?

Actually, I’m not sure what I was assuming! Maybe I’ll still suffering a bit of jet lag. It seems like maybe we should check for the match statement at an earlier point than that. Well, maybe that’s not so different from your negative lookahead idea, which I was thinking about a bit more and I’ll comment on a bit more later.

Well, okay, but there could be some bug lurking here. I’m not sure what’s going on. It doesn’t seem that on a complex (multi-token) lookahead that any first_set sort of logic should come into play. In fact, I’m pretty sure that there is no separate logic in the system for the first set of a negated lookahead. Basically, the way a negated lookahead works is very bloody minded. You just do the same non-negated lookahead, but the lookahead succeeds if it otherwise would have failed, and it fails if it would have succeeded.

Well, I mean, the code generated for the negated lookahead, the scan routine, is the same as for the regular, non-negated lookahead. Except that the result of the lookahead gets inverted. See: javacc21/LookaheadRoutines.java.ftl at master · javacc21/javacc21 · GitHub

And that’s about it. What I mean is that the codebase doesn’t include any manipulations of first set logic for “negated lookaheads”. So, in short, I’m getting confused… :slight_smile:

Well, so am I :confused: I couldn’t see what looked a lookahead routine generated from the negative lookahead in the generated code. Also, I got different behaviour in these two cases:

Statement :
  LOOKAHEAD(~ "match" SubjectExpr) SimpleStatement
  |
  CompoundStatement
;

Fails to parse even a single statement: the first set has only ten elements, and the parser doesn’t even enter the Statement production but fails when trying to consume the EOF after all the statements. However, with

Statement :
  (LOOKAHEAD(~ "match" SubjectExpr) SimpleStatement)
  |
  CompoundStatement
;

If parses the first three lines and fails on the fourth (the match statement). Have I just misunderstood how LOOKAHEAD binds? Is LOOKAHEAD(X) A | B ; the same as LOOKAHEAD(X) (A | B ); rather than (LOOKAHEAD(X) A) | B ; ?