Tokenization Enhancements with some Tips and Tricks

It has long been a nagging thing in my head that there is a basic flaw with token lexical actions in CongoCC. Well, first things first. What is a token lexical action, you might be wondering.

What I'm talking about is something like:

TOKEN :
   <FOO: "foo"> {some java code here, this is the token lexical action!}
;

The situation has been (though now it is remedied!) that you can only attach a token lexical action to a regular TOKEN regexp pattern, or one that is specified as UNPARSED, but not to a MORE or a SKIP. I don't actually know offhand whether this glitch was in the legacy JavaCC or not, or maybe it was actually introduced in JavaCC 21. But regardless, I recently found this deficiency annoying enough that I finally addressed it.

Now, to be perfectly precise, you could tack on that token lexical action to a MORE or SKIP pattern, but it would just be ignored! Not only would the code snippet be ignored, but the tool did not even emit a warning! So, really, this is definitely a bug, and a longstanding one at that.

The reason I finally got round to this was that I am currently working up a CongoCC grammar for the Rust programming language and there were some fiddly things in the lexical spec that really made me want to extirpate this longstanding wart.

Motivation: What is this MORE thing anyway?

The MORE pattern is a feature that has existed since the early days of JavaCC, about 30 years ago at this point. The basic idea is that you might want to match one or more regexp patterns without rolling up a token. One thing in Rust that is familiar but a little bit different is the fact that it has multiline comments just like Java or C#, but with the difference that they can nest. So, for example, you can have:

    /* This is the outer comment
       /* And now in an inner one. */
       And now we are still in the outer comment.*/

Of course, the above does not work in Java or C#. The /* on the second line does not start any sort of "nested comment". It is just regular text that is part of the comment begun on the previous line. And thus, the */ on the second line leaves the outer comment. (There is only an outer comment, no inner one!) And then all the text on the third line is not inside of any comment. Regardless, that would not be the intention of somebody who wrote this. It wouldn't work. But in Rust it does!

So, this is how this is currently implemented in the Rust grammar I am writing:

MORE :
   <COMMENT_START : "/*"> {commentStart = matchStart;} : IN_MULTILINE_COMMENT
;

<IN_MULTILINE_COMMENT>
MORE:
   <NESTED_COMMENT_START : "/*"> {++commentNesting;}
   |
   <ANY_CHAR : ~[]>
;

<IN_MULTILINE_COMMENT>
UNPARSED:
   <MULTILINE_COMMMENT : "*/">
   {
      assert commentNesting >=0 : "Should be impossible!";
      if (commentNesting > 0) {
         commentNesting--;
         matchedToken = null;
      }
      else {
         matchedToken.setBeginOffset(commentStart);
         switchTo(LexicalState.RUST);
      }
   }
;

So, the above is a fairly clean (I think so anyway!) implementation of these nested comments. Well, it is actually missing a little piece. The above code keeps track of two things via the commentStart and the commentNesting variables. When we hit the beginning of the comment, we store the start offset in commentStart. When we hit the /* when we are already in a comment, what we do is we increment the commentNesting variable. But, of course, those variables have to be declared somewhere, so we also have:

INJECT LEXER_CLASS : {
private int commentStart, commentNesting;
}

Without the above injection, our generated code would not compile!

Now, you may be wondering about some other variables that were used, i.e. matchedToken and matchStart. Good question, and the answer is that these variables are already in the context and we don't need to inject them, like we did with commentStart and commentNesting.

Now, matchedToken was already there to be used in a token lexical action, but matchStart is something I recently added. Here is an interesting technical (maybe hypertechnical) point about all this. A key difference between matching a MORE (or SKIP) as opposed to a TOKEN (or UNPARSED) pattern is that if it's a MORE, we don't roll up a token, so matchedToken is actually null in this spot. Of course, it is hard to do anything useful in your code action here if you don't know where you are precisely, so we now also expose a couple of extra variables, which are matchStart and matchEnd. The matchStart and matchEnd variables tell us the start and end locations of the pattern we just matched. Note that the end position is non-inclusive. So, another way of thinking of this is that this is the position where we will start matching on the next iteration. It is interesting to note that if we are matching a regular token, then those two variables are superfluous because they will be the same as matchedToken.getBeginOffset() and matchedToken.getEndOffset() respectively. But, since matchedToken is null when we just matched a MORE or SKIP...

Some other little tricks

Actually, this whole MORE concept is probably best understood as a specific case of something more general. Namely:

There is not necessarily an exact correspondence between the patterns you match and the tokens you create!

(Not necessarily, I say. Sometimes there is, but sometimes not...)

Here is another very typical kind of situation. Suppose you are parsing some sort HTML/XML-ish dialect and the most natural thing is to match some text until you hit a < character. However in:

 <tag&rt;schmoo</tag&rt;

you typically do not want your token representing the content inside the tag to be "schmoo<" but rather just "schmoo", right?

So the natural thing would be to have a token lexical action that chops off the final character. So, supposing that we match some regexp that includes the angle bracket, we end up with something like:

 {matchedToken.truncate(1);}

Note, by the way, that once we truncate that token, the tokenizing machinery uses the matchedToken.getEndOffset() as the starting point on the next iteration of matching. So, it starts with that < character and presumably picks up the </tag> as the next token. But obviously, if it were going to pick up tokenization at the / character that just wouldn't work, would it? So, yes, truncating the matchedToken by one character causes the next iteration of tokenization to start where it needs to start...

In the case with the nested comments above, what we do is that we match a pattern, the */ but we don't want that to be our token. We want it to be the entire comment, right? So, what we do (and note that we only roll up a token if commentNesting is zero) is that we set the token's beginOffset to the commentStart offset that we stored earlier when we hit the very first /* -- not any nested ones, of course!

Now, if we are in a nested comment, i.e. commentNesting is greater than zero, what we do is we (of course!) decrement the commentNesting counter and we set matchedToken to null. The matchedToken being null causes the tokenization machinery to just continue matching text. It would normally return matchedToken as the freshly minted token from the tokenization machinery, but if we set that to null, it just keeps tokenizing.

One interesting aspect of all of this is that the MORE specification is actually superfluous. We could basically use SKIP for the same purpose. As long as we hold onto where the desired start offset of the token is, it doesn't matter whether the intervening characters were matched as MORE or SKIP really. It really makes very little difference. Finally, I chose to keep MORE as a category because it does sort of say something about one's intent. Also, the MORE is a longstanding thing in JavaCC -- even if, like many things in legacy JavaCC, it didn't exactly work totally correctly, so I figured it would be a good idea to remedy that.

I'll close this little tips and tricks blog article with another example from the Rust grammar.

In Rust, both 1 and 1. are numerical literals. The first is the integer one, while the second is a floating point literal. (I guess somebody decided it was just too much work to write 1.0 so you can write 1. instead. Okay, fine...)

But then in Rust, you have things like:

 for i in 1..10 {...}

I was having fits with this until I finally got into a debugger and realized that it was tokening the 1. as a floating point literal and then choking on the next .. So this is fairly straightforwardly resolved with a token lexical action:

    {
      if (charAt(matchEnd-1) == '.' && charAt(matchEnd) == '.') {
         matchedToken.truncate(1);
         matchedToken.setType(INTEGER_LITERAL);
      }
    }

So, as you can see, in this spot, when it matched 1. it peeks ahead an extra character and if it's another ., it truncates the matched token, and sets the type to INTEGER_LITERAL. And, it scans the next token as .. and thus scans the 1..10 range expression correctly.

Okay, that's enough for now.

Are we having fun yet?

Post Views: 20

Motivation: What is this MORE thing anyway?

Some other little tricks

Leave a Comment Cancel Reply