All sorts of things you always wanted to know about tokenization but were afraid to ask (Part I)

Let's consider the multiline comment in Java (and C/C++/C#, among others) which, you surely know, looks like this:

  /*
   *  Comment text.
   */

This is an interesting construct. Paradoxically, it is extremely simple -- I mean to describe in natural, human language (English or whatever) -- but shockingly difficult to express in CongoCC. Or, that was the case until very recently, when lazy tokenization was introduced.

As I said, the natural language explanation is absurdly simple: we recognize /* as starting the comment. Then we scan forward until we reach */, which closes the comment.

But I also said that this is very hard (or has been) to express with CongoCC (and its predecessors JavaCC21 and the legacy JavaCC.) So, I guess the first question is why it is hard? For starters, why can't we just write:

<COMMENT : "/*" (~[])* "*/" > 

(Note that the ~[] is the way to express any character. The [ ] signifies the empty set, so given that ~ means negation, ~[] is thus all characters.)

So why can't we just write this to mean that we match /*, any characters, and then a closing */?

Well, you can write that, but it doesn't work. The reason it doesn't work is that the above regexp matches, for example, the following as a single comment, not as two comments with some other stuff in the middle.

 /* Some comment */
    some java code
 /* Another comment */   

That is because the loop inside the regexp, (~[])* continues matching past the first */, continues gobbling all the input until the very LAST occurrence of */ that it finds. Or in other words, the default greedy approach causes the "machine" to continue matching text that it is obviously (obvious to us humans anyway) not supposed to match.

A funny thing about this is that the above regexp does work if the comment being scanned is the only such comment in a file. Even then there is a pretty major caveat, which is that it will always scan to the very end of the file anyway, which is maybe not very far if you're lucky, but could also be thousands of lines later. It has to scan to the end of the file to see that there is no later */. So, even if it works when this is the only comment in the file, there is the potential for a catastrophic degradation in performance!

I say the above as a little detail, but the more important fact is that it just doesn't work in the general case, because there is no reason to believe that this is the only comment in the file!

Anyway, that is the problem at hand, and, leaving aside aside the aforementioned lazy tokens feature that was only introduced in the last few days, it has been surprisingly difficult to deal with. There are actually three different solutions available.

  1. Create a separate lexical state.
  2. Write a Java code action.
  3. Come up with a surprisingly complex regexp.

The original JavaCC package, released in 1996, came with a sample Java grammar (for a very old, primitive version of the language obviously) and it handled this using the first approach. That is as follows:

MORE : < START_COMMENT : "/*" > : IN_COMMENT ;

<IN_COMMENT> MORE : < ANY_CHAR: ~[] > ;

<IN_COMMENT> UNPARSED : < COMMENT : "*/" > : DEFAULT ;

The key mechanism here is that this actually defines a little lexical state called IN_COMMENT to handle the state you are in when inside a /*...*/ comment. Granted, this is a very simple little lexical state that only defines two patterns, ANY_CHAR and COMMENT. Note that two of these patterns above are specified as MORE. What MORE means is that we don't roll up a token from matching this pattern, but rather, we accumulate the matched text, and we tack it on to the next actual token we create. In this state, that can only happen when we encounter the terminating */ pattern. When we hit that, all the accumulated MORE text is prepended to the token, so the resulting COMMENT token ends up containing the START_COMMENT text (i.e. /*) and all the subsequent individual characters that were matched as ANY_CHAR. By the way, the terminating */ is matched as COMMENT, not as two successive ANY_CHAR because, given the choice between two patterns, the lexing machinery will match two characters rather than one, i.e. the longer match wins. So */ gets matched as COMMENT, not as two successive ANY_CHAR. That is how we get out of the loop!

Well, the above certainly works, and furthermore, if you are an aspiring CongoCC expert, it is well worth studying the example and thoroughly understanding it. But, that said, is this not a surprisingly intricate solution to such a simple (or seemingly simple) problem?

In the interests of contrast, consider the second way of dealing with this, a Java code action:

<COMMENT : "/*" > {
    int pos = matchedToken.getEndOffset();
    try { 
        while (charAt(pos) != '*' || charAt(pos+1) != '/') ++pos;
    } catch (IndexOutOfBoundsException e) {
        matchedToken.setType(INVALID);
        matchedToken.setUnparsed(false); 
    }
    matchedToken.setEndOffset(pos+2);
} 

The above Java code action is only 8 lines long but I should tell you that it only works with relatively recent versions of CongoCC. It makes use of the fact that the TokenSource object, that the lexer inherits from, now implements java.lang.CharSequence so we have the charAt() method with the typical semantics, so we can be confident that if we scan forward and walk off the end of the buffer, we will hit the IndexOutOfBoundsException if the comment is unclosed. The reason it is then marked as parsed (i.e. not unparsed) is so that the parsing machinery receives this as the next token in the stream, and of course, it is not an expected type for the parser (that it is of type INVALID should be the first clue!) so it hits the regular parser error handling machinery.

But when all is well, the code action scans forward and finds the closing */ sequence and then it simply extends the token (that is the line matchedToken.setEndOffset(pos+2);) to include up to and including the closing */. The reason that works is that the tokenizing machinery takes the endOffset of the preceding token as the point at which to pick up its scanning. So the next token in the stream will be the one that results from starting to scan at the (non-inclusive) endOffsetof the previously vended token.

Still, there is at least one very strong argument against such a Java code action. What if you are generating code in some other language -- i.e. C# or Python? The advantage of avoiding the code action is that your lexer should work just as well when generating code in whatever target language.

The third approach is to come up with a regexp that actually handles this. This one:

< COMMENT : "/*" (~["*"])* "*" (~["*","/"] (~["*"])* "*" | "*")* "/" >

Yes, the above is a single regexp that handles this C style comment construct. Easy peasy, huh? Well, I will be honest and admit that I did not write the above regexp myself. I googled it. In fact, I'll further admit that I am not at all certain that I would have ever come up with that regexp myself. Maybe... maybe not... I'm not even 100% certain that I would come up with the lexical state solution using MORE. Maybe... maybe not... Well, I think the truth of the matter is that the solution I would be most likely to come up with on my own would be the above code action.

This approach has the advantage that it should work independently of what your target language is. Also, it fits on a single line unlike the first solution. But really, maybe the main advantage it has is in terms of impressing your colleagues. (Or better yet, your boss.)

Wow, dude! Did you write that yourself?

Yeah, man. It was a bit tricky but I finally nailed it.

You've really been burning the midnight oil, eh?

By the way, if you are wondering about the performance characteristics of the above solutions, I was too, and I wrote a little benchmark, having it lex a bunch of files with very long C-style comments, like many thousands of lines. It turns out that this gnarly regexp is about twice as fast as the solution based on separate lexical states. However, the Java code action is about twice as fast as the gnarly regexp. Of course, it doesn't matter anyway. Scanning to the end of multiline comments is not going to be the performance bottleneck in your app. In fact, to quantify the performance difference, I had to write a very contrived benchmark. I did that because I was curious.

It stands to reason that the Java code action is the fastest. It just directly scans to the next occurrence of */. The slowest solution is the one with a separate lexical state, because it repeatedly invokes the tokenization machinery to identify each lone character in the comment. The gnarly regexp has less overhead, but the pure bloody-minded Java code is even faster.

I mention this, but I am hardly recommending using a Java code action in this spot. I provide that as a conversation piece basically. And, also, even though I don't recommend this, it would be well worth your effort to study the example and understand it thoroughly.

Actually, the solution I recommend is to use the new lazy tokens feature, to write it this way:

<?COMMENT : "/* (~[])* "*/">

(As of this exact moment of writing, the above only works for Java code generation, but that will probably addressed fairly soon. That has already been addressed.) So, this lazy tokens way of specifying this should supersede other approaches above.