ANN: Tokens can now be declared as “lazy”

The poster child for this feature would be something like the C-style multiline comment, i.e.

  /**
   * comment
   */

It is actually surprisingly difficult to specify this in legacy JavaCC (or JavaCC21/Congo up until very recently). One way is to break it into multiple regexp patterns using the MORE feature. Or, alternatively, there is actually a way of specifying this in a single regexp, that looks like this:

  <COMMENT : "/*" (~["*"])* "*" (~["*","/"] (~["*"])* "*" | "*")* "/">

I did not come up with that myself, by the way. I found it via a search on the internet. And, to be completely honest, I am hardly certain that I would have ever come up with that myself!

But, now that we can specify that a token is matched lazily, we can write this like so:

 <?COMMENT : "/*" ~([])* "*/ >

The ? in front of the name of the comment means that, once we match the pattern, we don't scan any more characters looking for something longer. Without the ? to specify this as lazy, the above would match something like:

 /* ... */ ... */

as a single comment. That is actually a rather hard problem to deal with, though again, it can be dealt with via the horrific regexp above.

The feature is really quite simple. There is only one gotcha that occurs to me. One can use an existing regex as a macro. So, suppose you wanted to use that COMMENT inside another, more complex regexp. Maybe:

<FOOBAR_COMMENT : ("foo")* <COMMENT> ("bar")* >

That would presumably allow you to write foofoofoo/* ... */barbar which is, admittedly, a rather contrived example. Well, actually, the above would not work, because the embedded <COMMENT> is effectively a macro and thus, the above is the same as writing:

<FOOBAR_COMMENT : ("foo")* "/*" (~[])* "*/" ("bar")* >

The fact that the COMMENT regexp is specified as lazy at the top level has no relevance in terms of using it as an embedded regexp. And if you decided to remedy that problem by writing:

<?FOOBAR_COMMENT> ("foo")* <COMMENT> ("bar")* >

that would still not work as intended. It would do okay on the opening ("foo")* but it would never match the ("bar")* part because, once it hits a final state, it just stops. So, if you had foofoo/*baz*/barbarbar it would only match up to the */ because it has hit a "final state" and is not going to scan any more characters!

But, as I say, this is a rather contrived example, and I don't know how many cases in the wild require something like this to work. I daresay it is rare and if you really had a need for this, it is not too hard to drop down into Java code to handle something weird like that.

Though I have been using the term NFA and NFA state when writing about all this, the truth is that it is more accurate to refer to this sort of regexp engine as a DFA engine. It is apparently impossible to implement localized lazy matching in a pure DFA engine. That is the real reason why these old-line UNIX tools like Lex and Flex don't have it. Generally speaking, you can't really have anything that requires backtracking. So there are a host of features in more feature-complete regex engines that we will never have -- at least short of re-implementing our regex functionality using an NFA engine. And frankly, it is hard to conceive of that ever being worth it.

Actually, the implementation of lazy tokenization on a token as a whole was quite low-hanging fruit, much like the context-sensitive tokenization described here.

Frequently, I find myself thinking that, in designing language syntax, we are effectively restricted to 7-bit ASCII and its limited set of operators/delimiters. Now that we support full unicode, would it not be very daring to dispense with that? We could use an emoji like 🏖️, a parasol on the beach, to represent "lazy" tokenization. But I tend to think the computer world is too stodgy for that.

So there you go.

Post Views: 584