Niggles with the New Up-To-Here Marker

Not too long ago, I implemented a new syntactical element in JavaCC, that I call the up-to-here marker. To recap:

Suppose you have a production like this:

FooBar : "foo" "bar" Baz;

NB. I am moving towards always just using the newer streamlined syntax in any examples and documentation. For the record, if you really want, you can still write:

void FooBar() :
    "foo" "bar" Baz()

However, I am confident that most people will prefer the shorter way. (I certainly do!)

Anyway, you have the production above, and it so happens that, to decide whether you want to enter the production, you want to scan ahead two tokens, i.e. check whether the next two tokens in the stream are "foo" and "bar".

Suppose further that elsewhere (or possibly in more than one place) you have something like:

( FooBar )*

This means, of course, that you have a loop that consumes all the FooBars you scan until there aren't any more. Except, in this case, you want to scan ahead two tokens on each iteration. With the legacy JavaCC tool, you write:

 ( LOOKAHEAD(2) FooBar() )*

or alternatively, one could use a syntactic lookahead, i.e.

( LOOKAHEAD ("foo" "bar") FooBar())

This says that we are going to look ahead two tokens to decide whether to stay in the loop or break out. If that is not specified, the generated code will check one token ahead, i.e. it will check whether the next token is "foo".

The way to deal with this using the new up-to-here syntax is to write the production as:

 FooBar : "foo" "bar" =>|| Baz;

The up-to-here marker =>|| indicates that we scan up to this point to check whether to enter the production. In this case, we simply write the loop as:


If the FooBar production already has the up-to-here specified, then we don't need to specify it when using the production.


Here is the first niggle. (I know that "niggle", like "niggardly", is a dangerous word, at least when spoken aloud, but we like to live dangerously around here, no?)

This isn't much of a niggle. It's simply that you might not always want to look ahead as specified in the production. Maybe in a certain case, you want to lookahead only 1 token. Or you want to look ahead more than those two tokens, so you should be able to override what is specified via the up-to-here syntax. Well, you can. (At least, in principle, you can. I've been going over thse various cases and making sure they are implemented properly and will be putting in some regression tests.)

So, suppose you write:

 (SCAN 1 => FooBar)*

or equivalently:

 (SCAN "foo" => FooBar)*

then your loop will only check one token. Or alternatively, supposing that in a specific spot, you want to check ahead more tokens, like 3 or 4, you could specify that too. So, here is one rule:

If you specify an explicit lookahead, either numerical or syntactical, at a given choice point, that will override any up-to-here notation.

Well, that is obvious, I think, but it must be specified. Here is the next niggle:

An up-to-here in an expansion sequence overrides what is specified in nested productions

So, suppose you have a loop like:

( FooBar FooBaz =>|| FooBat )*

This, on its own, means that we scan up to the end of the FooBaz production when deciding whether to stay in the loop.

Now, we know that the FooBar production already contains its own up-to-here marker. However, that will be ignored in this spot because we have a higher level up-to-here marker in the expansion that overrides it. If that was not present, we would use the up-to-here marker in FooBar and scan two tokens on each iteration of the loop. If there was no up-to-here marker at all, and no explicit lookahead (i.e. SCAN instruction in the newer syntax) then we the generated code would just use the default, which would be to look ahead one token, in this case, check for "foo".

If you use the terse syntax of:


this is considered to be an abbreviated way of writing:

(LOOKAHEAD(FooBar()) FooBar())*

Again, in either case, this is an explicit lookahead that overrides any lookahead specified in the FooBar production.


If an expansion at a choice point starts with a non-terminal that refers to a production that contains an up-to-here marker, that is used when generating the code. So, if we have two productions:

 FooBar : "foo" "bar" =>|| Baz;


FooBaz : "foo" "baz" Bat;

in a choice construct, say, like:

 ( FooBar | FooBaz )

the generated code will scan ahead past "foo" and "bar" to check whether to enter FooBar, and if not, will then enter FooBaz. If the up-to-here marker was not present in the FooBar production, then this would pretty much invariably be a bug because the FooBaz option would be unreachable. The same would apply in other cases, such as:

 ( FooBar )+ FooBaz

Just as in the previous case, FooBaz above is also unreachable IF there is no up-to-here marker in FooBar.

So, consider the following rules:

An up-to-here marker in an expansion at a choice point overrides any up-to-here marker specified in a production referenced by a non-terminal in the expansion.

An explicit syntactic or numerical lookahead overrides either of the above.

I'll say in closing that I wrote this post largely to clarify in my own mind how it should work. As I write these lines, I am actually pretty sure that there are cases where it is not working correctly! What happened is that I was on another pass of cleaning up the code and started noticing these issues... these niggles...

It is largely working, but...

But I'm not going to bother to specify the cases I discovered recently where this is not working, because writing about it would be more work than just fixing it! I think it is safe to say that the above will be working properly within a few days at most.

(Since writing the above lines, I put in a certain amount of incremental work, and to the best of my knowledge, everything is working as described above. Also, the up-to-here construct is used internally in JavaCC development, so it should be pretty robust moving forward. If it stops working correctly, most likely the thing won't even build!)