Tokens, I do declare!

There is a surprising amount of messy detail in all of this and it took me several days of hacking the code to get it about right. But okay, I'll try to get straight to the point. There is a new option available that you can put up top of a grammar file. If you put:

 REQUIRE_TOKEN_DECLARATION=true;

or actually (since the true is inferred) just:

 REQUIRE_TOKEN_DECLARATION;

at the top of a grammar file, it means that any tokens referenced as string literals must be explicitly declared.

Well, one could infer from the foregoing that the status quo ex ante on this was that this option did not exist, and thus, token declaration was not required. And you would be right.

On a certain level, It is (IMHO) very nice that in CongoCC, the following is a valid grammar that builds and runs.

 Foobar : "foo" "bar" ;

Really, it works. Just paste the above line into an empty file and save it as Foobar.ccc, and then, from the command line:

 java -jar congocc.jar Foobar.ccc

Voilà!

I was curious what the minimal legacy JavaCC file would be. It looks like:

PARSER_BEGIN(FoobarParser)
public class FoobarParser {}
PARSER_END(FoobarParser)

public void Foobar() : {}
{"foo" "bar" }

One can see that legacy JavaCC requires a lot more ceremony than CongoCC does. In CongoCC, there is no need for the PARSER_BEGIN...PARSER_END prologue since we have the convention that if the grammar file is Foobar.ccc, say, then the generated parser class is FoobarParser.java and so on. There is no need to write public or void in front of the Foobar production, since those are defaults that are inferred. No need for the empty parentheses after Foobar or the empty code block {} after the colon.

However, one concession to terseness that the legacy JavaCC tool does make is that it does allow you to implicitly declare the string literal tokens "foo" and "bar". Given their penchant for requiring you to declare things, you'd think they would require you to write:

 TOKEN : {<FOO : "foo"> | <BAR : "bar"> }

somewhere in the grammar file so as to declare the token types. But no, at least the minimal example above allows you to do without the declaration of the tokens.

Well, the astute reader will see where this is going. Now, in CongoCC, if you do set the new option REQUIRE_TOKEN_DECLARATION at the top of your grammar file, then yes, declaring the tokens is required. (Duhh.) So, suppose we prepend that line to our minimal Foobar.ccc file and now have:

 REQUIRE_TOKEN_DECLARATION;
 Foobar : "foo" "bar" ;

Then you can try to generate the parser with the same old:

 java -jar javacc.jar Foobar.ccc

But now you will get the error message:

Error: Foobar.ccc:3:10:String literal token "foo" is not declared in the DEFAULT lexical state.
Error: Foobar.ccc:3:16:String literal token "bar" is not declared in the DEFAULT lexical state.

You can get rid of those error messages by adding the following line to the file:

TOKEN : <FOO : "foo"> | <BAR : "bar"> ;

So, now our minimal grammar file has grown a bit. We have:

 REQUIRE_TOKEN_DECLARATION;
 TOKEN : <FOO : "foo"> | <BAR : "bar"> ;
 Foobar : "foo" "bar" ;

Of course, with this minimal example, one could ask why anybody would bother with this. Weren't we better off with the original one-liner? Well, the answer is that declaring the tokens does have practical advantages, but those advantages mostly manifest themselves with much larger grammars. If you are in some fast prototyping mode or just working up something fairly simple and small, you might as well not require token declaration. In particular, the ability to have shorter examples with little or no ceremony or scaffolding should be more approachable for people who are starting out with the tool.

Advantages of Declaring Tokens

Now, for one thing, requiring tokens to be declared has the same general advantages as requiring people to externalize string literals. It gives you some protection against silly spelling mistakes. Consider some expansion that is supposed to be something like:

 < "color" ":" <IDENTIFIER> ;

Suppose somebody wrote:

 "colour" ":" <IDENTIFIER> ;

(Of course, this is not even a misspelling strictly speaking. "Colour" is the favored (or favoured!) spelling in British English. And also in Canada, Australia, etc... But regardless...)

If we require identifiers to be declared, we need some declaration like:

 TOKEN : <COLOR : "color"> ;

And then writing colour instead of color will be an error. And we'll get an error message something like:

Error: MyGrammar.ccc:3:10:String literal token "colour" is not declared in the DEFAULT lexical state.

But if token declaration is not required, then the tool simply auto-declares a new string literal token "colour". Or, to put it another way, it sort of sneakily inserts a virtual token production that would look like this:

TOKEN : <"COLOUR : "colour"> ;

Now, if your intention was actually to accept either the USA or UK spelling, you could have:

TOKEN : <COLOR : "color" | "colour"> ;

This token type would actually have to be declared in any case because the auto-declaration of tokens in CongoCC only applies to string literals, and the above is no longer a straight string literal, but a choice between two different ones. It could also be written :

TOKEN : <COLOR : "colo" ("u")? "r"> ;

but either way, this is not a simple string literal any more, so it has to be declared in the lexical part of the grammar regardless.

Admittedly, the protection against spelling mistakes could be overrated. The spelling of keywords in programming languages, such as if, else, and while do not change very often and there is a very limited tendency to spell them wrong.

But there are other advantages to declaring tokens. For one thing, there are simply things you can do in a token declaration. For example, in CongoCC, you can specify the token subclass that is generated, so you can write:

TOKEN : <"COLOR : "color"> #Color ;

This generates a Color subclass (of Token) that will be generated instead of just using Token. And this is actually a very powerful feature in conjunction with INJECT, as in:

INJECT Color : {
    ... injected code ...
}

Of course, none of that existed in legacy JavaCC. One thing that you could always do in a token production is have a token lexical action as in:

  TOKEN : <COLOR : "color" > { ...custom code hook...}

That always existed in legacy JavaCC as well as in CongoCC.

Also, there are general advantages in terms of the code documenting itself to require that the entirety of the lexical grammar should be sort of formally declared.. Within a certain range, this is a matter of taste. I tend to think that with larger grammars it is probably better to declare all the tokens, so I recently added REQUIRE_TOKEN_DECLARATION to the top of the various example grammars -- Java, Python, CSharp, Lua, Rust.

Niggling Details, the default lexical state

It's a funny thing. Back in legacy JavaCC days, life was actually simpler in some ways. For one thing, there was no INCLUDE directive. As such, your grammar was one physical file (sometimes many thousands of lines long) and any options declared up top were effectively global.

But once you can INCLUDE one grammar file from another one, it allows code reuse, but there is the problem that different options specified could conflict -- clobber one another, so to say. For a good while, I've been aware that this whole issue (or set of issues) is not handled all that well in CongoCC. I was aware of it, but never got around to setting aside the time to address it.

Well, again, let us take a concrete example of this. Suppose we want to create a little language but we intend to use the Java lexical grammar as a starting point. We're doing our own thang but we're perfectly happy with how Java defines a string literal, or an integer or floating point. Let's just reuse all that. No need to re-invent the wheel! So let's say we start prototyping our grammar, so let's say we have:

  INCLUDE JAVA_LEXER

  Foobar : "foo" "switch" "case" ;

The above is an example that builds and compiles. However, it is likely not to do what one intended. It is instructive to eyeball the generated code. You would likely think that in the above, we auto-declared the "foo" token, but we are picking up the "switch" and "case" from the included Java lexical grammar. But no, you can look at the generated FoobarParser.java and you will see that the lines in the Foobar() method look like:

  consumeToken(DEFAULT_SWITCH);

and:

  consumeToken(DEFAULT_CASE);

The "switch" and "case" tokens in the Java lexical grammar are labeled SWITCH and CASE respectively. But here we are consuming DEFAULT_SWITCH and DEFAULT_CASE. What happened is that we are in the DEFAULT lexical state (since we never said otherwise!) and the SWITCH and CASE token types are only defined in the JAVA lexical state. So the system auto-declared the the "switch" and "case" tokens in the DEFAULT state. However, they need to be labeled differently. (We still have the issue that tokens all need to have globally unique labels.) So the tool auto-declares these token types as DEFAULT_SWITCH and DEFAULT_CASE, i.e. the "switch" and "case" in the DEFAULT lexical state.

This is very unlikely to be the intention of the grammar's author in this case, but do understand that this is just what logically must happen. The included grammar declared that its default lexical state is JAVA, but the including grammar did not! So it's still DEFAULT. Most likely, to get the intended behavior, the grammar writer (maybe after some confusion!) would just write:

 DEFAULT_LEXICAL_STATE=JAVA;

up top of their grammar, so they now have:

 DEFAULT_LEXICAL_STATE=JAVA;

 INCLUDE JAVA_LEXER

 Foobar : "foo" "switch" "case" ;

In the above case, everything is in the the JAVA lexical state, i.e. both the including and included grammar. So what happens is that we auto-declare a "foo" token type and the other two, "switch" and "case" are already defined in the included JavaLexer.ccc grammar.

Now, suppose we also added REQUIRE_TOKEN_DECLARATION to this grammar, so we have:

 DEFAULT_LEXICAL_STATE=JAVA;
 REQUIRE_TOKEN_DECLARATION;

 INCLUDE JAVA_LEXER

 Foobar : "foo" "switch" "case" ;

Then if we try to build the parser, we get the following error message:

Error: Foobar.ccc:6:10:String literal token "foo" is not declared in the JAVA lexical state.

The token types "switch" and "case" were of course declared in the included Java lexical grammar, but "foo" was not. We need to declare it. Why? Because we said up top that all tokens needed to be declared! Is this all starting to make sense?

Now, suppose that we actually did want to have our new grammar in a different lexical state. Well, here is one possibility:

   DEFAULT_LEXICAL_STATE=FOOBAR;
   REQUIRE_TOKEN_DECLARATION;

   INCLUDE JAVA_LEXER;

   TOKEN : <FOO : "foo"> ;

   Foobar : "foo" LEXICAL_STATE JAVA ("switch" "case") ;

The above works fine. The token declaration of FOO is in the default lexical state FOOBAR. The "switch" and "case" tokens used in the Foobar grammar production are stated clearly as being the JAVA lexical state, where they were already declared.

Another way that the final line could be written is:

   Foobar : JAVA : LEXICAL_STATE FOOBAR ("foo") "switch" "case" ;

We declare the Foobar production as a whole to be in the JAVA lexical state but we jump back to the default lexical state of FOOBAR to scan the "foo" token and then we go back to the JAVA lexical state to get "switch" and "case".

Well, I suppose there is no need to say much more about all this. At the moment, all of the above is working in the latest version of the tool. This has been the case for a day or two. Before that, the situation was somewhat broken. In fact, I was aware of that, and this is why I eschewed the use of string literals. Now that all of this is working properly, I decided that I much prefer to write (and read!):

IfStatement : "if" "(" Expression ")" Statement ["else" Statement] ;

rather than:

IfStatement : <IF> <LPAREN> Expression <RPAREN> Statement [ <ELSE> Statement] ;

Oh, and as a final point: yes, contextual tokens also need to be declared. (If you set REQUIRE_TOKEN_DECLARATION) up top. See here for an example.

I actually don't think that any of the changes as outlined above should break any grammars out there. I can't reject the possibility. However, if this happens, I think the fix should be quite obvious. Or, to put it differently, if something was working prior to this and now does not work, it would be because you were relying on extant buggy behavior that now is fixed. That was also the case with the (now fixed) code prologue glitch.

Post Views: 10

Advantages of Declaring Tokens

Niggling Details, the default lexical state

Leave a Comment Cancel Reply