The TERMINATING_STRING setting, a new (and quite minor!) feature

Some days ago, I added a new setting. If, at the top of your grammar, you write:

 TERMINATING_STRING="some string";

this means that the input you're parsing is guaranteed to end with that string. If the file ends with that string already, then it does nothing. Otherwise, it tacks that string to the end. In actual praxis, there is little doubt that the most typical terminating string would be a single end-of-line character. Actually, the option to tack on an EOL (assuming it's not already there) already existed.

So this is just a generalization of the existing ENSURE_FINAL_EOL setting, which tacks on a closing end-of-line character if it is not present. Actually, with the TERMINATING_STRING setting, ENSURE_FINAL_EOL is now superfluous, since it can be expressed as:

TERMINATING_STRING="\n";

And that is actually how ENSURE_FINAL_EOL is implemented internally.

Even though it is superfluouos, I decided to leave the ENSURE_FINAL_EOL setting, since, for one thing, it is already there, and also, a high percentage of the time one wants to specify a terminating sequence, it is the end-of-line character! Looking back, I'm pretty sure that I implemented that ENSURE_FINAL_EOL when I wrote the preprocessor. I realized that there were a whole bunch of line-oriented grammars that would tend to run into this sort of little nitpicking problem, that the basic logic fails at the end unless you can be sure the file ends with a newline. The whole thing is actually surprisingly annoying.

Well, as you can see, the solution to this problem is absurdly simple. You just tack on the newline character at the end if it's not there. And that is what ENSURE_FINAL_EOL does. And generalizing this to any arbitrary string is just as simple!

For a good while, I thought that this was a good example of a general problem in software development. This need for the final newline at the end of the file, and hence this (quite trivial) configuration setting is something that becomes apparent in actual usage of the tool. When, in addition to developing the tool, you actually use the tool yourself, you encounter these sorts of little rough edges and tend to round them out -- particularly something as trivial as this.

However, I recently noticed that the legacy JavaCC community seems to be aware of this particular glitch. At least it's in their FAQ. See here: Why do the example Java and C++ token managers report an error when the last line of a file is a single line comment? That question has been in their FAQ since at least 2007. (I was somehow curious about this, so I looked into it.)

The answer to the question starts with:

The file is likely missing a newline character (or the equivalent) at the end of the last line.

The answer then goes on and explains (quite accurately) that, in particular, this problem tends to show up when you define a token using MORE and they give the example of Java/C++ single-line comments.

You see, consider the way that the single-line comment is defined in the Java grammar used in CongoCC:

MORE :
  <SINGLE_LINE_COMMENT_START: "//"> : IN_SINGLE_LINE_COMMENT
;

<IN_SINGLE_LINE_COMMENT>
  UNPARSED #Comment :
    <SINGLE_LINE_COMMENT: "\n" | "\r" | "\r\n" > #SingleLineComment : JAVA
;

<IN_SINGLE_LINE_COMMENT, MULTILINE_COMMENT>
   MORE : < ANY_CHAR : ~[] >
;

This definition of a single line comment, I think, has been in the Java grammar from the very beginning, and I never really altered it. It is perfectly okay, except that it relies on finding the terminating newline. It encounters the // and then enters the lexical state IN_SINGLE_LINE_COMMENT and then it consumes any non-newline character until it hits the newline and then returns the SINGLE_LINE_COMMENT token.

However, this definition of the single-line comment does not work if you have a single line comment at the very end of a file and the file does not have a newline at the end. So, there are two basic solutions to that problem. You can use the ENSURE_FINAL_EOL setting, but that only exists in CongoCC (and JavaCC21.) Or you can rewrite the token's definition without the MORE, i.e.

  UNPARSED : <SINGLE_LINE_COMMENT: "//" (~["\n","\r"])*> ;

(The above is actually taken from the Java grammar that is part of the JavaParser project. Well, they use the legacy JavaCC, so, using the older syntax, it is actually:

SPECIAL_TOKEN :
{
     <SINGLE_LINE_COMMENT: "//" (~["\n","\r"])*>
}

You can see that here.

In any case, that definition of SINGLE_LINE_COMMENT is more concise and also does not have the problem that it potentially fails at the end of the file if there is no final newline character. In fact, I use a similar (not exactly the same) definition in the Csharp grammar. It's written as:

  < SINGLE_LINE_COMMENT : "//" (~["\n"])* "\n" >

It actually doesn't bother with any \r because, since it just uses the default of PRESERVE_LINE_ENDINGS of being false, any CR-LF is just converted to LF reading in the file. Aside from that, the definition includes the trailing newline. But, of course, it can include the trailing newline because the grammar has the ENSURE_FINAL_EOL up top to make sure that there is always a newline at the end, even if this is the last line of the file!

(You might wonder why I express this construct differently in the Java and the CSharp grammars. That's actually deliberate. Sometimes I write the same construct differently because it improves the test coverage a bit. For most end users this would make little sense. It would be at least a bit better to express the same thing consistently, but my needs, developing the tool itself, are a bit different.)

Anyway, getting back to the JavaCC FAQ's treatment of this issue, the question being why we have this problem if we have a single-line comment at the very end of a file.

Well, there is a rather nitpicking matter in this that should be settled. Or so it seems.

Does a single-line comment include the terminating end-of-line character?

Or, more concretely, should we think of:

  // This is a comment.

as being the string "This is a comment.\n" or as the string "This is a comment." followed by ignorable whitespace. Granted, it doesn't matter on any practical level, but if you're going to specify something down to the last character in some specification document, then, I guess you should clarify this, no? So again, does the comment include the final newline or not?

Well, I was curious enough to look at the Java language specification. It appears that the newline at the end is not part of the comment. See here. I was a bit surprised by that. I just tended to think that the final newline at the end of a single-line comment would be part of the comment. That seems a bit more intuitive to me. But, actually, I think they may have specified it this way precisely because of this problem of the final line! Aside from eyeballing the holy writ, I also tested a little Java source file consisting of one line:

 public class A {} // A comment.

And note that there is no EOL at the end. The java compiler javac compiles the above file with no complaint. The JavaCC FAQ seems to be saying that a strict java compiler should complain about the above, because the comment should terminate with an EOL character. But, the java compiler that is part of the JDK does not complain about this. Moreover, contrary to what the JavaCC FAQ states, the terminating newline is not obligatory. (That is my reading of it anyway...) However, the Java parser inside of the legacy JavaCC will fail to parse the above single-line file because it needs there to be a newline at the end of the file to terminate the comment.

The JavaCC FAQ adds:

Both the Java and the C++ standards agree with the example .jj files, but some compilers are more liberal and do not insist on that final newline.

Well, I have no idea where these people got that from. It seems to be incorrect. (N.B. I have not checked the C++ standard. Maybe that does insist on the final newline in this spot.)

What is striking about all of this is that they have this FAQ and that item in the FAQ dates back at least 15 years, and it is so trivial to address the problem. So trivial, in fact, that it really ought to be easier to just fix this than to write anything about it! You just tack the newline at the end if it's not there. As entangled and poorly structured as the legacy JavaCC code is (I know whereof I speak) that is probably still easy enough. Well, what to say about this... I mean, look, that whole scene is so dominated by a certain kind of... wankerdom... that... Okay, I know I'm not supposed to say that, but the situation is quite infuriating. We're quite literally talking about a project that cannot fix something when the fix is a single-character edit! How these people can have the sheer audacity to put up all these web pages representing that they have an actively developed project...

Well, never mind. To get back on topic, there is a more general point about all this. I think that MORE is actually a pretty nice feature and there are surely some cases where it provides the most elegant, readable solution, in terms of expressing something in the lexical grammar. However, there is this problem of the final line of the file, which is really due to the fact that a MORE specification does need a clear, reliable termination. If you're just consuming MORE characters and then hit an EOF, you're going to have a problem.

I became more intensely aware of all this, by the way, when I started writing a PHP grammar (very far from finished) and I see that, at the very top level, lexically, a PHP file contains just plain text, interspersed with actual code that is always delimited as follows:

 <? ... ?>

You see, in its origins, PHP is basically a template language (a topic I do know something about) and a template language has the basic feature that just plain text is a valid "program" in the language. So, in PHP, or in FreeMarker, the text:

 Hello, World!

is a valid PHP program or FreeMarker template and the effect is just to echo the text in the file. Of course, any interesting example of PHP (or FreeMarker) contains more than just plain text. So, a text fragment (or snippet or whatever) ends when we hit some sort of escape sequence that causes us to leave this plain text echoing mode. In PHP, that is <?. Or a text block can simply end with an EOF. But here are a couple of points. First of all, one way to express a text block ending when it encounters some actual PHP code would be something like:

 <IN_TEXT> MORE :  <ANY_CHAR : ~[]> ;
 <IN_TEXT> UNPARSED : <TEXT : "<?"> : PHP ;

It just gobbles any character until it hits the special sequence <? and then goes into the PHP lexical mode. However, you have the problem now that the TEXT token will include those initial characters <? which is not quite right. Well, one way of dealing with this in CongoCC would be to write the second line above as:

 <IN_TEXT> UNPARSED : <TEXT : "<?"> {matchedToken.truncate(2);} : PHP ;

Actually, I added the truncate() method very recently. Before then, it would have to be:

 {matchedToken.setEndOffset(matchedToken.getEndOffset()-2);}

But that is ugly enough that it is worth having a shorthand for that, hence the truncate() method. In any case, after the truncation, on the next iteration of tokenization, the machinery starts the next token at the very end of the previous one, which is now (after the truncation) right before the pointy bracket.

Now the other problem is that the TEXT might not end with <?. It could just end with the end of the file, i.e. there is no <? acting as a marker for the end.

This is where the TERMINATING_STRING setting can come in. One solution to this problem would be to just ensure that the input ends with a certain control character. Traditionally, what is used for this is the CTRL-Z, i.e. \u001A. So, up top, we have:

   TERMINATING_STRING="\u001A";

Then we can have:

  <IN_TEXT> UNPARSED : <TEXT : "<?" | "\u001A">
  {
        matchedToken.truncate(1);
        if (matchedToken.getImage().endsWith("<?")) matchedToken.truncate(1);
  } : PHP ;

Of course, the PHP lexical state has to deal with that final CTRL-Z character that is pending, so we could have:

 <PHP> SKIP : <CTRL_Z : "\u001a"> ;

So then it encounters the CTRL_Z, skips it, and then what follows is EOF and everything gets squared away. Or, I guess another possibility would be not to skip the CTRL-Z but to treat it as an end-of-file marker. You could have:

 <PHP> TOKEN : <CTRL_Z : "\u001a"> {matchedToken.setType(EOF);} ;

And that should work about as well. Though the difference would be that if you encountered a CTRL-Z at any point in the file, that would be taken as an end-of-file. That might make sense, but it also might not be what you want.

In any case, this key aspect of using MORE, that there must be a clear terminating character (or sequence) to end it could be dealt with by simply ensuring that any input ends with a certain string and using that to match the end of the token type.

And, generally speaking, this boundary problem of the last element or line in a file seems very real. For example, one could think of a format like Markdown as being, at least at the outermost level, a sequence of paragraph level elements that all terminate with 2 (or more) newline characters, i.e each paragraph has to end with a blank line, which amounts to two newlines. So, again, it could be useful to specify a double-newline as the terminator, i.e.

 TERMINATING_STRING="\n\n";

That, rather than just a single newline. In any case, the TERMINATING_STRING setting seemed well motivated enough to spend 5 minutes implementing it -- though, as you can imagine, Dear Reader, I spent far more time writing the above text. But I thought to clarify various aspects of all this.