Developing custom error messages in JavaCC 21


I am wondering what facilities there are in JavaCC 21 to generate error messages that are more informative that the default that were available in the legacy JavaCC. The default of just dumping all possible acceptable tokens does not work very well.

I would happily add lots of error messages to appropriate places in the grammar if I could, but I could never figure out a way to do it. I suspect that part of the problem with the legacy JavaCC was that there were options in the code base that just were not documented.



Hi Tom, nice to see you here!

Well, let’s see… I think that the issue of generating decent error messages is very important and actually, that is probably one of the key reasons that people end up writing parsers by hand instead of using parser generators!

The current state of affairs in JavaCC 21 is not so great on this front. I have been thinking about how to deal with this sort of thing systematically, but it is not really resolved by any means. Still, I would say that if you are aware of the issues and conscientious about it, you can have it generate better error messages in key spots, but it still is problematic.

Well, to show you what I mean more concretely, consider this point in the included Java grammar, that expresses the try statement. Prior to JDK 7, this was very simple, but now here is the try-with-resources construct as well as the “classic” try statement. So, as you see, I express it like this:

TryStatement :
   "try" FAIL "Expecting '{' or '(' after 'try'"

Well, it’s written a lot more elegantly than it would be in legacy JavaCC, because I hide the lookaheads in the corresponding statements using the up-to-here syntax. You can see that on lines 1043 and 1057 respectively. Basically, to enter the “classic” try statement, you need to look ahead two tokens for “try” followed by “{” and for the newer “try with resources” statement (since JDK 7) you need to scan ahead for “try” followed by “(”.

So here is a problem. If the above construct were written (and i think I did write it this way originally) something like:

  SCAN "try" "(" => TryWithResources

then it scans ahead for try ( and if it doesn’t match, then it looks for try {. (BTW, it’s actually a given that the next token is try because that’s why we’re in the TryStatement production to start with!) But, now suppose the token after try is neither a ( nor a {, well, the logic is that it rejects the TryWithResources and then enters the ClassicTryStatement, which will then, based on its own self-contained logic, as it were, complain that it was expecting a { token after the try.

This is actually wrong, because the error message should say that it was expecting either a { OR a ( token! You see the problem?

You see, it tries one and then if that doesn’t work, it defaults to the other one (which is how 99% of programmers would write this…) but if that hits an error because the next token is not a { either, it just generates a message saying it was expecting the {.

And, actually, that is bad enough, but there are cases that are worse. For example, if you try to improve things by writing:

   SCAN "try" "(" => TryWithResources
   SCAN "try" "{" => ClassicTryStatement

and the token after try is wrong (neither ( nor {) then it will generate an error message (and this really MUST be fixed at some point soon) that says that it was expecting a try!

Except that message is completely wrong because there was a try token! The issue is the token after that! The problematic point in the code generation is actually here and you can see that it is some boilerplate code for generating the error message based on what starting tokens were permissible at that point in the parse.

Well, you see, the real problem here is that this sort of code generates a reasonable enough error message IF your grammar really is LL(1) in that spot, i.e. all choices can be resolved with a single token of lookahead and nothing more. For example, if you have:


and, let’s say, they start with “if”, “for”, and “while” respectively, then a failure to match any of those 3 rules is precisely the same thing as the next token not being “if”, “for”, or “while”. So it will generate a reasonable error message.

But once you have a situation where entering a “production” is conditioned in a more complex way, then this error message generation will frequently generate messages that don’t even make any sense!

For example, suppose you use “semantic lookahead”(wrongly named, but that’s what they’ve always called it) as follows:

  SCAN 1 {extraCondition()} => IfStatement

So, in the above, to enter the IfStatement, the next token must be “if” AND the extra condition (whatever it is…) must be satisfied.

So, in that case, the automatically generated error message is very likely to be totally spurious. It fails to match any of the three choices, but the reason is not because there was no “if”, “for”, or “while”. It’s because there was an “if”, but the additional condition was not satisfied, right?

Well, I’m aware of all this and, for now, well, I think you can put in extra code that handles these things, so in the above, you could have:

    SCAN 1 {extraCondition()} => IfStatement
    "if" FAIL "The extra condition was not satisfied.""

So, as things stand, if you are aware of these issues, and you know that the default error message is going to be pretty misleading, you can deal with it.

But, at the moment, all this is not well explained (or explained at all) and also, the default error message generation should get improved. There are all sorts of cases where, if you write your grammar in a very straightforward way, the error messages generated by default will just be totally wrong. Actually, this whole thing had been on the back of my mind for a while, so thanks for bringing it up.

I hope that answers your question. It’s nuanced because, yes, this is a big problem (though I don’t know if other parser generation tools really solve it adequately either, I suspect not…) but also, yes, there are some ways of dealing with it, but the whole thing should not be as complicated and error-prone as it currently is.

(Maybe that’s all TLDR stuff, but I figured I might as well give a pretty complete answer…)

Actually, I misspoke in the above. This is already addressed. In this sort of situation, it does not generate the message saying that it was expecting try, because it checks whether the next token is try precisely to avoid writing an obviously incorrect error message. You can see where that is implemented here. I had forgotten about this! Of course, the problem remains that, in this spot, it basically doesn’t give any error message at all hardly. It just uses the current location to say:

buf.append("\nEncountered an error at (or somewhere around) " + token.getLocation());

And it gives you a stack trace. But it doesn’t give any real description of the problem, no.

For the moment, I guess the solution would be to add an extra catch-all case in the choice with an explicit FAIL statement. And that gives you a point at which to write your own description of the problem in plain English.

This reminds me that something I have been meaning to add for a good while now, but not got around to – some sort of ASSERT syntax. You can still write assertions of a sort, like:

        SCAN ~("{" | "(") => FAIL "Expecting a '{' or '(' here"

In other words, if the next token is something other than { or ( then we abort.

But I was thinking of having a terser syntax for this, more like:

     ASSERT "foo" => "Expecting a 'foo' here"

And then you could even sprinkle these sorts of assertions all over the place and, like assertions generally, they serve a role in documenting the code, but also, if they are hit when the code is running, they give a custom error message that you wrote yourself.

So, for the try statement:

    ASSERT "try" ("{" | "(") => "The 'try' must be followed by either '{' or '('"

And then the code following the assertion can just assume that the token after try is one of the two possible ones, because otherwise the assertion would have been triggered. Of course, that just amounts to the same thing as the way the try statement is currently written, but there may be cases where expressing things this way would be clearer, I think.

Thank you for the detailed and interesting explanation. I had a look around the web at parser code written in JavaCC in order to see how others were implementing custom error messages and I didn’t find anything. I have long wondered how to handle this and I am in a sense happy to see that it is not just that I am a slow learner, but that it is a challenging problem.

The ASSERT statement looks like an interesting approach, and it kind of in line with what I had thought of ‘sprinkling’ these around my code. I am an old JavaCC rookie (been using it for 20 years, still don’t understand it, used flex/yacc before that, no formal compiler-compiler training) I look forward to seeing how this works out.

Here is a simple example of one of the issues I would like to catch. I have a domain specific language where strings are delimited by single quote characters. If a string literal does not have the closing quote then the legacy JavaCC error message is “encounterd EOF after 'blah blah blah”. Of course this is true but not helpful. I think that an improved message would be ‘string literal starting at column x does not have a closing quote’ or something like that. Is this handling something that goes in to the lexer, and if so how?

Well, in terms of your example of unclosed quotes, well, I assume that you understand that there are really two “machines” in JavaCC. (Actually maybe 3 if you count tree-building as another “machine”. Or 4 if you cound lookaheads as yet another “machine”.) But anyway, the parsing machinery and the tokenizing machinery are two separate things, so typically, a failure to close parentheses, say, is a parsing or syntactic error, while the failure to close a quoted string would typically be on the lexical side.

Of course, the end user doesn’t understand that distinction (and shouldn’t have to!) but it does relate to the whole question of somehow generating a human readable error message.

Though JavaCC is kind of a “black box” for most users (not just you!) that goes even more so for the lexical side, which is what I just finished cleaning up!

Traditionally, in JavaCC, there might be some extremely crude kludges for recovering from syntactic errors, but on the lexical side, one has really been SOL in these cases. At least, in the older versions of JavaCC, if you had an error on lexical side, then it would throw TokenMgrError, which subclasses java.lang.Error, which, in principle, is something you’re not even supposed to try to recover from!

This is substantially fixed in JavaCC21 because it unifies the exceptions. Either a lexical or syntactic error just ends up throwing ParseException. You see, the way it works i that if it can’t even tokenize some input, I mean legitimately, it creates a token of type INVALID. So if you had something like:

   if (x) ### y();

It will just create a Token of type INVALID for the intractable characters ###. And, in FAULT_INTOLERANT mode, it can just ignore these things as if they were comments. Even in a non-fault-tolerant mode, one could imagine some sort of TOKEN_HOOK mechanism that intercepts these situations and generates a much better error message than what the default would be. Well, in terms of an unclosed string literal, if you had something like:

  print 'foo ;

obviously it’s missing a ' after the foo but if it fails to find the closing quote, it will turn the initial ' into a token of type INVALID. Of course, that will end up throwing a ParseException because INVALID is not an expected type. (Presumably, it’s expecting some literal or a variable or something like that.) But, in principle, one could intercept it and do something or other, give a human-readable error message, at the very least. I’m still working on designing a clear API to handle all these sorts of things, but I really had to complete the full cleanup of the lexical code and that is pretty recent.

I noticed that somebody opened this issue on the legacy JavaCC project. Of course, the gatekeepers sitting on the legacy project will never do anything about that. But it’s actually a reasonably well motivated kind of idea. I guess JFLex has it. So I was thinking about addressing the issue.
Basically the person wants some sort of assertion mechanism to “reject” a token even if the lexical machinery did match the string. I guess filtering out cuss words is one obvious application. Also, in terms of string matching, just rejecting input that is just too long, say. Of course, that doesn’t answer your question about unclosed strings, because it still has to hit the terminating character to complete the token matching.

Well, I guess the point I really want to get across here is that the code has now been cleaned up to such an extent that it is not very hard at all to implement feature requests like that one.

Well, one thing I really would like to get through to people is that this whole code rewrite is very much a game changer. Like, in my last answer, I went into some detail and pointed you (or any reader) to relevant points in the codebase. I have a strong suspicion that a lot of people, when I point to the relevant place in the code, don’t even look. It’s a bit like if somebody pointed me to some page in an advaned textbook on biochemistry. I likely wouldn’t even look because I would be so sure that I couldn’t understand. (Probably correctly.)

But the thing is that, now, I think the code is pretty readable. (I can’t be objective about this, I grant that…) But what I mean to say is that, if, in a conversation like this one, if you make a point of just looking at the code I point to and making some attempt to read it and understand it, okay, it’s likely to be pretty inaccessible, but if you make a habit of just clicking and looking at it each time, I really believe that at some point, the clouds will clear up and… And of course, if you see something that you don’t quite understand but would like to, there is no problem with asking me a question (or three…)

You see, what I’m really trying to get at here is that if you do aspire to gradually acquire more of a “white box” relationship with this tool, it is now possible. I mean, I can’t read anybody’s mind, but I get the feeling that people have just accepted that they will never really understand how this tool works and always have a “black box” relationship with it.

It kind of reminds me of how some people just give up on learning a foreign language. Well, I dunno… But it’s a different thing, but there is some similarity. It’s like people let something just kinda get the whammy on them. They are so intimidated by something and just stop trying. I hope you understand what I’m expressing here. I used to play a lot of poker, you see, and I feel that this code cleanup offers some of the same satisfaction that you could get from calling a particularly obnoxious bluff. This parser generation stuff is really just not nearly as difficult as all that. One gets the feeling that it’s kind of a bluff. Everything in this parser generation space is explained in such an obfuscated, abstruse kind of way. That’s what I was trying to get at here and here

Well, it is difficult to understand the code, but absolutely NOT because there is some really complex theory that a normal person can’t get their head around. That was y point in those two essays. What makes it difficult to fully understand is that it’s a code generator so it’s quite different in structure from a classic Java app, and that makes it more difficult to get one’s head around. I wrote an essay about these issues.

Well, anyway, I should close this post here! I always end up talking too much. (Both online and in person!)