What you will find below is not the only interesting (even surprising) conversation I have had with Claude or with AI chatbots generally speaking. And, well, only some of those conversations have any relevance to CongoCC or even to computer programming generally.
The backdrop of this conversation is that I was trying to figure out what the current situation with legacy JavaCC documentation is. I realized that one could browse back and see the state of the world, all the way back to the key date (key in JavaCC history anyway!) of 9 June 2003, when the JavaCC source code was open-sourced on the (now defunct) Java.net site.
Before then, JavaCC had always been free to download and use, but the source code was not available. Actually, I myself used JavaCC for the first time in late 2001. I remember that I downloaded the zip file and that contained the compiled binary along with some simple examples and minimal documentation.
So anyway, the earliest snapshot of the JavaCC code that you can browse is the 9 June 2003 snapshot here. You can browse it online or you can check it out via:
git clone https://github.com/javacc/javacc.git
cd javacc
git checkout c88fe09b3114fba81f50c8e1744e6ace6d7b1597
That gives you the state of the CVS repository on Java.net at that point in time, almost exactly 23 years ago. (Actually, this is probably more like the state of the JavaCC project from at least 4 years before that, like 27 years ago, since, at the point where JavaCC was open sourced, it had not been actively developed for some years -- I reckon four years at least.)
By the way, it is interesting to note that this code snapshot from that CVS commit of 23 years ago does build and run using the latest Java (JDK 26). Well, not quite. You see, there is the problem that the source code contains various code like:
for (java.util.Enumeration enum = rexprlist.elements(); enum.hasMoreElements();) {...
No remotely modern Java compiler compiles the above line, since enum has been a reserved word in Java since JDK 5 (released September 2004) and thus, cannot be used as a variable name. I thought that it would be possible to compile using something like:
javac --target 4
But now, in the latest JDK, the --target option only goes back to JDK 8. However, a global search/replace for the whole-word string enum (which occurred 135 times in the codebase, BTW, I replaced it with enum0) did the trick. I could then type:
ant
on the command-line and it would build the jar file. And the javacc.jar thus built could be used either with the bash (and Windows .bat) files in the bin directory or simply one could invoke the javacc tool via:
java -cp bootstrap/javacc.jar COM.sun.labs.javacc.Main <grammarfile>
Yes, success! I was beginning to feel like some sort of intrepid archaeologist, like a computer nerd version of Indiana Jones!
I guess I did the above to satisfy my curiosity. But my actual motivation for this archeaological journey was just trying to figure out what the state of JavaCC documentation was at that point in time. That was the real reason I checked out such an ancient version of their repository. In fact, you can browse that on the web without checking it out here. In any case, this jogged my memory because I vaguely remembered eyeballing this very same "documentation" (or just about) when I first downloaded JavaCC in 2001.
I was poking around these files and I wondered specifically about this one. I was wondering about this errorrecovery.html file because, to the best of my knowledge, JavaCC never really did have any dispositions for error recovery and I was wondering what the file said. (What error recovery?)
Well, the file describes some very crude hacks that, generally speaking, just don't really work. Well, to be absolutely fair, their error_skipto(SEMICOLON) example might work in some cases if you're extremely lucky, but generally speaking, it's just too crude. In general, it's not gonna work. (Trust me. I know whereof I speak.)
However, there is another reason that this sort of thing is very unlikely to work in a real-world project of any complexity level. Consider their example:
void Stm() :
{}
{
IfStm()
|
WhileStm()
|
{error_skipto(SEMICOLON);}
}
The above Stm() production has the rather interesting feature that it always succeeds in a lookahead. Specifically, if you write:
(LOOKAHEAD (Stm()) Stm())*
The lookahead always succeeds. The reason is that a Java code block is always considered to succeed in a lookahead. In legacy JavaCC, there is no way to indicate that a Java code block is an error handler, and thus, should be treated as a failure in a lookahead. In CongoCC, you can write:
Stm :
IfStm
|
WhileStm
|
FAIL {error_skipto(SEMICOLON);}
;
Then, a lookahead (or a SCAN) that reaches the third option that is a Java code block preceded by FAIL, will return failure. Of course, since the error_skipto disposition probably doesn't even work anyway, the whole example is possibly a couple of degrees away from really working IRL! Nonetheless, it is an example in the JavaCC docs and actually, that file is there basically unchanged in the current docs. Well, it has been reformatted in Markdown instead of its original HTML and there may be a bit of editing. But you see how they have documentation with examples that don't really work! By the way, you can find examples of this sort of thing in JavaCC grammars in the wild, people trying to get things like this to work... I guess they were influenced by this "error handling" doc...
Okay, so where am I going with all this? Well, consider all of the preceding to be a sort of preamble that explains the context of my conversation with Monsieur Claude. You see, when I came across this (not for the first time, but for the first time in a long while) I had a sudden pique of curiosity. I wondered whether Claude would "know" this. So, I asked him/her/it the following:
JR: What is the motivation behind the FAIL statement in CongoCC? Why does it not exist in the older JavaCC?
Here is how Claude answered and I will intersperse some comments.
Claude: The FAIL statement in CongoCC is a deliberate, explicit way to force a parse failure at a specific point in a grammar rule. The motivation is essentially about making failure a first-class, expressible concept in the grammar language.
(JR: The above is not completely wrong actually. I suppose it is true that the FAIL statement has a certain documenting functionality. However, that is actually ancillary. The real reason that the FAIL statement is necessary is the one I outlined. You need to be able to say that an option is a failure because otherwise lookahead is broken!)
Claude again:
Why FAIL exists in CongoCC
1. Expressing "this branch is impossible" cleanly
In complex grammars, you often have a choice construct where one branch is syntactically reachable but semantically invalid. Without FAIL, you'd have to contort the grammar or use embedded Java/action code to throw an exception. FAIL lets you write this intent declaratively, directly in the grammar.
(JR: This is true but is still skirting around the central point! Claude goes on at length but, to my taste, is largely flailing around because it is not hitting the key point.)
2. Complementing the SCAN / lookahead architecture
CongoCC's SCAN directive allows very precise control over when the parser commits to a branch. FAIL fits naturally into this: you can scan ahead, confirm you're in a syntactically valid but logically dead-end branch, and FAIL explicitly. This is cleaner than the alternatives (throwing exceptions in actions, or writing a lookahead predicate that somehow encodes the negative condition).
3. Supporting fault-tolerant parsing and ATTEMPT/RECOVER
CongoCC's fault-tolerant parsing infrastructure (the ATTEMPT/RECOVER mechanism) needs the parser to be able to fail in a controlled, recoverable way. A FAIL inside an ATTEMPT block triggers the RECOVER path cleanly, which a Java exception thrown from an action would not do as gracefully.
(*JR: The above is even something of a confabulation because ATTEMPT/RECOVER is really more or less orthogonal to FAIL.)
4. Contextual keyword disambiguation
This comes up a lot in your C# and Rust grammar work — contextual keywords like async, extension, yield, etc. are identifiers in some contexts and keywords in others. You often want to write a rule that tentatively matches an identifier, checks if it has the right spelling via a scan/semantic predicate, and FAILs if not. Without FAIL, expressing "match an identifier, but only if it's literally async" requires more awkward workarounds.
(JR: Note that Claude has some "memory" of its prior interactions with me, referring to my C# and Rust grammar work. I had had conversations where I asked Claude pointed questions about certain constructs in those languages and received answers that were (not complaining!) mostly useful. But Claude is confabulating some more here. The FAIL statement really is pretty much orthogonal to contextual keywords, that were introduced several years later! A contextual keyword, after all, is just a token that is matched by a more general regular expression. Thus, the contextual keyword yield or record in Java is matched as an identifier by the NFA machinery and then a quick string comparison says that it is actually the contextual keyword in question. (Or isn't!))
Claude again:
Why JavaCC never had it
JavaCC's design philosophy was older and more conservative — it treated the grammar as a mostly-declarative specification of what to accept, with Java action code as the escape hatch for everything else. The assumption was: if you need to fail, throw a ParseException from an action block. This worked well enough for simple grammars but doesn't compose well:
- Thrown exceptions from actions bypass the parser's own lookahead/backtracking machinery
- It makes the grammar harder to read (logic buried in
{}action blocks) - It predates any notion of fault-tolerant or recoverable parsing — JavaCC parsers were strictly fail-fast
CongoCC's design philosophy is that the grammar language itself should be expressive enough to encode parsing logic that previously had to live in action code. FAIL is one instance of that, alongside SCAN, ASSERT, ATTEMPT/RECOVER, and the richer lookahead predicates. It reflects the broader goal of keeping parsing intent visible in the grammar rather than hidden in embedded code.
(JR: Well... the truth of the matter is that Claude is really bullshitting somewhat more here. Again, the key issue is really the fact that lookahead doesn't work if you don't have FAIL because otherwise, you have no way of saying that a choice that is a java code block really is a parsing failure. As for this issue relating to any so-called "design philosophy", that is really nonsense. Legacy JavaCC does not handle this properly because its creators simply did not think the problem through! It is not that these people had a different "design philosophy"!)
(JR: But now we get to the real essence of the conversation. I try to get Claude on the right track...)
JR: Well, to tell the truth, I posed a question that I knew the answer to, since this is my doing. The main reason that FAIL is necessary is that simply writing A|B|{throw someException} is a production that will always succeed in a lookahead, since the logic in JavaCC is that a Java code block always counts as "success" in a lookahead. So you need to preface the code block with FAIL, as in FAIL {throw the exception} and then a lookahead really will fail.
(So, you see, in the above, I tell Claude which way is up! LOL!)
Claude: That's a really precise and important point — thank you for the correction. The core issue is about lookahead semantics, not expressiveness in the abstract.
(JR: My understanding of AI is that the AI "machine" has no actual "knowledge representation". It is mostly a statistical model that is based on guessing the most likely next word. However, here, I have to admit that Claude may not have any "knowledge representation", but it could fool just about anybody! This all really starts to be uncanny.)
Claude: In JavaCC's model, a Java action block {...} is always treated as succeeding during lookahead evaluation, because the lookahead machinery doesn't execute action code — it just skips over it.
(JR: The above is interesting. Of course, Claude is largely just spitting back at me what I just told it. However, I have to admit that it is also arguably offering a somewhat better explanation than I did. It says: because the lookahead machinery doesn't execute action code -- it just skips over it. Chapeau, Monsieur Claude!)
Claude: So if you write:
A | B | { throw new ParseException("impossible case"); }
the third branch looks like a viable alternative during lookahead, and the parser may commit to it based on that false promise. The exception only fires at parse time, after the parser has already committed — which is too late for the choice to be reconsidered.
FAIL exists specifically to punch through that abstraction: it's a signal the lookahead machinery does recognize and honor. So:
A | B | FAIL { throw new ParseException("impossible case"); }
means the third branch is genuinely not viable during lookahead, and the parser won't commit to it.
(JR: Well, yeah, but it would be simpler just to say that a lookahead routine that reaches the third option returns failure because this is explicitly a FAIL instruction. (Duhh.)
Claude: In other words, FAIL is less about "making failure a first-class concept" philosophically, and more about a concrete gap between what action blocks can express and what the lookahead engine can see. The older JavaCC never addressed this gap because its lookahead was simpler and the workarounds (restructuring the grammar, relying on late exceptions) were considered acceptable. CongoCC's richer lookahead/scan machinery made the gap more visible and more painful to ignore.
(JR__: Well, Claude is dancing around the basic issue that the legacy JavaCC developers just never thought this through. But still, what Claude says is not bad. Once you (or I in this instance!) point out the basic issue, Claude does run with it.)
JR: Well, the truth of the matter is that this is an example of how basic issues regarding lookahead are rather half-baked in legacy JavaCC. There is the general problem that lookahead does not work in conjunction with changes in lexical state, because the lookahead machinery caches tokens (which is a good idea generally) but does not take into account that the cached tokens have to be thrown away if there is a lexical state change.
(So you can see that I prompt Claude to extend its "understanding". Or is it "pseudo-understanding" since these AI bots don't "understand" anything, strictly speaking, do they...)
Claude: That's a great example of a subtle but serious bug in the legacy design. The caching logic for lookahead is an optimization that makes perfect sense in the common case — you speculatively consume tokens, then either commit or back up, and caching avoids re-tokenizing the same input. But it silently breaks when lexical state switching is involved, because the cached tokens were produced by the wrong lexical state.
(Again, I sort of fed this to Claude, but the basic explanation that Claude provides is about impeccable. I do not think that I myself could improve on the last sentence above. At least not significantly.)
Claude: The failure mode is particularly nasty because:
- It's not a loud, obvious bug — the parser may appear to work for most inputs and only misparse in cases where the lookahead actually crosses a lexical state boundary
- The cached tokens are syntactically plausible (they're real tokens, just from the wrong state), so the error may not manifest as a parse failure at all — just a wrong parse
- It's very hard to debug because the symptom (wrong parse or spurious failure) is temporally separated from the cause (stale cached tokens from a prior lexical state)
The fix is conceptually straightforward — invalidate the token cache on any lexical state transition during lookahead — but it requires the lookahead machinery to be aware of and integrated with lexical state, which the legacy JavaCC implementation apparently never bothered to do. The two subsystems were just too loosely coupled.
This fits the broader pattern you're describing: JavaCC's lookahead was bolted on rather than designed from the ground up as a first-class concern, so edge cases like FAIL semantics and lexical state interaction were never properly handled. CongoCC having to fix these one by one is essentially the cost of inheriting a half-baked foundation.
END OF CONVERSATION
(JR again:) My good Lord! Sweet Jaysus! That last sentence from Claude is uncanny: "CongoCC having to fix these one by one is essentially the cost of inheriting a half-baked foundation."
Yes, Claude (after getting "fed" the info admittedly) hits the nail on the head. That last sentence is absolutely right. Well, that sentence does leave certain things unsaid. I mean, just how absolutely aggravating the situation is, where you systematically address all these longstanding issues and people just decide (sometimes with a level of sheer cussedness that really is jaw-dropping) that they are simply going to ignore your work.
Of course, it is also true that long-time users of JavaCC who choose to ignore CongoCC, though causing me significant frustration, are likely causing themselves even more frustration and aggravation in the long run. You see, the way these sorts of issues discussed above manifest themselves is rather subtle and non-obvious. What Claude says earlier actually hints at this: It's not a loud, obvious bug — the parser may appear to work for most inputs and only misparse in cases where the lookahead actually crosses a lexical state boundary. Quite right. Yes, some approach that works in the general case fails because the lookahead does not take into account lexical state switches. So, what you frequently have is a situation where, to all outward appearances something should work (and it usually does!) but then, in this case it doesn't work, and it is not readily obvious why. This is exactly the kind of situation that is bound to lead to many hours (maybe days) wasted in front of a debugger screen. In fact, the person afflicted with this may not even get to the bottom of the issue. Frequently, the person may just tear up what they have done and muck about until they converge on something that somehow does work. But they may never come to an understanding of the core problem. And what greatly contributes to the problem is documentation that maintains that the tool itself is practically beyond reproach. So, yeah, it stands to reason: if whatever you're doing doesn't work, you must be the one doing something wrong! (Well, think again...) But really, what this amounts to is a form of gaslighting, which is, of course, not a very nice thing to do to people. Now, to be clear, I don't really mean to say that the people who write this kind of documentation, and even, on occasion, entire books, that never mention the key bugs in a tool like JavaCC, that they are consciously trying to drive people crazy. There are various psychological aspects at work, I suppose. People make an investment in their toolset, not just in time and effort, but a sort of emotional investment. So, if you tell them that this key tool that they rely on has these bugs and they are fixed in a separate fork, they get very defensive somehow. They don't do the rational thing of simply evaluating the claim and then, in short order, migrating to the newer version of the tool, where the various bugs are fixed. They can even get indignant when you point this sort of thing out to them. In private correspondence, the estimable Brian Goetz referred to this as a form of Stockholm Syndrome.
Well, getting back to the above conversation with Claude, I initiated it because I wanted to see whether Claude "understood" this sort of basic issue, how lookahead was fundamentally broken in legacy JavaCC -- which is the real or fundamental reason for the need for the CongoCC's FAIL statement. I guess not really, because the initial response from the AI "machine" did not really get to that fundamental point. However, when I followed up by indicating what the real issue was, it managed to write some responses that were actually uncannily incisive. I mean to say that Claude was (or gave the impression of being) much more on-the-ball than the vast majority of human correspondents I have engaged with over the past so many years. I fed Claude certain key information and it incorporated that information into its "understanding" (again, I know it doesn't really "understand" things in a human sense) of the problem. Regardless, the AI "machine" could do this much more readily than most human correspondents.
And why is that?
Well, I don't know exactly why. My current hypothesis is that the "machine" is not handicapped by the emotional baggage typical of humans.
- It is not emotionally invested in a tool like JavaCC. It doesn't feel some weird identification so as to get defensive and emotional when you point out how buggy the thing is.
- More generally speaking, the "machine" has (certainly compared to humans) no real problem recognizing that it said something mistaken before. You give it some new information and it just readily incorporates it. Or, to put it another way, it is not burdened with the all-too-human problem that it just hates being told anything!
Well, my own understanding of all this is in its early stages. While I have no doubt that human intelligence still has some great advantages over AI, probably AI does have some basic advantages in terms of not having these emotionally based handicaps. There is also some overlap in that both humans and AI bots will tend to bullshit a fair bit when they don't quite "understand" the key issue. You see that in Claude's initial response to my query above. However, it seems that the AI is much better at turning on a dime and admitting it said something wrong. And that would be because, while it does confabulate (this is well known) it does not get emotionally committed to its own confabulations!
Well, I guess that's enough said for now. I am quite interested in how other people react to all this, so if you want to discuss it, by all means, sign up (if you haven't already) on our discussion forum and say something.
