Notes on the ongoing CongoCC API Evolution

The following essay outlines the main evolution in the API generated by CongoCC from its origins in the Dark Ages up until now.

The Journey of a Thousand Leagues Begins with a Single Step

Let's say you go on an excursion to a national park and there is a well known hiking trail that goes up to some famous panoramic point. You set off on the trail and hike the first few miles on a steadily rising slope. For a good while, you're mostly just focused on your walking and never stop to look back and enjoy the view. However, at some point, you do stop and rest and when you look back, you are totally bowled over by the spectacular panoramic view and realize just how far you came to this point.

Well, that is kind of similar to how I feel when I contemplate how far this CongoCC (né FreeCC, a.k.a. JavaCC 21) has come from its origins. As with the hiking, most of the time, you are just focused on trudging on forward, but at some key moments, you do stop and look back and realize just how far you came. And that can be a very satisfying feeling.

The CongoCC Origin Story

Now, as you probably know, all of this began when a mad genius professor was trying to breed highly intelligent apes... And then (as one would surely expect) some of the particularly intelligent specimens escaped from the laboratory... And then...

Oh, hold on, I'm getting confused. That is the Planet of the Apes origin story. The CongoCC origin story is that yours truly picked up the JavaCC source code at some point in 2008 and started hacking it, trying to do something with it. I originally thought (naively, I guess) that the existing JavaCC "community" would even appreciate my desire to move the thing forward, but... well... apparently not. So I did some work on it in the latter half of 2008, putting out a few releases under the name FreeCC but then drifted away from it -- and then, for some rather inexplicable reasons, I picked up the work again at the very end of 2019.

Now, even when I picked up the code in 2008, the existing JavaCC project had a history dating back over a decade. It was originally written at Sun Microsystems in the latter half of 1996 and, at least as far as I could ever tell, the only incremental work of any real significance was done in 1997, when they added JJTree (automatic tree-building capability) to the package. Some six years later, in June of 2003, Sun Microsystems open-sourced the code to great fanfare, but even at that point (20 years ago as I write these lines) the tool had not evolved at all for a number of years.

So, given this chronology, what with the legacy JavaCC tool having been written in 1996/1997, it is hardly surprising that the original codebase does not follow standard Java conventions, as they were not even established at that point in time. More specifically, you see that the original JavaCC tool would always generate a Token.java with eight publicly visible fields:

public int kind; // Now superseded by type in CongoCC, which is an instanceof TokenType, a generated type-safe enum.

public int beginLine, beginColumn, endLine, endColumn; // Location info, now superseded by beginOffset/endOffset in CongoCC

public String image; // The string contents of the token. In typical usage, this is no longer present in CongoCC.

public Token next; // A reference to the next (parsed) token in the stream, no longer present in CongoCC

public Token specialToken; // A reference the previous (unparsed) token in the stream, no longer present in CongoCC

I think it is instructive to note that in the current version of CongoCC, as a result of a huge amount of refactoring/redesign, none of the above fields even exists. (Well, the image field kind of still exists, but only if you use the setting MINIMAL_TOKEN=false (the default is now true) and next also kind of exists if you use the hyper-specialized TOKEN_CHAINING=true setting. But even those things are completely refactored.) The kind field is replaced by type, which is a type-safe enum representing the various token types. Actually this use-case is almost a poster-child case for the use of enum. However, one can hardly reproach the original JavaCC authors for not using an enum here, since that language feature was introduced in JDK 5, released in late 2004.

Well, anyway, all of the various coding antipatterns (certainly recognized as such now) in the original codebase would tend to constitute a pretty major impedance in terms of making improvements to the tool. Straightening all of this out would be best carried out as a series of incremental steps. One of the very first (and non-controversial) such steps would be to make all the aforementioned fields private and put accessor methods around them. Thus, for example, the evolution of beginLine (and the other 3 location fields) would start with replacing the publicly visible field beginLine with:

private int beginLine; 
public int getBeginLine() {return beginLine;} 
public void setBeginLine(int beginLine) {this.beginLine= beginLine;}

And public String image; was likewise replaced with:

private String image;
public String getImage() {return image;}
public void setImage(String image) {this.image = image;}

Actually, this is such a typical sort of refactoring that any current-day IDE will do it for you automatically! You right-click on the field and choose the refactoring "Encapsulate Field" or some such thing. And it will then do the above change, but more importantly, it will fix up every direct access to the field in the entire codebase to use the accessor methods instead of direct access to the field. A real time saver! Not only can all modern Java IDE's do this for you, at least one of the more opinionated such tools (I am thinking of IntelliJ IDEA) will actually take the initiative of suggesting this refactoring and even nag you if you decline to do it! (I find tools being this presumptuous somewhat annoying but the feature can be turned off, after all.)

Going with the Flow

Now, of course, one does not do these sorts of refactorings just to do them. There are extremely solid reasons that certain coding patterns are established as the norm. In the case of the encapsulation of the fields, this allows us to evolve the code forward while maintaining the same API for the application programmer. Thus, these methods like getBeginLine() or getEndColumn() can (and do!) continue to work even when the underlying fields have been refactored away!

But speaking even more generally, there is bound to be significant benefit in adhering (as much as possible anyway) to established coding patterns. That is why I recently decided, for example, to have the generated Node.java interface extend java.util.List<Node> and then to deprecate the methods like getChild/setChild/addChild in favor of get/set/add that, aside from being less verbose, are part of the java.util.List API. For starters, this is just building on naming conventions that any seasoned Java developers are familiar with -- principal of least surprise. Also, there is just some general tendency for things to just work. For example, there is a whole family of scripting/templating tools, like JRuby or Jython, that will tend to translate/wrap Java semantics in a clean, transparent way. Thus, if the POJO (slang for "Plain Old Java Object") being wrapped implements java.util.List there will be a tendency for any such tool to expose:

 node.get(i)

as:

 node[i]

in the wrapper language (I call it a wrapper language for lack of a better term offhand). Or, in one such tool that is largely my fault, FreeMarker, you can now of course write:

 [#list node as child]
    ... Do something with the child ...
 [/#list]

What I'm getting at is that if you name/pattern things in a more conventional way, you will very typically get a lot of notational convenience for free -- without any separate ceremony or scaffolding. So, even in Java itself, having an object implement java.util.List automatically means that you can write:

for (Node child : node) {
    child.doSomething();
}

OR you can write it in this style:

node.forEach(child->child.doSomething());

You can also even write:

for (int i = 0; i < node.size(); i++) {
    node.get(i).doSomething();
}

Granted, that last one is a tad old-fashioned, possibly preferred mostly by the kind of retro people who sit around listening to Frank Sinatra music.

But, okay, Java is a multi-paradigm language nowadays and you can express things in the style you prefer.

Even some early renamings I did back in the day, like changing jttGetParent/jjtSetParent to just getParent/setParent would typically mean that in a "wrapper" language, you could just write: node.parent since that would be translated transparently to getParent/setParent on the Java layer.

General Rename-itis and GITNM (Gigabyte is the New Megabyte)

Actually, legacy JavaCC had a strong tendency to put pointless (and IMHO butt-ugly) prefixes in front of fields, methods, and types. Admittedly, much of this is not even noticeable unless you (bravely) venture into the code that the tool generates, but I certainly got sick and tired of scanning over all that sh*t. So I gradually got rid of those things. Thus, for example, the method jj_consume_token became consumeToken. Or, for example, the field jj_scanpos became currentLookaheadToken. (Though more long-winded, this does more clearly say what the variable is.)

Now, granted, that kind of rename-itis amounts to what are rather superficial changes. Writing (and more importantly reading) nextToken instead of jj_ntk does not alter any of the deep logic or structure of how the thing actually works. So, to some extent, I am describing some steps that were just part of my overall effort to tame or gain mastery over the codebase.

However, one more fundamental architectural aspect of legacy JavaCC that I gradually became aware of was what I outlined in a blog article entitled The Gigabyte is the New Megabyte, GITNM for short. I wrote that only a few months after picking up the code again after an 11 year layoff. The bottom line is that the original JavaCC codebase was (and still is) based on some very anachronistic assumptions about the price/scarcity of computer memory. As I noted there, the original landmark parser generator tool YACC was designed at a point in time when a megabyte of RAM cost close to a million dollars, while the current price of a megabyte of RAM is under a penny. After writing that article, I became more and more convinced that all of this disposition with the buffering of the input and only retaining a smallish window of the input file in a buffer, was basically pointless -- at least in typical usage of the tool... For one, you tend to get mired in a sort of intricate, error-prone coding, and also it ends up imposing annoying limitations on what the tool can do, as compared to just being able to straightforwardly access the entirety of the input! Also, since I had long concluded that the normal usage of the tool was to build a syntax tree, this was already tending towards the view that the normal usage of the tool would be to hold the entire content to be parsed in memory.

Even then, it took me a while to finally burn the bridges, as it were. It was finally towards the end of 2021 that I simply axed all the legacy code that dealt with buffering input. The default had been to simply read all the text into a buffer for a while, but I still kept the older disposition working via a setting called HUGE_FILE_SUPPORT. But now, for over a year and a half, that has been gone. Like, why bother with this kind of stuff when the gigabyte is the new megabyte? (GITNM)

So, really, properly understood, a lot of the API evolution that I outline here can be thought of as a logical consequence of GITNM -- particularly, much of the evolution of the Token API. So, it occurs to me that it could be useful to ponder the following question...

WTF is a "token" anyway?

Well, at this level of exposition, I guess the reader is supposed to know what a token is. But still, it can be useful to reconsider such fundamental questions. I would put it to you that, in typical usage, a token actually has three different functionalities:

  1. A token is a sequence of characters -- i.e. a string. No getting around that, but there are still a lot of subtleties...
  2. A token is a location in the input being parsed (now represented by the TokenSource API). Well, actually, it is a range of the input, so okay, if one wants to nitpick, it is really two locations, the start and end.
  3. A (terminal) node in the constructed parse tree.

Note that, above I say in typical usage, so anybody could infer that I think that building a parse tree is the "normal usage". Otherwise, point 3 above would not be included. And, note that the legacy JavaCC API does not address at all the question of Tokens being thought of naturally as the terminal nodes in a parse tree. So, the original legacy JavaCC API only addresses the first two of the points above. It deals with the first point via the image field, which is a String representing the token's character content. It deals with the second via those beginLine/beginColumn/endLine/endColumn fields.

The CongoCC Token API, on the other hand, does this with three fields. It has tokenSource and beginOffset and endOffset. The tokenSource is a reference to a TokenSource object which really is a wrapper around the character buffer holding the entirety of the input to be parsed. The two integer fields beginOffset and endOffset are the begin/end offsets in the parsed input corresponding to this Token object. Of course, the third point (assuming that tree-building is enabled) is dealt with via the parent field.

Look, Ma! No image field!

So, you can see how the image field finally melted away. The getImage() method which started off as:

public String getImage() {
   return image;
}

evolved into something like:

public String getImage() {
    return tokenSource.getText(beginOffset, endOffset);
}

And the setImage method which started off as:

public void setImage(String image) {
    this.image = image;
}

became... well, actually, it becomes nothing at all. It just disappeared.

Well, it disappeared in "typical" usage. "Typical" is a loaded word, of course, but I am thinking of usage where you slurp in a file, tokenize/parse it, and there is no need to ever mutate the resulting parse tree. Actually, perhaps "minimal" would be a better term than "typical". And, in fact, the setting that controls whether to generate a separate image field is called MINIMAL_TOKEN.

Now, here is another interesting point. The generated Node API has a default method getSource() that gets the source text of the input corresponding to this node, based on the TokenSource object and the beginOffset/endOffset values. So, in this minimalist disposition, getImage() and getSource() would return the same string content. Also, the more typical least surprise Java API to use would be toString(), which would also return the same thing presumably. (And it does.)

This would explain why the getImage() method is marked as @Deprecated. For one thing, the getImage naming dates back to when there really was an image field! So, on those grounds, it's a already a bit confusing. But also, it is totally redundant. If we have a getSource() and a toString() that return the same thing, then what is getImage() good for??? (Well, the answer is the same as the answer to "War, what is it good for?" Absolutely nothing!) Well, not absolutely nothing, I guess. It reamins there so as not to gratuitously break existing code. That's all.

But there is a little wrinkle here. In default, minimalist usage, even getSource() and toString() would always return the same thing. However, there are tricky/advanced usages where you might want them to return something different. There is, after all, the whole problem of localization/internationalization/branding. It is easy enough to imagine the getSource() of a numerical literal token returns the string "1.3" but the toString() could return "1,3". That is just one example off the top of my head. So, one way to override what is in the input could well be to set the image field. So, okay, if you must have it, then you set MINIMAL_TOKEN=false and you have an image field. (Though it is renamed to cachedImage but that is private anyway.) And, actually, getImage/setImage are now there again, but deprecated in favor of getCachedImage/setCachedImage. But this is part of a general pattern. We have a minimal API that makes certain... well... minimalist assumptions and you can make things more complex if you need to. But if you don't need any of this kind of aliasing, then you can leave the MINIMAL_TOKEN setting as true and avoid the extra complexity. I mean, if you don't need this, then you are really better off just not generating the extra code. Then the character content of your tokens is effectively immutable, and thus, easier to reason about.

(On the rereading the above, I realized that you don't need to set MINIMAL_TOKEN=true to override the "normal" rendering of the Token. For example, in the given example, if a numeric literal was represented by a NumericLiteral subclass, you could use INJECT to inject a toString() method that handles i18n. Just for example... But, you know, it suddenly occurs to me that a lot of best practices for CongoCC are maybe not even so established...)

Now let's consider the TokenSource API...

Preprocessor and Other Related Messiness

At some point in the evolution of all this, I decided to implement a preprocessor in CongoCC. Actually, it was not that there was such an urgent need for this. It was more like a "nice-to-have", but I intended to write a C# grammar and I needed to implement the preprocessor part and I figured I could just reuse the C# preprocessor for CongoCC and kill two birds with one stone. That, by the way, turned out not to be quite possible, since I had misread (actually did not really read) the C# spec as regards the preprocessor and implemented something that did not quite follow the spec. So I had to reimplement the C# preprocessor but left the original preprocessor for CongoCC. Well, anyway, that C# style preprocessor only really provides the ability to turn on/off lines in the input, but does not provide any aliasing. You can't write stuff like #define FOO foobar and things like that. (I am seriously considering adding some aliasing capability to the CongoCC preprocessor since it could be useful and it is already decoupled from the C# preprocessor anyway...)

Well, the upshot of this is that the TokenSource API actually needs two separate subSequence sorts of methods. Suppose you have some code snippet like:

 public void foo() {
    #if debug
        foo_debug();
    #else
        foo_nodebug();
    #endif
 }

Let's suppose the debug flag is not set. So, after the preprocessing step, this becomes:

public void foo() {
    foo_nodebug();
}

The preprocessor lines that start with #if, #else, and #endif are ignored from the point of view lexing/parsing the code, as is the line containing foo_debug();. However, from a location point of view, what is to be used in error reporting, the non-ignored tokens on the foo_nodebug line need to report their locations based on the ignored parts being present! I mean, if a compiler/interpreter reported error locations based on the preprocessed input, all the line numbers would be off!

Well, anyway, I don't care to get into more explanation of all this. At the level of this exposition, suffice it to say that there is various messy stuff going on if you want to have this kind of preprocessor functionality. And, for the most part, it is all dealt with internally.

BUT.... suppose you don't need it. Well, this is what the USES_PREPROCESSOR setting is about. There is some extra code generated in the TokenSource.java such as isIgnored(offset) which tells you if a given offset of the buffer is marked as "ignored" by the preprocessor. But, by default, USES_PREPROCESSOR is false, so that code is not generated. So, as in the case, of the image field in Token, if you don't need it, it is simply not generated.

And preprocessing is hardly the only use case to consider. There is just the general problem of post-parse manipulation of the tree. Suppose I want to inject some content. So let's say I want to replace:

   int x = 7;

with:

   final int x = 7;

Well, this might amount to inserting a final Token before the int token. Well, we have API for that, actually two approaches: we could handle it on the lexical level, by pre-inserting a final token before the int token. Or we could handle it at the post-lexical or syntactic level, by inserting a child node (which is the token final) in the FieldDeclaration node. But note that in either case, the final token is actually a "synthetic" token in the sense that it does not actually occur in the parsed input but was synthetically inserted afterwards.

In any case, extra code to insert synthetic tokens into the token stream is only present if you have TOKEN_CHAINING=true in your settings. (With TOKEN_CHAINING on, MINIMAL_TOKEN is automatically off, by the way.)

Again, I am loath to get into all the various messy details of this sort of problem. These kinds of questions surely merit their own standalone essay relating to advanced usages of the tool. I would just point out that, in terms of error reporting, the token locations in int x = 7; have to be reported as if the final token was not present. However, the FieldDeclaration.toString() method should return the string with the final that was inserted. And the getSource() method should return the text without the final.

But, again, in minimalist usage, where you are not doing any of this token chaining, you would be surely better off having none of the code for that generated. So these are options that are off by default, but can be turned on.

Oh, and for some reason, I feel I should point out that I did not implement these things like TOKEN_CHAINING=true out of any speculative generality. It was based on a pre-existing need -- in particular, to deal with Python (or other languages with an indentation-based approach). So, the INDENT and DEDENT are dealt with as synthetic tokens that are inserted into the token stream. That was how I decided to deal with the problem. Maybe there is/are some alternative solution(s) that are as good or better, but I could not find them. As best I could figure, the token chaining approach was necessary to handle Python. But, of course, if you don't need any such tricky stuff in what you are doing, then why generate any code for it?

The legacy JavaCC tool has some disposition to deal with inserting synthetic tokens into the stream via the next field, but it tends to be very tricky and error-prone. As for the problems relating to implementing a preprocessor where certain parts of the input are marked as ignored, as far as I can tell, legacy JavaCC never addressed the problem at all. In fact, this has come up at least once in their community and the people have been told that you can use some separate preprocessor tool to run over your input. Of course, they are ignoring the fact that any resulting error messages would not be relative to the un-preprocessed input.

Now, getting back to the evolution of API, at some point, and this is quite recent, scanning over the code of TokenSource, I asked myself: why doesn't this object implement java.lang.CharSequence?

After all, TokenSource is, most of all, just a wrapper around a character buffer that has the additional functionality of keeping track of line/token locations and such. But there is no particular reason not to just treat the object itself as a "stringy" sort of object and have it implement CharSequence. So, it would have charAt(offset) and subSequence(begin,end) methods that would typically just delegate to the underlying character buffer (itself a CharSequence).

So then I started thinking along similar lines elsewhere. The generated Token class might as well implement CharSequence also... And this actually leads to some cute possibilities. For example, since the base constructor of a CongoCC generated parser just takes a CharSequence as an argument, it is now perfectly possible to instantiate a parser using an existing token as its input. So, if a parser for the Foo language, FooParser let's say, contains a snippet of embedded Bar language code, that Bar snippet can be treated as just a Token by FooParser, but then that token can later be directly passed to the constructor for a BarParser.

One funny aspect of Token implementing CharSequence is that myToken.charAt(-1) has an automatic meaning, which is that it would (and, for the moment, does) return the character right before the token. (Though it would throw an exception if the token is at the very beginnning of the input.) But if token.charAt(i) translates to tokenSource.charAt(beginOffset+index) and index is negative, then... I have mixed feelings about this because the semantics in other languages (Python at least) for having a negative number as an index are that seq[-1] translates into the last element of the sequence, i.e. is semantic sugar for seq[len(seq) -1]. Most likely, I'll leave it working the way it does though. After all, it can be useful, in particular, to know if the character preceding (or following) a token is whitespace or a printable character, and this could be a (very terse) way of finding out.

I'll conclude this essay here. It may cover too much ground for a single article. I wanted to provide some flavor of these changes to the generated API, and I guess, beyond that, the mind behind them. I would just make the final point that pretty much all of what I describe should not really affect existing users that much. Certainly, not people who were using some relatively later version of JavaCC 21.

One key conceptual takeaway is that, by default, the parser that CongoCC generates supports what I call here a minimalist usage pattern, where you don't really need to manipulate the token stream or remove/insert nodes from the parse tree. For certain specific "fancy" usages, like the aforementioned "token chaining" or implementing a preprocessor that marks parts of the input as "ignored", there are settings that turn on the code for that. I am painfully aware that all of this is woefully undocumented. For now, if you want to understand token chaining, the best advice I could give you is to make a very careful study the included Python grammar. Aside from that, since it is so poorly documented right now, if you have any questions about that kind of thing, we're happy to help you on the discussion forum.

I would also admit that some of this stuff still may be somewhat half-baked and is liable to be improved over the coming months. In particular, I think that the whole question of inserting synthetic tokens or nodes into the tree is likely to get more squared away as the fault-tolerant functionality really gets more squared away. Fault-tolerant parsing is liable to involve inserting synthetic tokens into the stream, like tacking on a missing semicolon at the end of a statement and things like this.

As for the newest rename-itis sorts of changes, what with methods like getChild() changing to get() and things like that, well, the older methods are all still there, but are deprecated, so when you compile, it will tell you that getChild() is deprecated and you should just use get(). But it all still works. Actually, I have no idea when (if ever) I'll really axe all the methods marked as deprecated. Probably I will eventually, just as I finally did remove support for legacy syntax. (This happened with the Congo rebranding.) However, the writing was on the wall for a long time and whatever changes are made, like finally getting rid of the deprecated methods, you'll also have ample warning.

So, I hope you're along for the ride. It shouldn't be too rough, and well, if the alternative is to hitch your cart to a dead horse, then, well, I really think it should be an easy decision!