Tokens, Then and Now

I recently completed a major refactoring of how the generated Token API works in JavaCC 21. Though this constitutes a quite revolutionary set of changes under the hood it may actually not be all that noticeable to many (or most) users. I think there are basically two groups of users (not mutually exclusive) who would be affected by this:

  • People who have upgraded from legacy JavaCC to JavaCC 21 with quite minimal changes to their grammar(s) and codebase, in particular, people who were using the LEGACY_API=true setting (which is now gone, by the way!)
  • People who were doing very tricky things with token chaining. In particular, this would involve setting the next field in various token objects, most likely inside of a token hook method. (Given that the generated Token.java now no longer has a next field, this would tend to be a problem now!)

As I write this, I realize that this new state of affairs will require more than one article to clarify. In this first one, I will just outline what the main changes are and in later articles, I will explain how this new API allows you to write much more robust, elegant code yourself! But now, please bear with me as I outline some of the (ancient) history that led to this point.

In the beginning...

When it comes to understanding JavaCC, and its current incarnation JavaCC 21, it is important -- or useful, at the very least -- to understand the historical timeline. As I outline here, the original JavaCC was mostly written in 1996, against JDK 1.0, which had had its first non-beta version in January of that same year. As I write these lines, we are about a month away from the 26th anniversary of that event. Moreover, JavaCC is just about as old as Java itself, written when Java was a very new, experimental language.

This was a point in time when the best practices and conventions in Java programming were not yet established and that really should be borne in mind when looking at the original JavaCC and the forward evolution there has been from that point. So, getting down to some of the nitty-gritty, let us consider the field layout defined in the Token.java that the legacy tool generates:

public class Token {
   public int kind, beginLine, beginColumn, endLine, endColumn;
   public String image;
   public Token next, specialToken;
   ...
}

Most people reading this surely know that code like this, made up of publicly accessible fields, is considered to be very poor practice nowadays. It is just taken as a given that these fields should be encapsulated -- hidden behind accessors, a.k.a. getter/setter methods. Well, be that as it may, let's review this original disposition. Here, the Token object stores 8 fields, specifically:

  • 5 integer fields
  • 1 String field
  • 2 Token fields

Of those, four of the five integer fields store location information: beginLine, beginColumn, endLine, endColumn. The remaining integer field, kind, is the token's "type" for parsing purposes. Again, this is vintage 1990's Java code. It is hard to imagine any recent Java programming project using a raw integer for this purpose, since it is such an obvious use case for type-safe enums, a language feature introduced in Java 5, released in September of 2004 -- as of this writing, a little over seventeen years ago.

Moving along, the remaining three fields are one String field, image, which stores the text of the Token, and two fields that are references to other tokens: next and specialToken. Those two fields, in particular the next field have played a significant role in very sophisticated uses of JavaCC. A lot of JavaCC code in the wild makes use of the next field to manipulate the stream of tokens that the parser sees. In this and the articles that follow, we will call this token chaining.

Token chaining is actually very tricky and problematic and the first-best solution (if possible) would be to just say no. However, it seems like it is sometimes necessary. For example, it is hard to imagine an implementation of a grammar for Python that does not make use of it to some extent. Python's lexical grammar basically requires the insertion of imaginary or virtual tokens into the token stream that represent indentation -- and the removal of indentation, i.e. INDENT and DEDENT tokens.

You've come a long way, baby!

Now let us fast-forward a couple of decades and take a look at the current JavaCC 21 Token API, as of November of this year, after the most recent round of refactoring. For a hypothetical Foo language, JavaCC 21 generates a rather different field layout:

public class Token implements FooConstants, Node {
    private TokenType type;
    private int beginOffset, endOffset;
    private FooLexer tokenSource;
    private boolean unparsed;
    private Node parent;
    ...
}

One noteworthy point here is that all of the fields that the Token object stores are now private, and thus the implementation is properly encapsulated. We see that the old

 public int kind;

is replaced by:

 private TokenType type;

Aside from being private, the type field is now a type-safe enum called TokenType, that is defined in the generated FooConstants.java file. One field that is in the JavaCC 21 token definition and was not present originally is the parent field. Since a Token is a Node in JavaCC 21 (at least potentially) it needs to "know" what its parent Node is. (There is no need to define any field(s) for child nodes in Token.java, since a Token is assumed to be a terminal node in the tree.)

Aside from the fact that the fields are now encapsulated and the kind integer has been replaced by a type-safe enum, it is noteworthy that every last one of the 8 fields in the original Token class has been refactored away. The four fields that held the line/column location information have been replaced by the two fields beginOffset and endOffset, which are the initial/final absolute offsets of this Token in the file it came from. Thus, the Token object no longer stores the line/column information, since it is really simpler and less error-prone to calculate this information on the fly and it delegates that task to the tokenSource object.

In fact, the various accessor methods getBeginLine(), getEndLine(), etcetera, have a default implementation in the base Node API. For example, default implementation of getBeginLine is as follows:

default int getBeginLine() {
    FooLexer tokenSource = getTokenSource();
    return tokenSource == null ? 0 : tokenSource.getLineFromOffset(getBeginOffset());                
};

Generally speaking, Node objects in JavaCC 21, including Token, which implements Node, delegate the task of figuring out their location in line/column terms to the tokenSource object. That object also is responsible for knowing what the tokens before and after this one are, which is why the next and specialToken fields are gone. (Actually, the next/specialToken fields in the original JavaCC were always quite tricky and unintuitive. The next field is the next parsed token, while specialToken is the previous unparsed token. All of that is really quite messy conceptually, and I think the new setup that I explain here should be much clearer.)

Comments about backward compatibility

With some frequency, I find myself in conversations with people, in which their primary concern seems to be whether they can use JavaCC 21 as a drop-in for the legacy JavaCC with absolutely no changes.

And then I have had to inform people that upgrading to JavaCC 21 is very likely to require some adjustments to their existing grammars. (At that point, these people usually disappear. Well, what is one to expect? Once people frame the question as being whether they can switch to JavaCC 21 with absolutely no adjustment, then obviously, I cannot compete with the legacy tool!) In any case, as of the recent round of refactoring, it is not just likely that you would have to make certain adjustments to upgrade to JavaCC 21; it is an absolute certainty. Up until about a month ago, JavaCC 21 had a setting LEGACY_API which kept many of the old quirky things working. For example, if you had LEGACY_API=true at the top of a grammar, it would generate a Token.java file in which various fields (albeit contrary to every known best practice) were publicly visible, just as in legacy JavaCC. And thus, any older code that had things like tok.beginLine or tok.next could still work.

However, with the latest refactoring, it is simply impossible to keep such code working, since the various fields have been refactored away!

Now, maybe I have a chip on my shoulder about all this, but I feel I must say that it really is ridiculous and unfair to blame me for this state of affairs. If the legacy JavaCC had properly encapsulated these fields, as far as I can see, I could have kept the same API working indefinitely. Suppose, for example, the beginColumn field had only been accessible via a getter:

public int getBeginColumn() {
    return beginColumn;
 }

I could keep that API working even after removing the beginColumn field. Moreover, many application programmers might not even realize that the field had been removed!

You see, dear reader, this is why you are supposed to encapsulate the fields!

In short, when an API has no such encapsulation and has direct access to public fields, it becomes impossible to do any very significant refactoring without breaking legacy code that relied on these exact fields being present (and publicly accessible.) In such a situation, trying to retain backward compatibility amounts to imposing a straightjacket on oneself that is eventually untenable.

Now, it is true that the legacy JavaCC "community" has managed to go from a version 2.x to a version 7.x, five whole point release cycles, without ever addressing this sort of basic problem. As I see it, there are two possible explanations for this:

  • These people are master implementers and have managed to achieve something I myself could not: all this forward progress without breaking backward compatibility at all.
  • The aforementioned "forward progress" is basically fraudulent and, in fact, the legacy JavaCC is a zombie or nothingburger project.

One or the other. I leave it to the reader to decide which one corresponds to reality...

The Gigabyte is the new Megabyte Redux

Note that the refactored token API is much more powerful and expressive than what it replaces. For example, consider the following methods added to Token recently:

Iterator<Token> precedingTokens() 
Iterator<Token> followingTokens()

These two methods allow us to scan backwards or forwards from any token in a quite natural way -- in modern idiomatic Java. We could write something like:

 tok.followingTokens().forEachRemaining(t->{
     switch (t.getType()) {
         case IDENTIFIER : doSomething(t); break;
         case COMMENT : doSomethingElse(t); break;
         ...
     }
 });

You see, this is one annoying aspect of legacy JavaCC, in its neglected, abandoned state. The code it generates, and any client code that interfaces with the generated parser code, is basically cut off from modern Java idioms. Party like it's 1999! Certainly, one goal of the JavaCC 21 project (though not really its primary goal, it's more like icing on the cake) is to address this problem.

More importantly though, all of these refactorings are based on a fundamental change of approach.

While the original JavaCC was based on the assumption that a parser only had access to a small window into the input being parsed, a JavaCC 21 generated parser is based on the assumption that we have access to the entire file in memory. In fact, the first thing it does when parsing input is simply to read the entire file into memory.

In a sense, that is really the "secret sauce" that enables all of these elegant refactorings. Once you have all the input in memory, then certain things can be revisited. For example, strikingly, in JavaCC 21, there is really no need for a token to have an image field, since it can get that information from the tokenSource object. Thus, any Node object can get its source text based on the tokenSource object and its absolute begin/end offsets, so the base Node interface has the method getSource, implemented by default as follows:

default String getSource() {
    FooLexer tokenSource = getTokenSource();
    return tokenSource == null ? null : tokenSource.getText(getBeginOffset(), getEndOffset());
}

This works because we have access to the entire file we are parsing. And that is also why the previous example of iterating over the previous tokens works. Since we keep the information in memory, we can just scan backwards over all the preceding tokens and fish out whatever information or do what we want. By the way, that is not quite true when it comes to scanning forward. The analogous API followingTokens(), if you are in the middle of a parse, gives you an iterator that goes to the last token that has been scanned (and cached) so far. However, if the parse is finished, you should be able to scan through to the very last token, typically of type EOF.

Now, to be clear, the reason for this major refactoring is not just that memory is now absurdly cheap, i.e. we do it because we can but really it is because the ability to attack more interesting problems -- 21st century problems, if you will -- really does require us to be able to assume that we have access to all the input in memory. For example, to do error handling/recovery properly, we really need to be able to reset the parsing/tokenizing machinery to any arbitrary point in the file.

So, principally, this whole refactoring that I describe here (not completely, there are some follow-up posts in store) is really designed (quite carefully, by the way!) to make JavaCC a feasible tool to attack problems for which it was not previously in the running at all.