Major Milestone: The Lexical Code Generation is completely rewritten

Originally published at: Major Milestone: The Lexical Code Generation is completely rewritten – JavaCC 21

Dear Readers… This blog has been dark for about two months now, but not because I was inactive in the project. Quite the opposite actually. What happened over the last couple of months is that I went into full-blown obsessive mode and managed to rewrite the remaining part of JavaCC that had been resisting my refactoring efforts – specifically the lexer/scanner generation part.

Made it ma! Top of the World!

I do not know if I can convey how happy delighted ecstatic I am about this. It was a very big deal for me psychologically because it was the last remaining piece to deal with and now JavaCC 21 is, to all intents and purposes, a complete rewrite. Actually, I finished off that work about two weeks ago, but was so mentally drained that I am only getting round to blogging about it now.

I was exhausted, I guess, but in a good way. After I completed that piece, I was so happy with myself that I was walking around with a big grin on my face for the next couple of days afterwards. (I have no idea what passersby thought I was so happy about. For all I know, they entertained various salacious speculations.)

With this piece of work, not only have I completed a full rewrite of the legacy JavaCC codebase, but this was the most challenging part. (I guess I was saving the best for last!) I don't know how many people are interested in understanding the nitty-gritty details, but in case you are, just take a look at the main starting point.

It's not just that this is some huge blob of... SHIT.... with no real logic or structure. The entire file (over 3000 lines) does not contain a single comment. Well, to all intents and purposes it does not. For example, look at the comment on line 863:

/* This function generates the bit vectors of low and hi    
    bytes     for common bit vectors and returns those that
    are not common with anything (in loBytes) and returns
    an array of indices that can be used to generate the 
   function names for char matching using the common bit 
   vectors. It also generates code to match a char with the
   common bit vectors. (Need a better comment). */

Need a better comment...

(Duhh, yeah, maybe... Of course, to write a better comment, you actually have to understand the code!) I had a shocking epiphany about a month ago when, after encountering that comment for the umpteenth time, I suddenly realized that I now understood it! Except that this occurred at a point where, after my latest refactorings, the comment was no longer relevant!

So I just deleted it.

But finally I understood it! Funny how that works. But that comment had been in my codebase until quite recently and thus, had been there unchanged for at least 13 years. (Probably well over 20 years.) That is about the only substantive comment in the 3000-line file. That, and these sorts of assertions (before Java had assertions, they were introduced in JDK 1.4) that occur on lines 725, 1755, 1812, 2219, and 2774. For example:

if (c >= 128) 
    throw new Error("JavaCC Bug: Please send mail to sankar@cs.stanford.edu");

That is presumably the email address of the main original author of JavaCC, Sriram Sankar, who was (I gather) a graduate student at Stanford in the 1990's, while working at Suntest, the division of Sun Microsystems, where JavaCC was originally developed.

I assume that that email address has not worked for at least 20 years. (I had a bit of email correspondence with Mr. Sankar over the last year, but he was using gmail like everybody else nowadays!) Regardless of all that, Mr. Sankar has not been involved in JavaCC for at least 18 years, so it would be rather pointless to write him about a bug in the code -- even in the very unlikely event that his standford.edu email still worked. Sriram Sankar, seems like quite a nice fellow and was quite positive and complimentary about my efforts to resuscitate the JavaCC project.

Father, forgive them, for they know not what they do...

Generally speaking, I have never made any bones about the fact that I do not consider the legacy JavaCC codebase to be of very high quality. But I feel I have to make a certain point clear: I do not say that to denigrate the original authors, such as the aforementioned Mr. Sankar. The fact is that the original JavaCC codebase was written in the very early days of Java, against JDK 1.0.x, at a point in time when many of the core API's we take for granted (like the collections classes) did not exist. Moreover, the best practices for writing code in Java were not even well established.

I would also add that, even at this point, in 2021, the best practices for writing code like this, code that generates code may still not be widely understood. It is my view that any such app (of which JavaCC is just one example) should use templates for code generation. It seems pretty clear that the lead developer of our main competitor in this space, Terence Parr, does not disagree. ANTLR uses his own StringTemplate template engine to generate code, while JavaCC 21 uses FreeMarker, the template engine of which I am the main author.

In fact, one of the biggest problems with the legacy JavaCC codebase is that it does not use any template engine to generate code. Though that is not its only problem, it may be the biggest single reason for the lack of any forward progress in that project since it was open-sourced in 2003. One can see how far they have gotten with an approach that eschews the use of the main tool (a template engine, any template engine...) that would allow the code to be properly structured and maintainable.

Now, let us consider the main FreeMarker template in the JavaCC21 codebase that generates the XXXLexer class). That template file is currently about 400 lines, but taking into account comments and whitespace, a true count is more like 300 lines.

In fact, the real heart of it is in this nextToken() method.

That is the main loop of the nondeterministic finite automaton algorithm that is at the heart of the tokenization machinery. The real guts of the NFA code is now separated out into another file, which is a holder for the various NFA-related data. The template that generates that is on the order of 200 lines. Well, in short, the core of the lexical scanner machinery is generated in less than 1000 lines. Maybe about 300 lines of FreeMarker and another 600 lines of Java.

Of course, legacy JavaCC could not express this nearly so tersely and elegantly, just for starters, because the rewritten version makes use of some of the functional interface API that was introduced in JDK 8. (Around 2014.) Basically, the template generates a table of function pointers, actually instances of the NfaFunction interface that I define, and that collectively represent the NFA state machine that corresponds to a given lexical state. All the FTL (FreeMarker template language) code that generates this is expressed mostly in this relatively short template.

The code on the Java side is in the package com.javacc.core which is now about 2000 lines in its entirety, but the real code for generating the NFA, principally in NfaBuilder.java and NfaState.java is really very little code.

Taking Stock of Things

I really feel that cleaning up all the NFA generation code and getting it down below 1000 lines total is really a very major milestone in JavaCC 21 development. As I say above, this is now, to all intents and purposes, a 100% rewrite. By the way, if you're wondering how I got from the starting point to this endpoint, here is an executive summary:

What I did finally was that I tried to get rid of everything in the original code that was not purely necessary. There were all kinds of tricky bits of code that really only existed for optimization. (I won't get into all the ugly details here, though I may explain it in a separate post.) I figured that if I got rid of every single optimization in the code, it would run a bit slower, but I reasoned that if it was within a binary order of magnitude, and in return, I had the thing rewritten in a readable, maintainable manner, that was a good tradeoff.

Also, I undertook the aforementioned code simplification in large part so that I could finally acquire a conceptual grasp of what it really did.

But...

It turned out that once I had the thing reduced down to its minimal expression algorithmically, the resulting generated code was dramatically slower. Probably 10x or even 15x slower. That was more than I considered acceptable or that I considered that an end-user could (or should) accept.

I put in some of my own optimizations (rather naive, bloody-minded ones in retrospect...) and got the code down to about 4x slower than legacy, which I thought was still too much of a slowdown. I then did more investigation of the question and saw that there is actually a whole literature about optimizing NFA state machines. I had run across this stuff before, but never really read through it because I had found it all very inaccessible. (I will likely blog about all this in more detail in the coming days...)

So, I had managed such a massive cleanup of the code but now needed to re-optimize since the speed difference was unacceptable. (My code simplification had gone too far, I guess!) It turned out that after applying one well-known optimization and another separate optimization I came up with, I got the code within a binary order of magnitude of the legacy code. I was going to leave it at that, because there were various other matters requiring my attention (not just in JavaCC!). At that point, depending on your exact use case (one's mileage always varies) the newer lexical scanning code was about 30 to 40% slower than legacy. However, in the following days (since an early draft of this blog article) I got the thing back to around par. At least, tokenizing Java code (which is my main benchmark) is now about the same speed as what it was before. Actually, I still think there is a bit of optimization left on the table, so I now believe I can get this a shade faster than the legacy code was.

The overall lexical generation code is still about 1000 lines.

Well, the truth of the matter is that the legacy code for this was really classic Rube Goldberg contraption code. Frankly, I look at this and just marvel that the original code actually works. Yes, it works, but you know, that is really setting the bar quite low. The problem is that it simply provides no basis for doing anything more. It can't be developed any further from that point. Nobody can really pick it up and do anything with it. Most likely, the original author, Sriram Sankar, not having looked at it in 2 decades, could not easily come back and do anything else with this code!

It's the only game in town!

I will make one final point about all this now. People in this space really do need to understand that my work on JavaCC, the resuscitated JavaCC 21 project, is the only game in town. (By town, I mean JavaCC broadly speaking. If you want to get involved with ANTLR or something else entirely different, that is another matter, of course.) But if you like JavaCC, its basic approach, you find it more usable, and in particular, if you want to make the intellectual investment of understanding how the code works, so that you feel some sort of mastery over the system.... in that case, this refactored version of the code is the only feasible basis to build on.

Now, to be clear, yes, you can use the legacy tool. (It's not clear to me why an informed person would use it, given that it has quite a few well known bugs that are all fixed in JavaCC21.) However, I think that, even aside from that, what is more fundamental to understand is this: if a tool like this is an important part of your toolset, and you aspire to acquire a white box relationship with the tool, to gradually acquire an understanding of how it works, and maybe eventually adding features that you need yourself, this project is the only option.

This is simply because (as you can surmise from the exposition above) the code has been gradually beaten into a state where it is tractable. And that is simply not the case with the legacy codebase. So, unless you were willing to undertake a similar code cleanup yourself...

But why would you? I've done it for you!

(I felt a tad uncomfortable writing the last passage. You'd think it could be left mostly unsaid and, based on the facts that I outline above, people would get the message. But it seems like many people just don't understand this. And, granted, I am not (cannot possibly be) an impartial, objective observer, and aside from that, nobody has to believe me, but I put it to you that you might save yourselves quite a bit of wasted time and aggravation by taking what I say above at face value!)