The main goal of the JavaCC 21 project, soon to be renamed as CongoCC, is to develop a very capable, practical, usable parser generator tool. However, there is another overarching goal -- perhaps not so much a goal, as a philosophy. That can be summed up in a single word:
Demystification
You see, somehow or other, some impenetrable mystique has been built up around this application space. People out there believe (whether they say it explicitly or not) that parser generator tools -- such as the venerable YACC, Bison, JavaCC, Antlr... -- are the preserve of the guru.
"Yeah, man. Forget about mastering this sort of tool unless you are part of the inner circle of humanity."
"Aye, my son. As the good book sayeth: many are called but few are chosen."
Well, in short, people really do seem to be intimidated by these things. Now, granted, there are some conceptual hurdles to overcome. But even taking that into account, people seem to build this up into a bigger deal than it should be. And, as a result, these sorts of parser generator tools seem to be surprisingly under-utilized out there.
Well, this is a big ball of wax and I can really only demystify one thing at a time. Today is the turn of the NFA (a.k.a. nondeterministic finite automaton) which is the heart of how the lexical machinery works. What JavaCC 21 (and other similar tools) do is that they generate code that implements an automaton that scans a stream of characters and groups them into tokens.
Oops, there I go myself, using this sort of jargon -- automaton, lexical, token... I'm already starting to be guilty of what I am criticizing! You know, obfuscating things with this sort of impenetrable jargon. I think it best to wind back to some first principles.
Complex software is built in layers.
I think that one could further say, without courting too much controversy, that what characterizes a well designed system is that the various layers are decoupled from one another -- well, to the maximum extent possible anyway. So, in this vein, a parser generator such as JavaCC is built around a very basic separation of concerns. It typically partitions the problem into lexical recognition (a.k.a. tokenization) and the syntactic side (a.k.a. parsing). Of those two, lexical recognition is the lower level task. It consists in breaking the input into tokens, which are about analogous with words in natural language. Thus, tokenization can be thought of as being akin to identifying individual words in English or some other language. Parsing, on the other hand, is about taking that stream of tokens and parsing them into grammatical elements -- which, in a natural language, would be things like conditional or subjunctive clauses, and then full sentences and paragraphs and so on. With an object-oriented programming language, we would be parsing the stream of tokens into declarations of classes, methods, variables, and so on.
So, to be perfectly clear, this article is solely about demystifying the lexical side, breaking the input into tokens. Other articles will demystify the parsing side. Without much further ado, let's start getting our hands dirty, shall we?
A Minimal Lexical Grammar
Now, let us define a minimal lexical grammar in the format used by the JavaCC 21 parser generator. Just paste the following into your favorite text editor and save it as a plain text file.
BASE_NAME="";
TREE_BUILDING_ENABLED=false;
TOKEN :
<AsNoB : ("a")+ >
|
<AB : "ab">
;
INJECT LEXER_CLASS :
{
static public void main(String[] args) {
StringBuilder input = new StringBuilder();
for (String arg : args) {
input.append(arg);
}
System.out.println("Input is: " + input + "\n");
LEXER_CLASS lexer = new LEXER_CLASS(input);
while (true) {
Token tok = lexer.getNextToken();
System.out.println("\"" + tok.getImage() + "\" of type " + tok.getType());
if (tok.getType() == TokenType.EOF) break;
}
}
}
The above toy example should be easy enough to build and run. You do need JavaCC 21. You can download the latest pre-built jarfile here and, from the command line you can build the above grammar like so:
java -jar javacc-full.jar Lex1.javacc
In the above, I use the name Lex1.javacc
because it has to be called something, but actually the filename is irrelevant. We do assume, of course, that you have a java
and javac
executable on your path. If your javacc-full.jar
is in a different directory, you would, of course, have to specify its location in the command-line.
Once you have all that sorted out, the above command line should generate the following files in the same directory as your grammar file:
Constants.java
InvalidToken.java
Lexer.java
NfaData.java
Token.java
And you should be able to compile them with:
javac Lexer.java
The Lexer
class has a little main()
test harness injected into it so
you can try out your lexer on some various sample input. So, do try out the following three commands and see what happens.
java Lexer ab
and now:
java Lexer ab aaa
and:
java Lexer ab cd aaa
What the test program does is that it simply takes the input and breaks it into tokens based on the lexical rules we specified. (All two of them!) Thus, the first command above produces the output:
Input is: ab
"ab" of type AB
"" of type EOF
The second line gives us:
Input is: abaaa
"ab" of type AB
"aaa" of type AsNoB
"" of type EOF
Note, by the way, that the input passed to the lexer does not include any spaces since the test harness just concatenates all of the strings together. Or, in other words, our Lexer just scans in the input without any whitespace. The purpose of any whitespace on the command line would really just be for human readability in this case.
Now, probably one of the first things you will observe is that the lexer generates a finalEOF
token when it reaches the end of the input. So, for our first incursion into the code, let's take a look at the generated Constants.java
file. I reckon it looks something an awful lot like this:
/* Generated by: JavaCC 21 Parser Generator. Do not edit. Constants.java */
public interface Constants {
/**
* The various token types. The first type EOF
* and the last type INVALID are auto-generated.
* They represent the end of input and invalid input
* respectively.
*/
public enum TokenType {
EOF, AsNoB, AB, INVALID
}
/**
* Lexical States
*/
public enum LexicalState {
DEFAULT
}
}
You see there that there is an enum called TokenType
that contains 4 elements. Well, we only defined two token types in our grammar, AsNoB
and AB
. The other two, EOF
and INVALID
are always generated and they represent the end of input and invalid input respectively.
We can see INVALID
in action in the third example above. It produces the output:
Input is: abcdaaa
"ab" of type AB
"cd" of type INVALID
"aaa" of type AsNoB
"" of type EOF
So you see what it does. The first two letters, a
followed by b
can be matched as an AB
token type. But then it hits two letters that it certainly cannot match. You see, our little 2-rule lexical grammar only uses two input symbols, a
and b
, so any others letters really must be invalid input! After wrapping up the c
and the d
as an INVALID
token, the lexer finds that it can match the next three letters as AsNoB
and then it hits the end of the input, so we get an EOF
token.
Can you say Nondeterministic Finite Automaton? (No, but I can say "egg mcmuffin".)
Arthur C. Clarke famously said that any sufficiently advanced technology is indistinguishable from magic, so we could just leave it at that. However, since our goal is to demystify all of this, we shall now have a look at the generated code.
Take a deep breath and bring up the generated NfaData.java
file in your text editor of choice. Yes, we are wandering into the lair of the beast here. Here is where the state machine, the NFA, is defined:
private static class DEFAULT {
static private TokenType NFA_0(int ch, BitSet nextStates, EnumSet validTypes) {
TokenType type= null;
if (ch== 'a') {
nextStates.set(1);
nextStates.set(2);
if (validTypes.contains(TokenType.AsNoB)) type= TokenType.AsNoB;
}
return type;
}
static private TokenType NFA_1(int ch, BitSet nextStates, EnumSet validTypes) {
TokenType type= null;
if (ch== 'b') {
if (validTypes.contains(TokenType.AB)) type= TokenType.AB;
}
return type;
}
static private TokenType NFA_2(int ch, BitSet nextStates, EnumSet validTypes) {
TokenType type= null;
if (ch== 'a') {
nextStates.set(2);
if (validTypes.contains(TokenType.AsNoB)) type= TokenType.AsNoB;
}
return type;
}
static private void NFA_FUNCTIONS_init() {
nfaFunctions= new NfaFunction[]{DEFAULT::NFA_0, DEFAULT::NFA_1, DEFAULT::NFA_2};
}
Yep, that is the state machine that implements the 2-rule lexical grammar that we defined. Here is how the basic algorithm works in this case. All of these states are like little components that only do one very specific task. They read in a character and, based on that, do one (or both or none) of the following:
- They pass back a
TokenType
to say that we have have a match for that type. - They activate one or more of these states for the next iteration of the loop. That is what that
nextStates
parameter is about.
On the first iteration of the loop, reading in the first character of the token, the machine is in its initial state, represented by NFA_0
. That is this method:
static private TokenType NFA_0(int ch, BitSet nextStates, EnumSet validTypes) {
TokenType type= null;
if (ch== 'a') {
nextStates.set(1);
nextStates.set(2);
if (validTypes.contains(TokenType.AsNoB)) type= TokenType.AsNoB;
}
return type;
}
This corresponds to the initial state of our NFA. It receives the character input, which is the ch
parameter. You see by eyeballing the code that this state can only deal with one character, a
, and this corresponds to the fact that we only have two types of token and both of them have to start with an a
! So let us consider two possibilities:
- The first character read in is not an
a
. - The first character read in is an
a
.
In the first case, our saga is basically over before it began. For one thing, you can see that the NFA_0
routine is going to return null
and it is also not setting any of the bits in the nextStates
. Now, setting bits in nextStates
is how we set which states are used in the next iteration of the loop. But if no states are set, the loop is going to terminate because there is nothing to do on the next iteration! Moreover, since we did not return any TokenType
, just a null
, our basic NFA loop is just not going to match any token. (You can anticipate that this is how we end up with a Token
of type INVALID
. We are going to read in characters until we can start the loop again. And those invalid characters are going to be collected in a token of type INVALID
!)
So, now let's consider what happens when we do read in an a
at this point. The lines:
nextStates.set(1);
nextStates.set(2);
amount to setting 'NFA_1
and NFA_2
to be active on the next iteration of the loop. But note that, regardless of what happens on the next iteration, we already have matched a token type, which is AsNoB
. As you see, the lone a
that we read in already can be matched as an AsNoB
, which consists of one or more a
s. Now, if there is no more input, we're going to have that AsNoB
token that we just matched and then the machinery will produce an EOF
token and we're finished. But let's assume that there is more input, at least one more character. So here are the other two states that are now active for the next iteration:
static private TokenType NFA_1(int ch, BitSet nextStates, EnumSet validTypes) {
TokenType type= null;
if (ch== 'b') {
if (validTypes.contains(TokenType.AB)) type= TokenType.AB;
}
return type;
}
static private TokenType NFA_2(int ch, BitSet nextStates, EnumSet validTypes) {
TokenType type= null;
if (ch== 'a') {
nextStates.set(2);
if (validTypes.contains(TokenType.AsNoB)) type= TokenType.AsNoB;
}
return type;
}
Since both states 1 and 2 were activated in the last iteration, we are going to execute both of the above methods. So now, there are basically three possibilities to consider:
- The next character is an
a
. - The next character is a
b
. - The next character is neither an
a
nor ab
.
If the next character is an a
, NFA_1
does nothing, because it only reacts to a b
. However, the NFA_2
state will set NFA_2
(i.e. itself) as active in the next iteration and also it will return the token type AsNoB
. Now that we have read in two a
's, we can match them both as a single AsNoB
token, right? Moreover, on the next iteration, the only active state is that NFA_2
state, which will read in an a
(or not). Once it fails to read in an a
, the loop is over.
If the next character is a b
, NFA_2
does nothing, but NFA_1
returns the token type AB
as a match. However, the loop ends there because no states have been set for the next iteration. This corresponds to the fact that, once we have matched an a
followed by a b
, there is nowhere further to go. We have to start a new token.
If the next character is neither an a
nor a b
, then neither NFA_1
nor NFA_2
have any effect, so the loop will end there, but at least we did match the first character as an AsNoB
.
That is the 3-state machine that our little lexical grammar generated and this is how it is expressed in code -- the initial state NFA_0
and two other states.
Show me the MONEY CODE!!!
Before moving on to some other points about this sort of state machine approach, let me point out that the above exposition is somehow quite unconventional. It is very hard (or impossible?) to find any introduction to this NFA state machine topic that is based on actual code -- I mean, in some language, that actually compiles and runs!
Consider the Wikipedia page on Nondeterministic finite automaton. The page does contain various information -- of a theoretical and historical nature. For example, apparently, "NFAs were introduced in 1959 by Michael O. Rabin and Dana Scott, who also shows their equivalence to DFAs." (They rely on the reader to infer that a DFA is a deterministic finite automaton, which admittedly, does stand to reason, I suppose.) There are various flow-chart diagrams and a lot of mathematical notation, but not a single line of code. Well, as they say, different strokes for different folks. If you can figure this out from that Wikipedia page, that is fine. (And I would add: You are a better man than I, Gunga Dinn!) But, if you have a mind-set more like mine and prefer to come to some conceptual mastery of such things using actual code, then you might prefer the approach taken here.
Real men do (fill in the blank)
In the above we defined an extremely simple lexical grammar with only two input symbols, a
and b
, and two lexical rules. Those two rules are:
- We can match any number of repeating
a
's. - We can match the string
ab
And that's it. We see that this generates a finite state machine with three states, the initial state and two other ones. Now, I suppose it goes without saying that if all we ever wanted to do was to was on this order of complexity, the whole thing would surely not be worth the candle. I mean, what is this? We read in a character. If it's not an a
, then it's invalid. If it is an a
, then we read in any number of subsequent a
s. Or just a lone b
. Surely we could just directly code something like that in a few lines and save ourselves a lot of bother.
Of course, the problem is that, outside of little tutorials for pedagogical purposes (like this one!) we're not working with little two-character languages. Now, just for the fun of it, suppose we decided that real men write their state machines by hand and we consider the magnitude of the task. Consider a real programming language -- Java, let's say. Let's say we want to tokenize some java code. We start off a new token and let's say the first character is i
. Then what?
Well, for starters, that sole i
does stand on its own as an IDENTIFIER
type. However, depending on the characters that follow, that starting i
could be the start of any of the following:
if
import
implements
instanceof
int
interface
- some longer identifier, like
if000
orinstance
orirritate
In other words, all the reserved words in Java that start with an i
OR an identifier -- the lone i
or something longer. So, now we read in the next character. The simplest case is that this next character is not a letter, digit, or underscore, or a dollar sign. So we just have the single-character identifier i
. But if the next character is a letter, say, but not an f
or an m
or an n
, then we know this is an identifier, at least two characters long. And we also know we are going to enter a loop that will match as many of these identifier part characters as it can. Now, if the next character after the i
is an f
, we have matched the Java reserved word if
but we still need to see if the next character is an identifier part. If it's not, then we're through and we have matched the if
. If it is, we go into the same aforementioned loop where we just scan forward to the end of the identifier.
One could conceivably identify all the different states that this sort of state machine for scanning Java code would have. How many? Well, the Java lexical grammar defined here generates a state machine with 427 different states and the resulting generated JavaNfaData.java
file is something over 4000 lines. If we did, just as a conceptual experiment, write this out by hand, it would simply be a question of coding those 427 different states -- not very different in principle from our 3-state 2-character A-B language. Just more work, I suppose.
Now, to be absolutely fair, one possible disadvantage of using a code generation tool for this problem is that it requires one to think in ways that are not necessarily natural for a person.
Going back to our little toy A-B grammar. Consider this test input:
java Lexer aaaab
That provides the output:
Input is: aaaab
"aaaa" of type AsNoB
"b" of type INVALID
"" of type EOF
This is kind of interesting because we have two kinds of tokens, the a
s repeating and an "ab". You would think that the above input aaaab
could (or should) be matched as:
"aaa" of type AsNoB
"ab" of type AB
"" of type EOF
But nooooo. You see, this might be what a human being would do. A human would consume one a
, then another one, and then another one. And then he could realize that if he consumes that last a
, then he won't be able to do anything on the next round of tokenization. The b
is going to be a problem. All of our tokens (all 2 of them!) need to start with an a
! But hey, if we could just stop after the third a
, then we are left with ab
and everything squares away very nicely. So, hey, why doesn't the little man in the machine do that?
Well, no. The little man in the machine is not going to do that because there is no little man in the machine. There's just this NFA, this state machine that's going to consume as many a
's as it can and then is going to run into a wall basically when it has to start a new token with a b
, which it can't. Now, I have to admit that I am no regexp maven. I believe, however, that there are more sophisticated regexp engines that do backtrack and look ahead and so on. (And that may be at a considerable cost in terms of performance...) However, the basic algorithm in JavaCC 21 (like in the legacy JavaCC before it) has none of that. It does not look ahead or backtrack. It is a straightforward implication of the principle of greedy matching that the resulting state machine is not going to scan aaaab
as aaa
followed by ab
. It will (greedily) scan the aaaa
and then fail on the following b
. It is understandable that you could want (or even expect) it to do something else, but it won't.
Kludge Alert! (Don't try this at home, kids.)
Well, I was pondering this whole thing and finally, I did see a solution to this. We could, in a pinch, "fix" our A-B grammar to deal with this problem, it turns out. Consider:
TOKEN :
<AsNoB : ("a")+ >
{
if (bufferPosition < content.length()
&& content.charAt(bufferPosition) == 'b') {
matchedToken.setEndOffset(--bufferPosition);
}
}
|
<AB : "ab">
;
That's the same lexical grammar as before but we add a code action after AsNoB
. If the next character to be read is a b
, we basically take the last a
out of the token and put it back on the input stream. So, in the case of aaaab
, it sees the b
that is coming and "realizes" that it needs to put that last a
back and so now, with that code action added, we can rebuild:
java -jar javacc-full.jar Lex1.javacc
javac Lexer.java
java Lexer aaaab
and voilà!
Input is: aaaab
"aaa" of type AsNoB
"ab" of type AB
"" of type EOF
So, here, it is behaving in a clever way seemingly. It doesn't match the fourth a
because it needs to leave it there so that we can start the next iteration with an a
and thus match the ab
. Except it only looks like that. (There is still no little man in the machine!) It actually did greedily (or mindlessly) gobble all four a
s but then we have a code action that sees that the next character is a b
(Oops!) and makes a little adjustment. It effectively puts the last a
back by decrementing the bufferPosition
variable and also setting the endOffset
of the token to that -- i.e. shortening it by one character. So now it will start the next tokenization loop on that final a
instead of the following b
. (Tricky.)
Well, dropping down into Java code to handle this case is not so ideal perhaps, since it requires using things normally reserved for internal use, like directly accessing and adjusting the content
and bufferPosition
variables. But hey, sometimes a man's gotta do what a man's gotta do, right? (Though, given that real men code all this state machine stuff by hand anyway,..)
Some End Notes
The BASE_NAME=
option above means that we name the generated files based on that string, so with that set to the empty string, we simply generate Lexer.java
, NfaData.java
, etc. If, for example, we had BASE_NAME="Foo"
it would generate FooLexer.java
, FooNfaData.java
etc. If BASE_NAME
is omitted, then by default, it is deduced from the name of the file. So a grammar file named JSON.javacc
will generate JSONParser.java
, JSONLexer.java
, etc.
The reason for TREE_BUILDING_ENABLED=false
is just so that it doesn't generate certain extra files, such as Node.java
and BaseNode.java
that relate to building a tree of nodes and aren't needed for this purely lexical example.)
The reason that I write:
INJECT LEXER_CLASS :
and:
LEXER_CLASS lexer = new LEXER_CLASS(input);
rather than simply writing INJECT Lexer :
etcetera is that it is better style. If you use the alias LEXER_CLASS
(or PARSER_CLASS
or whatever) the code snippet is more robust. If you later decide that you do want the generated class to have some different name, like FooLexer
or FooParser
, the code snippet continues to work unchanged.
If you really really want to get your hands dirty, the algorithm to convert regular expressions into an NFA is implemented here. That is the main part, though actually NfaBuilder
builds a lot of NfaState
objects that are basically dummy states, a kind of scaffolding that is later removed to get down to the minimal set of states. Certain states can be consolidated into composite states. All of this is expressed in maybe 700 lines of Java code in this package. The final data structure is exposed to this FreeMarker template that generates the XXXNfaData.java
file. Each generated NFA_XXX
state function is an instance of the NfaFunction
functional interface defined here.
The main NFA loop is in the nextToken()
routine in particular here.
Generally speaking (at least in my own experience) the easiest way to gain familiarity with writing lexical grammars is just by looking at various ones in the wild. You can see a full lexical grammar for Java right here. You could even build it and view the NFA code:
git clone https://github.com/javacc21/javacc21.git
cd javacc21/examples/java
java -jar ../../bin/javacc.jar JavaLexer.javacc
javac JavaLexer.java
And you can look at the generated JavaNfaData.java
to see the code that is generated to tokenize Java code. There is a lexical grammar for Python here and for C# here. So, hopefully, dear reader, that leaves you something to chew on.