Having Fun with the Fundamentals (Part I)

I was trying to think back to when most students are introduced to basic algebra. A typical sort of exercise is simplifying unnecessarily complex expressions, so he can rewrite:

 (x + 1)y -y

as:

xy

Well, regardless of the exact age when people first see such things, this is part of a typical middle school curriculum, so there is the implicit assumption that all but the dimmest kids can get it. (At least if the material is not taught horrendously badly...)

This sort of middle school maths does present a certain conceptual leap however -- mostly because it involves thinking at a certain abstract level. One has to grasp that there is no need to specify what x or y is. These are variables, they could be any numbers at all and the above operations are valid regardless. So what this amounts to is introducing people to a formal notation for talking about things at a certain level of abstraction.

Something similar happens with the sort of formal grammar that one works with when using a parser generator such as CongoCC. When we write:

(Foobar)+

this means that, at this juncture, we expect to match the Foobar expansion one or more times. However, just as x(y+z) could be written as xy + xz we could also consider that (Foobar)+ is to be just a shorter, more convenient way of writing:

Foobar
(Foobar)*

i.e. we have a single mandatory Foobar expansion and then zero or more optional Foobar expansions after that. Now, granted, to actually have some running code, this Foobar must be concretely defined somewhere, but the point is that regardless of how precisely it is defined this is a kind of invariant equivalence. Just as we can say that x+y is the same as y+x without specifying what the value of x or y is, the above two expansions, (Foobar)+ and Foobar (Foobar)* are equivalent regardless of what Foobar is.

Well, all that said, let's get more concrete for now and say that we have this grammar rule:

Foobar :
   SCAN 2 "foo" "bar" baz"
;

What this means is that Foobar is a "foo" token, followed by a "bar" and then a "baz", but what the SCAN 2 part means it that when we are at a choice point where we have to decide whether to attempt to match Foobar or not, we make that decision based on scanning ahead two tokens. (If that SCAN 2 is not there, we fall back to the default or implicit scanahead of a single token.)

Now here is a gotcha: this predicate only applies if we are at a choice point. So, if we have:

"bang" (Foobar)+ Bat

the first Foobar is entered unconditionally. Our predicate, in this case, matching the first two tokens, is not used! Not for the first Foobar, only the ones after that. And that is surely at least a bit more obvious if we write the above the longer way:

"bang" Foobar (Foobar)* Bat

There, it is perfectly clear that we only have a choice to make after the first Foobar, right?

Now, we could have written our Foobar grammar rule just as well as:

Foobar :
   SCAN "foo" "bar" => "foo" "bar" "baz"
;

On the one hand, that is more explicit. Rather than just saying we scan ahead two tokens, we say explicitly that we scan ahead for a "foo" and a "bar". However, on the other hand, that is also more long-winded and repetitious! Regardless, those two ways of expressing this are equivalent. (You say tomayto, I say tomahto.) You could express this either way, but actually, in CongoCC, the preferred way of writing this is surely yet another way, using the up-to-here notation, as in:

Foobar :
  "foo" "bar" =>|| "baz"
;

This is not only shortest of the three, but is also arguably the clearest. The above expresses that we scan up to the =>|| up-to-here point and if that scan is successful, we enter the expansion. But again, in the one-or-more construct (Foobar)+ we only have the choice of whether to enter the expansion or not after the first Foobar, which is matched unconditionally. It is the ones after that where we have a choice and thus check the predicate, in this instance scanning ahead the first two tokens.

Now, for the fun of it, let's examine some other ways we could express the same two-token lookahead. For example, we could write:

Foobar :
   "foo" 
   ASSERT {checkNextTokenType(TokenType.BAR)}#
   =>||
   "bar"
   "baz"
;

This is effectively the same thing, except that we drop down into Java code when expressing the predicate. Again, do understand that this is a contrived example. Usually, one resorts to using Java code to express things that are not easily expressed otherwise. But that is not the case here! But it does work. The parser generator logic will generate code that checks for the next token being "foo" and then there is the assertion expressed with Java code that the type of the token after that is BAR, and once we pass that assertion, we have hit the up-to-here point, so the predicate routine is successful and yes, we enter the Foobar rule. (By the way, the reason that the assertion above ends with # is to indicate that it applies in a lookahead as well as regular parsing. If we did not have that there, the generated predicate would not check for the "bar" because the assertion, by default, would not be checked in the generated scanahead routine. Details, details...)

Or we could also write:

Foobar :
    SCAN {getToken(2).getType() == TokenType.BAR}# 
    => "foo" "bar" "baz"
;

That is also equivalent to the preceding examples. That might not immediately seem true. After all, this SCAN is not checking for the first token being "foo", right? But, actually, by default in CongoCC, if a SCAN routine has neither a numerical lookahead amount nor a syntactic expansion, then the generated predicate routine will scan ahead one token, as well as checking whatever condition is expressed in Java code. N.B. In roughly the same spot in legacy JavaCC, the default lookahead amount is zero, so if you wrote, say:

[
    LOOKAHEAD(getToken(2).kind == BAR)
    "foo" "bar" "baz"
]

it would not check whether the first token was a "foo" because the default lookahead in that spot is zero tokens. So, it would check the second token, but not the first one. For that to work as (probably) desired, you would need to write:

[
    LOOKAHEAD(1, getToken(2).kind == BAR)
    "foo" "bar" "baz"
] 

Well, granted, the above example is rather silly, since you could just as well write this as LOOKAHEAD(2). The point is that in this case, the scanahead part defaults to zero if no numerical amount is specified, while in CongoCC, it defaults to looking ahead a single token.

Well, silly season is not quite over, so here is another equivalent way of writing the above -- in CongoCC, that is. You could use the up-to-here-plus notation:

Foobar :
   "foo" =>|+1 "bar" "baz"
;

What the up-to-here-plus means is that we scan up to this point and one token more. So, in this case we end up checking for the "bar". Of course, this is also a contrived example, since there really is no conceivable reason to write it this way. However, the up-to-here-plus is quite useful when you just want to scan one (or maybe two or three) token(s) into the following part of the expansion. For instance this is how an initializer block is expressed in the Java grammar:

Initializer# : 
    [<STATIC>] =>|+1 CodeBlock
;

What this means is that we scan past the (optional) "static" keyword and then one more token which is the opening { of the (mandatory) codeblock. So, this is effectively the same as:

Initializer# :
    SCAN [<STATIC>] <LBRACE> =>
    [<STATIC>] CodeBlock
;

However, using the up-to-here-plus notation is briefer and, I daresay, quite a bit more readable, since we can just mentally parse it as meaning that we check for the optional "static" keyword and then one more token, which is, of course, the initial left brace, i.e. {.

We could also write SCAN 2 here admittedly, but that is slightly sloppy, since, if the "static" keyword is not present, then the two-token scan is unnecessary, it only needs to scan one token! But granted, it basically works. (Though there are some subtle points about scanning extra, superfluous tokens that I will surely discuss in a later article. I'll also explain why Initializer has a # after it!)

The Takeaway

I wrote the above with a kind of dual intention. On the one hand, I wanted to introduce some key concepts about how the tool works, its basic logic and structure. But also I introduced certain nitty gritty features, like the ability to express assertions using Java code and even the fact that we have some special notation, the trailing #, to indicate that the assertion in question applies in a lookahead or predicate routine.

Just as in the case of middle school maths, we understand that certain things can be expressed in different ways, and we gradually gain some fluency with this. So, just as:

(Foobar)+

is equivalent to:

Foobar (Foobar)*

the zero-or-more expansion

[Foobar]

(or alternatively (Foobar)?) can be written as:

Foobar | {}

So this is the same thing essentially, a choice between matching a Foobar or nothing at all. (The empty code block {} is a convenient way to represent an empty or do-nothing expansion.)

[Foo | Bar | Baz ]

is actually equivalent to:

Foo | Bar | Baz | {}

Or, an interesting case is that:

[(Foobar)+]

is equivalent to:

(Foobar)*

but in this case, the latter is simpler and clearer, which is typically preferable. But again, sometimes, the alternative ways of expressing things are about equivalent and it amounts to a stylistic choice.

Closing Thoughts

Obviously, in natural language, there is usually a plethora of different ways of saying what is essentially the same thing. We can say "Hello, how are you?" or "Hey, what's up?". Now, in that exact case, there is, of course, a difference of register and one might be more appropriate than the other depending on the social context, but there is also the fact that, to a large extent, people do express their personality (quirks and all) via what are basically stylistic choices. But there is a deeper point in all this: programming (at least at its best) really is an art form. As such, a master of the craft will have developed his (or occasionally her) own coding style. So, at the risk of sounding a bit too new-agey, programming, as with other art forms such as music or painting, is something of a vehicle for self-realization.

Well, I am struggling with words to explain just why mastering a tool like CongoCC can be so very deeply satisfying. While I freely admit that I would like to get more visibility for my work (and, yes, I am deeply ego-involved) I also just find it quite unfortunate that so many people lose out on this because, somehow they never overcome a sort of intimidation factor. So, this is the first of a series of articles that I intend to write as a attempt to remedy that.

(To be continued)