Terminological Questions

In natural language, we have a lot of words and expressions that, even if there is a subtle difference in meaning, are often interchangeable. For example, “street” and “road”. I think a “street” is usually inside a city or town (no absolutely clear difference between those two things either…) while a road is more likely to be in a rural area or outside the city center anyway. Also, I think of a road as something bigger and wider than a street, but not quite so much as a “highway” or “avenue”. Still, across a lot of usage cases, “street” and “road” are interchangeable. You don’t want your kids playing in the street or the road. (In the highway even less so, but I suppose that is because vehicles are moving much faster there.) I don’t know offhand what is the difference between an “alley” and a “lane”, but either of them is quite a bit narrower than a street or road. (Some alleys and lanes do not even appear on a street map, I suppose.)

Well, in short, I don’t think it’s such a horrible thing if one uses different words at times to refer to the same thing. Generally speaking, human beings are actually equipped to deal with this sort of thing. Still, I think it is worth thinking about this a bit because some terminology may just be more natural and clear to most people.

Production, “BNF Production”, or “Grammatical Production”

I think a better alternative might simply be “rule”, or when it is worth being extra explicit, “grammar rule”. Or “parser rule”.

I do tend to think that “rule” is clearer to most people than “production”.

Regular Expression

I suppose the term “regular expression” is well known to anybody with a minimal computing culture. However, I tend to think that the regular English word “pattern” may be a better alternative. At least most of the time…

Internally in JavaCC, the term “Token Production” is also used. I don’t think that’s actually a very useful term, so…

(Well, a pattern is really a rule too, but is a rule that is matched by the lexical machinery, not the parsing machinery. So, the idea is to use “pattern” for things that come from lexically scanning the input, and “rule” for things that come from parsing. But, of course, the real issue here is that the user acquires a conceptual model of the difference between the two things.)

Expansion

It is possible that the term grammar expression would be clearer to most people. I’m less certain of this one. Still thinking about it…

Lookahead

This is a funny one because the term is used for things that do not really involve “looking ahead”, e.g. semantic lookahead, which is just a Java code snippet that returns a boolean (true/false) to decide whether to enter an expansion (a.k.a. grammar expression) or not. It does not necessarily involve looking ahead in the token stream. (Though it could.)

In general, I tend to think we should call all these things predicates, or possibly conditions. (Predicate is rather rare word in normal English, after all.)

Probably this semantic lookahead could be better called: “Java code predicate” or simply “code predicate” since, at some point in the not-too-distant future, we will be able to output other languages besides Java. The predicates that actually do involve scanning ahead could perfectly well be called a “lookahead predicate” or yeah, just a “lookahead”, for short.

Lookbehind

Lookbehind is a term that I came up with to denote a new feature I introduced that does not even exist in legacy JavaCC.

I think this should really be called a contextual predicate. I’m pretty sure that this is better and, when I get round to it, I intend to replace the term “lookbehind” (wherever I’ve written it) with “contextual predicate”.

Terms related to tree building

The term abstract syntax tree, a.k.a. AST, seems quite well established. I noted on various sites that they take pains to point out the difference between an AST and a CST, i.e. a concrete syntax tree. It is quite true that if you add all of your tokens as the terminal nodes in your tree, it is really a CST, not an AST.

As best I understand, you have a CST, when you could reconstruct precisely the entire source file by traversing the nodes – including whitespace etcetera. Otherwise, you have an AST. (Is that it?) The distinction is valid, but I somehow doubt that it is all that useful. I mean, it just stands to reason that, for some applications, you need to have all the formatting information, as well as comments, and then you would require a CST, and then for other applications, you could throw away that information and then you just have the AST. But, you see, in the given context, people know that, so as far as I can see, in most discussions, we could just talk about the tree and that’s that.

Various ANTLR documentation makes a distinction between the parse tree and the syntactical tree. I have to admit that, at first blush, I didn’t really know what the difference was. Moreover, I was dubious that, even if I did understand the difference, that this is something very conceptually useful!

I guess a parse tree is what you get from a straightforward top-down parse of the input (though the tree is actually built bottom-up, but never mind…) and a syntactical tree is something that is more carefully designed to actually correspond to the syntax (or semantics?) of the programming language that you are parsing. It’s basically the same tree, for the most part, but a bit more trimmed and presentable. (Is that it?)

Well, the bottom line is that we could mostly just talk about the tree or the syntax tree and, in the context, one knows what is being talked about.

There is actually more of a terminological issue when we get into talking about specific tokens and nodes in the tree – especially as fault-tolerant features come into place. We want to be able to refer to whether a specific node or token reflects text in the parsed input, or was it inserted, for example. But, a few weeks at most, when I’ve written more about these issues, I’ll be able to converge better on the terminology.

Any feedback is welcome

I think that all this parser generator space has a lot of very obfuscatory terminology and things are continually being presented as much more complex than they really are. So, I am quite interested in trying to converge on some terminology that obfuscates things as little as possible. But I don’t know myself what people find natural or confusing, so I am quite open to any opinions on this.