Thinking about AST Structures

Hi there,

While playing around with possibilities to traverse the AST today with the visitor pattern I wondered what would be an efficient way to typecheck the programm and operations as well as doing other semantic actions.

A lot of projects on the net seem to do this in a rather complicated way and I’d like to leverage the prower of JavaCC21 to (hopefully) make this easier.

ATM my AST is rather deep and nested.

I have the following ideas:

  1. Removing as much elements as possible, e.g. putting the expression nodes together in a general node

The problem would be: I’d loose the easily conveyed information i won from the syntax-analysis (for example which concrete operation is used, the token-name describes this here and i can easily check for that when i have nested expressions with precedence)

The good thing would be a much smaller AST with mostly terminal-tokens.

  1. Traversing it as is but that would mean: Going deep into the ast every-time and having a lot of operations that are pretty sturdy and not so flexible if i’d wanted to adjust sth. i’d suggest.

  2. Add more parameters to the token-classes and let JJTree fill them in when the syntactical-analysis is done. (for Example operator and expressions in the token, or values if a type is an arraytype etc.) Thats not ideal either, cause it seems a little clunky using JJTree a lot in the Grammar-File and my plan was to use the visitor as much as possible concerning the semantic checking logic. But especially problematic seems to be that the grammar-logic and the visitors become more tied together with this approach and i cant wrap my head around this being optimal. It would be nice have maybe some way to include specific code to fill the AST-objects with needed data but without putting that code in the grammar file, maybe by having a visitor-defined include or something different and then using an interface and generating subclasses for a specific visitor with the needed information. Otherwise my AST is always traversed with the same grammar and jjtree code for all the visitors unless i write another grammer what I dont want to do.

Thanks for all answers
Cheers

Hi
I see different subjects in our post.

  1. Visitor(s) and grammar linked to a tool you want to build:
  • obviously a grammar should not be tailored to a tool (only to the language), but unfortunately it is the case of most projects using JJTree or JavaCC;
  • a project should build different visitors, some rather general (build some global data…), others tailored to (different) responsibilities (check, record, format, translate…), but most projects build only one visitor.
  1. With JJTree, in a visited node, you frequently need to go up at least one level (the caller) or pass info from it to know what you have to do (in the callee);
  • I prefer using JTB, which generates default visitors (which just walks down the entire tree, doing nothing more) and inlines in it the accept/visit calls to the next level node, so you have all the traversing code already generated for you (it also generates intermediate nodes for *+? constructs but they become quite transparent after the inlining);
  • therefore the fact that the AST is big and deep and nested is not a problem, you do not code the AST traversing (it is all there, you just may have to copy/paste/alter it), and you code “against” the grammar, in the productions nodes, in a natural, readable and rather clear manner;
  • also you easily build visitors that traverse only part of the AST (just remove the accept calls to the lower branch in your visitor that extends the default visitor);
  • you seldom have to play with the tokens nodes, everything is in your productions nodes, where you need it;
  • you can add your own classes / methods to be used in your visitors in a java standard manner, because your visitors are your standard classes (that just need to extend at least the generated default visitors);
  • it is easy to call another visitor from some visitor.

Hi Silas,

Thanks for posting. We need more activity here and certainly don’t hesitate to pose any questions.

Now, that said, I guess I have to say that I was a bit puzzled by your question. I reread it a few times to try to figure out what your issue really was.

Well, you seem to have this idea that having a deeply nested tree is some sort of problem that needs to be remedied, and I don’t quite get that. Certainly from the point of view of traversing the tree with a visitor pattern, it really doesn’t matter how deeply nested the tree is.

Let’s say that you want to walk the tree and the only nodes you are interested in are of type Foo and Bar. So you create a Visitor object by extending Node.Visitor.

public class FooBarVisitor extends Node.Visitor {
     public void visit(Foo foo) {
             .... do some things...
             recurse(); // we go into the child nodes
     }

     public void visit(Bar bar) {
             ... do some other things...
             recurse();
     }
}

If, elsewhere, you have:

FooBarVisitor myVisitor = new FooBarVisitor(...params...);
myVisitor.visit(rootNode);

Then it will just walk the tree because the default visit(Node…) routine just visits the child nodes, but it has no behavior defined, so it only does something if you have defined a custom visit(…) routine for that Node type. So it will call your custom visit(Foo) and visit(Bar) methods when it hits those node types, but… I mean to say… it doesn’t really matter at all how deeply nested those nodes are in the tree. I mean, why do you think that is a problem? Or maybe you meant something else…?

So, I guess that is the point that the other person who answered (Marc Mazas) made, which is that your tree being deeply nested is not much of an issue.

Now, the other point that needs to be made (I think…) about this is that there is not much point in talking about JJTree as this separate thing. JJTree is a preprocessor in the legacy tool, but in JavaCC 21, tree-building is just an integral part of the machinery. There is no separate preprocessor called JJTree any more, so to talk about JJTree is really only a source of confusion potentially more than anything else. The parser generated by JavaCC 21 automatically creates an AST – unless you explicitly turn it off.

I think it uses pretty reasonable rule of thumb to build the tree, but it can be adjusted, yes. So, if you want to take precise control of the generated AST, which productions create Nodes under what conditions, you can put annotations in the productions that indicate whether a production builds a node. Like, you could use #void to indicate that a production does not generate a node at all. And yes, this is all pretty much backward compatible with the legacy JJTree tool, but again, in JavaCC21, JJTree (the rather cumbersome preprocessor) does not exist any more. Not as a separate tool anyway, so it just gets confusing to talk about it. Probably it’s better just to talk in terms of whether the production creates a node or not.

Actually, one way in which the JavaCC 21 default behavior differs from JJTree is that it only creates a new node if there than one object on the tree building stack. So, for example, if you had a production like:

    AdditiveExpression : Expression ("+" Expression)* ;

then it would create a new AdditiveExpression node if you went into the "+" Expression part of the production. If not, it doesn’t create a new node. The way you would specify that in the legacy JJTree was something like:

  void AdditiveExpression() #AdditiveExpression(>1) :  ...

Create a new AdditiveExpression (or ASTAdditiveExpression, I think…) if there is more than one object sitting on the stack. But that is just the default now in JavaCC 21, since I just tend to think that this is what people usually want.

The other way you can massage the tree is by using the open/close hooks. So you could have:

  INJECT MyNode :
  {
          public void close() {
                    ... do some manipulation here...
          }
  }

And the close() method is called when the tree building machinery adds the node to the tree. This can be useful. So I mention these things to give you a sense of the things available.

BUT… to be totally honest, if the premise of your question is that having a deeply nested tree is somehow a problem, then… because I don’t think it is!

Hi Marc,

I have no objection to your telling people about JTB (or anyth.ing else really) but I do have to say that if you’re going to make comparisons here, it should be with the tree-building machinery of JavaCC 21, not with the legacy JJTree.

Granted, what JavaCC 21 has for tree building is based on the legacy JJTree, but it’s also largely a rewrite, since JJTree was a separate pre-processor, and JavaCC 21 re-implements it as a core part of JavaCC. Not only that, but the whole Node/Token API was refactored to make the thing far more usable. Also, of course, INJECT makes the whole thing vastly more usable since there is now no need to post-edit any generated XXXNode.java files.

So, the various bullet points you have in your note about why use JTB, I am frankly not certain that most of them apply to a comparison with JavaCC 21. Probably, using the functionality that is built into the core of JavaCC 21 will be overall a more comfortable approach than using a separate tree-building add-on like JTB.

But the thing is that JavaCC 21 offers some things that were not ever in JJTree. JavaCC 21 includes a base Visitor abstract class that (IMHO) is pretty clean and easy to use. That is this class.

Now, here is a very nice example of this being used internally in the core code itself:

Now, this is a very clear example, since the RegexpVisitor class is only interested in objects of type RegexpRef so it simply defines a visit method for that node type. Actually, it is so short, that I’ll just paste it in here:

public class RegexpVisitor extends Node.Visitor {

    private HashSet<RegularExpression> alreadyVisited = new HashSet<>(),  currentlyVisiting = new HashSet<>();

    public void visit(RegexpRef ref) {
        RegularExpression referredTo = ref.getRegexp();
        if (!alreadyVisited.contains(referredTo)) {
            if (!currentlyVisiting.contains(referredTo)) {
                currentlyVisiting.add(referredTo);
                visit(referredTo);
                currentlyVisiting.remove(referredTo);
            } else {
                alreadyVisited.add(referredTo);
                grammar.addSemanticError(ref, "Self-referential loop detected");
            }
        }
    }
}

This is the visitor that is used to do the sanity check of whether a Regexp refers to itself. The code that uses the above-defined object is here: https://github.com/javacc21/javacc21/blob/master/src/main/java/com/javacc/parsegen/ParserData.java#L628

Now, for one thing, I daresay it would be instructive to compare this code with the code in the legacy project that it replaces!

But also, as regards JTB, can you express this any more cleanly using JTB?

Now, all that said, I really still don’t know anything about JTB. It’s still sort of on my TODO list to look into it, but there are so many things to understand, you know.

To me, the whole concept of a separate tree-building related preprocessor that exists separately from the core tool is very dubious, to say the least. To me, the typical usage of a tool like JavaCC is in fact to build a tree and walk the tree recursively using some sort of Visitor pattern. So, to make this a separate add-on tool makes very little sense. (I mean, not only does it not make sense now, but it never made any sense!)

What is of interest about something like JTB for me is just what ideas are there that I could incorporate into JavaCC 21. And maybe you could help me on that.

To tell the truth, I think that using a separate pre-processor tool, whether it’s the legacy JJTree or JTB, is really quite cumbersome and the tree-building machinery should be right in the core and working by default – which is how it is in JavaCC 21. So, the most useful thing you could do (as regards JTB) for me is just to summarize what it offers that the functionality currently in JavaCC 21 is not offering. And I’ll try to address that.

Again, I don’t think that having separate tools for the tree-building side is really where it’s going to be at. Tree-building should just be a core concern of the base tool itself.

Hi

A/ no need to repeatedly hammer on the “JJTree - and JTB - are preprocessors for legacy JavaCC” point: you edit a .jjt - or .jtb - grammar, you save it in the IDE, it builds the .javacc and .java files in a row, you edit your own custom visitors and your classes, all this is quite transparent. Of course, if you have to hack the generated classes, you are dead, but you can manage to avoid this (specially with JTB). Of course one can be satisfied that JavaCC21 does all in one strike. JTB could evolve to do the javacc parsing internally (and incorporate JavaCC21 smart features - then I would call it JavaCC3M for JavaCC for the 3rd millenium :slight_smile:…). Even JavaCC21 could generate ASTs and visitors like JTB :slight_smile:…)

B/ on this last question (you raised): JTB interesting feature is the default visitors it generates, with the inlining. The generated code is quite old fashioned, verbose, more procedural than object oriented, quite no abstraction. This is what it makes it easy to code custom visitors. It generates the - boring - traversing code for you, especially when you choose the inline option, and useful javadoc / java comments recalling the grammar structure. Of course you have to copy / paste / alter the generated “do nothing else than traverse” code to add your logic. This you can like (it reduces the amount of thinking you must produce) or not (if you prefer concise abstract coding). So you have to look at some example (one good is JTB itself) to see. You may like it or completely dislike it. You may add your own ideas.

C/ whatever is generated, one thing matters: when you alter a production, does your corresponding visitors / methods must be adapted or not. Sometimes your code does not compile, but sometimes it may compile and will fail at run time. So if your code must not be adapted, this is quite brilliant. If it must be adapted, it is good to provide a mechanism to warn the developper that it has to.

Well, look… some first principles here.

You know, it’s really not terribly serious to talk about what is straightforward and usable (transparent is the word you use, but it amounts to the same thing, no?) in this kind of flippant way. The major company that comes to mind that has always taken useability (user-friendliness, they might say) very seriously is obviously Apple. My understanding is that Apple has a whole human factors research center where they really test empirically the useability of human interfaces.

I mean to say: nobody who is serious about these issues allows the engineers themselves to define what is user-friendly and what is not. It is an empirical question that must actually be put to the test!

Now, granted, we don’t have the resources of Apple (or Microsoft or even IBM…) but still we can focus the question somewhat empirically. At various points, I have tried to scan over the various projects that use JavaCC and figure out how it’s being used – what the typical usage patterns are. A few months ago, I was wondering how many of the projects that use JavaCC also make use of JJTree. I used various strategies of googling to try to get a rough sense of this.

Without going too much into boring detail of how I came to this estimate, I concluded that it’s about 50-50. About 50% of the projects that use JavaCC out there use JJTree and the other half do not. It is something like that. Now, 50% might seem like a lot, but my strong belief is that this kind of tree building (combined with the visitor pattern tree-walking) is the typical usage pattern of this sort of tool, so, one could expect that something like 90% of JavaCC users would also be JJTree users. But that is far from being the case. So, this tends to reinforce my view that JJTree has some very major useability issues.

And this all kind of forms a whole in my mind. I also recall quite distinctly that, when I was simply an end-user of JavaCC, at a few points, I pondered whether to use JJTree, and always ended up rejecting the idea. The basic idea was okay, of course, but it just had these severe useability issues. Once I started working on the code, one of the first things I set out to do was to address these problems with JJTree. (I have to admit that I barely ever even looked at JTB, but… most people didn’t…) But all of that is the origin of the INJECT feature.

I would add also that I am pretty confident that, as we move forward and the JavaCC 21 user base grows and we have enough of a sample size to draw some conclusions, we will see that the vast majority of JavaCC 21 users (90% or more, I would venture the guess) will use the automatic tree building. (This is a conjecture an remains to be seen but I think it is a cinch bet.)

(By the way, I think it’s also a cinch bet that any new users (I.e. with no previous JavaCC experience) will opt to use the newer streamlined syntax, probably at something like the 99.9% level!)

So, when tree-building is the normal usage pattern but about half the users of a tool opt not to use the tree-building facility that is part of the package, this surely says some things about the tool’s usability.

Now, you do make a valid point (as far as it goes) that tooling can make quite a significant difference in useability. Well, it depends on the case, of course. Typically, tooling, like your Eclipse plugin, say, can mitigate some of the existing useability problems but probably not eliminate the problems completely.

But, okay, I certainly don’t dispute that it can make quite a big difference. So I grant that… (Of course, other useability issues can only be addressed in the core code, however. For example, all of the relatively recent work I did towards making the whole LOOKAHEAD construct more useable is work that can only be done in the core codebase. For example: https://javacc.com/2020/07/31/new-feature-for-scan-up-to-here/)

But the question of tooling leads us to a rather delicate (maybe even embarassing) sort of question now.

I mean, I assume that when you talk about creating a .jjt or .jtb file “in an IDE”, you mean in Eclipse, with your JavaCC plugin enabled. (Correct me if I’m wrong.)

Well, fine, but surely you understand that this discussion forum that I am trying to develop here exists for the purpose of discussing the latest version of JavaCC, i.e. JavaCC 21. It really seems that, unless you are going to make some clear commitment to supporting the up-to-date feature set of JavaCC 21 in your plugin, it is really rather pointless and possibly in dubious taste to be going on about your plugin.

People who want to use JavaCC 21 presumably want a plugin, tooling generally, that supports the up-to-date feature set. A JavaCC plugin that does not support the improved streamlined syntax, or even the things I added over a decade ago, like INJECT and INCLUDE, this is simply not useful to somebody who is trying to get going with JavaCC 21.

Now, Rome was not built in a day, so it is quite understandable if it takes a while for your plugin to be updated to supporting the latest feature set. However, I think there has to be the understanding that this is a goal you have.

So, could you clarify your position on this?

Likewise. The idea was fine, the execution not so much. At least for small grammars, it always seemed easier to me to build my own tree in the .jj itself.

Hi, Vinay! Nice to see you here!

Yes, but properly understood, that is really an indictment of the JJTree tool. Of course, it’s true that it’s not so hard to build your own tree and write your own code that recursively walks the tree and so on…

But it kind of brings to mind the “good old days”. Like when everybody was always writing their own Hashtable etcetera. So, you know, the die-hard C hackers would say that it’s no big deal. It’s easy to write your own Hashtable implementation.

It’s easy! I’ve done it hundreds of times! (Uhh, yeah, right, but that’s exactly the point!)

So, yeah, you can write your own HashMap and ArrayList and StringBuilder and so on… it’s not so hard… But finally, people are more productive in the the current-day world where everybody just uses whatever classes are in the standard library. But probably even more importantly, somebody else joins the team and you’re just using the standard library classes with the standard idioms and the person should be able to get up to speed much more quickly.

So, pretty much all the advantages would lie with just having some standard node classes with all the code for traversing the tree and so forth just already generated. And you have some standard idioms that pretty much all projects that use the tool end up using…

But, as you know, JJTree has these real problems. Just for starters, it gets you into a much more complicated 2-step build process, where you run jjtree as a preprocessor and then javacc over the output of the preprocessor. It also has this very strange half-baked aspect that the Token class is not retrofitted to implement the Node API, which is really very strange, because the most obvious default tree-building strategy would just be for the tokens to be the terminal leaf nodes in the tree, right?

And then there is the very big problem of having no INJECT instruction so you generate all these ASTXXX.java node files and then presumably you’re supposed to post-edit the generated files if you want to add some functionality into those classes.

But anyway, the automatic tree-building in JavaCC 21 is very much based on JJTree, with key improvements, like it not being a preprocessor and Token being retrofitted to implement Node and the INJECT directive. There is probably quite a bit of improvement to be made in the Node API and we have a window of opportunity to do that and even make some amount of backward incompatible changes because, frankly, the current user base is pretty small, and is also made up of people who are willing to handle some amount of evolution in the API’s.