Tree Building¶
CongoCC builds a syntax tree as it parses, with no extra tooling — tree
building is part of the core and is on by default. This chapter explains the
default tree, the # annotations that shape it, and the API for walking the
result. The examples below are language-neutral in spirit; the dumps shown are
from the Java target.
The default tree¶
Out of the box, every production builds a node, and every token is a node too. Given this grammar:
Config : ( Pair )+ <EOF> ;
Pair : <ID> "=" Value ;
Value : <NUM> ;
parsing a=1 b=22 produces:
<Config (1, 1)-(1, 8)>
<Pair (1, 1)-(1, 3)>
ID: (1, 1) - (1, 1): a
Token: (1, 2) - (1, 2): =
NUM: (1, 3) - (1, 3): 1
<Pair (1, 5)-(1, 8)>
ID: (1, 5) - (1, 5): b
Token: (1, 6) - (1, 6): =
NUM: (1, 7) - (1, 8): 22
Token: (1, 1) - (1, 1): EOF
Each Pair production became a Pair node, each token (including the
implicit "=" and the final EOF) is a node, and the whole thing is
rooted at a Config node. Notice there is no Value node even though
Value is a production — that is the effect of smart node creation, described next.
Smart node creation¶
By default (SMART_NODE_CREATION = true) a production builds a node only when
it would have more than one child. A production like Value : <NUM> that
wraps a single child therefore adds no node of its own; the child takes its
place. This keeps trees free of trivial one-child wrappers.
Turn it off to get a node for every production. With
SMART_NODE_CREATION = false; the same grammar yields a Value node around
each number:
<Pair (1, 1)-(1, 3)>
ID: (1, 1) - (1, 1): a
Token: (1, 2) - (1, 2): =
<Value (1, 3)-(1, 3)>
NUM: (1, 3) - (1, 3): 1
Shaping the tree with # descriptors¶
A # annotation on a production (or on a parenthesized expansion) overrides
the default and states exactly what node, if any, to build.
- Renaming a node
Pair #KeyValue : <ID> "=" Value ;builds a node namedKeyValueinstead ofPair. The generated node class is named accordingly.- Suppressing a node with
#void Pair #void : <ID> "=" Value ;builds no node forPair; its children attach to the parent instead. With#voidonPairthe tree above flattens to:<Config (1, 1)-(1, 8)> ID: (1, 1) - (1, 1): a Token: (1, 2) - (1, 2): = NUM: (1, 3) - (1, 3): 1 ID: (1, 5) - (1, 5): b Token: (1, 6) - (1, 6): = NUM: (1, 7) - (1, 8): 22 Token: (1, 1) - (1, 1): EOF
- Conditional nodes
A condition in parentheses builds the node only when the condition holds. The comparison operators
>,>=,<,<=test the number of child nodes the production produced. For example,#List(>1)builds aListnode only when it would hold more than one child:Root : Items <EOF> ; Items #List(>1) : ( <NUM> )+ ;
Parsing
5builds noList(the loneNUMattaches toRoot), while5 6 7does:<Root (1, 1)-(1, 5)> <List (1, 1)-(1, 5)> NUM: (1, 1) - (1, 1): 5 NUM: (1, 3) - (1, 3): 6 NUM: (1, 5) - (1, 5): 7 Token: (1, 1) - (1, 1): EOFNote
The count is of child nodes, which includes token nodes. If the production also matches punctuation or
<EOF>, those count too — keep that in mind when choosing the threshold.- Abstract and interface node types
#abstractand#interfacemark the generated node type as abstract or as an interface, for grammars that build a node class hierarchy and inject shared behavior into it (see Code Injection).- Inline descriptors
A
#descriptor may also follow a parenthesized expansion inside a production, naming a node for just that part:( A B C )#Group.- Whole-grammar defaults
Two settings change the baseline.
NODE_DEFAULT_VOID = true;inverts the default so that productions build a node only when they explicitly ask for one with a#descriptor.NODE_PREFIXprepends a fixed string to every generated node class name. See Settings Reference.
Tokens as nodes¶
Because TOKENS_ARE_NODES is true by default, tokens appear in the tree as
terminal nodes, as the dumps above show. Set TOKENS_ARE_NODES = false; to
keep tokens out of the tree entirely (the JJTree-style behavior).
Unparsed tokens — comments and the like — are excluded by default; set
UNPARSED_TOKENS_ARE_NODES = true; to include them. And as covered in
Lexical Specification, a #ClassName on a token production groups its tokens under a
shared node class, which is handy when you want to treat a family of tokens
uniformly in the tree.
Walking the tree¶
Every node implements the Node interface, which provides a broad traversal
API. The most commonly used members are:
children()— the direct child nodes;descendants()— all nodes below.firstChildOfType(...)/childrenOfType(...)and thedescendantequivalents, selecting by node class or byNodeType.getType()— the node’sNodeType;getParent()— its parent.getSource()— the source text the node spans.dump()— print the subtree, as used throughout this chapter.
The full contract is in Generated API.
Visitors¶
For traversals that dispatch on node type, extend the built-in
Node.Visitor base class and define a visit method for each node type you
care about. Start the walk with visit(rootNode); call recurse(node) from
within a visit method to descend into that node’s children. A visitor that
prints every KeyValue node looks like this:
class Printer extends Node.Visitor {
public void visit(KeyValue kv) {
System.out.println("pair -> " + kv.getSource());
recurse(kv);
}
}
// new Printer().visit(parser.rootNode()); prints:
// pair -> a=1
// pair -> b=22
Dispatch is by the node’s runtime class, falling back to its superclasses and
marked interfaces, so a visit method written for a base class or an
#interface node type handles all of its subtypes. Set the visitor’s
visitUnparsedTokens field to also visit comment tokens. No separate visitor
class is generated and there is no VISITOR setting — the one reflective base
class above is the whole mechanism.
Cardinality constraints¶
A repetition can carry a cardinality constraint — an assertion of how many
times it must match — written with the & form of an assertion. These build
on the assertion machinery in Disambiguation.
Coming from JJTree¶
CongoCC’s tree building replaces the separate JJTree
preprocessor from JavaCC. There is no
jjtThis, no SimpleNode, and no generated visitor interface to implement;
tree building is on by default rather than opt-in, and the node API is the rich
Node interface shown above. The full mapping is in Appendix: Legacy Mapping.