How Parsing Works

This chapter explains the parsing model behind CongoCC, to help you reason about why a grammar behaves as it does. It is background, not a how-to.

Recursive descent

CongoCC generates a recursive-descent parser. Each production becomes a method, and a production that references another production compiles to a method that calls the other’s method. Parsing the grammar is thus, quite literally, running a set of mutually recursive functions whose call structure mirrors the grammar’s structure. This is why the generated code is readable, why a stack trace points at recognizable productions, and why embedding target-language actions in a production is natural — they are just statements in a method.

The cost of this model is that the grammar must not be left-recursive: a production cannot call itself as its very first action, or the parser would recurse forever. Left recursion is expressed instead with the repetition operators, which is exactly what the calculator grammar does to make its operators left-associative.

Deciding at choice points

At each choice point the parser must pick an alternative. By default it decides on the next single token: if that token can begin an alternative, the parser commits to it. This is fast and, for most of a grammar, sufficient.

When one token is not enough, the parser does scan-ahead: it tentatively scans the upcoming tokens far enough to decide, then rolls back and parses for real. The =>|| marker and the SCAN statement (Disambiguation) are how you tell it where and how far to scan. A long-standing limitation of the original JavaCC — that nested scan-ahead did not work — has been fixed, so a scan may itself enter productions that scan.

What “deterministic” means here

The generated parser is deterministic: at every point it makes one choice and follows it, using lookahead to choose well rather than trying alternatives in parallel. Scan-ahead is bounded backtracking for the decision only — the parser looks, decides, and then parses once. Understanding that the decision and the parse are separate steps is the key to reasoning about lookahead.