Disambiguation

CongoCC generates a deterministic recursive-descent parser. At each choice point — a | alternation, an optional [ ], or a loop ( )* / ( )+ — the parser must decide what to do by looking at upcoming input. By default, it decides using the next single token. When one token is not enough to tell the alternatives apart, you can give the parser license to lookahead. This chapter describes the choice-point model and the constructs that control lookahead, along with the related ASSERT / ENSURE and FAIL features.

Choice points and the first-token rule

By default the parser chooses an alternative by checking whether the next token can begin it — formally, whether the token is in that alternative’s first set. This is fast and, for most of a typical grammar, entirely sufficient.

It breaks down when two alternatives can begin with the same token. Consider a loop that chooses between a declaration (x : y) and a bare reference (x), both of which start with an identifier:

Entry : ( Declaration | Reference )+ <EOF> ;
Declaration : <ID> <COLON> <ID> ;
Reference   : <ID> ;

Generating this grammar produces a warning, because the parser will always commit to Declaration on seeing an <ID> and can never reach Reference:

Warning: Entry.ccc:5:25:Expansion is unreachable.

To choose correctly, the parser has to look past the first <ID> to see whether a colon follows. There are two ways to tell it to.

The up-to-here marker

The simplest and preferred tool is the up-to-here marker, =>||. Placed inside an expansion, it means scan the input up to this point to decide whether to take this path. Mark Declaration just after the part that makes it unambiguous — the colon:

Entry : ( Declaration | Reference )+ <EOF> ;
Declaration : <ID> <COLON> =>|| <ID> ;
Reference   : <ID> ;

Now the parser, deciding whether to enter Declaration, scans <ID> <COLON>; if that matches it commits, otherwise it falls through to Reference. The grammar generates without warnings and parses x : y z as expected:

<Entry (1, 1)-(1, 7)>
  <Declaration (1, 1)-(1, 5)>
    ID: (1, 1) - (1, 1): x
    COLON: (1, 3) - (1, 3): :
    ID: (1, 5) - (1, 5): y
  ID: (1, 7) - (1, 7): z
  Token: (1, 1) - (1, 1): EOF

(The bare ID node at the end is the Reference; because Reference has a single child, no separate node is created for it — see Tree Building.)

A variant, =>|+n|, scans through the marked expansion plus n more tokens — useful when one extra token settles the decision but you do not want to scan an entire following construct.

The SCAN statement

=>|| is shorthand for the more general SCAN statement, which supersedes the LOOKAHEAD construct of legacy JavaCC. A SCAN is written at the start of an alternative, ending with => before the expansion it guards. The same example written with SCAN places the lookahead condition at the choice point itself:

Entry : ( SCAN <ID> <COLON> => Declaration | Reference )+ <EOF> ;

SCAN accepts, in combination, any of the following:

Numeric lookahead

SCAN n => permits the parser to look up to n tokens ahead when matching the guard.

Syntactic lookahead

SCAN expansion => succeeds if expansion matches the upcoming input. SCAN <ID> <COLON> => above is syntactic lookahead. Prefix the expansion with ~ to negate it.

Semantic lookahead

SCAN { booleanExpression } => consults a boolean expression in the target language. It is true when the expression is true. For example, SCAN { allowTrailingComma } => ",".

Contextual predicates

A look-behind condition tests the parser’s call stack rather than the upcoming tokens — see below.

Unlike legacy LOOKAHEAD, SCAN defaults to unlimited lookahead when it scans an expansion, and nested syntactic lookahead works correctly: a SCAN may itself invoke productions that contain their own SCANs.

Tip

Reach for =>|| first; it expresses the common case — “look as far as this point” — with the least ceremony. Use a full SCAN when you need a numeric limit, a semantic condition, a contextual predicate, or a negated look-ahead. Task-oriented advice on resolving conflicts is in How To: Resolve Choice Conflicts.

Contextual predicates

A contextual predicate (a look-behind) decides whether to enter an expansion based on which productions are currently on the parse stack, rather than on the tokens ahead. The path syntax uses:

Element

Meaning

\

step backward, up the call stack toward the current rule

/

step forward, down from the root

.

match exactly one production of any name

...

match zero or more productions of any name

~

negate the following step

The most common use is to forbid re-entering a production that is already active. This enters Foo only if Foo is not already somewhere above on the stack:

[ SCAN ~\...\Foo => Foo ]

Contextual predicates combine with the other forms; for example ( SCAN 2 ~\...\Foo => Foo )* applies both a non-reentrancy check and a two-token limit. They are a distinctive CongoCC feature and are explored further in How To: Handle Context-Sensitive Input.

Assertions

ASSERT and ENSURE state a condition that must hold; the difference is when the condition is checked. ASSERT is checked during normal parsing and raises an error if it fails. ENSURE is evaluated during lookahead, to steer a choice. Both accept either a semantic condition or a syntactic expansion:

ASSERT { depth < MAX_DEPTH } : "Nesting too deep"     // semantic, with a message
ASSERT ~( <ELSE> )                                    // syntactic: assert no else follows

A semantic assertion takes a boolean expression and an optional message (after :); a syntactic assertion takes a parenthesized expansion, optionally negated with ~. Assertions are also the basis of cardinality constraints on repetitions, covered in Tree Building.

Note

Whether a plain ASSERT is also consulted during lookahead is governed by the ASSERT_APPLIES_IN_LOOKAHEAD setting (default off); ENSURE always applies in lookahead. See Settings Reference.

Forcing a failure

FAIL aborts the current parse with an error. It is useful in a choice branch that should never be reached, or to give a better message than the default for known-bad input:

( "(" Expression ")" | FAIL "Expected a parenthesized expression" )

FAIL may be followed by a message expression, or by a code block to run. Fault-tolerant alternatives to outright failure — ATTEMPT / RECOVER and the ! recovery markers — are described in Fault-Tolerant Parsing.