Overview¶
CongoCC is a parser generator. Users describe a language in a grammar file, and CongoCC generates source code — a lexer, a parser, and a set of syntax-tree classes — that reads text in that language and generates an abstract syntax tree that your application can then process. It is a recursive-descent generator that can produce parsers in Java, Python, C#, and Rust from the same grammar.
A Short History¶
CongoCC was originally developed as a fork of JavaCC21, which was itself a fork of the original JavaCC. ConngoCC’s goal is to provide a more modern and flexible approach to parser generation. It has since evolved to support multiple target languages and has been used in various projects requiring custom language processing. CongoCC source code is here.
The processing model¶
There are two distinct processing phases: generation, which you run once when the grammar changes, and parsing, which the generated parser code does at run time.
Generation. You run CongoCC on a .ccc grammar file (see
Invocation). It produces, in your chosen target language:
a lexer that recognizes the grammar’s tokens,
a parser with one method per grammar production, and
node classes for the syntax tree.
Parsing. At run time the generated code works in the classic two stages:
At run time the generated lexer and parser turn input text into a syntax tree.¶
The lexer turns characters into a stream of tokens (numbers, identifiers, punctuation, …). The parser consumes that stream according to the grammar’s productions and, by default, builds a syntax tree of nodes as it goes. Your application then walks the tree.
Terminology¶
A few terms recur throughout this manual; they are defined fully in the Appendix: Glossary.
- Grammar
The complete description of a language, written in a
.cccfile (The Grammar File).- Token / terminal
An indivisible lexical unit produced by the lexer. Token types are declared in token productions (Lexical Specification).
- Production / non-terminal
A named grammar rule (Productions and Expansions). CongoCC generates one parser method per production.
- Expansion
The right-hand side of a production — the pattern of tokens, non-terminals, and operators it matches.
- Node / syntax tree
The parser’s output. Each production and token can contribute a node (Tree Building).
- Lexical state
A mode the lexer is in that determines which tokens it can currently match (Lexical Specification).
- Lookahead
Information the parser uses to choose between alternatives at a choice point (Disambiguation).
How this manual is organized¶
The chapters move from running the tool, through the grammar language, to the code it generates:
Invocation and The Grammar File — running CongoCC and the structure of a grammar file, including the preprocessor.
Lexical Specification, Productions and Expansions, and Disambiguation — the core grammar language: tokens, productions, and lookahead.
Tree Building and Code Injection — shaping the syntax tree and adding your own code to the generated classes.
Advanced Tokenization and Fault-Tolerant Parsing — context-sensitive tokenization and recovery from malformed input.
Generated API — the contract the generated parser, lexer, tokens, and nodes present to your application.
Settings Reference — every configuration setting, by category.
The appendices give the formal grammar of the grammar, a legacy mapping for users coming from JavaCC, and the Appendix: Glossary.
For tutorials and task-oriented guidance, see the User Guide; for what differs between target languages, see the Target Language Guide.