Overview

CongoCC is a parser generator. Users describe a language in a grammar file, and CongoCC generates source code — a lexer, a parser, and a set of syntax-tree classes — that reads text in that language and generates an abstract syntax tree that your application can then process. It is a recursive-descent generator that can produce parsers in Java, Python, C#, and Rust from the same grammar.

A Short History

CongoCC was originally developed as a fork of JavaCC21, which was itself a fork of the original JavaCC. ConngoCC’s goal is to provide a more modern and flexible approach to parser generation. It has since evolved to support multiple target languages and has been used in various projects requiring custom language processing. CongoCC source code is here.

The processing model

There are two distinct processing phases: generation, which you run once when the grammar changes, and parsing, which the generated parser code does at run time.

Generation. You run CongoCC on a .ccc grammar file (see Invocation). It produces, in your chosen target language:

  • a lexer that recognizes the grammar’s tokens,

  • a parser with one method per grammar production, and

  • node classes for the syntax tree.

Parsing. At run time the generated code works in the classic two stages:

input text, to the lexer, to tokens, to the parser, to a syntax tree.

At run time the generated lexer and parser turn input text into a syntax tree.

The lexer turns characters into a stream of tokens (numbers, identifiers, punctuation, …). The parser consumes that stream according to the grammar’s productions and, by default, builds a syntax tree of nodes as it goes. Your application then walks the tree.

Terminology

A few terms recur throughout this manual; they are defined fully in the Appendix: Glossary.

Grammar

The complete description of a language, written in a .ccc file (The Grammar File).

Token / terminal

An indivisible lexical unit produced by the lexer. Token types are declared in token productions (Lexical Specification).

Production / non-terminal

A named grammar rule (Productions and Expansions). CongoCC generates one parser method per production.

Expansion

The right-hand side of a production — the pattern of tokens, non-terminals, and operators it matches.

Node / syntax tree

The parser’s output. Each production and token can contribute a node (Tree Building).

Lexical state

A mode the lexer is in that determines which tokens it can currently match (Lexical Specification).

Lookahead

Information the parser uses to choose between alternatives at a choice point (Disambiguation).

How this manual is organized

The chapters move from running the tool, through the grammar language, to the code it generates:

For tutorials and task-oriented guidance, see the User Guide; for what differs between target languages, see the Target Language Guide.