Appendix: Grammar of the Grammar

This appendix gives the syntax of the CongoCC grammar language itself, in EBNF. It is a curated, readable presentation meant for quick reference; the authoritative and complete definition is the self-hosted grammar that CongoCC uses to parse .ccc files (CongoCC.ccc in the source distribution), which necessarily covers details elided here.

Notation

=        defines a rule
|        alternatives
{ x }    zero or more repetitions of x
[ x ]    x is optional
( x )    grouping
"x"      the literal text x
Name     an identifier;  String  a quoted string;  Int  an integer literal

Grammar file

GrammarFile    = { Setting } { GrammarElement }
GrammarElement = TokenProduction | Production | Injection | Inclusion
Setting        = Name [ "=" Value ] ";"
Value          = "true" | "false" | Int | String | Name { "," Name }
Inclusion      = "INCLUDE" Location { "!" Location } [ ";" ]
Location       = String | Name                 // a path, or a built-in alias

A setting written as a bare name is a boolean set to true. Comments use // and /* */. The preprocessor directives (#if and friends) are handled before this grammar applies; see The Grammar File.

Token productions

TokenProduction = [ States ] TokenKind [ "[" "IGNORE_CASE" "]" ] [ "#" Name ] ":"
                  RegexpSpec { "|" RegexpSpec } ";"
States          = "<" ( "*" | Name { "," Name } ) ">"
TokenKind       = "TOKEN" | "UNPARSED" | "SKIP" | "MORE" | "CONTEXTUAL"

RegexpSpec      = String
                | "<" [ ( "#" | "?" ) Name ":" ] Regexp ">"
                  [ "#" Name ] [ Action ] [ ":" Name ]
Regexp          = RegexpSeq { "|" RegexpSeq }
RegexpSeq       = RegexpUnit { RegexpUnit }
RegexpUnit      = String
                | "<" Name ">"                  // reference to another token/regexp
                | CharClass
                | "(" Regexp ")" [ "*" | "+" | "?" | "{" Int [ "," [ Int ] ] "}" ]
CharClass       = [ "~" ] "[" [ CharRange { "," CharRange } ] "]"
CharRange       = Char [ "-" Char ]

In a RegexpSpec, a leading # marks a private regular expression and a leading ? marks a lazy token; a trailing : Name switches lexical state after the match.

Productions

Production     = [ Access ] [ ReturnType ] Name [ "(" Params ")" ] [ "throws" Names ]
                 [ NodeDescriptor ] [ "RECOVER_TO" Expansion ] ":"
                 [ Name ":" ] Expansion ";"
Access         = "public" | "private" | "protected"
NodeDescriptor = "#" ( Name | "void" | "abstract" | "interface" )
                 [ "(" [ CmpOp ] Expression ")" ]
CmpOp          = ">" | ">=" | "<" | "<="

The optional Name ":" after the first colon sets the production’s starting lexical state.

Expansions

Expansion      = Sequence { "|" Sequence }
Sequence       = [ Lookahead ] { ExpansionUnit }
ExpansionUnit  = Terminal
               | NonTerminal
               | "(" Expansion ")" [ "*" | "+" | "?" ]
               | "[" Expansion "]"
               | Action
               | Assertion
               | "FAIL" [ Message ]
               | "ATTEMPT" Expansion "RECOVER" ( "(" Expansion ")" | Action )
               | "UNCACHE_TOKENS"
               | TokenActivation "(" Expansion ")"
Terminal       = [ Assignment ] ( String | "<" Name ">" | "<" "EOF" ">" )
                 [ "!" ] [ UpToHere ]
NonTerminal    = [ Assignment ] Name [ "(" Args ")" ] [ "!" ] [ UpToHere ]
TokenActivation= ( "ACTIVATE_TOKENS" | "DEACTIVATE_TOKENS" | "ACTIVE_TOKENS" )
                 Name { [ "," ] Name }
UpToHere       = "=>|" ( "|" | "+" Digit )
Action         = "{" TargetLanguageCode "}"

Lookahead and assertions

Lookahead      = ( "SCAN" | Int ) [ Int ]
                 { "{" Expression "}" }
                 [ LookBehind ]
                 [ [ "~" ] Expansion ] "=>"
LookBehind     = [ "~" ] ( "\" | "/" ) PathElem { ( "\" | "/" ) PathElem }
PathElem       = [ "~" ] Name | "." | "..."
Assertion      = ( "ASSERT" | "ENSURE" )
                 ( "{" Expression [ ":" Message ] "}" | [ "~" ] "(" Expansion ")" )

Injection and inclusion

Injection  = "INJECT" Name ":" [ Imports ] [ "extends" Type ]
               [ "implements" Types ] [ "{" Members "}" ]
           | "INJECT" ":" "{" CompilationUnit "}"     // top-level code
           | "INJECT" [ Name ] "{" Code "}"           // raw block

The injection target Name may be a node type, a magic class name (PARSER_CLASS, LEXER_CLASS, BASE_NODE_CLASS, BASE_TOKEN_CLASS), or the Node interface; see Code Injection.