How the Lexer Works¶

This chapter explains how CongoCC’s lexer recognizes tokens. As with the parsing model, it is background to help you understand behavior, not a how-to.

From patterns to an automaton¶

The token patterns in a grammar are regular expressions, and CongoCC compiles them into a nondeterministic finite automaton (NFA) per lexical state. At run time the lexer feeds characters through the automaton, tracking the set of states it could be in, and at each point remembers the longest match found so far. When no further character can extend a match, it emits the longest matching token — the maximal-munch rule — with ties broken by declaration order.

No DFA table to tune¶

Many lexer generators convert the automaton to a deterministic finite automaton (DFA) and emit a big transition table. CongoCC works directly from the NFA representation instead. The practical consequence for you is that there is no generated lexer table to size or tune and no table-related limits to run into; the lexer is ordinary, readable code. It also makes features like full 32-bit Unicode character classes, lazy tokens, and per-state token sets fall out naturally, because they are expressed in the automaton rather than fought against a fixed table format.

Lexical states and context¶

Each lexical state is its own automaton over just the tokens active in that state, which is what makes states an efficient way to handle regions with different lexical rules. The more dynamic forms of context sensitivity — token activation and contextual tokens (Advanced Tokenization) — work by adjusting which tokens the lexer will currently accept, on top of this same automaton-based core.

How the Lexer Works¶

From patterns to an automaton¶

No DFA table to tune¶

Lexical states and context¶

CongoCC

Navigation

Related Topics