How To: Parse Large Inputs

CongoCC parsers read the entire input into memory rather than streaming it. This is a deliberate simplification — it makes lookahead, backtracking, and the position information on every node straightforward — and for the overwhelming majority of inputs it is exactly the right trade-off. This guide is about the cases where input size actually matters.

The memory model

When you construct a parser from a file or a CharSequence, the whole text is held in memory, and the tokens and tree are built on top of it. As a rule of thumb, budget for the input plus the token stream plus the tree. Modern machines have a lot of memory; inputs of many megabytes are unremarkable.

Releasing tokens you are past

For genuinely large inputs where you do not need to revisit earlier tokens, the UNCACHE_TOKENS construct (Advanced Tokenization) lets the parser release tokens it has moved beyond, bounding the token memory rather than retaining the whole stream.

Practical advice

  • Measure before optimizing. Most “large” inputs are not large enough to matter; confirm there is a real memory problem before complicating a grammar.

  • Split at natural boundaries. If your input is really a sequence of independent records, parse them one at a time with a fresh parser per record instead of one giant parse.

  • Give the JVM room. When you do parse very large files, raise the heap size rather than fighting the whole-file model.

  • Skip aggressively. Sending bulk ignorable content to SKIP keeps it out of the token stream and the tree entirely.