How To: Parse Large Inputs
==========================

CongoCC parsers read the **entire input into memory** rather than streaming it.
This is a deliberate simplification — it makes lookahead, backtracking, and the
position information on every node straightforward — and for the overwhelming
majority of inputs it is exactly the right trade-off. This guide is about the
cases where input size actually matters.

The memory model
----------------

When you construct a parser from a file or a ``CharSequence``, the whole text is
held in memory, and the tokens and tree are built on top of it. As a rule of
thumb, budget for the input plus the token stream plus the tree. Modern machines
have a lot of memory; inputs of many megabytes are unremarkable.

Releasing tokens you are past
-----------------------------

For genuinely large inputs where you do not need to revisit earlier tokens, the
``UNCACHE_TOKENS`` construct (:doc:`/docs/reference/tokenization-advanced`) lets
the parser release tokens it has moved beyond, bounding the token memory rather
than retaining the whole stream.

Practical advice
----------------

- **Measure before optimizing.** Most "large" inputs are not large enough to
  matter; confirm there is a real memory problem before complicating a grammar.
- **Split at natural boundaries.** If your input is really a sequence of
  independent records, parse them one at a time with a fresh parser per record
  instead of one giant parse.
- **Give the JVM room.** When you do parse very large files, raise the heap size
  rather than fighting the whole-file model.
- **Skip aggressively.** Sending bulk ignorable content to ``SKIP`` keeps it out
  of the token stream and the tree entirely.