Advanced Tokenization ===================== The :doc:`lexical` chapter covers declaring tokens and lexical states. This chapter covers the features for **context-sensitive tokenization** — having the same input tokenize differently depending on where the parser is — together with lazy tokens, synthetic tokens, and token memory management. Three tools address context sensitivity, in rough order of how localized they are: lexical states (a mode the lexer is in), token activation (turning token types on and off as parsing proceeds), and contextual tokens (a token that is only recognized where the grammar expects it). Guidance on choosing among them is in :doc:`/docs/userguide/howto/context-sensitive`. Activating and deactivating tokens ---------------------------------- A token type can be switched on and off during parsing, so a word is treated as a keyword in some places and an ordinary identifier in others — the classic problem of "soft" keywords. Start a token off with the ``DEACTIVATE_TOKENS`` setting, then turn it on for a specific expansion with an ``ACTIVATE_TOKENS`` prefix on a parenthesized group (``DEACTIVATE_TOKENS`` works the same way to turn one off): .. code-block:: ccc DEACTIVATE_TOKENS = KW; TOKEN : | ; Root : ACTIVATE_TOKENS KW ( ) ; Because ``KW`` starts deactivated, the first ``begin`` lexes as an ``ID``; the ``ACTIVATE_TOKENS KW`` prefix makes ``KW`` live for the following group, so the second ``begin`` lexes as the keyword. Parsing ``begin begin`` gives: .. code-block:: text ID: (1, 1) - (1, 5): begin KW: (1, 7) - (1, 11): begin Token: (1, 1) - (1, 1): EOF Contextual tokens ----------------- A token declared with the ``CONTEXTUAL`` kind is only produced where the parser actually allows it; elsewhere its text tokenizes by the other rules. This gives soft-keyword behavior without managing activation by hand. The pattern of a contextual token must be a plain string literal: .. code-block:: ccc CONTEXTUAL : ; TOKEN : ; Where the grammar expects ````, the word ``from`` is the ``FROM`` token; everywhere else it is an ordinary ``ID``. Lazy tokens ----------- A token whose name is prefixed with ``?`` is **lazy**: it matches the *shortest* text that satisfies its pattern rather than the longest. This is exactly what block comments need, so that ``/* … */`` ends at the first ``*/`` instead of the last: .. code-block:: ccc UNPARSED : ; TOKEN : ; Root : ( )* ; Parsing ``a /* one */ b /* two */ c`` keeps the two comments separate, so all three identifiers come through: .. code-block:: text ID: (1, 1) - (1, 1): a ID: (1, 13) - (1, 13): b ID: (1, 25) - (1, 25): c Token: (1, 1) - (1, 1): EOF Without the ``?``, the greedy ``(~[])*`` would run from the first ``/*`` to the last ``*/``, swallowing ``b`` along with both comment bodies. Synthetic tokens ---------------- Some languages need token types that the lexer cannot produce by pattern matching alone — the ``INDENT`` and ``DEDENT`` tokens of an indentation-sensitive language, for instance. Declare such types with the ``EXTRA_TOKENS`` setting, and emit them from a ``TOKEN_HOOK`` (see :doc:`injection`), which can inspect each token and insert others around it: .. code-block:: ccc EXTRA_TOKENS = INDENT, DEDENT; Inserting a token into the stream this way is *token chaining*; it is enabled by the ``TOKEN_CHAINING`` setting, which also turns on automatically when the grammar uses the chaining API. See :doc:`settings`. Releasing tokens ---------------- For very large inputs you may not want the whole token stream retained in memory. The ``UNCACHE_TOKENS`` construct, used in an expansion, releases tokens the parser no longer needs as it moves forward. Token hooks ----------- The ``TOKEN_HOOK`` method underlies several of the features above: it runs for every token and may inspect, replace, or chain tokens. Because a hook is just an injected method, it is documented with the injection mechanism in :doc:`injection`.