Advanced Tokenization¶

The Lexical Specification chapter covers declaring tokens and lexical states. This chapter covers the features for context-sensitive tokenization — having the same input tokenize differently depending on where the parser is — together with lazy tokens, synthetic tokens, and token memory management.

Three tools address context sensitivity, in rough order of how localized they are: lexical states (a mode the lexer is in), token activation (turning token types on and off as parsing proceeds), and contextual tokens (a token that is only recognized where the grammar expects it). Guidance on choosing among them is in How To: Handle Context-Sensitive Input.

Activating and deactivating tokens¶

A token type can be switched on and off during parsing, so a word is treated as a keyword in some places and an ordinary identifier in others — the classic problem of “soft” keywords.

Start a token off with the DEACTIVATE_TOKENS setting, then turn it on for a specific expansion with an ACTIVATE_TOKENS prefix on a parenthesized group (DEACTIVATE_TOKENS works the same way to turn one off):

DEACTIVATE_TOKENS = KW;
TOKEN : <KW : "begin"> | <ID : (["a"-"z"])+ > ;

Root : <ID> ACTIVATE_TOKENS KW ( <KW> ) <EOF> ;

Because KW starts deactivated, the first begin lexes as an ID; the ACTIVATE_TOKENS KW prefix makes KW live for the following group, so the second begin lexes as the keyword. Parsing begin begin gives:

<Root (1, 1)-(1, 11)>
  ID: (1, 1) - (1, 5): begin
  KW: (1, 7) - (1, 11): begin
  Token: (1, 1) - (1, 1): EOF

Contextual tokens¶

A token declared with the CONTEXTUAL kind is only produced where the parser actually allows it; elsewhere its text tokenizes by the other rules. This gives soft-keyword behavior without managing activation by hand. The pattern of a contextual token must be a plain string literal:

CONTEXTUAL : <FROM : "from"> ;
TOKEN : <ID : (["a"-"z"])+ > ;

Where the grammar expects <FROM>, the word from is the FROM token; everywhere else it is an ordinary ID.

Lazy tokens¶

A token whose name is prefixed with ? is lazy: it matches the shortest text that satisfies its pattern rather than the longest. This is exactly what block comments need, so that /* … */ ends at the first */ instead of the last:

UNPARSED : <?BLOCK_COMMENT : "/*" (~[])* "*/" > ;
TOKEN : <ID : (["a"-"z"])+ > ;
Root : ( <ID> )* <EOF> ;

Parsing a /* one */ b /* two */ c keeps the two comments separate, so all three identifiers come through:

<Root (1, 1)-(1, 25)>
  ID: (1, 1) - (1, 1): a
  ID: (1, 13) - (1, 13): b
  ID: (1, 25) - (1, 25): c
  Token: (1, 1) - (1, 1): EOF

Without the ?, the greedy (~[])* would run from the first /* to the last */, swallowing b along with both comment bodies.

Synthetic tokens¶

Some languages need token types that the lexer cannot produce by pattern matching alone — the INDENT and DEDENT tokens of an indentation-sensitive language, for instance. Declare such types with the EXTRA_TOKENS setting, and emit them from a TOKEN_HOOK (see Code Injection), which can inspect each token and insert others around it:

EXTRA_TOKENS = INDENT, DEDENT;

Inserting a token into the stream this way is token chaining; it is enabled by the TOKEN_CHAINING setting, which also turns on automatically when the grammar uses the chaining API. See Settings Reference.

Releasing tokens¶

For very large inputs you may not want the whole token stream retained in memory. The UNCACHE_TOKENS construct, used in an expansion, releases tokens the parser no longer needs as it moves forward.

Token hooks¶

The TOKEN_HOOK method underlies several of the features above: it runs for every token and may inspect, replace, or chain tokens. Because a hook is just an injected method, it is documented with the injection mechanism in Code Injection.