Advanced Tokenization¶
The Lexical Specification chapter covers declaring tokens and lexical states. This chapter covers the features for context-sensitive tokenization — having the same input tokenize differently depending on where the parser is — together with lazy tokens, synthetic tokens, and token memory management.
Three tools address context sensitivity, in rough order of how localized they are: lexical states (a mode the lexer is in), token activation (turning token types on and off as parsing proceeds), and contextual tokens (a token that is only recognized where the grammar expects it). Guidance on choosing among them is in How To: Handle Context-Sensitive Input.
Activating and deactivating tokens¶
A token type can be switched on and off during parsing, so a word is treated as a keyword in some places and an ordinary identifier in others — the classic problem of “soft” keywords.
Start a token off with the DEACTIVATE_TOKENS setting, then turn it on for a
specific expansion with an ACTIVATE_TOKENS prefix on a parenthesized group
(DEACTIVATE_TOKENS works the same way to turn one off):
DEACTIVATE_TOKENS = KW;
TOKEN : <KW : "begin"> | <ID : (["a"-"z"])+ > ;
Root : <ID> ACTIVATE_TOKENS KW ( <KW> ) <EOF> ;
Because KW starts deactivated, the first begin lexes as an ID; the
ACTIVATE_TOKENS KW prefix makes KW live for the following group, so the
second begin lexes as the keyword. Parsing begin begin gives:
<Root (1, 1)-(1, 11)>
ID: (1, 1) - (1, 5): begin
KW: (1, 7) - (1, 11): begin
Token: (1, 1) - (1, 1): EOF
Contextual tokens¶
A token declared with the CONTEXTUAL kind is only produced where the parser
actually allows it; elsewhere its text tokenizes by the other rules. This gives
soft-keyword behavior without managing activation by hand. The pattern of a
contextual token must be a plain string literal:
CONTEXTUAL : <FROM : "from"> ;
TOKEN : <ID : (["a"-"z"])+ > ;
Where the grammar expects <FROM>, the word from is the FROM token;
everywhere else it is an ordinary ID.
Lazy tokens¶
A token whose name is prefixed with ? is lazy: it matches the shortest
text that satisfies its pattern rather than the longest. This is exactly what
block comments need, so that /* … */ ends at the first */ instead of the
last:
UNPARSED : <?BLOCK_COMMENT : "/*" (~[])* "*/" > ;
TOKEN : <ID : (["a"-"z"])+ > ;
Root : ( <ID> )* <EOF> ;
Parsing a /* one */ b /* two */ c keeps the two comments separate, so all
three identifiers come through:
<Root (1, 1)-(1, 25)>
ID: (1, 1) - (1, 1): a
ID: (1, 13) - (1, 13): b
ID: (1, 25) - (1, 25): c
Token: (1, 1) - (1, 1): EOF
Without the ?, the greedy (~[])* would run from the first /* to the
last */, swallowing b along with both comment bodies.
Synthetic tokens¶
Some languages need token types that the lexer cannot produce by pattern
matching alone — the INDENT and DEDENT tokens of an
indentation-sensitive language, for instance. Declare such types with the
EXTRA_TOKENS setting, and emit them from a TOKEN_HOOK (see
Code Injection), which can inspect each token and insert others around it:
EXTRA_TOKENS = INDENT, DEDENT;
Inserting a token into the stream this way is token chaining; it is enabled by
the TOKEN_CHAINING setting, which also turns on automatically when the
grammar uses the chaining API. See Settings Reference.
Releasing tokens¶
For very large inputs you may not want the whole token stream retained in
memory. The UNCACHE_TOKENS construct, used in an expansion, releases tokens
the parser no longer needs as it moves forward.
Token hooks¶
The TOKEN_HOOK method underlies several of the features above: it runs for
every token and may inspect, replace, or chain tokens. Because a hook is just an
injected method, it is documented with the injection mechanism in
Code Injection.