How To: Design Tokens

Good tokens make the rest of a grammar easier to write. This guide collects practical advice; the syntax reference is Lexical Specification.

Name the tokens you refer to, inline the rest

A literal that appears in a production — "(", ";" — does not need a TOKEN declaration; it defines an implicit token. Reserve explicit, named token declarations for the tokens you refer to by name (<IDENTIFIER>, <NUMBER>) and for anything with a non-trivial pattern. If you want every token declared explicitly, set REQUIRE_TOKEN_DECLARATION and the tool will flag undeclared literals.

Build complex patterns from private pieces

A pattern declared <#NAME : …> is private: a reusable building block that is not a token itself. Factoring a hard pattern into named pieces makes it readable, as the JSON number token shows:

TOKEN :
    <#DIGITS : (["0"-"9"])+ >
  | <#FRACTION : "." <DIGITS> >
  | <NUMBER : <DIGITS> (<FRACTION>)? >
;

Handle case where it belongs

For a case-insensitive keyword or two, put [IGNORE_CASE] on the one token production: TOKEN [IGNORE_CASE] : <IF : "if"> | <ELSE : "else"> ;. Reach for the global IGNORE_CASE setting only when the entire language is case-insensitive — it affects every token.

Keep comments out of the way

Declare comments and other ignorable runs that you might still want later as UNPARSED (or SKIP if you never need them). Unparsed tokens are kept and attached to the following token, so tools can recover them, but they do not clutter the grammar. Block comments usually want a lazy token so they end at the first */.

Mind overlapping matches

When two patterns can match the same text, the longest match wins, and ties go to the token declared first. That is why a keyword like begin must be declared before a general <IDENTIFIER> if it is to win the tie — or, when a word should be a keyword only in some places, treated as a soft keyword (see How To: Handle Context-Sensitive Input).