Skip to content

Internal Tokenizing

Dave DeLong edited this page Sep 18, 2015 · 2 revisions

Tokenizing is the first phase of parsing. It's the point where characters are extracted from the source string and grouped. There are several kinds of tokens:

  1. numbers
  2. hexadecimal numbers
  3. localized numbers
  4. exponents
  5. variables
  6. operators
  7. identifiers

Numbers

All numbers extracted from the source string are positive, and are defined as anything that matches the following regular expression:

\d*(\.\d*)?([eE][-+]?\d+)?

Even though "." is technically recognized by this regular expression, it is not evaluated as a number.

Unicode numbers, like ½, are parsed as basic numbers.

Hexadecimal Numbers

Hexadecimal numbers are anything that matches this regular expression:

0[xX][0-9a-f]+

Localized Numbers

Exponents

Variables

Variables follow the same rules as identifiers, except that they must be prefixed with a "$" character. Thus, the following are all legal variable names:

  • $a
  • $_
  • $0xdeadbeef

In addition, variables may also be quoted strings:

  • 'a'
  • "hello"
  • '\''
  • "Inigo Montoya"

Operators

Operators are pretty much all other characters in the string.

Parentheses are parsed as operator tokens, even though they are not listed as part of the built-in operators. Parentheses used to denote order of operations and functions arguments are eliminated during term grouping.

Identifiers

Identifier tokens are strictly the name of a function. For example, given the string "sin(0)", the extracted identifier token is sin.

Identifiers can contain letters (upper and lowercase), decimal digits, and underscores.

The exception to this are three special functions that are special cased in their recognition: π, Φ, and τ. These correspond to the mathematical costants of pi, phi, and tau, respectively.

Regarding Whitespace

Whitespace is seen as a logical break in the token stream. That means that "3 4" will be parsed as the 3 token followed by the 4 token. And because of the logic in recognizing implicit multiplication, a multiplication operator will be injected into the stream. Thus, "3 4" is recognized as "3*4", and evaluates to 12.