-
Notifications
You must be signed in to change notification settings - Fork 154
Internal Tokenizing
Tokenizing is the first phase of parsing. It's the point where characters are extracted from the source string and grouped. There are several kinds of tokens:
- numbers
- hexadecimal numbers
- localized numbers
- exponents
- variables
- operators
- identifiers
All numbers extracted from the source string are positive, and are defined as anything that matches the following regular expression:
\d*(\.\d*)?([eE][-+]?\d+)?
Even though ".
" is technically recognized by this regular expression, it is not evaluated as a number.
Unicode numbers, like ½
, are parsed as basic numbers.
Hexadecimal numbers are anything that matches this regular expression:
0[xX][0-9a-f]+
Variables follow the same rules as identifiers, except that they must be prefixed with a "$
" character. Thus, the following are all legal variable names:
$a
$_
$0xdeadbeef
$π
In addition, variables may also be quoted strings:
'a'
"hello"
'\''
"Inigo Montoya"
Operators are pretty much all other characters in the string.
Parentheses are parsed as operator tokens, even though they are not listed as part of the built-in operators. Parentheses used to denote order of operations and functions arguments are eliminated during term grouping.
Identifier tokens are strictly the name of a function. For example, given the string "sin(0)"
, the extracted identifier token is sin
.
Identifiers can contain letters (upper and lowercase), decimal digits, and underscores.
The exception to this are three special functions that are special cased in their recognition: π
, Φ
, and τ
. These correspond to the mathematical costants of pi, phi, and tau, respectively.
Whitespace is seen as a logical break in the token stream. That means that "3 4
" will be parsed as the 3
token followed by the 4
token. And because of the logic in recognizing implicit multiplication, a multiplication operator will be injected into the stream. Thus, "3 4
" is recognized as "3*4
", and evaluates to 12
.