[superseded] Adding support for user-defined number suffixes #17020

JohnAD · 2021-02-12T05:40:32Z

As a precursor for adding a decimal (base10) library to Nim's stdlib, this PR adds support for user-defined suffixes to numeric literals.

nim-lang/RFCs#308 - Add decimal to standard library

This PR also solves:

nim-lang/RFCs#216 - User defined integer literals
nim-lang/RFCs#228 - user defined quote literals, eg: -12.3E+4022'dec128 ; -128'i8 is transformed as i8("-128")

To distinguish between a typical string literal and a string literal that is the result of a numeric suffix, a new token was added to the lexer: tkStrNumLit. This token is only generated if, when processing a numeric literal, the suffix does not match one of the pre-defined suffixes used internally.

The rest of the work is done in the parser. When the tkStrNumLit token is detected, a "dot" is presumed to follow and the following symbol is considered the indentifier called in a "dotExpr" syntax. So, effectively:

var a = 12345.678'something

becomes the equivalent of:

var a = "12345.678".something

With this, a library, including the up-and-coming decimal library, can parse numbers in source code that cannot be handled by even int64 or float64. For example:

var a = 5192296858534827628530496329220095'm

(The decimal library plans on using m as the suffix to match the convention seen in other languages such as C#.)

While I've written tests and don't know of any bugs, this is my first attempt at modifying the core lexer/parser of a compiler, I've marked the PR as Work-In-Progress [WIP]. Any and all code review is very much appreciated.

links

EDIT(timotheecour) see bug list in [superseded] Adding support for user-defined number suffixes #17020 (comment)

JohnAD · 2021-02-12T05:49:02Z

While I didn't add this check to the lexer, I could see a benefit to it:

I could easily forbid suffixes that start with the letters a to f and A to F. Doing this would remove the confusion that can be seen with both hex numbers and numbers with the e exponent (such as 12.3e22.)

Let me know if adding this restriction to the lexer (and adding it to the documentation) is desirable.

tests/lexer/tstrnumlits.nim

doc/manual.rst

Araq · 2021-02-12T09:34:06Z

Two thoughts:

Unary minus is an operator and never part of a literal. This sucks for cases like -128'i8. This is an existing problem with the current spec and can easily be fixed within this PR.
The design encourages the existing practice of having short top-level proc names like re. I am not sure if this existing practice is so solid that we shouldn't muse about it. For example, in C++ procs that are called as a literal suffix have to defined as operator "" X (with an empty string literal). However, that doesn't address the scoping aspect.

Araq · 2021-02-12T09:45:46Z

I could easily forbid suffixes that start with the letters a to f and A to F. Doing this would remove the confusion that can be seen with both hex numbers and numbers with the e exponent (such as 12.3e22.)

Let me know if adding this restriction to the lexer (and adding it to the documentation) is desirable.

I'm not sure how it currently works, but floating point values containing 'e' and hexadecimal numbers need to use the 'suffix (with the single quote) form. It would be really weird if my user defined literal could not start with A to F.

timotheecour · 2021-02-12T09:49:57Z

The design encourages the existing practice of having short top-level proc names like re. I am not sure if this existing practice is so solid that we shouldn't muse about it.

how about requiring a pragma, which mostly addresses this concern, and makes it more searchable, eg:

proc big(a: string) {.litteral.} = # create a JsBigInt from a
echo -123'big # ok
proc bad(a: string) = # create a JsBigInt from a
echo -123'bad # error, bad must have {.litteral.}

(big can still be used in other context; but the pragma also allows it to be used in a litteral context)

Araq · 2021-02-12T10:16:53Z

A pragma doesn't address the scoping problem either. It's not that I want to annotate the suffix declaration, I am thinking about a scope extension. For example:

from bignums import nil

echo 355'bignum # works because it's turned into

echo bignums.bignum("355")

If it weren't so silly I would propose that 123'suffix is turned into suffix.suffix("123"). (Or into suffixs.suffix("123") as the plural s for a module name is very common.)

timotheecour · 2021-02-12T10:28:07Z

that's the exact same problem with MCS though.

but here, you can use:

option 1: no scope extension

(but IIUC your problem is "small identifiers" polluting scope?)

from bignums import bignum
echo 355'bignum

option 2: special litteral rule

from bignums import 'bignum # makes it only usable as a litteral
echo 355'bignum # ok
echo bignum("355") # `bignum` not declared, was imported via 'bignum

but isn't option 1 good enough?

Araq · 2021-02-12T10:59:57Z

Oh there is an option: Suffix procs need to be declared as

proc `'bignum` (...)

And then the rewrite rule is:

123'suffix is turned into `'suffix`("123")

And then we can import suffixes just like:

from bignums import `'bignum`

timotheecour · 2021-02-12T17:37:27Z

Oh there is an option: Suffix procs need to be declared as

seems good; in particular it helps with turning f32 (etc) into a user defined litteral (defined in system.nim via proc `'f32'*(a: string): float32) without polluting scope with f32, since we'd have:

doAssert not declared(f32)
doAssert declared(`'f32`)

this also encourages 1 way to write things:
123'suffix instead of suffix("123"), since the declaration would contain '

a user could of course define both, but that'd be opt in:

proc `'bignum`(...)
proc bignum(...)

can we also deprecate 123f32 syntax and require 123'f32?
if not then at least not allow the quote-less form for user defined litterals?

1 way to do things seems better (more searchable, etc); see also my argument about ambiguities this would cause otherwise (12e32 vs 12'e32 etc). This would help turning the builtin f32 (etc) into user defined literals, with obvious advantages.

JohnAD · 2021-02-12T20:23:25Z

I like the two big ideas that have emerged from this discussion:

For suffix procs, require a 'suffix style declaration. Also require the apostrophe for usage. This makes things cleaner/clearer.
Modify the lexer so that the minus symbol is part of the numeric literals when the context calls for it.

I'll start work on both; starting with the lexer itself.

Slightly off-topic: after this change is released, I/we will need to encourage linters and code markup tools (such as prettifiers) to properly handle the unpaired "single quotes". I use the sublime text editor and I'll need to fix/hack that sooner rather than later.)

Araq · 2021-02-12T21:52:44Z

if not then at least not allow the quote-less form for user defined literals?

I doubt we can deprecate the 0f32 syntax. But the new form can require the single quote, we can always allow to leave it out later.

JohnAD · 2021-02-14T21:26:28Z

Started work on these changes. Just wrote the new test cases.

But, since we are talking "big picture", would there be value in also allowing more than [0-9a-fA-F] characters in a number?

Perhaps with restrictions such as:

It still starts with a numeric digit [0-9].
It has an attached 'suffix after the character sequence.
The character sequence does not contain whitespace.
The character sequence does not certain punctuation characters such as full-quotes.

This would allow for useful things like this:

import korean

var a = 0k열여섯'hangul

or

import numericsystems

var a = 0s03jKi3h'base60

or

import roman

assert 256 == 0CCLVI'roman

or crazy stuff like:

import moneystuff
var a = 1,234,332.20'USD

I, myself, don't need this functionality personally, so I'm not pushing hard for it. But, I can see it being useful for other libraries.

I can also see danger in allowing any punctuation other than dots and underscores. So perhaps even if we allow more than the just digit/hex characters, punctuation should not be allowed.

timotheecour · 2021-02-15T01:43:39Z

This would allow for useful things like this:

user can always write: var a = "열여섯".hangul, this is probably not common enough to allow it via user defined literal.

or crazy stuff like:

Commas would be more common and useful to support but I don't see how to avoid the ambiguity this would lead to: echo (1,234,567'big) => is it (1, 234, big"567") or (big"1234567") ? you'd need some arbitrary rule to disambiguate. And we can use "_" for that.

In any case, this could be handled in future work.

Araq · 2021-02-15T13:18:30Z

I, myself, don't need this functionality personally, so I'm not pushing hard for it. But, I can see it being useful for other libraries.

We can always add it later.

Araq · 2021-02-24T15:13:58Z

@JohnAD what's the current state of this PR? I really want it... :-)

JohnAD · 2021-02-28T02:30:07Z

Started work again on this today. I have already added the expanded/corrected unit testing. Should have a new commit for the lexer/parser by end of tomorrow.

JohnAD · 2021-02-28T21:02:46Z

Open question:

The minus symbol (-) has two purposes:

acting as sign for a number aka "negative" x = 3+ -4
acting as an operation indicating an operation x = 3-4

And, if we were doing things in the higher levels of the language, I suspect I could easily distinguish between them. But, I'm trying to get the "negative" usage into the numeric literal where it really makes sense. To do that, the logic needs to be in the lexer, which is literally a character-by-character scan.

In the end, being conservative, I suspect any numeric library will still need to handle the unary negative.

I just added the ability for the lexer to "remember" the previous tokType in the main loop. This solved multiple problems for me.

So, if the previous tokType is in the

{tkComma, tkColon, tkParLe, tkBracketLe, tkCurlyLe}

set or is preceded by whitespace, then a - followed by a numeric digit will be parsed into the numeric literal. Otherwise the - is parsed as an operation.

Sound reasonable? Suggestions?

(Off-topic, but IMO, the conversion to float/int (and the handling of different bases) for numeric literals should not be in the lexer. That is beyond the scope of a lexer. But that is me being nitpicky. And moving it would be a thought for another day.)

JohnAD · 2021-02-28T21:15:37Z

Still to do:

Add AST tests to unit testing
Test invalid negation, such as against a UInt or out-of-range, in unit testing
Update manual.rst

compiler/lexer.nim

timotheecour · 2021-02-28T21:41:13Z

Sound reasonable? Suggestions?

yes, this is the best approach IMO.

{tkComma, tkColon, tkParLe, tkBracketLe, tkCurlyLe}

what about ; ? eg: nim --eval:"echo 3;-1.echo"

(Off-topic, but IMO, the conversion to float/int (and the handling of different bases) for numeric literals should not be in the lexer. That is beyond the scope of a lexer. But that is me being nitpicky. And moving it would be a thought for another day.)

well it's not really off-topic and relates to nim-lang/RFCs#228

parser becomes lazy, makes no attempt at parsing the string into a numerical type (parsing is deferred till actually needed insde semphase, if at all)

but it can indeed be fixed in followup work

compiler/lexer.nim

JohnAD · 2021-03-21T20:35:13Z

I've got a new version ready for commit, but I'll pause for a bit just in case you are in the middle of adding a conversation item.

compiler/lexer.nim

tests/lexer/tstrnumlits.nim

timotheecour · 2021-03-21T22:45:21Z

for bug 6, relevant code to fix is in lib/packages/docutils/highlite.nim eg:

    of '\'':
      inc(pos)
      g.kind = gtCharLit

timotheecour · 2021-03-21T23:12:42Z

see bug 10 i added

timotheecour · 2021-03-21T23:50:05Z

fix for bug 10:

diff --git a/compiler/lexer.nim b/compiler/lexer.nim
index c0ba217b5..efd78f983 100644
--- a/compiler/lexer.nim
+++ b/compiler/lexer.nim
@@ -1339,7 +1339,7 @@ proc rawGetTok*(L: var Lexer, tok: var Token) =
       if L.buf[L.bufpos] in SymChars+{'_'}:
         lexMessage(L, errGenerated, "invalid token: no whitespace between number and identifier")
     of '-':
-      if ((tok.strongSpaceA > 0) or (L.previousTokType in negationPrefixes)) and (L.buf[L.bufpos + 1] in '0'..'9'):
+      if ((tok.strongSpaceA > 0) or (L.previousTokType in negationPrefixes) or L.buf[L.bufpos - 1] == ' ') and (L.buf[L.bufpos + 1] in '0'..'9'):
         getNumber(L, tok)
         if L.buf[L.bufpos] in SymChars+{'_'}:
           lexMessage(L, errGenerated, "invalid token: no whitespace between number and identifier")

test for bug 10 (adapt as needed in your test file):

when true:
  func `'wrap`(a: string): string = "{" & a & "}"
  template toSuf1(): untyped =
    -12'wrap
  assert toSuf1() == "{-12}"

  template toSuf2(): untyped =
    var a = "adsf"
    -12'wrap
  assert toSuf2() == "{-12}"

=> fixed in devel, test in #17500

timotheecour · 2021-03-22T00:25:24Z

i think we can close bug 2 because this works:

when true:
  func `'wrap`(a: string): string = "{" & a & "}"
  template toSuf(`'suf`): untyped =
    let x = -12'suf
    x
  assert toSuf(`'wrap`) == "{-12}"

(needs a test case; again, integrate as needed to reuse your wrap)

=> #17500

timotheecour · 2021-03-22T00:33:20Z

bug 8: is completely fixed by genAst (new genAst as replacement for quote do #17426) (test can be added after new genAst as replacement for quote do #17426 is merged)
bug 9: genAst provides a reasonable workaround, but this can be improved in future work:

when defined case40:
  import std/genasts # pending #17426
  import std/macros
  func `'wrap2`(a: string): string = "{" & a & "}"
  macro bar(): untyped =
    func wrap1(a: string): string = "{" & a & "}"
    func `'wrap3`(a: string): string = "{" & a & "}"
    result = genAst():
      let a1 = wrap1"-128" # ok
      let a2 = -128'wrap2
      # let a3 = -128'wrap3 # xxx still doesn't work
      let a3 = `'wrap3`("-128") # ok
  bar()

doc/manual.rst

timotheecour

LGTM, remaining comments and bug fixes can be done in followup PR
/cc @Araq

(EDIT: see bug 11 i added but it's more of a TODO for followup PR)

Araq · 2021-03-23T10:23:51Z

compiler/parser.nim

+  if baseTokType == tkStrNumLit:
+    result = dotExpr(p, result)
+    result = parseGStrLit(p, result)


Seems hacky and has no corresponding grammar update. I'll see if it can be done differently.

compiler/renderer.nim

* make unary minus part of number literals, refs #17020 * fixes regression

Araq · 2021-03-24T10:39:23Z

I took over, see #17489

* user defined integer literals; refs #17020 * updated renderer.nim * use mlexerutils helper * imported all test cases from #17020 * final grammar updated

…ng#17488) * make unary minus part of number literals, refs nim-lang#17020 * fixes regression

* user defined integer literals; refs nim-lang#17020 * updated renderer.nim * use mlexerutils helper * imported all test cases from nim-lang#17020 * final grammar updated

Add ability to use user-defined number suffixes

69d49c0