Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[superseded] Adding support for user-defined number suffixes #17020

Closed
wants to merge 11 commits into from
Closed

[superseded] Adding support for user-defined number suffixes #17020

wants to merge 11 commits into from

Conversation

JohnAD
Copy link
Contributor

@JohnAD JohnAD commented Feb 12, 2021

As a precursor for adding a decimal (base10) library to Nim's stdlib, this PR adds support for user-defined suffixes to numeric literals.

nim-lang/RFCs#308 - Add decimal to standard library

This PR also solves:

nim-lang/RFCs#216 - User defined integer literals
nim-lang/RFCs#228 - user defined quote literals, eg: -12.3E+4022'dec128 ; -128'i8 is transformed as i8("-128")

To distinguish between a typical string literal and a string literal that is the result of a numeric suffix, a new token was added to the lexer: tkStrNumLit. This token is only generated if, when processing a numeric literal, the suffix does not match one of the pre-defined suffixes used internally.

The rest of the work is done in the parser. When the tkStrNumLit token is detected, a "dot" is presumed to follow and the following symbol is considered the indentifier called in a "dotExpr" syntax. So, effectively:

var a = 12345.678'something

becomes the equivalent of:

var a = "12345.678".something

With this, a library, including the up-and-coming decimal library, can parse numbers in source code that cannot be handled by even int64 or float64. For example:

var a = 5192296858534827628530496329220095'm

(The decimal library plans on using m as the suffix to match the convention seen in other languages such as C#.)

While I've written tests and don't know of any bugs, this is my first attempt at modifying the core lexer/parser of a compiler, I've marked the PR as Work-In-Progress [WIP]. Any and all code review is very much appreciated.

links

@JohnAD
Copy link
Contributor Author

JohnAD commented Feb 12, 2021

While I didn't add this check to the lexer, I could see a benefit to it:

I could easily forbid suffixes that start with the letters a to f and A to F. Doing this would remove the confusion that can be seen with both hex numbers and numbers with the e exponent (such as 12.3e22.)

Let me know if adding this restriction to the lexer (and adding it to the documentation) is desirable.

doc/manual.rst Outdated Show resolved Hide resolved
@Araq
Copy link
Member

Araq commented Feb 12, 2021

Two thoughts:

  1. Unary minus is an operator and never part of a literal. This sucks for cases like -128'i8. This is an existing problem with the current spec and can easily be fixed within this PR.
  2. The design encourages the existing practice of having short top-level proc names like re. I am not sure if this existing practice is so solid that we shouldn't muse about it. For example, in C++ procs that are called as a literal suffix have to defined as operator "" X (with an empty string literal). However, that doesn't address the scoping aspect.

@Araq
Copy link
Member

Araq commented Feb 12, 2021

I could easily forbid suffixes that start with the letters a to f and A to F. Doing this would remove the confusion that can be seen with both hex numbers and numbers with the e exponent (such as 12.3e22.)

Let me know if adding this restriction to the lexer (and adding it to the documentation) is desirable.

I'm not sure how it currently works, but floating point values containing 'e' and hexadecimal numbers need to use the 'suffix (with the single quote) form. It would be really weird if my user defined literal could not start with A to F.

@timotheecour
Copy link
Member

timotheecour commented Feb 12, 2021

The design encourages the existing practice of having short top-level proc names like re. I am not sure if this existing practice is so solid that we shouldn't muse about it.

how about requiring a pragma, which mostly addresses this concern, and makes it more searchable, eg:

proc big(a: string) {.litteral.} = # create a JsBigInt from a
echo -123'big # ok
proc bad(a: string) = # create a JsBigInt from a
echo -123'bad # error, bad must have {.litteral.}

(big can still be used in other context; but the pragma also allows it to be used in a litteral context)

@Araq
Copy link
Member

Araq commented Feb 12, 2021

A pragma doesn't address the scoping problem either. It's not that I want to annotate the suffix declaration, I am thinking about a scope extension. For example:

from bignums import nil

echo 355'bignum # works because it's turned into

echo bignums.bignum("355")

If it weren't so silly I would propose that 123'suffix is turned into suffix.suffix("123"). (Or into suffixs.suffix("123") as the plural s for a module name is very common.)

@timotheecour
Copy link
Member

that's the exact same problem with MCS though.

but here, you can use:

option 1: no scope extension

(but IIUC your problem is "small identifiers" polluting scope?)

from bignums import bignum
echo 355'bignum

option 2: special litteral rule

from bignums import 'bignum # makes it only usable as a litteral
echo 355'bignum # ok
echo bignum("355") # `bignum` not declared, was imported via 'bignum

but isn't option 1 good enough?

@Araq
Copy link
Member

Araq commented Feb 12, 2021

Oh there is an option: Suffix procs need to be declared as

proc `'bignum` (...)

And then the rewrite rule is:

123'suffix is turned into `'suffix`("123")

And then we can import suffixes just like:

from bignums import `'bignum`

@timotheecour
Copy link
Member

timotheecour commented Feb 12, 2021

Oh there is an option: Suffix procs need to be declared as

seems good; in particular it helps with turning f32 (etc) into a user defined litteral (defined in system.nim via proc `'f32'*(a: string): float32) without polluting scope with f32, since we'd have:

doAssert not declared(f32)
doAssert declared(`'f32`)

this also encourages 1 way to write things:
123'suffix instead of suffix("123"), since the declaration would contain '

a user could of course define both, but that'd be opt in:

proc `'bignum`(...)
proc bignum(...)
  • can we also deprecate 123f32 syntax and require 123'f32?
    if not then at least not allow the quote-less form for user defined litterals?

1 way to do things seems better (more searchable, etc); see also my argument about ambiguities this would cause otherwise (12e32 vs 12'e32 etc). This would help turning the builtin f32 (etc) into user defined literals, with obvious advantages.

@JohnAD
Copy link
Contributor Author

JohnAD commented Feb 12, 2021

I like the two big ideas that have emerged from this discussion:

  • For suffix procs, require a 'suffix style declaration. Also require the apostrophe for usage. This makes things cleaner/clearer.
  • Modify the lexer so that the minus symbol is part of the numeric literals when the context calls for it.

I'll start work on both; starting with the lexer itself.

Slightly off-topic: after this change is released, I/we will need to encourage linters and code markup tools (such as prettifiers) to properly handle the unpaired "single quotes". I use the sublime text editor and I'll need to fix/hack that sooner rather than later.)

@Araq
Copy link
Member

Araq commented Feb 12, 2021

if not then at least not allow the quote-less form for user defined literals?

I doubt we can deprecate the 0f32 syntax. But the new form can require the single quote, we can always allow to leave it out later.

@JohnAD
Copy link
Contributor Author

JohnAD commented Feb 14, 2021

Started work on these changes. Just wrote the new test cases.

But, since we are talking "big picture", would there be value in also allowing more than [0-9a-fA-F] characters in a number?

Perhaps with restrictions such as:

  1. It still starts with a numeric digit [0-9].
  2. It has an attached 'suffix after the character sequence.
  3. The character sequence does not contain whitespace.
  4. The character sequence does not certain punctuation characters such as full-quotes.

This would allow for useful things like this:

import korean

var a = 0k열여섯'hangul

or

import numericsystems

var a = 0s03jKi3h'base60

or

import roman

assert 256 == 0CCLVI'roman

or crazy stuff like:

import moneystuff
var a = 1,234,332.20'USD

I, myself, don't need this functionality personally, so I'm not pushing hard for it. But, I can see it being useful for other libraries.

I can also see danger in allowing any punctuation other than dots and underscores. So perhaps even if we allow more than the just digit/hex characters, punctuation should not be allowed.

@timotheecour
Copy link
Member

This would allow for useful things like this:

user can always write: var a = "열여섯".hangul, this is probably not common enough to allow it via user defined literal.

or crazy stuff like:

Commas would be more common and useful to support but I don't see how to avoid the ambiguity this would lead to: echo (1,234,567'big) => is it (1, 234, big"567") or (big"1234567") ? you'd need some arbitrary rule to disambiguate. And we can use "_" for that.

In any case, this could be handled in future work.

@Araq
Copy link
Member

Araq commented Feb 15, 2021

I, myself, don't need this functionality personally, so I'm not pushing hard for it. But, I can see it being useful for other libraries.

We can always add it later.

@Araq
Copy link
Member

Araq commented Feb 24, 2021

@JohnAD what's the current state of this PR? I really want it... :-)

@JohnAD
Copy link
Contributor Author

JohnAD commented Feb 28, 2021

Started work again on this today. I have already added the expanded/corrected unit testing. Should have a new commit for the lexer/parser by end of tomorrow.

@JohnAD
Copy link
Contributor Author

JohnAD commented Feb 28, 2021

Open question:

The minus symbol (-) has two purposes:

  1. acting as sign for a number aka "negative" x = 3+ -4
  2. acting as an operation indicating an operation x = 3-4

And, if we were doing things in the higher levels of the language, I suspect I could easily distinguish between them. But, I'm trying to get the "negative" usage into the numeric literal where it really makes sense. To do that, the logic needs to be in the lexer, which is literally a character-by-character scan.

In the end, being conservative, I suspect any numeric library will still need to handle the unary negative.

I just added the ability for the lexer to "remember" the previous tokType in the main loop. This solved multiple problems for me.

So, if the previous tokType is in the

{tkComma, tkColon, tkParLe, tkBracketLe, tkCurlyLe}

set or is preceded by whitespace, then a - followed by a numeric digit will be parsed into the numeric literal. Otherwise the - is parsed as an operation.

Sound reasonable? Suggestions?

(Off-topic, but IMO, the conversion to float/int (and the handling of different bases) for numeric literals should not be in the lexer. That is beyond the scope of a lexer. But that is me being nitpicky. And moving it would be a thought for another day.)

@JohnAD
Copy link
Contributor Author

JohnAD commented Feb 28, 2021

Still to do:

  • Add AST tests to unit testing
  • Test invalid negation, such as against a UInt or out-of-range, in unit testing
  • Update manual.rst

compiler/lexer.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member

timotheecour commented Feb 28, 2021

Sound reasonable? Suggestions?

yes, this is the best approach IMO.

{tkComma, tkColon, tkParLe, tkBracketLe, tkCurlyLe}

what about ; ? eg: nim --eval:"echo 3;-1.echo"

(Off-topic, but IMO, the conversion to float/int (and the handling of different bases) for numeric literals should not be in the lexer. That is beyond the scope of a lexer. But that is me being nitpicky. And moving it would be a thought for another day.)

well it's not really off-topic and relates to nim-lang/RFCs#228

parser becomes lazy, makes no attempt at parsing the string into a numerical type (parsing is deferred till actually needed insde semphase, if at all)

but it can indeed be fixed in followup work

compiler/lexer.nim Outdated Show resolved Hide resolved
compiler/lexer.nim Outdated Show resolved Hide resolved
compiler/lexer.nim Outdated Show resolved Hide resolved
@JohnAD
Copy link
Contributor Author

JohnAD commented Mar 21, 2021

I've got a new version ready for commit, but I'll pause for a bit just in case you are in the middle of adding a conversation item.

compiler/lexer.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member

for bug 6, relevant code to fix is in lib/packages/docutils/highlite.nim eg:

    of '\'':
      inc(pos)
      g.kind = gtCharLit

@timotheecour
Copy link
Member

timotheecour commented Mar 21, 2021

  • see bug 10 i added

@timotheecour
Copy link
Member

timotheecour commented Mar 21, 2021

fix for bug 10:

diff --git a/compiler/lexer.nim b/compiler/lexer.nim
index c0ba217b5..efd78f983 100644
--- a/compiler/lexer.nim
+++ b/compiler/lexer.nim
@@ -1339,7 +1339,7 @@ proc rawGetTok*(L: var Lexer, tok: var Token) =
       if L.buf[L.bufpos] in SymChars+{'_'}:
         lexMessage(L, errGenerated, "invalid token: no whitespace between number and identifier")
     of '-':
-      if ((tok.strongSpaceA > 0) or (L.previousTokType in negationPrefixes)) and (L.buf[L.bufpos + 1] in '0'..'9'):
+      if ((tok.strongSpaceA > 0) or (L.previousTokType in negationPrefixes) or L.buf[L.bufpos - 1] == ' ') and (L.buf[L.bufpos + 1] in '0'..'9'):
         getNumber(L, tok)
         if L.buf[L.bufpos] in SymChars+{'_'}:
           lexMessage(L, errGenerated, "invalid token: no whitespace between number and identifier")

test for bug 10 (adapt as needed in your test file):

when true:
  func `'wrap`(a: string): string = "{" & a & "}"
  template toSuf1(): untyped =
    -12'wrap
  assert toSuf1() == "{-12}"

  template toSuf2(): untyped =
    var a = "adsf"
    -12'wrap
  assert toSuf2() == "{-12}"

=> fixed in devel, test in #17500

@timotheecour
Copy link
Member

timotheecour commented Mar 22, 2021

  • i think we can close bug 2 because this works:
when true:
  func `'wrap`(a: string): string = "{" & a & "}"
  template toSuf(`'suf`): untyped =
    let x = -12'suf
    x
  assert toSuf(`'wrap`) == "{-12}"

(needs a test case; again, integrate as needed to reuse your wrap)

=> #17500

@timotheecour
Copy link
Member

when defined case40:
  import std/genasts # pending #17426
  import std/macros
  func `'wrap2`(a: string): string = "{" & a & "}"
  macro bar(): untyped =
    func wrap1(a: string): string = "{" & a & "}"
    func `'wrap3`(a: string): string = "{" & a & "}"
    result = genAst():
      let a1 = wrap1"-128" # ok
      let a2 = -128'wrap2
      # let a3 = -128'wrap3 # xxx still doesn't work
      let a3 = `'wrap3`("-128") # ok
  bar()

Copy link
Member

@timotheecour timotheecour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, remaining comments and bug fixes can be done in followup PR
/cc @Araq

(EDIT: see bug 11 i added but it's more of a TODO for followup PR)

@timotheecour timotheecour added the TODO: followup needed remove tag once fixed or tracked elsewhere label Mar 22, 2021
Comment on lines +788 to +790
if baseTokType == tkStrNumLit:
result = dotExpr(p, result)
result = parseGStrLit(p, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems hacky and has no corresponding grammar update. I'll see if it can be done differently.

Araq added a commit that referenced this pull request Mar 23, 2021
Araq added a commit that referenced this pull request Mar 24, 2021
Araq added a commit that referenced this pull request Mar 24, 2021
* make unary minus part of number literals, refs #17020
* fixes regression
Araq added a commit that referenced this pull request Mar 24, 2021
@Araq
Copy link
Member

Araq commented Mar 24, 2021

I took over, see #17489

@Araq Araq closed this Mar 24, 2021
Araq added a commit that referenced this pull request Mar 24, 2021
* user defined integer literals; refs #17020
* updated renderer.nim
* use mlexerutils helper
* imported all test cases from #17020
* final grammar updated
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
…ng#17488)

* make unary minus part of number literals, refs nim-lang#17020
* fixes regression
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
* user defined integer literals; refs nim-lang#17020
* updated renderer.nim
* use mlexerutils helper
* imported all test cases from nim-lang#17020
* final grammar updated
@timotheecour timotheecour removed the TODO: followup needed remove tag once fixed or tracked elsewhere label Apr 29, 2021
@timotheecour timotheecour changed the title Adding support for user-defined number suffixes [superseded] Adding support for user-defined number suffixes Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants