Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define appropriate implementation behaviors for extreme values #538

Closed
jongiddy opened this issue May 18, 2018 · 16 comments
Closed

Define appropriate implementation behaviors for extreme values #538

jongiddy opened this issue May 18, 2018 · 16 comments

Comments

@jongiddy
Copy link

jongiddy commented May 18, 2018

One thing that I would like to see defined before 1.0 is the behavior when implementation limits are reached. This would consist of an explicit statement that TOML itself does not have a limit but that parsers may.

For example:

Strings can have any length. However, parsers may produce an error for strings that are too long for them to safely handle.

Integers can have any length. Parsers must produce an error if the value cannot be represented in an internal type. Parsers must not produce an integer that does not equal the full value represented in the TOML and must not produce a string representation of the value.

And perhaps most debatable, what happens to very large or very small floats (e.g. 1e500 and 1e-500). Programming languages often convert these to inf or 0. For configuration, I'd consider it more appropriate to produce an error. For example, a very small float may represent an epsilon value or error allowance, and should produce an error rather than be rounded down to zero.

Numeric floats do not convert to ±inf. The parser should produce an error if the float is too large to be represented as a non-infinite float.
Floats that contain a non-zero mantissa do not convert to zero. The parser should produce an error if the float is too small to be represented as a non-zero float.

@eksortso
Copy link
Contributor

Well, TOML does set a few expectations. Here's what we have for integers. So we cannot say that integers can have any length. Unless you want this requirement rescinded.

64 bit (signed long) range expected (−9,223,372,036,854,775,808 to 9,223,372,036,854,775,807).

Same thing for floats. Any valid value, including a very small epsilon value, is expected to work within this limit. Even inf, -inf, and nan are expected to work.

Floats should be implemented as IEEE 754 binary64 values.

For any numerical value in a TOML file that doesn't translate to these 64-bit standards, I take it that you want parsers to raise errors, rather than leave the parsers' actual behaviors undefined. I'm inclined to agree, but not sure how verbose the standard needs to be about such cases.

@jongiddy
Copy link
Author

@eksortso I think it would be useful to say that an integer outside the supported ranges should give an error, rather than return a different integer, which may happen for languages that have error-free wrapping, or return a truncated integer calculated up until a parse error occurs.

Instead of producing an error, I'd support the parser returning the correct integer where the language can support it (e.g. Python).

If that latter behavior is allowed, and it is clear that an error must be produced otherwise, then the 64-bit signed value could be defined as the guaranteed minimum that a parser should support.

Defining these things specifically encourages parser developers to test for these cases and also ensures parser implementations are more interchangeable

@jongiddy
Copy link
Author

Both the Python TOML parsers listed on the Wiki ("specs-conforming and strict" pytoml and "passes the TOML test suite" toml) support integers outside the 64-bit signed range, so parser developers already interpret the statement about integers as a minimum range.

This also emphasises that the word "expected" has almost no meaning in a standard.

For floats, both Python libraries convert 1e500 to inf and 1e-500 to 0.0. Other languages may return the maximum non-infinite float or produce an error. For simplicity of building a parser, it may be best to leave the behavior of floats undefined.

@pradyunsg
Copy link
Member

Are there existing real world problems/issues caused by the lack of strict language on the behavior at these extreme values?

@pradyunsg
Copy link
Member

Some notes:

  • the case of strings not being able to be stored in memory just seems a little weird.
  • @mojombo thinks that the underlying issue here is that TOML would either need to decide to be strict or flexible or leave it to the implementations to decide what has to be done here.
  • I agree with him though -- there might be a middle position where TOML just tells parsers to do the best they can to represent the values and print a warning when they can't stay true to what the file contains but the spec doesn't really do anything like that today so 🤷‍♂️.

@jongiddy
Copy link
Author

jongiddy commented Jun 17, 2018

the case of strings not being able to be stored in memory just seems a little weird

Consider a parser for a typed language that allows the caller to provide a fixed-length buffer for a string. Similarly, for integers a typed parser may support multiple integral types (8, 16, 32, 64, and possibly 128-bit). Ensuring that the default behavior is to produce an error if the value cannot be represented exactly is simple and conservative. Parsers can provide configuration options for other behaviors: truncation, wrapping, producing the max value.

If we end up with TOML parsers in different languages undetectably producing different values for the same TOML, then TOML is not going to be an appropriate choice for systems that use multiple languages.

print a warning when they can't stay true to what the file contains

Please do not encourage parser libraries to print warnings. The failure to parse correctly needs to go to the calling code, not to the user.

@lmna
Copy link

lmna commented Aug 21, 2018

Parsers can provide configuration options for other behaviors: truncation, wrapping, producing the max value.

Toml parsers do not have sufficient domain knowledge to take a responsibility for changing values. I believe that the only sane option for parsers is to produce a parsing error. So that the user (or some kind of domain-aware AI) will have an opportunity to fix unrepresentable values in a conscious way. See also https://en.wikipedia.org/wiki/Fail-fast

If we end up with TOML parsers in different languages undetectably producing different values for the same TOML

That would be a disaster. To prevent this, parser should either produce an exact representation of toml file, or produce a parsing error.

Please do not encourage parser libraries to print warnings. The failure to parse correctly needs to go to the calling code, not to the user.

+1

@lmna
Copy link

lmna commented Aug 21, 2018

Should the spec recommend distinct error codes for specific representation problems? Could it be useful for parser-neutral testsuits?

  • insufficient memory (too many items in arrays and tables, too long strings/keys, too deep nesting, etc)
  • integer unrepresentable (too small / too large)
  • date / time / date-time unrepresentable (too far in past/future, insufficient precision, violation of leap day rules, violation of leap second rules, invalid timezone, etc)

@pradyunsg
Copy link
Member

Are there existing real world problems/issues caused by the lack of strict language on the behavior at these extreme values?

Given the lack of response to this question, I don't think I want to look into this prior to 1.0 though since we won't miss out on much.

@Suhoy95
Copy link

Suhoy95 commented Feb 12, 2020

Are there existing real world problems/issues caused by the lack of strict language on the behavior at these extreme values?

Yeah, i.e. undefined, NaN in the HTML/JS user interface. undefined is not a function. Of course, we can tell that it is programmer's fails. But I think it can be solve on the deeper level.

I am disagree that it is post-1.0 issue. JSON failed with this, and I do not see any reason to avoid the problem. I would like to consider this topic as I described in #701, Point 5.1. It is really non-trivial task. But it would be great: application may negotiate about limits (if it is important) or use the popular one. We should become a little bit Microprocessor designer to understand how to make it with better/the best way.

@Suhoy95
Copy link

Suhoy95 commented Feb 14, 2020

Hello, everyone.

TOML does not come out of my mind, so today morning I realised possible solution for this border-cases issue.

Statement 1. The Integer, Float and String can be any precision while it keeps as TOML text document.

Really it doesn't matter that an Integer keeps inside 64bit border or not. We can easily describe general syntax with Regular Expression or ABNF notation.

Statement 2. The issue comes when you parse TOML for particular machine or language (Python also an be considered as virtual machine). And TOML specification (and its authors) can not solve this problem by design. It is responsibility of programmer for the particular system/program.

This problem can be splitted into two issues:

2.1 Downcast precision of Float/Integer/Datetime, because of unsupporting this precision in the machine
2.2 Small buffer of end recipient, which reffers to String size limit and TOML file limit as well (for example, #446)

So, can we help somewhat to programmer?

Solution 1. We can define a hard or recommended limits (as it is written right now).

But if the limits is not appropriate for particular case, a programmer consciously has to break the rules.

Solution 2. We can specify optional Prefix-Types where it is really matter:

S2.0 Without prefix-types we can set recommended limits as Solution 1.

S2.1 Integer. As eexample we can take cstdint or go basic types, aslo add something like long-arithmetic for Python/Scientific case.

S2.2 Float. We can easily take it form IEEE 754. Also I would add something like long/precision arithmetic because IEEE 754 is not so perfect way to keep the number (for example, especially for Money accounts)

S2.3 String. We can encoded length/encoding in the prefix.

Other encodings except UTF-8 make thing complicated. But it is really helpful in non Unicode environment

Example:

a_int = uint16_0xFFEA
b_float = binary16_0.2
s_string = utf8@14"Hello, world!"

Options:

S2.4 We can add optional exclaimed sign to warn the end recipiend that the precision is important and it is should be parsed as written or dismissed with error as well. Without exclamation can be as recommended/non critical.

But it should be argumented. it seems a litle bit as over-engineered

S2.5 This does not solve the problem of large TOML file and small recipient buffer. Of course, it is really big problem of recipient. In the HTTP headers or Filesystem metadata it can reach to real TOML size, but it is not true for arbitary bites stream. So we can add optional #?[Unsigned integer filesize] line in the beginning of TOML document.

S2.6 These suggestions seem to back compatible with TOML v0.5.0.

@cesss
Copy link

cesss commented Oct 24, 2021

I'm an outsider (just comparing serialization formats out there, and TOML is perhaps the format I'm going to choose, mainly because of how simple its C library is), but one of the things I was checking now is precisely numeric ranges, and then I found this issue, so I thought I could write my 2cents:

If I'm reading the spec correctly, if an implementation chooses to support, for example, 128-bit integers, it would be fully-compliant, because the spec requires to support 64-bit signed ints, but doesn't say anything about integers with more bits, so I guess they would be legal. Even parsing an integer as unsigned 64 bit when it's positive and doesn't fit as signed 64bit would be legal too (I mean, the spec doesn't say that's forbidden). I find all of this very fortunate, because I use to work with unsigned 64bit a lot, and sometimes with more than 64 bits, so the fact that the TOML spec doesn't impose limits to integers is in my case a very good point for choosing it.

Regarding floating point, I think it would be nice to relax the "should" word in the spec when it says that implementations should implement fp numbers as doubles. In some of my apps, I use 80bit Intel fp, for example. Yes, not portable to different CPUs, but I use it when I really need it, and it would be nonsense to be able to use TOML in my apps that work with 64bit doubles, and not being able to do so when some variable needs to be 80bit fp.

As I said, just my 2cents, as I'm an outsider...

@pradyunsg
Copy link
Member

If I'm reading the spec correctly, if an implementation chooses to support, for example, 128-bit integers, it would be fully-compliant, because the spec requires to support 64-bit signed ints, but doesn't say anything about integers with more bits, so I guess they would be legal. Even parsing an integer as unsigned 64 bit when it's positive and doesn't fit as signed 64bit would be legal too (I mean, the spec doesn't say that's forbidden).

You did read it correctly.

Given the lack of response to this question, I don't think I want to look into this prior to 1.0 though since we won't miss out on much.

Given the lack of concrete answers here, I'm leaning toward maintaining status quo; which defers to implementation authors to make the design choices of what the best apporach to tkae would be -- Each implementation can do as its authors deem appropriate for their language/ecosystem/domain and they're welcome to support higher limits, having a strict failure mode or other behaviours on extreme values.

@Timie
Copy link

Timie commented Jan 26, 2023

Hi, everybody,

I just want to add one thing regarding the exact representation of floats. If the TOML would require exact representation, it could not work with usual floating point types because of their rounding errors. It would fail even for simple values such as 0.1 or 0.3 which do not really have an exact representation in double for float, but at the same time they are pretty likely to appear in human-provided configuration files.

The only solution to that would be to use exact floating point representation types, such as "decimal" in C#. That may complicate the design of parsing libraries in languages where there's no native type for that.

In practice, I believe that supporting special treatment of types and entries is rather a (optional, not required) responsibility of parsing library. They can either provide "callbacks" or access to the raw textual value of the entry, if the user of the library really cares about the special treatment of the values.

@eksortso
Copy link
Contributor

@Timie You said:

Hi, everybody,

I just want to add one thing regarding the exact representation of floats. If the TOML would require exact representation, it could not work with usual floating point types because of their rounding errors. It would fail even for simple values such as 0.1 or 0.3 which do not really have an exact representation in double for float, but at the same time they are pretty likely to appear in human-provided configuration files.

Well, the specification does say the following:

Floats should be implemented as IEEE 754 binary64 values.

Which means that decimal values cannot be expected to be represented with exact precision, barring additional input to the parser. Instances where TOML floats need to be converted to exact Decimal types are exceptional cases, and indeed there are parsers that can make that conversion. Python's tomli/tomllib library immediately comes to mind; it can parse TOML files with a special parse_float option that can be used to make Decimal values out of TOML floats and maintain their precise values. That's just one implementation.

This is not a problem in most cases. The only thing that parsers must do with TOML floats is turn them into numbers. There's no immediate check for exact equivalence to perceived precision. It's usually not necessary. Especially since TOML does not perform arithmetic. The consumer program that uses the parser's output is supposed to handle any rounding issues that may arise. And the spec doesn't state how floats ought to be represented if they're not converted to IEEE 754 binary64 value. The parsers already carry a lot of that weight.

So really, nothing more needs to be said.

However, if a distinct TOML decimal type would ever need to be created, then we'd have to provide a syntactic marker to indicate a decimal, and we'd have to put acceptable lower limits on the precision of such decimals. (No idea if we could agree upon a common prefix or suffix to indicate decimals.) And then, nearly all modern programming languages have common implementations of exact decimal values; there's no shortage of those. Parser writers will choose what they deem fits their needs most accurately.

If you think that's something that TOML should consider, for v1.1.0 or for the future, then feel free to make a case for such a data type. Suggest a syntax. (I'm partial to a prefix like a $ or a D, but I'm being provincial.) Give a precise definition of what a resultant "decimal" type and its requirements should be. There are a lot of precedents for these.

But we would still need to gauge whether such a type really belongs in TOML. And for that, we'll need feedback for such a proposed new feature.

@pradyunsg
Copy link
Member

I'm gonna close this out, based on #538 (comment)

kachick added a commit to kachick/ruby-ulid that referenced this issue Mar 10, 2024
But this made unparsable error for 10000+ years

May relate to toml-lang/toml#538 (comment)
kachick added a commit to kachick/ruby-ulid that referenced this issue Mar 10, 2024
But this made unparsable error for 10000+ years

May relate to toml-lang/toml#538 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants