Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical representation for 0000~0999 years in ISO8601 #2082

Closed
LiviaMedeiros opened this issue Mar 9, 2022 · 18 comments · Fixed by #2090
Closed

Canonical representation for 0000~0999 years in ISO8601 #2082

LiviaMedeiros opened this issue Mar 9, 2022 · 18 comments · Fixed by #2090

Comments

@LiviaMedeiros
Copy link
Contributor

According to current ISOYearString implementation, non-negative years earlier than 1000 are represented in +000YYY format instead of 0YYY.

Is that intentional?
If yes, what is the rationale?

As far as I understand ISO8601, expanded representation is always allowed to be used "by agreement", but is neither necessary or preferable within whole 0000~9999 range.

Current behaviour is not compatible with legacy Date:

new Date(-33_333_333_333_333).toJSON(); // 0913-09-16T12:44:26.667Z
Temporal.Instant.fromEpochMilliseconds(-33_333_333_333_333).toJSON(); // +000913-09-16T12:44:26.667Z
Temporal.Instant.from('0913-09-16T12:44:26.667Z').toString(); // +000913-09-16T12:44:26.667Z

It's not how it usually works in other languages:

Instant.ofEpochSecond(-33333333333L).toString() # openjdk 11.0.14: 0913-09-16T12:44:27Z
datetime.fromtimestamp(-33333333333).isoformat() # Python 3.10.2: 0913-09-16T19:41:32
date('c', -33333333333); # PHP 8.1.3: 0913-09-16T12:44:27+00:00

And of course it breaks compatibility with a lot of userland parsers.

@ptomato
Copy link
Collaborator

ptomato commented Mar 9, 2022

Thanks for noting this!

It's true, ISO 8601 says (section 4.3.2, Calendar Year and Years Duration):

Each Gregorian calendar year can be identified by a 4-digit ordinal number beginning with ‘0000’ for year zero, through ‘9999’.

This behaviour of switching to the extended years at <1000 existed in the proposal since before I was involved. Maybe @pipobscure @maggiepint @mattjohnsonpint might know what the rationale is?

If there are interoperability problems with legacy Date and with other languages that we previously weren't aware of, that might meet the bar for considering a change even during Stage 3.

On the other hand, I'm not sure that a userland parser rejecting the extended year format is a strong enough motivation. Even if the 4-digit year was output for 0000–9999 instead of 1000–9999, that parser would still be unable to handle some of the output from JS (legacy Date as well as Temporal). "We use 6-digit extended years by agreement" means that you have to be able to handle 6-digit extended years if you expect to parse the output, or you will eventually get something that breaks your parser 😄

@justingrant
Copy link
Collaborator

IMO, matching the output of legacy Date ISO strings seems important enough to justify a change. In some cases we must diverge (e.g. nanoseconds) but seems like we shouldn't diverge from legacy without a really good reason.

@LiviaMedeiros
Copy link
Contributor Author

Thanks for responses!

I hope it was simply overlooked in the polyfill code, and intended to be more like:

   ISOYearString: (year) => {
     let yearString;
-    if (year < 1000 || year > 9999) {
+    if (year < 0 || year > 9999) {
       let sign = year < 0 ? '-' : '+';
       let yearNumber = MathAbs(year);
       yearString = sign + `000000${yearNumber}`.slice(-6);
     } else {
-      yearString = `${year}`;
+      yearString = `0000${year}`.slice(-4);
     }
     return yearString;
   }

@pipobscure
Copy link
Collaborator

pipobscure commented Mar 9, 2022

The reason for this was that ISO8601 (4.1.2.1) specifies that years before 1582 require prior consent between platforms so aren’t really canonical. Add to that that leading 0 in years give trouble to some platforms and 4 digits are required for the basic format. So we said let’s maximise for interoperability which turns out to be outputting extended years before 1000CE

So if you want to change this for better interop, then don’t do this! If you want to go for strictness then change it to “extended before 1582”. And if you want to optimise for interop and humans then leave it as is.

@mattjohnsonpint
Copy link
Collaborator

mattjohnsonpint commented Mar 9, 2022

I have no idea why that's there. My preference would be that years 0000-9999 are always represented with four digits.

@mattjohnsonpint
Copy link
Collaborator

@pipobscure - which platforms can't parse years before 1000 with four digits?

@pipobscure
Copy link
Collaborator

None. I was being unclear. I meant that because the year is the first thing and leading zeros are something a lot of things don’t deal well with it’s problematic. Just look at any CSV library or handling data in Excel. Leading zeros are likely to be dropped. By requiring extended format before 1000 dates are not lead by zeroes anymore since that requires a + or - in the lead position.

@justingrant
Copy link
Collaborator

justingrant commented Mar 9, 2022

Summarizing the discussion above, it sounds like the decision is this:

  1. If we want to optimize for platforms that choke on leading zeroes, then leave as-is: +000092-01-01.
  2. If we want to optimize for compatibility with legacy Date, then use a 4-digit year: 0092-01-01.

FWIW, it turns out that in Excel (at least on my Mac), (2) may actually be more compatible. When you paste +000092-01-01 into an Excel cell, it interprets it as a formula!

image

But when you paste 0092-01-01, it's correctly interpreted as text and the leading zeroes are retained.

image

It's possible that modern versions of excel recognize the ISO 8601 format (at least the 4-digit variant) and intentionally avoid converting it to a formula.

I tried importing the same values in a CSV into Excel. Surprisingly, the behavior was the same: 4-digit years with no prefix imported fine, while 6-digit years with a plus prefix imported as a formula. Note that to get Excel to successfully import, I needed to wrap the values in quotes. Without quotes, the imported file had all blank cells!

Given that the 4-digit format seems to be more compatible with legacy Date and also seems more compatible with Excel, my vote would be to switch to the 4-digit format for years between 1 BCE - 999 CE.

@LiviaMedeiros
Copy link
Contributor Author

The reason for this was that ISO8601 (4.1.2.1) specifies that years before 1582 require prior consent between platforms so aren’t really canonical.

Sorry for misleading title: by "canonical" I meant format that will be produced by the very final reference implementation of Temporal, and maybe explicitly described in final specification.

However, I don't see any rational connection between <1582 and expanded representation.
ISO allows 0000~1582 years by agreement. ISO allows expanded representation by agreement and then recommends agreeing the additional number of digits. ISO also allows omitting T, - and : by agreement. But all these agreements are independent, and "de-facto canonical" format starts with YYYY- as long as it fits.
Also we might assume applicable parts of RFC 3339 to be a global mutual agreement by default.

As for Y1582 itself, I think the logic behind that is that only dates after Inter gravissimas are maintained as legitimate part of Gregorian calendar. Earlier dates are deprecated and may only occur in its extrapolation. So it's a sort of disclaimer: if user implements ISO 8601 in time machine and goes too far into the past, they may desync due to astronomical reasons. If they go to the future, time shall be patched by leap seconds.
Such dates are important in really specific calendar-centric code, where we actually have "start/end of world" timestamps, and going out of bounds must actually throw an error. But in this context it's just a nonsense number.

If we want to optimize for platforms that choke on leading zeroes

Thanks a lot for checking this one.

I suppose, most-likely-to-fail here are parsers with heuristics for 2-digit years and without highest priority check for machine-readable formats. But two extra leading zeros and a plus sign won't help them anyways: if they don't support simple version of standard, why expect support for expanded form?

As an example of 2-digit oriented field, YEAR(4) type in MySQL interprets 0000 as 0000 and +000000 as 2000:

CREATE TABLE gwak ( y YEAR, t TEXT );
INSERT INTO gwak VALUES ( '0000', '0000' ), ( '+000000', '+000000' );
SELECT * from gwak;
+------+---------+
| y    | t       |
+------+---------+
| 0000 | 0000    |
| 2000 | +000000 |
+------+---------+

@mattjohnsonpint
Copy link
Collaborator

mattjohnsonpint commented Mar 10, 2022

I'm still in favor of using only 4 digits for years 0000-9999. I think having years 0-1000 behave differently is a trap that is too reminiscent of the legacy Date behavor for years 0-100.

Also, I don't think it's our responsibility to avoid leading-zero pitfalls on other platforms. And as mentioned, the leading +/- can cause side effects in some systems.

@ckknight
Copy link
Contributor

Given the lack of modern representation of dates older than 1022 years ago and the possibility that some parties may choke on leading zeroes, I don't see why the addition of 3 tiny characters (+00) in those circumstances is more harmful than padding with zeroes.

@LiviaMedeiros
Copy link
Contributor Author

A word about terminology and importance of third parties (personal opinion, feel free to ignore)

ISO8601 representation is a string, serialized form.
Stringified output is expected to be transferred, stored and parsed by ANY kind of software.
Strong interoperability is the main purpose of standartized format.
Asking things outside of JS ecosystem to support uncommon representation is a dead end: they might be limited by their own specifications, boundary values, performance issues, etc.

Standalone "compatibility with legacy Date" is a totally correct wording, but it sounds like a step back because Temporal was made to solve problems caused by Date.
Thinking of it as "compatibility with a majority of outer-world systems" might be more accurate.

I'd highly recommend to investigate more on which format is more likely to fail on other systems.

About csv and spreadsheet software, I tried to import this file:

2017-03-17T10:39:13.250Z,0913-09-16T12:44:26.667Z,+000913-09-16T12:44:26.667Z
2017-03-17T10:39:13.250,0913-09-16T12:44:26.667,+000913-09-16T12:44:26.667
2017-03-17T10:39:13.250+0000,0913-09-16T12:44:26.667+0000,+000913-09-16T12:44:26.667+0000
2017-03-17T10:39:13Z,0913-09-16T12:44:26Z,+000913-09-16T12:44:26Z
2017-03-17T10:39:13,0913-09-16T12:44:26,+000913-09-16T12:44:26
2017-03-17T10:39:13+0000,0913-09-16T12:44:26+0000,+000913-09-16T12:44:26+0000

This is how it's interpreted by LibreOffice Calc 7.3.1.3:
temporal-libreoffice-csv
※ Comma before milliseconds in is okay: it's actually preferred over dot in ISO 8601.

And this is Google Docs:
temporal-googledoc-csv

Right-aligned cells are the ones that were successfully interpreted as datetime values (i.e. implicitly converted to their own internal objects).
So as you can see, leading zeroes are fine here, but expanded representation is not recognized.
Saving/reopening the csv file didn't corrupt anything, and didn't trim leading zeroes.

Some considerable examples on +00 being harmful:

# datetime module is a part of The Python Standard Library
from datetime import datetime
print(datetime.fromisoformat('0913-09-16T12:44:26')) # 0913-09-16 12:44:26
print(datetime.fromisoformat('+000913-09-16T12:44:26')) # ValueError: Invalid isoformat string
# Time is a part of Ruby Standard Library
require 'time'
puts Time.iso8601('0913-09-16T12:44:26Z') # 0913-09-16 12:44:26 UTC
puts Time.iso8601('+000913-09-16T12:44:26Z') # ArgumentError: invalid date
# date is a part of GNU coreutils
date -d '0913-09-16T12:44:26Z' # Sat Sep 16 19:41:31 LMT 0913
date -d '+000913-09-16T12:44:26Z' # date: invalid date

@ptomato
Copy link
Collaborator

ptomato commented Mar 16, 2022

Personally, I'm not yet convinced by those examples — those third-parties still choke on other valid ISO 8601 strings that could just as easily be output by Temporal as well as legacy Date. For example, years before 0, or after 9999:

>>> from datetime import datetime
>>> datetime.fromisoformat('-000913-09-16T12:44:26')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Invalid isoformat string: '-000913-09-16T12:44:26'
irb(main):001:0> require 'time'
=> true
irb(main):002:0> Time.iso8601('+100000-09-16T12:44:26Z')
Traceback (most recent call last):
        6: from /usr/local/bin/irb:23:in `<main>'
        5: from /usr/local/bin/irb:23:in `load'
        4: from /var/lib/gems/2.5.0/gems/irb-1.1.0/exe/irb:11:in `<top (required)>'
        3: from (irb):3
        2: from (irb):4:in `rescue in irb_binding'
        1: from /usr/lib/ruby/2.5.0/time.rb:602:in `xmlschema'
ArgumentError (invalid date: "+100000-09-16T12:44:26Z")

(interestingly, Ruby's standard library does accept -YYYYYY, just not +YYYYYY)

$ date -d '-000913-09-16T12:44:26Z'
date: invalid date ‘-000913-09-16T12:44:26Z’

Sure, this change would make them accept some more ISO strings originating from Temporal that they previously would not, but there is still not "mutual agreement between the communicating parties" on the extended years.

Is there were a parser in widespread use that accepts -000001 and +100000 but rejects +000999, for example? Or do we find it likely that interoperability will drop when applications that serialize legacy Dates outside in the year range 0-999 and pass them to third-parties, would port their code to Temporal? (Quirk-for-quirk compatibility with legacy Date is definitely not a goal of Temporal, but I could see a concern like this weighing in favour of making the change.)

@justingrant
Copy link
Collaborator

justingrant commented Mar 16, 2022

Or do we find it likely that interoperability will drop when applications that serialize legacy Dates outside in the year range 0-999 and pass them to third-parties, would port their code to Temporal? (Quirk-for-quirk compatibility with legacy Date is definitely not a goal of Temporal, but I could see a concern like this weighing in favour of making the change.)

The most important interop for Temporal is with other ECMAScript code. If there are two valid formats for the same date, but one of them is currently used by legacy Date and the other is not, then IMO there's a really high bar to choose the "not" option.

From what I see from the examples above, some parsers fail on 0111-01-01 and others fail on +000111-01-01. There's no format that works for everyone. Furthermore, other platforms' parsers are not static. They will evolve over time in response to interop concerns.

So I think the best thing we can do here is to match legacy Date output, because:

  • It will avoid interop issues after converting Date code to use Temporal.
  • External parsers that want to support ECMAScript output will only have one format to align to, instead of two different ECMAScript formats.

@LiviaMedeiros
Copy link
Contributor Author

For example, years before 0, or after 9999

In this case, such values are outside of range, agreed between emitter and parser by supporting it on both sides. 0000~9999 boundaries are justified by at least two major reasons: performance and existence of RFC 3339. Current behaviour of Temporal shrinks intersection of supported formats by 1000 years, seemingly for no reason.

Is there were a parser in widespread use that accepts -000001 and +100000 but rejects +000999, for example?

I hope not. 😄

Or do we find it likely that interoperability will drop when applications that serialize legacy Dates outside in the year range 0-999 and pass them to third-parties, would port their code to Temporal?

Yes. I don't have good examples of applications that would be affected, and too old dates are indeed rare in most of "real life" data, but year 0 is affected, and it's natural to give a special meaning (e.g. "date unspecified") to it.
After porting code with said convention from Date, testing on common cases and releasing, this will cause a lot of gray hair.

Complicated processing is not required to meet this problem: checking and normalization of existing dates is enough to break output in Temporal version. Innocent function like this one might become a textbook footgun:

  static #MIN = Temporal.Instant.from('0000-01-01T00:00:00+0000');
  static #MAX = Temporal.Instant.from('9999-12-31T23:59:59.999999999+0000');
  static function validateAndNormalizeAndCheckRange(dateString) {
    const tmp = Temporal.Instant.from(dateString);
    if (Temporal.Instant.compare(this.#MIN, tmp) > 0 || Temporal.Instant.compare(this.#MAX, tmp) < 0)
      throw new RangeError('Datetime outside of range');
    return tmp.toString();
  }

// or simply:
function validateAndNormalize(dateString) { return Temporal.Instant.from(dateString).toString(); }
/**
 * @deprecated and strongly discouraged due to browser differences and inconsistencies
 */
function normalizeOrNull(dateString) { return new Date(dateString).toJSON(); }

By the way, not only emitters and parsers are happier with 0YYY. Some storage engines may add weight as well.

SQLite version 3.38.0 2022-02-22 18:58:40
sqlite> CREATE TABLE gwak (d TEXT, t TEXT);
sqlite> INSERT INTO gwak VALUES ( datetime('0913-09-16T12:44:26'), '0913-09-16T12:44:26' );
sqlite> INSERT INTO gwak VALUES ( datetime('+000913-09-16T12:44:26'), '+000913-09-16T12:44:26' );
sqlite> INSERT INTO gwak VALUES ( datetime(-33333333333, 'unixepoch'), '-33333333333' );
sqlite> SELECT * FROM gwak;
0913-09-16 12:44:26|0913-09-16T12:44:26
|+000913-09-16T12:44:26
0913-09-16 12:44:27|-33333333333

@justingrant
Copy link
Collaborator

Meeting 2022-03-17: Temporal will change toString output of ISO years 0000-0999 from the extended sign+6-digit format (e.g. +000123) to the compatible-with-Date format of 4 digits (e.g. 0123).

The main reason for making this change is to maximize compatibility with Date.p.toString(). If that method's output unnecessarily differs from Temporal.*.p.toString() then it makes it harder and riskier for developers to port code from Date to Temporal and harder and riskier to interop between Date-using code and Temporal-using code.

We considered other platforms' challenges with parsing either format, but in the end we agreed that it's a much higher priority to be compatible with other ECMAScript code than to worry about non-spec-compliant parsers outside of ECMAScript. One could even argue that it's better for developers dealing with external parsers to only have one ECMAScript format to worry about, instead of two. Hopefully those other parsers will eventually become compliant in the future!

@LiviaMedeiros
Copy link
Contributor Author

Thanks everyone for agreeing on this despite of Stage 3!

@justingrant
Copy link
Collaborator

Thanks @LiviaMedeiros! We're grateful that you found this soon enough so that it could be fixed. If you find other issues, feel free to let us know!

webkit-commit-queue pushed a commit to WebKit/WebKit that referenced this issue May 11, 2022
…s, not 6

https://bugs.webkit.org/show_bug.cgi?id=240294

Reviewed by Yusuke Suzuki.

This patch implements the spec change of tc39/proposal-temporal#2082:
The range for 4-digit years in ISO8601 date strings should be 0-9999, not 1000-9999.

* test262/expectations.yaml:
Mark four test cases as passing.

* runtime/TemporalInstant.cpp:

Canonical link: https://commits.webkit.org/250456@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@294050 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants