-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The actual formatting part of compact decimal formatting #2898
The actual formatting part of compact decimal formatting #2898
Conversation
I ended up implementing formatting from Still needs a couple of implementation comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work; a few minor things I found, but this is landable now in experimental and you can keep working in another PR if you like.
let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type); | ||
let mut significand = unrounded.multiplied_pow10(-i16::from(exponent)); | ||
if significand.nonzero_magnitude_start() == 0 { | ||
significand.half_even(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: Even though I reviewed these functions when they went in, it's not super clear to me as a reader that this call is mutating the significant to perform rounding. 😃 We should consider in 2.0 changing these functions to be called round_half_even
or something else with an active verb. CC @younies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree; I had to reverse-engineer the semantics from the doc tests to figure out what these functions actually do (the comments are not particularly helpful).
While waiting for 2.0 I would like to improve the comments in a follow-up PR.
let log10_type = unrounded.nonzero_magnitude_start(); | ||
let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type); | ||
let mut significand = unrounded.multiplied_pow10(-i16::from(exponent)); | ||
if significand.nonzero_magnitude_start() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This behavior is slightly different than ICU and ECMA-402. In those implementations, we retain 2 significant digits all the way down. We should try to align.
new Intl.NumberFormat("en", { notation: "compact" }).format(1.2e-5)
// '0.000012'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.2e-5 is not a valid input here, since we take i64
s.
Note that this is looking at the magnitude of the significand on its own, not the actual magnitude of the number being formatted, so this condition is « if we show one digit before the decimal point » (and then that branch does « show one more after the decimal point »).
I will add some comments that gloss the logic like that so it is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this isn't a problem since we only have integers. We shouldn't forget about it when supporting non-integers.
let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type); | ||
let mut significand = unrounded.multiplied_pow10(-i16::from(exponent)); | ||
if significand.nonzero_magnitude_start() == 0 { | ||
significand.half_even(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I wonder if we should make trunc
the default compact decimal rounding mode in ICU4X? People might not like 999,501 of something (likes, views, emails, ...) rounded up to 1M.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this was a questionable default (YouTube uses directed rounding towards 0 everywhere for the very reason you cite).
However, I would rather we aligned with ICU4everythingelse and other existing implementations.
Aside from compatibility/consistency questions, I should note that truncating by hand is much easier (you don’t need the adjustment on round up), so I would rather let people do the easy one themselves and provide the more involved (yet also reasonable) option than the reverse.
self.formatter | ||
.fixed_decimal_format | ||
.format(self.value.significand()) | ||
.write_to(sink)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This might not work as expected with negative numbers. Please add some test cases for that (can be done after landing this PR).
The way FixedDecimalFormatter deals with signs (plus and minus) is to have 3 versions of the pattern that are selected at runtime: no sign, plus sign, and minus sign. So, we don't actually render a sign at runtime; we just select the pre-rendered pattern. I don't know if that's how we want to handle it in CDF since we have a lot more patterns and would like to avoid bloating the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some examples in the docs tests; it works as I expect at least.
French mille
is messy (though it doesn’t do something truly misleading like dropping the sign, it falls to that weird one
case, thus -1 millier
, which may be the least broken thing it can do; getting -mille
would probably be tricky, and that looks utterly weird anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but try this in a locale where the sign and the compact affix are on the same side. For example:
We have a pattern
"1000000-count-other": "M0",
The correct output according to CLDR would be "+M3", but your code produces "M+3"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that same file, there is
"1000-count-one": "elfu 0;elfu -0",
which demonstrates how CLDR does actually want us to consider these as proper decimal patterns that could have different signed and unsigned versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(argh, too many threads.)
Yes, but there is also elfu 0
without that dance, which is therefore suspect; and the millions don’t do that but do it in other variants of Swahili, etc.
See this in the datagen:
icu4x/provider/datagen/src/transform/cldr/decimal/compact_decimal_pattern.rs
Lines 51 to 65 in 356bfa8
// All compact decimal patterns for sw (Swahili) are split by sign; | |
// the sign ends up where it would be as part of the significand, so | |
// this special handling is unneeded. Depending on the region subtag, | |
// the space may be breaking or nonbreaking. | |
("elfu 0;elfu -0", "elfu 0"), | |
("milioni 0;milioni -0", "milioni 0"), | |
("bilioni 0;bilioni -0", "bilioni 0"), | |
("trilioni 0;trilioni -0", "trilioni 0"), | |
("elfu\u{A0}0;elfu\u{A0}-0", "elfu\u{A0}0"), | |
("milioni\u{A0}0;milioni\u{A0}-0", "milioni\u{A0}0"), | |
("bilioni\u{A0}0;bilioni\u{A0}-0", "bilioni\u{A0}0"), | |
("trilioni\u{A0}0;trilioni\u{A0}-0", "trilioni\u{A0}0"), | |
("0M;-0M", "0M"), | |
("0B;-0B", "0B"), | |
("0T;-0T", "0B"), |
I suspect we have bad data, but that having the sign next to the number is the better default, since many swahili patterns go out of their way to stick it back there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"+M3" is likely a reasonable and correct outcome
I am unconvinced of that, given that it abbreviates something that seems to want to be milioni +3
.
Bear in mind that while compact decimal is used quite a bit more than scientific, its edge cases (including the use of negative numbers) quickly get uncharted (after all, the Romance languages are only just starting not to be a flagrantly ungrammatical mess, and that is for positive values).
We will indeed need to deal with that for currencies (along with a number of other complexities).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some evidence from the number range work that designers have special requirements when it comes to single-character symbols. They tend to prefer single characters to be "glued" to the number, with extra annotations (like range separators and signs) displayed outside of the single-character symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unambiguous that the current UTS 35 requires "+M3" instead of "M+3". Perhaps that data is wrong, but even if it is, there will continually be more cases where we need to handle cases like this. Should we appeal to CLDR to get clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way I need to open a CLDR ticket because it is clear that there is a lot of wrongness in the Swahili data, for instance
https://github.com/unicode-org/cldr-json/blob/f27f55f1dd487af7cf4260f56296ee1c7649b7fc/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json#L34-L36
next to https://github.com/unicode-org/cldr-json/blob/f27f55f1dd487af7cf4260f56296ee1c7649b7fc/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json#L62-L64.
If it turns out that something like sign
prefix
number
etc. is needed for compact decimal (as opposed to compact currency) we can add a bit that says « sign goes at the beginning », but I suspect many of the patterns that do that (again, for compact decimal, not compact currency or ranges etc.) do that because CLDR hands the translators a handy footgun (which they work around as bugs come up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it turns out that something like
sign
prefix
number
etc. is needed for compact decimal (as opposed to compact currency) we can add a bit that says « sign goes at the beginning »
OK. Having only two valid places to put the sign (inside or outside the affix) is probably reasonable, but it's yet another invariant that we'll need to enforce in datagen. It does make things a bit easier and smaller.
pub struct FormattedCompactDecimal<'l> { | ||
pub(crate) formatter: &'l CompactDecimalFormatter, | ||
pub(crate) value: Cow<'l, CompactDecimal>, | ||
pub(crate) plural_map: Option<ZeroMap2dCursor<'l, 'l, i8, Count, PatternULE>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: ought we pre-resolve the plural rule selection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it's tricky since it's not always needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought. There's not much use pre-resolving it since FormattedCompactDecimal is short-lived and generally intended to only be converted to a string or to parts; in fact, CompactDecimalFormatter::format is already doing more work than either FixedDecimalFormatter or DateTimeFormatter. Most of the work should live in FormattedCompactDecimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompactDecimalFormatter::format is already doing more work than either FixedDecimalFormatter or DateTimeFormatter.
The reason for that is that the error checking of the fallible one (format_compact_decimal
) needs to happen early enough that we can return a useful error (rather than core::fmt::Error
).
That work being done, FormattedCompactDecimal
naturally takes the shape it has.
.get1(&plural_category.into()) | ||
.or_else(|| plural_map.get1(&Count::Other)) | ||
})() | ||
.ok_or(core::fmt::Error)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: IIFEs aren't super idiomatic in Rust, but the alternative is definitely uglier. being able to if foo && let ...
will help in the future though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL a new acronym.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there's actually a nicer way to do this in Rust 1.65
Co-authored-by: Shane F. Carr <shane@unicode.org>
Towards #1267.
Only the footgun API, the end-to-end one will come in a subsequent PR.Edit: I ended up writing the
format(i64)
API too.