Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The actual formatting part of compact decimal formatting #2898

Merged
merged 103 commits into from
Dec 20, 2022

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Dec 16, 2022

Towards #1267.

Only the footgun API, the end-to-end one will come in a subsequent PR.
Edit: I ended up writing the format(i64) API too.

@eggrobin
Copy link
Member Author

you can keep working in the PR

I ended up implementing formatting from i64 while writing the doc comments, because it is a lot easier to document something that exists than to awkwardly gesture at something that doesn’t.

Still needs a couple of implementation comments.

@eggrobin eggrobin marked this pull request as ready for review December 19, 2022 17:57
@eggrobin eggrobin requested a review from sffc December 19, 2022 18:23
sffc
sffc previously approved these changes Dec 19, 2022
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work; a few minor things I found, but this is landable now in experimental and you can keep working in another PR if you like.

utils/fixed_decimal/src/compact.rs Outdated Show resolved Hide resolved
experimental/compactdecimal/src/compactdecimal.rs Outdated Show resolved Hide resolved
let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type);
let mut significand = unrounded.multiplied_pow10(-i16::from(exponent));
if significand.nonzero_magnitude_start() == 0 {
significand.half_even(-1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: Even though I reviewed these functions when they went in, it's not super clear to me as a reader that this call is mutating the significant to perform rounding. 😃 We should consider in 2.0 changing these functions to be called round_half_even or something else with an active verb. CC @younies

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agree; I had to reverse-engineer the semantics from the doc tests to figure out what these functions actually do (the comments are not particularly helpful).

While waiting for 2.0 I would like to improve the comments in a follow-up PR.

let log10_type = unrounded.nonzero_magnitude_start();
let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type);
let mut significand = unrounded.multiplied_pow10(-i16::from(exponent));
if significand.nonzero_magnitude_start() == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This behavior is slightly different than ICU and ECMA-402. In those implementations, we retain 2 significant digits all the way down. We should try to align.

new Intl.NumberFormat("en", { notation: "compact" }).format(1.2e-5)
// '0.000012'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.2e-5 is not a valid input here, since we take i64s.
Note that this is looking at the magnitude of the significand on its own, not the actual magnitude of the number being formatted, so this condition is « if we show one digit before the decimal point » (and then that branch does « show one more after the decimal point »).

I will add some comments that gloss the logic like that so it is clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this isn't a problem since we only have integers. We shouldn't forget about it when supporting non-integers.

let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type);
let mut significand = unrounded.multiplied_pow10(-i16::from(exponent));
if significand.nonzero_magnitude_start() == 0 {
significand.half_even(-1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I wonder if we should make trunc the default compact decimal rounding mode in ICU4X? People might not like 999,501 of something (likes, views, emails, ...) rounded up to 1M.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this was a questionable default (YouTube uses directed rounding towards 0 everywhere for the very reason you cite).

However, I would rather we aligned with ICU4everythingelse and other existing implementations.

Aside from compatibility/consistency questions, I should note that truncating by hand is much easier (you don’t need the adjustment on round up), so I would rather let people do the easy one themselves and provide the more involved (yet also reasonable) option than the reverse.

Comment on lines +58 to +61
self.formatter
.fixed_decimal_format
.format(self.value.significand())
.write_to(sink)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This might not work as expected with negative numbers. Please add some test cases for that (can be done after landing this PR).

The way FixedDecimalFormatter deals with signs (plus and minus) is to have 3 versions of the pattern that are selected at runtime: no sign, plus sign, and minus sign. So, we don't actually render a sign at runtime; we just select the pre-rendered pattern. I don't know if that's how we want to handle it in CDF since we have a lot more patterns and would like to avoid bloating the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some examples in the docs tests; it works as I expect at least.

French mille is messy (though it doesn’t do something truly misleading like dropping the sign, it falls to that weird one case, thus -1 millier, which may be the least broken thing it can do; getting -mille would probably be tricky, and that looks utterly weird anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but try this in a locale where the sign and the compact affix are on the same side. For example:

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json

We have a pattern

              "1000000-count-other": "M0",

The correct output according to CLDR would be "+M3", but your code produces "M+3"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that same file, there is

              "1000-count-one": "elfu 0;elfu -0",

which demonstrates how CLDR does actually want us to consider these as proper decimal patterns that could have different signed and unsigned versions.

Copy link
Member Author

@eggrobin eggrobin Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(argh, too many threads.)

Yes, but there is also elfu 0 without that dance, which is therefore suspect; and the millions don’t do that but do it in other variants of Swahili, etc.

See this in the datagen:

// All compact decimal patterns for sw (Swahili) are split by sign;
// the sign ends up where it would be as part of the significand, so
// this special handling is unneeded. Depending on the region subtag,
// the space may be breaking or nonbreaking.
("elfu 0;elfu -0", "elfu 0"),
("milioni 0;milioni -0", "milioni 0"),
("bilioni 0;bilioni -0", "bilioni 0"),
("trilioni 0;trilioni -0", "trilioni 0"),
("elfu\u{A0}0;elfu\u{A0}-0", "elfu\u{A0}0"),
("milioni\u{A0}0;milioni\u{A0}-0", "milioni\u{A0}0"),
("bilioni\u{A0}0;bilioni\u{A0}-0", "bilioni\u{A0}0"),
("trilioni\u{A0}0;trilioni\u{A0}-0", "trilioni\u{A0}0"),
("0M;-0M", "0M"),
("0B;-0B", "0B"),
("0T;-0T", "0B"),

I suspect we have bad data, but that having the sign next to the number is the better default, since many swahili patterns go out of their way to stick it back there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"+M3" is likely a reasonable and correct outcome

I am unconvinced of that, given that it abbreviates something that seems to want to be milioni +3.

Bear in mind that while compact decimal is used quite a bit more than scientific, its edge cases (including the use of negative numbers) quickly get uncharted (after all, the Romance languages are only just starting not to be a flagrantly ungrammatical mess, and that is for positive values).

We will indeed need to deal with that for currencies (along with a number of other complexities).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some evidence from the number range work that designers have special requirements when it comes to single-character symbols. They tend to prefer single characters to be "glued" to the number, with extra annotations (like range separators and signs) displayed outside of the single-character symbol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unambiguous that the current UTS 35 requires "+M3" instead of "M+3". Perhaps that data is wrong, but even if it is, there will continually be more cases where we need to handle cases like this. Should we appeal to CLDR to get clarity?

Copy link
Member Author

@eggrobin eggrobin Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way I need to open a CLDR ticket because it is clear that there is a lot of wrongness in the Swahili data, for instance
https://github.com/unicode-org/cldr-json/blob/f27f55f1dd487af7cf4260f56296ee1c7649b7fc/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json#L34-L36
next to https://github.com/unicode-org/cldr-json/blob/f27f55f1dd487af7cf4260f56296ee1c7649b7fc/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json#L62-L64.

If it turns out that something like sign prefix number etc. is needed for compact decimal (as opposed to compact currency) we can add a bit that says « sign goes at the beginning », but I suspect many of the patterns that do that (again, for compact decimal, not compact currency or ranges etc.) do that because CLDR hands the translators a handy footgun (which they work around as bugs come up).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it turns out that something like sign prefix number etc. is needed for compact decimal (as opposed to compact currency) we can add a bit that says « sign goes at the beginning »

OK. Having only two valid places to put the sign (inside or outside the affix) is probably reasonable, but it's yet another invariant that we'll need to enforce in datagen. It does make things a bit easier and smaller.

Manishearth
Manishearth previously approved these changes Dec 19, 2022
pub struct FormattedCompactDecimal<'l> {
pub(crate) formatter: &'l CompactDecimalFormatter,
pub(crate) value: Cow<'l, CompactDecimal>,
pub(crate) plural_map: Option<ZeroMap2dCursor<'l, 'l, i8, Count, PatternULE>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: ought we pre-resolve the plural rule selection?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's tricky since it's not always needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought. There's not much use pre-resolving it since FormattedCompactDecimal is short-lived and generally intended to only be converted to a string or to parts; in fact, CompactDecimalFormatter::format is already doing more work than either FixedDecimalFormatter or DateTimeFormatter. Most of the work should live in FormattedCompactDecimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompactDecimalFormatter::format is already doing more work than either FixedDecimalFormatter or DateTimeFormatter.

The reason for that is that the error checking of the fallible one (format_compact_decimal) needs to happen early enough that we can return a useful error (rather than core::fmt::Error).
That work being done, FormattedCompactDecimal naturally takes the shape it has.

.get1(&plural_category.into())
.or_else(|| plural_map.get1(&Count::Other))
})()
.ok_or(core::fmt::Error)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: IIFEs aren't super idiomatic in Rust, but the alternative is definitely uglier. being able to if foo && let ... will help in the future though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL a new acronym.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there's actually a nicer way to do this in Rust 1.65

https://stackoverflow.com/a/66629605/1407170

@eggrobin eggrobin dismissed stale reviews from Manishearth and sffc via 6b241f2 December 19, 2022 23:44
@eggrobin eggrobin requested review from sffc and Manishearth December 19, 2022 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants