The actual formatting part of compact decimal formatting #2898

eggrobin · 2022-12-16T16:17:39Z

Towards #1267.

~~Only the footgun API, the end-to-end one will come in a subsequent PR.~~
Edit: I ended up writing the format(i64) API too.

eggrobin · 2022-12-19T15:18:31Z

you can keep working in the PR

I ended up implementing formatting from i64 while writing the doc comments, because it is a lot easier to document something that exists than to awkwardly gesture at something that doesn’t.

Still needs a couple of implementation comments.

sffc

Great work; a few minor things I found, but this is landable now in experimental and you can keep working in another PR if you like.

utils/fixed_decimal/src/compact.rs

experimental/compactdecimal/src/compactdecimal.rs

sffc · 2022-12-19T18:42:48Z

experimental/compactdecimal/src/compactdecimal.rs

+        let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type);
+        let mut significand = unrounded.multiplied_pow10(-i16::from(exponent));
+        if significand.nonzero_magnitude_start() == 0 {
+            significand.half_even(-1);


Thought: Even though I reviewed these functions when they went in, it's not super clear to me as a reader that this call is mutating the significant to perform rounding. 😃 We should consider in 2.0 changing these functions to be called round_half_even or something else with an active verb. CC @younies

Strongly agree; I had to reverse-engineer the semantics from the doc tests to figure out what these functions actually do (the comments are not particularly helpful).

While waiting for 2.0 I would like to improve the comments in a follow-up PR.

sffc · 2022-12-19T18:45:33Z

experimental/compactdecimal/src/compactdecimal.rs

+        let log10_type = unrounded.nonzero_magnitude_start();
+        let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type);
+        let mut significand = unrounded.multiplied_pow10(-i16::from(exponent));
+        if significand.nonzero_magnitude_start() == 0 {


Issue: This behavior is slightly different than ICU and ECMA-402. In those implementations, we retain 2 significant digits all the way down. We should try to align.

new Intl.NumberFormat("en", { notation: "compact" }).format(1.2e-5) // '0.000012'

1.2e-5 is not a valid input here, since we take i64s.
Note that this is looking at the magnitude of the significand on its own, not the actual magnitude of the number being formatted, so this condition is « if we show one digit before the decimal point » (and then that branch does « show one more after the decimal point »).

I will add some comments that gloss the logic like that so it is clearer.

Yep, this isn't a problem since we only have integers. We shouldn't forget about it when supporting non-integers.

sffc · 2022-12-19T18:52:00Z

experimental/compactdecimal/src/compactdecimal.rs

+        let (mut plural_map, mut exponent) = self.plural_map_and_exponent_for_magnitude(log10_type);
+        let mut significand = unrounded.multiplied_pow10(-i16::from(exponent));
+        if significand.nonzero_magnitude_start() == 0 {
+            significand.half_even(-1);


Thought: I wonder if we should make trunc the default compact decimal rounding mode in ICU4X? People might not like 999,501 of something (likes, views, emails, ...) rounded up to 1M.

I agree that this was a questionable default (YouTube uses directed rounding towards 0 everywhere for the very reason you cite).

However, I would rather we aligned with ICU4everythingelse and other existing implementations.

Aside from compatibility/consistency questions, I should note that truncating by hand is much easier (you don’t need the adjustment on round up), so I would rather let people do the easy one themselves and provide the more involved (yet also reasonable) option than the reverse.

sffc · 2022-12-19T18:54:39Z

experimental/compactdecimal/src/format.rs

+                    self.formatter
+                        .fixed_decimal_format
+                        .format(self.value.significand())
+                        .write_to(sink)?;


Issue: This might not work as expected with negative numbers. Please add some test cases for that (can be done after landing this PR).

The way FixedDecimalFormatter deals with signs (plus and minus) is to have 3 versions of the pattern that are selected at runtime: no sign, plus sign, and minus sign. So, we don't actually render a sign at runtime; we just select the pre-rendered pattern. I don't know if that's how we want to handle it in CDF since we have a lot more patterns and would like to avoid bloating the data.

Added some examples in the docs tests; it works as I expect at least.

French mille is messy (though it doesn’t do something truly misleading like dropping the sign, it falls to that weird one case, thus -1 millier, which may be the least broken thing it can do; getting -mille would probably be tricky, and that looks utterly weird anyway).

Yes, but try this in a locale where the sign and the compact affix are on the same side. For example:

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json

We have a pattern

"1000000-count-other": "M0",

The correct output according to CLDR would be "+M3", but your code produces "M+3"

In that same file, there is

"1000-count-one": "elfu 0;elfu -0",

which demonstrates how CLDR does actually want us to consider these as proper decimal patterns that could have different signed and unsigned versions.

(argh, too many threads.)

Yes, but there is also elfu 0 without that dance, which is therefore suspect; and the millions don’t do that but do it in other variants of Swahili, etc.

See this in the datagen:

icu4x/provider/datagen/src/transform/cldr/decimal/compact_decimal_pattern.rs

Lines 51 to 65 in 356bfa8

// All compact decimal patterns for sw (Swahili) are split by sign;

// the sign ends up where it would be as part of the significand, so

// this special handling is unneeded. Depending on the region subtag,

// the space may be breaking or nonbreaking.

("elfu 0;elfu -0", "elfu 0"),

("milioni 0;milioni -0", "milioni 0"),

("bilioni 0;bilioni -0", "bilioni 0"),

("trilioni 0;trilioni -0", "trilioni 0"),

("elfu\u{A0}0;elfu\u{A0}-0", "elfu\u{A0}0"),

("milioni\u{A0}0;milioni\u{A0}-0", "milioni\u{A0}0"),

("bilioni\u{A0}0;bilioni\u{A0}-0", "bilioni\u{A0}0"),

("trilioni\u{A0}0;trilioni\u{A0}-0", "trilioni\u{A0}0"),

("0M;-0M", "0M"),

("0B;-0B", "0B"),

("0T;-0T", "0B"),

I suspect we have bad data, but that having the sign next to the number is the better default, since many swahili patterns go out of their way to stick it back there.

"+M3" is likely a reasonable and correct outcome

I am unconvinced of that, given that it abbreviates something that seems to want to be milioni +3.

Bear in mind that while compact decimal is used quite a bit more than scientific, its edge cases (including the use of negative numbers) quickly get uncharted (after all, the Romance languages are only just starting not to be a flagrantly ungrammatical mess, and that is for positive values).

We will indeed need to deal with that for currencies (along with a number of other complexities).

We have some evidence from the number range work that designers have special requirements when it comes to single-character symbols. They tend to prefer single characters to be "glued" to the number, with extra annotations (like range separators and signs) displayed outside of the single-character symbol.

It's unambiguous that the current UTS 35 requires "+M3" instead of "M+3". Perhaps that data is wrong, but even if it is, there will continually be more cases where we need to handle cases like this. Should we appeal to CLDR to get clarity?

Either way I need to open a CLDR ticket because it is clear that there is a lot of wrongness in the Swahili data, for instance
https://github.com/unicode-org/cldr-json/blob/f27f55f1dd487af7cf4260f56296ee1c7649b7fc/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json#L34-L36
next to https://github.com/unicode-org/cldr-json/blob/f27f55f1dd487af7cf4260f56296ee1c7649b7fc/cldr-json/cldr-numbers-full/main/sw-KE/numbers.json#L62-L64.

If it turns out that something like sign prefix number etc. is needed for compact decimal (as opposed to compact currency) we can add a bit that says « sign goes at the beginning », but I suspect many of the patterns that do that (again, for compact decimal, not compact currency or ranges etc.) do that because CLDR hands the translators a handy footgun (which they work around as bugs come up).

If it turns out that something like sign prefix number etc. is needed for compact decimal (as opposed to compact currency) we can add a bit that says « sign goes at the beginning »

OK. Having only two valid places to put the sign (inside or outside the affix) is probably reasonable, but it's yet another invariant that we'll need to enforce in datagen. It does make things a bit easier and smaller.

Manishearth · 2022-12-19T21:34:27Z

experimental/compactdecimal/src/format.rs

+pub struct FormattedCompactDecimal<'l> {
+    pub(crate) formatter: &'l CompactDecimalFormatter,
+    pub(crate) value: Cow<'l, CompactDecimal>,
+    pub(crate) plural_map: Option<ZeroMap2dCursor<'l, 'l, i8, Count, PatternULE>>,


thought: ought we pre-resolve the plural rule selection?

i guess it's tricky since it's not always needed

I had the same thought. There's not much use pre-resolving it since FormattedCompactDecimal is short-lived and generally intended to only be converted to a string or to parts; in fact, CompactDecimalFormatter::format is already doing more work than either FixedDecimalFormatter or DateTimeFormatter. Most of the work should live in FormattedCompactDecimal.

CompactDecimalFormatter::format is already doing more work than either FixedDecimalFormatter or DateTimeFormatter.

The reason for that is that the error checking of the fallible one (format_compact_decimal) needs to happen early enough that we can return a useful error (rather than core::fmt::Error).
That work being done, FormattedCompactDecimal naturally takes the shape it has.

Manishearth · 2022-12-19T22:49:14Z

experimental/compactdecimal/src/format.rs

+                    .get1(&plural_category.into())
+                    .or_else(|| plural_map.get1(&Count::Other))
+            })()
+            .ok_or(core::fmt::Error)?;


note: IIFEs aren't super idiomatic in Rust, but the alternative is definitely uglier. being able to if foo && let ... will help in the future though.

TIL a new acronym.

Hmm, there's actually a nicer way to do this in Rust 1.65

https://stackoverflow.com/a/66629605/1407170

Co-authored-by: Shane F. Carr <shane@unicode.org>

experimental/compactdecimal/src/compactdecimal.rs

eggrobin added 30 commits December 7, 2022 17:23

maybe I should commit sometimes?

0f2b754

also the new files

1debe59

I forgot my laptop charger

f1760a1

it appears to compile

d5657bf

meow

8b0f103

plate the boiler

4d9f95f

a test that disturbingly passes

43fe38e

I should probably have pushed that

aa6dc5c

some refactoring and &&i8

507e234

logic is hard

a5b63f6

long

a0db0c9

simpler end-to-end test

12bebe4

a test

7e60fe1

more tests

0c25367

another test

5f5db46

fix some warnings

3dfd5bf

wrong warning is wrong

9537951

use

b9f1405

the licensing will continue until morale improve

9ff37a2

cargo dottore

22097a2

test all the errors

09cdad5

readthee

addb602

two

b9309e7

appease clippy

21a35cf

lints &c

7373756

cargo make testdata

1c17a74

logic remains hard

e9c7bb1

cargo make testdata again

89c14cb

feature propagation

a980ac0

I am shocked, shocked, to find that escaping errors are going on in here

5265ca9

eggrobin added 2 commits December 19, 2022 16:10

fmt

9a9b7ae

appease clippy a little bit

6ba4c27

eggrobin dismissed sffc’s stale review via 6ba4c27 December 19, 2022 15:15

eggrobin added 5 commits December 19, 2022 16:45

try to make the cargo doctor happy

ca644e1

&

1479b06

'_ nonsense

a9fafdd

covered in diplomats

1258527

invariants and correctness

dbe1258

eggrobin marked this pull request as ready for review December 19, 2022 17:57

eggrobin requested a review from sffc December 19, 2022 18:23

sffc previously approved these changes Dec 19, 2022

View reviewed changes

sffc reviewed Dec 19, 2022

View reviewed changes

sffc mentioned this pull request Dec 19, 2022

FixedDecimal mutating functions should use active verbs #2902

Closed

Manishearth previously approved these changes Dec 19, 2022

View reviewed changes

signs

6b241f2

eggrobin dismissed stale reviews from Manishearth and sffc via 6b241f2 December 19, 2022 23:44

comments

da5ec50

eggrobin requested review from sffc and Manishearth December 19, 2022 23:50

eggrobin and others added 2 commits December 20, 2022 01:22

Update experimental/compactdecimal/src/compactdecimal.rs

e64818c

Co-authored-by: Shane F. Carr <shane@unicode.org>

oops, missed some comments

c5184c9

sffc reviewed Dec 20, 2022

View reviewed changes

experimental/compactdecimal/src/compactdecimal.rs Show resolved Hide resolved

eggrobin added 2 commits December 20, 2022 01:50

renaming is hard

ab62c18

more renaming

287d8f7

sffc approved these changes Dec 20, 2022

View reviewed changes

eggrobin merged commit a61a16a into unicode-org:main Dec 20, 2022

robertbastian mentioned this pull request Jan 23, 2023

1.1.0 #3018

Merged

eggrobin mentioned this pull request Oct 7, 2023

Split the UCA checks into their own job, check UCD consistency unicode-org/unicodetools#562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The actual formatting part of compact decimal formatting #2898

The actual formatting part of compact decimal formatting #2898

eggrobin commented Dec 16, 2022 •

edited

Loading

eggrobin commented Dec 19, 2022

sffc left a comment

sffc Dec 19, 2022

eggrobin Dec 19, 2022

sffc Dec 19, 2022

eggrobin Dec 19, 2022

sffc Dec 20, 2022

sffc Dec 19, 2022

eggrobin Dec 19, 2022

sffc Dec 19, 2022

eggrobin Dec 19, 2022

sffc Dec 20, 2022

sffc Dec 20, 2022

eggrobin Dec 20, 2022 •

edited

Loading

eggrobin Dec 20, 2022

sffc Dec 20, 2022

sffc Dec 20, 2022

eggrobin Dec 20, 2022 •

edited

Loading

sffc Dec 20, 2022

Manishearth Dec 19, 2022

Manishearth Dec 19, 2022

sffc Dec 20, 2022

eggrobin Dec 20, 2022

Manishearth Dec 19, 2022

eggrobin Dec 19, 2022

sffc Dec 20, 2022

	// All compact decimal patterns for sw (Swahili) are split by sign;
	// the sign ends up where it would be as part of the significand, so
	// this special handling is unneeded. Depending on the region subtag,
	// the space may be breaking or nonbreaking.
	("elfu 0;elfu -0", "elfu 0"),
	("milioni 0;milioni -0", "milioni 0"),
	("bilioni 0;bilioni -0", "bilioni 0"),
	("trilioni 0;trilioni -0", "trilioni 0"),
	("elfu\u{A0}0;elfu\u{A0}-0", "elfu\u{A0}0"),
	("milioni\u{A0}0;milioni\u{A0}-0", "milioni\u{A0}0"),
	("bilioni\u{A0}0;bilioni\u{A0}-0", "bilioni\u{A0}0"),
	("trilioni\u{A0}0;trilioni\u{A0}-0", "trilioni\u{A0}0"),
	("0M;-0M", "0M"),
	("0B;-0B", "0B"),
	("0T;-0T", "0B"),

The actual formatting part of compact decimal formatting #2898

The actual formatting part of compact decimal formatting #2898

Conversation

eggrobin commented Dec 16, 2022 • edited Loading

eggrobin commented Dec 19, 2022

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eggrobin Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eggrobin Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eggrobin commented Dec 16, 2022 •

edited

Loading

eggrobin Dec 20, 2022 •

edited

Loading

eggrobin Dec 20, 2022 •

edited

Loading