Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumberFormatter & co. scope (unified vs modular) #275

Open
zbraniecki opened this issue Sep 27, 2020 · 5 comments
Open

NumberFormatter & co. scope (unified vs modular) #275

zbraniecki opened this issue Sep 27, 2020 · 5 comments
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design A-performance Area: Performance (CPU, Memory) A-scope Area: Project scope, feature coverage C-numbers Component: Numbers, units, currencies help wanted Issue needs an assignee S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality

Comments

@zbraniecki
Copy link
Member

In ICU (and ECMA402) NumberFormat becomes the jack of all trades with formatting for numbers, currencies, measuring units, and so on. There's even a drive from Shane to incorporate Pluralization as a feature of a NumberFormat.

Shane justified it by saying that all number formatters take similar options and similar operations to tailor the data.

The cost of such approach is that it becomes trickier to modularize such crates and NumberFormat becomes actually a pretty large codebase on a very fundamental level that is required by basically everything and fragile DCE is the only hope to keep the overhead lower.

It is my impression that in ICU4X context we can bring that modularity back, and I believe Shane is intending to separate the part that operates on the number (like rounding, tailoring etc.) into FixedDecimal and similar helper structs and effectively removing the benefit of clustering all types of numerical operations in a single formatter.
@sffc can you share your take on this if I misunderstood you?

if that hypothesis is correct we have a way to get modular and lean CurrencyFormatter, MeasureUnitFormatter, RuleBasedNumberFormatter, RelativeTimeFormatter, DurationFormatter and so on into their own components and keep each one small without paying with an overhead when all of them are in use.

This issue has been filled to discuss that and verify if we're all on the same page about how we want to tackle that topic.

@zbraniecki zbraniecki added A-performance Area: Performance (CPU, Memory) C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design A-scope Area: Project scope, feature coverage discuss Discuss at a future ICU4X-SC meeting C-numbers Component: Numbers, units, currencies labels Sep 27, 2020
@sffc
Copy link
Member

sffc commented Sep 27, 2020

I am of the opinion that crates are not the most effective way to go about modularization. I have written in wrapper-layer.md that I believe we can use dead code elimination to achieve modularization in a much more effective way.

@sffc
Copy link
Member

sffc commented Sep 27, 2020

I believe Shane is intending to separate the part that operates on the number (like rounding, tailoring etc.) into FixedDecimal and similar helper structs and effectively removing the benefit of clustering all types of numerical operations in a single formatter.
@sffc can you share your take on this if I misunderstood you?

FixedDecimal is intended as a type that preserves leading and trailing zeros on input and output of NumberFormat, which is an important feature we largely lack in 402. Rounding operations cannot be split from NumberFormat because rounding depends on locale data for currencies, compact decimals, and measurement units.

@sffc sffc added this to the ICU4X 0.2 milestone Oct 22, 2020
@sffc sffc changed the title NumberFormatter & co. scope (monolithic vs modular) NumberFormatter & co. scope (unified vs modular) Dec 4, 2020
@sffc
Copy link
Member

sffc commented Dec 4, 2020

2020-12-04 discussion:

  • We all agree that the FFI (logical) API layer should be as modular as possible. So this question is only about the ergonomic layer.
  • @mihnita and @sffc agree that it is useful to have a super-lightweight class that maps from FixedDecimal to localized string, with no currencies, compact notation, measurement units, etc. This would be used in DateTimeFormat to reduce the dependency weight.
  • @sffc points out that there is no clean way to perform inheritance between the different types of higher-level formatters. For example, CurrencyFormatter is one code path, CompactFormatter is another, and CompactCurrencyFormatter is yet a third. The three do not encapsulate each other.
  • @nciric points out that we can start with the two ergonomic classes (simple and kitchen sink), and create modular classes as needed to help platforms perform dead code elimination
  • @zbraniecki agrees; we can extract the second tier of formatters when needed
  • @sffc suggested that RBNF could be on the level of the lightweight FixedDecimal formatter, such that the kitchen sink class can use either decimal format or RBNF

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Dec 4, 2020
@sffc
Copy link
Member

sffc commented Dec 4, 2020

More specifically, here is how I see the breakdown of features going into FixedDecimalFormat (lower level) versus KitchenSinkNumberFormat (higher level):

FixedDecimalFormat

What: Pass-through formatter for FixedDecimal, applying localized symbols but no arithmetic.

Features:

  • Localized decimal digits
  • Grouping separators
  • Sign display*
  • Scientific notation**

* Sign display is slightly more complex, due to the requirement that we add affixes to the number. It may be slightly smaller if FixedDecimalFormat were "positive only", not capable of outputting a sign.

** Depends on the chosen design of #228

KitchenSinkNumberFormat

What: A larger, data-driven formatter supporting a larger set of UTS 35

Features:

  • Currency*
  • Measurement units
  • Compact notation
  • Percentages

* "Currency" encompasses currency spacing rules, currency rounding, symbol resolution, etc.

Note on Rounding

Rounding is a big chunk of the logic in ICU NumberFormatter. Unfortunately, it needs to be coupled with at least KitchenSinkNumberFormat, because the algorithm for selecting a compact form and applying a currency both require rounding the number based on locale data.

@sffc sffc self-assigned this Dec 10, 2020
@sffc sffc added the T-core Type: Required functionality label Jan 21, 2021
@sffc sffc modified the milestones: ICU4X 0.2, 2021-Q2-m1 Mar 25, 2021
@sffc sffc added the S-epic Size: Major project (create smaller child issues) label Apr 3, 2021
@sffc sffc modified the milestones: 2021-Q2-m1, ICU4X 0.4 Apr 19, 2021
@sffc sffc added A-data Area: Data coverage or quality and removed C-data-infra Component: provider, datagen, fallback, adapters labels Jun 16, 2021
@sffc sffc removed this from the ICU4X 0.4 milestone Jul 21, 2021
@sffc sffc added backlog help wanted Issue needs an assignee labels Jul 21, 2021
@sffc
Copy link
Member

sffc commented Dec 22, 2021

I filed #1441 to track currency formatting.

In terms of class structure / modularity: there are 2 main dimensions:

  1. Notation
    • Decimal
    • Compact Decimal
    • Scientific
    • Spellout (RBNF) -- not yet supported in ECMA-402, but we want to get here
  2. Unit
    • No Unit
    • Currency
    • Percentage
    • Measurement

These are the two main dimensions we need to solve. The challenge is that these two dimensions can be combined freely, and when doing so, we may need to load different data or use different code paths.

For example:

Unit \ Notation Decimal Compact Scientific Spellout
None 1000 1K 1E3 one thousand
Currency $1000.00 $1K $1.00E3 one thousand dollars
Percent 1000% 1K% 1E3% one thousand percent
Measure 1000 m 1K m 1E3 m one thousand meters

Within each box, there may be multiple display options as well, most often long/short/narrow.

Clearly there are some formats in this table that make more sense than others. But, we need to think about how to scale up to support this grid.

CC @robertbastian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design A-performance Area: Performance (CPU, Memory) A-scope Area: Project scope, feature coverage C-numbers Component: Numbers, units, currencies help wanted Issue needs an assignee S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

2 participants