Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve accuracy when casting floating-point numbers to decimal #11079

Closed
ttnghia opened this issue Jun 8, 2022 · 5 comments
Closed

[FEA] Improve accuracy when casting floating-point numbers to decimal #11079

ttnghia opened this issue Jun 8, 2022 · 5 comments
Labels
feature request New feature or request

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jun 8, 2022

Currently, decimals constructed from floating-point numbers are done by:

  • Shifting (multiple/dividing by scale value)
  • Rounding the result toward zero to an integer for internal storage

Ref:

CUDF_HOST_DEVICE inline explicit fixed_point(T const& value, scale_type const& scale)

As such, a number like 0.28 and scale=-1 can be shifted and rounded to 2 and stored as 2! Of course the more accurate rounded number should be 3. We should do a better rounding for this process.

Such rounding was causing issues for the users. In particular, in spark-rapids: NVIDIA/spark-rapids#5765

@jrhemstad
Copy link
Contributor

Of course the more accurate rounded number should be 3.

Why?

float -> integer rounding in C++ always rounds towards zero.

https://godbolt.org/z/4aPMj5zzd

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 8, 2022

That is rounding. We are casting instead, rounding is just the way we use to do casting.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 8, 2022

Oh here we are casting from floating-point to fixed-point, thus the casting to use should be float-to-point rounding then casting to int, not float-to-int direct casting.

float-to-float rounding is round-to-nearest by C++ specs.

@codereport
Copy link
Contributor

codereport commented Jun 8, 2022

(reposting this message from slack)

There is a long history behind this and truncation is definitely the desired and intentional behaviour.

Here are the main motivations for truncation:

  1. Consistency with CNL. CNL is being proposed for the standard library and consistency with standard types is a goal of libcudf. Note that (IIRC) every other C++ fixed-point library has made the same choice.
  2. Relying on the behaviour of integer types in C++ greatly simplifies certain binary operations like divison for fixed-point
  3. We can (and have) provided rounding functionality through the cudf::round API so on the Spark side of things you are able to get whatever behaviour you want

Note that at one point, we actually did have rounding functionality baked into floating point construction for fixed-point. However, it wasn't actually to address the desire to be more "accurate" as @ttnghia is pointing out, it was to try and address inherent issues in floating point (such as 1.001 not being representable by floating point, meaning if you construct a fixed-point with scale -3, you end up with 1 , missing the .001). @harrism and I had many discussions about this and both agreed that "ideally" fixed-point should be able to avoid this at the end of the day. However, it ended up presenting too many issues / complications and not actually comprehensively fixing all cases. And note, trying to be "better" was not consistent with what CNL and other C++ fixed-point libraries do. So we decided at a certain point to just accept the inherent flaws of floating point and follow CNL. This was done in the following small PR: #6544 If you take a look at the unit tests, it is very clear how the behaviour changed by looking at the before and after. Furthermore, there was a bunch of "opposition research" done at one point in to what other C++ fixed-point libraries do. Our goal is not to be similar to Julia or Go or Python, but to be a generic C++ backend that provides all the tools to do whatever you want in the front end language (like cuDF or Spark-RAPIDS).

@ttnghia
Copy link
Contributor Author

ttnghia commented Jun 8, 2022

Thanks @codereport. That explains to me why we have such choice and I'm convinced with it. So I'm going to close the issue.

@ttnghia ttnghia closed this as completed Jun 8, 2022
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants