-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reflect the new paper on Dragonbox #2750
Conversation
- Change constants appearing in log & division computations - Rename beta_minus_1 to beta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR! Looks great, just a few minor comments inline.
include/fmt/format-inl.h
Outdated
@@ -819,6 +817,16 @@ struct uint128_wrapper { | |||
} | |||
}; | |||
|
|||
// Compilers should be able to optimize this into the ror instruction. | |||
inline std::uint32_t rotr(uint32_t n, uint32_t r) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I suggest dropping std::
in the return type for consistency (and the same in the overload below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is not constexpr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitaut Sorry, I think I was in rush for some reason 😅 corrected now.
@miscco Added, thanks for pointing out.
EDIT: @miscco actually, I reverted the change to make the code C++11-compatible. There should be ways to workaround that of course, but I don't think doing so is a good idea because that will likely to make compilers more unlikely to recognize the pattern and reduce it down to ror
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jk-jeon
FMT_CONSTEXPR
- macro for C++14 constexpr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phprus oooohhh didn't know that, thanks for letting me know=)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I thought this should have been possible with C++11. Thanks @phprus for pointing out the workaround
include/fmt/format-inl.h
Outdated
@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept { | |||
// Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[-1700, 1700] -> [-2620, 2620]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected.
include/fmt/format-inl.h
Outdated
@@ -895,86 +903,72 @@ inline uint64_t umul96_lower64(uint32_t x, uint64_t y) noexcept { | |||
// Computes floor(log10(pow(2, e))) for e in [-1700, 1700] using the method from | |||
// https://fmt.dev/papers/Grisu-Exact.pdf#page=5, section 3.4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should link to the Dragonbox paper now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I linked to your copy of the paper (fmt.dev/papers/Dragonbox.pdf) assuming you will replace the copy in a near future.
include/fmt/format-inl.h
Outdated
if (q <= std::numeric_limits<uint32_t>::max() / 100) { | ||
n = q; | ||
s += 2; | ||
} else { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use early exit, i.e.
if (q > std::numeric_limits<uint32_t>::max() / 100) break;
n = q;
s += 2;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, modified the code according to it.
Merged, thank you! |
@jk-jeon, have you thought about handling fixed precision in Dragonbox? Is it even possible, at least for small precision? |
I haven't, at least recently.
Might be, but I'm not sure. I don't even know how do you do that with Grisu 😋. |
Okay, so I did some thought experiment meanwhile, and I think something like that should be possible for small precision. It seems quite simple actually. Just find an appropriate exponent k so that when we multiply 10^k to our floating-point number the integer part of the resulting number is representable in uint32_t or uint64_t. Find the minimum possible integer part of the result, then we get a lower bound on how many decimal digits we can precisely compute. To make this lower bound not too small, we may need to normalize subnormal numbers. To compute the integer part of the result of the multiplication, we do what we have been doing in Dragonbox: multiply with the cached 10^k. And then measure the decimal length of the resulting number and cut as many digits as needed and perform proper rounding. In order to do that we may need to know if our number is an integer or not, which we can already figure out from the computation of the integer part (which is what the previous PR was mainly about). We may need to rerun the correctness analysis, but I'm pretty sure it will be alright. Once this is materialized (which I have no plan for doing right now, but I can assist/discuss things with anyone who wants to try it or something similar), I guess it even should not be really called Dragonbox anymore, as it doesn't do anything similar. Except for the multiplication with 10^k, but that's basically something that all the modern floating-point parsing/formatting algorithms developed recently are doing. |
Hi Victor, I'm recently giving some shots on fixed-precision formatting in my spare time. A good news is that I think it is possible to come up with a configurable method of trading between the cache table size and performance. What do you think is the maximum size of the table (which will be there in addition to the Dragonbox table) which |
Ideally the smaller the better but I think we can live with a few more kiB of data. There is no exact budget. |
So I've been thinking about what can I do with the fixed-precision case. Here is what I thought. The algorithm I recently advertised is composed of two parts, one for the first few digits (covered by the Dragonbox table) and another for further digits (covered by an additional table). I think it is probably too early to adopt the second part, but meanwhile we can discuss the adoption of the first part. (The "garbage digits" case is less important anyway.) Assuming I don't know if this adoption will make the performance better or worse though. Grisu uses fewer precision so it might be faster. But we do not need to rely on Dragon4 fallback anymore if the number of digits is at most 18. Also we can get rid of the Grisu table. What do you think? Thanks. |
This sounds like a good plan. The performance will be more predictable without the fallback for the common case and we could get rid of Grisu tables which is an improvement. A regression for the non-fallback case is OK if it's not too big. |
The new paper is here: https://github.com/jk-jeon/dragonbox/blob/master/other_files/Dragonbox.pdf
This PR is mostly the reflection of the new paper that has not been done in the previous PR.
ctz
and friends, and instead introducedrotr
. I think Dragonbox is the only one who usesctz
in fmt, so probably you can removectz
and friends from fmt now. Also, interestingly simple loop seems to outperform binary search. I think this makes a lot of sense for binary64 because 64-bit constants in general must be first loaded into register and cannot be used as immediates, but it is quite surprising that even for binary32 still loop is better. Looking at the assembly, it seems that the reason is because for the binary search, the compiler is trying too hard to get rid of branches and ends up generating more instructions. However, it should be noted that nowremove_trailing_zeros
traps into infinite loop if the input is0
. I believe0
cannot be fed intoremove_trailing_zeros
in the current implementation, but you should make sure nobody feeds0
into Dragonbox.