-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using csFastFloat for faster FP parsing #48646
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @tannergooding, @pgovind Issue DetailsSee Daniel Lemire's (@lemire) blogpost: https://lemire.me/blog/2021/02/22/parsing-floating-point-numbers-really-fast-in-c - a super fast floating-point parsing algorithm which is up to 6 times faster than the BCL one for an average case (for invariant culture).
C# port of the algorithm (by @CarlVerret): https://github.com/CarlVerret/csFastFloat The current implementation supports "scientific", "fixed" and "general" formats, and a custom decimal separator (single The implementation takes roughly 25Kb (~10kb of it is spent here https://github.com/CarlVerret/csFastFloat/blob/master/csFastFloat/Constants/Constants.cs#L22 - can be converted to a /cc @tannergooding
|
25kb is a lot of increase. That's basically a full 1% of the raw IL and 0.25% of the crossgen'd image. Likewise, it looks like this algorithm only works for inputs up to 19 significant digits where-as we must be able to handle inputs up to It would be interesting to run this against our full test suite to see if any failures crop up. Likewise, our current algorithm does try to optimize for the more common cases and after filling the
|
I think that this is entirely correct. This is how it was done in Go. They did not remove their code, they just inserted the fast path. One way to understand the approach taken in csFastFloat is that we insert a fast path between Clinger's fast path (which you already use) and the full approach you rely upon. The trick is that this "second fast path" catches most of the "common inputs" so that the 'slower' path is rarely needed. It is more useful than just for the 19 digit case, however. Even when the input exceed 19 digits, you can most often still use the fast path because 19 digits is often all you need to determine the number (you can often truncate). Please see Section 11 "Processing long numbers quickly" in the manuscript: https://arxiv.org/pdf/2101.11408.pdf Note that this is what the Go runtime does. Otherwise, you must fall through to some other solution like your current own code. The csFastFloat has its own fall back currently, but it is likely that the current code in the runtime would be superior. In practice, while users can provide 100s of digits, you should expect many (most?) use cases to use no more than 17 digits. There is no benefit to serializing double or float types to more than 17 digits if you want to later de-serialize them to float/double types. You still need to handle the case with 100s of digits, but it is should be expected to be common. Of course, we are relying on a table has a cost (storage). In Go, this was discussed and they decided that it was a good trade-off.
The algorithm itself should be sound. It is has been in production for quite some time without issue and there is a complete mathematical derivation on arXiv. The csFastFloat library itself could use more thorough testing... Note however that we do have a rather good testing suite. A couple of minor issues were caught in the last few hours, but nothing that has to do with the algorithm or core processing. |
Thanks for the reference. I'll give it an additional look through when I have some free time this afternoon. It might be interesting to see if we can compress the table size somehow (many of the powers used in these cases are powers of 10 and have many trailing zeros for example; although this table looks to be power of 5 instead). |
Yes. They are powers of 5 (for positive exponents) or reciprocal of powers of 5 (for negative exponents). We have that 10^50 = 2^50 * 5^50... so that there is no need to represent 2**50 formally. There are various ways one could trim that table but none that I have found appealing so far. One would be to restrict the application of the fast path to some powers only, instead of the full range. The rationale there might be that 1e200 and 1e-200 are relatively uncommon. This is not at all unreasonable, but it leaves you to decide where you set the threshold. |
Note that the table 10kB not 25kB. Roughly... the exponents go from -300 to 300... so about 600 exponents. You have 600 x 128 bits... is 9.6 kB. The 25kB is what you would need to store the full table which we do not do. The manuscript addresses this issue...
For comparison, on my current laptop...
Sorry if I was slow to react... it took me hours to think "25kB??? that's not right". I need stronger coffee. (Disclosure: I had a version with a 25kB for a time, but I decided that it was too wasteful.) |
25kb is the total size of the lib (code + that 10kb table), so it's a rough estimate of a size overhead for the corelib but I assume we can save a bit more
That is a size of the SDK, typical .NET apps are self-contained nowadays, e.g. for web-assembly purposes every 10kb counts. But I assume we can have two implementations OptForSize (Wasm, Mobiles) and OptForSpeed (Desktop, Server) - we already do it for some other algorithms. I'd love to integrate this algorithm and run some JSON-parsing benchmarks. |
Most of the code in csFastFloat is for the fallback scenario which you would not need. It should be really tight once integrated in the runtime. I expect that 25kB is pessimistic. |
Ah, makes sense! |
@EgorBo Furthermore, the same data can be amortized by reusing it for UTF-8 processing as well: CarlVerret/csFastFloat#49 |
The approach in question was merged into Rust, Update Rust Float-Parsing Algorithms to use ..., for an upcoming release. The bloat analysis was favorable. |
This has been my experience as well in writing my own implementations: the two algorithms (Bellepheron and Lemire) are complementary, and make needing a big-integer algorithm required only for extremely rare cases.
A good battery of tests should catch any correctness issues. I've used 3 main test suites extensively in my own implementations, and if it passes all 3, the chance of any major issues should be practically non-existent. Basically, this approach generates a battery of random numbers (millions) that are designed to trigger common failures in parsing. Some of these include trailing zeros after significant digits, extremely large numbers of digits, halfway cases, near-infinite values, and subnormal floats. It should be trivial to port these tests (which are only a few lines of code each) to dotnet. These are the tests cases Golang and Wuffs use. This is a massive collection of curated cases that have been known to cause failures, aggregated from numerous different sources (such as freetype, Wuffs, double-conversion, and more). A very compact, but carefully curated collection of test cases that is designed as a smoke test for common failures. These are all examples of floats known to fail in previous implementations, so it's a very good starting ground. If either 1). or 2). pass, any implementation should be viewed as sound (as long as it rejects any language-specific syntax requirements). If I can be of help in any way in designing test cases, I'd be more than happy to contribute. |
We just published V4.0 of our parsing library. This version brings a considerable performane boost as we now make use of SIMD functionnalities to speed up the parsing process. In some cases, csFastFloat is up to 9 times faster than double.Parse.
From the first version, we asserted that Nigel Tao's test files were succesfully parsed. In V4, we also added strtod_tests and parsed it without any issue (btw looks like theres many cases in strtod_tests that could not be parsed by the standard library as double.TryParse just returns false). I'll take a look at your rust's random number tests. Altough I am not really familiar with rust.,. |
@CarlVerret. Thanks for the update. Do you have more numbers on the performance of
Do you have any data showing how the algorithm scales for various lengths of inputs? They key considerations would be:
It would likewise be good to see some basic numbers covering other interesting values such as:
Most importantly is size impact. A 10kb table is still quite a lot and will impact trimming, startup time, and have a few other considerations. Perf numbers showing how this would impact scenarios like ML.NET, WPF, or other workloads commonly working with floating-point string data helps show that the size increase is worth taking on. |
i'll be back with more info on that. Actually we only parse float and double. I guess parsing half could be done the same way as we did. |
The most interesting thing is likely that this algorithm cannot be used in blazor-wasm, I expect the size regression (regardless of perf) would be considered too big (CC. @jkotas). So we'd need to ensure that it could correctly be disabled or excluded in that scenario. Given that the algorithm still needs to "fallback" in some cases, that doesn't seem like it would be super difficult to make happen, its just an additional consideration that would be required here. |
Note that the approach has recently landed in LLVM: |
Don't worry, I'll port the logic to C# if there's interest, no need to ask additional work from maintainers. But if it passes Nigel Tao's tests, it should be fine unless it's failing on very unusual corner cases. |
Thanks, yes it does, and it also passes strtod file (which had been recenlty added to my test suite) |
@tannergooding here's some of the requested info :
here's a comparison for parsing data with float datatype. Parsing values as half is not available yet.
here's a comparison for parsing data with UTF-8 encoding
I've generated a 150K random integer number file and parsed it for both UTF-8 and UTF-16.
I've generated 11 files each containing 150K number with fixed significant digits count (from 6 to 17 digits with variable decimal location) (i.e 0.123456, 1.23456 and so on...). I've parse both set of files with standard lib and fast_float. For the remaining cases, do you have something specific in mind (as mixing negative zeros, NaN , .. in one file ) ? I hope this meets your requierements. |
Thanks a lot. The numbers do look promising and I'd be open to taking a prototype here. I think it probably needs to be under some feature flag due to the size increase and we'd need to explicitly check what the size before vs size after is with the implementation used. |
@jkotas do you think it would be better for such a PR to be in dotnet/runtime proper or in dotnet/runtime-lab first? The main concern with the implementation is size increase to support the tables; but the 2-3x faster parsing should be a big benefit to ML and similar scenarios. |
I do not think we would get any benefits by doing this in runtimelab. My understanding is that this is baked algorithm. We either take the size/throughput trade-off or not; there is not much to experiment with. If we take the new implementation, are we going to delete any of the existing FP parsing code? Can we compute the tables on-demand at runtime as an option? |
The biggest concern for size (on my end) is impact for WASM. However, as called out above, other runtimes (like Rust) have taken this as an acceptable tradeoff. I think we then have two options:
Possibly, but I expect that may impact perf some. We'd have to measure and see what the differences here are. |
The C++ equivalent is now considered for inclusion in libc++ (GCC): https://gcc.gnu.org/pipermail/libstdc++/2021-November/053502.html |
It sounds like we should be good for someone to contribute a PR here. If someone is wanting to do that, please let me know and we can work out the finer details here. I think it should be generally straightforward, but we'll want a feature flag around the table and main algorithm, only leaving the "fallback" implementation if that flag is off. |
I would be glad to do it. @tannergooding is it possible to talk about it (by email instead of this forum) ? Got a couple of questions before I start. Thanks. |
Maybe this is frequently asked but is it possible to get some guidance about how to get started with this PR? I've forked, branched and compiled the whole runtime. But it takes a lot of time and I'm sure there's a more focused way to build it. After reading all contributing guidelines, I feel that contributing to Sytem.Private.CoreLib is a bit different than contributing to libraries. |
Looks like I've found the way to compile and test S.P.O, I'll try to incomporate fast_float algorithm in the parsing process. |
Sorry for the delay, missed this due to the holidays, etc. For corelib related work, I generally build the repo as: This gives me a working runtime + corelib and allows VS to open and be used. You can run tests on the command line by appending |
I suspect you can just ignore that file |
See Daniel Lemire's (@lemire) blogpost: https://lemire.me/blog/2021/02/22/parsing-floating-point-numbers-really-fast-in-c - a super fast floating-point parsing algorithm which is up to 6 times faster than the BCL one for an average case (for invariant culture).
The algorithm: https://arxiv.org/abs/2101.11408 (and https://nigeltao.github.io/blog/2020/eisel-lemire.html)
Its C# port (by @CarlVerret): https://github.com/CarlVerret/csFastFloat
The current implementation supports "scientific", "fixed" and "general" formats, and a custom decimal separator (single
char
)what else is needed for a proper integration into
dotnet/runtime
?The implementation takes roughly 25Kb (~10kb of it is spent here https://github.com/CarlVerret/csFastFloat/blob/master/csFastFloat/Constants/Constants.cs#L22 - can be converted to a
ROS<byte>
?) of IL byte code.Go lang switched to that algorithm in 1.16 (see
strconv
https://golang.org/doc/go1.16#strconv)/cc @tannergooding
The text was updated successfully, but these errors were encountered: