Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using csFastFloat for faster FP parsing #48646

Closed
EgorBo opened this issue Feb 23, 2021 · 33 comments
Closed

Consider using csFastFloat for faster FP parsing #48646

EgorBo opened this issue Feb 23, 2021 · 33 comments
Labels
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Feb 23, 2021

See Daniel Lemire's (@lemire) blogpost: https://lemire.me/blog/2021/02/22/parsing-floating-point-numbers-really-fast-in-c - a super fast floating-point parsing algorithm which is up to 6 times faster than the BCL one for an average case (for invariant culture).

|                     Method |        FileName |      Mean |     Error |    StdDev |       Min | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated | MFloat/s |     MB/s |
|--------------------------- |---------------- |----------:|----------:|----------:|----------:|------:|------:|------:|------:|----------:|---------:|---------:|
|    FastFloat.ParseDouble() | data/canada.txt |  5.974 ms | 0.0060 ms | 0.0053 ms |  5.965 ms |  0.16 |     - |     - |     - |       2 B |    18.63 |   350.04 |
|             Double.Parse() | data/canada.txt | 37.459 ms | 0.2417 ms | 0.2142 ms | 37.003 ms |  1.00 |     - |     - |     - |      21 B |     3.00 |    56.43 |
|                            |                 |           |           |           |           |       |       |       |       |           |          |          |
|    FastFloat.ParseDouble() |   data/mesh.txt |  1.815 ms | 0.0144 ms | 0.0134 ms |  1.798 ms |  0.26 |     - |     - |     - |       1 B |    40.61 |   344.84 |
|             Double.Parse() |   data/mesh.txt |  6.911 ms | 0.1136 ms | 0.1062 ms |  6.758 ms |  1.00 |     - |     - |     - |       2 B |    10.81 |    91.75 |

The algorithm: https://arxiv.org/abs/2101.11408 (and https://nigeltao.github.io/blog/2020/eisel-lemire.html)
Its C# port (by @CarlVerret): https://github.com/CarlVerret/csFastFloat

The current implementation supports "scientific", "fixed" and "general" formats, and a custom decimal separator (single char)
what else is needed for a proper integration into dotnet/runtime?

The implementation takes roughly 25Kb (~10kb of it is spent here https://github.com/CarlVerret/csFastFloat/blob/master/csFastFloat/Constants/Constants.cs#L22 - can be converted to a ROS<byte>?) of IL byte code.

Go lang switched to that algorithm in 1.16 (see strconv https://golang.org/doc/go1.16#strconv)

/cc @tannergooding

@EgorBo EgorBo added the tenet-performance Performance related issue label Feb 23, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 23, 2021
@ghost
Copy link

ghost commented Feb 23, 2021

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

See Daniel Lemire's (@lemire) blogpost: https://lemire.me/blog/2021/02/22/parsing-floating-point-numbers-really-fast-in-c - a super fast floating-point parsing algorithm which is up to 6 times faster than the BCL one for an average case (for invariant culture).

|                     Method |        FileName |      Mean |     Error |    StdDev |       Min | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated | MFloat/s |     MB/s |
|--------------------------- |---------------- |----------:|----------:|----------:|----------:|------:|------:|------:|------:|----------:|---------:|---------:|
|    FastFloat.ParseDouble() | data/canada.txt |  5.974 ms | 0.0060 ms | 0.0053 ms |  5.965 ms |  0.16 |     - |     - |     - |       2 B |    18.63 |   350.04 |
|             Double.Parse() | data/canada.txt | 37.459 ms | 0.2417 ms | 0.2142 ms | 37.003 ms |  1.00 |     - |     - |     - |      21 B |     3.00 |    56.43 |
|                            |                 |           |           |           |           |       |       |       |       |           |          |          |
|    FastFloat.ParseDouble() |   data/mesh.txt |  1.815 ms | 0.0144 ms | 0.0134 ms |  1.798 ms |  0.26 |     - |     - |     - |       1 B |    40.61 |   344.84 |
|             Double.Parse() |   data/mesh.txt |  6.911 ms | 0.1136 ms | 0.1062 ms |  6.758 ms |  1.00 |     - |     - |     - |       2 B |    10.81 |    91.75 |

C# port of the algorithm (by @CarlVerret): https://github.com/CarlVerret/csFastFloat

The current implementation supports "scientific", "fixed" and "general" formats, and a custom decimal separator (single char)
what else is needed for a proper integration into dotnet/runtime?

The implementation takes roughly 25Kb (~10kb of it is spent here https://github.com/CarlVerret/csFastFloat/blob/master/csFastFloat/Constants/Constants.cs#L22 - can be converted to a ROS<byte>?)

/cc @tannergooding

Author: EgorBo
Assignees: -
Labels:

area-System.Numerics, tenet-performance, untriaged

Milestone: -

@tannergooding
Copy link
Member

25kb is a lot of increase. That's basically a full 1% of the raw IL and 0.25% of the crossgen'd image.

Likewise, it looks like this algorithm only works for inputs up to 19 significant digits where-as we must be able to handle inputs up to 113 digits for float and 768 digits for double so we would still need to maintain our current code-paths as well.

It would be interesting to run this against our full test suite to see if any failures crop up.

Likewise, our current algorithm does try to optimize for the more common cases and after filling the NumberBuffer optimizes any case less than 15 significant digits and with an absolute exponent less than 22, so it might be interesting to find out if there is something simple we can do to improve things compared to CsFastFloat: https://source.dot.net/#System.Private.CoreLib/Number.NumberToFloatingPointBits.cs,07872ebd9c2d23ac

  • Tthat is, for the cases that are less than 15 digits, is our slowdown just from filling in the number buffer or something similar?

@lemire
Copy link

lemire commented Feb 23, 2021

@tannergooding

so we would still need to maintain our current code-paths as well.

I think that this is entirely correct.

This is how it was done in Go. They did not remove their code, they just inserted the fast path.

One way to understand the approach taken in csFastFloat is that we insert a fast path between Clinger's fast path (which you already use) and the full approach you rely upon. The trick is that this "second fast path" catches most of the "common inputs" so that the 'slower' path is rarely needed.

It is more useful than just for the 19 digit case, however. Even when the input exceed 19 digits, you can most often still use the fast path because 19 digits is often all you need to determine the number (you can often truncate). Please see Section 11 "Processing long numbers quickly" in the manuscript: https://arxiv.org/pdf/2101.11408.pdf Note that this is what the Go runtime does.

Otherwise, you must fall through to some other solution like your current own code. The csFastFloat has its own fall back currently, but it is likely that the current code in the runtime would be superior.

In practice, while users can provide 100s of digits, you should expect many (most?) use cases to use no more than 17 digits. There is no benefit to serializing double or float types to more than 17 digits if you want to later de-serialize them to float/double types. You still need to handle the case with 100s of digits, but it is should be expected to be common.

Of course, we are relying on a table has a cost (storage). In Go, this was discussed and they decided that it was a good trade-off.

It would be interesting to run this against our full test suite to see if any failures crop up.

The algorithm itself should be sound. It is has been in production for quite some time without issue and there is a complete mathematical derivation on arXiv. The csFastFloat library itself could use more thorough testing... Note however that we do have a rather good testing suite. A couple of minor issues were caught in the last few hours, but nothing that has to do with the algorithm or core processing.

@tannergooding
Copy link
Member

tannergooding commented Feb 23, 2021

Thanks for the reference. I'll give it an additional look through when I have some free time this afternoon.

It might be interesting to see if we can compress the table size somehow (many of the powers used in these cases are powers of 10 and have many trailing zeros for example; although this table looks to be power of 5 instead).
If we can reduce the overall size and have a perf win, then I'd have less concerns about taking something like this on.

@lemire
Copy link

lemire commented Feb 23, 2021

@tannergooding

Yes. They are powers of 5 (for positive exponents) or reciprocal of powers of 5 (for negative exponents). We have that 10^50 = 2^50 * 5^50... so that there is no need to represent 2**50 formally. There are various ways one could trim that table but none that I have found appealing so far. One would be to restrict the application of the fast path to some powers only, instead of the full range. The rationale there might be that 1e200 and 1e-200 are relatively uncommon. This is not at all unreasonable, but it leaves you to decide where you set the threshold.

@lemire
Copy link

lemire commented Feb 24, 2021

@tannergooding

Note that the table 10kB not 25kB. Roughly... the exponents go from -300 to 300... so about 600 exponents. You have 600 x 128 bits... is 9.6 kB. The 25kB is what you would need to store the full table which we do not do.

The manuscript addresses this issue...

To implement our algorithm, we use a precomputed table of powers of five and reciprocals. Though it uses 10 KiB, we should compare it with the original Gay’s implementation of strtod in C which uses 160 KiB and compiles to tens of kilobytes. Our table is used for parsing both 64-bit and 32-bit number

For comparison, on my current laptop...

❯ du -h /usr/local/share/dotnet/shared/Microsoft.NETCore.App
140M	/usr/local/share/dotnet/shared/Microsoft.NETCore.App

Sorry if I was slow to react... it took me hours to think "25kB??? that's not right". I need stronger coffee.

(Disclosure: I had a version with a 25kB for a time, but I decided that it was too wasteful.)

@EgorBo
Copy link
Member Author

EgorBo commented Feb 24, 2021

Note that the table 10kB not 25kB

25kb is the total size of the lib (code + that 10kb table), so it's a rough estimate of a size overhead for the corelib but I assume we can save a bit more

For comparison, on my current laptop...

That is a size of the SDK, typical .NET apps are self-contained nowadays, e.g. for web-assembly purposes every 10kb counts. But I assume we can have two implementations OptForSize (Wasm, Mobiles) and OptForSpeed (Desktop, Server) - we already do it for some other algorithms. I'd love to integrate this algorithm and run some JSON-parsing benchmarks.

@lemire
Copy link

lemire commented Feb 25, 2021

25kb is the total size of the lib (code + that 10kb table), so it's a rough estimate of a size overhead for the corelib but I assume we can save a bit more

Most of the code in csFastFloat is for the fallback scenario which you would not need. It should be really tight once integrated in the runtime. I expect that 25kB is pessimistic.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 25, 2021

25kb is the total size of the lib (code + that 10kb table), so it's a rough estimate of a size overhead for the corelib but I assume we can save a bit more

Most of the code in csFastFloat is for the fallback scenario which you would not need. It should be really tight once integrated in the runtime. I expect that 25kB is pessimistic.

Ah, makes sense!

@lemire
Copy link

lemire commented Feb 25, 2021

@EgorBo Furthermore, the same data can be amortized by reusing it for UTF-8 processing as well: CarlVerret/csFastFloat#49

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jun 17, 2021
@tannergooding tannergooding added this to the Future milestone Jun 17, 2021
@lemire
Copy link

lemire commented Jul 17, 2021

The approach in question was merged into Rust, Update Rust Float-Parsing Algorithms to use ..., for an upcoming release. The bloat analysis was favorable.

@Alexhuszagh
Copy link

Alexhuszagh commented Jul 18, 2021

One way to understand the approach taken in csFastFloat is that we insert a fast path between Clinger's fast path (which you already use) and the full approach you rely upon. The trick is that this "second fast path" catches most of the "common inputs" so that the 'slower' path is rarely needed.

This has been my experience as well in writing my own implementations: the two algorithms (Bellepheron and Lemire) are complementary, and make needing a big-integer algorithm required only for extremely rare cases.

The algorithm itself should be sound. It is has been in production for quite some time without issue and there is a complete mathematical derivation on arXiv. The csFastFloat library itself could use more thorough testing... Note however that we do have a rather good testing suite. A couple of minor issues were caught in the last few hours, but nothing that has to do with the algorithm or core processing.

A good battery of tests should catch any correctness issues. I've used 3 main test suites extensively in my own implementations, and if it passes all 3, the chance of any major issues should be practically non-existent.

  1. rust's random number tests

Basically, this approach generates a battery of random numbers (millions) that are designed to trigger common failures in parsing. Some of these include trailing zeros after significant digits, extremely large numbers of digits, halfway cases, near-infinite values, and subnormal floats. It should be trivial to port these tests (which are only a few lines of code each) to dotnet.

  1. parse-number-fxx-test-data

These are the tests cases Golang and Wuffs use. This is a massive collection of curated cases that have been known to cause failures, aggregated from numerous different sources (such as freetype, Wuffs, double-conversion, and more).

  1. strtod_tests

A very compact, but carefully curated collection of test cases that is designed as a smoke test for common failures. These are all examples of floats known to fail in previous implementations, so it's a very good starting ground.

If either 1). or 2). pass, any implementation should be viewed as sound (as long as it rejects any language-specific syntax requirements). If I can be of help in any way in designing test cases, I'd be more than happy to contribute.

@CarlVerret
Copy link
Contributor

We just published V4.0 of our parsing library. This version brings a considerable performane boost as we now make use of SIMD functionnalities to speed up the parsing process. In some cases, csFastFloat is up to 9 times faster than double.Parse.

If either 1). or 2). pass, any implementation should be viewed as sound

From the first version, we asserted that Nigel Tao's test files were succesfully parsed. In V4, we also added strtod_tests and parsed it without any issue (btw looks like theres many cases in strtod_tests that could not be parsed by the standard library as double.TryParse just returns false). I'll take a look at your rust's random number tests. Altough I am not really familiar with rust.,.

@EgorBo @lemire

@tannergooding
Copy link
Member

@CarlVerret. Thanks for the update.

Do you have more numbers on the performance of float and or Half? Most of the numbers I see on the repo are just for double.

  • Seeing numbers comparing to TryParse and the built-in UTF8 parsing methods would also be beneficial

Do you have any data showing how the algorithm scales for various lengths of inputs? They key considerations would be:

  • Normal (that is non-subnormal) values between 0 and 1
    • These values are ~half of the representable values in a given IEEE floating-point value
  • Whole integer values
    • These are fairly common and where no decimal point exists (and they are less than 20 digits) should ideally have comparable perf to parsing an integer and casting to the right type
  • Values up to 6, 9, 15, and 17 significant digits (decimal location likely varies)
    • These represent common/interesting edge cases for values that will be commonly encountered due to serialization
  • A mix between "human" inputs and computation results
    • 0.3 is a human input
      • ToString() just gives 0.3 back here
      • Its exact value is 0.299999999999999988897769753748434595763683319091796875
    • 0.1 + 0.2 is the result of a computation
      • ToString() gives 0.30000000000000004 here
      • Its exact value is 0.3000000000000000444089209850062616169452667236328125

It would likewise be good to see some basic numbers covering other interesting values such as:

  • negative zero
  • positive/negative infinity
  • nan
  • values that explicitly have trailing zeros
    • it's not uncommon for strings to be formatted to all be the same size, like 1.00
  • values that are precise (e.g. up to 768 digits need to be considered for correctness for double and 113 for float)
    • while these represent "worst case" and aren't likely to be encountered in real world code, its still good to have basic numbers here
  • values that have 5, many trailing zeros and then a non-zero value
    • these also represent a kind of "worst case", non-zero trailing digits can impact rounding decisions here

Most importantly is size impact. A 10kb table is still quite a lot and will impact trimming, startup time, and have a few other considerations. Perf numbers showing how this would impact scenarios like ML.NET, WPF, or other workloads commonly working with floating-point string data helps show that the size increase is worth taking on.

@CarlVerret
Copy link
Contributor

i'll be back with more info on that. Actually we only parse float and double. I guess parsing half could be done the same way as we did.

@tannergooding
Copy link
Member

The most interesting thing is likely that this algorithm cannot be used in blazor-wasm, I expect the size regression (regardless of perf) would be considered too big (CC. @jkotas).

So we'd need to ensure that it could correctly be disabled or excluded in that scenario. Given that the algorithm still needs to "fallback" in some cases, that doesn't seem like it would be super difficult to make happen, its just an additional consideration that would be required here.

@lemire
Copy link

lemire commented Oct 23, 2021

Note that the approach has recently landed in LLVM:
llvm/llvm-project@87c0160

@Alexhuszagh
Copy link

Altough I am not really familiar with rust.,.

Don't worry, I'll port the logic to C# if there's interest, no need to ask additional work from maintainers. But if it passes Nigel Tao's tests, it should be fine unless it's failing on very unusual corner cases.

@CarlVerret
Copy link
Contributor

Thanks, yes it does, and it also passes strtod file (which had been recenlty added to my test suite)

@CarlVerret
Copy link
Contributor

@tannergooding here's some of the requested info :

Do you have more numbers on the performance of float

here's a comparison for parsing data with float datatype. Parsing values as half is not available yet.

image

Seeing numbers comparing to TryParse and the built-in UTF8 parsing methods would also be beneficial

here's a comparison for parsing data with UTF-8 encoding

image

Whole integer values

I've generated a 150K random integer number file and parsed it for both UTF-8 and UTF-16.

image

Do you have any data showing how the algorithm scales for various lengths of inputs? They key considerations would be:

  • Values up to 6, 9, 15, and 17 significant digits (decimal location likely varies)

I've generated 11 files each containing 150K number with fixed significant digits count (from 6 to 17 digits with variable decimal location) (i.e 0.123456, 1.23456 and so on...). I've parse both set of files with standard lib and fast_float.

image

For the remaining cases, do you have something specific in mind (as mixing negative zeros, NaN , .. in one file ) ?

I hope this meets your requierements.

@tannergooding
Copy link
Member

Thanks a lot. The numbers do look promising and I'd be open to taking a prototype here.

I think it probably needs to be under some feature flag due to the size increase and we'd need to explicitly check what the size before vs size after is with the implementation used.

@tannergooding
Copy link
Member

@jkotas do you think it would be better for such a PR to be in dotnet/runtime proper or in dotnet/runtime-lab first?

The main concern with the implementation is size increase to support the tables; but the 2-3x faster parsing should be a big benefit to ML and similar scenarios.

@jkotas
Copy link
Member

jkotas commented Nov 15, 2021

I do not think we would get any benefits by doing this in runtimelab. My understanding is that this is baked algorithm. We either take the size/throughput trade-off or not; there is not much to experiment with.

If we take the new implementation, are we going to delete any of the existing FP parsing code?

Can we compute the tables on-demand at runtime as an option?

@tannergooding
Copy link
Member

If we take the new implementation, are we going to delete any of the existing FP parsing code?

The biggest concern for size (on my end) is impact for WASM. However, as called out above, other runtimes (like Rust) have taken this as an acceptable tradeoff.

I think we then have two options:
A) Take this algorithm everywhere and just accept the size regression
B) #ifdef this algorithm and basically keep just the "fallback" path for platforms that care more about size than perf

  • Notably we already have a "fast" vs "fallback" path today (Grisu3 vs Dragon4). This algorithm also needs a fallback, so we wouldn't be maintaining two separate things, we'd just disable the "fast path" and big tables for platforms like WASM

Can we compute the tables on-demand at runtime as an option?

Possibly, but I expect that may impact perf some. We'd have to measure and see what the differences here are.

@lemire
Copy link

lemire commented Nov 16, 2021

The C++ equivalent is now considered for inclusion in libc++ (GCC): https://gcc.gnu.org/pipermail/libstdc++/2021-November/053502.html

@tannergooding
Copy link
Member

It sounds like we should be good for someone to contribute a PR here.

If someone is wanting to do that, please let me know and we can work out the finer details here. I think it should be generally straightforward, but we'll want a feature flag around the table and main algorithm, only leaving the "fallback" implementation if that flag is off.

@CarlVerret
Copy link
Contributor

CarlVerret commented Nov 16, 2021

It sounds like we should be good for someone to contribute a PR here.

I would be glad to do it. @tannergooding is it possible to talk about it (by email instead of this forum) ? Got a couple of questions before I start. Thanks.

@CarlVerret
Copy link
Contributor

Maybe this is frequently asked but is it possible to get some guidance about how to get started with this PR? I've forked, branched and compiled the whole runtime. But it takes a lot of time and I'm sure there's a more focused way to build it. After reading all contributing guidelines, I feel that contributing to Sytem.Private.CoreLib is a bit different than contributing to libraries.
@tannergooding @EgorBo

@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 20, 2021
@CarlVerret
Copy link
Contributor

Looks like I've found the way to compile and test S.P.O, I'll try to incomporate fast_float algorithm in the parsing process.

@tannergooding
Copy link
Member

Maybe this is frequently asked but is it possible to get some guidance about how to get started with this PR? I've forked, branched and compiled the whole runtime. But it takes a lot of time and I'm sure there's a more focused way to build it. After reading all contributing guidelines, I feel that contributing to Sytem.Private.CoreLib is a bit different than contributing to libraries.

Sorry for the delay, missed this due to the holidays, etc.

For corelib related work, I generally build the repo as: .\build.cmd -subset clr -config debug; .\build.cmd -subset clr+libs+libs.tests -config release
I then open S.P.Corelib via .\build.cmd -vs System.Private.Corelib

This gives me a working runtime + corelib and allows VS to open and be used.
When I'm finally ready to work on tests, I'll likewise do .\build.cmd -vs System.Runtime (or relevant other reference assembly) and that gives me the ref + tests in VS.

You can run tests on the command line by appending -test onto the command line (such as .\build.cmd -subset libs -config release -test

@CarlVerret
Copy link
Contributor

I'm about to make my first PR and I wonder why is Visual Studio making a lot of changes to System.Runtime.sln file ? Is there a way to prevent this from happening ? Thanks.

image

@EgorBo
Copy link
Member Author

EgorBo commented Dec 2, 2021

I suspect you can just ignore that file

@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
@tannergooding tannergooding removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants