-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Estrin's method for polynomial evaluation #932
Conversation
7ad8c9d
to
b0b0928
Compare
@thomasahle : Do you know literature which gives the roundoff bounds? IIRC for Horner's method it's μn|p'(x)| where μ is the unit roundoff and n is length of the polynomial. But I believe Estrin has something more like μlog(n)|p'(x)|... |
372a981
to
a7f7a01
Compare
@jzmaddock : These are the numbers I get on an M1 Pro:
|
Analyzed the crossover point with DenseRange: 96 with real argument and real coefficients:
|
6dadb98
to
b7a7047
Compare
Are you able to check if specialized code is being generated for the small sizes? It really should be at least as fast as horner for n=1 and 8. Maybe I should try just writing it out. But if we can template on the size, it might be enough as well. |
@thomasahle : The Lemme add a few cases. |
@thomasahle : The comparison between the std::array case is much more favorable to Estrin's method:
Diffs pushed up; can you run the benchmark? |
c17836f
to
9c12ce8
Compare
N.B.: This is a slightly modified version of the code provided by Thomas Dybdahl Ahle in a github issue. [CI SKIP] [ci skip]
This is looking very good, especially for larger polynomials, just pushed some more comparisons to Horner for the std::array case, on cygwin I see:
So Horner is very slightly ahead until probably > 10 terms. Speculation: estrin involves more memory accesses, and/or pulling temp storage into the CPU cache, but that's a one time hit, and after that it just hits the same storage over and over which may explain those results. Two questions spring to mind:
|
N.B.: This is a slightly modified version of the code provided by Thomas Dybdahl Ahle in a github issue. [CI SKIP] [ci skip]
@jzmaddock : Like an idiot I just forced push over your diffs; could your repush? In any case:
Good point. I've kept the code but removed it from documentation. I want to keep the code for a little bit just to see if there are use cases . . . also maybe that gcc extension for stack arrays could help here .. .
Yes, I think that's a good idea. It may also make sense to add the following interface to match template <std::size_t N, class T, class V>
V estrin(const T(&poly)[N], const V& val); However, the tuning required to backend might take quite a huge commit; should we merge the Estrin method without making the changes to the Finally, I have reduced the C++ standard to 11 so that we could backend |
Fixed Conflicts: doc/internals/estrin.qbk include/boost/math/tools/estrin.hpp reporting/performance/estrin_performance.cpp test/Jamfile.v2 test/test_estrin.cpp
@jzmaddock , @mborland , @thomasahle : I have some dream that the build will succeed, or at least be close enough that this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. The only thing I saw was the missing header because you use std::is_same
in the assertion.
What I was trying to suggest above was using Estrin only on blocks of up to, say, size 128. That way we will never overflow the stack. Maybe it could be even lower, like 16 or 32. Then we can just combine the blocks using horner. Btw, did any of you try benchmarkingthe Knuth Eve code? |
Not a bad idea. Stack arrays are always tricky to get working across all supported platforms though, and won't the inlining have to die once we're using all the registers for scratch space?
I'm very interested, but isn't it conceptually distinct from this MR? |
Ok, gave the stack array a try; sadly at least with this mechanism, it made the code slower: diff --git a/include/boost/math/tools/estrin.hpp b/include/boost/math/tools/estrin.hpp
index b94702a71..4c3abfffb 100644
--- a/include/boost/math/tools/estrin.hpp
+++ b/include/boost/math/tools/estrin.hpp
@@ -59,8 +59,14 @@ inline RealOrComplex estrin(const RandomAccessContainer &coeffs, RealOrComplex z
// Normally, I'd make `ds` a RandomAccessContainer, but its value type needs to be RealOrComplex,
// and the value_type of the passed RandomAccessContainer can just be Real.
// Allocation of the std::vector is not ideal, but I have no other ideas at the moment:
- std::vector<RealOrComplex> ds((n + 1) / 2);
- return estrin(coeffs, ds, z);
+ constexpr const size_t n_max = 33;
+ if (n >= n_max) {
+ std::vector<RealOrComplex> ds((n + 1) / 2);
+ return estrin(coeffs, ds, z);
+ } else {
+ std::array<RealOrComplex, (n_max+1)/2> ds;
+ return estrin(coeffs, ds, z);
+ }
}
} // namespace tools Obviously I can't preclude the existence of some other mechanism to do a stack array that might be faster, but at this point I believe instruction generation is also playing a major role here. |
Build failure is a timeout. |
Yes. I've added the polynomial degree to the title and tried n=8. There are fewer roundoffs and hence both are more accurate, but it still appears that Horner is winning:
I googled around but found nothing. Someone should ask ChatGPT!
Yeah, I don't think in applied practice it would make much difference, but in general our "best goal" is half-ulp accuracy, because we really don't know what the user intends to do with it. |
@jzmaddock : This is probably good to go IMO. |
I merged dropping the language requirement for condition numbers from C++17 to C++14 which I think was the only thing stopping this from testing at C++14. Might be a worthwhile addition. |
I tried implementing Estrin with blocks:
These were the results using block size 4, 8 and 16:
Clearly 8 is the best block size. One could also try using the even/odd trick of the second order Horner to try and reclaim the last speed difference. Not sure if the extra code is worth it for avoiding the need for "scratch space"? |
@jzmaddock : Mind me merging this? I wanna use it in the Daubechies MR and it's a bit of a pain to fork off forks. |
@NAThompson Go for it! |
@thomasahle : Huge thanks for pushing on this; it's a really interesting algorithm and looks like it might be a major improvement to some of the rational function approximations (though we probably still need to check accuracy pretty carefully in those cases). As to your other ideas: I'm not ignoring them! I just felt like "done" was the correct action at this juncture; we can and should revisit you blocking ideas. @jzmaddock , @mborland : Thanks so much for spending your mental energies on this; it wouldn't have been very good without y'alls help. |
N.B.: This is a slightly modified version of the code provided by Thomas Dybdahl Ahle in a github issue.