Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Estrin's method for polynomial evaluation #932

Merged
merged 16 commits into from
Feb 4, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions doc/internals/estrin.qbk
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
[section:estrin Estrin's method for polynomial evaluation]

[h4 Synopsis]

``
#include <boost/math/tools/estrin.hpp>
``

namespace boost { namespace math { namespace tools {

// Advanced interface: Use if you can preallocate a scratch buffer of size (coeffs.size() +1)/2:
// The scratch buffer size is *unchecked*, so use at your own risk!
mborland marked this conversation as resolved.
Show resolved Hide resolved
template<typename RandomAccessContainer1, typename RandomAccessContainer2, typename RealOrComplex>
inline auto estrin(RandomAccessContainer1 const & coeffs, RandomAccessContainer2& scratch, RealOrComplex z);

// Template specialization for std::array, no preallocation is necessary so massive performance improvements are typically observed:
template <typename RealOrComplex1, size_t n, typename RealOrComplex2>
inline RealOrComplex2 estrin(const std::array<RealOrComplex1, n> &coeffs, RealOrComplex2 z);

}}} // namespaces

[h4 Description]

Boost.math provided free functions which evaluate polynomials by Estrin's method.
Though Estrin's method is not optimal from the standpoint of minimizing arithmetic operations (that claim goes to Horner's method),
it nonetheless is well-suited to SIMD pipelines on modern CPUs.
For example, on an 2022 M1 Pro, evaluating a double precision polynomial of length N using Estrin's method with scratch space takes 0.28 N nanoseconds for large N.
For comparison, using Horner's method takes 1.24 N ns.
However, for small N, choosing between the two methods is rather complicated.
If you know your polynomial coefficients at compile time and can store them in a `std::array`, then Estrin's method is faster.
If the coefficients are computed at runtime, then only for N roughly greater than 90 is Estrin's method better.
These numbers are highly dependent on compiler flags; ensure the compiler is allowed to emit AVX instructions and fmas to take full advantage of the benefits of Estrin's method.

To measure the performance on your system, refer to the google benchmark file `reporting/performance/estrin_performance.cpp`.


[h4 References]

* Muller, Jean-Michel (2005). Elementary Functions: Algorithms and Implementation (2nd ed.). Birkhäuser. p. 58. ISBN 0-8176-4372-9.

[endsect] [/section:estrin]
[/
Copyright 2023 Thomas Dybdahl Alhe, Nicholas Thompson, Matt Borland

Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt).
]
1 change: 1 addition & 0 deletions doc/math.qbk
Original file line number Diff line number Diff line change
Expand Up @@ -726,6 +726,7 @@ and as a CD ISBN 0-9504833-2-X 978-0-9504833-2-0, Classification 519.2-dc22.

[mathpart poly Polynomials and Rational Functions]
[include internals/polynomial.qbk]
[include internals/estrin.qbk]
[include internals/rational.qbk]
[endmathpart]

Expand Down
68 changes: 68 additions & 0 deletions include/boost/math/tools/estrin.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
/*
* Copyright Thomas Dybdahl Ahle, Nick Thompson, Matt Borland, John Maddock, 2023
* Use, modification and distribution are subject to the
* Boost Software License, Version 1.0. (See accompanying file
* LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
*/
#ifndef BOOST_MATH_TOOLS_ESTRIN_HPP
#define BOOST_MATH_TOOLS_ESTRIN_HPP

#include <array>
#include <vector>

NAThompson marked this conversation as resolved.
Show resolved Hide resolved
namespace boost {
namespace math {
namespace tools {

template <typename RandomAccessContainer1, typename RandomAccessContainer2, typename RealOrComplex>
inline RealOrComplex estrin(RandomAccessContainer1 const &coeffs, RandomAccessContainer2 &scratch, RealOrComplex z) {
// Does anyone care about the complex coefficients, real argument case?
// I've never seen it used, and this static assert makes the error messages much better:
static_assert(std::is_same<typename RandomAccessContainer2::value_type, RealOrComplex>::value,
"The value type of the scratch space must be the same as the abscissa.");
auto n = coeffs.size();
if (n == 0) {
return static_cast<RealOrComplex>(0);
}
for (decltype(n) i = 0; i < n / 2; i++) {
scratch[i] = coeffs[2 * i] + coeffs[2 * i + 1] * z;
}
if (n & 1) {
scratch[n / 2] = coeffs[n - 1];
}
auto m = (n + 1) / 2;

while (m != 1) {
z = z * z;
for (decltype(n) i = 0; i < m / 2; i++) {
scratch[i] = scratch[2 * i] + scratch[2 * i + 1] * z;
}
if (m & 1) {
scratch[m / 2] = scratch[m - 1];
}
m = (m + 1) / 2;
}
return scratch[0];
}

// The std::array template specialization doesn't need to allocate:
template <typename RealOrComplex1, size_t n, typename RealOrComplex2>
inline RealOrComplex2 estrin(const std::array<RealOrComplex1, n> &coeffs, RealOrComplex2 z) {
std::array<RealOrComplex2, (n + 1) / 2> ds;
return estrin(coeffs, ds, z);
}

template <typename RandomAccessContainer, typename RealOrComplex>
inline RealOrComplex estrin(const RandomAccessContainer &coeffs, RealOrComplex z) {
auto n = coeffs.size();
// Normally, I'd make `ds` a RandomAccessContainer, but its value type needs to be RealOrComplex,
// and the value_type of the passed RandomAccessContainer can just be Real.
// Allocation of the std::vector is not ideal, but I have no other ideas at the moment:
std::vector<RealOrComplex> ds((n + 1) / 2);
return estrin(coeffs, ds, z);
}

} // namespace tools
} // namespace math
} // namespace boost
#endif
269 changes: 269 additions & 0 deletions reporting/performance/estrin_performance.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
// (C) Copyright Nick Thompson, John Maddock 2023.
// Use, modification and distribution are subject to the
// Boost Software License, Version 1.0. (See accompanying file
// LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
#include <array>
#include <benchmark/benchmark.h>
#include <boost/math/tools/estrin.hpp>
#include <boost/math/tools/rational.hpp>
#include <complex>
#include <iostream>
#include <random>
#include <vector>

using boost::math::tools::estrin;
using boost::math::tools::evaluate_polynomial;

template <class Real> void HornerRealCoeffsRealArg(benchmark::State &state) {
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::vector<Real> coeffs(n);
for (auto &c : coeffs) {
c = unif(mt);
}
Real x = unif(mt);
// Prevent the compiler from hoisting the evaluation out of the loop:
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
Real output = evaluate_polynomial(coeffs.data(), x, n);
benchmark::DoNotOptimize(output);
x += fudge;
}
state.SetComplexityN(state.range(0));
}

template <class Real> void HornerRealCoeffsComplexArg(benchmark::State &state) {
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::vector<Real> coeffs(n);
for (auto &c : coeffs) {
c = unif(mt);
}
std::complex<Real> x{unif(mt), unif(mt)};
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
std::complex<Real> output = evaluate_polynomial(coeffs.data(), x, n);
benchmark::DoNotOptimize(output);
x += fudge;
}
state.SetComplexityN(state.range(0));
}

template <class Real> void EstrinRealCoeffsRealArg(benchmark::State &state) {
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::vector<Real> c(n);
for (auto &d : c) {
d = unif(mt);
}
Real x = unif(mt);
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
Real output = estrin(c, x);
benchmark::DoNotOptimize(output);
x += fudge;
}
state.SetComplexityN(state.range(0));
}

template <class Real> void EstrinRealCoeffsRealArgWithScratch(benchmark::State &state) {
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::vector<Real> c(n);
std::vector<Real> scratch((n + 1) / 2);
for (auto &d : c) {
d = unif(mt);
}
Real x = unif(mt);
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
Real output = boost::math::tools::estrin(c, scratch, x);
benchmark::DoNotOptimize(output);
x += fudge;
}
state.SetComplexityN(state.range(0));
}

template <class Real> void EstrinRealCoeffsComplexArgWithScratch(benchmark::State &state) {
long n = state.range(0);
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::vector<Real> c(n);
std::vector<std::complex<Real>> scratch((n + 1) / 2);
for (auto &d : c) {
d = unif(mt);
}
std::complex<Real> z{unif(mt), unif(mt)};
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
auto output = boost::math::tools::estrin(c, scratch, z);
benchmark::DoNotOptimize(output);
z += fudge;
}
state.SetComplexityN(state.range(0));
}

template <class Real, size_t n> void EstrinRealCoeffsRealArgStdArray(benchmark::State &state) {
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::array<Real, n> c;
for (auto &d : c) {
d = unif(mt);
}
Real x = unif(mt);
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
Real output = boost::math::tools::estrin(c, x);
benchmark::DoNotOptimize(output);
x += fudge;
}
}

template <class Real, size_t n> void HornerRealCoeffsRealArgStdArray(benchmark::State &state) {
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::array<Real, n> c;
for (auto &d : c) {
d = unif(mt);
}
Real x = unif(mt);
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
Real output = boost::math::tools::evaluate_polynomial(c, x);
benchmark::DoNotOptimize(output);
x += fudge;
}
}

template <class Real, size_t n> void EstrinRealCoeffsComplexArgStdArray(benchmark::State &state) {
std::random_device rd;
auto seed = rd();
std::mt19937_64 mt(seed);
std::uniform_real_distribution<Real> unif(-10, 10);

std::array<Real, n> c;
for (auto &d : c) {
d = unif(mt);
}
std::complex<Real> z{unif(mt), unif(mt)};
Real fudge = std::sqrt(std::numeric_limits<Real>::epsilon());
for (auto _ : state) {
auto output = boost::math::tools::estrin(c, z);
benchmark::DoNotOptimize(output);
z += fudge;
}
}

BENCHMARK_TEMPLATE(HornerRealCoeffsRealArg, float)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgWithScratch, float)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);

BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 2);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 3);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 4);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 5);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 8);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 9);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 16);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 17);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 32);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 33);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 64);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, float, 65);

BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 2);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 3);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 4);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 5);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 8);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 9);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 16);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 17);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 32);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 33);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 64);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, float, 65);

BENCHMARK_TEMPLATE(HornerRealCoeffsRealArg, double)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgWithScratch, double)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);

BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 2);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 3);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 4);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 5);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 8);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 9);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 16);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 17);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 32);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 33);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 64);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgStdArray, double, 65);

BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 2);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 3);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 4);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 5);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 8);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 9);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 16);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 17);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 32);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 33);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 64);
BENCHMARK_TEMPLATE(HornerRealCoeffsRealArgStdArray, double, 65);

BENCHMARK_TEMPLATE(HornerRealCoeffsRealArg, double)->DenseRange(64, 128, 8)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArgWithScratch, double)->DenseRange(64, 128, 8)->Complexity(benchmark::oN);

BENCHMARK_TEMPLATE(HornerRealCoeffsComplexArg, float)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgWithScratch, float)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, float, 2);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, float, 4);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, float, 8);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, float, 16);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, float, 32);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, float, 64);

BENCHMARK_TEMPLATE(HornerRealCoeffsComplexArg, double)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgWithScratch, double)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, double, 2);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, double, 4);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, double, 8);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, double, 16);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, double, 32);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgStdArray, double, 64);

BENCHMARK_TEMPLATE(HornerRealCoeffsComplexArg, double)->DenseRange(64, 128, 8)->Complexity(benchmark::oN);
BENCHMARK_TEMPLATE(EstrinRealCoeffsComplexArgWithScratch, double)->DenseRange(64, 128, 8)->Complexity(benchmark::oN);

// These just tell us what we already know: Allocation is expensive!
// BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArg, float)->RangeMultiplier(2)->Range(1, 1<<15)->Complexity(benchmark::oN);
// BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArg, double)->RangeMultiplier(2)->Range(1, 1 << 15)->Complexity(benchmark::oN);
// BENCHMARK_TEMPLATE(EstrinRealCoeffsRealArg, double)->DenseRange(64, 128, 8)->Complexity(benchmark::oN);

BENCHMARK_MAIN();
Loading