Skip to content

Latest commit

 

History

History
286 lines (242 loc) · 17.8 KB

Partial-Sieve-Function.md

File metadata and controls

286 lines (242 loc) · 17.8 KB

The partial sieve function

The partial sieve function $\phi(x, a)$ counts the numbers ≤ $x$ that are not divisible by any of the first $a$ primes. This function is sometimes also named "Legendre's sum" after the French mathematician Adrien-Marie Legendre who first studied it in the 19th century [1]. The partial sieve function is at the heart of all combinatorial prime counting algorithms. In fact Legendre's prime counting function $\pi(x)=\pi(\sqrt{x})+\phi(x,\pi(\sqrt{x}))-1$ can be computed using solely the partial sieve function, whereas the other more advanced combinatorial prime counting algorithms require evaluating the partial sieve function as well as other functions.

In the advanced combinatorial prime counting algorithms such as the Lagarias-Miller-Odlyzko algorithm [3] and the Deléglise-Rivat algorithm [4] the partial sieve function is mainly used as an auxiliary function during initialization. Therefore the overall performance of these algorithms does not meaningfully depend upon the execution speed of the partial sieve function.

However, with the advent of multi-core CPUs, the partial sieve function has found another important use case within the combinatorial prime counting algorithms. For software programs to scale well on multi-core CPUs it is crucial that the individual worker threads are independent of each other, so that the threads have exclusive access to the CPU's resources and in order to prevent that a thread has to wait idle for data from another thread. In 2002 Xavier Gourdon [5] devised a modification to the hard special leaves algorithm so that the computation can be slit up into independent chunks. This modification relies on the partial sieve function for generating a lookup table of $\phi(x, i)$ results for $i \in [0, a]$. The Generate phi(x, i) lookup table paragraph contains more information.

Hence now the partial sieve function's performance has become critical for parallel implementations of the combinatorial prime counting algorithms. This document describes the many known optimizations that can be used to speed up the $\phi(x, a)$ computation and it describes a new optimization that I have devised and that has first been implemented in primecount. It is a compressed cache of $\phi(i, j)$ results, for small to medium values of $i$ and $j$, that speeds up most $\phi(x, a)$ computations by more than an order of magnitude.

primecount's $\phi(x, a)$ implementation

In primecount the partial sieve function is implemented in the file phi.cpp and the files PhiTiny.hpp & PhiTiny.cpp contain auxiliary functions for computing $\phi(x, a)$ in constant time for tiny values of $a$. The partial sieve function $\phi(x, a)$ is also part of primecount's C/C++ API and it is available in the primecount command-line application via the --phi option.

Compute $\phi(1000, 10)$ using the primecount command-line application:

$ primecount 1000 10 --phi

Recursive formula

$\phi(x, a) = \phi(x, a - 1) - \phi(x / \mathrm{prime}_a, a - 1)$

This is the main formula for the computation of the partial sieve function. As mentioned in the introduction this formula was first described by Legendre in his book "Théorie des nombres" [1]. When implemented in a computer program the above recursive $\phi(x, a)$ formula with $a = \pi(\sqrt{x})$ allows computing Legendre's prime counting function $\pi(x)=\pi(\sqrt{x})+\phi(x,\pi(\sqrt{x}))-1$ in $O(x)$ operations and using $O(\sqrt{x} / \log{x})$ space. Tomás Oliveira e Silva's paper [6] contains a simple C implementation of this formula:

static void phi(int x, int a, int sign)
{
loop:
    if(a == 0)
        sum += sign * x;
    else if(x < p[a])
        sum += sign;
    else
    {
        --a;
        phi(x / p[a], a, -sign);
        goto loop;
    }
}

Optimizations

There are many known optimizations that can be used in conjunction with the recursive $\phi(x, a)$ formula and which, when combined, can speed up the computation by many orders for magnitude. I will now briefly describe all known optimizations, and then further down I will present a new optimization that I have devised and that has first been implemented in primecount.

$\phi(x, a) = (x / pp)\times \varphi(pp) + \phi(x \bmod pp, a)$

This formula allows computing $\phi(x, a)$ in $O(1)$ for small values of $a$ e.g. for $a$ ≤ 7. φ(n) is Euler's totient function and $pp$ denotes the product of the first $a$ primes: $pp = 2\times 3\times ...\times \mathrm{prime}_a$. The use of this formula requires initializing a lookup table of $\phi(i, a)$ results for $i \in [0, pp[$, hence the lookup table has a size of $pp$. The German astronomer Ernst Meissel was the first who used this formula for the computation of the number of primes below 1 billion at the end of the 19th century. This formula is also present in Lehmer's paper from 1959 [2] and is described in more detail in most of the other combinatorial prime counting papers. In primecount this formula is implemented in PhiTiny.hpp and the initialization of the lookup table is implemented in PhiTiny.cpp.

$\mathrm{if}(x\bmod pp ≤ pp/2)\ \ \phi(x, a) = (x/pp)\times \varphi(pp) + \phi(x\bmod pp, a)$
$\mathrm{if}(x\bmod pp &gt; pp/2)\ \ \phi(x, a) = (x/pp)\times \varphi(pp) + \varphi(pp) - \phi(pp - 1 - x\bmod pp, a)$

In the formulas above $pp$ corresponds to the product of the first $a$ primes: $pp = 2\times 3\times ...\times \mathrm{prime}_a$ and φ(n) is Euler's totient function. If it is not possible to compute $\phi(x, a)$ in $O(1)$ using the formula from the previous paragraph, these formulas can be used to avoid computing $\phi(x, a)$ where $x$ may be large, and instead compute $\phi(x\bmod pp, a)$ or $\phi(pp - 1 - x\bmod pp, a)$ where $x\bmod pp$ and $pp - 1 - x\bmod pp$ may be orders of magnitude smaller than $x$. I have tested these formulas in primecount, however they did not provide a general speedup. The main issue with these formulas is that they are only useful for relatively large values of $x$ and they are limited to small values of $a$ because they involve the product of the first $a$ primes which grows rather quickly. In computer programs that use 64-bit integers these formulas can be used for $a$ ≤ 16. These formulas are partially described in R.P. Leopold's paper [7].

Stop recursion at $c$ instead of 1

Using the formula $\phi(x, a) = (x / pp)\times \varphi(pp) + \phi(x \bmod pp, a)$ it is possible to compute $\phi(x, c)$ in $O(1)$ for small values of $c$ e.g. $c$ ≤ 7. Using this formula we can stop recursion at $c$ instead of 1 in the main recursive phi(x, a) formula and simply increase the sum by $\phi(x, c)$.

Calculate all $\phi(x / \mathrm{prime}_i, i-1) = 1$ upfront in $O(1)$

Once $\phi(x / \mathrm{prime}_i, i-1) = 1$ occurs in the main recursive phi(x, a) formula all subsequent $\phi(x / \mathrm{prime}_j, j-1)$ computations with $j \in ]i, a]$ will also be 1. Generally $\phi(x / \mathrm{prime}_i, i-1) = 1$ if $(x / \mathrm{prime}_i ≤ \mathrm{prime}_{i-1})$. Hence instead of computing $\phi(x / \mathrm{prime}_j, j-1)$ individually for all $j \in ]i, a]$ we can simply increase the sum by $a - i$.

$\mathrm{if}(a ≥ \pi(\sqrt{x}))\ \ \phi(x, a) = \pi(x) - a + 1$

This formula also allows computing $\phi(x, a)$ in $O(1)$ provided that $a$ is relatively large and $x$ is relatively small. If $a ≥ \pi(\sqrt{x})$ then $\phi(x, a)$ counts the number of primes ≤ $x$, minus the first $a$ primes, plus the number 1. The use of this formula requires using a $\pi(x)$ lookup table of size $x$. In order to reduce the memory usage it is best to use a compressed $\pi(x)$ lookup table such as primecount's PiTable.hpp. The use of a lookup table makes this formula unsuitable for computing $\phi(x, a)$ for large values of $x$ due to its excessive memory requirement. However, for large values of $x$ we can compute the $\pi(x)$ part of this formula using a prime counting function implementation in $O(x^{\frac{2}{3}})$ or less instead of a lookup table which uses much less memory.

$\mathrm{if}(\pi(\sqrt[3]{x}) ≤ a &lt; \pi(\sqrt{x}))\ \ \phi(x, a) = \pi(x) + \mathrm{P_2}(x, a) - a + 1$

$\mathrm{P_2}(x, a)$ corresponds to the 2nd partial sieve function, it counts the numbers ≤ $x$ that have exactly two prime factors each exceeding the a-th prime. If $(\pi(\sqrt[4]{x}) ≤ a &lt; \pi(\sqrt[3]{x}))$ then one needs to add the $\mathrm{P_3}(x, a)$ term i.e. $\phi(x, a) = \pi(x) + \mathrm{P_2}(x, a) + \mathrm{P_3}(x, a) - a + 1$. The formulas from this paragraph are not yet being used in primecount, even though I expect that their use could significantly speed up some $\phi(x, a)$ computations.

New optimization

Due to the recursive nature of the main phi(x, a) formula the same values of $\phi(i, j)$ are calculated over and over again, this is especially true for small to medium values of $i$ and $j$. The formula $\phi(x, a) = (x / pp)\times \varphi(pp) + \phi(x \bmod pp, a)$ can be used to avoid recursion, however it is limited to small values of $a$$c$ with $c$ being a small constant e.g. $c = 7$. The formula $\phi(x, a) = \pi(x) - a + 1$ can also be used to compute $\phi(x, a)$ in $O(1)$, however it is limited to large values of $a$$\pi(\sqrt{x})$. Hence there is currently no known optimization for computing $\phi(x, a)$ for medium values of $a \in\ ]c, \pi(\sqrt{x})[$.

The new optimization that I have devised is a $\phi(i, j)$ cache for small to medium values of $i$ and $j$ e.g. $i$$\sqrt{x}$ and $j$ ≤ 100. The more $\phi(i, j)$ results are cached, the fewer recursive calls occur in the main phi(x, a) formula and the faster the computation runs. However, on the other hand we are memory constrained, we cannot cache everything and ideally our $\phi(i, j)$ cache should fit into the CPU's fast cache memory. Hence the main goal for our cache is to store as many $\phi(i, j)$ results as possible using as little memory as possible.

The densest data structure for storing the count of numbers ≤ $n$ that are not divisible by any of the first $a$ primes that I am aware of is a bit array. If a bit is set, this means the corresponding number is not divisible by any of the first $a$ primes. The 8 bits of each byte correspond to the offsets [ 1, 7, 11, 13, 17, 19, 23, 29 ], hence each byte represents an interval of size 30. However to determine the count of numbers ≤ $n$ that are not divisible by any of the first $a$ primes, we need to iterate over the bit array and count all set bits that correspond to numbers ≤ $n$. This means the access time of our cache would be $O(n)$ which is not great. Therefore we introduce a second array which contains the count of set bits in the first array below the current index. Using these two arrays we can now count the numbers ≤ $n$ that are not divisible by any of the first $a$ primes in $O(1)$ operations. Note that the two arrays may be interleaved which makes our cache slightly more memory efficient. Below is the corresponding code from primecount:

int64_t phi_cache(uint64_t x, uint64_t a)
{
  // We devide by 240 instead of 30 here because our arrays
  // use the uint64_t type instead of std::byte.
  uint64_t count = sieve_[a][x / 240].count;
  uint64_t bits = sieve_[a][x / 240].bits;
  uint64_t bitmask = unset_larger_[x % 240];
  return count + popcnt64(bits & bitmask);
}

Before the $\phi(i, j)$ cache can be used, it needs to be initialized. The cache can be initialized using a modified version of the sieve of Eratosthenes. In primecount the cache is lazily initialized during the execution of the main phi(x, a) formula. Whenever a new $\phi(x, i)$ computation is started we first check whether that result is not yet present in the cache and if $x$ & $i$ meet the caching criteria. If these conditions apply the cache will be filled up to $x$ & $i$. In the first part of this algorithm we unset the bits that correspond to numbers that are divisible by the i-th prime. When sieving has finished, we proceed to the second part of the algorithm where we count all set bits (below each index) in the first array and store that count in the second array. Below is the corresponding code from primecount:

/// Cache phi(x, i) results with: x <= max_x && i <= min(a, max_a).
/// Eratosthenes-like sieving algorithm that removes the first a primes
/// and their multiples from the sieve array. Additionally this
/// algorithm counts the numbers that are not divisible by any of the
/// first a primes after sieving has completed. After sieving and
/// counting has finished phi(x, a) results can be retrieved from the
/// cache in O(1) using the phi_cache(x, a) method.
///
void init_cache(uint64_t x, uint64_t a)
{
  // Each bit in the sieve array corresponds to an integer that
  // is not divisible by 2, 3 and 5. The 8 bits of each byte
  // correspond to the offsets { 1, 7, 11, 13, 17, 19, 23, 29 }.
  sieve_[3].resize(max_x_size_, sieve_t{0, ~0ull});

  for (uint64_t i = 4; i <= a; i++)
  {
    // Initalize phi(x, i) with phi(x, i - 1)
    sieve_[i] = sieve_[i - 1];

    // Remove prime[i] and its multiples
    for (uint64_t n = primes_[i]; n <= max_x_; n += primes_[i] * 2)
      sieve_[i][n / 240].bits &= unset_bit_[n % 240];

    // Fill an array with the cumulative 1 bit counts.
    // sieve[i][j] contains the count of numbers < j * 240 that
    // are not divisible by any of the first i primes.
    uint64_t count = 0;
    for (auto& sieve : sieve_[i])
    {
      sieve.count = count;
      count += popcnt64(sieve.bits);
    }
  }
}

According to my benchmarks, the cache as implemented above speeds up primecount's $\phi(x, a)$ implementation by more than an order of magnitude. Based on my empirical tests, caching $\phi(x, a)$ results for $a$ ≤ 100 provides the best performance. As mentioned earlier, smaller values of $x$ & $a$ are accessed much more frequently than larger values. I also limit the size of the cache to about 16 megabytes in primecount which is slightly larger than my CPU's L3 cache size. Using an even larger cache size deteriorates performance especially when using multi-threading.

Generate $\phi(x, i)$ lookup table

In 2002 Xavier Gourdon [5] devised a modification to the hard special leaves algorithm so that the computation can be slit up into independent chunks. This modification is particularly useful for parallelizing the hard special leaves algorithm, since it avoids the need for frequent thread synchronization. This modification relies on the partial sieve function for generating a lookup table of $\phi(x, i)$ results for $i \in [0, a]$. The idea of the algorithm is described very shortly in Gourdon's paper [5] and it is also described in some more detail in Douglas Staple's paper [8], however no pseudocode is provided in both papers.

Computing $\phi(x, i)$ individually for all $i \in [0, a]$ would be far too slow. However, by taking advantage of the recursive nature of the main formula $\phi(x, a) = \phi(x, a - 1) - \phi(x / \mathrm{prime}_a, a - 1)$, we can actually generate a lookup table of $\phi(x, i)$ results for $i \in [0, a]$ in the same amount of time it takes to compute $\phi(x, a)$. We first compute $\phi(x, 0)$, next we compute $\phi(x, 1)$ and reuse the $\phi(x, 0)$ result we have computed previously. Then we compute $\phi(x, 2)$ and reuse our previous $\phi(x, 1)$ result and so forth. The code below shows how this algorithm can be implemented using a simple for loop. The phi_recursive(x / primes[i], i - 1) part needs to be computed using the recursive $\phi(x, a)$ formula in conjunction with the optimizations described in this document.

int* phi = new int[a + 1];
phi[0] = x;

// Fill an array with phi(x, i) results
for (int i = 1; i <= a; i++)
  phi[i] = phi[i-1] - phi_recursive(x / primes[i], i - 1);

References

  1. A. M. Legendre, Théorie des nombres, Third edition, Paris, 1830. Vol. 2, p. 65.
  2. D. H. Lehmer, On the exact number of primes less than a given limit, Illinois J. Math. 3 (1959), pp. 381–388.
  3. J. C. Lagarias, V. S. Miller, and A. M. Odlyzko, Computing pi(x): The Meissel-Lehmer method, Mathematics of Computation, 44 (1985), pp. 537–560.
  4. M. Deleglise and J. Rivat, "Computing pi(x): The Meissel, Lehmer, Lagarias, Miller, Odlyzko Method", Mathematics of Computation, Volume 65, Number 213, 1996, pp 235–245.
  5. Xavier Gourdon, Computation of pi(x) : improvements to the Meissel, Lehmer, Lagarias, Miller, Odllyzko, Deléglise and Rivat method, February 15, 2001.
  6. Tomás Oliveira e Silva, Computing pi(x): the combinatorial method, Revista do DETUA, vol. 4, no. 6, March 2006, pp. 759-768.
  7. R.P. Leopold, On Methods for Computing pi(x), 2007.
  8. Douglas B. Staple, The combinatorial algorithm for computing pi(x), Master of Science Thesis, Dalhousie University Halifax, Nova Scotia, August 2015.