-
-
Notifications
You must be signed in to change notification settings - Fork 41
/
ChangeLog
652 lines (519 loc) · 26.5 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
Changes in primecount-6.5, 2021-04-02
In primecout-6.4 (and earlier) the AC formula scales very badly on
servers with a large number of CPU cores. The root cause of this
scaling issue is the SegmentedPiTable lookup table which is accessed
randomly and which is much too large to fit into the CPU's cache. In
primecount-6.5 the AC algorithm as been slightly modified to allow a
much smaller SegmentedPiTable without deteriorating the runtime
complexity of the algorithm.
* AC.cpp: Decrease SegmentedPiTable by a factor > log(x).
* AC_libdivide.cpp: Decrease SegmentedPiTable by a factor > log(x).
* SegmentedPiTable.cpp: Increase minimum segment size to 2 MiB.
Changes in primecount-6.4, 2021-03-20
This version fixes a critical integer overflow in the B formula
which is part of Xavier Gourdon's prime counting algorithm. The
integer overflow occurred when computing B(x) with x > 1e25, this bug
has been present in primecount-6.1, 6.2 and 6.3. This version also
adds a workaround for a severe scaling issue in Clang's OpenMP
library (Clang 11, 2021) that occurs when using dynamic thread
scheduling combined with long running computations. Because of this
issue primecount-6.3 compiled with Clang takes 2.5x longer to
compute AC(1e24) than primecount-6.3 compiled with GCC on my
dual-socket AMD EPYC Rome server. I have submitted a bug report
to LLVM which contains more information:
https://bugs.llvm.org/show_bug.cgi?id=49588
* B.cpp: Fix integer overflow (thanks to David Baugh for reporting it)
* B_mpi.cpp: Fix integer overflow (same as in B.cpp).
* P2.cpp: Fix integer overflow (same as in B.cpp).
* for_atomic.hpp: Lock-free for loop built with std::atomic.
* AC.cpp: Fix Clang/OpenMP scaling issue.
* AC.cpp: Decrease size of SegmentedPiTable by 1.5x.
* AC_libdivide.cpp: Fix Clang/OpenMP scaling issue.
* AC_libdivide.cpp: Decrease size of SegmentedPiTable by 1.5x.
* S2_easy.cpp: Fix Clang/OpenMP scaling issue.
* S2_easy_libdivide.cpp: Fix Clang/OpenMP scaling issue.
* SegmentedPiTable.cpp: Increase minimum segment size.
* SegmentedPiTable.cpp: Annotate pi(x) with ALWAYS_INLINE.
* Sieve.cpp: Simplify counters_dist initialization.
* PiTable.cpp: Annotate pi(x) with ALWAYS_INLINE.
* LoadBalancerP2.cpp: Fix last iteration detection.
Changes in primecount-6.3, 2021-03-05
The memory usage of the PiTable and SegmentedPiTable has been
reduced by about 2x, this speeds up the computation of the easy
special leaves (S2_easy & AC formulas) by up to 30% for large pi(x)
computations with x >= 1e23. Furthermore the partial sieve function
a.k.a. phi(x, a) has been improved significantly, it now runs 3 to
5x faster for most input. phi(x, a) is used extensively by many
other functions such as pi_legendre(x) and pi_meissel(x) which now
also run up to 5x faster.
* PiTable.cpp: Reduce memory usage by 2x.
* SegmentedPiTable.cpp: Reduce memory usage by 2x.
* Sieve.cpp: Reduce memory usage of counters array by 2x.
* phi.cpp: Fixed integer overflow #39.
* phi.cpp: Cache 40x more phi(x, a) results.
* phi.cpp: Use pi(x) if a > pi(sqrt(x)).
* phi.cpp: Use larger c constant if phi(x, larger_c) is cached.
* generate_phi.hpp: Same optimizations as phi.cpp.
* LoadBalancer.cpp: Tune for new phi(x, a) implementation.
* FactorTable.hpp: Reduce number of branches.
* DFactorTable.hpp: Reduce number of branches.
Changes in primecount-6.2, 2021-01-07
The sieving algorithm has been improved by annotating branches
with likely/unlikely and __builtin_unreachable(). This improves
the performance of the S2_hard and D formulas by up to 5%. GCC
benefits most from these changes (Clang's performance was
already very good before). With this release there is also a new
primecount-backup version (last primecount-backup release was 6.0).
In order to reduce the amount of work for myself there will be no
precompiled binaries anymore going forward. You can now install
primecount using your operating system's package manager (if it
is available there) or by compiling it from source yourself.
* Use Appveyor-CI instead of Travis-CI for testing.
* README.md: primecount can now be installed using package managers.
* Sieve.cpp: Annotate branches with unlikely, up to 5% speedup.
* Sieve.cpp: Annotate switch cases with fallthrough.
* AC.cpp: Reduce number of function paramaters.
* AC_libdivide.cpp: Reduce number of function paramaters.
* B.cpp: Improve GCC performance.
* P2.cpp: Improve GCC performance.
* popcnt.hpp: Improve GCC performance.
Changes in primecount-6.1, 2020-09-12
The main focus of this release has been on polishing the code
and improving the documentation. I also tried many things to
improve the scaling on servers with a large number of CPU cores,
however I only achieved minor speed ups. The only meaningful
improvement is that the same threads are now reused throughout
the entire AC computation. This improves the scaling for small
to medium sized computations up to 10^20. GCC benefits most
from this change whereas Clang performance is mostly unchanged.
* Xavier Gourdon's algorithm has been distributed using MPI
so that computations can now run on HPC clusters.
* CMakeLists.txt: New WITH_JEMALLOC option (default OFF).
* AC.cpp: Reuse the same threads throughout the computation.
* AC.cpp: Improve upper bound of C2 formula.
* AC.cpp: Avoid branch inside hot loop of A formula.
* SegmentedPiTable.cpp: Reuse threads from AC.cpp.
* LoadBalancerP2.cpp: New load balancer for P2 & B.
* phi.cpp: Reduce caching for tiny numbers.
* generate_phi.hpp: Reduce caching for tiny numbers.
* pod_vector.hpp: Like std::vector, but without default
initialization (useful when allocating 100s of GiB).
* PiTable.cpp: Multi-threaded initialization.
* Status.cpp: Avoid thread synchronization when printing
in order to improve scaling of AC and S2_easy.
* Status.cpp: Improve S2_hard & D status accuracy.
* StatusAC.cpp: More accurate status for AC formula.
* cmdoptions.cpp: Add -B & -D options.
* Fixed #32: primecount-backup prints incorrect time elapsed.
Changes in primecount-6.0, 2020-03-16
This version fixes a scaling issue that has been present in
primecount since the first version. The computation of the
hard special leaves (S2_hard & D formulas) used a flawed
optimization that deteriorated the runtime complexity of the
algorithm. The doc/Special-Leaves.md document contains more
information. The new version should run much faster for large
computations >= 10^23.
* Sieve.cpp: Implemented O(sqrt(sieve_size)) counting,
previously counting was O(sieve_size).
* AC_libdivide.cpp: Up to 20% faster using GCC.
* S2_easy_libdivide.cpp: Up to 20% faster using GCC.
* primecount.h: New C API. primecount now has both a C API
(primecount.h) and a C++ API (primecount.hpp).
* LoadBalancer.cpp: Refactor using new ThreadSettings struct.
* find_optimal_alpha_*.sh: New scripts that find optimal
alpha tuning factors using the Linux perf tool.
Changes in primecount-5.3, 2020-01-18
libprimesieve has been updated to the latest version which
provides a minor speedup for all formulas that use
primesieve::iterator e.g. P2, B, pi_lehmer, ...
* Sieve.hpp: Use NOINLINE.
* S2Status.hpp: Use NOINLINE.
* D4.cpp: Simplify bounds.
* S2_hard.cpp: Simplify bounds.
* LoadBalancer.cpp: Simplify bounds.
* RiemannR.cpp: Add support for __float128 type.
* popcnt.hpp: Faster ARM64 popcount implementation.
* fast_div.hpp: Add ENABLE_DIV32 macro.
* S2Status.cpp: Fix non-critical data race.
* LoadBalancer.cpp: Improve thread load balancing.
Changes in primecount-5.2, 2019-11-17
I have now implemented Xavier Gourdon's algorithm in the
primecount-backup version (branch backup3). Other than that
there have been documentation improvements and build system
improvements which should make it easier to package
primecount for e.g. Linux distros (see BUILD.md).
* Sieve.cpp: Fix vector out of bounds.
* cmdoptions.cpp: Support options of type: --option VALUE.
* doc/BUILD.md: Detailed build instructions. Contains new
section: "Packaging primecount" for Linux distros.
* doc/primecount.1: Add primecount man page.
* doc/primecount.txt: AsciiDoc, used to generate primecount.1.
* CMakeLists.txt: Generate man page using a2x program.
* CMakeLists.txt: New option: -DBUILD_LIBPRIMESIEVE=OFF.
* Get rid of underscores in command-line options:
--deleglise_rivat -> --deleglise-rivat
--S2_trivial -> --S2-trivial
--S2_easy -> --S2-easy
--S2_hard -> --S2-hard
Changes in primecount-5.1, 2019-09-02
The memory usage of Xavier Gourdon's algorithm has been
reduced from O(x^(1/2)) to O(x^(1/3) * log^3 x). Xavier
Gourdon's algorithm is now fully implemented and luckily it
scales well up to a very large number of CPU cores. Xavier
Gourdon's algorithm is now enabled by default for
numbers > 10^7.
* main.cpp: New options: --AC, --B, --D, --Phi0, --Sigma.
* SegmentedPiTable.cpp: New PiTable implementation.
* AC.cpp: Combine the A & C formulas to improve scaling.
* D4.cpp: Tighter bounds, minor speedup.
* Sigma.cpp: Reduce initialization overhead.
* Sieve.cpp: Improved pre-sieving.
* S2_hard.cpp: Improved pre-sieving.
* print.cpp: Updated for Gourdon's algorithm.
=== C++ API Changes ===
In order to simplify the API the following functions have
been removed from the primecount.hpp header:
int64_t pi_deleglise_rivat(int64_t x);
int64_t pi_legendre(int64_t x);
int64_t pi_lehmer(int64_t x);
int64_t pi_lmo(int64_t x);
int64_t pi_meissel(int64_t x);
int64_t pi_primesieve(int64_t x);
You should use pi(x) instead which is an alias for the
fastest prime counting function implementation in
primecount.
Changes in primecount-5.0, 2019-08-13
I have finally implemented Xavier Gourdon's prime counting
algorithm! Xavier Gourdon's algorithm is an improved
version of the Deleglise-Rivat algorithm. According to my
benchmarks Gourdon's algorithm runs up to 2x faster than
the Deleglise-Rivat algorithm.
* src/gourdon: Implementation of Xavier Gourdon's algorithm.
* LoadBalancer.cpp: Updated for Gourdon's algorithm.
* primecount.cpp: Updated for Gourdon's algorithm.
* print.cpp: Updated for Gourdon's algorithm.
* generate.cpp: Generate array with largest prime factors.
* test.cpp: Test Gourdon's algorithm.
* README.md: Update benchmarks.
Changes in primecount-4.8, 2019-07-14
This version reduces the memory usage of S2_easy(n) by
15% and the memory usage of S2_hard(n) by 10%.
I have also improved the documentation of the code so
that others and myself can easier understand it.
* PiTable.cpp: Reduce memory usage by 2x.
* FactorTable.hpp: Reduce memory usage by 10%.
* LoadBalancer.cpp: Improve scaling.
* MpiLoadBalancer.cpp: Improve scaling.
* fast_div.hpp: x64 assembly for (128-bit / 64-bit) = 64-bit.
* phi_tiny.hpp: Speedup for 128-bit integers.
* S2_trivial.cpp: Implement O(1) math formula.
* libdivide.h: Update to libdivide-2.0.
* appveyor.yml: Treat compiler warnings as errors.
Changes in primecount-4.7, 2019-04-16
This version fixes an issue with the new MD5 checksum
feature introduced in primecount-backup-4.6 that I
found unfortunately 1 day after releasing primecount-4.6.
* json.hpp: Fix incorrect MD5 formatting, bytes with the
first 4 bits masked off must be prefixed with '0'.
Changes in primecount-4.6, 2019-04-13
This version fixes 2 bugs in the CMakeLists.txt build script
and improves the primecount backup version that stores
intermediate results to hard disk. Note that
primecount.backup files produced by previous primecount
releases are incompatible with primecount-4.6.
* CMakeLists.txt: Fix libatomic issue.
* CMakeLists.txt: Require CMake 3.4 instead of 3.9.
* LoadBalancer.cpp: Increase number of backups.
* S2_easy.cpp: Allow resuming using fewer/more threads.
* S2_hard.cpp: Allow resuming using fewer/more threads.
* primecount.backup: Add MD5 checksum.
Changes in primecount-4.5, 2019-02-20
This is a maintenance release that contains minor
improvements e.g. tests should run 10% faster.
* lib/primesieve: upgrade to version 7.4.
* travis.yml: Test using many different compiler versions.
* calculator.hpp: Silence clang-cl -Wdeprecated warning.
Changes in primecount-4.4, 2018-05-09
The computation of the second partial sieve function
P2(x, a) has been sped up by 30% due to an update to the
latest primesieve-7.0 library.
* CMakeLists.txt: Add OpenMP support for LLVM/Clang.
* CMakeLists.txt: Add Intel C++ compiler support.
* nth_prime.cpp: Fix non-critical data race.
* pi_legendre.cpp: Fix non-critical data race.
* pi_primesieve.cpp: Fix non-critical data race.
* make test: Runs twice as fast.
Changes in primecount-4.3, 2018-03-18
* Support big-endian CPUs.
* Speed up x86 without POPCNT.
* libdivide.h: update to libdivide 1.0.
* calculator.hpp: Fix integer overflow in pow().
* Required CMake version is now 3.4 (previously 3.1).
* CMakeLists.txt: Fix make install issue.
* CMakeLists.txt: Get rid of int128_t, __int128_t checks.
* isqrt_constexpr.cpp: Add test for compile time square root.
Changes in primecount-4.2, 2017-12-02
The computation of the second partial sieve function
P2(x, a) has been sped up by 10% due to an upgrade to the
latest primesieve-6.3 library.
* lib/primesieve: upgrade to version 6.3.
* src/backup.cpp: Speed up resume from backup.
* test/RiemannR.cpp: Fix bash/ubuntu on Windows issue.
* CMakeLists.txt: Silence GCC 7 warnings.
* .travis.yml: Update to Ubuntu 14.
Changes in primecount-4.1, 2017-07-19
This version improves the backup & resume functionality
and uses up to 8% less memory due to a reduced alpha
tuning factor.
* S2_easy.cpp: Fix severe backup scaling issues.
* S2_easy_libdivide.cpp: Fix severe backup scaling issues.
* S2_hard.cpp: Faster resume.
* LoadBalancer.cpp: Simplify backup.
* results.txt: Fix incorrect time elapsed.
* primecount.cpp: smaller alpha tuning factor.
Changes in primecount-4.0, 2017-07-06
This version features a new highly optimized sieve of
Eratosthenes implementation for the computation of the
hard special leaves that speeds up the S2_hard algorithm
by up to 40%, giving an overall speed up of up to 20%.
* Sieve.cpp: New sieve of Eratosthenes.
* S2_hard.cpp: Use new sieve.
* S2_hard_mpi.cpp: Use new sieve.
* lmo5.cpp: Use new sieve.
* pi_lmo_parallel.cpp: Use new sieve.
* LoadBalancer.cpp: L1 cache size optimization.
* MpiLoadBalancer.cpp: L1 cache size optimization.
Changes in primecount-3.8, 2017-06-27
This version reduces the memory usage by a factor of 2
above 10^20! Furthermore the S2_hard algorithm for
computing the hard special leaves has been improved: The
binary indexed tree data structure (random memory access
pattern) has been removed. This fixes the scaling issues
above 10^24 primecount had previously.
* libdivide.h: Reduce memory usage, pack structs.
* BitSieve.hpp: Reduce memory usage, store odd integers.
* S2_hard.cpp: Remove binary indexed tree.
* S2_hard_mpi.cpp: Remove binary indexed tree.
* primecount.cpp: Update alpha tuning factor formula.
Changes in primecount-3.7, 2017-06-07
This version features a new multi-threading load balancer
for the computation of the hard special leaves. The new
load balancer requires no synchronization between threads
and achieves 100% CPU cores usage. The new load balancer
is based on ideas from Xavier Gourdon and Douglas Staple.
* LoadBalancer.cpp: New multi-threading load balancer.
* generate_phi.hpp: Part of new load balancer.
* S2_hard.cpp: Use new load balancer.
* BitSieve.cpp: Get rid of AVX2 (use POPCNT).
* Li.cpp: Riemann R and inverse Riemann R implementations.
* fast_div.hpp: Refactor using template metaprogramming.
* src/test: Added 27 unit tests.
* phi(x, a) now scales > 8 threads.
* BinaryIndexedTree.hpp: Do not store multiples of 2.
* CMakeLists.txt: Faster C++ compilation.
* S2Status.cpp: Reduce --status overhead.
Changes in primecount-3.6, 2017-03-04
This version features a new AVX2 popcount algorithm which
computes the hard special leaves up to 15% faster on x86 CPUs
with AVX2 support (2013 or later).
* BitSieve-popcnt.cpp: New AVX2 popcount algorithm.
* popcnt.hpp: Fix clang performance bug.
* primecount.cpp: Fix clang time measuring.
* CMakeLists.txt: Add AVX2 check.
Changes in primecount-3.5, 2016-12-16
* CMake: Use CMake build system instead of Autotools.
* README.md: Update build instructions.
Changes in primecount-3.4, 2016-08-06
* Makfile.msvc: Use libdivide also with MSVC compiler.
* S2LoadBalancer.cpp: Improved load balancing.
* P2.cpp: Improved load balancing.
* P2_mpi.cpp: Improved load balancing.
* .travis.yml: Add static C++ code analysis using cppcheck.
Changes in primecount-3.3, 2016-07-16
* README.md: Fix error in "C++ API" section.
* S2_hard.cpp: Refactoring.
* S2_hard_mpi.cpp: Refactoring.
* P2.cpp: Refactoring.
* P2_mpi.cpp: Refactoring.
Changes in primecount-3.2, 2016-05-09
This version runs up to 5% faster due to an improved P2(x, a)
implementation.
* P2.cpp: 30% faster.
* P2_mpi.cpp: 30% faster.
* libdivide.h: Update to latest version.
* autogen.sh: Print error msg if Autotools is not installed.
Changes in primecount-3.1, 2016-04-02
This version runs up to 20% faster! The speed up is due to
the usage of libdivide in S2_easy, libdivide allows to replace
expensive integer divides with comparatively cheap
multiplication and bitshifts.
* S2_easy_libdivide.cpp: 40% speed up due to libdivide.
* S2_easy_mpi_libdivide.cpp: 40% speed up due to libdivide.
* BitSieve.cpp: Fix broken Big-Endian CPU support.
Changes in primecount-3.0, 2016-03-12
In this release I have merged the MPI (Message Passing Interface)
branch into the master branch. primecount can now distribute
computations onto cluster nodes if MPI is enabled in the build
process (--enable-mpi).
* doc/primecount-MPI.md: primecount MPI documentation.
* src/mpi: primecount MPI source code.
* include/FactorTable.hpp: Multi-threaded initialization.
* build.sh: Improved build script with support for primecount MPI.
* README.md: Updated build instructions.
Changes in primecount-2.6, 2016-02-14
primecount has been distributed using MPI (Message Passing Interface)
and can now run computations on large clusters!
The distributed version of primecount is named primecount-mpi and
the code can be found at:
https://github.com/kimwalisch/primecount/tree/mpi
* src/phi.cpp: 2x speed up due to pi(x) lookup table optimization.
* scripts/build.sh: Easily build primecount.
Changes in primecount-2.5, 2016-01-24
This version adds logging to primecount-backup and fixes 2
integer overflow bugs in primecount-backup.
* --log: Logs results into results.txt and primecount.log.
* scripts/worktodo.sh: Script for batch processing via worktodo.txt.
* src/S1.cpp: Fix integer overflow bug (in backup branch).
* src/deleglise-rivat/S2_trivial.cpp: Fix integer overflow bug (in backup branch).
Changes in primecount-2.4, 2015-12-27
* README.md: Simplified build instructions.
* appveyor.yml: Automated Windows (MSVC++) testing.
* configure.ac: Removed usage of buggy ax_gcc_builtin.m4 macro.
* src/P2.cpp: Port to primesieve-5.5.0 library.
* src/test.cpp: Faster nth prime testing.
Changes in primecount-2.3, 2015-10-07
This version runs about 10% faster for x <= 10^21. Because of the
sieving optimizations implemented in primecount-2.2 the sieving
limit has now been increased which reduces the time to compute the
easy special leaves.
I have now also officially published primecount-backup binaries
which save intermediate results to a file once per hour and which
can resume from these files after e.g. a crash:
https://github.com/kimwalisch/primecount/tree/backup
* Renamed to --P2, --S1, --S2_trivial, --S2_easy, --S2_hard.
* find_fastest_alpha.sh: Benchmark which finds fastest alpha factors.
* src/primecount.cpp: Alpha factor tuning.
* src/P2.cpp: Use multi-threading for initialization.
* src/S2Status.cpp: --status[=N], N digits after decimal point.
* src/pi_lehmer.cpp: Remove pi_lehmer2(x).
Changes in primecount-2.2, 2015-09-19
This version runs about 10% faster due to newly added pre-sieving
of small primes and wheel factorization.
* src/primecount.cpp: Increase pi(x) limit to 10^31 (previously 10^27).
* src/BitSieve.cpp: Add pre-sieving of small primes.
* src/P2.cpp: Use pre-sieving and wheel factorization.
* src/deleglise-rivat/*: Use pre-sieving and wheel factorization.
* src/lmo/*: Use pre-sieving and wheel factorization.
* include/popcount.hpp: Faster popcount algorithm for CPUs without POPCNT.
* include/Wheel.hpp: New wheel factorization data structures.
Changes in primecount-2.1, 2015-04-12
* src/S1.cpp: Fix Windows performance regression.
* src/S2LoadBalancer.cpp: Refactoring.
* Makefile.am: Add autogen.sh to EXTRA_DIST.
Changes in primecount-2.0, 2015-03-26
This version improves the POPCNT algorithm in the computation of the
hard special leaves and contains a new algorithm for the computation
of the ordinary leaves which uses half as much memory.
* src/S1.cpp: New implementation based on Douglas Staple's algorithm.
* src/S2LoadBalancer.cpp: Improved load balancing.
* src/deleglise-rivat/S2_hard.cpp: Also use POPCNT if high < y.
* src/lmo/pi_lmo5.cpp: Optimized sieving, up to 10% faster.
* src/lmo/pi_lmo_parallel3.cpp: Optimized sieving, up to 10% faster.
* include/BitSieve.hpp: Add count backwards optimization.
Changes in primecount-1.9, 2015-03-08
This version introduces new command-line options for calculating
intermediate formulas of the Deleglise-Rivat algorithm. This allows
to distribute the computation of large pi(x) values on multiple
systems. This version also fixes two non-critical bugs.
* src/app: New options: --p2, --s1, --s2_trivial, --s2_easy, --s2_hard.
* src/deleglise-rivat/S2_trivial.cpp: Fix off by 1 bug.
* src/deleglise-rivat/*: Bugfix, set 1 and unset 2 in sieve.
* src/deleglise-rivat/S2_hard.cpp: Improved scaling for large x.
* src/deleglise-rivat/*: Alpha factor tuning.
* src/lmo/*: Bugfix, set 1 and unset 2 in sieve.
* src/lmo/*: Alpha factor tuning.
* src/primecount.cpp: Improved alpha formula.
* src/print.cpp: New print functions for terminal output.
Changes in primecount-1.8, 2015-02-28
This version reduces the memory usage of the Deleglise-Rivat
implementation by up to 30 percent. Instead of using the same set of
memory intense data structures for all formulas (P2, S1, S2_trvial,
S2_easy, S2_hard), each individual formula now generates its own set
of data structures and the size of each data structure is the
smallest possible for the given formula.
* -a<N>, --alpha=<N>: Manually set the tuning factor.
* src/primecount.cpp: Improved alpha factor formula.
* src/deleglise-rivat/*: Reduce memory usage.
* src/lmo/*: Update S1(x) function calls.
* src/nth_prime.cpp: Add optimization for small primes.
* src/app/*: Add --alpha option.
* avoid_128bit_div.patch: merged into master branch.
Changes in primecount-1.7.1, 2015-01-26
* Makefile.am: Add avoid_128bit_div.patch to EXTRA_DIST
* avoid_128bit_div.patch: Renamed
Changes in primecount-1.7, 2015-01-24
This version runs to 20% faster on Windows for numbers >= 2^63 by
using 64-bit divisions instead of 128-bit divisions whenever
possible. While primecount-1.6 has already been very fast on Linux
it ran slower on Windows, primecount-1.7 now achieves the same
speed on both Windows and Linux.
* fast_div.patch: Avoid 128-bit divisions.
Changes in primecount-1.6, 2015-01-05
This version calculates hard special leaves much faster, above a
certain threshold the algorithm switches to POPCNT for counting the
number of unsieved elements (instead of Tomás Oliveira's special
tree data structure) which dramatically improves memory efficiency.
This version also fixes a serious bug in P2(x, a) for x > 10^21,
thanks to Dana Jacobsen for reporting it.
* src/deleglise-rivat/S2_hard.cpp: Add POPCNT optimization.
* src/lmo/pi_lmo_parallel3: Add POPCNT optimization.
* src/lmo/pi_lmo5: Add POPCNT optimization.
* src/P2.cpp: Bug fix for numbers > 10^21.
* src/BitSieve.hpp: Improved memory efficiency.
* include/isqrt.hpp: Fixed C++11 bug in isqrt(x).
* include/min.hpp: Refactoring.
* include/int128.hpp: Add int128_t trait functions.
Changes in primecount-1.5, 2014-12-27
This version runs up to 30 percent faster than primecount-1.4 for
numbers > 2^64. The speed up is achieved using a clever trick:
Instead of calculating xn = x / (primes[b] * primes[l]) which uses
a 128-bit division, x2 = x / primes[b] is calculated upfront and
then xn = x2 / primes[l]. If x2 is < 2^64 then xn can be calculated
using a 64-bit division which is much faster.
* src/deleglise-rivat/S2_*.cpp: Avoid 128-bit divisions.
* src/S2Status.cpp: Improved status precision.
* src/app/cmdoptions.cpp: Set -l = --lmo instead of --lehmer.
* README.md: Add command-line options summary.
Changes in primecount-1.4, 2014-12-15
* src/deleglise-rivat/*.cpp: Improved computation of special leaves.
* src/P2.cpp: Use unsigned division.
* src/S1.cpp: Use unsigned division.
* src/S2LoadBalancer.cpp: Update documentation.
* src/PhiCache.cpp: Update documentation.
* include/PiTable.hpp: Update documentation.
Changes in primecount-1.3, 2014-11-04
* Fixed bug in m4-ax_popcnt.m4 for QEMU virtual machines.
* Limit number of threads in phi.cpp to 8.
* Improve scaling of P2_lehmer(x, a).
* Improve scaling of P3(x, a).
Changes in primecount-1.2, 2014-08-24
* Add --status (-s) command-line option.
* src/S2LoadBalancer.cpp: New faster load balancer.
* src/S2Status.cpp: Print S2 progress.
* Fixed integer overflow bugs in: Li_inverse(x), isqrt(x), iroot(x)
Changes in primecount-1.1, 2014-08-06
* pi_deleglise_rivat4.cpp: 128-bit implementation.
* pi_deleglise_rivat_parallel4.cpp: 128-bit implementation.
* Add --time option to print elapsed seconds.
* include/S1.hpp: Added multi-threading and templates.
* include/FactorTable.hpp: template implementation.
* src/PiTable.cpp: Faster initialization.
* src/balance_S2_load.cpp: Improved load balancer.
* src/generate.cpp: Contains all generate_*() functions.
Changes in primecount-1.0, 2014-07-19
* src/BitSieve.cpp: Fix big-endian CPU bug.
* src/lmo/*.cpp: Improved alpha factor.
* src/deleglise-rivat/*.cpp: Improved alpha factor.
* include/pmath.hpp: Add max3(a, b, c).
* m4/m4-ax_popcnt.m4: Support non x86 CPU architectures.
* Readme.md: Update documentation.