Changelog-mfakto.txt

version 0.15 (...)
- more GPU models are now detected
- some support for Intel HD Graphics
- mfakto now compiles and runs on macOS
- updated Windows build instructions and fixed some errors
- fixed some Linux compilation issues
- removed dependency on AMD APP SDK as it has been discontinued
- menu / key-press evaluation to dynamically adjust settings like
  SievePrimes, SieveSize, SieveProcessSize (and others)
- bug fix: SieveCPUMask on Linux had "random" results
- bug fix: kernel compilation did not work correctly if a device number
  greater than 1 was specified via the -d option
- bug fix: verbosity level was not passed to valid_assignment()
- bug fix: the '-d g' option was hard-coded to use an invalid device number
- bug fix: the DETAILED_INFO and CHECKS_MODBASECASE debug options no longer
  cause compilation to fail
- bug fix: the line "Overflow!" was printed ad infinitum when the DETAILED_INFO
  debug option was enabled
- bug fix: the number of threads per grid could be set to zero when less than
  the vector size * maximum threads per block

version 0.14 (2014-04-17)
- --perftest enhancements including GPU sieve evaluation (for optimizing GPUSievePrimes etc.)
- successfully resync when the working directory was temporarily lost (ejected USB device or interrupted network drive)
- save and reload compiled OpenCL kernels => reduce startup time (UseBinfile config variable)
- MoreClasses config variable to allow for a "less classes" version for very short assignments (GPU sieve only)
- BugFix: enforce GPUSieveSize being a multiple of GPUSieveProcessSize
- FlushInterval config variable to fine tune the number of kernels in the GPU queue => address high CPU load issue of newer AMD drivers
- MinGW build (thanks to kracker)
- slight performance improvement for the montgomery kernels
- improved English wording in program output, ini file etc. (thanks to kracker)
- recognition of new GPUs (8xxx, R9, new APUs) (thanks to kracker)
- added a warning when using VectorSize=1 (possible AMD driver issue, see
  https://community.amd.com/thread/167571 for details)
- fix for a small memory leak (~0.5kB per assignment)
- compatible with windows 8.1

version 0.13 (2013-05-19)
- most important: GPU sieving (thanks to George Woltman)
  o set SieveOnGPU=1 to enable
  o use GPUSievePrimes, GPUSieveSize, GPUSieveProcessSize to tweak it
    (see mfakto.ini for details)
  o on lower-end GPUs GPU sieving works as well, but may result in ~20% less
    throughput than CPU sieving: try it out.
  o almost no CPU utilization, but Catalyst 13.4 and 13.5 have a bug so mfakto
    will consume up to one CPU core. Stay below 13.4 for now.
- performance improvements for all kernels between 2 and 20% (also visible when CPU sieving)
  o use of more intrinsics like amd_bitalign and mad_hi
  o direct use of comparison results
  o skipping a few calculations due to better judgement of required precision
  o new kernels (thanks for the ideas and CUDA examples to George Woltman and Oliver Weihe)
  o big set of kernels based on 15-bit-math (speedup for Cayman and GCN)
- closer alignment with mfaktc
  o -v switch (also available as a Verbosity config variable)
  o output adjustments
  o internal file and function structure
- better diagnostics through improved tracing and error number decoding
- prepare for running on Intel HD4000 or NVIDIA devices (not fully functional yet)
- --perftest mode to test CPU sieving performance (to be extended for kernel performance tests)
- allow setting CPU affinity for the CPU sieve thread (SieveCPUMask config variable)
- 30k extra test cases now available in the -st2 selftest

version 0.12 (2012-07-29)
- IMPORTANT: bugfix for missing factors for certain exponent ranges.
  Thanks to dabaichi and Axelsson for bug report and help on
  http://mersenneforum.org/showthread.php?t=13977&page=16#383
- handling of <worktodo>.add (worktodo file name as configured, with suffix
  replaced): will be appended to <worktodo> between 5 and ~10 minutes after
  the add file's creation, or when moving to the next assignment
- new mfakto.ini key SieveCPUMask to set the siever thread's CPU affinity
- auto-detection of the GPU-type, along with new GPUType key in mfakto.ini,
  removed PreferKernel key
- optimization for GCN (Graphics Core Next: HD77xx, HD78xx, HD79xx)
- improved estimation of the number of compute units
- bugfix: occasional abort on very high SievePrimes (> 450,000)
- progress line output format: percent complete now %5.1f (was %6.2f),
  exponent now %-10u (was %d)
- bugfix: if a kernel file cannot be found, tell it's name instead of
  KERNEL_FILE (thanks to Axelsson)
- approx. 30,000 new test cases (factors) for pre-release testing
  (thanks to James Heinrich)

version 0.11 (2012-05-21)
- new 24-bit barrett kernel for FCs up to 2^70 - very fast!
- new 15-bit barrett kernel for FCs up to 2^73 - almost as fast,
  especially on Cayman this one has a speedup of 50%
- new SievePrimesMin ini-file variable to replace the so-far fix
  value of 5000
- new V5UserID and ComputerID ini-file variables that let you configure
  these ID's for the results file output (so far only useful for mersenne.ca)
- new TimeStampInResults ini-file variable allows to configure that each
  output line in the results file should be preceded by a time stamp
- new ProgressHeader and PrintFormat ini-file variables to adapt the
  information that is printed after each class is finished. See the included
	mfakto.ini file for details.
- On Linux: Siever code is now compiled with gcc4.6: ~10% faster sieve
- file locking: worktodo and results files accesses are now synchronized
  using a lock file (.lck appended to the file name).
- evaluation of GHz-days of assignments, and current speed as GHz-days/day
- Ctrl-C handler already in selftest to get a summary of so-far-completed
  tests
- new --pertest option to test the siever performance depending on SievePrimes
  and SieveSizeLimit (if that is not fix at compile time)
- using a fix power of 2 for the number of GPU threads (still set via
  GridSize)
- removed many compiler-warnings

version 0.10 (2011-12-19)
- added workaround for compatibility with Catalyst 11.10 and above
- MODBASECASE for barrett kernels enabled
- Checkpoints now keep a backup (.bu) that is automatically read if the
  .cpt file is corrupt
- mfakto now allows reading checkpoints from any version of mfaktc and mfakto
  as long as the other parameters and the checksum are OK
- When Checkpoint files cannot be used, there are some diagnostic messages
- added some optimization options to the Linux Makefile
- split mul24 kernel in two bitranges 0-64 and 61-72 allowing for some
  optimizations: +10-20% for 70-bit assignments using the mul24 kernel
- merged mfaktc 0.18 features:
  + inifile parameter CheckpointDelay (see below)
	+ extended selftest (-st2)
	+ minor bugfix reporting a bitlevel as incompletely tested if a factor was
	  found in the last class
	+ Changes to the factor found result line as discussed in the mersenne forum
	+ Linux: the signal handler now also catches SIGTERM
- Writing checkpoints can now be limited:
  + set CheckpointDelay <s> to write a checkpoint only if at least s seconds
	  have passed since the last checkpoint
	+ set Checkpoint <n>, n>1 to write a checkpoint only after n classes have
	  been tested (set to n=1 to enable CheckpointDelay)
- added commandline parameter  -i|--inifile <file>
  to load <file> as inifile (default: mfakto.ini), allowing multiple instances
	of mfakto in the same directory
- added ResultsFile parameter to inifile
- added program support for GPU-sieving (siever kernel not yet functional!)

version 0.09 (2011-10-01)
- fixed bug in 72-bit kernel that might miss small factors (<48 bit)
  unfortunately, the fix slows this kernel by ~3-5%
- added test for the above bug to the selftest
- better calculation for SievePrimesAdjust

version 0.08 (2011-09-13)
- added --help and parameter checking
- exclude single-vectored 72-bit-mul24-kernel (not working with AMD APP SDK 2.5)
- removed THREADS_PER_BLOCK (no longer needed, will be selected automatically
  based on the GPU capabilities)
- removed slow 95-bit kernel (not usable anyway)
- added to mfakto.ini a config setting
  PreferKernel=mfakto_cl_barrett79|mfakto_cl_71 as HD5xxx and HD6xxx
  have their top-speed with different kernels
- tuned the settings for SievePrimesAdjust - should now be usable

version 0.07 (2011-08-10)
- vectorized barrett kernels
- re-enabled the MODBASECASE checks
- removed crash-workaround from barrett kernels
  (no longer needed since Catalyst 11.7): +3%
- fixed index-evaluation in barrett kernels
- fix for bitshifts of >31 bits in barrett kernels
- fixed a few compiler warnings
- improved error handling for bad config
- added "-d c" switch to force running on CPU
- added warning when no atomics are available
- added option for debuggable kernel code
- fixed/optimized limits for preprocessing on CPU

version 0.06 (2011-07-09)
- ported barrett kernels for 79 and 92 bit factor sizes
- added support for GPUs without atomics (HD4xxx)
- dropped 2- and 16-wide vectors
- optimization in the 24-bit kernels: +3%
- added a few testcases to the short selftest
- SievePrimes auto-adjust now working (still not optimal)
- slightly faster sieve-initialization per class

version 0.05 (2011-06-19)
- fixed 72-bit subtraction (in various places) as per TheJudger's comment
  on www.mersenneforum.org - Thanks for the hint!
- fixed boundaries for 95-bit kernel (-st failures)
- added vectorized versions of the 72-bit kernel (2, 4, 8, 16 wide)
- added VectorSize config item to select one of the vectorized kernels
- added GridSize=4 (up to 2M threads per grid)
- combined the "if (a>b) sub(a,b)" to a "sub_if_gt" without any "if"s
  (only conditional moves - prerequisite for the vector approach)
- moved the new DETAILED_INFO and CL_PERFORMANCE_INFO debug defines to
  params.h
- added GPLv3 headers everywhere

version 0.04 (2011-06-09)
- first version to go out to some people
- cleanup of
  - unused code
	- cuda-references and workarounds
  - removed OpenCL-example-code
- merged in the signal handler from mfaktc 0.18-pre2
- 71-bit-mul24-kernel working for all tests:
  - fixed bit-shifting offset in square_72_144_shl
	- fixed carry in sub72
- unrolled the modulo loop in the 95-bit kernel: 10% faster