NEWS

FFTW 3.3.9:

* Fix wisdom for avx512.

  The avx512 alignment requirement was set to 64 bytes, but this is
  wrong.  Alignment requirements are a property of the platform (e.g.,
  x86) and not of the instruction set (e.g., AVX).  Among other
  things, this broke wisdom with avx512.

  Note that avx512 support is still experimental because the FFTW
  authors have no avx512 hardware available for testing.

* fftw_threads_set_callback function to change the threading backend at runtime.

FFTW 3.3.8:

* Fixed AVX, AVX2 for gcc-8.

  By default, FFTW 3.3.7 was broken with gcc-8.  AVX and AVX2 code
  assumed that the compiler honors the distinction between +0 and -0,
  but gcc-8 -ffast-math does not.  The default CFLAGS included -ffast-math.
  This release ensures that FFTW works with gcc-8 -ffast-math, and
  removes -ffast-math from the default CFLAGS for good measure.

FFTW 3.3.7:

* Experimental support for CMake.

  The primary build mechanism for FFTW remains GNU autoconf/automake.
  CMake support is meant to offer an easy way to compile FFTW on
  Windows, and as such it does not cover all the features of the
  automake build system, such as exotic cycle counters,
  cross-compiling, or build of binaries for a mixture of ISA's
  (e.g., amd64 vs amd64+avx vs amd64+avx2).  Patches are welcome.

* Fixes for armv7a cycle counter.
* Official support for aarch64, now that we have hardware to test it.
* Tweak usage of FMA instructions in a way that favors newer processors
  (Skylake and Ryzen) over older processors (Haswell).
* tests/bench: use 64-bit precision to compute mflops.

FFTW 3.3.6-pl2:

* Bugfix: MPI Fortran-03 headers were missing in FFTW 3.3.6-pl1.

FFTW 3.3.6-pl1:

* Bugfix: FFTW 3.3.6 had the wrong libtool version number, and generated
  shared libraries of the form libfftw3.so.2.6.6 instead of
  libfftw3.so.3.*.

FFTW 3.3.6:

* The fftw_make_planner_thread_safe() API introduced in 3.3.5 didn't
  work, and this 3.3.6 fixes it.  Sorry about that.
* compilation fixes for IBM XLC
* compilation fixes for threads on Windows
* fix SIMD autodetection on amd64 when (_MSC_VER > 1500)

FFTW 3.3.5:

* New SIMD support:
  - Power8 VSX instructions in single and double precision.
    To use, add --enable-vsx to configure.
  - Support for AVX2 (256-bit FMA instructions).
    To use, add --enable-avx2 to configure.
  - Experimental support for AVX512 and KCVI. (--enable-avx512, --enable-kcvi)
    This code is expected to work but the FFTW maintainers do not have
    hardware to test it.
  - Support for AVX128/FMA (for some AMD machines) (--enable-avx128-fma)
  - Double precision Neon SIMD for aarch64.
    This code is expected to work but the FFTW maintainers do not have
    hardware to test it.
  - generic SIMD support using gcc vector intrinsics
* Add fftw_make_planner_thread_safe() API
* fix #18 (disable float128 for CUDACC)
* fix #19: missing Fortran interface for fftwq_alloc_real
* fix #21 (don't use float128 on Portland compilers, which pretend to be gcc)
* fix: Avoid segfaults due to double free in MPI transpose

* Special note for distribution maintainers: Although FFTW supports a
  zillion SIMD instruction sets, enabling them all at the same time is
  a bad idea, because it increases the planning time for minimal gain.
  We recommend that general-purpose x86 distributions only enable SSE2
  and perhaps AVX.  Users who care about the last ounce of performance
  should recompile FFTW themselves.

FFTW 3.3.4

* New functions fftw_alignment_of (to check whether two arrays are
  equally aligned for the purposes of applying a plan) and fftw_sprint_plan
  (to output a description of plan to a string).

* Bugfix in fftw-wisdom-to-conf; thanks to Florian Oppermann for the
  bug report.

* Fixed manual to work with texinfo-5.

* Increased timing interval on x86_64 to reduce timing errors.

* Default to Win32 threads, not pthreads, if both are present.

* Various build-script fixes.

FFTW 3.3.3

* Fix deadlock bug in MPI transforms (thanks to Michael Pippig for the
  bug report and patch, and to Graham Dennis for the bug report).

* Use 128-bit ARM NEON instructions instead of 64-bits.  This change
  appears to speed up even ARM processors with a 64-bit NEON pipe.

* Speed improvements for single-precision AVX.

* Speed up planner on machines without "official" cycle counters, such as ARM.

FFTW 3.3.2

* Removed an archaic stack-alignment hack that was failing with
  gcc-4.7/i386.

* Added stack-alignment hack necessary for gcc on Windows/i386.  We
  will regret this in ten years (see previous change).

* Fix incompatibility with Intel icc which pretends to be gcc
  but does not support quad precision.

* make libfftw{threads,mpi} depend upon libfftw when using libtool;
  this is consistent with most other libraries and simplifies the life
  of various distributors of GNU/Linux.

FFTW 3.3.1

* Changes since 3.3.1-beta1:

  - Reduced planning time in estimate mode for sizes with large
    prime factors.

  - Added AVX autodetection under Visual Studio.  Thanks Carsten
    Steger for submitting the necessary code.

  - Modern Fortran interface now uses a separate fftw3l.f03 interface
    file for the long double interface, which is not supported by
    some Fortran compilers.  Provided new fftw3q.f03 interface file
    to access the quadruple-precision FFTW routines with recent
    versions of gcc/gfortran.

* Added support for the NEON extensions to the ARM ISA.  (Note to beta
  users: an ARM cycle counter is not yet implemented; please contact
  fftw@fftw.org if you know how to do it right.)

* MPI code now compiles even if mpicc is a C++ compiler; thanks to
  Kyle Spyksma for the bug report.

FFTW 3.3

* Changes since 3.3-beta1:

  - Compiling OpenMP support (--enable-openmp) now installs a
    fftw3_omp library, instead of fftw3_threads, so that OpenMP
    and POSIX threads (--enable-threads) libraries can be built
    and installed at the same time.

  - Various minor compilation fixes, corrections of manual typos, and
    improvements to the benchmark test program.

* Add support for the AVX extensions to x86 and x86-64.  The AVX code
  works with 16-byte alignment (as opposed to 32-byte alignment),
  so there is no ABI change compared to FFTW 3.2.2.

* Added Fortran 2003 interface, which should be usable on most modern
  Fortran compilers (e.g. gfortran) and provides type-checked access
  to the the C FFTW interface.  (The legacy Fortran-77 interface is
  still included also.)

* Added MPI distributed-memory transforms.  Compared to 3.3alpha,
  the major changes in the MPI transforms are:
    - Fixed some deadlock and crashing bugs.
    - Added Fortran 2003 interface.
    - Added new-array execute functions for MPI plans.
    - Eliminated use of large MPI tags, since Cray MPI requires tags < 2^24;
      thanks to Jonathan Bentz for the bug report.
    - Expanded documentation.
    - 'make check' now runs MPI tests
    - Some ABI changes - not binary-compatible with 3.3alpha MPI.

* Add support for quad-precision __float128 in gcc 4.6 or later (on x86.
  x86-64, and Itanium).  The new routines use the fftwq_ prefix.

* Removed support for MIPS paired-single instructions due to lack of
  available hardware for testing.  Users who want this functionality
  should continue using FFTW 3.2.x.  (Note that FFTW 3.3 still works
  on MIPS; this only concerns special instructions available on some
  MIPS chips.)

* Removed support for the Cell Broadband Engine.  Cell users should
  use FFTW 3.2.x.

* New convenience functions fftw_alloc_real and fftw_alloc_complex
  to use fftw_malloc for real and complex arrays without typecasts
  or sizeof.

* New convenience functions fftw_export_wisdom_to_filename and
  fftw_import_wisdom_from_filename that export/import wisdom
  to a file, which don't require you to open/close the file yourself.

* New function fftw_cost to return FFTW's internal cost metric for
  a given plan; thanks to Rhys Ulerich and Nathanael Schaeffer for the
  suggestion.

* The --enable-sse2 configure flag now works in both double and single
  precision (and is equivalent to --enable-sse in the latter case).

* Remove --enable-portable-binary flag: we new produce portable binaries
  by default.

* Remove the automatic detection of native architecture flag for gcc
  which was introduced in fftw-3.1, since new gcc supports -mtune=native.
  Remove the --with-gcc-arch flag; if you want to specify a particlar
  arch to configure, use ./configure CC="gcc -mtune=...".

* --with-our-malloc16 configure flag is now renamed --with-our-malloc.

* Fixed build problem failure when srand48 declaration is missing;
  thanks to Ralf Wildenhues for the bug report.

* Fixed bug in fftw_set_timelimit: ensure that a negative timelimit
  is equivalent to no timelimit in all cases.  Thanks to William Andrew
  Burnson for the bug report.

* Fixed stack-overflow problem on OpenBSD caused by using alloca with
  too large a buffer.

FFTW 3.2.2

* Improve performance of some copy operations of complex arrays on
  x86 machines.

* Add configure flag to disable alloca(), which is broken in mingw64.

* Planning in FFTW_ESTIMATE mode for r2r transforms became slower
  between fftw-3.1.3 and 3.2.  This regression has now been fixed.

FFTW 3.2.1

* Performance improvements for some multidimensional r2c/c2r transforms;
  thanks to Eugene Miloslavsky for his benchmark reports.

* Compile with icc on MacOS X, use better icc compiler flags.

* Compilation fixes for systems where snprintf is defined as a macro;
  thanks to Marcus Mae for the bug report.

* Fortran documentation now recommends not using dfftw_execute,
  because of reports of problems with various Fortran compilers;
  it is better to use dfftw_execute_dft etcetera.

* Some documentation clarifications, e.g. of fact that --enable-openmp
  and --enable-threads are mutually exclusive (thanks to Long To),
  and document slightly odd behavior of plan_guru_r2r in Fortran
  (thanks to Alexander Pozdneev).

* FAQ was accidentally omitted from 3.2 tarball.

* Remove some extraneous (harmless) files accidentally included in
  a subdirectory of the 3.2 tarball.

FFTW 3.2

* Worked around apparent glibc bug that leads to rare hangs when freeing
  semaphores.

* Fixed segfault due to unaligned access in certain obscure problems
  that use SSE and multiple threads.

* MPI transforms not included, as they are still in alpha; the alpha
  versions of the MPI transforms have been moved to FFTW 3.3alpha1.

FFTW 3.2alpha3

* Performance improvements for sizes with factors of 5 and 10.

* Documented FFTW_WISDOM_ONLY flag, at the suggestion of Mario
  Emmenlauer and Phil Dumont.

* Port Cell code to SDK2.1 (libspe2), as opposed to the old libspe1 code.

* Performance improvements in Cell code for N < 32k, thanks to Jan Wagner
  for the suggestions.

* Cycle counter for Sun x86_64 compiler, and compilation fix in cycle
  counter for AIX/xlc (thanks to Jeff Haferman for the bug report).

* Fixed incorrect type prefix in MPI code that prevented wisdom routines
  from working in single precision (thanks to Eric A. Borisch for the report).

* Added 'make check' for MPI code (which still fails in a couple corner
  cases, but should be much better than in alpha2).

* Many other small fixes.

FFTW 3.2alpha2

* Support for the Cell processor, donated by IBM Research; see README.Cell
  and the Cell section of the manual.

* New 64-bit API: for every "plan_guru" function there is a new "plan_guru64"
  function with the same semantics, but which takes fftw_iodim64 instead of
  fftw_iodim.  fftw_iodim64 is the same as fftw_iodim, except that it takes
  ptrdiff_t integer types as parameters, which is a 64-bit type on
  64-bit machines.  This is only useful for specifying very large transforms
  on 64-bit machines.  (Internally, FFTW uses ptrdiff_t everywhere
  regardless of what API you choose.)

* Experimental MPI support.  Complex one- and multi-dimensional FFTs,
  multi-dimensional r2r, multi-dimensional r2c/c2r transforms, and
  distributed transpose operations, with 1d block distributions.
  (This is an alpha preview: routines have not been exhaustively
  tested, documentation is incomplete, and some functionality is
  missing, e.g. Fortran support.)  See mpi/README and also the MPI
  section of the manual.

* Significantly faster r2c/c2r transforms, especially on machines with SIMD.

* Rewritten multi-threaded support for better performance by
  re-using a fixed pool of threads rather than continually
  respawning and joining (which nowadays is much slower).

* Support for MIPS paired-single SIMD instructions, donated by
  Codesourcery.

* FFTW_WISDOM_ONLY planner flag, to create plan only if wisdom is
  available and return NULL otherwise.

* Removed k7 support, which only worked in 32-bit mode and is
  becoming obsolete.  Use --enable-sse instead.

* Added --with-g77-wrappers configure option to force inclusion
  of g77 wrappers, in addition to whatever is needed for the
  detected Fortran compilers.  This is mainly intended for GNU/Linux
  distros switching to gfortran that wish to include both
  gfortran and g77 support in FFTW.

* In manual, renamed "guru execute" functions to "new-array execute"
  functions, to reduce confusion with the guru planner interface.
  (The programming interface is unchanged.)

* Add missing __declspec attribute to threads API functions when compiling
  for Windows; thanks to Robert O. Morris for the bug report.

* Fixed missing return value from dfftw_init_threads in Fortran;
  thanks to Markus Wetzstein for the bug report.

FFTW 3.1.3

* Bug fix: FFTW computes incorrect results when the user plans both
  REDFT11 and RODFT11 transforms of certain sizes.  The bug is caused
  by incorrect sharing of twiddle-factor tables between the two
  transforms, and only occurs when both are used.  Thanks to Paul
  A. Valiant for the bug report.

FFTW 3.1.2

* Correct bug in configure script: --enable-portable-binary option was ignored!
  Thanks to Andrew Salamon for the bug report.

* Threads compilation fix on AIX: prefer xlc_r to cc_r, and don't use
  either if we are using gcc.  Thanks to Guy Moebs for the bug report.

* Updated FAQ to note that Apple gcc 4.0.1 on MacOS/Intel is broken,
  and suggest a workaround.  configure script now detects Core/Duo arch.

* Use -maltivec when checking for altivec.h.  Fixes Gentoo bug #129304,
  thanks to Markus Dittrich.

FFTW 3.1.1

* Performance improvements for Intel EMT64.

* Performance improvements for large-size transforms with SIMD.

* Cycle counter support for Intel icc and Visual C++ on x86-64.

* In fftw-wisdom tool, replaced obsolete --impatient with --measure.

* Fixed compilation failure with AIX/xlc; thanks to Joseph Thomas.

* Windows DLL support for Fortran API (added missing __declspec(dllexport)).

* SSE/SSE2 code works properly (i.e. disables itself) on older 386 and 486
  CPUs lacking a CPUID instruction; thanks to Eric Korpela.

FFTW 3.1

* Faster FFTW_ESTIMATE planner.

* New (faster) algorithm for REDFT00/RODFT00 (type-I DCT/DST) of odd size.

* "4-step" algorithm for faster FFTs of very large sizes (> 2^18).

* Faster in-place real-data DFTs (for R2HC and HC2R r2r formats).

* Faster in-place non-square transpositions (FFTW uses these internally
  for in-place FFTs, and you can also perform them explicitly using
  the guru interface).

* Faster prime-size DFTs: implemented Bluestein's algorithm, as well
  as a zero-padded Rader variant to limit recursive use of Rader's algorithm.

* SIMD support for split complex arrays.

* Much faster Altivec/VMX performance.

* New fftw_set_timelimit function to specify a (rough) upper bound to the
  planning time (does not affect ESTIMATE mode).

* Removed --enable-3dnow support; use --enable-k7 instead.

* FMA (fused multiply-add) version is now included in "standard" FFTW,
  and is enabled with --enable-fma (the default on PowerPC and Itanium).

* Automatic detection of native architecture flag for gcc.  New
  configure options: --enable-portable-binary and --with-gcc-arch=<arch>,
  for people distributing compiled binaries of FFTW (see manual).

* Automatic detection of Altivec under Linux with gcc 3.4 (so that
  same binary should work on both Altivec and non-Altivec PowerPCs).

* Compiler-specific tweaks/flags/workarounds for gcc 3.4, xlc, HP/UX,
  Solaris/Intel.

* Various documentation clarifications.

* 64-bit clean.  (Fixes a bug affecting the split guru planner on
  64-bit machines, reported by David Necas.)

* Fixed Debian bug #259612: inadvertent use of SSE instructions on
  non-SSE machines (causing a crash) for --enable-sse binaries.

* Fixed bug that caused HC2R transforms to destroy the input in
  certain cases, even if the user specified FFTW_PRESERVE_INPUT.

* Fixed bug where wisdom would be lost under rare circumstances,
  causing excessive planning time.

* FAQ notes bug in gcc-3.4.[1-3] that causes FFTW to crash with SSE/SSE2.

* Fixed accidentally exported symbol that prohibited simultaneous
  linking to double/single multithreaded FFTW (thanks to Alessio Massaro).

* Support Win32 threads under MinGW (thanks to Alessio Massaro).

* Fixed problem with building DLL under Cygwin; thanks to Stephane Fillod.

* Fix build failure if no Fortran compiler is found (thanks to Charles
  Radley for the bug report).

* Fixed compilation failure with icc 8.0 and SSE/SSE2.  Automatic
  detection of icc architecture flag (e.g. -xW).

* Fixed compilation with OpenMP on AIX (thanks to Greg Bauer).

* Fixed compilation failure on x86-64 with gcc (thanks to Orion Poplawski).

* Incorporated patch from FreeBSD ports (FreeBSD does not have memalign,
  but its malloc is 16-byte aligned).

* Cycle-counter compilation fixes for Itanium, Alpha, x86-64, Sparc,
  MacOS (thanks to Matt Boman, John Bowman, and James A. Treacy for
  reports/fixes).  Added x86-64 cycle counter for PGI compilers,
  courtesy Cristiano Calonaci.

* Fix compilation problem in test program due to C99 conflict.

* Portability fix for import_system_wisdom with djgpp (thanks to Juan
  Manuel Guerrero).

* Fixed compilation failure on MacOS 10.3 due to getopt conflict.

* Work around Visual C++ (version 6/7) bug in SSE compilation;
  thanks to Eddie Yee for his detailed report.

Changes from FFTW 3.1 beta 2:

* Several minor compilation fixes.

* Eliminate FFTW_TIMELIMIT flag and replace fftw_timelimit global with
  fftw_set_timelimit function.  Make wisdom work with time-limited plans.

Changes from FFTW 3.1 beta 1:

* Fixes for creating DLLs under Windows; thanks to John Pavel for his feedback.

* Fixed more 64-bit problems, thanks to John Pavel for the bug report.

* Further speed improvements for Altivec/VMX.

* Further speed improvements for non-square transpositions.

* Many minor tweaks.

FFTW 3.0.1

* Some speed improvements in SIMD code.

* --without-cycle-counter option is removed.  If no cycle counter is found,
  then the estimator is always used.  A --with-slow-timer option is provided
  to force the use of lower-resolution timers.

* Several fixes for compilation under Visual C++, with help from Stefane Ruel.

* Added x86 cycle counter for Visual C++, with help from Morten Nissov.

* Added S390 cycle counter, courtesy of James Treacy.

* Added missing static keyword that prevented simultaneous linkage
  of different-precision versions; thanks to Rasmus Larsen for the bug report.

* Corrected accidental omission of f77_wisdom.f file; thanks to Alan Watson.

* Support -xopenmp flag for SunOS; thanks to John Lou for the bug report.

* Compilation with HP/UX cc requires -Wp,-H128000 flag to increase
  preprocessor limits; thanks to Peter Vouras for the bug report.

* Removed non-portable use of 'tempfile' in fftw-wisdom-to-conf script;
  thanks to Nicolas Decoster for the patch.

* Added 'make smallcheck' target in tests/ directory, at the request of
  James Treacy.

FFTW 3.0

Major goals of this release:

* Speed: often 20% or more faster than FFTW 2.x, even without SIMD (see below).

* Complete rewrite, to make it easier to add new algorithms and transforms.

* New API, to support more general semantics.

Other enhancements:

* SIMD acceleration on supporting CPUs (SSE, SSE2, 3DNow!, and AltiVec).
 (With special thanks to Franz Franchetti for many experimental prototypes
  and to Stefan Kral for the vectorizing generator from fftwgel.)

* True in-place 1d transforms of large sizes (as well as compressed
  twiddle tables for additional memory/cache savings).

* More arbitrary placement of real & imaginary data, e.g. including
  interleaved (as in FFTW 2.x) as well as separate real/imag arrays.

* Efficient prime-size transforms of real data.

* Multidimensional transforms can operate on a subset of a larger matrix,
  and/or transform selected dimensions of a multidimensional array.

* By popular demand, simultaneous linking to double precision (fftw),
  single precision (fftwf), and long-double precision (fftwl) versions
  of FFTW is now supported.

* Cycle counters (on all modern CPUs) are exploited to speed planning.

* Efficient transforms of real even/odd arrays, a.k.a. discrete
  cosine/sine transforms (types I-IV).  (Currently work via pre/post
  processing of real transforms, ala FFTPACK, so are not optimal.)

* DHTs (Discrete Hartley Transforms), again via post-processing
  of real transforms (and thus suboptimal, for now).

* Support for linking to just those parts of FFTW that you need,
  greatly reducing the size of statically linked programs when
  only a limited set of transform sizes/types are required.

* Canonical global wisdom file (/etc/fftw/wisdom) on Unix, along
  with a command-line tool (fftw-wisdom) to generate/update it.

* Fortran API can be used with both g77 and non-g77 compilers
  simultaneously.

* Multi-threaded version has optional OpenMP support.

* Authors' good looks have greatly improved with age.

Changes from 3.0beta3:

* Separate FMA distribution to better exploit fused multiply-add instructions
  on PowerPC (and possibly other) architectures.

* Performance improvements via some inlining tweaks.

* fftw_flops now returns double arguments, not int, to avoid overflows
  for large sizes.

* Workarounds for automake bugs.

Changes from 3.0beta2:

* The standard REDFT00/RODFT00 (DCT-I/DST-I) algorithm (used in
  FFTPACK, NR, etcetera) turns out to have poor numerical accuracy, so
  we replaced it with a slower routine that is more accurate.

* The guru planner and execute functions now have two variants, one that
  takes complex arguments and one that takes separate real/imag pointers.

* Execute and planner routines now automatically align the stack on x86,
  in case the calling program is misaligned.

* README file for test program.

* Fixed bugs in the combination of SIMD with multi-threaded transforms.

* Eliminated internal fftw_threads_init function, which some people were
  calling accidentally instead of the fftw_init_threads API function.

* Check for -openmp flag (Intel C compiler) when --enable-openmp is used.

* Support AMD x86-64 SIMD and cycle counter.

* Support SSE2 intrinsics in forthcoming gcc 3.3.

Changes from 3.0beta1:

* Faster in-place 1d transforms of non-power-of-two sizes.

* SIMD improvements for in-place, multi-dimensional, and/or non-FFTW_PATIENT
  transforms.

* Added support for hard-coded DCT/DST/DHT codelets of small sizes; the
  default distribution only includes hard-coded size-8 DCT-II/III, however.

* Many minor improvements to the manual.  Added section on using the
  codelet generator to customize and enhance FFTW.

* The default 'make check' should now only take a few minutes; for more
  strenuous tests (which may take a day or so), do 'cd tests; make bigcheck'.

* fftw_print_plan is split into fftw_fprint_plan and fftw_print_plan, where
  the latter uses stdout.

* Fixed ability to compile with a C++ compiler.

* Fixed support for C99 complex type under glibc.

* Fixed problems with alloca under MinGW, AIX.

* Workaround for gcc/SPARC bug.

* Fixed multi-threaded initialization failure on IRIX due to lack of
  user-accessible PTHREAD_SCOPE_SYSTEM there.