WIP: new DFT api #6193

stevengj · 2014-03-18T02:09:53Z

This is not quite ready to merge yet, but I wanted to bring it to your attention to see what you think. It is a rewritten DFT API that provides:

p = plan_fft(x) (and similar) now returns a subtype of Base.DFT.Plan. It acts like a linear operator: you can apply it to an existing array with p * x, and apply it to a preallocated output array with A_mul_B!(y, p, x). You can apply the inverse plan with p \ x or inv(p) * x.
It is easy to add new FFT algorithm implementations for specific numeric and array types...

Partly since we have discussed moving FFTW to a module at some point for licensing reasons, and partly to support numeric types like BigFloat, the patch also includes a pure-Julia FFT implementation, with the following features.

One-dimensional transforms of arbitrary AbstractVectors of arbitrary Complex types (including BigFloat), with fast algorithms for all sizes (including prime sizes). Multidimensional transforms of arbitrary StridedArray subtypes.
Uses code generation (a baby version of the FFTW code generator) in Julia to generate kernels for use as base cases of the transforms; algorithms in general are FFTW1-like.
Performance is reasonable. For moderate-power-of-two sizes, it is within a factor of 2 or so of FFTW, with the difference mainly due to the lack of SIMD. ~~For non-power-of-two sizes, the performance is several times worse, but I suspect that there may be some fixable compiler problem there (a good stress test for the compiler).~~ ~~(Non-power-of-two sizes seem fine with latest Julia).~~ (Now they suck again.)
import FFTW is enough to automatically switch over to the FFTW algorithms for supported types.

Major to-do items:

Documentation
More test cases (implemented a self-test algorithm by Funda Ergün)

Minor to-do items (will eventually be needed to move the FFTW code out to an external module, but should not block merge):

Native Julia rfft support.
Native Julia in-place transforms. (The 1d FFT algorithm here is out-of-place, but it could act "in-place" via a buffer. The multi-dim FFTs only need a buffer of length maximum(size(x)).)
Performance improvements in non-power-of-two transforms — these underwent a major performance regression at some point (possibly due to the inlining heuristics changing); the right thing is ultimately to change the generator to spit out kernels in terms of real rather than complex arithmetic (or just to use FFTW's generator).

cc: @JeffBezanson, @timholy

tknopp · 2014-03-18T08:32:53Z

I know that we have discussed it in #1805 but still it would be nice if the plan would also allow for p \ x in order to perform the inverse fft. Would it be a large overhead to pair the forward and backward fftw plan in order to allow this?

timholy · 2014-03-18T10:03:02Z

This is quite amazing. Having native code is particularly amazing. For multidimensional transforms, what would be needed to have this easily interact with SharedArrays, so we can parallelize fft computations?

One other tiny question: the word "dimensions" is used somewhat inconsistently in the standard library. Sometimes it refers to the size of an array (in the same way that you might talk about the dimensions of a room in your house), and sometimes it refers to the indices of particular coordinates (i.e., the 2nd dimension). For what you typically mean by dims, the reduction code uses the term region, which I'm not sure is any better. Any thoughts on a consistent terminology?

tknopp · 2014-03-18T10:11:10Z

What about calling it sizes as size() gives us the "dimensions" of ordinary arrays. I also like shape from numpy but as we already inherite size from matlab we should not mix this up.

When it comes to selecting one particular "dimension" one could either use dir for direction or dim

timholy · 2014-03-18T10:38:55Z

Like that suggestion; but since the same issue just came up on julia-users and I'll feel guilty if this PR gets hijacked by a terminology bikeshed, let's move this discussion elsewhere.

https://groups.google.com/forum/?fromgroups=#!topic/julia-users/VA5rtWlOdhk

tknopp · 2014-03-18T11:24:11Z

Yes indeed. It is really cool to get a native Julia implementation of the FFT. Even better that this is done by someone that already has some experience with FFTs ... ;-)

stevengj · 2014-03-18T12:24:41Z

@tknopp, I see two options. One would be to store both the forward and backward plans in p. This would double the plan-creation time in FFTW, which wouldn't be desirable. The other would be to implement p \ x via conj(p * conj(x))/length(x) or similar, which would work but would add overhead.

tknopp · 2014-03-18T13:02:18Z

Or solution 3. Have some extra plan that supports both directions and maybe have some keyword for plan_fft that gives this plan on request (or an extra function).

stevengj · 2014-03-18T13:38:35Z

Having another plan type seems messy. Maybe the best solution is to have p\x or inv(p) compute the inverse plan lazily the first time it is needed, maybe caching it in p.

tknopp · 2014-03-18T14:23:52Z

Well, this would be another option. This could be generalized in a way that one can specify (as keyword argument) which of the two plans one wants to be precomputed on initialization.

timholy · 2014-03-18T14:38:30Z

OK, now I have an honest-to-goodness engineering question for you. I'm aware that what I'm about to ask makes it harder to cope with my other question about SharedArrays.

I notice that with FFTW, transforming along dimension 2 is substantially slower than transforming along dimension 1 (all times after suitable warmup):

A = rand(1024,1024);
p1 = plan_fft(A, 1);
p2 = plan_fft(A, 2);

julia> @time p1(A);
elapsed time: 0.019999711 seconds (16778600 bytes allocated)

julia> @time p2(A);
elapsed time: 0.070864814 seconds (16778600 bytes allocated)

Presumably this is because of cache. In Images, I've taken @lindahua's advice to heart, and implemented operations on dimensions higher than 1 in terms of interwoven operations along dimension 1. For example, with imfilter_gaussian one gets

julia> copy!(Acopy, A); @time Images._imfilter_gaussian!(Acopy, [3,0]);
elapsed time: 0.028588639 seconds (27768 bytes allocated)

julia> copy!(Acopy, A); @time Images._imfilter_gaussian!(Acopy, [0,3]);
elapsed time: 0.018678493 seconds (52264 bytes allocated)

Slightly faster along dimension 2 than 1! An even more dramatic example is restrict: in Grid I implemented this using BLAS's blazingly-fast axpy! (which must be using SIMD and other fancy tricks), but with more experience I began to wonder whether the need for multiple passes through the data might outweigh its advantages. So in Images I recently tested a new pure-Julia version. Results:

julia> @time Grid.restrict(A, 1, 0.5);
elapsed time: 0.009557864 seconds (4469584 bytes allocated)

julia> @time Grid.restrict(A, 2, 0.5);
elapsed time: 0.040201774 seconds (4461440 bytes allocated)

julia> @time Images.restrict(A, (1,));
elapsed time: 0.008084057 seconds (4206936 bytes allocated)

julia> @time Images.restrict(A, (2,));
elapsed time: 0.010128509 seconds (4206936 bytes allocated)

Notice that the 4-fold difference with Grid's version is about the same as for FFTW. I'm not sure the FFT can be written so easily in terms of interwoven operations, but given the crucial importance of this algorithm in modern computing I thought it might be worth asking. No better time than when you're in the middle of re-implementing it from scratch.

stevengj · 2014-03-18T14:45:56Z

FFTW does implement a variety of strategies for "interweaving" the transforms along the discontiguous direction (see our Proc. IEEE paper). You should notice a much smaller performance difference if you use the FFTW.PATIENT flag when creating the plan. But probably you should file an FFTW issue if you want to discuss this further.

simonbyrne · 2014-03-18T17:02:00Z

This is hugely impressive, very nice.

simonster · 2014-03-18T17:50:47Z

This looks pretty awesome. One thing I've wondered about: would it be reasonable to use an LRU cache for FFTW plan objects created by fft and friends, or even to keep the plans that have been created but not yet garbage collected in a WeakKeyDict and use them when possible? It seems that, even with an existing identical plan that hasn't been destroyed yet, there is still significant overhead for creating a new plan (using ESTIMATE), and this overhead is typically greater than the time to perform the FFT. I've written a decent amount of code that could be a single function but needs to be a type in order to preserve the plan between invocations and thus avoid incurring the overhead of plan creation.

stevengj · 2014-03-18T19:20:17Z

@simonster, that's a good idea, but I'd prefer to do that in a generic way that is not restricted to FFTW (i.e. a cache of Plans, not just FFTWPlans). Some care would be required in figuring out what to key this cache on, though.

However, I'd prefer to put that functionality, assuming it can be implemented cleanly, into a separate PR following this one, since there's a lot going on in this PR already. Ideally, this sort of cache would be completely transparent and wouldn't affect the public API.

stevengj · 2014-03-18T20:28:04Z

@timholy, in principle it is straightforward to get shared-memory parallelism in an FFT, since many of the loops can be executed in parallel, though getting good cache performance is hard. You can already use FFTW's shared-memory parallelism on any StridedArray. Shared-memory parallelism for the native Julia FFT is probably something for a later patch, though.

stevengj · 2014-03-18T21:16:05Z

@tknopp, the latest version of the patch now supports p \ x and inv(p). The inverse plan is computed once, the first time it is needed, and is cached in the plan thereafter.

tknopp · 2014-03-18T21:26:16Z

great, thanks!

simonster · 2014-03-18T21:31:36Z

base/dft.jl

+# similar to FFTW, and we do the scaling generically to get the ifft:
+
+type ScaledPlan{T} <: Plan{T}
+    p::Plan{T}


Since Plan is abstract, I think you might have to parametrize by the type of the plan itself to get type inference?

Is it possible to write something along the lines of type ScaledPlan{P<:Plan{T}} <: Plan{T}; p::P; ......; end? Otherwise I don't know how to make ScaledPlan a subtype of the correct T for Plan{T}.

Hmm, I guess I could do:

type ScaledPlan{T,P<:Plan} <: Plan{T} p::P .... end

and just make sure in the constructor that P <: Plan{T}.

There may be a better way, but one approach that should work is to parametrize by both T and the plan type.

I've been playing with this, and it seems rather hard to avoid nasty recursive type definitions. Suppose we have a type FooFFT{T,true} <: Plan{T} that computes FFTs, and that the inverse is of a slightly different type FooFFT{T,false}. I want to put a correctly typed pinv field in it (initially undefined) like this:

type FooFFT{T,forward} pinv::FooFFT{T,!forward} FooFFT() = new() end

But this isn't allowed (no method !(TypeVar)). So, instead, I parameterize by the type of pinv:

type FooFFT{T,forward,P} <: Plan{T} pinv::P FooFFT() = new() end

But then my initialization requires an infinitely recursive type:

FooFFT{T,true,FooFFT{T,false,FooFFT{T,true,...}}}()

One compromise would be to initialize it as

FooFFT{T,true,FooFFT{T,false,Plan{T}}}()

which means that inv(inv(p)) would not have a concrete type that is detectable by the compiler.

Is there a better way?

jakebolewski · 2014-03-18T21:35:35Z

@stevengj this is really nice. I love the new Plan api, it will make it much easier to support the same api for GPU accelerated fft's in CLFFT.jl and I suspect cuda's fft library.

stevengj · 2014-04-17T17:32:05Z

Just rebased, and it looks like the performance problems with non-power-of-two sizes have disappeared: if I disable SIMD in FFTW (via the FFTW.NO_SIMD flag), the performance of the pure-Julia version is within a factor of two of FFTW for moderate non-power-of-two composite sizes.

I'm guessing the recent inlining improvements get the credit here, but I would have to bisect to be sure.

jtravs · 2014-04-17T18:10:08Z

This is really exciting! Have you tried whether @simd helps close the last gap in performance? (I've had mixed results with it so far).

stevengj · 2014-04-17T20:03:30Z

@jtravs, it looks like @simd makes only a tiny difference (in double precision), which is disappointing since I was hoping it would help with Complex arithmetic.

stevengj · 2014-04-18T17:27:08Z

I'm seeing some weirdness in the FFTW linkage now that I've updgraded deps to FFTW 3.3.4; it seems to be getting confused by the FFTW 3.3.3 installed by Homebrew in /usr/local/lib.

In particular, I use

convert(VersionNumber, split(bytestring(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), Uint8)), "-")[2])

to get the FFTW version number. When this command runs during the Julia build in order to initialize the const version in the FFTW module, it returns 3.3.4 as expected. But if I execute the same command at runtime from the REPL, I get 3.3.3. So, Julia is linking different versions of the library at different points in its build?

@staticfloat, could this be similar to the PCRE woes in #1331 and #3838? How can I fix it so that Julia uses its own version of FFTW consistently, regardless of what is installed in /usr/local/lib?

stevengj · 2015-05-31T18:58:48Z

@ScottPJones, the performance isn't crucial for merging as long as we still have FFTW for all of the common types, which is why I thought it didn't need to be checked off for 0.4. But it would be nice to figure out why the speed keeps oscillating over time.

simonster · 2015-05-31T19:25:02Z

@stevengj Since #11486 was merged, I think you'll also have to switch the argument order of ntuple in this PR to avoid deprecation warnings.

ScottPJones · 2015-05-31T21:16:37Z

Finally back! @stevengj Sorry, I didn't make myself clear, I meant that it looked like the performance issue mentioned in the top comment had been taken care of, and that somebody had struck out the text about it, but had simply forgotten to check off the box... I didn't mean at all that this PR should be held up for any reason except for any remaining things like deprecation warnings. It looks like a very worthwhile improvement to the code... I like that it uses a more flexible framework... and for people who can't use FFTW, the decent performance it has is still infinitely better than NO performance at all.

stevengj · 2015-06-01T13:21:06Z

Okay, changed the ntuple usage.

jtravs · 2015-06-07T13:40:50Z

I'd just like to register a vote here for merging this soon, even if performance and other issues need to be resolved later. I'm currently having to merge this pull request into a local julia repository and it is a bit of a pain TBH.

simonster · 2015-06-07T16:17:03Z

Yes, +1 for merging

jtravs · 2015-06-07T19:19:50Z

As an additional comment, is there any chance of getting a nicer syntax for preallocated output than A_mul_B!(y, p, x)? I really like the y = P*x syntax (and the style of this PR in general), but I really need fast, preallocated, real ffts in my code and A_mul_B!(y, p, x) isn't very clean or clear. I guess I could write a macro... I also realize that this is a general issue for inplace operations, not just the DFT.

nalimilan · 2015-06-07T19:26:27Z

@jtravs Yes, see #249.

simonbyrne · 2015-06-07T19:48:50Z

It should also be possible to make InplaceOps.jl work for this.

stevengj · 2015-07-20T19:21:33Z

Closing this as the new DFT api has been merged separately, and now a separate patch with the pure-Julia FFTs is needed.

yuyichao · 2015-07-20T19:27:19Z

I've rebased this branch, should I push it out?

stevengj · 2015-07-20T19:51:30Z

Please do, thanks @yuyichao.

yuyichao · 2015-07-20T19:56:57Z

The rebase is here. I simply rebased it and checked all the conflicts.

I didn't check the compatibility with the current master of the non-conflict parts and I also didn't squash any of the non-empty commits.

Edit: the branch is on top of the current master now (after my new DFT API tweak for type inference.)

hayd · 2015-09-19T00:29:59Z

@yuyichao Bump! Is this going to be a PR (now that 0.4 is branched) ? :)

yuyichao · 2015-09-19T00:35:38Z

I don't think I'm familiar with this enough to do the PR. I rebased the branch to make sure the part I want to get into 0.4 doesn't mess up the rest of this PR too much.

CC. @stevengj

stevengj · 2015-09-22T19:24:29Z

It's not a priority for me at the moment, but it is certainly something that could be rebased onto 0.5 if needed.

bjarthur · 2020-03-31T12:07:54Z

sorry for reviving a now 4+ year old thread, but is there a differentiable FFT in julia? IIUC, FFTW.jl is a C wrapper, so it is not, plus it is also GPL. thanks.

DhairyaLGandhi · 2020-03-31T13:13:50Z

We have the adjoints for FFTW.jl in Zygote.jl already, and will change that to AbstractFFT.jl very soon too

https://github.com/FluxML/Zygote.jl/blob/17ca911b82134c4a765822cd2b7ee19e959cc8e4/src/lib/array.jl#L777 for reference

stevengj added breaking labels Mar 18, 2014

simonster mentioned this pull request Mar 18, 2014

added spectrogram JuliaDSP/DSP.jl#23

Merged

simonster reviewed Mar 18, 2014
View reviewed changes

fix int deprecations

d1898e6

use ntuple(f, n) rather than old-style ntuple(n, f)

57aecad

yuyichao mentioned this pull request Jul 2, 2015

Use call overload in FFTW.plan_* #11994

Closed

stevengj mentioned this pull request Jul 9, 2015

new DFT api #12087

Closed

stevengj closed this Jul 20, 2015

timholy mentioned this pull request Sep 28, 2015

Inlining arithmetic can produce slowdowns #13350

Closed

stevengj mentioned this pull request Feb 26, 2016

Initial implementation of butterfly FFT with various ways of indexing JuliaAttic/IndexingBenchmarks.jl#6

Closed

MikaelSlevinsky mentioned this pull request Mar 6, 2016

Separate out BigFloat FFT JuliaApproximation/ApproxFun.jl#304

Closed

tkelman mentioned this pull request Sep 7, 2016

Move FFTW to a package #18389

Closed

stevengj mentioned this pull request Feb 1, 2017

WIP: Support 0.6 JuliaApproximation/FastTransforms.jl#13

Merged

stevengj mentioned this pull request Feb 7, 2018

Added FFTs. Still need to add iFFTs (straightforward) and... JuliaArrays/StaticArrays.jl#363

Closed

8 tasks

stevengj mentioned this pull request May 24, 2019

With Double64s? JuliaMath/AbstractFFTs.jl#27

Closed

WIP: new DFT api #6193

WIP: new DFT api #6193

Conversation

stevengj commented Mar 18, 2014

tknopp commented Mar 18, 2014

timholy commented Mar 18, 2014

tknopp commented Mar 18, 2014

timholy commented Mar 18, 2014

tknopp commented Mar 18, 2014

stevengj commented Mar 18, 2014

tknopp commented Mar 18, 2014

stevengj commented Mar 18, 2014

tknopp commented Mar 18, 2014

timholy commented Mar 18, 2014

stevengj commented Mar 18, 2014

simonbyrne commented Mar 18, 2014

simonster commented Mar 18, 2014

stevengj commented Mar 18, 2014

stevengj commented Mar 18, 2014

stevengj commented Mar 18, 2014

tknopp commented Mar 18, 2014

simonster Mar 18, 2014

Choose a reason for hiding this comment

stevengj Mar 19, 2014

Choose a reason for hiding this comment

simonster Mar 19, 2014

Choose a reason for hiding this comment

stevengj Mar 21, 2014

Choose a reason for hiding this comment

jakebolewski commented Mar 18, 2014

stevengj commented Apr 17, 2014

jtravs commented Apr 17, 2014

stevengj commented Apr 17, 2014

stevengj commented Apr 18, 2014

stevengj commented May 31, 2015

simonster commented May 31, 2015

ScottPJones commented May 31, 2015

stevengj commented Jun 1, 2015

jtravs commented Jun 7, 2015

simonster commented Jun 7, 2015

jtravs commented Jun 7, 2015

nalimilan commented Jun 7, 2015

simonbyrne commented Jun 7, 2015

stevengj commented Jul 20, 2015

yuyichao commented Jul 20, 2015

stevengj commented Jul 20, 2015

yuyichao commented Jul 20, 2015

hayd commented Sep 19, 2015

yuyichao commented Sep 19, 2015

stevengj commented Sep 22, 2015

bjarthur commented Mar 31, 2020

DhairyaLGandhi commented Mar 31, 2020