Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: new DFT api #6193

Closed
wants to merge 12 commits into from
Closed

WIP: new DFT api #6193

wants to merge 12 commits into from

Conversation

stevengj
Copy link
Member

This is not quite ready to merge yet, but I wanted to bring it to your attention to see what you think. It is a rewritten DFT API that provides:

  • p = plan_fft(x) (and similar) now returns a subtype of Base.DFT.Plan. It acts like a linear operator: you can apply it to an existing array with p * x, and apply it to a preallocated output array with A_mul_B!(y, p, x). You can apply the inverse plan with p \ x or inv(p) * x.
  • It is easy to add new FFT algorithm implementations for specific numeric and array types...

Partly since we have discussed moving FFTW to a module at some point for licensing reasons, and partly to support numeric types like BigFloat, the patch also includes a pure-Julia FFT implementation, with the following features.

  • One-dimensional transforms of arbitrary AbstractVectors of arbitrary Complex types (including BigFloat), with fast algorithms for all sizes (including prime sizes). Multidimensional transforms of arbitrary StridedArray subtypes.
  • Uses code generation (a baby version of the FFTW code generator) in Julia to generate kernels for use as base cases of the transforms; algorithms in general are FFTW1-like.
  • Performance is reasonable. For moderate-power-of-two sizes, it is within a factor of 2 or so of FFTW, with the difference mainly due to the lack of SIMD. For non-power-of-two sizes, the performance is several times worse, but I suspect that there may be some fixable compiler problem there (a good stress test for the compiler). (Non-power-of-two sizes seem fine with latest Julia). (Now they suck again.)
  • import FFTW is enough to automatically switch over to the FFTW algorithms for supported types.

Major to-do items:

  • Documentation
  • More test cases (implemented a self-test algorithm by Funda Ergün)

Minor to-do items (will eventually be needed to move the FFTW code out to an external module, but should not block merge):

  • Native Julia rfft support.
  • Native Julia in-place transforms. (The 1d FFT algorithm here is out-of-place, but it could act "in-place" via a buffer. The multi-dim FFTs only need a buffer of length maximum(size(x)).)
  • Performance improvements in non-power-of-two transforms — these underwent a major performance regression at some point (possibly due to the inlining heuristics changing); the right thing is ultimately to change the generator to spit out kernels in terms of real rather than complex arithmetic (or just to use FFTW's generator).

cc: @JeffBezanson, @timholy

@tknopp
Copy link
Contributor

tknopp commented Mar 18, 2014

I know that we have discussed it in #1805 but still it would be nice if the plan would also allow for p \ x in order to perform the inverse fft. Would it be a large overhead to pair the forward and backward fftw plan in order to allow this?

@timholy
Copy link
Sponsor Member

timholy commented Mar 18, 2014

This is quite amazing. Having native code is particularly amazing. For multidimensional transforms, what would be needed to have this easily interact with SharedArrays, so we can parallelize fft computations?

One other tiny question: the word "dimensions" is used somewhat inconsistently in the standard library. Sometimes it refers to the size of an array (in the same way that you might talk about the dimensions of a room in your house), and sometimes it refers to the indices of particular coordinates (i.e., the 2nd dimension). For what you typically mean by dims, the reduction code uses the term region, which I'm not sure is any better. Any thoughts on a consistent terminology?

@tknopp
Copy link
Contributor

tknopp commented Mar 18, 2014

What about calling it sizes as size() gives us the "dimensions" of ordinary arrays. I also like shape from numpy but as we already inherite size from matlab we should not mix this up.

When it comes to selecting one particular "dimension" one could either use dir for direction or dim

@timholy
Copy link
Sponsor Member

timholy commented Mar 18, 2014

Like that suggestion; but since the same issue just came up on julia-users and I'll feel guilty if this PR gets hijacked by a terminology bikeshed, let's move this discussion elsewhere.

https://groups.google.com/forum/?fromgroups=#!topic/julia-users/VA5rtWlOdhk

@tknopp
Copy link
Contributor

tknopp commented Mar 18, 2014

Yes indeed. It is really cool to get a native Julia implementation of the FFT. Even better that this is done by someone that already has some experience with FFTs ... ;-)

@stevengj
Copy link
Member Author

@tknopp, I see two options. One would be to store both the forward and backward plans in p. This would double the plan-creation time in FFTW, which wouldn't be desirable. The other would be to implement p \ x via conj(p * conj(x))/length(x) or similar, which would work but would add overhead.

@tknopp
Copy link
Contributor

tknopp commented Mar 18, 2014

Or solution 3. Have some extra plan that supports both directions and maybe have some keyword for plan_fft that gives this plan on request (or an extra function).

@stevengj
Copy link
Member Author

Having another plan type seems messy. Maybe the best solution is to have p\x or inv(p) compute the inverse plan lazily the first time it is needed, maybe caching it in p.

@tknopp
Copy link
Contributor

tknopp commented Mar 18, 2014

Well, this would be another option. This could be generalized in a way that one can specify (as keyword argument) which of the two plans one wants to be precomputed on initialization.

@timholy
Copy link
Sponsor Member

timholy commented Mar 18, 2014

OK, now I have an honest-to-goodness engineering question for you. I'm aware that what I'm about to ask makes it harder to cope with my other question about SharedArrays.

I notice that with FFTW, transforming along dimension 2 is substantially slower than transforming along dimension 1 (all times after suitable warmup):

A = rand(1024,1024);
p1 = plan_fft(A, 1);
p2 = plan_fft(A, 2);

julia> @time p1(A);
elapsed time: 0.019999711 seconds (16778600 bytes allocated)

julia> @time p2(A);
elapsed time: 0.070864814 seconds (16778600 bytes allocated)

Presumably this is because of cache. In Images, I've taken @lindahua's advice to heart, and implemented operations on dimensions higher than 1 in terms of interwoven operations along dimension 1. For example, with imfilter_gaussian one gets

julia> copy!(Acopy, A); @time Images._imfilter_gaussian!(Acopy, [3,0]);
elapsed time: 0.028588639 seconds (27768 bytes allocated)

julia> copy!(Acopy, A); @time Images._imfilter_gaussian!(Acopy, [0,3]);
elapsed time: 0.018678493 seconds (52264 bytes allocated)

Slightly faster along dimension 2 than 1! An even more dramatic example is restrict: in Grid I implemented this using BLAS's blazingly-fast axpy! (which must be using SIMD and other fancy tricks), but with more experience I began to wonder whether the need for multiple passes through the data might outweigh its advantages. So in Images I recently tested a new pure-Julia version. Results:

julia> @time Grid.restrict(A, 1, 0.5);
elapsed time: 0.009557864 seconds (4469584 bytes allocated)

julia> @time Grid.restrict(A, 2, 0.5);
elapsed time: 0.040201774 seconds (4461440 bytes allocated)

julia> @time Images.restrict(A, (1,));
elapsed time: 0.008084057 seconds (4206936 bytes allocated)

julia> @time Images.restrict(A, (2,));
elapsed time: 0.010128509 seconds (4206936 bytes allocated)

Notice that the 4-fold difference with Grid's version is about the same as for FFTW. I'm not sure the FFT can be written so easily in terms of interwoven operations, but given the crucial importance of this algorithm in modern computing I thought it might be worth asking. No better time than when you're in the middle of re-implementing it from scratch.

@stevengj
Copy link
Member Author

FFTW does implement a variety of strategies for "interweaving" the transforms along the discontiguous direction (see our Proc. IEEE paper). You should notice a much smaller performance difference if you use the FFTW.PATIENT flag when creating the plan. But probably you should file an FFTW issue if you want to discuss this further.

@simonbyrne
Copy link
Contributor

This is hugely impressive, very nice.

@simonster
Copy link
Member

This looks pretty awesome. One thing I've wondered about: would it be reasonable to use an LRU cache for FFTW plan objects created by fft and friends, or even to keep the plans that have been created but not yet garbage collected in a WeakKeyDict and use them when possible? It seems that, even with an existing identical plan that hasn't been destroyed yet, there is still significant overhead for creating a new plan (using ESTIMATE), and this overhead is typically greater than the time to perform the FFT. I've written a decent amount of code that could be a single function but needs to be a type in order to preserve the plan between invocations and thus avoid incurring the overhead of plan creation.

@stevengj
Copy link
Member Author

@simonster, that's a good idea, but I'd prefer to do that in a generic way that is not restricted to FFTW (i.e. a cache of Plans, not just FFTWPlans). Some care would be required in figuring out what to key this cache on, though.

However, I'd prefer to put that functionality, assuming it can be implemented cleanly, into a separate PR following this one, since there's a lot going on in this PR already. Ideally, this sort of cache would be completely transparent and wouldn't affect the public API.

@stevengj
Copy link
Member Author

@timholy, in principle it is straightforward to get shared-memory parallelism in an FFT, since many of the loops can be executed in parallel, though getting good cache performance is hard. You can already use FFTW's shared-memory parallelism on any StridedArray. Shared-memory parallelism for the native Julia FFT is probably something for a later patch, though.

@stevengj
Copy link
Member Author

@tknopp, the latest version of the patch now supports p \ x and inv(p). The inverse plan is computed once, the first time it is needed, and is cached in the plan thereafter.

@tknopp
Copy link
Contributor

tknopp commented Mar 18, 2014

great, thanks!

# similar to FFTW, and we do the scaling generically to get the ifft:

type ScaledPlan{T} <: Plan{T}
p::Plan{T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Plan is abstract, I think you might have to parametrize by the type of the plan itself to get type inference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to write something along the lines of type ScaledPlan{P<:Plan{T}} <: Plan{T}; p::P; ......; end? Otherwise I don't know how to make ScaledPlan a subtype of the correct T for Plan{T}.

Hmm, I guess I could do:

type ScaledPlan{T,P<:Plan} <: Plan{T}
     p::P
     ....
end

and just make sure in the constructor that P <: Plan{T}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a better way, but one approach that should work is to parametrize by both T and the plan type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been playing with this, and it seems rather hard to avoid nasty recursive type definitions. Suppose we have a type FooFFT{T,true} <: Plan{T} that computes FFTs, and that the inverse is of a slightly different type FooFFT{T,false}. I want to put a correctly typed pinv field in it (initially undefined) like this:

type FooFFT{T,forward} 
    pinv::FooFFT{T,!forward}
    FooFFT() = new()
end

But this isn't allowed (no method !(TypeVar)). So, instead, I parameterize by the type of pinv:

type FooFFT{T,forward,P} <: Plan{T}
    pinv::P
    FooFFT() = new()
end

But then my initialization requires an infinitely recursive type:

FooFFT{T,true,FooFFT{T,false,FooFFT{T,true,...}}}()

One compromise would be to initialize it as

FooFFT{T,true,FooFFT{T,false,Plan{T}}}()

which means that inv(inv(p)) would not have a concrete type that is detectable by the compiler.

Is there a better way?

@jakebolewski
Copy link
Member

@stevengj this is really nice. I love the new Plan api, it will make it much easier to support the same api for GPU accelerated fft's in CLFFT.jl and I suspect cuda's fft library.

@stevengj
Copy link
Member Author

Just rebased, and it looks like the performance problems with non-power-of-two sizes have disappeared: if I disable SIMD in FFTW (via the FFTW.NO_SIMD flag), the performance of the pure-Julia version is within a factor of two of FFTW for moderate non-power-of-two composite sizes.

I'm guessing the recent inlining improvements get the credit here, but I would have to bisect to be sure.

@jtravs
Copy link
Contributor

jtravs commented Apr 17, 2014

This is really exciting! Have you tried whether @simd helps close the last gap in performance? (I've had mixed results with it so far).

@stevengj
Copy link
Member Author

@jtravs, it looks like @simd makes only a tiny difference (in double precision), which is disappointing since I was hoping it would help with Complex arithmetic.

@stevengj
Copy link
Member Author

I'm seeing some weirdness in the FFTW linkage now that I've updgraded deps to FFTW 3.3.4; it seems to be getting confused by the FFTW 3.3.3 installed by Homebrew in /usr/local/lib.

In particular, I use

convert(VersionNumber, split(bytestring(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), Uint8)), "-")[2])

to get the FFTW version number. When this command runs during the Julia build in order to initialize the const version in the FFTW module, it returns 3.3.4 as expected. But if I execute the same command at runtime from the REPL, I get 3.3.3. So, Julia is linking different versions of the library at different points in its build?

@staticfloat, could this be similar to the PCRE woes in #1331 and #3838? How can I fix it so that Julia uses its own version of FFTW consistently, regardless of what is installed in /usr/local/lib?

@stevengj
Copy link
Member Author

@ScottPJones, the performance isn't crucial for merging as long as we still have FFTW for all of the common types, which is why I thought it didn't need to be checked off for 0.4. But it would be nice to figure out why the speed keeps oscillating over time.

@simonster
Copy link
Member

@stevengj Since #11486 was merged, I think you'll also have to switch the argument order of ntuple in this PR to avoid deprecation warnings.

@ScottPJones
Copy link
Contributor

Finally back! @stevengj Sorry, I didn't make myself clear, I meant that it looked like the performance issue mentioned in the top comment had been taken care of, and that somebody had struck out the text about it, but had simply forgotten to check off the box... I didn't mean at all that this PR should be held up for any reason except for any remaining things like deprecation warnings. It looks like a very worthwhile improvement to the code... I like that it uses a more flexible framework... and for people who can't use FFTW, the decent performance it has is still infinitely better than NO performance at all.

@stevengj
Copy link
Member Author

stevengj commented Jun 1, 2015

Okay, changed the ntuple usage.

@jtravs
Copy link
Contributor

jtravs commented Jun 7, 2015

I'd just like to register a vote here for merging this soon, even if performance and other issues need to be resolved later. I'm currently having to merge this pull request into a local julia repository and it is a bit of a pain TBH.

@simonster
Copy link
Member

Yes, +1 for merging

@jtravs
Copy link
Contributor

jtravs commented Jun 7, 2015

As an additional comment, is there any chance of getting a nicer syntax for preallocated output than A_mul_B!(y, p, x)? I really like the y = P*x syntax (and the style of this PR in general), but I really need fast, preallocated, real ffts in my code and A_mul_B!(y, p, x) isn't very clean or clear. I guess I could write a macro... I also realize that this is a general issue for inplace operations, not just the DFT.

@nalimilan
Copy link
Member

@jtravs Yes, see #249.

@simonbyrne
Copy link
Contributor

It should also be possible to make InplaceOps.jl work for this.

@stevengj stevengj mentioned this pull request Jul 9, 2015
@stevengj
Copy link
Member Author

Closing this as the new DFT api has been merged separately, and now a separate patch with the pure-Julia FFTs is needed.

@stevengj stevengj closed this Jul 20, 2015
@yuyichao
Copy link
Contributor

I've rebased this branch, should I push it out?

@stevengj
Copy link
Member Author

Please do, thanks @yuyichao.

@yuyichao
Copy link
Contributor

The rebase is here. I simply rebased it and checked all the conflicts.

I didn't check the compatibility with the current master of the non-conflict parts and I also didn't squash any of the non-empty commits.

Edit: the branch is on top of the current master now (after my new DFT API tweak for type inference.)

@hayd
Copy link
Member

hayd commented Sep 19, 2015

@yuyichao Bump! Is this going to be a PR (now that 0.4 is branched) ? :)

@yuyichao
Copy link
Contributor

I don't think I'm familiar with this enough to do the PR. I rebased the branch to make sure the part I want to get into 0.4 doesn't mess up the rest of this PR too much.

CC. @stevengj

@stevengj
Copy link
Member Author

It's not a priority for me at the moment, but it is certainly something that could be rebased onto 0.5 if needed.

@bjarthur
Copy link
Contributor

sorry for reviving a now 4+ year old thread, but is there a differentiable FFT in julia? IIUC, FFTW.jl is a C wrapper, so it is not, plus it is also GPL. thanks.

@DhairyaLGandhi
Copy link

We have the adjoints for FFTW.jl in Zygote.jl already, and will change that to AbstractFFT.jl very soon too

https://github.com/FluxML/Zygote.jl/blob/17ca911b82134c4a765822cd2b7ee19e959cc8e4/src/lib/array.jl#L777 for reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.