-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add onepass algorithm for logsumexp #97
Conversation
Seems like we could just use this algorithm for |
The algorithm doesn't handle reductions along certain dimensions, so I guess it would only replace the case with |
Bump :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
One thing I did notice in the checks, not specific to this PR, is that the package is only being tested on 1.0, 1.3 and nightly. Perhaps the Travis and Appveyor configurations should be updated.
I addressed this issue in a separate PR. |
Thank you. |
AFAICT your algorithm is only very slightly slower for small arrays, and much faster for large arrays, right? So better use it instead of the existing EDIT: I hadn't realized |
@simonbyrne wrote the previous implementation at #45, so I'd like him to confirm it's OK, in particular for custom arrays like I also wonder why |
|
Calling The one-pass algorithm could be used as a replacement for the generic Therefore I created a separate Edit: If it is desired we could dispatch Edit 2: Alternatively, one could use the |
Unfortunately, ArrayInterface currently depends on Requires, which increases the load time significantly and is therefore probably not OK for this package. We really need this kind of traits in a lightweight package containing only basic definitions. I guess in the meantime is would be OK to use Though another more interesting solution would be to adapt the one-pass algorithm to use |
Hmm I don't see how this would be possible with |
Yeah, the algorithm currently can only handle one new element at a time. But I wonder whether it would be possible to use a trick in order to keep track of whether the reduction operation is currently working on a single element or on a partial sum of multiple elements. Something like mapreduce(x -> (x, 1), ((x, i), (y, j)) -> (i == 1 && j == 1 ? : ... : (i == 1 ? ... : (j == 1 ? ... : ...)), i+j), v) AFAICT that would work if we have a formula to combine two partial sums into a single one. Is that the case? :-) If that doesn't work, let's go with the special-casing of |
Yes, I guess that could work 👍 The reduction of two partial sums seems to be very similar to the currently implemented version AFAICT. I'll see if it works and benchmark both variants. |
Hmm it works mathematically but it seems julia> using StatsFuns, BenchmarkTools, Random
julia> Random.seed!(100);
julia> n = 10_000_000;
julia> X = 500 .* randn(n);
julia> @btime logsumexp($X)
106.007 ms (0 allocations: 0 bytes)
2570.6633157197443
julia> @btime StatsFuns.logsumexp_onepass($X)
42.388 ms (0 allocations: 0 bytes)
2570.6633157197443
julia> @btime StatsFuns.logsumexp_onepass_reduce($X)
74.562 ms (0 allocations: 0 bytes)
2570.6633157197443
julia> f(X) = reduce(logaddexp, X)
f (generic function with 1 method)
julia> @btime f($X)
163.131 ms (0 allocations: 0 bytes)
2570.6633157197443 The implementation should be quite optimized and is very similar to the one of function logsumexp_onepass_reduction((xmax1, r1)::T, (xmax2, r2)::T) where {T<:Tuple}
if xmax1 < xmax2
xmax = xmax2
r = r2 + r1 * exp(xmax1 - xmax2)
elseif xmax1 > xmax2
xmax = xmax1
r = r1 + r2 * exp(xmax2 - xmax1)
else # ensure finite values if x = xmax = ± Inf
xmax = xmax1
r = r1 + r2
end
return xmax, r
end
function logsumexp_onepass_reduce(X)
isempty(X) && return log(sum(X))
xmax, r = mapreduce(logsumexp_onepass_reduction, X) do x
x, float(one(x))
end
return xmax + log(r)
end |
Ah, too bad. Though it's a bit faster than the current implementation. I wonder what would be the result for Have you tried forcing |
There's no need for |
Sorry I meant |
Yeah I agree, that would be great. Also it seems by reducing the partial sums it would be trivial to implement |
Unfortunately, the timings are the same even if I add an |
OK, too bad. Maybe it would be faster if you passed |
src/basicfuns.jl
Outdated
return xmax + log(r) | ||
end | ||
|
||
# with initial element: required by CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's also required by Array
when the input is empty, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, empty inputs are handled by log(sum(X))
, as before. Actually we never want to use this method apart for GPU arrays - the type calculations are doomed to fail e.g. for not concretely typed arrays and hence it is much safer to not rely on them and instead just compute the output. I'm not sure why the type calculations in https://github.com/JuliaGPU/GPUArrays.jl/blob/356c5fe3f83f76f8139b4edf5536cbd08d48da7f/src/host/mapreduce.jl#L49-L52 fail, I can check if this can be fixed from our side. Alternatively, maybe another package such as NNlib should provide a GPU-compatible overload of StatsFuns.logsumexp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. They fail because neutral_element
is only defined for a few particular functions. Another approach would be to use the type of the first element, since we already handle the case where the array is empty. That way we wouldn't need two different methods depending on whether the eltype is known.
Given that it seems doable, it would be nice to have something that works for all arrays without another package having to overload the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it is just bad that GPUArrays requires us to provide an initial or neutral element. We really never want to do that, regardless if there is a first element or not. Retrieving the first element is bad for any stateful iterator, and is a scalar indexing operation on the GPU (which leads to warnings when calling the function - and we can not remove them without depending on CUDA). Additionally, the docstring of mapreduce
says that
It is unspecified whether init is used for non-empty collections.
so there is really no point in providing init
here - apart from GPU compatibility. The problem is that we can not restrict our implementation with the init
hack to only GPU arrays without taking a dependency on CUDA. Hence I don't think it is completely unreasonable to provide a GPU-compatible implementation somewhere else. An alternative would be to not use the one-pass algorithm as the default algorithm (for arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieving the first element should be OK for any AbstractArray
(contrary to some iterables). It's too bad that CUDA prints warnings in that case.
Let's keep what you have now: since the method only accepts arrays with T<:Real
element type, float(T)
will (almost) certainly work.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/basicfuns.jl
Outdated
return xmax + log(r) | ||
end | ||
|
||
# with initial element: required by CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retrieving the first element should be OK for any AbstractArray
(contrary to some iterables). It's too bad that CUDA prints warnings in that case.
Let's keep what you have now: since the method only accepts arrays with T<:Real
element type, float(T)
will (almost) certainly work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few remaining details.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Looks good if you confirm the latest version is as fast as the previous one. I have two stylistic comments.
Same benchmark as above gives now julia> @btime StatsFuns.logsumexp($X)
37.203 ms (0 allocations: 0 bytes)
2570.6633157197443 on my computer. |
For the calls with julia> using StatsFuns, BenchmarkTools, Random
julia> Random.seed!(100);
julia> n = 100_000;
julia> X = 500 .* randn(n, 100); and got on this PR julia> @btime StatsFuns.logsumexp($X; dims=2);
99.331 ms (8 allocations: 2.29 MiB)
julia> @btime StatsFuns.logsumexp($X; dims=1);
63.163 ms (2 allocations: 2.64 KiB)
julia> @btime StatsFuns.logsumexp($X; dims=(1,2));
66.100 ms (6 allocations: 352 bytes) whereas master yields julia> @btime StatsFuns.logsumexp($X; dims=2);
117.644 ms (16 allocations: 78.58 MiB)
julia> @btime StatsFuns.logsumexp($X; dims=1);
86.431 ms (5 allocations: 76.30 MiB)
julia> @btime StatsFuns.logsumexp($X; dims=(1,2));
72.502 ms (13 allocations: 76.29 MiB) |
Thanks @devmotion! |
This PR fixes #91.
I used the streaming algorithm for the computation of logsumexp described in http://www.nowozin.net/sebastian/blog/streaming-log-sum-exp-computation.html in one of my projects lately, and I noticed that there is already an open issue for adding it to StatsFuns.
I reran the benchmark mentioned in the issue and got:
Currently
logsumexp_onepass
is used as the fallback forlogsumexp
but not exported. I'm not sure if it should be exported as well. Additionally, maybe it would be beneficial to use the one-pass algorithms even for arrays above a certain size, since it seems to perform better in the benchmark above. Some additional benchmarks suggest that this might be the case even for smaller arrays:Edit: This PR also fixes the example in #63:
Additionally, it addresses (and fixes?) #75 (but maybe we would still want the two-pass algorithm instead?).