-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve foldl with tail-call function-barrier #34293
base: master
Are you sure you want to change the base?
Conversation
if v isa T | ||
return _foldl_impl(op, v, itr, y[2]) | ||
else | ||
return _foldl_impl(op, v, itr, y[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this exactly the same as what's done in the other branch of the conditional? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it helps the compiler. Removing this branch introduces an allocation and makes the code a bit slower:
julia> @eval Base function _foldl_impl(op::OP, init::T, itr) where {OP, T}
y = iterate(itr)
y === nothing && return init
v = op(init, y[1])
return _foldl_impl(op, v, itr, y[2])
end
_foldl_impl (generic function with 4 methods)
julia> @btime sum(x for x in $xs if x !== missing)
780.859 ns (1 allocation: 16 bytes)
But the speedup is not as drastic as I felt while I was playing with it. So I'm OK with removing this micro-optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as this "no-op if
branch" gets rid of a constant-time work, the difference becomes large if you consider shorter arrays:
julia> xs = [abs(x) < 1 ? x : missing for x in randn(10)];
julia> @btime sum(x for x in $xs if x !== missing)
8.405 ns (0 allocations: 0 bytes)
1.5696312630017393
julia> @eval Base function _foldl_impl(op::OP, init::T, itr) where {OP, T}
y = iterate(itr)
y === nothing && return init
v = op(init, y[1])
return _foldl_impl(op, v, itr, y[2])
end
_foldl_impl (generic function with 4 methods)
julia> @btime sum(x for x in $xs if x !== missing)
29.120 ns (1 allocation: 16 bytes)
1.5696312630017393
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is incredibly bizarre. 😳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was just doing union splitting manually.
This PR improves
foldl
for type-based filtering by using the function-barrier at tail-call positions. This improves the performance of the benchmark (equivalent to what) I mentioned in #33526This benchmark is included in JuliaCI/BaseBenchmarks.jl#254
To be honest, I am not 100% sure this is the best approach. Maybe this is too much of a strain to the compiler? Should we care about generated code size? Does x1.8 speedup justify that? As some versions of Julia compiler was able to generate a comparable code, is it better to tweak the compiler to recover the performance?
Implementation notes
As I discussed in this discourse thread, using this trick in generic
foldl
is harder than in implementations ofcollect
-like functions. This is because it is easy to have stack overflow if you don't have some kind of monotonicity in the way the accumulator changes its type. This is why I have the counter at type-level. At the moment the type can be refined only up to three times (counter = Val(3)
) which is arbitrary but I think conservative.Aside: it would be great if Julia compiler generates a finite state machine from the tail calls as Guy Steele argued in the talk I linked in the discourse thread.