-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up filter of vector #31929
speed up filter of vector #31929
Conversation
I am not entirely sure whether we can give For this, we would do
This would work for simple offset vectors. Are there valid abstract vectors where this would fail? Are we OK with returning a |
I think this could be ready to go: All dispatches now rely on the same interface functions as before ( The test failures look unrelated to me. Can we re-run the tests? Can we get a nanosoldier to catch unexpected perf regressions and otherwise delight in smaller numbers? Then there is the question of time-memory tradeoff: The Any votes for speeding this up even further, at the cost of potentially increased memory consumption? I tend to be against: The overhead of |
All CI seems to fail with the same tests, are you sure they are unrelated? I would expect they are not but we can easily rerun the tests. |
You are correct and I failed at reading comprehension of the test results. Sorry! |
base/array.jl
Outdated
j = ifelse(f(ai), j+1, j) | ||
end | ||
resize!(idxs, j-1) | ||
return a[idxs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is generally the best algorithm. In some cases the extra index vector might be faster, but not enough to justify using twice the memory in many cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it is using up to 8x the temporary memory (Int
instead of Bool
), same as a view
.
However, the temp is very temporary. Should we manually free its buffer immediately via resize!(idxs, 0); sizehint!(idxs, 0)
in order to limit gc pressure? I am not seeing lots of precedent of that in Base, but I like the idea.
Or is the issue that out-of-place filtering of giant vectors of tiny objects can cause oom errors?
b = Vector{T}(undef, length(a)) | ||
for ai in a | ||
@inbounds b[j] = ai | ||
j = ifelse(f(ai), j+1, j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would guess LLVM is smart enough to eliminate the branch if you write it like this?
j = ifelse(f(ai), j+1, j) | |
j = f(ai) ? j+1: j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM should. You can also use the rules of arithmetic to write it as:
j = ifelse(f(ai), j+1, j) | |
j += f(ai)::Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the j += f(ai)::Bool
variant: That works, but appears to be no faster and sometimes sligthly slower. My naive guess would be that f(ai)
is typically encoded in a flag that can be directly used for a conditional move, and converting the flag to a 0/1 integer register plus addition takes longer (even though llvm should know that both are semantically the same and should select the fastest version regardless of what I wrote). Also, there used to be cases where explicit typeasserts confused the compiler, even if it can correctly infer that f(ai)::Bool
. But I may well be hallucinating issues where there are none; should I still change it?
Regarding the j = f(ai) ? j+1 : j
spelling: That becomes the same compiled code as felse(f(ai), j+1, j)
. I just thought that the ifelse
was more informative to the reader, since we explicitly want a conditional move instead of a branch. With this explanation, do you still prefer the ?:
spelling? Then I'll change it.
What's the status of this? How would the examples in #32303 run with these changes? |
I think this should be ready, and the test failure looks unrelated. Rerunning tests to see. |
Could you quickly run the examples in #32303? Is it true that with this PR the |
Before:
After:
|
To answer your question look at the benchmarks from the top:
You see that all AbstractArray with linear indexing (especially AbstractVector) uses a slower algorithm, because we cannot assume that its indices are |
This is amazing work! If I read the original benchmarks correctly, then this PR even improves the case when the success rate is very small? That's when the current |
Yes. The only case that should currently slow down is when you have a very cheap predicate and filter a very small (size k) subset of giant Array (size n) of non-isbits type. In that case, the new algorithm allocates a new array of length n, fills it and afterwards truncates it to size k. That could be done almost free with appropriate syscalls, but is currently expensive (need to zero 8*n bytes twice, once in the kernel pagefault handler and once in the julia initializer for the array; optimally we would only write O(k) bytes to memory). The current mood appears to be against mucking with OS internals, but I hope that this will get fixed eventually. |
Fixes #32303 |
Single test fail looks unrelated ("Error: Download failed: curl: (6) Could not resolve host: frippery.org"). Any change requests? Rebase & squash & merge? Or let it wait some more time? |
I think we should merge. |
|
Bump? :) |
Fix fitlter after JuliaLang/julia#31929.
empty!(idxs) | ||
sizehint!(idxs, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really improve performance in the end? This breaks with AbstractArray
s for which getindex
returns a view which uses idxs
to store its indices. It's not clear whether the AbstractArray
interface allows that or not. See JuliaData/DataFrames.jl#3192.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan, maybe open an issue so this doesn't get lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure: #47078
Use branchfree code for filtering vectors. Part of the saved time is then spent to resize the buffer of the result, such that memory consumption is reduced, if the result should happen to be long-lived.
I observe a 4x speedup for inplace filtering and 7x for out-of-place filtering in the 50% case. If desired we can obtain an additional 30% speedup by skipping the
realloc
of the buffer, i.e. thesizehint!
; this will, however, increase memory consumption. Since the result may long-lived, I tend to be conservative with eating user memory. 1 ms corresponds to about 2 cycles / element.Abstract assumptions (updated):
getindex(a, idxs::Vector{Int})
Old algorithms: Branchy inplace filtering of AbstractVector; out of place, logical indexing for AbstractArray and
push!
-based for Vector.New algorithms: Branchless inplace filtering for AbstractVector; branchless out-of-place for Array,
Vector{Int}
-indexing for AbstractArray with LinearIndexing (using branchless construction of the indexing vector; this is an 8 times larger temporary than the old code using logical indexing, so we immediately free its buffer viaempty!(tmp); sizehint!(tmp, 0)
to relieve gc pressure). For AbstractArray without IndexLinear, we continue to use the old logical indexing algorithm.Memory consumption: We have a larger temporary to store selected indices. In case of inplace filtering Vector, we now shrink the underlying buffer post filtering. In case of out-of-place filtering of Vector, we now allocate a potentially larger buffer and shrink it afterwards, instead of relying on the
push!
-resize logic. This can be either a gain or a loss with respect to peak memory consumption, and is almost always a win with respect to sustained memory consumption for long-lived results (because the oldpush!
based code leads to up to 2x overallocation of the resulting object, and the old code did not truncate the resulting buffer). This should be a gain with respect to gc pressure.Updated (2) Microbenchmark:
Before:
After:
Monkey-patch: