Add optimised methods for reducing(hcat/vcat on any iterators of vectors #31644

oxinabox · 2019-04-07T18:55:06Z

This covers the 2 vector cases of #31636
which I think are the most important.
Basically making all iterators comparably performant for this as Arrays

Looking at it, I do wonder if they (including the existing ones from #21672)
should be pushed down to be specialisations of mapreduce(identity, vcat/hcat,.
Not sure though.

Benchmarks:

Data

Covering each combination of iterator traits

v_v = [rand(128) for ii in 1:1000]
g_v = (x for x in v_v)
f_g_v = Iterators.filter(x->true, g_v)
f_v_v = Iterators.filter(x->true, v_v);

Results:

Vector
--- hcat, on: Array{Array{Float64,1},1} ---
splatting: 
  148.638 μs (8 allocations: 1023.94 KiB)
reduce old: 
  60.284 μs (2 allocations: 1000.08 KiB)
reduce new: 
  57.536 μs (2 allocations: 1000.08 KiB)

--- hcat, on: Base.Generator{Array{Array{Float64,1},1},getfield(Main, Symbol("##159#160"))} ---
splatting: 
  188.010 μs (1505 allocations: 1.05 MiB)
reduce old: 
  259.465 ms (2985 allocations: 488.88 MiB)
reduce new: 
  58.025 μs (2 allocations: 1000.08 KiB)

--- hcat, on: Base.Iterators.Filter{getfield(Main, Symbol("##163#164")),Array{Array{Float64,1},1}} ---
splatting: 
  190.236 μs (1505 allocations: 1.05 MiB)
reduce old: 
  282.979 ms (2985 allocations: 488.88 MiB)
reduce new: 
  187.215 μs (13 allocations: 2.00 MiB)

--- hcat, on: Base.Iterators.Filter{getfield(Main, Symbol("##161#162")),Base.Generator{Array{Array{Float64,1},1},getfield(Main, Symbol("##159#160"))}} ---
splatting: 
  191.919 μs (1505 allocations: 1.05 MiB)
reduce old: 
  266.831 ms (2985 allocations: 488.88 MiB)
reduce new: 
  228.940 μs (13 allocations: 2.00 MiB)

=================
--- vcat, on: Array{Array{Float64,1},1} ---
splatting: 
  61.553 μs (3 allocations: 1008.02 KiB)
reduce old: 
  57.962 μs (2 allocations: 1000.08 KiB)
reduce new: 
  55.552 μs (2 allocations: 1001.20 KiB)

--- vcat, on: Base.Generator{Array{Array{Float64,1},1},getfield(Main, Symbol("##159#160"))} ---
splatting: 
  103.196 μs (1500 allocations: 1.03 MiB)
reduce old: 
  282.454 ms (1984 allocations: 488.85 MiB)
reduce new: 
  55.424 μs (2 allocations: 1001.20 KiB)

--- vcat, on: Base.Iterators.Filter{getfield(Main, Symbol("##163#164")),Array{Array{Float64,1},1}} ---
splatting: 
  105.145 μs (1500 allocations: 1.03 MiB)
reduce old: 
  281.383 ms (1984 allocations: 488.85 MiB)
reduce new: 
  157.521 μs (11 allocations: 2.00 MiB)

--- vcat, on: Base.Iterators.Filter{getfield(Main, Symbol("##161#162")),Base.Generator{Array{Array{Float64,1},1},getfield(Main, Symbol("##159#160"))}} ---
splatting: 
  107.264 μs (1500 allocations: 1.03 MiB)
reduce old: 
  281.396 ms (1984 allocations: 488.85 MiB)
reduce new: 
  177.695 μs (11 allocations: 2.00 MiB)

Note the reduce new: entry bypasses the existing method for reduce(vcat|hcat, ::Array{<:AbstractVector}),
so that I could compare performance on just treating Arrays, as Iterators,
and thus to see if we could drop the extra specialised methods for them.
(once another PR to do matrixes is complete)

Benchmark Takeways:

On iterators with known length, performance is equivelent to that we get on Arrays
Up to a 4000x speedup (vcat on generators)

oxinabox · 2019-04-07T19:37:06Z

base/reduce.jl

+    if !(isize isa SizeUnknown)
+        # Assume first element has representitive size, unless that would make this too large
+        SIZEHINT_CAP = 10^6
+        sizehint!(ret, min(SIZEHINT_CAP, length(xs)*length(x1)))


This is a substatial point in speeding this up.
It is the reason knowing the size gives a speedup for vcat.

And it think it is a very common case that
the size of your vectors are all the same,
or when it is wrong that it is still a good

I am not sure what a good value for SIZEHINT_CAP is.
We need it to catch cases that only fit in memory because
the first element is massive, and the rest much smaller.

Are you sure it makes a large difference? IIRC append! is supposed to double the size of the storage to ensure resizing doesn't happen too often. At any rate, I don't think it's correct to assume the first element is representative.

Yes, especially on small cases it really matters, adding it got a 60% speedup.
one my testcast of 100 vectors of length 128.

Doubling just doesn't increase the capacity that much, if you are doubling a realitvely small number.
Consider if you have 100 vectors of equal length.
Then doubling results in, having to allocate 7 times.
Which is a lot, as a portion of the time you need to spend

So the case of thinking the first element is representative is
is a bit of a guess.

If the guess is approximately right then you'll probably only end up doing 1 more allocation.

If the guess is too low, then your basically back into the case of doubling, so you've lost nothing by taking it.

If the guess is too large, this is the dangerous point, because it risks allocating a ton of memory that isn't needed.

The last case is where the SIZEHINT_CAP comes in. It is our guard against that case. So we set it to some suitable number, I thought 10^6 might be OK, so that would be allocating 4MB if it was Float64s. But we could do 10^5 if we anted to be more conservative.
The other thing is that once things are bigger than SIZEHINT_CAP,
then the array size should be large enough that doubling is a very effective increase.

I don't think allocating this much extra is reasonable so you need to sizehint! back to the actual number in the end (which will reduce the capacity).

Also, posting the benchmarks is a good idea (unless they are the same as in the first post).

This feels like too much heuristic in a PR which otherwise would be quite simple.

Benchmark:
With code that bypasses normal array reduce:

data = [rand(max(ceil(Int,96+32(randn())), 0)) for ii in 1:1000]

Timing with @btime, different computer from before. (probably a worse computer for benchmarking as it is a shared server, but still point holds)

Multiple rounds of testing to roughly catch on chance since we are using random length arrays

round 1:

with hint: 498.927 μs (2 allocations: 782.19 KiB),

with out : 728.514 μs (11 allocations: 1.80 MiB)

round2:

with hint: 757.129 μs (4 allocations: 2.08 MiB)

with out : 900.870 μs (13 allocations: 2.56 MiB)

round3:

with hint: 643.748 μs (3 allocations: 1.74 MiB)

with out : 687.555 μs (11 allocations: 1.51 MiB)

Now of-course the advantage goes up if your inital guess at the size was wrong.
And once sizehint!ing back down is added, that will also cut down the advantage in doing it.
But I think it will still be worth it
(can further add heuristics about not sizehinting down if <2x too large, since that is acceptable)

oxinabox · 2019-04-07T19:38:20Z

base/reduce.jl

+    offset =  length(x1)+1
+    while(x_state !== nothing)
+        x, state =  x_state
+        length(x)==dim1_size || throw(DimensionMismatch("hcat"))


Should include dimensions in error message

oxinabox · 2019-04-07T19:39:58Z

Errors given by the CI are correctt, looks like i broke at least one case for matrixes

base/abstractarray.jl

oxinabox · 2019-04-08T17:28:15Z

Ok, now the failing tests are not real. Some kind of distributed processing fault

nalimilan · 2019-04-10T16:03:32Z

base/reduce.jl

+    if !(isize isa SizeUnknown)
+        # Assume first element has representitive size, unless that would make this too large
+        SIZEHINT_CAP = 10^6
+        sizehint!(ret, min(SIZEHINT_CAP, length(xs)*length(x1)))


Are you sure it makes a large difference? IIRC append! is supposed to double the size of the storage to ensure resizing doesn't happen too often. At any rate, I don't think it's correct to assume the first element is representative.

nalimilan · 2019-04-10T16:03:59Z

base/reduce.jl

-reduce(op, itr; kw...) = mapreduce(identity, op, itr; kw...)
+function reduce(op, itr::T; kw...) where T
+    # Redispatch, adding traits
+    reduce(op, itr, eltype_or_default_eltype(itr), IteratorSize(T); kw...)


This should be a private _reduce method.

Yes, unless it should be a private _mapreduce method.
What do you think?

I guess _mapreduce is better if you can support that, since that will also support mapreduce(identity, ...).

KristofferC · 2019-04-10T20:26:29Z

base/reduce.jl

@@ -362,10 +367,95 @@ julia> reduce(*, [2; 3; 4]; init=-1)
 -24
 ```
 """
-reduce(op, itr; kw...) = mapreduce(identity, op, itr; kw...)
+function reduce(op, itr::T; kw...) where T
+    # Redispatch, adding traits


Can remove this comment, it is literally what the code below it does.

KristofferC · 2019-04-10T20:26:42Z

base/reduce.jl

+end
+
+function reduce(op, itr, et, isize; kw...)
+    # Fallback: if nothing interesting is being done with the traits


Think this comment can likely be removed.

KristofferC · 2019-04-10T20:35:19Z

base/reduce.jl

+    while(x_state !== nothing)
+        x, state =  x_state
+        length(x)==dim1_size || throw(DimensionMismatch("hcat"))
+        copyto!(ret, offset, x, 1)


How do we know this will fit into ret?

julia> reduce(hcat, (UInt8[1,2], [1000, 5000])) ERROR: InexactError: trunc(UInt8, 1000)

we don't this function needs to be tightenned to

T::Type{<:AbstractVector{S}}

And things that fell back though @default_eltype need to be banned from using it.

And then a nother (marginally slower, but benchmarking will tell)
version needs to be created that can deal with hetrogenous containers.
I would say that could be in another PR and I just want to common case covered here,
but actually i would be sad if can't get speedup for generators.

I would say that could be in another PR and I just want to common case covered here,

I don't understand, this PR breaks the use cases I linked so how can it be in another PR?

Sorry I was unclear.
I mean definitately fixing that issue.
But the question of handling things like

(i for i in ([0x1, 0x2], [1,2]))

does not have to be handled, in this PR (I think it should, if it doesn't add undue complexity).
If we just tighten the type signature,
from where T<:AbstractVector to where T<:AbstractVector{S} where S.
That would solve your case by causing it to fallback to the current slow reduce methods.

Thus working out how to actually efficently handle hetrogeneous types could be in a PR;
if it was going to be hard.
But I don't think it will so it can be in this one.

KristofferC · 2019-04-10T20:35:56Z

base/reduce.jl

+
+    while(x_state !== nothing)
+        x, state =  x_state
+        append!(ret_vec, x)


How do we know this will fit into ret?

julia> reduce(vcat, (UInt8[1,2], [1000, 5000])) ERROR: InexactError: trunc(UInt8, 1000)

Correct we don't, see https://github.com/JuliaLang/julia/pull/31644/files#r274153197

oxinabox · 2019-04-13T13:28:37Z

Now with sizehinting back down if made it too large.
This applies to reduce(vcat, ) where we know the size of the iterater.
To keep performance i tweaked how the heuristic works.

In the ideal case of guessing the size right(ish) you still get basically same performance as running on Array{Vector}.
Right or wrong, if out guess at the size SIZEHINTCAP, then we fallback to acting the same as if we did not know the iterator size.
Otherwise, depending on how representative the first element was, this ranged between 50% - 150% the amount of time taken by the unknown iterator case.

I think the sizehint heuristic is worth it. But I could be convinced otherwise.
Basically the performance boils down to: If you have the sizehint back down then it was better not to have hinted at all.
But this case is fairly rare, since it means the first element was >2x the average size,
and the estimated size based on it was < sizehint cap.

Edgecase Benchmarks related to this case.
Note the neoreduce_nohintdown is what happens if we just never sizehint back down.
The choice to be made is between the first number in each set (which uses SizeUnknown behavour regardless of if the size is known), and the second, which uses the sizehint heuristic including hinting back down.

Perfect estimate, under SIZEHINTCAP

  data = [rand(50), rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50)]
  @btime neoreduce_hintdown(vcat, $data, Base.SizeUnknown()); # Not hint in he first place
  @btime neoreduce_hintdown(vcat, $data, Base.HasLength());
  @btime neoreduce_nohintdown(vcat, $data, Base.HasLength());

    724.560 ns (5 allocations: 12.38 KiB)
    379.985 ns (2 allocations: 4.86 KiB)
    389.140 ns (2 allocations: 4.86 KiB)

Perfect estimate, over SIZEHINT CAP

  data = [rand(50) for ii in 1:100_000]
  @btime neoreduce_hintdown(vcat, $data, Base.SizeUnknown()); # Not hint in he first place
  @btime neoreduce_hintdown(vcat, $data, Base.HasLength());
  @btime neoreduce_nohintdown(vcat, $data, Base.HasLength());
    19.343 ms (18 allocations: 51.56 MiB)
    18.740 ms (18 allocations: 51.56 MiB)
    19.134 ms (8 allocations: 49.59 MiB)

  @btime reduce(vcat, data);
    9.196 ms (2 allocations: 38.15 MiB)

Maxing out SIZEHINT cap, according to our massice over estimate,

  data = [rand(3*10^5), rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50)]
  @btime neoreduce_hintdown(vcat, $data, Base.SizeUnknown()); # Not hint in he first place
  @btime neoreduce_hintdown(vcat, $data, Base.HasLength());
  @btime neoreduce_nohintdown(vcat, $data, Base.HasLength());

    243.261 μs (3 allocations: 4.58 MiB)
    243.798 μs (3 allocations: 4.58 MiB)
    239.252 μs (3 allocations: 4.58 MiB)

  @btime reduce(vcat, data);

    130.166 μs (2 allocations: 2.29 MiB)

Under cap, but over estimated, enough to hintdown

  data = [rand(3*10^3), rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50)]
  @btime neoreduce_hintdown(vcat, $data, Base.SizeUnknown()); # Not hint in he first place
  @btime neoreduce_hintdown(vcat, $data, Base.HasLength());
  @btime neoreduce_nohintdown(vcat, $data, Base.HasLength());

    1.892 μs (3 allocations: 46.95 KiB)
    2.940 μs (4 allocations: 257.89 KiB)
    1.388 μs (3 allocations: 257.89 KiB)

Under cap, but over estimated, but not enough to hintdown

  data = [rand(70), rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50),rand(50)]
  @btime neoreduce_hintdown(vcat, $data, Base.SizeUnknown()); # Not hint in he first place
  @btime neoreduce_hintdown(vcat, $data, Base.HasLength());
  @btime neoreduce_nohintdown(vcat, $data, Base.HasLength());

    831.905 ns (5 allocations: 17.30 KiB)
    380.134 ns (2 allocations: 6.78 KiB)
    379.488 ns (2 allocations: 6.78 KiB)

KristofferC · 2019-04-15T09:13:57Z

base/reduce.jl

+        x_state = iterate(xs, state)
+    end
+
+    if length(ret) < hinted_size/2  # it is only allowable to be at most 2x to much memory


Get rid of this conditional? sizehint! already has heuristics for when shrinking the capacity is worth i.

Those heuristics are less generious than this, though.
They are require saving an 8th, rather than a half

julia/src/array.c

Line 1108 in 68db871

//if we don't save at least an eighth of maxsize then its not worth it to shrink

KristofferC · 2019-04-15T09:22:51Z

base/reduce.jl

+    x_state === nothing && return T() # New empty instance
+    x1, state = x_state
+
+    ret = copy(x1)  # this is **Not** going to work for StaticArrays


So did this use to work for StaticArrays and is breaking? What should be done to resolve this comment?

Indeed it did, used to work, though it used to work a little weirdly.
Since the result type depended if the iterator was a Array or not.

julia> data = [(@SVector Int[1,2,3,4]), @SVector Int[1,2,3,4]] 2-element Array{SArray{Tuple{4},Int64,1,4},1}: [1, 2, 3, 4] [1, 2, 3, 4] julia> reduce(vcat, data) 8-element Array{Int64,1}: 1 2 3 4 1 2 3 4 julia> reduce(vcat, (i for i in data)) 8-element SArray{Tuple{8},Int64,1,8}: 1 2 3 4 1 2 3 4

We should do something to support it.
But I wasn't sure what, so I left the comment (It should have had a #TODO)

We can tighten this to apply only to Vector not to AbstractVector

We can check for ismutable and if not mutable, we can fallback to standard reduce

We can check for ismutable and if not mutable, we can fall back to using a Array for the return type

All 3 options leave it open to the package to define there own improved method for this, on there type.

1 would be too bad. 3 would break the reduce interface. So 2 sounds like the best solution.

KristofferC · 2019-04-15T09:24:24Z

base/reduce.jl

+function reduce(::typeof(hcat), xs, T::Type{<:AbstractVector}, isize)
+    # Size is known
+    x_state = iterate(xs)
+    x_state === nothing && return T() # New empty instance


Can get rid of the comment. No need to describe what the code does in words.

nalimilan · 2019-04-15T09:22:16Z

base/reduce.jl

+
+function reduce(::typeof(vcat), xs, T::Type{<:AbstractVector}, isize)
+    x_state = iterate(xs)
+    x_state === nothing && return T() # New empty instance


This should just throw an ArgumentError as currently (BTW I'm not even sure the AbstractArray interface actually guarantees that T() works).

I think we have an empty_reduce function that throws that error; unless it knows something better to do.
but yes.

nalimilan · 2019-04-15T11:36:35Z

base/reduce.jl

+    end
+
+    x_state = iterate(xs, state)
+    while(x_state !== nothing)


Suggested change

while(x_state !== nothing)

while x_state !== nothing

nalimilan · 2019-04-15T11:52:23Z

base/reduce.jl

+    x_state === nothing && return T() # New empty instance
+    x1, state = x_state
+
+    ret = copy(x1)  # this is **Not** going to work for StaticArrays


1 would be too bad. 3 would break the reduce interface. So 2 sounds like the best solution.

base/reduce.jl

nalimilan · 2019-04-15T11:55:56Z

base/reduce.jl

+
+## vcat
+
+function reduce(::typeof(vcat), xs, T::Type{<:AbstractVector}, isize)


This should probably be restricted to the case where all vectors are of the same concrete type. Otherwise there is no guaranty that calling vcat repeatedly (what reduce does by default) will return a vector of the same type as the first one.

vtjnash · 2020-10-28T22:05:23Z

Removed label since it doesn't seem like this was being worked on anymore. It seems relatively complex, and possibly is just suggesting that our growth strategy isn't sufficient for small numbers (along with large ones, refs #28588)?

oxinabox · 2020-10-28T22:26:18Z

I should return to this, and just remove the growth stuff that people didn't like, and do the more minimal version for the case we know about.

vtjnash · 2023-10-27T14:51:38Z

Hoping this is covered by moving push! out of C into Julia in #51319

oxinabox commented Apr 7, 2019

View reviewed changes

base/abstractarray.jl Outdated Show resolved Hide resolved

nalimilan reviewed Apr 10, 2019

View reviewed changes

KristofferC reviewed Apr 10, 2019

View reviewed changes

fredrikekre added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Apr 11, 2019

KristofferC reviewed Apr 15, 2019

View reviewed changes

nalimilan reviewed Apr 15, 2019

View reviewed changes

oxinabox added 6 commits August 12, 2019 15:43

Add optimised methods for reducing(hcat/vcat on any iterators

6c4cbf9

correct ommissions from test loops

32da054

remove old, and previously masked redundant reduce method

bee2ed1

undo extra specialisation of abstract array methods

62d1384

fix whitespace

75d22ca

sizehint back down

0a193a6

oxinabox force-pushed the ox/reducevcat branch from 10f5294 to 0a193a6 Compare August 12, 2019 14:43

make less things error

889a554

StefanKarpinski added the forget me not PRs that one wants to make sure aren't forgotten label Aug 12, 2019

mcabbott mentioned this pull request Mar 20, 2020

Function to combine nested arrays into multidimensional array #35170

Closed

vtjnash removed the forget me not PRs that one wants to make sure aren't forgotten label Oct 28, 2020

mcabbott mentioned this pull request Jul 15, 2021

stack(vec_of_vecs) for vcat(vec_of_vecs...) #21672

Closed

mcabbott mentioned this pull request Dec 4, 2021

Add stack(array_of_arrays) #43334

Merged

vtjnash closed this Oct 27, 2023


		## vcat

		function reduce(::typeof(vcat), xs, T::Type{<:AbstractVector}, isize)

Add optimised methods for reducing(hcat/vcat on any iterators of vectors #31644

Add optimised methods for reducing(hcat/vcat on any iterators of vectors #31644

Conversation

oxinabox commented Apr 7, 2019 • edited Loading

Benchmarks:

Data

Results:

Benchmark Takeways:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KristofferC Apr 10, 2019 • edited Loading

Choose a reason for hiding this comment

oxinabox Apr 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox commented Apr 7, 2019

oxinabox commented Apr 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox Apr 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox commented Apr 13, 2019

Perfect estimate, under SIZEHINTCAP

Perfect estimate, over SIZEHINT CAP

Maxing out SIZEHINT cap, according to our massice over estimate,

Under cap, but over estimated, enough to hintdown

Under cap, but over estimated, but not enough to hintdown

Choose a reason for hiding this comment

oxinabox Apr 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Oct 28, 2020

oxinabox commented Oct 28, 2020

vtjnash commented Oct 27, 2023

oxinabox commented Apr 7, 2019 •

edited

Loading

KristofferC Apr 10, 2019 •

edited

Loading

oxinabox Apr 10, 2019 •

edited

Loading

oxinabox Apr 10, 2019 •

edited

Loading

oxinabox Apr 15, 2019 •

edited

Loading