Improve efficiency of mergelevels #331

nalimilan · 2021-02-20T15:39:32Z

mergelevels was implemented very inefficiently, which is acceptable when the number of levels is small but incredibly slow when it's large.

@bkamins This is still slow for large pools but at least it finishes:

julia> using CategoricalArrays

julia> function f(x, y)
           z = similar(x, length(x)+length(y))
           copyto!(z, x)
           copyto!(z, length(x)+1, y, 1, length(y))
           return z
       end
f (generic function with 1 method)

julia> @time x = categorical(1:10^7);
  8.157592 seconds (10.03 M allocations: 1.562 GiB, 10.83% gc time)

julia> @time y = categorical(1:2*10^7);
 18.108206 seconds (20.00 M allocations: 3.123 GiB, 11.00% gc time)

julia> @time f(x, y);
 34.870381 seconds (31.33 M allocations: 5.735 GiB, 10.83% gc time)

`mergelevels` was implemented very inefficiently, which is acceptable when the number of levels is small but incredibly slow when it's large.

bkamins · 2021-02-20T15:54:53Z

src/pool.jl

-                else
-                    i = levelsmap[j]
-                end
+    union!(res, levels...)


Since we are breaking anyway I would just opt for using union!. How much cost it the level order fixing incur? If it is a majority of time I would skip doing it and document that we use union! when merging levels.

(I am asking because it looks like the complexity of the level fixing algorithm below is quadratic)

Removing the for levs in reverse(levels) loop below gets rid of about 2/3 of the time. That proportion seems to stay roughly constant when moving from 10^4 to 10^7.

But I'm not sure union! is really a satisfying solution. Using it would mean that even if the second set of levels is a superset of the first one, we won't keep the order of the superset. Of course that would have the advantage that existing refs wouldn't have to be recoded since new levels would always be added at the end. But the downside is that if you assign categorical values from arrays with different levels (or call copyto!), the final levels would depend on the order in which you make the calls.

An intermediate solution would be to check whether one set of levels starts with or ends with a subsequence of another, or is an order superset of another (that should be cheap). If that's the case we can combine them in the appropriate order. If they are not, we can call union!.

I was thinking about it and concluded that:

If the cost is 2/3 of time (and does not explode) then it is acceptable.

We should clearly document what we do (and actually highlight that this is an advantage of our approach)

We should clearly document that CategoricalArrays.jl is not intended to handle huge numbers of levels (as a lot of book-keeping has to be done)

Regarding the implementation. As you suggest what could be checked for the case if we merge two levels (a common case) and one level is a subsequence of another then we probably can be faster (just like you now check if all levels are equal). Such check could be done before checking for the equality.

@ablaom Do you have any preference regarding how levels from different pools are merged? Is this something that you encountered in real use cases?

bkamins · 2021-02-24T10:11:42Z

I would make a release with this change soon. https://github.com/ablaom - do you have any comment on the proposed change here?

bkamins · 2021-04-04T15:01:40Z

Is there anything to be done here still?

ablaom · 2021-04-04T20:02:24Z

Sorry, I am currently on holiday but can take a closer look on Wednesday. Thanks for seeking my feedback.

ablaom · 2021-04-06T23:33:59Z

In an MLJ workflow I would guess merging arrays with different levels does not occur that often, and that when it does the performance bottlenecks are still elsewhere. I would say, from our end, performance is the least important factor. More important is that it is easy to understand what the protocol for merging is, so the simpler the better.

I would say more than 100,000 levels would be pretty unusual for an ordered categorical that is not really a "count" variable in disguise.

For what it is worth, I have tested this branch against MLJBase/dev and nothing breaks.

nalimilan · 2021-04-07T08:36:22Z

Thanks for your comments. Currently the algorithm isn't really simple, but it's a best effort to ensure that when merging two ordered pools, the result is also ordered if possible. In practice I'm not sure many people really on that.

Let's merge this then. I think performance can be improved more as discussed above but this can be done later as it won't change the user-visible behavior.

Improve efficiency of mergelevels

4f653bb

`mergelevels` was implemented very inefficiently, which is acceptable when the number of levels is small but incredibly slow when it's large.

bkamins reviewed Feb 20, 2021

View reviewed changes

bkamins mentioned this pull request Feb 21, 2021

add DataAPI.unwrap #328

Merged

nalimilan closed this Apr 4, 2021

nalimilan reopened this Apr 4, 2021

nalimilan merged commit b2177ab into master Apr 7, 2021

nalimilan deleted the nl/mergelevels branch April 7, 2021 08:36

nalimilan mentioned this pull request Apr 7, 2021

Make merging pools faster and document it better #341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve efficiency of mergelevels #331

Improve efficiency of mergelevels #331

nalimilan commented Feb 20, 2021

bkamins Feb 20, 2021

bkamins Feb 20, 2021

nalimilan Feb 20, 2021

bkamins Feb 21, 2021

nalimilan Feb 21, 2021

bkamins commented Feb 24, 2021

bkamins commented Apr 4, 2021

ablaom commented Apr 4, 2021

ablaom commented Apr 6, 2021

nalimilan commented Apr 7, 2021

Improve efficiency of mergelevels #331

Improve efficiency of mergelevels #331

Conversation

nalimilan commented Feb 20, 2021

bkamins Feb 20, 2021

Choose a reason for hiding this comment

bkamins Feb 20, 2021

Choose a reason for hiding this comment

nalimilan Feb 20, 2021

Choose a reason for hiding this comment

bkamins Feb 21, 2021

Choose a reason for hiding this comment

nalimilan Feb 21, 2021

Choose a reason for hiding this comment

bkamins commented Feb 24, 2021

bkamins commented Apr 4, 2021

ablaom commented Apr 4, 2021

ablaom commented Apr 6, 2021

nalimilan commented Apr 7, 2021