Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of mergelevels #331

Merged
merged 1 commit into from
Apr 7, 2021
Merged

Improve efficiency of mergelevels #331

merged 1 commit into from
Apr 7, 2021

Conversation

nalimilan
Copy link
Member

mergelevels was implemented very inefficiently, which is acceptable when the number of levels is small but incredibly slow when it's large.

@bkamins This is still slow for large pools but at least it finishes:

julia> using CategoricalArrays

julia> function f(x, y)
           z = similar(x, length(x)+length(y))
           copyto!(z, x)
           copyto!(z, length(x)+1, y, 1, length(y))
           return z
       end
f (generic function with 1 method)

julia> @time x = categorical(1:10^7);
  8.157592 seconds (10.03 M allocations: 1.562 GiB, 10.83% gc time)

julia> @time y = categorical(1:2*10^7);
 18.108206 seconds (20.00 M allocations: 3.123 GiB, 11.00% gc time)

julia> @time f(x, y);
 34.870381 seconds (31.33 M allocations: 5.735 GiB, 10.83% gc time)

`mergelevels` was implemented very inefficiently, which is acceptable
when the number of levels is small but incredibly slow when it's large.
else
i = levelsmap[j]
end
union!(res, levels...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are breaking anyway I would just opt for using union!. How much cost it the level order fixing incur? If it is a majority of time I would skip doing it and document that we use union! when merging levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am asking because it looks like the complexity of the level fixing algorithm below is quadratic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the for levs in reverse(levels) loop below gets rid of about 2/3 of the time. That proportion seems to stay roughly constant when moving from 10^4 to 10^7.

But I'm not sure union! is really a satisfying solution. Using it would mean that even if the second set of levels is a superset of the first one, we won't keep the order of the superset. Of course that would have the advantage that existing refs wouldn't have to be recoded since new levels would always be added at the end. But the downside is that if you assign categorical values from arrays with different levels (or call copyto!), the final levels would depend on the order in which you make the calls.

An intermediate solution would be to check whether one set of levels starts with or ends with a subsequence of another, or is an order superset of another (that should be cheap). If that's the case we can combine them in the appropriate order. If they are not, we can call union!.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it and concluded that:

  1. If the cost is 2/3 of time (and does not explode) then it is acceptable.
  2. We should clearly document what we do (and actually highlight that this is an advantage of our approach)
  3. We should clearly document that CategoricalArrays.jl is not intended to handle huge numbers of levels (as a lot of book-keeping has to be done)
  4. Regarding the implementation. As you suggest what could be checked for the case if we merge two levels (a common case) and one level is a subsequence of another then we probably can be faster (just like you now check if all levels are equal). Such check could be done before checking for the equality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ablaom Do you have any preference regarding how levels from different pools are merged? Is this something that you encountered in real use cases?

@bkamins bkamins mentioned this pull request Feb 21, 2021
@bkamins
Copy link
Member

bkamins commented Feb 24, 2021

I would make a release with this change soon. https://github.com/ablaom - do you have any comment on the proposed change here?

@nalimilan nalimilan closed this Apr 4, 2021
@nalimilan nalimilan reopened this Apr 4, 2021
@bkamins
Copy link
Member

bkamins commented Apr 4, 2021

Is there anything to be done here still?

@ablaom
Copy link

ablaom commented Apr 4, 2021

Sorry, I am currently on holiday but can take a closer look on Wednesday. Thanks for seeking my feedback.

@ablaom
Copy link

ablaom commented Apr 6, 2021

In an MLJ workflow I would guess merging arrays with different levels does not occur that often, and that when it does the performance bottlenecks are still elsewhere. I would say, from our end, performance is the least important factor. More important is that it is easy to understand what the protocol for merging is, so the simpler the better.

I would say more than 100,000 levels would be pretty unusual for an ordered categorical that is not really a "count" variable in disguise.

For what it is worth, I have tested this branch against MLJBase/dev and nothing breaks.

@nalimilan
Copy link
Member Author

Thanks for your comments. Currently the algorithm isn't really simple, but it's a best effort to ensure that when merging two ordered pools, the result is also ordered if possible. In practice I'm not sure many people really on that.

Let's merge this then. I think performance can be improved more as discussed above but this can be done later as it won't change the user-visible behavior.

@nalimilan nalimilan merged commit b2177ab into master Apr 7, 2021
@nalimilan nalimilan deleted the nl/mergelevels branch April 7, 2021 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants