-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve efficiency of mergelevels #331
Conversation
`mergelevels` was implemented very inefficiently, which is acceptable when the number of levels is small but incredibly slow when it's large.
else | ||
i = levelsmap[j] | ||
end | ||
union!(res, levels...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are breaking anyway I would just opt for using union!
. How much cost it the level order fixing incur? If it is a majority of time I would skip doing it and document that we use union!
when merging levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am asking because it looks like the complexity of the level fixing algorithm below is quadratic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the for levs in reverse(levels)
loop below gets rid of about 2/3 of the time. That proportion seems to stay roughly constant when moving from 10^4 to 10^7.
But I'm not sure union!
is really a satisfying solution. Using it would mean that even if the second set of levels is a superset of the first one, we won't keep the order of the superset. Of course that would have the advantage that existing refs wouldn't have to be recoded since new levels would always be added at the end. But the downside is that if you assign categorical values from arrays with different levels (or call copyto!
), the final levels would depend on the order in which you make the calls.
An intermediate solution would be to check whether one set of levels starts with or ends with a subsequence of another, or is an order superset of another (that should be cheap). If that's the case we can combine them in the appropriate order. If they are not, we can call union!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it and concluded that:
- If the cost is 2/3 of time (and does not explode) then it is acceptable.
- We should clearly document what we do (and actually highlight that this is an advantage of our approach)
- We should clearly document that CategoricalArrays.jl is not intended to handle huge numbers of levels (as a lot of book-keeping has to be done)
- Regarding the implementation. As you suggest what could be checked for the case if we merge two levels (a common case) and one level is a subsequence of another then we probably can be faster (just like you now check if all levels are equal). Such check could be done before checking for the equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ablaom Do you have any preference regarding how levels from different pools are merged? Is this something that you encountered in real use cases?
I would make a release with this change soon. https://github.com/ablaom - do you have any comment on the proposed change here? |
Is there anything to be done here still? |
Sorry, I am currently on holiday but can take a closer look on Wednesday. Thanks for seeking my feedback. |
In an MLJ workflow I would guess merging arrays with different levels does not occur that often, and that when it does the performance bottlenecks are still elsewhere. I would say, from our end, performance is the least important factor. More important is that it is easy to understand what the protocol for merging is, so the simpler the better. I would say more than 100,000 levels would be pretty unusual for an ordered categorical that is not really a "count" variable in disguise. For what it is worth, I have tested this branch against MLJBase/dev and nothing breaks. |
Thanks for your comments. Currently the algorithm isn't really simple, but it's a best effort to ensure that when merging two ordered pools, the result is also ordered if possible. In practice I'm not sure many people really on that. Let's merge this then. I think performance can be improved more as discussed above but this can be done later as it won't change the user-visible behavior. |
mergelevels
was implemented very inefficiently, which is acceptable when the number of levels is small but incredibly slow when it's large.@bkamins This is still slow for large pools but at least it finishes: