-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of getindex #54
Comments
So my proposal is to define
without copying The reason is that we currently do not provide a way to shrink pool in Potential problems since we are post 1.0 release:
Additionally we could add Note that currently we just copy |
So after discussing this on Slack, it seems that making I would say that it's too dangerous. Unfortunately it's not yet clear whether it's possible to ensure thread safety without killing performance of |
Assume
is as fast as Also note that:
so there is a trade-off between construction and slicing. @nalimilan - do you know the reasons? |
Sorry, I don't get this.
I don't think there's a trade-off, really. Maybe just a missing optimization in PooledArrays? For example, it doesn't define |
I mean this:
Indeed:
is missing, and we should add it. However, I checked that adding it does not solve this issue. |
So you mean when the requested slice is short it would be faster to re-add levels that are actually used to an empty pool than to copy the full pool? Yeah in extreme cases like this it would help. What's hard to identify appropriate thresholds. I guess we can do as if levels are distributed randomly in slices so that the only parameter to take into account is the ratio between the length of the slice and the total number of levels? BTW, it's a shame that we don't implement any optimized methods for |
Yes - this is roughly what I think we should do. |
I will propose a PR for this. In the mean time I have spotted that the current code was already sharing
|
Wow that's really bad. |
It is only bad for threading fortunately and I guess currently it is a very rare use case. |
A bottleneck in fast join in DataFrames.jl is:
as
copy(A.invpool)
is very expensive for large pools (also it leaves pool much larger than needed usually). I see two options:This case will hit us (and probably already significantly hits us but we did not know about it in H2O benchmarks which heavily use
PooledArrays
)CC @nalimilan @quinnj
The text was updated successfully, but these errors were encountered: