Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesing of PooledArray internals #64

Open
bkamins opened this issue Apr 29, 2021 · 5 comments
Open

Redesing of PooledArray internals #64

bkamins opened this issue Apr 29, 2021 · 5 comments

Comments

@bkamins
Copy link
Member

bkamins commented Apr 29, 2021

Things to do:

  • treat missing as a special value that is not pooled, probably with level 0. This would work the same as in CategoricalArrays.jl; the benefit is that two PooledArrays differing only in the fact if they allow Missing or not could share pool
  • add locking for setindex! but make sure that we support batch operations of adding levels (both in setindex! and in e.g. copyto!); this will allow to fully drop Copy-On-Write and never copy pool and invpool by default; tentatively unsafe_setindex! would be an alternative that does not use lock
  • stress in documentation that using invpool is not safe if potentially other threads are modifying it (this should not be a problem)
  • add droplevels! to DataAPI.jl and to PooledArrays.jl (this requires also a change in CategoricalArrays.jl); this function would reduce pool and invpool to only used levels and also at the same time make a fresh copy of them (as a way to detach pool and invpool between PooledArray-s)

I think this design is better than global pool. It will still cost us a bit in H2O benchmarks, but at least we avoid a global pool that is not reclaimable.

@nalimilan + @quinnj : any additional comments on this?

@nalimilan
Copy link
Member

I'd split this into separate issues as these are quite orthogonal. I think locking is worth investigating, but I'm less convinced by using droplevels! (see JuliaData/DataAPI.jl#31).

Storing missing as 0 is tempting, but note that it makes things more complex in CategoricalArrays. For example, we need a special CategoricalRefPool type supporting index 0, and missing has to be special cased in several places. It could be simpler to ensure that missing (or nothing, BTW) appears first in the pool, and check that where appropriate.

@bkamins
Copy link
Member Author

bkamins commented Apr 29, 2021

I'd split this into separate issues as these are quite orthogonal.

I agree, but I want to first have a roadmap and then implement things separately. However, they are all needed to reach good performance.

It could be simpler to ensure that missing (or nothing, BTW) appears first in the pool, and check that where appropriate.

The point is that we need to avoid having to reallocate the pool if we go between non-missing and allow-missing arrays. But indeed we could always keep missing in the pool, just not allow storing it in the array.

I'm less convinced by using droplevels! (see JuliaData/DataAPI.jl#31).

so how would you "detach" the pool and invpool of one PooledArray from another if you wanted to?

@bkamins
Copy link
Member Author

bkamins commented Apr 29, 2021

What would you say if we decided that PooledArray should always allow Missing? Is this going to far or would be OK? Apart from the benefit of not having hot handle it specially we would also avoid #undef in it (but when e.g. growing or constructing the default value would be missing). This would be breaking so we would have to go for a 2.0 release if we decided this way.

@bkamins
Copy link
Member Author

bkamins commented Apr 30, 2021

What would you say if we decided that PooledArray should always allow Missing?

Alternatively internal pool and invpool could always allow missing (so missing would be present there, but if eltype of PooledArray would not allow it we would not allow to store it; this would mean that pool and invpool could potentially hold entries that are not allowed in the container itself, but IIUC this is not breaking the API contract; we would just need to remember to correctly handle it in functions like map that do mapping of the internal data structures)

@bkamins
Copy link
Member Author

bkamins commented Apr 30, 2021

I have drafted the changes for storing Missing in pool/invpool always in #65 (but as commented above - this would still disallow using them in arrays whose eltype does not allow it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants