-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce new dropunusedlevels! function and allow using from map #43
base: main
Are you sure you want to change the base?
Conversation
Addresses #36. This PR proposes ways to address the issue in #36, namely that unused pool levels can cause unexpected results when mapping (and possibly other operations). It introduces a new exported `dropunusedlevels!` function that will scan a `PooledArray` and remove any pooled values not being reference in the ref array any more. It also updates the `map` implementation to allow a keyword argument `dropunusedlevels=true` to call this before mapping. We could perhaps consider making that `true` by default to avoid the original issue happening, though it comes with a performance cost of checking for unused levels. This PR also updates the implementation of `map`; it was previously doing quite a bit of extra work: mapping over keys twice for some reason, in addition to checking the resulting mapped values for duplicates. As far as I can tell/understand, having duplicates in the pool shouldn't actually cause any real issues, but maybe others can comment more on that. The only place I can think is maybe in the sorting code which I admit I haven't look at very closely. Anyway, still need to add tests here, but figured I would put up the PR as-is for discussion on the approach.
As I have commented before I think this function should be called The rationale is that dropping levels is often required in generic code that is pooled-vector type agnostic (like now we are e.g. in DataFrames.jl). Also maybe we could consider adding |
Ok, PR up for DataAPI.jl to define |
for (k, k1) in zip(ks, ks1) | ||
if haskey(newinvpool, k1) | ||
translate[x.invpool[k]] = newinvpool[k1] | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan @bkamins, this is where the droplevels!
is defined, with the new definition of map
below it.
Thanks for tackling this @quinnj! I have a few remarks. Regarding the handling of unused levels, as we discussed at #36 I think we have better solutions than printing a warning, which his is likely to happen in places deep in code that has no special knowledge of PooledArrays and which will be confusing for users. We can keep track somewhere that we got an error for this value, and if we got one error check at the end whether the value is actually used. If not, then we can just drop it, and if it's used we can throw an error. That won't slow down code that doesn't error.
The guarantee that there are no duplicates is essential for the DataFrames grouping code. There, checking for duplicates would be very slow to do. That would basically kill all the advantages of having a
Actually I don't think this should be considered the same thing as
@bkamins This is already what |
Almost, as |
@nalimilan - so what should be approach here given your comments? In general I think that a more crucial trait for DataFrames.jl (that probably should be in DataAPI.jl) is a trait signaling if levels are guaranteed to be unique, but this is probably a separate PR. |
I've proposed a solution in my comment, let's hear what @quinnj thinks. I agree that an API to know whether all levels are unique would be useful for DataFrames. If @quinnj agrees that |
Actually I was thinking of it now and we do not need to add anything to DataAPI.jl. What we need to do is to make sure that |
I guess we could do that, but note that introducing custom wrapper types forces compiling more functions. So maybe the simpler approach of adding a function to DataAPI is better. |
OK - it is a fair point. |
The first commit is applying JuliaFormatter to the code, which was pretty poorly formatted. Sorry for the diff noise, but you can flip the switch to ignore whitespace changes that should make it easier to digest. Copying the body of the git commit for the real work below:
Addresses #36. This PR proposes ways to address the issue in #36, namely
that unused pool levels can cause unexpected results when mapping (and
possibly other operations). It introduces a new exported
dropunusedlevels!
function that will scan aPooledArray
and removeany pooled values not being reference in the ref array any more. It also
updates the
map
implementation to allow a keyword argumentdropunusedlevels=true
to call this before mapping. We could perhapsconsider making that
true
by default to avoid the original issuehappening, though it comes with a performance cost of checking for
unused levels.
This PR also updates the implementation of
map
; it was previouslydoing quite a bit of extra work: mapping over keys twice for some
reason, in addition to checking the resulting mapped values for
duplicates. As far as I can tell/understand, having duplicates in the
pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting
code which I admit I haven't look at very closely.
Anyway, still need to add tests here, but figured I would put up the PR
as-is for discussion on the approach.