Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new dropunusedlevels! function and allow using from map #43

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Nov 21, 2020

The first commit is applying JuliaFormatter to the code, which was pretty poorly formatted. Sorry for the diff noise, but you can flip the switch to ignore whitespace changes that should make it easier to digest. Copying the body of the git commit for the real work below:

Addresses #36. This PR proposes ways to address the issue in #36, namely
that unused pool levels can cause unexpected results when mapping (and
possibly other operations). It introduces a new exported
dropunusedlevels! function that will scan a PooledArray and remove
any pooled values not being reference in the ref array any more. It also
updates the map implementation to allow a keyword argument
dropunusedlevels=true to call this before mapping. We could perhaps
consider making that true by default to avoid the original issue
happening, though it comes with a performance cost of checking for
unused levels.

This PR also updates the implementation of map; it was previously
doing quite a bit of extra work: mapping over keys twice for some
reason, in addition to checking the resulting mapped values for
duplicates. As far as I can tell/understand, having duplicates in the
pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting
code which I admit I haven't look at very closely.

Anyway, still need to add tests here, but figured I would put up the PR
as-is for discussion on the approach.

Addresses #36. This PR proposes ways to address the issue in #36, namely
that unused pool levels can cause unexpected results when mapping (and
possibly other operations). It introduces a new exported
`dropunusedlevels!` function that will scan a `PooledArray` and remove
any pooled values not being reference in the ref array any more. It also
updates the `map` implementation to allow a keyword argument
`dropunusedlevels=true` to call this before mapping. We could perhaps
consider making that `true` by default to avoid the original issue
happening, though it comes with a performance cost of checking for
unused levels.

This PR also updates the implementation of `map`; it was previously
doing quite a bit of extra work: mapping over keys twice for some
reason, in addition to checking the resulting mapped values for
duplicates. As far as I can tell/understand, having duplicates in the
pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting
code which I admit I haven't look at very closely.

Anyway, still need to add tests here, but figured I would put up the PR
as-is for discussion on the approach.
@bkamins
Copy link
Member

bkamins commented Nov 21, 2020

As I have commented before I think this function should be called droplevels! like in CategoricalArrays.jl and also its should be defined in DataAPI.jl.

The rationale is that dropping levels is often required in generic code that is pooled-vector type agnostic (like now we are e.g. in DataFrames.jl).

Also maybe we could consider adding usedlevels function in DataAPI.jl that would give a list of used levels without mutating the source.

@quinnj
Copy link
Member Author

quinnj commented Nov 22, 2020

Ok, PR up for DataAPI.jl to define droplevels! there: JuliaData/DataAPI.jl#31. @nalimilan, are you ok w/ that and the general approach in this PR? I'll comment on the relevant new code in this PR.

for (k, k1) in zip(ks, ks1)
if haskey(newinvpool, k1)
translate[x.invpool[k]] = newinvpool[k1]
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan @bkamins, this is where the droplevels! is defined, with the new definition of map below it.

@nalimilan
Copy link
Member

nalimilan commented Nov 22, 2020

Thanks for tackling this @quinnj! I have a few remarks.

Regarding the handling of unused levels, as we discussed at #36 I think we have better solutions than printing a warning, which his is likely to happen in places deep in code that has no special knowledge of PooledArrays and which will be confusing for users. We can keep track somewhere that we got an error for this value, and if we got one error check at the end whether the value is actually used. If not, then we can just drop it, and if it's used we can throw an error. That won't slow down code that doesn't error.

As far as I can tell/understand, having duplicates in the pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting code which I admit I haven't look at very closely.

The guarantee that there are no duplicates is essential for the DataFrames grouping code. There, checking for duplicates would be very slow to do. That would basically kill all the advantages of having a PooledArray input since we'd need to use a dict instead of just reusing the reference codes directly (EDIT: see JuliaData/DataFrames.jl#2442 (comment)). Since map is the only operation that can generate duplicate values in the pool, I think we'd better handle them here.

Ok, PR up for DataAPI.jl to define droplevels! there: JuliaData/DataAPI.jl#31. @nalimilan, are you ok w/ that and the general approach in this PR? I'll comment on the relevant new code in this PR.

Actually I don't think this should be considered the same thing as droplevels!. In CategoricalArrays, levels will return unused levels unless you call droplevels!. But in PooledArrays, the pool is an implementation detail: when you call levels, you get only levels that are used, and the order of levels is ignored, just like for Array. I don't think we should have "levels" in the name of this function as it would introduce a confusion with levels, droplevels!, levels!, etc. I'd rather mention the pool and have it tied to refpool. Maybe something like trimpool! or cleanpool!?

Also maybe we could consider adding usedlevels function in DataAPI.jl that would give a list of used levels without mutating the source.

@bkamins This is already what unique does, right?

@bkamins
Copy link
Member

bkamins commented Nov 22, 2020

This is already what unique does, right?

Almost, as levels skips missing and unique retains it.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2020

@nalimilan - so what should be approach here given your comments?

In general I think that a more crucial trait for DataFrames.jl (that probably should be in DataAPI.jl) is a trait signaling if levels are guaranteed to be unique, but this is probably a separate PR.

@nalimilan
Copy link
Member

I've proposed a solution in my comment, let's hear what @quinnj thinks.

I agree that an API to know whether all levels are unique would be useful for DataFrames. If @quinnj agrees that map can preserve this property PooledArray currently have, then the method would always be true. At any rate we should add something to DataAPI and packages can implement it as appropriate.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2020

At any rate we should add something to DataAPI.jl and packages can implement it as appropriate.

Actually I was thinking of it now and we do not need to add anything to DataAPI.jl. What we need to do is to make sure that refpool(A) implements allunique function from Base. In CategoricalArrays.jl it is not a problem (it is a special type already). For PooledArray.jl what we would need to do is to change the return type of refpool(::PooledArray) to a simple custom wrapper over Vector that would override allunique. I think that such an approach is cleaner (as later we can add more functionalities to refpool(A) return value if needed and dispatch based on refpool(A) return value type).

@nalimilan
Copy link
Member

I guess we could do that, but note that introducing custom wrapper types forces compiling more functions. So maybe the simpler approach of adding a function to DataAPI is better.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2020

OK - it is a fair point.

Base automatically changed from master to main January 30, 2021 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants