Introduce new dropunusedlevels! function and allow using from map #43

quinnj · 2020-11-21T07:55:40Z

The first commit is applying JuliaFormatter to the code, which was pretty poorly formatted. Sorry for the diff noise, but you can flip the switch to ignore whitespace changes that should make it easier to digest. Copying the body of the git commit for the real work below:

Addresses #36. This PR proposes ways to address the issue in #36, namely
that unused pool levels can cause unexpected results when mapping (and
possibly other operations). It introduces a new exported
dropunusedlevels! function that will scan a PooledArray and remove
any pooled values not being reference in the ref array any more. It also
updates the map implementation to allow a keyword argument
dropunusedlevels=true to call this before mapping. We could perhaps
consider making that true by default to avoid the original issue
happening, though it comes with a performance cost of checking for
unused levels.

This PR also updates the implementation of map; it was previously
doing quite a bit of extra work: mapping over keys twice for some
reason, in addition to checking the resulting mapped values for
duplicates. As far as I can tell/understand, having duplicates in the
pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting
code which I admit I haven't look at very closely.

Anyway, still need to add tests here, but figured I would put up the PR
as-is for discussion on the approach.

Addresses #36. This PR proposes ways to address the issue in #36, namely that unused pool levels can cause unexpected results when mapping (and possibly other operations). It introduces a new exported `dropunusedlevels!` function that will scan a `PooledArray` and remove any pooled values not being reference in the ref array any more. It also updates the `map` implementation to allow a keyword argument `dropunusedlevels=true` to call this before mapping. We could perhaps consider making that `true` by default to avoid the original issue happening, though it comes with a performance cost of checking for unused levels. This PR also updates the implementation of `map`; it was previously doing quite a bit of extra work: mapping over keys twice for some reason, in addition to checking the resulting mapped values for duplicates. As far as I can tell/understand, having duplicates in the pool shouldn't actually cause any real issues, but maybe others can comment more on that. The only place I can think is maybe in the sorting code which I admit I haven't look at very closely. Anyway, still need to add tests here, but figured I would put up the PR as-is for discussion on the approach.

bkamins · 2020-11-21T08:32:48Z

As I have commented before I think this function should be called droplevels! like in CategoricalArrays.jl and also its should be defined in DataAPI.jl.

The rationale is that dropping levels is often required in generic code that is pooled-vector type agnostic (like now we are e.g. in DataFrames.jl).

Also maybe we could consider adding usedlevels function in DataAPI.jl that would give a list of used levels without mutating the source.

quinnj · 2020-11-22T06:06:50Z

Ok, PR up for DataAPI.jl to define droplevels! there: JuliaData/DataAPI.jl#31. @nalimilan, are you ok w/ that and the general approach in this PR? I'll comment on the relevant new code in this PR.

quinnj · 2020-11-22T06:07:24Z

src/PooledArrays.jl

-        for (k, k1) in zip(ks, ks1)
-            if haskey(newinvpool, k1)
-                translate[x.invpool[k]] = newinvpool[k1]
+"""


@nalimilan @bkamins, this is where the droplevels! is defined, with the new definition of map below it.

nalimilan · 2020-11-22T14:58:55Z

Thanks for tackling this @quinnj! I have a few remarks.

Regarding the handling of unused levels, as we discussed at #36 I think we have better solutions than printing a warning, which his is likely to happen in places deep in code that has no special knowledge of PooledArrays and which will be confusing for users. We can keep track somewhere that we got an error for this value, and if we got one error check at the end whether the value is actually used. If not, then we can just drop it, and if it's used we can throw an error. That won't slow down code that doesn't error.

As far as I can tell/understand, having duplicates in the pool shouldn't actually cause any real issues, but maybe others can
comment more on that. The only place I can think is maybe in the sorting code which I admit I haven't look at very closely.

The guarantee that there are no duplicates is essential for the DataFrames grouping code. There, checking for duplicates would be very slow to do. That would basically kill all the advantages of having a PooledArray input since we'd need to use a dict instead of just reusing the reference codes directly (EDIT: see JuliaData/DataFrames.jl#2442 (comment)). Since map is the only operation that can generate duplicate values in the pool, I think we'd better handle them here.

Ok, PR up for DataAPI.jl to define droplevels! there: JuliaData/DataAPI.jl#31. @nalimilan, are you ok w/ that and the general approach in this PR? I'll comment on the relevant new code in this PR.

Actually I don't think this should be considered the same thing as droplevels!. In CategoricalArrays, levels will return unused levels unless you call droplevels!. But in PooledArrays, the pool is an implementation detail: when you call levels, you get only levels that are used, and the order of levels is ignored, just like for Array. I don't think we should have "levels" in the name of this function as it would introduce a confusion with levels, droplevels!, levels!, etc. I'd rather mention the pool and have it tied to refpool. Maybe something like trimpool! or cleanpool!?

Also maybe we could consider adding usedlevels function in DataAPI.jl that would give a list of used levels without mutating the source.

@bkamins This is already what unique does, right?

bkamins · 2020-11-22T15:36:22Z

This is already what unique does, right?

Almost, as levels skips missing and unique retains it.

bkamins · 2020-12-01T11:16:44Z

@nalimilan - so what should be approach here given your comments?

In general I think that a more crucial trait for DataFrames.jl (that probably should be in DataAPI.jl) is a trait signaling if levels are guaranteed to be unique, but this is probably a separate PR.

nalimilan · 2020-12-01T11:23:02Z

I've proposed a solution in my comment, let's hear what @quinnj thinks.

I agree that an API to know whether all levels are unique would be useful for DataFrames. If @quinnj agrees that map can preserve this property PooledArray currently have, then the method would always be true. At any rate we should add something to DataAPI and packages can implement it as appropriate.

bkamins · 2020-12-01T11:26:06Z

At any rate we should add something to DataAPI.jl and packages can implement it as appropriate.

Actually I was thinking of it now and we do not need to add anything to DataAPI.jl. What we need to do is to make sure that refpool(A) implements allunique function from Base. In CategoricalArrays.jl it is not a problem (it is a special type already). For PooledArray.jl what we would need to do is to change the return type of refpool(::PooledArray) to a simple custom wrapper over Vector that would override allunique. I think that such an approach is cleaner (as later we can add more functionalities to refpool(A) return value if needed and dispatch based on refpool(A) return value type).

nalimilan · 2020-12-01T13:08:57Z

I guess we could do that, but note that introducing custom wrapper types forces compiling more functions. So maybe the simpler approach of adding a function to DataAPI is better.

bkamins · 2020-12-01T13:14:01Z

OK - it is a fair point.

quinnj added 2 commits November 20, 2020 23:18

JuliaFormatter formatting

87ab96b

quinnj commented Nov 22, 2020

View reviewed changes

Base automatically changed from master to main January 30, 2021 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new dropunusedlevels! function and allow using from map #43

Introduce new dropunusedlevels! function and allow using from map #43

quinnj commented Nov 21, 2020

bkamins commented Nov 21, 2020 •

edited

Loading

quinnj commented Nov 22, 2020

quinnj Nov 22, 2020

nalimilan commented Nov 22, 2020 •

edited

Loading

bkamins commented Nov 22, 2020

bkamins commented Dec 1, 2020

nalimilan commented Dec 1, 2020

bkamins commented Dec 1, 2020

nalimilan commented Dec 1, 2020

bkamins commented Dec 1, 2020

Introduce new dropunusedlevels! function and allow using from map #43

Are you sure you want to change the base?

Introduce new dropunusedlevels! function and allow using from map #43

Conversation

quinnj commented Nov 21, 2020

bkamins commented Nov 21, 2020 • edited Loading

quinnj commented Nov 22, 2020

quinnj Nov 22, 2020

Choose a reason for hiding this comment

nalimilan commented Nov 22, 2020 • edited Loading

bkamins commented Nov 22, 2020

bkamins commented Dec 1, 2020

nalimilan commented Dec 1, 2020

bkamins commented Dec 1, 2020

nalimilan commented Dec 1, 2020

bkamins commented Dec 1, 2020

bkamins commented Nov 21, 2020 •

edited

Loading

nalimilan commented Nov 22, 2020 •

edited

Loading