[BREAKING] Change `stack` to create a `PooledArray{String}` column by default #2391

nalimilan · 2020-08-29T14:57:06Z

This is less likely to confuse users (it's the dplyr and pandas default), and it would allow dropping the dependency on CategoricalArrays if we wanted to.
The drawback of returning strings is that the order of old columns is only reflected by the order of rows, and no longer by the sorting order of levels.

Also return a PooledArray{Symbol} when variable_eltype=Symbol is passed. That's more consistent and more efficient, e.g. if you want to group on symbols later.

Any element type is now accepted, with CategoricalValue{String} being handled by the general path. In practice it would allow using a custom AbstractString type, though that's relatively unlikely.

See #1947 for a previous discussion.

This is less likely to confuse users (it's the dplyr and pandas default), and it would allow dropping the dependency on CategoricalArrays if we wanted to. The drawback of returning strings is that the order of old columns is only reflected by the order of rows, and no longer by the sorting order of levels. Also return a `PooledArray{Symbol}` when `variable_eltype=Symbol` is passed. That's more consistent and more efficient, e.g. if you want to group on symbols later. Any element type is now accepted, with `CategoricalValue{String}` being handled by the general path. In practice it would allow using a custom `AbstractString` type, though that's relatively unlikely.

tk3369 · 2020-08-29T16:48:35Z

I like this as I found myself almost always having to unwrap the column after a stack call.

quinnj

I like this as well. I like defaulting to PooledArrays as a lighter dependency w/ the efficient storage, while still allowing CategoricalArray to work automatically. I think the more we can refactor things after this pattern, the better it will be all around; we want to make sure CategoricalArray "just works", but it would also be great to break the dependency.

bkamins · 2020-08-29T21:50:53Z

src/abstractdataframe/reshape.jl

+        # this covers CategoricalArray{String} in particular,
+        # as copyto! inserts levels in their order of appearance
+        nms = names(df, ints_measure_vars)
+        catnms = copyto!(similar(nms, variable_eltype), nms)


This relies on similar for CategoricalValue{String} producing a CategoricalVector. Is this something that will be kept being provided in CategoricalArrays.jl? (I guess you intend so, but just making sure.)

That's a good question. Given improvements at JuliaLang/julia#37163, we may be able to keep that function in the end. Anyway if CategoricalArrays stops doing that, it will be in a major release so we'll be able to adapt. But yeah if we drop that we'll have to find another approach if we want to drop the CategoricalArrays dependency...

Concretely, for this PR, I could also simply go back to explicitly creating a CategoricalArray. That would probably be clearer.

Maybe this is what we can do in this PR? I would go for the "smart" path at the moment we decide do drop CategoricalArrays.jl dependency of DataFrames.jl. I will post on #data to ask about it.

src/abstractdataframe/reshape.jl

bkamins · 2020-08-29T21:58:03Z

but it would also be great to break the dependency.

This is the goal and I think if we want to go for it we should clean up all places where we have it in 0.22. The major things left are:

categorical function - which I think we could just drop as I am not sure how much it is used
fast grouping rows (for groupby and joins) - but here we probably can delegate things to DataAPI.jl - @nalimilan should have a clear vision of the details I assume
reshape.jl - some more minor removals (this should not be very problematic I think)

bkamins · 2020-08-30T12:11:19Z

Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later).

I have also noticed that I think it makes no sense to create PooledVector{Symbol}, as there is no benefit over Vector{Symbol}, so I would change this PR to create Vector{Symbol} if Symbol is passed.

nalimilan · 2020-08-30T12:17:39Z

categorical function - which I think we could just drop as I am not sure how much it is used

Right, see #2394.

fast grouping rows (for groupby and joins) - but here we probably can delegate things to DataAPI.jl - @nalimilan should have a clear vision of the details I assume

Yes, the API is already there, I just need a way to implement it. The issue I've discovered is that if I want refpool to return the CategoricalPool, it has to allow passing zero to getindex for missing values. But until now CategoricalPool had now idea of missing values: only the parent array allowed or didn't allow for missing values. So I have to add a new type parameter to CategoricalPool, or to have refpool return a different object (which unfortunately cannot be a plain Array due to zeros...).

reshape.jl - some more minor removals (this should not be very problematic I think)

Hopefully it should be easy.

Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later).

OK I'll have a look.

I have also noticed that I think it makes no sense to create PooledVector{Symbol}, as there is no benefit over Vector{Symbol}, so I would change this PR to create Vector{Symbol} if Symbol is passed.

Actually I think the benefit is exactly the same as for Vector{String}, as in both cases the array holds 64-bit values pointing to shared objects. PooledArray uses 32-bit refs, and allows performing grouping much more efficiently as the pool is known.

bkamins · 2020-08-30T12:58:53Z

Actually I think the benefit is exactly the same

It is not the same, as String is not interned, but Symbol is, but agreed that the benefit is that the pool would be known for grouping/joins, so let us leave it as is.

nalimilan · 2020-08-30T15:00:04Z

Well at least repeat reuses the same string:

julia> x = repeat(["a", "b"], outer=2)
4-element Array{String,1}:
 "a"
 "b"
 "a"
 "b"

julia> x[1] === x[3]
true

But yeah with symbols the difference is that you're sure that two different pointers refer to two different symbols, which isn't the case for strings (that's a big problem for grouping).

bkamins · 2020-08-30T15:04:05Z

Well 😄 :

julia> s1 = "abcd"^100;

julia> s2 = "abcd"^100;

julia> s1 === s2
true

julia> pointer(s1) == pointer(s2)
false

so you see that === does not do an equal pointer test. But indeed repeat reuses the same string, so you are right:

julia> x = repeat([s1], 2);

julia> pointer(x[1]) == pointer(x[2])
true

nalimilan · 2020-08-30T15:47:12Z

Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later).

Actually after looking at it I'd rather merge this and #2394 first, and only them stop reexporting CategoricalArrays: that can't be done before merging either of these, and that's a logically independent step.

bkamins · 2020-08-30T20:27:16Z

src/abstractdataframe/reshape.jl

-    elseif variable_eltype <: String
-        catnms = names(df, measure_vars)
+    if variable_eltype === Symbol
+        catnms = PooledArray(_names(df)[measure_vars])


there is probably not much benefit from it here but we can have it for consistency.

Good catch. Yeah that's probably counter-productive here so I've reverted the change. What could make sense is to define copy(::RepeatedVector) to return a PooledArray, but that's non-breaking so we can do that later.

bkamins · 2020-08-30T20:32:38Z

I have left two minor comments - can you please merge this PR when you decide on them. Thank you!

bkamins · 2020-08-31T09:11:36Z

Looks good to merge after the CI passes

quinnj approved these changes Aug 29, 2020

View reviewed changes

bkamins changed the title ~~Change stack to create a PooledArray{String} column by default~~ [BREAKING] Change stack to create a PooledArray{String} column by default Aug 29, 2020

bkamins added the breaking The proposed change is breaking. label Aug 29, 2020

bkamins added this to the 1.0 milestone Aug 29, 2020

bkamins reviewed Aug 29, 2020

View reviewed changes

src/abstractdataframe/reshape.jl Show resolved Hide resolved

Use ===

a4bb506

Fixes

95ee656

bkamins reviewed Aug 30, 2020

View reviewed changes

bkamins approved these changes Aug 30, 2020

View reviewed changes

nalimilan added 2 commits August 31, 2020 08:36

Merge branch 'master' into nl/stack

4e86a1e

Fixes

918d734

nalimilan merged commit 6b0feaa into master Aug 31, 2020

nalimilan deleted the nl/stack branch August 31, 2020 13:04

nalimilan mentioned this pull request Sep 1, 2020

[BREAKING] Stop reexporting CategoricalArrays #2404

Merged

JuliaRegistrator mentioned this pull request Nov 15, 2020

New version: DataFrames v0.22.0 JuliaRegistries/General#24650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING] Change `stack` to create a `PooledArray{String}` column by default #2391

[BREAKING] Change `stack` to create a `PooledArray{String}` column by default #2391

nalimilan commented Aug 29, 2020 •

edited

Loading

tk3369 commented Aug 29, 2020

quinnj left a comment

bkamins Aug 29, 2020

nalimilan Aug 30, 2020

bkamins Aug 30, 2020

bkamins commented Aug 29, 2020

bkamins commented Aug 30, 2020

nalimilan commented Aug 30, 2020 •

edited

Loading

bkamins commented Aug 30, 2020

nalimilan commented Aug 30, 2020

bkamins commented Aug 30, 2020

nalimilan commented Aug 30, 2020

bkamins Aug 30, 2020

nalimilan Aug 31, 2020

bkamins commented Aug 30, 2020

bkamins commented Aug 31, 2020

[BREAKING] Change stack to create a PooledArray{String} column by default #2391

[BREAKING] Change stack to create a PooledArray{String} column by default #2391

Conversation

nalimilan commented Aug 29, 2020 • edited Loading

tk3369 commented Aug 29, 2020

quinnj left a comment

Choose a reason for hiding this comment

bkamins Aug 29, 2020

Choose a reason for hiding this comment

nalimilan Aug 30, 2020

Choose a reason for hiding this comment

bkamins Aug 30, 2020

Choose a reason for hiding this comment

bkamins commented Aug 29, 2020

bkamins commented Aug 30, 2020

nalimilan commented Aug 30, 2020 • edited Loading

bkamins commented Aug 30, 2020

nalimilan commented Aug 30, 2020

bkamins commented Aug 30, 2020

nalimilan commented Aug 30, 2020

bkamins Aug 30, 2020

Choose a reason for hiding this comment

nalimilan Aug 31, 2020

Choose a reason for hiding this comment

bkamins commented Aug 30, 2020

bkamins commented Aug 31, 2020

[BREAKING] Change `stack` to create a `PooledArray{String}` column by default #2391

[BREAKING] Change `stack` to create a `PooledArray{String}` column by default #2391

nalimilan commented Aug 29, 2020 •

edited

Loading

nalimilan commented Aug 30, 2020 •

edited

Loading