Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Change stack to create a PooledArray{String} column by default #2391

Merged
merged 5 commits into from
Aug 31, 2020

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Aug 29, 2020

This is less likely to confuse users (it's the dplyr and pandas default), and it would allow dropping the dependency on CategoricalArrays if we wanted to.
The drawback of returning strings is that the order of old columns is only reflected by the order of rows, and no longer by the sorting order of levels.

Also return a PooledArray{Symbol} when variable_eltype=Symbol is passed. That's more consistent and more efficient, e.g. if you want to group on symbols later.

Any element type is now accepted, with CategoricalValue{String} being handled by the general path. In practice it would allow using a custom AbstractString type, though that's relatively unlikely.

See #1947 for a previous discussion.

This is less likely to confuse users (it's the dplyr and pandas default),
and it would allow dropping the dependency on CategoricalArrays if we wanted to.
The drawback of returning strings is that the order of old columns is only
reflected by the order of rows, and no longer by the sorting order of levels.

Also return a `PooledArray{Symbol}` when `variable_eltype=Symbol` is passed.
That's more consistent and more efficient, e.g. if you want to group on symbols
later.

Any element type is now accepted, with `CategoricalValue{String}` being handled
by the general path. In practice it would allow using a custom `AbstractString`
type, though that's relatively unlikely.
@tk3369
Copy link
Contributor

tk3369 commented Aug 29, 2020

I like this as I found myself almost always having to unwrap the column after a stack call.

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this as well. I like defaulting to PooledArrays as a lighter dependency w/ the efficient storage, while still allowing CategoricalArray to work automatically. I think the more we can refactor things after this pattern, the better it will be all around; we want to make sure CategoricalArray "just works", but it would also be great to break the dependency.

@bkamins bkamins changed the title Change stack to create a PooledArray{String} column by default [BREAKING] Change stack to create a PooledArray{String} column by default Aug 29, 2020
@bkamins bkamins added the breaking The proposed change is breaking. label Aug 29, 2020
@bkamins bkamins added this to the 1.0 milestone Aug 29, 2020
# this covers CategoricalArray{String} in particular,
# as copyto! inserts levels in their order of appearance
nms = names(df, ints_measure_vars)
catnms = copyto!(similar(nms, variable_eltype), nms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on similar for CategoricalValue{String} producing a CategoricalVector. Is this something that will be kept being provided in CategoricalArrays.jl? (I guess you intend so, but just making sure.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. Given improvements at JuliaLang/julia#37163, we may be able to keep that function in the end. Anyway if CategoricalArrays stops doing that, it will be in a major release so we'll be able to adapt. But yeah if we drop that we'll have to find another approach if we want to drop the CategoricalArrays dependency...

Concretely, for this PR, I could also simply go back to explicitly creating a CategoricalArray. That would probably be clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is what we can do in this PR? I would go for the "smart" path at the moment we decide do drop CategoricalArrays.jl dependency of DataFrames.jl. I will post on #data to ask about it.

@bkamins
Copy link
Member

bkamins commented Aug 29, 2020

but it would also be great to break the dependency.

This is the goal and I think if we want to go for it we should clean up all places where we have it in 0.22. The major things left are:

  • categorical function - which I think we could just drop as I am not sure how much it is used
  • fast grouping rows (for groupby and joins) - but here we probably can delegate things to DataAPI.jl - @nalimilan should have a clear vision of the details I assume
  • reshape.jl - some more minor removals (this should not be very problematic I think)

@bkamins
Copy link
Member

bkamins commented Aug 30, 2020

Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later).

I have also noticed that I think it makes no sense to create PooledVector{Symbol}, as there is no benefit over Vector{Symbol}, so I would change this PR to create Vector{Symbol} if Symbol is passed.

@nalimilan
Copy link
Member Author

nalimilan commented Aug 30, 2020

  • categorical function - which I think we could just drop as I am not sure how much it is used

Right, see #2394.

  • fast grouping rows (for groupby and joins) - but here we probably can delegate things to DataAPI.jl - @nalimilan should have a clear vision of the details I assume

Yes, the API is already there, I just need a way to implement it. The issue I've discovered is that if I want refpool to return the CategoricalPool, it has to allow passing zero to getindex for missing values. But until now CategoricalPool had now idea of missing values: only the parent array allowed or didn't allow for missing values. So I have to add a new type parameter to CategoricalPool, or to have refpool return a different object (which unfortunately cannot be a plain Array due to zeros...).

  • reshape.jl - some more minor removals (this should not be very problematic I think)

Hopefully it should be easy.

Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later).

OK I'll have a look.

I have also noticed that I think it makes no sense to create PooledVector{Symbol}, as there is no benefit over Vector{Symbol}, so I would change this PR to create Vector{Symbol} if Symbol is passed.

Actually I think the benefit is exactly the same as for Vector{String}, as in both cases the array holds 64-bit values pointing to shared objects. PooledArray uses 32-bit refs, and allows performing grouping much more efficiently as the pool is known.

@bkamins
Copy link
Member

bkamins commented Aug 30, 2020

Actually I think the benefit is exactly the same

It is not the same, as String is not interned, but Symbol is, but agreed that the benefit is that the pool would be known for grouping/joins, so let us leave it as is.

@nalimilan
Copy link
Member Author

Well at least repeat reuses the same string:

julia> x = repeat(["a", "b"], outer=2)
4-element Array{String,1}:
 "a"
 "b"
 "a"
 "b"

julia> x[1] === x[3]
true

But yeah with symbols the difference is that you're sure that two different pointers refer to two different symbols, which isn't the case for strings (that's a big problem for grouping).

@bkamins
Copy link
Member

bkamins commented Aug 30, 2020

Well 😄 :

julia> s1 = "abcd"^100;

julia> s2 = "abcd"^100;

julia> s1 === s2
true

julia> pointer(s1) == pointer(s2)
false

so you see that === does not do an equal pointer test. But indeed repeat reuses the same string, so you are right:

julia> x = repeat([s1], 2);

julia> pointer(x[1]) == pointer(x[2])
true

@nalimilan
Copy link
Member Author

Given #2321 I would go for removing CategoricalArrays.jl here as you proposed. Then in the docstring we should say that CategoricalArrays.jl should be loaded for what is recommended to work (to make sure we do not forget to add it later).

Actually after looking at it I'd rather merge this and #2394 first, and only them stop reexporting CategoricalArrays: that can't be done before merging either of these, and that's a logically independent step.

elseif variable_eltype <: String
catnms = names(df, measure_vars)
if variable_eltype === Symbol
catnms = PooledArray(_names(df)[measure_vars])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is probably not much benefit from it here but we can have it for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Yeah that's probably counter-productive here so I've reverted the change. What could make sense is to define copy(::RepeatedVector) to return a PooledArray, but that's non-breaking so we can do that later.

@bkamins
Copy link
Member

bkamins commented Aug 30, 2020

I have left two minor comments - can you please merge this PR when you decide on them. Thank you!

@bkamins
Copy link
Member

bkamins commented Aug 31, 2020

Looks good to merge after the CI passes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants