Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse categorical arrays? #374

Open
gdalle opened this issue Nov 3, 2021 · 12 comments
Open

Sparse categorical arrays? #374

gdalle opened this issue Nov 3, 2021 · 12 comments

Comments

@gdalle
Copy link

gdalle commented Nov 3, 2021

Hey there,
Thanks for this really nice package! In order to further decrease the memory footprint of my dataframes, I think a sparse version of CategoricalArray might be useful. I'm not really sure how to implement this yet, but I'm willing to take a look.
Would you welcome a PR in this direction? Do you have any tips?

@sprmnt21
Copy link

sprmnt21 commented Nov 3, 2021

I have no particular contributions to offer, but I am interested in learning something about the subject and I would like to ask a few questions.
I would like, for example, to have a more precise idea of the quantities involved.
Can you show with examples what the size of your original database is now and how much would you like to reduce it to?

@gdalle
Copy link
Author

gdalle commented Nov 3, 2021

Essentially I have thousands of columns with categorical entries and lots (> 90%) of missing values, so I thought it might be useful. However, it turns out that sparse columns in a dataframe are not a good idea (thanks @pdeffebach for this minimal not-working example):

julia> using DataFramesMeta, SparseArrays;

julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   11
   20
   30
   40
   51

julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
  deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
  deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
  deleteat!(::BitVector, ::Any) at bitarray.jl:981
  ...
Stacktrace:
 [1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
 [2] delete!(df::DataFrame, inds::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
 [3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
 [4] top-level scope
   @ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681

@bkamins
Copy link
Member

bkamins commented Nov 3, 2021

it turns out that sparse columns in a dataframe are not a good idea

This is unrelated with DataFrames.jl. The problem is with your column not defining deleteat!. If you add a method for this function for your column type all will work.

@gdalle
Copy link
Author

gdalle commented Nov 3, 2021

But I guess there must be a good reason why deleteat! is not implemented for sparse vectors? Would there be a performance penalty?

@pdeffebach
Copy link

CategoricalArrays are already "sparse" in the sense that they don't take up the full memory. So I don't think this is needed.

julia> using CategoricalArrays

julia> x = rand(1:5, 10_000);

julia> y = categorical(x);

julia> sizeof(y)
16

julia> sizeof(x)
80000

@bkamins
Copy link
Member

bkamins commented Nov 3, 2021

Would there be a performance penalty?

deleteat! could be implemented in general. I think it is that just no one had done this yet. The cost of this operation would not be that big (but of course it would be bigger than for vectors). You would need to do deleteat! for nzind and nzval fields which are Vector and then update indices in nzind for indices larger than the last deleted value.

don't take up the full memory. So I don't think this is needed.

But they do take memory "linear" to their size, while for sparse vectors the memory taken is proportional to non-zero entries. If data is very sparse there is a very big difference between these two scenarios.

Still for the reason that @pdeffebach raises probably this is not a top priority to do. However, your measure of memory footprint is incorrect. You should do:

julia> using CategoricalArrays

julia> x = rand(1:5, 10_000);

julia> y = categorical(x);

julia> z = categorical(x, compress=true);

julia> Base.summarysize(x)
80040

julia> Base.summarysize(y)
40584

julia> Base.summarysize(z)
10536

@gdalle
Copy link
Author

gdalle commented Nov 3, 2021

Yes I agree with both of you. I just thought it would be a nice-to-have in my case, since categorical arrays still store one integer per value no matter what (the category rank). With compress = true this integer can be small, but maybe we can gain even more space by only remembering the indices where values are not missing.
Of course this isn't a top priority. However, I imagine it wouldn't be too hard to implement, which is why I wouldn't mind having a go myself.

@pdeffebach
Copy link

I would file an issue in Base, as well, about re-sizing sparse arrays. See what they think.

@bkamins
Copy link
Member

bkamins commented Nov 3, 2021

I also imagined it wouldn't be too hard to implement, which is why I considered having a go myself

I think it would not be that hard. You probably would need to specify what value you want to be treated as "zero" of sparse array.

However, I would propose to wait for @nalimilan to comment how he sees adding it to CategoricalArrays.jl from a general perspective of package maintenance.

@gdalle
Copy link
Author

gdalle commented Nov 3, 2021

Yes, let's start with @pdeffebach's suggestion to shake things up in the stdlib

@nalimilan
Copy link
Member

In theory it shouldn't be too hard to support sparse array-backed CategoricalArrays. This would mainly require adding another type parameter corresponding to the type of the refs field. Maybe some places would have to be adapted to use similar instead of allocating an Array (need to check). Since 0 is already used to represent #undef or missing values, a sparse array of integer values should work automatically.

You're welcome to experiment with this if that would be useful for you. Though I cannot promise I will accept the change if it turns out it increases the package's complexity too much. As you can see, the number of type parameters is already quite large... (One possible advantage of supporting any array type for refs could be that CarArrOrSub could be dropped in favor of having view return a CategoricalArray wrapping a SubArray, rather than the contrary like we do now.)

Out of curiosity, could you develop in what kind of situation you have this kind of data?

@gdalle
Copy link
Author

gdalle commented Nov 3, 2021

My data points were generated by many semi-independent sub-systems, and not every sub-system logs the same columns, which means most columns are very sparse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants