Sparse categorical arrays? #374

gdalle · 2021-11-03T07:00:31Z

Hey there,
Thanks for this really nice package! In order to further decrease the memory footprint of my dataframes, I think a sparse version of CategoricalArray might be useful. I'm not really sure how to implement this yet, but I'm willing to take a look.
Would you welcome a PR in this direction? Do you have any tips?

The text was updated successfully, but these errors were encountered:

sprmnt21 · 2021-11-03T07:59:59Z

I have no particular contributions to offer, but I am interested in learning something about the subject and I would like to ask a few questions.
I would like, for example, to have a more precise idea of the quantities involved.
Can you show with examples what the size of your original database is now and how much would you like to reduce it to?

gdalle · 2021-11-03T13:05:33Z

Essentially I have thousands of columns with categorical entries and lots (> 90%) of missing values, so I thought it might be useful. However, it turns out that sparse columns in a dataframe are not a good idea (thanks @pdeffebach for this minimal not-working example):

julia> using DataFramesMeta, SparseArrays;

julia> df = DataFrame(x = sparse([1, 0, 0, 0, 1]))
5×1 DataFrame
 Row │ x     
     │ Int64 
─────┼───────
   1 │     1
   2 │     0
   3 │     0
   4 │     0
   5 │     1

julia> @subset! df :x .== 1
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::Vector{Int64})
Closest candidates are:
  deleteat!(::Vector{T} where T, ::AbstractVector{T} where T) at array.jl:1377
  deleteat!(::Vector{T} where T, ::Any) at array.jl:1376
  deleteat!(::BitVector, ::Any) at bitarray.jl:981
  ...
Stacktrace:
 [1] _delete!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:999
 [2] delete!(df::DataFrame, inds::Vector{Int64})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:976
 [3] subset!(df::DataFrame, args::Any; skipmissing::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/subset.jl:292
 [4] top-level scope
   @ ~/.julia/packages/DataFramesMeta/lbAjC/src/macros.jl:681

bkamins · 2021-11-03T13:11:03Z

it turns out that sparse columns in a dataframe are not a good idea

This is unrelated with DataFrames.jl. The problem is with your column not defining deleteat!. If you add a method for this function for your column type all will work.

gdalle · 2021-11-03T13:15:38Z

But I guess there must be a good reason why deleteat! is not implemented for sparse vectors? Would there be a performance penalty?

pdeffebach · 2021-11-03T13:22:30Z

CategoricalArrays are already "sparse" in the sense that they don't take up the full memory. So I don't think this is needed.

julia> using CategoricalArrays

julia> x = rand(1:5, 10_000);

julia> y = categorical(x);

julia> sizeof(y)
16

julia> sizeof(x)
80000

bkamins · 2021-11-03T13:26:40Z

Would there be a performance penalty?

deleteat! could be implemented in general. I think it is that just no one had done this yet. The cost of this operation would not be that big (but of course it would be bigger than for vectors). You would need to do deleteat! for nzind and nzval fields which are Vector and then update indices in nzind for indices larger than the last deleted value.

don't take up the full memory. So I don't think this is needed.

But they do take memory "linear" to their size, while for sparse vectors the memory taken is proportional to non-zero entries. If data is very sparse there is a very big difference between these two scenarios.

Still for the reason that @pdeffebach raises probably this is not a top priority to do. However, your measure of memory footprint is incorrect. You should do:

julia> using CategoricalArrays

julia> x = rand(1:5, 10_000);

julia> y = categorical(x);

julia> z = categorical(x, compress=true);

julia> Base.summarysize(x)
80040

julia> Base.summarysize(y)
40584

julia> Base.summarysize(z)
10536

gdalle · 2021-11-03T13:29:07Z

Yes I agree with both of you. I just thought it would be a nice-to-have in my case, since categorical arrays still store one integer per value no matter what (the category rank). With compress = true this integer can be small, but maybe we can gain even more space by only remembering the indices where values are not missing.
Of course this isn't a top priority. However, I imagine it wouldn't be too hard to implement, which is why I wouldn't mind having a go myself.

pdeffebach · 2021-11-03T13:31:00Z

I would file an issue in Base, as well, about re-sizing sparse arrays. See what they think.

bkamins · 2021-11-03T13:31:38Z

I also imagined it wouldn't be too hard to implement, which is why I considered having a go myself

I think it would not be that hard. You probably would need to specify what value you want to be treated as "zero" of sparse array.

However, I would propose to wait for @nalimilan to comment how he sees adding it to CategoricalArrays.jl from a general perspective of package maintenance.

gdalle · 2021-11-03T13:32:33Z

Yes, let's start with @pdeffebach's suggestion to shake things up in the stdlib

nalimilan · 2021-11-03T14:21:49Z

In theory it shouldn't be too hard to support sparse array-backed CategoricalArrays. This would mainly require adding another type parameter corresponding to the type of the refs field. Maybe some places would have to be adapted to use similar instead of allocating an Array (need to check). Since 0 is already used to represent #undef or missing values, a sparse array of integer values should work automatically.

You're welcome to experiment with this if that would be useful for you. Though I cannot promise I will accept the change if it turns out it increases the package's complexity too much. As you can see, the number of type parameters is already quite large... (One possible advantage of supporting any array type for refs could be that CarArrOrSub could be dropped in favor of having view return a CategoricalArray wrapping a SubArray, rather than the contrary like we do now.)

Out of curiosity, could you develop in what kind of situation you have this kind of data?

gdalle · 2021-11-03T14:58:26Z

My data points were generated by many semi-independent sub-systems, and not every sub-system logs the same columns, which means most columns are very sparse

gdalle mentioned this issue Jan 14, 2022

Resizing sparse vectors JuliaSparse/SparseArrays.jl#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse categorical arrays? #374

Sparse categorical arrays? #374

gdalle commented Nov 3, 2021

sprmnt21 commented Nov 3, 2021

gdalle commented Nov 3, 2021

bkamins commented Nov 3, 2021

gdalle commented Nov 3, 2021

pdeffebach commented Nov 3, 2021

bkamins commented Nov 3, 2021 •

edited

Loading

gdalle commented Nov 3, 2021 •

edited

Loading

pdeffebach commented Nov 3, 2021

bkamins commented Nov 3, 2021

gdalle commented Nov 3, 2021

nalimilan commented Nov 3, 2021

gdalle commented Nov 3, 2021

Sparse categorical arrays? #374

Sparse categorical arrays? #374

Comments

gdalle commented Nov 3, 2021

sprmnt21 commented Nov 3, 2021

gdalle commented Nov 3, 2021

bkamins commented Nov 3, 2021

gdalle commented Nov 3, 2021

pdeffebach commented Nov 3, 2021

bkamins commented Nov 3, 2021 • edited Loading

gdalle commented Nov 3, 2021 • edited Loading

pdeffebach commented Nov 3, 2021

bkamins commented Nov 3, 2021

gdalle commented Nov 3, 2021

nalimilan commented Nov 3, 2021

gdalle commented Nov 3, 2021

bkamins commented Nov 3, 2021 •

edited

Loading

gdalle commented Nov 3, 2021 •

edited

Loading