API: Add Dictionary-encoded Extension Type #20899

TomAugspurger · 2018-05-01T11:07:27Z

Currently, Categorical serves two main purposes

A type for expressing data from a fixed set of categories
A memory efficient storage format for low-cardinality objects

This proposal is to add a new extension type (let's call it DictEncodedArray
for now) for the second use case. The storage format would be the same as
Categorical: an Index of the unique "keys" (categories) and an array of codes.
Much of the implementation would be shared. But they would have different
semantics on operations

concat (union by default)
groupby (unobserved categories would be dropped by default)
value_counts (unobserved categories would be dropped by default)

This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).

The text was updated successfully, but these errors were encountered:

toobaz · 2018-05-01T11:23:56Z

From #20583 , @jankatins :

If you have a defined order, where would you put the new categorical into the orders? E.g. good - middle - bad, where would you put 'extreme' -> if you care about order, then you also care about not adding new stuff

Sure: an exception can be when the new categories include the old ones, but that's a very specific case, and I agree the ability to add categories is less relevant for ordered categories (by "less relevant" I mean that it could for instance loose the ordering, or append the new ones according to their own order).

toobaz · 2018-05-03T17:57:15Z

This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).

... or even arbitrary Python objects (for which you gain not only RAM, but possibly CPU when e.g. comparing)

Did I understand correctly that for the case of strings we could expect some significant advantage in storing them as a unique string and map (internally) categories to slices of it, rather than directly map categories to individual strings?

TomAugspurger · 2019-12-30T15:22:37Z

Not happening for 1.0

jbrockmendel · 2023-07-27T22:23:49Z

@mroeschke is this covered by ArrowDtype? Closable?

mroeschke · 2023-07-27T22:45:08Z

I believe so, but I am not sure if the operations described in the OP are fully covered by a ArrowDtype(pa.dictionary())

TomAugspurger mentioned this issue May 1, 2018

API: categorical grouping will no longer return the cartesian product #20583

Merged

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 1, 2018

TomAugspurger added this to the 0.23.0 milestone May 1, 2018

jreback modified the milestones: 0.23.0, 0.24.0 May 1, 2018

jreback mentioned this issue May 19, 2018

Unable to do calculations with categorical dtypes #21117

Closed

jreback modified the milestones: 0.24.0, 0.25.0 Oct 23, 2018

TomAugspurger mentioned this issue Nov 19, 2018

Shouldn't Pandas 'categorical' type map to OmniSci dictionary encoded string? heavyai/pymapd#114

Closed

jreback modified the milestones: 0.25.0, 1.0 Apr 20, 2019

jorisvandenbossche mentioned this issue Oct 2, 2019

API: Add string extension type #27949

Merged

TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019

TomAugspurger mentioned this issue Aug 31, 2020

Deprecate groupby/pivot observed=False default #35967

Closed

5 tasks

mroeschke added the Enhancement label Jun 19, 2021

jbrockmendel mentioned this issue Dec 21, 2021

API/ENH: unprotected Categorical #13506

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue Apr 20, 2023

Make pyarrow a required dependency #52509

Closed

topper-123 mentioned this issue May 11, 2023

ENH: Improved CategoricalDtype subtype handling. #48515

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add Dictionary-encoded Extension Type #20899

API: Add Dictionary-encoded Extension Type #20899

TomAugspurger commented May 1, 2018 •

edited

Loading

toobaz commented May 1, 2018 •

edited

Loading

toobaz commented May 3, 2018

TomAugspurger commented Dec 30, 2019

jbrockmendel commented Jul 27, 2023

mroeschke commented Jul 27, 2023

API: Add Dictionary-encoded Extension Type #20899

API: Add Dictionary-encoded Extension Type #20899

Comments

TomAugspurger commented May 1, 2018 • edited Loading

toobaz commented May 1, 2018 • edited Loading

toobaz commented May 3, 2018

TomAugspurger commented Dec 30, 2019

jbrockmendel commented Jul 27, 2023

mroeschke commented Jul 27, 2023

TomAugspurger commented May 1, 2018 •

edited

Loading

toobaz commented May 1, 2018 •

edited

Loading