Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add Dictionary-encoded Extension Type #20899

Open
TomAugspurger opened this issue May 1, 2018 · 5 comments
Open

API: Add Dictionary-encoded Extension Type #20899

TomAugspurger opened this issue May 1, 2018 · 5 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 1, 2018

Currently, Categorical serves two main purposes

  1. A type for expressing data from a fixed set of categories
  2. A memory efficient storage format for low-cardinality objects

This proposal is to add a new extension type (let's call it DictEncodedArray
for now) for the second use case. The storage format would be the same as
Categorical: an Index of the unique "keys" (categories) and an array of codes.
Much of the implementation would be shared. But they would have different
semantics on operations

  • concat (union by default)
  • groupby (unobserved categories would be dropped by default)
  • value_counts (unobserved categories would be dropped by default)

This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label May 1, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone May 1, 2018
@toobaz
Copy link
Member

toobaz commented May 1, 2018

From #20583 , @jankatins :

If you have a defined order, where would you put the new categorical into the orders? E.g. good - middle - bad, where would you put 'extreme' -> if you care about order, then you also care about not adding new stuff

Sure: an exception can be when the new categories include the old ones, but that's a very specific case, and I agree the ability to add categories is less relevant for ordered categories (by "less relevant" I mean that it could for instance loose the ordering, or append the new ones according to their own order).

@jreback jreback modified the milestones: 0.23.0, 0.24.0 May 1, 2018
@toobaz
Copy link
Member

toobaz commented May 3, 2018

This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).

... or even arbitrary Python objects (for which you gain not only RAM, but possibly CPU when e.g. comparing)

Did I understand correctly that for the case of strings we could expect some significant advantage in storing them as a unique string and map (internally) categories to slices of it, rather than directly map categories to individual strings?

@TomAugspurger
Copy link
Contributor Author

Not happening for 1.0

@jbrockmendel
Copy link
Member

@mroeschke is this covered by ArrowDtype? Closable?

@mroeschke
Copy link
Member

I believe so, but I am not sure if the operations described in the OP are fully covered by a ArrowDtype(pa.dictionary())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

5 participants