ENH: allow better categorical dtype strings, e.g. `category[string]` #53190

topper-123 · 2023-05-11T22:27:49Z

closes ENH: Improved CategoricalDtype subtype handling. #48515
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR is a quality-of-life improvement when working with categorical dtypes:

>>> df = pd.read_csv(file, dtypes={"a": "category[string]", "b": "category[datetime64[ns, UTC]]"}

It can also be visible in reprs and can be used for comparisons:

>>> ser = pd.Series([1,2,3, 4], dtype="category[Int8]")
>>> ser
0    1
1    2
2    3
3    4
dtype: category[Int8]
Categories (4, Int8): [1, 2, 3, 4]
>>> ser.dtype == "category"
True
>>> ser.dtype == "category[Int8]"
True
>>> ser.dtype == "category[string]"
False

Still needs to do

update the tests.

EDIT: updated to newest capabilities.

jbrockmendel · 2023-05-12T23:55:52Z

i was expecting this to be about getting a more informative string when doing df.dtypes.

I like the added specificity, but dont like adding more statefulness to the Dtype object. once the categories are known the extra state should be redundant right?

topper-123 · 2023-05-14T09:05:07Z

i was expecting this to be about getting a more informative string when doing df.dtypes.

I was actually thinking about doing that as a follow-up, so we'd get e.g.:

>>> cat = pd.Categorical(["a", "f"])
>>> pd.DataFrame(cat).dtypes
0    category[object]
dtype: object

I like the added specificity, but dont like adding more statefulness to the Dtype object. once the categories are known the extra state should be redundant right?

I could not initiate _categories_dtype on the class + delete it after the categories have been declared?

The issue is that CategoricalDtype can be instantiated without categories, so to make "category[string]" work as a string dtype, we have to store the dtype string somewhere, until the categories are declared.

…ring]'.

topper-123 · 2023-05-29T07:53:16Z

I've updated this, so in addition to being able to use these strings as dtypes in constructors, they will also be visible in reprs and can be used for comparisons:

>>> ser = pd.Series([1,2,3, 4], dtype="category[Int8]")
>>> ser
0    1
1    2
2    3
3    4
dtype: category[Int8]
Categories (4, Int8): [1, 2, 3, 4]
>>> ser.dtype == "category"
True
>>> ser.dtype == "category[Int8]"
True
>>> ser.dtype == "category[string]"
False

I still need to do the tests and before I do them, can you verify you approve of the idea, @jbrockmendel?

jreback

i am surprised this doesn't break a lot of things

i like the repr idea but i am not sure allowing a dtype that is not inferred is actually useful

can you show when this is the case?

jreback · 2023-05-29T10:31:42Z

pandas/core/dtypes/dtypes.py

    ordered : bool or None, default False
        Whether or not this categorical is treated as a ordered categorical.
        None can be used to maintain the ordered value of existing categoricals when
        used in operations that combine categoricals, e.g. astype, and will resolve to
        False if there is no existing ordered to maintain.
+    categories_dtype : dtype, optional
+        If given, will be the dtype of the categories.


dtype is a better name

I’d be ok to change, categories_dtype is very verbose.

topper-123 · 2023-05-29T11:25:02Z

i am surprised this doesn't break a lot of things

i like the repr idea but i am not sure allowing a dtype that is not inferred is actually useful

can you show when this is the case?

Hmm, this is backward compatible, so don't see it could break things? E.g. Series([...], dtype="category") will still infer.

The use case is generally when we want a more precise categorical dtype, we currently typically do it like this:

>>> df = pd.read_csv(file, dtypes={"a": "string", "b": "datetime64[ns, UTC]"} 
>>> df["a"] = df["a"].astype("category")
>>> df["b"] = df["b"].astype("category")

which after this can be:

df = pd.read_csv(file, dtypes={"a": "category[string]", "b": "category[datetime64[ns, UTC]]"}

More precise type comparisons are currently done like this when comparing to string dtypes:

>>> cat = pd.Categorical(pd.array([1, 2, 3, 4], dtype="string"), dtype="category") 
>>> if cat.dtype = "category" and cat.dtype.categories.dtype == "string":
...     ...

which after this PR can be:

>>> cat = pd.Categorical([1, 2, 3, 4], dtype="category[string]") 
>>> if cat.dtype = "category[string]":
...     ...

So this makes some common categorical operations more readable and concise IMO. Note also in the read_csv example that readers can after this PR see that the end product will be a categorical of strings, opening up for the readers to possibly be more efficient.

jbrockmendel · 2023-06-15T15:14:46Z

i very much like the __repr__ and __eq__ here. im not sold on it being in the constructor

topper-123 · 2023-06-19T06:58:54Z

I'm not super fond of a categorical dtype repr not being usable as a dtype, i.e. if we allow "category[string]" as a dtype repr, IMO not allow it as a dtype could be unexpected.

The categories attribute can currently be either None or a sequence. Could an idea be that categories could also be a dtype to represent an intermediate state, i.e. no sequence has yet been added, but we've restricted it to be of the given dtype?

For example Series([1, 2, 3], dtype="category[string]") would then be equivalent to Series([1, 2, 3], dtype=CategoryDtype(StringDtype()))? This would then resolve into Series([1, 2, 3], dtype=CategoryDtype(Index([1, 2, 3], dtype="string"))) when CategoricalDtype has been given the sequence?

github-actions · 2023-07-20T00:05:39Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-10-12T16:48:16Z

Appears the discussion here has stalled and might need a bit more discussion on the design which may good to have in the issue so closing for now. Happy to reopen at a later point

mroeschke added Enhancement Categorical Categorical Data Type labels May 12, 2023

topper-123 added 5 commits May 27, 2023 09:12

ENH: allow more specific categorical dtype strings, e.g. `category[st…

c585867

…ring]'.

fix precommit

7ad0624

git doc test (interim)

d4aa357

add categories_dtype to dtype string

6c9a33e

update dtype name

358f654

topper-123 force-pushed the enh_CategoricalDtype.categories_dtype branch from 19d57a2 to 358f654 Compare May 27, 2023 09:37

topper-123 added 4 commits May 27, 2023 10:51

update

55d2bab

improve docs

e54e1fb

fix CI

cddf02a

fix asv test

0a1e438

jreback requested changes May 29, 2023

View reviewed changes

topper-123 mentioned this pull request Jun 1, 2023

BUG: pd.NA showing up as NaN in Categorical repr #52681

Closed

5 tasks

github-actions bot added the Stale label Jul 20, 2023

mroeschke closed this Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow better categorical dtype strings, e.g. `category[string]` #53190

ENH: allow better categorical dtype strings, e.g. `category[string]` #53190

topper-123 commented May 11, 2023 •

edited

Loading

jbrockmendel commented May 12, 2023

topper-123 commented May 14, 2023 •

edited

Loading

topper-123 commented May 29, 2023

jreback left a comment

jreback May 29, 2023

topper-123 May 29, 2023

topper-123 commented May 29, 2023 •

edited

Loading

jbrockmendel commented Jun 15, 2023

topper-123 commented Jun 19, 2023 •

edited

Loading

github-actions bot commented Jul 20, 2023

mroeschke commented Oct 12, 2023

ENH: allow better categorical dtype strings, e.g. category[string] #53190

ENH: allow better categorical dtype strings, e.g. category[string] #53190

Conversation

topper-123 commented May 11, 2023 • edited Loading

jbrockmendel commented May 12, 2023

topper-123 commented May 14, 2023 • edited Loading

topper-123 commented May 29, 2023

jreback left a comment

Choose a reason for hiding this comment

jreback May 29, 2023

Choose a reason for hiding this comment

topper-123 May 29, 2023

Choose a reason for hiding this comment

topper-123 commented May 29, 2023 • edited Loading

jbrockmendel commented Jun 15, 2023

topper-123 commented Jun 19, 2023 • edited Loading

github-actions bot commented Jul 20, 2023

mroeschke commented Oct 12, 2023

ENH: allow better categorical dtype strings, e.g. `category[string]` #53190

ENH: allow better categorical dtype strings, e.g. `category[string]` #53190

topper-123 commented May 11, 2023 •

edited

Loading

topper-123 commented May 14, 2023 •

edited

Loading

topper-123 commented May 29, 2023 •

edited

Loading

topper-123 commented Jun 19, 2023 •

edited

Loading