Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow better categorical dtype strings, e.g. category[string] #53190

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 11, 2023

This PR is a quality-of-life improvement when working with categorical dtypes:

>>> df = pd.read_csv(file, dtypes={"a": "category[string]", "b": "category[datetime64[ns, UTC]]"} 

It can also be visible in reprs and can be used for comparisons:

>>> ser = pd.Series([1,2,3, 4], dtype="category[Int8]")
>>> ser
0    1
1    2
2    3
3    4
dtype: category[Int8]
Categories (4, Int8): [1, 2, 3, 4]
>>> ser.dtype == "category"
True
>>> ser.dtype == "category[Int8]"
True
>>> ser.dtype == "category[string]"
False

Still needs to do

  • update the tests.

EDIT: updated to newest capabilities.

@mroeschke mroeschke added Enhancement Categorical Categorical Data Type labels May 12, 2023
@jbrockmendel
Copy link
Member

i was expecting this to be about getting a more informative string when doing df.dtypes.

I like the added specificity, but dont like adding more statefulness to the Dtype object. once the categories are known the extra state should be redundant right?

@topper-123
Copy link
Contributor Author

topper-123 commented May 14, 2023

i was expecting this to be about getting a more informative string when doing df.dtypes.

I was actually thinking about doing that as a follow-up, so we'd get e.g.:

>>> cat = pd.Categorical(["a", "f"])
>>> pd.DataFrame(cat).dtypes
0    category[object]
dtype: object

I like the added specificity, but dont like adding more statefulness to the Dtype object. once the categories are known the extra state should be redundant right?

I could not initiate _categories_dtype on the class + delete it after the categories have been declared?

The issue is that CategoricalDtype can be instantiated without categories, so to make "category[string]" work as a string dtype, we have to store the dtype string somewhere, until the categories are declared.

@topper-123 topper-123 force-pushed the enh_CategoricalDtype.categories_dtype branch from 19d57a2 to 358f654 Compare May 27, 2023 09:37
@topper-123
Copy link
Contributor Author

I've updated this, so in addition to being able to use these strings as dtypes in constructors, they will also be visible in reprs and can be used for comparisons:

>>> ser = pd.Series([1,2,3, 4], dtype="category[Int8]")
>>> ser
0    1
1    2
2    3
3    4
dtype: category[Int8]
Categories (4, Int8): [1, 2, 3, 4]
>>> ser.dtype == "category"
True
>>> ser.dtype == "category[Int8]"
True
>>> ser.dtype == "category[string]"
False

I still need to do the tests and before I do them, can you verify you approve of the idea, @jbrockmendel?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am surprised this doesn't break a lot of things

i like the repr idea but i am not sure allowing a dtype that is not inferred is actually useful

can you show when this is the case?

ordered : bool or None, default False
Whether or not this categorical is treated as a ordered categorical.
None can be used to maintain the ordered value of existing categoricals when
used in operations that combine categoricals, e.g. astype, and will resolve to
False if there is no existing ordered to maintain.
categories_dtype : dtype, optional
If given, will be the dtype of the categories.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype is a better name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d be ok to change, categories_dtype is very verbose.

@topper-123
Copy link
Contributor Author

topper-123 commented May 29, 2023

i am surprised this doesn't break a lot of things

i like the repr idea but i am not sure allowing a dtype that is not inferred is actually useful

can you show when this is the case?

Hmm, this is backward compatible, so don't see it could break things? E.g. Series([...], dtype="category") will still infer.

The use case is generally when we want a more precise categorical dtype, we currently typically do it like this:

>>> df = pd.read_csv(file, dtypes={"a": "string", "b": "datetime64[ns, UTC]"} 
>>> df["a"] = df["a"].astype("category")
>>> df["b"] = df["b"].astype("category")

which after this can be:

df = pd.read_csv(file, dtypes={"a": "category[string]", "b": "category[datetime64[ns, UTC]]"} 

More precise type comparisons are currently done like this when comparing to string dtypes:

>>> cat = pd.Categorical(pd.array([1, 2, 3, 4], dtype="string"), dtype="category") 
>>> if cat.dtype = "category" and cat.dtype.categories.dtype == "string":
...     ...

which after this PR can be:

>>> cat = pd.Categorical([1, 2, 3, 4], dtype="category[string]") 
>>> if cat.dtype = "category[string]":
...     ...

So this makes some common categorical operations more readable and concise IMO. Note also in the read_csv example that readers can after this PR see that the end product will be a categorical of strings, opening up for the readers to possibly be more efficient.

@jbrockmendel
Copy link
Member

i very much like the __repr__ and __eq__ here. im not sold on it being in the constructor

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 19, 2023

I'm not super fond of a categorical dtype repr not being usable as a dtype, i.e. if we allow "category[string]" as a dtype repr, IMO not allow it as a dtype could be unexpected.

The categories attribute can currently be either None or a sequence. Could an idea be that categories could also be a dtype to represent an intermediate state, i.e. no sequence has yet been added, but we've restricted it to be of the given dtype?

For example Series([1, 2, 3], dtype="category[string]") would then be equivalent to Series([1, 2, 3], dtype=CategoryDtype(StringDtype()))? This would then resolve into Series([1, 2, 3], dtype=CategoryDtype(Index([1, 2, 3], dtype="string"))) when CategoricalDtype has been given the sequence?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 20, 2023
@mroeschke
Copy link
Member

Appears the discussion here has stalled and might need a bit more discussion on the design which may good to have in the issue so closing for now. Happy to reopen at a later point

@mroeschke mroeschke closed this Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Improved CategoricalDtype subtype handling.
4 participants