-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow better categorical dtype strings, e.g. category[string]
#53190
ENH: allow better categorical dtype strings, e.g. category[string]
#53190
Conversation
i was expecting this to be about getting a more informative string when doing I like the added specificity, but dont like adding more statefulness to the Dtype object. once the categories are known the extra state should be redundant right? |
I was actually thinking about doing that as a follow-up, so we'd get e.g.: >>> cat = pd.Categorical(["a", "f"])
>>> pd.DataFrame(cat).dtypes
0 category[object]
dtype: object
I could not initiate The issue is that |
19d57a2
to
358f654
Compare
I've updated this, so in addition to being able to use these strings as dtypes in constructors, they will also be visible in reprs and can be used for comparisons: >>> ser = pd.Series([1,2,3, 4], dtype="category[Int8]")
>>> ser
0 1
1 2
2 3
3 4
dtype: category[Int8]
Categories (4, Int8): [1, 2, 3, 4]
>>> ser.dtype == "category"
True
>>> ser.dtype == "category[Int8]"
True
>>> ser.dtype == "category[string]"
False I still need to do the tests and before I do them, can you verify you approve of the idea, @jbrockmendel? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am surprised this doesn't break a lot of things
i like the repr idea but i am not sure allowing a dtype that is not inferred is actually useful
can you show when this is the case?
ordered : bool or None, default False | ||
Whether or not this categorical is treated as a ordered categorical. | ||
None can be used to maintain the ordered value of existing categoricals when | ||
used in operations that combine categoricals, e.g. astype, and will resolve to | ||
False if there is no existing ordered to maintain. | ||
categories_dtype : dtype, optional | ||
If given, will be the dtype of the categories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype is a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d be ok to change, categories_dtype is very verbose.
Hmm, this is backward compatible, so don't see it could break things? E.g. The use case is generally when we want a more precise categorical dtype, we currently typically do it like this: >>> df = pd.read_csv(file, dtypes={"a": "string", "b": "datetime64[ns, UTC]"}
>>> df["a"] = df["a"].astype("category")
>>> df["b"] = df["b"].astype("category") which after this can be: df = pd.read_csv(file, dtypes={"a": "category[string]", "b": "category[datetime64[ns, UTC]]"} More precise type comparisons are currently done like this when comparing to string dtypes: >>> cat = pd.Categorical(pd.array([1, 2, 3, 4], dtype="string"), dtype="category")
>>> if cat.dtype = "category" and cat.dtype.categories.dtype == "string":
... ... which after this PR can be: >>> cat = pd.Categorical([1, 2, 3, 4], dtype="category[string]")
>>> if cat.dtype = "category[string]":
... ... So this makes some common categorical operations more readable and concise IMO. Note also in the |
i very much like the |
I'm not super fond of a categorical dtype repr not being usable as a dtype, i.e. if we allow The For example |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Appears the discussion here has stalled and might need a bit more discussion on the design which may good to have in the issue so closing for now. Happy to reopen at a later point |
CategoricalDtype
subtype handling. #48515doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This PR is a quality-of-life improvement when working with categorical dtypes:
It can also be visible in reprs and can be used for comparisons:
Still needs to do
EDIT: updated to newest capabilities.