-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Improved CategoricalDtype
subtype handling.
#48515
Comments
+1, in particular the "Ensure round tripping subtypes when serializing in formats that support categorical types". Doing a |
take |
xref #50041 (similar but different (aboutgetting an array out of a categorical, in the same dtype as the categories). |
I'm not super sure what I think about this. I can see its utility, but i'm worried about the complexity, e.g "category[string]" isn't a dtype currently and it takes more to define a CategoricalDtype than the underlying type. For example:
I'm -1 on this concrete idea (but I acknowledge the problem you're describing is real). |
@topper-123 Well, this is how pyarrow's dictionary type works: it simply saves a dictionary [index_type -> value_type], and then stores index-values (typically int32) instead of the actual values. To me it makes a lot of sense to have it this way, and I don't see what's supposedly confusing about it. In fact, allowing to specify the dtype as
But please notice, one of the major points of this issue is that even if we disallow writing |
There was at one point discussion about adding a Categorical-like dtype that didn't have fixed categories. I think the idea was that the categories would be part of the array instead of the dtype and be dict-like instead of an array. It didn't go anywhere, unfortunately IMO, but that would be very similar to what you're describing I think, e.g. having a dtype of I agree it would be nice to specify a categorical with a guaranteed dtype, but not necessarily spelled out categories. I don't think I'd be positive doing it in string form only ( Just riffing a bit on your idea, but maybe could accept a |
Found it: #20899. |
Having a string alias is really nice to have, for example if you want to write table schemas in a config file (json/toml/yaml). I do this quite often when I need to write a pipeline for data that comes via |
@topper-123 Also, as mentioned in this thread, |
My point was to make that possible, i..e |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Internally categories already distinguish different subtypes. consider for example:
In the first case,
s.dtype.categories
isIndex(['bar', 'foo'], dtype='object')
, in the latter case it isIndex(['bar', 'foo'], dtype='string')
.However currently handling of these subtypes is a bit awkward, hence the proposed features are quality-of-life improvements when working with such kinds of data, mainly:
.astype("category[<type>]")
Feature Description
CategoricalDtype
atyping.Generic
parametrized by a scalar type. (⇝ relevant forpandas-stubs
)category[object]
(cf. Defaults for Generics? python/mypy#4236 (comment)).astype("category[<type>]")
series.astype("category[string]")
should behave equivalently toseries.astype("string").astype("category")
read_csv(file, dtype=...)
andDataFrame(..., dtype=...)
pyarrow
's dictionary type)series.dtype == "category[string]"
.series.dtype == "string"
andpd.api.types.is_string_dtype(series)
should evaluate toTrue
if thedtype
iscategory[string]
, sincecategory
acts only as a kind of wrapper and things likeSeries.str
accessor are still applicable. (needs discussion)Alternative Solutions
Existing functionality is to manually cast as
.astype(<type>).astype("category")
whenever necessary, or to explicitly construct an instance ofCategoricalDtype
, which however requires a-priori knowledge of the categories.Additional Context
Allowing direct casting to
category[<type>]
when usingread_csv
should bring minor performance benfitsThe text was updated successfully, but these errors were encountered: