Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categoricals shouldn't allow non-strings when object dtype is passed (#13919) #14047

Closed
wants to merge 2 commits into from

Conversation

wcwagner
Copy link
Contributor

@wcwagner wcwagner commented Aug 19, 2016

Why this change is needed: Categorical variables are by definition single types, so to allow them to take on different types of values is misleading. Object dtypes should only be allowed when ALL strings or ALL periods are passed (due to the way there are handled internally).

The result of this PR will raise a TypeError when a categorical is created that has an object dtype but doesn't contain all string or all period values.

@wcwagner wcwagner changed the title BUG: Categoricals shouldn't allow non-strings when object dtype is passed BUG: Categoricals shouldn't allow non-strings when object dtype is passed (#13919) Aug 19, 2016
@jreback
Copy link
Contributor

jreback commented Aug 19, 2016

you already have #14027 you don't need a 2nd
just push to the original
ok for this one


# this however will raise as cannot be sorted
self.assertRaises(
TypeError, lambda: Categorical.from_array(arr, ordered=True))

def test_constructor_object_dtype(self):
Copy link
Member

@sinhrks sinhrks Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add tests:

  • unicode (for py2, include 2 bytes)
  • str and unicode mixed (for py2, it's often the case in asian countries)
  • bool (not sure it is used as category)
  • period with mixed freq

I think some of them can't be covered because of infer_dtype spec and current impl.

@sinhrks sinhrks added API Design Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Categorical Categorical Data Type labels Aug 19, 2016
@@ -191,6 +192,8 @@ class Categorical(PandasObject):
If an explicit ``ordered=True`` is given but no `categories` and the
`values` are not sortable.

If an `object` dtype is passed and `values` contains dtypes other
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed dtypes

@wcwagner
Copy link
Contributor Author

wcwagner commented Aug 21, 2016

This change will always fail to build because it conflicts with the way MultiIndex.from_arrays, MultiIndex.from_tuples, and MultiIndex.from_product are implemented.

Regardless of the way I implement it, ensuring that Categoricals aren't mixed dtypes will always raise when a MultiIndex is constructed from some listlike with mixed dtypes :\

Do I disregard this and keep working/pushing to this PR?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2016

you would have 2 show an example

@wcwagner
Copy link
Contributor Author

@jreback Show an ex. here? the docs? add a test and assert it raises?
If here, then

MultiIndex.from_arrays([[1 ,2], ['foo',  3]])

will raise because this

@jreback
Copy link
Contributor

jreback commented Aug 21, 2016

#13854 refactors those routines

build your fix on top of this PR

@wcwagner
Copy link
Contributor Author

wcwagner commented Aug 22, 2016

The refactored routines still call Categorical.from_array (here), so using mixed dtypes with MultiIndex.from_arrays would still raise unfortunately.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

@wcwagner
Copy link
Contributor Author

@jreback I don't think this will work. Even though #13854 refactored MultiIndex.from_arrays, it still creates categoricals, potentially of mixed types. So i will be closing this, as I can't think of a clean way to get around this :\

@wcwagner wcwagner closed this Sep 11, 2016
@jreback
Copy link
Contributor

jreback commented Sep 11, 2016

Categoricals should never be mixed type
not sure why you think that the constructor creates them

if you have an example pls show it

@wcwagner
Copy link
Contributor Author

@jreback If I call

MultiIndex.from_arrays([[1 ,2], ['foo',  3]])

_factorize_from_iterable will create a Categorical from ['foo', 3] here

I'm definitely misunderstanding something here. In reference to your first comment in #13919, shouldn't that raise on creation?

@jreback
Copy link
Contributor

jreback commented Sep 11, 2016

that should raise its not valid

@wcwagner
Copy link
Contributor Author

@jreback So back to my original problem. I can't think of a clean way to allow MultiIndex.from_arrays to exist, while simultaneously blocking the creation of Categoricals with mixed dtypes.

My immediate idea is to pass in some parameter to indicate that a MultiIndex is being created, but as you said in another one of my PRs -- you can't pass in an extra parameter to determine state

@jorisvandenbossche
Copy link
Member

@jreback from a MultiIndex.from_arrays point of vue, such mixed types is perfectly valid IMO

Personally, I don't think we should check this at Categorical construction, I would rather check for this in the hdf code itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: Categoricals should not allow non-strings when an object dtype is passed
4 participants