-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.to_records, DataFrame constructor broken for categoricals #8626
Comments
hmm, the second part I am not sure I buy the first part. If you want to see if you can 'fix' and demonstrate that would help. |
Actually, there are several issues. I only had the impression that they are somehow related in the sense that it makes a difference whether I use My main problem is this df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype('category')
df.to_records() Result:
I have no problems when converting as follows: df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records() The same is true for Another example which shows it is not only related to df['A'] = df['A'].astype(pd.core.common.CategoricalDtype)
pd.lib.fast_zip([df.A.values, df.B.values]) ... works as expected. df['A'] = df['A'].astype('category')
pd.lib.fast_zip([df.A.values, df.B.values]) ... returns:
|
you realize that using to_ records() defeats the entire purpose of categorical as numpy cannot support this |
Sure, but I don't see point. Most Ironically I came across this issue when creating factor/key variables from multiple columns: pd.factorize(df.to_records()) But I ended up using pd.factorize(pd.lib.fast_zip([df[c].values for c in df.columns])) |
by your reply I am confused to_records gives you a structured array |
Sorry for the confusion. Just to make sure: the main issue here is the inconsistency between using But if this is the case and you think But instead df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype('category')
df.to_records()
Out[...]
...
TypeError: data type not understood Whereas
|
Oh wait, it seems the reason for the inconsistency is even more weird. The second call of df=pd.DataFrame(list('abc'), columns=['A'])
df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
...
TypeError: data type not understood
df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
rec.array([(0, 'a'), (1, 'b'), (2, 'c')],
dtype=[('index', '<i8'), ('A', 'O')])
df['A'] = df['A'].astype(pd.core.categorical.Categorical)
df.to_records()
...
TypeError: data type not understood
|
in the other thread, I was refering specifically to serialization/deserilization to other formats (e.g. while Further, you shoulld never need touch the actual The reason we released categorical is its simply impossible to catch ALL possible cases. This is a massive addition and I think we got most, but that's why its helpful for you to find bugs! (which turn into tests and will get fixed for 0.15.1) |
I realize u r using pd.core.categorical.Category as a dtype Categorical is the actual object technically it's actually ok and doesn't actually raise (nor does Numpy when u do this) - but prob should simply raise as its not correct at all |
Sorry for the late reply - df=pd.DataFrame(np.random.choice(list(u'abcde'), 20).reshape(10, 2),
columns=list(u'AB'))
pd.lib.fast_zip([df.A.values, df.B.values])
df=pd.DataFrame(np.random.choice(list(u'abcde'), 20).reshape(10, 2),
columns=list(u'AB'))
for col in df.columns: df[col] = df[col].astype('category')
pd.lib.fast_zip([df.A.values, df.B.values])
But perhaps that's a different issue. |
that's an internal routine |
if u really want to use it |
Oh, wasn't aware of the subtle difference between I use df['gid'], grps =pd.factorize(pd.lib.fast_zip([df.A.get_values(), df.B.get_values()])) |
@fkaufer I like that you are finding bugs in you can do:
or if you really want to factorize all values
note that using another approach is to simply construct the series and then factorize,
but really why are you factorizing directly? |
Seems, we're a bit in a loop. So let me explain a bit more detailed. Upfront the general question: Why factorizing directly? Now: Why using the sketched technical approach?
To sum up, the following workaround now works: pd.factorize(pd.lib.fast_zip([df[col].get_values() for col in factorize_cols])) I don't see how your proposals are equivalent alternatives to what I'm doing:
But if you provide a df[factorize_cols].factorize() ... then I promise that I'll keep my hands off internal functions .. for now. As a general remark regarding "categorical bug reporting". I'm not sitting here being overly eager to find as many categoricals bugs as possible and therefore also testing all internal functions. It's just that having categoricals is really beneficial for my work such that for some tasks I'm currently working directly on the development/master pandas branch. By that I just stumble across the issues and report them without distinguishing between issues with supposedly internal functions and official API functions but just report any inconsistencies I come across. And I think it is really important that categoricals behave consistently with other dtypes such that you can get just the extra benefits of categoricals without breaking existing code. |
can u show me a small example of what you are wanting it seems that u simply want a categorical for a few selected columns that have the same categories? give me a complete concrete example and I'll show you how you should do it all of the functionary is there now (or if not we'll see what we can do) DataFrame.factoriize() doesn't make sense but maybe your example will shed some light |
Probably it would be better to open a new issue: "ENH: DataFrame.factorize()" but here you go ...
df['user_id_1'], labels_1 = df[['firstname', 'lastname']].factorize()
df['user_id_2'], labels_2 = df[['firstname', 'lastname', 'city']].factorize()
df
|
A small example as far as I understand (@fkaufer correct me if I am wrong!)
So, @jreback, @fkaufer wants to make categories based on values as the full rows (combined values of all (or a selection of the) columns), a bit like @fkaufer I think this is a more uncommon operation, and I don't know if pandas should provide a built-in way to do this (I am not fully convinced that |
@jorisvandenbossche I do not agree that this is uncommon. Generating ids for a column subset is an important preprocessing step for a lot of algorithms doing duplicate detection, clustering, classification, association rule mining, functional/inclusion dependency detection, etc. It is even more useful when dealing with denormalized dirty data which I would say is pandas' bread-and-butter business. Stata has egen user_id = group(firstname lastname city) http://www.stata.com/support/faqs/data-management/creating-group-identifiers/ In R I would to something like transform(df, user_id = as.numeric(interaction(firstname lastname city, drop=TRUE))) |
See #8709 for an idea to how to represent this. A 2-d categorical will represent this well (with tuples for categories). Will think about this for CategoricalIndex as well (which this would be natural at). But this works now
I suppose could make a cookbook entry for this. @fkaufer how are you then using the factorized values? And here's the big difference between These don't preseve the structure! Instead they are an aggregation across a dimension (here some columns). I suppose you could make a Believe me, all for a better function/way to do X. But what is X here? |
I'm wondering a bit why the use cases are not obvious, but I'll try elaborate on the "X" factor asap with an example. In general the use cases are algorithms working on groups/value-combinations, so clustering in the broadest sense (I threw in some buzzwords in a comment above: "algorithms doing duplicate detection, clustering, classification, association rule mining, functional/inclusion dependency detection, etc."). Most ML algorithms for clustering and classification work on numeric values or at least it's faster to work on numeric/integer values instead of dealing with records potentially containing lengthy strings. So encodings of records is an important step.
Yes exactly, that's what I said. For the use cases mentioned I want to keep the structure, partially because I want to work on many (potentially overlapping) groups in parallel. Regarding your proposal. The result is equivalent, but I'm not convinced and here is why: %timeit pd.factorize(pd.MultiIndex.from_arrays([df[columns]]))
1 loops, best of 3: 1.33 s per loop
# has not worked for categoricals, but is fixed now, see #8652
%timeit pd.factorize(df.to_records())
1 loops, best of 3: 1.42 s per loop
# still does not work for categoricals!
%timeit pd.lib.fast_zip([df[c].values for c in columns])
10 loops, best of 3: 99.8 ms per loop
# necessary when columns contain categoricals
%timeit pd.lib.fast_zip([df[c].get_values() for c in columns])
10 loops, best of 3: 99.4 ms per loop
# further alternatives possible with groupby.groups/groupby.indices Actually I favor Recall: this brittleness or more precise this inconsistency when using categoricals in some of the columns of a df is the actual topic of this issue, i.e. having categoricals in my data broke my workarounds for |
just show what you are doing with the results of factorize |
@fkaufer the reason I keep asking questions is that I want to know you flow better |
to_records()
and the df constructor are broken for data containing categoricals when created withdtype='category'
orastype('category')
.Broken (
TypeError: data type not understood
):Works:
to_frame
seems to remove the category dtype though.Pandas version: 0.15.0-20-g2737f5a
The text was updated successfully, but these errors were encountered: