-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in DataFrame.set_index with class instance column keys #24969
Comments
We have had quite some discussion lately about In general the usage cc @h-vetinari |
Does pandas support custom objects as labels? I think that's bound to break in many places. The code previously tried everything it got as a key, so in this sense this is a regression, yes. I'm a bit stumped as for how to deal with this. Column keys should IMO clearly be scalar (or tuples, grudgingly) - and that's the only reason Your object (at least the toy version) looks a bit like a tuple. As an immediate workaround, I'd suggest to try inheriting |
In my production code, I use dataclasses as custom objects in both column keys and row indices, which worked throughout Pandas 0.23.4. If Pandas 0.24 or later drop support for custom classes in row/column indices, I would be stuck at 0.23.4 forever. This is why I view the change as a regression. |
We didn't disallow it previously, so yes. |
This may have happened to work, but we don't support custom objects as labels explicity. Not against reverting this, but its buyer beware here. |
I never could find anything in the documentation that takes a stance on what can or can’t be column keys, except the general notion that DataFrames are dict-like. From this I surmised that column keys should be hashable and immutable. Did I miss something in the documentation? |
I've added a fix for the regression in #24984, with the caveat that I think that this should be immediately deprecated. For a method that's supposed to intelligently process column labels, arrays and list of labels/arrays, it's opening Pandora's box to try to support custom types. There's no end to the (hashable) horrors that someone might implement (see below). Allowing dataclasses, as in the usecase of @wkschwartz is a separate consideration, but for much the same reasons, I'd suggest to deprecate even tuples in labels: #24688 See also #24702, which also ties into the following:
This should be clearly defined, documented, and enforced (same as list-like). About the stuff that pandas already "supports", how about lists-as-labels...?
|
Obviously I would prefer no deprecation of custom label types as they are integral to my company’s applications. However, I would urge strongly that if you do decide to deprecate the feature, you do so starting only in the next major release (presumably 0.25.0) rather than in a minor release (0.24.1). If you do stop supporting custom label types and I am not to be stuck at Pandas 0.23.4 forever, I could theoretically undertake the (expensive) refactoring to use unique IDs (my production code has the equivalent of the name field from my toy example in the OP). However, other users whose code would break might not have convenient unique IDs to switch to. Please do not remove this feature lightly. |
|
Hmm well this conversation is somewhat split between this and the PR but I would be -1 on supporting something like this. I understand that it may have worked previously and wasn't explicitly disallowed, but I don't think that means we have to support it going forward, when I would think there are already an infinite other more idiomatic ways of going about this. Even if fixed in this particular instance trying to guarantee support for something like this as an index item is going to cause a lot of unnecessary complexity in other parts of the code base, I would think particularly on subsequent indexing operations |
It does not change, that's what the It would work. Not that it should, of course, much less that it should be supported.
|
I meant: it can contain a proper list. But you're right this is irrelevant (tuples can too). And indeed it works, and that's great. Users will need to remember only a simple rule: "hashable -> works". By the way: it's a rule they already know from |
If your definition of backward compatibility means maintaining the functionality of old Pandas-based code, then you shouldn't care about why folks used Pandas as they did, only that it worked before and should continue to work as long as the major version of Pandas doesn't change. This is the way Go handles compatibility. In that case, whether my use case is Pythonic or elegant shouldn't matter. If your goals for the backward compatibility of Pandas only encompases the examples in the documentation and the tests, then I refer you to my next point.
Using custom classes as column headers in my application has been great because there's a bunch of extra data that is automatically carried around as I make (modified) copies of DataFrames. For my use case, custom classes have been elegant and, IMO, very Pythonic. |
Go is statically typed, and that gives a very different baseline from which to make such a promise, compared to python. Pandas API is under massive development (see the whole ExtensionArray effort), and still has far too many nooks and crannies (whether due to legacy reasons or simple oversight), such that promising backwards compatibility for "anything that works" is suicide from a maintenance perspective. That being said, there's a substantial amount of effort that things are being deprecated gently whenever possible. |
Just giving this some more thought today...is this all we should require? There are a lot of index operations that also require the concept of sortability and that manifests itself in quite a few of the reshaping ops. I feel like hashable alone would only work with a subset of the API that would require intense care to work around.
Sure but what parts of the DataFrame API are you subsequently using? Outside of masking operations I'm kind of curious what the advantage of the DataFrame is versus just a dict, given IO operations with the former are going to be lossy and as mentioned I would think you'd have to be very careful which parts of the API you ultimately navigate. |
Maybe categorical dtypes are useful analogy here. Some categoricals are ordered and thus sortable, others are not. Which operations you can do on a categorical column depends on whether the dtype is ordered. Perhaps there's a hiererarchy for keys:
Pandas can then assume compatibility among
I make substantial use of row and column indexing (both boolean masking and by keys), row and column sums, merging on row indices, broadcasting comparisons (e.g., Most of those operations are way harder with lists of dicts, which I know because I've written this application both ways. |
Mixed type indexes are non-sortable since Python 3: to my eyes the problem is analogous, and not particularly worrisome. For my experience, there are some reshaping ops in which we try to sort, and fallback to not sorting, and this is (I think) OK. |
@toobaz is there a reason you closed this? As I don't think the issue is resolved (or the typical wrong button mistake; github UI should put them farther apart :-)) |
Lack of sleep probably ;-) Thanks for reopening. |
There has been some interesting discussion, here and in the PR, on the general topic of "should we support custom objects in Index" (or, how do we define and specify what we support in Index). And I think it is a discussion we should continue (should we open a new issue for that?); but, we also need to decide what to do on the short term. Do we agree that this was an unintended regression that we would like to fix for 0.24.x, awaiting more general discussion later on? (-> which means working towards merging #24984) Tom asked the same question on the PR a few days ago (#24984 (comment)), and there was not really reaction on it, so I somewhat assumed that it was the case (but I also agree with it). But @WillAyd mentioned on gitter that he has his doubts on the need to fix it. |
@jreback
|
@h-vetinari indeed, changed the milestone accordingly |
The following code worked in Pandas 0.23.4 but not in Pandas 0.24.0 (I'm on Python 3.7.2).
In Pandas 0.23.4, I get the following correct result:
In Pandas 0.24.0, I get the following error:
After looking at Pandas 0.24.0's implementation of
DataFrame.set_index
:pandas/pandas/core/frame.py
Lines 4144 to 4153 in 83eb242
I noticed that
is_scalar
returnsFalse
forthing1
in Pandas 0.24.0:I suspect that it is incorrect to test DataFrame column keys using
is_scalar
.Output of
pd.show_versions()
pd.show_versions()
from Pandas 0.23.4INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.1.2
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
pd.show_versions()
from Pandas 0.24.0INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0
pytest: None
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.16.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: