-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: rename incapable of accepting tuples as new name #19497
Comments
You're fighting against pandas by using tuples as keys in your Index, instead of using a cc @toobaz is this worth attempting to support? |
I agree with @TomAugspurger that tuples as keys are weird. That said, I might have fixed this somewhere... maybe #18600 ... in any case I guess it will be supported eventually. |
Consider this circumstance:
In this circumstance, it would seem to me that a simple Index with tuples is the most obvious and easy solution, and may even be the only option. |
Uhm, no, that PR is unrelated. And I was probably just confused. I still think this is going to be fixed... sooner or later. |
as @TomAugspurger says above, this is simply not supported and you are fighting pandas like crazy here. The only way I could see doing this going forward would be to have an actual TupleIndex (subclassing EA) that is pretty explicity created here. Closing this as won't fix. |
FWIW, I think think when #17246 is fixed, this will happen to be fixed as well. |
I don't see this as "supporting tuples", but as "supporting anything which we don't state is not supported" (and can be supported). The bug must lie in a This said, I totally agree this is low priority. |
I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False, but I assume this approach would be avoided because it would be a large impact on the API. This surprising change of type is discussed in #17246, and will hopefully be included in the fix. @jreback argues that the fix is inappropriate, and that using tuples is unsupported. If that be the case - assuming #17246 is not going to be fixed soon, or, even if it is fixed, it doesn't fix this bug - then I think it should be clearly documented that tuples are not supported. Not supporting tuples I think would be a little disappointing, simply because I can't see a more obvious way to support the circumstances I have previously outlined above. |
I think this thread might be benefited by an example of why supporting tuples is a good idea. Consider for example, the country Australia and the states within it: NSW, QLD, VIC, TAS, WA, SA, NT, ACT. Also consider the region "Murray Darling Basin", which also has a natural hierarchical relationship to Australia, but specifies an area within NSW, VIC and SA (but does not completely include all of those states - it specifies the water catchment area). With reference to my earlier comments about the circumstances in which tuples in the index are useful:
There exists a natural hierarchical relationship between 'Australia' and these states - i.e. each state lies within Australia. There is also a relationship between 'Murray Darling Basin' and 'Australia'.
Consider that you wish to include in your dataframe, data series with names:
Because I should be able to.
If it's a multiindex, and there are None or * fields, my recollection is that this doesn't play nice (hence a 3-level multi-index may not be a simple workaround).
Because I want to export to a file that is interpreted by another program that understand the hierarchical relationship.. I hope it is now clear why an Index of tuples becomes the most obvious, if not the best, option for solving some problems. |
... or that one can pass Then probably this bug is fixed also if the default changes (as mentioned by @TomAugspurger ), but that was not my point. When I wrote that the code is "actively" doing something wrong, I just meant that |
This is hard, since it isn't really clear from the name that w.r.t. your example @charlie0389, it's hard to say anything without actual code / data. I suspect that a MI is able to handle your problem. |
Ok, for an example, please consider the following code:
Which has the following output:
Apologies for the wordiness, but I think it illustrates the conceptual point I'm trying to make. |
I don't think it's relevant, but what do those two sentences mean? The reason I say it's not relevant, is because the meaning you attach to a MultiIndex is up to you. Typically they're used to represent hierarchical data, but that's not necessary. It really is just a multi-part label, just like a tuple. Attempting to interpret the "conceptual equals" bit, it seems like you're implicitly putting data in the index. You have some kind of I don't understand the 3-level example. Again, though, it looks like you're putting some data in the index when it should go in the columns. Assuming the new level is something like midx = pd.MultiIndex.from_tuples(given_labels)
df = pd.DataFrame({
"sq. kms": given_data,
"is_water": [False] * 8 + [True]
}, index=midx)
df results in
Which (IIUC) is a much better way to represent the data. |
As much as I love indicizing stuff with Consider keys which represent paths: In [3]: megabytes = pd.Series([103, 30, 5],
index=pd.Index([('usr', 'share'), ('usr', 'bin'), ('usr', 'local', 'bin')], tupleize_cols=False))
In [4]: megabytes
Out[4]:
(usr, share) 103
(usr, bin) 30
(usr, local, bin) 5
dtype: int64 This is not an index which it makes sense to store as So: we can always say pandas does not support tuples because it's just too messy (in terms of API, not necessarily just implementation). I just don't think it is the case. But I might be wrong. In any case, I don't think that investigating the intentions of anybody who wants to use tuples as keys (also see #20597) is a viable long term solution :-) |
Is it possible to leave this bug open please? Going from the discussion so far, no one is disputing the fact that this is a bug. The only question is whether it should be supported, which even then, there still seems to be little disagreement that this should be supported in one form or another - all the discussion appears to revolve around implementation. |
For those that stumble upon this at a later date and are similarly frustrated by this bug, the following code fixes it:
The only differences are the function signature, and the last return line. |
@charlie0389 if you can open a PR where you
... then I think it would be a good candidate for inclusion. If in order to pass tests you do need to change the signature, I would suggest, rather than Reopening this at least temporarily as a fix seems feasible and simple. |
Would we be ok with a rule that rename doesn’t chase the type of the index between MI and other? If you have tuples going in and out, you get an Index. If you have a MI going in and tuples coming out, you get a MI?
…________________________________
From: Pietro Battiston <notifications@github.com>
Sent: Tuesday, April 17, 2018 11:03:53 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] Bug: rename incapable of accepting tuples as new name (#19497)
@charlie0389<https://github.com/charlie0389> if you can open a PR where you
* change the current _transform_index to tupleize_cols=False
* add a test (e.g. your initial example)
* verify that it passes all current tests (the part that most worries me)
... then I think it would be a good candidate for inclusion.
If in order to pass tests you do need to change the signature, I would suggest, rather than tupleize_cols, a more general parameter such as keep_type=True (@TomAugspurger<https://github.com/TomAugspurger> @jreback<https://github.com/jreback> better ideas?) which, when set to False, re-interprets the index content (so potentially changing not just an Index to MultiIndex, but also the other way round if e.g. keys in a MultiIndex are replaced with non-tuples). You might then want to split the process of creating the items list and the actual creation of the index.
Reopening this at least temporarily as a fix seems feasible and simple.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#19497 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIs9QzqC6VkhzwterzbYuVN5yTS8xks5tprspgaJpZM4R2YV9>.
|
Yes, that was my idea (with |
@toobaz do you think we need a new parameter ( |
I don't think we do in principle: my only concern was about backwards compatibility (and if we don't, then the fix to this is really a matter of passing
For all examples I can think of, explicitly recasting is a better solution. |
What's the backwards compatibility concern? I misunderstood In [22]: s = pd.Series(1, index=pd.MultiIndex.from_product([["A", "B"], ['a', 'b']]))
In [23]: s
Out[23]:
A a 1
b 1
B a 1
b 1
dtype: int64
In [24]: s.rename({"A": 'a'})
Out[24]:
a a 1
b 1
B a 1
b 1
dtype: int64 In that case, I think that passing |
Just that somebody (I would already be happy if it doesn't happen in some tests) assumed the following is a reasonable way to create a
|
(but mine might be pure paranoia: if tests pass, I would proceed) |
For completeness: in principle code out there could also be relying on the fact that In [2]: pd.Series(range(3), index=['1', '2', '3']).rename({'1' : 1, '2' : 2, '3' : 3.}).index
Out[2]: Float64Index([1.0, 2.0, 3.0], dtype='float64') although implementation wise this can be decoupled from the issue of multi vs. flat, documentation wise we probably just want to say "the resulting index will have the same type", and disable this automatic conversion. |
Understood. That is a valid concern...
I think converting between types (numeric vs. Index, etc.) is fine. It's the conversion between multi vs. flat that we (maybe) want to disallow via |
…as-dev/pandas#19497). Still compatible with 0.22 by using `pd.MultiIndex` directly.
…as-dev/pandas#19497). Still compatible with 0.22 by using `pd.MultiIndex` directly.
Pandas is incapable of renaming a pandas.Index object with tuples as the new value. Providing a tuple as
new_name
inpandas.DataFrame.rename({old_name: new_name}, axis="index")
returns apandas.MultiIndex
object, and providing it within a singleton tuple returns an undesirable result. See code below (work-around at bottom):...Will produce the output:
Desired/Expected output:
Problem description
The current behaviour is a problem for two reasons:
I have checked for similar issues by search of the word
rename
, and at time of writing, pandas 0.22.0 is the latest released version.Output of
pd.show_versions()
pandas: 0.22.0
pytest: 3.0.3
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.8.0
bs4: 4.5.1
html5lib: 1.0b10
sqlalchemy: 1.1.3
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Workaround
The workaround below uses
set_value
function which the documentation tells the user to avoid using (unless you really know what you're doing):Produces the output:
The text was updated successfully, but these errors were encountered: