-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge nonstring columns #46879
Merge nonstring columns #46879
Conversation
Hello @ericman93! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-06-13 09:36:13 UTC |
Thanks for the PR. Is there an open issue for this? |
@WillAyd not that I could find |
Thanks for the PR. As mentioned, for changes like this, it's best that this PR as associated with an existing issue since it's unclear the motivation and context for this change. |
@mroeschke here is the issue #46885 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i kind of -0.5 on this. i understand the motivation but this is non-standard case and i don't see a good reason to support it.
@ericman93 reading through your original issue does passing |
@WillAyd it is, but passing
|
I don't think this is best way of doing this, since it might mess with a lot of if x in to_rename and suffix is not None:
try:
return x + suffix
except TypeError:
return f"{x}{suffix}" This would then preserve class Column:
def __init__(self, name):
self.name = name
def __add__(self, arg):
return Column(name=self.name+arg.name) For example if you index was tuples say |
@attack68 I think that trying |
why? |
ha never mind, you are right |
even when non string suffixes are given and it might add ints that are weird, one could always input a string suffix and this reverts to past behaviour anyway. there might be cases where intger addition could be useful too |
don't you think having try-catch will make things slower? |
you can measure this yourself, but it suffices to say the try/except on an int and string catching a type error is 3x slower, but the operation only costs 500ns more. Even if you had 1 million columns this would only be worth 0.5seconds performance degradation. And if you had 1million columns you could supply suffixes with an appropriate format to not raise a |
@attack68 fixed according to your suggestion |
IMO allowing None to allow duplicate columns gives more user control than the changes here as stated in the issue #46885 (comment) ...
However, for even more user control, it may be worth considering allowing a renamer function or dictionary to be passed to otherwise agree with #46879 (review) |
@simonjayhawkins removing the validation for none suffix is the easiest way to go, but I think I'll change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. pls move the note and ping on green.
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -796,6 +796,8 @@ Reshaping | |||
- Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`) | |||
- Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`) | |||
- Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`) | |||
- Bug in :func:`merge` and :meth:`DataFrame.merge` now allows passing ``None`` or ``(None, None)`` for ``suffixes`` argument, keeping column labels unchanged in the resulting :class:`DataFrame` potentially with duplicate column labels (:issue:`46885`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move these to other enhancements section (change the text as these are not bugs per se, rather a change in api)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can reference the original issue as a bug report if you want (just make it a single note). but i also want a note in enhancements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to enhancements
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -796,6 +796,8 @@ Reshaping | |||
- Bug in :func:`concat` with identical key leads to error when indexing :class:`MultiIndex` (:issue:`46519`) | |||
- Bug in :meth:`DataFrame.join` with a list when using suffixes to join DataFrames with duplicate column names (:issue:`46396`) | |||
- Bug in :meth:`DataFrame.pivot_table` with ``sort=False`` results in sorted index (:issue:`17041`) | |||
- Bug in :func:`merge` and :meth:`DataFrame.merge` now allows passing ``None`` or ``(None, None)`` for ``suffixes`` argument, keeping column labels unchanged in the resulting :class:`DataFrame` potentially with duplicate column labels (:issue:`46885`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can reference the original issue as a bug report if you want (just make it a single note). but i also want a note in enhancements.
@jreback its all green |
@jreback do we still want to wait for @simonjayhawkins response or should we push it? |
if u rebase we can look again |
@jreback done |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally -1 on this. It adds even more complexity and more things that can fail to a part of pandas that is already tricky, that it's supporting non-string columns, and allowing duplicate columns.
I agree that automatically casting columns to strings to add suffixes is not a good practice. But instead of adding this extra complexity, I'd prefer to simply raise an exception if non-string duplicate column names are going to be generated as a result of an operation. So the user can decide what's best for their case, and write their code accordingly.
I think that was generally the reason for not doing this before. non-string columns and duplicate columns are legitimate supported features of pandas and so I think they should be supported and work globally with all pandas methods. But to achieve this, it is probably not easily achieved (with discussion) in a single PR (with suggested follow ups) and without considering the behavior of other pandas methods. So achieve this goal, a PDEP or two maybe required, say "Supporting non-string column labels globally and consistently" and "Supporting duplicate column labels globally and consistently" where extensive background material is collated to describe the current behavior (good an bad) and to ensure that a universally consistent approach is adopted to, say, the api, index sorting, label mangling, exception raising and more.
This is basically the status quo, so am happy to +1 this for now (the quoted comment, not the PR, sorry for adding potential confusion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I think the discussion we should have is whether we want to continue to support duplicated and non-string column names. And this is unrelated to this PR.
I may make a PDEP proposal to discuss it when the PDEP framework is ready. Would be good to know what are the use cases for duplicate and non-string column names. If there are other ways to deal with them, and if it's worth the extra complexity.
Still not a big fan of what's introduced here, but feel free to move forward with this PR.
the +1 was for your comment, not the PR, sorry for adding potential confusion. |
I've recently been working pandas-stubs and have discovered that the claimed
The non-robustness of operations makes me wonder is encouraging more arbitrary types (that are hashable) is a good idea. |
For duplicates one case is repeated measurements from some labeled entity. For example, suppore you get 10 measurements of type "a" and 7 of type "b" where each has a set of characteristics. If there is not other index information, then the natural set of columns is As for non-string column names. surely integer column names are both useful and common. I would also think dates have obvious utility. |
I also agree that raising and kicking it back to the user to address is the cleanest approach. |
Thanks @bashtage, very useful info. I think it may still be worth going deeper into those use cases. What you say makes perfect sense, but I wonder what the whole pipeline could look like. A dataframe with ten A columns and seven B columns can make sense as you say. But what would a user do with it? Use columns one at a time? Then maybe adding suffixes to disambiguate is better. Or would be user group them and compute an average? In that case, maybe supporting an option to stack/unstack when duplicate columns exist in the source would be more useful to the user than having duplicate column names. We may reach the conclusion that allowing duplicates as we do now ia the best we can do. But since there is an obvious trade off, and a lot of added complexity, it may be useful to do an exercise of trying to better understand specific use cases, and what would be the best in every case. |
Agreed. |
Grouping them to perform statistical analysis based on the properties of the group. For example, repeated timsers measurements in a series of experiments where the only time that matters is time since start of the experiment run. Different treatments would get different labels, but the time series would be identical otherwise. The natural form for this data would be run label in columns, and relative time in rows (to me, at least). Another place where duplicated arise is then dropping levels from MultiIndex data. Again, this is sometimes done to group by for statistical analysis within groups. |
Thanks for your effort so far @ericman93, but appears there's not full buy in for this feature yet and looks like it should be discussed further in the linked issue, so closing for now. We can base a future PR off this one if there's more buy in. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.