-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: coalesce-method (upgrade for update/combine_first) #22812
Comments
Just about the name: One disadvantage of this name: not being native speaking, and not coming from a database background, I would not even have the slightest idea what even the word itself would mean, let alone what it would do. I know it is a term used in SQL (and which is a big plus for using such a name), and in the meantime I know it from there, but IMO it is a rather newcomer unfriendly name. In that sense, I find 'combine' or 'update' more natural terminology. Then about the functionality: I think a (sidenote in general: sorry for the slow progress at the moment in those discussions, it seems there are currently not many other core devs that have time to participate, and such discussions take time / need input of other people. That can certainly be annoying, but is part of the open source process ..) |
Thanks for the response! I understand that dev time is a very limited resource... ;-) I don't feel strongly about the name. I'd be fine with having it under I'd also be fine with naming it something shorter (and less specific than https://www.dictionary.com/browse/fuse:
verb (used without object), fused, fus·ing.
In any case, my main point is about having the desired capabilities, not about a specific name. |
Re-pinging @jreback @jorisvandenbossche @TomAugspurger @cpcloud @gfyoung @toobaz |
@h-vetinari : Yikes! Sorry that we all went dark on this... Given that you already had some backing already on this proposal from @jreback (and in some way from @jorisvandenbossche ), I would suggest that you try implementing this and open a PR. In terms of implementation, I'm inclined to agree that naming / behavior should be consistent with SQL. If we're afraid that "coalesce" is unfriendly to end-users, I don't see why we couldn't alias it to another, more friendly-sounding name if need be. |
@pandas-dev/pandas-core @h-vetinari : As a side note, I noticed that you have a handful of substantive PR's open that touch upon some pretty big functionality questions but seemed to have reached impasse's either as a result of core-dev's not answering or @h-vetinari your not updating them for awhile... Perhaps we / you might want to clean those up first for merging before opening yet another PR? 😉 |
My preference would be
I try to keep all my PRs current and have responded quickly to any and all feedback, but since that is such a scarce resource (and in my case, >95% of it is done by @jreback, it has to be acknowledged. So thanks for that! :) ), I have no problem doing several big PRs in parallel, which increases the total feedback throughput I receive and therefore lets me make faster progress. In detail:
|
If we're changing the name, I'm wondering if it's worth creating yet-another-name versus using something somewhat standardized (COALESCE from SQL, for better or for worse) |
@wesm The pre-existing name is a strong argument for sure. While I care much more about the functionality than what name it resides under, I also see the appeal of not having the baggage of preconceptions that come with the name (e.g. This could look as follows in a putative docstring: def fuse(self, other, join='left', overwrite=True, filter_func=None,
errors='ignore'):
"""
Unite two DataFrames into one, filling non-NA values where possible.
This method always aligns on index and columns. The parameters easily
allow switching between different usage patterns, which sometimes have
dedicated methods in other languages (or previous versions of pandas):
* `update` (e.g. for python-dict):
same result as `df.fuse(other)`, but `fuse` is not inplace.
* `coalesce` (SQL):
equivalent (entry-by-entry) to `df.fuse(other, overwrite=False)`.
* `combine_first` (pandas before version 1.0):
`df.combine_first(other)` is equivalent to
`df.fuse(other, join='outer', overwrite=False) or
`other.fuse(df, join='outer')`.
Two additional keywords allow further tuning of the behavior, e.g. to
raise if non-NA values coincide.
Parameters
----------
other : DataFrame, or object coercible into a DataFrame
[rest is the same as for df.update (but with more joins)] That being said, I'd be just as happy with |
The name "fuse" doesn't really connote the "overlay" aspect of combine_first to me, but I don't have super strong feelings about it |
It's really a pity that the PS. I was prepared to implement all that along the lines of the docstring above (and do it right, going deep into the bowels of the type promotion to avoid painful inconsistencies for the user), but decided to spend my energies elsewhere after investing 100s of hours that ended up being repaid with what I can only describe as hostility (and - where my changes got taken over - basically zero attribution). |
@h-vetinari Is this something you desired: https://pwwang.github.io/datar/notebooks/coalesce/ |
The state of
update/combine_first
inv0.23
:.update
signature does not match between DataFrame/Series (ENH: unify signature for df.update and Series.update #22358)df.update
has ajoin
-kwarg that only supportsleft
, although the source code itself notes:# TODO: Support other joins
(ENH: more joins for DataFrame.update #21855).update
is one of the (very) few pandas-methods that's inplace by default, but does not have aninplace
-kwarg (ENH: add inplace-kwarg to df.update #22286).combine_first
is effectively (the not-yet-implemented).update(join='outer')
, has an awkward, non-standard name, and much fewer capabilities than.update
. (DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859)I tried to make some steps towards #21855 and #21859 by adding an
inplace
-kwarg todf.update
in #22286, which has been stalled in discussion whetherupdate
should ever be inplace at all, resp. how to move away from inplacing generally.Today, some headway was made with the comment by @jreback:
which I'm strongly in favour of (with the caveat that it should use the capabilities of
update
; I suggested something similar in #21855; would also solve most of the discussion there). And yes, dplyr uses "coalesce", which itself is inspired by SQL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf#page.15This discussion is opened on the advice of @jreback, who would like to involve:
Also tagging the other participants of #21855: @gfyoung @toobaz
Summing up this proposal:
.coalesce
togeneric.py
, à la:def coalesce(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False):
which is not inplace and inherited by DataFrame/Series
join='left'|'outer'|'inner'|'right'
(most of the discussion in ENH: more joins for DataFrame.update #21855 is about potentially allowing different joins for different axes, and which keywords to use for that)..update
and.combine_first
The text was updated successfully, but these errors were encountered: