Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859

Open
h-vetinari opened this issue Jul 11, 2018 · 8 comments
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Jul 11, 2018

I always found the mechanics of combine_first very unintuitive, and constantly need to look into the docs to see what's happening. I haven't checked the git history, but it seems that the method was a direct response from wesm to a SO question (https://stackoverflow.com/a/9794891). In particular, I think this would be much more intuitive to do with df.update, which is a subset of what #21855 proposes -- it introduces join='outer' for DataFrame.update (currently, only 'left' is supported, but even the source code notes # TODO: Support other joins).

With that new option, df1.combine_first(df2) would be the same as df1.update(df2, join='outer', overwrite=False), only that combine_first has much fewer options and controls (i.e. filter_func and raise_conflict). The only difference is that df.update currently returns None, see #21858.

Since it's quite a well-established function, the deprecation cycle would maybe have to be longer than usual, but I think the update variant is much cleaner, as well as more versatile, than this single-purpose function.

@gfyoung gfyoung added API Design Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 12, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 12, 2018

cc @jreback

@shoyer
Copy link
Member

shoyer commented Jul 16, 2018

One downside of DataFrame.update() vs combine_first() is that it modifies the dataframe inplace rather than having a return value.

Just as a point of reference, in xarray we implement combine_first() in terms of an outer join + fillna:

    def combine_first(self, other):
        """Combine two DataArray objects, with union of coordinates.

        This operation follows the normal broadcasting and alignment rules of
        ``join='outer'``.  Default to non-null values of array calling the
        method.  Use np.nan to fill in vacant cells after alignment.

        Parameters
        ----------
        other : DataArray
            Used to fill all matching missing values in this array.

        Returns
        -------
        DataArray
        """
        return ops.fillna(self, other, join="outer")

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 16, 2018

@shoyer

One downside of DataFrame.update() vs combine_first() is that it modifies the dataframe inplace rather than having a return value.

I agree, which is why I opened #21858 and linked it with the issue here... :)

@jorisvandenbossche
Copy link
Member

While looking through #22812, I was also thinking about how combine_first is basically fillna, but without the outer alignment. So basically what @shoyer said :)

Adding a join or align keyword to fillna might also be an option.

@jorisvandenbossche
Copy link
Member

I haven't checked the git history, but it seems that the method was a direct response from wesm to a SO question (https://stackoverflow.com/a/9794891)

And for historical fun, combineFirst was already present in the first commit of the git history: c6b236d, so at least a couple of years before the SO question :-)

@jbrockmendel
Copy link
Member

Adding a join or align keyword to fillna might also be an option.

my impression is that we're making more of an effort to avoid adding new keywords than we were in 2018. could we just tell users to align explicitly before calling fillna?

@lukemanley
Copy link
Member

could we just tell users to align explicitly before calling fillna?

This probably works in most cases. It doesn't work well for non-nullable integers where aligning first may introduce nans and convert to float.

@jbrockmendel
Copy link
Member

It doesn't work well for non-nullable integers where aligning first may introduce nans and convert to float.

Fair enough. That's a price I'd be willing to live with, especially as non-nullable integers are going way eventually, but I dont care enough to push on this too hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants