Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concat() to drop/replace duplicate index labels? #1072

Closed
chunweiyuan opened this issue Nov 1, 2016 · 26 comments
Closed

Allow concat() to drop/replace duplicate index labels? #1072

chunweiyuan opened this issue Nov 1, 2016 · 26 comments

Comments

@chunweiyuan
Copy link
Contributor

Right now,

>>> coords_l, coords_r = [0, 1, 2], [1, 2, 3]
>>> missing_3 = xr.DataArray([11, 12, 13], [(dim, coords_l)])
>>> missing_0 = xr.DataArray([21, 22, 23], [(dim, coords_r)])
>>> together = xr.concat([missing_3, missing_0], dim='x')
>>> together
<xarray.DataArray 'missing_3' (x: 6)>
array([11, 12, 13, 21, 22, 23])
Coordinates:
  * x        (x) int64 0 1 2 1 2 3
>>> together.sel(x=1)
<xarray.DataArray 'missing_3' (x: 2)>
array([12, 21])
Coordinates:
  * x        (x) int64 1 1

Would it be OK to introduce a kwarg ("replace"?) that replaces cells of identical coordinates from right to left?

That would render

>>> together
<xarray.DataArray 'missing_3' (x: 6)>
array([11, 21, 22, 23])
Coordinates:
  * x        (x) int64 0 1 2 3

Some people might even want to drop all cells with coordinate collisions (probably not us). If that's the case then the kwarg would be ternary.....

@shoyer
Copy link
Member

shoyer commented Nov 2, 2016

I don't think this is the best fit for concat. Rather, I would suggest a new method, something like combine_first. See these issues for more discussion: #835 #742

@chunweiyuan
Copy link
Contributor Author

Is combine_first already implemented? I can't find it in the source code, neither could I find ops.choose.

@shoyer
Copy link
Member

shoyer commented Nov 2, 2016

Sorry, nope not implemented yet. ops.choose also is not implemented yet, but it would be in the style of stack from core/ops.py.

@chunweiyuan
Copy link
Contributor Author

chunweiyuan commented Nov 3, 2016

So I fooled around with Pandas' combine_first:

>>> df1 = pd.DataFrame({'x':[1,2,3],'z':[4,5,6]}).set_index('x')
>>> df1      
   z
x   
1  4
2  5
3  6
>>> df2 = pd.DataFrame({'x':[1,12,13],'y':[0,5,6],'z':[7,8,9]}).set_index(['x','y'])
>>> df2
      z
x  y   
1  0  7
12 5  8
13 6  9
>>> df1.combine_first(df2)
        z
x  y     
1  0  4.0
12 5  8.0
13 6  9.0

and was surprised that the indexes were not "outer-joined". Is this the behavior xarray wants to emulate?

As a mockup for xr.combine_first(arr1, arr2), I was thinking about using align([arr1, arr2], join="outer") to set up the template, and then go into the template to set the values right. Is that a sound approach? Muchas gracias.

@shoyer
Copy link
Member

shoyer commented Nov 3, 2016

Indeed, I think we definitely want to do an outer align first. That behavior for pandas looks very strange.

@chunweiyuan
Copy link
Contributor Author

Here's my somewhat hacky stab at it, not optimized for speed/efficiency:

def combine_first(left, right):
    """
    Takes 2 data arrays, performs an outer-concat, using values from the right array to
    fill in for missing coordinates in the left array.
    """
    l_aligned, r_aligned = xr.align(left, right, join="outer")
    temp = l_aligned + r_aligned   # hack
    temp.values[temp.notnull().values] = np.nan  # now template is all nan
    # insert non-nan values from right array first
    temp.values[r_aligned.notnull().values] = r_aligned.values[r_aligned.notnull().values]
    # insert values from left array, overwriting those from right array
    temp.values[l_aligned.notnull().values] = l_aligned.values[l_aligned.notnull().values]
    return temp

And the result:

>>> ar1 = xr.DataArray([4,5,6],[('x',[1,2,3])])
>>> ar2 = xr.DataArray([[7,8,9],[10,11,12],[13,14,15]],[('x',[1,12,13]),('y',[0,5,6])])
>>> ar1
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 1 2 3
>>> ar2
<xarray.DataArray (x: 3, y: 3)>
array([[ 7,  8,  9],
       [10, 11, 12],
       [13, 14, 15]])
Coordinates:
  * x        (x) int64 1 12 13
  * y        (y) int64 0 5 6
>>> temp = combine_first(ar1, ar2)
>>> temp
<xarray.DataArray (x: 5, y: 3)>
array([[  4.,   5.,   6.],
       [  4.,   5.,   6.],
       [  4.,   5.,   6.],
       [ 10.,  11.,  12.],
       [ 13.,  14.,  15.]])
Coordinates:
  * x        (x) int64 1 2 3 12 13
  * y        (y) int64 0 5 6
>>> temp = combine_first(ar2, ar1)
>>> temp
<xarray.DataArray (x: 5, y: 3)>
array([[  7.,   8.,   9.],
       [  4.,   5.,   6.],
       [  4.,   5.,   6.],
       [ 10.,  11.,  12.],
       [ 13.,  14.,  15.]])
Coordinates:
  * x        (x) int64 1 2 3 12 13
  * y        (y) int64 0 5 6

Would this be the behavior we want from xarray's combine_first? Not entirely analogous to the Pandas function. Maybe rename it?

@chunweiyuan
Copy link
Contributor Author

Another test:

>>> left
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [1, 1]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0
>>> right
<xarray.DataArray (x: 2, y: 2)>
array([[0, 0],
       [0, 0]])
Coordinates:
  * x        (x) |S1 'b' 'c'
  * y        (y) int64 0 2
>>> combine_first(left, right)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   1.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2
>>> combine_first(right, left)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   0.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2

@shoyer
Copy link
Member

shoyer commented Nov 6, 2016

Pandas implements this like: result = where(notnull(left), left, right). That's probably how we want want to do it -- much simpler than my proposal in the other issue.

@shoyer
Copy link
Member

shoyer commented Nov 6, 2016

I think we should still call it combine_first -- the only difference is the alignment behavior.

@chunweiyuan
Copy link
Contributor Author

chunweiyuan commented Nov 6, 2016

Thanks for the reply. Would it make things easier if c, d = xr.align(a, b, join="outer") explicitly broadcasts missing dimensions in all returned values? Currently, if b is missing a dimension, then d would also miss a dimension:

>>> a
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [1, 1]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0
>>> b
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) |S1 'a' 'b' 'd'
>>> c, d = xr.align(a, b, join="outer")
>>> c
<xarray.DataArray (x: 3, y: 2)>
array([[  1.,   1.],
       [  1.,   1.],
       [ nan,  nan]])
Coordinates:
  * x        (x) object 'a' 'b' 'd'
  * y        (y) int64 -2 0
>>> d
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) object 'a' 'b' 'd'

@shoyer
Copy link
Member

shoyer commented Nov 6, 2016

Use xarray.broadcast to do xarray style broadcasting (it also aligns). This will be easier after xarray.apply is merged

For now, maybe use Dataset._binary_op - take look at how fillna is implemented. This is actually almost exactly the same as fillna except using an outer join (instead of a left join).

@chunweiyuan
Copy link
Contributor Author

chunweiyuan commented Nov 8, 2016

So I spent some time looking at dataset._calculate_binary_op but couldn't quite come up with what I wanted. After banging my head against the wall a bit this is what I have:

def combine_first(left, right):
    la, ra = xr.align(left, right, join='outer', copy=False)  # should copy=True?
    la, ra = la.where(ra.isnull() | ra.notnull()), ra.where(la.isnull() | la.notnull()) 
    ra.values[la.notnull().values] = la.values[la.notnull().values]
    return ra

And it seems to work. My test cases are

>>> l_2d
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [1, 1]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0
>>> r_2d
<xarray.DataArray (x: 2, y: 2)>
array([[0, 0],
       [0, 0]])
Coordinates:
  * x        (x) |S1 'b' 'c'
  * y        (y) int64 0 2
>>> ar_1d
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) |S1 'a' 'b' 'd'

and here are the results:

>>> combine_first(l_2d, r_2d)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   1.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2
>>> combine_first(r_2d, l_2d)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   0.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2
>>> combine_first(l_2d, ar_1d)
<xarray.DataArray (x: 3, y: 2)>
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 6.,  6.]])
Coordinates:
  * x        (x) object 'a' 'b' 'd'
  * y        (y) int64 -2 0
>>> combine_first(ar_1d, l_2d)
<xarray.DataArray (x: 3, y: 2)>
array([[ 4.,  4.],
       [ 5.,  5.],
       [ 6.,  6.]])
Coordinates:
  * x        (x) object 'a' 'b' 'd'
  * y        (y) int64 -2 0

I don't like the fact that I have to access .values, and the use of .where is slightly wonky. But this is definitely the cleanest working solution I have thus far. Any suggestion to improve is welcome.

@chunweiyuan
Copy link
Contributor Author

Any suggestion on improvement?

@chunweiyuan
Copy link
Contributor Author

hmm, so what would be an expected behavior of ds.combine_first?

If I have

>>> ds0
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -1 0
Data variables:
    ds0      (x, y) int64 0 0 0 0
>>> ds1
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) |S1 'b' 'c'
  * y        (y) int64 0 1
Data variables:
    ds1      (x, y) int64 1 1 1 1

I get

>>> ds0.combine_first(ds1)
<xarray.Dataset>
Dimensions:  (x: 3, y: 3)
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -1 0 1
Data variables:
    ds0      (x, y) float64 0.0 0.0 nan 0.0 0.0 nan nan nan nan
    ds1      (x, y) float64 nan nan nan nan 1.0 1.0 nan 1.0 1.0

and changing the order to ds1.combine_first(ds0) just flips the order of the data_vars, but the cell values of the data_vars remain the same.

This is done essentially by adding a _combine_first to ops.py that mimics _fillna, except join='outer'.

@shoyer
Copy link
Member

shoyer commented Nov 13, 2016

What about simply adding a keyword argument join to fillna? Then you could you write join='outer' to accomplish combine_first.

@chunweiyuan
Copy link
Contributor Author

So these are my _fillna and _combine_first in ops.inject_binary_ops:

f = _func_slash_method_wrapper(fillna)
method = cls._binary_op(f, join='left', fillna=True)
setattr(cls, '_fillna', method)

f = _func_slash_method_wrapper(fillna)
method = cls._binary_op(f, join='outer', fillna=True)
setattr(cls, '_combine_first', method)

Within dataarray.py and dataset.py, combine_first(self, other) simply returns self._combine_first(other). This code path renders the test results you saw earlier.

Given this construct, I'm not sure how to do the refactor like you mentioned. Perhaps a few more pointers to the right direction? :)

@chunweiyuan
Copy link
Contributor Author

I suppose I can save one line by getting rid of the duplicate f = _func_slash_method_wrapper(fillna), but besides that I'm not sure what's the best way to refactor this.

If the behaviors check out, then this branch might be ready for a PR.

@chunweiyuan
Copy link
Contributor Author

Currently, ds0.combine_first(ds1) gives exactly the same result as xr.merge([ds0, ds1]). But for data arrays it still offers something new.

Either
1.) my combine_first should be doing something different with datasets, or
2.) we don't need a combine_first for datasets, or
3.) change xr.merge so that when applied to data arrays, it creates a new data array with outer-join and fillna (by chaining combine_first down the list of data arrays).

@chunweiyuan
Copy link
Contributor Author

Checking in on how to move forward from here...I feel it's pretty close to a PR...

@shoyer
Copy link
Member

shoyer commented Nov 17, 2016

I would rather wait a little while here -- I think #964 will handle the logic we need to make add a join argument to fillna very easy, which I would prefer to adding a new method.

merge works a little differently than combine_first -- it raises an exception if any overlapping values conflict.

@chunweiyuan
Copy link
Contributor Author

So I took at quick look at the commits in #964. It's not entirely clear to me how one can easily add a join argument to fillna. Should I keep my current commits, and just submit a PR to master once #964 is merged in, and then we could see how it goes from there?

@shoyer
Copy link
Member

shoyer commented Jan 6, 2017

I just merged #964.

If you're up for it, I would suggest rewriting fillna using apply_ufunc, instead of the current hack which relies on the injected _fillna method created with the _binary_op staticmethod (ugh).

Let me know if you need any guidance on using apply_ufunc or if the documentation is sufficient. In this case, at least you would not need to use confusing signature argument.

@chunweiyuan
Copy link
Contributor Author

Ok I'll give it a shot. Will touch base when I run into roadblocks.

@chunweiyuan
Copy link
Contributor Author

I'm curious, if you put a from .computation import apply_ufunc inside ops.py, would you not get some circular ImportError? Seems difficult to get out of this circle......

@shoyer
Copy link
Member

shoyer commented Jan 10, 2017

@chunweiyuan import at the top of the method/ function,

def fillna(self, other, ...):
     from .computation import apply_ufunc
     return apply_ufunc(...)

This is the pretty standard way to get around circular imports. There's basically no way to avoid this issue with APIs designed to have lots of logic in method calls. (Other than gigantic files, which isn't a real fix.)

@shoyer
Copy link
Member

shoyer commented Jan 23, 2017

Fixed by #1204

@shoyer shoyer closed this as completed Jan 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants