Allow concat() to drop/replace duplicate index labels? #1072

chunweiyuan · 2016-11-01T23:59:56Z

Right now,

>>> coords_l, coords_r = [0, 1, 2], [1, 2, 3]
>>> missing_3 = xr.DataArray([11, 12, 13], [(dim, coords_l)])
>>> missing_0 = xr.DataArray([21, 22, 23], [(dim, coords_r)])
>>> together = xr.concat([missing_3, missing_0], dim='x')
>>> together
<xarray.DataArray 'missing_3' (x: 6)>
array([11, 12, 13, 21, 22, 23])
Coordinates:
  * x        (x) int64 0 1 2 1 2 3
>>> together.sel(x=1)
<xarray.DataArray 'missing_3' (x: 2)>
array([12, 21])
Coordinates:
  * x        (x) int64 1 1

Would it be OK to introduce a kwarg ("replace"?) that replaces cells of identical coordinates from right to left?

That would render

>>> together
<xarray.DataArray 'missing_3' (x: 6)>
array([11, 21, 22, 23])
Coordinates:
  * x        (x) int64 0 1 2 3

Some people might even want to drop all cells with coordinate collisions (probably not us). If that's the case then the kwarg would be ternary.....

The text was updated successfully, but these errors were encountered:

shoyer · 2016-11-02T01:09:45Z

I don't think this is the best fit for concat. Rather, I would suggest a new method, something like combine_first. See these issues for more discussion: #835 #742

chunweiyuan · 2016-11-02T23:02:03Z

Is combine_first already implemented? I can't find it in the source code, neither could I find ops.choose.

shoyer · 2016-11-02T23:06:47Z

Sorry, nope not implemented yet. ops.choose also is not implemented yet, but it would be in the style of stack from core/ops.py.

chunweiyuan · 2016-11-03T18:35:13Z

So I fooled around with Pandas' combine_first:

>>> df1 = pd.DataFrame({'x':[1,2,3],'z':[4,5,6]}).set_index('x')
>>> df1      
   z
x   
1  4
2  5
3  6
>>> df2 = pd.DataFrame({'x':[1,12,13],'y':[0,5,6],'z':[7,8,9]}).set_index(['x','y'])
>>> df2
      z
x  y   
1  0  7
12 5  8
13 6  9
>>> df1.combine_first(df2)
        z
x  y     
1  0  4.0
12 5  8.0
13 6  9.0

and was surprised that the indexes were not "outer-joined". Is this the behavior xarray wants to emulate?

As a mockup for xr.combine_first(arr1, arr2), I was thinking about using align([arr1, arr2], join="outer") to set up the template, and then go into the template to set the values right. Is that a sound approach? Muchas gracias.

shoyer · 2016-11-03T18:45:19Z

Indeed, I think we definitely want to do an outer align first. That behavior for pandas looks very strange.

chunweiyuan · 2016-11-03T22:35:47Z

Here's my somewhat hacky stab at it, not optimized for speed/efficiency:

def combine_first(left, right):
    """
    Takes 2 data arrays, performs an outer-concat, using values from the right array to
    fill in for missing coordinates in the left array.
    """
    l_aligned, r_aligned = xr.align(left, right, join="outer")
    temp = l_aligned + r_aligned   # hack
    temp.values[temp.notnull().values] = np.nan  # now template is all nan
    # insert non-nan values from right array first
    temp.values[r_aligned.notnull().values] = r_aligned.values[r_aligned.notnull().values]
    # insert values from left array, overwriting those from right array
    temp.values[l_aligned.notnull().values] = l_aligned.values[l_aligned.notnull().values]
    return temp

And the result:

>>> ar1 = xr.DataArray([4,5,6],[('x',[1,2,3])])
>>> ar2 = xr.DataArray([[7,8,9],[10,11,12],[13,14,15]],[('x',[1,12,13]),('y',[0,5,6])])
>>> ar1
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 1 2 3
>>> ar2
<xarray.DataArray (x: 3, y: 3)>
array([[ 7,  8,  9],
       [10, 11, 12],
       [13, 14, 15]])
Coordinates:
  * x        (x) int64 1 12 13
  * y        (y) int64 0 5 6
>>> temp = combine_first(ar1, ar2)
>>> temp
<xarray.DataArray (x: 5, y: 3)>
array([[  4.,   5.,   6.],
       [  4.,   5.,   6.],
       [  4.,   5.,   6.],
       [ 10.,  11.,  12.],
       [ 13.,  14.,  15.]])
Coordinates:
  * x        (x) int64 1 2 3 12 13
  * y        (y) int64 0 5 6
>>> temp = combine_first(ar2, ar1)
>>> temp
<xarray.DataArray (x: 5, y: 3)>
array([[  7.,   8.,   9.],
       [  4.,   5.,   6.],
       [  4.,   5.,   6.],
       [ 10.,  11.,  12.],
       [ 13.,  14.,  15.]])
Coordinates:
  * x        (x) int64 1 2 3 12 13
  * y        (y) int64 0 5 6

Would this be the behavior we want from xarray's combine_first? Not entirely analogous to the Pandas function. Maybe rename it?

chunweiyuan · 2016-11-04T22:58:24Z

Another test:

>>> left
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [1, 1]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0
>>> right
<xarray.DataArray (x: 2, y: 2)>
array([[0, 0],
       [0, 0]])
Coordinates:
  * x        (x) |S1 'b' 'c'
  * y        (y) int64 0 2
>>> combine_first(left, right)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   1.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2
>>> combine_first(right, left)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   0.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2

shoyer · 2016-11-06T13:50:54Z

Pandas implements this like: result = where(notnull(left), left, right). That's probably how we want want to do it -- much simpler than my proposal in the other issue.

shoyer · 2016-11-06T15:35:43Z

I think we should still call it combine_first -- the only difference is the alignment behavior.

chunweiyuan · 2016-11-06T23:02:01Z

Thanks for the reply. Would it make things easier if c, d = xr.align(a, b, join="outer") explicitly broadcasts missing dimensions in all returned values? Currently, if b is missing a dimension, then d would also miss a dimension:

>>> a
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [1, 1]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0
>>> b
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) |S1 'a' 'b' 'd'
>>> c, d = xr.align(a, b, join="outer")
>>> c
<xarray.DataArray (x: 3, y: 2)>
array([[  1.,   1.],
       [  1.,   1.],
       [ nan,  nan]])
Coordinates:
  * x        (x) object 'a' 'b' 'd'
  * y        (y) int64 -2 0
>>> d
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) object 'a' 'b' 'd'

shoyer · 2016-11-06T23:11:27Z

Use xarray.broadcast to do xarray style broadcasting (it also aligns). This will be easier after xarray.apply is merged

For now, maybe use Dataset._binary_op - take look at how fillna is implemented. This is actually almost exactly the same as fillna except using an outer join (instead of a left join).

chunweiyuan · 2016-11-08T21:59:06Z

So I spent some time looking at dataset._calculate_binary_op but couldn't quite come up with what I wanted. After banging my head against the wall a bit this is what I have:

def combine_first(left, right):
    la, ra = xr.align(left, right, join='outer', copy=False)  # should copy=True?
    la, ra = la.where(ra.isnull() | ra.notnull()), ra.where(la.isnull() | la.notnull()) 
    ra.values[la.notnull().values] = la.values[la.notnull().values]
    return ra

And it seems to work. My test cases are

>>> l_2d
<xarray.DataArray (x: 2, y: 2)>
array([[1, 1],
       [1, 1]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0
>>> r_2d
<xarray.DataArray (x: 2, y: 2)>
array([[0, 0],
       [0, 0]])
Coordinates:
  * x        (x) |S1 'b' 'c'
  * y        (y) int64 0 2
>>> ar_1d
<xarray.DataArray (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) |S1 'a' 'b' 'd'

and here are the results:

>>> combine_first(l_2d, r_2d)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   1.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2
>>> combine_first(r_2d, l_2d)
<xarray.DataArray (x: 3, y: 3)>
array([[  1.,   1.,  nan],
       [  1.,   0.,   0.],
       [ nan,   0.,   0.]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -2 0 2
>>> combine_first(l_2d, ar_1d)
<xarray.DataArray (x: 3, y: 2)>
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 6.,  6.]])
Coordinates:
  * x        (x) object 'a' 'b' 'd'
  * y        (y) int64 -2 0
>>> combine_first(ar_1d, l_2d)
<xarray.DataArray (x: 3, y: 2)>
array([[ 4.,  4.],
       [ 5.,  5.],
       [ 6.,  6.]])
Coordinates:
  * x        (x) object 'a' 'b' 'd'
  * y        (y) int64 -2 0

I don't like the fact that I have to access .values, and the use of .where is slightly wonky. But this is definitely the cleanest working solution I have thus far. Any suggestion to improve is welcome.

chunweiyuan · 2016-11-10T23:49:07Z

Any suggestion on improvement?

chunweiyuan · 2016-11-12T17:48:08Z

hmm, so what would be an expected behavior of ds.combine_first?

If I have

>>> ds0
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -1 0
Data variables:
    ds0      (x, y) int64 0 0 0 0
>>> ds1
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) |S1 'b' 'c'
  * y        (y) int64 0 1
Data variables:
    ds1      (x, y) int64 1 1 1 1

I get

>>> ds0.combine_first(ds1)
<xarray.Dataset>
Dimensions:  (x: 3, y: 3)
Coordinates:
  * x        (x) object 'a' 'b' 'c'
  * y        (y) int64 -1 0 1
Data variables:
    ds0      (x, y) float64 0.0 0.0 nan 0.0 0.0 nan nan nan nan
    ds1      (x, y) float64 nan nan nan nan 1.0 1.0 nan 1.0 1.0

and changing the order to ds1.combine_first(ds0) just flips the order of the data_vars, but the cell values of the data_vars remain the same.

This is done essentially by adding a _combine_first to ops.py that mimics _fillna, except join='outer'.

shoyer · 2016-11-13T00:36:27Z

What about simply adding a keyword argument join to fillna? Then you could you write join='outer' to accomplish combine_first.

chunweiyuan · 2016-11-13T04:14:38Z

So these are my _fillna and _combine_first in ops.inject_binary_ops:

f = _func_slash_method_wrapper(fillna)
method = cls._binary_op(f, join='left', fillna=True)
setattr(cls, '_fillna', method)

f = _func_slash_method_wrapper(fillna)
method = cls._binary_op(f, join='outer', fillna=True)
setattr(cls, '_combine_first', method)

Within dataarray.py and dataset.py, combine_first(self, other) simply returns self._combine_first(other). This code path renders the test results you saw earlier.

Given this construct, I'm not sure how to do the refactor like you mentioned. Perhaps a few more pointers to the right direction? :)

chunweiyuan · 2016-11-14T18:02:47Z

I suppose I can save one line by getting rid of the duplicate f = _func_slash_method_wrapper(fillna), but besides that I'm not sure what's the best way to refactor this.

If the behaviors check out, then this branch might be ready for a PR.

chunweiyuan · 2016-11-14T19:11:55Z

Currently, ds0.combine_first(ds1) gives exactly the same result as xr.merge([ds0, ds1]). But for data arrays it still offers something new.

Either
1.) my combine_first should be doing something different with datasets, or
2.) we don't need a combine_first for datasets, or
3.) change xr.merge so that when applied to data arrays, it creates a new data array with outer-join and fillna (by chaining combine_first down the list of data arrays).

chunweiyuan · 2016-11-15T17:57:19Z

Checking in on how to move forward from here...I feel it's pretty close to a PR...

shoyer · 2016-11-17T03:19:43Z

I would rather wait a little while here -- I think #964 will handle the logic we need to make add a join argument to fillna very easy, which I would prefer to adding a new method.

merge works a little differently than combine_first -- it raises an exception if any overlapping values conflict.

chunweiyuan · 2017-01-06T00:25:54Z

So I took at quick look at the commits in #964. It's not entirely clear to me how one can easily add a join argument to fillna. Should I keep my current commits, and just submit a PR to master once #964 is merged in, and then we could see how it goes from there?

shoyer · 2017-01-06T00:39:32Z

I just merged #964.

If you're up for it, I would suggest rewriting fillna using apply_ufunc, instead of the current hack which relies on the injected _fillna method created with the _binary_op staticmethod (ugh).

Let me know if you need any guidance on using apply_ufunc or if the documentation is sufficient. In this case, at least you would not need to use confusing signature argument.

chunweiyuan · 2017-01-06T01:47:20Z

Ok I'll give it a shot. Will touch base when I run into roadblocks.

chunweiyuan · 2017-01-10T01:37:22Z

I'm curious, if you put a from .computation import apply_ufunc inside ops.py, would you not get some circular ImportError? Seems difficult to get out of this circle......

shoyer · 2017-01-10T01:42:26Z

@chunweiyuan import at the top of the method/ function,

def fillna(self, other, ...):
     from .computation import apply_ufunc
     return apply_ufunc(...)

This is the pretty standard way to get around circular imports. There's basically no way to avoid this issue with APIs designed to have lots of logic in method calls. (Other than gigantic files, which isn't a real fix.)

shoyer · 2017-01-23T22:41:22Z

Fixed by #1204

shoyer closed this as completed Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow concat() to drop/replace duplicate index labels? #1072

Allow concat() to drop/replace duplicate index labels? #1072

chunweiyuan commented Nov 1, 2016

shoyer commented Nov 2, 2016

chunweiyuan commented Nov 2, 2016

shoyer commented Nov 2, 2016

chunweiyuan commented Nov 3, 2016 •

edited

Loading

shoyer commented Nov 3, 2016

chunweiyuan commented Nov 3, 2016

chunweiyuan commented Nov 4, 2016

shoyer commented Nov 6, 2016

shoyer commented Nov 6, 2016

chunweiyuan commented Nov 6, 2016 •

edited

Loading

shoyer commented Nov 6, 2016

chunweiyuan commented Nov 8, 2016 •

edited

Loading

chunweiyuan commented Nov 10, 2016

chunweiyuan commented Nov 12, 2016

shoyer commented Nov 13, 2016

chunweiyuan commented Nov 13, 2016

chunweiyuan commented Nov 14, 2016

chunweiyuan commented Nov 14, 2016

chunweiyuan commented Nov 15, 2016

shoyer commented Nov 17, 2016

chunweiyuan commented Jan 6, 2017

shoyer commented Jan 6, 2017

chunweiyuan commented Jan 6, 2017

chunweiyuan commented Jan 10, 2017

shoyer commented Jan 10, 2017 •

edited

Loading

shoyer commented Jan 23, 2017

Allow concat() to drop/replace duplicate index labels? #1072

Allow concat() to drop/replace duplicate index labels? #1072

Comments

chunweiyuan commented Nov 1, 2016

shoyer commented Nov 2, 2016

chunweiyuan commented Nov 2, 2016

shoyer commented Nov 2, 2016

chunweiyuan commented Nov 3, 2016 • edited Loading

shoyer commented Nov 3, 2016

chunweiyuan commented Nov 3, 2016

chunweiyuan commented Nov 4, 2016

shoyer commented Nov 6, 2016

shoyer commented Nov 6, 2016

chunweiyuan commented Nov 6, 2016 • edited Loading

shoyer commented Nov 6, 2016

chunweiyuan commented Nov 8, 2016 • edited Loading

chunweiyuan commented Nov 10, 2016

chunweiyuan commented Nov 12, 2016

shoyer commented Nov 13, 2016

chunweiyuan commented Nov 13, 2016

chunweiyuan commented Nov 14, 2016

chunweiyuan commented Nov 14, 2016

chunweiyuan commented Nov 15, 2016

shoyer commented Nov 17, 2016

chunweiyuan commented Jan 6, 2017

shoyer commented Jan 6, 2017

chunweiyuan commented Jan 6, 2017

chunweiyuan commented Jan 10, 2017

shoyer commented Jan 10, 2017 • edited Loading

shoyer commented Jan 23, 2017

chunweiyuan commented Nov 3, 2016 •

edited

Loading

chunweiyuan commented Nov 6, 2016 •

edited

Loading

chunweiyuan commented Nov 8, 2016 •

edited

Loading

shoyer commented Jan 10, 2017 •

edited

Loading