Add drop duplicates #5089

ahuang11 · 2021-03-29T03:51:07Z

Semi related to #2795, but not really; still want a separate unique function

Closes #xxxx
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

pep8speaks · 2021-03-29T03:51:19Z

Hello @ahuang11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file xarray/core/dataset.py:

Line 7260:29: F821 undefined name '_get_func_args'
Line 7261:43: F821 undefined name '_initialize_curvefit_params'
Line 7326:1: E305 expected 2 blank lines after class or function definition, found 1

Comment last updated at 2021-05-01 03:07:57 UTC

max-sixty

Thanks for the PR @ahuang11 !

I think the method could be really useful. Does anyone else have thoughts?

One important decision is whether this should operate on dimensioned coords or all coords (or even any array?). My guess would be that we could start with dimensioned coords given those are the most likely use case, and we could extent to non-dimensioned coords later.

(here's a glossary as the terms can get confusing: http://xarray.pydata.org/en/stable/terminology.html)

xarray/core/dataarray.py

xarray/core/dataset.py

ahuang11 · 2021-03-30T00:27:20Z

Thanks for the PR @ahuang11 !

I think the method could be really useful. Does anyone else have thoughts?

One important decision is whether this should operate on dimensioned coords or all coords (or even any array?). My guess would be that we could start with dimensioned coords given those are the most likely use case, and we could extent to non-dimensioned coords later.

(here's a glossary as the terms can get confusing: http://xarray.pydata.org/en/stable/terminology.html)

~~Let's start with just dims for now.~~

Okay, since I had some time, I decided to do coords too.

ahuang11 · 2021-03-30T00:52:18Z

Not sure how to fix this:


xarray/core/dataset.py:7111: error: Keywords must be strings
Found 1 error in 1 file (checked 138 source files)

mathause

I think this could be useful.

Is the name of the method clear or should it be made more explicit, e.g. drop_duplicates_dims?
Should it be dims=... for all dimensions to allow dims=None for no dimensions once we also want to support coords=? Or is that in the YAGNI category?

(I think it's probably fine as is.)

xarray/core/dataset.py

xarray/core/dataarray.py

mathause · 2021-03-30T13:49:19Z

xarray/core/dataset.py

+        """
+        if dims is None:
+            dims = list(self.coords)
+        elif isinstance(dims, str) or not isinstance(dims, Iterable):


You could in principle use elif isinstance(dims, Hashable): but I would leave it as is (we should once discuss what we do about da.mean(("x", "y")) as ("x", "y") is Hashable)

Let's use utils.is_scalar?

…uplicates

max-sixty · 2021-03-31T16:20:01Z

xarray/core/dataset.py

+        Dataset
+        """
+        if dims is None:
+            dims = list(self.coords)


Suggested change

dims = list(self.coords)

dims = list(self.dims)

...I think?

And we should add a test for this please — an array with a non-dimensioned coord

max-sixty · 2021-03-31T16:23:22Z

@pydata/xarray we didn't get to this on the call today — two questions from @mathause :

should we have dims=None default to all dims? Or are we gradually transitioning to dims=... for all dims?
Is drop_duplicates a good name? Or should it explicitly refer to dropping duplicates on the index?

max-sixty · 2021-04-04T22:35:15Z

If we don't hear anything, let's add this to the top of the list for the next dev call in ten days

shoyer · 2021-04-05T04:00:54Z

From an API perspective, I think the name drop_duplicates() would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates?

One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates.

This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds.drop_duplicates('valid') on the example in the PR description would result in a "valid" coordinate/dimension of length 3. The original 'init' and 'tau' dimensions could be preserved as coordinates, e.g.,

    ds = xr.DataArray(
        [[1, 2, 3], [4, 5, 6]],
        coords={"init": [0, 1], "tau": [1, 2, 3]},
        dims=["init", "tau"],
    ).to_dataset(name="test")
    ds.coords["valid"] = (("init", "tau"), np.array([[8, 6, 6], [7, 7, 7]]))
    result = ds.drop_duplicates('valid')

would result in:

>>> result
<xarray.Dataset>
Dimensions:  (valid: 3)
Coordinates:
    init     (valid) int64 0 0 1
    tau      (valid) int64 1 2 1
  * valid    (valid) int64 8 6 7
Data variables:
    test     (valid) int64 1 2 4

i.e., the exact same thing that would be obtained by indexing with the positions of the de-duplicated values: ds.isel(init=('valid', [0, 0, 1]), tau=('valid', [0, 1, 0])).

ahuang11 · 2021-04-05T04:09:26Z

I prefer drop duplicate values to be under the unique() PR; maybe could be renamed as drop_duplicate_values(). Also I think preserving existing dimensions is more powerful than flattening the dimensions.

…

On Sun, Apr 4, 2021, 11:01 PM Stephan Hoyer ***@***.***> wrote: From an API perspective, I think the name drop_duplicates() would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates? One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates. This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds.drop_duplicates('valid') on the example in the PR description would result in a "valid" coordinate/dimension of length 3. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5089 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADU7FFWCT2NXOR2AYNLGVQDTHEYYFANCNFSM4Z6ZAMUA> .

ahuang11 · 2021-04-05T04:48:09Z

Oh I just saw the edits with keeping the dims. I guess that would work.

ahuang11 · 2021-04-06T02:48:00Z

Not sure if there's a more elegant way of implementing this.

max-sixty · 2021-04-17T23:37:07Z

Hi @ahuang11 — forgive the delay. We discussed this with the team on our call and think it would be a welcome addition, so thank you for contributing.

I took another look through the tests and the behavior looks ideal for dimensioned coords are passed:

In [6]: da
Out[6]:
<xarray.DataArray (lat: 5, lon: 5)>
array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12],
       [ 0,  4,  8, 12, 16]])
Coordinates:
  * lat      (lat) int64 0 1 2 2 3
  * lon      (lon) int64 0 1 3 3 4

In [7]:      result = da.drop_duplicate_coords(["lat", "lon"], keep='first')

In [8]: result
Out[8]:
<xarray.DataArray (lat: 4, lon: 4)>
array([[ 0,  0,  0,  0],
       [ 0,  1,  2,  4],
       [ 0,  2,  4,  8],
       [ 0,  4,  8, 16]])
Coordinates:
  * lat      (lat) int64 0 1 2 3
  * lon      (lon) int64 0 1 3 4

And I think this is also the best we can do for non-dimensioned coords. One thing I call out is that:
a. The array is stacked for any non-dim coord > 1 dim
b. The supplied coord becomes the new dimensioned coord

e.g. Stacking:

In [12]: da
Out[12]:
<xarray.DataArray (init: 2, tau: 3)>
array([[1, 2, 3],
       [4, 5, 6]])
Coordinates:
  * init     (init) int64 0 1
  * tau      (tau) int64 1 2 3
    valid    (init, tau) int64 8 6 6 7 7 7

In [13]: da.drop_duplicate_coords("valid")
Out[13]:
<xarray.DataArray (valid: 3)>
array([1, 2, 4])
Coordinates:
  * valid    (valid) int64 8 6 7
    init     (valid) int64 0 0 1
    tau      (valid) int64 1 2 1

Changing the dimensions: zeta becoming the new dimension, from tau:

In [16]: (
    ...:     da
    ...:     .assign_coords(dict(zeta=(('tau'),[4,4,6])))
    ...:     .drop_duplicate_coords('zeta')
    ...:     )
Out[16]:
<xarray.DataArray (init: 2, zeta: 2)>
array([[1, 3],
       [4, 6]])
Coordinates:
  * init     (init) int64 0 1
    valid    (init, zeta) int64 8 6 7 7
  * zeta     (zeta) int64 4 6
    tau      (zeta) int64 1 3

One peculiarity — though I think a necessary one — is that the order matters in some cases:

In [17]: (
    ...:     da
    ...:     .assign_coords(dict(zeta=(('tau'),[4,4,6])))
    ...:     .drop_duplicate_coords(['zeta','valid'])
    ...:     )
Out[17]:
<xarray.DataArray (valid: 3)>
array([1, 3, 4])
Coordinates:
  * valid    (valid) int64 8 6 7
    tau      (valid) int64 1 3 1
    init     (valid) int64 0 0 1
    zeta     (valid) int64 4 6 4

In [18]: (
    ...:     da
    ...:     .assign_coords(dict(zeta=(('tau'),[4,4,6])))
    ...:     .drop_duplicate_coords(['valid','zeta'])
    ...:     )
Out[18]:
<xarray.DataArray (zeta: 1)>
array([1])
Coordinates:
  * zeta     (zeta) int64 4
    init     (zeta) int64 0
    tau      (zeta) int64 1
    valid    (zeta) int64 8

Unless anyone has any more thoughts, let's plan to merge this over the next few days. Thanks again @ahuang11 !

shoyer · 2021-04-18T05:58:49Z

This looks great, but I wonder if we could simplify the implementation? For example, could we get away with only doing a single isel() for selecting the positions corresponding to unique values, rather than the current loop? .stack() can also be expensive relative to indexing.

This might require using a different routine to find the unique positions the current calls to duplicated() on a pandas.Index. I think we could construct the necessary indices even for multi-dimensional arrays using np.unique with return_index=True and np.unravel_index.

max-sixty · 2021-04-18T23:57:20Z

@ahuang11 IIUC, this is only using .stack where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

shoyer · 2021-04-19T00:12:20Z

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

I guess this may come down to the desired behavior for multiple arguments, e.g., drop_duplicates(['lat', 'lon'])? I'm not certain that this case is well defined in this PR (it certainly needs more tests!).

I think we could make this work via the axis argument to np.unique, although the lack of support for object arrays could be problematic for us, since we put strings in object arrays.

ahuang11 · 2021-04-19T00:19:17Z

@ahuang11 IIUC, this is only using .stack where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

shoyer · 2021-04-19T00:29:17Z

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

I hope you feel well soon here! There is no time pressure from our end on this.

max-sixty · 2021-04-19T00:41:47Z

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

IIUC there are two broad cases here

where every supplied coord is a dimensioned coord — it's v simple, just isel non-duplicates for each dimension*
where there's a non-dimensioned coord with ndim > 1, then it requires stacking; e.g. the example above. Is there a different way of doing this?

In [12]: da
Out[12]:
<xarray.DataArray (init: 2, tau: 3)>
array([[1, 2, 3],
       [4, 5, 6]])
Coordinates:
  * init     (init) int64 0 1
  * tau      (tau) int64 1 2 3
    valid    (init, tau) int64 8 6 6 7 7 7

In [13]: da.drop_duplicate_coords("valid")
Out[13]:
<xarray.DataArray (valid: 3)>
array([1, 2, 4])
Coordinates:
  * valid    (valid) int64 8 6 7
    init     (valid) int64 0 0 1
    tau      (valid) int64 1 2 1

* very close to this is a 1D non-dimensioned coord, in which case we can either turn it into a dimensioned coord or retain the existing dimensioned coords — I think probably the former if we allow the stacking case, for the sake of consistency.

shoyer · 2021-04-22T02:58:53Z

A couple thoughts on strategy here:

Let's consider starting with a minimal set of functionality (e.g., only drop duplicates in a single variable and/or along only one dimension). This is easier to merge and provides a good foundation for implementing the remaining features in follow-on PRs.
It might be useful to start from the foundation of implementing multi-dimensional indexing with a boolean array (Boolean indexing with multi-dimensional key arrays #1887). Then drop_duplicates() (and also unique()) could just be a layer on top of that, passing in a boolean index of "non-duplicate" entries.

max-sixty · 2021-04-30T17:12:02Z

This is great work and it would be good to get this in for the upcoming release #5232.

I think there are two paths:

Narrow: merge the functionality which works along 1D dimensioned coords
Full: Ensure we're at consensus on how we handle >1D coords

I would mildly vote for narrow. While I would also vote to merge it as-is, I think it's not a huge task to move wide onto a new branch.

@ahuang11 what are your thoughts?

ahuang11 · 2021-04-30T18:16:59Z

I can take a look this weekend. If narrow, could simply rollback to this commit, make minor adjustments and merge. 28aa96a

But I personally prefer full so it'd be nice if we could come to a consensus on how to handle it~

This reverts commit f9ee3fe.

This reverts commit cc94bbe, reversing changes made to daa6e42. Conflicts: xarray/core/dataset.py

This reverts commit a1ce19d.

This reverts commit 8a168ce.

This reverts commit 8c27afb.

This reverts commit d33586e.

This reverts commit e307041.

This reverts commit 1698990.

This reverts commit 596ec7a.

This reverts commit 344a7d8.

This reverts commit 915dcf5.

This reverts commit f7dcdd4.

ahuang11 · 2021-05-01T03:25:47Z

I failed to commit properly so see #5239 where I only do drop duplicates for dims

Add drop duplicates; wip need to fix tests

d84dae7

max-sixty reviewed Mar 29, 2021

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

xarray/core/dataset.py Outdated Show resolved Hide resolved

ahuang11 added 2 commits March 29, 2021 19:21

Comments

1494966

Replace apostrophes with quotations

adfafc0

ahuang11 added 2 commits March 29, 2021 19:38

Add whats new, fix linting, and bug

322ad9a

Add to api, pre commit

81d4002

mathause reviewed Mar 30, 2021

View reviewed changes

ahuang11 added 3 commits March 30, 2021 21:50

Revise based on comments

28aa96a

Merge branch 'master' of https://github.com/pydata/xarray into drop_d…

4b1dab7

…uplicates

iPrecommit

f9ee3fe

ahuang11 changed the title ~~Add drop duplicates; wip need to fix tests~~ Add drop duplicates Mar 31, 2021

max-sixty reviewed Mar 31, 2021

View reviewed changes

ahuang11 and others added 3 commits April 4, 2021 17:25

Rewrite function to support coords

daa6e42

Merge branch 'master' into drop_duplicates

cc94bbe

Lint

f7dcdd4

ahuang11 added 5 commits April 4, 2021 17:36

Add back >>> for what's new

915dcf5

Revert new line

344a7d8

Update test_dataarray tests

596ec7a

Fix formatting

1698990

Fix test

e307041

shoyer mentioned this pull request Apr 5, 2021

Add unique method #5091

Closed

5 tasks

Fix based on sugggestion

d33586e

Replace drop with drop_vars

8a168ce

Fix tests

a1ce19d

shoyer mentioned this pull request Apr 20, 2021

Boolean indexing with multi-dimensional key arrays #1887

Open

ahuang11 added 12 commits April 30, 2021 21:58

Revert "iPrecommit"

e1e24bc

This reverts commit f9ee3fe.

Revert "Merge branch 'master' into drop_duplicates"

d7cf3c4

This reverts commit cc94bbe, reversing changes made to daa6e42. Conflicts: xarray/core/dataset.py

Revert "Fix tests"

1c8a4ae

This reverts commit a1ce19d.

Revert "Replace drop with drop_vars"

c2cc15f

This reverts commit 8a168ce.

Revert "Black"

966a420

This reverts commit 8c27afb.

Revert "Fix based on sugggestion"

d9fde90

This reverts commit d33586e.

Revert "Fix test"

b9ee4ca

This reverts commit e307041.

Revert "Fix formatting"

3b9b7e3

This reverts commit 1698990.

Revert "Update test_dataarray tests"

61352f9

This reverts commit 596ec7a.

Revert "Revert new line"

25949b0

This reverts commit 344a7d8.

Revert "Add back >>> for what's new"

5c4fc82

This reverts commit 915dcf5.

Revert "Lint"

a77f78d

This reverts commit f7dcdd4.

ahuang11 force-pushed the drop_duplicates branch from 663d0c9 to a77f78d Compare May 1, 2021 03:07

ahuang11 mentioned this pull request May 1, 2021

Add drop_duplicates for dims #5239

Merged

5 tasks

ahuang11 closed this May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add drop duplicates #5089

Add drop duplicates #5089

ahuang11 commented Mar 29, 2021 •

edited

Loading

pep8speaks commented Mar 29, 2021 •

edited

Loading

max-sixty left a comment

ahuang11 commented Mar 30, 2021 •

edited

Loading

ahuang11 commented Mar 30, 2021

mathause left a comment

mathause Mar 30, 2021

max-sixty Mar 31, 2021

max-sixty Mar 31, 2021

max-sixty commented Mar 31, 2021

max-sixty commented Apr 4, 2021

shoyer commented Apr 5, 2021 •

edited

Loading

ahuang11 commented Apr 5, 2021 via email

ahuang11 commented Apr 5, 2021

ahuang11 commented Apr 6, 2021

max-sixty commented Apr 17, 2021

shoyer commented Apr 18, 2021

max-sixty commented Apr 18, 2021

shoyer commented Apr 19, 2021

ahuang11 commented Apr 19, 2021

shoyer commented Apr 19, 2021

max-sixty commented Apr 19, 2021

shoyer commented Apr 22, 2021

max-sixty commented Apr 30, 2021

ahuang11 commented Apr 30, 2021 •

edited

Loading

ahuang11 commented May 1, 2021

Add drop duplicates #5089

Add drop duplicates #5089

Conversation

ahuang11 commented Mar 29, 2021 • edited Loading

pep8speaks commented Mar 29, 2021 • edited Loading

Comment last updated at 2021-05-01 03:07:57 UTC

max-sixty left a comment

Choose a reason for hiding this comment

ahuang11 commented Mar 30, 2021 • edited Loading

ahuang11 commented Mar 30, 2021

mathause left a comment

Choose a reason for hiding this comment

mathause Mar 30, 2021

Choose a reason for hiding this comment

max-sixty Mar 31, 2021

Choose a reason for hiding this comment

max-sixty Mar 31, 2021

Choose a reason for hiding this comment

max-sixty commented Mar 31, 2021

max-sixty commented Apr 4, 2021

shoyer commented Apr 5, 2021 • edited Loading

ahuang11 commented Apr 5, 2021 via email

ahuang11 commented Apr 5, 2021

ahuang11 commented Apr 6, 2021

max-sixty commented Apr 17, 2021

shoyer commented Apr 18, 2021

max-sixty commented Apr 18, 2021

shoyer commented Apr 19, 2021

ahuang11 commented Apr 19, 2021

shoyer commented Apr 19, 2021

max-sixty commented Apr 19, 2021

shoyer commented Apr 22, 2021

max-sixty commented Apr 30, 2021

ahuang11 commented Apr 30, 2021 • edited Loading

ahuang11 commented May 1, 2021

ahuang11 commented Mar 29, 2021 •

edited

Loading

pep8speaks commented Mar 29, 2021 •

edited

Loading

ahuang11 commented Mar 30, 2021 •

edited

Loading

shoyer commented Apr 5, 2021 •

edited

Loading

ahuang11 commented Apr 30, 2021 •

edited

Loading