Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add drop duplicates #5089

Closed
wants to merge 32 commits into from
Closed

Add drop duplicates #5089

wants to merge 32 commits into from

Conversation

ahuang11
Copy link
Contributor

@ahuang11 ahuang11 commented Mar 29, 2021

Semi related to #2795, but not really; still want a separate unique function

  • Closes #xxxx
  • Tests added
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

image

@pep8speaks
Copy link

pep8speaks commented Mar 29, 2021

Hello @ahuang11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 7260:29: F821 undefined name '_get_func_args'
Line 7261:43: F821 undefined name '_initialize_curvefit_params'
Line 7326:1: E305 expected 2 blank lines after class or function definition, found 1

Comment last updated at 2021-05-01 03:07:57 UTC

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @ahuang11 !

I think the method could be really useful. Does anyone else have thoughts?

One important decision is whether this should operate on dimensioned coords or all coords (or even any array?). My guess would be that we could start with dimensioned coords given those are the most likely use case, and we could extent to non-dimensioned coords later.

(here's a glossary as the terms can get confusing: http://xarray.pydata.org/en/stable/terminology.html)

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@ahuang11
Copy link
Contributor Author

ahuang11 commented Mar 30, 2021

Thanks for the PR @ahuang11 !

I think the method could be really useful. Does anyone else have thoughts?

One important decision is whether this should operate on dimensioned coords or all coords (or even any array?). My guess would be that we could start with dimensioned coords given those are the most likely use case, and we could extent to non-dimensioned coords later.

(here's a glossary as the terms can get confusing: http://xarray.pydata.org/en/stable/terminology.html)

Let's start with just dims for now.

Okay, since I had some time, I decided to do coords too.

@ahuang11
Copy link
Contributor Author

Not sure how to fix this:


xarray/core/dataset.py:7111: error: Keywords must be strings
Found 1 error in 1 file (checked 138 source files)

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be useful.

  • Is the name of the method clear or should it be made more explicit, e.g. drop_duplicates_dims?
  • Should it be dims=... for all dimensions to allow dims=None for no dimensions once we also want to support coords=? Or is that in the YAGNI category?

(I think it's probably fine as is.)

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
"""
if dims is None:
dims = list(self.coords)
elif isinstance(dims, str) or not isinstance(dims, Iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could in principle use elif isinstance(dims, Hashable): but I would leave it as is (we should once discuss what we do about da.mean(("x", "y")) as ("x", "y") is Hashable)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use utils.is_scalar?

@ahuang11 ahuang11 changed the title Add drop duplicates; wip need to fix tests Add drop duplicates Mar 31, 2021
Dataset
"""
if dims is None:
dims = list(self.coords)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dims = list(self.coords)
dims = list(self.dims)

...I think?

And we should add a test for this please — an array with a non-dimensioned coord

@max-sixty
Copy link
Collaborator

@pydata/xarray we didn't get to this on the call today — two questions from @mathause :

  • should we have dims=None default to all dims? Or are we gradually transitioning to dims=... for all dims?
  • Is drop_duplicates a good name? Or should it explicitly refer to dropping duplicates on the index?

@max-sixty
Copy link
Collaborator

If we don't hear anything, let's add this to the top of the list for the next dev call in ten days

@shoyer
Copy link
Member

shoyer commented Apr 5, 2021

From an API perspective, I think the name drop_duplicates() would be fine. I would guess that handling arbitrary variables in a Dataset would not be any harder than handling only coordinates?

One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates.

This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds.drop_duplicates('valid') on the example in the PR description would result in a "valid" coordinate/dimension of length 3. The original 'init' and 'tau' dimensions could be preserved as coordinates, e.g.,

    ds = xr.DataArray(
        [[1, 2, 3], [4, 5, 6]],
        coords={"init": [0, 1], "tau": [1, 2, 3]},
        dims=["init", "tau"],
    ).to_dataset(name="test")
    ds.coords["valid"] = (("init", "tau"), np.array([[8, 6, 6], [7, 7, 7]]))
    result = ds.drop_duplicates('valid')

would result in:

>>> result
<xarray.Dataset>
Dimensions:  (valid: 3)
Coordinates:
    init     (valid) int64 0 0 1
    tau      (valid) int64 1 2 1
  * valid    (valid) int64 8 6 7
Data variables:
    test     (valid) int64 1 2 4

i.e., the exact same thing that would be obtained by indexing with the positions of the de-duplicated values: ds.isel(init=('valid', [0, 0, 1]), tau=('valid', [0, 1, 0])).

@ahuang11
Copy link
Contributor Author

ahuang11 commented Apr 5, 2021 via email

@ahuang11
Copy link
Contributor Author

ahuang11 commented Apr 5, 2021

Oh I just saw the edits with keeping the dims. I guess that would work.

@shoyer shoyer mentioned this pull request Apr 5, 2021
5 tasks
@ahuang11
Copy link
Contributor Author

ahuang11 commented Apr 6, 2021

Not sure if there's a more elegant way of implementing this.

@max-sixty
Copy link
Collaborator

Hi @ahuang11 — forgive the delay. We discussed this with the team on our call and think it would be a welcome addition, so thank you for contributing.

I took another look through the tests and the behavior looks ideal for dimensioned coords are passed:

In [6]: da
Out[6]:
<xarray.DataArray (lat: 5, lon: 5)>
array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12],
       [ 0,  4,  8, 12, 16]])
Coordinates:
  * lat      (lat) int64 0 1 2 2 3
  * lon      (lon) int64 0 1 3 3 4

In [7]:      result = da.drop_duplicate_coords(["lat", "lon"], keep='first')

In [8]: result
Out[8]:
<xarray.DataArray (lat: 4, lon: 4)>
array([[ 0,  0,  0,  0],
       [ 0,  1,  2,  4],
       [ 0,  2,  4,  8],
       [ 0,  4,  8, 16]])
Coordinates:
  * lat      (lat) int64 0 1 2 3
  * lon      (lon) int64 0 1 3 4

And I think this is also the best we can do for non-dimensioned coords. One thing I call out is that:
a. The array is stacked for any non-dim coord > 1 dim
b. The supplied coord becomes the new dimensioned coord

e.g. Stacking:

In [12]: da
Out[12]:
<xarray.DataArray (init: 2, tau: 3)>
array([[1, 2, 3],
       [4, 5, 6]])
Coordinates:
  * init     (init) int64 0 1
  * tau      (tau) int64 1 2 3
    valid    (init, tau) int64 8 6 6 7 7 7

In [13]: da.drop_duplicate_coords("valid")
Out[13]:
<xarray.DataArray (valid: 3)>
array([1, 2, 4])
Coordinates:
  * valid    (valid) int64 8 6 7
    init     (valid) int64 0 0 1
    tau      (valid) int64 1 2 1

Changing the dimensions: zeta becoming the new dimension, from tau:

In [16]: (
    ...:     da
    ...:     .assign_coords(dict(zeta=(('tau'),[4,4,6])))
    ...:     .drop_duplicate_coords('zeta')
    ...:     )
Out[16]:
<xarray.DataArray (init: 2, zeta: 2)>
array([[1, 3],
       [4, 6]])
Coordinates:
  * init     (init) int64 0 1
    valid    (init, zeta) int64 8 6 7 7
  * zeta     (zeta) int64 4 6
    tau      (zeta) int64 1 3

One peculiarity — though I think a necessary one — is that the order matters in some cases:

In [17]: (
    ...:     da
    ...:     .assign_coords(dict(zeta=(('tau'),[4,4,6])))
    ...:     .drop_duplicate_coords(['zeta','valid'])
    ...:     )
Out[17]:
<xarray.DataArray (valid: 3)>
array([1, 3, 4])
Coordinates:
  * valid    (valid) int64 8 6 7
    tau      (valid) int64 1 3 1
    init     (valid) int64 0 0 1
    zeta     (valid) int64 4 6 4

In [18]: (
    ...:     da
    ...:     .assign_coords(dict(zeta=(('tau'),[4,4,6])))
    ...:     .drop_duplicate_coords(['valid','zeta'])
    ...:     )
Out[18]:
<xarray.DataArray (zeta: 1)>
array([1])
Coordinates:
  * zeta     (zeta) int64 4
    init     (zeta) int64 0
    tau      (zeta) int64 1
    valid    (zeta) int64 8

Unless anyone has any more thoughts, let's plan to merge this over the next few days. Thanks again @ahuang11 !

@shoyer
Copy link
Member

shoyer commented Apr 18, 2021

This looks great, but I wonder if we could simplify the implementation? For example, could we get away with only doing a single isel() for selecting the positions corresponding to unique values, rather than the current loop? .stack() can also be expensive relative to indexing.

This might require using a different routine to find the unique positions the current calls to duplicated() on a pandas.Index. I think we could construct the necessary indices even for multi-dimensional arrays using np.unique with return_index=True and np.unravel_index.

@max-sixty
Copy link
Collaborator

@ahuang11 IIUC, this is only using .stack where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

@shoyer
Copy link
Member

shoyer commented Apr 19, 2021

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

I guess this may come down to the desired behavior for multiple arguments, e.g., drop_duplicates(['lat', 'lon'])? I'm not certain that this case is well defined in this PR (it certainly needs more tests!).

I think we could make this work via the axis argument to np.unique, although the lack of support for object arrays could be problematic for us, since we put strings in object arrays.

@ahuang11
Copy link
Contributor Author

@ahuang11 IIUC, this is only using .stack where it needs to actually stack the array, is that correct? So a list of dims is passed (rather than non-dim coords), then it's not stacking.

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

@shoyer
Copy link
Member

shoyer commented Apr 19, 2021

I agree with @shoyer that we could do it in a single isel in the basic case. One option is to have a fast path for non-dim coords only, and call isel once with those.

Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits!

I hope you feel well soon here! There is no time pressure from our end on this.

@max-sixty
Copy link
Collaborator

@max-sixty is there a case where you don't think we could do a single isel? I'd love to do the single isel() call if possible, because that should have the best performance by far.

IIUC there are two broad cases here

  • where every supplied coord is a dimensioned coord — it's v simple, just isel non-duplicates for each dimension*
  • where there's a non-dimensioned coord with ndim > 1, then it requires stacking; e.g. the example above. Is there a different way of doing this?
In [12]: da
Out[12]:
<xarray.DataArray (init: 2, tau: 3)>
array([[1, 2, 3],
       [4, 5, 6]])
Coordinates:
  * init     (init) int64 0 1
  * tau      (tau) int64 1 2 3
    valid    (init, tau) int64 8 6 6 7 7 7

In [13]: da.drop_duplicate_coords("valid")
Out[13]:
<xarray.DataArray (valid: 3)>
array([1, 2, 4])
Coordinates:
  * valid    (valid) int64 8 6 7
    init     (valid) int64 0 0 1
    tau      (valid) int64 1 2 1

* very close to this is a 1D non-dimensioned coord, in which case we can either turn it into a dimensioned coord or retain the existing dimensioned coords — I think probably the former if we allow the stacking case, for the sake of consistency.

@shoyer
Copy link
Member

shoyer commented Apr 22, 2021

A couple thoughts on strategy here:

  1. Let's consider starting with a minimal set of functionality (e.g., only drop duplicates in a single variable and/or along only one dimension). This is easier to merge and provides a good foundation for implementing the remaining features in follow-on PRs.
  2. It might be useful to start from the foundation of implementing multi-dimensional indexing with a boolean array (Boolean indexing with multi-dimensional key arrays #1887). Then drop_duplicates() (and also unique()) could just be a layer on top of that, passing in a boolean index of "non-duplicate" entries.

@max-sixty
Copy link
Collaborator

This is great work and it would be good to get this in for the upcoming release #5232.

I think there are two paths:

  1. Narrow: merge the functionality which works along 1D dimensioned coords
  2. Full: Ensure we're at consensus on how we handle >1D coords

I would mildly vote for narrow. While I would also vote to merge it as-is, I think it's not a huge task to move wide onto a new branch.

@ahuang11 what are your thoughts?

@ahuang11
Copy link
Contributor Author

ahuang11 commented Apr 30, 2021

I can take a look this weekend. If narrow, could simply rollback to this commit, make minor adjustments and merge. 28aa96a

But I personally prefer full so it'd be nice if we could come to a consensus on how to handle it~

ahuang11 added 12 commits April 30, 2021 21:58
This reverts commit f9ee3fe.
This reverts commit cc94bbe, reversing
changes made to daa6e42.

Conflicts:
	xarray/core/dataset.py
This reverts commit a1ce19d.
This reverts commit 8c27afb.
This reverts commit e307041.
This reverts commit 1698990.
This reverts commit 344a7d8.
This reverts commit f7dcdd4.
@ahuang11 ahuang11 force-pushed the drop_duplicates branch from 663d0c9 to a77f78d Compare May 1, 2021 03:07
@ahuang11 ahuang11 mentioned this pull request May 1, 2021
5 tasks
@ahuang11
Copy link
Contributor Author

ahuang11 commented May 1, 2021

I failed to commit properly so see #5239 where I only do drop duplicates for dims

@ahuang11 ahuang11 closed this May 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants