-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add drop duplicates #5089
Add drop duplicates #5089
Conversation
Hello @ahuang11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-05-01 03:07:57 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @ahuang11 !
I think the method could be really useful. Does anyone else have thoughts?
One important decision is whether this should operate on dimensioned coords or all coords (or even any array?). My guess would be that we could start with dimensioned coords given those are the most likely use case, and we could extent to non-dimensioned coords later.
(here's a glossary as the terms can get confusing: http://xarray.pydata.org/en/stable/terminology.html)
Okay, since I had some time, I decided to do coords too. |
Not sure how to fix this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be useful.
- Is the name of the method clear or should it be made more explicit, e.g.
drop_duplicates_dims
? - Should it be
dims=...
for all dimensions to allowdims=None
for no dimensions once we also want to supportcoords=
? Or is that in the YAGNI category?
(I think it's probably fine as is.)
xarray/core/dataset.py
Outdated
""" | ||
if dims is None: | ||
dims = list(self.coords) | ||
elif isinstance(dims, str) or not isinstance(dims, Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could in principle use elif isinstance(dims, Hashable):
but I would leave it as is (we should once discuss what we do about da.mean(("x", "y"))
as ("x", "y")
is Hashable
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use utils.is_scalar
?
xarray/core/dataset.py
Outdated
Dataset | ||
""" | ||
if dims is None: | ||
dims = list(self.coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dims = list(self.coords) | |
dims = list(self.dims) |
...I think?
And we should add a test for this please — an array with a non-dimensioned coord
@pydata/xarray we didn't get to this on the call today — two questions from @mathause :
|
If we don't hear anything, let's add this to the top of the list for the next dev call in ten days |
From an API perspective, I think the name One thing that is a little puzzling to me is how deduplicating across multiple dimensions is handled. It looks like this function preserves existing dimensions, but inserts NA is the arrays would be ragged? This seems a little strange to me. I think it could make more sense to "flatten" all dimensions in the contained variables into a new dimension when dropping duplicates. This would require specifying the name for the new dimension(s), but perhaps that could work by switching to the de-duplicated variable name? For example, ds = xr.DataArray(
[[1, 2, 3], [4, 5, 6]],
coords={"init": [0, 1], "tau": [1, 2, 3]},
dims=["init", "tau"],
).to_dataset(name="test")
ds.coords["valid"] = (("init", "tau"), np.array([[8, 6, 6], [7, 7, 7]]))
result = ds.drop_duplicates('valid') would result in:
i.e., the exact same thing that would be obtained by indexing with the positions of the de-duplicated values: |
I prefer drop duplicate values to be under the unique() PR; maybe could be
renamed as drop_duplicate_values().
Also I think preserving existing dimensions is more powerful than
flattening the dimensions.
…On Sun, Apr 4, 2021, 11:01 PM Stephan Hoyer ***@***.***> wrote:
From an API perspective, I think the name drop_duplicates() would be
fine. I would guess that handling arbitrary variables in a Dataset would
not be any harder than handling only coordinates?
One thing that is a little puzzling to me is how deduplicating across
multiple dimensions is handled. It looks like this function preserves
existing dimensions, but inserts NA is the arrays would be ragged? This
seems a little strange to me. I think it could make more sense to "flatten"
all dimensions in the contained variables into a new dimension when
dropping duplicates.
This would require specifying the name for the new dimension(s), but
perhaps that could work by switching to the de-duplicated variable name?
For example, ds.drop_duplicates('valid') on the example in the PR
description would result in a "valid" coordinate/dimension of length 3.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5089 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADU7FFWCT2NXOR2AYNLGVQDTHEYYFANCNFSM4Z6ZAMUA>
.
|
Oh I just saw the edits with keeping the dims. I guess that would work. |
Not sure if there's a more elegant way of implementing this. |
Hi @ahuang11 — forgive the delay. We discussed this with the team on our call and think it would be a welcome addition, so thank you for contributing. I took another look through the tests and the behavior looks ideal for dimensioned coords are passed: In [6]: da
Out[6]:
<xarray.DataArray (lat: 5, lon: 5)>
array([[ 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4],
[ 0, 2, 4, 6, 8],
[ 0, 3, 6, 9, 12],
[ 0, 4, 8, 12, 16]])
Coordinates:
* lat (lat) int64 0 1 2 2 3
* lon (lon) int64 0 1 3 3 4
In [7]: result = da.drop_duplicate_coords(["lat", "lon"], keep='first')
In [8]: result
Out[8]:
<xarray.DataArray (lat: 4, lon: 4)>
array([[ 0, 0, 0, 0],
[ 0, 1, 2, 4],
[ 0, 2, 4, 8],
[ 0, 4, 8, 16]])
Coordinates:
* lat (lat) int64 0 1 2 3
* lon (lon) int64 0 1 3 4 And I think this is also the best we can do for non-dimensioned coords. One thing I call out is that: e.g. Stacking: In [12]: da
Out[12]:
<xarray.DataArray (init: 2, tau: 3)>
array([[1, 2, 3],
[4, 5, 6]])
Coordinates:
* init (init) int64 0 1
* tau (tau) int64 1 2 3
valid (init, tau) int64 8 6 6 7 7 7
In [13]: da.drop_duplicate_coords("valid")
Out[13]:
<xarray.DataArray (valid: 3)>
array([1, 2, 4])
Coordinates:
* valid (valid) int64 8 6 7
init (valid) int64 0 0 1
tau (valid) int64 1 2 1 Changing the dimensions: In [16]: (
...: da
...: .assign_coords(dict(zeta=(('tau'),[4,4,6])))
...: .drop_duplicate_coords('zeta')
...: )
Out[16]:
<xarray.DataArray (init: 2, zeta: 2)>
array([[1, 3],
[4, 6]])
Coordinates:
* init (init) int64 0 1
valid (init, zeta) int64 8 6 7 7
* zeta (zeta) int64 4 6
tau (zeta) int64 1 3 One peculiarity — though I think a necessary one — is that the order matters in some cases: In [17]: (
...: da
...: .assign_coords(dict(zeta=(('tau'),[4,4,6])))
...: .drop_duplicate_coords(['zeta','valid'])
...: )
Out[17]:
<xarray.DataArray (valid: 3)>
array([1, 3, 4])
Coordinates:
* valid (valid) int64 8 6 7
tau (valid) int64 1 3 1
init (valid) int64 0 0 1
zeta (valid) int64 4 6 4
In [18]: (
...: da
...: .assign_coords(dict(zeta=(('tau'),[4,4,6])))
...: .drop_duplicate_coords(['valid','zeta'])
...: )
Out[18]:
<xarray.DataArray (zeta: 1)>
array([1])
Coordinates:
* zeta (zeta) int64 4
init (zeta) int64 0
tau (zeta) int64 1
valid (zeta) int64 8 Unless anyone has any more thoughts, let's plan to merge this over the next few days. Thanks again @ahuang11 ! |
This looks great, but I wonder if we could simplify the implementation? For example, could we get away with only doing a single isel() for selecting the positions corresponding to unique values, rather than the current loop? This might require using a different routine to find the unique positions the current calls to |
@ahuang11 IIUC, this is only using I agree with @shoyer that we could do it in a single |
@max-sixty is there a case where you don't think we could do a single I guess this may come down to the desired behavior for multiple arguments, e.g., I think we could make this work via the |
Yes correct. I am not feeling well at the moment so I probably won't get to this today, but feel free to make commits! |
I hope you feel well soon here! There is no time pressure from our end on this. |
IIUC there are two broad cases here
In [12]: da
Out[12]:
<xarray.DataArray (init: 2, tau: 3)>
array([[1, 2, 3],
[4, 5, 6]])
Coordinates:
* init (init) int64 0 1
* tau (tau) int64 1 2 3
valid (init, tau) int64 8 6 6 7 7 7
In [13]: da.drop_duplicate_coords("valid")
Out[13]:
<xarray.DataArray (valid: 3)>
array([1, 2, 4])
Coordinates:
* valid (valid) int64 8 6 7
init (valid) int64 0 0 1
tau (valid) int64 1 2 1 * very close to this is a 1D non-dimensioned coord, in which case we can either turn it into a dimensioned coord or retain the existing dimensioned coords — I think probably the former if we allow the stacking case, for the sake of consistency. |
A couple thoughts on strategy here:
|
This is great work and it would be good to get this in for the upcoming release #5232. I think there are two paths:
I would mildly vote for narrow. While I would also vote to merge it as-is, I think it's not a huge task to move wide onto a new branch. @ahuang11 what are your thoughts? |
I can take a look this weekend. If narrow, could simply rollback to this commit, make minor adjustments and merge. 28aa96a But I personally prefer full so it'd be nice if we could come to a consensus on how to handle it~ |
This reverts commit f9ee3fe.
This reverts commit a1ce19d.
This reverts commit 8a168ce.
This reverts commit 8c27afb.
This reverts commit d33586e.
This reverts commit e307041.
This reverts commit 1698990.
This reverts commit 596ec7a.
This reverts commit 344a7d8.
This reverts commit 915dcf5.
This reverts commit f7dcdd4.
I failed to commit properly so see #5239 where I only do drop duplicates for dims |
Semi related to #2795, but not really; still want a separate unique function
pre-commit run --all-files
whats-new.rst
api.rst