-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize empty or full DataArray #3159
Conversation
Just realised most things broke with the change I made. I'll refactor it and try again. |
Hello @DangoMelon! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-08-26 20:15:42 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DangoMelon thanks for starting on this! I have a few comments to discuss on the design of this.
@@ -279,6 +302,7 @@ def __init__( | |||
encoding = getattr(data, 'encoding', None) | |||
|
|||
data = as_compatible_data(data) | |||
data = _check_data_shape(data, coords, dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should move this logic above as_compatible_data
, which would let us distinguish between scalar values like float
/int
(which don't have an inherent shape
) vs 0-dimensional NumPy arrays (which do have an array shape already).
For example:
xarray.DataArray(0.5, coords=[('x', np.arange(3)), ('y', ['a', 'b'])])
-> duplicate the scalar to make an array of shape(3, 2)
xarray.DataArray(np.array(1.0), coords=[('x', np.arange(3)), ('y', ['a', 'b'])])
-> error, shapes do not match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the second example you provided shouldn't work since np.array(1.0)
is a 0-dimensional NumPy array with shape ()
and DataArray
expects it to have a (3, 2)
shape, right?. The current behavior is set to duplicate the value as if it were xarray.DataArray(1.0, coords=[('x', np.arange(3)), ('y', ['a', 'b'])])
, which I thought was the desired feature. I am currently pushing a commit that makes this work since I didn't consider the case of coords
being a list of tuples (although all test passed).
Regarding the _check_data_shape
position, I placed it after as_compatible_data
since the latter returns an ndarray
containing the value passed to it, scalar or None
, on which I can check the shape.
I am not sure what happened to the commit history. I might have messed up trying to update my local fork. Is there anything I can do to revert this? |
7c4c049
to
2fa6e48
Compare
Hi @DangoMelon, no stress, I know git can be a pain. I don't remember your previous effort so I'm not sure what the previous version looked like. Is it materially different from what currently exists? The current code looks relevant and @shoyer has reviewed & commented on it. If you have overwritten your previous commits, you can generally get them back by using |
Hi @max-sixty, I managed to fix it up a bit. It previously showed all the commits made by others collaborators since I forked the repo. I did a |
So far, this addition can do the following:
>>> xr.DataArray(5, coords=[('x', np.arange(3)), ('y', ['a', 'b'])])
<xarray.DataArray (x: 3, y: 2)>
array([[5, 5],
[5, 5],
[5, 5]])
Coordinates:
* x (x) int64 0 1 2
* y (y) <U1 'a' 'b'
>>> xr.DataArray(np.array(1.0), coords=[('x', np.arange(3)), ('y', ['a', 'b'])])
<xarray.DataArray (x: 3, y: 2)>
array([[1., 1.],
[1., 1.],
[1., 1.]])
Coordinates:
* x (x) int64 0 1 2
* y (y) <U1 'a' 'b'
>>> xr.DataArray(0, coords={'x': pd.date_range('20190101', '20190131'),
'y': ['north', 'south'], 'z': np.arange(4)},
dims=['w', 'x', 'y', 'p', 'z'])
<xarray.DataArray (w: 1, x: 31, y: 2, p: 1, z: 4)>
array([[[[[0, ..., 0]],
[[0, ..., 0]]],
...,
[[[0, ..., 0]],
[[0, ..., 0]]]]])
Coordinates:
* x (x) datetime64[ns] 2019-01-01 2019-01-02 ... 2019-01-30 2019-01-31
* y (y) <U5 'north' 'south'
* z (z) int64 0 1 2 3
Dimensions without coordinates: w, p
>>> xr.DataArray(None, coords={'x': np.datetime64('2019-01-01'),
'y': np.arange(100),
'z': 'ST1',
'p': np.arange(10)}, dims=['y', 'p'])
<xarray.DataArray (y: 100, p: 10)>
array([[ 4.047386e-320, 6.719293e-321, 0.000000e+000, ..., 6.935425e-310,
6.935319e-310, 0.000000e+000],
[ 4.940656e-324, 6.935107e-310, 6.935432e-310, ..., 6.935432e-310,
1.086944e-322, 6.935430e-310],
[ 6.935432e-310, 6.935319e-310, 2.758595e-313, ..., 6.935432e-310,
6.935432e-310, 6.935432e-310],
...,
[ 6.781676e+194, 3.328071e-113, 9.124901e+192, ..., 2.195875e-157,
-4.599251e-303, -2.217863e-250],
[ 7.830998e+247, -8.407382e+089, 1.299071e+193, ..., 9.124901e+192,
-4.661908e-303, 2.897933e+193],
[ 1.144295e-309, 7.041423e+053, -8.538757e-210, ..., 1.473665e+256,
-6.525461e-210, -1.665001e-075]])
Coordinates:
x datetime64[ns] 2019-01-01
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 ... 90 91 92 93 94 95 96 97 98 99
z <U3 'ST1'
* p (p) int64 0 1 2 3 4 5 6 7 8 9 Any comment on what is missing or needs to be fixed is welcome. |
Great @DangoMelon , that looks superb! Could you run those cases you listed above in test functions? You can copy and paste those lines, and then compare the resulting object to one constructed with the full array (let me know if this is unclear and I can give more guidance). Could you also construct a test case using @shoyer 's example which fails? Again lmk if you need any guidance in how to do that. In the final case "Use And then I think we can merge, pending any other feedback! Thanks - this will be a valuable addition to xarray. |
This is the case that I'm not sure we want to support. I think the rule we want is something like "scalar values are repeated automatically," but 0-dimensional arrays are kind of a strange case -- are they really scalars or multi-dimensional arrays? My inclination is to treat these like multi-dimensional arrays, in which case we should raise an error to avoid hiding errors. In particular, one thing that an xarray user might expect, but which I think don't want to support, is full broadcasting of multi-dimensional arrays to match the shape of coordinates.
Rather than using The way we do this in xarray is with a xarray/xarray/core/computation.py Line 26 in 1757dff
xarray/xarray/core/computation.py Line 692 in 1757dff
There is also the question of what values should be inside such an empty array. Here I think there are roughly two options:
It looks like you've currently implemented option (2), but again I'm not sure that is the most sensible default behavior for xarray. The performance gains from not filling in array values with a constant are typically very small (writing constant values into memory is very fast). Pandas also seems to use
|
👍
👍 ...so for clarity, no need to expose the sentinel values externally, they're an internal implementation that then fills |
If the default value is Line 8 in 55593a8
|
Thanks for the feedback.
I wasn't sure on how to treat 0-dimensional arrays and just assumed it to be the same as a scalar since this function considers them as so Lines 238 to 248 in 1ab7569
Should I treat them like multi-dimensional arrays or leave the current behavior for consistency with the snippet above?
Thanks for the advice, I'll be using this. |
That's a good point. I think in this case, given that it's passed to an arg expected an array, we should raise on 0d. I realize that's a bit inconsistent with treating them as scalars elsewhere. Happy to be outvoted if others disagree |
I was expecting to rely on the current implementation of if utils.is_scalar(data) and coords is not None: Otherwise everything would be filter out since xarray/xarray/core/variable.py Lines 195 to 196 in 8d46bf0
I can only imagine copying |
Yes, I think it would make sense to add an option to is_scalar() to
indicate whether or not 0-d arrays should be considered "scalars"
…On Thu, Aug 8, 2019 at 6:44 AM Gerardo Rivera ***@***.***> wrote:
That's a good point. I think in this case, given that it's passed to an
arg expected an array, we should raise on 0d.
I was expecting to rely on the current implementation of is_scalar to do
the type checking since I'm moving _check_data_shape above
as_compatible_data to do something like this
if utils.is_scalar(data) and coords is not None:
Otherwise everything would be filter out since as_compatible_data returns
a 0d given a scalar value.
https://github.com/pydata/xarray/blob/8d46bf09f20e022baca98b4050584d93c0ea118b/xarray/core/variable.py#L195-L196
I can only imagine copying is_scalar but removing getattr(value, 'ndim',
None) == 0 to filter out the 0d to only do the duplication on scalars.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3159?email_source=notifications&email_token=AAJJFVQCR5GRKIH4BR5UIGLQDQPLPA5CNFSM4IGMSZ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33U55A#issuecomment-519524084>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAJJFVSSFNOMIWO2TDXWSJLQDQPLPANCNFSM4IGMSZ7A>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking great!
@shoyer any final thoughts?
Great - will merge unless anyone has final comments? |
Great, will merge later unless other comments. Will be good to get this in! |
Great, looking good @DangoMelon - I merged master to resolve a conflict - then we can get this in! |
Thanks for the help! I'm glad to contribute to this project. |
Test failure is the same as on master, ref #3265 |
Thanks @DangoMelon ! I know this PR was a long road, but it's a material improvement to the ergonomics of xarray. We'd enjoy having you as a contributor in the future! |
* upstream/master: Initialize empty or full DataArray (pydata#3159) Raise on inplace=True (pydata#3260) added support for rasterio geotiff tags (pydata#3249) Remove sel_points (pydata#3261) Fix sparse ops that were calling bottleneck (pydata#3254) New feature of filter_by_attrs added (pydata#3259) Update filter_by_attrs to use 'variables' instead of 'data_vars' (pydata#3247)
I attempted to implement what has been asked for in #277 as an effort to contribute to this project.
This PR adds the ability to initialize a DataArray with a constant value, including
np.nan
. Also, ifdata = None
then it is initialized asnp.empty
to take advantage of its speed for big arrays.whats-new.rst
for all changes andapi.rst
for new APIRegarding the tests, I am not sure how to test the creation of an empty DataArray with
data=None
since the values changes between calls ofnp.empty
. This is the reason I only added the test for the constant value.