Initialize empty or full DataArray #3159

griverat · 2019-07-24T06:21:50Z

I attempted to implement what has been asked for in #277 as an effort to contribute to this project.
This PR adds the ability to initialize a DataArray with a constant value, including np.nan. Also, if data = None then it is initialized as np.empty to take advantage of its speed for big arrays.

>> foo = xr.DataArray(None, coords=[range(3), range(4)])
>> foo
<xarray.DataArray (dim_0: 3, dim_1: 4)>
array([[4.673257e-310, 0.000000e+000, 0.000000e+000, 0.000000e+000],
       [0.000000e+000, 0.000000e+000, 0.000000e+000, 0.000000e+000],
       [0.000000e+000, 0.000000e+000, 0.000000e+000, 0.000000e+000]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2
  * dim_1    (dim_1) int64 0 1 2 3

Closes Allow passing a default value (instead of ndarray) for data argument for DataArray #878, Creation of an empty DataArray #277
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

Regarding the tests, I am not sure how to test the creation of an empty DataArray with data=None since the values changes between calls of np.empty. This is the reason I only added the test for the constant value.

griverat · 2019-07-24T12:30:57Z

Just realised most things broke with the change I made. I'll refactor it and try again.

pep8speaks · 2019-07-24T17:13:50Z

Hello @DangoMelon! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-26 20:15:42 UTC

shoyer

@DangoMelon thanks for starting on this! I have a few comments to discuss on the design of this.

xarray/core/dataarray.py

shoyer · 2019-08-04T22:53:56Z

xarray/core/dataarray.py

@@ -279,6 +302,7 @@ def __init__(
                encoding = getattr(data, 'encoding', None)

            data = as_compatible_data(data)
+            data = _check_data_shape(data, coords, dims)


I wonder if we should move this logic above as_compatible_data, which would let us distinguish between scalar values like float/int (which don't have an inherent shape) vs 0-dimensional NumPy arrays (which do have an array shape already).

For example:

xarray.DataArray(0.5, coords=[('x', np.arange(3)), ('y', ['a', 'b'])]) -> duplicate the scalar to make an array of shape (3, 2)

xarray.DataArray(np.array(1.0), coords=[('x', np.arange(3)), ('y', ['a', 'b'])]) -> error, shapes do not match

If I understand correctly, the second example you provided shouldn't work since np.array(1.0) is a 0-dimensional NumPy array with shape () and DataArray expects it to have a (3, 2) shape, right?. The current behavior is set to duplicate the value as if it were xarray.DataArray(1.0, coords=[('x', np.arange(3)), ('y', ['a', 'b'])]), which I thought was the desired feature. I am currently pushing a commit that makes this work since I didn't consider the case of coords being a list of tuples (although all test passed).

Regarding the _check_data_shape position, I placed it after as_compatible_data since the latter returns an ndarray containing the value passed to it, scalar or None, on which I can check the shape.

griverat · 2019-08-05T16:33:22Z

I am not sure what happened to the commit history. I might have messed up trying to update my local fork. Is there anything I can do to revert this?

max-sixty · 2019-08-05T18:15:13Z

Hi @DangoMelon, no stress, I know git can be a pain.

I don't remember your previous effort so I'm not sure what the previous version looked like. Is it materially different from what currently exists? The current code looks relevant and @shoyer has reviewed & commented on it.

If you have overwritten your previous commits, you can generally get them back by using git reflog. There's plenty of documentation on that online, let us know if we can be helpful in the specific case though

griverat · 2019-08-05T21:02:27Z

Hi @max-sixty, I managed to fix it up a bit. It previously showed all the commits made by others collaborators since I forked the repo. I did a git rebase, solved all the merge conflicts and then force pushed it. It seems like some commits got duplicated though.

griverat · 2019-08-06T16:36:42Z

So far, this addition can do the following:

Use a scalar value

>>> xr.DataArray(5, coords=[('x', np.arange(3)), ('y', ['a', 'b'])])

<xarray.DataArray (x: 3, y: 2)>
array([[5, 5],
       [5, 5],
       [5, 5]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b'

Use a scalar array

>>> xr.DataArray(np.array(1.0), coords=[('x', np.arange(3)), ('y', ['a', 'b'])])

<xarray.DataArray (x: 3, y: 2)>
array([[1., 1.],
       [1., 1.],
       [1., 1.]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b'

Match any number of dims

>>> xr.DataArray(0, coords={'x': pd.date_range('20190101', '20190131'), 
                           'y': ['north', 'south'], 'z': np.arange(4)}, 
                    dims=['w', 'x', 'y', 'p', 'z'])

<xarray.DataArray (w: 1, x: 31, y: 2, p: 1, z: 4)>
array([[[[[0, ..., 0]],

         [[0, ..., 0]]],


        ...,


        [[[0, ..., 0]],

         [[0, ..., 0]]]]])
Coordinates:
  * x        (x) datetime64[ns] 2019-01-01 2019-01-02 ... 2019-01-30 2019-01-31
  * y        (y) <U5 'north' 'south'
  * z        (z) int64 0 1 2 3
Dimensions without coordinates: w, p

Use None to get an empty array

>>> xr.DataArray(None, coords={'x': np.datetime64('2019-01-01'),
                               'y': np.arange(100),
                               'z': 'ST1',
                               'p': np.arange(10)}, dims=['y', 'p'])

<xarray.DataArray (y: 100, p: 10)>
array([[ 4.047386e-320,  6.719293e-321,  0.000000e+000, ...,  6.935425e-310,
         6.935319e-310,  0.000000e+000],
       [ 4.940656e-324,  6.935107e-310,  6.935432e-310, ...,  6.935432e-310,
         1.086944e-322,  6.935430e-310],
       [ 6.935432e-310,  6.935319e-310,  2.758595e-313, ...,  6.935432e-310,
         6.935432e-310,  6.935432e-310],
       ...,
       [ 6.781676e+194,  3.328071e-113,  9.124901e+192, ...,  2.195875e-157,
        -4.599251e-303, -2.217863e-250],
       [ 7.830998e+247, -8.407382e+089,  1.299071e+193, ...,  9.124901e+192,
        -4.661908e-303,  2.897933e+193],
       [ 1.144295e-309,  7.041423e+053, -8.538757e-210, ...,  1.473665e+256,
        -6.525461e-210, -1.665001e-075]])
Coordinates:
    x        datetime64[ns] 2019-01-01
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 ... 90 91 92 93 94 95 96 97 98 99
    z        <U3 'ST1'
  * p        (p) int64 0 1 2 3 4 5 6 7 8 9

Any comment on what is missing or needs to be fixed is welcome.

max-sixty · 2019-08-06T16:49:20Z

Great @DangoMelon , that looks superb!

Could you run those cases you listed above in test functions? You can copy and paste those lines, and then compare the resulting object to one constructed with the full array (let me know if this is unclear and I can give more guidance).

Could you also construct a test case using @shoyer 's example which fails? Again lmk if you need any guidance in how to do that.

In the final case "Use None", is that correct? Where do all the values come from? (Maybe I'm missing something basic?)

And then I think we can merge, pending any other feedback! Thanks - this will be a valuable addition to xarray.

shoyer · 2019-08-06T17:13:27Z

Use a scalar array

This is the case that I'm not sure we want to support.

I think the rule we want is something like "scalar values are repeated automatically," but 0-dimensional arrays are kind of a strange case -- are they really scalars or multi-dimensional arrays? My inclination is to treat these like multi-dimensional arrays, in which case we should raise an error to avoid hiding errors.

In particular, one thing that an xarray user might expect, but which I think don't want to support, is full broadcasting of multi-dimensional arrays to match the shape of coordinates.

Use None to get an empty array

Rather than using None, I would suggest using a custom sentinel value. Somebody might actually want an array full of all None values! If users want an empty DataArray, make them omit the argument entirely, e.g., xr.DataArray(coord=coords, dims=dims).

The way we do this in xarray is with a ReprObject, e.g., see here for apply_ufunc:

xarray/xarray/core/computation.py

Line 26 in 1757dff

_NO_FILL_VALUE = utils.ReprObject('<no-fill-value>')

xarray/xarray/core/computation.py

Line 692 in 1757dff

dataset_fill_value: object = _NO_FILL_VALUE,

There is also the question of what values should be inside such an empty array. Here I think there are roughly two options:

Fill the unspecified array with np.nan, to indicate invalid values.
Just use np.empty, which means the array can be filled with arbitrary invalid data.

It looks like you've currently implemented option (2), but again I'm not sure that is the most sensible default behavior for xarray. The performance gains from not filling in array values with a constant are typically very small (writing constant values into memory is very fast). Pandas also seems to use NaN as the default value:

>>> pandas.Series(index=[1, 2])
1   NaN
2   NaN
dtype: float64

max-sixty · 2019-08-06T19:31:28Z

Fill the unspecified array with np.nan, to indicate invalid values.

👍

If users want an empty DataArray, make them omit the argument entirely, e.g., xr.DataArray(coord=coords, dims=dims).

👍

...so for clarity, no need to expose the sentinel values externally, they're an internal implementation that then fills np.nans

shoyer · 2019-08-06T20:02:35Z

If the default value is NaN, we could reuse xarray's pre-existing sentinel value for NA:

xarray/xarray/core/dtypes.py

Line 8 in 55593a8

NA = utils.ReprObject('<NA>')

griverat · 2019-08-06T23:07:34Z

Thanks for the feedback.

My inclination is to treat these like multi-dimensional arrays, in which case we should raise an error to avoid hiding errors.

I wasn't sure on how to treat 0-dimensional arrays and just assumed it to be the same as a scalar since this function considers them as so

xarray/xarray/core/utils.py

Lines 238 to 248 in 1ab7569

    
           def is_scalar(value: Any) -> bool: 
        
               """Whether to treat a value as a scalar. 
        
               Any non-iterable, string, or 0-D array 
        
               """ 
        
               return ( 
        
                   getattr(value, 'ndim', None) == 0 or 
        
                   isinstance(value, (str, bytes)) or not 
        
                   (isinstance(value, (Iterable, ) + dask_array_type) or 
        
                    hasattr(value, '__array_function__')) 
        
               )

Should I treat them like multi-dimensional arrays or leave the current behavior for consistency with the snippet above?

If the default value is NaN, we could reuse xarray's pre-existing sentinel value for NA:

Thanks for the advice, I'll be using this.

max-sixty · 2019-08-07T01:57:00Z

Should I treat them like multi-dimensional arrays or leave the current behavior for consistency with the snippet above?

That's a good point. I think in this case, given that it's passed to an arg expected an array, we should raise on 0d. I realize that's a bit inconsistent with treating them as scalars elsewhere.

Happy to be outvoted if others disagree

griverat · 2019-08-08T13:44:22Z

That's a good point. I think in this case, given that it's passed to an arg expected an array, we should raise on 0d.

I was expecting to rely on the current implementation of is_scalar to do the type checking since I'm moving _check_data_shape above as_compatible_data to do something like this

if utils.is_scalar(data) and coords is not None:

Otherwise everything would be filter out since as_compatible_data returns a 0d given a scalar value.

xarray/xarray/core/variable.py

Lines 195 to 196 in 8d46bf0

    
           # validate whether the data is valid data types 
        
           data = np.asarray(data)

I can only imagine copying is_scalar but removing getattr(value, 'ndim', None) == 0 to filter out the 0d to only do the duplication on scalars.

shoyer · 2019-08-08T15:12:27Z

Yes, I think it would make sense to add an option to is_scalar() to indicate whether or not 0-d arrays should be considered "scalars"

…

On Thu, Aug 8, 2019 at 6:44 AM Gerardo Rivera ***@***.***> wrote: That's a good point. I think in this case, given that it's passed to an arg expected an array, we should raise on 0d. I was expecting to rely on the current implementation of is_scalar to do the type checking since I'm moving _check_data_shape above as_compatible_data to do something like this if utils.is_scalar(data) and coords is not None: Otherwise everything would be filter out since as_compatible_data returns a 0d given a scalar value. https://github.com/pydata/xarray/blob/8d46bf09f20e022baca98b4050584d93c0ea118b/xarray/core/variable.py#L195-L196 I can only imagine copying is_scalar but removing getattr(value, 'ndim', None) == 0 to filter out the 0d to only do the duplication on scalars. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3159?email_source=notifications&email_token=AAJJFVQCR5GRKIH4BR5UIGLQDQPLPA5CNFSM4IGMSZ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD33U55A#issuecomment-519524084>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJJFVSSFNOMIWO2TDXWSJLQDQPLPANCNFSM4IGMSZ7A> .

max-sixty

I think this is looking great!

@shoyer any final thoughts?

xarray/core/utils.py

max-sixty · 2019-08-09T21:13:51Z

Great - will merge unless anyone has final comments?

xarray/core/dataarray.py

xarray/tests/test_dataarray.py

max-sixty · 2019-08-21T19:11:19Z

Great, will merge later unless other comments. Will be good to get this in!

xarray/core/dataarray.py

max-sixty · 2019-08-26T20:12:59Z

Great, looking good @DangoMelon - I merged master to resolve a conflict - then we can get this in!

griverat · 2019-08-26T20:20:17Z

Great, looking good @DangoMelon - I merged master to resolve a conflict - then we can get this in!

Thanks for the help! I'm glad to contribute to this project.

max-sixty · 2019-08-26T20:36:30Z

Test failure is the same as on master, ref #3265

max-sixty · 2019-08-26T20:37:39Z

Thanks @DangoMelon ! I know this PR was a long road, but it's a material improvement to the ergonomics of xarray. We'd enjoy having you as a contributor in the future!

* upstream/master: Initialize empty or full DataArray (pydata#3159) Raise on inplace=True (pydata#3260) added support for rasterio geotiff tags (pydata#3249) Remove sel_points (pydata#3261) Fix sparse ops that were calling bottleneck (pydata#3254) New feature of filter_by_attrs added (pydata#3259) Update filter_by_attrs to use 'variables' instead of 'data_vars' (pydata#3247)

shoyer reviewed Aug 4, 2019

View reviewed changes

griverat added 8 commits August 5, 2019 11:47

TST: add test for DataArray init with a single value

4f6311e

ENH: add empty and full DataArray initialization

1aa3ff3

Update whats-new

db2cb28

Remove ValueError test

4cecfe9

Add function to verify and fill array according to coordinates

a538dab

Use item in numpy array to compare with None

4c45b7b

ENH: add empty and full DataArray initialization

b0f2d4e

Add function to verify and fill array according to coordinates

2fa6e48

griverat force-pushed the add-init-val-darray branch from 7c4c049 to 2fa6e48 Compare August 5, 2019 17:07

griverat added 4 commits August 5, 2019 16:05

Handle coords being a list of tuples

c481560

Use .shape to identify scalar arrays

28b7336

Better handling of dims

59a632f

Remove conditionals over shape value

8c165fd

griverat added 2 commits August 9, 2019 11:22

Ignore 0d arrays

f337550

Fill array with NaN when no data given

9a57dc5

Merge branch 'master' into add-init-val-darray

347ec33

max-sixty approved these changes Aug 9, 2019

View reviewed changes

xarray/core/utils.py Outdated Show resolved Hide resolved

griverat added 2 commits August 9, 2019 16:09

Type check for ExplicitlyIndexed objects

e3127a3

Change parameter name

38858c6

Remove Optional

4be0607

griverat commented Aug 9, 2019

View reviewed changes

xarray/core/dataarray.py Show resolved Hide resolved

shoyer reviewed Aug 9, 2019

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

xarray/core/dataarray.py Outdated Show resolved Hide resolved

griverat added 4 commits August 16, 2019 16:49

Merge branch 'master' into add-init-val-darray

9bb3530

Remove abbreviation

68ff54a

Use as_variable

7407757

Pass tuples explicitly to coords in test

4e95cc3

griverat commented Aug 17, 2019

View reviewed changes

xarray/core/dataarray.py Show resolved Hide resolved

max-sixty approved these changes Aug 17, 2019

View reviewed changes

xarray/tests/test_dataarray.py Show resolved Hide resolved

Tests for 0d

4df28db

shoyer reviewed Aug 21, 2019

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

xarray/core/dataarray.py Show resolved Hide resolved

xarray/core/dataarray.py Show resolved Hide resolved

griverat and others added 2 commits August 26, 2019 14:51

Move ExplicitlyIndexed check into is_scalar

8ea1c47

Merge branch 'master' into add-init-val-darray

95c88ab

Update utils.py

2c0c634

max-sixty merged commit 3c020e5 into pydata:master Aug 26, 2019

griverat deleted the add-init-val-darray branch August 27, 2019 16:28

griverat mentioned this pull request Aug 28, 2019

Raise proper error for scalar array when coords is a dict #3271

Merged

3 tasks

dcherian mentioned this pull request Mar 6, 2020

Creation of an empty DataArray #277

Closed

itcarroll mentioned this pull request Sep 9, 2024

Creation of empty DataArrays in a Dataset #9468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize empty or full DataArray #3159

Initialize empty or full DataArray #3159

griverat commented Jul 24, 2019

griverat commented Jul 24, 2019

pep8speaks commented Jul 24, 2019 •

edited

Loading

shoyer left a comment

shoyer Aug 4, 2019

griverat Aug 5, 2019

griverat commented Aug 5, 2019

max-sixty commented Aug 5, 2019

griverat commented Aug 5, 2019

griverat commented Aug 6, 2019

max-sixty commented Aug 6, 2019

shoyer commented Aug 6, 2019

max-sixty commented Aug 6, 2019

shoyer commented Aug 6, 2019

griverat commented Aug 6, 2019

max-sixty commented Aug 7, 2019

griverat commented Aug 8, 2019

shoyer commented Aug 8, 2019 via email

max-sixty left a comment

max-sixty commented Aug 9, 2019

max-sixty commented Aug 21, 2019

max-sixty commented Aug 26, 2019

griverat commented Aug 26, 2019

max-sixty commented Aug 26, 2019

max-sixty commented Aug 26, 2019

Initialize empty or full DataArray #3159

Initialize empty or full DataArray #3159

Conversation

griverat commented Jul 24, 2019

griverat commented Jul 24, 2019

pep8speaks commented Jul 24, 2019 • edited Loading

Comment last updated at 2019-08-26 20:15:42 UTC

shoyer left a comment

Choose a reason for hiding this comment

shoyer Aug 4, 2019

Choose a reason for hiding this comment

griverat Aug 5, 2019

Choose a reason for hiding this comment

griverat commented Aug 5, 2019

max-sixty commented Aug 5, 2019

griverat commented Aug 5, 2019

griverat commented Aug 6, 2019

max-sixty commented Aug 6, 2019

shoyer commented Aug 6, 2019

max-sixty commented Aug 6, 2019

shoyer commented Aug 6, 2019

griverat commented Aug 6, 2019

max-sixty commented Aug 7, 2019

griverat commented Aug 8, 2019

shoyer commented Aug 8, 2019 via email

max-sixty left a comment

Choose a reason for hiding this comment

max-sixty commented Aug 9, 2019

max-sixty commented Aug 21, 2019

max-sixty commented Aug 26, 2019

griverat commented Aug 26, 2019

max-sixty commented Aug 26, 2019

max-sixty commented Aug 26, 2019

pep8speaks commented Jul 24, 2019 •

edited

Loading