Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize empty or full DataArray #3159

Merged
merged 31 commits into from
Aug 26, 2019

Conversation

griverat
Copy link
Contributor

I attempted to implement what has been asked for in #277 as an effort to contribute to this project.
This PR adds the ability to initialize a DataArray with a constant value, including np.nan. Also, if data = None then it is initialized as np.empty to take advantage of its speed for big arrays.

>> foo = xr.DataArray(None, coords=[range(3), range(4)])
>> foo
<xarray.DataArray (dim_0: 3, dim_1: 4)>
array([[4.673257e-310, 0.000000e+000, 0.000000e+000, 0.000000e+000],
       [0.000000e+000, 0.000000e+000, 0.000000e+000, 0.000000e+000],
       [0.000000e+000, 0.000000e+000, 0.000000e+000, 0.000000e+000]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2
  * dim_1    (dim_1) int64 0 1 2 3

Regarding the tests, I am not sure how to test the creation of an empty DataArray with data=None since the values changes between calls of np.empty. This is the reason I only added the test for the constant value.

@griverat
Copy link
Contributor Author

Just realised most things broke with the change I made. I'll refactor it and try again.

@pep8speaks
Copy link

pep8speaks commented Jul 24, 2019

Hello @DangoMelon! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-26 20:15:42 UTC

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DangoMelon thanks for starting on this! I have a few comments to discuss on the design of this.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
@@ -279,6 +302,7 @@ def __init__(
encoding = getattr(data, 'encoding', None)

data = as_compatible_data(data)
data = _check_data_shape(data, coords, dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move this logic above as_compatible_data, which would let us distinguish between scalar values like float/int (which don't have an inherent shape) vs 0-dimensional NumPy arrays (which do have an array shape already).

For example:

  • xarray.DataArray(0.5, coords=[('x', np.arange(3)), ('y', ['a', 'b'])]) -> duplicate the scalar to make an array of shape (3, 2)
  • xarray.DataArray(np.array(1.0), coords=[('x', np.arange(3)), ('y', ['a', 'b'])]) -> error, shapes do not match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the second example you provided shouldn't work since np.array(1.0) is a 0-dimensional NumPy array with shape () and DataArray expects it to have a (3, 2) shape, right?. The current behavior is set to duplicate the value as if it were xarray.DataArray(1.0, coords=[('x', np.arange(3)), ('y', ['a', 'b'])]), which I thought was the desired feature. I am currently pushing a commit that makes this work since I didn't consider the case of coords being a list of tuples (although all test passed).

Regarding the _check_data_shape position, I placed it after as_compatible_data since the latter returns an ndarray containing the value passed to it, scalar or None, on which I can check the shape.

@griverat
Copy link
Contributor Author

griverat commented Aug 5, 2019

I am not sure what happened to the commit history. I might have messed up trying to update my local fork. Is there anything I can do to revert this?

@griverat griverat force-pushed the add-init-val-darray branch from 7c4c049 to 2fa6e48 Compare August 5, 2019 17:07
@max-sixty
Copy link
Collaborator

Hi @DangoMelon, no stress, I know git can be a pain.

I don't remember your previous effort so I'm not sure what the previous version looked like. Is it materially different from what currently exists? The current code looks relevant and @shoyer has reviewed & commented on it.

If you have overwritten your previous commits, you can generally get them back by using git reflog. There's plenty of documentation on that online, let us know if we can be helpful in the specific case though

@griverat
Copy link
Contributor Author

griverat commented Aug 5, 2019

Hi @max-sixty, I managed to fix it up a bit. It previously showed all the commits made by others collaborators since I forked the repo. I did a git rebase, solved all the merge conflicts and then force pushed it. It seems like some commits got duplicated though.

@griverat
Copy link
Contributor Author

griverat commented Aug 6, 2019

So far, this addition can do the following:

  • Use a scalar value
>>> xr.DataArray(5, coords=[('x', np.arange(3)), ('y', ['a', 'b'])])

<xarray.DataArray (x: 3, y: 2)>
array([[5, 5],
       [5, 5],
       [5, 5]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b'
  • Use a scalar array
>>> xr.DataArray(np.array(1.0), coords=[('x', np.arange(3)), ('y', ['a', 'b'])])

<xarray.DataArray (x: 3, y: 2)>
array([[1., 1.],
       [1., 1.],
       [1., 1.]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) <U1 'a' 'b'
  • Match any number of dims
>>> xr.DataArray(0, coords={'x': pd.date_range('20190101', '20190131'), 
                           'y': ['north', 'south'], 'z': np.arange(4)}, 
                    dims=['w', 'x', 'y', 'p', 'z'])

<xarray.DataArray (w: 1, x: 31, y: 2, p: 1, z: 4)>
array([[[[[0, ..., 0]],

         [[0, ..., 0]]],


        ...,


        [[[0, ..., 0]],

         [[0, ..., 0]]]]])
Coordinates:
  * x        (x) datetime64[ns] 2019-01-01 2019-01-02 ... 2019-01-30 2019-01-31
  * y        (y) <U5 'north' 'south'
  * z        (z) int64 0 1 2 3
Dimensions without coordinates: w, p
  • Use None to get an empty array
>>> xr.DataArray(None, coords={'x': np.datetime64('2019-01-01'),
                               'y': np.arange(100),
                               'z': 'ST1',
                               'p': np.arange(10)}, dims=['y', 'p'])

<xarray.DataArray (y: 100, p: 10)>
array([[ 4.047386e-320,  6.719293e-321,  0.000000e+000, ...,  6.935425e-310,
         6.935319e-310,  0.000000e+000],
       [ 4.940656e-324,  6.935107e-310,  6.935432e-310, ...,  6.935432e-310,
         1.086944e-322,  6.935430e-310],
       [ 6.935432e-310,  6.935319e-310,  2.758595e-313, ...,  6.935432e-310,
         6.935432e-310,  6.935432e-310],
       ...,
       [ 6.781676e+194,  3.328071e-113,  9.124901e+192, ...,  2.195875e-157,
        -4.599251e-303, -2.217863e-250],
       [ 7.830998e+247, -8.407382e+089,  1.299071e+193, ...,  9.124901e+192,
        -4.661908e-303,  2.897933e+193],
       [ 1.144295e-309,  7.041423e+053, -8.538757e-210, ...,  1.473665e+256,
        -6.525461e-210, -1.665001e-075]])
Coordinates:
    x        datetime64[ns] 2019-01-01
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 ... 90 91 92 93 94 95 96 97 98 99
    z        <U3 'ST1'
  * p        (p) int64 0 1 2 3 4 5 6 7 8 9

Any comment on what is missing or needs to be fixed is welcome.

@max-sixty
Copy link
Collaborator

Great @DangoMelon , that looks superb!

Could you run those cases you listed above in test functions? You can copy and paste those lines, and then compare the resulting object to one constructed with the full array (let me know if this is unclear and I can give more guidance).

Could you also construct a test case using @shoyer 's example which fails? Again lmk if you need any guidance in how to do that.

In the final case "Use None", is that correct? Where do all the values come from? (Maybe I'm missing something basic?)

And then I think we can merge, pending any other feedback! Thanks - this will be a valuable addition to xarray.

@shoyer
Copy link
Member

shoyer commented Aug 6, 2019

  • Use a scalar array

This is the case that I'm not sure we want to support.

I think the rule we want is something like "scalar values are repeated automatically," but 0-dimensional arrays are kind of a strange case -- are they really scalars or multi-dimensional arrays? My inclination is to treat these like multi-dimensional arrays, in which case we should raise an error to avoid hiding errors.

In particular, one thing that an xarray user might expect, but which I think don't want to support, is full broadcasting of multi-dimensional arrays to match the shape of coordinates.

  • Use None to get an empty array

Rather than using None, I would suggest using a custom sentinel value. Somebody might actually want an array full of all None values! If users want an empty DataArray, make them omit the argument entirely, e.g., xr.DataArray(coord=coords, dims=dims).

The way we do this in xarray is with a ReprObject, e.g., see here for apply_ufunc:

_NO_FILL_VALUE = utils.ReprObject('<no-fill-value>')

dataset_fill_value: object = _NO_FILL_VALUE,

There is also the question of what values should be inside such an empty array. Here I think there are roughly two options:

  1. Fill the unspecified array with np.nan, to indicate invalid values.
  2. Just use np.empty, which means the array can be filled with arbitrary invalid data.

It looks like you've currently implemented option (2), but again I'm not sure that is the most sensible default behavior for xarray. The performance gains from not filling in array values with a constant are typically very small (writing constant values into memory is very fast). Pandas also seems to use NaN as the default value:

>>> pandas.Series(index=[1, 2])
1   NaN
2   NaN
dtype: float64

@max-sixty
Copy link
Collaborator

  1. Fill the unspecified array with np.nan, to indicate invalid values.

👍

If users want an empty DataArray, make them omit the argument entirely, e.g., xr.DataArray(coord=coords, dims=dims).

👍

...so for clarity, no need to expose the sentinel values externally, they're an internal implementation that then fills np.nans

@shoyer
Copy link
Member

shoyer commented Aug 6, 2019

If the default value is NaN, we could reuse xarray's pre-existing sentinel value for NA:

NA = utils.ReprObject('<NA>')

@griverat
Copy link
Contributor Author

griverat commented Aug 6, 2019

Thanks for the feedback.

My inclination is to treat these like multi-dimensional arrays, in which case we should raise an error to avoid hiding errors.

I wasn't sure on how to treat 0-dimensional arrays and just assumed it to be the same as a scalar since this function considers them as so

xarray/xarray/core/utils.py

Lines 238 to 248 in 1ab7569

def is_scalar(value: Any) -> bool:
"""Whether to treat a value as a scalar.
Any non-iterable, string, or 0-D array
"""
return (
getattr(value, 'ndim', None) == 0 or
isinstance(value, (str, bytes)) or not
(isinstance(value, (Iterable, ) + dask_array_type) or
hasattr(value, '__array_function__'))
)

Should I treat them like multi-dimensional arrays or leave the current behavior for consistency with the snippet above?

If the default value is NaN, we could reuse xarray's pre-existing sentinel value for NA:

Thanks for the advice, I'll be using this.

@max-sixty
Copy link
Collaborator

Should I treat them like multi-dimensional arrays or leave the current behavior for consistency with the snippet above?

That's a good point. I think in this case, given that it's passed to an arg expected an array, we should raise on 0d. I realize that's a bit inconsistent with treating them as scalars elsewhere.

Happy to be outvoted if others disagree

@griverat
Copy link
Contributor Author

griverat commented Aug 8, 2019

That's a good point. I think in this case, given that it's passed to an arg expected an array, we should raise on 0d.

I was expecting to rely on the current implementation of is_scalar to do the type checking since I'm moving _check_data_shape above as_compatible_data to do something like this

if utils.is_scalar(data) and coords is not None:

Otherwise everything would be filter out since as_compatible_data returns a 0d given a scalar value.

# validate whether the data is valid data types
data = np.asarray(data)

I can only imagine copying is_scalar but removing getattr(value, 'ndim', None) == 0 to filter out the 0d to only do the duplication on scalars.

@shoyer
Copy link
Member

shoyer commented Aug 8, 2019 via email

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking great!

@shoyer any final thoughts?

xarray/core/utils.py Outdated Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

Great - will merge unless anyone has final comments?

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/tests/test_dataarray.py Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

Great, will merge later unless other comments. Will be good to get this in!

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
xarray/core/dataarray.py Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

Great, looking good @DangoMelon - I merged master to resolve a conflict - then we can get this in!

@griverat
Copy link
Contributor Author

Great, looking good @DangoMelon - I merged master to resolve a conflict - then we can get this in!

Thanks for the help! I'm glad to contribute to this project.

@max-sixty
Copy link
Collaborator

Test failure is the same as on master, ref #3265

@max-sixty max-sixty merged commit 3c020e5 into pydata:master Aug 26, 2019
@max-sixty
Copy link
Collaborator

Thanks @DangoMelon ! I know this PR was a long road, but it's a material improvement to the ergonomics of xarray. We'd enjoy having you as a contributor in the future!

dcherian added a commit to dcherian/xarray that referenced this pull request Aug 26, 2019
* upstream/master:
  Initialize empty or full DataArray (pydata#3159)
  Raise on inplace=True (pydata#3260)
  added support for rasterio geotiff tags (pydata#3249)
  Remove sel_points (pydata#3261)
  Fix sparse ops that were calling bottleneck (pydata#3254)
  New feature of filter_by_attrs added (pydata#3259)
  Update filter_by_attrs to use 'variables' instead of 'data_vars' (pydata#3247)
@griverat griverat deleted the add-init-val-darray branch August 27, 2019 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow passing a default value (instead of ndarray) for data argument for DataArray
4 participants