Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add numpy interface #210

Merged
merged 3 commits into from
Jun 11, 2021
Merged

Conversation

erikmannerfelt
Copy link
Contributor

As was discussed GlacioHack/xdem#145, allowing numpy functions directly on a raster object would be very nice to have (it would just point to self._data):

mean = np.mean(raster)

raster += np.ones_like(raster)

This turned out to be a quite simple fix! numpy has the __array_interface__ which is used for all of their functionalities. So I just added a property that first loads the data if needed and then provides the __array_interface__ of self._data. As you can see in the tests, it seems to work quite well!

@erikmannerfelt erikmannerfelt self-assigned this Jun 11, 2021
@erikmannerfelt erikmannerfelt added the enhancement Feature improvement or request label Jun 11, 2021
@erikmannerfelt
Copy link
Contributor Author

Oh and I also updated the versions and removed an unused dependency in the pre-commit config, and black changed a tiny thing in geovector.py

@atedstone
Copy link
Member

Superb! 👍

@erikmannerfelt erikmannerfelt merged commit 19ec8d8 into GlacioHack:main Jun 11, 2021
@erikmannerfelt erikmannerfelt deleted the numpy_interface branch June 11, 2021 09:56
@property
def __array_interface__(self) -> dict[str, Any]:
if self._data is None:
self.load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really load it? Maybe it's not wanted by the user (some cropping or loading only a single band might be needed instead). I would rather raise an error.

@adehecq
Copy link
Member

adehecq commented Jun 11, 2021

Fantastic! It will be really useful.
I only have some concern regarding the case where no data is loaded. With large rasters, I often just load the metadata, and possibly do a subset before loading. I wouldn't want to load the entire array "by accident".

@atedstone
Copy link
Member

I'm gonna disagree with you there @adehecq ;) I think that applying a numpy function to a Raster is a pretty clear indication that you expect the data to be loaded. I think that this 'lazy' approach, where we load the data in for the user automatically, is pretty sensible.
Having said that, if in the future we can support dask arrays (hahahah) then we would need to rethink this.

@adehecq
Copy link
Member

adehecq commented Jun 11, 2021

I'm gonna disagree with you there @adehecq ;) I think that applying a numpy function to a Raster is a pretty clear indication that you expect the data to be loaded. I think that this 'lazy' approach, where we load the data in for the user automatically, is pretty sensible.
Having said that, if in the future we can support dask arrays (hahahah) then we would need to rethink this.

I also disagree :-) The reason why this implementation was brought up was that, in xdem functions one could provide the raster instead of a numpy array. In these cases, it might not be so obvious that at some point all the data will be loaded into memory, to for example just calculate a mean.

A typical use case would be:
I load the metadata to get some georeferencing info etc.
I do some other stuff, including e.g. finding the pixels that overlap with my area of interest.
I forget to load the subset.
I run a function (like xdem.coreg) that uses the raster as a numpy array underneath -> I overload my memory... :-(

I know, it seems quite specific, but

  1. I could see myself do this quite often
  2. it does not cost much to raise an error that says "Run self.load" rather than assuming the user wants to load the entire data set in memory. This is already done in other instances: https://github.com/GlacioHack/GeoUtils/blob/19ec8d8810f2ab2dc59c412cce0d4148cf768b37/geoutils/georaster.py#L377 or https://github.com/GlacioHack/GeoUtils/blob/19ec8d8810f2ab2dc59c412cce0d4148cf768b37/geoutils/georaster.py#L958
  3. If people load the metadata only, this is for good (memory) reason and they are necessarily aware of the self.load function and how to use it.

@atedstone
Copy link
Member

Alright, I think I can see the issue, although it remains a bit difficult for me to imagine! But easy either way on this.

Copy link
Contributor

@rhugonnet rhugonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing indeed, well done 👍
For the self.load() issue brought up by Amaury, I think I agree more with @atedstone and would encourage a lazy approach. It's the responsibility of the user to ensure he does not perform any calculation if he does not want the data to be loaded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature improvement or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants