Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use windowed writing in to_raster #9

Closed
snowman2 opened this issue Jun 25, 2019 · 16 comments · Fixed by #210
Closed

Use windowed writing in to_raster #9

snowman2 opened this issue Jun 25, 2019 · 16 comments · Fixed by #210
Labels
proposal Idea for a new feature.

Comments

@snowman2
Copy link
Member

snowman2 commented Jun 25, 2019

From: #8

The only other helpful long term solution would be to use windowed writing like we do in trollimage to achieve better performance with dask (saving chunk by chunk), but this has its own issues we've discovered with GDAL's in-memory caching and making large output files if you don't have a lot of memory available.

ping @djhoese

@snowman2
Copy link
Member Author

Currently dask is not a dependency to rioxarray.

If I understand correctly, it seems this is how you are achieving this in trollimage here: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L371

With the use of RIOFile: https://github.com/pytroll/trollimage/blob/63fa32f2d40bb65ebc39c4be1fb1baf8f163db98/trollimage/xrimage.py#L63-L138

I am wondering if this could be an option to toggle on and will only work if you have dask installed.

@djhoese
Copy link
Contributor

djhoese commented Jun 25, 2019

Yes. Dask's store function expects a destination that supports numpy-like array indexing. Our RIOFile helps with that. I'm sure core xarray has some helpful things for this now, they didn't when we first created RIOFile.

There is also the complication of when to open/close the file. This is one of the uglier parts of how Satpy/trollimage does things. Dask has no way of having a task that opens a file, runs a series of tasks (saving array chunks), and then closing the file when they are all done. There are related issues on dask's github but I don't have time to find them right now.

Since trollimage requires dask (because Satpy uses dask) we haven't had to deal with the difficulty of handling numpy and dask arrays in the most efficient way possible.

@snowman2
Copy link
Member Author

Interesting. I will have to dig into it later.

@snowman2
Copy link
Member Author

Reading on the subject leads me to think that using the blockxsize and blockysize in GDAL/rasterio in connection with the dask chunks would be a good idea for reading from/writing to rasters.

  1. If it is a dask array, the chunks property is available and the blockxsize and blockysize should be set based on that property.
  2. If it is not a dask array, the chunk method could be used to convert it into a dask array based on the blockxsize and blockysize properties.

@djhoese
Copy link
Contributor

djhoese commented Jun 26, 2019

In point 1, doesn't blockxsize and blockysize assume tiling? What if I wanted a striped geotiff but was using dask arrays with 2D chunks?

For point 2, what "chunk" method are you referring to?

@snowman2
Copy link
Member Author

For point 1, it does assume tiling. With my current understanding I was thinking about how to optimize, but I am not entirely sure at the moment if it is a good idea or not. How often do you use a striped raster?

For point 2, when writing with dask chunks and you didn't start with a dask array: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.chunk.html

@snowman2
Copy link
Member Author

Also, these could be defaults that could be overridden.

@djhoese
Copy link
Contributor

djhoese commented Jun 26, 2019

Point 1, ok. Since the default for gdal/rasterio is striped rasters that's what I typically do but I give users the option to switch to tiled if they'd like. I'd be ok with more libraries changing the default to tiled but I'd be curious about gdal's reasoning for not doing tiling by default. This would likely remove the issues I mentioned with GDAL's internal caching. The only thing I don't 100% agree with here is that the chunk size be used directly for block size. I think it could be used, but in most cases (in my experience) the dask array chunk size is much larger than "usual" tiled geotiffs (256x256, 512x512, etc); especially if you consider cloud optimized geotiffs as the "preferred" version of a geotiff.

Point 2, ah cool. I didn't know that existed.

@snowman2
Copy link
Member Author

The only thing I don't 100% agree with here is that the chunk size be used directly for block size.

That is definitely a valid concern. In these cases, we would have to have a check for that. The block sizes for a geotiff for overviews are a power-of-two value between 64 and 4096 (https://gdal.org/drivers/raster/gtiff.html#overviews), but I am not sure what it is for normal tiffs. I have read here that the TIFF spec states that tile width (and this goes for height, I presume) must be a multiple of 16.

So, it seems you can get away with larger block sizes as long as they are in the power of 2. But, this does limit the allowed dask chunk sizes when writing to a file. Ideally the dask chunk size and the block size would be the same. Thoughts?

@snowman2
Copy link
Member Author

Another option for optimizing writing would be to call the chunk function beforehand to better match the chunk sizes. But, it would be a good idea to profile that one.

@djhoese
Copy link
Contributor

djhoese commented Jun 26, 2019

Agree with everything you said. Perhaps if this method is defaulting to tiled geotiffs then it could have default block sizes of 256x256 (like GDAL does) and it could rechunk the data before writing it. If you tiled=False then no rechunking is needed.

Brain dump/side note: I wonder if rechunking for a striped geotiff to stripe size chunks would improve geotiff writing performance and final geotiff size. I'll have to try that in satpy some time.

@snowman2
Copy link
Member Author

Sounds good. Looks like we have enough here to tinker with some ideas later. It will be interesting to hear back about the stripsize tests as well.

@snowman2
Copy link
Member Author

This looks interesting: https://github.com/dymaxionlabs/dask-rasterio. Hasn't been updated in a while and only works with poetry. But, seems like a good additional reference.

@djhoese
Copy link
Contributor

djhoese commented Jan 30, 2020

Looks like they do a lot of the same stuff we did in the trollimage package. The main difference is that we allow you to not compute the result right away. This lets you delay the data writing until later which is needed by some of the stuff we do in Satpy. I should probably look at how xarray is handling things nowadays with handling open file objects (and more importantly serializing them).

@snowman2
Copy link
Member Author

snowman2 commented Jan 30, 2020

@snowman2 snowman2 added the proposal Idea for a new feature. label Apr 10, 2020
@snowman2
Copy link
Member Author

Was planning to tinker around with this and noticed the GPL license in trollimage, As such, dask-rasterio is going to have to be the reference point for this since it has a BSD-2 Clause license.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Idea for a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants