Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset::read as #374

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

julienr
Copy link
Contributor

@julienr julienr commented Feb 26, 2023

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

This adds a Dataset::read_as function that is the dataset counterpart to RasterBand::read_as. and reads the whole raster at once. This avoids requiring multiple calls to RasterBand::read_as.

@julienr julienr force-pushed the add_dataset_read_as branch from 0a689a2 to 9cdcb52 Compare February 26, 2023 15:49
window_size: (usize, usize),
buffer_size: (usize, usize),
e_resample_alg: Option<ResampleAlg>,
) -> Result<Buffer3D<T>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is heavily inspired by RasterBand::read_as since a lot of the logic is similar. The main change is that the RasterIO function on dataset needs some direction about which bands to read ( the bands param).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this comment, but I'm wondering, is it not possible to specify which bands to read? IOW, is reading all the bands the only option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely possible. I'll try to expose this in a way that make it easy to read all bands if you don't want to bother, but also gives you the power to be more specific.

src/dataset.rs Outdated
T::gdal_ordinal(),
band_count as i32,
bands.as_mut_ptr() as *mut c_int,
// We want to read a HWC
Copy link
Contributor Author

@julienr julienr Feb 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit arbitrary, but it seems to make more sense than the default CHW because other rust image library (e.g. the png crate) expect image data a HWC. Happy to get feedback / revise this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with this terminology. Does "CHW" mean "channel/height/width" and "HWC" mean "height/width/channel"?

How does this compare to how Rasterio handles reading multiple bands at once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the jargon. Yes indeed CHW is for channel/height/width.

How does this compare to how Rasterio handles reading multiple bands at once?

Do you mean the C++ GDALDataset::RasterIO API or the python rasterio project ? I'll try to answer for both:

  • GDALDataset::RasterIO lets you specify this through the nPixelSpace, nLineSpace, nBandSpace params
  • rasterio DatasetReader.read will return data in channel-first format (channel/heigh/width). Numpy makes it trivial to change that though (call data.transpose(1, 2, 0) ). I feel this is going to be a bit more involved in rust so having a good default would make sense - but maybe with the option to change it.

src/dataset.rs Outdated
/// * `buffer_size` - the desired size of the 'Buffer'
/// * `e_resample_alg` - the resample algorithm used for the interpolation. Default: `NearestNeighbor`.
///
/// The returned buffer contains the image data in HWC order.
Copy link
Contributor

@metasim metasim Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "HWC" is common jargon in the GDAL/EO world... Do you mean "rows/cols/band" order?


#[derive(Debug)]
pub struct Buffer3D<T: GdalType> {
pub size: (usize, usize, usize),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document how to interpret the size/shape?

window_size: (usize, usize),
buffer_size: (usize, usize),
e_resample_alg: Option<ResampleAlg>,
) -> Result<Buffer3D<T>> {
Copy link
Contributor

@metasim metasim Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud here: Why don't we just expose RasterIO instead? If you want to read all the cells in rows/cols/band order, this gets the job done, but how common is that use case compared to needing something more flexible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wondering if this should be paired with exposing AdviseRead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. One thing I wasn't sure is if georust/gdal aims to be mostly a thin wrapper around GDAL C++ API or focus more on ergonomics (like the rasterio python project).

For example currently RasterBand doesn't expose GDALRasterBand::RasterIO but instead exposes a number of read_as function that also make some choices (e.g. they return images in width / height format and I don't think they offer the possibility to get it in height/width format.

I do like the idea of exposing RasterIO (and AdviseRead possibly). Then I feel it would make sense to also expose these on the RasterBand type then.

And then maybe still keep some high-level read_as functions that simplify common usage patterns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I wasn't sure is if georust/gdal aims to be mostly a thin wrapper around GDAL C++ API or focus more on ergonomics (like the rasterio python project).

I'm not sure about that either! 😅 Actually, it's more like a thin wrapper around the C API, with inspiration from the C++ API, while leveraging safety benefits of Rust when possible. I personally would like to focus more on the ergonomics, but there are good, well-considered arguments against that. There have been lots of cooks in this kitchen, and lacking a project BDFL, my default position is to try to be consistent, yet look for ways burn-down friction and entropy along the way.

RasterBand doesn't expose GDALRasterBand::RasterIO but instead exposes a number of read_as function that also make some choices

I'm too ignorant of the project history to offer an explanation for that. It probably best scratched an itch at the time.

they return images in width / height format and I don't think they offer the possibility to get it in height/width format.

This may be a simple situation of convention. I've worked on a number of different geospatial raster processing systems, and I can't think of one that doesn't assum row-major order.

I should also note that if ndarray is turned on, you have all the same shape manipulating view capabilities as you do in numpy.


Something that might help is to understand the use case for this PR. Is calling RasterBand::read* multiple times to construct your tensor a real bottleneck? Or is this just a convenience? In my work code, we do that and assemble the results into the shape we want using ndarray methods. Duplicating all the read* methods on Dataset has some code smell to it, except that there's precedence in the C++ library. But my beef with it gets back to your original question: is the goal just a wrapper, or do we aim for better ergonomics?

Copy link
Contributor Author

@julienr julienr Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good questions, let me try to give some more motivations.

The reason I started working on this is because I was working on a xyz tiler service (similar to rio-tiler) to serve XYZ tiles to a web frontend from cloud-optimized GeoTiffs stored on a blobstore. This is a case where I always want to read all bands of the tiff because either the tiffs are RGB or they have more than 3 bands but then I'm gonna do some band combinations to turn into 3 bands for visualization.

I do already have a similar service in Python and I'm using gdal.Dataset.ReadAsArray from the GDAL python bindings and so I was looking for the same thing in rust. Since this function is also available in the GDAL C API (GDALDatasetRasterIO), it felt quite natural to have it in the rust bindings.

Now, granted I could do this by reading band-by-band and recombining with ndarray, but:

  • I'd like to avoid the ndarray dependency since I'm not using it elsewhere
  • Even with ndarray, at some point when I want to turn the ndarray into a height x width x channel buffer to write to a PNG encoder, this will involve an extra copy I feel
  • Without ndarray, the logic of recombining the bands in an image - even if simple - feels kind of cumbersome to write and something that should be provided by the GDAL wrapper
  • I haven't benchmarked this in a while, but I think reading band by band is less efficient than reading a whole band at once, depending on the compresion/interleaving of the TIFF file. In particular, when using JPEG compression, reading band by band will still require reading and uncompressiong all the bands and discard everything but the requested band. If I understand correctly, this is what this comment in the GDAL code is referring to
    • I am also not sure if reading band by band could cause one network request (when reading from a blobstore) for each band instead of one request for all bands when reading by block. GDAL does have some caching logic, so I'm unsure about this one.

All in all, if the goal is for this crate to be a wrapper around the C API (that seem like a good guideline to me) but inspired by the GDAL C++ API, then I feel exposing Dataset::RasterIO make sense because it's available on both of those API.

I get the code duplication argument though, but then I would almost argue that if we have to pick one function to expose, it should be RasterIO on the Dataset and not on the RasterBand. Because the Dataset one lets you do a single band read (by passing the band) while the opposite is not true (you can't read the whole dataset from RasterBand).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienr Interesting and helpful context. Thanks for taking the time to write it up! I hope I didn't come across as questioning your need; I just thought for the sake of one of the other contributors having the use case fleshed out would help with thinking about the best long-term API.

I have a few additional reflections on the above, in case it's helpful (ignore if not), but I'll need to defer to a more senior contributor to determine the appropriate direction.

I'd like to avoid the ndarray dependency since I'm not using it elsewhere

We have a rust-based XYZ tiler as well, and I avoided ndarray for a while, but found it's worth the price in the end. It's just too damn convenient. And it's impressively efficient, at least in our use cases. The API has a learning curve, but being able to re-shape things and zip arrays (and even try out parallelization with rayon) has been worth it to us.

at some point when I want to turn the ndarray into a height x width x channel buffer to write to a PNG encoder, this will involve an extra copy I feel

FWIW, we create an in-memory Dataset where we assemble the bands, apply color correction, etc., and then use the built-in GDAL PNG encoder to render. Negligible overhead compared to the S3 network transfer. And GDAL appears to take care of any reshaping needed.

the logic of recombining the bands in an image - even if simple - feels kind of cumbersome to write and something that should be provided by the GDAL wrapper

I agree, but even with the C/C++ API, GDAL doesn't help much, unless you start looking at GDAL Warp and the pixel functions. I aspire to write some wrappers for that API, but I've run away from the dragons therein several times 😅! But maybe I'm missing a part of the API here. 🤔 I've found ndarray more of a help here.

I haven't benchmarked this in a while, but I think reading band by band is less efficient than reading a whole band at once

That may be so. I wonder if it depends on how big your blocks are, and if allocating B x W x H contiguous memory is more efficient than B x (W x H). 🤔

In particular, when using JPEG compression, reading band by band will still require reading and uncompressing all the bands and discard everything but the requested band.

Maybe I'm wrong, but if it's truly a COG format with TILED=YES, I don't think this is the case. We're certainly not seeing that behavior in our system, in our use cases. We also use the Kakadu JP2K driver and haven't had problems with range reads as long as it's internally tiled. But I can't claim comprehensive experience or analysis here. It's definitely been "fast enough" and definitely isn't downloading the full raster.

I feel exposing Dataset::RasterIO make sense because it's available on both of those API.

Based on your arguments, I agree. But I'd go all the way and expose as much of the API as you can without leaking unsafe.

I get the code duplication argument though,

I think it's a minor consideration. We could consider traits for a common interface and share functions as we're able, but a lot of this is just baked into GDAL.

@metasim
Copy link
Contributor

metasim commented Mar 4, 2023

I'm not qualified to approve this PR, but a couple other questions that come to mind are:

  • What about when ndarray is enabled?
  • Is Buffer3d absolutely necessary? Should Buffer be extended to handle the multiple dimension case?
  • Once you have a Buffer3d, how does one know how to traverse it? Do we need to provide traversal semantics?
  • Should we be looking to rasterio for ergonomic inspiration?
  • Do we need to consider any overlap or coherence with the MDArray API, or is the use case completely different?

@julienr
Copy link
Contributor Author

julienr commented Mar 5, 2023

@metasim thanks for the thorough review and the comments ! I'll convert this to draft because I think some more feedback/discussion on the API design is required to make this good.

What about when ndarray is enabled?

Should we adopt the same strategy as for RasterBand which has a read_as_array function for this case ?

Is Buffer3d absolutely necessary? Should Buffer be extended to handle the multiple dimension case?

I need to think about this a bit more. I feel this would require turning the size of Buffer into a (usize, usize, usize) and this would have the impact that you always have to think about the 3rd dimension being 0 when using this as a 2D buffer. Obviously you can achieve flexible nd buffer types, but are when then re-implementing ndarray ?

Once you have a Buffer3d, how does one know how to traverse it? Do we need to provide traversal semantics?

Indeed, this isn't well specified. It isn't well specified currently for Buffer either though (happy to tackle that as well, but it feels like a slightly different issue).

(I feel I've adressed the other 2 questions in my other comments)

@julienr julienr marked this pull request as draft March 5, 2023 13:23
@metasim metasim requested review from lnicola and jdroenner March 8, 2023 16:22
@lnicola
Copy link
Member

lnicola commented Mar 10, 2023

So I haven't made up my mind on the API here, but I think we should offer some API for it. Pixel-interleaved images are a good argument (although the data will probably be cached unless you're reading too much at once), and in general I'd like us to have an equivalent for every GDAL API.

@julienr
Copy link
Contributor Author

julienr commented Mar 11, 2023

Thanks for the input @metasim and @lnicola .

I feel the next step would be for me to give a shot at exposing Dataset::RasterIO in a manner that's closer to the C API than what I did here. I'll do that.

Would you rather have me do a new MR or shall I just keep updating this one ?

@metasim
Copy link
Contributor

metasim commented Mar 20, 2023

@julienr

Would you rather have me do a new MR or shall I just keep updating this one ?

I think doing it in this PR is probably best... we already have the conversation going here.

@julienr julienr force-pushed the add_dataset_read_as branch from 9cdcb52 to a11a53a Compare July 19, 2023 18:37
@julienr
Copy link
Contributor Author

julienr commented Jul 19, 2023

I've pushed an updated version that expose an API that is very close to GDAL RasterIO (including choosing the interleaving) without sacrificing safety.

The only thing I haven't adressed (yet) is the ndarray part but that could be done in a follow up MR by adding Dataset::read_as_array following the same logic as RasterBand. Happy to add it in this MR if you feel that's better.

I also see that the CI is failing, but I think this isn't this branch fault but more than recent clippy is more nitpicky that old versions.

@julienr
Copy link
Contributor Author

julienr commented Jul 25, 2023

Rebasing on #426 should fix the linting

julienr added 4 commits July 28, 2023 08:48
This is similar to RasterBand::read_as, but it is reading the whole
image at once. Similar to how GDAL has RasterIO on both the dataset
and band level.
@julienr julienr force-pushed the add_dataset_read_as branch from a11a53a to 76fdac5 Compare July 28, 2023 06:48
@julienr julienr marked this pull request as ready for review July 28, 2023 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants