-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add satellite image processing benchmark #1550
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still TODO:
- Try with
odc.stac
instead ofstackstac
- Write output to Zarr instead of persisting to cluster memory
@guillaumeeb @Kirill888 does this look like it captures a common use case for folks processing satellite images?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrbourbeau, this is clearly what I had in mind, and I think it's a very good basic benchmark on imagery, and underlying infrastructure (here Microsoft Planetary Computer). Just made some comments, but more questions than remarks!
Were you able to run this? Do you have an idea of the data volume you read? Did you have problem with Datacube being to sparse?
from distributed import wait | ||
|
||
|
||
def harmonize_to_old(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I knew that there was some version change, but didn't know that this could affect the processing. This adds some real problematics, but also complexifies the benchmark, this might no be needed...
items = search.item_collection() | ||
|
||
# Construct Xarray Dataset from stack items | ||
ds = stackstac.stack(items, assets=["B08", "B11"], chunksize=(400, 1, 256, 256)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you can complexify and select other bands for other indices like NDVI, NDSI, NDWI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a suggestion on which indicies are most common? I see NDVI come up a lot. Is there any computation difference between these indicies, or are there all just relatively straightforward reductions like we already have here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see NDVI as the HelloWorld of Satellite image processing. There is no computation difference on all if these indices, it's always the same formula, you only switch the bands. You probably already know, but NDVI vor Vegetation, NDSI for Snow, NDWI for Water.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you can complexify and select other bands for other indices like NDVI, NDSI, NDWI.
Is it common to calculate multiple of these at the same time?
modifier=planetary_computer.sign_inplace, | ||
) | ||
|
||
# GeoJSON for region of interest is from https://github.com/isellsoap/deutschlandGeoJSON/tree/main/1_deutschland |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much Sentinel 2 tiles does that represent? Do you have an idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search.item_collection()
returns 2835 items. Is that what you're asking about, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was more a question about the number of Sentinel 2 zone in UTM format, but nevermind, it's already a good answer. If my calculation is correct, this means already half a TB of Data to load.
ds = harmonize_to_old(ds) | ||
|
||
# Compute humidity index | ||
humidity = (ds.sel(band="B08") - ds.sel(band="B11")) / ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be prettier to have (ds.B08 - ds.B11) / (ds.B08 + ds.B11)
, but I guess this is the way stackstack is working, and not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right this has to do with what stackstac
returns (data array). odc.stac
returns a dataset with the structure you're proposing. Thoughts on stackstac
vs odc.stac
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on stackstac vs odc.stac?
Nope, there is a thread about that on Pangeo Discourse I think. Both have probably advantages and drawbacks...
ds.sel(band="B08") + ds.sel(band="B11") | ||
) | ||
result = humidity.groupby("time.month").mean() | ||
result = result.persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also love to see the plot and if it represent something, typically seasonality at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Are these benchmarks running in a stable cloud / region? It'd be nice to find a dataset in the same region if possible (to cut down on egress costs and speed up the I/O portion of the benchmark, which probably isn't relevant for dask). (edit: I see the On stackstac vs. odc-stac, the main things to be aware of are
I'll give this workload a shot today or tomorrow and will report back. |
else: | ||
time_of_interest = "2015-01-01/2024-09-01" | ||
|
||
search = catalog.search( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Coiled have a spot where we could store a small parquet file? As written, this is hitting the search endpoint every time the benchmark runs, which isn't exercising Dask at all and has the potential to throw errors.
We could instead do the search once and use stac-geoparquet to cache that result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I may need to do a little setup to give the CI runner access to an Azure bucket, but we can definitely do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the parquet file is small enough that we can also put it into s3 if that’s easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, fair point
Right now this is running in
That'd be great. I'm happy to chat generally about this. Also, let me know if you need access to a Coiled workspace that's configured Azure. |
Okay, so here's notebook (https://gist.github.com/jrbourbeau/900b602d19fe8087cafc0490b5c26f68) that runs the same computation using resolution = 10
SHRINK = 4
resolution = resolution * SHRINK
ds = odc.stac.load(
items,
chunks={},
patch_url=planetary_computer.sign,
resolution=resolution,
crs="EPSG:3857",
groupby="solar_day",
) where I use things like |
693f70c
to
e534038
Compare
I've updated the benchmark to use Open TODOs:
|
This PR is ready for review. I haven't bothered too much with caching data from Planetary Computer yet since I saw some occasional permissions issues when data seems to have gotten stale, but we can tackle that when necessary. |
xref #1548