Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap towards UC8 #124

Closed
LukeWeidenwalker opened this issue Jun 19, 2023 · 10 comments
Closed

Roadmap towards UC8 #124

LukeWeidenwalker opened this issue Jun 19, 2023 · 10 comments

Comments

@LukeWeidenwalker
Copy link
Contributor

LukeWeidenwalker commented Jun 19, 2023

@clausmichele @aljacob I wanted to follow up on our call last Friday with a roadmap of concrete steps we'll need to finally close UC8.

Last year, because of the way the architecture was set up and there was no local execution environment, we could only test the whole workflow within the entire running system, so it was difficult to collaborate on this outside of EODC. With the rewrites to the parser and the processes, this should now be much easier, because we can essentially prototype the entire workflow without needing to have it exposed in a fully-fledged backend. As discussed on Friday, I'm summing up here what's still missing, so that @clausmichele can start supporting us again!

Subissues:

  • Support the geometries parameter in load_collection (or load_stac) by loading sparse arrays (https://github.com/pydata/sparse) (Implement load_stac #120, Support geometries in load_stac #121)
    • The problem with this usecase was always that we had no way to only load the data that we'd crop to later in aggregate_spatial. I'm attaching my proof of concept notebook for how to get around this by leveraging pydata/sparse arrays (super crude - @clausmichele please skim this quickly and let me know if you can work with this already, or if I need to tidy it up and add some more comments!). This has to be done at the point where data is loaded initially, because otherwise any operations between load_collection and aggregate_spatial will run over an extent much larger than necessary.
    • @clausmichele : If you prototype this in load_stac, the rest of the pipeline should be unlocked to be worked on, even if this isn't live at EODC yet!
  • Rewrite aggregate_spatial to work with sparse arrays and produce vector cubes with https://github.com/xarray-contrib/xvec (Port aggregate_spatial to work with sparse arrays and produce vector cubes with xvec #119)
    • When we worked on this last year, we used geopandas+dask_geopandas to represent vector cubes somewhat haphazardly. This is actually not a good idea in the long term, because we'd have to handle raster and vector cubes separately everywhere (think if isinstance(data, gpd.GeoDataFrame): statements all over the place) - which would be annoying and make this unnecessarily hard to maintain!
    • Xvec has since come out, which I find quite promising, because it should allow us to support vector cubes within the xarray paradigm. It has the added benefit that the implementation is directly motivated by Edzer's concept of the vector cube, so we should be quite closely aligned conceptually already. It should also be easier to keep processes dask-friendly that way, rather than relying on the extra dependency of geopandas/dask-geopandas.
    • Disclaimer: I haven't tried xvec yet, so all of this (especially how it behaves with dask) still needs to be confirmed!
  • adapt fit_regr_random_forest to work on xvec vector cubes (Adapt fit_regr_random_forest to work on xvec vector cubes #122)
    • there's also be some developments in the spec for the ML processes, which we should adopt
  • adapt pred_regr_random_forest to work with the new setup (Test pred_regr_random_forest to work this new setup #123)
    • This should already work (there's been an integration test for this for a long time), but still needs to be tested

Let me know if you have any questions on this general plan! As I explained on Friday, we don't have bandwidth to tackle these issues ourselves for at least another 2-week increment, so any progress you're able to make on this will help us out a lot and will be much appreciated. I'll try to make myself available to support and review any pull requests in a timely manner!

cc @SerRichard @ValentinaHutter @christophreimer @bschumac

test_sparse_multipolygons.zip

@LukeWeidenwalker LukeWeidenwalker added this to the Support for UC8 milestone Jun 19, 2023
@clausmichele
Copy link
Member

@LukeWeidenwalker my colleagues are working on the aggregate functionality: ks905383/xagg#48

@SerRichard
Copy link
Collaborator

Some open points from the coordination!

  • Test xagg on larger scale datasets

  • Working with projected systems?

  • Is loading of a multi polygon a hard requirement?

  • Fit_random is unblocked if we know we will be using xvec as an output of the aggregate_spatial

@ValentinaHutter
Copy link
Collaborator

Some open points from the coordination!

  • Test xagg on larger scale datasets

Concerning larger scale datasets:

@clausmichele
Copy link
Member

@masawdah implemented filter_spatial in this PR: #170
It is dask based, we tested it with full S2 tiles. This is allows to crop a large input dataset to the requested polygon and load only the necessary subset.
Once this is approved we will reuse this process to implement aggregate_spatial using the xvec data structure as output.

@ValentinaHutter
Copy link
Collaborator

ValentinaHutter commented Nov 14, 2023

I now did some tests with filter_spatial and can confirm that it works with our datasets as well. When testing it with one polygon, the dataset gets cropped to the bounds of the polygon and gets smaller. When two polygons are used, which are further away from each other, it also works, but has lots of nan values in the area between the two polygons in the dataset (which is as expected). This is the most important detail for the aggregate_spatial implementation, I think. I will now continue with the implementation for fit_regr_random_forest, hope that's okay!

@masawdah
Copy link
Member

Hi @ValentinaHutter , aggregate_spatial implemented in this PR: #194

As discussed with @clausmichele , if you plan to proceed with the implementation of fit_regr_random_forest, it is important to determine that the output of aggregate_spatial is compatible with the requirements of other processes. Currently, it is in the form of an x-vector cube as an xarray dataset.

@ValentinaHutter
Copy link
Collaborator

Thanks a lot for the quick reply! will have a look at the PR and test it on our side asap!

@ValentinaHutter
Copy link
Collaborator

There is now a general implementation for aggregate spatial on openeo-processes-dask.

However, we found that the spatio-temporal extent for the UC8 needs to be treated differently. This is because the UC8 starts with load_collection on bbox = [3, 43, 18, 51], temporal_extent = ["2018-05-01", "2018-09-01"]). This fails for the datasets, that are provided by EODC in the EODC backend. The error is raised as soon as dask tries to create a lazy array, for either having to much data in one chunk or having a task graph that is too large. The kernel crashes even before the aggregate_spatial is started.
Therefore, we needed to reimplemented aggregate_spatial internally to work with subsets of load_collection outputs.

Let me know, if anything is unclear or you need a more detailed description of the issue.

@clausmichele
Copy link
Member

@ValentinaHutter in my opinion the re-implementation/adaptation would also be important to be shared, so that we understand the internal logic. Additionally, if someone would face the same issue, it would be important to track a possible solution to it.

@ValentinaHutter
Copy link
Collaborator

I will try to sum it up in the next days - I think this might be a nice thing to discuss at our first openeo-processes-dask meeting next Monday :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants