Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does graphchain make sense for persisting partial computations of xarrays as zarr? #93

Open
SohumB opened this issue Aug 17, 2022 · 0 comments

Comments

@SohumB
Copy link

SohumB commented Aug 17, 2022

Hi there!

I'm trying to evaluate graphchain for our usecase, and I'm not sure I fully understand the model. I'd appreciate help clarifying, if possible!

We have some relatively simple dask computations; backed by climate science zarr datastores, that eventually we want to compute location-specific metrics from. At any given time/compute step, only a small subset of these metrics will be necessary. Ideally, we'd like to persist and cache any subset of these metrics we've computed, but we don't really want to precompute and cache metrics for the entire world.

As I understand it: graphchain will examine the dask compute graph, identify nodes as caching opportunities, and persist them. I think this means that if a subset operation happens late in the compute graph, then dask moves that subset op back up the compute graph, and graphchain then starts by looking at the already subsetted computation graph?

Graphchain then persists this computed node; by default as a joblib dump, but you can write custom serializers that do different things if it sees a dask dataframe. So... what happens if I use to_zarr in a custom serializer? Are the dataframes we're talking about here already subsetted, or would calling to_zarr on them force them and compute the entire globe's worth of metric? The latter can't be the case; to_parquet would do the same... so does the key then include the chunk being persisted? Are those chunks the same as zarr chunks; if only by accident because our input is a zarr datastore?

I get the feeling that I'm not quite understanding something here. I'll be running some experiments of my own to try and figure out what graphchain is doing, but I figured I should just as well also ask the question!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant