Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing: GPM MERGIR #17

Closed
4 tasks done
ranchodeluxe opened this issue Jan 4, 2024 · 11 comments
Closed
4 tasks done

Passing: GPM MERGIR #17

ranchodeluxe opened this issue Jan 4, 2024 · 11 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Jan 4, 2024

pangeo-forge/staged-recipes#260

  • Runs on LocalDirectBakery using prune option
pangeo-forge-runner bake \
    --repo=https://github.com/developmentseed/pangeo-forge-staging \
    --ref="gpm_mergir_gcorradini" \
    --Bake.feedstock_subdir="recipes/gpm_mergir" \
    -f config.py 
  • Runs on LocalDirectBakery for all timesteps
pangeo-forge-runner bake \
    --repo=https://github.com/developmentseed/pangeo-forge-staging \
    --ref="gpm_mergir_gcorradini" \
    --Bake.feedstock_subdir="recipes/gpm_mergir" \
    -f config.py 
  • Runs on FlinkOperatorBakery for prune option
curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Authorization: token blablah" \
https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \
-d '{"ref":"main", "inputs":{"repo":"https://github.com/developmentseed/pangeo-forge-staging","ref":"gpm_mergir_gcorradini","prune":"1","feedstock_subdir": "recipes/gpm_mergir"}}'
  • Runs on FlinkOperatorBakery for all timesteps
curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Authorization: token blablah" \
https://api.github.com/repos/NASA-IMPACT/veda-pforge-job-runner/actions/workflows/job-runner.yaml/dispatches \
-d '{"ref":"main", "inputs":{"repo":"https://github.com/developmentseed/pangeo-forge-staging","ref":"gpm_mergir_gcorradini","prune":"0","feedstock_subdir": "recipes/gpm_mergir"}}'
@ranchodeluxe ranchodeluxe added the documentation Improvements or additions to documentation label Jan 4, 2024
@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Jan 15, 2024

I'm able to run this locally and on Flink but here's a list of recipe problems that probably need to be fixed before we could run this for the whole ~225k archive (also applicable to #15):

  • the biggest challenge is the full archive is ~225k and with runs <=5k we often run into a whole host botocore.exceptions.ConnectionError situations. This makes sense but the pangeo-forge-recipes openers (in this case OpenWithKerchunk) do not handle errors gracefully, log them and move on. So the whole pipeline fails.

  • pangeo-forge-recipes only has the changes we need on main and there is not cut release for those changes yet. So that should happen

  • main changes (compared to 0.10.4) have a breaking change I mentioned here(we can use target_options and remote_options as a work around for now but there shouldn't be a reason we need to)

@abarciauskas-bgse
Copy link
Collaborator

@ranchodeluxe the target dataset was intended to be GPM IMERG so I have opened a new (draft) PR pangeo-forge/staged-recipes#264 which thankfully has fewer files (8,461). Should we update this issue with that collection or close this and open a new issue?

@ranchodeluxe
Copy link
Contributor Author

@ranchodeluxe the target dataset was intended to be GPM IMERG so I have opened a new (draft) PR pangeo-forge/staged-recipes#264 which thankfully has fewer files (8,461). Should we update this issue with that collection or close this and open a new issue?

Thanks for doing that 🥳 I'll update this ticket to point to your new PR branch and test it out locally and on Flink

@ranchodeluxe
Copy link
Contributor Author

@abarciauskas-bgse: I was creating a PR for pangeo-forge/staged-recipes#264 to fold in the changes on pangeo-forge-recipe that just merged. I thought I'd write a new validator/tester function to make sure the reference file that ConsolidateMetadata outputs works as expected. Here is my updated recipe

Outcomes:

  1. we can read the reference file fine with zarr.open_conslidated, but...
  2. xr.open_dataset(..., consolidated=True) should work too but doesn't

Filing a ticket

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Jan 25, 2024

@abarciauskas-bgse: I was creating a PR for [pangeo-forge/staged-recipes#264](https://github.com/pangeo-
...

I guess the good news is this mostly works

@ranchodeluxe
Copy link
Contributor Author

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Jan 27, 2024

calling this a success and changing to passing since I've run multiple years on Flink and locally to prove it works. There still seem to be holes for some datasets that result in 404(s) so we'll have to figure out where the holes are incrementally, work around them and tell folks upstream

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Jan 27, 2024

calling this a success and changing to passing since I've run multiple years on Flink and locally to prove it works. There still seem to be holes for some datasets that result in 404(s) so we'll have to figure out where the holes are incrementally, work around them and tell folks upstream

Seems we have holes after ~15 years of data (which makes sense based on history of IMERG and TRMM).

This run can do all 14 years in ~9 minutes with parallelism:5: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7679542574

@abarciauskas-bgse
Copy link
Collaborator

This run can do all 14 years in ~9 minutes with parallelism:5: https://github.com/NASA-IMPACT/veda-pforge-job-runner/actions/runs/7679542574

That's awesome!!! 🥳

@ranchodeluxe
Copy link
Contributor Author

@abarciauskas-bgse: profile of LocalDirectRunner.num_workers=1 on JH with this GPM IMERGE recipe. So same memory pattern locally (just not as drastic as TRMM) and possibly is able to run on Flink b/c the current per-worker resourcing is so large that it just works. More to look into. About to run LEAP (which is only StoreToZarr) and definitely will have the same pattern. I have ideas on how to isolate the issue if it is a memory leak and will create at ticket

Screen Shot 2024-02-04 at 6 39 13 AM

@ranchodeluxe
Copy link
Contributor Author

ticket: #32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants