-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting data in the binder #94
Comments
@openradar/erad2024-team , We need to download the data a-priori for the QPE exercise and we thought to do it through a script in the appendix.txt . Are you OK with that or do you have other suggestions? We do not have a way to read data from s3 buckets directly yet in Pyrad. |
@mgrover1 and I have created an analysis-ready version of the dataset, which is stored in the bucket. It can be read using This is the code that you need to open the dataset: from xarray.backends.api import open_datatree
URL = 'https://js2.jetstream-cloud.org:8001/'
path = f'pythia/radar/erad2024'
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=URL))
file = s3fs.S3Map(f"{path}/zarr_radar/erad_2024.zarr", s3=fs)
dt = open_datatree(file, engine='zarr', consolidated=True) Then it will load a |
I don't think it works for us. We need access to the original files and we thought it would be good to have them available in advance so that we do not need to download lots of data on the fly. The QPE exercise needs more data that more demonstrations |
@jfigui We can't bake >2 GB of data into the image, see #79. The jetstream s3 bucket is located alongside the pythiahub IIUC (at least in the same datacenter), so transfer times would be negligible. You can utilize |
@jfigui - do you have a ftp server or google drive link to the data? |
We need to process blocks of data stored locally at the moment. Is there a way to mount a drive in the image? |
We can also restrict the size of our data by removing some unnecessary data. Possibly enough to put it in github |
@mgrover1 Please correct me if I'm wrong. The bottleneck is the image size, which we should keep as small as possible. If deployed on the pythia binder hub, there is a larger storage per instance. Please add the data to the jetstream s3 bucket, where you can fetch it into the running notebook. I'd strongly vote against putting large data (>tens of MB) into github or docker image. |
@jfigui - the bucket storing the data is located on the same network as the jupyter/binder hub, can add a bash script if that works for you that pulls from the bucket, at the top of the notebook you are using for the exercise. I can assure you the transfer speeds will be quick This will make the course more reproducible, with an opportunity to migrate this over to a more permanent space such as the radar cookbook. Where is the data currently located? Has it been used in previous repositories? |
We should have all the original files we need for the exercise stored "locally" before Pyrad runs. If you think we can do that fast enough then that is fine. |
we do not use xradar yet in Pyrad |
The original Cband data is fetched into the running instance from the jetstream bucket in 15.5 secs. |
@jfigui Just test https://openradarscience.org/erad2024/intro-data-access in the running instance for yourself. @mgrover1 If you find the time when stress testing the machine on friday to test the download with that notebook. But even it it takes a minute this is still sufficient. |
OK, If it is on the order of 1 minute or so I think we can deal with it :) |
I quickly looked back into my snippets and found out that past solutions I was using to read Zarr Store on cloud buckets does not work well with recent Zarr versions. I found a new solution to open the Zarr store directly using Furthermore, opening the datatree is extremely slow (often hangs when I tried). Maybe @aladinor has something to add on this topic. Here below I provide the code to reproduce solutions and problems.
|
I have been working on I ran your code in my xarray version (2024.7.1....) and these are the results. I think we should install xarray directly form the GitHub repo python -m pip install git+https://github.com/pydata/xarray.git
can we add this version to the binder @kmuehlbauer ? |
I recall to have seen such PR so I updated to xarray 2024.7.0. But now that I checks looks so strange: with |
I'd suggest to fire up a binder instance and debug directly there. |
Ah but your PR @aladinor is not included in xarray |
I am not sure if there will be a new release soon... therefore, I suggest to install it directly from the repo |
Yes make sense ! I just tried out too and now it's muuuch faster opening the datatree. Good work !!! |
We should add an install to that specific branch 👍 |
No description provided.
The text was updated successfully, but these errors were encountered: