Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting data in the binder #94

Closed
jfigui opened this issue Aug 27, 2024 · 22 comments
Closed

Getting data in the binder #94

jfigui opened this issue Aug 27, 2024 · 22 comments

Comments

@jfigui
Copy link
Contributor

jfigui commented Aug 27, 2024

No description provided.

@jfigui
Copy link
Contributor Author

jfigui commented Aug 27, 2024

@openradar/erad2024-team ,

We need to download the data a-priori for the QPE exercise and we thought to do it through a script in the appendix.txt . Are you OK with that or do you have other suggestions?

We do not have a way to read data from s3 buckets directly yet in Pyrad.

@aladinor
Copy link
Member

@mgrover1 and I have created an analysis-ready version of the dataset, which is stored in the bucket. It can be read using xarray.open_datatree; you don't need to download any data. Do you think it will work for you @jfigui?

This is the code that you need to open the dataset:

from xarray.backends.api import open_datatree

URL = 'https://js2.jetstream-cloud.org:8001/'
path = f'pythia/radar/erad2024'
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=URL))
file = s3fs.S3Map(f"{path}/zarr_radar/erad_2024.zarr", s3=fs)
dt = open_datatree(file, engine='zarr', consolidated=True)

Then it will load a datatree (dt) with the c-band and x-band nodes, and each node contains all sweeps and all files concatenated along the volume_time dimension. We must have the xarray version v2014.07.01 to open this dataset. Please let me know if this works for you.

@jfigui
Copy link
Contributor Author

jfigui commented Aug 27, 2024

I don't think it works for us. We need access to the original files and we thought it would be good to have them available in advance so that we do not need to download lots of data on the fly.

The QPE exercise needs more data that more demonstrations

@kmuehlbauer
Copy link
Collaborator

@jfigui We can't bake >2 GB of data into the image, see #79.

The jetstream s3 bucket is located alongside the pythiahub IIUC (at least in the same datacenter), so transfer times would be negligible.

You can utilize fsspec, please see: https://openradarscience.org/erad2024/intro-data-access. There is a pyart example, too.

@mgrover1
Copy link
Contributor

@jfigui - do you have a ftp server or google drive link to the data?

@jfigui
Copy link
Contributor Author

jfigui commented Aug 27, 2024

We need to process blocks of data stored locally at the moment. Is there a way to mount a drive in the image?

@jfigui
Copy link
Contributor Author

jfigui commented Aug 27, 2024

We can also restrict the size of our data by removing some unnecessary data. Possibly enough to put it in github

@kmuehlbauer
Copy link
Collaborator

@mgrover1 Please correct me if I'm wrong. The bottleneck is the image size, which we should keep as small as possible. If deployed on the pythia binder hub, there is a larger storage per instance. Please add the data to the jetstream s3 bucket, where you can fetch it into the running notebook.

I'd strongly vote against putting large data (>tens of MB) into github or docker image.

@mgrover1
Copy link
Contributor

@jfigui - the bucket storing the data is located on the same network as the jupyter/binder hub, can add a bash script if that works for you that pulls from the bucket, at the top of the notebook you are using for the exercise. I can assure you the transfer speeds will be quick

This will make the course more reproducible, with an opportunity to migrate this over to a more permanent space such as the radar cookbook. Where is the data currently located? Has it been used in previous repositories?

@jfigui
Copy link
Contributor Author

jfigui commented Aug 27, 2024

We should have all the original files we need for the exercise stored "locally" before Pyrad runs. If you think we can do that fast enough then that is fine.

@jfigui
Copy link
Contributor Author

jfigui commented Aug 27, 2024

we do not use xradar yet in Pyrad

@kmuehlbauer
Copy link
Collaborator

The original Cband data is fetched into the running instance from the jetstream bucket in 15.5 secs.

@kmuehlbauer
Copy link
Collaborator

@jfigui Just test https://openradarscience.org/erad2024/intro-data-access in the running instance for yourself.

@mgrover1 If you find the time when stress testing the machine on friday to test the download with that notebook. But even it it takes a minute this is still sufficient.

@openradar openradar deleted a comment Aug 27, 2024
@openradar openradar deleted a comment Aug 27, 2024
@openradar openradar deleted a comment from GoldenCaterpie Aug 27, 2024
@jfigui
Copy link
Contributor Author

jfigui commented Aug 28, 2024

OK, If it is on the order of 1 minute or so I think we can deal with it :)

@ghiggi
Copy link
Contributor

ghiggi commented Aug 28, 2024

I quickly looked back into my snippets and found out that past solutions I was using to read Zarr Store on cloud buckets does not work well with recent Zarr versions.

I found a new solution to open the Zarr store directly using fsspec and the get_mapper method, but the old fs.open approach (with all simple/block cache variants) does not work anymore. I get the error: Starting with Zarr 2.11.0, stores must be subclasses of BaseStore, if your store exposes the MutableMapping interface wrap it in Zarr.storage.KVStore.

Furthermore, opening the datatree is extremely slow (often hangs when I tried). Maybe @aladinor has something to add on this topic.

Here below I provide the code to reproduce solutions and problems.

import s3fs 
import zarr
import fsspec
import xarray as xr
from xarray.backends.api import open_datatree

URL =  "https://js2.jetstream-cloud.org:8001"
path = f"pythia/radar/erad2024"

#---------------------------------------------------------.
#### List bucket files 
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))
fs.glob(f"pythia/*")
fs.glob(f"{path}/*")
fs.glob(f"{path}/gpm_api/*")
fs.glob(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr/*")
fs.glob(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr/*")

#---------------------------------------------------------.
#### Open with s3fs.S3FileSystem
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=URL))

# Dataset
file = s3fs.S3Map(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr", s3=fs)
ds = xr.open_zarr(file, consolidated=True)

# Datatree
file = s3fs.S3Map(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr", s3=fs)
dt = open_datatree(file, engine='zarr', consolidated=True)

#-------------------------------------------------------------------------------------
#### Open with fsspec directly  
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))

# Dataset
file = fs.get_mapper(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr")
ds = xr.open_zarr(file, consolidated=True, chunks={}) 
ds["zFactorFinalNearSurface"] = ds["zFactorFinalNearSurface"].compute()

# Datatree [SUPERSLOW ... NEVER ENDING ...]
file = fs.get_mapper(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr")
dt = open_datatree(file, engine="zarr", consolidated=True, chunks={}) 

#-------------------------------------------------------------------------------------
#### fs.open() does not work anymore !  
# Error:
# Starting with Zarr 2.11.0, stores must be subclasses of BaseStore, 
# if your store exposes the MutableMapping interface wrap it in Zarr.storage.KVStore

# Option 1
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))
file = fs.open(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr")
ds = xr.open_zarr(file, consolidated=True, chunks={}) 

# Option 2
file = fsspec.open(
        f"{URL}{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr", 
        s3={"anon": True}
    )
ds = xr.open_zarr(file, consolidated=True, chunks={}) 

# Context manager
with fs.open(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr") as file:
    ds = xr.open_zarr(file, consolidated=True, chunks={}) 

with fs.open(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr") as file:
    dt = open_datatree(file, engine="zarr", chunks={}) 

#------------------------------------------------------------------------------------- 
#### Caching Blocks of Data (for re-access) (instead of full file download with simplecache)
import appdirs
from fsspec.implementations.cached import CachingFileSystem, SimpleCacheFileSystem
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))
cache_expiry_times = 60*5 
block_size = 2**20
cachedir = appdirs.user_cache_dir("ABI-simple-cache-numpy")
fs_block = CachingFileSystem(
    fs=fs,
    cache_storage=cachedir,
    expiry_time=cache_expiry_times,  # seconds 
)

file = fs_block.open(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr", block_size=block_size)
ds = xr.open_zarr(file, consolidated=True, chunks={}) 
ds["zFactorFinalNearSurface"] = ds["zFactorFinalNearSurface"].compute()
 

@aladinor
Copy link
Member

aladinor commented Aug 28, 2024

I have been working on xarray.datatree because I found out that it takes too long as here. I recently added a new PR to solve that issue here. @ghiggi what xarray version are you working with?

I ran your code in my xarray version (2024.7.1....) and these are the results.
dtree

I think we should install xarray directly form the GitHub repo

python -m pip install git+https://github.com/pydata/xarray.git

can we add this version to the binder @kmuehlbauer ?

@ghiggi
Copy link
Contributor

ghiggi commented Aug 28, 2024

I recall to have seen such PR so I updated to xarray 2024.7.0. But now that I checks looks so strange: with mamba list I have
2024.7.0 ... import xarray as xr; xr.__version__ says is 2024.2.0. Really strange. Time for a fresh environment 😄 ..

@kmuehlbauer
Copy link
Collaborator

I'd suggest to fire up a binder instance and debug directly there.

@ghiggi
Copy link
Contributor

ghiggi commented Aug 28, 2024

Ah but your PR @aladinor is not included in xarray 2024.7.0. We need to wait to get the new xarray release .. hopefully one of the next days?

@aladinor
Copy link
Member

aladinor commented Aug 28, 2024

I am not sure if there will be a new release soon... therefore, I suggest to install it directly from the repo

@ghiggi
Copy link
Contributor

ghiggi commented Aug 28, 2024

Yes make sense ! I just tried out too and now it's muuuch faster opening the datatree. Good work !!!

@mgrover1
Copy link
Contributor

We should add an install to that specific branch 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants