Getting data in the binder #94

jfigui · 2024-08-27T13:28:16Z

No description provided.

jfigui · 2024-08-27T13:32:37Z

@openradar/erad2024-team ,

We need to download the data a-priori for the QPE exercise and we thought to do it through a script in the appendix.txt . Are you OK with that or do you have other suggestions?

We do not have a way to read data from s3 buckets directly yet in Pyrad.

aladinor · 2024-08-27T13:42:53Z

@mgrover1 and I have created an analysis-ready version of the dataset, which is stored in the bucket. It can be read using xarray.open_datatree; you don't need to download any data. Do you think it will work for you @jfigui?

This is the code that you need to open the dataset:

from xarray.backends.api import open_datatree

URL = 'https://js2.jetstream-cloud.org:8001/'
path = f'pythia/radar/erad2024'
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=URL))
file = s3fs.S3Map(f"{path}/zarr_radar/erad_2024.zarr", s3=fs)
dt = open_datatree(file, engine='zarr', consolidated=True)

Then it will load a datatree (dt) with the c-band and x-band nodes, and each node contains all sweeps and all files concatenated along the volume_time dimension. We must have the xarray version v2014.07.01 to open this dataset. Please let me know if this works for you.

jfigui · 2024-08-27T13:46:28Z

I don't think it works for us. We need access to the original files and we thought it would be good to have them available in advance so that we do not need to download lots of data on the fly.

The QPE exercise needs more data that more demonstrations

kmuehlbauer · 2024-08-27T13:53:56Z

@jfigui We can't bake >2 GB of data into the image, see #79.

The jetstream s3 bucket is located alongside the pythiahub IIUC (at least in the same datacenter), so transfer times would be negligible.

You can utilize fsspec, please see: https://openradarscience.org/erad2024/intro-data-access. There is a pyart example, too.

mgrover1 · 2024-08-27T13:55:37Z

@jfigui - do you have a ftp server or google drive link to the data?

jfigui · 2024-08-27T14:05:16Z

We need to process blocks of data stored locally at the moment. Is there a way to mount a drive in the image?

jfigui · 2024-08-27T14:09:14Z

We can also restrict the size of our data by removing some unnecessary data. Possibly enough to put it in github

kmuehlbauer · 2024-08-27T14:13:08Z

@mgrover1 Please correct me if I'm wrong. The bottleneck is the image size, which we should keep as small as possible. If deployed on the pythia binder hub, there is a larger storage per instance. Please add the data to the jetstream s3 bucket, where you can fetch it into the running notebook.

I'd strongly vote against putting large data (>tens of MB) into github or docker image.

mgrover1 · 2024-08-27T14:14:43Z

@jfigui - the bucket storing the data is located on the same network as the jupyter/binder hub, can add a bash script if that works for you that pulls from the bucket, at the top of the notebook you are using for the exercise. I can assure you the transfer speeds will be quick

This will make the course more reproducible, with an opportunity to migrate this over to a more permanent space such as the radar cookbook. Where is the data currently located? Has it been used in previous repositories?

jfigui · 2024-08-27T14:17:50Z

We should have all the original files we need for the exercise stored "locally" before Pyrad runs. If you think we can do that fast enough then that is fine.

jfigui · 2024-08-27T14:18:10Z

we do not use xradar yet in Pyrad

kmuehlbauer · 2024-08-27T14:24:34Z

The original Cband data is fetched into the running instance from the jetstream bucket in 15.5 secs.

kmuehlbauer · 2024-08-27T14:26:28Z

@jfigui Just test https://openradarscience.org/erad2024/intro-data-access in the running instance for yourself.

@mgrover1 If you find the time when stress testing the machine on friday to test the download with that notebook. But even it it takes a minute this is still sufficient.

jfigui · 2024-08-28T08:43:47Z

OK, If it is on the order of 1 minute or so I think we can deal with it :)

ghiggi · 2024-08-28T13:13:59Z

I quickly looked back into my snippets and found out that past solutions I was using to read Zarr Store on cloud buckets does not work well with recent Zarr versions.

I found a new solution to open the Zarr store directly using fsspec and the get_mapper method, but the old fs.open approach (with all simple/block cache variants) does not work anymore. I get the error: Starting with Zarr 2.11.0, stores must be subclasses of BaseStore, if your store exposes the MutableMapping interface wrap it in Zarr.storage.KVStore.

Furthermore, opening the datatree is extremely slow (often hangs when I tried). Maybe @aladinor has something to add on this topic.

Here below I provide the code to reproduce solutions and problems.

import s3fs 
import zarr
import fsspec
import xarray as xr
from xarray.backends.api import open_datatree

URL =  "https://js2.jetstream-cloud.org:8001"
path = f"pythia/radar/erad2024"

#---------------------------------------------------------.
#### List bucket files 
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))
fs.glob(f"pythia/*")
fs.glob(f"{path}/*")
fs.glob(f"{path}/gpm_api/*")
fs.glob(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr/*")
fs.glob(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr/*")

#---------------------------------------------------------.
#### Open with s3fs.S3FileSystem
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url=URL))

# Dataset
file = s3fs.S3Map(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr", s3=fs)
ds = xr.open_zarr(file, consolidated=True)

# Datatree
file = s3fs.S3Map(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr", s3=fs)
dt = open_datatree(file, engine='zarr', consolidated=True)

#-------------------------------------------------------------------------------------
#### Open with fsspec directly  
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))

# Dataset
file = fs.get_mapper(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr")
ds = xr.open_zarr(file, consolidated=True, chunks={}) 
ds["zFactorFinalNearSurface"] = ds["zFactorFinalNearSurface"].compute()

# Datatree [SUPERSLOW ... NEVER ENDING ...]
file = fs.get_mapper(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr")
dt = open_datatree(file, engine="zarr", consolidated=True, chunks={}) 

#-------------------------------------------------------------------------------------
#### fs.open() does not work anymore !  
# Error:
# Starting with Zarr 2.11.0, stores must be subclasses of BaseStore, 
# if your store exposes the MutableMapping interface wrap it in Zarr.storage.KVStore

# Option 1
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))
file = fs.open(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr")
ds = xr.open_zarr(file, consolidated=True, chunks={}) 

# Option 2
file = fsspec.open(
        f"{URL}{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr", 
        s3={"anon": True}
    )
ds = xr.open_zarr(file, consolidated=True, chunks={}) 

# Context manager
with fs.open(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr") as file:
    ds = xr.open_zarr(file, consolidated=True, chunks={}) 

with fs.open(f"{path}/gpm_api/KNKX20230820_221341_V06.zarr") as file:
    dt = open_datatree(file, engine="zarr", chunks={}) 

#------------------------------------------------------------------------------------- 
#### Caching Blocks of Data (for re-access) (instead of full file download with simplecache)
import appdirs
from fsspec.implementations.cached import CachingFileSystem, SimpleCacheFileSystem
fs = fsspec.filesystem("s3", anon=True, client_kwargs=dict(endpoint_url=URL))
cache_expiry_times = 60*5 
block_size = 2**20
cachedir = appdirs.user_cache_dir("ABI-simple-cache-numpy")
fs_block = CachingFileSystem(
    fs=fs,
    cache_storage=cachedir,
    expiry_time=cache_expiry_times,  # seconds 
)

file = fs_block.open(f"{path}/gpm_api/2A.GPM.DPR.V9-20211125.20230820-S213941-E231213.053847.V07B.zarr", block_size=block_size)
ds = xr.open_zarr(file, consolidated=True, chunks={}) 
ds["zFactorFinalNearSurface"] = ds["zFactorFinalNearSurface"].compute()

aladinor · 2024-08-28T13:19:10Z

I have been working on xarray.datatree because I found out that it takes too long as here. I recently added a new PR to solve that issue here. @ghiggi what xarray version are you working with?

I ran your code in my xarray version (2024.7.1....) and these are the results.

I think we should install xarray directly form the GitHub repo

python -m pip install git+https://github.com/pydata/xarray.git

can we add this version to the binder @kmuehlbauer ?

ghiggi · 2024-08-28T13:22:00Z

I recall to have seen such PR so I updated to xarray 2024.7.0. But now that I checks looks so strange: with mamba list I have
2024.7.0 ... import xarray as xr; xr.__version__ says is 2024.2.0. Really strange. Time for a fresh environment 😄 ..

kmuehlbauer · 2024-08-28T13:22:53Z

I'd suggest to fire up a binder instance and debug directly there.

ghiggi · 2024-08-28T13:26:13Z

Ah but your PR @aladinor is not included in xarray 2024.7.0. We need to wait to get the new xarray release .. hopefully one of the next days?

aladinor · 2024-08-28T13:27:52Z

I am not sure if there will be a new release soon... therefore, I suggest to install it directly from the repo

ghiggi · 2024-08-28T13:29:05Z

Yes make sense ! I just tried out too and now it's muuuch faster opening the datatree. Good work !!!

mgrover1 · 2024-08-28T13:34:06Z

We should add an install to that specific branch 👍

aladinor mentioned this issue Aug 27, 2024

Project: QPE exercise in the North of Italy #51

Closed

openradar deleted a comment Aug 27, 2024

openradar deleted a comment from GoldenCaterpie Aug 27, 2024

aladinor mentioned this issue Aug 30, 2024

xarray v2024.7.1. version #98

Closed

mgrover1 closed this as completed Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting data in the binder #94

Getting data in the binder #94

jfigui commented Aug 27, 2024

jfigui commented Aug 27, 2024

aladinor commented Aug 27, 2024

jfigui commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

mgrover1 commented Aug 27, 2024

jfigui commented Aug 27, 2024

jfigui commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

mgrover1 commented Aug 27, 2024

jfigui commented Aug 27, 2024

jfigui commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

jfigui commented Aug 28, 2024

ghiggi commented Aug 28, 2024 •

edited

Loading

aladinor commented Aug 28, 2024 •

edited

Loading

ghiggi commented Aug 28, 2024 •

edited

Loading

kmuehlbauer commented Aug 28, 2024

ghiggi commented Aug 28, 2024 •

edited

Loading

aladinor commented Aug 28, 2024 •

edited

Loading

ghiggi commented Aug 28, 2024

mgrover1 commented Aug 28, 2024

Getting data in the binder #94

Getting data in the binder #94

Comments

jfigui commented Aug 27, 2024

jfigui commented Aug 27, 2024

aladinor commented Aug 27, 2024

jfigui commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

mgrover1 commented Aug 27, 2024

jfigui commented Aug 27, 2024

jfigui commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

mgrover1 commented Aug 27, 2024

jfigui commented Aug 27, 2024

jfigui commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

kmuehlbauer commented Aug 27, 2024

jfigui commented Aug 28, 2024

ghiggi commented Aug 28, 2024 • edited Loading

aladinor commented Aug 28, 2024 • edited Loading

ghiggi commented Aug 28, 2024 • edited Loading

kmuehlbauer commented Aug 28, 2024

ghiggi commented Aug 28, 2024 • edited Loading

aladinor commented Aug 28, 2024 • edited Loading

ghiggi commented Aug 28, 2024

mgrover1 commented Aug 28, 2024

ghiggi commented Aug 28, 2024 •

edited

Loading

aladinor commented Aug 28, 2024 •

edited

Loading

ghiggi commented Aug 28, 2024 •

edited

Loading

ghiggi commented Aug 28, 2024 •

edited

Loading

aladinor commented Aug 28, 2024 •

edited

Loading