-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement FilePattern.file_type
#322
Conversation
The failed tests are the same upstream error seen in #303 (comment). Setting that aside for the moment... @rabernat, are you aware of anything about the "classic CDF-1" format which would make it incompatible with appending to an existing Zarr store via Here's what we know: Using the procedure recommended in the Unidata docs, we can confirm that this file is in "classic CDF-1" format: $ wget http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20210101v30001.nc
$ od -An -c -N4 ZCYL5_20210101v30001.nc
C D F 001 There is no mention of With xarray, we can open the file, dump it to Zarr, and get the data back out of Zarr: import xarray as xr
ds = xr.open_dataset("ZCYL5_20210101v30001.nc", engine="scipy")
print(ds)
ds.to_zarr("my.zarr")
ds_zarr = xr.open_zarr("my.zarr")
print(ds_zarr)
But with import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe, setup_logging
def make_url(time):
year=time.strftime('%Y')
year_month_day = time.strftime('%Y%m%d')
return(f'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/{year}/ZCYL5_{year_month_day}v30001.nc')
dates = pd.date_range('2021-01-01','2021-01-03', freq='D')
time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(
make_url,
time_concat_dim,
+ file_type="netcdf3",
)
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=30)
setup_logging()
recipe_pruned = recipe.copy_pruned()
run_function = recipe_pruned.to_function()
run_function() errors with Logs + Traceback[03/09/22 10:13:15] INFO Caching input 'Index({DimIndex(name='time', index=0, sequence_len=2, xarray_zarr.py:149
operation=<CombineOp.CONCAT: 2>)})'
INFO Caching file 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20 storage.py:154
210101v30001.nc'
INFO Copying remote file 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/Z storage.py:165
CYL5_20210101v30001.nc' to cache
[03/09/22 10:13:16] DEBUG entering fs.open context manager for /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmptvm460p6 storage.py:122
/oiAAMfqs/777be2b9214151be7e2c4f211c36a334-http_tds.coaps.fsu.edu_thredds_fileserver_samos_data_r
esearch_zcyl5_2021_zcyl5_20210101v30001.nc
DEBUG FSSpecTarget.open yielding <fsspec.implementations.local.LocalFileOpener object at 0x10bb8afa0> storage.py:124
DEBUG _copy_btw_filesystems total bytes copied: 305660 storage.py:51
DEBUG avg throughput over 0.01 min: 0.69 MB/sec storage.py:52
DEBUG FSSpecTarget.open yielded storage.py:126
DEBUG _copy_btw_filesystems done storage.py:56
INFO Caching input 'Index({DimIndex(name='time', index=1, sequence_len=2, xarray_zarr.py:149
operation=<CombineOp.CONCAT: 2>)})'
INFO Caching file 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_20 storage.py:154
210102v30001.nc'
INFO Copying remote file 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/Z storage.py:165
CYL5_20210102v30001.nc' to cache
DEBUG entering fs.open context manager for /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmptvm460p6 storage.py:122
/oiAAMfqs/8c0e3b3efe320f6d7e72b7b9c38c77e0-http_tds.coaps.fsu.edu_thredds_fileserver_samos_data_r
esearch_zcyl5_2021_zcyl5_20210102v30001.nc
DEBUG FSSpecTarget.open yielding <fsspec.implementations.local.LocalFileOpener object at 0x17a8418e0> storage.py:124
DEBUG _copy_btw_filesystems total bytes copied: 305868 storage.py:51
DEBUG avg throughput over 0.00 min: 1.65 MB/sec storage.py:52
DEBUG FSSpecTarget.open yielded storage.py:126
DEBUG _copy_btw_filesystems done storage.py:56
/Users/charlesstern/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:111: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
return xr.open_zarr(target.get_mapper())
INFO Creating a new dataset in target xarray_zarr.py:452
INFO Opening inputs for chunk Index({DimIndex(name='time', index=0, sequence_len=1, xarray_zarr.py:334
operation=<CombineOp.CONCAT: 2>)})
INFO Opening input with Xarray Index({DimIndex(name='time', index=0, sequence_len=2, xarray_zarr.py:249
operation=<CombineOp.CONCAT: 2>)}): 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/r
esearch/ZCYL5/2021/ZCYL5_20210101v30001.nc'
INFO Opening 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_2021010 storage.py:260
1v30001.nc' from cache
DEBUG file_opener entering first context for <contextlib._GeneratorContextManager object at storage.py:275
0x10bc96790>
DEBUG entering fs.open context manager for /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmptvm460p6 storage.py:122
/oiAAMfqs/777be2b9214151be7e2c4f211c36a334-http_tds.coaps.fsu.edu_thredds_fileserver_samos_data_r
esearch_zcyl5_2021_zcyl5_20210101v30001.nc
DEBUG FSSpecTarget.open yielding <fsspec.implementations.local.LocalFileOpener object at 0x17be41e50> storage.py:124
DEBUG file_opener entering second context for <fsspec.implementations.local.LocalFileOpener object at storage.py:277
0x17be41e50>
DEBUG about to enter xr.open_dataset context on <fsspec.implementations.local.LocalFileOpener xarray_zarr.py:303
object at 0x17be41e50>
DEBUG successfully opened dataset xarray_zarr.py:305
DEBUG <xarray.Dataset> xarray_zarr.py:315
Dimensions: (time: 1439, h_num: 50)
Coordinates:
* time (time) datetime64[ns] 2021-01-01 ... 2021-01-01T23:59:00
Dimensions without coordinates: h_num
Data variables: (12/38)
lat (time) float32 ...
lon (time) float32 ...
PL_HD (time) float32 ...
PL_CRS (time) float32 ...
DIR (time) float32 ...
DIR2 (time) float32 ...
... ...
RAD_PAR (time) float32 ...
RAD_PAR2 (time) float32 ...
date (time) int32 ...
time_of_day (time) int32 ...
flag (time) |S35 ...
history (h_num) |S236 ...
Attributes: (12/22)
title: FALKOR Meteorological Data
site: FALKOR
elev: 0
ID: ZCYL5
IMO: 007928677
platform: unknown at this time
... ...
Cruise_id: Cruise_id undefined for now
Data_modification_date: 01/12/2021 13:21:16 EST
Metadata_modification_date: 01/12/2021 13:21:16 EST
metadata_retrieved_from: ZCYL5_20210101v10001.nc
files_merged: [ZCYL5_20210101v10001.nc]
merger_version: v001
INFO Opening input with Xarray Index({DimIndex(name='time', index=1, sequence_len=2, xarray_zarr.py:249
operation=<CombineOp.CONCAT: 2>)}): 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/r
esearch/ZCYL5/2021/ZCYL5_20210102v30001.nc'
INFO Opening 'http://tds.coaps.fsu.edu/thredds/fileServer/samos/data/research/ZCYL5/2021/ZCYL5_2021010 storage.py:260
2v30001.nc' from cache
DEBUG file_opener entering first context for <contextlib._GeneratorContextManager object at storage.py:275
0x17be41040>
DEBUG entering fs.open context manager for /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmptvm460p6 storage.py:122
/oiAAMfqs/8c0e3b3efe320f6d7e72b7b9c38c77e0-http_tds.coaps.fsu.edu_thredds_fileserver_samos_data_r
esearch_zcyl5_2021_zcyl5_20210102v30001.nc
DEBUG FSSpecTarget.open yielding <fsspec.implementations.local.LocalFileOpener object at 0x17be93f70> storage.py:124
DEBUG file_opener entering second context for <fsspec.implementations.local.LocalFileOpener object at storage.py:277
0x17be93f70>
DEBUG about to enter xr.open_dataset context on <fsspec.implementations.local.LocalFileOpener xarray_zarr.py:303
object at 0x17be93f70>
DEBUG successfully opened dataset xarray_zarr.py:305
DEBUG <xarray.Dataset> xarray_zarr.py:315
Dimensions: (time: 1440, h_num: 50)
Coordinates:
* time (time) datetime64[ns] 2021-01-02 ... 2021-01-02T23:59:00
Dimensions without coordinates: h_num
Data variables: (12/38)
lat (time) float32 ...
lon (time) float32 ...
PL_HD (time) float32 ...
PL_CRS (time) float32 ...
DIR (time) float32 ...
DIR2 (time) float32 ...
... ...
RAD_PAR (time) float32 ...
RAD_PAR2 (time) float32 ...
date (time) int32 ...
time_of_day (time) int32 ...
flag (time) |S35 ...
history (h_num) |S236 ...
Attributes: (12/22)
title: FALKOR Meteorological Data
site: FALKOR
elev: 0
ID: ZCYL5
IMO: 007928677
platform: unknown at this time
... ...
Cruise_id: Cruise_id undefined for now
Data_modification_date: 01/12/2021 13:41:04 EST
Metadata_modification_date: 01/12/2021 13:41:04 EST
metadata_retrieved_from: ZCYL5_20210102v10002.nc
files_merged: [ZCYL5_20210102v10001.nc, ZCYL5_20210102v100...
merger_version: v001
INFO Combining inputs for chunk 'Index({DimIndex(name='time', index=0, sequence_len=1, xarray_zarr.py:352
operation=<CombineOp.CONCAT: 2>)})'
[03/09/22 10:13:17] DEBUG <xarray.Dataset> xarray_zarr.py:368
Dimensions: (time: 2879, h_num: 50)
Coordinates:
* time (time) datetime64[ns] 2021-01-01 ... 2021-01-02T23:59:00
Dimensions without coordinates: h_num
Data variables: (12/38)
lat (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
lon (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
PL_HD (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
PL_CRS (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
DIR (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
DIR2 (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
... ...
RAD_PAR (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
RAD_PAR2 (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
date (time) int32 dask.array<chunksize=(1439,), meta=np.ndarray>
time_of_day (time) int32 dask.array<chunksize=(1439,), meta=np.ndarray>
flag (time) |S35 dask.array<chunksize=(1439,), meta=np.ndarray>
history (time, h_num) |S236 dask.array<chunksize=(1439, 50), meta=np.ndarray>
Attributes: (12/22)
title: FALKOR Meteorological Data
site: FALKOR
elev: 0
ID: ZCYL5
IMO: 007928677
platform: unknown at this time
... ...
Cruise_id: Cruise_id undefined for now
Data_modification_date: 01/12/2021 13:21:16 EST
Metadata_modification_date: 01/12/2021 13:21:16 EST
metadata_retrieved_from: ZCYL5_20210101v10001.nc
files_merged: [ZCYL5_20210101v10001.nc]
merger_version: v001
DEBUG Setting variable time encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable lat encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable lon encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_HD encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_CRS encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable DIR encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable DIR2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable DIR3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_WDIR encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_WDIR2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_WDIR3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_SPD encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable SPD encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable SPD2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable SPD3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_WSPD encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_WSPD2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable PL_WSPD3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable P encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable P2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable P3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable T encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable T2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable T3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RH encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RH2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RH3 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable TS encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable TS2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable SSPS encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable CNDC encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RAD_SW encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RAD_LW encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RAD_PAR encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable RAD_PAR2 encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable date encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable time_of_day encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable flag encoding chunks to (2879,) xarray_zarr.py:482
DEBUG Setting variable history encoding chunks to (2879, 50) xarray_zarr.py:482
INFO Storing dataset in /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmptvm460p6/xtIlMDcU xarray_zarr.py:494
DEBUG <xarray.Dataset> xarray_zarr.py:495
Dimensions: (time: 2879, h_num: 50)
Coordinates:
* time (time) datetime64[ns] 2021-01-01 ... 2021-01-02T23:59:00
Dimensions without coordinates: h_num
Data variables: (12/38)
lat (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
lon (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
PL_HD (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
PL_CRS (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
DIR (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
DIR2 (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
... ...
RAD_PAR (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
RAD_PAR2 (time) float32 dask.array<chunksize=(1439,), meta=np.ndarray>
date (time) int32 dask.array<chunksize=(1439,), meta=np.ndarray>
time_of_day (time) int32 dask.array<chunksize=(1439,), meta=np.ndarray>
flag (time) |S35 dask.array<chunksize=(1439,), meta=np.ndarray>
history (time, h_num) |S236 dask.array<chunksize=(1439, 50), meta=np.ndarray>
Attributes: (12/22)
title: FALKOR Meteorological Data
site: FALKOR
elev: 0
ID: ZCYL5
IMO: 007928677
platform: unknown at this time
... ...
Cruise_id: Cruise_id undefined for now
Data_modification_date: 01/12/2021 13:21:16 EST
Metadata_modification_date: 01/12/2021 13:21:16 EST
metadata_retrieved_from: ZCYL5_20210101v10001.nc
files_merged: [ZCYL5_20210101v10001.nc]
merger_version: v001
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/mapping.py:135, in FSMap.__getitem__(self, key, default)
134 try:
--> 135 result = self.fs.cat(k)
136 except self.missing_exceptions:
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/spec.py:739, in AbstractFileSystem.cat(self, path, recursive, on_error, **kwargs)
738 else:
--> 739 return self.cat_file(paths[0], **kwargs)
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/spec.py:649, in AbstractFileSystem.cat_file(self, path, start, end, **kwargs)
648 # explicitly set buffering off?
--> 649 with self.open(path, "rb", **kwargs) as f:
650 if start is not None:
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/spec.py:1009, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
1008 ac = kwargs.pop("autocommit", not self._intrans)
-> 1009 f = self._open(
1010 path,
1011 mode=mode,
1012 block_size=block_size,
1013 autocommit=ac,
1014 cache_options=cache_options,
1015 **kwargs,
1016 )
1017 if compression is not None:
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py:155, in LocalFileSystem._open(self, path, mode, block_size, **kwargs)
154 self.makedirs(self._parent(path), exist_ok=True)
--> 155 return LocalFileOpener(path, mode, fs=self, **kwargs)
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py:250, in LocalFileOpener.__init__(self, path, mode, autocommit, fs, compression, **kwargs)
249 self.blocksize = io.DEFAULT_BUFFER_SIZE
--> 250 self._open()
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py:255, in LocalFileOpener._open(self)
254 if self.autocommit or "w" not in self.mode:
--> 255 self.f = open(self.path, mode=self.mode)
256 if self.compression:
FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmptvm460p6/xtIlMDcU/.zmetadata'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py:348, in ZarrStore.open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
347 try:
--> 348 zarr_group = zarr.open_consolidated(store, **open_kwargs)
349 except KeyError:
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/zarr/convenience.py:1188, in open_consolidated(store, metadata_key, mode, **kwargs)
1187 # setup metadata store
-> 1188 meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
1190 # pass through
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/zarr/storage.py:2645, in ConsolidatedMetadataStore.__init__(self, store, metadata_key)
2644 # retrieve consolidated metadata
-> 2645 meta = json_loads(store[metadata_key])
2647 # check format of consolidated metadata
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/zarr/storage.py:546, in KVStore.__getitem__(self, key)
545 def __getitem__(self, key):
--> 546 return self._mutable_mapping[key]
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/fsspec/mapping.py:139, in FSMap.__getitem__(self, key, default)
138 return default
--> 139 raise KeyError(key)
140 return result
KeyError: '.zmetadata'
During handling of the above exception, another exception occurred:
GroupNotFoundError Traceback (most recent call last)
File ~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:443, in prepare_target(config)
442 try:
--> 443 ds = open_target(config.storage_config.target)
444 logger.info("Found an existing dataset in target")
File ~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:111, in open_target(target)
110 def open_target(target: FSSpecTarget) -> xr.Dataset:
--> 111 return xr.open_zarr(target.get_mapper())
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py:752, in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, **kwargs)
743 backend_kwargs = {
744 "synchronizer": synchronizer,
745 "consolidated": consolidated,
(...)
749 "stacklevel": 4,
750 }
--> 752 ds = open_dataset(
753 filename_or_obj=store,
754 group=group,
755 decode_cf=decode_cf,
756 mask_and_scale=mask_and_scale,
757 decode_times=decode_times,
758 concat_characters=concat_characters,
759 decode_coords=decode_coords,
760 engine="zarr",
761 chunks=chunks,
762 drop_variables=drop_variables,
763 backend_kwargs=backend_kwargs,
764 decode_timedelta=decode_timedelta,
765 use_cftime=use_cftime,
766 )
767 return ds
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:495, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
496 filename_or_obj,
497 drop_variables=drop_variables,
498 **decoders,
499 **kwargs,
500 )
501 ds = _dataset_from_backend_dataset(
502 backend_ds,
503 filename_or_obj,
(...)
510 **kwargs,
511 )
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py:800, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
799 filename_or_obj = _normalize_path(filename_or_obj)
--> 800 store = ZarrStore.open_group(
801 filename_or_obj,
802 group=group,
803 mode=mode,
804 synchronizer=synchronizer,
805 consolidated=consolidated,
806 consolidate_on_close=False,
807 chunk_store=chunk_store,
808 storage_options=storage_options,
809 stacklevel=stacklevel + 1,
810 )
812 store_entrypoint = StoreBackendEntrypoint()
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py:365, in ZarrStore.open_group(cls, store, mode, synchronizer, group, consolidated, consolidate_on_close, chunk_store, storage_options, append_dim, write_region, safe_chunks, stacklevel)
350 warnings.warn(
351 "Failed to open Zarr store with consolidated metadata, "
352 "falling back to try reading non-consolidated metadata. "
(...)
363 stacklevel=stacklevel,
364 )
--> 365 zarr_group = zarr.open_group(store, **open_kwargs)
366 elif consolidated:
367 # TODO: an option to pass the metadata_key keyword
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/zarr/hierarchy.py:1182, in open_group(store, mode, cache_attrs, synchronizer, path, chunk_store, storage_options)
1181 raise ContainsArrayError(path)
-> 1182 raise GroupNotFoundError(path)
1184 elif mode == 'w':
GroupNotFoundError: group not found at path ''
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [2], in <cell line: 29>()
25 recipe_pruned = recipe.copy_pruned()
27 run_function = recipe_pruned.to_function()
---> 29 run_function()
File ~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/executors/python.py:46, in FunctionPipelineExecutor.compile.<locals>.function()
44 stage.function(m, config=pipeline.config)
45 else:
---> 46 stage.function(config=pipeline.config)
File ~/Dropbox/pangeo/pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py:500, in prepare_target(config)
496 with warnings.catch_warnings():
497 warnings.simplefilter(
498 "ignore"
499 ) # suppress the warning that comes with safe_chunks
--> 500 ds.to_zarr(target_mapper, mode="a", compute=False, safe_chunks=False)
502 # Regardless of whether there is an existing dataset or we are creating a new one,
503 # we need to expand the concat_dim to hold the entire expected size of the data
504 input_sequence_lens = calculate_sequence_lens(
505 config.nitems_per_input, config.file_pattern, config.storage_config.metadata,
506 )
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/core/dataset.py:2036, in Dataset.to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
2033 if encoding is None:
2034 encoding = {}
-> 2036 return to_zarr(
2037 self,
2038 store=store,
2039 chunk_store=chunk_store,
2040 storage_options=storage_options,
2041 mode=mode,
2042 synchronizer=synchronizer,
2043 group=group,
2044 encoding=encoding,
2045 compute=compute,
2046 consolidated=consolidated,
2047 append_dim=append_dim,
2048 region=region,
2049 safe_chunks=safe_chunks,
2050 )
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:1406, in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region, safe_chunks, storage_options)
1391 zstore = backends.ZarrStore.open_group(
1392 store=mapper,
1393 mode=mode,
(...)
1402 stacklevel=4, # for Dataset.to_zarr()
1403 )
1405 if mode in ["a", "r+"]:
-> 1406 _validate_datatypes_for_zarr_append(dataset)
1407 if append_dim is not None:
1408 existing_dims = zstore.get_dimensions()
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:1301, in _validate_datatypes_for_zarr_append(dataset)
1292 raise ValueError(
1293 "Invalid dtype for data variable: {} "
1294 "dtype must be a subtype of number, "
(...)
1297 "object".format(var)
1298 )
1300 for k in dataset.data_vars.values():
-> 1301 check_dtype(k)
File ~/miniconda3/envs/pangeo-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py:1292, in _validate_datatypes_for_zarr_append.<locals>.check_dtype(var)
1283 def check_dtype(var):
1284 if (
1285 not np.issubdtype(var.dtype, np.number)
1286 and not np.issubdtype(var.dtype, np.datetime64)
(...)
1290 ):
1291 # and not re.match('^bytes[1-9]+$', var.dtype.name)):
-> 1292 raise ValueError(
1293 "Invalid dtype for data variable: {} "
1294 "dtype must be a subtype of number, "
1295 "datetime, bool, a fixed sized string, "
1296 "a fixed size unicode string or an "
1297 "object".format(var)
1298 )
ValueError: Invalid dtype for data variable: <xarray.DataArray 'flag' (time: 2879)>
dask.array<concatenate, shape=(2879,), dtype=|S35, chunksize=(1440,), chunktype=numpy.ndarray>
Coordinates:
* time (time) datetime64[ns] 2021-01-01 ... 2021-01-02T23:59:00
Attributes:
long_name: quality control flags
A: Units added
B: Data out of range
C: Non-sequential time
D: Failed T>=Tw>=Td
E: True wind error
F: Velocity unrealistic
G: Value > 4 s. d. from climatology
H: Discontinuity
I: Interesting feature
J: Erroneous
K: Suspect - visual
L: Ocean platform over land
M: Instrument malfunction
N: In Port
O: Multiple original units
P: Movement uncertain
Q: Pre-flagged as suspect
R: Interpolated data
S: Spike - visual
T: Time duplicate
U: Suspect - statistial
V: Spike - statistical
X: Step - statistical
Y: Suspect between X-flags
Z: Good data
metadata_retrieved_from: ZCYL5_20210101v10001.nc dtype must be a subtype of number, datetime, bool, a fixed sized string, a fixed size unicode string or an object |
The zarr errors are a red herring. |
Why does it work outside pangeo-forge-recipes with the simplified And also, in terms of a solution, this would be resolved by fixing the dtype with an |
xref pydata/xarray#6345 |
With 154fa6a, I've added descriptive errors so future users with mismatched xarray backend + FilePattern.file_type configurations are directed how to resolve their configuration issues. For the case of the motivating recipe from #315 (comment), if run ...
Some other failure modes:With the motivating recipe...
|
@rabernat, things which are unresolved:
Things which IMO are resolved: Look forward to your review. |
:param is_opendap: If True, assume all input fnames represent opendap endpoints. | ||
Cannot be used with caching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to be nice to our users, we would not just remove this but deprecate it. Now that we have a few users, do we want to be more conservative about breaking changes? Or do we just want to move fast and not worry about that.
Sorry, did I miss something? |
Martin see #305. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
One more suggestion then LGTM.
Can we add a test in test_patterns
that makes sure you can initialize a FilePattern with every valid FileType and that specifying an unsupported type raises an error?
pangeo_forge_recipes/patterns.py
Outdated
if is_opendap: | ||
_deprecation_message = ( | ||
"`FilePattern(..., is_opendap=True)` will be deprecated in v0.9.0. " | ||
"Please use `FilePattern(..., file_type='opendap')` instead." | ||
) | ||
warnings.warn(_deprecation_message, DeprecationWarning) | ||
self.file_type = FileType("opendap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In doing this I realized that we never actually followed through on
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 631 to 634 in de33ebc
_deprecation_message = ( | |
"This method will be deprecated in v0.8.0. " | |
"Please call the equivalent function directly from the xarray_zarr module." | |
) |
pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py
Lines 839 to 840 in de33ebc
# Below lie convience methods that help users develop and debug the recipe | |
# They will all be deprecated |
but for clarity that's probably best as a separate PR.
/run-test-tutorials |
Some of the tutorial notebooks appear to be failing as well. See https://github.com/pangeo-forge/pangeo-forge-recipes/actions/runs/1964614213 |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
/run-test-tutorials |
Logging output makes notebook diffs hard to read, even with ReviewNB, so enumerating the tutorial notebook fixes here:
|
pangeo_forge_recipes/patterns.py
Outdated
OPENER_MAP = { | ||
FileType.netcdf3: dict(engine="scipy"), | ||
FileType.netcdf4: dict(engine="h5netcdf"), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final nit I promise! 😇
Can we move OPENER_MAP
to xarray_zarr_recipe.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_patterns.py
Outdated
else: | ||
with pytest.raises(ValueError) as excinfo: | ||
fp = make_concat_merge_pattern(**file_type_kwargs)[0] | ||
assert f"'{file_type_value}' is not a valid FileType" in str(excinfo.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this assertion ever gets hit. You want to use the match=
option in pytest.raises
to check the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused as to why we are still seeing upstream-dev fail with
when this was solved by fsspec/kerchunk#132 and we are installing kerchunk from GitHub. |
Is it possible that the environment is cached? Can you double check the exact version / hash of kerchunk that is getting installed? Can you make the tests pass in a local env? |
IIUC we never cache upstream dev versions pangeo-forge-recipes/.github/workflows/main.yaml Lines 85 to 87 in de33ebc
Referencing the latest 3.8 upstream dev env build here:
Does the trailing
Can you clarify what you mean by this? |
|
/run-test-tutorials |
All checks (including Tutorial Notebooks) pass with exception of All other notes have been addressed here, so going to merge. Noting that the only docs change I made was to update the release notes for |
Closes #320
Wanted to prioritize this because it will resolve:
Defaulting
FilePattern.file_type
toFileType.netcdf4
means that this PR is backwards compatible with recent releases and documentation. We might add a block quote note about this in the docs somewhere to the effect of:The only other thing I think I'd like to do is consider wrapping the
xr.open_dataset
call in XarrayZarrRecipe in a try/except, so that we can raise a more descriptive error in situations like those that motivated this PR.