Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for byte order (endianness) #132

Merged
merged 1 commit into from
Aug 11, 2023
Merged

Add support for byte order (endianness) #132

merged 1 commit into from
Aug 11, 2023

Conversation

markgoddard
Copy link

netCDF4 variables are written with a specific byte order (endianness),
typically the same as the machine writing to the dataset. In
netCDF4-python this shows up in the variable's dtype.byteorder, and may
be '<' (little), '>' (big) or '=' (native).

For the local storage backend this generally "just works", since the
numpy array returned by netCDF4-python has the correct byte ordering.

For files in an S3 store, kerchunk uses h5py to read the netCDF4 dataset
when converting it to a Zarr array index. In this case the byte order of
the dtype on the Zarr array is not always correct, and sometimes shows
up as big endian even though little endian was specified. We work around
this by capturing the dtype of the dataset when reading other attributes
(filters, metadata, etc.).

Closes #76

netCDF4 variables are written with a specific byte order (endianness),
typically the same as the machine writing to the dataset. In
netCDF4-python this shows up in the variable's dtype.byteorder, and may
be '<' (little), '>' (big) or '=' (native).

For the local storage backend this generally "just works", since the
numpy array returned by netCDF4-python has the correct byte ordering.

For files in an S3 store, kerchunk uses h5py to read the netCDF4 dataset
when converting it to a Zarr array index. In this case the byte order of
the dtype on the Zarr array is not always correct, and sometimes shows
up as big endian even though little endian was specified. We work around
this by capturing the dtype of the dataset when reading other attributes
(filters, metadata, etc.).

Closes #76
@markgoddard markgoddard self-assigned this Aug 3, 2023
@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Patch coverage: 89.47% and no project coverage change.

Comparison is base (8aae749) 86.52% compared to head (bbcfe46) 86.53%.

Additional details and impacted files
@@               Coverage Diff                @@
##           s3-missing-data     #132   +/-   ##
================================================
  Coverage            86.52%   86.53%           
================================================
  Files                    8        8           
  Lines                  579      594   +15     
================================================
+ Hits                   501      514   +13     
- Misses                  78       80    +2     
Files Changed Coverage Δ
activestorage/reductionist.py 87.17% <75.00%> (-1.40%) ⬇️
activestorage/active.py 92.68% <100.00%> (+0.07%) ⬆️
activestorage/dummy_data.py 89.21% <100.00%> (+0.55%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@markgoddard
Copy link
Author

(Please merge #130 before merging this, since this PR depends on and merges into the s3-missing-data branch. We can change the target branch once #130 merges)

Base automatically changed from s3-missing-data to main August 8, 2023 15:22
Copy link
Collaborator

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good, many thanks @markgoddard 👍 🍺

@valeriupredoi
Copy link
Collaborator

#130 was merged so am merging this too 👍

@valeriupredoi valeriupredoi merged commit 8adf0f0 into main Aug 11, 2023
4 of 5 checks passed
@valeriupredoi valeriupredoi deleted the byte-order branch August 11, 2023 11:25
markgoddard added a commit that referenced this pull request Aug 15, 2023
Since byte order support was added in #132, we capture the dtype from
netCDF metadata and use it in preference to the Zarr dtype when using
the S3 storage backend.

In the case where missing data values are passed into the Active
constructor, we do not load netCDF metadata, and therefore do not
capture the dtype. This leads to the following error with S3 storage:

  AttributeError: 'Active' object has no attribute '_dtype'

This change switches to always load the metadata so that we can capture
the dtype.  This has the downside that it is not necessary when using
local storage and missing data has been specified.

This allows us to enable the test_daily_data_masked test for S3.

Closes #137
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support endianness
2 participants