-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable of dtype int8 casted to float64 #1576
Comments
Can you run |
Here you are
|
OK. I'll let @shoyer comment on the substance but indeed it seems that |
I guess, the poblem is caused in xarray/conventions.py. Note, when debugging into it, |
Right, since xarray uses Out of curiosity, what is the meaning |
We currently decode anything with a However, this isn't really a useful thing to do for a dataset like this where the values really represent enums/categories. It seems like the CF compliant way to indicate this is with the various flag_* attributes. So we could look for those to indicate that we shouldn't fill-in fill values. Eventually, we could possibly also use this for decoding into a true "categorical" dtype, but numpy doesn't have anything like that yet. |
I see, that is what is done in |
@jhamman |
We have an open issue for this topic (#1194). A lot of it comes down to performance, dask is part of that but the other issue is that masked arrays in numpy are quite slow. |
I believe this fact is surprising for any user of integer/index/enum/classification datasets. Since its justification seems to be an implementation detail which comes at the cost of increased memory and CPU consumption I suggest documenting it in Here is how we overcome this issue by deleting the
where |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
I'm using a CF-compliant dataset from the ESA Land Cover CCI Project that contains a variable
lccs_class
withdtype=int8
and attribute_Unsigned='true'
. Its values are class numbers in the range 1 to 220. When I open the dataset with default options, the resulting dtype of that variable will befloat64
. As the Land Cover maps are quite large (global, 300m grid cells, 129600 x 64800) this produces a considerable memory overhead.If I switch off CF decoding I get the original data type.
I'd actually expect it to be converted to
uint8
orint16
so that values above 127 are represented correctly.The dataset is available here: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/land_cover/data/land_cover_maps/v1.6.1/ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.nc. Note the file is ~3 GB.
Btw, the attributes of the variable are
The text was updated successfully, but these errors were encountered: