Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable of dtype int8 casted to float64 #1576

Closed
forman opened this issue Sep 18, 2017 · 11 comments
Closed

Variable of dtype int8 casted to float64 #1576

forman opened this issue Sep 18, 2017 · 11 comments

Comments

@forman
Copy link

forman commented Sep 18, 2017

I'm using a CF-compliant dataset from the ESA Land Cover CCI Project that contains a variable lccs_class with dtype=int8 and attribute _Unsigned='true'. Its values are class numbers in the range 1 to 220. When I open the dataset with default options, the resulting dtype of that variable will be float64. As the Land Cover maps are quite large (global, 300m grid cells, 129600 x 64800) this produces a considerable memory overhead.

>>> ds = xr.open_dataset(path)
>>> ds['lccs_class'].dtype
dtype('float64')

If I switch off CF decoding I get the original data type.

>>> ds = xr.open_dataset(path, decode_cf=False)
>>> ds['lccs_class'].dtype
dtype('int8')

I'd actually expect it to be converted to uint8 or int16 so that values above 127 are represented correctly.

The dataset is available here: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/land_cover/data/land_cover_maps/v1.6.1/ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.nc. Note the file is ~3 GB.

Btw, the attributes of the variable are

>>> ds['lccs_class'].attrs
OrderedDict([('long_name', 'Land cover class defined in LCCS'),
             ('standard_name', 'land_cover_lccs'),
             ('flag_values',
              array([   0,   10,   11,   12,   20,   30,   40,   50,   60,   61,   62,
                       70,   71,   72,   80,   81,   82,   90,  100,  110,  120,  121,
                      122, -126, -116, -106, -104, -103,  -96,  -86,  -76,  -66,  -56,
                      -55,  -54,  -46,  -36], dtype=int8)),
             ('flag_meanings',
              'no_data cropland_rainfed cropland_rainfed_herbaceous_cover cropland_rainfed_tree_or_shrub_cover ...'),
             ('valid_min', array([1])),
             ('valid_max', array([220])),
             ('_Unsigned', 'true'),
             ('_FillValue', array([0], dtype=int8)),
             ('ancillary_variables',
              'processed_flag current_pixel_state observation_count algorithmic_confidence_level')])
@fmaussion
Copy link
Member

fmaussion commented Sep 18, 2017

Can you run ncdump -h -s on the file an report back?

@forman
Copy link
Author

forman commented Sep 18, 2017

Here you are

$ ncdump -h -s
netcdf ESACCI-LC-L4-LCCS-Map-300m-P5Y-2005-v1.6.1 {
dimensions:
        lat = 64800 ;
        lon = 129600 ;
variables:
        byte lccs_class(lat, lon) ;
                lccs_class:long_name = "Land cover class defined in LCCS" ;
                lccs_class:standard_name = "land_cover_lccs" ;
                lccs_class:flag_values = 0b, 10b, 11b, 12b, 20b, 30b, 40b, 50b, 60b, 61b, 62b, 70b, 71b, 72b, 80b, 81b, 82b, 90b, 100b, 110b, 120b, 121b, 122b, -126b, -116b, -106b, -104b, -103b, -96b, -86b, -76b, -66b, -56b, -55b, -54b, -46b,
-36b ;
                lccs_class:flag_meanings = "no_data cropland_rainfed cropland_rainfed_herbaceous_cover cropland_rainfed_tree_or_shrub_cover cropland_irrigated mosaic_cropland mosaic_natural_vegetation tree_broadleaved_evergreen_closed_to_open
tree_broadleaved_deciduous_closed_to_open tree_broadleaved_deciduous_closed tree_broadleaved_deciduous_open tree_needleleaved_evergreen_closed_to_open tree_needleleaved_evergreen_closed tree_needleleaved_evergreen_open tree_needleleaved_decidu
ous_closed_to_open tree_needleleaved_deciduous_closed tree_needleleaved_deciduous_open tree_mixed mosaic_tree_and_shrub mosaic_herbaceous shrubland shrubland_evergreen shrubland_deciduous grassland lichens_and_mosses sparse_vegetation sparse_s
hrub sparse_herbaceous tree_cover_flooded_fresh_or_brakish_water tree_cover_flooded_saline_water shrub_or_herbaceous_cover_flooded urban bare_areas bare_areas_consolidated bare_areas_unconsolidated water snow_and_ice" ;
                lccs_class:valid_min = 1 ;
                lccs_class:valid_max = 220 ;
                lccs_class:_Unsigned = "true" ;
                lccs_class:_FillValue = 0b ;
                lccs_class:ancillary_variables = "processed_flag current_pixel_state observation_count algorithmic_confidence_level" ;
                lccs_class:_Storage = "chunked" ;
                lccs_class:_ChunkSizes = 2048, 2048 ;
                lccs_class:_DeflateLevel = 6 ;
                lccs_class:_NoFill = "true" ;
        byte processed_flag(lat, lon) ;
                processed_flag:standard_name = "land_cover_lccs status_flag" ;
                processed_flag:flag_values = 0b, 1b ;
                processed_flag:flag_meanings = "not_processed processed" ;
                processed_flag:valid_min = 0 ;
                processed_flag:valid_max = 1 ;
                processed_flag:_FillValue = -1b ;
                processed_flag:long_name = "LC map processed area flag" ;
                processed_flag:_Storage = "chunked" ;
                processed_flag:_ChunkSizes = 2048, 2048 ;
                processed_flag:_DeflateLevel = 6 ;
                processed_flag:_NoFill = "true" ;
        byte current_pixel_state(lat, lon) ;
                current_pixel_state:standard_name = "land_cover_lccs status_flag" ;
                current_pixel_state:flag_values = 0b, 1b, 2b, 3b, 4b, 5b ;
                current_pixel_state:flag_meanings = "invalid clear_land clear_water clear_snow_ice cloud cloud_shadow" ;
                current_pixel_state:valid_min = 0 ;
                current_pixel_state:valid_max = 5 ;
                current_pixel_state:_FillValue = -1b ;
                current_pixel_state:long_name = "LC pixel type mask" ;
                current_pixel_state:_Storage = "chunked" ;
                current_pixel_state:_ChunkSizes = 2048, 2048 ;
                current_pixel_state:_DeflateLevel = 6 ;
                current_pixel_state:_NoFill = "true" ;
        short observation_count(lat, lon) ;
                observation_count:standard_name = "land_cover_lccs number_of_observations" ;
                observation_count:valid_min = 0 ;
                observation_count:valid_max = 32767 ;
                observation_count:_FillValue = -1s ;
                observation_count:long_name = "number of valid observations" ;
                observation_count:_Storage = "chunked" ;
                observation_count:_ChunkSizes = 2048, 2048 ;
                observation_count:_DeflateLevel = 6 ;
                observation_count:_Endianness = "little" ;
                observation_count:_NoFill = "true" ;
        byte algorithmic_confidence_level(lat, lon) ;
                algorithmic_confidence_level:standard_name = "land_cover_lccs algorithmic_confidence" ;
                algorithmic_confidence_level:valid_min = 0 ;
                algorithmic_confidence_level:valid_max = 100 ;
                algorithmic_confidence_level:scale_factor = 0.01f ;
                algorithmic_confidence_level:_FillValue = -1b ;
                algorithmic_confidence_level:long_name = "LC map confidence level based on algorithm performance" ;
                algorithmic_confidence_level:_Storage = "chunked" ;
                algorithmic_confidence_level:_ChunkSizes = 2048, 2048 ;
                algorithmic_confidence_level:_DeflateLevel = 6 ;
                algorithmic_confidence_level:_NoFill = "true" ;
        float lat(lat) ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
                lat:valid_min = -89.9986f ;
                lat:valid_max = 89.99861f ;
                lat:units = "degrees_north" ;
                lat:_Storage = "chunked" ;
                lat:_ChunkSizes = 64800 ;
                lat:_DeflateLevel = 6 ;
                lat:_Endianness = "little" ;
                lat:_NoFill = "true" ;
        float lon(lon) ;
                lon:long_name = "longitude" ;
                lon:standard_name = "longitude" ;
                lon:valid_min = -179.9986f ;
                lon:valid_max = 179.9986f ;
                lon:units = "degrees_east" ;
                lon:_Storage = "chunked" ;
                lon:_ChunkSizes = 129600 ;
                lon:_DeflateLevel = 6 ;
                lon:_Endianness = "little" ;
                lon:_NoFill = "true" ;
        int crs ;
                crs:i2m = "0.002777777701187,0.0,0.0,-0.002777777701187,-180.00000033927267,90.0" ;
                crs:wkt = "GEOGCS[\"WGS 84\", \r\n  DATUM[\"World Geodetic System 1984\", \r\n    SPHEROID[\"WGS 84\", 6378137.0, 298.257223563, AUTHORITY[\"EPSG\",\"7030\"]], \r\n    AUTHORITY[\"EPSG\",\"6326\"]], \r\n  PRIMEM[\"Greenwich\",
0.0, AUTHORITY[\"EPSG\",\"8901\"]], \r\n  UNIT[\"degree\", 0.017453292519943295], \r\n  AXIS[\"Geodetic longitude\", EAST], \r\n  AXIS[\"Geodetic latitude\", NORTH], \r\n  AUTHORITY[\"EPSG\",\"4326\"]]" ;
                crs:_Endianness = "little" ;
                crs:_NoFill = "true" ;

// global attributes:
                :title = "ESA CCI Land Cover Map" ;
                :summary = "This dataset contains the global ESA CCI land cover classification map derived from satellite data of one epoch." ;
                :type = "ESACCI-LC-L4-LCCS-Map-300m-P5Y" ;
                :id = "ESACCI-LC-L4-LCCS-Map-300m-P5Y-2005-v1.6.1" ;
                :project = "Climate Change Initiative - European Space Agency" ;
                :references = "http://www.esa-landcover-cci.org/" ;
                :institution = "Universite catholique de Louvain" ;
                :contact = "landcover-cci@uclouvain.be" ;
                :comment = "" ;
                :Conventions = "CF-1.6" ;
                :standard_name_vocabulary = "NetCDF Climate and Forecast (CF) Standard Names version 21" ;
                :keywords = "land cover classification,satellite,observation" ;
                :keywords_vocabulary = "NASA Global Change Master Directory (GCMD) Science Keywords" ;
                :license = "ESA CCI Data Policy: free and open access" ;
                :naming_authority = "org.esa-cci" ;
                :cdm_data_type = "grid" ;
                :TileSize = "2048:2048" ;
                :tracking_id = "00f7e0ee-3b0e-4ea3-9b9f-186e02fb4439" ;
                :product_version = "1.6.1" ;
                :date_created = "20151217T094622Z" ;
                :creator_name = "University catholique de Louvain" ;
                :creator_url = "http://www.uclouvain.be/" ;
                :creator_email = "landcover-cci@uclouvain.be" ;
                :source = "MERIS FR L1B version 5.05, MERIS RR L1B version 8.0, SPOT VGT P" ;
                :history = "amorgos-4,0, lc-sdr-1.0, lc-sr-1.0, lc-classification-1.0,lc-user-tools-3.10" ;
                :time_coverage_start = "20030101" ;
                :time_coverage_end = "20071231" ;
                :time_coverage_duration = "P5Y" ;
                :time_coverage_resolution = "P5Y" ;
                :geospatial_lat_min = "-89.99999" ;
                :geospatial_lat_max = "90.0" ;
                :geospatial_lon_min = "-180.0" ;
                :geospatial_lon_max = "179.99998" ;
                :spatial_resolution = "300m" ;
                :geospatial_lat_units = "degrees_north" ;
                :geospatial_lat_resolution = "0.002778" ;
                :geospatial_lon_units = "degrees_east" ;
                :geospatial_lon_resolution = "0.002778" ;
                :_SuperblockVersion = 2 ;
                :_IsNetcdf4 = 1 ;
                :_Format = "netCDF-4" ;
}

@fmaussion
Copy link
Member

OK. I'll let @shoyer comment on the substance but indeed it seems that decode_cf could be cleverer here. It should be an easy fix.

@forman
Copy link
Author

forman commented Sep 18, 2017

I guess, the poblem is caused in xarray/conventions.py.

Note, when debugging into it, fill_value == nd.array([0], dtype == np.int8) and fill_value.dtype.kind='i' and the latter kind is not dealt with. Therefore int8 is turned into float64.

@jhamman
Copy link
Member

jhamman commented Sep 18, 2017

Right, since xarray uses np.nan as its fill value, any array with a _FillValue will be promoted to a float dtype.

Out of curiosity, what is the meaning _NoFill = "true"?

@shoyer
Copy link
Member

shoyer commented Sep 18, 2017

We currently decode anything with a _FillValue attribute to float, so that we can convert any values equal to the fill value to NaN. This ensure's that xarray's NaN skipping aggregations (e.g., mean()) work properly.

However, this isn't really a useful thing to do for a dataset like this where the values really represent enums/categories. It seems like the CF compliant way to indicate this is with the various flag_* attributes. So we could look for those to indicate that we shouldn't fill-in fill values.

Eventually, we could possibly also use this for decoding into a true "categorical" dtype, but numpy doesn't have anything like that yet.

@forman
Copy link
Author

forman commented Sep 18, 2017

I see, that is what is done in mask_and_scale(). Why can't xarray used masked arrays, that would retain the original dtype? (Dask, I guess?)
Expanding integers to 8 byte floats not only cost memory but also CPU, including an inaccurate in-memory integer representation.

@forman
Copy link
Author

forman commented Sep 18, 2017

@jhamman _NoFill is about optimizing writes, see nc_set_fill

@jhamman
Copy link
Member

jhamman commented Sep 18, 2017

Why can't xarray used masked arrays, that would retain the original dtype?

We have an open issue for this topic (#1194). A lot of it comes down to performance, dask is part of that but the other issue is that masked arrays in numpy are quite slow.

@forman forman changed the title Variable of dtype int8 casted to float64 for no obvious reason Variable of dtype int8 casted to float64 Sep 18, 2017
@forman
Copy link
Author

forman commented Sep 19, 2017

@shoyer

We currently decode anything with a _FillValue attribute to float, ...

I believe this fact is surprising for any user of integer/index/enum/classification datasets. Since its justification seems to be an implementation detail which comes at the cost of increased memory and CPU consumption I suggest documenting it in open_dataset() and decode_cf() functions.

Here is how we overcome this issue by deleting the _FillValue attribute of integer variables if their scale_factor and add_offset attributes are not provided:

ds = xr.open_dataset(path, decode_cf=False)
old_fill_values = unset_fill_value_for_int_vars(ds)
ds = xr.decode_cf(ds)
reset_fill_value_for_int_vars(ds, old_fill_values)

where old_fill_values is a mapping of variable names to fill values.

@stale
Copy link

stale bot commented Sep 26, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Sep 26, 2020
@stale stale bot closed this as completed Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants