Variable of dtype int8 casted to float64 #1576

forman · 2017-09-18T14:28:32Z

I'm using a CF-compliant dataset from the ESA Land Cover CCI Project that contains a variable lccs_class with dtype=int8 and attribute _Unsigned='true'. Its values are class numbers in the range 1 to 220. When I open the dataset with default options, the resulting dtype of that variable will be float64. As the Land Cover maps are quite large (global, 300m grid cells, 129600 x 64800) this produces a considerable memory overhead.

>>> ds = xr.open_dataset(path)
>>> ds['lccs_class'].dtype
dtype('float64')

If I switch off CF decoding I get the original data type.

>>> ds = xr.open_dataset(path, decode_cf=False)
>>> ds['lccs_class'].dtype
dtype('int8')

I'd actually expect it to be converted to uint8 or int16 so that values above 127 are represented correctly.

The dataset is available here: ftp://anon-ftp.ceda.ac.uk/neodc/esacci/land_cover/data/land_cover_maps/v1.6.1/ESACCI-LC-L4-LCCS-Map-300m-P5Y-2010-v1.6.1.nc. Note the file is ~3 GB.

Btw, the attributes of the variable are

>>> ds['lccs_class'].attrs
OrderedDict([('long_name', 'Land cover class defined in LCCS'),
             ('standard_name', 'land_cover_lccs'),
             ('flag_values',
              array([   0,   10,   11,   12,   20,   30,   40,   50,   60,   61,   62,
                       70,   71,   72,   80,   81,   82,   90,  100,  110,  120,  121,
                      122, -126, -116, -106, -104, -103,  -96,  -86,  -76,  -66,  -56,
                      -55,  -54,  -46,  -36], dtype=int8)),
             ('flag_meanings',
              'no_data cropland_rainfed cropland_rainfed_herbaceous_cover cropland_rainfed_tree_or_shrub_cover ...'),
             ('valid_min', array([1])),
             ('valid_max', array([220])),
             ('_Unsigned', 'true'),
             ('_FillValue', array([0], dtype=int8)),
             ('ancillary_variables',
              'processed_flag current_pixel_state observation_count algorithmic_confidence_level')])

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-09-18T14:37:06Z

Can you run ncdump -h -s on the file an report back?

forman · 2017-09-18T15:32:27Z

Here you are

$ ncdump -h -s
netcdf ESACCI-LC-L4-LCCS-Map-300m-P5Y-2005-v1.6.1 {
dimensions:
        lat = 64800 ;
        lon = 129600 ;
variables:
        byte lccs_class(lat, lon) ;
                lccs_class:long_name = "Land cover class defined in LCCS" ;
                lccs_class:standard_name = "land_cover_lccs" ;
                lccs_class:flag_values = 0b, 10b, 11b, 12b, 20b, 30b, 40b, 50b, 60b, 61b, 62b, 70b, 71b, 72b, 80b, 81b, 82b, 90b, 100b, 110b, 120b, 121b, 122b, -126b, -116b, -106b, -104b, -103b, -96b, -86b, -76b, -66b, -56b, -55b, -54b, -46b,
-36b ;
                lccs_class:flag_meanings = "no_data cropland_rainfed cropland_rainfed_herbaceous_cover cropland_rainfed_tree_or_shrub_cover cropland_irrigated mosaic_cropland mosaic_natural_vegetation tree_broadleaved_evergreen_closed_to_open
tree_broadleaved_deciduous_closed_to_open tree_broadleaved_deciduous_closed tree_broadleaved_deciduous_open tree_needleleaved_evergreen_closed_to_open tree_needleleaved_evergreen_closed tree_needleleaved_evergreen_open tree_needleleaved_decidu
ous_closed_to_open tree_needleleaved_deciduous_closed tree_needleleaved_deciduous_open tree_mixed mosaic_tree_and_shrub mosaic_herbaceous shrubland shrubland_evergreen shrubland_deciduous grassland lichens_and_mosses sparse_vegetation sparse_s
hrub sparse_herbaceous tree_cover_flooded_fresh_or_brakish_water tree_cover_flooded_saline_water shrub_or_herbaceous_cover_flooded urban bare_areas bare_areas_consolidated bare_areas_unconsolidated water snow_and_ice" ;
                lccs_class:valid_min = 1 ;
                lccs_class:valid_max = 220 ;
                lccs_class:_Unsigned = "true" ;
                lccs_class:_FillValue = 0b ;
                lccs_class:ancillary_variables = "processed_flag current_pixel_state observation_count algorithmic_confidence_level" ;
                lccs_class:_Storage = "chunked" ;
                lccs_class:_ChunkSizes = 2048, 2048 ;
                lccs_class:_DeflateLevel = 6 ;
                lccs_class:_NoFill = "true" ;
        byte processed_flag(lat, lon) ;
                processed_flag:standard_name = "land_cover_lccs status_flag" ;
                processed_flag:flag_values = 0b, 1b ;
                processed_flag:flag_meanings = "not_processed processed" ;
                processed_flag:valid_min = 0 ;
                processed_flag:valid_max = 1 ;
                processed_flag:_FillValue = -1b ;
                processed_flag:long_name = "LC map processed area flag" ;
                processed_flag:_Storage = "chunked" ;
                processed_flag:_ChunkSizes = 2048, 2048 ;
                processed_flag:_DeflateLevel = 6 ;
                processed_flag:_NoFill = "true" ;
        byte current_pixel_state(lat, lon) ;
                current_pixel_state:standard_name = "land_cover_lccs status_flag" ;
                current_pixel_state:flag_values = 0b, 1b, 2b, 3b, 4b, 5b ;
                current_pixel_state:flag_meanings = "invalid clear_land clear_water clear_snow_ice cloud cloud_shadow" ;
                current_pixel_state:valid_min = 0 ;
                current_pixel_state:valid_max = 5 ;
                current_pixel_state:_FillValue = -1b ;
                current_pixel_state:long_name = "LC pixel type mask" ;
                current_pixel_state:_Storage = "chunked" ;
                current_pixel_state:_ChunkSizes = 2048, 2048 ;
                current_pixel_state:_DeflateLevel = 6 ;
                current_pixel_state:_NoFill = "true" ;
        short observation_count(lat, lon) ;
                observation_count:standard_name = "land_cover_lccs number_of_observations" ;
                observation_count:valid_min = 0 ;
                observation_count:valid_max = 32767 ;
                observation_count:_FillValue = -1s ;
                observation_count:long_name = "number of valid observations" ;
                observation_count:_Storage = "chunked" ;
                observation_count:_ChunkSizes = 2048, 2048 ;
                observation_count:_DeflateLevel = 6 ;
                observation_count:_Endianness = "little" ;
                observation_count:_NoFill = "true" ;
        byte algorithmic_confidence_level(lat, lon) ;
                algorithmic_confidence_level:standard_name = "land_cover_lccs algorithmic_confidence" ;
                algorithmic_confidence_level:valid_min = 0 ;
                algorithmic_confidence_level:valid_max = 100 ;
                algorithmic_confidence_level:scale_factor = 0.01f ;
                algorithmic_confidence_level:_FillValue = -1b ;
                algorithmic_confidence_level:long_name = "LC map confidence level based on algorithm performance" ;
                algorithmic_confidence_level:_Storage = "chunked" ;
                algorithmic_confidence_level:_ChunkSizes = 2048, 2048 ;
                algorithmic_confidence_level:_DeflateLevel = 6 ;
                algorithmic_confidence_level:_NoFill = "true" ;
        float lat(lat) ;
                lat:long_name = "latitude" ;
                lat:standard_name = "latitude" ;
                lat:valid_min = -89.9986f ;
                lat:valid_max = 89.99861f ;
                lat:units = "degrees_north" ;
                lat:_Storage = "chunked" ;
                lat:_ChunkSizes = 64800 ;
                lat:_DeflateLevel = 6 ;
                lat:_Endianness = "little" ;
                lat:_NoFill = "true" ;
        float lon(lon) ;
                lon:long_name = "longitude" ;
                lon:standard_name = "longitude" ;
                lon:valid_min = -179.9986f ;
                lon:valid_max = 179.9986f ;
                lon:units = "degrees_east" ;
                lon:_Storage = "chunked" ;
                lon:_ChunkSizes = 129600 ;
                lon:_DeflateLevel = 6 ;
                lon:_Endianness = "little" ;
                lon:_NoFill = "true" ;
        int crs ;
                crs:i2m = "0.002777777701187,0.0,0.0,-0.002777777701187,-180.00000033927267,90.0" ;
                crs:wkt = "GEOGCS[\"WGS 84\", \r\n  DATUM[\"World Geodetic System 1984\", \r\n    SPHEROID[\"WGS 84\", 6378137.0, 298.257223563, AUTHORITY[\"EPSG\",\"7030\"]], \r\n    AUTHORITY[\"EPSG\",\"6326\"]], \r\n  PRIMEM[\"Greenwich\",
0.0, AUTHORITY[\"EPSG\",\"8901\"]], \r\n  UNIT[\"degree\", 0.017453292519943295], \r\n  AXIS[\"Geodetic longitude\", EAST], \r\n  AXIS[\"Geodetic latitude\", NORTH], \r\n  AUTHORITY[\"EPSG\",\"4326\"]]" ;
                crs:_Endianness = "little" ;
                crs:_NoFill = "true" ;

// global attributes:
                :title = "ESA CCI Land Cover Map" ;
                :summary = "This dataset contains the global ESA CCI land cover classification map derived from satellite data of one epoch." ;
                :type = "ESACCI-LC-L4-LCCS-Map-300m-P5Y" ;
                :id = "ESACCI-LC-L4-LCCS-Map-300m-P5Y-2005-v1.6.1" ;
                :project = "Climate Change Initiative - European Space Agency" ;
                :references = "http://www.esa-landcover-cci.org/" ;
                :institution = "Universite catholique de Louvain" ;
                :contact = "landcover-cci@uclouvain.be" ;
                :comment = "" ;
                :Conventions = "CF-1.6" ;
                :standard_name_vocabulary = "NetCDF Climate and Forecast (CF) Standard Names version 21" ;
                :keywords = "land cover classification,satellite,observation" ;
                :keywords_vocabulary = "NASA Global Change Master Directory (GCMD) Science Keywords" ;
                :license = "ESA CCI Data Policy: free and open access" ;
                :naming_authority = "org.esa-cci" ;
                :cdm_data_type = "grid" ;
                :TileSize = "2048:2048" ;
                :tracking_id = "00f7e0ee-3b0e-4ea3-9b9f-186e02fb4439" ;
                :product_version = "1.6.1" ;
                :date_created = "20151217T094622Z" ;
                :creator_name = "University catholique de Louvain" ;
                :creator_url = "http://www.uclouvain.be/" ;
                :creator_email = "landcover-cci@uclouvain.be" ;
                :source = "MERIS FR L1B version 5.05, MERIS RR L1B version 8.0, SPOT VGT P" ;
                :history = "amorgos-4,0, lc-sdr-1.0, lc-sr-1.0, lc-classification-1.0,lc-user-tools-3.10" ;
                :time_coverage_start = "20030101" ;
                :time_coverage_end = "20071231" ;
                :time_coverage_duration = "P5Y" ;
                :time_coverage_resolution = "P5Y" ;
                :geospatial_lat_min = "-89.99999" ;
                :geospatial_lat_max = "90.0" ;
                :geospatial_lon_min = "-180.0" ;
                :geospatial_lon_max = "179.99998" ;
                :spatial_resolution = "300m" ;
                :geospatial_lat_units = "degrees_north" ;
                :geospatial_lat_resolution = "0.002778" ;
                :geospatial_lon_units = "degrees_east" ;
                :geospatial_lon_resolution = "0.002778" ;
                :_SuperblockVersion = 2 ;
                :_IsNetcdf4 = 1 ;
                :_Format = "netCDF-4" ;
}

fmaussion · 2017-09-18T15:38:49Z

OK. I'll let @shoyer comment on the substance but indeed it seems that decode_cf could be cleverer here. It should be an easy fix.

forman · 2017-09-18T15:52:55Z

I guess, the poblem is caused in xarray/conventions.py.

Note, when debugging into it, fill_value == nd.array([0], dtype == np.int8) and fill_value.dtype.kind='i' and the latter kind is not dealt with. Therefore int8 is turned into float64.

jhamman · 2017-09-18T16:03:49Z

Right, since xarray uses np.nan as its fill value, any array with a _FillValue will be promoted to a float dtype.

Out of curiosity, what is the meaning _NoFill = "true"?

shoyer · 2017-09-18T16:04:47Z

We currently decode anything with a _FillValue attribute to float, so that we can convert any values equal to the fill value to NaN. This ensure's that xarray's NaN skipping aggregations (e.g., mean()) work properly.

However, this isn't really a useful thing to do for a dataset like this where the values really represent enums/categories. It seems like the CF compliant way to indicate this is with the various flag_* attributes. So we could look for those to indicate that we shouldn't fill-in fill values.

Eventually, we could possibly also use this for decoding into a true "categorical" dtype, but numpy doesn't have anything like that yet.

forman · 2017-09-18T16:13:45Z

I see, that is what is done in mask_and_scale(). Why can't xarray used masked arrays, that would retain the original dtype? (Dask, I guess?)
Expanding integers to 8 byte floats not only cost memory but also CPU, including an inaccurate in-memory integer representation.

forman · 2017-09-18T16:20:33Z

@jhamman _NoFill is about optimizing writes, see nc_set_fill

jhamman · 2017-09-18T16:26:36Z

Why can't xarray used masked arrays, that would retain the original dtype?

We have an open issue for this topic (#1194). A lot of it comes down to performance, dask is part of that but the other issue is that masked arrays in numpy are quite slow.

forman · 2017-09-19T08:16:43Z

@shoyer

We currently decode anything with a _FillValue attribute to float, ...

I believe this fact is surprising for any user of integer/index/enum/classification datasets. Since its justification seems to be an implementation detail which comes at the cost of increased memory and CPU consumption I suggest documenting it in open_dataset() and decode_cf() functions.

Here is how we overcome this issue by deleting the _FillValue attribute of integer variables if their scale_factor and add_offset attributes are not provided:

ds = xr.open_dataset(path, decode_cf=False)
old_fill_values = unset_fill_value_for_int_vars(ds)
ds = xr.decode_cf(ds)
reset_fill_value_for_int_vars(ds, old_fill_values)

where old_fill_values is a mapping of variable names to fill values.

stale · 2020-09-26T10:40:38Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

forman changed the title ~~Variable of dtype int8 casted to float64 for no obvious reason~~ Variable of dtype int8 casted to float64 Sep 18, 2017

This was referenced Sep 19, 2017

Support for unsigned data #1579

Closed

Land Cover CCI uses float64 instead of uint8 CCI-Tools/cate#378

Open

dcherian added the topic-CF conventions label Oct 26, 2018

stale bot added the stale label Sep 26, 2020

stale bot closed this as completed Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable of dtype int8 casted to float64 #1576

Variable of dtype int8 casted to float64 #1576

forman commented Sep 18, 2017 •

edited

Loading

fmaussion commented Sep 18, 2017 •

edited

Loading

forman commented Sep 18, 2017

fmaussion commented Sep 18, 2017

forman commented Sep 18, 2017 •

edited

Loading

jhamman commented Sep 18, 2017

shoyer commented Sep 18, 2017

forman commented Sep 18, 2017

forman commented Sep 18, 2017

jhamman commented Sep 18, 2017

forman commented Sep 19, 2017

stale bot commented Sep 26, 2020

Variable of dtype int8 casted to float64 #1576

Variable of dtype int8 casted to float64 #1576

Comments

forman commented Sep 18, 2017 • edited Loading

fmaussion commented Sep 18, 2017 • edited Loading

forman commented Sep 18, 2017

fmaussion commented Sep 18, 2017

forman commented Sep 18, 2017 • edited Loading

jhamman commented Sep 18, 2017

shoyer commented Sep 18, 2017

forman commented Sep 18, 2017

forman commented Sep 18, 2017

jhamman commented Sep 18, 2017

forman commented Sep 19, 2017

stale bot commented Sep 26, 2020

forman commented Sep 18, 2017 •

edited

Loading

fmaussion commented Sep 18, 2017 •

edited

Loading

forman commented Sep 18, 2017 •

edited

Loading