Use masked arrays while preserving int #1194

gerritholl · 2017-01-06T12:40:22Z

A great beauty of numpys masked arrays is that it works with any dtype, since it does not use nan. Unfortunately, when I try to put my data into an xarray.Dataset, it converts ints to float, as shown below:

In [137]: x = arange(30, dtype="i1").reshape(3, 10)

In [138]: xr.Dataset({"count": (["x", "y"], ma.masked_where(x%5>3, x))}, coords={"x": range(3), "y":
     ...: range(10)})
Out[138]:
<xarray.Dataset>
Dimensions:  (x: 3, y: 10)
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9
  * x        (x) int64 0 1 2
Data variables:
    count    (x, y) float64 0.0 1.0 2.0 3.0 nan 5.0 6.0 7.0 8.0 nan 10.0 ...

This happens in the function _maybe_promote.

Such type “promotion” is unaffordable for me; the memory consumption of my multi-gigabyte arrays would explode by a factor 4. Secondly, many of my integer-dtype fields are bit arrays, for which floating point representation is not desirable.

It would greatly benefit xarray if it could use masking while preserving the dtype of input data.

(See also: Stackoverflow question)

The text was updated successfully, but these errors were encountered:

shoyer · 2017-01-07T02:54:54Z

I answered your question on StackOverflow.

I agree that this is unfortunate. The cleanest solution would be an integer dtype with missing value support in NumPy itself, but that isn't going to happen anytime soon.

I'm not entirely opposed to the idea of adding (limited) support for masked arrays in xarray (see also #1118), but this could be a lot of work for relatively limited return.

I definitely recommend trying dask for processing multi-gigabyte arrays. You might even find the performance boost compelling enough that you could forgive the limitation that it doesn't handle masked arrays, either.

gerritholl · 2017-01-07T11:24:49Z

I don't see how an integer dtype could ever support missing values; float missing values are specifically defined by IEEE 754 but for ints, every sequence of bits corresponds to a valid value. OTOH, NetCDF does have a _FillValue attribute that works for any type including int. If we view xarray as "NetCDF in memory" that could be an approach to follow, but for numpy in general it would fairly heavily break existing code (see also http://www.numpy.org/NA-overview.html) in particular for 8-bit types. If i understand correctly, R uses INT_MAX which would be 127 for 'int8… Apparently, R ints are always 32 bits. I'm new to xarray so I don't have a good idea on how much work adding support for masked arrays would be, but I'll take your word that it's not straightforward.

stale · 2019-01-24T11:05:22Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

gerritholl · 2019-01-24T11:10:46Z

I think this issue should remain open. I think it would still be highly desirable to implement support for true masked arrays, such that any value can be masked without throwing away the original value.

max-sixty · 2019-01-24T14:09:32Z

@gerritholl check out https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#whatsnew-0240-enhancements-intna

I think that's the closest way of having int support; from my understanding supporting masked arrays directly would be a decent lift

gerritholl · 2019-01-24T14:40:33Z

@max-sixty Interesting! I wonder what it would take to make use of this "nullable integer data type" in xarray. It wouldn't work to convert it to a standard numpy array (da.values) retaining the dtype, but one could make a new .to_maskedarray() method returning a numpy masked array; that would probably be easier than to add full support for masked arrays.

gerritholl · 2020-01-31T14:42:36Z

Pandas 1.0 uses pd.NA for integers, boolean, and string dtypes: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values

Hoeze · 2020-03-29T13:00:29Z

Currently I keep carrying a "_missing" mask with all of my unstacked arrays to solve this issue. It would be very desirable to have a clean solution for this to keep arrays from being converted to float.
Also, NaN does not necessarily mean NA which already caused me quite some head-scratching in the past. Further, it would be a very cool indicator to see which values of a dense array should be converted into a sparse array.

eric-czech · 2020-03-29T20:37:29Z

I agree, I have this same issue with large genotyping data arrays often containing tiny integers and some degree of missingness in nearly 100% of raw datasets. Are there recommended workarounds now? I am thinking of constantly using Datasets instead of DataArrays with mask arrays to accompany every data array, but I'm not sure if that's the best interim solution.

keewis · 2024-09-29T13:56:34Z

I've recently come across marray, which is still very experimental (and still needs a hack to really work) but allows us to wrap masked arrays:

In [1]: import marray
   ...: import numpy as np
   ...: import xarray as xr
   ...: 
   ...: # create a nested namespace for masked arrays wrapping numpy
   ...: xp = marray.masked_array(np)
   ...: data = xp.arange(10)
   ...: data.mask[:] = data.data % 2 == 0
   ...: # hack: set `__array_namespace__` to the nested namespace we just created
   ...: data.__array_namespace__ = lambda self, **kwargs: xp
   ...: 
   ...: arr = xr.DataArray(data, dims="x")
   ...: arr
Out[1]: 
<xarray.DataArray (x: 10)> Size: 80B
masked_array(data=[--, 1, --, 3, --, 5, --, 7, --, 9],
             mask=[ True, False,  True, False,  True, False,  True, False,
                    True, False],
       fill_value=999999)
Dimensions without coordinates: x

(there's a lot of other things that does not work, for example indexing / isel)

Also, @shoyer, this another instance of the nested array namespace I was talking about in the last meeting.

jhamman mentioned this issue Sep 18, 2017

Variable of dtype int8 casted to float64 #1576

Closed

jhamman mentioned this issue Feb 14, 2018

Vectorized lazy indexing #1899

Merged

4 tasks

jhamman mentioned this issue Mar 21, 2018

Interoperability with xarray/dask NCPP/ocgis#479

Open

stale bot added the stale label Jan 24, 2019

stale bot removed the stale label Jan 24, 2019

ahartikainen mentioned this issue Jan 28, 2019

Fix nan handling and fix n_eff to ess arviz-devs/arviz#573

Merged

kmuehlbauer mentioned this issue Apr 8, 2020

Masking and preserving int type #3955

Closed

eric-czech mentioned this issue Apr 12, 2020

Explore Xarray as the basis for a genetic toolkit API related-sciences/gwas-analysis#5

Closed

eric-czech mentioned this issue May 27, 2020

Fix #33. Handle missing values in PackGeneticBits related-sciences/gwas-analysis#34

Merged

snowman2 mentioned this issue Apr 20, 2021

Keep in memory original data type for writing corteva/rioxarray#305

Closed

Hoeze mentioned this issue Jun 11, 2021

Cannot append Pandas dataframe to existing array TileDB-Inc/TileDB-Py#592

Open

gjoseph92 mentioned this issue Nov 29, 2021

Support non-NaN nodata values in mosaic gjoseph92/stackstac#92

Closed

brisvag mentioned this issue Dec 2, 2021

Use pandas DataFrame to store layer features napari/napari#3730

Merged

11 tasks

Alexander-Barth mentioned this issue Aug 25, 2022

Specifying the fill value when reading a file Alexander-Barth/NCDatasets.jl#188

Closed

dcherian mentioned this issue Dec 20, 2022

The current data read-in will make the FillValue=nan instead of -1 UXARRAY/uxarray#189

Closed

keewis mentioned this issue Sep 29, 2024

Is _FillValue really the same as zarr's fill_value? #5475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use masked arrays while preserving int #1194

Use masked arrays while preserving int #1194

gerritholl commented Jan 6, 2017

shoyer commented Jan 7, 2017

gerritholl commented Jan 7, 2017 •

edited

Loading

stale bot commented Jan 24, 2019

gerritholl commented Jan 24, 2019

max-sixty commented Jan 24, 2019

gerritholl commented Jan 24, 2019

gerritholl commented Jan 31, 2020

Hoeze commented Mar 29, 2020 •

edited

Loading

eric-czech commented Mar 29, 2020

keewis commented Sep 29, 2024 •

edited

Loading

Use masked arrays while preserving int #1194

Use masked arrays while preserving int #1194

Comments

gerritholl commented Jan 6, 2017

shoyer commented Jan 7, 2017

gerritholl commented Jan 7, 2017 • edited Loading

stale bot commented Jan 24, 2019

gerritholl commented Jan 24, 2019

max-sixty commented Jan 24, 2019

gerritholl commented Jan 24, 2019

gerritholl commented Jan 31, 2020

Hoeze commented Mar 29, 2020 • edited Loading

eric-czech commented Mar 29, 2020

keewis commented Sep 29, 2024 • edited Loading

gerritholl commented Jan 7, 2017 •

edited

Loading

Hoeze commented Mar 29, 2020 •

edited

Loading

keewis commented Sep 29, 2024 •

edited

Loading