Skip to content
This repository has been archived by the owner on Dec 1, 2022. It is now read-only.

Integer-valued ranges and missing values #78

Closed
jonblower opened this issue Jul 25, 2016 · 4 comments
Closed

Integer-valued ranges and missing values #78

jonblower opened this issue Jul 25, 2016 · 4 comments

Comments

@jonblower
Copy link
Member

jonblower commented Jul 25, 2016

In JSON there's nothing wrong with having an array of integers with missing values:

{
      "type" : "NdArray",
      "dataType": "integer",
      "axisNames": ["t","z","y","x"],
      "shape": [1, 1, 2, 3],
      "values" : [ 5, 6, 4, 6, null, 2 ]
}

But in many programming languages (e.g. numpy in Python) this can cause an issue as there is no way to record a "missing value" in an array of integers. (With floating point numbers one can use NaN for missing values.)

The workarounds would include:

  1. Use a masked array (i.e. a parallel array of flags to indicate missing values), which adds inefficiency.
  2. Use an array of objects (which can be nulled), instead of an array of primitives, which also adds inefficiency.
  3. Use a "special" integer values (e.g. -999) to denote missing values, and make sure this is taken into account in calculations (requires extra metadata on the NdArray to advertise this special value, and usually translates into the creation of a masked array anyway).

So there are two possible courses of action:

  1. Consider the presence of "null" in an integer array to be an error, and disallow it in the spec
  2. Allow the use of "null", but provide advice to data providers of the difficulties it may cause for clients. (If the data are categorical, then assigning one category to "missing data" may be preferable to using nulls.)
@letmaik
Copy link
Member

letmaik commented Jul 26, 2016

On 25/07/2016 15:00, Jon Blower wrote:

  1. Consider the presence of "null" in an integer array to be an
    error, and disallow it in the spec
  2. Allow the use of "null", but provide advice to data providers of
    the difficulties it may cause for clients. (If the data are
    categorical, then assigning one category to "missing data" may be
    preferable to using nulls.)

Hm, I think from a semantic point of view I don't like "missing data"
categories that much. For example, think about an observed property
"land cover" with categories grassland, urban, ... "Missing data"
doesn't fit into that collection in my opinion. And this also makes
rendering more complicated as there is no easy way anymore to detect
missing data and to skip rendering of those pixels etc.

I think this issue should be solved in the software by reading the
integer data into an array and replacing null's with an unused integer
(which would be fairly easy to find out by first scanning the input
array for the maximum value). This unused integer is then marked as
missing value, either with abstractions like numpy's masked arrays or
otherwise. I don't think the performance impact would be noticeable.

@jonblower
Copy link
Member Author

jonblower commented Jul 27, 2016

I think you're right that that's the only realistic solution. A bit of a pain though.

With numpy arrays of floating point numbers, does it matter if we use NaNs for missing values, or are masked arrays better? Do they give different results, or perform differently?

@letmaik
Copy link
Member

letmaik commented Jul 27, 2016

If efficiency is important, I would use NaNs because masked arrays are
still slower (but maybe not enough to notice for our purposes!). If you
use NaNs, then you have to be careful with aggregation operations and
there are special versions like np.nanmean which ignore NaNs. For
consistency though, I would probably use masked arrays everywhere (for
float and integer arrays).

@letmaik
Copy link
Member

letmaik commented Feb 18, 2022

Closing this as it's not a real issue. The libraries we've created can deal with it, and yes it's a bit annoying, but it's also not too bad. I think, if anything, then this may be picked up in the future with a new range type that supports both missing value encoding but also offset/scaling.

@letmaik letmaik closed this as completed Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants