[Task]: Investigate xcdat handling of missing data #244

pochedls · 2022-05-27T23:35:27Z

Describe the task

Much of the geospatial data that we use has missing / masked values. This issue is intended to make sure that missing data is properly handled by xcdat. In particular, we need to make sure that xcdat handles missing data correctly for:

spatial averages
temporal averages
regridding

Notes on weighted averaging

Spatial and temporal averages are weighted averages. Spatial averages are typically weighted by the area in each grid cell and temporal averages are weighted by the length of time for each time interval. In general, a weighted average is just:

WA(x) = Σ(w(x) * v(x)) / Σ(w(x))

where WA is the weighted average at time/location, x, for a given weight, w, and value, v.

If a value is missing, its weight should be set to zero. For example, if I have arrays v = [99, 80, 77, 92, 87] and w = [10, 10, 10, 10, 30], I will get WA = 87.0 (think a weighted grade average with homeworks worth 10 points and a quiz worth 30).

Now suppose the teacher says that I can miss one homework, then I have arrays v = [99, 80, np.nan, 92, 87] and w = [10, 10, 10, 10, 30], I will get WA = 76.0 [or nan if I don't use np.nansum]. This is impossible, because I don't have any grades below 87 – so I can't get an average below 87. This is because the quiz that has a nan value is still being weighted. I need to zero out that weight (w = [10, 10, np.nan, 10, 30]), yielding WA = 88.67.

The take home message is that we need to ensure that values that are missing / masked are zero-ed out for spatial and temporal averaging.

The current spatial and temporal averages utilize the xarray .weighted().mean() API, which generally appear to handle missing data appropriately, though special attention may be needed for groupby averaging operations (used in temporal averaging). One question is whether a group of values that include a missing value (e.g., May, June, July temperature) should return a NaN or weighted average of the available data.

These notes are subsetted from a conversation with @tomvothecoder

The text was updated successfully, but these errors were encountered:

taylor13 · 2022-06-03T23:25:29Z

Regarding the question of a single missing month in a seasonal average, it may not make much practical difference. But when computing an annual mean, if all the winter or all the summer months were missing, then you could get a quite biased result (for things like temperature, at least).

For the annual mean calculation in CDAT we came up with two criteria (set by the user) that would determine whether a mean was recorded or the value was set to missing: 1) the threshold fraction of samples required to compute the mean, and 2) how far the "centroid" of weights computed from available samples differed from the centroid of weights computed assuming no missing samples. If you consider monthly data as numbers on a clock, then for no missing data the centroid lies at the axis of the clock hands. Similarly, if data were only available for 4 months but equally distributed (say, January, April, July, and October), the centroid would still be at the center. But if the 4 months all occurred in one half of the year, then the centroid would be offset. When computing an annual mean, the user would specify what minimum number of the months was required and how close the centroid should be to the center of the clock.

I would be happy to discuss further.

taylor13 · 2022-06-03T23:33:03Z

Another suggestion: Sometimes carrying both the "unmasked" weights and a second array of "masking" factors can be useful when constructing algorithms. When computing means, you would then calculate the sum-over-samples(wts x msk x data) and divide by sum-over-samples(wts x msk). In general "msk" would be a fraction (set to 0 for missing values). When regridding conservatively, the "wts" would be set to the area of each grid cell and the "msk" would indicate the fraction of each grid cell that was unmasked. The output of the regridder would give the area of each target grid cell and the unmasked fraction of each target cell, along with the regridded field itself.

tomvothecoder · 2022-07-06T16:05:05Z

I am going to convert this issue to a discussion to make it easier to follow individual threads.

pochedls added the Priority: High label May 27, 2022

pochedls self-assigned this May 27, 2022

pochedls mentioned this issue Jun 21, 2022

[Bug]: open_dataset error with "Y" axis bounds #215

Closed

xCDAT locked and limited conversation to collaborators Jul 6, 2022

tomvothecoder converted this issue into discussion #275 Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[Task]: Investigate xcdat handling of missing data #244

[Task]: Investigate xcdat handling of missing data #244

pochedls commented May 27, 2022 •

edited

Loading

taylor13 commented Jun 3, 2022

taylor13 commented Jun 3, 2022 •

edited

Loading

tomvothecoder commented Jul 6, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

[Task]: Investigate xcdat handling of missing data #244

[Task]: Investigate xcdat handling of missing data #244

Comments

pochedls commented May 27, 2022 • edited Loading

Describe the task

taylor13 commented Jun 3, 2022

taylor13 commented Jun 3, 2022 • edited Loading

tomvothecoder commented Jul 6, 2022

This issue was moved to a discussion.

pochedls commented May 27, 2022 •

edited

Loading

taylor13 commented Jun 3, 2022 •

edited

Loading