Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Investigate xcdat handling of missing data #244

Closed
pochedls opened this issue May 27, 2022 · 3 comments
Closed

[Task]: Investigate xcdat handling of missing data #244

pochedls opened this issue May 27, 2022 · 3 comments
Assignees

Comments

@pochedls
Copy link
Collaborator

pochedls commented May 27, 2022

Describe the task

Much of the geospatial data that we use has missing / masked values. This issue is intended to make sure that missing data is properly handled by xcdat. In particular, we need to make sure that xcdat handles missing data correctly for:

  • spatial averages
  • temporal averages
  • regridding

Notes on weighted averaging

Spatial and temporal averages are weighted averages. Spatial averages are typically weighted by the area in each grid cell and temporal averages are weighted by the length of time for each time interval. In general, a weighted average is just:

WA(x) = Σ(w(x) * v(x)) / Σ(w(x))

where WA is the weighted average at time/location, x, for a given weight, w, and value, v.

If a value is missing, its weight should be set to zero. For example, if I have arrays v = [99, 80, 77, 92, 87] and w = [10, 10, 10, 10, 30], I will get WA = 87.0 (think a weighted grade average with homeworks worth 10 points and a quiz worth 30).

Now suppose the teacher says that I can miss one homework, then I have arrays v = [99, 80, np.nan, 92, 87] and w = [10, 10, 10, 10, 30], I will get WA = 76.0 [or nan if I don't use np.nansum]. This is impossible, because I don't have any grades below 87 – so I can't get an average below 87. This is because the quiz that has a nan value is still being weighted. I need to zero out that weight (w = [10, 10, np.nan, 10, 30]), yielding WA = 88.67.

The take home message is that we need to ensure that values that are missing / masked are zero-ed out for spatial and temporal averaging.

The current spatial and temporal averages utilize the xarray .weighted().mean() API, which generally appear to handle missing data appropriately, though special attention may be needed for groupby averaging operations (used in temporal averaging). One question is whether a group of values that include a missing value (e.g., May, June, July temperature) should return a NaN or weighted average of the available data.

These notes are subsetted from a conversation with @tomvothecoder

@pochedls pochedls self-assigned this May 27, 2022
@taylor13
Copy link

taylor13 commented Jun 3, 2022

Regarding the question of a single missing month in a seasonal average, it may not make much practical difference. But when computing an annual mean, if all the winter or all the summer months were missing, then you could get a quite biased result (for things like temperature, at least).

For the annual mean calculation in CDAT we came up with two criteria (set by the user) that would determine whether a mean was recorded or the value was set to missing: 1) the threshold fraction of samples required to compute the mean, and 2) how far the "centroid" of weights computed from available samples differed from the centroid of weights computed assuming no missing samples. If you consider monthly data as numbers on a clock, then for no missing data the centroid lies at the axis of the clock hands. Similarly, if data were only available for 4 months but equally distributed (say, January, April, July, and October), the centroid would still be at the center. But if the 4 months all occurred in one half of the year, then the centroid would be offset. When computing an annual mean, the user would specify what minimum number of the months was required and how close the centroid should be to the center of the clock.

I would be happy to discuss further.

@taylor13
Copy link

taylor13 commented Jun 3, 2022

Another suggestion: Sometimes carrying both the "unmasked" weights and a second array of "masking" factors can be useful when constructing algorithms. When computing means, you would then calculate the sum-over-samples(wts x msk x data) and divide by sum-over-samples(wts x msk). In general "msk" would be a fraction (set to 0 for missing values). When regridding conservatively, the "wts" would be set to the area of each grid cell and the "msk" would indicate the fraction of each grid cell that was unmasked. The output of the regridder would give the area of each target grid cell and the unmasked fraction of each target cell, along with the regridded field itself.

@tomvothecoder
Copy link
Collaborator

I am going to convert this issue to a discussion to make it easier to follow individual threads.

@xCDAT xCDAT locked and limited conversation to collaborators Jul 6, 2022
@tomvothecoder tomvothecoder converted this issue into discussion #275 Jul 6, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants