-
Notifications
You must be signed in to change notification settings - Fork 12
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task]: Investigate xcdat handling of missing data #244
Comments
Regarding the question of a single missing month in a seasonal average, it may not make much practical difference. But when computing an annual mean, if all the winter or all the summer months were missing, then you could get a quite biased result (for things like temperature, at least). For the annual mean calculation in CDAT we came up with two criteria (set by the user) that would determine whether a mean was recorded or the value was set to missing: 1) the threshold fraction of samples required to compute the mean, and 2) how far the "centroid" of weights computed from available samples differed from the centroid of weights computed assuming no missing samples. If you consider monthly data as numbers on a clock, then for no missing data the centroid lies at the axis of the clock hands. Similarly, if data were only available for 4 months but equally distributed (say, January, April, July, and October), the centroid would still be at the center. But if the 4 months all occurred in one half of the year, then the centroid would be offset. When computing an annual mean, the user would specify what minimum number of the months was required and how close the centroid should be to the center of the clock. I would be happy to discuss further. |
Another suggestion: Sometimes carrying both the "unmasked" weights and a second array of "masking" factors can be useful when constructing algorithms. When computing means, you would then calculate the sum-over-samples(wts x msk x data) and divide by sum-over-samples(wts x msk). In general "msk" would be a fraction (set to 0 for missing values). When regridding conservatively, the "wts" would be set to the area of each grid cell and the "msk" would indicate the fraction of each grid cell that was unmasked. The output of the regridder would give the area of each target grid cell and the unmasked fraction of each target cell, along with the regridded field itself. |
I am going to convert this issue to a discussion to make it easier to follow individual threads. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Describe the task
Much of the geospatial data that we use has missing / masked values. This issue is intended to make sure that missing data is properly handled by xcdat. In particular, we need to make sure that xcdat handles missing data correctly for:
Notes on weighted averaging
Spatial and temporal averages are weighted averages. Spatial averages are typically weighted by the area in each grid cell and temporal averages are weighted by the length of time for each time interval. In general, a weighted average is just:
WA(x) = Σ(w(x) * v(x)) / Σ(w(x))
where WA is the weighted average at time/location, x, for a given weight, w, and value, v.
If a value is missing, its weight should be set to zero. For example, if I have arrays
v = [99, 80, 77, 92, 87]
andw = [10, 10, 10, 10, 30]
, I will getWA = 87.0
(think a weighted grade average with homeworks worth 10 points and a quiz worth 30).Now suppose the teacher says that I can miss one homework, then I have arrays
v = [99, 80, np.nan, 92, 87]
andw = [10, 10, 10, 10, 30]
, I will getWA = 76.0
[or nan if I don't use np.nansum]. This is impossible, because I don't have any grades below 87 – so I can't get an average below 87. This is because the quiz that has anan
value is still being weighted. I need to zero out that weight (w = [10, 10, np.nan, 10, 30]
), yieldingWA = 88.67
.The take home message is that we need to ensure that values that are missing / masked are zero-ed out for spatial and temporal averaging.
The current spatial and temporal averages utilize the xarray
.weighted().mean()
API, which generally appear to handle missing data appropriately, though special attention may be needed for groupby averaging operations (used in temporal averaging). One question is whether a group of values that include a missing value (e.g., May, June, July temperature) should return a NaN or weighted average of the available data.These notes are subsetted from a conversation with @tomvothecoder
The text was updated successfully, but these errors were encountered: