-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow xr.combine_by_coords to work with point coordinates? #3774
Comments
Hi @kazimuth , thanks for highlighting this! This is a case which I can see the argument for handling, but we apparently didn't think of when implementing What this really boils down to is "should
This function is supposed to be "magic". But the behaviour still needs to be clearly-defined.
Yes - I actually got your example to work with just a few extra lines. Change the top of def _infer_concat_order_from_coords(datasets):
# All datasets have same variables because they've been grouped as such
ds0 = datasets[0]
concat_dim_candidates = list(ds0.dims.keys())
# Look for 0D coordinates which can be concatenated over but aren't dims
non_dimension_0d_coords = [coord for coord in ds0.coords
if coord not in ds0.dims
and len(ds0[coord].dims) == 0]
for coord in non_dimension_0d_coords:
if all(coord in ds.coords for ds in datasets):
datasets = [ds.expand_dims(coord) for ds in datasets]
concat_dim_candidates.append(coord)
concat_dims = []
tile_ids = [() for ds in datasets]
for dim in concat_dim_candidates:
... # The rest of the function is unchanged
I don't think it would be breaking... the only cases which would behave differently are ones which would previously would have thrown an error. Would have to think about this though. Might you be interested in submitting a PR along these lines @kazimuth ? It would require some more robust input checks, some tests for different cases, and some minor changes to docstrings I think. |
I did get one test failure
I think the problem here is that it's happily promoting the coord to a coordinate dimension even though there is only 1 dataset. We should probably only expand dims which are going to be used in the concatenation... |
is |
By the way, there is an additional problem to consider in addition to the above: ambiguity with multiple coordinates for the same dimension. data0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data0.coords['trial'] = 0
data0.coords['day'] = 5
data1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data1.coords['trial'] = 1
data1.coord['day'] = 3 In this example then We could pass extra optional arguments to deal with this, but for a "magic" function that doesn't seem desirable. It might be better to have some logic like:
That would be backwards-compatible (currently all magic ordering is being done for dimension coords), and not make any arbitrary choices. |
@TomNicholas I think this is the case I ran into. Does the old |
@dcherian it looks like the old import xarray as xr
from xarray.core.combine import _old_auto_combine
data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = 0
data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = 1
all_trials = _old_auto_combine([data_0, data_1], concat_dim='trial')
print(all_trials)
but you can get the same result with all_trials = xr.combine_nested([data_0, data_1], concat_dim='trial')
print(all_trials)
So technically this issue doesn't need to be closed before completely removing the old |
Ok sounds good. Thanks |
I opened a PR to fix this. Also I realised that what I said about dealing with ambiguities of multiple coordinates for the same dimension doesn't make sense. In that situation there is no way to know the user wanted each of the (I did make sure it handles the case which caused |
Suppose there were multiple scalar coordinates that are unique for each variable. How would |
@shoyer it would expand and stack along both, filling the (many) gaps created with import xarray as xr
data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = 0 # scalar coords
data_0.coords['day'] = 1
data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = 1
data_1.coords['day'] = 0
# both scalar coords will be promoted to dims
all_trials = xr.combine_by_coords([data_0, data_1])
print(all_trials)
The gaps created will be filled in with NaNs print(all_trials['temperature'].data)
This gap-filling isn't new though - without this PR the same thing already happens with length-1 dimension coords (since PR #3649 - see my comment there) data_0 = xr.Dataset({'temperature': ('time', [10,20,30])}, coords={'time': [0,1,2]})
data_0.coords['trial'] = [0] # 1D dimension coords
data_0.coords['day'] = [1]
data_1 = xr.Dataset({'temperature': ('time', [50,60,70])}, coords={'time': [0,1,2]})
data_1.coords['trial'] = [1]
data_1.coords['day'] = [0]
all_trials = xr.combine_by_coords([data_0, data_1])
print(all_trials)
# gaps will again be filled in with NaNs
print(all_trials['temperature'].data)
So all my PR is doing is promoting all scalar coordinates (those which aren't equal across all datasets) to dimension coordinates before combining. There is a chance this could unwittingly increase the overall size of people's datasets (when they have different scalar coordinates in different datasets), but that could already happen since #3649. |
I don't think it's a great idea to automatically turn scalar coords into 1d arrays. It's not uncommon to have datasets with a whole handful of scalar coordinates, which could potentially become very high dimensional due to this change. I also don't like fallback logic that tries to do matching by coordinates with dimensions first, and then falls back to using scalars. These types of heuristics look very convenient first (and are very convenient much of the time) but then have a tendency to fail in unexpected/unpredictable ways. The other choice for situations like this would be to encourage switching to
We could even put a reference to |
Fair enough. Those are good reasons.
The fact that you can solve this use case with
The error you get is
so we could try to automatically detect the coordinates the user might have wanted to combine along, or we can just add something like
|
MCVE Code Sample
Expected output
Both work and produce the same result.
Problem Description
I found this behavior somewhat unintuitive -- I didn't understand that I needed to call
expand_dims
to getcombine_by_coords
to work. It might make things a little easier to understand if both behaved the same way -- maybe the code that concatenates dimensions should look for full dimensions first, and point coordinates second.This would also make code that selects dimensions with
sel
and then callscombine_by_coords
more easy to write.On the other hand, this could be seen as "magic", and it might also be a breaking change to
combine_by_coords
; I'm not sure. So I'd understand if isn't implemented.Output of
xr.show_versions()
The text was updated successfully, but these errors were encountered: