Coordinates implicitly created when passing a DataArray as coord to Dataset constructor #5732

benbovy · 2021-08-23T15:20:37Z

I stumbled on this while working on #5692. Is this intended behavior or unwanted side effect?

What happened:

Create a new Dataset by passing a DataArray object as coordinate also add the DataArray coordinates to the dataset:

>>> foo = xr.DataArray([1.0, 2.0, 3.0], coords={"x": [0, 1, 2]}, dims="x")
>>> ds = xr.Dataset(coords={"foo": foo})
>>> ds
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
    foo      (x) float64 1.0 2.0 3.0
Data variables:
    *empty*

What you expected to happen:

The behavior above seems a bit counter-intuitive to me. I would rather expect no additional coordinates auto-magically added to the dataset, i.e. only one foo coordinate in this example:

>>> ds
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
    foo      (x) float64 1.0 2.0 3.0
Data variables:
    *empty*

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Nov 27 2020, 19:17:44)
[Clang 11.0.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.1.5
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.5.1
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.3.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.3.3
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20201009
pip: 20.3.1
conda: None
pytest: 6.1.2
IPython: 7.25.0
sphinx: 3.3.1

The text was updated successfully, but these errors were encountered:

dcherian · 2021-08-23T16:56:37Z

I have definitely relied on this before, I imagine its a pretty common pattern too but I could be wrong.

Note that automatically adding the index satisfies this condition which seems sensible to me

ds.foo.identical(foo)

EDIT: would a similar concern apply to

ds.update({"foo": foo})

or is your concern specifically for coords vs data_vars?

benbovy · 2021-08-24T08:43:07Z

I think a similar concern applies to .update. I find it surprising in the first place that a single mapping item ("foo") results in multiple mapping items ("foo", "x") in the returned xarray object.

That said, I agree that taking into account the (indexed) coordinates of the DataArray are important for alignment. The example below makes perfect sense:

>>> foo = xr.DataArray([1.0, 2.0, 3.0], coords={"x": [0, 1, 2]}, dims="x")
>>> ds = xr.Dataset(coords={"x": [2, 3, 4, 5]})
>>> ds.update({"foo": foo})
<xarray.Dataset>
Dimensions:  (x: 4)
Coordinates:
  * x        (x) int64 2 3 4 5
Data variables:
    foo      (x) float64 3.0 nan nan nan

I guess my concern is related to the (my?) slightly ambiguous view of the DataArray concept as a single array-like vs. a collection of data + coordinates.

dcherian · 2021-08-24T14:18:08Z

the DataArray concept as a single array-like vs. a collection of data + coordinates.

Yeah non-dim coordinates essentially make DataArrays a Dataset with one special variable =)

* no need to wrap pandas index in lazy index adapter * multi-index default level names * refactor setting Dataset/DataArray default indexes * update multi-index (text) repr Notes: - move the multi-index formatting logic into PandasMultiIndexingAdapter._repr_inline_ - inline repr: check for _repr_inline_ implementation first * remove print * minor fixes and improvements * fix dtype of index variables created from Index * fix multi-index selection regression See #5691 * check conflicting multi-index level names * update formatting (text and html) * check level name conflicts for midx given as coord * intended behavior or unwanted side effect? see #5732 * get rid of multi-index virtual coordinates Not totally yet: need to refactor set_index / reset_index * add level coords in indexes & keep coord order * fix copying multi-index level variable data * collect index for multi-index level variables Avoid re-creating the indexes for dimension variables. Collect then directly instead. Note: the change here is working for building new Datasets but I haven't checked other cases like merging different objects, etc. So I'm not sure this is the right approach. * wip refactor label based selection - Index.query must now return a mapping of {dim_name: positional_indexer} as indexes may be based on several coordinates with different dimensions - Added `group_coords_by_index` utility function (not used yet, not sure we'll need it) TODO: - Update DataArray selection - Update .loc and other places using remap_label_indexers - Fix selection of multi-index that returns only scalar coordinates * fix index query tests * fix multi-index adapter getitem scalar Multi-index level variables now return the scalar value that corresponds to the level instead of the multi-index tuple element (all levels). Also get rid of PandasMultiIndexingAdapter.__getitem__ cache optimization, which doesn't work with level scalar values and was premature optimization anyway. * wip refactor label based selection Fixed renamed dimension in the case of multi-index -> single index Updated DataArray._overwrite_indexes Dirty fix for alignment (not tested yet) * wip: deeper refactoring label-based sel Created QueryResult and MergedQueryResults classes for convenience. * fix some tests + minor tweaks * fix indexing PandasMultiIndexingAdapater When level is not None: - if result is another adapter: propagate it properly - if result is a numpy adapter: use level values * refactor cast label indexer to coord dtype Make the fix in #3153 specific to pandas indexes (i.e., do not apply it to other, custom indexes). See #5697 for details. This should also fix #5700 although no test has been added yet (we need to refactor set_index first). * better handling of multi-index level labels * label-based selection tweaks and fixes - Use a dataclass for QueryResult - Pass Variables and DataArrays indexers un-normalized to Index.query(). Indexes have the responsibility of returning the expected types for positional (dimension) indexers by adding back dimensions and coordinates if needed. - Typing fixes and tweaks * sel: propagate multi-index vars attrs/encoding Related to #1366 * wip refactor rename Still needs tests Also need to address the problem of multi-index level coordinates (data adapters currently not updated). We'll probably need for `Index.rename()` to also return new index variables? * wip refactor rename: return new vars from Index * refactor rename: update tests Still need to address default indexes internal checks (these are disabled for now). * typing tweaks * fix html formatting: dims w/ or w/o index * minor fixes and tweaks * wip refactor set_index * wip refactor set_index - Refactor reset_index - Improve creating new xarray indexes from pandas indexes with proper propagation of variable metadata (dtype, attrs, encoding) * refactor reorder_levels * Set pd.MultiIndex name from dim name Closes #4542 * .sel() with multi-index: return scalar coords ..instead of dropping those coordinates Closes #1408 once tests are fixed/updated * fix multi-index level coordinate inline repr * wip refactor stack Some changes: Multi-indexes are created only if the stacked dimensions each have a single coordinate index. In the other cases, multi-indexes are not created, which means that it is an irreversible operation (unstack will not work) Multi-indexes are created from the stacked coordinate variables and not anymore from the unstacked dimension indexes (level product). There's a significant decrease in performance but it is probably acceptable since it's now possible to avoid the creation of multi-indexes. It is possible to stack a dimension with a multi-index, but it drops the index. Otherwise it would make it hard for unstack() to figure out what's going on. It makes it clear that this is an irreversible operation. * stack: better rule for add/skip multi-index It is more robust and allows creating a multi-index from non-default (non-dimension) coordinates, as long as there's one and only one 1-d indexed coordinate for each dimension to stack. * add utility methods to class Indexes This removes some ugly & duplicate patterns. * re-arrange class Indexes internals * wip refactor stack / unstack stack: revert to old way of creating multi-index unstack: support non-default (non-dimension) multi-index (as long as there is exactly one multi-index per specified dimension) * add PandasMultiIndex.from_product_variables Refactored utils.multiindex_from_product_levels utility function into a new PandasMultiIndex classmethod. * unstack: propagate index coordinate metadata * wip: fix/update tests * fix/refactor reindex * functools.cached_property is since py38 Use the good old way as we still support py37 * update reset_coords * fix ipython key completion test make sure the stacked dataset used has a multi-index * update test_map_index_queries * do not coerce bool indexer as float coord dtype Fixes #5727 * add Index.create_variables() method This will probably be used instead of including index coordinate variables in the signature (return type) of many methods of ``Index``. It has several advantages: - More DRY, and for custom indexes that do not need to create coordinate variables with special data adapters, it's easier to just skip implementing this method and not bother with returning empty dicts in other methods - This allows to decouple index vs. coordinates creation. For many cases this can be done at the same time but for some cases like alignment this is useful - It's a more elegant solution when we need to propagate metadata (attrs, encoding) * improve Dataset/DataArray indexes proxy class One possible solution to the recurring problem of accessing coordinate variables related to a given index in some of Xarray's internals. I feel that the ``Indexes`` proxy is the right place for storing references to those indexed coordinate variables. It's safer than relying on a public attribute/property of ``Index``. * align Indexes API with Coordinates API Expose similar properties: - variables - dims * clean-up formatting using updated Indexes API * wip: refactor alignment * wip refactor alignment Almost done rewriting `align`. * tweaks and fixes - Some fixes and clean-up in new implementation of align - Add Index.join method (only for inner/outer join) - Tweak index generic types - IndexVars type: use Variable instead of IndexVariable (IndexVariable may eventually be dropped) * wip refactor alignment and reindex Imported all re-indexing logic into the `Alignator` class, which can be reused directly for both `align()` and `obj.reindex()`. TODO: - import override indexes into `Alignator` - connect Alignator to public API entry points - clean-up (remove old implementation functions) * wip refactor alignment / reindex: fixes and tweaks * wip refactor alignment (support join='override') * wip alignment: clean-up + tests I will fix mypy errors later. * refactor alignment fix tests & types annotations All `*align*` and `*reindex*` tests are passing! (locally) Mypy doesn't complain. TODO: refactor `interp` and `interp_like` (I only plan to fix it with the new Aligner class for now) * refactor interp and interp_like It now reuses the new alignment/reindex internals, but it still only supports dimension coordinates with a single index, * wip review merge * refactor swap_dims * refactor isel * add create_index option to stack * add Index.stack and Index.unstack methods Also added a `index_cls` parameter to `Dataset.stack` and `DataArray.stack`. Custom indexes may thus implement their own logic for stack/unstack, with the current limitation that a `pandas.MultiIndex` is still explicitly required for unstack optimized versions. * fix PandasIndex.isel with scalar indexers Fix concat tests and some groupby tests * fix index adapter (variable) dtype * more indexes invariants checks Check consistency between PandasIndex objects and coordinate variables data adapters (PandasIndexingAdapter). * fix DataArray.isel() * refactor Dataset/DataArray copy Filter indexes and preserve unique index objects for multi-coordinate indexes. Also fixed indexes tests (stack/unstack). * refactor computation (and merge) * minor fixes and tweaks * propagate_indexes -> filter_indexes_from_coords We can't just look at excluded dimensions anymore... * Update xarray/core/coordinates.py * refactor expand_dims * merge: avoid compare same indexes more than once * wip refactor update coords Check in merge that prioritized elements won't corrupt collected indexes. * refactor update/remove coords Fixes and tweaks Refactored assign, update, setitem, delitem, drop_vars Added or updated tests * misc. fixes * fix Dataset,__delitem__ * fix Dataset.reset_coords (default coords) * refactor Dataset.from_dataframe * Fix .sel with DataArray and multi-index * PandasIndex.from_variables: preserve wrapped index Prevent converting special index types like ``pd.CategoricalIndex`` when such objects are wrapped in xarray variables. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * two minor fixes after merging main * filter_indexes_from_coords: preserve index order * implicit multi-index from coord: fix edge case * groupby combine: fix missing index in result When coord is None * implicit multi-index coord: more robust fix Also cover the cases where indexes are equal but not identical (caused a couple of tests failing). Perfs may not be a concern for such edge case? * backward compat fix: multi-index given as data-var ... in Dataset constructor * refactor concat Summary: - Add the ``Index.concat()`` class method with implementations for ``PandasIndex`` and ``PandasMultiIndex`` (adapted from ``IndexVariable.concat()``). - Use ``Index.concat()`` to create new indexes (and their corresponding variables) in ``_dataset_concat``. Fallback to variable concatenation (with default index) when no implementation is given for ``Index.concat`` or when no all the variables have an index. - Refactored merging the other variables (also merge the indexes) in ``_dataset_concat``. This refactor is incomplete. It should work with Pandas (multi-)indexes but it is likely that it won't work with meta-indexes involving multiple dimensions. We probably need to update ``_calc_concat_over`` to take into account such meta-indexes and their related coordinates. Other limitation (?) we use here the index of the 1st dataset for the concatenation (i.e., ``Index.concat``). No specific check is made on the type and/or coordinates of the other indexes. * sel drop=True: remove multi-index scalar coords * PandasIndex.from_variables(multi-index level var) Make sure it gets the level index (not the whole multi-index). * reindex: disable invalid dimension check From the docstrings: mis-matched dimensions are simply ignored * add index concat tests + fix positions type * add Indexes.get_all_dims convenient method * refactor pad * unstack: return copies of mindex.levels Fix error when trying to set the name of the extracted level indexes. * strip_units: prevent index->array conversion which caused alignment conflicts due to multi-index variable converted to non-index variable. * attach_units: do not preserve coord multi-indexes To prevent conflicts when trying to implicitly create multi-index coordinates. * concat: avoid multiple loads of dask arrays * groupby mindex coord: propagate level names * Dataset.copy: don't reset indexes if data is given The `data` parameter accepts new data for data variables only, it won't affect the indexes. * reindex/align: fix coord dtype of new indexes Use `as_compatible_data` to get the right dtype to assign to the new index coordinates created from reindex indexers. * fix doctests * to_stacked_array: do not coerce level dtypes If the intent was to propagate the original coordinate dtypes, this is now handled by PandasMultiIndex.stack. This fixes the case where multi-index ``.levels`` do not return labels with "nan" values (if any) but where the coordinate does, which resulted in dtype mismatch between the coordinate variable dtype (e.g., coerced to int64) and the actual label values (float64 due to nan values), eventually causing numpy errors trying to coerce nan value to int64 when indexing the coordinate. * fix indent * doc: fix user-guide build errors * stack: fix new index variables not in coordinates * unstack full-reindex: fix alignment errors We need a mapping of all index coordinate names to the full index so that Aligner can find matching indexes. (Note: reindex may eventually be refactored to be a bit more clever so that providing only a `{dim: indexer}` will be enough in this case?) * PandasIndex coord dtype: avoid convert index->array * refactor diff * quick fix level coord dtype int32/64 on win * dask presist/compute dataarray: propagate indexes * refactor roll * rename Index.query -> Index.sel Also rename: - QueryResult -> IndexSelResult - merge_query_results -> merge_sel_results * remove Index.union and Index.intersection Those are not used elsewhere. ``Index.join`` is used instead for alignment. * use future annotations in indexes.py * Index.rename: return only the new index Create new coordinate variables using Index.create_variables instead (get rid of PandasIndex.from_pandas_index). * Index.stack: return only the new index Create new coordinate variables using Index.create_variables instead (get rid of PandasIndex.from_pandas_index) * PandasMultiIndex class methods: return only the index * wip get rid of Pandas(Multi)Index.from_pandas_index * remove Pandas(Multi)Index.from_pandas_index * Index.from_variables: return the new index only Use Index.create_variables to get the new coordinate variables. * rename: propagate_attrs_encoding not needed Index.create_variables already propagates coordinate variable metadata. * align exclude dims: pass through indexes * fix set_index append=True Level variables may have updated names in this case: single index + one level. * refactor default_indexes and re-enable invariant check Default indexes invariant check is now optional (enabled by default). * fix formatting errors * fix type * Dataset._indexes, DataArray._indexes: always dict Default indexes are created by the default constructors. Indexes are always passed explicitly by the fastpath constructors. * remove _level_coords property It was only used for plotting and has been replaced by simpler logic. * clean-up Use ``_indexes`` instead of ``.xindexes`` when possible, for speed. ``xindexes`` is not cached: it would return inconsistent results if updating the object in-place. Fix some types. * optimize Dataset/DataArray copy Avoid copying pandas indexes twice (indexes + coordinate variables) * add default implementation for Index.copy Better than raising an error? * add/fix indexes tests * tweak indexes formatting This will need more work (in a follow-up PR) to be consistent with the other data model components (i.e., coordinates and data variables): add summary (inline) and detailed reprs, maybe group by coordinates, etc. * fix doctests * PandasIndex.copy: avoid too many pd.Index copies * DataArray stack test: revert to original Now that default (range) indexes are created again (create_index=True). * test indexes/indexing: rename query -> sel * alignment: use future annotations * misc. tweaks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Aligner tweaks Access aligned objects through the ``.result`` attribute. Enable type checking in internal methods. * remove assert_unique_multiindex_level_names It is not needed anymore as multi-indexes have their own coordinates and we now check for possible name conflicts when "unpacking" multi-index levels at Dataset / DataArray creation. * remove propagate_attrs_encoding Only used in one place. * assert -> check + ValueError * misc. fixes and tweaks * update what's new * concat: remove fall-backs to Variable.concat Raise an error instead when trying to concat a mix of indexed and un-indexed coordinates or when the index doesn't support concat. We still support the concatenation of a mix of scalar and dimension indexed coordinates by creating a PandasIndex on-the-fly for scalar coordinates. * fix flaky test Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: keewis <keewis@users.noreply.github.com>

benbovy added a commit to benbovy/xarray that referenced this issue Aug 23, 2021

intended behavior or unwanted side effect? see pydata#5732

60935ac

benbovy mentioned this issue Aug 23, 2021

Explicit indexes #5692

Merged

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinates implicitly created when passing a DataArray as coord to Dataset constructor #5732

Coordinates implicitly created when passing a DataArray as coord to Dataset constructor #5732

benbovy commented Aug 23, 2021

INSTALLED VERSIONS

dcherian commented Aug 23, 2021 •

edited

Loading

benbovy commented Aug 24, 2021 •

edited

Loading

dcherian commented Aug 24, 2021

Coordinates implicitly created when passing a DataArray as coord to Dataset constructor #5732

Coordinates implicitly created when passing a DataArray as coord to Dataset constructor #5732

Comments

benbovy commented Aug 23, 2021

INSTALLED VERSIONS

dcherian commented Aug 23, 2021 • edited Loading

benbovy commented Aug 24, 2021 • edited Loading

dcherian commented Aug 24, 2021

dcherian commented Aug 23, 2021 •

edited

Loading

benbovy commented Aug 24, 2021 •

edited

Loading