-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement grid concatenation and standardize datatype casting #2762
Conversation
@@ -1532,7 +1531,7 @@ def groupby_python(self_or_cls, ndmapping, dimensions, container_type, | |||
selects = get_unique_keys(ndmapping, dimensions) | |||
selects = group_select(list(selects)) | |||
groups = [(k, group_type((v.reindex(idims) if hasattr(v, 'kdims') | |||
else [((), (v,))]), **kwargs)) | |||
else [((), v)]), **kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are NdMapping implementation details left over from when we had an NdElement, the precursor to datasets, which now lead to strange behavior.
holoviews/core/data/__init__.py
Outdated
datatypes = ['dictionary', 'grid'] | ||
|
||
try: | ||
import pandas as pd # noqa (Availability import) | ||
from .pandas import PandasInterface | ||
default_datatype = 'dataframe' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When converting from gridded to columnar data throughout the code it usually has to cast the data to a specific datatype. Various places in the code hardcoded ['pandas', 'dictionary']
in these places, defining a default_datatype avoids having to hardcode this all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be "default_columnar_datatype', then? Or are there no cases where columnar data needs to be cast into some gridded data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columnar data cannot be cast to gridded data without some kind of aggregation occurring. So that's correct. Would still be okay with changing it to default_columnar_datatype
.
holoviews/core/data/grid.py
Outdated
arrays = [grid[vdim.name] for grid in grids] | ||
stack = np.stack if any(is_dask(arr) for arr in arrays) else da.stack | ||
new_data[vdim.name] = stack(arrays, -1) | ||
return new_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since arrays cannot be concatenated along multiple axes at once the implementation of concat
on gridded interfaces has two components. A general concat
method coordinates hierarchical concatenation along each dimension and uses the interface specific concat_dim
method implementations to concatenate along one particular axis or dimension.
Made some additional comments to clarify certain implementation details. |
holoviews/core/data/interface.py
Outdated
cast = [] | ||
for ds in datasets: | ||
if cast_type is not None or ds.interface.datatype != datatype: | ||
ds = ds.clone(ds, datatype=[datatype], new_type=cast_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting works quite simply, if the Interface.initialize
is passed another dataset and it finds a mismatch between the supplied datatype and the requested datatype it will deconstruct the original dataset into the columnar or gridded tuple format, which is supported by all interfaces. In this way a dataset can easily be cast to any other datatype, except for columnar -> gridded conversions.
As a followup to this PR we should provide special handling for dask arrays/dataframes during casting. This requires multiple things:
|
3a8b95a
to
b16c799
Compare
b16c799
to
67bd62a
Compare
Ready for review. |
holoviews/core/data/__init__.py
Outdated
""" | ||
Concatenates multiple datasets wrapped in an NdMapping type | ||
along all of its dimensions. Before concatenation all datasets | ||
are cast to the same datatype. For columnar data concatenation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'same datatype' determined how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either explicitly defined or the type of the first dataset that was passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to state that bit about it being chosen from the first one if not explicitly set.
holoviews/core/data/interface.py
Outdated
datasets = datasets.items() | ||
keys, datasets = zip(*datasets) | ||
elif isinstance(datasets, list) and not any(isinstance(v, tuple) for v in datasets): | ||
keys = [()]*len(datasets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are all these empty tuple keys for? Just to get things in the right format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, concatenate is usually meant for concatenating along some dimension but you can also concatenate a simple list of datasets without concatenating along some dimensions. For that case we generate empty tuple keys. Happy to add a comment. Separately I also need to assert that this only happens for tabular data, since gridded data must be concatenated along some dimension.
@@ -4,6 +4,9 @@ | |||
from itertools import product | |||
|
|||
import iris | |||
from iris.coords import DimCoord | |||
from iris.cube import CubeList | |||
from iris.experimental.equalise_cubes import equalise_attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be good to have the iris interface moved to geoviews. Could this be done for 1.10.6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests need to be moved into the holoviews package first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean 'geoviews' package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I mean the /tests
need to move to /holoviews/tests
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface tests are defined as mix-in classes, so if I want to run them in geoviews I have to be able to import them from holoviews. We also promised this to the bokeh folks so they can run our bokeh unit tests easily.
col_data = group.last.clone(data) | ||
collapsed[key] = col_data | ||
return collapsed if self.ndims > 1 else collapsed.last | ||
group_data = group.last.clone(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_data
can be a whole load of different things at different times. Not critical but I would prefer to have something that isn't clobbered so much.
Other than a few minor comments this looks good and I'm happy to merge. |
Tests are green. Merging. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR has two main aims:
Before this PR was applied past both casting and concatenation were limited to columnar data formats, which meant that certain operations could not be applied to gridded data, e.g. a HoloMap collapse. Having a dedicated concat implementation for both columnar and gridded data also allows much more efficient concatenation than what is currently in use by methods like
.table
and.dframe
and will generalize them so that we can eventually replace the column specific.table
implementation with a general one that returns a dataset of arbitrary type.Implementing concatenation along HoloMap dimensions also means that Dataset.groupby operations are now reversible and fixes HoloMap.collapse.