Use Generic Types instead of Hashable or Any #8199

headtr1ck · 2023-09-17T19:41:39Z

Is your feature request related to a problem?

Currently, part of the static type of a DataArray or Dataset is a Mapping[Hashable, DataArray].
I'm quite sure that 99% of the users will actually use str key values (aka. variable names), while some exotic people (me included) want to use e.g. Enums for their keys.
Currently, we allow to use anything as keys as long as it is hashable, but once the DataArray/set is created, the type information of the keys is lost.

Consider e.g.

for name, da in Dataset({"a": ("t", np.arange(5))}).items():
    reveal_type(name)  # hashable
    reveal_type(da.dims)  # tuple[hashable, ...]

Woudn't that be nice if this would actually return str, so you don't have to cast it or assert it everytime?

This could be solved by making these classes generic.

Another related issue is the underlying data.
This could be introduced as a Generic type as well.
Probably, this should reach some common ground on all wrapping array libs that are out there. Every one should use a Generic Array class that keeps track of the type of the wrapped array, e.g. dask.array.core.Array[np.ndarray].
In return, we could do DataArray[np.ndarray] or then DataArray[dask.array.core.Array[nd.ndarray]].

Describe the solution you'd like

The implementation would be something along the lines of:

KeyT = TypeVar("KeyT", bound=Hashable)
DataT = TypeVar("DataT", bound=<some protocol?>)

class DataArray(Generic[KeyT, DataT]):

    _coords: dict[KeyT, Variable[DataT]]
    _indexes: dict[KeyT, Index[DataT]]
    _name: KeyT | None
    _variable: Variable[DataT]

    def __init__(
        self,
        data: DataT = dtypes.NA,
        coords: Sequence[Sequence[DataT] | pd.Index | DataArray[KeyT]]
        | Mapping[KeyT, DataT]
        | None = None,
        dims: str | Sequence[KeyT] | None = None,
        name: KeyT | None = None,
        attrs: Mapping[KeyT, Any] | None = None,
        # internal parameters
        indexes: Mapping[KeyT, Index] | None = None,
        fastpath: bool = False,
    ) -> None:
    ...

Now you could create a "classical" DataArray:

da = DataArray(np.arange(10), {"t": np.arange(10)}, dims=["t"])
# will be of type
# DataArray[str, np.ndarray]

while you could also create something more fancy

da2 = DataArray(dask.array.array([1, 2, 3]), {}, dims=[("tup1", "tup2),])
# will be of type
# DataArray[tuple[str, str], dask.array.core.Array]

Any whenever you access the dimensions / coord names / underlying data you will get the correct type.

For now I only see three mayor problems:

non-array types (like lists or anything iterable) will get cast to a np.ndarray and I have no idea how to tell the type checker that DataArray([1, 2, 3], {}, "a") should be DataArray[str, np.ndarray] and not DataArray[str, list[int]]. Depending on the Protocol in the bound TypeVar this might even fail static type analysis or require tons of special casing and overloads.
How does the type checker extract the dimension type for Datasets? This is quite convoluted and I am not sure this can be typed correctly...
The parallel compute workflows are quite dynamic and I am not sure if static type checking can keep track of the underlying datatype... What does DataArray([1, 2, 3], dims="a").chunk({"a": 2}) return? Is it DataArray[str, dask.array.core.Array]? But what about other chunking frameworks?

Describe alternatives you've considered

One could even extend this and add more Generic types.

Different types for dimensions and variable names would be a first (and probably quite a nice) feature addition.

One could even go so far and type the keys and values of variables and coords (for Datasets) differently.
This came up e.g. in #3967
However, this would create a ridiculous amount of Generic types and is probably more confusing than helpful.

Additional context

Probably this feature should be done in consecutive PRs that each implement one Generic each, otherwise this will be a giant task!

The text was updated successfully, but these errors were encountered:

mathause · 2023-09-18T09:06:45Z

I think #6142 may be relevant here.

TomNicholas · 2023-09-18T14:16:01Z

The parallel compute workflows are quite dynamic and I am not sure if static type checking can keep track of the underlying datatype... What does DataArray([1, 2, 3], dims="a").chunk({"a": 2}) return? Is it DataArray[str, dask.array.core.Array]? But what about other chunking frameworks?

It's handled by the from_array method of the ChunkManagerEntrypoint ABC. The implementation of the ABC that gets used depends in general on the value of chunked_array_type passed to .chunk and on what libraries are installed and available. By default it will use the dask implementation, in which case the return type of .chunk would be DataArray[str, dask.array.core.Array]. In general it will return DataArray[str, T_ChunkedArray], but that type is just a placeholder for now.

headtr1ck added enhancement topic-typing labels Sep 17, 2023

headtr1ck mentioned this issue Sep 17, 2023

Support static type analysis #3967

Closed

headtr1ck mentioned this issue Sep 19, 2023

Inconsistent Type Hinting for dims Parameter in xarray Methods #8210

Closed

headtr1ck mentioned this issue Oct 5, 2023

Give NamedArray Generic dimension type #8276

Draft

4 tasks

TomNicholas mentioned this issue Sep 10, 2024

DataTree should support Hashable names. #8836

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Generic Types instead of Hashable or Any #8199

Use Generic Types instead of Hashable or Any #8199

headtr1ck commented Sep 17, 2023 •

edited

Loading

mathause commented Sep 18, 2023

TomNicholas commented Sep 18, 2023

Use Generic Types instead of Hashable or Any #8199

Use Generic Types instead of Hashable or Any #8199

Comments

headtr1ck commented Sep 17, 2023 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

mathause commented Sep 18, 2023

TomNicholas commented Sep 18, 2023

headtr1ck commented Sep 17, 2023 •

edited

Loading