-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frozenlist/frozendict types are inconvenient when passed outside of Dagster #3008
Labels
area: config
Related to Configuration
Comments
IMO passing values as |
This was referenced Feb 12, 2023
smackesey
added a commit
that referenced
this issue
Mar 23, 2023
### Summary & Motivation Internal companion PR: dagster-io/internal#5239 Relevant discussion: dagster-io/internal#4859 (comment) Removes the `frozenlist`, `frozendict` and `frozentags` classes. Reasons: - They don't play nicely with static type-checking and are a source of type-checker diagnostic noise - Static type-checking with Sequence/Mapping sort of solves the same problem (this is kind of like removing runtime check calls from internal methods) - They are used inconsistently in the codebase (most places where we have the intention of creating an immutable list/dict they are not used). Where they are used, it is not at all obvious why at first glance. - They generally complicate the code The main purpose the `frozen*` classes were serving in our code was to make a few select `NamedTuple` classes hashable. If a `NamedTuple` contains mutable collection members, its default hash function will fail. Replacing those mutable collections with immutable ones lets the hash succeed, which in turns lets the `NamedTuple` be cached via `lru_cache` or used as a dict key. The set of classes that need to be made hashable for `lru_cache` purposes are: - `CustomPointer` - `AssetsDefinitionCacheableData` - `RepositoryLoadData` - `ReconstructableRepository` These are part of a tree of objects rooted on `ReconstructablePipeline`, which uses `lru_cache` in its `get_definition` method. The above classes can be made hashable in a more legible and type-annotation-friendly way by defining a `__hash__` method. This PR does that-- wherever we have a `NamedTuple` that needed to be hashed, `frozen*` instantiations in the constructor were removed and a `__hash__` method was added. All of the `__hash__` methods are the same, they just call `dagster._utils.hash_named_tuple` and cache the result. This function just converts mutable collections to tuples prior to hashing, which allows hashing to succeed as tuples are immutable. Aside from cases where frozen classes were used to achieve `NamedTuple` hashability, the other uses were: - random? uses in some places where whoever wrote the code thought the structure should be immutable. I removed these cases for the reasons above (most of the time we want structures to be immutable and in most of the codebase we signal this with `Sequence`/`Mapping`. - A few other cases where dicts need to be hashable to sort them or use them in a cache. For these, I provided simple substitute solutions that achieve the same effect-- e.g. instead of using `frozendict`, converting a dict to a tuple of 2-tuples before hashing. --- - Resolves #3008 - Resolves #3641 ### How I Tested These Changes Existing test suite.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem to solve
Dagster has frozenlist/frozendict types that are immutable versions of the standard Python list and dict types.
The
frozenlist
/frozendict
types pose difficulties when Dagster config values are passed to other systems. For example, consider the following solid that writes out a Dask dataframe to a Parquet dataset, with a solid config forpartition_on
.The list of keys to partition on (defaults to
["year", "month"]
) is returned as afrozenlist
fromcontext.solid_config
. When this is passed as the value of thepartition_on
argument ininput_df.to_parquet(…)
, thefrozenlist
is pickled and sent to Dask workers.The problem this poses is that the Dask workers now need to have the
dagster
module installed in order to unpickle what is otherwise a normal Pythonlist
. This adds an extra dependency on the Dask workers, or any other system that receives pickled values from a Dagster pipeline.Intended users
User experience goal
If a solid or resource passes a config to another system, the target system should be able to unpickle the value without a complete
dagster
installation. Ideally, no extra modules would be required at all.This particularly affects Dask pipelines, where most things are pickled to run on Dask workers, outside of the Dagster execution context.
Proposal
There are several possibilities.
frozenlist
/frozendict
pickle as their native Python types. That way they can be unpickled aslist
/dict
without requiring any additional modules.frozenlist
/frozendict
to an independent module, saydagster_collections
. Systems that receive these types can install this module only instead ofdagster
and its complete dependency tree.frozenlist
/frozendict
tolist
/dict
. Solids and resources are then responsible for converting the Dagster types to native types before passing them to other systems. Utility functions are needed for deep conversions (e.g., afrozendict
nested in afrozenlist
).list
/dict
to solids and resources. Not sure about the consequences of this.frozenlist
/frozenlist
. This would have negative impacts on caching in Dagster.I personally prefer option 1 or 2. At minimum, I don't want to install
dagster
completely on my Dask workers just to unpicklefrozenlist
/frozendict
.Option 3 might be useful to have regardless of the other options.
Other information
frozenlist
.frozenlist
tolist
when passing configs to Dask.The text was updated successfully, but these errors were encountered: