Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frozenlist/frozendict types are inconvenient when passed outside of Dagster #3008

Closed
kinghuang opened this issue Oct 4, 2020 · 1 comment · Fixed by #12293
Closed

frozenlist/frozendict types are inconvenient when passed outside of Dagster #3008

kinghuang opened this issue Oct 4, 2020 · 1 comment · Fixed by #12293
Labels
area: config Related to Configuration

Comments

@kinghuang
Copy link
Contributor

Problem to solve

Dagster has frozenlist/frozendict types that are immutable versions of the standard Python list and dict types.

The frozenlist/frozendict types pose difficulties when Dagster config values are passed to other systems. For example, consider the following solid that writes out a Dask dataframe to a Parquet dataset, with a solid config for partition_on.

@solid(
    config_schema={
        "partition_on": Field(list, default_value=["year", "month"], is_required=False),
    }
)
def example(context, input_df: DataFrame) -> DataFrame:
    partition_on_keys = context.solid_config.get("partition_on")
    input_df.to_parquet("s3://bucket/key", partition_on=partition_on_keys)

The list of keys to partition on (defaults to ["year", "month"]) is returned as a frozenlist from context.solid_config. When this is passed as the value of the partition_on argument in input_df.to_parquet(…), the frozenlist is pickled and sent to Dask workers.

The problem this poses is that the Dask workers now need to have the dagster module installed in order to unpickle what is otherwise a normal Python list. This adds an extra dependency on the Dask workers, or any other system that receives pickled values from a Dagster pipeline.

Intended users

  • Pipeline Developer
  • DevOps Engineer

User experience goal

If a solid or resource passes a config to another system, the target system should be able to unpickle the value without a complete dagster installation. Ideally, no extra modules would be required at all.

This particularly affects Dask pipelines, where most things are pickled to run on Dask workers, outside of the Dagster execution context.

Proposal

There are several possibilities.

  1. Have frozenlist/frozendict pickle as their native Python types. That way they can be unpickled as list/dict without requiring any additional modules.
  2. Move frozenlist/frozendict to an independent module, say dagster_collections. Systems that receive these types can install this module only instead of dagster and its complete dependency tree.
  3. Provide utility functions to convert frozenlist/frozendict to list/dict. Solids and resources are then responsible for converting the Dagster types to native types before passing them to other systems. Utility functions are needed for deep conversions (e.g., a frozendict nested in a frozenlist).
  4. Pass config values as list/dict to solids and resources. Not sure about the consequences of this.
  5. Stop using frozenlist/frozenlist. This would have negative impacts on caching in Dagster.

I personally prefer option 1 or 2. At minimum, I don't want to install dagster completely on my Dask workers just to unpickle frozenlist/frozendict.

Option 3 might be useful to have regardless of the other options.

Other information

@yuhan yuhan added the area: config Related to Configuration label May 14, 2021
@yuhan
Copy link
Contributor

yuhan commented May 14, 2021

IMO passing values as list/dict/set from outside Dagster sounds good and it may be too much to expect user inputs to be immutable objects. And we should have already guarded these cases by converting them into frozenlist/frozendict/frozenset inside the machinery for immutability.

smackesey added a commit that referenced this issue Mar 23, 2023
### Summary & Motivation

Internal companion PR: dagster-io/internal#5239

Relevant discussion:
dagster-io/internal#4859 (comment)

Removes the `frozenlist`, `frozendict` and `frozentags` classes.
Reasons:

- They don't play nicely with static type-checking and are a source of
type-checker diagnostic noise
- Static type-checking with Sequence/Mapping sort of solves the same
problem (this is kind of like removing runtime check calls from internal
methods)
- They are used inconsistently in the codebase (most places where we
have the intention of creating an immutable list/dict they are not
used). Where they are used, it is not at all obvious why at first
glance.
- They generally complicate the code

The main purpose the `frozen*` classes were serving in our code was to
make a few select `NamedTuple` classes hashable. If a `NamedTuple`
contains mutable collection members, its default hash function will
fail. Replacing those mutable collections with immutable ones lets the
hash succeed, which in turns lets the `NamedTuple` be cached via
`lru_cache` or used as a dict key.

The set of classes that need to be made hashable for `lru_cache`
purposes are:

- `CustomPointer`
- `AssetsDefinitionCacheableData`
- `RepositoryLoadData`
- `ReconstructableRepository`

These are part of a tree of objects rooted on `ReconstructablePipeline`,
which uses `lru_cache` in its `get_definition` method.

The above classes can be made hashable in a more legible and
type-annotation-friendly way by defining a `__hash__` method. This PR
does that-- wherever we have a `NamedTuple` that needed to be hashed,
`frozen*` instantiations in the constructor were removed and a
`__hash__` method was added. All of the `__hash__` methods are the same,
they just call `dagster._utils.hash_named_tuple` and cache the result.
This function just converts mutable collections to tuples prior to
hashing, which allows hashing to succeed as tuples are immutable.

Aside from cases where frozen classes were used to achieve `NamedTuple`
hashability, the other uses were:

- random? uses in some places where whoever wrote the code thought the
structure should be immutable. I removed these cases for the reasons
above (most of the time we want structures to be immutable and in most
of the codebase we signal this with `Sequence`/`Mapping`.
- A few other cases where dicts need to be hashable to sort them or use
them in a cache. For these, I provided simple substitute solutions that
achieve the same effect-- e.g. instead of using `frozendict`, converting
a dict to a tuple of 2-tuples before hashing.

---

- Resolves #3008
- Resolves #3641

### How I Tested These Changes

Existing test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: config Related to Configuration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants