Improve optimizer to traverse the dask graph from the requested key #3

maurosilber · 2022-02-02T04:33:39Z

Currently, the optimizer step traverses the full dask graph in no particular order:

https://github.com/maurosilber/pipeline/blob/0cac8b8954b4def43e593040dced79807ac37f3a/pipeline/storage.py#L50-L58

It would be better to traverse it starting from the requested key(s), following through with their dependencies. A task does not need to be checked if all its dependents are already stored and will be loaded.

For instance, consider the following graph, a -> b -> c, where b is already stored.

dsk = {
    "a": (task_a,),
    "b": (task_b, "a"),
    "c": (task_c, "b"),
}

If we request c, which is not stored, then we would need to check b, which is stored and hence loaded. Then, we don't need to check a, which is simply removed from the graph.

optimized_dsk = {
    # "a": (task_a,),
    "b": (load, task_b),
    "c": (task_c, "b"),
}

We could adapt the dask.cull implementation.

The text was updated successfully, but these errors were encountered:

maurosilber added the enhancement New feature or request label Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve optimizer to traverse the dask graph from the requested key #3

Improve optimizer to traverse the dask graph from the requested key #3

maurosilber commented Feb 2, 2022

Improve optimizer to traverse the dask graph from the requested key #3

Improve optimizer to traverse the dask graph from the requested key #3

Comments

maurosilber commented Feb 2, 2022