Remove lock to support multiprocessing schedulers #88

multimeric · 2022-08-07T02:48:49Z

Currently we use functools.lru_cache and functools.cached_property (which internally uses lru_cache) to cache a few things, in particular I think it caches the filesystem object. From my understanding this isn't super essential, because the actual computations are cached in the filesystem itself, we're just caching the filesystem wrapper, but I assume there was a good reason for doing this.

In any case, lru_cache seems to request a lock whenever it wants to write to the cache, I guess this is with multithreaded scenarios in mind, but this seems like a very niche use case, because I think dask is generally used with a multiprocessing type mode, or synchronous for debugging. In these situations I don't think the lock does anything, but it does prevent the computation from being pickled, and therefore transmitted between processes (see #87).

For this reason I've moved to cachetools, which offers a more customizable cache that can have a lock but doesn't need to. I've added a test which would normally fail using multiprocessing, and also I've incidentally tested this on my own workflow where it seems to successfully load from the cache while also using the distributed scheduler.

The only remaining issue is what to do about multithreading. Because these decorators are customizable, there is a chance that I could conditionally use a lock based on the dask scheduler config, but this would be ugly because these are top level decorators. Could we just decide to not support multithreading? As far as I know multithreading isn't particularly useful in Python anyway because the threads can't run concurrently with the GIL.

lsorber · 2022-08-07T08:22:28Z

Hi @multimeric, thanks for submitting this PR!

Currently we use functools.lru_cache and functools.cached_property (which internally uses lru_cache) to cache a few things, in particular I think it caches the filesystem object. From my understanding this isn't super essential, because the actual computations are cached in the filesystem itself, we're just caching the filesystem wrapper, but I assume there was a good reason for doing this.

You're right, the current usage of lru_cache and cached_property is not essential. For context, here's why we use them:

The two instances of cached_property are used to avoid having to open the PyFilesystem FS for every computation, which can be slow. Removing this would not reduce the functionality, it would just be inefficient to have to open the filesystem for every computation.
The one instance of lru_cache is used to cache the reading of load, store, and compute timings of the computation. These can be queried by other CachedComputations to determine their own time to result, and so it would again be rather inefficient to have to read the timings from the filesystem every time.

In any case, lru_cache seems to request a lock whenever it wants to write to the cache, I guess this is with multithreaded scenarios in mind, but this seems like a very niche use case, because I think dask is generally used with a multiprocessing type mode, or synchronous for debugging. In these situations I don't think the lock does anything, but it does prevent the computation from being pickled, and therefore transmitted between processes (see #87).

I see, thanks for submitting that issue!

For this reason I've moved to cachetools, which offers a more customizable cache that can have a lock but doesn't need to. I've added a test which would normally fail using multiprocessing, and also I've incidentally tested this on my own workflow where it seems to successfully load from the cache while also using the distributed scheduler.

I think we can solve this without introducing a dependency:

We can replace the two instances of @cached_property with @lazy_property from David Beazly's Python Cookbook:

class lazy_property:
    def __init__(self, func):
        self.func = func

    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            value = self.func(instance)
        setattr(instance, self.func.__name__, value)
        return value

The one instance of @lru_cache could be removed by splitting def read_time(self, timing_type: str) -> float into:

@lazy_property
def read_time_load(self) -> float:
    ...

@lazy_property
def read_time_store(self) -> float:
    ...

@lazy_property
def read_time_compute(self) -> float:
    ...

Does that seem like an acceptable solution to you? Would you want to give this a go, or do you prefer that I create a PR?

multimeric · 2022-08-07T08:40:16Z

This all seems reasonable.

Regarding the memoization mechanism, I tend to prefer to import a trusted package rather than write an implementation myself because I feel it increases the maintenance burden, but not everyone agrees with me on that. Also cachetools is a tiny package, it's only 9.3kbs. However I'm happy to use the property you have described.

Do you have any thoughts on the multithreading issue? Also, is there any way you can think of to pickle the lock object so that we don't have to rewrite anything? I don't quite understand why it can't be done.

multimeric · 2022-08-07T09:00:36Z

I guess there's also the question of whether there is another reason why you recommended the sync scheduler in the first place, in case it's more than just this pickling issue?

lsorber · 2022-08-07T09:36:00Z

This all seems reasonable.

Regarding the memoization mechanism, I tend to prefer to import a trusted package rather than write an implementation myself because I feel it increases the maintenance burden, but not everyone agrees with me on that. Also cachetools is a tiny package, it's only 9.3kbs. However I'm happy to use the property you have described.

If it's all right with you I'd prefer not to introduce a dependency in this case – every dependency is an opportunity for graphchain to break down the line.

Do you have any thoughts on the multithreading issue? Also, is there any way you can think of to pickle the lock object so that we don't have to rewrite anything? I don't quite understand why it can't be done.

If we go with the lazy_decorator solution, I don't see any reason why multithreading wouldn't work.

I guess there's also the question of whether there is another reason why you recommended the sync scheduler in the first place, in case it's more than just this pickling issue?

Different schedulers have different pros and cons to me, this is how I see it:

dask.threaded.get (multithreading): useful if your dask graph has independent paths and the computations release the GIL. For instance, many NumPy operations will release the GIL, enabling you to make good use of multiple cores (in the sense that multiple tasks will be processed concurrently, and the tasks themselves may benefit from multiple cores too). If your computations don't release the GIL, I don't think you'll see any benefit at all.
dask.multiprocessing.get (multiprocess): useful if your dask graph has independent paths and the computations don't release the GIL. Requires tasks to be pickle-able though, which may not always be possible.
dask.get (single process): graphchain is most useful here if you have dask graphs in which the computations take a lot of time or the individual tasks already saturate your node's cores (in which case multiprocessing wouldn't bring any benefit), and you want to be able to quickly iterate on the graph without having to wait on intermediate results.

multimeric · 2022-08-07T09:53:06Z

In terms of multithreading I'm more thinking, there must be a reason why lru_cache has a lock. If Python could be multithreaded, there could be an issue where one thread tries to instantiate the filesystem while the other thread is doing the same. I would have thought the GIL would prevent this, but then I don't really understand why Python even has a multithreading library at all.

multimeric · 2022-08-07T10:10:12Z

Actually I just realised that @functools.cache doesn't use a lock (by default), so we can very simply fix this bug without any dependencies and minimal changes.

multimeric · 2022-08-09T05:56:17Z

Any thoughts @lsorber?

lsorber · 2022-08-09T08:01:14Z

That sounds like an even better solution, thanks @multimeric!

I just enabled the CI workflow on this PR. Could you update the PR to make sure it succeeds? At first glance it seems like the dev dependencies were removed from the lock file.

EDIT: If you're developing locally, you'll need to run poetry install followed by poetry shell and poe lint. Otherwise, you can also follow the README to start a Dev Container with the development environment set up for you where you can run poe lint.

multimeric · 2022-08-09T09:58:12Z

Okay, it should be properly linted now. There were some ugly things I had to do to make mypy and flake8 happy though.

lsorber

LGTM. Just one comment: can the changes to poetry.lock be removed, or are they necessary?

EDIT: It looks like functools.cache is new since Python 3.9. I'd rather not drop support for Python 3.8 yet. It also looks like the implementation of @cache just invokes lru_cache(maxsize=None) [1], so I'm not sure using it would avoid the lock issue.

[1] https://github.com/python/cpython/blob/3.10/Lib/functools.py#L651

multimeric · 2022-08-09T13:12:56Z

EDIT: It looks like functools.cache is new since Python 3.9. I'd rather not drop support for Python 3.8 yet. It also looks like the implementation of @cache just invokes lru_cache(maxsize=None) [1], so I'm not sure using it would avoid the lock issue.

Okay, I can change it to lru_cache. Actually the main bug was with cached_property, and so I can still fix it and retain backwards compatibility.

LGTM. Just one comment: can the changes to poetry.lock be removed, or are they necessary?

Well they can be reverted, and everything will still work, but then the lock file will be out of date. The reason why my lock is different is just because I have newer (but still compatible) versions of some dependencies, I think it's normal for the lockfile to change over time, but actually it seems that you could just delete the file entirely: https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control.

lsorber

The changes LGTM, only mypy complaining now. Might be because of a newer version of dask. Could you address the remaining issue? Happy to merge once the pipeline succeeds.

lsorber

isort complaining now, could you check? I recommend running poe lint and poe test locally to make sure the CI workflow will pass too.

lsorber

Getting closer, only pytest left!

codecov-commenter · 2022-08-10T15:44:22Z

Codecov Report

Merging #88 (a584514) into master (408e130) will increase coverage by 0.25%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
+ Coverage   87.19%   87.44%   +0.25%     
==========================================
  Files           3        3              
  Lines         242      247       +5     
  Branches       41       41              
==========================================
+ Hits          211      216       +5     
  Misses         17       17              
  Partials       14       14

Impacted Files	Coverage Δ
src/graphchain/core.py	`91.54% <100.00%> (+0.21%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Use cachetools lock

f4d09b2

Use @cache and remove cachetools

3a45dca

Update core.py

6a1d5c7

multimeric changed the title ~~Use cachetools lock~~ Remove lock to support multiprocessing schedulers Aug 7, 2022

Linting

be2096e

lsorber reviewed Aug 9, 2022

View reviewed changes

Use lru_cache instead of cache

5bfe30b

lsorber reviewed Aug 9, 2022

View reviewed changes

Mypy fixes

c951735

lsorber reviewed Aug 10, 2022

View reviewed changes

Isort

d0bb2a9

lsorber reviewed Aug 10, 2022

View reviewed changes

Fix tests

a584514

lsorber merged commit 1c27c8c into superlinear-ai:master Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove lock to support multiprocessing schedulers #88

Remove lock to support multiprocessing schedulers #88

multimeric commented Aug 7, 2022 •

edited

Loading

lsorber commented Aug 7, 2022

multimeric commented Aug 7, 2022

multimeric commented Aug 7, 2022

lsorber commented Aug 7, 2022

multimeric commented Aug 7, 2022

multimeric commented Aug 7, 2022 •

edited

Loading

multimeric commented Aug 9, 2022

lsorber commented Aug 9, 2022 •

edited

Loading

multimeric commented Aug 9, 2022

lsorber left a comment •

edited

Loading

multimeric commented Aug 9, 2022

lsorber left a comment

lsorber left a comment

lsorber left a comment

codecov-commenter commented Aug 10, 2022

Remove lock to support multiprocessing schedulers #88

Remove lock to support multiprocessing schedulers #88

Conversation

multimeric commented Aug 7, 2022 • edited Loading

lsorber commented Aug 7, 2022

multimeric commented Aug 7, 2022

multimeric commented Aug 7, 2022

lsorber commented Aug 7, 2022

multimeric commented Aug 7, 2022

multimeric commented Aug 7, 2022 • edited Loading

multimeric commented Aug 9, 2022

lsorber commented Aug 9, 2022 • edited Loading

multimeric commented Aug 9, 2022

lsorber left a comment • edited Loading

Choose a reason for hiding this comment

multimeric commented Aug 9, 2022

lsorber left a comment

Choose a reason for hiding this comment

lsorber left a comment

Choose a reason for hiding this comment

lsorber left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 10, 2022

Codecov Report

multimeric commented Aug 7, 2022 •

edited

Loading

multimeric commented Aug 7, 2022 •

edited

Loading

lsorber commented Aug 9, 2022 •

edited

Loading

lsorber left a comment •

edited

Loading