Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove lock to support multiprocessing schedulers #88

Merged
merged 8 commits into from
Aug 11, 2022

Conversation

multimeric
Copy link
Contributor

@multimeric multimeric commented Aug 7, 2022

Currently we use functools.lru_cache and functools.cached_property (which internally uses lru_cache) to cache a few things, in particular I think it caches the filesystem object. From my understanding this isn't super essential, because the actual computations are cached in the filesystem itself, we're just caching the filesystem wrapper, but I assume there was a good reason for doing this.

In any case, lru_cache seems to request a lock whenever it wants to write to the cache, I guess this is with multithreaded scenarios in mind, but this seems like a very niche use case, because I think dask is generally used with a multiprocessing type mode, or synchronous for debugging. In these situations I don't think the lock does anything, but it does prevent the computation from being pickled, and therefore transmitted between processes (see #87).

For this reason I've moved to cachetools, which offers a more customizable cache that can have a lock but doesn't need to. I've added a test which would normally fail using multiprocessing, and also I've incidentally tested this on my own workflow where it seems to successfully load from the cache while also using the distributed scheduler.

The only remaining issue is what to do about multithreading. Because these decorators are customizable, there is a chance that I could conditionally use a lock based on the dask scheduler config, but this would be ugly because these are top level decorators. Could we just decide to not support multithreading? As far as I know multithreading isn't particularly useful in Python anyway because the threads can't run concurrently with the GIL.

@lsorber
Copy link
Member

lsorber commented Aug 7, 2022

Hi @multimeric, thanks for submitting this PR!

Currently we use functools.lru_cache and functools.cached_property (which internally uses lru_cache) to cache a few things, in particular I think it caches the filesystem object. From my understanding this isn't super essential, because the actual computations are cached in the filesystem itself, we're just caching the filesystem wrapper, but I assume there was a good reason for doing this.

You're right, the current usage of lru_cache and cached_property is not essential. For context, here's why we use them:

  1. The two instances of cached_property are used to avoid having to open the PyFilesystem FS for every computation, which can be slow. Removing this would not reduce the functionality, it would just be inefficient to have to open the filesystem for every computation.
  2. The one instance of lru_cache is used to cache the reading of load, store, and compute timings of the computation. These can be queried by other CachedComputations to determine their own time to result, and so it would again be rather inefficient to have to read the timings from the filesystem every time.

In any case, lru_cache seems to request a lock whenever it wants to write to the cache, I guess this is with multithreaded scenarios in mind, but this seems like a very niche use case, because I think dask is generally used with a multiprocessing type mode, or synchronous for debugging. In these situations I don't think the lock does anything, but it does prevent the computation from being pickled, and therefore transmitted between processes (see #87).

I see, thanks for submitting that issue!

For this reason I've moved to cachetools, which offers a more customizable cache that can have a lock but doesn't need to. I've added a test which would normally fail using multiprocessing, and also I've incidentally tested this on my own workflow where it seems to successfully load from the cache while also using the distributed scheduler.

I think we can solve this without introducing a dependency:

  1. We can replace the two instances of @cached_property with @lazy_property from David Beazly's Python Cookbook:
class lazy_property:
    def __init__(self, func):
        self.func = func

    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            value = self.func(instance)
        setattr(instance, self.func.__name__, value)
        return value
  1. The one instance of @lru_cache could be removed by splitting def read_time(self, timing_type: str) -> float into:
@lazy_property
def read_time_load(self) -> float:
    ...

@lazy_property
def read_time_store(self) -> float:
    ...

@lazy_property
def read_time_compute(self) -> float:
    ...

Does that seem like an acceptable solution to you? Would you want to give this a go, or do you prefer that I create a PR?

@multimeric
Copy link
Contributor Author

This all seems reasonable.

Regarding the memoization mechanism, I tend to prefer to import a trusted package rather than write an implementation myself because I feel it increases the maintenance burden, but not everyone agrees with me on that. Also cachetools is a tiny package, it's only 9.3kbs. However I'm happy to use the property you have described.

Do you have any thoughts on the multithreading issue? Also, is there any way you can think of to pickle the lock object so that we don't have to rewrite anything? I don't quite understand why it can't be done.

@multimeric
Copy link
Contributor Author

I guess there's also the question of whether there is another reason why you recommended the sync scheduler in the first place, in case it's more than just this pickling issue?

@lsorber
Copy link
Member

lsorber commented Aug 7, 2022

This all seems reasonable.

Regarding the memoization mechanism, I tend to prefer to import a trusted package rather than write an implementation myself because I feel it increases the maintenance burden, but not everyone agrees with me on that. Also cachetools is a tiny package, it's only 9.3kbs. However I'm happy to use the property you have described.

If it's all right with you I'd prefer not to introduce a dependency in this case – every dependency is an opportunity for graphchain to break down the line.

Do you have any thoughts on the multithreading issue? Also, is there any way you can think of to pickle the lock object so that we don't have to rewrite anything? I don't quite understand why it can't be done.

If we go with the lazy_decorator solution, I don't see any reason why multithreading wouldn't work.

I guess there's also the question of whether there is another reason why you recommended the sync scheduler in the first place, in case it's more than just this pickling issue?

Different schedulers have different pros and cons to me, this is how I see it:

  1. dask.threaded.get (multithreading): useful if your dask graph has independent paths and the computations release the GIL. For instance, many NumPy operations will release the GIL, enabling you to make good use of multiple cores (in the sense that multiple tasks will be processed concurrently, and the tasks themselves may benefit from multiple cores too). If your computations don't release the GIL, I don't think you'll see any benefit at all.
  2. dask.multiprocessing.get (multiprocess): useful if your dask graph has independent paths and the computations don't release the GIL. Requires tasks to be pickle-able though, which may not always be possible.
  3. dask.get (single process): graphchain is most useful here if you have dask graphs in which the computations take a lot of time or the individual tasks already saturate your node's cores (in which case multiprocessing wouldn't bring any benefit), and you want to be able to quickly iterate on the graph without having to wait on intermediate results.

@multimeric
Copy link
Contributor Author

In terms of multithreading I'm more thinking, there must be a reason why lru_cache has a lock. If Python could be multithreaded, there could be an issue where one thread tries to instantiate the filesystem while the other thread is doing the same. I would have thought the GIL would prevent this, but then I don't really understand why Python even has a multithreading library at all.

@multimeric
Copy link
Contributor Author

multimeric commented Aug 7, 2022

Actually I just realised that @functools.cache doesn't use a lock (by default), so we can very simply fix this bug without any dependencies and minimal changes.

@multimeric multimeric changed the title Use cachetools lock Remove lock to support multiprocessing schedulers Aug 7, 2022
@multimeric
Copy link
Contributor Author

Any thoughts @lsorber?

@lsorber
Copy link
Member

lsorber commented Aug 9, 2022

That sounds like an even better solution, thanks @multimeric!

I just enabled the CI workflow on this PR. Could you update the PR to make sure it succeeds? At first glance it seems like the dev dependencies were removed from the lock file.

EDIT: If you're developing locally, you'll need to run poetry install followed by poetry shell and poe lint. Otherwise, you can also follow the README to start a Dev Container with the development environment set up for you where you can run poe lint.

@multimeric
Copy link
Contributor Author

Okay, it should be properly linted now. There were some ugly things I had to do to make mypy and flake8 happy though.

Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one comment: can the changes to poetry.lock be removed, or are they necessary?

EDIT: It looks like functools.cache is new since Python 3.9. I'd rather not drop support for Python 3.8 yet. It also looks like the implementation of @cache just invokes lru_cache(maxsize=None) [1], so I'm not sure using it would avoid the lock issue.

[1] https://github.com/python/cpython/blob/3.10/Lib/functools.py#L651

@multimeric
Copy link
Contributor Author

EDIT: It looks like functools.cache is new since Python 3.9. I'd rather not drop support for Python 3.8 yet. It also looks like the implementation of @cache just invokes lru_cache(maxsize=None) [1], so I'm not sure using it would avoid the lock issue.

Okay, I can change it to lru_cache. Actually the main bug was with cached_property, and so I can still fix it and retain backwards compatibility.

LGTM. Just one comment: can the changes to poetry.lock be removed, or are they necessary?

Well they can be reverted, and everything will still work, but then the lock file will be out of date. The reason why my lock is different is just because I have newer (but still compatible) versions of some dependencies, I think it's normal for the lockfile to change over time, but actually it seems that you could just delete the file entirely: https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control.

Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM, only mypy complaining now. Might be because of a newer version of dask. Could you address the remaining issue? Happy to merge once the pipeline succeeds.

Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isort complaining now, could you check? I recommend running poe lint and poe test locally to make sure the CI workflow will pass too.

Copy link
Member

@lsorber lsorber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer, only pytest left!

@codecov-commenter
Copy link

Codecov Report

Merging #88 (a584514) into master (408e130) will increase coverage by 0.25%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
+ Coverage   87.19%   87.44%   +0.25%     
==========================================
  Files           3        3              
  Lines         242      247       +5     
  Branches       41       41              
==========================================
+ Hits          211      216       +5     
  Misses         17       17              
  Partials       14       14              
Impacted Files Coverage Δ
src/graphchain/core.py 91.54% <100.00%> (+0.21%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@lsorber lsorber merged commit 1c27c8c into superlinear-ai:master Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants