-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove lock to support multiprocessing schedulers #88
Conversation
Hi @multimeric, thanks for submitting this PR!
You're right, the current usage of
I see, thanks for submitting that issue!
I think we can solve this without introducing a dependency:
class lazy_property:
def __init__(self, func):
self.func = func
def __get__(self, instance, cls):
if instance is None:
return self
else:
value = self.func(instance)
setattr(instance, self.func.__name__, value)
return value
@lazy_property
def read_time_load(self) -> float:
...
@lazy_property
def read_time_store(self) -> float:
...
@lazy_property
def read_time_compute(self) -> float:
... Does that seem like an acceptable solution to you? Would you want to give this a go, or do you prefer that I create a PR? |
This all seems reasonable. Regarding the memoization mechanism, I tend to prefer to import a trusted package rather than write an implementation myself because I feel it increases the maintenance burden, but not everyone agrees with me on that. Also cachetools is a tiny package, it's only 9.3kbs. However I'm happy to use the property you have described. Do you have any thoughts on the multithreading issue? Also, is there any way you can think of to pickle the lock object so that we don't have to rewrite anything? I don't quite understand why it can't be done. |
I guess there's also the question of whether there is another reason why you recommended the sync scheduler in the first place, in case it's more than just this pickling issue? |
If it's all right with you I'd prefer not to introduce a dependency in this case – every dependency is an opportunity for graphchain to break down the line.
If we go with the
Different schedulers have different pros and cons to me, this is how I see it:
|
In terms of multithreading I'm more thinking, there must be a reason why |
Actually I just realised that |
Any thoughts @lsorber? |
That sounds like an even better solution, thanks @multimeric! I just enabled the CI workflow on this PR. Could you update the PR to make sure it succeeds? At first glance it seems like the dev dependencies were removed from the lock file. EDIT: If you're developing locally, you'll need to run |
Okay, it should be properly linted now. There were some ugly things I had to do to make mypy and flake8 happy though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one comment: can the changes to poetry.lock
be removed, or are they necessary?
EDIT: It looks like functools.cache
is new since Python 3.9. I'd rather not drop support for Python 3.8 yet. It also looks like the implementation of @cache
just invokes lru_cache(maxsize=None)
[1], so I'm not sure using it would avoid the lock issue.
[1] https://github.com/python/cpython/blob/3.10/Lib/functools.py#L651
Okay, I can change it to
Well they can be reverted, and everything will still work, but then the lock file will be out of date. The reason why my lock is different is just because I have newer (but still compatible) versions of some dependencies, I think it's normal for the lockfile to change over time, but actually it seems that you could just delete the file entirely: https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes LGTM, only mypy
complaining now. Might be because of a newer version of dask. Could you address the remaining issue? Happy to merge once the pipeline succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isort complaining now, could you check? I recommend running poe lint
and poe test
locally to make sure the CI workflow will pass too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer, only pytest left!
Codecov Report
@@ Coverage Diff @@
## master #88 +/- ##
==========================================
+ Coverage 87.19% 87.44% +0.25%
==========================================
Files 3 3
Lines 242 247 +5
Branches 41 41
==========================================
+ Hits 211 216 +5
Misses 17 17
Partials 14 14
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Currently we use
functools.lru_cache
andfunctools.cached_property
(which internally useslru_cache
) to cache a few things, in particular I think it caches the filesystem object. From my understanding this isn't super essential, because the actual computations are cached in the filesystem itself, we're just caching the filesystem wrapper, but I assume there was a good reason for doing this.In any case,
lru_cache
seems to request a lock whenever it wants to write to the cache, I guess this is with multithreaded scenarios in mind, but this seems like a very niche use case, because I think dask is generally used with a multiprocessing type mode, or synchronous for debugging. In these situations I don't think the lock does anything, but it does prevent the computation from being pickled, and therefore transmitted between processes (see #87).For this reason I've moved to
cachetools
, which offers a more customizable cache that can have a lock but doesn't need to. I've added a test which would normally fail using multiprocessing, and also I've incidentally tested this on my own workflow where it seems to successfully load from the cache while also using the distributed scheduler.The only remaining issue is what to do about multithreading. Because these decorators are customizable, there is a chance that I could conditionally use a lock based on the dask scheduler config, but this would be ugly because these are top level decorators. Could we just decide to not support multithreading? As far as I know multithreading isn't particularly useful in Python anyway because the threads can't run concurrently with the GIL.