Optimize job hash and eq checks. #442

bdice · 2020-12-22T07:49:49Z

Description

In signac-flow's aggregation feature, it is fairly common to check membership: job in list(job1, job2, ...). This is currently very slow in signac.

By Python's membership test rules, x in y is equivalent to any(x is e or x == e for e in y).

Currently, the __eq__ check requires a call to os.path.realpath, which is fairly expensive. For a directory path like /a/b/c/d/e/f, realpath must check whether /a, /a/b, /a/b/c, etc. are symlinks, and if so, resolve them to their target location. That requires a lot of system calls just to check if jobs are equal. You can see that definition here.

I propose weakening the __hash__ function (which should always be fast in the proposed optimization) and using the hash as a fast way to rule out equality. This optimization is valid because a == b implies hash(a) == hash(b) (see the Python data model section on hashing for details). Using the contrapositive, we can check for hash collision (the job's hash is simply hash(job.id), a property that is known by the job) and then only check the realpath if hashes collide.

Motivation and Context

For a workspace of 30,000 jobs, this speeds up job in list(project) by a factor of ~70. (6.05 seconds without the optimization, 0.086 seconds with optimization). (Note that job in project would use the project's __contains__ method, so we need to check against a list-of-jobs,)

Types of Changes

Documentation update
Bug fix
New feature
Breaking change¹

¹The change breaks (or has the potential to break) existing functionality.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
My code follows the code style guideline of this project.
The changes introduced by this pull request are covered by existing or newly introduced tests.

If necessary:

I have updated the API documentation as part of the package doc-strings.
I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

bdice · 2020-12-22T07:55:55Z

I made a mistake and pushed the changelog update to master in commit 9d41d5c. 😞 If we choose to close this PR and not merge it, I will take care of fixing that.

…oblems if two projects share the same job but have different working directories.

codecov · 2020-12-22T08:02:56Z

Codecov Report

Merging #442 (0179c46) into master (426d2ed) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #442   +/-   ##
=======================================
  Coverage   76.98%   76.98%           
=======================================
  Files          42       42           
  Lines        5704     5704           
  Branches     1112     1112           
=======================================
  Hits         4391     4391           
  Misses       1029     1029           
  Partials      284      284

Impacted Files	Coverage Δ
signac/contrib/job.py	`89.60% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 426d2ed...0179c46. Read the comment docs.

bdice · 2020-12-22T08:04:28Z

I had to update one test file to accommodate this change. I believe the guarantees about job hashes that were previously tested were too strong, and were only true because of the implementation details of hash(job) (the test went beyond the expectations of the Python data model, which I believe are sufficient for our needs). I am open to reviewer feedback on whether this is "too breaking" or not. If it's "too breaking" then we'll need to punt on it until signac 2.0.

mikemhenry · 2020-12-22T21:24:37Z

Looking at this PR now, but how were you able to push to master? I thought we had the master branch locked down.

mikemhenry · 2020-12-22T21:28:10Z

tests/test_project.py

@@ -655,7 +655,6 @@ def test_job_move(self):
        job = project_a.open_job(dict(a=0))
        job_b = project_b.open_job(dict(a=0))
        assert job != job_b
-        assert hash(job) != hash(job_b)


So this test fails under the new changes since it only failed since the working directories for job and job_b were different

Yes, this fails because of the relaxed definition of the hash function. job == job_b implies that hash(job) == hash(job_b), but the logical inverse (job != job_b ⏩ hash(job) != hash(job_b)) does not hold. In the Python data model, it is fine for objects to be non-equal and have the same hash.

tests/test_project.py

mikemhenry · 2020-12-22T21:37:19Z

I had to update one test file to accommodate this change. I believe the guarantees about job hashes that were previously tested were too strong, and were only true because of the implementation details of hash(job) (the test went beyond the expectations of the Python data model, which I believe are sufficient for our needs). I am open to reviewer feedback on whether this is "too breaking" or not. If it's "too breaking" then we'll need to punt on it until signac 2.0.

I had a few questions about the removed tests since I think understanding how job equality is different now impacts how I feel about this change being too breaking or not. On one hand I think this optimization is great, on the other hand it feels like we are touching the most important internals for the core, so I would be interested in hearing the opinion of another maintainer (@vyasr @csadorf @b-butler)

bdice · 2020-12-22T22:10:15Z

Looking at this PR now, but how were you able to push to master? I thought we had the master branch locked down.

Repo administrators have privileges to push to master, to simplify the process of applying tiny fixes like correcting typos and to allow manual merges via the command line (instead of the GitHub web interface). I got careless while benchmarking between my feature branch and master and forgot to change back before committing. We used to have branch restrictions apply to maintainers as well, but we changed it to grant maintainers push access at some point. If maintainers agree that we want to adopt a more restrictive set of rules on this, I am open to changing it back.

vyasr · 2020-12-24T01:38:08Z

@mikemhenry @bdice I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how hash(x) behaves is a change to the public API; however, I don't think we need to make such a drastic change solely for the purpose of optimization here. Why not simply define:

    def __eq__(self, other):
        return (self._id == other._id) and (
            os.path.realpath(self._wd) == os.path.realpath(other._wd)
        )

Checking equality of the id directly is at least as good as hashing it first, and then you don't have to modify the hash. We could still make the hash change if we find other cases where a faster hash function would be helpful, but at least wait until 2.0 to break that behavior.

Based on your benchmark making this change alone is sufficient to get large speedups. That said, if realpath is still the bottleneck after making the change, would we benefit from caching it anyway (i.e. storing something like self._real_wd = os.path.realpath(self._wd))? We could store that as an internal property that calculates and caches the realpath the first time it's requested.

Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary.

bdice · 2020-12-28T05:24:27Z

I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how hash(x) behaves is a change to the public API; however, I don't think we need to make such a drastic change solely for the purpose of optimization here. [...]

@vyasr @mikemhenry There are several places where jobs' hashes are used. This includes sets of jobs (I know there are examples of this) and dicts with jobs as keys (I'm not 100% sure if there are examples of this but it seems likely). Thus, I believe it is necessary to optimize __hash__ as well as __eq__. Only optimizing __eq__ is a partial solution but it seems like we should change both at the same time, because of the close relationship of __hash__ and __eq__ in the Python data model.

That said, if realpath is still the bottleneck after making the change, would we benefit from caching it anyway (i.e. storing something like self._real_wd = os.path.realpath(self._wd))? We could store that as an internal property that calculates and caches the realpath the first time it's requested.

Unfortunately caching the realpath would require significant additional work to deal with cache invalidation. There are several cases where the realpath is invalidated, such as moving a job from one project to another. For now, I prefer to limit the number of calls to realpath, as is done in the implementation of this PR.

Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary.

It is unfortunately not possible to eliminate the call to realpath under the current definition of equality, even if we enforce that restriction in signac. (I assume you meant "subdirectory of a project"). Because of symbolic links, it is possible to construct cases where the jobs must be compared by realpath in order to determine equality under the current definition of equality.

vyasr · 2020-12-28T14:48:14Z

I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how hash(x) behaves is a change to the public API; however, I don't think we need to make such a drastic change solely for the purpose of optimization here. [...]

@vyasr @mikemhenry There are several places where jobs' hashes are used. This includes sets of jobs (I know there are examples of this) and dicts with jobs as keys (I'm not 100% sure if there are examples of this but it seems likely). Thus, I believe it is necessary to optimize __hash__ as well as __eq__. Only optimizing __eq__ is a partial solution but it seems like we should change both at the same time, because of the close relationship of __hash__ and __eq__ in the Python data model.

That's fair, I agree that optimizing for usage of the hash is something that we should do. I'm simply suggesting that we can avoid the question of whether changing the hash is too breaking in the 1.x line by only optimizing __eq__ for now, then changing the hash in 2.x. That seems like the safest choice, assuming that your benchmark mostly holds in that case. I would be more comfortable waiting for 2.x for changing the hash unless it's demonstrably necessary to achieve these speedups, but if other @glotzerlab/signac-maintainers feel less need to maintain this stability then I'm OK being outvoted.

I do think that in either case we should redefine __eq__ in the way that I proposed. Using the hash directly in equality exploits a detail of the Python Data Model, but the equality check should be something that clearly indicates what defines a job in a way that's easy for readers to understand. Using the hash in the equality check is possible, but less semantically clear. Adding a comments to my snippet indicating that id checks are cheap and a necessary but not sufficient condition -- and that because of symlinks the absolute path test is the true test defining equality -- would be helpful as well. This implementation has the added benefits of not being dependent on the implementation of the hash and being faster by avoiding the extra function calls and hashes.

That said, if realpath is still the bottleneck after making the change, would we benefit from caching it anyway (i.e. storing something like self._real_wd = os.path.realpath(self._wd))? We could store that as an internal property that calculates and caches the realpath the first time it's requested.

Unfortunately caching the realpath would require significant additional work to deal with cache invalidation. There are several cases where the realpath is invalidated, such as moving a job from one project to another. For now, I prefer to limit the number of calls to realpath, as is done in the implementation of this PR.

Fair point, I agree that might end up requiring more work. Something to consider for a future PR if realpath remains a necessity in future versions.

Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary.

It is unfortunately not possible to eliminate the call to realpath under the current definition of equality, even if we enforce that restriction in signac. (I assume you meant "subdirectory of a project"). Because of symbolic links, it is possible to construct cases where the jobs must be compared by realpath in order to determine equality under the current definition of equality.

Yes, subdirectory of a project... the alternative of infinite recursion would be a small problem 😂

If we make both changes I listed (a project root directory is defined as an absolute path and the workspace is a subdirectory of a project) the only case I can see where symbolic links would cause problems would be if two different projects symlink to the same workspace directory in a different location. I guess technically those should compare as the same job... that seems like a very error-prone use case that I intentionally discounted from my previous evaluation, but on second consideration even if we do redefine the workspace as a subdirectory of a project there's no way for us to check for that case, so yes I agree that realpath is necessary and we're stuck with this definition as the best that can be done.

bdice · 2020-12-28T16:41:17Z

That's fair, I agree that optimizing for usage of the hash is something that we should do. I'm simply suggesting that we can avoid the question of whether changing the hash is too breaking in the 1.x line by only optimizing eq for now, then changing the hash in 2.x. That seems like the safest choice, assuming that your benchmark mostly holds in that case. I would be more comfortable waiting for 2.x for changing the hash unless it's demonstrably necessary to achieve these speedups, but if other @glotzerlab/signac-maintainers feel less need to maintain this stability then I'm OK being outvoted.

I profiled set(project) for a project of 30,000 jobs. With the __hash__ optimizations in this PR, it's around 1/3 faster (3.3 seconds vs. 2.2 seconds). Much of the 2.2 seconds is spent fetching statepoints.

@klywang @mikemhenry This PR is ready for review. Tagging @glotzerlab/signac-maintainers in case anyone feels strongly and wants to vote against the small breaking change to __hash__ behavior in 1.x releases.

mikemhenry

I believe this change is non-breaking enough to happen without a major version change. I'm going to "vote" by approving this PR.

bdice · 2020-12-29T19:46:06Z

Thanks @mikemhenry! This can be merged with a second approval from @klywang or one of @glotzerlab/signac-maintainers.

klywang

Looks good!

csadorf · 2021-01-08T15:27:55Z

As long as two jobs that are in different directories are not comparing equal, I think it is fine to relax the restraint on the hash function. So this is a retroactive approval. 👍

Optimize job __hash__ and __eq__ checks.

d886455

bdice requested review from a team as code owners December 22, 2020 07:49

bdice requested review from mikemhenry and klywang and removed request for a team December 22, 2020 07:49

Update changelog.

e4d22f8

bdice added 2 commits December 22, 2020 01:58

Use id instead of working directory for hash, because it can cause pr…

336315d

…oblems if two projects share the same job but have different working directories.

Remove strict guarantees in tests about job hashes.

e3cea09

bdice self-assigned this Dec 22, 2020

bdice added the enhancement New feature or request label Dec 22, 2020

bdice added this to the v1.6.0 milestone Dec 22, 2020

bdice mentioned this pull request Dec 22, 2020

Optimize H5Store init. #443

Merged

12 tasks

mikemhenry reviewed Dec 22, 2020

View reviewed changes

tests/test_project.py Outdated Show resolved Hide resolved

Test that hashes are equal when the jobs are equal.

941a2fb

bdice requested a review from mikemhenry December 22, 2020 22:11

Use id instead of hash for equality pre-test.

fb643ca

Merge branch 'master' into feature/optimize-job-hash-eq

ef7cc60

mikemhenry approved these changes Dec 29, 2020

View reviewed changes

bdice requested a review from a team December 29, 2020 19:46

Merge branch 'master' into feature/optimize-job-hash-eq

0179c46

klywang approved these changes Dec 29, 2020

View reviewed changes

bdice merged commit c1cf74e into master Dec 29, 2020

bdice deleted the feature/optimize-job-hash-eq branch December 29, 2020 20:02

This was referenced Dec 30, 2020

Handle non-Job objects in Job equality check. #455

Merged

Optimization: use cached status everywhere. glotzerlab/signac-flow#410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize job hash and eq checks. #442

Optimize job hash and eq checks. #442

bdice commented Dec 22, 2020 •

edited

Loading

bdice commented Dec 22, 2020 •

edited

Loading

codecov bot commented Dec 22, 2020 •

edited

Loading

bdice commented Dec 22, 2020 •

edited

Loading

mikemhenry commented Dec 22, 2020

mikemhenry Dec 22, 2020

bdice Dec 22, 2020 •

edited

Loading

mikemhenry commented Dec 22, 2020

bdice commented Dec 22, 2020

vyasr commented Dec 24, 2020

bdice commented Dec 28, 2020

vyasr commented Dec 28, 2020

bdice commented Dec 28, 2020

mikemhenry left a comment

bdice commented Dec 29, 2020

klywang left a comment

csadorf commented Jan 8, 2021

Optimize job __hash__ and __eq__ checks. #442

Optimize job __hash__ and __eq__ checks. #442

Conversation

bdice commented Dec 22, 2020 • edited Loading

Description

Motivation and Context

Types of Changes

Checklist:

bdice commented Dec 22, 2020 • edited Loading

codecov bot commented Dec 22, 2020 • edited Loading

Codecov Report

bdice commented Dec 22, 2020 • edited Loading

mikemhenry commented Dec 22, 2020

mikemhenry Dec 22, 2020

Choose a reason for hiding this comment

bdice Dec 22, 2020 • edited Loading

Choose a reason for hiding this comment

mikemhenry commented Dec 22, 2020

bdice commented Dec 22, 2020

vyasr commented Dec 24, 2020

bdice commented Dec 28, 2020

vyasr commented Dec 28, 2020

bdice commented Dec 28, 2020

mikemhenry left a comment

Choose a reason for hiding this comment

bdice commented Dec 29, 2020

klywang left a comment

Choose a reason for hiding this comment

csadorf commented Jan 8, 2021

Optimize job hash and eq checks. #442

Optimize job hash and eq checks. #442

bdice commented Dec 22, 2020 •

edited

Loading

bdice commented Dec 22, 2020 •

edited

Loading

codecov bot commented Dec 22, 2020 •

edited

Loading

bdice commented Dec 22, 2020 •

edited

Loading

bdice Dec 22, 2020 •

edited

Loading