Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize job __hash__ and __eq__ checks. #442

Merged
merged 8 commits into from
Dec 29, 2020
Merged

Conversation

bdice
Copy link
Member

@bdice bdice commented Dec 22, 2020

Description

In signac-flow's aggregation feature, it is fairly common to check membership: job in list(job1, job2, ...). This is currently very slow in signac.

By Python's membership test rules, x in y is equivalent to any(x is e or x == e for e in y).

Currently, the __eq__ check requires a call to os.path.realpath, which is fairly expensive. For a directory path like /a/b/c/d/e/f, realpath must check whether /a, /a/b, /a/b/c, etc. are symlinks, and if so, resolve them to their target location. That requires a lot of system calls just to check if jobs are equal. You can see that definition here.

I propose weakening the __hash__ function (which should always be fast in the proposed optimization) and using the hash as a fast way to rule out equality. This optimization is valid because a == b implies hash(a) == hash(b) (see the Python data model section on hashing for details). Using the contrapositive, we can check for hash collision (the job's hash is simply hash(job.id), a property that is known by the job) and then only check the realpath if hashes collide.

Motivation and Context

For a workspace of 30,000 jobs, this speeds up job in list(project) by a factor of ~70. (6.05 seconds without the optimization, 0.086 seconds with optimization). (Note that job in project would use the project's __contains__ method, so we need to check against a list-of-jobs,)

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

@bdice bdice requested review from a team as code owners December 22, 2020 07:49
@bdice bdice requested review from mikemhenry and klywang and removed request for a team December 22, 2020 07:49
@bdice
Copy link
Member Author

bdice commented Dec 22, 2020

I made a mistake and pushed the changelog update to master in commit 9d41d5c. 😞 If we choose to close this PR and not merge it, I will take care of fixing that.

@codecov
Copy link

codecov bot commented Dec 22, 2020

Codecov Report

Merging #442 (0179c46) into master (426d2ed) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #442   +/-   ##
=======================================
  Coverage   76.98%   76.98%           
=======================================
  Files          42       42           
  Lines        5704     5704           
  Branches     1112     1112           
=======================================
  Hits         4391     4391           
  Misses       1029     1029           
  Partials      284      284           
Impacted Files Coverage Δ
signac/contrib/job.py 89.60% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 426d2ed...0179c46. Read the comment docs.

@bdice
Copy link
Member Author

bdice commented Dec 22, 2020

I had to update one test file to accommodate this change. I believe the guarantees about job hashes that were previously tested were too strong, and were only true because of the implementation details of hash(job) (the test went beyond the expectations of the Python data model, which I believe are sufficient for our needs). I am open to reviewer feedback on whether this is "too breaking" or not. If it's "too breaking" then we'll need to punt on it until signac 2.0.

@bdice bdice self-assigned this Dec 22, 2020
@bdice bdice added the enhancement New feature or request label Dec 22, 2020
@bdice bdice added this to the v1.6.0 milestone Dec 22, 2020
@bdice bdice mentioned this pull request Dec 22, 2020
12 tasks
@mikemhenry
Copy link
Collaborator

Looking at this PR now, but how were you able to push to master? I thought we had the master branch locked down.

@@ -655,7 +655,6 @@ def test_job_move(self):
job = project_a.open_job(dict(a=0))
job_b = project_b.open_job(dict(a=0))
assert job != job_b
assert hash(job) != hash(job_b)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test fails under the new changes since it only failed since the working directories for job and job_b were different

Copy link
Member Author

@bdice bdice Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this fails because of the relaxed definition of the hash function. job == job_b implies that hash(job) == hash(job_b), but the logical inverse (job != job_bhash(job) != hash(job_b)) does not hold. In the Python data model, it is fine for objects to be non-equal and have the same hash.

tests/test_project.py Outdated Show resolved Hide resolved
@mikemhenry
Copy link
Collaborator

I had to update one test file to accommodate this change. I believe the guarantees about job hashes that were previously tested were too strong, and were only true because of the implementation details of hash(job) (the test went beyond the expectations of the Python data model, which I believe are sufficient for our needs). I am open to reviewer feedback on whether this is "too breaking" or not. If it's "too breaking" then we'll need to punt on it until signac 2.0.

I had a few questions about the removed tests since I think understanding how job equality is different now impacts how I feel about this change being too breaking or not. On one hand I think this optimization is great, on the other hand it feels like we are touching the most important internals for the core, so I would be interested in hearing the opinion of another maintainer (@vyasr @csadorf @b-butler)

@bdice
Copy link
Member Author

bdice commented Dec 22, 2020

Looking at this PR now, but how were you able to push to master? I thought we had the master branch locked down.

Repo administrators have privileges to push to master, to simplify the process of applying tiny fixes like correcting typos and to allow manual merges via the command line (instead of the GitHub web interface). I got careless while benchmarking between my feature branch and master and forgot to change back before committing. We used to have branch restrictions apply to maintainers as well, but we changed it to grant maintainers push access at some point. If maintainers agree that we want to adopt a more restrictive set of rules on this, I am open to changing it back.

@vyasr
Copy link
Contributor

vyasr commented Dec 24, 2020

@mikemhenry @bdice I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how hash(x) behaves is a change to the public API; however, I don't think we need to make such a drastic change solely for the purpose of optimization here. Why not simply define:

    def __eq__(self, other):
        return (self._id == other._id) and (
            os.path.realpath(self._wd) == os.path.realpath(other._wd)
        )

Checking equality of the id directly is at least as good as hashing it first, and then you don't have to modify the hash. We could still make the hash change if we find other cases where a faster hash function would be helpful, but at least wait until 2.0 to break that behavior.

Based on your benchmark making this change alone is sufficient to get large speedups. That said, if realpath is still the bottleneck after making the change, would we benefit from caching it anyway (i.e. storing something like self._real_wd = os.path.realpath(self._wd))? We could store that as an internal property that calculates and caches the realpath the first time it's requested.

Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary.

@bdice
Copy link
Member Author

bdice commented Dec 28, 2020

I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how hash(x) behaves is a change to the public API; however, I don't think we need to make such a drastic change solely for the purpose of optimization here. [...]

@vyasr @mikemhenry There are several places where jobs' hashes are used. This includes sets of jobs (I know there are examples of this) and dicts with jobs as keys (I'm not 100% sure if there are examples of this but it seems likely). Thus, I believe it is necessary to optimize __hash__ as well as __eq__. Only optimizing __eq__ is a partial solution but it seems like we should change both at the same time, because of the close relationship of __hash__ and __eq__ in the Python data model.

That said, if realpath is still the bottleneck after making the change, would we benefit from caching it anyway (i.e. storing something like self._real_wd = os.path.realpath(self._wd))? We could store that as an internal property that calculates and caches the realpath the first time it's requested.

Unfortunately caching the realpath would require significant additional work to deal with cache invalidation. There are several cases where the realpath is invalidated, such as moving a job from one project to another. For now, I prefer to limit the number of calls to realpath, as is done in the implementation of this PR.

Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary.

It is unfortunately not possible to eliminate the call to realpath under the current definition of equality, even if we enforce that restriction in signac. (I assume you meant "subdirectory of a project"). Because of symbolic links, it is possible to construct cases where the jobs must be compared by realpath in order to determine equality under the current definition of equality.

@vyasr
Copy link
Contributor

vyasr commented Dec 28, 2020

I'm not necessarily opposed to changing the hash function, and I agree that the current hash function is more than what the Python data model requires. One could argue that changing how hash(x) behaves is a change to the public API; however, I don't think we need to make such a drastic change solely for the purpose of optimization here. [...]

@vyasr @mikemhenry There are several places where jobs' hashes are used. This includes sets of jobs (I know there are examples of this) and dicts with jobs as keys (I'm not 100% sure if there are examples of this but it seems likely). Thus, I believe it is necessary to optimize __hash__ as well as __eq__. Only optimizing __eq__ is a partial solution but it seems like we should change both at the same time, because of the close relationship of __hash__ and __eq__ in the Python data model.

That's fair, I agree that optimizing for usage of the hash is something that we should do. I'm simply suggesting that we can avoid the question of whether changing the hash is too breaking in the 1.x line by only optimizing __eq__ for now, then changing the hash in 2.x. That seems like the safest choice, assuming that your benchmark mostly holds in that case. I would be more comfortable waiting for 2.x for changing the hash unless it's demonstrably necessary to achieve these speedups, but if other @glotzerlab/signac-maintainers feel less need to maintain this stability then I'm OK being outvoted.

I do think that in either case we should redefine __eq__ in the way that I proposed. Using the hash directly in equality exploits a detail of the Python Data Model, but the equality check should be something that clearly indicates what defines a job in a way that's easy for readers to understand. Using the hash in the equality check is possible, but less semantically clear. Adding a comments to my snippet indicating that id checks are cheap and a necessary but not sufficient condition -- and that because of symlinks the absolute path test is the true test defining equality -- would be helpful as well. This implementation has the added benefits of not being dependent on the implementation of the hash and being faster by avoiding the extra function calls and hashes.

That said, if realpath is still the bottleneck after making the change, would we benefit from caching it anyway (i.e. storing something like self._real_wd = os.path.realpath(self._wd))? We could store that as an internal property that calculates and caches the realpath the first time it's requested.

Unfortunately caching the realpath would require significant additional work to deal with cache invalidation. There are several cases where the realpath is invalidated, such as moving a job from one project to another. For now, I prefer to limit the number of calls to realpath, as is done in the implementation of this PR.

Fair point, I agree that might end up requiring more work. Something to consider for a future PR if realpath remains a necessity in future versions.

Moving to 2.0 we could consider having a project always normalize its root directory to an absolute path. If we also start enforcing that the workspace be a subdirectory of a job then I don't think the realpath call would even ever be necessary.

It is unfortunately not possible to eliminate the call to realpath under the current definition of equality, even if we enforce that restriction in signac. (I assume you meant "subdirectory of a project"). Because of symbolic links, it is possible to construct cases where the jobs must be compared by realpath in order to determine equality under the current definition of equality.

Yes, subdirectory of a project... the alternative of infinite recursion would be a small problem 😂

If we make both changes I listed (a project root directory is defined as an absolute path and the workspace is a subdirectory of a project) the only case I can see where symbolic links would cause problems would be if two different projects symlink to the same workspace directory in a different location. I guess technically those should compare as the same job... that seems like a very error-prone use case that I intentionally discounted from my previous evaluation, but on second consideration even if we do redefine the workspace as a subdirectory of a project there's no way for us to check for that case, so yes I agree that realpath is necessary and we're stuck with this definition as the best that can be done.

@bdice
Copy link
Member Author

bdice commented Dec 28, 2020

That's fair, I agree that optimizing for usage of the hash is something that we should do. I'm simply suggesting that we can avoid the question of whether changing the hash is too breaking in the 1.x line by only optimizing eq for now, then changing the hash in 2.x. That seems like the safest choice, assuming that your benchmark mostly holds in that case. I would be more comfortable waiting for 2.x for changing the hash unless it's demonstrably necessary to achieve these speedups, but if other @glotzerlab/signac-maintainers feel less need to maintain this stability then I'm OK being outvoted.

I profiled set(project) for a project of 30,000 jobs. With the __hash__ optimizations in this PR, it's around 1/3 faster (3.3 seconds vs. 2.2 seconds). Much of the 2.2 seconds is spent fetching statepoints.

@klywang @mikemhenry This PR is ready for review. Tagging @glotzerlab/signac-maintainers in case anyone feels strongly and wants to vote against the small breaking change to __hash__ behavior in 1.x releases.

Copy link
Collaborator

@mikemhenry mikemhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change is non-breaking enough to happen without a major version change. I'm going to "vote" by approving this PR.

@bdice
Copy link
Member Author

bdice commented Dec 29, 2020

Thanks @mikemhenry! This can be merged with a second approval from @klywang or one of @glotzerlab/signac-maintainers.

@bdice bdice requested a review from a team December 29, 2020 19:46
Copy link
Contributor

@klywang klywang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@bdice bdice merged commit c1cf74e into master Dec 29, 2020
@bdice bdice deleted the feature/optimize-job-hash-eq branch December 29, 2020 20:02
@csadorf
Copy link
Contributor

csadorf commented Jan 8, 2021

As long as two jobs that are in different directories are not comparing equal, I think it is fine to relax the restraint on the hash function. So this is a retroactive approval. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants