Improve performance #975

joaander · 2024-02-07T21:57:09Z

Description

Improve the performance of signac for many use-cases, especially when workspaces have with large job counts. Read the complete commit messages for full details on each change.

To summarize:

Introduce Job.statepoint_mapping (edit: Job.cached_statepoint)which allows cached, read-only access to the statepoint. statepoint_mapping is loaded on demand when the job is not in the cache.
Use the statepoint cache in additional code paths.
All open_job code paths now lazily load Job._statepoint.
Add validate_statepoint argument to Job.init. When False, init checks only that the job directory exists.
Cache job ids in JobsCursor so that __len__ and __contains__ are O(1).
Re-use the results from listdir when iterating over jobs in Project and bypass the exists check on every job as it is opened.

Motivation and Context

Users would like their scripts to complete quickly. I will post benchmark results in the comments.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
The changes introduced by this pull request are covered by existing or newly introduced tests.
The package documentation and framework documentation in signac-docs are up to date with these changes.
I have updated the changelog and added any related issue and pull request numbers for future reference.

_StatePointDict takes significant time to initialize, even when the statepoint dict is known. Adjust `Job` initialization to make more use of the statepoint cache and initialize `_StatePointDict` only when `Job.statepoint` is accessed. Provide a faster path for cached *read* access to the statepoint dict via the new property `Job.statepoint_dict`. One side effect of this change is that some warnings are now deferred to `statepoint` access that were previously issued during `Job.__init__` (see changes in tests/). There are additional opportunities to use the cached statepoint dict in `Project.groupby` that this commit does not address.

Cache the ids matching the job filter. This enables O(1) cost for __len__ and __contains__ as users would expect. In some use-cases, signac-flow repeatedly calls __contains__ on a JobsCursor. The side effect of this change is that modifications to the workspace will not be reflected in existing JobsCursor instances. This behavior was not previously documented in the user API.

`with job`, `Job.document`, and `Job.stores` call `init()` because they require that the job directory exists. Prior to this change, `init()` also forced a load of the `_StatepointDict`. These methods now call `init(validate_statepoints=False)` which exits early when the job directory exists. This change provides a reasonable performance boost (5x on NVME, more on network storage). There may be more room for improvement as there are currently 2N stat calls in this use-case: ```python for job in project: with job: pass ```

deepcopy is unexpectedly expensive. Refactor the earlier commit to deepcopy only user-provided statepoint dicts. Statepoints from the cache are passed to the user read-only via MappingProxyType.

`open_job` uses the statepoint cache to improve performance. Read the cache from disk in `open_job` (if it has not already been read). This provides consistently high performance in cases where `open_job` is called before any other method that may have triggered `_get_statepoint`.

Users may find the messages to verbose. At the same time, users might never realize that they should run `signac update-cache` without this message...

`open_job` is a user-facing function and performs error checking on the id. This check involves a stat call to verify the job directory exists. When `Project` is looping over ids from `_get_job_ids`, the directory is known to exist (subject to race conditions). `stat` calls are expensive, especially on networked filesystems. Instantiating `Job` directly bypasses this check in `open_job`.

for more information, see https://pre-commit.ci

joaander · 2024-02-07T21:58:35Z

@tcmoore3 @janbridley here are the signac modifications I've been talking about. I will post benchmarks soon.

codecov · 2024-02-07T22:00:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5a82c4c) 85.71% compared to head (6639566) 86.09%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #975      +/-   ##
==========================================
+ Coverage   85.71%   86.09%   +0.37%     
==========================================
  Files          20       20              
  Lines        3466     3503      +37     
  Branches      760      770      +10     
==========================================
+ Hits         2971     3016      +45     
+ Misses        337      330       -7     
+ Partials      158      157       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…Job. This allows Job to avoid some `stat` calls which greatly improves performance on networked file systems.

These missed opportunities to pre-populate _statepoint_mapping triggered slow code paths.

joaander · 2024-02-08T15:52:12Z

Here are the results for a suite of benchmarks run on cheme-hodges (NVME) with 100,000 jobs in the workspace. For example, the access_statepoint_mapping benchmark is:

for job in project:
    job.statepoint_mapping['a']

The "no cache" column is run with no statepoint cache file on disk. cached is measured after executing signac update-cache.

main:

Benchmark	no cache (s)	cached (s)
iterate_jobs	0.365	0.379
with_job	5.6	5.65
open_job_id	0.299	0.303
open_job_statepoint	1.34	1.31
open_job_statepoint_id	0.942	0.938
access_statepoint_mapping	3.73	3.72
access_statepoint	3.55	3.62
access_job_document	8.08	8.03
access_job_stores	34.2	33.9
find_all	2.07	1.36
groupby	2.47	1.66
project_len	6.6	6.62
project_contains	0.47	0.474
find_jobs_len	24.9	30.1
find_jobs_contains	>240	>240

This pull request (7998f9a0b2084f99493312d17ee28391cffce8e4):

Benchmark	no cache (s)	cached (s)
iterate_jobs	0.158	0.16
with_job	0.533	0.541
open_job_id	0.323	0.169
open_job_statepoint	0.507	0.579
open_job_statepoint_id	0.0733	0.0762
access_statepoint_mapping	1.08	0.292
access_statepoint	3.27	3.3
access_job_document	3.01	2.79
access_job_stores	25.5	25.6
find_all	1.15	0.478
groupby	2.54	1.65
project_len	6.64	6.67
project_contains	0.497	0.398
find_jobs_len	1.06	0.391
find_jobs_contains	1.17	0.527

joaander · 2024-02-08T15:52:30Z

Here is the same on Great Lakes scratch (GPFS).

main:

Benchmark	no cache (s)	cached (s)
iterate_jobs	45.5	46.6
with_job	136.0	128.0
open_job_id	49.0	50.6
open_job_statepoint	1.69	1.66
open_job_statepoint_id	1.19	1.17
access_statepoint_mapping	117.0	114.0
access_statepoint	117.0	123.0
access_job_document	199.0	201.0
access_job_stores	>240	>240
find_all	120.0	2.17
groupby	113.0	2.24
project_len	9.37	9.36
project_contains	50.1	52.3
find_jobs_len	151.0	45.1
find_jobs_contains	>240	>240

This pull request (7998f9a0b2084f99493312d17ee28391cffce8e4):

Benchmark	no cache (s)	cached (s)
iterate_jobs	0.214	0.215
with_job	52.5	53.2
open_job_id	48.8	0.256
open_job_statepoint	0.616	0.732
open_job_statepoint_id	0.0893	0.0877
access_statepoint_mapping	116.0	0.403
access_statepoint	110.0	115.0
access_job_document	116.0	111.0
access_job_stores	>240	>240
find_all	119.0	0.764
groupby	115.0	2.23
project_len	9.42	9.41
project_contains	50.8	50.7
find_jobs_len	112.0	0.66
find_jobs_contains	106.0	0.763

joaander · 2024-02-08T15:52:40Z

Nearly all usage scenarios are significantly faster. There are some remaining cached benchmarks that take more than several seconds:

with_job cd's into the job directory. This requires O(N) chdir calls which account for nearly the entire 49 seconds in the benchmark. There is no opportunity for further optimization here.
access_statepoint and access_job_document activate the expensive synced_collections interface to the json files.
access_jobs_stores spends almost all time in h5store.py and presumably the hdf5 Python package.
groupby uses Job.statepoint. groupby could be updated to use statepoint_mapping to greatly improve performance, but doing so would require additional work to emulate the dotted access notation.
project_contains checks job in project for every job. This makes O(N) stat calls which account for nearly the entire 50 seconds. Caching this is challenging because signac can make few assumptions about when the job workspace directory changes. We can relax these assumptions in signac-flow and cache the listdir results while in buffered mode. One listdir is much faster than O(N) calls to stat (see also the improvement in iterate_jobs performance).

joaander · 2024-02-08T17:37:40Z

This is ready for review. I suggest waiting to merge until I complete additional testing of this branch with signac-flow.

bdice

The benchmarks look great! Nice job.

I worked quite a bit on improving signac performance prior to signac 2. At first glance, these optimizations seem fine, but it would be nice to know if there are tradeoffs in guaranteed consistency. Two important cases that are very hard to protect with increased levels of caching / reduced validation are:

parallel access from multiple Python interpreters modifying the same signac project
files being manually modified on disk by researchers who are unaware of signac's data model (we can't protect fully here, but if possible, we want to avoid data corruption / loss and give information to the user that the data model has been violated)

At some point, the filesystem itself is the layer that gives signac atomicity, in the sense of a database transaction. Cutting out the filesystem where possible is important for performance, but may come at a cost to ACID properties (atomicity, consistency, isolation, durability). If you anticipate significant impact to those properties, please share your thoughts.

bdice · 2024-02-08T18:47:58Z

Great! That's the analysis I needed. Please verify both signac-flow and signac-dashboard tests pass, if possible. Then I can approve.

janbridley

Looks good and seems to perform well! Thanks for this

joaander · 2024-02-08T19:46:52Z

Yes, I plan to test this with flow and dashboard soon.

cbkerr

Great!

Discussing naming offline with Josh

signac/project.py

signac/job.py

tests/test_job.py

Also attempt to fix sphinx-doc links.

joaander · 2024-02-10T17:25:16Z

This pull request breaks flow aggregate.groupsof (and possibly other aggregates). The jobs appear in random orders in the aggregates. I am investigating.

Python set has a randomized iteration order. Preserve the original iteration order with a list and converto to set only for __contains__ checks.

This has the added benefit of validating all statepoints that are added to the cache. I needed to add a validate argument to update_cache because one of the unit tests relies on adding invalid statepoints to the cache.

joaander · 2024-02-12T13:09:57Z

Since the last review, I:

Restored the job iteration order.
Added cached_statepoint validation when it is read from disk. This includes validation on update-cache to prevent invalid mappings from reaching the cache. One unit test relied on an invalid cache, so I added a validate argument to opt out of this behavior. Now, cached_statepoint will raise a JobsCorruptedError if the calc_id(statepoint) and the job's id do not match when loaded from disk. This matches the signac 2.1.0 behavior with statepoint.
Use cached_statepoint to accelerate groupby. After testing, I learned that "dotted" keys referred only to the "sp." and "doc." prefixes.

flow and dashboard work well with 1de7155 installed - behaving correctly in production runs and passing all unit tests.

Here are updated benchmarks (1de7155).
cheme-hodges (NVME):

Benchmark	no cache (s)	cached (s)
iterate_jobs	0.151	0.149
with_job	0.529	0.523
open_job_id	0.318	0.172
open_job_statepoint	0.508	0.58
open_job_statepoint_id	0.0717	0.0701
access_cached_statepoint	1.44	0.271
access_statepoint	3.15	3.21
access_job_document	2.78	2.85
access_job_stores	25.8	25.6
find_all	1.56	0.456
groupby	1.72	0.585
to_dataframe	5.45	3.72
project_len	6.57	6.58
project_contains	0.499	0.391
find_jobs_len	1.44	0.389
find_jobs_contains	1.58	0.53

Grea Lakes (scratch):

Benchmark	no cache (s)	cached (s)
iterate_jobs	0.197	0.194
with_job	44.0	45.8
open_job_id	47.6	0.293
open_job_statepoint	0.603	0.732
open_job_statepoint_id	0.0918	0.0904
access_cached_statepoint	116.0	0.379
access_statepoint	113.0	116.0
access_job_document	119.0	117.0
access_job_stores	>240	>240
find_all	113.0	0.852
groupby	111.0	0.857
to_dataframe	183.0	119.0
project_len	9.58	9.38
project_contains	48.7	50.0
find_jobs_len	115.0	0.902
find_jobs_contains	109.0	0.768

bdice · 2024-02-12T16:53:29Z

signac/job.py

@@ -406,6 +414,33 @@ def update_statepoint(self, update, overwrite=False):
        statepoint.update(update)
        self.statepoint = statepoint

+    @property
+    def cached_statepoint(self):


Must we add a new public API in order to provide the performance benefits of this PR? I am unsure if we should permit users to call this, or if it should be only leveraged internally as a private property job._cached_statepoint.

As noted below, I want to conceal as much as possible about topics like caching and validation from the user API as we can.

Yes, I intentionally make this public.

Job.statepoint is writeable and carries significant overhead from synced_collections as shown in the benchmarks - reading one key from every job's statepoint takes 116 seconds even when the statepoint is in the cache.

Many workflows only need to read the statepoint, flow in particular. While flow could internally use a private _cached_statepiont for str keys, users need public access to the fast path so that their user-defined callable methods (key, select, sort_by) can complete quickly. Many users are frustrated with 10+ minute flow status updates. As shown in the benchmarks, the same loop over projects accessing cached_statepoint completes in 0.379 seconds - 306 times faster. This alone improves flow performance tremendously when using aggregates.

The alternative API I considered was to replace statepoint with the read-only statepoint and require update_statepoint to change it. I opted for a new attribute as changing statepoint semantics is a massive breaking change.

Thanks for the explanation, that is helpful.

I would be open to changing statepoint semantics to be read-only in a future major version. We had discussed this at one point as a possibility for signac 2. Let's file an issue for that proposal.

Noted in #983.

signac/project.py

Locally catch the JobsCorruptedError and ignore it in the test that needs to.

bdice

Thanks for the hard work on this @joaander.

cbkerr

Adding changes to unify docstrings around state point being two words (https://docs.signac.io/en/latest/glossary.html#term-state-point).

changelog.txt

signac/job.py

signac/project.py

cbkerr

Excited to release this!

joaander and others added 8 commits February 7, 2024 09:38

Rename statepoint_dict to statepoint_mapping.

8e6e2fc

deepcopy is unexpectedly expensive. Refactor the earlier commit to deepcopy only user-provided statepoint dicts. Statepoints from the cache are passed to the user read-only via MappingProxyType.

Restore cache miss logger level to debug.

e20194f

Users may find the messages to verbose. At the same time, users might never realize that they should run `signac update-cache` without this message...

[pre-commit.ci] auto fixes from pre-commit.com hooks

15c2d60

for more information, see https://pre-commit.ci

joaander added 2 commits February 7, 2024 20:24

Add statepoint_mapping test.

99aaf28

Pass information about the Job directories existence from Project to …

57ef8a9

…Job. This allows Job to avoid some `stat` calls which greatly improves performance on networked file systems.

joaander force-pushed the improve-performance branch from 5e7ab62 to 57ef8a9 Compare February 8, 2024 01:26

Populate _statepoint_mapping in additional code paths.

04fca6e

These missed opportunities to pre-populate _statepoint_mapping triggered slow code paths.

janbridley self-assigned this Feb 8, 2024

Increase test coverage.

4f135b6

joaander force-pushed the improve-performance branch from bc862f4 to 4f135b6 Compare February 8, 2024 17:28

joaander marked this pull request as ready for review February 8, 2024 17:36

joaander requested review from a team as code owners February 8, 2024 17:36

joaander requested review from cbkerr and jennyfothergill and removed request for a team February 8, 2024 17:36

Update change log.

c95ac0b

bdice reviewed Feb 8, 2024

View reviewed changes

janbridley approved these changes Feb 8, 2024

View reviewed changes

cbkerr reviewed Feb 9, 2024

View reviewed changes

signac/project.py Show resolved Hide resolved

signac/job.py Outdated Show resolved Hide resolved

tests/test_job.py Show resolved Hide resolved

joaander and others added 4 commits February 9, 2024 15:28

Rename statepoint_mapping to cached_statepoint.

f02beb5

Also attempt to fix sphinx-doc links.

Doc fixes.

32b8fa9

Update code comments

e641333

Use cached_statepoint in to_dataframe.

60e76a2

joaander marked this pull request as draft February 10, 2024 17:23

joaander added 2 commits February 11, 2024 19:00

Restore iteration order.

7578993

Python set has a randomized iteration order. Preserve the original iteration order with a list and converto to set only for __contains__ checks.

Validate cached_statpoing when read from disk.

4ebf74f

This has the added benefit of validating all statepoints that are added to the cache. I needed to add a validate argument to update_cache because one of the unit tests relies on adding invalid statepoints to the cache.

joaander force-pushed the improve-performance branch from 6b30785 to 4ebf74f Compare February 12, 2024 01:41

joaander marked this pull request as ready for review February 12, 2024 01:44

Use cached_statepoint in groupby.

1de7155

joaander requested review from bdice and cbkerr February 12, 2024 13:10

cbkerr mentioned this pull request Feb 12, 2024

Update contributors #982

Merged

6 tasks

bdice reviewed Feb 12, 2024

View reviewed changes

Remove validate argument from update_cache.

d48d281

Locally catch the JobsCorruptedError and ignore it in the test that needs to.

bdice approved these changes Feb 13, 2024

View reviewed changes

joaander mentioned this pull request Feb 13, 2024

Make Job.statepoint read only. #983

Open

cbkerr reviewed Feb 13, 2024

View reviewed changes

cbkerr added 2 commits February 13, 2024 10:10

Write state point as two words in doc strings

9eb658f

Merge branch 'main' into improve-performance

6639566

cbkerr approved these changes Feb 13, 2024

View reviewed changes

cbkerr merged commit 2d6db63 into main Feb 13, 2024
17 checks passed

cbkerr deleted the improve-performance branch February 13, 2024 16:28

cbkerr added this to the v2.2.0 milestone Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #975

Improve performance #975

joaander commented Feb 7, 2024 •

edited by cbkerr

Loading

joaander commented Feb 7, 2024

codecov bot commented Feb 7, 2024 •

edited

Loading

joaander commented Feb 8, 2024

joaander commented Feb 8, 2024

joaander commented Feb 8, 2024

joaander commented Feb 8, 2024

bdice left a comment •

edited

Loading

bdice commented Feb 8, 2024

janbridley left a comment

joaander commented Feb 8, 2024

cbkerr left a comment

joaander commented Feb 10, 2024

joaander commented Feb 12, 2024

bdice Feb 12, 2024

bdice Feb 12, 2024

joaander Feb 12, 2024

bdice Feb 13, 2024

joaander Feb 13, 2024

bdice left a comment

cbkerr left a comment

cbkerr left a comment

Improve performance #975

Improve performance #975

Conversation

joaander commented Feb 7, 2024 • edited by cbkerr Loading

Description

Motivation and Context

Checklist:

joaander commented Feb 7, 2024

codecov bot commented Feb 7, 2024 • edited Loading

Codecov Report

joaander commented Feb 8, 2024

joaander commented Feb 8, 2024

joaander commented Feb 8, 2024

joaander commented Feb 8, 2024

bdice left a comment • edited Loading

Choose a reason for hiding this comment

bdice commented Feb 8, 2024

janbridley left a comment

Choose a reason for hiding this comment

joaander commented Feb 8, 2024

cbkerr left a comment

Choose a reason for hiding this comment

joaander commented Feb 10, 2024

joaander commented Feb 12, 2024

bdice Feb 12, 2024

Choose a reason for hiding this comment

bdice Feb 12, 2024

Choose a reason for hiding this comment

joaander Feb 12, 2024

Choose a reason for hiding this comment

bdice Feb 13, 2024

Choose a reason for hiding this comment

joaander Feb 13, 2024

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

cbkerr left a comment

Choose a reason for hiding this comment

cbkerr left a comment

Choose a reason for hiding this comment

joaander commented Feb 7, 2024 •

edited by cbkerr

Loading

codecov bot commented Feb 7, 2024 •

edited

Loading

bdice left a comment •

edited

Loading