Reduce memory and CPU use when scanning #222

bradlarsen · 2024-09-25T16:16:20Z

Rework git metadata calculation to reduce peak memory and wall clock time when scanning. This includes many changes. The net effect of all this is a typical 30% speedup and 50% memory reduction when scanning Git repositories; in pathological cases, up to 5x speedup and 20x memory reduction.

Git metadata graph:
- Do not construct in-memory graph of all trees and blob names; instead, read tree objects from repo as late as possible
- Use SmallVec to reduce heap fragmentation and small heap allocations
- Use more suitable initial size for worklists and scratch buffers to reduce reallocations and heap fragmentation
- Use the fastest / slimmest order for iterating object headers from a git repository when initially counting objects
- Eliminate redundant intermediate data structures; remove unused fields from remaining intermediate data structures
- Avoid temporary allocations when concatenating blob filenames
- Fix a longstanding bug where a blob introduced multiple times within a single commit would have only a single arbitrary pathname reported
BStringTable:
- change default initialization to create an empty table
- change get_or_intern to avoid heap-allocated temporaries when an entry already exists
Scanning:
- Use an Arc<CommitMetadata> instead of CommitMetadata and Arc<PathBuf> instead of PathBuf within git blob provenance entries (allows sharing; sometimes reduces memory use of these object types 10,000x)
- Use a higher default level of parallelism: require 3GiB RAM instead of 4GiB per parallel job
- Open Git repositories a single time instead of twice

- Git metadata graph: - Do not construct in-memory graph of all trees and blob names; instead, read tree objects from repo as late as possible - Use `SmallVec` to reduce heap fragmentation and small heap allocations - Use more suitable initial size for worklists and scratch buffers to reduce reallocations and heap fragmentation - Use the fastest / slimmest order for iterating object headers from a git repository when initially counting objects - Remove unused fields from intermediate data structures - `BStringTable`: - change default initialization to create an empty table - change `get_or_intern` to avoid heap-allocated temporaries when an entry already exists - Scanning: - Use an `Arc<CommitMetadata>` for git blob provenance instead of cloning a plain `CommitMetadata` value (allows sharing; sometimes reduces memory use of this object type 10,000x) - Use a higher default level of parallelism: require 3GiB RAM instead of 4GiB per parallel job - Open Git repositories a single time instead of twice

bradlarsen added performance Related to runtime performance content discovery Related to enumerating or specifying content to scan labels Sep 25, 2024

bradlarsen added 2 commits September 26, 2024 15:05

Further memory reduction and code cleanup

04e17b3

Fix clippy nits

074405b

bradlarsen marked this pull request as ready for review September 26, 2024 21:23

bradlarsen added 2 commits September 26, 2024 17:30

Eliminate debug instrumentation

f95b49a

Fix an old provenance bug; update CHANGELOG

8cc3fd1

bradlarsen merged commit cd6c187 into main Sep 27, 2024
11 checks passed

bradlarsen deleted the memory-slimming branch September 27, 2024 14:46

BrewTestBot mentioned this pull request Oct 4, 2024

noseyparker 0.20.0 Homebrew/homebrew-core#192836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory and CPU use when scanning #222

Reduce memory and CPU use when scanning #222

bradlarsen commented Sep 25, 2024 •

edited

Loading

Reduce memory and CPU use when scanning #222

Reduce memory and CPU use when scanning #222

Conversation

bradlarsen commented Sep 25, 2024 • edited Loading

bradlarsen commented Sep 25, 2024 •

edited

Loading