Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory and CPU use when scanning #222

Merged
merged 5 commits into from
Sep 27, 2024
Merged

Reduce memory and CPU use when scanning #222

merged 5 commits into from
Sep 27, 2024

Conversation

bradlarsen
Copy link
Collaborator

@bradlarsen bradlarsen commented Sep 25, 2024

Rework git metadata calculation to reduce peak memory and wall clock time when scanning. This includes many changes. The net effect of all this is a typical 30% speedup and 50% memory reduction when scanning Git repositories; in pathological cases, up to 5x speedup and 20x memory reduction.

  • Git metadata graph:

    • Do not construct in-memory graph of all trees and blob names; instead, read tree objects from repo as late as possible
    • Use SmallVec to reduce heap fragmentation and small heap allocations
    • Use more suitable initial size for worklists and scratch buffers to reduce reallocations and heap fragmentation
    • Use the fastest / slimmest order for iterating object headers from a git repository when initially counting objects
    • Eliminate redundant intermediate data structures; remove unused fields from remaining intermediate data structures
    • Avoid temporary allocations when concatenating blob filenames
    • Fix a longstanding bug where a blob introduced multiple times within a single commit would have only a single arbitrary pathname reported
  • BStringTable:

    • change default initialization to create an empty table
    • change get_or_intern to avoid heap-allocated temporaries when an entry already exists
  • Scanning:

    • Use an Arc<CommitMetadata> instead of CommitMetadata and Arc<PathBuf> instead of PathBuf within git blob provenance entries (allows sharing; sometimes reduces memory use of these object types 10,000x)
    • Use a higher default level of parallelism: require 3GiB RAM instead of 4GiB per parallel job
    • Open Git repositories a single time instead of twice

- Git metadata graph:
  - Do not construct in-memory graph of all trees and blob names;
    instead, read tree objects from repo as late as possible
  - Use `SmallVec` to reduce heap fragmentation and small heap allocations
  - Use more suitable initial size for worklists and scratch buffers to
    reduce reallocations and heap fragmentation
  - Use the fastest / slimmest order for iterating object headers from a
    git repository when initially counting objects
  - Remove unused fields from intermediate data structures

- `BStringTable`:
  - change default initialization to create an empty table
  - change `get_or_intern` to avoid heap-allocated temporaries when
    an entry already exists

- Scanning:
  - Use an `Arc<CommitMetadata>` for git blob provenance instead of
    cloning a plain `CommitMetadata` value (allows sharing; sometimes
    reduces memory use of this object type 10,000x)
  - Use a higher default level of parallelism: require 3GiB RAM instead
    of 4GiB per parallel job
  - Open Git repositories a single time instead of twice
@bradlarsen bradlarsen added performance Related to runtime performance content discovery Related to enumerating or specifying content to scan labels Sep 25, 2024
@bradlarsen bradlarsen marked this pull request as ready for review September 26, 2024 21:23
@bradlarsen bradlarsen merged commit cd6c187 into main Sep 27, 2024
11 checks passed
@bradlarsen bradlarsen deleted the memory-slimming branch September 27, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content discovery Related to enumerating or specifying content to scan performance Related to runtime performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant