Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-use buffers to optimise memory allocation in fingerprint #36736

Merged
merged 2 commits into from
Oct 4, 2023

Conversation

rdner
Copy link
Member

@rdner rdner commented Oct 4, 2023

This dramatically drops the memory usage, particularly on large amount of files.

Benchmark results

Before
BenchmarkToFileDescriptor-10   764442     15849 ns/op   2688 B/op    12 allocs/op

After
BenchmarkToFileDescriptor-10   758116     15171 ns/op   416 B/op      8 allocs/op

CPU Profiles

Before

cpu-before

After

cpu-after

Memory Profiles

Before

mem-before

After

mem-after

Checklist

  • My code follows the style guidelines of this project
    - [ ] I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

go test -run=none -bench=".*ToFileDescriptor.*" -benchmem -benchtime=10s -memprofile profile.bin
go tool pprof -http localhost:9999 profile.bin

Related issues

@rdner rdner added Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-v8.10.0 Automated backport with mergify labels Oct 4, 2023
@rdner rdner self-assigned this Oct 4, 2023
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Oct 4, 2023
This dramatically drops the memory usage, particularly on large amount of files.
@rdner rdner force-pushed the optimise-fingerprint-memory branch from 6acd3c1 to 1562bfc Compare October 4, 2023 09:26
@rdner rdner changed the title Re-use the SHA256 block to optimise memory allocation in fingerprint Re-use buffers to optimise memory allocation in fingerprint Oct 4, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 4, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 71 min 32 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@rdner rdner added the backport-7.17 Automated backport to the 7.17 branch with mergify label Oct 4, 2023
@rdner rdner marked this pull request as ready for review October 4, 2023 09:37
@rdner rdner requested a review from a team as a code owner October 4, 2023 09:37
@rdner rdner merged commit 429b38f into elastic:main Oct 4, 2023
26 checks passed
@rdner rdner deleted the optimise-fingerprint-memory branch October 4, 2023 11:40
mergify bot pushed a commit that referenced this pull request Oct 4, 2023
This dramatically drops the memory usage, particularly on large amount of files.

(cherry picked from commit 429b38f)
mergify bot pushed a commit that referenced this pull request Oct 4, 2023
This dramatically drops the memory usage, particularly on large amount of files.

(cherry picked from commit 429b38f)
rdner added a commit that referenced this pull request Oct 4, 2023
…in fingerprint (#36738)

* Re-use buffers to optimise memory allocation in fingerprint (#36736)

This dramatically drops the memory usage, particularly on large amount of files.

(cherry picked from commit 429b38f)

* Fix changelog

---------

Co-authored-by: Denis <denis.rechkunov@elastic.co>
rdner added a commit that referenced this pull request Oct 4, 2023
…in fingerprint (#36739)

* Re-use buffers to optimise memory allocation in fingerprint (#36736)

This dramatically drops the memory usage, particularly on large amount of files.

(cherry picked from commit 429b38f)

* Fix changelog

---------

Co-authored-by: Denis <denis.rechkunov@elastic.co>
@rodrigc
Copy link

rodrigc commented Oct 11, 2023

@rdner can you give me a rough idea as to the order of magnitude improvement in memory usage of this patch?
In the images you posted, I see before: 1.97GB, and after 301MB.

Were similar data inputs used? An improvement of 5-6 times is a huge improvement.

Will this PR improve some of the issues described here:

@rdner
Copy link
Member Author

rdner commented Oct 11, 2023

@rodrigc I think it's more correct to compare numbers per operation, since the Go benchmarks here adjust the iteration count to run for 10 seconds.

So, it's 416 B/op against 2688 B/op, which is 6,5 times (646%).

The issue you linked (same twice?) is not using the filestream fingerprint mode, this optimisation affects only the filestream input and only when the new fingerprint file identity is used.

written, err := io.Copy(h, r)
s.hasher.Reset()
lr := io.LimitReader(file, s.cfg.Fingerprint.Length)
written, err := io.CopyBuffer(s.hasher, lr, s.readBuffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to reset the length of s.readBuffer before calling CopyBuffer? If s.cfg.Fingerprint.Length were to be made smaller there would still be data left in s.readBuffer from the previous read that is never cleared.

The CopyBuffer implementation does not clear the buffer before it copies https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/io/io.go;l=399

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to reset the length of s.readBuffer before calling CopyBuffer?

No, because the buffer is created per file watcher (per prospector, eventually per filestream input).

prospector, err := newProspector(config)

func newProspector(config config) (loginp.Prospector, error) {
err := checkConfigCompatibility(config.FileWatcher, config.FileIdentity)
if err != nil {
return nil, err
}
filewatcher, err := newFileWatcher(config.Paths, config.FileWatcher)

If the input configuration changes (e.g. fingerprint size), the file watcher gets re-created with a new buffer size.

The CopyBuffer implementation does not clear the buffer before it copies

It's true but it does not matter since this is just a buffer and once Read returns some data it also returns amount of bytes written into the buffer and only this amount of bytes is used for Write in the destination Writer https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/io/io.go;l=432

I have tests that I have not changed in this PR and that would fail if the previous buffer value was re-used or buffer got corrupted in general:

{
name: "returns all files except too small to fingerprint",
cfgStr: `
scanner:
symlinks: true
recursive_glob: true
fingerprint:
enabled: true
offset: 0
length: 1024
`,
expDesc: map[string]loginp.FileDescriptor{
normalFilename: {
Filename: normalFilename,
Fingerprint: "2edc986847e209b4016e141a6dc8716d3207350f416969382d431539bf292e4a",
Info: testFileInfo{
size: sizes[normalFilename],
name: normalBasename,
},
},
excludedFilename: {
Filename: excludedFilename,
Fingerprint: "bd151321c3bbdb44185414a1b56b5649a00206dd4792e7230db8904e43987336",
Info: testFileInfo{
size: sizes[excludedFilename],
name: excludedBasename,
},
},
excludedIncludedFilename: {
Filename: excludedIncludedFilename,
Fingerprint: "bfdb99a65297062658c26dfcea816d76065df2a2da2594bfd9b96e9e405da1c2",
Info: testFileInfo{
size: sizes[excludedIncludedFilename],
name: excludedIncludedBasename,
},
},
travelerSymlinkFilename: {
Filename: travelerSymlinkFilename,
Fingerprint: "c4058942bffcea08810a072d5966dfa5c06eb79b902bf0011890dd8d22e1a5f8",
Info: testFileInfo{
size: sizes[travelerFilename],
name: travelerSymlinkBasename,
},
},
},
},

Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
…36736)

This dramatically drops the memory usage, particularly on large amount of files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.10.0 Automated backport with mergify Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants