[CWS] add cleanup loop, removing persisted dumps that are no longer needed #29333

paulcacheux · 2024-09-13T11:05:39Z

What does this PR do?

This PR adds a new cleanup goroutine, that will remove the unneeded activity dumps from the directory managed by the directory provider.

The basic rule used is that if there is a real profile (i.e. tag == "") for a given image name, a dump (tag != "") is not needed (and we already have the logic to not even load it).

The main goal is to drastically cut on the amount of calls to LoadProfile that are done, reducing clearly the amount of allocations (obviously since we load way less profiles).

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter · 2024-09-13T11:12:47Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 1ef06f16-0e98-4260-8ab7-457ed134b8f3

Baseline: d0a1d25
Comparison: 9fff824
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+1.19	[+0.46, +1.92]	1	Logs
➖	basic_py_check	% cpu utilization	+0.97	[-2.85, +4.79]	1	Logs
➖	otel_to_otel_logs	ingress throughput	+0.65	[-0.06, +1.35]	1	Logs
➖	quality_gate_idle	memory utilization	+0.21	[+0.16, +0.25]	1	Logs bounds checks dashboard
➖	file_to_blackhole_300ms_latency	egress throughput	+0.11	[-0.52, +0.75]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	+0.09	[-0.59, +0.76]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.02	[-0.12, +0.08]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	-0.05	[-0.82, +0.71]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.08	[-0.94, +0.78]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.13	[-0.24, -0.02]	1	Logs bounds checks dashboard
➖	tcp_syslog_to_blackhole	ingress throughput	-0.23	[-0.30, -0.16]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.31	[-1.09, +0.47]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	-0.36	[-0.81, +0.10]	1	Logs
➖	file_tree	memory utilization	-0.43	[-0.57, -0.29]	1	Logs
➖	pycheck_lots_of_tags	% cpu utilization	-3.48	[-6.81, -0.15]	1	Logs

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	links
❌	file_to_blackhole_1000ms_latency	lost_bytes	0/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.

agent-platform-auto-pr · 2024-09-16T12:08:56Z

[Fast Unit Tests Report]

On pipeline 49247697 (CI Visibility). The following jobs did not run any unit tests:

Jobs:

tests_windows-x64

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help

pr-commenter · 2024-09-16T12:17:07Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=49247697 --os-family=ubuntu

Note: This applies to commit 9fff824

pkg/security/security_profile/profile/profile.go

pkg/security/security_profile/profile/profile_dir.go

spikat · 2024-09-18T16:29:46Z

pkg/security/security_profile/profile/profile_dir.go

+	// read workload selectors from all directory profiles
+	workloadSelectors := make([]wsAndPath, 0)
+	for _, path := range paths {
+		_, workloadSelector, err := readProfile(path)
+		if err != nil {
+			return err
+		}
+
+		workloadSelectors = append(workloadSelectors, wsAndPath{
+			selector: workloadSelector,
+			path:     path,
+		})
+	}


This feels like the cached profileMapping list is not very useful to me.. I wonder if we could modify the cached one to store "real" workload selector (instead of, somehow a "fake" profile one) and use it to clean up dumps more easily (instead of reconstruct the whole view by re-open every dumps/profiles every 5min). WDYT ?

yes, good point. I updated the PR to cleanup a bit the profileMapping, and used it in the cleanup loop (instead of re-reading from disk)

github-actions bot added component/system-probe team/agent-security labels Sep 13, 2024

paulcacheux force-pushed the paulcacheux/move-dirprov-fixes branch from 7fde8c4 to 8789d00 Compare September 16, 2024 11:47

paulcacheux force-pushed the paulcacheux/move-dirprov-fixes branch 2 times, most recently from 4b74b33 to 89d603d Compare September 18, 2024 13:56

paulcacheux added changelog/no-changelog qa/done QA done before merge and regressions are covered by tests labels Sep 18, 2024

paulcacheux changed the title ~~[CWS][WIP] directory provider debug~~ [CWS] add cleanup loop, removing persisted dumps that are no longer needed Sep 18, 2024

spikat reviewed Sep 18, 2024

View reviewed changes

paulcacheux force-pushed the paulcacheux/move-dirprov-fixes branch 3 times, most recently from a21f6e5 to e1df669 Compare September 20, 2024 16:19

paulcacheux force-pushed the paulcacheux/move-dirprov-fixes branch 2 times, most recently from b8f255f to c5894ee Compare October 13, 2024 16:09

paulcacheux added 4 commits November 18, 2024 12:11

add debug log

106776a

cleanup loop

ddafdb3

simplification of profileMapping

e0f2e50

make use of profile mapping instead of reading from disk

9fff824

paulcacheux force-pushed the paulcacheux/move-dirprov-fixes branch from c5894ee to 9fff824 Compare November 18, 2024 11:11

github-actions bot added the medium review PR review might take time label Nov 18, 2024

paulcacheux closed this Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CWS] add cleanup loop, removing persisted dumps that are no longer needed #29333

[CWS] add cleanup loop, removing persisted dumps that are no longer needed #29333

paulcacheux commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 13, 2024 •

edited by cit-pr-commenter bot

Loading

Fine details of change detection per experiment

Explanation

agent-platform-auto-pr bot commented Sep 16, 2024 •

edited

Loading

pr-commenter bot commented Sep 16, 2024 •

edited by agent-platform-auto-pr bot

Loading

spikat Sep 18, 2024

paulcacheux Sep 20, 2024

[CWS] add cleanup loop, removing persisted dumps that are no longer needed #29333

[CWS] add cleanup loop, removing persisted dumps that are no longer needed #29333

Conversation

paulcacheux commented Sep 13, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Sep 13, 2024 • edited by cit-pr-commenter bot Loading

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

agent-platform-auto-pr bot commented Sep 16, 2024 • edited Loading

pr-commenter bot commented Sep 16, 2024 • edited by agent-platform-auto-pr bot Loading

Test changes on VM

spikat Sep 18, 2024

Choose a reason for hiding this comment

paulcacheux Sep 20, 2024

Choose a reason for hiding this comment

paulcacheux commented Sep 13, 2024 •

edited

Loading

pr-commenter bot commented Sep 13, 2024 •

edited by cit-pr-commenter bot

Loading

agent-platform-auto-pr bot commented Sep 16, 2024 •

edited

Loading

pr-commenter bot commented Sep 16, 2024 •

edited by agent-platform-auto-pr bot

Loading