Reduce locks during decision logging #6797

mjungsbluth · 2024-06-06T14:17:25Z

Why the changes in this PR are needed?

Decision logging is heavy on mutexes which increases the latency of decisions (see #5724) and also limits the parallelism. This PR removes two of the three Mutexes and replaces them with Golang structures for those use cases

What are the changes in this PR?

Mutexes are replaced with sync.Once which has an optimised call path if the prepared queries are already set and only locks on the first occurrence.

Signed-off-by: Magnus Jungsbluth <magnus.jungsbluth@zalando.de>

ashutosh-narkar · 2024-06-06T17:03:04Z

@mjungsbluth thanks for the contribution. If you can share any numbers that help validate this change that would be helpful. It would be interesting to see the performance improvements this change gives us. Also FWIW @johanfylling is planning to take a closer look at #5724 later this month.

philipaconrad

Overall, this looks like a good change to me! As described by @mjungsbluth, the main place we were using mutexes here was for updates to the mask/drop policies-- perfect candidates for using sync.Once instead to ensure atomicity. 😄

Also, because my memory was fuzzy, I double-checked how sync.Once ensures atomicity-- it uses a CAS op for the fast path (checking if the sync.Once has run once already), and an underlying mutex on the slower path (which is used to ensure atomicity on the first invocation of (Once).Do()).

I think the only thing missing here would be some kind of benchmark, or instructions/materials for running a head-to-head test of decision logging performance. One option might be to get a Mutex profile from pprof, since that should directly address the core improvement of the PR. 🤔

EDIT: Let me see if I can get a quick mutex profile...

philipaconrad · 2024-06-10T19:10:33Z

Okay! So I found an existing masking benchmark that seems to be illustrative of the improvements here: BenchmarkMaskingErase.

Here's how I ran my benchmarks (30x runs each):

git fetch origin pull/6797/head:remove_locks_decision_logs

git checkout main
go test -mutexprofile mtx.prof -run=^$ -tags opa_wasm,slow,e2e,noisy -bench ^BenchmarkMaskingErase$ github.com/open-policy-agent/opa/plugins/logs -count=30 > main.txt

git checkout remove_locks_decision_logs
go test -mutexprofile mtx.prof -run=^$ -tags opa_wasm,slow,e2e,noisy -bench ^BenchmarkMaskingErase$ github.com/open-policy-agent/opa/plugins/logs -count=30 > pr.txt

benchstat main.txt pr.txt

And here's the benchstat results from my Core i7-8650U Thinkpad:

name            old time/op  new time/op  delta
MaskingErase-8  16.4µs ± 5%  15.9µs ± 4%  -3.38%  (p=0.000 n=29+28)

So, I think that's a small, but solid bit of evidence in favor of this PR improving performance! 😄

Benchmark results files from my run:

The above benchmark is dramatically better for illustrating the PR's effects than doing a mutex profile against the OPA server as a whole-- I found that at least for console logging, the Logrus mutexes drown out any effect from this change! 😓

Example mutex profile with Logrus overwhelming everything else

This profile was collected by launching 10,000x queries at a few decisions that would trigger masking/drop policies for a fairly simple policy. The large number of seconds shown is aggregated across 8x threads, and thus should be taken with a grain of salt. 😅

…nt calls Fixing nil pointer dereference panic. Signed-off-by: Johan Fylling <johan.dev@fylling.se>

netlify · 2024-06-11T10:18:30Z

✅ Deploy Preview for openpolicyagent ready!

Name	Link
🔨 Latest commit	`e716234`
🔍 Latest deploy log	https://app.netlify.com/sites/openpolicyagent/deploys/66699b757b7ae90008c279f2
😎 Deploy Preview	https://deploy-preview-6797--openpolicyagent.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

johanfylling

Thank you for your contribution! 😃
There’s just one concern to address before this can be merged.

johanfylling · 2024-06-11T10:20:02Z

plugins/logs/plugin.go

@@ -975,42 +988,33 @@ func (p *Plugin) bufferChunk(buffer *logBuffer, bs []byte) {
 }

 func (p *Plugin) maskEvent(ctx context.Context, txn storage.Transaction, input ast.Value, event *EventV1) error {


A benefit of the old mutex-based approach was that on error, the PrepareForEval() call would be retried on subsequent mask/drop calls. Now, if the first (and only) call to PrepareForEval() fails for the current configuration, we'll end up with a broken PreparedEvalQuery that'll cause a panic when used.

I've made a commit to your branch with a proposed fix and tests asserting the behavior. The fix simply re-emits the error for subsequent calls. I think it's unlikely that subsequent PrepareForEval() calls would have a different outcome, so we don't need to retain the old behavior where we retry.

Thoughts?
This change shouldn't have an impact on performance, but if you're running separate tests on your end, please let us know how this fares.

srenatus

@johanfylling reviewed just your commit. Looks good to me, one naming nitpick. But it's no blocker. 👍

plugins/logs/plugin.go

for clarity at call-site. Signed-off-by: Johan Fylling <johan.dev@fylling.se>

mjungsbluth · 2024-06-12T10:00:01Z

One small observation: Before the change an error would set the query to rego.PreparedEvalQuery{}, now it is nil. I think this is generally nicer and idiomatic Go but it is a change. Was this intentional?

And thanks for ironing out the kinks :)

johanfylling · 2024-06-12T12:58:18Z

@mjungsbluth, I think we can chalk that one up as unconsciously intentional 😄. Since there is an initial state of prepareOnce where neither preparedQuery or err has been assigned values, it makes sense to internally use field types that can be nil to reflect being unassigned.

Reduce locks during decision logging

3d6d2be

Signed-off-by: Magnus Jungsbluth <magnus.jungsbluth@zalando.de>

mjungsbluth force-pushed the remove_locks_decision_logs branch from eaa0c8a to 3d6d2be Compare June 6, 2024 14:17

philipaconrad reviewed Jun 10, 2024

View reviewed changes

tsandall requested a review from johanfylling June 10, 2024 16:16

philipaconrad previously approved these changes Jun 10, 2024

View reviewed changes

Re-emitting error caused by first (only) PrepareForEval on subseque…

937b560

…nt calls Fixing nil pointer dereference panic. Signed-off-by: Johan Fylling <johan.dev@fylling.se>

johanfylling dismissed philipaconrad’s stale review via 937b560 June 11, 2024 10:17

johanfylling reviewed Jun 11, 2024

View reviewed changes

srenatus previously approved these changes Jun 11, 2024

View reviewed changes

plugins/logs/plugin.go Outdated Show resolved Hide resolved

Renaming prepareOnce.prepare() -> prepareOnce.prepareOnce()

273f481

for clarity at call-site. Signed-off-by: Johan Fylling <johan.dev@fylling.se>

johanfylling dismissed srenatus’s stale review via 273f481 June 11, 2024 11:22

srenatus approved these changes Jun 11, 2024

View reviewed changes

johanfylling approved these changes Jun 11, 2024

View reviewed changes

Merge branch 'main' into remove_locks_decision_logs

e716234

johanfylling merged commit b463d30 into open-policy-agent:main Jun 12, 2024
28 checks passed

BrewTestBot mentioned this pull request Jun 27, 2024

opa 0.66.0 Homebrew/homebrew-core#175785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce locks during decision logging #6797

Reduce locks during decision logging #6797

mjungsbluth commented Jun 6, 2024

ashutosh-narkar commented Jun 6, 2024

philipaconrad left a comment •

edited

Loading

philipaconrad commented Jun 10, 2024 •

edited

Loading

netlify bot commented Jun 11, 2024 •

edited

Loading

johanfylling left a comment

johanfylling Jun 11, 2024

srenatus left a comment

mjungsbluth commented Jun 12, 2024

johanfylling commented Jun 12, 2024

		@@ -975,42 +988,33 @@ func (p Plugin) bufferChunk(buffer logBuffer, bs []byte) {
		}

		func (p Plugin) maskEvent(ctx context.Context, txn storage.Transaction, input ast.Value, event EventV1) error {

Reduce locks during decision logging #6797

Reduce locks during decision logging #6797

Conversation

mjungsbluth commented Jun 6, 2024

Why the changes in this PR are needed?

What are the changes in this PR?

ashutosh-narkar commented Jun 6, 2024

philipaconrad left a comment • edited Loading

Choose a reason for hiding this comment

philipaconrad commented Jun 10, 2024 • edited Loading

netlify bot commented Jun 11, 2024 • edited Loading

✅ Deploy Preview for openpolicyagent ready!

johanfylling left a comment

Choose a reason for hiding this comment

johanfylling Jun 11, 2024

Choose a reason for hiding this comment

srenatus left a comment

Choose a reason for hiding this comment

mjungsbluth commented Jun 12, 2024

johanfylling commented Jun 12, 2024

philipaconrad left a comment •

edited

Loading

philipaconrad commented Jun 10, 2024 •

edited

Loading

netlify bot commented Jun 11, 2024 •

edited

Loading