Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce locks during decision logging #6797

Merged

Conversation

mjungsbluth
Copy link
Contributor

Why the changes in this PR are needed?

Decision logging is heavy on mutexes which increases the latency of decisions (see #5724) and also limits the parallelism. This PR removes two of the three Mutexes and replaces them with Golang structures for those use cases

What are the changes in this PR?

Mutexes are replaced with sync.Once which has an optimised call path if the prepared queries are already set and only locks on the first occurrence.

Signed-off-by: Magnus Jungsbluth <magnus.jungsbluth@zalando.de>
@mjungsbluth mjungsbluth force-pushed the remove_locks_decision_logs branch from eaa0c8a to 3d6d2be Compare June 6, 2024 14:17
@ashutosh-narkar
Copy link
Member

@mjungsbluth thanks for the contribution. If you can share any numbers that help validate this change that would be helpful. It would be interesting to see the performance improvements this change gives us. Also FWIW @johanfylling is planning to take a closer look at #5724 later this month.

Copy link
Contributor

@philipaconrad philipaconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks like a good change to me! As described by @mjungsbluth, the main place we were using mutexes here was for updates to the mask/drop policies-- perfect candidates for using sync.Once instead to ensure atomicity. 😄

Also, because my memory was fuzzy, I double-checked how sync.Once ensures atomicity-- it uses a CAS op for the fast path (checking if the sync.Once has run once already), and an underlying mutex on the slower path (which is used to ensure atomicity on the first invocation of (Once).Do()).


I think the only thing missing here would be some kind of benchmark, or instructions/materials for running a head-to-head test of decision logging performance. One option might be to get a Mutex profile from pprof, since that should directly address the core improvement of the PR. 🤔

EDIT: Let me see if I can get a quick mutex profile...

@tsandall tsandall requested a review from johanfylling June 10, 2024 16:16
@philipaconrad
Copy link
Contributor

philipaconrad commented Jun 10, 2024

Okay! So I found an existing masking benchmark that seems to be illustrative of the improvements here: BenchmarkMaskingErase.

Here's how I ran my benchmarks (30x runs each):

git fetch origin pull/6797/head:remove_locks_decision_logs

git checkout main
go test -mutexprofile mtx.prof -run=^$ -tags opa_wasm,slow,e2e,noisy -bench ^BenchmarkMaskingErase$ github.com/open-policy-agent/opa/plugins/logs -count=30 > main.txt

git checkout remove_locks_decision_logs
go test -mutexprofile mtx.prof -run=^$ -tags opa_wasm,slow,e2e,noisy -bench ^BenchmarkMaskingErase$ github.com/open-policy-agent/opa/plugins/logs -count=30 > pr.txt

benchstat main.txt pr.txt

And here's the benchstat results from my Core i7-8650U Thinkpad:

name            old time/op  new time/op  delta
MaskingErase-8  16.4µs ± 5%  15.9µs ± 4%  -3.38%  (p=0.000 n=29+28)

So, I think that's a small, but solid bit of evidence in favor of this PR improving performance! 😄

Benchmark results files from my run:


The above benchmark is dramatically better for illustrating the PR's effects than doing a mutex profile against the OPA server as a whole-- I found that at least for console logging, the Logrus mutexes drown out any effect from this change! 😓

Example mutex profile with Logrus overwhelming everything else

This profile was collected by launching 10,000x queries at a few decisions that would trigger masking/drop policies for a fairly simple policy. The large number of seconds shown is aggregated across 8x threads, and thus should be taken with a grain of salt. 😅

mutex-main-1

philipaconrad
philipaconrad previously approved these changes Jun 10, 2024
…nt calls

Fixing nil pointer dereference panic.

Signed-off-by: Johan Fylling <johan.dev@fylling.se>
Copy link

netlify bot commented Jun 11, 2024

Deploy Preview for openpolicyagent ready!

Name Link
🔨 Latest commit e716234
🔍 Latest deploy log https://app.netlify.com/sites/openpolicyagent/deploys/66699b757b7ae90008c279f2
😎 Deploy Preview https://deploy-preview-6797--openpolicyagent.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! 😃
There’s just one concern to address before this can be merged.

@@ -975,42 +988,33 @@ func (p *Plugin) bufferChunk(buffer *logBuffer, bs []byte) {
}

func (p *Plugin) maskEvent(ctx context.Context, txn storage.Transaction, input ast.Value, event *EventV1) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A benefit of the old mutex-based approach was that on error, the PrepareForEval() call would be retried on subsequent mask/drop calls. Now, if the first (and only) call to PrepareForEval() fails for the current configuration, we'll end up with a broken PreparedEvalQuery that'll cause a panic when used.

I've made a commit to your branch with a proposed fix and tests asserting the behavior. The fix simply re-emits the error for subsequent calls. I think it's unlikely that subsequent PrepareForEval() calls would have a different outcome, so we don't need to retain the old behavior where we retry.

Thoughts?
This change shouldn't have an impact on performance, but if you're running separate tests on your end, please let us know how this fares.

srenatus
srenatus previously approved these changes Jun 11, 2024
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johanfylling reviewed just your commit. Looks good to me, one naming nitpick. But it's no blocker. 👍

plugins/logs/plugin.go Outdated Show resolved Hide resolved
for clarity at call-site.

Signed-off-by: Johan Fylling <johan.dev@fylling.se>
@mjungsbluth
Copy link
Contributor Author

One small observation: Before the change an error would set the query to rego.PreparedEvalQuery{}, now it is nil. I think this is generally nicer and idiomatic Go but it is a change. Was this intentional?

And thanks for ironing out the kinks :)

@johanfylling
Copy link
Contributor

@mjungsbluth, I think we can chalk that one up as unconsciously intentional 😄. Since there is an initial state of prepareOnce where neither preparedQuery or err has been assigned values, it makes sense to internally use field types that can be nil to reflect being unassigned.

@johanfylling johanfylling merged commit b463d30 into open-policy-agent:main Jun 12, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants