-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce locks during decision logging #6797
Reduce locks during decision logging #6797
Conversation
Signed-off-by: Magnus Jungsbluth <magnus.jungsbluth@zalando.de>
eaa0c8a
to
3d6d2be
Compare
@mjungsbluth thanks for the contribution. If you can share any numbers that help validate this change that would be helpful. It would be interesting to see the performance improvements this change gives us. Also FWIW @johanfylling is planning to take a closer look at #5724 later this month. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks like a good change to me! As described by @mjungsbluth, the main place we were using mutexes here was for updates to the mask/drop policies-- perfect candidates for using sync.Once
instead to ensure atomicity. 😄
Also, because my memory was fuzzy, I double-checked how sync.Once
ensures atomicity-- it uses a CAS op for the fast path (checking if the sync.Once
has run once already), and an underlying mutex on the slower path (which is used to ensure atomicity on the first invocation of (Once).Do()
).
I think the only thing missing here would be some kind of benchmark, or instructions/materials for running a head-to-head test of decision logging performance. One option might be to get a Mutex profile from pprof, since that should directly address the core improvement of the PR. 🤔
EDIT: Let me see if I can get a quick mutex profile...
Okay! So I found an existing masking benchmark that seems to be illustrative of the improvements here: Here's how I ran my benchmarks (30x runs each):
And here's the benchstat results from my
So, I think that's a small, but solid bit of evidence in favor of this PR improving performance! 😄 Benchmark results files from my run: The above benchmark is dramatically better for illustrating the PR's effects than doing a mutex profile against the OPA server as a whole-- I found that at least for console logging, the Logrus mutexes drown out any effect from this change! 😓 Example mutex profile with Logrus overwhelming everything elseThis profile was collected by launching 10,000x queries at a few decisions that would trigger masking/drop policies for a fairly simple policy. The large number of seconds shown is aggregated across 8x threads, and thus should be taken with a grain of salt. 😅 |
…nt calls Fixing nil pointer dereference panic. Signed-off-by: Johan Fylling <johan.dev@fylling.se>
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! 😃
There’s just one concern to address before this can be merged.
@@ -975,42 +988,33 @@ func (p *Plugin) bufferChunk(buffer *logBuffer, bs []byte) { | |||
} | |||
|
|||
func (p *Plugin) maskEvent(ctx context.Context, txn storage.Transaction, input ast.Value, event *EventV1) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A benefit of the old mutex-based approach was that on error, the PrepareForEval()
call would be retried on subsequent mask/drop calls. Now, if the first (and only) call to PrepareForEval()
fails for the current configuration, we'll end up with a broken PreparedEvalQuery
that'll cause a panic when used.
I've made a commit to your branch with a proposed fix and tests asserting the behavior. The fix simply re-emits the error for subsequent calls. I think it's unlikely that subsequent PrepareForEval()
calls would have a different outcome, so we don't need to retain the old behavior where we retry.
Thoughts?
This change shouldn't have an impact on performance, but if you're running separate tests on your end, please let us know how this fares.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johanfylling reviewed just your commit. Looks good to me, one naming nitpick. But it's no blocker. 👍
for clarity at call-site. Signed-off-by: Johan Fylling <johan.dev@fylling.se>
One small observation: Before the change an error would set the query to And thanks for ironing out the kinks :) |
@mjungsbluth, I think we can chalk that one up as unconsciously intentional 😄. Since there is an initial state of |
Why the changes in this PR are needed?
Decision logging is heavy on mutexes which increases the latency of decisions (see #5724) and also limits the parallelism. This PR removes two of the three Mutexes and replaces them with Golang structures for those use cases
What are the changes in this PR?
Mutexes are replaced with
sync.Once
which has an optimised call path if the prepared queries are already set and only locks on the first occurrence.