Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Address several remaining race conditions to improve stability #325

Merged

Conversation

fako1024
Copy link
Collaborator

Improves multiple things across the board detected when running hacksaw-style concurrency tests using an actual deployment and gpctl (excessively).

Closes #317

@fako1024 fako1024 added bug Something isn't working critical Issues that should be looked at with priority labels Jun 11, 2024
@fako1024 fako1024 requested a review from els0r June 11, 2024 10:57
@fako1024 fako1024 self-assigned this Jun 11, 2024
@fako1024 fako1024 linked an issue Jun 11, 2024 that may be closed by this pull request
3 tasks
cmd/goProbe/goProbe.go Show resolved Hide resolved
c.wgProc.Wait()
c.capLock.Close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses the upstream improvement from gotools/concurrency...

@@ -347,14 +348,14 @@ func (c *Capture) bufferPackets(buf *LocalBuffer, captureErrors chan error) erro

// Ensure that the buffer is released at the end of the method
defer func() {
c.capLock.ConsumeUnlockRequest() // Consume the unlock request to continue normal processing
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into the defer to ensure it's being run (even if in the future code is added below). Since the function is on the stack anyway this has no negative impact.

pkg/capture/capture_manager.go Show resolved Hide resolved
pkg/capture/capture_manager.go Show resolved Hide resolved
@@ -580,12 +583,12 @@ func (cm *Manager) performWriteout(ctx context.Context, timestamp time.Time, ifa
writeoutChan := make(chan capturetypes.TaggedAggFlowMap, writeout.WriteoutsChanDepth)
doneChan := cm.writeoutHandler.HandleWriteout(ctx, timestamp, writeoutChan)

cm.Lock()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also incorrect. The global lock must encompass the whole (unprotected) rotation, otherwise it's possible that a scheduled one runs in parallel to one triggered from an interface down. TBH this is the most likely candidate for the deadlocks (because we've seen it happen in exactly these scenarios quite frequently - interface changes + scheduled rotation)....

@fako1024 fako1024 changed the title Address several remaining race conditions to improve stability [bugfix] Address several remaining race conditions to improve stability Jun 11, 2024
@fako1024
Copy link
Collaborator Author

Damn, now there's a race in the capture manager logic (probably has been there all along, just never got triggered):

=== RUN   TestConcurrentMethodAccess/1_ifaces
==================
WARNING: DATA RACE
Read at 0x00c0001d2078 by goroutine 30:
  github.com/els0r/goProbe/pkg/capture.(*Capture).updateParsingErrorCounters()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:439 +0x5e4
  github.com/els0r/goProbe/pkg/capture.(*Capture).bufferPackets()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:418 +0x5ad
  github.com/els0r/goProbe/pkg/capture.(*Capture).process.func1()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:282 +0x9da

Previous write at 0x00c0001d2078 by goroutine 35:
  github.com/els0r/goProbe/pkg/capture/capturetypes.(*ParsingErrTracker).Reset()
      /home/runner/work/goProbe/goProbe/pkg/capture/capturetypes/parsing_error.go:66 +0x539
  github.com/els0r/goProbe/pkg/capture.(*Capture).status()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:561 +0x4f5
  github.com/els0r/goProbe/pkg/capture.(*Capture).fetchStatusInBackground.func1()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:573 +0x64

Goroutine 30 (running) created at:
  github.com/els0r/goProbe/pkg/capture.(*Capture).process()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:250 +0x104
  github.com/els0r/goProbe/pkg/capture.(*Manager).update.func2()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture_manager.go:405 +0x584
  github.com/els0r/goProbe/pkg/capture.(*RunGroup).Run.func1()
      /home/runner/work/goProbe/goProbe/pkg/capture/rungroup.go:25 +0x33

Goroutine 35 (finished) created at:
  github.com/els0r/goProbe/pkg/capture.(*Capture).fetchStatusInBackground()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture.go:572 +0x184
  github.com/els0r/goProbe/pkg/capture.(*Manager).rotate()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture_manager.go:522 +0x352
  github.com/els0r/goProbe/pkg/capture.testConcurrentMethodAccess.func2()
      /home/runner/work/goProbe/goProbe/pkg/capture/capture_test.go:[149](https://github.com/els0r/goProbe/actions/runs/9464351977/job/26071510731?pr=325#step:10:150) +0x1d1
==================

On it...

@els0r els0r merged commit c1fe480 into main Jun 11, 2024
5 checks passed
@els0r els0r deleted the 317-deadlock-in-three-point-capture-routine-lock-sporadic-2 branch June 11, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical Issues that should be looked at with priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in three-point capture routine lock (sporadic)
2 participants