Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging following IO.blocking(...) calls in v3.5.0 RC3 and RC4 #3548

Closed
armanbilge opened this issue Apr 22, 2023 · 13 comments
Closed

Hanging following IO.blocking(...) calls in v3.5.0 RC3 and RC4 #3548

armanbilge opened this issue Apr 22, 2023 · 13 comments
Labels
Milestone

Comments

@armanbilge
Copy link
Member

So far this has been observed in both FS2 and Skunk and specifically in their TLS/SSL suites, where blocking(...) is used in the SSL engine wrapper.

  1. SSLTest intermittently hanging on CI skunk#852
    We started experiencing non-deterministic hangs after upgrading to CE v3.5.0-RC3 (via FS2 v3.7.0-RC4). It also reproduced with CE RC4, and does not appear to reproduce after bumping back down to the stable series.

  2. Implement I/O with CE polling system on JVM fs2#3091 (comment)
    This one reproduces reliably for me locally on Linux after updating to CE 1f95fd7 (which merged RC4 into the previous snapshot CE 6581dc4). Furthermore replacing the blocking(...) with delay(...) in the SSL engine wrapper appeared to resolve the issue, so long as no other blocking calls are involved (i.e. DNS resolution).

@armanbilge armanbilge added this to the v3.5.0 milestone Apr 22, 2023
@djspiewak
Copy link
Member

Just to rule out some things quickly, can you try swapping WorkStealingThreadPool#canExecuteBlockingCode() to returning false in all cases? This should allow us to check if it's WSTP or something higher level.

@armanbilge
Copy link
Member Author

Btw, here is a heap dump from typelevel/fs2#3091 (comment) when it hangs. i3548.bin.zip

@djspiewak
Copy link
Member

All of the workers are parked (in epoll) but there is one item in the external queue, suggesting that something has been messed up in the state machine. Additionally, there is one sleeper, but all of the threads are selecting without a timeout, which suggests that there's some part of the logic there which is also confused.

@armanbilge
Copy link
Member Author

Thanks. So that's probably polling system bugs 😬

@armanbilge
Copy link
Member Author

I just published an update in typelevel/skunk#852 (comment). The tl;dr is that there may be an issue with Mutex in CE 3.5. We recently started using Mutex in FS2 3.7. I was the last person to touch Mutex in #3409, so hip hip hooray 😩

I may have been hasty to blame blocking(...), sorry 😇 at least for the Skunk issue. My polling system branch is probably just generally borked and blocking is aggravating it.

@durban
Copy link
Contributor

durban commented Apr 23, 2023

@armanbilge Can you try 3da03b9? It seems to fix the FS2 TLSSocketSuite hanging for me.

But it doesn't help with the skunk problem (it is polling system specific), which I couldn't reproduce locally.

@armanbilge
Copy link
Member Author

armanbilge commented Apr 24, 2023

But it doesn't help with the skunk problem (it is polling system specific)

Hmm, not really. The skunk issues are reproducible with just the RCs with timers. The FS2 issue is the polling system one.

In any case, thanks! Will try that and report back.

Edit: oh, is that commit only relevant for the polling system branch?

@durban
Copy link
Contributor

durban commented Apr 24, 2023

Yeah, sorry, in "it is polling system specific", the "it" is my commit (3da03b9).

@durban
Copy link
Contributor

durban commented Apr 24, 2023

#3549 might be related to the skunk issue (although I couldn't reproduce that one locally).

@armanbilge
Copy link
Member Author

Yeah, no one's reproduced that one locally 😛 I'll publish your fix and try it in CI, awesome!!! Thank you so much 😁

@He-Pin
Copy link

He-Pin commented Apr 24, 2023

if it's select without timeout, then must wakeup on external submition.

@armanbilge
Copy link
Member Author

In #3551 (comment) we just failed:

x not lose cedeing threads from the bypass when blocker transitioning

@djspiewak
Copy link
Member

I think we sorted this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants