Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No need to use mirai::saisei() #136

Closed
wlandau opened this issue Nov 6, 2023 · 9 comments
Closed

No need to use mirai::saisei() #136

wlandau opened this issue Nov 6, 2023 · 9 comments

Comments

@wlandau
Copy link
Owner

wlandau commented Nov 6, 2023

The most recent version of mirai locks dispatcher sockets, so only one worker can connect at a time. That means crew should no longer need to rotate websocket URLs (I plan to test this locally).

@wlandau
Copy link
Owner Author

wlandau commented Nov 6, 2023

Using {nanonext} 0.10.4.9000 and {mirai} 0.11.1.9000, I am finding I can part with saisei() just fine. But with the latest CRAN releases (0.10.4 and 0.11.1), it appears that two daemons can connect to the same websocket, and that this flips the "online" bit back to 0 as predicted at #31 (comment). I guess we'll be depending on the dev versions for a while.

@wlandau
Copy link
Owner Author

wlandau commented Nov 6, 2023

On the bright side, eliminating saisei() dramatically reduces manual interactions with the dispatcher, which just might solve #128 and user-end troubles like it.

wlandau-lilly added a commit that referenced this issue Nov 6, 2023
@wlandau
Copy link
Owner Author

wlandau commented Nov 6, 2023

But with the latest CRAN releases (0.10.4 and 0.11.1), it appears that two daemons can connect to the same websocket, and that this flips the "online" bit back to 0 as predicted at #31 (comment).

It is straightforward to reproduce this locally (R 4.3.2, Mac OS 14.1). In a host process, I call:

# terminal 1 R session (host)
mirai::daemons(n = 1L, url = "ws://127.0.0.1:5700")
#> [1] 1
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        0        0        0

Then I start a daemon process in a different terminal.

# terminal 2
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'

The host process sees this new daemon. So far so good.

# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      1        1        0        0

But now when I start a second daemon process in a different terminal, that daemon also connects.

# terminal 3
# The daemon R session stays running:
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'
# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        1        0        0

At this point, both daemons are running (terminals 2 and 3), and the online bit flipped back to 0. Interestingly, mirai::status()$connections still shows 1.

FYI @shikokuchuo

@wlandau
Copy link
Owner Author

wlandau commented Nov 6, 2023

I am rediscovering that I need to manually reset the instance counter each time I launch a {crew} worker. Otherwise, it's hard to reason about how many worker instances I am expecting and whether the "current" one completed. So I think crew attached to saisei() after all.

@wlandau wlandau closed this as completed Nov 6, 2023
@wlandau wlandau reopened this Nov 6, 2023
@wlandau
Copy link
Owner Author

wlandau commented Nov 6, 2023

Reopened to close as "not planned".

@wlandau wlandau closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2023
@shikokuchuo
Copy link
Contributor

But with the latest CRAN releases (0.10.4 and 0.11.1), it appears that two daemons can connect to the same websocket, and that this flips the "online" bit back to 0 as predicted at #31 (comment).

It is straightforward to reproduce this locally (R 4.3.2, Mac OS 14.1). In a host process, I call:

# terminal 1 R session (host)
mirai::daemons(n = 1L, url = "ws://127.0.0.1:5700")
#> [1] 1
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        0        0        0

Then I start a daemon process in a different terminal.

# terminal 2
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'

The host process sees this new daemon. So far so good.

# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      1        1        0        0

But now when I start a second daemon process in a different terminal, that daemon also connects.

# terminal 3
# The daemon R session stays running:
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'
# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        1        0        0

At this point, both daemons are running (terminals 2 and 3), and the online bit flipped back to 0. Interestingly, mirai::status()$connections still shows 1.

FYI @shikokuchuo

Your reprex works as expected for me on Ubuntu 22.04.3, R 4.3.2. The second terminal instance exits straight away. The first one remains connected. With nanonext 0.10.4, mirai 0.11.1. But you report it working with the dev versions anyway so I'm not really concerned.

@shikokuchuo
Copy link
Contributor

I am rediscovering that I need to manually reset the instance counter each time I launch a {crew} worker. Otherwise, it's hard to reason about how many worker instances I am expecting and whether the "current" one completed. So I think crew attached to saisei() after all.

I am not sure what is meant here, but it seems possible to create (duplicate to) another counter for this purpose, you don't actually need to manipulate the actual condition value inside the cv right? In my opinion, eliminating saisei() is not a small matter for your semi-transient workers, having to re-create the listeners is just pure overhead - I wouldn't be so quick to dismiss.

@wlandau
Copy link
Owner Author

wlandau commented Nov 7, 2023

crew needs a way to figure out if it needs to relaunch a worker. It uses the instance counter to help determine whether an instance of a worker successfully connected. If there are lurking instances in the background that could potentially connect, this creates ambiguity. That's why I need to rotate the websocket URL.

Say I launch a worker as an AWS Batch spot instance. It could take a few seconds or several minutes to connect to the host, depending pricing fluctuations on the cloud. crew's policy is to wait for some fixed time interval for it to start, then give up if it does not connect within that window. By "give up", I mean send a REST API request to cancel the job, and rotate the websocket URL. If crew does not rotate the websocket and the cancellation request arrives too late, then this could happen:

  1. crew sends the REST request to cancel the worker instance.
  2. To replace the first worker instance, crew launches a second one.
  3. The first worker connects in spite of the cancellation request.
  4. The REST cancellation request succeeds and the first worker abruptly terminates.
  5. The second worker connects.

Here, we would expect an instance count of 1, but it would actually be 2. In general, instance could be either 0, 1, or 2, depending on a hidden race condition. Rotating the websocket with saisei() excludes lurking worker instances and guarantees that crew can trust instance.

We had a similar discussion about this in #31 leading up to the implementation of saisei().

@shikokuchuo
Copy link
Contributor

Thanks, that's very clear now. Locked sockets will ensure that only one daemon can connect at any one time, but it doesn't guarantee that the correct one does in your example above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants