Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with {targets} #64

Closed
wlandau opened this issue May 17, 2022 · 11 comments
Closed

Integration with {targets} #64

wlandau opened this issue May 17, 2022 · 11 comments

Comments

@wlandau
Copy link

wlandau commented May 17, 2022

The targets package currently uses clustermq and future to send tasks to workers running on traditional clusters. As a next step for targets, I aim to support workers running on the cloud (AWS Batch, Fargate, Google Cloud Run, Kubernetes, etc.) just like Airflow, Metaflow, Nextflow, and Prefect. A task queue would be an excellent layer between targets and cloud platforms.

Before I learned about rrq, I started crew to extend https://www.tidyverse.org/blog/2019/09/callr-task-q/ to other types of workers. There are a couple callr-based queues, and the future-based queue seems to make future.batchtools workloads a bit more efficient. That's about as far as I have pursued crew up to this point. Interprocess communication and heartbeating seem like huge challenges given how much more isolated AWS Batch jobs are than jobs on a traditional cluster.

So I am wondering if I can use rrq for targets. Can it support workers on the cloud? There are mentions of AWS in the docs, particularly about the Redis server, and I would like to learn more about how the pieces fit together for a use case like mine.

@wlandau
Copy link
Author

wlandau commented May 18, 2022

Proposal

On a second look at the rrq docs, I think crew actually does still serve a purpose. In fact, I think rrq is the missing piece that will make crew viable for the cloud.

crew takes full responsibility for configuring and launching workers, whether those workers are local processes, cluster jobs, or (eventually) AWS Batch jobs. targets really needs this piece. However, it is not enough: the way crew currently sends tasks and detects crashes is storage-based, naive, slow, and unreliable.

rrq does not manage the whole business of launching workers, but it does help configure the R sessions within those workers, and it supports a robust and proper system for managing and monitoring workers. This is exactly what crew needs the most.

Questions

  1. How would you recommend setting up a Redis server so cloud workers can reach it? Have you found an AWS service that does this?
  2. How quick and easy is the setup and teardown for (1)? Ideally, tar_make() should start with nothing, spin up a temporary Redis server if there are any targets that require cloud jobs, then tear everything down when the pipeline completes or crashes.
  3. From https://mrc-ide.github.io/rrq/articles/rrq.html#start-workers-on-another-node-perhaps-using-a-scheduler, how would a cloud worker talk to a Redis server on a different computer or cloud job? Does --config cover it?

P.S.: I am currently just starting to learn these packages but am currently stuck at richfitz/redux#54.

@richfitz
Copy link
Member

I suspect that the details of cloud orchestration will vary by user and their needs - are the workers in the cloud while the orchestrating process is on a desktop? Is that also in the cloud? Is the cloud resource some generally-available single large instance or are you using more transient compute like ecs?

The second and third questions are easiest so I'll answer them first:

  1. setup and teardown of a redis server is comparable to that of any R process - looks like it's on the order of about 0.1s start and stop if you do it via a docker container with the image already present
  2. not --config; that's for controlling the configuration after your worker can talk to the server. Instead, set the REDIS_URL environment variable (see redux::redis_config())

We don't use the cloud much here, but one of our patterns is probably of interest:

  • we keep a redis server running on each cluster's head node
  • users submit cluster "jobs" that are just workers that poll for work
  • they then submit tasks to their rrq queue which uses these workers

I imagine this is something not dissimilar to how someone might want to use these things if you're doing work in the cloud; stand up your compute then work with it. Importantly from rrq's point of view the worker pool can change (shrink, grow, be replaced) at any point, and workers can be told to turn off once they're idle etc. However, at the point that you submit any work to the queue, you must have a working connection to Redis, even if no workers.

If you're doing this in the cloud with a controlling process that is local to you, then the other factor that you need to consider is how to secure the communication channel. I've done this in the distant past (2014-2015) with a ssh tunnel and that worked well. This was where many of the ideas came from in the first place but that project is well in the past now and I don't remember the details.

The other way we use this is we have a group of docker containers that form an application - we bring up the workers and the redis server at the start of the application life cycle and they live until the entire system is torn down.

@wlandau
Copy link
Author

wlandau commented May 21, 2022

are the workers in the cloud while the orchestrating process is on a desktop? Is that also in the cloud?

I would ideally like to allow for both, but it may be the case that only the latter is possible. I am not sure yet.

Is the cloud resource some generally-available single large instance or are you using more transient compute like ecs?

I'm thinking more transient compute, mainly Batch to start, which sits on top of ECS.

We don't use the cloud much here, but one of our patterns is probably of interest:

  • we keep a redis server running on each cluster's head node
  • users submit cluster "jobs" that are just workers that poll for work
  • they then submit tasks to their rrq queue which uses these workers

I see. That sounds very much like clustermq. I could see a use for something like this that extends rrq and can submit to clusters in a clustermq/batchtools kind of way.

I imagine this is something not dissimilar to how someone might want to use these things if you're doing work in the cloud; stand up your compute then work with it.

Exactly.

However, at the point that you submit any work to the queue, you must have a working connection to Redis, even if no workers. If you're doing this in the cloud with a controlling process that is local to you, then the other factor that you need to consider is how to secure the communication channel. I've done this in the distant past (2014-2015) with a ssh tunnel and that worked well.

That's where I am stumped. When you worked with an ssh tunnel, do you remember if you used a cluster or something on the cloud? Between the encapsulation of Batch and my company's infosec wall, I haven't been able to ssh/tunnel into a running job. I was thinking as a fallback that I would send work to an EFS mount that the Batch jobs could pick up, then crew could poll CloudWatch or something to check for crashes, but that's not nearly as efficient as Redis/ZeroMQ and not nearly as robust as heartbeating.

The other way we use this is we have a group of docker containers that form an application - we bring up the workers and the redis server at the start of the application life cycle and they live until the entire system is torn down.

Haven't much of a the chance to use Docker much, but that sounds promising. How easy is it in practice to make Docker instances talk to each other? Picturing one with the Redis server and the other with the workers.

@richfitz
Copy link
Member

For docker, the usual model is one process for one container. It's extremely straightforward to have different docker containers talk to each other (this is really the point of docker, in fact!) or to expose things to the host in a single-node setting.

I imagine that a sensible AWS setup would look like

  • some micro instance running Redis and ssh tunnel
  • configure your workers in the cloud to either connect to the redis server over a private network if Batch allows this, over the ssh tunnel if not
  • configure the controlling process to connect over the ssh tunnel

Importantly, you're not connecting to the running jobs, the running jobs are connecting to the redis server

@wlandau
Copy link
Author

wlandau commented Oct 19, 2022

That setup makes sense. I have read more about AWS and Docker since we last spoke ("AWS in Action" and "Docker Deep Dive"), and it seems like a custom security group could allow the required traffic from cloud instance to cloud instance.

But for the SSH tunnel from the local machine to the Redis instance, it seems like I would need to need to expose the user's public IP address in an AWS security group. Would that hard-coded public IP create vulnerabilities if AWS stores it? Would it be more secure to run that traffic through a local container with its own temporary public IP address? If so, would it make sense to run the Redis server in that local container instead of a persistent and potentially expensive cloud instance? Or would that not be worth it because of the increased attacked surface (one tunnel per worker over the public internet)?

Do you have any recommendations for resources that would help me learn more about the relevant networking and security fundamentals? I feel like it would help to improve my understanding of IP/TCP, the OSI model, overlays, firewalls, public key cryptography, TLS encryption, and other infrastructure that would help the local machine securely tunnel into the cloud.

By the way, what are your plans for CRAN? rrq is incredibly deep and comprehensive, and I would love to build on top of it.

@wlandau
Copy link
Author

wlandau commented Oct 19, 2022

From @mschubert at mschubert/clustermq#208 (comment)

We don't know workers' node or IP address because the scheduler will assign them to nodes. Instead, each worker connects back to the main process (which is listening on the port corresponding to the ID field), either directly or via an SSH tunnel.

So each worker needs to know either the host+port of the main process or of the SSH tunnel.

In terms of running tunnels over the public internet, I would feel better if this public IP were a temporary one-time address and the AWS API requests that delivered it were encrypted. But it's entirely possible that I just need to understand more about internet security.

@mschubert
Copy link

@wlandau In clustermq the tunnel is established local>remote, so no need to expose any local ports (but the remote needs to accept SSH connections and reverse tunneling; traffic over the tunnel is always encrypted via SSH)

@wlandau
Copy link
Author

wlandau commented Oct 19, 2022

Interesting. Would be great to follow up at mschubert/clustermq#290.

@wlandau
Copy link
Author

wlandau commented Mar 27, 2023

Hi @richfitz,

A lot has changed on my end since we last spoke on this thread. The crew package, which will eventually drive auto-scaled high-performance computing in targets, has adopted mirai as the backend task scheduler. At first I was not sure whether integration with mirai would be feasible, but @shikokuchuo has implemented incredible improvements in the last few weeks. Today, crew is totally committed to mirai, and it has reached CRAN.

I know my comments have spurred development in rrq, particularly in #66, and I am sorry to have changed direction after sending those requests. I still think rrq is a valuable contribution to R, and I hope a CRAN release will bring recognition to the effort you and your team put into developing it.

@wlandau wlandau closed this as completed Mar 27, 2023
@richfitz
Copy link
Member

Cool, that looks like a really nice package -- a very different model to rrq and I hope that's easier to work around with targets/crew than rrq would have been. In particular, relaxing the requirement for there to be a persistent Redis server will probably make things much easier than it would have been with rrq

@wlandau
Copy link
Author

wlandau commented Mar 27, 2023

Thanks Rich!

Yes, I had actually planned to start short-lived Redis server instances for crew workloads. That would have probably worked well, but the system dependency on redis-server might have been a lot to ask users to install themselves.

Also, I am super happy with the blazing fast alternative to heartbeating that @shikokuchuo suggested at wlandau/crew#31 (comment). It's such slick magic that I was considering recommending it for rrq. All you need is NNG (via the nanonext package). Bus sockets are lightweight, and you don't even have to send any messages for it to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants