Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs to hide Mesos #4413

Merged
merged 15 commits into from
Sep 27, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 19 additions & 19 deletions docs/appendices/deploy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,27 +31,27 @@ From here, you can install a project and its dependencies::
$ tree
.
├── util
   ├── __init__.py
   └── sort
   ├── __init__.py
   └── quick.py
├── __init__.py
└── sort
├── __init__.py
└── quick.py
└── workflow
├── __init__.py
└── main.py

3 directories, 5 files
$ pip install matplotlib
$ cp -R workflow util venv/lib/python2.7/site-packages
$ cp -R workflow util venv/lib/python3.9/site-packages

Ideally, your project would have a ``setup.py`` file (see `setuptools`_) which streamlines the installation process::

$ tree
.
├── util
   ├── __init__.py
   └── sort
   ├── __init__.py
   └── quick.py
├── __init__.py
└── sort
├── __init__.py
└── quick.py
├── workflow
│ ├── __init__.py
│ └── main.py
Expand All @@ -70,7 +70,7 @@ both Python and Toil are assumed to be present on the leader and all worker node

We can now run our workflow::

$ python main.py --batchSystem=mesos
$ python main.py --batchSystem=kubernetes

.. important::

Expand Down Expand Up @@ -101,13 +101,13 @@ This scenario applies if the user script imports modules that are its siblings::
$ cd my_project
$ ls
userScript.py utilities.py
$ ./userScript.py --batchSystem=mesos
$ ./userScript.py --batchSystem=kubernetes

Here ``userScript.py`` imports additional functionality from ``utilities.py``.
Toil detects that ``userScript.py`` has sibling modules and copies them to the
workers, alongside the user script. Note that sibling modules will be
auto-deployed regardless of whether they are actually imported by the user
scriptall .py files residing in the same directory as the user script will
script-all .py files residing in the same directory as the user script will
automatically be auto-deployed.

Sibling modules are a suitable method of organizing the source code of
Expand All @@ -134,16 +134,16 @@ The following shell session illustrates this::
$ tree
.
├── utils
   ├── __init__.py
   └── sort
   ├── __init__.py
   └── quick.py
├── __init__.py
└── sort
├── __init__.py
└── quick.py
└── workflow
├── __init__.py
└── main.py

3 directories, 5 files
$ python -m workflow.main --batchSystem=mesos
$ python -m workflow.main --batchSystem=kubernetes

.. _package: https://docs.python.org/2/tutorial/modules.html#packages

Expand All @@ -168,7 +168,7 @@ could do this::
$ cd my_project
$ export PYTHONPATH="$PWD"
$ cd /some/other/dir
$ python -m workflow.main --batchSystem=mesos
$ python -m workflow.main --batchSystem=kubernetes

Also note that the root directory itself must not be package, i.e. must not
contain an ``__init__.py``.
Expand All @@ -193,7 +193,7 @@ replicates ``PYTHONPATH`` from the leader to every worker.
Toil Appliance
--------------

The term Toil Appliance refers to the Mesos Docker image that Toil uses to simulate the machines in the virtual mesos
The term Toil Appliance refers to the ubuntu-based Docker image that Toil uses to simulate the machines in the virtual
cluster. It's easily deployed, only needs Docker, and allows for workflows to be run in single-machine mode and for
clusters of VMs to be provisioned. To specify a different image, see the Toil :ref:`envars` section. For more
information on the Toil Appliance, see the :ref:`runningAWS` section.
21 changes: 14 additions & 7 deletions docs/gettingStarted/quickStart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,14 @@ Toil uses batch systems to manage the jobs it creates.

The ``singleMachine`` batch system is primarily used to prepare and debug workflows on a
local machine. Once validated, try running them on a full-fledged batch system (see :ref:`batchsysteminterface`).
Toil supports many different batch systems such as `Apache Mesos`_ and Grid Engine; its versatility makes it
Toil supports many different batch systems such as `Kubernetes`_ and Grid Engine; its versatility makes it
easy to run your workflow in all kinds of places.

Toil is totally customizable! Run ``python helloWorld.py --help`` to see a complete list of available options.

For something beyond a "Hello, world!" example, refer to :ref:`runningDetail`.

.. _Apache Mesos: https://mesos.apache.org/getting-started/
.. _Kubernetes: https://kubernetes.io/

.. _cwlquickstart:

Expand Down Expand Up @@ -279,7 +279,7 @@ workflow there is always one leader process, and potentially many worker process

When using the single-machine batch system (the default), the worker processes will be running
on the same machine as the leader process. With full-fledged batch systems like
Mesos the worker processes will typically be started on separate machines. The
Kubernetes the worker processes will typically be started on separate machines. The
boilerplate ensures that the pipeline is only started once---on the leader---but
not when its job functions are imported and executed on the individual workers.

Expand Down Expand Up @@ -394,8 +394,10 @@ Also! Remember to use the :ref:`destroyCluster` command when finished to destro
#. Launch a cluster in AWS using the :ref:`launchCluster` command::

(venv) $ toil launch-cluster <cluster-name> \
--clusterType kubernetes \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-2a

The arguments ``keyPairName``, ``leaderNodeType``, and ``zone`` are required to launch a cluster.
Expand Down Expand Up @@ -448,8 +450,10 @@ Also! Remember to use the :ref:`destroyCluster` command when finished to destro
#. First launch a node in AWS using the :ref:`launchCluster` command::

(venv) $ toil launch-cluster <cluster-name> \
--clusterType kubernetes \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-2a

#. Copy ``example.cwl`` and ``example-job.yaml`` from the :ref:`CWL example <cwlquickstart>` to the node using
Expand All @@ -462,24 +466,25 @@ Also! Remember to use the :ref:`destroyCluster` command when finished to destro

(venv) $ toil ssh-cluster --zone us-west-2a <cluster-name>

#. Once on the leader node, it's a good idea to update and install the following::
#. Once on the leader node, command line tools such as ``kubectl`` will be available to you. It's also a good idea to
update and install the following::

sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get -y dist-upgrade
sudo apt-get -y install git
sudo pip install mesos.cli

#. Now create a new ``virtualenv`` with the ``--system-site-packages`` option and activate::

virtualenv --system-site-packages venv
source venv/bin/activate

#. Now run the CWL workflow::
#. Now run the CWL workflow with the kubernetes batch system::

(venv) $ toil-cwl-runner \
--provisioner aws \
--jobStore aws:us-west-2a:any-name \
--batchSystem kubernetes \
--jobStore aws:us-west-2:any-name \
/tmp/example.cwl /tmp/example-job.yaml

.. tip::
Expand All @@ -498,6 +503,8 @@ Also! Remember to use the :ref:`destroyCluster` command when finished to destro
Running a Workflow with Autoscaling - Cactus
---------------------------------------------------

.. TODO: change to use a kubernetes cluster.

`Cactus <https://github.com/ComparativeGenomicsToolkit/cactus>`__ is a reference-free, whole-genome multiple alignment
program that can be run on any of the cloud platforms Toil supports.

Expand Down
32 changes: 23 additions & 9 deletions docs/running/cloud/amazon.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,21 +99,27 @@ during the computation of a workflow, first set up and configure an account with
the installed version that you are using if you're using a different version): ::

$ TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:5.3.0 \
toil launch-cluster clustername \
toil launch-cluster <cluster-name> \
--clusterType kubernetes \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-1a \
--keyPairName id_rsa

To further break down each of these commands:

**TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:latest** --- This is optional. It specifies a mesos docker image that we maintain with the latest version of toil installed on it. If you want to use a different version of toil, please specify the image tag you need from https://quay.io/repository/ucsc_cgl/toil?tag=latest&tab=tags.
**TOIL_APPLIANCE_SELF=quay.io/ucsc_cgl/toil:latest** --- This is optional. It specifies a ubuntu-based docker image that we maintain with the latest version of toil installed on it. If you want to use a different version of toil, please specify the image tag you need from https://quay.io/repository/ucsc_cgl/toil?tag=latest&tab=tags.

**toil launch-cluster** --- Base command in toil to launch a cluster.

**clustername** --- Just choose a name for your cluster.
**<cluster-name>** --- Just choose a name for your cluster.

**--clusterType kubernetes** --- Specify the type of cluster to coordinate and execute your workflow. Kubernetes is the recommended option.

**--leaderNodeType t2.medium** --- Specify the leader node type. Make a t2.medium (2CPU; 4Gb RAM; $0.0464/Hour). List of available AWS instances: https://aws.amazon.com/ec2/pricing/on-demand/

**--nodeTypes t2.medium -w 1** --- Specify the worker node type and the number of worker nodes to launch. The kubernetes cluster requires at least 1 worker node.

**--zone us-west-1a** --- Specify the AWS zone you want to launch the instance in. Must have the same prefix as the zone in your awscli credentials (which, in the example of this tutorial is: "us-west-1").

**--keyPairName id_rsa** --- The name of your key pair, which should be "id_rsa" if you've followed this tutorial.
Expand Down Expand Up @@ -162,7 +168,7 @@ Getting started with the provisioner is simple:
`here <http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-config-files>`__.

The Toil provisioner is built around the Toil Appliance, a Docker image that bundles
Toil and all its requirements (e.g. Mesos). This makes deployment simple across
Toil and all its requirements (e.g. Kubernetes). This makes deployment simple across
platforms, and you can even simulate a cluster locally (see :ref:`appliance_dev` for details).

.. admonition:: Choosing Toil Appliance Image
Expand All @@ -182,12 +188,14 @@ Details about Launching a Cluster in AWS
----------------------------------------

Using the provisioner to launch a Toil leader instance is simple using the ``launch-cluster`` command. For example,
to launch a cluster named "my-cluster" with a t2.medium leader in the us-west-2a zone, run ::
to launch a kubernetes cluster named "my-cluster" with a t2.medium leader in the us-west-2a zone, run ::

(venv) $ toil launch-cluster my-cluster \
--clusterType kubernetes \
--leaderNodeType t2.medium \
--nodeTypes t2.medium -w 1 \
--zone us-west-2a \
--keyPairName <your-AWS-key-pair-name>
--keyPairName <AWS-key-pair-name>

The cluster name is used to uniquely identify your cluster and will be used to
populate the instance's ``Name`` tag. Also, the Toil provisioner will
Expand Down Expand Up @@ -234,9 +242,12 @@ change. This is in contrast with :ref:`Autoscaling`.
To launch worker nodes alongside the leader we use the ``-w`` option::

(venv) $ toil launch-cluster my-cluster \
--clusterType kubernetes \
--leaderNodeType t2.small -z us-west-2a \
--keyPairName your-AWS-key-pair-name \
--nodeTypes m3.large,t2.micro -w 1,4
--keyPairName <AWS-key-pair-name> \
--nodeTypes m3.large,t2.micro -w 1,4 \
--zone us-west-2a


This will spin up a leader node of type t2.small with five additional workers --- one m3.large instance and four t2.micro.

Expand Down Expand Up @@ -264,6 +275,8 @@ look like ::
Running a Workflow with Autoscaling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. TODO: change to use kubernetes. But the kubernetes batch system doesn't support autoscaling?

Autoscaling is a feature of running Toil in a cloud whereby additional cloud instances are launched to run the workflow.
Autoscaling leverages Mesos containers to provide an execution environment for these workflows.

Expand All @@ -276,6 +289,7 @@ Autoscaling leverages Mesos containers to provide an execution environment for t
#. Launch the leader node in AWS using the :ref:`launchCluster` command: ::

(venv) $ toil launch-cluster <cluster-name> \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the kubernetes batch system not support autoscaling? I tried this with --clusterType=kubernetes, but when I run:

python sort.py aws:us-west-2:<job-store-name> \
          --provisioner aws \
          --batchSystem kubernetes \
          --nodeTypes t2.medium \
          --maxNodes 2

I get:

[2023-03-24T23:04:25+0000] [scaler    ] [E] [toil.provisioners.clusterScaler] Exception encountered in scaler thread. Making a best-effort attempt to keep going, but things may go wrong from now on.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/toil/provisioners/clusterScaler.py", line 1149, in tryRun
    self.scaler.updateClusterSize(estimatedNodeCounts)
  File "/usr/local/lib/python3.10/dist-packages/toil/provisioners/clusterScaler.py", line 754, in updateClusterSize
    newNodeCount = self.setNodeCount(instance_type, estimatedNodeCount, preemptible=nodeShape.preemptible)
  File "/usr/local/lib/python3.10/dist-packages/toil/provisioners/clusterScaler.py", line 802, in setNodeCount
    raise RuntimeError('Non-scalable batch system abusing a scalable-only function.')
RuntimeError: Non-scalable batch system abusing a scalable-only function.

It does look like the kubernetes batch system doesn't implement AbstractScalableBatchSystem.

Are there other ways to dynamically spin up nodes that I am not aware of?

Copy link
Member

@adamnovak adamnovak Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like our Kubernetes implementation supports scaling up and down through Kubernetes's cluster autoscaler (which you get when you pass a range of numbers instead of a single number for the number of nodes you want of a given type, when launching the cluster).

Running the Toil-integrated autoscaler as part of the workflow needs the AbstractScalableBatchSystem functions to e.g. drain nodes to safely scale them away. The Kubernetes cluster autoscaler is supposed to take care of all of that inside of Kubernetes and not inside the individual workflows.

--clusterType mesos \
--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--zone us-west-2a
Expand Down Expand Up @@ -382,7 +396,7 @@ For example, to launch a Toil cluster with a Kubernetes scheduler, run: ::
--provisioner=aws \
--clusterType kubernetes \
--zone us-west-2a \
--keyPairName wlgao@ucsc.edu \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to replace this in the last PR.

--keyPairName <AWS-key-pair-name> \
--leaderNodeType t2.medium \
--leaderStorage 50 \
--nodeTypes t2.medium -w 1-4 \
Expand Down
9 changes: 6 additions & 3 deletions docs/running/cloud/cloud.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@ Running in the Cloud
Toil supports Amazon Web Services (AWS) and Google Compute Engine (GCE) in the cloud and has autoscaling capabilities
that can adapt to the size of your workflow, whether your workflow requires 10 instances or 20,000.

Toil does this by creating a virtual cluster with `Apache Mesos`_. `Apache Mesos`_ requires a leader node to coordinate
Toil does this by creating a virtual cluster with `Kubernetes`_. `Kubernetes`_ requires a leader node to coordinate
the workflow, and worker nodes to execute the various tasks within the workflow. As the workflow runs, Toil will
"autoscale", creating and terminating workers as needed to meet the demands of the workflow.
"autoscale", creating and terminating workers as needed to meet the demands of the workflow. Historically, Toil has
spun up clusters with `Apache Mesos`_, but it is no longer the recommended way to coordinate and execute tasks within
the workflow.

Once a user is familiar with the basics of running toil locally (specifying a :ref:`jobStore <jobStoreOverview>`, and
how to write a toil script), they can move on to the guides below to learn how to translate these workflows into cloud
Expand All @@ -25,12 +27,13 @@ distributed over several nodes. The provisioner also has the ability to automati
the cluster to handle dynamic changes in computational demand (autoscaling). Currently we have working provisioners
with AWS and GCE (Azure support has been deprecated).

Toil uses `Apache Mesos`_ as the :ref:`batchSystemOverview`.
Toil uses `Kubernetes`_ as the :ref:`batchSystemOverview`.

See here for instructions for :ref:`runningAWS`.

See here for instructions for :ref:`runningGCE`.

.. _Kubernetes: https://kubernetes.io/
.. _Apache Mesos: https://mesos.apache.org/gettingstarted/

.. _cloudJobStore:
Expand Down
13 changes: 12 additions & 1 deletion src/toil/utils/toilLaunchCluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ def create_tags_dict(tags: List[str]) -> Dict[str, str]:
def main() -> None:
parser = parser_with_common_options(provisioner_options=True, jobstore_option=False)
parser.add_argument("-T", "--clusterType", dest="clusterType",
choices=['mesos', 'kubernetes'], default='mesos',
choices=['mesos', 'kubernetes'],
default=None, # TODO: change default to "kubernetes" when we are ready.
help="Cluster scheduler to use.")
parser.add_argument("--leaderNodeType", dest="leaderNodeType", required=True,
help="Non-preemptible node type to use for the cluster leader.")
Expand Down Expand Up @@ -160,6 +161,16 @@ def main() -> None:
raise RuntimeError(f'Please provide a value for --zone or set a default in the '
f'TOIL_{options.provisioner.upper()}_ZONE environment variable.')

if options.clusterType == "mesos":
logger.warning('You are using a "mesos" cluster, which is no longer recommended as Toil is '
'transitioning to using a kubernetes-based cluster. Consider switching to '
'--clusterType=kubernetes.')

if options.clusterType is None:
logger.warning('Argument --clusterType is not set... using "mesos" as the cluster scheduler. '
'Starting in the next version of Toil, the default cluster scheduler will be '
'set to "kubernetes" if the cluster type is not specified.')
options.clusterType = "mesos"

logger.info('Creating cluster %s...', options.clusterName)

Expand Down