This project is an effort to create Open-source implementation of a task execution engine based on the TES standard
distributing executions to services exposing Pulsar application. For more details
on TES
, see the Task Execution Schemas documentation. Pulsar
is a Python server application that allows a Galaxy server to run jobs on remote systems. The original intention of this
project was to modify the Pulsar
project (e.g. via forking) so its Rest API would be compatible with the TES
standard.
Later a decision was made that rather a separate microservice will be created, decoupled from the Pulsar
, implementing the TES
standard and distributing TES
tasks execution to Pulsar
applications.
The most straightforward way to deploy the TESP is to use Docker Compose.
docker compose up -d --build
Depending on you Docker and Docker Compose installation, you may need to use docker-compose
(with hyphen) instead.
You might encounter a timeout error in container runtime which can be solved by correct mtu
configuration either in the docker-compose.yaml
:
networks:
default:
driver: bridge
driver_opts:
com.docker.network.driver.mtu: 1442
or directly in your /etc/docker/daemon.json
:
{
"mtu": 1442
}
The docker-compose.yaml
spins also collection of Data Transfer Services which can be used for testing.
If the TESP is running, you can try to submit a task. One way is to use cURL. Although the project is still in development, the TESP should be compatible with TES so you can try TES clients such as Snakemake or Nextflow. The example below shows how to submit task using cURL.
The first step you need to take is to prepare JSON file with the task. For inspiration you can use tests located in this repository, or TES documentation.
Example JSON file:
{
"inputs": [
{
"url": "http://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.5.5.tar.xz",
"path": "/data/kernel.tar.gz",
"type": "FILE"
}
],
"executors": [
{
"image": "ubuntu:20.04",
"command": [
"/bin/sha1sum",
"./kernel.tar.gz"
],
"workdir": "/data/",
"stdout": "/tmp/stdout.log",
"stderr": "/tmp/stderr.log"
}
]
}
Please check the URL of the running TES and the file with the task you just created.
curl http://localhost:8080/v1/tasks -X POST -H "Content-Type: application/json" -d $(sed -e "s/ //g" example.json | tr -d '\n')
(The only reason for the subshell is to remove whitespaces and newlines.)
After the task is submitted, the endpoint returns the task ID. This is usefull to check the task status.
There are more usefull endpoints to check the task status.
List all tasks:
curl "http://localhost:8080/v1/tasks"
Check the specific task status (enter your task ID):
curl "http://localhost:8080/v1/tasks/<id>?view=FULL"
Repository contains docker-compose.yaml
file with infrastructure setup for current functionality which can be used to
immediately start the project in DEVELOPMENT environment. This is convenient for users and contributors as there is no need to manually install and
configure all the services which TESP API
requires for it to be fully functional. While this is the easiest approach to start
the project, for "first timers" it's recommended to follow this readme to understand all the services and tools used across the project.
Also have a detailed look at Current Docker services section of this readme before starting up the infrastructure for the first time.
!! DISCLAIMER:
Project is currently in the development phase only and should not be used in production environments yet. If you really
wish to set up a production environment despite missing features, tests etc ... then following contents will show what needs to be done.
You can work purely with docker and docker-compose only
instead of starting the project locally without docker
. In that case only those two dependencies are relevant for you.
dependency | version | note |
---|---|---|
docker | 20.10.0+ | latest is preferred |
docker-compose | 1.28.0+ | - |
python | 3.10.0+ | - |
pip | python3 | 21.3.1 in case of problems |
poetry | 1.1.13+ | pip install poetry |
mongodb | 4.4+ | docker-compose uses latest |
pulsar | 0.14.13 | actively trying to support latest. Must have access to docker with the same host as pulsar application itself |
ftp server | - | no real recommendation here. docker-compose uses ftpserver so local alternative should support same fpt commands. |
TESP API
uses dynaconf for its configuration. Configuration is currently set up by using
./settings.toml file. This file declares sections which represent different environments for TESP API
. Default section
is currently used for local development without docker
. Also, all the properties from default section are propagated
to other sections as well unless they are overridden in the specific section itself. So for example if following settings.toml
file is used
[default]
db.mongodb_uri = "mongodb://localhost:27017"
logging.level = "DEBUG"
[dev-docker]
db.mongodb_uri = "mongodb://tesp-db:27017"
then dev-docker environment will use property logging.level = DEBUG
as well, while property db.mongodb_uri
gets overridden to url of mongodb in the docker environment. dev-docker
section in current ./settings.toml
file is set up to support ./docker-compose.yaml for development infrastructure.
To apply different environment (i.e. to switch which section will be picked by TESP API
) environment variable
FASTAPI_PROFILE
must be set to the concrete name of such section (e.g. FASTAPI_PROFILE=dev-docker
which can be seen
in the ./docker/tesp_api/Dockerfile)
You can have a look at ./docker-compose.yaml to see how
the infrastructure for development should look like. Of course, you can configure those services in your preferred way if you are
going to start the project without docker
or if you are trying to create other than development
environment but some things
must remain as they are. For example, TESP API
currently supports communication with Pulsar
only through its Rest API and
therefore Pulsar
must be configured in such a way.
All the current Docker
services which will be used when the project is started with docker-compose
have common directory
./docker for configurations, data, logs and Dockerfiles if required.
docker-compose
should run out of the box, but sometimes it might happen that a problem with privileges occurs while for
example trying to create data folder for given service. Such issues should be resolved easily manually. Always look into
./docker-compose.yaml to see what directories need to mapped
which ports to be used etc. Following services are currently defined by ./docker-compose.yaml
- tesp-api - This project itself. Depends on mongodb
- tesp-db - MongoDB instance for persistence layer
- pulsar_rest -
Pulsar
configured to use Rest API with access to a docker instance thanks to DIND. - pulsar_amqp - currently disabled, will be used in the future development
- ftpserver - online storage for
TES
tasks input/output content - minio - currently acting only as a storage backend for the
ftpserver
with simple web interface to access data.
Folder ./docker/minio/initial_data contains startup
folders for minio
service which must be copied to the ./docker/minio/data
folder before starting up the infrastructure. Those data
configure minio
to start with already created bucket and user which will be used by ftpserver
for access.
This project uses Poetry for dependency management
and packaging
. Poetry
makes it easy
to install libraries required by TESP API
. It uses ./pyproject.toml
file to obtain current project orchestration configuration. Poetry
automatically creates virtualenv, so it's easy to run
application immediately. You can use command poetry config virtualenvs.in-project true
which globally configures
creation of virtualenv directories directly in the project instead of the default cache folder. Then all you need to do
to run TESP API
deployed to uvicorn
for example is:
poetry install
poetry run uvicorn tesp_api.tesp_api:app --reload --host localhost --port 8000
Otherwise, as was already mentioned, you can instead use docker-compose
to start whole development infrastructure.
Service representing TESP API
is configured to mount this project sources as a volume and TESP API
is run with the very
same command as is mentioned above. Therefore, any changes made to the sources in this repository will be immediately applied to the docker
service as well, enabling live reloading which makes development within the docker
environment very easy.
docker-compose up -d
docker-compose
sets up whole development infrastructure. There will be two important endpoints to explore if you wish to
execute some TES
tasks. Before doing any action, don't forget to run docker-compose logs
command to see if each service
initialized properly or whether any errors occurred.
- http://localhost:8080/ - will redirect to Swagger documentation of
TESP API
. This endpoint also currently acts as a frontend. You can use it to execute REST based calls expected by theTESP API
. Swagger is automatically generated from the sources, and therefore it corresponds to the very current state of theTESP API
interface. - http://localhost:40949/ -
minio
web interface. Useadmin
and!Password123
credentials to login. Make sure that buckettesp-ftp
is already present, otherwise see Current Docker services section of this readme to properly prepare infrastructure before the startup.
This section will demonstrate execution of simple TES
task which will calculate md5sum
hash of given input. There are more approaches of how I/O can be handled by TES
but main goal here is to demonstrate ftp server
as well.
- Head over to http://localhost:40949/buckets/tesp-ftp/browse and upload a new file with your preferred name and content (e.g. name
holy_file
and contentHello World!
). This file will now be accessible troughftpserver
service and will be used as an input file for this demonstration. - Go to http://localhost:8080/ and use
POST /v1/tasks
request to create followingTES
task (task is sent in the request body). In the"inputs.url"
replace<file_uploaded_to_minio>
with the file name you chose in the previous step. If http status of returned response is 200, the response will containid
of created task in the response body which will be used to reference this task later on.
{
"inputs": [
{
"url": "ftp://ftpserver:2121/<file_uploaded_to_minio>",
"path": "/data/file1",
"type": "FILE"
}
],
"outputs": [
{
"path": "/data/outfile",
"url": "ftp://ftpserver:2121/outfile-1",
"type": "FILE"
}
],
"executors": [
{
"image": "alpine",
"command": [
"md5sum"
],
"stdin": "/data/file1",
"stdout": "/data/outfile"
}
]
}
- Use
GET /v1/tasks/{id}
request to view task you have created. Useid
from the response you have obtained in the previous step. This request also supportsview
query parameter which can be used to limit the view of the task. By default,TESP API
will returnMINIMAL
view which only includesid
andstate
of the requested task. Wait until task state is set to the stateCOMPLETE
or one of the error states. In case of an error state, depending on its type, the error will be part of the task logs in the response (useFULL
view), or you can inspect the logs ofTESP API
service, where error should be logged with respective message. - Once the task completes you can head over back to http://localhost:40949/buckets/tesp-ftp/browse where you should find
uploaded
outfile-1
with output content of executed md5sum. You can play around by creating different tasks, just be sure to only use functionality which is currently supported - see Known limitations. For example, you can omitinputs.url
and instead useinputs.content
which allows you to create input in place, or you can also omitoutputs
andexecutors.stdout
in which case the output will be present in thelogs.logs.stdout
as executor is no longer configured to redirect stdout into the file.
Domain | Limitation |
---|---|
Pulsar | TESP API communicates with Pulsar only through its REST API, missing functionality for message queues |
Pulsar | TESP API should be able to dispatch executions to multiple Pulsar services via different types of Pulsar interfaces. Currently, only one Pulsar service is supported |
Pulsar | Pulsar must be "polled" for job state. Preferably Pulsar should notify TESP API about state change. This is already default behavior when using Pulsar with message queues |
TES | Canceling TES task does not immediately stop the task. Task even cannot be canceled while it is running. |
TES | TES does not state specific urls to be supported for file transfer (e.g. tasks inputs.url ). Only FTP is supported for now |
TES | tasks inputs.type and outputs.type can be either DIRECTORY or FILE. Only FILE is supported, DIRECTORY will lead to undefined behavior for now |
TES | tasks resources currently do not change execution behavior in any way. This configuration will take effect once Pulsar limitations are resolved |
TES | tasks executors.workdir and executors.env functionality is not yet implemented. You can use them but they will have no effect |
TES | tasks volumes and tags functionality is not yet implemented. You use them but they will have no effect |
TES | tasks logs.outputs functionality is not yet implemented. However this limitation can be bypassed with tasks outputs |
Current main branch is origin/main
. This happens to be also a release branch for now. Developers should typically derive their
own feature branches such as e.g. feature/TESP-111-task-monitoring
. This project has not yet configured any CI/CD. Releases are
done manually by creating a tag in the current release branch. There is not yet configured any issue tracking software but for
any possible future integration this project should reference commits, branches PR's etc ... with prefix TESP-0
as a reference
to a work that has been done before such integration. Pull request should be merged using Squash and merge
option with message format Merge pull request #<PRnum> from <branch-name>
.
Since there is no CI/CD setup this is only opinionated view on how branching policies should work and for now everything is possible.
Copyright (c) 2022 Norbert Dopjera
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.