Skip to content

Latest commit

 

History

History
180 lines (170 loc) · 39.5 KB

infrastructure_playbook_repo.md

File metadata and controls

180 lines (170 loc) · 39.5 KB

Generic introduction

  • This repository contains the Ansible playbooks, roles, etc for the European Galaxy server. It is used to deploy the infrastructure for the European Galaxy server through Jenkins. All configurational changes related to the Galaxy EU are made through this repository.
  • Galaxy EU compute infrastructure is run on the BW OpenStack cloud. At the time of writing (21/02/2023) our cloud is of size 8488 VCPUs, 44.6 TB RAM, 162.6 TB storage. Additionally, a few petabytes of storage is also mounted (NFS) in the cloud.
  • The compute infrastructure (cloud cluster; Galaxy worker nodes) is configured through VGCN infrastructure repo where we define what cloud images should be used, the size of the cloud cluster, the number of VMs, the cloud network, the cloud security groups, etc.
  • The cloud (VMs for group members and non Galaxy worker nodes) is configured through this infrastructure repo using Terraform. The underlying cloud hardware, storage, network, etc are managed by the compute center of the University of Freiburg. For DNS records we use Amazon's Route53.
  • Some documentation related to services and IT operations are available in this operations repo
  • For Galaxy Admin training you can refer here
  • For monitoring of the Galaxy EU infrastructure we use Grafana. The dashboards are available here

Ansible

  • Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code.
  • The basic components of Ansible can be found here. Understanding this is important to understand this repo.

Our repo structure

  • files: contains files/configs that are used by the playbooks/roles
  • group_vars and host_vars: contains variables that are used by the playbooks specific to certain host/group defined in the inventory file. For every playbook we have an associated group_vars/host_vars file where we define the variables that are used by the playbook and the roles that are included/imported in the playbook.
  • roles: Contains roles that are not maintained somewhere else, which is not typical for Ansible. All other roles are installed during deployment from the requirements.yaml file.
  • secret_group_vars: This is our vault. It contains the passwords and other sensitive information that is used by the playbooks/roles. The files are encrypted using Ansible Vault.
  • templates: contains the templates that are used by the playbooks/roles. The templates are used to generate the final configuration files that are used by the services. The templates are written in Jinja2 syntax.
  • ansible.cfg: contains the configuration of Ansible. This is used to define the location of the inventory file, the vault password file, etc.
  • hosts: contains the inventory of the hosts that are managed by Ansible.

Playbooks

The playbooks are located in the root directory of the repo.

The playbooks are:

Roles

Our locally maintained Ansible roles are located in the roles directory. Also, we maintain several other roles and all of them are in their own github repositories and can be found in our organization. Most of these roles are published on Ansible Galaxy. In addition to our roles we also use roles from the galaxyproject All the roles (non-local) we use are listed in our requirements.yaml file. These roles can be installed by running the following command:

ansible-galaxy install -r requirements.yaml

Roles in use

  • Separate repo: Whether the role has its own repo or is it a local role located and available only in the infrastructure_playbook repo
  • Still being used: Whether the role is included/imported in any of the above listed playbooks
Roles Separate repo Still being used Description
devops.tomcat7 ✔️ Installs Tomcat 7 on RedHat/CentOS Linux servers
dj-wasabi.telegraf ✔️ Installs and configures telegraf
docker Installs and configures docker; sets up SSL certificates
galaxyprojectdotorg.proftpd ✔️ Installs, configures and manges proftpd (FTP) server.
geerlingguy.haproxy Installs HAProxy
geerlingguy.nginx Installs and configures Nginx
hostname ✔️ Set's system's hostname
htcondor Installs and configures HTCondor
hxr.admin-tools ✔️ Install's some admin packages via the package manager and stops firewalld service if its installed
hxr.api-check Installs a bash script to check the http status
hxr.apollo ✔️ Installs and configures a genome annotation web-based editor
hxr.autofs ✔️ Installs autofs and adds autofs configuration to mount needed NFS shares (NOTE: This should be merged/replaced with usegalaxy-eu.autofs role at some point)
hxr.autofs-format-n-mount Copies a script to format a certain disk and mount it
hxr.aws-cli ✔️ Creates AWS directory (~/.aws) and deploys (copies) AWS credentials to a given system user account
hxr.dns Sets DNS entries using route53 and refreshes the certbot certificates if domains have changed
hxr.docker-ssl Configures docker to use SSL certificates
hxr.docker-ssl-client Adds SSL certs to the user home directory
hxr.exclude-repo Excludes a given list of YUM repositories
hxr.galaxy-cron Adds cron jobs for cleaning up docker (via prune) and for cleaning up condor held jobs
hxr.galaxy-echo-tool Add a custom nagios "echo" tool to the Galaxy tool directory
hxr.galaxy-log-dir Creates galaxy log directory if it does not exist
hxr.galaxy-nonreproducible-tools ✔️ Clones temporary tools repo to the Galaxy's custom tools directory
hxr.grafana-gitter-bridge ✔️ Configures a bridge between Grafana and Gitter
hxr.gx-cookie-proxy Sets up and configures the translation of a galaxy session cookie into a remote user identity
hxr.haproxy-error-pages Downloads Galaxy error pages through a bash script (NOTE: Bash script is not found in the role, so can't explain what these error pages are)
hxr.install-to-venv ✔️ Installs Python dependencies into any requested virtual environment
hxr.monitor-cluster ✔️ Adds Condor cluster monitoring scripts and configures Telegraf user to run these scripts
hxr.monitor-cvmfs ✔️ Adds a Telegraf task that monitors the CernVM-FS repos
hxr.monitor-email ✔️ Adds an /var/spool/mail counter script and adds an Telegraf task to monitor
hxr.monitor-galaxy NOTE: Tasks file is empty
hxr.monitor-galaxy-journalctl ✔️ Adds a script and a Telegraf task to monitor Galaxy's journalctl logs
hxr.monitor-galaxy-queue Adds a Telegraf task to run gxadmin queries for monitoring Galaxy queue and workflow-incovcation-status
hxr.monitor-squid ✔️ Adds a squid proxy data parser script and a Telegraf task to monitor
hxr.monitor-ssl ✔️ Adds an SSL check script and a Telegraf task to monitor SSL certificates expiry
hxr.postgres-connection ✔️ Adds Galaxy database credentials and required ENVs to the user's bashrc file and creates a ~/.pgpass file as well
hxr.remap-user ✔️ Remaps system user's UID and GID to the tomcat user/group
hxr.replace-galaxy-user ✔️ Creates a system user and group with 999 as UID and GID
hxr.sentry Installs and configures Sentry service
hxr.simple-nagios Installs a few "simple" scripts that performs various tasks related to Galaxy, Nagios, and SSL certificate check
hxr.zfs-monit NOTE: Task file does not exist
jasonroyle.rabbitmq Installs and configures RabbitMQ
linuxhq.yum_cron ✔️ Installs yum-cron and adds required configuration
matterircd Sets up a minimal IRC server using Docker that can integrate with mattermost, slack, and mastodon
multinic Adds network config files and configures multiple NICs
multinic-old Same as multinic but without routing config etc.
pgs ✔️ Sets upProbes Public Galaxy Servers (pgs) instances
sentry Sets up Sentry a realtime event logging and aggregation platform using Docker
ssh-host-resign ✔️ Copies server CA and signs the Host SSH keys
ssh-host-sign ✔️ Sign the server host key to prevent TOFU for SSH
usegalaxy-eu.bashrc ✔️ Adds ENVs, aliases, etc. to the bashrc file for any given user
usegalaxy-eu.create-user ✔️ Creates a galaxy system user and group
usegalaxy-eu.error-pages ✔️ Copies Nginx's error (404, 502, 503, and 504) pages
usegalaxy-eu.fix-ancient-ftp-data ✔️ Removes old FTP data and adds a cron job to create FTP users
usegalaxy-eu.fix-failing-to-fail-jobs ✔️ Adds a cron job to fix failing to fail jobs
usegalaxy-eu.fix-galaxy-server-dir ✔️ Creates a GDPR compliance log file if it does not exist and creates a symlink of all tools present in /usr/local/tools to <galaxy_server_dir>/dependencies
usegalaxy-eu.fix-missing-api-keys ✔️ Adds a cron job that generates and adds missing API keys for IE (Interactive Environments) users
usegalaxy-eu.fix-oidc ✔️ Adds a cron job that finds all of the OIDC authenticated users that do not have any roles associated to them and fixes them
usegalaxy-eu.fix-stop-ITs ✔️ Adds a cron job that finds the Galaxy ITs running longer than 24 hrs and terminates them
usegalaxy-eu.fix-stuck-handlers ✔️ Adds several cron jobs (sync-to-nfs, restart galaxy handlers systemd service, restart gunicorn systemd service, restart galaxy workflow schedulers systemd service)
usegalaxy-eu.fix-unscheduled-jobs ✔️ Adds a cron job that finds the Galaxy jobs that failed to run (unscheduled) and sets its state to error in the Galaxy database
usegalaxy-eu.fix-unscheduled-workflows ✔️ Adds a cron job (the Ansible task is commented, so it does not create a cron job at the moment) that fixes unscheduled workflows
usegalaxy-eu.fix-user-quotas ✔️ Adds cron jobs that recalculates user quotas and sets ELIXIR quota for ELIXIR users
usegalaxy-eu.galactic-radio-telescope ✔️ Installs and configures Galactic Radio Telescope
usegalaxy-eu.galaxy-cleanup ✔️ Adds a Telegraf task that performs a cleanup of histories/hdas/etc that are older than 60 days
usegalaxy-eu.galaxy-procstat ✔️ Adds Telegraf procstat tasks that collects metrics from processes (Gunicorn, Galaxy handlers, Galaxy workflow schedulers)
usegalaxy-eu.galaxy-slurp ✔️ Adds cron jobs for pulling Galaxy stats (like, how many users were registered as of date X, current user count, current dataset size/distribution/etc.) into InfluxDB using gxadmin's slurp commands
usegalaxy-eu.gapars-galaxy ✔️ Sets up and installs the GAPARS Galaxy webhook
usegalaxy-eu.gie-deployer Creates GIE (Galaxy Interactive Environments) required directories, adds config, etc to deploy GIE
usegalaxy-eu.gie-node-proxy Clones GIE NodeJS proxy configurations and installs Node dependencies and sets up the GIE proxy
usegalaxy-eu.google-verification Adds Google site verification HTML file and adds required Nginx configuration
usegalaxy-eu.grt-client Adds cron jobs that can export and upload data to GRT
usegalaxy-eu.grt-export Adds a cron job that exports data to GRT
usegalaxy-eu.htcondor_release ✔️ Adds a cron job that releases Condor jobs that are in hold state (also removes jobs in hold state that are resubmitted more than two times)
usegalaxy-eu.jenkins-ssh-key ✔️ Creates SSH directory and adds a key to the Jenkins user
usegalaxy-eu.log-cleaner ✔️ Adds cron job to clean up old journalctl logs of gunicorn and galaxy handlers services
usegalaxy-eu.logrotate ✔️ Adds logrotate configuration for galaxy and atop logs
usegalaxy-eu.monitoring ✔️ Adds Telegraf tasks for monitoring NFS shares access times, and to collect NFS stats
usegalaxy-eu.plausible ✔️ Clones Plausible Analytics setup and adds the configuration and starts the service using Docker
usegalaxy-eu.remap-user Remaps system user and group with UID and GID 999 to a different UID and GID so Galaxy user can be created with 999 UID and GID
usegalaxy-eu.rsync-to-nfs ✔️ Adds and executes a script that performs a Rsync operation of the Galaxy root directory to NFS location
usegalaxy-eu.subdomain-themes ✔️ Adds custom subdomain themes (HTML and CSS files)
usegalaxy-eu.tours ✔️ Clones Galaxy tours repo
usegalaxy-eu.webhooks ✔️ Clones webhooks repo
usegalaxy-eu.vgcn-monitoring ✔️ Adds VGCN monitoring python script and a Telegraf configuration file
dev-sec.os-hardening ✔️ ✔️ Now, part of devsec.hardening collection. This role provides numerous security-related configurations, providing all-round base protection to the system
dev-sec.ssh-hardening ✔️ ✔️ Now, part of devsec.hardening collection. This role provides secure ssh-client and ssh-server configurations.
devops.tomcat7 ✔️ ✔️ Installs Tomcat 7 on RedHat/CentOS Linux servers
dj-wasabi.telegraf ✔️ ✔️ Installs and configures Telegraf
galaxyproject.galaxy ✔️ ✔️ Installs and configures Galaxy
galaxyproject.cvmfs ✔️ ✔️ Install and configure CernVM-FS (CVMFS), particularly for Galaxy servers.
galaxyproject.proftpd ✔️ ✔️ Installs, configures and manges proftpd (FTP) server.
usegalaxy_eu.ansible_nginx_upload_module ✔️ ✔️ Role for building the Nginx upload module
usegalaxy-eu.nginx ✔️ ✔️ Role for installing and managing nginx servers
galaxyproject.nginx ✔️ ✔️ Role for installing and managing nginx servers
galaxyproject.postgresql ✔️ Role for installing and managing PostgreSQL servers
usegalaxy-eu.ansible-postgresql ✔️ ✔️ Role for installing and managing PostgreSQL servers
geerlingguy.docker ✔️ ✔️ Role for installing Docker
geerlingguy.java ✔️ ✔️ Role for installing Java
geerlingguy.jenkins ✔️ ✔️ Role for installing Jenkins CI
geerlingguy.repo-epel ✔️ ✔️ Installs the EPEL repository
influxdata.chrony ✔️ ✔️ Manages the Chrony services on Linux.
linuxhq.yum_cron ✔️ ✔️ Installs yum-cron and adds required configuration
galaxyproject.gxadmin ✔️ ✔️ Installs and configures gxadmin
usegalaxy-eu.certbot ✔️ Installs and configures Certbot (for Let's Encrypt).
usegalaxy_eu.galaxy_systemd ✔️ ✔️ Copies systemd service files and starts processes for Gunicorn handlers, Galaxy (workflow) handlers and celery. Important to configure those processes
usegalaxy-eu.dynmotd ✔️ ✔️ Sets up a dynamic message-of-the-day login prompt
cloudalchemy.grafana ✔️ ✔️ Role for provisioning and managing Grafana platform for analytics and monitoring
galaxyproject.tiaas2 ✔️ ✔️ Install and configure TIaaS (Training Infrastructure as a Service)
usegalaxy-eu.autoupdates ✔️ ✔️ Sets up automatic system Updates using Dnf-automatic
usegalaxy_eu.htcondor ✔️ ✔️ Role for installing and configuring HTCondor
usegalaxy-eu.update-hosts ✔️ ✔️ Adds a cron job to update computing nodes list in a HTCondor managed cluster
usegalaxy_eu.gie_proxy ✔️ ✔️ Install and configure the proxy server used by Galaxy for IE (Interactive Environments) /IT (Interactive Tools)
usegalaxy-eu.autofs ✔️ ✔️ Installs autofs and configures mount points for auto mounting
usegalaxy_eu.fs_maintenance ✔️ ✔️ Role for deploying and configuring some common Galaxy file system maintenance routines and also adds cron jobs
galaxyproject.tusd ✔️ ✔️ Installs and configures the tusd server
usegalaxy_eu.rabbitmqserver ✔️ ✔️ Role to deploy and configure a RabbitMQ server using a docker container
usegalaxy_eu.influxdbserver ✔️ ✔️ Role to deploy and configure an InfluxDB server using a docker container
usegalaxy_eu.flower ✔️ ✔️ Role for installing Celery's Web UI Flower.
paprikant.beacon ✔️ ✔️ Role that sets up a running instance of beacon-python, with an accompaning PostgreSQL database
paprikant.beacon-importer ✔️ ✔️ Sets up Beacon importer and adds a cron job for the import task
galaxyproject.miniconda ✔️ ✔️ Role for installing and managing Miniconda installation and Conda environments
usegalaxy_eu.tpv_auto_lint ✔️ ✔️ Adds a TPV (Total Perspective Vortex) lint script that automatically lints all TPV YAML files