Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transfer the project from the SCC to the NHR partition #16

Merged
merged 28 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
a693602
add nhr hosts and storages
MehmedGIT Aug 26, 2024
0656a95
adapt: .env variables
MehmedGIT Aug 26, 2024
5c6824b
adapt: hpc utility tests
MehmedGIT Aug 26, 2024
0c1e540
adapt: hpc scc to nhr
MehmedGIT Aug 26, 2024
abd68e2
fix: hpc __init__
MehmedGIT Aug 26, 2024
1a45ae0
refactor: remove hpc utils file
MehmedGIT Aug 26, 2024
e08bf5a
adapt: batch scripts
MehmedGIT Aug 26, 2024
602b510
improve: use standard96s partition, wrapper
MehmedGIT Aug 26, 2024
0d36a34
adapt: broker workers
MehmedGIT Aug 26, 2024
dc171ed
fix: server request model
MehmedGIT Aug 26, 2024
b6dc3cf
fix: determine project env
MehmedGIT Aug 26, 2024
0e1821b
enable: the last 2 hpc tests
MehmedGIT Aug 26, 2024
a62764a
add: put_batch_script
MehmedGIT Aug 26, 2024
2c8a3fb
increase integration test resources
MehmedGIT Aug 26, 2024
352723c
fix: singularity -> apptainer
MehmedGIT Aug 26, 2024
7198391
refactor: improve code, remove constants
MehmedGIT Aug 26, 2024
56ee9f1
fix freeze issue: use EmmyPhase2, not EmmyPhase3
MehmedGIT Aug 27, 2024
0fe8f51
get rid of extra echos in batch
MehmedGIT Aug 27, 2024
238ec87
remove: print debugging
MehmedGIT Aug 27, 2024
989ca7e
clean local node temp data
MehmedGIT Aug 27, 2024
f566081
increase rabbitmq and database healthcheck timeout
MehmedGIT Aug 27, 2024
41e5137
try fixing: pulling from PR not main
MehmedGIT Aug 27, 2024
6865881
Merge branch 'main' into transfer-to-nhr-partition
MehmedGIT Aug 27, 2024
2ac80f3
Merge branch 'main' into transfer-to-nhr-partition
MehmedGIT Aug 27, 2024
456debb
enable all utils tests
MehmedGIT Aug 27, 2024
ca824f5
fix: avoid name collisions in tests
MehmedGIT Aug 27, 2024
1771887
utilize: force command to check job status
MehmedGIT Aug 27, 2024
2ee8edb
release: v2.15.0
MehmedGIT Aug 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ OPERANDI_DB_ROOT_PASS=db_operandi
OPERANDI_DB_URL=mongodb://db_operandi:db_operandi@localhost:27017
OPERANDI_HARVESTER_DEFAULT_USERNAME=harvester_operandi
OPERANDI_HARVESTER_DEFAULT_PASSWORD=harvester_operandi
OPERANDI_HPC_USERNAME=mmustaf
OPERANDI_HPC_PROJECT_USERNAME=u11874
OPERANDI_HPC_PROJECT_USERNAME=u12198
OPERANDI_HPC_SSH_KEYPATH=/home/mm/.ssh/gwdg-cluster
OPERANDI_HPC_PROJECT_NAME=operandi
OPERANDI_LOGS_DIR=/tmp/operandi_logs
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/tests/.env
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@ OPERANDI_DB_ROOT_PASS=db_operandi
OPERANDI_DB_URL=mongodb://db_operandi:db_operandi@localhost:27017
OPERANDI_HARVESTER_DEFAULT_USERNAME=harvester_operandi
OPERANDI_HARVESTER_DEFAULT_PASSWORD=harvester_operandi
OPERANDI_HPC_USERNAME=mmustaf
OPERANDI_HPC_PROJECT_USERNAME=u11874
OPERANDI_HPC_PROJECT_USERNAME=u12198
OPERANDI_HPC_SSH_KEYPATH=/home/runner/.ssh/key_hpc
OPERANDI_HPC_PROJECT_NAME=operandi
OPERANDI_HPC_PROJECT_NAME=operandi_test_cicd
OPERANDI_LOGS_DIR=/tmp/operandi_logs
OPERANDI_RABBITMQ_CONFIG_JSON=./src/rabbitmq_definitions.json
OPERANDI_RABBITMQ_URL=amqp://operandi_user:operandi_password@localhost:5672/
Expand Down
11 changes: 5 additions & 6 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ services:
healthcheck:
test: rabbitmq-diagnostics check_port_connectivity
interval: 1s
timeout: 3s
retries: 30
timeout: 5s
retries: 120

operandi-mongodb:
image: "mongo"
Expand All @@ -50,15 +50,15 @@ services:
healthcheck:
test: echo 'db.runCommand("ping").ok' | mongosh localhost:27017/test --quiet
interval: 1s
timeout: 3s
retries: 30
timeout: 5s
retries: 120

operandi-server:
image: operandi-server
container_name: operandi-server
build:
context: ./src
dockerfile: ./Dockerfile_server
dockerfile: ./Dockerfile_server
depends_on:
operandi-rabbitmq:
condition: service_healthy
Expand Down Expand Up @@ -102,7 +102,6 @@ services:
- OPERANDI_DB_NAME=${OPERANDI_DB_NAME}
- OPERANDI_DB_URL=${OPERANDI_DB_URL}
- OPERANDI_HPC_PROJECT_NAME=${OPERANDI_HPC_PROJECT_NAME}
- OPERANDI_HPC_USERNAME=${OPERANDI_HPC_USERNAME}
- OPERANDI_HPC_PROJECT_USERNAME=${OPERANDI_HPC_PROJECT_USERNAME}
- OPERANDI_HPC_SSH_KEYPATH=/home/root/.ssh/gwdg_hpc_key
- OPERANDI_LOGS_DIR=${OPERANDI_LOGS_DIR}
Expand Down
9 changes: 4 additions & 5 deletions docker-compose_image_based.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ services:
healthcheck:
test: rabbitmq-diagnostics check_port_connectivity
interval: 1s
timeout: 3s
retries: 30
timeout: 5s
retries: 120

operandi-mongodb:
image: "mongo"
Expand All @@ -50,8 +50,8 @@ services:
healthcheck:
test: echo 'db.runCommand("ping").ok' | mongosh localhost:27017/test --quiet
interval: 1s
timeout: 3s
retries: 30
timeout: 5s
retries: 120

operandi-server:
image: ghcr.io/subugoe/operandi-server:main
Expand Down Expand Up @@ -96,7 +96,6 @@ services:
- OPERANDI_DB_NAME=${OPERANDI_DB_NAME}
- OPERANDI_DB_URL=${OPERANDI_DB_URL}
- OPERANDI_HPC_PROJECT_NAME=${OPERANDI_HPC_PROJECT_NAME}
- OPERANDI_HPC_USERNAME=${OPERANDI_HPC_USERNAME}
- OPERANDI_HPC_PROJECT_USERNAME=${OPERANDI_HPC_PROJECT_USERNAME}
- OPERANDI_HPC_SSH_KEYPATH=/home/root/.ssh/gwdg_hpc_key
- OPERANDI_LOGS_DIR=${OPERANDI_LOGS_DIR}
Expand Down
3 changes: 1 addition & 2 deletions docker.env
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ OPERANDI_DB_ROOT_PASS=db_operandi
OPERANDI_DB_URL=mongodb://db_operandi:db_operandi@mongo-db-host:27017
OPERANDI_HARVESTER_DEFAULT_USERNAME=harvester_operandi
OPERANDI_HARVESTER_DEFAULT_PASSWORD=harvester_operandi
OPERANDI_HPC_USERNAME=mmustaf
OPERANDI_HPC_PROJECT_USERNAME=u11874
OPERANDI_HPC_PROJECT_USERNAME=u12198
OPERANDI_HPC_SSH_KEYPATH=/home/mm/.ssh/gwdg-cluster
OPERANDI_HPC_PROJECT_NAME=operandi
OPERANDI_LOGS_DIR=/tmp/operandi_logs
Expand Down
6 changes: 3 additions & 3 deletions src/broker/operandi_broker/job_status_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
DBHPCSlurmJob, DBWorkflowJob, DBWorkspace,
sync_db_initiate_database, sync_db_get_hpc_slurm_job, sync_db_get_workflow_job, sync_db_get_workspace,
sync_db_update_hpc_slurm_job, sync_db_update_workflow_job, sync_db_update_workspace)
from operandi_utils.hpc import HPCExecutor, HPCTransfer
from operandi_utils.hpc import NHRExecutor, NHRTransfer
from operandi_utils.rabbitmq import get_connection_consumer


Expand Down Expand Up @@ -51,9 +51,9 @@ def run(self):
signal.signal(signal.SIGTERM, self.signal_handler)

sync_db_initiate_database(self.db_url)
self.hpc_executor = HPCExecutor(tunnel_host='localhost', tunnel_port=self.tunnel_port_executor)
self.hpc_executor = NHRExecutor()
self.log.info("HPC executor connection successful.")
self.hpc_io_transfer = HPCTransfer(tunnel_host='localhost', tunnel_port=self.tunnel_port_transfer)
self.hpc_io_transfer = NHRTransfer()
self.log.info("HPC transfer connection successful.")

self.rmq_consumer = get_connection_consumer(rabbitmq_url=self.rmq_url)
Expand Down
18 changes: 8 additions & 10 deletions src/broker/operandi_broker/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@
from operandi_utils.database import (
sync_db_initiate_database, sync_db_get_workflow, sync_db_get_workspace, sync_db_create_hpc_slurm_job,
sync_db_update_workflow_job, sync_db_update_workspace)
from operandi_utils.hpc import HPCExecutor, HPCTransfer
from operandi_utils.hpc import NHRExecutor, NHRTransfer
from operandi_utils.hpc.constants import (
HPC_JOB_DEADLINE_TIME_REGULAR, HPC_JOB_DEADLINE_TIME_TEST, HPC_JOB_QOS_SHORT, HPC_JOB_QOS_DEFAULT
)
HPC_BATCH_SUBMIT_WORKFLOW_JOB, HPC_JOB_DEADLINE_TIME_REGULAR, HPC_JOB_DEADLINE_TIME_TEST, HPC_JOB_QOS_SHORT,
HPC_JOB_QOS_DEFAULT)
from operandi_utils.rabbitmq import get_connection_consumer


Expand Down Expand Up @@ -58,9 +58,9 @@ def run(self):
signal.signal(signal.SIGTERM, self.signal_handler)

sync_db_initiate_database(self.db_url)
self.hpc_executor = HPCExecutor(tunnel_host='localhost', tunnel_port=self.tunnel_port_executor)
self.hpc_executor = NHRExecutor()
self.log.info("HPC executor connection successful.")
self.hpc_io_transfer = HPCTransfer(tunnel_host='localhost', tunnel_port=self.tunnel_port_transfer)
self.hpc_io_transfer = NHRTransfer()
self.log.info("HPC transfer connection successful.")

self.rmq_consumer = get_connection_consumer(rabbitmq_url=self.rmq_url)
Expand Down Expand Up @@ -214,8 +214,6 @@ def prepare_and_trigger_slurm_job(
# self.hpc_io_transfer = HPCTransfer(tunel_host='localhost', tunel_port=4023)
# self.log.info("HPC transfer connection renewed successfully.")

hpc_batch_script_path = self.hpc_io_transfer.put_batch_script(batch_script_id="batch_submit_workflow_job.sh")

try:
sync_db_update_workspace(find_workspace_id=workspace_id, state=StateWorkspace.TRANSFERRING_TO_HPC)
sync_db_update_workflow_job(find_job_id=workflow_job_id, job_state=StateJob.TRANSFERRING_TO_HPC)
Expand All @@ -228,8 +226,8 @@ def prepare_and_trigger_slurm_job(
try:
# NOTE: The paths below must be a valid existing path inside the HPC
slurm_job_id = self.hpc_executor.trigger_slurm_job(
batch_script_path=hpc_batch_script_path, workflow_job_id=workflow_job_id,
nextflow_script_path=workflow_script_path, workspace_id=workspace_id, mets_basename=workspace_base_mets,
workflow_job_id=workflow_job_id, nextflow_script_path=workflow_script_path,
workspace_id=workspace_id, mets_basename=workspace_base_mets,
input_file_grp=input_file_grp, nf_process_forks=nf_process_forks, ws_pages_amount=ws_pages_amount,
use_mets_server=use_mets_server, file_groups_to_remove=file_groups_to_remove, cpus=cpus, ram=ram,
job_deadline_time=job_deadline_time, partition=partition, qos=qos)
Expand All @@ -239,7 +237,7 @@ def prepare_and_trigger_slurm_job(
try:
sync_db_create_hpc_slurm_job(
workflow_job_id=workflow_job_id, hpc_slurm_job_id=slurm_job_id,
hpc_batch_script_path=hpc_batch_script_path,
hpc_batch_script_path=HPC_BATCH_SUBMIT_WORKFLOW_JOB,
hpc_slurm_workspace_path=join(self.hpc_io_transfer.slurm_workspaces_dir, workflow_job_id))
except Exception as error:
raise Exception(f"Failed to save the hpc slurm job in DB: {error}")
Expand Down
4 changes: 2 additions & 2 deletions src/server/operandi_server/models/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from typing import Optional

from operandi_utils import StateJob
from operandi_utils.hpc.constants import HPC_JOB_DEFAULT_PARTITION
from operandi_utils.hpc.constants import HPC_NHR_JOB_DEFAULT_PARTITION

from ..constants import DEFAULT_FILE_GRP, DEFAULT_METS_BASENAME

Expand Down Expand Up @@ -35,6 +35,6 @@ class WorkflowArguments(BaseModel):


class SbatchArguments(BaseModel):
partition: str = HPC_JOB_DEFAULT_PARTITION # partition to be used
partition: str = HPC_NHR_JOB_DEFAULT_PARTITION # partition to be used
cpus: int = 4 # cpus per job allocated by default
ram: int = 32 # RAM (in GB) per job allocated by default
12 changes: 4 additions & 8 deletions src/utils/operandi_utils/hpc/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
__all__ = [
"HPCConnector",
"HPCExecutor",
"HPCTransfer"
]
__all__ = ["NHRConnector", "NHRExecutor", "NHRTransfer"]

from operandi_utils.hpc.connector import HPCConnector
from operandi_utils.hpc.executor import HPCExecutor
from operandi_utils.hpc.transfer import HPCTransfer
from operandi_utils.hpc.nhr_connector import NHRConnector
from operandi_utils.hpc.nhr_executor import NHRExecutor
from operandi_utils.hpc.nhr_transfer import NHRTransfer
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
#!/bin/bash
#SBATCH --constraint scratch
#SBATCH --partition medium
#SBATCH --time 0:05:00
#SBATCH --output /scratch1/projects/project_pwieder_ocr/batch_job_logs/batch_check_ocrd_all_version_job-%J.txt
#SBATCH --partition standard96:shared
#SBATCH --time 00:05:00
#SBATCH --qos 2h
#SBATCH --output check_ocrd_all_version_job-%J.txt
#SBATCH --cpus-per-task 1
#STABCH --mem 16G
#SBATCH --mem 16G

set -e

hostname
/opt/slurm/etc/scripts/misc/slurm_resources

module purge
module load singularity
module load apptainer
SIF_PATH="/mnt/lustre-emmy-hdd/projects/project_pwieder_ocr_nhr/ocrd_all_maximum_image.sif"

singularity exec "/scratch1/projects/project_pwieder_ocr/ocrd_all_maximum_image.sif" ocrd --version
apptainer exec "$SIF_PATH" ocrd --version
apptainer exec "$SIF_PATH" ocrd-tesserocr-recognize --dump-module-dir
apptainer exec "$SIF_PATH" ls -la /models
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
#!/bin/bash
#SBATCH --constraint scratch
#SBATCH --partition medium
#SBATCH --time 02:00:00
#SBATCH --output /scratch1/projects/project_pwieder_ocr/batch_job_logs/batch_create_ocrd_all_sif_job-%J.txt
#SBATCH --partition standard96:shared
#SBATCH --time 2:00:00
#SBATCH --output create_ocrd_all_sif_job-%J.txt
#SBATCH --cpus-per-task 16
#SBATCH --mem 64G

set -e

module purge
module load singularity
module load apptainer

hostname
/opt/slurm/etc/scripts/misc/slurm_resources

SINGULARITY_CACHE_DIR="/scratch1/projects/project_pwieder_ocr"
SIF_NAME="ocrd_all_maximum_image.sif"
APPTAINER_TMPDIR="$LOCAL_TMPDIR"
APPTAINER_CACHE_DIR="/mnt/lustre-emmy-hdd/projects/project_pwieder_ocr_nhr"
SIF_NAME="ocrd_all_maximum_image_new.sif"
OCRD_ALL_MAXIMUM_IMAGE="docker://ocrd/all:latest"

cd "${SINGULARITY_CACHE_DIR}" || exit
singularity build --disable-cache "${SIF_NAME}" "${OCRD_ALL_MAXIMUM_IMAGE}"
singularity exec "${SIF_NAME}" ocrd --version
cd "${APPTAINER_CACHE_DIR}" || exit
apptainer build --disable-cache "${SIF_NAME}" "${OCRD_ALL_MAXIMUM_IMAGE}"
apptainer exec "${SIF_NAME}" ocrd --version
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
#!/bin/bash
#SBATCH --constraint scratch
#SBATCH --partition medium
#SBATCH --time 06:00:00
#SBATCH --output /scratch1/projects/project_pwieder_ocr/batch_job_logs/batch_download_all_ocrd_models_job-%J.txt
#SBATCH --partition standard96:shared
#SBATCH --time 6:00:00
#SBATCH --output download_all_ocrd_models_job-%J.txt
#SBATCH --cpus-per-task 16
#SBATCH --mem 64G
#SBATCH --mem 32G

set -e

module purge
module load singularity
module load apptainer

hostname
/opt/slurm/etc/scripts/misc/slurm_resources

# This sif file is generated with another batch script
SIF_PATH="/scratch1/projects/project_pwieder_ocr/ocrd_all_maximum_image.sif"
OCRD_MODELS_DIR="/scratch1/projects/project_pwieder_ocr/ocrd_models"
SIF_PATH="/mnt/lustre-emmy-hdd/projects/project_pwieder_ocr_nhr/ocrd_all_maximum_image.sif"
OCRD_MODELS_DIR="/mnt/lustre-emmy-hdd/projects/project_pwieder_ocr_nhr/ocrd_models"
OCRD_MODELS_DIR_IN_DOCKER="/usr/local/share"

if [ ! -f "${SIF_PATH}" ]; then
Expand All @@ -28,11 +27,9 @@ if [ ! -d "${OCRD_MODELS_DIR}" ]; then
mkdir -p "${OCRD_MODELS_DIR}"
fi

# Download all available ocrd models
singularity exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download '*'
# Download models for ocrd-tesserocr-recognize which are not downloaded with the '*' glob
singularity exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download ocrd-tesserocr-recognize '*'
# Download models for ocrd-kraken-recognize which are not downloaded with the '*' glob
singularity exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download ocrd-kraken-recognize '*'
# Download models for ocrd-calamari-recognize which are not downloaded with the '*' glob
singularity exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download ocrd-calamari-recognize '*'
apptainer exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download -o '*'
apptainer exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download -o ocrd-tesserocr-recognize '*'
apptainer exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download -o ocrd-calamari-recognize '*'
apptainer exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download -o ocrd-kraken-recognize '*'
apptainer exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download -o ocrd-sbb-binarize '*'
apptainer exec --bind "${OCRD_MODELS_DIR}:${OCRD_MODELS_DIR_IN_DOCKER}" "${SIF_PATH}" ocrd resmgr download -o ocrd-cis-ocropy-recognize '*'
Loading
Loading