Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds System sub-module for HPC Wisteria #176

Merged
merged 19 commits into from
Aug 15, 2023
Merged

Conversation

bch0w
Copy link
Member

@bch0w bch0w commented Aug 15, 2023

Relevant Issue: #161

This PR adds support for the U. Tokyo cluster, Wisteria, which is a Fujitsu brand cluster with it's own workload management software.

Changelog:

  • system.Cluster
    • Makes the location of the submit and run scripts a Class variable which can be overwritten.
    • This does not affect the existing Workflow sub-modules, but allows the Fujitsu sub-module to override this and point to a different script.
  • system.Fujitsu:
    • Creates a Fujitsu system sub-module which adopts from Cluster.
    • Acts as the general system-interaction framework, similar to Slurm or PBS
    • Allows for specific sub-arguments like rscgrp (research group)
    • Also contains the architecture for submitting jobs, monitoring the queue (using pjstat) etc.
  • system.Wisteria:
    • Inherits from Fujitsu to provide more specific arguments for the Wisteria system.
  • Custom run and submit scripts for Wisteria:
    • These custom scripts were required because on Wisteria, the compute node does not inherit from the login node's environment, meaning everything must be re-loaded prior to submitting a workflow or running a job (e.g., modules, Conda environment)
    • Paths, environment and loaded modules are hard-coded for Kyoto group, not generalized
    • If we wanted to make this more general, these scripts might have to be generated on the fly or created from paths/parameters defined in the parameter file
    • For now we leave this somewhat hardcoded to get research problems going

Notes from System.Wisteria docstring:

  • Wisteria Caveat 1: On Wisteria you cannot submit batch jobs from compute nodes and you cannot SSH from compute nodes (Manual 5.13), so the master job must be run from the login node or the pre-post node (Manual 5.2.3)
  • Wisteria Caveat 2: On Wisteria, the login node Conda environment is not inherited by compute nodes, so it requires custom submit and run script which first load the correct modules, and then run the corresponding script
  • Wisteria Caveat 3: On Wisteria, command line arguments for the submit and run script, normally input like '--key value' interfere with the batch submission cmd pjsub. So instead we use the pjsub '-x' flag which allows us to set environment variables. We use these in place of command line arguments

bch0w added 19 commits March 2, 2023 08:47
…hat is: submit jobs one by one and track individual job ids rather than submitting one array job"
…ored in root dir. instead this is defined once at the top of cluster script and inherited by all child classes.

this also allows wisteria to override this and set custom run and submit scripts which allow for activating conda environment after submitted job to scheduler
… get around inability to pass command line arguments in pjsub command and inability to inherit conda env from login node
… for array jobs but not all systems run array style
@bch0w bch0w merged commit 596c4a2 into devel Aug 15, 2023
@bch0w bch0w deleted the feature-system_wisteria branch August 15, 2023 20:46
@bch0w bch0w mentioned this pull request Jan 18, 2024
@bch0w bch0w mentioned this pull request May 9, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant