Skip to content

Commit

Permalink
initial commit klone_Running-a-Job.md
Browse files Browse the repository at this point in the history
  • Loading branch information
kubu4 committed Sep 19, 2024
1 parent da26030 commit 80fe2a6
Showing 1 changed file with 278 additions and 0 deletions.
278 changes: 278 additions & 0 deletions docs/klone_Running-a-Job.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@

**NOTE** - Please use [`/gscratch/scrubbed/<UW_NetID>`](https://robertslab.github.io/resources/klone_Data-Storage-and-System-Organization/#3-temporary-storage) for running jobs (i.e. writing output files to). As the name suggests, you will need to move files to a "bird" for archival storage. If you are needing a set of large raw files for analysis - also place these in the `/gscratch/scrubbed/<UW_NetID>` directory.

---

`sbatch` is the main execution command for the job scheduler ([SLURM](https://slurm.schedmd.com/overview.html)). It spools up an compute node for long term or compute intensive tasks such as assemblies, blasts, alignments, etc.

`sbatch` can be run from a login node with the command


```
`sbatch <slurm_script_name.sh>`
```

`sbatch` requires a shell script to function, with two main parts: the header and the execute portion.

## The Header

<pre>
<code>
#!/bin/bash
## Job Name
#SBATCH --job-name=<b>myjob</b>
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=<b>dd-hh:mm:ss</b>
## Memory per node
#SBATCH --mem=<b>450G</b>
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER
## Specify the working directory for this job
#SBATCH --chdir=<b>/gscratch/srubbed/<UW_NetID>/to/your/desired/directory</b>
</code>
</pre>


<b>Bolded sections above must be changed prior to execution.</b> Those specific sections are described in more detail.

- `--job-name=`**`myjob`** is an identifier for your current job. It's what shows up in `scontrol` and `squeue` calls. Providing a unique-to-you job name may be helpful for distinguishing between different runs, by is not necessary.

- `--time=`**`dd-hh:mm:ss`** is the "wall" time, or how long we are reserving the node for our use. This argument requires some consideration and knowledge about the program you're running prior to execution. Selecting too little wall time will cause the scheduler to kill your process mid-run when time runs out. Selecting too much time limits other's ability to use Hyak functions, but the scheduler releases a node upon program completion usually, so this is a secondary consideration.

- `--mem=`**`450G`** specifies how much memory (RAM) to allocate to the process. We have a single slice with a maximum of 490GB of RAM. Specifying a value below the maximum allows for some additional processing overhead. Usually, setting this to the maximum is fine, but reserving only what you might need can allow for multiple users to use the slice at the same time.

- `--chdir=`**`/gscratch/srubbed/<UW_NetID>/to/your/desired/directory`** indicates the working directory where output will be written. All jobs should be executed in your `/gscratch/srubbed/<UW_NetID>` directory. See the [Data Storage & System Organization section of the wiki](https://github.com/RobertsLab/hyak_mox/wiki/Data-Storage-&-System-Organization) for more info.

## The Execute portion

This section contains the commands/programs you want executed. You can treat it like the command line, in that it executes commands sequentially as input. These can include program calls, module loading, making directories, etc. However, since Klone relies on the use of [containers](./klone_containers.md) to run, your SLURM script will _require_ the following:

```bash
# Load modules
module load apptainer

# Execute Roberts Lab bionformatics container
# Binds home directory
# Binds /gscratch directory
# Directory bindings allow outputs to be written to the hard drive.
apptainer exec \
--home $PWD \
--bind /mmfs1/home/ \
--bind /mmfs1/gscratch/ \
/gscratch/srlab/containers/srlab-bioinformatics-container-<git_commit_hash>.sif \
<program_name> <programs_arguments>
```
## SLURM Script Template/Example - Multiple Commands

If you need to execute multiple commands using a container, which will usually be the case, you'll need to place those commands in a separte script.

### Command script example

Here's an example script, called `commands.sh`. This is where we'll set all of our variables and execute various commands/programs we'd like for our analysis:

```bash
#!/bin/bash

# Requires Bash >=4.0, as script uses associative arrays.

###################################################################################
# These variables need to be set by user

## Number of CPU threads to use for programs (if applicable)
threads=28

## Programs associative array
## Using array is useful for logging program options (see end of script)
declare -A programs_array

programs_array=(
[bowtie2]="bowtie2" \
[bowtie2_build]="bowtie2-build" \
[samtools_index]="samtools index" \
[samtools_sort]="samtools sort" \
[samtools_view]="samtools view"
)


## INPUT FILES ##
genome_fasta="./data/C_gigas/genomes/cgig-NCBI-genome.fasta"
genome_name="cgig-NCBI-genome"

###################################################################################

# Capture program options
if [[ "${#programs_array[@]}" -gt 0 ]]; then
echo "Logging program options..."
for program in "${!programs_array[@]}"
do
{
echo "Program options for ${program}: "
echo ""
# Handle samtools help menus
if [[ "${program}" == "samtools_index" ]] \
|| [[ "${program}" == "samtools_sort" ]] \
|| [[ "${program}" == "samtools_view" ]]
then
${programs_array[$program]}

# Handle DIAMOND BLAST menu
elif [[ "${program}" == "diamond" ]]; then
${programs_array[$program]} help

# Handle NCBI BLASTx menu
elif [[ "${program}" == "blastx" ]]; then
${programs_array[$program]} -help

# Handle StringTie prepDE script
elif [[ "${program}" == "prepDE" ]]; then
python3 ${programs_array[$program]} -h
fi
${programs_array[$program]} -h
echo ""
echo ""
echo "----------------------------------------------"
echo ""
echo ""
} &>> program_options.log || true

# If MultiQC is in programs_array, copy the config file to this directory.
if [[ "${program}" == "multiqc" ]]; then
cp --preserve ~/.multiqc_config.yaml multiqc_config.yaml
fi
done
echo "Finished logging programs options."
echo ""
fi


# Document programs in PATH (primarily for program version ID)
echo "Logging system $PATH..."
{
date
echo ""
echo "System PATH for $SLURM_JOB_ID"
echo ""
printf "%0.s-" {1..10}
echo "${PATH}" | tr : \\n
} >> system_path.log
echo "Finished logging system $PATH."
```

To run the `commands.sh` script above in our conatiner on Klone, we would have the following SLURM script.

This example will perform the following:

- Request the slice assigned to our account (`--account=srlab`)
- Request the parition on the `srlab` slice (`--partition=cpu-g2-mem2x`)
- Set a run time of 10 days (`--time=10-00:00:00`)
- Requst 120GB of memory (`--mem=120G`)
- Identify the most recent version of the bioinformatics container to use.
- Run the `commands.sh` script from the bioinformatics container to constuct a bowtie2 index of the provided genome FastA.

NOTE: This is written to assume that the `commands.sh` script and the SLURM script are in the same directory.

```bash
#!/bin/bash
## Job Name
#SBATCH --job-name=DESCRIPTIVE_JOB_NAME
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/$USER/to/your/desired/directory
###################################################################################

# Exit script if any command fails
set -e

# Get most recent container git hash
git_commit_hash=$(find /gscratch/srlab/containers/ \
-name "srlab-bioinformatics-container*" \
-printf "%T+ %p\n" \
| sort -n \
| awk -F[-.] 'NR == 1 {print $7}')

# Load modules
module load apptainer

# Execute Roberts Lab bionformatics container
# Binds home directory
# Binds /gscratch directory
# Directory bindings allow outputs to be written to the hard drive.
apptainer exec \
--home "$PWD" \
--bind /mmfs1/home/ \
--bind /mmfs1/gscratch/ \
/gscratch/srlab/containers/srlab-bioinformatics-container-"${git_commit_hash}$".sif \
commands.sh
```

## SLURM Script Template/Example - Single Command

Less likely to be used, as many of our analyses require multiple steps, but is still useful to know.

```bash
#!/bin/bash
## Job Name
#SBATCH --job-name=DESCRIPTIVE_JOB_NAME
## Allocation Definition
#SBATCH --account=srlab
#SBATCH --partition=cpu-g2-mem2x
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=120G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$USER@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/$USER/to/your/desired/directory
###################################################################################

# Exit script if any command fails
set -e

# Get most recent container git hash
git_commit_hash=$(find /gscratch/srlab/containers/ \
-name "srlab-bioinformatics-container*" \
-printf "%T+ %p\n" \
| sort -n \
| awk -F[-.] 'NR == 1 {print $7}')

# Load modules
module load apptainer

# Execute Roberts Lab bionformatics container
# Binds home directory
# Binds /gscratch directory
# Directory bindings allow outputs to be written to the hard drive.
apptainer exec \
--home "$PWD" \
--bind /mmfs1/home/ \
--bind /mmfs1/gscratch/ \
/gscratch/srlab/containers/srlab-bioinformatics-container-"${git_commit_hash}$".sif \
bowtie2_build \
--threads 28 \
/gscratch/scrubbed/"$USER"/data/C_gigas/genomes/cgig-ncbi-genome.fasta \
cgig-ncbi-genome
```

0 comments on commit 80fe2a6

Please sign in to comment.