2021_Bash refresher

Welcome to the bash refresher wiki

Today we will start with an short introduction to sequencing technologies followed by practicals on bash basics split into 2 parts:

Lecture on YouTube: Introduction to Genomics and Sequencing
Bash practicals part 1:
- Lecture on YouTube: bash / unix / bioinformatics tutorial for beginners part 1
- Jump to General bash introduction
Bash practicals part 2:
- Lecture on YouTube: bash / unix / bioinformatics tutorial for beginners part 2
- Jump to Working on the cluster

1. General bash

The command line:

After opening your terminal your command line should look something like this:

[yourname@computer_name ~]$

This is the prompt telling you who you are, what device/server you are on, followed by the $ sign which lets us know that it is ready for our commands
To get started let's log onto the cluster

Logging on: First we can log onto the cluster
this can be done using the following command, where $USER is the username you were issued

ssh $USER@saga.sigma2.no

Just like navigating through folders on your computers finder/file manager on the terminal you can move up and down through your file structure
BUT just like Maya Angelou says "you cant really know where you're going until you know where you have been"

We can check our current location with the following command:

pwd

you should see something like /cluster/home/rdekayne

Navigating directories:
We can check whether anything is in our current directory with ls - for now there probably isn't anything there

ls

Now we know where we are we can make a new directory (so that we can move around). The mkdir command lets us make a directory

mkdir ohknow

And we can move into this new folder using cd to change directory.

cd ohknow

If we want to go back go our previous directory we can do so using relative filepath commands

cd ../

Here the ../ represents moving 'back' or 'up' a directory in our file structure
You can also give an absolute filepath to any directory - complete filepaths from the root directory start with a /

cd /path/to/my/folder/ohknow

Making and moving files:
First lets make a file - if we use touch it will make an empty file

touch file1

check it is there

ls

we can also see the size of files with

ls -lh

Then we can move this file between folders using mv

mv file1 ../

Or back to our ohknow folder which we are in using . meaning 'here'

mv ../file1 .

For more complicated scripts I usually use Vim which is a bit less user-friendly but has lots of nice features to try vim type vim followed by your filename

vim test_file

this will open this new file in the vim editor - to add text to the file you need to type i to put vim in 'insert' mode

one of the most intimidating things about vim is getting stuck inside it - the most common way to exit the editor once you're done typing or pasting your code into vim is to press ESC on your keyboard and then type :wq and press enter

Loops:
For many things in bioinformatics a loop might be helpful - in terms of making, moving or altering files
Let's add some files to the folder - files2-5

for i in {2..5}
do
#here we will get the loop to 'echo' the variable i so we know what number we are on
echo ${i}
touch file${i}
done

Now we can use mv to rename them in a loop - adding ohknow_ at the beginning of the filename - this time we could use i and the numbers 1-5 but we can also use the wildcard * to specify all files in the current directory ./

for file in file*
do
echo ${file}
mv ${file} ohknow_${file}
#in the following line '>>' appends the information to the file on the right - note that '>' will make a new file that would be overwritten each time
echo '>fastaheader1' >> ohknow_${file}
echo 'acgggttttttaaaaa' >> ohknow_${file}
echo '>fastaheader2' >> ohknow_${file}
echo 'ttttctcacggtgaacgggttttttaaaaa' >> ohknow_${file}
done

We can now take a look at one of the files we made with less - to exit less just press q on your keyboard

less ohknow_file1

Or we can look at the whole file by 'cat'ing the file to standard-output

cat ohknow_file1

If we pipe the output to a new file cat can also be used to combine files

cat ohknow_file4 ohknow_file5 > ohknow_file6

Pattern matching - sed:
sed can act like a find and replace
We can do find and replace to append a fasta header for example - we can pipe the output to a new file Here is the syntax for sed: sed 's/replace_this/with_this/g file > new_file

sed 's/>/>genome_2/g' ohknow_file2 > ohknow_file2_newheaders

If you're 100% sure you want to append the file directly without making a new one we can use the -i flag in sed

sed -i 's/>/>genome_3/g' ohknow_file3

Some more tips and tricks for sed can be found here: https://www.tecmint.com/linux-sed-command-tips-tricks/

Pattern matching - grep:
we can take a look at certain lines using grep - to search for a pattern/word/phrase to get our fasta headers lets search for >

grep ">" ohknow_file1

count the lines

wc -l ohknow_file1

we can count the sequences in our combined file by looking for lines with a > and piping the output to wc

grep ">" ohknow_file6 | wc -l

what if we know we want to extract the sequence associated with the read/chromosome fastaheader1 - use -A1 to extract one line trailing the line we searched for

grep -A1 "fastaheader1" ohknow_file1

We can also use grep and pipe '|' commands together If we want the above command but dont want to have the fasta header in the output we can use grep -v to include only lines without the phrase

grep -A1 "fastaheader1" ohknow_file1 | grep -v ">"

Some more grep tips can be found here: https://ryanstutorials.net/linuxtutorial/cheatsheetgrep.php

Manipulating data frames - awk:
To take a look at a few examples of awk - a very powerful (if a bit beginner unfriendly) language we will make a new file
This file will have individuals we have sampled and some phenotypic measure

echo 'Individual Species Lake Head_length Body_lengh' > sample_info.txt
echo '1 C_heglingus Zurich 40 100' >> sample_info.txt
echo '4 C_heglingus Zurich 43 99' >> sample_info.txt
echo '3 C_heglingus Zurich 38 102' >> sample_info.txt
echo '8 C_duplex Zurich 50 110' >> sample_info.txt
echo '9 C_duplex Zurich 47 105' >> sample_info.txt
echo '10 C_duplex Zurich 35 90' >> sample_info.txt
echo '28 C_profundus Thun 20 80' >> sample_info.txt

Now convert spaces to tabs so that we can use awk easily (you can specify the delimiter but to keep it simple here we will keep it at tabs)

sed -i 's/ /\t/g' sample_info.txt

We can use awk to do a number of things - first we can extract a single column e.g. the individual ID number

awk '{print $1}' sample_info.txt

Now we want to only keep samples form Lake Zurich

awk '{if($3=="Zurich") print $0}' sample_info.txt > Zurich_samples.txt

Then we can add a new column with standard length - a combination of the last two lengths

awk '{$6 = ($4+$5); print}' Zurich_samples.txt > Zurich_samples_SL.txt

Say every fish with a standard length below 140 is a juvenile we can make a new file with only those individuals

awk '{if($6 >= 140) {print $0}}' Zurich_samples_SL.txt > Zurich_samples_SL_adults.txt

An awk cheat sheets can be found here:
https://www.shortcutfoo.com/app/dojos/awk/cheatsheet https://bl831.als.lbl.gov/~gmeigs/scripting_help/awk_cheat_sheet.pdf

Writing and running a basic script: To make a script use Vim to make the following script called my_first_script.sh

#!/bin/sh

# here the script will return 'Hello world'
echo "Hello world"

Now exit vim and make it executable with chmod and run

chmod +x my_first_script.sh
./my_first_script.sh

Now let's try a more complex script - this time the script will tell us how many lines are in each file in our directory
Again, open vim and same the following script as count_lines.sh

for file in ohknow_file*
do
echo ${file} '-line count is:'
wc -l ${file}
done

Now make it executable and run as before
A very comprehensive bash cheat sheet can be found here: https://devhints.io/bash

2. Working on the cluster

Logging on: First we can log onto the cluster
this can be done using the following command, where $USER is the username you were issued

ssh $USER@saga.sigma2.no

You will then be prompted to enter your password

Cluster basics: This cluster is run using slurm - information about this specific cluster can be found here: https://documentation.sigma2.no/hpc_machines/hardware_overview.html
If you get stuck trying to run jobs on a cluster it's always a good idea to check out the wiki - it specifies the types of nodes available and the best practices for using them as well as the quirks of the specific computer cluster. This is particularly important for understanding which parameters are needed for the job queueing/submission system.

We can see what jobs are already running:

squeue

You can add additional flags to see specifics:

-j jobids   show only the specified jobs
-w nodes    show only jobs on the specified nodes
-A projects show only jobs belonging to the specified projects
-t states   show only jobs in the specified states (pending, running,
            suspended, etc.)
-u users    show only jobs belonging to the specified users

and they can be combined

squeue -j 14132,14133    # shows jobs 4132 and 4133
squeue -w c23-11         # shows jobs running on c23-11
squeue -u foo -t PD      # shows pending jobs belonging to user 'foo'
squeue -A bar            # shows all jobs in the project 'bar'

Submitting a basic script: Let's try running a similar script as before, but this time on the cluster as a job: my_first_cluster_script.sh

#!/bin/bash

# Job name:
#SBATCH --job-name=test
#
# Project:
#SBATCH --account=NN9458K
#
# Wall time limit:
#SBATCH --time=00-00:10:00
#
#SBATCH --mem-per-cpu=4000M

echo "hello_world"

and now we can submit the job to the cluster using sbatch - this will make sure that your job gets submitted with the right resources to the right nodes so it runs successfully

sbatch my_first_cluster_script.sh

This should run and you will see a *.out file in your directory (the job name is usually in the name which can be useful if you run into problems or need to keep track of a batch job - lets open it and see what information it returns

less *.out

You'll see that we have our output hello_world as well as information about the resources we used - in reality we might want to use this information to optimise our computations or if the job failed we could see the reason why. It could be insufficient memory/time required or an error in our code.

Loading environments:
Often we want to run scripts that make use of other programs rather than just simple bash scripts
One of the easiest ways to do this is to use Conda - conda should produce an environment in which everything we need to run a specific program is available.
To set up our environment for conda packages:

ml purge
ml avail conda
ml Miniconda3/4.9.2

Then initialise our environment:

conda init bash

Make sure our environment has everything conda needs to download and install new packages

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

For this exercise we want to install fastp which allows us to run QC on our raw sequencing data:

conda create --name fastp

To add packages to our newly made conda environment we must first activate it

source ~/.bashrc
conda activate fastp

Here you may get an error about not initialising your environment - this should only happen once and if so you should log off and back on - then running conda activate fastp should work
Then we can install software, the -n lets us name the conda environment and the following fastp is the 'recipe' for the fastp conda environment - e.g. from https://anaconda.org/bioconda/fastp

conda install -n fastp fastp

Check that we can now use fastp

fastp -h

To exit from the current environment we can use

conda deactivate

Basic sequencing checks: The data we will be using for this part of the refresher is kept here:

/cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces

One thing we might want to do is see how many reads we have
Looking at the top of the file we will use with head

zcat /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/SRR3265401_1.fastq.gz | head

We see that each read starts with a header beginning with @SRR so we can count the reads with

zcat /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*.fastq.gz | grep "@SRR" | wc -l

Running QC on reads: Now we can check the quality of our reads using fastp
We will make a script to do this that we will submit to the queue - fastp_my_reads_test.sh

#!/bin/bash

# Job name:
#SBATCH --job-name=test
#
# Project:
#SBATCH --account=NN9458K
#
# Wall time limit:
#SBATCH --time=00-00:10:00
#
#SBATCH --mem-per-cpu=40000M

eval "$(conda shell.bash hook)"

conda activate /cluster/home/YOUR_USERNAME/.conda/envs/fastp

# Execute the program in some way
fastp -h

Let's make sure that worked

sbatch fastp_my_reads_test.sh

And now let's run the fastp filtering with a script where we add the fastp command: run_fastp.sh

#!/bin/bash

# Job name:
#SBATCH --job-name=test
#
# Project:
#SBATCH --account=NN9458K
#
# Wall time limit:
#SBATCH --time=00-00:10:00
#
#SBATCH --mem-per-cpu=40000M

eval "$(conda shell.bash hook)"

conda activate /cluster/home/YOUR_USERNAME/.conda/envs/fastp
fastp -i /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*_1.fastq.gz -I /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*_2.fastq.gz -o /cluster/home/YOUR_USERNAME/out.R1.fq.gz -O /cluster/home/YOUR_USERNAME/out.R2.fq.gz

sbatch run_fastp.sh

You now have your output files and .html file to show the filtering and QC You can now download the html file to your personal computer so that you can open the html file

scp -r YOUR_USERNAME@saga.sigma2.no:/cluster/home/YOUR_USERNAME/*.html ./

If you want to try writing your own script then you can write another submission script to analyse this tse-tse fly data /cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse

Arguments in bash
Sometimes it can be useful to be able to define arguments when you submit a script that are then used in the script
In this example we will make a short script that will return information about yourself participent_info.sh
This script is a bit more complex than before but more realistic for more complex analyses

#!/bin/bash
#SBATCH --job-name=oh_know_participent
#SBATCH --account=NN9458K
#SBATCH --time=00-00:10:00
#SBATCH --mem-per-cpu=1000M

# Submit as follows:
# sbatch participent_info.sh <your first name> <your study genus> 

# make a working directory in /scratch for this job
#this command will make the working directory for the script and the -p flag means it will only make it if it doesnt exist already
SCRATCH=/cluster/home/$USER/me
mkdir -p $SCRATCH

#here we specify our arguments based on the user inputted arguments from the command - $1 will be first, $2 second and so on.
NAME=$1
STUDYSPP=$2

#and here is the bit our script does - piping the info and the arguments to a new file - me.txt
echo "My name is:" >> ${SCRATCH}/me.txt
echo ${NAME} >> ${SCRATCH}/me.txt
echo "I study the genus:" >> ${SCRATCH}/me.txt
echo ${STUDYSPP} >> ${SCRATCH}/me.txt

This time we can run and specify the variables - in my case:

sbatch participent_info.sh Rishi Danaus

Once it is done we can move into our new me/ directory and check the output

less me/me.txt

The interactive node:
As well as running jobs locally or submitting them to be run as jobs we can start an interactive node and run them there
This is particularly useful for more memory/CPU intensive tasks which 'head' or 'login' nodes are not able to handle
To start a login node try using the following - this time note that we use srun rather than sbatch

srun --ntasks=1 --mem-per-cpu=5G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

You should see that your prompt has now changed and instead of showing login it will show a number - the node which you are now computing from
Now you can try loading a conda module from here - - in this example we use Kamil's conda environment which you will be using later in the workshop

module purge
module load Miniconda3/4.9.2
/cluster/projects/nn9458k/oh_know/.conda/kmer_tools

3. Further reading:

Different sequencing technologies:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841808/

More Bash tips/tricks:
https://data-skills.github.io/unix-and-bash/03-bash-scripts/index.html
https://speciationgenomics.github.io/getting_used_to_unix/

Table of content

Introduction

Concept of k-mers

k-mer spectra analysis

📖 Introduction to K-mer spectra analysis
- ⚒ Generating k-mer spectra tutorial
📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
📖 Analysing genome skimming data

Separation of chromosomes

📖Separate sub-genomes of an allopolyploid
📖Separating chromosomes by comparison of sequencing libraries
- ⚒ Extracting sex chromosome k-mers from a male and female sample
- ⚒ Extract k-mers specific to germ-line restricted chromosomes
- ⚒ Matching k-mers to a reference (bwa-mem)
- ⚒ Matching k-mers to sequencing reads (cookiecutter)

Species assignment using short k-mers

📖Identifying haplotypes within targeted amplicon sequencing datasets
- ⚒ Performing species assigment from targeted amplicon sequencing data

Others

🖥️ Installation of the kmer_tools conda evironment
📖 Other k-mer resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly