-
Notifications
You must be signed in to change notification settings - Fork 9
2021_Bash refresher
Today we will start with an short introduction to sequencing technologies followed by practicals on bash basics split into 2 parts:
- Lecture on YouTube: Introduction to Genomics and Sequencing
- Bash practicals part 1:
- Lecture on YouTube: bash / unix / bioinformatics tutorial for beginners part 1
- Jump to General bash introduction
- Bash practicals part 2:
- Lecture on YouTube: bash / unix / bioinformatics tutorial for beginners part 2
- Jump to Working on the cluster
The command line:
After opening your terminal your command line should look something like this:
[yourname@computer_name ~]$
This is the prompt telling you who you are, what device/server you are on, followed by the $
sign which lets us know that it is ready for our commands
To get started let's log onto the cluster
Logging on:
First we can log onto the cluster
this can be done using the following command, where $USER
is the username you were issued
ssh $USER@saga.sigma2.no
Just like navigating through folders on your computers finder/file manager on the terminal you can move up and down through your file structure
BUT just like Maya Angelou says "you cant really know where you're going until you know where you have been"
We can check our current location with the following command:
pwd
you should see something like /cluster/home/rdekayne
Navigating directories:
We can check whether anything is in our current directory with ls
- for now there probably isn't anything there
ls
Now we know where we are we can make a new directory (so that we can move around). The mkdir
command lets us make a directory
mkdir ohknow
And we can move into this new folder using cd
to change directory.
cd ohknow
If we want to go back go our previous directory we can do so using relative filepath commands
cd ../
Here the ../
represents moving 'back' or 'up' a directory in our file structure
You can also give an absolute filepath to any directory - complete filepaths from the root directory start with a /
cd /path/to/my/folder/ohknow
Making and moving files:
First lets make a file - if we use touch
it will make an empty file
touch file1
check it is there
ls
we can also see the size of files with
ls -lh
Then we can move this file between folders using mv
mv file1 ../
Or back to our ohknow
folder which we are in using .
meaning 'here'
mv ../file1 .
For more complicated scripts I usually use Vim which is a bit less user-friendly but has lots of nice features
to try vim type vim
followed by your filename
vim test_file
this will open this new file in the vim editor - to add text to the file you need to type i
to put vim in 'insert' mode
i
one of the most intimidating things about vim is getting stuck inside it - the most common way to exit the editor once you're done typing or pasting your code into vim is to press ESC
on your keyboard and then type :wq
and press enter
Loops:
For many things in bioinformatics a loop might be helpful - in terms of making, moving or altering files
Let's add some files to the folder - files2-5
for i in {2..5}
do
#here we will get the loop to 'echo' the variable i so we know what number we are on
echo ${i}
touch file${i}
done
Now we can use mv to rename them in a loop - adding ohknow_
at the beginning of the filename - this time we could use i and the numbers 1-5 but we can also use the wildcard *
to specify all files in the current directory ./
for file in file*
do
echo ${file}
mv ${file} ohknow_${file}
#in the following line '>>' appends the information to the file on the right - note that '>' will make a new file that would be overwritten each time
echo '>fastaheader1' >> ohknow_${file}
echo 'acgggttttttaaaaa' >> ohknow_${file}
echo '>fastaheader2' >> ohknow_${file}
echo 'ttttctcacggtgaacgggttttttaaaaa' >> ohknow_${file}
done
We can now take a look at one of the files we made with less
- to exit less just press q
on your keyboard
less ohknow_file1
Or we can look at the whole file by 'cat'ing the file to standard-output
cat ohknow_file1
If we pipe the output to a new file cat can also be used to combine files
cat ohknow_file4 ohknow_file5 > ohknow_file6
Pattern matching - sed:
sed can act like a find and replace
We can do find and replace to append a fasta header for example - we can pipe the output to a new file
Here is the syntax for sed: sed 's/replace_this/with_this/g file > new_file
sed 's/>/>genome_2/g' ohknow_file2 > ohknow_file2_newheaders
If you're 100% sure you want to append the file directly without making a new one we can use the -i
flag in sed
sed -i 's/>/>genome_3/g' ohknow_file3
Some more tips and tricks for sed can be found here: https://www.tecmint.com/linux-sed-command-tips-tricks/
Pattern matching - grep:
we can take a look at certain lines using grep - to search for a pattern/word/phrase
to get our fasta headers lets search for >
grep ">" ohknow_file1
count the lines
wc -l ohknow_file1
we can count the sequences in our combined file by looking for lines with a >
and piping the output to wc
grep ">" ohknow_file6 | wc -l
what if we know we want to extract the sequence associated with the read/chromosome fastaheader1 - use -A1
to extract one line trailing the line we searched for
grep -A1 "fastaheader1" ohknow_file1
We can also use grep and pipe '|' commands together
If we want the above command but dont want to have the fasta header in the output we can use grep -v
to include only lines without the phrase
grep -A1 "fastaheader1" ohknow_file1 | grep -v ">"
Some more grep tips can be found here: https://ryanstutorials.net/linuxtutorial/cheatsheetgrep.php
Manipulating data frames - awk:
To take a look at a few examples of awk - a very powerful (if a bit beginner unfriendly) language we will make a new file
This file will have individuals we have sampled and some phenotypic measure
echo 'Individual Species Lake Head_length Body_lengh' > sample_info.txt
echo '1 C_heglingus Zurich 40 100' >> sample_info.txt
echo '4 C_heglingus Zurich 43 99' >> sample_info.txt
echo '3 C_heglingus Zurich 38 102' >> sample_info.txt
echo '8 C_duplex Zurich 50 110' >> sample_info.txt
echo '9 C_duplex Zurich 47 105' >> sample_info.txt
echo '10 C_duplex Zurich 35 90' >> sample_info.txt
echo '28 C_profundus Thun 20 80' >> sample_info.txt
Now convert spaces to tabs so that we can use awk easily (you can specify the delimiter but to keep it simple here we will keep it at tabs)
sed -i 's/ /\t/g' sample_info.txt
We can use awk to do a number of things - first we can extract a single column e.g. the individual ID number
awk '{print $1}' sample_info.txt
Now we want to only keep samples form Lake Zurich
awk '{if($3=="Zurich") print $0}' sample_info.txt > Zurich_samples.txt
Then we can add a new column with standard length - a combination of the last two lengths
awk '{$6 = ($4+$5); print}' Zurich_samples.txt > Zurich_samples_SL.txt
Say every fish with a standard length below 140 is a juvenile we can make a new file with only those individuals
awk '{if($6 >= 140) {print $0}}' Zurich_samples_SL.txt > Zurich_samples_SL_adults.txt
An awk cheat sheets can be found here:
https://www.shortcutfoo.com/app/dojos/awk/cheatsheet
https://bl831.als.lbl.gov/~gmeigs/scripting_help/awk_cheat_sheet.pdf
Writing and running a basic script:
To make a script use Vim to make the following script called my_first_script.sh
#!/bin/sh
# here the script will return 'Hello world'
echo "Hello world"
Now exit vim and make it executable with chmod
and run
chmod +x my_first_script.sh
./my_first_script.sh
Now let's try a more complex script - this time the script will tell us how many lines are in each file in our directory
Again, open vim and same the following script as count_lines.sh
for file in ohknow_file*
do
echo ${file} '-line count is:'
wc -l ${file}
done
Now make it executable and run as before
A very comprehensive bash cheat sheet can be found here: https://devhints.io/bash
Logging on:
First we can log onto the cluster
this can be done using the following command, where $USER
is the username you were issued
ssh $USER@saga.sigma2.no
You will then be prompted to enter your password
Cluster basics:
This cluster is run using slurm - information about this specific cluster can be found here: https://documentation.sigma2.no/hpc_machines/hardware_overview.html
If you get stuck trying to run jobs on a cluster it's always a good idea to check out the wiki - it specifies the types of nodes available and the best practices for using them as well as the quirks of the specific computer cluster. This is particularly important for understanding which parameters are needed for the job queueing/submission system.
We can see what jobs are already running:
squeue
You can add additional flags to see specifics:
-j jobids show only the specified jobs
-w nodes show only jobs on the specified nodes
-A projects show only jobs belonging to the specified projects
-t states show only jobs in the specified states (pending, running,
suspended, etc.)
-u users show only jobs belonging to the specified users
and they can be combined
squeue -j 14132,14133 # shows jobs 4132 and 4133
squeue -w c23-11 # shows jobs running on c23-11
squeue -u foo -t PD # shows pending jobs belonging to user 'foo'
squeue -A bar # shows all jobs in the project 'bar'
Submitting a basic script:
Let's try running a similar script as before, but this time on the cluster as a job: my_first_cluster_script.sh
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Project:
#SBATCH --account=NN9458K
#
# Wall time limit:
#SBATCH --time=00-00:10:00
#
#SBATCH --mem-per-cpu=4000M
echo "hello_world"
and now we can submit the job to the cluster using sbatch
- this will make sure that your job gets submitted with the right resources to the right nodes so it runs successfully
sbatch my_first_cluster_script.sh
This should run and you will see a *.out
file in your directory (the job name is usually in the name which can be useful if you run into problems or need to keep track of a batch job - lets open it and see what information it returns
less *.out
You'll see that we have our output hello_world
as well as information about the resources we used - in reality we might want to use this information to optimise our computations or if the job failed we could see the reason why. It could be insufficient memory/time required or an error in our code.
Loading environments:
Often we want to run scripts that make use of other programs rather than just simple bash scripts
One of the easiest ways to do this is to use Conda - conda should produce an environment in which everything we need to run a specific program is available.
To set up our environment for conda packages:
ml purge
ml avail conda
ml Miniconda3/4.9.2
Then initialise our environment:
conda init bash
Make sure our environment has everything conda needs to download and install new packages
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
For this exercise we want to install fastp which allows us to run QC on our raw sequencing data:
conda create --name fastp
To add packages to our newly made conda environment we must first activate it
source ~/.bashrc
conda activate fastp
Here you may get an error about not initialising your environment - this should only happen once and if so you should log off and back on - then running conda activate fastp
should work
Then we can install software, the -n lets us name the conda environment and the following fastp is the 'recipe' for the fastp conda environment - e.g. from https://anaconda.org/bioconda/fastp
conda install -n fastp fastp
Check that we can now use fastp
fastp -h
To exit from the current environment we can use
conda deactivate
Basic sequencing checks: The data we will be using for this part of the refresher is kept here:
/cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces
One thing we might want to do is see how many reads we have
Looking at the top of the file we will use with head
zcat /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/SRR3265401_1.fastq.gz | head
We see that each read starts with a header beginning with @SRR
so we can count the reads with
zcat /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*.fastq.gz | grep "@SRR" | wc -l
Running QC on reads:
Now we can check the quality of our reads using fastp
We will make a script to do this that we will submit to the queue - fastp_my_reads_test.sh
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Project:
#SBATCH --account=NN9458K
#
# Wall time limit:
#SBATCH --time=00-00:10:00
#
#SBATCH --mem-per-cpu=40000M
eval "$(conda shell.bash hook)"
conda activate /cluster/home/YOUR_USERNAME/.conda/envs/fastp
# Execute the program in some way
fastp -h
Let's make sure that worked
sbatch fastp_my_reads_test.sh
And now let's run the fastp filtering with a script where we add the fastp command: run_fastp.sh
#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Project:
#SBATCH --account=NN9458K
#
# Wall time limit:
#SBATCH --time=00-00:10:00
#
#SBATCH --mem-per-cpu=40000M
eval "$(conda shell.bash hook)"
conda activate /cluster/home/YOUR_USERNAME/.conda/envs/fastp
fastp -i /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*_1.fastq.gz -I /cluster/projects/nn9458k/oh_know/teachers/kamil/data/sacharomyces/*_2.fastq.gz -o /cluster/home/YOUR_USERNAME/out.R1.fq.gz -O /cluster/home/YOUR_USERNAME/out.R2.fq.gz
sbatch run_fastp.sh
You now have your output files and .html
file to show the filtering and QC
You can now download the html file to your personal computer so that you can open the html file
scp -r YOUR_USERNAME@saga.sigma2.no:/cluster/home/YOUR_USERNAME/*.html ./
If you want to try writing your own script then you can write another submission script to analyse this tse-tse fly data /cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse
Arguments in bash
Sometimes it can be useful to be able to define arguments when you submit a script that are then used in the script
In this example we will make a short script that will return information about yourself participent_info.sh
This script is a bit more complex than before but more realistic for more complex analyses
#!/bin/bash
#SBATCH --job-name=oh_know_participent
#SBATCH --account=NN9458K
#SBATCH --time=00-00:10:00
#SBATCH --mem-per-cpu=1000M
# Submit as follows:
# sbatch participent_info.sh <your first name> <your study genus>
# make a working directory in /scratch for this job
#this command will make the working directory for the script and the -p flag means it will only make it if it doesnt exist already
SCRATCH=/cluster/home/$USER/me
mkdir -p $SCRATCH
#here we specify our arguments based on the user inputted arguments from the command - $1 will be first, $2 second and so on.
NAME=$1
STUDYSPP=$2
#and here is the bit our script does - piping the info and the arguments to a new file - me.txt
echo "My name is:" >> ${SCRATCH}/me.txt
echo ${NAME} >> ${SCRATCH}/me.txt
echo "I study the genus:" >> ${SCRATCH}/me.txt
echo ${STUDYSPP} >> ${SCRATCH}/me.txt
This time we can run and specify the variables - in my case:
sbatch participent_info.sh Rishi Danaus
Once it is done we can move into our new me/
directory and check the output
less me/me.txt
The interactive node:
As well as running jobs locally or submitting them to be run as jobs we can start an interactive node and run them there
This is particularly useful for more memory/CPU intensive tasks which 'head' or 'login' nodes are not able to handle
To start a login node try using the following - this time note that we use srun
rather than sbatch
srun --ntasks=1 --mem-per-cpu=5G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i
You should see that your prompt has now changed and instead of showing login
it will show a number - the node which you are now computing from
Now you can try loading a conda module from here - - in this example we use Kamil's conda environment which you will be using later in the workshop
module purge
module load Miniconda3/4.9.2
/cluster/projects/nn9458k/oh_know/.conda/kmer_tools
Different sequencing technologies:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841808/
More Bash tips/tricks:
https://data-skills.github.io/unix-and-bash/03-bash-scripts/index.html
https://speciationgenomics.github.io/getting_used_to_unix/
Introduction
k-mer spectra analysis
- 📖 Introduction to K-mer spectra analysis
- 📖 Basics of genome modeling
- ⚒ manual model fitting (for better understanding of the underlying model)
- ⚒ simple diploid
- ⚒ demonstrating the effect of sequencing error rate on k-mer coverage
- 📖 Common difficulties in characterisation of diploid genomes using k mer spectra analysis
- ⚒ low coverage (pitfall) - to be merged
- ⚒ very homozygous diploid
- ⚒ highly heterozygous diploid
- ⚒ Genome size of a repetitive genome (pitfall)
- ⚒ Wrong ploidy (pitfall)
- 📖 Characterization of polyploid genomes using k mer spectra analysis
- ⚒ Autotetraploid
- ⚒ Allotetraploid
- ⚒ Estimating ploidy (smudgeplot)
- 📖 Genome modeling as a quality control
- ⚒ Contamination (pitfall)
- ⚒ k-mers in an assembly (Mercury/KAT)
- 📖 Analysing genome skimming data
Separation of chromosomes
- 📖Separate sub-genomes of an allopolyploid
- 📖Separating chromosomes by comparison of sequencing libraries
Species assignment using short k-mers
Others