Instructor: Daehwan Kim (Daehwan.Kim@UTSouthwestern.edu)
Teaching assistant: Chanhee Park (Chanhee.Park@UTSouthwestern.edu)
The class meets at 9am from May 6th (Day 1) through May 10th (Day 5) in the following rooms.
Day | Classroom |
---|---|
May 6th (Day 1) | ND5.218 |
May 7th (Day 2) | NL3.120 |
May 8th (Day 3) | ND5.218 |
May 9th (Day 4) | NL3.120 |
May 10th (Day 5) | ND5.218 |
Office hours will be 2 - 4pm on each of Days 1-5 in J9.120A.
This course provides hands-on practice with C++ programming for implementing DNA sequence alignment. You will design and implement memory-efficient data structures and rapid algorithms for suffix arrays, Burrows-Wheeler transforms (BWT), and Ferragina-Manzini (FM) indexes. It is important for you to write code in a modular and easy-to-maintain fashion. You are encouraged to provide documentation in the form of comments along with your code so that other future developers can understand and extend it further. You will be expected to write code so that your code will live forever.
Prerequisites:
- Familiarity with C++ features such as array, pointer, and class
- Understanding of concepts in Object-oriented programming
- Knowledge of basic algorithm techniques such as binary search and sorting
We will provide relevant articles and slides that you can read on your own in order to carry out programming assignments as described in the daily course schedule below:
From 9 to 10am, we will provide a brief introduction to sequence alignment, suffix arrays, and a C++ coding environment. For the rest of the day, you will implement suffix arrays (SA) for two genomic sequences: a small 100-bp sequence and a large 51 million-bp long sequence (human chromosome 22). As this session marks the commencement of the programming course, there will also be a concise introduction to the practice of code commenting. Optional: you can implement suffix trees
From 9 to 10am, we will provide a brief introduction to sequence alignment and have a Q&A session. For the rest of the day, you will implement an alignment algorithm using the suffix array. We will provide input sequencing reads and true output alignment, which you can use to test your alignment algorithm in terms of alignment accuracy and speed. You can also implement test cases and use assertions so that you can effectively identify and fix bugs in your code. Optional: you can implement a suffix array and an alignment algorithm for the human reference genome (about 3 billion bases) that can be used on your computer.
From 9 to 10am, we will introduce Burrows-Wheeler transforms (BWT) and have a Q&A session. For the rest of the day, you will implement BWT for two genomic sequences: a small 100-bp sequence and a large 51 million-bp long sequence (human chromosome 22). Optional: you can implement a BWT for the human reference genome.
From 9 to 10am, we will cover sequence alignment and have a Q&A session. For the rest of the day, you will implement an alignment algorithm using BWTs for two genomic sequences: a small 100-bp sequence and a large 51 million-bp long sequence (human chromosome 22). Optional: you can implement an alignment algorithm using a BWT for the human reference genome.
From 9 to 10am, we will provide a brief introduction to Ferragina-Manzini (FM) indexing and have a Q&A session. For the rest of the day, you will implement FM indexes and an alignment algorithm using FM indexes for two genomic sequences: a small 100-bp sequence and a large 51 million-bp long sequence (human chromosome 22). Optional: you can make this FM index and alignment work for the human reference genome.
The performance standards for you to achieve in producing your algorithms include lean memory usage and high speed. Please document your memory usage and runtime information. The output formats for your programs are described below. You will be asked to review one of your classmates’ code implementations, with the code to review assigned by the course instructors. Your review summary should not be more than 300 words.
All assignments are to be done individually, but discussion with classmates regarding general solution strategies is allowed.
This C++ course will constitute 20% of the final grade or 20 points as follows:
Day 1: SA implementation | Day 2: SA-based alignment implementation | Day 3: BWT implementation | Day 4: BWT-based alignment implementation | Day 5: FM index implementation | |
---|---|---|---|---|---|
Correctness | 2 | 2 | 2 | 2 | 3 |
Memory efficiency and runtime | 1 | 1 | 1 | 1 | 1 |
Code Review | 0 | 1 | 1 | 1 | 1 |
The deadline for submitting the programming assignment, except for the Day5 assignment, is the next day at 9 a.m. The submission deadline for the Day 5 assignment is 10 p.m. on Day 5.
There is no code review assignment for Day1.
The output file formats are described as follows.
For each day we will assess the correctness of the output of the code. Please produce codes which write the results of the computation to a file on disk according to each of the following:
Input Format Description | Input Format Example | Output Format Description | Output Format Example | Command Example | |
---|---|---|---|---|---|
Day 1: SA Impl. | Input .fa file, single sequence multi-line, > for header | sample_ref.fa | A list of integers in txt format (label .sa) | sample_ref.sa | <your_program_name> sample_ref.fa sample_ref.sa |
Day 2: SA Search | Input .fa reference file, input .sa from Day 1, and Multi-line query fasta file (labeled .fa) | sample_ref.fa, sample_ref.sa, and sample_reads.fa | A tab-delimited file where each row contains query name and alignment location, multiple rows for the same query matching multiple locations. The order of the query names in the alignment file must be the same as the input. No entries for queries that don't align | sample.aln | <your_program_name> sample_ref.sa sample_ref.fa sample_reads.fa > sample.aln |
Day 3: BWT Impl. | Input .fa reference file, input suffix array .sa | sample_ref.fa and sample_ref.sa | text file containing burrows-wheeler transform text (with or without extra character) named .bwt | sample.bwt | <your_program_name> sample_ref.fa sample_ref.sa > sample.bwt |
Day 4: BWT Search | Input .bwt, input .sa, and queries file (.fa) | sample.bwt and sample_ref.sa and sample_reads.fa | A tab-delimited file where each row contains query name and alignment location, multiple rows for the same query matching multiple locations. | sample.aln | <your_program_name> sample.bwt sample_ref.sa sample_reads.fa > sample.aln |
Day 5 part 1: FM Index Impl. | Input .bwt, input .sa | sample.bwt and sample_ref.sa and commandline options for sampling rate | An FM Index file, with FM-Index components seperated by > and appropriate headers as shown in the example output. | sample.fm | <your_program_name> sample.bwt sample.sa -c 2 > sample.fm |
Day 5 part 2: FM Index Search | input .fm, input .fa | sample.fm and sample_reads.fa | a tab-delimited file where each row contains query name and alignment location, multiple rows for the same query matching multiple locations | sample.aln | <your_program_name> -s sample_ref.fa -b sample.bwt -f sample.fm |
Each dataset contains a reference sequence, reads, and a position of read in the reference. Student may use the Human Reference Genome to do optional assignment. When you test your program with chr22_reads.fa and genome_reads.fa, you can use a subset of reads, for example, first 1000 reads. But when you test a peformance of your program, we recommend to use whole reads.
- Small
- Reference: 100bp sequence small.fa
- Reads: 10 10bp reads small_reads.fa
- Read positions: small.aln
- Human Chromosome 22
- Reference: chr22.fa
- Reads: 1Million 100bp reads chr22_reads.fa
- Read positions: chr22.aln
- Human Reference Genome
- Reference: genome.fa
- Reads: 1Million 100bp reads genome_reads.fa
- Read positions: genome.aln
We recommend using a laptop/desktop rather than the biohpc client.
-
Install gcc/g++ compiler
- Ubuntu
sudo apt install build-essential
- CentOS
sudo yum group install "Development Tools"
-
Download CLion package. This package includes an evaluation license key for a free 30-day trial.
CLion - Download -
Unpack the package
tar zxvf CLion-2023.1.3.tar.gz
-
Run CLion.sh from the bin subdirectory
-
Open your project directory (
Day1
,Day2
, ...)
Visual Studio 2022 Community - Download
- Open a solution file in the project directory(
Day1.sln
,Day2.sln
, ...)
Xcode - Download
- Open a xcode project in the project directory(
Day1.xcodeproj
,Day2.xcodeproj
, ...)