Here I present the programs and the steps for carrying out a small benchmark pre-analysis for determining full-length mRNA isoforms by Nanopore RNA-seq data.
As for the datasets to perform a pre-screening study for this benchmark analysis, I have decided to use sequins, synthetic RNA standards, available thanks to the Garvan Institute of Medical Research.
For this initial analysis, I downloaded the following datasets:
- Reference files:
- rnasequin_decoychr_2.4.fa.gz - decoy chromosome (chrIS) sequence the encodes all synthetic sequin gene loci (10Mb).
- rnasequin_annotation_2.4.gtf – annotation of sequin genes/isoforms on the decoy chromosome (120kb).
- rnasequin_sequences_2.4.fa – sequences of all sequin isoforms (213kb).
- rnasequin_isoforms_2.4.tsv – expected concentration (expression) of each sequin isoform in mixture (5kb).
- Example libraries:
- K562_SequinMixA.Rep2.R1.fq.gz (2.5Gb): Sequins (RNA mixture A) was added to RNA extract from the K562 cell type. The sample then underwent library preparation (using the KAPA Stranded mRNA-Seq) and sequencing (using the Illumina HiSeq 2500).
- k562sequins_dRNA_albacore2.1.3.tar.gz (33Mb): RNA sequins (Mixture A) was added to 500ng total RNA extract from the K562 cell type. The sample then underwent sequencing using Nanopore MinION flowcell (R9.4 chemistry), displayed 1323 active channels at commencement of run. Used sequencing protocol RNA001. Base called with Albacore version 2.1.3.
To highlight the differences between Illumina and Oxford Nanopore reads, I have plotted the read length distributions for each sequencing method, using the following command to obtain the read lengths:
awk 'NR%4 == 2 {lengths[length($0)]++ ; counter++} END {for (l in lengths) {print l, lengths[l]}}' file.fastq >> file_with_read_lengths.txt
Once I obtained the length of the reads and their frequencies, using an RScript, I made the comparative graph to observe the differences, obtaining the following graph:
We can observe that the difference in the average read lengths, since, while for Illumina we have a large number of reads of the same size (125 bp) for Nanopore we obtain a much lower performance, with fewer reads, but of a much greater length (917 bp), although it is true that the variability is very wide, obtaining reads ranging from 5 bp to ~3 kb, but for our preliminary analysis this is fine.
In addition, I have performed an alignment of Illumina short reads and Nanopore long reads against the isoform reference in order to observe the differences in coverage and read length between the two methods.
For this purpose I used Minimap2 to map both reads, converted the resulting SAM files to BAM and sorted and indexed them using samtools.
In this way we can observe that the Illumina reads are much shorter, since a single read does not cover the entire area of the sequin, while in Nanopore it does, although the coverage of the Illumina sample is much higher than that of Nanopore (32902 vs. 706).
As for the programmes used to carry out this small benchmark pre-analysis, I have chosen 3 of the most recent algorithms that allow the analysis of RNA-seq data using the Oxford Nanopore technology, as the de novo reconstruction algorithms are much more complex to set up and run than the reference-guided ones. These can be divided into:
In addition to these 3 that allow the use of dRNA-seq data, I will also use HTSeq to highlight possible differences with isoform detection using programmes that use Illumina data as input.
Note: All programs that featured a Docker/Singularity image or Conda package have been installed in this way to facilitate the replicability of the results obtained in this analysis.
FLAIR (Full-Length Alternative Isoform analysis of RNA) is an algorithm for the correction, isoform definition and alternative splicing analysis of noisy reads that has been used mainly for native RNA developed by Tang et al. (2018). FLAIR uses multiple alignment steps and splice site filters to increase confidence in the set of isoforms defined from noisy data. FLAIR was designed to be able to sense subtle splicing changes in nanopore data.
As we can see in the image above, this algorithm has multiple modules to follow, so I will therefore show below the commands I have used to carry out the analysis, but you can visit the FLAIR repository for further information.
Align reads.
singularity exec flair_latest.sif python3 ./flair/flair.py align -g <genome.fa> -r <reads.fq.gz>|<reads.fq>|<reads.fa> -nvra -o <output_aligned>
In our case:
- -g:
rnasequin_decoychr_2.4.fa
- -r:
k562+sequins_dRNA_albacore-2.1.3.fastq
Correct reads using an annotation file.
singularity exec flair_latest.sif python3 ./flair/flair.py correct -q <output_aligned.bed> -g <genome.fa> -f <annotation.gtf> -c <chromsize.fa.fai> -nvrna -o <output_correct>
In our case:
- -g:
rnasequin_decoychr_2.4.fa
- -f:
rnasequin_annotation_2.4.gtf
- -c:
rnasequin_decoychr_2.4.fa.fai
Defines high-confidence isoforms from corrected reads.
singularity exec flair_latest.sif python3 ./flair/flair.py collapse -g <genome.fa> -r <reads.fq>|<reads.fa> -f <annotation.gtf> -q <output_corrected.bed> --trust_ends
In our case:
- -g:
rnasequin_decoychr_2.4.fa
- -r:
k562+sequins_dRNA_albacore-2.1.3.fastq
- -f:
rnasequin_annotation_2.4.gtf
Convenience function to quantifying FLAIR isoform usage across samples using minimap2, creating a manifest.tsv tab-separated file with the following structure:
sample1 conditionA batch1 ./sample1_reads.fq
Command to quantifying:
singularity exec flair_latest.sif python3 ./flair/flair.py quantify -r <reads_manifest.tsv> -i <flair.collapse.isoforms.fa> --trust_ends
After all the analysis, we obtain a .gtf file with all the isoforms found by FLAIR and a table with their quantification, which will be later compared with the quantifications of the sequins and the predictions made by the other programmes.
bambu is a R package for multi-sample transcript discovery and quantification using long read RNA-Seq data. You can use bambu after read alignment to obtain expression estimates for known and novel transcripts and genes. The output from bambu can directly be used for visualisation and downstream analysis such as differential gene expression or transcript usage.
To carry out the analysis using bambu, just run the Rscript.
See the bambu repository for further information.
Once the script is executed, three files are obtained:
counts_gene.txt
: a tab-delimited file with the counts per gene.counts_transcript.txt
: a tab-delimited file with counts per transcript.extended_annotations.gtf
: a tab-delimited file with the annotations made by the programme.
LIQA(Long-read Isoform Quantification and Analysis) is an Expectation-Maximization based statistical method to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read RNA-seq data. LIQA incorporates base-pair quality score and isoform-specific read length information to assign different weights across reads instead of summarizing isoform-specific read counts directly. Moreover, LIQA can detect DAS events between conditions using isoform usage estimates.
To perform the analysis using LIQA, two steps must be carried out:
The first step is the transformation of isoforms into a compatible matrix based on a reference annotation file. In our case we will use the command for gtf files, although there is also an option for files in ucsc format.
liqa -task refgene -ref <example.gtf> -format gtf -out <example.refgene><
In our case:
- -ref:
rnasequin_annotation_2.4.gtf
The next step is to quantify the expression of the isoforms. Although it is recommended to filter the bam file reads, this has not been done, as it has not been done with any of the above algorithms.
It is then necessary to sort and index the bam file, and give refgene_file, bam_file to LIQA to estimate isoform expression:
liqa -task quantify -refgene <refgene_file> -bam <bam_file> -out <output_file> -max_distance <max distance> -f_weight <weight of F function>
In our case:
- -refgene:
refgene_file
generated in the previous step. - -bam:
flair.aligned.bam
, as the reads are aligned via minimap2 and we use the same input for all programmes. - -max_distance: 10
- -f_weight: 1, as the example shown.
Finally, a file with the quantification of each isoform is obtained.
See the LIQA repository for further information.
HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays. It also has a script (htseq-count) which takes a GTF file with gene/transcript models and a SAM file and counts for each gene/transcript how many reads map to it.
In our case, I have used this algorithm instead of Cufflinks as stated in the report, as I have had great difficulty running Cufflinks and HTSeq is the second most used programme to quantify genes/transcripts from short reads, so for a first analysis it is more than sufficient.
First we have to align the Illumina reads with a reference genome, using minimap2:
minimap2 -a ref.fa query.fq > alignment.sam
In our case:
- -a:
rnasequin_decoychr_2.4.fa
- query.fq:
K562_SequinMixA.Rep2.R1.fq
Once we have the alignment.sam
, we can run htseq-count:
htseq-count --idattr transcript_id <aligment_file> <gff_file>
In our case:
- alignment_file:
alignment.sam
, previously generated. - gff_file:
rnasequin_annotation_2.4.gtf
To carry out the preliminary analysis of the reference analysis once we have managed to quantify at the isoform level with the three proposed programmes (and HTSeq for comparison), I have generated an RScript that takes as input data from an auto-generated Excel file with the quantisation data for each programme and outputs a graph for each algorithm comparing the quantisation of that programme with the expected quantification, available in the file rnasequin_isoforms_2.4.tsv
.
In this way we obtain the following plot:
In the file with the expected quantifications, there are a total of 160 isoforms. Bambu, FLAIR and LIQA were able to quantify 82 (51.25%), 59 (36.88%) and 78 (48.75%) isoforms respectively, with each algorithm showing a respective correlation of 0.57, 0.7 and 0.41. Meanwhile, HTSeq can quantify 75 isoforms (46.88%) with a correlation of 0.43.
With these results we can see that programs such as FLAIR are not able to quantify even half of the isoforms, but those that it is able to detect do so quite accurately. On the other hand, bambu and LIQA quantify about 50% of the transcripts with moderate accuracy. What is interesting is that both these last two algorithms and FLAIR show a higher correlation (or very similar in the case of LIQA) with that obtained by HTSeq, which reinforces our initial hypothesis, since methods using long reads show, a priori, better, or similar precision than methods using short reads.
However, the accuracies are not as good as they should be, so one would expect that for the de novo reconstruction algorithms these accuracies would be even worse, discovering even fewer transcripts.