RelocaTE2: a high resolution transposable element insertion sites mapping tool for population resequencing
RelocaTE2 is an improved version of RelocaTE (Robb et al., 2013). RelocaTE2 is highly sensitive and accurate in mapping transposable elements (TE) polymorphisms at single base pair resolution. RelocaTE2 uses the reads associated with TEs as seeds to cluster the read pairs on chromosomes. It automatically detects the target site duplication (TSD) of a TE insertion from alignments in each cluster, which enable high resolution mapping of TE polymorphisms. Unlike parallel searching of multi-TE elements in RelocaTE, RelocaTE2 searches all TEs in one cycle, which enables us find polymorphisms of thousands of TEs in an individual genome or large population in a reasonable timeframe without losing sensitivity and specificity.
-
System requirements
- Linux/Unix platform
- Short read aligner: BLAT (v35+), bowtie2 (v2.2.6+), bwa (v0.6.2)
- samtools (v1.3.1)
- bedtools (v2.25.0)
- Python (v2.7.5+) and pysam package (v0.8.5+)
- Perl (v5.20.2+)
- seqtk (v1.0+)
-
Install
echo "Installation via github"
git clone https://github.com/JinfengChen/RelocaTE2.git
cd RelocaTE2
bash install.sh
cd test_data
bash run_test.sh > run_test.sh.log 2>&1 &
echo "Installation via conda (Linux as example)"
#download miniconda from http://conda.pydata.org/miniconda.html and install
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
source ~/.bashrc
#install RelocaTE2 into miniconda isolated environment "RelocaTE2"
conda create --name RelocaTE2 -c bioconda relocate2
#run test data
source activate RelocaTE2
cd /PATH_TO_miniconda/env/RelocaTE2
cd test_data
bash run_test.sh > run_test.sh.log 2>&1 &
- Troubleshooting
- Installation of RelocaTE2 using install.sh or conda will install all the tools and packages required to run RelocaTE2. The executables of all tools are RelocaTE2/bin directory and record their paths in RelocaTE2/CONFIG. The main script of RelocaTE2, relocaTE2.py, searches for executables in $PATH; however, the executables from RelocaTE2/CONFIG will supercede $PATH. Users can modify RelocaTE2/CONFIG with paths to tools installed on their specific system to avoid problems.
- The Python module "pysam" is installed to RelocaTE2/lib. By setting PYTHONPATH=PATH_OF_RelocaTE2/lib/python2.7/site-packages, any other locally-installed versions of pysam are temporarily ignored and the supported version of pysam for RelocaTE2 is used instead.
- In RelocaTE2, we align trimmed reads to reference genome by bwa v0.6.2, which allows paired-end reads have different names in fastq files. We recommend using install.sh or conda provided in RelocaTE2 to install these dependent tools including bwa v0.6.2.
- set environment variables if failed to find executable PATH or pysam libary
export PYTHONPATH=`pwd`/lib/python2.7/site-packages:$PYTHONPATH
export PATH=`pwd`/bin:$PATH
- run RelocaTE2 to find transposable element insertions
cd test_data
#repeat element
repeat=RiceTE.fa
#reference genome
ref=MSU7.Chr3_2M.fa
#repeatmasker results of TE annotation on reference genome
ref_te=MSU7.Chr3_2M.fa.RepeatMasker.out
#directory where the input fastq format reads are located
fq_dir=MSU7.Chr3_2M.ALL_reads/
#output directory where RelocaTE2 write temperary and final output
outdir=MSU7.Chr3_2M.ALL_reads_RelocaTE2_outdir
python scripts/relocaTE2.py --te_fasta $repeat --genome_fasta $ref --fq_dir $fq_dir --outdir $outdir --reference_ins $ref_te --run
- check results of TE insertions and compare with simulated TE insertions
wc -l MSU7.Chr3_2M.ALL_reads_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff
196
bedtools window -w 10 -a MSU7.Chr3_2M.ALL.gff -b MSU7.Chr3_2M.ALL_reads_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff | wc -l
196
python scripts/relocaTE2.py --help
usage: relocaTE2.py [-h] [-b BAM] [-t TE_FASTA] [-d FQ_DIR] [-g GENOME_FASTA]
[-r REFERENCE_INS] [-o OUTDIR] [-s SIZE] [-c CPU]
[-1 MATE_1_ID] [-2 MATE_2_ID] [-u UNPAIRED_ID]
[--sample SAMPLE] [--aligner ALIGNER]
[--len_cut_match LEN_CUT_MATCH]
[--len_cut_trim LEN_CUT_TRIM] [--mismatch MISMATCH]
[--mismatch_junction MISMATCH_JUNCTION] [--step STEP]
[--run] [--split] [-v VERBOSE]
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Name of BAM file of reads mapped reference genome
-t TE_FASTA, --te_fasta TE_FASTA
Name of fasta sequence of repeat element
-d FQ_DIR, --fq_dir FQ_DIR
Name of directory of input fastq sequence data
-g GENOME_FASTA, --genome_fasta GENOME_FASTA
Name of fasta file of reference genome sequence
-r REFERENCE_INS, --reference_ins REFERENCE_INS
Name of RepeatMasker TE annotation of reference genome
-o OUTDIR, --outdir OUTDIR
Name of output directory where to put temperary and
final results
-s SIZE, --size SIZE Insert size of sequence library, default = 500
-c CPU, --cpu CPU Number of CPUs to use for multiplex, default = 1
-1 MATE_1_ID, --mate_1_id MATE_1_ID
string define paired-end read1, default = "_1"
-2 MATE_2_ID, --mate_2_id MATE_2_ID
string define paired-end read2, default = "_2"
-u UNPAIRED_ID, --unpaired_id UNPAIRED_ID
string defining single-end reads, default = ".unPaired"
--sample SAMPLE string defining sample name which will present in output
GFF, default = "not_given"
--aligner ALIGNER aligner used to map reads to repeat elements,
default=blat
--len_cut_match LEN_CUT_MATCH
length cutoff threshold for match between reads and
repeat elements. Large value will lead to less
sensitive but more accuracy, default = 10
--len_cut_trim LEN_CUT_TRIM
length cutoff threshold for trimed reads after
trimming repeat sequence from reads. Large value will
lead to less sensitive but more accuracy, default = 10
--mismatch MISMATCH Number of mismatches allowed for matches between reads
and repeat elements, default = 2
--mismatch_junction MISMATCH_JUNCTION
Number of mismatches allowed for matches between
junction reads and repeat elements, default = 2
--step STEP Number to control steps of pipeline, default =
"1234567"
--dry_run write shell scripts only while this script excute
--run run while this script excute
--split split fastq into 1 million reads chunks to run blat/bwa jobs
-v VERBOSE, --verbose VERBOSE
verbose grade to print out information in all scripts:
range from 0 to 4, default = 2
- Reference genome sequence (ref): multiple sequences of reference genome in fasta format
cat test_data/MSU7.Chr3.fa | head -n 25
>Chr3
#########Skip lines of NNNNNN in sequence#########
CCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACC
CTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC
TAAACCCTAAACCCTAAACCCTAAACCCTCAATCCTAACCCCTCAATCCT
AAGCCGTCTGCAGTGGCCATCGCAGGGTGAGGTGGGTGGAGGGTGAGTGA
- Repeat sequence (repeat): Consensus sequences of repeat families in fasta format
cat test_data/RiceTE.fa | head -n 4
>mPing
GGCCAGTCACAATGGGGGTTTCACTGGTGTGTCATGCACATTTAATAGGGGTAAGACTGAATAAAAAATG
ATTATTTGCATGAAATGGGGATGAGAGAGAAGGAAAGAGTTTCATCCTGGTGAAACTCGTCAGCGTCGTT
TCCAAGTCCTCGGTAACAGAGTGAAACCCCCGTTGAGGCCGATTCGTTTCATTCACCGGATCTCTTGCGT
- Resequencing data (fq_dir): Illumina reads from one strain in fastq format (fastq or fastq.gz). Sequences need to be put in one directory. Paired-end reads need to end with _1.fastq and _2.fastq.
ls test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/*.fq
test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_1.fq
test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_2.fq
cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_1.fq | head -n 4
@read_500_1/1
TGTAAGAGTGCTATTGATGTTCGTTAATATTGTGCTCATATTTATAACATAATGATTTCTTTCATCTATACGAACAAATAGTACAAAATCCAAATACGAC
+
GGGGGGFCCFGHEEE:?<8;C:;@<6=@BFGHEEHHHGGEBEEEEFFH?F;<<@HHFECFFHHHGFFFHHHFBE<BCDCDFHFHHHHHGBHHEED3;DHH
cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_2.fq | head -n 4
@read_500_1/2
ATTAATAATTTAAAATCTATATTAACTAATGTACACTTACACTAGAACCGGTGGACCCATTATTTAAATCATCTGATATATTATTTGCTAATAAATAAAG
+
GGGFHHHHHHHGGGHHHHHHHEHFFEHFHHGHHHDHHHGHFFGEHHFHHHFGDC@=@GFGB4DEHHHECE>>8/@ABHHH?DGHBEFEDAEDHFBDBCC>
- RepeatMasker results of TE annotation on reference genome (ref_te): default TE annotation of reference genome used to call TE insertions in reference genome and remove possible false positive non-reference TE insertions.
cat test_data/MSU7.Chr3.fa.RepeatMasker.out | head -n 4
249 23.8 0.0 4.8 Chr3 1283 1366 (36412453) + CACTA-K DNA/En-Spm 663 742 (389) 1
2250 19.7 0.7 0.9 Chr3 1973 2424 (36411395) + SZ-7_LTR LTR/Gypsy 1 451 (131) 2
4798 12.5 5.6 1.0 Chr3 2425 3618 (36410201) + SZ_LTR LTR/Gypsy 1 1249 (0) 3
820 15.4 0.0 0.0 Chr3 3619 3754 (36410065) + SZ-33_LTR LTR/Gypsy 440 575 (0) 4
- Bam file of reads mapping results on reference genome (bam): characterize heterozygous and homozygous TE insertions when BAM file is provided.
-
Structure of output directory
- blat_output: BLAT results of reads to repeat sequence
- bwa_aln: bwa results of trimmed-reads to reference genome
- flanking_seq: trimmed-reads and their pairs
- results: final results of TE insertions
- te_containing_fq: temporary results of reads that matched to repeat sequence
- te_only_read_portions_fa: temporary results of the proportion of reads that matched to repeat sequence
-
TE insertions
- TE insertions shared between resequenced strain and reference genome: test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500_RelocaTE2_outdir/repeat/results/ALL.all_ref_insert.gff
- TE insertions only present in resequenced strain: test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff
- TE insertions characterized as heterozygous and homozygous as described in Robb et al., 2013: test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.characTErized.gff.
-
GFF format used in RelocaTE2
cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff | head -n 4
Chr3 RelocaTE2 rice 85173 85180 . - . ID=repeat_Chr3_85173_85180;Name=nDaiz;TSD=GTGCTTGT;Note=Non-reference, not found in reference;Right_junction_reads=1;Left_junction_reads=2;Right_support_reads=3;Left_support_reads=4;
Chr3 RelocaTE2 rice 239462 239464 . + . ID=repeat_Chr3_239462_239464;Name=Truncator;TSD=supporting_junction;Note=Non-reference, not found in reference;Right_junction_reads=2;Left_junction_reads=0;Right_support_reads=4;Left_support_reads=2;
Chr3 RelocaTE2 rice 429710 429717 . + . ID=repeat_Chr3_429710_429717;Name=nDaiz;TSD=CAAAATTT;Note=Non-reference, not found in reference;Right_junction_reads=2;Left_junction_reads=2;Right_support_reads=2;Left_support_reads=6;
Chr3 RelocaTE2 rice 522905 522907 . + . ID=repeat_Chr3_522905_522907;Name=Dasheng;TSD=supporting_junction;Note=Non-reference, not found in reference;Right_junction_reads=0;Left_junction_reads=3;Right_support_reads=2;Left_support_reads=0;
Attributes in field 8 of GFF:
ID: unique id of TE insertions, repeat_"chromosome"_"start"_"end"
Name: TE family name of this insertion
TSD: target site duplicate predicted from read alignments
Note: definition of TE insertions, including Non-reference, Reference-Only and Shared.
Right_junction_reads: number of reads covering the junction of TE insertion on right side/downstream.
Left_junction_reads: number of reads covering the junction of TE insertion on left side/upstream.
Right_support_reads: number of reads not covering the junction of TE insertion, but supporting TE insertion by paired-end reads on right side/downstream.
Left_support_reads: number of reads not covering the junction of TE insertion, but supporting TE insertion by paired-end reads on left side/downstream.
- Chen, J., Wrightsman T, Wessler SR, Stajich JE. (2017) RelocaTE2: a high resolution transposable element insertion site mapping tool for population resequencing. PeerJ 5:e2942.
- Robb S.M., Lu L., Valencia E., Burnette J.M. 3rd., Okumoto Y., Wessler S.R., Stajich J.E. (2013) The use of RelocaTE and unassembled short reads to produce high-resolution snapshots of transposable element generated diversity in rice. G3 2013;3:949-957.