Skip to content

TAMA GO: Sequence Cleanup

GenomeRIK edited this page Mar 31, 2022 · 10 revisions

This set of tools in TAMA-GO is used to clean up sequences. Right now there is only one tool but it will be expanded later.

tama_flnc_polya_cleanup.py

To remove poly-A tail sequences from the FLNC fasta files use tama_read_support_levels.py. This tool is used to remove the poly-A tails left in the FLNC fasta files after running IsoSeq3 Refine without the "--require-polya" parameter. If you have Iso-Seq data generated from cDNA libraries prepared with the Teloprime kit, you should not use the "--require-polya" parameter. Using the "--require-polya" parameter will remove many reads due to an issue with the Teloprime 3' primer sequence and the way LIMA works. Instead you should run default Refine and then clean up the remaining Poly-A tails using this tool.

See twitter thread for more info: https://twitter.com/GenomeRIK/status/1179788262187110401


Instructions for Teloprime Iso-Seq data: Primer sequences (ie primers.fasta)(may need to change header depending on software version):

  >primer_5p
  TGGATTGATATGTAATACGACTCACTATAG
  >primer_3p
  AAAAAAAAAAAAAAAAAACGCCTGAGA

Run LIMA depending on the version you are using (for IsoSeq3 3.2):

  lima --isoseq --dump-clips --no-pbi --peek-guess -j 24 ccs.bam primers.fasta demux.bam   

Run refine without the "--require-polya" argument (for IsoSeq3 3.2):

  isoseq3 refine output.5p--3p.bam primers.fasta flnc.bam

Convert flnc.bam file into a fasta file:

  bamtools convert -format fasta -in flnc.bam  > flnc.fa

Run tama_flnc_polya_cleanup.py to remove remaining 3' poly-A tails:

  python tama_flnc_polya_cleanup.py -f flnc.fa -p prefix

The resulting fasta file is now ready for genome mapping.


In order to convert the FLNC BAM file into a fasta file you can use this command: bamtools convert -format fasta -in bam_file > fasta_file

Note: This is not a part of TAMA. This is bamtools.

usage: tama_flnc_polya_cleanup.py [-h] [-f] [-p]

optional arguments:

  -h, --help  show this help message and exit
  -f F        FLNC fasta file
  -p P        Prefix for output file
  -m M        Minimum read length to keep (default is 200)

Default command would look like this:

python tama_flnc_polya_cleanup.py -f flnc.fa -p prefix

Detailed explanation of arguments:

-f F

The FLNC fasta file is the output from running IsoSeq3 Refine and then the BAM to Fasta conversion.

-p P

This is the prefix used for the file naming of all the output files.

-m M

This is the minimum read length to keep after poly-A trimming. Default is 200bp.

Outputs:

  prefix.fa
  prefix_polya_flnc_report.txt
  prefix_discarded_reads.txt
  prefix_summary.txt

Detailed explanation:

prefix.fa

This is the cleaned up FLNC fasta file.

prefix_polya_flnc_report.txt

This is a report file showing a table of the number of sequences with different counts of poly-A's.

  polya_num       polya_num_count
  0       40676
  1       46986
  2       63718

prefix_discarded_reads.txt

This is a report file showing a fasta of the reads that were discarded and also giving a reason why.

prefix_summary.txt

This is a report file showing a summary of some of the characteristics of the reads.