Skip to content

Removes reads mapping to a genome with a prefix to [0-9XY]|[MT] as its designation e.g. M1

Notifications You must be signed in to change notification settings

rufessor/Strip_readsGenome_SAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Strip_readsGenome_SAM

HTS reads dervied from mixed genome samples must be mapped to combined genomic index files (using for instance BWA) Although this is required for correct read placement it can be problematic for downstream processing

This script is a catch all to work through some of the SAM file manipulations requied to deal with these issues.

In general- this script can accomplish the following things while either writing a new .SAM file prepended with the --prefix=NAME or through streaming output to the command line by invoking --stream.

It has the following dependicies- File::Basename Getopt::Long

the command line options are --prefix=<> (provide name to be prepended to input file to generate unique output file) --rg=<> --help (display abbreviated help) --reverse (switch mode from removal of rg specified tags to removal of tags NOT corresponding to rg) --insert (switch mode entirely, now we prepend rg to the reference seq name- refSeq name must correspond to [0-9]|X|Y|MT)

This script is built to edit BWA output and examines the alt SAM fields BWA generates. This script operates on SAM files from paired end or single ended reads- it operates on the RNAME and RNEXT fields. In addition it parses the BWA generated alt fields XA|SA in the same manner as it does for the RNAME and RNEXT fields.

For example, if you have mapped to a combined Human and Mouse index where in the Mouse SN fields have been altered from the typical [0-9]|X\Y|MT to instead include a prefix "M"- e.g. M1, MMT, MX, MY for the sequence names, and the human are represented normally, e.g. 1,2,X,Y,MT- and you wish to retain ONLY reads that map to the human genome.

perl Remove_genomeV3.pl --rg=M --prefix=StripGRCm /path/to/sam/MYfile.SAM

Will generate a new file- StripGRCm_MYfile.SAM in which any read or read with a mate mapping to any reference sequence name with a M as a prefix has been removed. In addition, any alt mapping fields with reference to the same will be removed. Any alt mapping fields with reference to the human genome will be retained. In the instance that the only XA|SA mapping is to Mouse sequence names, the entire XA|SA field is removed.

AS SUCH- reads that themselves map to the human genome, but for which their mate mapped to mouse, will be removed entirely- this behavior is intended.

If instead you wanted to retain the mouse mapping and remove the human- just invoke the --reverse flag (which does not take an argument)

Finally, if you wish to modify a SAM file to INCLUDE a prefix to the standard [0-9]|X|Y|MT designation, just use --insert and the rg tag will be prepended to every instance of the reference sequence name. This option does NOT work on mixed files- you must have a clean file with mapping to reference sequence names described as [0-9|X|Y|MT.

About

Removes reads mapping to a genome with a prefix to [0-9XY]|[MT] as its designation e.g. M1

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages