Skip to content

A pipeline for clustering of PacBio CCS reads into Operational Taxonomic Units.

Notifications You must be signed in to change notification settings

oscar-franzen/oclust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About oclust

A pipeline for clustering long 16S rRNA sequencing reads, or any sequences, into Operational Taxonomic Units.

Requirements

  • Linux v.2.6.x
  • Perl v.5.10.1
  • R (should be in path)
    • package seqinr should be installed:
      > install.packages("seqinr")

Note on data

The pipeline is designed for Pacbio CCS reads - it will not work on raw Pacbio reads.

Input files

The only input file to oclust is a file in FASTA format containing the sequencing reads to be clustered.

FASTQ files can be converted to FASTA:

   $ cd utils
   $ chmod +x fastq_to_fasta.pl
   $ ./fastq_to_fasta.pl file.fastq > file.fasta

Installation

  1. Get the repository:

    $ git clone https://github.com/oscar-franzen/oclust.git oclust

  2. Make executable (might not be necessary):

    $ cd oclust
    $ chmod +x *.pl
    
  3. Decide if you want to compute distances based on Needleman-Wunsch or Infernal. The latter will be substantially faster.

    First time executed, oclust_pipeline.pl will download the human genome sequence and format it.

   $ ./oclust_pipeline.pl -x <method> -f <input file> -o <output directory> -p <number of CPUs>

   General settings:
   -x PW or MSA               Can be PW for pairwise alignments (based on Needleman-Wunsch)
                               or MSA for multiple sequence alignment (based on
                               Infernal). [MSA]
   -t local or cluster        If -x is PW, should it be parallelized by running it locally
                               on multiple cores or by submitting jobs to a cluster
                               (requires a system with the LSF scheduler). [local]
   -a complete, average or    The desired clustering algorithm. [complete]
       single    
   -f [string]                Input fasta file.
   -o [string]                Name of output directory (must not exist) and use full path.
   -R HMM, BLAST, or none     Method to use for reverse complementing sequences. [HMM]
   -p [integer]               Number of processor cores to use for BLAST. [4]
   -minl [integer]            Minimum sequence length. [optional]
   -maxl [integer]            Maximum sequence length. [optional]
   -rand [integer]            Randomly sample a specified number of sequences. [optional]
   -human Y or N              If 'Y'es, then execute BLAST-based contamination
                               screen towards the human genome. [Y]
   -chimera Y or N            Run chimera check. Can be Y or N. [Y]

  LSF settings (only valid for -x PW when -t cluster):
   -lsf_queue [string]       Name of the LSF queue to use. [scavenger]
   -lsf_account [string]     Name of the account to use. [optional]
   -lsf_time [integer]       Runtime hours per job specified as number of hours. [1]
   -lsf_memory [integer]     Requested amount of RAM in MB. [3000]
   -lsf_nb_jobs [integer]    Number of jobs. [20]

Dependencies

The oclust pipeline bundles together the following open source/public domain software:

Reference

Contact

  • p.oscar.franzen at gmail.com

About

A pipeline for clustering of PacBio CCS reads into Operational Taxonomic Units.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages