Skip to content

Latest commit

 

History

History
337 lines (226 loc) · 16.8 KB

README.rdoc

File metadata and controls

337 lines (226 loc) · 16.8 KB

1. Getting started

*Process your raw data without cooking it*

SUSHI can Support Users for SHell script Integration, defined as a recursive acronym as usual, but someone might say after using it, SUSHI is a Super Ultra Special Hyper Incredible system!! SUSHI is an agile and extensible data analysis framework that brings innovative concepts to Next Generation Sequencing bioinformatics. SUSHI lets bioinformaticians wrap open source tools like e.g. bowtie, STAR or GATK into apps and provides natively a commandline and a web interface to run these apps. Users can script their analysis as well as do an analysis-by-clicking in their browser. When running an SUSHI application SUSHI takes care of all aspects of documentation: Input data, parameters, software tools and versions are stored persistently. SUSHI accepts meta-information on samples and processing and lets the user define the meta-information she/he needs in a tabular format. Finally, all results, the associated logs, all parameters and all meta-information is stored in a single, self-contained directory that can be shared with collaborators. Altogether, SUSHI as a framework truly supports collaborative, reproducible data analysis and research. At the Functional Genomics Center, SUSHI is tightly integrated with B-Fabric(fgcz-bfabric.uzh.ch) and shares the authentication, sample information and the storage. The production version of SUSHI is currently being further integrated in the overall B-Fabric framework.

Getting started at the demo SUSHI server: fgcz-sushi-demo.uzh.ch, just go and play around with it, and feel what SUSHI tastes like. Go to *Todays Menu* and select your project number and then DataSet to start analyzing data or viewing analysis results. You can find the basic usage here: fgcz-sushi.uzh.ch/usage.html

2. Installation

First of all, let me introduce quickly main SUSHI components:

  1. SUSHI server (Web-based front-end, Ruby on Rails)

  2. SUSHI application (sushi_fabric.gem)

  3. Workflow manager (workflow_manager.gem)

SUSHI server is the user interface and it send some requests to SUSHI application, and the SUSHI application calls an job submission command of the Worlflow manager that manages and observes each submitted job. All the components will be installed at once by using bundle command (see the next section). Note that the workflow manager is working as an independent process from SUSHI server (Ruby on Rails) process. Both processes do not have to run on a same computer node (they communicate each other using dRuby).

2.1. Requirements

  • Ruby (>= v2.6.x)

  • Git (>= v2.10.2)

  • bundler (RubyGem, >= v1.10.6)

  • (Ruby on Rails (>= v5) this wll be automatically installed through bundler)

Just confirm first to check if the required applications are installed in your system as follows:

$ ruby -v
$ git --version
$ bundle -v

If you get ‘Command not found’-like message, you have to install the applications before going on to the next step.

Note

  • Ruby on Rails will be installed in the following steps by bundler

  • Please refer to the following sites for the other application installation

  • Ruby www.ruby-lang.org/en

  • Git git-scm.com

  • To instal ‘bundler’, just type the following command after you install Ruby:

    $ gem install bundler

2.2. Download and configration

$ git clone https://github.com/uzh/sushi.git
$ cd sushi/master
$ bundle install --path vendor/bundle
$ bundle exec rake secret

If you get an error during gem library installation, please try 1) delete Gemfile.lock and 2) bundle install again. This may adjust the library version to your environment.

‘rake secret’ command gives you the secret key and paste it in /config/initilizer/devise.rb as follows:

config.secret_key = 'xxxxx'

# xxxxx is your secret key. Next,

$ bundle exec rake db:migrate RAILS_ENV=production

This command initializes the database appropriately.

3. Start SUSHI server

3.1. Start workflow_manager

$ mkdir workflow_manager
$ cd workflow_manager/
$ bundle exec workflow_manager
mode = development
ruby = xxx
GEM_PATH = xxx
DB = PStore
Cluster = LocalComputer
druby://localhost:12345

Note

  • The final line shows the dRuby address of the workflow manager

  • If KyotoCabinet (gem) is installed in your system, KyotoCabinet is automatically selected as a DB system rather than PSotre

  • KyotoCabinet: fallabs.com/kyotocabinet/pkg

3.2. Start SUSHI server

$ cd master/
$ bundle exec rails server -e production

Note

  • SUSHI (Ruby on Rails) server should run in a different folder from the workflow manager folder

  • The rails server command starts a WEBRick webserver daemon, which is implemented in Ruby

  • SUSHI server uses SQLite3 DBMS as a default

  • You can replace the DBMS to MySQL and the webserver application to Apache (+passenger module) if you need, then SUSHI will be much faster and more stable

  • Please search google and refer to a lot of Rails documents on the web for it

4. Set your first NGS Data and DataSet file

You need to import a DataSet file (.tsv) to SUSHI database.

A SUSHI application takes a DataSet, which is meta-information set of actual data, as an input, and a SUSHI application produces another DataSet as a result. The DataSet is identified as a .tsv (Tab-Separated-Value) text file, and it includes all input/output actual data file location (path). All DataSet file (.tsv) should be located in your project directory, the default project path becomes $RAILS_ROOT/public/gstore/projects/p1001, which directory will be automatically generated at the first running of the SUSHI server.

Sample DataSet

You can download it from your SUSHI server, or you can copy it from $RAIL_ROOT/public/ventricles_100k.tgz. After you decompress the archive file, move the folder in the project directory, $RAILS_ROOT/public/gstore/projects/p1001/

# Currently, SUSHI does not have a function to upload the data itself. It is the SUSHI framework concept, SUSHI does not manage data itself, but SUSHI does manage only the meta-information, called DataSet.

then you can find it from the SUSHI server by clicking ‘gStore’ of the main menu, and you can find ‘ventricles_100k’ directory. Double click the directory and you will find ‘dataset.tsv’ file in the ventricles_100k directory. By clickking the “+”(plus) button next to ‘dataset.tsv’ file under the ventricles_100k directory imports it as a SUSHI DataSet to SUSHI database, and a new DataSet, ventricles_100k, will show up in ‘DataSet’ view (automatically jump to ‘DataSet’ view and also you can select ‘DataSet’ menu from the top menu).

5. How to run a SUSHI application

You can run a SUSHI application after selecting a DataSet:

  1. Select a DataSet

  2. Click a SUSHI application button

  3. Set parameters

  4. Click ‘submit’ button

Note

  • Initially, only ‘WordCountApp’ is available, though other SUSHI application buttons appear in DataSet view.

  • After installing ezRun package and installing corresponding applications in your system, the other SUSHI applications will work. See the following steps.

6. Install ezRun package

Please refer to github.com/uzh/ezRun in detail.

Simply to say, just run the following command in R environement:

> source("http://bioconductor.org/biocLite.R")
> biocLite("devtools")
> biocLite("uzh/ezRun")

And set the following constants in lib/global_variables.rb appropriately:

EZ_GLOBAL_VARIABLES = '/usr/local/ngseq/opt/EZ_GLOBAL_VARIABLES.txt'
R_COMMAND = '/usr/local/ngseq/stow/R-3.2.0/bin/R'

You can find the EZ_GLOBAL_VARIABLES.txt location in your system by loading ezRun package

> library(ezRun)

in the R environment console, and R path for

$ which R

in shell command terminal.

7. Install third party applications and configure EZ_GLOBAL_VARIABLES.txt

ezRun package bundles a sort of wrapper applications to call a thrid party application software. For example, FastQCApp (SUSHI application) calles EzAppFastqc method in ezRun package, and subsequently it calls ‘fastqc’ command defined in EZ_GLOBAL_VARIABLES.txt. In other words, FastQCApp does not work without the installation of FastQC application software. All the application paths are defined in EZ_GLOBAL_VARIABLES.txt explicitly. Please check the application paths defined in EZ_GLOBAL_VARIABLES.txt one by one and set the correct path installed in your environement. If an appropriate application software is not found, the corresponding SUSHI application will stop with an error, though the SUSHI application button in DataSet view appears. SUSHI does not check if the third party application software is installed in your computer.

8. Install iGenome reference

In the case of using alignment tools, such as STARApp and BowtieApp, you need reference genome sequence(s). We use iGenome format for the reference data. If you can find your target species in the site of Illumina/iGenome, you can download it in your system as a reference of SUSHI application. The current all SUSHI applciations refer to the following environment variable

  • GENOME_REF_DIR = ‘/srv/GT/reference’

in lib/global_variables.rb and

  • GENOMES_ROOT = ‘/srv/GT/reference’

in EZ_GLOBAL_VARIABLES.txt. Please set the GENOME_REF_DIR and GENOMES_ROOT path accordingly to the local iGenome folders in your system.

Reference

9. Final small configuration

Please make a symbolic link to gstore directory in public directory, namely just run the following command in public/ directory:

$ ln -s gstore/projects

and set the following environment variable

in EZ_GLOBAL_VARIABLES.txt

10. Advanced SUSHI configuration

10.1. SUSHI configuration

Usually Ruby on Railes uses either configuration file depending on the Rails mode (development or production):

  • config/environments/development.rb

  • config/environments/production.rb

in these files, the following four properties are used in SUSHI:

  1. config.workflow_manager = ‘druby://localhost:12345’

  2. config.gstore_dir = ‘/sushi/public/gstore/projects’

  3. config.sushi_app_dir = ‘/sushi’

  4. config.scratch_dir = ‘/tmp/scratch’

config.workflow_manager is the hostname and port number of which the workflow manager is working. config.gstore_dir is the path to the actual project folder where input and result data is stored. config.sushi_app_dir is the location of SUSHI server (Ruby on Rails) installation. config.scratch_dir is used by actuall shell script as a temporary directory. After a job finished, all the data is copied to config.gstore_dir from config.scratch_dir. config.scratch_dir should be enough disk space for every job calculation, otherwise a job will stop due to no disk space.

10.2. How to make/add a new SUSHI application

All SUSHI applications (classes) inherit SushiApp class that is defined in sushi_fabric.gem. At the moment, we must make a SUSHI application class file manually and import it in lib directory. You can refer to many existing application files in lib directory. lib/otherApps/WordCountApp.rb is so smallest application that you can refer to for the first time when you create a new SUSHI Application.

Template of SUSHI application

#!/usr/bin/env ruby
# encoding: utf-8

require 'sushiApp'

class YourSushiApplication < SushiApp
  def initialize
    super
    @name = 'Application_Name'      # No Space
    @analysis_category = 'Category' # No Space
    @required_columns = ['Name', 'Read1']
    @required_params = []
  end
  def next_dataset
    {'Name'=>@dataset['Name'],
     'Stats [File]'=>File.join(@result_dir, @dataset['Name'].to_s + '.stats')
    }
    # [File] tag value (as a file) will be exported in gStore directory
  end
  def preprocess
  end
  def commands
    'echo hoge'
  end
end

Note

  • The important methods required to define are initialize, next_dataset, and commands

  • The preprocess method is not required but it will be executed before calling commands method

  • The next_dataset method must return Hash type data, the key will be column name and the value will be corresponding value to the column name

  • The commands method must return String type data that will become actual script commands

  • @name and @analysis_category instance variable is required and no space must be inserted

After you make your new SUSHI application code, put it in lib/ directory and restart the SUSHI server. The SUSHI application button will appear in DataSet view if the DataSet has the required columns defined in your SUSHI applicaiton.

10.3. Workflow Manager configuration

As well as SUSHI configuration, there are two configuration files depending on running mode (development/production):

  • config/environments/development.rb

  • config/environments/production.rb

These files are generated automatically with default setting at the first run of Workflow Manager.

WorkflowManager::Server.configure do |config|
  config.log_dir = '/home/workflow_manager/logs'
  config.db_dir = '/home/workflow_manager/dbs'
  config.interval = 30
  config.resubmit = 0
  config.cluster = WorkflowManager::LocalComputer.new('local_computer')
end

config.log_dir is the log directory path, and config.db_dir is the database directory path. config.interval means the interval time (seconds) to check a running job whether it is running, finished, or failed. When config.resubmit > 0, a failed job is resubmitted again. config.cluster sets a cluster class which has several method definitions depending on the cluster on which a job will actually run. For more details to set/customize/add a cluster, refer to the next section.

10.4. How to add a new cluster

For example, WorkflowManager::LocalComputer class, inheriting WorkflowManager::Cluster class, defines (overwrites) the following methods:

  1. submit_job

  2. job_running?

  3. job_ends?

  4. copy_commands

  5. kill_command

  6. delete_command

  7. cluster_nodes

When you add or customize your cluster system (which has calculation node(s)), you need to add a new cluser class and define (overwrite) these methods. Even though you have only one laptop computer, it is regonized as a ‘cluster’. These methods are called from SUSHI server via WorkflowManager instance and SUSHI manages the job submission. The Workflow Manager instance should run in the cluster system, but it is not required if the submission and job managing commands are available from the node where the Workflow Manager instance is running on.

Actually, although the WorkflowManager::LocalComputer class file is located in

{workflow_manager gem installed directory}/lib/workflow_manager/cluster.rb

, you can make a new class definition in the Workflow Manager configuration file (the following Ruby source code file):

  • config/environments/development.rb

  • config/environments/production.rb

An example to add a new cluster node is shown below:

#!/usr/bin/env ruby
# encoding: utf-8

class WorkflowManager::FGCZCluster
  alias :cluster_nodes_orig :cluster_nodes
  def cluster_nodes
    nodes = cluster_nodes_orig
    nodes.merge!({
      'fgcz-h-007: cpu 16,mem 61GB,scr 350GB' => 'fgcz-h-007',
      'fgcz-h-008: cpu 16,mem 61GB,scr 350GB' => 'fgcz-h-008',
    })
    Hash[*nodes.sort.flatten]
  end
end

WorkflowManager::Server.configure do |config|
  config.log_dir = '/srv/GT/analysis/masaomi/workflow_manager/run_workflow_manager/logs'
  config.db_dir = '/srv/GT/analysis/masaomi/workflow_manager/run_workflow_manager/dbs'
  config.interval = 30
  config.resubmit = 0
  config.cluster = WorkflowManager::FGCZCluster.new('FGCZ Cluster')
end

The other methods can be updated as well accordingly to your environment. Please refer to the WorkflowManager::LocalComputer class source code for more details.

10.5. Local user authentification

As default, there is no user authentification. Please apply the patch, local_devise.20160415.patch, before executing DB migration (rake db:migrate) as follows:

$ patch -p0 < master/local_devise.patch

Release information

Ver. 5.0.0, 2023.02.19

  • Rails v6.1.7.2, Ruby v3.1.3, ezRun v3.16.1 (Bioconductor v3.16), sushi_fabric v1.1.8, workflow_manager v0.8.0

Ver. 2.1.2, 2021.03.11

  • Rails v5.2.4.3, Ruby v2.6.6, ezRun v3.12.1 (Bioconductor v3.12.1, R v4.0.3), sushi_fabric v1.0.8, workflow_manager v0.6.0

  • all source files are concatenated in g-req copy case of job footer, sushi_fabric.gem ver.1.0.8

  • remove non-existing igv files in MpileupApp

  • extend EdgeR memory options

  • add SCMultipleSamplesApp in replacement of the other merging samples apps

  • Fixed SBATCH job dependency option, sushi_fabric.gem ver.1.0.7

  • Fixed EAGLERCApp

Ver. 2.1.0, 2020.11.04

  • Rails v5.2.4.3, Ruby v2.6.6, ezRun v3.12.1, sushi_fabric v1.0.0, workflow_manager v0.5.9

Ver. 2.0.2

  • fastq bug fix

  • Workflow Manager v0.5.6, added a new Cluster class for Debian10 (with slurm)

Ver. 2.0.0

  • Rails v5 + Ruby v2.6 + ezRun v3.11.1 (BioConductor v3.11.1)

Ver. 2.2.0

  • Rails v5 + Ruby v2.6 + ezRun v3.13 (BioConductor v3.13)