Skip to content

pRIblast is a high efficient, parallel application for extensive lncRNA-RNA interaction analysis

License

Notifications You must be signed in to change notification settings

UDC-GAC/pRIblast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pRIblast

gnu workflow issues doi version license

pRIblast is a high efficient, parallel application for extensive lncRNA-RNA interaction prediction. pRIblast is based on the work of T. Fukunaga and M. Hamada, RIblast, and it has been fully optimized to reduce I/O latencies and memory usage to the bare minimum.

Version

Version 0.0.3.

Requirements

To compile and execute pRIblast, the following software is required:

  • GNU Make.
  • C++ compiler (with support for OpenMP and the C++17 standard).
  • MPI implementation (MPI-3 compliant).

For instance, a valid combination of these tools may be: GNU Make v3.82, GCC v9.3.0 and OpenMPI v3.1.4.

Compilation

Download the source code from this repository, either use Git or download a copy from GitHub, and let GNU Make automatically compile pRIblast for you. As a result, there will be a newly created binary file named pRIblast in the target folder of your current working directory.

Execution

To execute pRIblast, fetch the MPI runtime interface as follows

$ mpirun -np <p> -x OMP_NUM_THREADS=<t> pRIblast <options>

where <p> is the number of processes that will exist in the MPI group and <t> is the number of threads spawned per MPI process.

As for the program options, RIblast's official repository provides a detailed list of the available execution modes (i.e. database construction and RNA interaction search) and per mode parameters. However, pRIblast implements new options to have fine grained control over the execution of the parallel algorithm. Those options are:

 (db) -a  <std>, sets the parallel algorithm used to distribute data among processes (block | heap | dynamic)
 (db) -p  <str>, sets a per process local path for fast writing of temporary output files
 (db) -c  <int>, sets the database page size (smaller page implies less memory usage)
(ris) -a  <str>, sets the parallel algorithm used to distribute data among processes (block | area | dynamic)
(ris) -p  <str>, sets a per process local path for fast writing of temporary output files

Execution example

Suppose you want to execute pRIblast (both the db and ris steps using the dynamic algorithm) on a 16-node multicore cluster using a FASTA file db.fa, which contains RNA sequences to construct a database, a FASTA file ris.fa, which contains the RNA sequences you want to predict interactions against the database, a page size of 500 sequences, and 1 process per node with 16 threads each. Furthermore, there exist a local, temporary disk attached to every node located in /tmp/scratch that allows fast writing of temporary output files.

First, create the target RNA database running the pRIblast database construction step as follows

$ mpirun -np 16 -x OMP_NUM_THREADS=16 \
         pRIblast db -i db.fa -o rna-db -a dynamic -p /tmp/scratch -c 500

And then, predict interactions against the database running the pRIblast RNA interaction search step as follows

$ mpirun -np 16 -x OMP_NUM_THREADS=16 \
         pRIblast ris -i ris.fa -o predictions.txt -d rna-db -a dynamic -p /tmp/scratch

Note that the -p option is not mandatory, but it is highly recommended to use it if there exists a local, temporary disk attached to every node, as this will drastically reduce I/O latencies. And also, note that the -c option is only available for the database construction step. It sets the page size of the database, i.e. the number of RNA sequences that will be loaded into memory at once. The smaller the page size, the less memory will be used in the ris step.

Configuration of threads, processes and algorithms

To achieve maximum performance, avoid running the pure-block algorithm. Its only purpose is to benchmark. Instead, use the heap (database construction step) and the area-sum (RNA interaction search step) algorithms if computing nodes have a high number of CPU cores available to take advantage of the multithreading performance optimization heuristics developed within the tool. Spawn one process per socket and run as many threads as cores it has. Otherwise, use the dynamic algorithm if the number of available nodes is low and/or the number of CPU cores per node is low. Spawn one process per core.

Cite us

If you use pRIblast in your research, please cite our work using the following references:

@article{amatria2023priblast,
  title={pRIblast: A highly efficient parallel application for comprehensive {lncRNA--RNA} interaction prediction},
  author={Amatria-Barral, I{\~n}aki and Gonz{\'a}lez-Dom{\'\i}nguez, Jorge and Touri{\~n}o, Juan},
  journal={Future Generation Computer Systems},
  volume={138},
  pages={270--279},
  year={2023}
}

@inproceedings{amatria2023parallel,
  author={Amatria-Barral, I{\~n}aki and Gonz{\'a}lez-Dom{\'\i}nguez, Jorge and Touri{\~n}o, Juan},
  title={Parallel construction of {RNA} databases for extensive {lncRNA--RNA} interaction prediction},
  booktitle={Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing},
  series={SAC '23},
  pages={555--558},
  year={2023},
  address={Tallinn, Estonia}
}

License

pRIblast is free software and as such it is distributed under the MIT License. However, pRIblast makes use of several modules which are not original pieces of work. Therefore, their usage is subject to their correspoding THIRDPARTYLICENSE and all rights are reserved to their authors.