Skip to content
Jorge Navarro edited this page Dec 19, 2021 · 1 revision

Introduction


What is BiG-SCAPE

Bioinformatically, mining (meta)genomes for Biosynthetic Gene Clusters (BGCs) encoding the production of Secondary Metabolites has become a key strategy for Naturel Product discovery. At the single-genome basis, this process is performed by tools such as antiSMASH.

When studying large sets of genomes and metagenomes, it becomes essential to perform analyses at a large scale. BiG-SCAPE (Biosynthetic Gene Similarity Clustering and Prospecting Engine) is a tool that calculates distances between BGCs in order to map the BGC diversity onto sequence similarity networks, which are then processed for automated reconstruction of Gene Cluster Families, groups of gene clusters that encode biosynthesis of highly similar or identical molecules. BiG-SCAPE's interactive visualizations of these similarity networks allows effective exploration of the diversity of BGCs, linking them to knowledge from reference data within the MIBiG repository

How does it work in a nutshell

BiG-SCAPE (recursively) reads BGC information stored as GenBank files from the input folder (which, preferrably, corresponds to identified gene clusters with a tool like antiSMASH).

BiG-SCAPE then uses the Pfam database and hmmscan from the HMMER suite to predict Pfam domains in each sequence, thus summarizing each BGC as a linear string of Pfam domains.

For every pair of BGCs in the set, the pairwise distance between them is calculated as the weighted combination of the Jaccard, Adjacency Index (AI) and Domain Sequence Similarity (DSS) indices. Two types of output are generated: text files which include Network files and an Interactive visualization. Different cutoff values for the distances can be taken into account in one or multiple runs (i.e. only pairs with Raw Distance < cutoff are written in the final .network file).

The distances for each cutoff value will be used to [automatically define](GCFs and GCCs) 'Gene Cluster Families' (GCFs) and 'Gene Cluster Clans' (GCCs).

By default, BiG-SCAPE uses the /product information of antiSMASH-processed GenBank files to separate the analysis into eight [BiG-SCAPE classes](BiG-SCAPE classes). Each has different (tuned) sets of [weights](distance indices weights) for the distance components. You can also choose to combine all BGC classes into a single network file (--mix) and deactivate the default classification (--no_classify). It is also possible to prevent analysis of any of the BiG-SCAPE classes by using the --banned_classes parameter.

Learn more about the BiG-SCAPE options with python bigscape.py -h or by going to the specific wiki page.

See the related pages in the wiki for more detailed information.

Clone this wiki locally