-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
processing large numbers of files with Snakemake and gridss? #354
Comments
Any advice is appreciated. |
This won't work. The sample tracking and breakdown fields in the assembly uses the ordinals of the corresponding input/label. These ordinals match the output VCF ordinals. Assuming your overall workflow looks something like the following:
Then the VCF will have no assembly support for 4/5/6 since it's based on ordinal. input4 assembly support will be allocated to input1, input5 to input2, and input6 to input3. The workaround for this is to introduce empty placeholder bams when you do each assembly step. Your pipeline needs to looks like:
This will allow you to batch your assemblies (related samples in the same batch if possible) without breaking the assembly-to-input mapping. Note that joint assembly using |
The key here is that the call step requires a single assembly bam since GRIDSS technically only supports joint assembly. As I posted above, you can simulate/hack this with empty bam files. The safest approach is perform each assembly in it's own working directory and symlink to either to real input, or the empty input (depending on which files you're processing). This way all assemblies just have For example, something like the following would set up the first batch:
|
Thanks for the response. And please forgive my ignorance, I am a software engineer assisting with the workflow, but I have no prior experience at all with this field. For joint assembly using I have around 850 input BAM files, but how does that number of files relate to the "~1000x aggregate coverage" you mention? Do you have any tips on how to estimate or measure aggregate coverage based on my input files so I can guide the batch size? Based on your comments, I think my best approach is to use multinode assembly on batches that are as large as possible, and combine that with the empty BAM trick to split into multiple batches. As for pre-processing, hopefully I can preprocess the empty bam just once and then symlink to the output files as required. Also, does gridss do the Finally, I have done some scaling analysis of pre-processing and assembly, but only on 2 input files and only testing variable thread count on a single node. This was to help guide the resource requests on the cluster. I can share the results with you if they are of any interest, but I am sure you have examined this already. |
The assembly algorithm is single-threaded. By default, the assembly process split up the genome into 10Megabase chunks and multiple assemblies are done in parallel on different regions of the genome. The recommended settings are 8 threads and 25-31Gb heap size (ie just under 32Gb so java compressed OOPs can be used) so up to 8 assembly chunks will be processed in parallel. Each chunk gets written to it's own output file (in the assembly working directory) which are then concatentated together at the end to generate the assembly.bam
These are all independent jobs. Each job processes chunks for which All of this is handled internally by GRIDSS (either in the driver script, or the java code).
Technically speaking, GRIDSS does not support multiple assembly files at all. The batching with empty files is a work-around. In this case, yes, you'll need to do the
Correct, just symlink the files with the appropriate names as I outlined in my previous example.
That depends on your sequencing depth (your biologist should be able to tell you the depth they sequenced to). If you have done GRIDSS preprocessing for the file, then you can find read length and read counts in the Note that this assumes whole genome sequencing. If you did targetted or exome sequence, your denominator is the size of the capture region, not the size of the entire genome. |
Finally someone is talking my language :) Thanks, I have my work cut out tomorrow adjusting the workflow and running tests! For assembly, my black-box scaling tests showed low efficiency but still usable reductions in processing time for up to 20 threads, but my heap size was larger. I will test again with your suggested settings. |
I need to process approximately 850 large files with gridss. I have a Snakemake workflow that groups the files into batches, runs pre-processing on each file in parallel on a cluster, and then runs assembly on each batch of files. Multiple assemble processes run in parallel as batch jobs on SLURM. I then intend to attempt a call operation on the results of all parallel assembly runs.
My question is do the parallel assemble gridss processes need a separate working directory to avoid conflicts? Or can I run them all in the same working directory? There will be many assemble processes running concurrently on the cluster as the workflow moves through the entire set of input files.
Currently, each batch of input files has a separate working directory for that batch just in case. But if this is required, how can I run the call step on everything? If separate working directories for assemble are required, can I just move everything into a common directory prior to call?
The text was updated successfully, but these errors were encountered: