-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inefficient in high-throughput, targeted amplicon use case #306
Comments
It isn't easy to decide which is the best parallelization strategy for Clair3 as it can be used for many purposes. The largest unit per process is a sequence in Clair3 (through setting max_chunk_size and chunk_size) because for some reasons, such as, the intermediate phasing step, as the most sequence context-dependent step, can use the whole context of a single sequence at best. The candidate proposal step does not always run fast, especially when the coverage is low. If there should be a solution to serialize the candidate proposal step, I prefer it to be a flag-controlled behavior. And I prefer additional code and minimal changes to the existing code. |
I'd be interested to see a BAM where candidate generation is particularly slow so I can understand the situation better. Do you have one you can share? Here's my proposal Add a flag If I can get a BAM with a significant number of candidates in it, I'd also like to look at letting tensorflow use the threads GNU parallel isn't using in the case where contig is None. If there's a significant performance improvement to letting python use multiple threads I would propose something like this pseudocode for the clair3_c_impl.sh script:
|
Any whole genome BAM with average depth between 6-10x. As you can see, this is a situation totally different from targeted amplicon use case. Both are very common use cases that Clair3 should optimize for both in a long run.
Tensorflow takes up quite some memory, so not setting INTERNAL_THREADS to 1 is to improve the usability of Clair3 in some memory constrained environments. Setting INTERNAL_THREADS to a constant 4 was empirical also considering that Tensorflow libraries are huge, the overhead of starting many Tensorflow processes was pretty significant. |
I don't follow. Currently INTERAL_THREADS is hard coded to 1 in the Run method of CallVariantsFromCFFI. The current behavior is to start up many tensorflow processes and pay that significant overhead. I preserved the current behavior in my pseudocode above. My proposal is to not start up many tensorflow processes but to have only one using multiple cores. Sounds like we agree that this will be more memory efficient and have less overhead. We don't do whole genome sequencing, so if you don't have a BAM to link to me I'm sure there's one on the Internet somewhere I can find. I've got enough to get started. Thanks! |
Our use case is variant calling on the output of a targeted amplicon assay on many samples in parallel. The way that
clair3_c_impl.sh
parallelizes the pileup calling step in this case is inefficient. A single process that accumulates all of the candidates across all of the contigs and then does variant calling on them works great, but the best case for the current script is a separate process per contig (if you edit the hard coded max_chunk_size and set the chunk_size to be bigger than any single contig).In the current implementation, each process works on a number of candidates far smaller than the networks batch size. This involves a number of threads that are doing nothing or very little which limits the amount of real parallelization we can do in our pipeline before the CPU gets heavily overloaded. The example I have been looking at compares the current implementation vs a single process batching all of the pileup candidates, both using a single thread, and the results are over 12 minutes vs under 2 minutes.
Currently we have a patch that works for us but I'd prefer this use case were officially supported so there's no chance my patch will break on new versions. Two possible solutions are 1) when a bed file is provided, don't use GNU parallel and instead use a single process that accumulates candidates from the bed file regions, or 2) a new flag that bypasses GNU parallel and accumulates candidates across the entire genome. It might be nice in these cases for the user provided number of threads to be used by tensorflow instead.
In general, since the candidate proposal step is so fast compared the network prediction or even process start time, it seems to me like accumulating all of the candidates and then using the available threads for processing those batches would be much more efficient in all cases. This would fully utilize the network batch size where chunking the genome will never give even and consistent batches of candidates. So I wonder if the current parallelization strategy is beneficial over my proposal in any situations.
I'd be happy to start a pull request once I know what you are open to having changed.
Thanks!
The text was updated successfully, but these errors were encountered: