-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory in peakcalling checkBams #78
Comments
So you suggest that we remove checkBams? I actually quite like the functionality of this code though because it generates an output table showing where the majority of your reads are filtered. Since peakcalling is your baby, what's your opinion @Charlie-George? |
I personally think that any algo that is trying to hold a whole BAM file in memory is a non-starter. I agree that its useful to know at what stages reads are being filtered, but there needs to be a better way than this. |
Ah I see your point, it seems like this is the offending line cgat-flow/cgatpipelines/tasks/peakcalling.py Line 577 in 06f35b7
How about a workaround:
Then the iterator works on the object and not over the whole file, is this correct? Please tell me if im talking crap. |
The offending lines are these: cgat-flow/cgatpipelines/tasks/peakcalling.py Lines 607 to 609 in 06f35b7
|
Yep I totally agree its - far to memory consuming I think the functions useful though, especially when we were checking we had flags correct etc, however now its stable we could take it out.... As a quick work around we added in a memory option - and my longer term plan was to rewrite it to be more memory efficent which hasn't been a top priority and won't be until after xmas, when I was planning to spend some time on this and actually split the bamfiltering step from the peakcalling pipeline into a bamprocessing pipeline for filtering, merging, downsampling bams, in which case it I think it would be usefull to have the checkbams funciton there |
Thought I'd do a quick check to see quite how memory consuming it was. _ tried to load a 2GB bam file (about 15 million reads) the same way on my desktop. This has now consumed over 20GB of RAM, and is still going (very, very slowly, as the machine is now thashing the disk). So if we are talking more than 10x the size of the BAM file, and say a simple ChIP-seq expresiment has 9 samples, you'd need to find a lot of memory for what should be a fairly simple operation. |
What was the consensus on checkBams? |
Its on the TODO list to change |
The function
checkBams
called bycgatpipelines.tasks.filterBams
is seg-faulting or giving MemoryError. One solution is obviously to increase the amount of RAM. This seems like it will be a recurring problem though as files get bigger.The problem appears to be that checkBams is building a dictionary of all the reads. Since the in memory structure will be decompressed BAM, it cannot take less than the decompressed BAM file would. Since the average compressed BAM file if in the 10s of GB, this is never going to work with real data of any size.
checkBams seems to check that the filtering has worked ... in what situation would this not be true. Surely we have to trust that, for example MarkDuplicates does what it says it will do?
The text was updated successfully, but these errors were encountered: