Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nanorepeat "freezing" #18

Open
fansalon opened this issue Dec 13, 2024 · 3 comments
Open

nanorepeat "freezing" #18

fansalon opened this issue Dec 13, 2024 · 3 comments

Comments

@fansalon
Copy link

Hi there,

I am running nanorepeat on some ONT bam files with the following command:

nanoRepeat.py -i $bam -t bam -d ont_sup -r $genome -b $bed -c 20 -o $output

where $bed contains 6,841 entries and $bam is a ~100G file.

The issue I am experiencing is that after some hours of running, nanorepeat seems to freeze. While at the beginning of the run I could see in the output folder tmp subfolders like ${sample}.NanoRepeat_temp.chr8-20707547-20707710-AAGA, now in the output folder there is only a subfolder named ${sample}.details. However, the program is still running, the last operation appears to have been done more than 6 hours ago and no *tsv file has been generated.

I have tried to repeat the analysis on a SR bed file containing 10 entries and everything run smoothly. Might this perhaps be linked to the big size of BAM (~100G) and BED (7k entries) files?

I have increased the memory up to 500gb, but nothing changed.

Do you have any idea on what is going on?

Thanks in advance,
Federico

@fangli80
Copy link
Collaborator

Hi Federico,
Thanks for your feedback. NanoRepeat might be stuck in a specific repeat region. Can you locate this region? You can find it in the stderr output.

like this:
image

By the way, please try to use -c 8 instead of -c 20. When using more than 8 threads, the speed improvement is minimal, but the I/O load is very high.

@fansalon
Copy link
Author

fansalon commented Dec 14, 2024

Hi,

thanks a lot for the feedback and suggestion, I am now running with -c 8. It indeed seems that nanorepeat gets stuck at a specific repeat. It has been stuck at the repeat chrY-87448334-87448375-A for 16h, see below:

[12/14/2024 00:40:07] NOTICE: [Process 04] Quantifying repeat: chrY-87448334-87448375-A
[12/14/2024 00:40:07] NOTICE: [Process 04] Step 1: finding anchor location in reads
[12/14/2024 00:40:07] NOTICE: [Process 04] Step 2: round 1 and round 2 estimation
[12/14/2024 00:40:08] NOTICE: [Process 04] Step 3: round 3 estimation
[12/14/2024 00:40:14] NOTICE: [Process 04] Step 4: phasing reads using GMM

I could certainly try to remove this repeat from my input file and re-run it. I have no problem in doing that, but I was wondering whether you have a guess on why this happens. Is this random or you think this particular repeat might be problematic?

Thanks a lot,
Federico

@fangli80
Copy link
Collaborator

fangli80 commented Dec 14, 2024

I'm not sure because I haven't seen the reads in this repeat. It might be due to an high sequencing depth, or these reads might contain very long repeats. Is it convenient for you to check the region in IGV? (you can use samtools view to extract the reads of a specific region so that you dont' need to download the whole bam file from the server)

By the way, it looks like chrY-87448334-87448375-A is finished. It might be another one. It is a little bit difficult to locate the problematic repeat because there are 20 parallel processes and the stderr output of these parallel processes are mixed. But you can separate them using the process id (e.g. [Process 04] )

Thanks,
Li

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants