Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use on large cohorts #176

Open
andrepessoni opened this issue Oct 18, 2022 · 6 comments
Open

Use on large cohorts #176

andrepessoni opened this issue Oct 18, 2022 · 6 comments

Comments

@andrepessoni
Copy link

  • spladder version: 3.0.3
  • Python version: 3.9.6
  • Operating System: Linux - CentOS 7

I have a large amount of samples (101) and I'm trying to run the build mode in several parts (as instructed in SplAdder Documentation, chapter 4).
I have finished the single graphs step by running each sample in parallel and it was fast. Now, I'm trying to run the merge graph part and it's taking more than 5 days to run half the number of samples. I have tried to increase the number of cores, but the job only uses one. I have tried to run the command in parallel for the merge graph part and to include the --parallel parameter (used on the test mode) but they do not work.
In sum, I would like to know if there is any spladder parameter for build mode that would increase the speed or parallelize this step.

At first, I used this command for each single graph: spladder build -o spladder_out -a annotation.gtf -b file.bam --merge-strat single --no-extract-ase (and it worked without problems)
Right now, I'm using this command: spladder build -o spladder_out -a annotation.gtf -b bamfiles_path.txt --merge-strat merge_graphs --no-extract-ase

Thank you very much,
André.

@kate-stankiewicz
Copy link

Hello,

I am experiencing a similar problem. Were you able to find a way to speed things up at the graph merging step?

Thanks!

@andrepessoni
Copy link
Author

Hello,

I was not. I run independent sets of samples in parallel, but in the end I'm not able to speed the step.

@MB571
Copy link

MB571 commented Dec 9, 2022

I'm having a similar issue, I also had a similar issue with the Spladder build function and i'm curious to know how fast your build stage was? My took 3 days for 129 samples using 400GB RAM on a HPC, setting a --parallel parameter didn't seem to actually effect anything either

@kate-stankiewicz
Copy link

Yeah, mine took about 2 days for 83 samples on a HPC system. I also don't think the --parallel parameter had any impact.

akahles added a commit that referenced this issue Feb 7, 2023
@akahles
Copy link
Member

akahles commented Feb 7, 2023

Dear all,

thanks for the valuable feedback. One of the currently slowest parts in the pipeline is the progressive merging of individual single-sample graphs, which is hard to parallelize over genes. This is on my to-do list since a while. As an intermediate solution, SplAdder offers the possibility for splitting up the merge procedure into chunks and running them in parallel, using the option --chunked-merge LEVEL MAX_LEVEL START END. I have just added a section to the docs describing this in more detail. Something similar could also be done using multiple threads on the same machine, but is currently not implemented.

Below, I have pasted the bit that I added to the docs.
I hope this helps.

Best,
Andre

========================================================
As a last point, the merging of single sample-graphs is done progressively over samples and takes
some time. This is also hard to parallelize over genes, as there is the option to mere neighboring
genes, if they become too overlapping. If you really have many samples to merge (like hundreds or
thousands), SplAdder offers the possibility to parallelize the merge over chunks of samples, using
the option --chunked-merge LEVEL MAX_LEVEL START END, where LEVEL is the current level of
the merging tree, MAX_LEVEL is the height of the merging tree, and START and END
describe the node range at the current merge level. It is easiest to describe this using a small
example. Assume we have 100 single sample graphs to be merged. Instead of merging all 100 samples
progressively in one go, we could always merge 10 samples at a time in parallel. This will give us
10 ten-sample graphs. These 10 graphs can now again be merged, resulting in the 100-sample graph.
The total number of levels (MAX_LEVEL) can be computed as the ceiling of the chunksize-base
logarithm of the total number of samples. For our example this would be log_10 100 = 2.

Applying this to our 100 sample example

### first level, merging samples in groups of 10
for i in $(seq 0 10 90)
do
    spladder build -o spladder_out \
                   -a annotation.gtf \
                   -b alignments.txt \
                   --merge-strat merge_graphs \
                   --no-extract-ase \
                   --no-quantify-graph \
                   --chunked-merge 0 1 ${i} $((${i} + 10))
done

### second level, merging the 10 ten-sample graphs
spladder build -o spladder_out \
               -a annotation.gtf \
               -b alignments.txt \
               --merge-strat merge_graphs \
               --no-extract-ase \
               --no-quantify-graph \
               --chunked-merge 1 1 0 10

Per default, the chunk size is 10, but can be adapted via the --chunksize option.

@erzakiev
Copy link

erzakiev commented Mar 14, 2023

Hello Andre, thank you very much for this cool tool!!
and also
Hello dear community of alternative splicing analysis, you are all also very cool!

Just to clear up my doubts about the necessity graph merging in my particular case, could someone please ELI5 when might we be interested in this merging of multiple samples?

What is the eventual point of that action? To better analyze, for instance, if a certain splicing event is more prevalent in one sample as compared to in another sample??

And why necessarily this is only recommended for large cohorts? If this confers advantage in case of between-sample comparisons, it should probably be utilized even in a case of simplest setup of 2 samples: [1 _control_ sample] vs [1 _some condition_ sample]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants