Use on large cohorts #176

andrepessoni · 2022-10-18T14:19:19Z

spladder version: 3.0.3
Python version: 3.9.6
Operating System: Linux - CentOS 7

I have a large amount of samples (101) and I'm trying to run the build mode in several parts (as instructed in SplAdder Documentation, chapter 4).
I have finished the single graphs step by running each sample in parallel and it was fast. Now, I'm trying to run the merge graph part and it's taking more than 5 days to run half the number of samples. I have tried to increase the number of cores, but the job only uses one. I have tried to run the command in parallel for the merge graph part and to include the --parallel parameter (used on the test mode) but they do not work.
In sum, I would like to know if there is any spladder parameter for build mode that would increase the speed or parallelize this step.

At first, I used this command for each single graph: spladder build -o spladder_out -a annotation.gtf -b file.bam --merge-strat single --no-extract-ase (and it worked without problems)
Right now, I'm using this command: spladder build -o spladder_out -a annotation.gtf -b bamfiles_path.txt --merge-strat merge_graphs --no-extract-ase

Thank you very much,
André.

kate-stankiewicz · 2022-11-07T18:17:33Z

Hello,

I am experiencing a similar problem. Were you able to find a way to speed things up at the graph merging step?

Thanks!

andrepessoni · 2022-11-10T19:12:12Z

Hello,

I was not. I run independent sets of samples in parallel, but in the end I'm not able to speed the step.

MB571 · 2022-12-09T21:07:56Z

I'm having a similar issue, I also had a similar issue with the Spladder build function and i'm curious to know how fast your build stage was? My took 3 days for 129 samples using 400GB RAM on a HPC, setting a --parallel parameter didn't seem to actually effect anything either

kate-stankiewicz · 2022-12-12T18:36:43Z

Yeah, mine took about 2 days for 83 samples on a HPC system. I also don't think the --parallel parameter had any impact.

akahles · 2023-02-07T19:52:39Z

Dear all,

thanks for the valuable feedback. One of the currently slowest parts in the pipeline is the progressive merging of individual single-sample graphs, which is hard to parallelize over genes. This is on my to-do list since a while. As an intermediate solution, SplAdder offers the possibility for splitting up the merge procedure into chunks and running them in parallel, using the option --chunked-merge LEVEL MAX_LEVEL START END. I have just added a section to the docs describing this in more detail. Something similar could also be done using multiple threads on the same machine, but is currently not implemented.

Below, I have pasted the bit that I added to the docs.
I hope this helps.

Best,
Andre

========================================================
As a last point, the merging of single sample-graphs is done progressively over samples and takes
some time. This is also hard to parallelize over genes, as there is the option to mere neighboring
genes, if they become too overlapping. If you really have many samples to merge (like hundreds or
thousands), SplAdder offers the possibility to parallelize the merge over chunks of samples, using
the option --chunked-merge LEVEL MAX_LEVEL START END, where LEVEL is the current level of
the merging tree, MAX_LEVEL is the height of the merging tree, and START and END
describe the node range at the current merge level. It is easiest to describe this using a small
example. Assume we have 100 single sample graphs to be merged. Instead of merging all 100 samples
progressively in one go, we could always merge 10 samples at a time in parallel. This will give us
10 ten-sample graphs. These 10 graphs can now again be merged, resulting in the 100-sample graph.
The total number of levels (MAX_LEVEL) can be computed as the ceiling of the chunksize-base
logarithm of the total number of samples. For our example this would be log_10 100 = 2.

Applying this to our 100 sample example

### first level, merging samples in groups of 10
for i in $(seq 0 10 90)
do
    spladder build -o spladder_out \
                   -a annotation.gtf \
                   -b alignments.txt \
                   --merge-strat merge_graphs \
                   --no-extract-ase \
                   --no-quantify-graph \
                   --chunked-merge 0 1 ${i} $((${i} + 10))
done

### second level, merging the 10 ten-sample graphs
spladder build -o spladder_out \
               -a annotation.gtf \
               -b alignments.txt \
               --merge-strat merge_graphs \
               --no-extract-ase \
               --no-quantify-graph \
               --chunked-merge 1 1 0 10

Per default, the chunk size is 10, but can be adapted via the --chunksize option.

erzakiev · 2023-03-14T11:23:42Z

Hello Andre, thank you very much for this cool tool!!
and also
Hello dear community of alternative splicing analysis, you are all also very cool!

Just to clear up my doubts about the necessity graph merging in my particular case, could someone please ELI5 when might we be interested in this merging of multiple samples?

What is the eventual point of that action? To better analyze, for instance, if a certain splicing event is more prevalent in one sample as compared to in another sample??

And why necessarily this is only recommended for large cohorts? If this confers advantage in case of between-sample comparisons, it should probably be utilized even in a case of simplest setup of 2 samples: [1 _control_ sample] vs [1 _some condition_ sample]

akahles added a commit that referenced this issue Feb 7, 2023

improved docs #176 #180

1993adc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use on large cohorts #176

Use on large cohorts #176

andrepessoni commented Oct 18, 2022

kate-stankiewicz commented Nov 7, 2022

andrepessoni commented Nov 10, 2022

MB571 commented Dec 9, 2022

kate-stankiewicz commented Dec 12, 2022

akahles commented Feb 7, 2023

erzakiev commented Mar 14, 2023 •

edited

Loading

Use on large cohorts #176

Use on large cohorts #176

Comments

andrepessoni commented Oct 18, 2022

kate-stankiewicz commented Nov 7, 2022

andrepessoni commented Nov 10, 2022

MB571 commented Dec 9, 2022

kate-stankiewicz commented Dec 12, 2022

akahles commented Feb 7, 2023

erzakiev commented Mar 14, 2023 • edited Loading

erzakiev commented Mar 14, 2023 •

edited

Loading