Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to parallelize? #4

Open
BlueSpace-ice opened this issue May 14, 2024 · 3 comments
Open

How to parallelize? #4

BlueSpace-ice opened this issue May 14, 2024 · 3 comments

Comments

@BlueSpace-ice
Copy link

I'm sorry to bother you, but is it possible to provide a graphics card for parallel processing of multiple slices at the same time? I use the for loop and it takes too long. An unknown error occurred when I used the from multiprocessing import Pool module. So I don't have any other options. thank you!

@CielAl
Copy link
Contributor

CielAl commented Jun 5, 2024

I have the same observation but I think the challenges would be:
(1) Individual inputs take different number of steps to converge and therefore a different variant of ISTA algorithm (so far torchvahadane implments the ISTA and FISTA) might be needed, or at least manage the step size depending on the losses of a whole batch of image.

(2) the tissue masking - essentially vahadane performs the dictionary learning to tissue pixels of each image, and therefore the dimensionality of the actual input of dictionary varies among the input images, depending on the tissue region.

I would recommend you to simply cache the stain matrices of all images that may be reused to avoid recomputation, and/or use faster approaches to obtain stain concentrations (e.g., least square solver torch.linalg.lstsq) from OD and stain matrices if you have specific needs of time efficiency.

An example of using least square to solve concentration is attached here, derived from @cwlkr 's codes.

@cwlkr
Copy link
Owner

cwlkr commented Aug 6, 2024

Hello,

Unfortunately, this not really feasible. At least not in a straight forward manner.

The problem lies in CUDA itself as far as I understand. In CUDA, tasks are accelerated by splitting a task into smaller simple step tasks that can be run in parallel on GPU kernels. CPU parallelization however runs the same task for different inputs in parallel. Unfortunately, in how CUDA is constructed these are not easily inter-mixable. As far as I know this is more related to shared memory issues, rather than the optimization algorithms themselves.

There might be a fix nowadays with torch.multiprocessing, but I lack the time at the moment to investigate this further.

If its a training situations, setting your num_workers to 16 more, usually still results in a good GPU utilization, as forward pass + backprob can anyhow take longer than the (parallelized) image normalization/augmentation.

For this, I have seen that in SPAMS, it is better to set numThreads=1 and have a higher num_workers, as creating new threads all the time can be slow.

@CielAl
Copy link
Contributor

CielAl commented Aug 12, 2024

Might be able to do it in a multi-GPU scenario while each GPU utilizes their own process (e.g., dask-cuda etc.) but that's up to how user creates their own workflows rather than what a stain normalization toolkit should resolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants