Best practice for offline bulk batch inference in candle #1700
Replies: 3 comments
-
It's hard to tell just based on this what is causing the slowness. You may want to add some tracing/timing so as to measure where the time actually gets spent. Also note that cuda is a lazy api so both on the pytorch and on the candle side you want to force the results to be retrieved on the cpu so as to be sure that the computation has finished. |
Beta Was this translation helpful? Give feedback.
-
I've been running some tests on your code and the reason this runs slow is not because of candle but because you are overloading the GPUs by using rayon. This is not a candle issue, this is an issue with your code. If you have more than one thread trying to move data in and out of the GPU, it's going to slow down the execution. Instead, you want to optimize the batch size and optimize the number of threads that are moving data in and out of GPU memory. The same thing would happen if you wrote any application in any language using any framework that had multiple threads trying to move data in and out of the GPU at the same time. |
Beta Was this translation helpful? Give feedback.
-
I was able to get candle batch inference to work. You were right about the rayon part. That was not correct way to handle this. I have used a single thread for the time being and am happy with the results. |
Beta Was this translation helpful? Give feedback.
-
I want to try and do bulk offline batch inference in candle for text data and extract embeddings. I modified the bert example to read a csv containing text data and tried to process it in batches but the resulting process is ~2.5x slower than python (PyTorch). Here is my code. Is slicing the worst pattern here? I didn't see any dataloader example with candle for inference. Any suggestions/guidance please? This is for GPU inference.
Beta Was this translation helpful? Give feedback.
All reactions