Best practice for offline bulk batch inference in candle #1700

msminhas93 · 2024-02-12T21:21:02Z

msminhas93
Feb 12, 2024

I want to try and do bulk offline batch inference in candle for text data and extract embeddings. I modified the bert example to read a csv containing text data and tried to process it in batches but the resulting process is ~2.5x slower than python (PyTorch). Here is my code. Is slicing the worst pattern here? I didn't see any dataloader example with candle for inference. Any suggestions/guidance please? This is for GPU inference.

fn main() -> Result<()> {
    use tracing_chrome::ChromeLayerBuilder;
    use tracing_subscriber::prelude::*;

    let args = Args::parse();
    let _guard = if args.tracing {
        println!("tracing...");
        let (chrome_layer, guard) = ChromeLayerBuilder::new().build();
        tracing_subscriber::registry().with(chrome_layer).init();
        Some(guard)
    } else {
        None
    };
    let start = std::time::Instant::now();

    let (model, mut tokenizer) = args.build_model_and_tokenizer()?;
    let device = &model.device;

    if let Some(prompt) = args.prompt {
        let tokenizer = tokenizer
            .with_padding(None)
            .with_truncation(None)
            .map_err(E::msg)?;
        let tokens = tokenizer
            .encode(prompt, true)
            .map_err(E::msg)?
            .get_ids()
            .to_vec();
        let token_ids = Tensor::new(&tokens[..], device)?.unsqueeze(0)?;
        let token_type_ids = token_ids.zeros_like()?;
        println!("Loaded and encoded {:?}", start.elapsed());
        for idx in 0..args.n {
            let start = std::time::Instant::now();
            let ys = model.forward(&token_ids, &token_type_ids)?;
            if idx == 0 {
                println!("{ys}");
            }
            println!("Took {:?}", start.elapsed());
        }
    } else {
        let path = "/home/manpreet/10M_random.csv";
        let df = CsvReader::from_path(path)
            .unwrap()
            .has_header(true)
            .finish()
            .unwrap();
        let text_series = df.column("text").unwrap();
        let text_chunk = text_series.str().unwrap();
        let text_vec: Vec<String> = text_chunk.par_iter().map(|opt_str| opt_str.unwrap_or_default().to_string()).collect();

        let pp = PaddingParams {
            strategy: tokenizers::PaddingStrategy::BatchLongest,
            ..Default::default()
        };
        let tp = TruncationParams::default();
        tokenizer
            .with_padding(Some(pp))
            .with_truncation(Some(tp))
            .map_err(E::msg)?;
        let batch_size = 32;
        for i in tqdm!(0..text_vec.len() / batch_size + 1) {
            let start = i * batch_size;
            let end = (start + batch_size).min(text_vec.len());
            let batch = &text_vec[start..end];
            let batch = tokenizer.encode_batch(batch.to_vec(), true).map_err(E::msg)?;
            let token_ids = batch
                .par_iter()
                .map(|tokens| {
                    let tokens = tokens.get_ids().to_vec();
                    Ok(Tensor::new(tokens.as_slice(), device)?)
                })
                .collect::<Result<Vec<_>>>()?;
            let token_ids = Tensor::stack(&token_ids, 0)?;
            let token_type_ids = token_ids.zeros_like()?;
            let embeddings = model.forward(&token_ids, &token_type_ids)?;
       }
        }
    }
    Ok(())
}

LaurentMazare · 2024-02-19T12:55:38Z

LaurentMazare
Feb 19, 2024
Maintainer

It's hard to tell just based on this what is causing the slowness. You may want to add some tracing/timing so as to measure where the time actually gets spent. Also note that cuda is a lazy api so both on the pytorch and on the candle side you want to force the results to be retrieved on the cpu so as to be sure that the computation has finished.

0 replies

eisenzopf · 2024-03-15T04:06:43Z

eisenzopf
Mar 15, 2024

I've been running some tests on your code and the reason this runs slow is not because of candle but because you are overloading the GPUs by using rayon. This is not a candle issue, this is an issue with your code. If you have more than one thread trying to move data in and out of the GPU, it's going to slow down the execution. Instead, you want to optimize the batch size and optimize the number of threads that are moving data in and out of GPU memory. The same thing would happen if you wrote any application in any language using any framework that had multiple threads trying to move data in and out of the GPU at the same time.

0 replies

msminhas93 · 2024-09-04T01:28:00Z

msminhas93
Sep 4, 2024
Author

I was able to get candle batch inference to work. You were right about the rayon part. That was not correct way to handle this. I have used a single thread for the time being and am happy with the results.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for offline bulk batch inference in candle #1700

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Best practice for offline bulk batch inference in candle #1700

msminhas93 Feb 12, 2024

Replies: 3 comments

LaurentMazare Feb 19, 2024 Maintainer

eisenzopf Mar 15, 2024

msminhas93 Sep 4, 2024 Author

msminhas93
Feb 12, 2024

LaurentMazare
Feb 19, 2024
Maintainer

eisenzopf
Mar 15, 2024

msminhas93
Sep 4, 2024
Author