Refactored vertical_slash_index.cu for performance improvement #72

alvi75 · 2024-09-09T15:09:21Z

What does this PR do?

This PR refactors the CUDA kernel in vertical_slash_index.cu to optimize the performance of vertical slash indexing for large context sizes in MInference. The changes are primarily focused on improving the efficiency of block and column indexing operations during inference.

The refactoring led to a 6-second improvement in the execution time, as shown in the attached profiling report images. The optimization was achieved by reusing calculations within the CUDA kernel and improving the memory access patterns.

We tested the refactored code with 100K tokens using NVIDIA A30 GPUs (24GB memory each), which limits the scale of our tests compared to the original results with 1M tokens. The project maintainers are encouraged to rebuild and validate the changes with larger token sizes and on other GPU models like A100.

The profiling results (before and after refactor) are attached, showing a reduction in total inference time from 268.128 seconds to 262.092 seconds.

Performance Results

Below are the benchmark results comparing the original and refactored versions of MInference on NVIDIA A30 GPUs for 10K, 50K, and 100K tokens:

GPU	Version	10K Tokens (s)	50K Tokens (s)	100K Tokens (s)
A30 (2 GPUs)	MInference	3.86	12.82	26.02
A30 (2 GPUs)	MInference (Refactored)	3.796	12.88	25.98

We request that you test the refactored version with an A100 GPU using your original settings, especially for larger token counts such as 1M tokens, as we were limited by the memory capacity of the A30 GPUs.

Fixes #(issue number, if applicable)

Motivation

Performance improvement: The motivation behind this refactor is to optimize the CUDA kernel used in vertical_slash_index.cu for better inference performance. Our changes focus on minimizing redundant calculations and improving memory access patterns during the inference process.
Context size handling: The refactor improves the handling of large context sizes, particularly when working with limited GPU memory on A30 devices. This may benefit applications that need to process large-scale input data efficiently.

Dependencies

No new dependencies were introduced by this PR.

Before submitting:

This PR fixes a performance bottleneck in the inference pipeline.
This was discussed/approved via a GitHub issue.
The documentation was updated where necessary.
Tests were performed using 100K tokens on NVIDIA A30 GPUs.
Additional tests are recommended for larger context sizes (e.g., 1M tokens) and on different hardware (e.g., A100 GPUs).

Who can review?

Tagging relevant maintainers for this PR review:

General: @iofu728, @liyucheng09, @Starmys, @mydmdm
Kernel-related: @Starmys

Attached Image:

Profiling 1: Before refactor (268.128 seconds)
Profiling 2: After refactor (262.092 seconds)

This version includes the performance benchmarks and additional requests for testing on A100 GPUs for larger contexts.

Additional Notes

While the refactor has improved performance, we observed that there are still numerous cudaStreamSynchronize points in the current pipeline. These synchronization points could be further analyzed and potentially optimized to reduce unnecessary blocking and improve the overall throughput of the system. We recommend reviewing these points to explore further optimizations.

alvi75 · 2024-09-09T15:13:02Z

@microsoft-github-policy-service agree company="William & Mary"

Starmys · 2024-10-09T07:03:08Z

Thank you for your contribution! Could you provide your CUDA version so we can check if this is related to compiler optimization ?

Copilot wasn't able to review any files in this pull request.

Files not reviewed (1)

csrc/vertical_slash_index.cu: Language not supported

Refactored vertical_slash_index.cu for performance improvement

2b42733

iofu728 assigned Starmys Sep 9, 2024

iofu728 added the feature feature label Sep 9, 2024

iofu728 requested a review from Copilot December 23, 2024 04:53

Copilot AI reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored vertical_slash_index.cu for performance improvement #72

Refactored vertical_slash_index.cu for performance improvement #72

alvi75 commented Sep 9, 2024 •

edited

Loading

alvi75 commented Sep 9, 2024

Starmys commented Oct 9, 2024

Refactored vertical_slash_index.cu for performance improvement #72

Are you sure you want to change the base?

Refactored vertical_slash_index.cu for performance improvement #72

Conversation

alvi75 commented Sep 9, 2024 • edited Loading

What does this PR do?

Performance Results

Fixes #(issue number, if applicable)

Motivation

Dependencies

Before submitting:

Who can review?

Attached Image:

Additional Notes

alvi75 commented Sep 9, 2024

Starmys commented Oct 9, 2024

Choose a reason for hiding this comment

alvi75 commented Sep 9, 2024 •

edited

Loading