-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi gpu docs #391
multi gpu docs #391
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Multi GPU with ICICLE | ||
|
||
:::info | ||
|
||
If you are looking for the Multi GPU API documentation refer here for [Rust](./rust-bindings/multi-gpu.md). | ||
|
||
::: | ||
|
||
One common challenge with Zero-Knowledge computation is managing the large input sizes. It's not uncommon to encounter circuits surpassing 2^25 constraints, pushing the capabilities of even advanced GPUs to their limits. To effectively scale and process such large circuits, leveraging multiple GPUs in tandem becomes a necessity. | ||
|
||
Multi-GPU programming involves developing software to operate across multiple GPU devices. Lets first explore different approaches to Multi-GPU programming then we will cover how ICICLE allows you to easily develop youR ZK computations to run across many GPUs. | ||
|
||
|
||
## Approaches to Multi GPU programming | ||
|
||
There are many [different strategies](https://github.com/NVIDIA/multi-gpu-programming-models) available for implementing multi GPU, however, it can be split into two categories. | ||
|
||
### GPU Server approach | ||
|
||
This approach usually involves a single or multiple CPUs opening threads to read / write from multiple GPUs. You can think about it as a scaled up HOST - Device model. | ||
|
||
![alt text](image.png) | ||
|
||
This approach wont let us tackle larger computation sizes but it will allow us to compute multiple computations which we wouldn't be able to load onto a single GPU. | ||
|
||
For example lets say that you had to compute two MSMs of size 2^20 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 2^20 will fit into 16GB RAM even quite more of these will, right? even with precomputation - for example for BLS12-381 - 2^20 * (48+32) = 80MB There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! That's a typo will fix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think size 2^20 MSM on bls12 curves should require less than 500 Mb. For bls12, 2^26 is probably the size when 1 MSM fits into 16 GB but 2 do not |
||
|
||
|
||
### Inter GPU approach | ||
|
||
This approach involves a more sophisticated approach to multi GPU computation. Using technologies such as [GPUDirect, NCCL, NVSHMEM](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-cwes1084/) and NVLink its possible to combine multiple GPUs and split a computation among different devices. | ||
|
||
This approach requires redesigning the algorithm at the software level to be compatible with splitting amongst devices. In some cases, to lower latency to a minimum, special inter GPU connections would be installed on a server to allow direct communication between multiple GPUs. | ||
|
||
|
||
# Writing ICICLE Code for Multi GPUs | ||
|
||
The approach we have taken for the moment is a GPU Server approach; we assume you have a machine with multiple GPUs and you wish to run some computation on each GPU. | ||
|
||
To dive deeper and learn about the API checkout the docs for our different ICICLE API | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. checkout -> check out |
||
|
||
- [Rust Multi GPU APIs](./rust-bindings/multi-gpu.md) | ||
- C++ Multi GPU APIs | ||
|
||
|
||
## Best practices | ||
|
||
- Never hardcode device IDs, if you want your software to take advantage of all GPUs on a machine use methods such as `get_device_count` to support arbitrary number of GPUs. | ||
|
||
- Launch one thread per GPU, to avoid nasty errors and hard to read code we suggest that for every GPU task you wish to launch you create a dedicated thread. This will make your code way more manageable, easy to read and performant. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. umm - one CPU thread per GPU - actually you can do more tasks on that thread as long as they target the same GPU. Also the section imo needs a link to https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ |
||
|
||
## ZKContainer support for multi GPUs | ||
|
||
Multi GPU support should work with ZK-Containers by simply defining which devices the docker container should interact with: | ||
|
||
```sh | ||
docker run -it --gpus '"device=0,2"' zk-container-image | ||
``` | ||
|
||
If you wish to expose all GPUs | ||
|
||
```sh | ||
docker run --gpus all zk-container-image | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
# Multi GPU APIs | ||
|
||
To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation. | ||
|
||
Here we will cover the core multi GPU apis and a [example](#a-multi-gpu-example) | ||
|
||
## Device management API | ||
|
||
To streamline device management we offer as part of `icicle-cuda-runtime` package methods for dealing with devices. | ||
|
||
#### [`set_device`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L6) | ||
|
||
Sets the current CUDA device by its ID, when calling `set_device` it will set the current thread to a CUDA device. | ||
|
||
**Parameters:** | ||
|
||
- `device_id: usize`: The ID of the device to set as the current device. Device IDs start from 0. | ||
|
||
**Returns:** | ||
|
||
- `CudaResult<()>`: An empty result indicating success if the device is set successfully. In case of failure, returns a `CudaError`. | ||
|
||
**Errors:** | ||
|
||
- Returns a `CudaError` if the specified device ID is invalid or if a CUDA-related error occurs during the operation. | ||
|
||
**Example:** | ||
|
||
```rust | ||
let device_id = 0; // Device ID to set | ||
match set_device(device_id) { | ||
Ok(()) => println!("Device set successfully."), | ||
Err(e) => eprintln!("Failed to set device: {:?}", e), | ||
} | ||
``` | ||
|
||
#### [`get_device_count`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L10) | ||
|
||
Retrieves the number of CUDA devices available on the machine. | ||
|
||
**Returns:** | ||
|
||
- `CudaResult<usize>`: The number of available CUDA devices. On success, contains the count of CUDA devices. On failure, returns a `CudaError`. | ||
|
||
**Errors:** | ||
|
||
- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the device count. | ||
|
||
**Example:** | ||
|
||
```rust | ||
match get_device_count() { | ||
Ok(count) => println!("Number of devices available: {}", count), | ||
Err(e) => eprintln!("Failed to get device count: {:?}", e), | ||
} | ||
``` | ||
|
||
#### [`get_device`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L15) | ||
|
||
Retrieves the ID of the current CUDA device. | ||
|
||
**Returns:** | ||
|
||
- `CudaResult<usize>`: The ID of the current CUDA device. On success, contains the device ID. On failure, returns a `CudaError`. | ||
|
||
**Errors:** | ||
|
||
- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the current device ID. | ||
|
||
**Example:** | ||
|
||
```rust | ||
match get_device() { | ||
Ok(device_id) => println!("Current device ID: {}", device_id), | ||
Err(e) => eprintln!("Failed to get current device: {:?}", e), | ||
} | ||
``` | ||
|
||
## Device context API | ||
|
||
The `DeviceContext` is embedded into `NTTConfig`, `MSMConfig` and `PoseidonConfig`, meaning you can simple pass a `device_id` to your existing config an the same computation will be triggered on a different device automatically. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and typo? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually current implementation doesn't have the "automatic" - we just check device_id from config matches the current device id for the thread, so it won't be executed on wrong device There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. simple -> simply There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. an -> and |
||
|
||
#### [`DeviceContext`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L11) | ||
|
||
Represents the configuration a CUDA device, encapsulating the device's stream, ID, and memory pool. The default device is always `0`, unless configured otherwise. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. , unless configured otherwise probably should be removed - I doubt it's possible |
||
|
||
```rust | ||
pub struct DeviceContext<'a> { | ||
pub stream: &'a CudaStream, | ||
pub device_id: usize, | ||
pub mempool: CudaMemPool, | ||
} | ||
``` | ||
|
||
##### Fields | ||
|
||
- **`stream: &'a CudaStream`** | ||
|
||
A reference to a `CudaStream`. This stream is used for executing CUDA operations. By default, it points to a null stream CUDA's default execution stream. | ||
|
||
- **`device_id: usize`** | ||
|
||
The index of the GPU currently in use. The default value is `0`, indicating the first GPU in the system. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. umm, assuming invocation command was prepended with |
||
|
||
- **`mempool: CudaMemPool`** | ||
|
||
Represents the memory pool used for CUDA memory allocations. The default is set to a null pointer, which signifies the use of the default CUDA memory pool. | ||
|
||
##### Implementation Notes | ||
|
||
- The `DeviceContext` structure is cloneable and can be debugged, facilitating easier logging and duplication of contexts when needed. | ||
|
||
|
||
#### [`DeviceContext::default_for_device(device_id: usize) -> DeviceContext<'static>`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L30C12-L30C30) | ||
|
||
Provides a default `DeviceContext` with system-wide defaults, ideal for straightforward setups. | ||
|
||
#### Returns | ||
|
||
A `DeviceContext` instance configured with: | ||
- The default stream (`null_mut()`). | ||
- The default device ID (`0`). | ||
- The default memory pool (`null_mut()`). | ||
|
||
#### Parameters | ||
|
||
- **`device_id: usize`**: The ID of the device for which to create the context. | ||
|
||
#### Returns | ||
|
||
A `DeviceContext` instance with the provided `device_id` and default settings for the stream and memory pool. | ||
|
||
|
||
#### [`check_device(device_id: i32)`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L42) | ||
|
||
Validates that the specified `device_id` matches the ID of the currently active device, ensuring operations are targeted correctly. | ||
|
||
#### Parameters | ||
|
||
- **`device_id: i32`**: The device ID to verify against the currently active device. | ||
|
||
#### Behavior | ||
|
||
- **Panics** if the `device_id` does not match the active device's ID, preventing cross-device operation errors. | ||
|
||
#### Example | ||
|
||
```rust | ||
let device_id: i32 = 0; // Example device ID | ||
check_device(device_id); | ||
// Ensures that the current context is correctly set for the specified device ID. | ||
``` | ||
|
||
|
||
## A Multi GPU example | ||
|
||
In this example we will display how you can | ||
|
||
1. Fetch the number of devices installed on a machine | ||
2. For every GPU launch a thread and set a active device per thread. | ||
3. Execute a MSM on each GPU | ||
|
||
|
||
|
||
```rust | ||
|
||
... | ||
|
||
let device_count = get_device_count().unwrap(); | ||
|
||
(0..device_count) | ||
.into_par_iter() | ||
.for_each(move |device_id| { | ||
set_device(device_id).unwrap(); | ||
|
||
// you can allocate points and scalars_d here | ||
|
||
let mut cfg = MSMConfig::default_for_device(device_id); | ||
cfg.ctx.stream = &stream; | ||
cfg.is_async = true; | ||
cfg.are_scalars_montgomery_form = true; | ||
msm(&scalars_d, &HostOrDeviceSlice::on_host(points), &cfg, &mut msm_results).unwrap(); | ||
|
||
// collect and process results | ||
}) | ||
|
||
... | ||
``` | ||
|
||
|
||
We use `get_device_count` to fetch the number of connected devices, device IDs will be `0...device_count-1` | ||
|
||
[`into_par_iter`](https://docs.rs/rayon/latest/rayon/iter/trait.IntoParallelIterator.html#tymethod.into_par_iter) is a parallel iterator, you should expect it to launch a thread for every iteration. | ||
|
||
We then call `set_device(device_id).unwrap();` it should set the context of that thread to the selected `device_id`. | ||
|
||
Any data you now allocate from the context of this thread will be linked to the `device_id`. We create our `MSMConfig` with the selected device ID `let mut cfg = MSMConfig::default_for_device(device_id);`, behind the scene this will create for us a `DeviceContext` configured for that specific GPU. | ||
|
||
We finally call our `msm` method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"programming then" - double space