ingonyama-zk · ImmanuelSegol · Feb 22, 2024 · Feb 22, 2024 · Feb 22, 2024 · DmytroTym
diff --git a/docs/docs/icicle/image.png b/docs/docs/icicle/image.png
diff --git a/docs/docs/icicle/multi-gpu.md b/docs/docs/icicle/multi-gpu.md
@@ -0,0 +1,64 @@
+# Multi GPU with ICICLE
+
+:::info
+
+If you are looking for the Multi GPU API documentation refer here for [Rust](./rust-bindings/multi-gpu.md).
+
+:::
+
+One common challenge with Zero-Knowledge computation is managing the large input sizes. It's not uncommon to encounter circuits surpassing 2^25 constraints, pushing the capabilities of even advanced GPUs to their limits. To effectively scale and process such large circuits, leveraging multiple GPUs in tandem becomes a necessity.
+
+Multi-GPU programming involves developing software to operate across multiple GPU devices. Lets first explore different approaches to Multi-GPU programming  then we will cover how ICICLE allows you to easily develop youR ZK computations to run across many GPUs.
+
+
+## Approaches to Multi GPU programming
+
+There are many [different strategies](https://github.com/NVIDIA/multi-gpu-programming-models) available for implementing multi GPU, however, it can be split into two categories.
+
+### GPU Server approach 
+
+This approach usually involves a single or multiple CPUs opening threads to read / write from multiple GPUs. You can think about it as a scaled up HOST - Device model.
+
+![alt text](image.png)
+
+This approach wont let us tackle larger computation sizes but it will allow us to compute multiple computations which we wouldn't be able to load onto a single GPU.
+
+For example lets say that you had to compute two MSMs of size 2^20 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel. 
+
+
+### Inter GPU approach
+
+This approach involves a more sophisticated approach to multi GPU computation. Using technologies such as [GPUDirect, NCCL, NVSHMEM](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-cwes1084/) and NVLink its possible to combine multiple GPUs and split a computation among different devices.
+
+This approach requires redesigning the algorithm at the software level to be compatible with splitting amongst devices. In some cases, to lower latency to a minimum, special inter GPU connections would be installed on a server to allow direct communication between multiple GPUs.
+
+
+# Writing ICICLE Code for Multi GPUs
+
+The approach we have taken for the moment is a GPU Server approach; we assume you have a machine with multiple GPUs and you wish to run some computation on each GPU.
+
+To dive deeper and learn about the API checkout the docs for our different ICICLE API
+
+- [Rust Multi GPU APIs](./rust-bindings/multi-gpu.md)
+- C++ Multi GPU APIs
+
+
+## Best practices 
+
+- Never hardcode device IDs, if you want your software to take advantage of all GPUs on a machine use methods such as `get_device_count` to support arbitrary number of GPUs.
+
+- Launch one thread per GPU, to avoid nasty errors and hard to read code we suggest that for every GPU task you wish to launch you create a dedicated thread. This will make your code way more manageable, easy to read and performant.
+
+## ZKContainer support for multi GPUs
+
+Multi GPU support should work with ZK-Containers by simply defining which devices the docker container should interact with:
+
+```sh
+docker run -it --gpus '"device=0,2"' zk-container-image
+```
+
+If you wish to expose all GPUs 
+
+```sh
+docker run --gpus all zk-container-image
+```
diff --git a/docs/docs/icicle/rust-bindings/multi-gpu.md b/docs/docs/icicle/rust-bindings/multi-gpu.md
@@ -0,0 +1,199 @@
+# Multi GPU APIs
+
+To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation.
+
+Here we will cover the core multi GPU apis and a [example](#a-multi-gpu-example)
+
+## Device management API
+
+To streamline device management we offer as part of `icicle-cuda-runtime` package methods for dealing with devices.
+
+#### [`set_device`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L6)
+
+Sets the current CUDA device by its ID, when calling `set_device` it will set the current thread to a CUDA device.
+
+**Parameters:**
+
+- `device_id: usize`: The ID of the device to set as the current device. Device IDs start from 0.
+
+**Returns:**
+
+- `CudaResult<()>`: An empty result indicating success if the device is set successfully. In case of failure, returns a `CudaError`.
+
+**Errors:**
+
+- Returns a `CudaError` if the specified device ID is invalid or if a CUDA-related error occurs during the operation.
+
+**Example:**
+
+```rust
+let device_id = 0; // Device ID to set
+match set_device(device_id) {
+    Ok(()) => println!("Device set successfully."),
+    Err(e) => eprintln!("Failed to set device: {:?}", e),
+}
+```
+
+#### [`get_device_count`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L10)
+
+Retrieves the number of CUDA devices available on the machine.
+
+**Returns:**
+
+- `CudaResult<usize>`: The number of available CUDA devices. On success, contains the count of CUDA devices. On failure, returns a `CudaError`.
+
+**Errors:**
+
+- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the device count.
+
+**Example:**
+
+```rust
+match get_device_count() {
+    Ok(count) => println!("Number of devices available: {}", count),
+    Err(e) => eprintln!("Failed to get device count: {:?}", e),
+}
+```
+
+#### [`get_device`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L15)
+
+Retrieves the ID of the current CUDA device.
+
+**Returns:**
+
+- `CudaResult<usize>`: The ID of the current CUDA device. On success, contains the device ID. On failure, returns a `CudaError`.
+
+**Errors:**
+
+- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the current device ID.
+
+**Example:**
+
+```rust
+match get_device() {
+    Ok(device_id) => println!("Current device ID: {}", device_id),
+    Err(e) => eprintln!("Failed to get current device: {:?}", e),
+}
+```
+
+## Device context API
+
+The `DeviceContext` is embedded into `NTTConfig`, `MSMConfig` and `PoseidonConfig`, meaning you can simple pass a `device_id` to your existing config an the same computation will be triggered on a different device automatically.
+
+#### [`DeviceContext`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L11)
+
+Represents the configuration a CUDA device, encapsulating the device's stream, ID, and memory pool. The default device is always `0`, unless configured otherwise.
+
+```rust
+pub struct DeviceContext<'a> {
+    pub stream: &'a CudaStream,
+    pub device_id: usize,
+    pub mempool: CudaMemPool,
+}
+```
+
+##### Fields
+
+- **`stream: &'a CudaStream`**
+
+  A reference to a `CudaStream`. This stream is used for executing CUDA operations. By default, it points to a null stream CUDA's default execution stream.
+
+- **`device_id: usize`**
+
+  The index of the GPU currently in use. The default value is `0`, indicating the first GPU in the system.
+
+- **`mempool: CudaMemPool`**
+
+  Represents the memory pool used for CUDA memory allocations. The default is set to a null pointer, which signifies the use of the default CUDA memory pool.
+
+##### Implementation Notes
+
+- The `DeviceContext` structure is cloneable and can be debugged, facilitating easier logging and duplication of contexts when needed.
+
+
+#### [`DeviceContext::default_for_device(device_id: usize) -> DeviceContext<'static>`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L30C12-L30C30)
+
+Provides a default `DeviceContext` with system-wide defaults, ideal for straightforward setups.
+
+#### Returns
+
+A `DeviceContext` instance configured with:
+- The default stream (`null_mut()`).
+- The default device ID (`0`).
+- The default memory pool (`null_mut()`).
+
+#### Parameters
+
+- **`device_id: usize`**: The ID of the device for which to create the context.
+
+#### Returns
+
+A `DeviceContext` instance with the provided `device_id` and default settings for the stream and memory pool.
+
+
+#### [`check_device(device_id: i32)`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L42)
+
+Validates that the specified `device_id` matches the ID of the currently active device, ensuring operations are targeted correctly.
+
+#### Parameters
+
+- **`device_id: i32`**: The device ID to verify against the currently active device.
+
+#### Behavior
+
+- **Panics** if the `device_id` does not match the active device's ID, preventing cross-device operation errors.
+
+#### Example
+
+```rust
+let device_id: i32 = 0; // Example device ID
+check_device(device_id);
+// Ensures that the current context is correctly set for the specified device ID.
+```
+
+
+## A Multi GPU example
+
+In this example we will display how you can
+
+1. Fetch the number of devices installed on a machine
+2. For every GPU launch a thread and set a active device per thread.
+3. Execute a MSM on each GPU
+
+
+
+```rust
+
+...
+
+let device_count = get_device_count().unwrap();
+
+(0..device_count)
+        .into_par_iter()
+        .for_each(move |device_id| {
+          set_device(device_id).unwrap();
+
+          // you can allocate points and scalars_d here
+
+          let mut cfg = MSMConfig::default_for_device(device_id);
+          cfg.ctx.stream = &stream;
+          cfg.is_async = true;
+          cfg.are_scalars_montgomery_form = true;
+          msm(&scalars_d, &HostOrDeviceSlice::on_host(points), &cfg, &mut msm_results).unwrap();
+
+          // collect and process results
+        })
+
+...
+```
+
+
+We use `get_device_count` to fetch the number of connected devices, device IDs will be `0...device_count-1`
+
+[`into_par_iter`](https://docs.rs/rayon/latest/rayon/iter/trait.IntoParallelIterator.html#tymethod.into_par_iter) is a parallel iterator, you should expect it to launch a thread for every iteration.
+
+We then call `set_device(device_id).unwrap();` it should set the context of that thread to the selected `device_id`.
+
+Any data you now allocate from the context of this thread will be linked to the `device_id`. We create our `MSMConfig` with the selected device ID `let mut cfg = MSMConfig::default_for_device(device_id);`, behind the scene this will create for us a `DeviceContext` configured for that specific GPU. 
+
+We finally call our `msm` method.
diff --git a/docs/sidebars.js b/docs/sidebars.js
@@ -30,9 +30,20 @@ module.exports = {
           id: "icicle/golang-bindings",
         },
         {
-          type: "doc",
+          type: "category",
           label: "Rust bindings",
-          id: "icicle/rust-bindings",
+          link: {
+            type: `doc`,
+            id: "icicle/rust-bindings",
+          },
+          collapsed: true,
+          items: [
+            {
+              type: "doc",
+              label: "Multi GPU Support",
+              id: "icicle/rust-bindings/multi-gpu",
+            }
+          ]
         },
         {
           type: "category",
@@ -60,6 +71,11 @@ module.exports = {
             }
           ],
         },
+        {
+          type: "doc",
+          label: "Multi GPU Support",
+          id: "icicle/multi-gpu",
+        },
         {
           type: "doc",
           label: "Supporting additional curves",