This repository describes how to setup and benchmark NVIDIA GPUDirect Storage performance.
Download and Reinstall Mellanox OFED (>= 5.3) before installing CUDA Toolkit.
# Check MLNX_OFED version
ofed_info -s
# If MLNX already installed, uninstall first
cd MLNX_OFED_LINUX-5.8-1.0.1.1-ubuntu20.04-x86_64
./uninstall.sh
# MLNX_OFED installation for GDS
./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support
update-initramfs -u -k `uname -r`
/etc/init.d/openibd restart # if error[FAILED] on loading new modules, try as below
reboot
# If no error[FAILED], skip here
rmmod {old_module} # if needed
/etc/init.d/openibd restart # try again
reboot
Install cuda toolkit with runfile. nvidia-fs
(nvidia-gds
package) is supported from cuda 12.0 (or later) in runfile installation.
# If CUDA already installed, uninstall first
# Uninstall cuda
sudo /usr/local/cuda-11.7/bin/cuda-uninstaller
# Uninstall Nvidia driver
sudo /usr/bin/nvidia-uninstall
# CUDA / Nvidia Driver installation with GDS package
sudo sh cuda_12.0.0_525.60.13_linux.run --kernelobjects # 'nvidia-fs' kernel object should be installed
# After installation, make sure to include PATH and LD_LIBRARY_PATH
export PATH="/usr/local/cuda-12.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH"
lsmod | grep nvidia # this should print 'nvidia_fs ...'
export PATH="/usr/local/cuda-12.0/gds/tools:$PATH"
gdscheck -p # this should print 'NVMe : Supported'
# Check your NVMe device
sudo fdisk -l | grep nvme
# Test run with 'gdsio' benchmarking utility
gdsio -f /dev/nvme0n1 -d 0 -w 4 -s 1G -i 1M -I 1 -x 0 -V
-
NVMe Server (Target) Configuration
# Load kernel modules modprobe nvmet modprobe nvmet-rdma # Create a NVMe subsystem cd /sys/kernel/config/nvmet/subsystems mkdir nvme_subsystem # any name you want cd nvme_subsystem echo 1 > attr_allow_any_host # Create a namespace cd namespaces mkdir 1 # any number you want cd 1 echo -n /dev/nvme0n1 > device_path echo 1 > enable # Create a port cd /sys/kernel/config/nvmet/ports mkdir 1 # any number you want cd 1 echo X.X.X.X > addr_traddr # IP address on the Mellanox adapter (e.g., InfiniBand) echo rdma > addr_trtype echo 4420 > addr_trsvcid # 4420 is IANA default for NVMeOF echo ipv4 > addr_adrfam # Link subsystem to port ln -s /sys/kernel/config/nvmet/subsystems/nvme_subsystem /sys/kernel/config/nvmet/ports/1/subsystems/nvme_subsystem # Verify whether NVMe target is listening on the port dmesg | grep nvme # this should print 'nvmet_rdma: enabling port 1 (X.X.X.X:4420)'
-
NVMeOF Client (Initiator) Configuration
Here we need to install nvme-cli for executing NVMe commands. Follow the nvme-cli repo install guide.
# Load kernel modules modprobe nvme-rdma gdscheck -p # # this should print 'NVMeOF : Supported' # Discover available subsystems on NVMeOF target sudo nvme discover -t rdma -a X.X.X.X -s 4420 # Same IP you have set on the NVMeOF Server # Connect to the discovered subsystem with the 'subnqn' name and IP you have set on the NVMeOF Server sudo nvme connect -t rdma -n nvme_subsystem -a X.X.X.X -s 4420
For persistent setup, refer to NVMeOF Configuration Docs.
-
NVMe Server (Target) Termination
rm -f /sys/kernel/config/nvmet/ports/1/subsystems/nvme_subsystem rmdir /sys/kernel/config/nvmet/ports/1 rmdir /sys/kernel/config/nvmet/subsystems/nvme_subsystem/namespaces/1 rmdir /sys/kernel/config/nvmet/subsystems/nvme_subsystem
-
NVMeOF Client (Initiator) Termination
# If you want to disconnect from the target, run the following command nvme disconnect -n nvme_subsystem
Currently, EXT4 and XFS are the only block device based filesystem that GDS supports.
# Format with EXT4 file system which GDS supports
sudo mkfs.ext4 /dev/nvme0n1
# The ext4 file system must be mounted with the journaling mode set to 'data=ordered'
sudo mount -o data=ordered /dev/nvme0n1 /gds_files # any directory location you want to mount
# Check mounted mode
sudo mount | grep /dev/nvme0n1 # this should print '/dev/nvme0n1 on /gds_files type ext4 (rw,relatime,data=ordered)'
# If you want to un-mount
sudo umount /dev/nvme0n1
# If target is busy, try as followings
fuser -cu /dev/nvme0n1 # find who is using the mount target
fuser -ck /dev/nvme0n1 # kill process
sudo umount /dev/nvme0n1 # try again
# Recommend to run in 'conda' environment
pip install -r requirements.txt
# Run GDS benchmarking code
python gds_benchmark.py