The bandwidth does not scale linearly when increasing the number of SSDs #28

starry12 · 2024-04-11T04:04:06Z

Describe the bug
Hi,
I am trying to run nvm-block-bench on ASUS ESC8000-E11 that has V100 GPU and 6x Intel NVMe SSDs. The GPU is configured with PCIe5 x16.

To Reproduce
In fact, I'd like to reproduce this test: #17 .

Expected behavior
In my expectation, I should get the same linear scaling of bandwidth as he did.

However, my bandwidth cap seems to be limited to ~10GB/s.
Here are some results:

I could get ~5GB/s results when reading only one ssd:
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=1 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 3.10162e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 1.3523e+06 Effective Bandwidth(GB/S): 5.1586

2.When we increase to 2 SSDs, we only get ~8GB/s
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=2 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 1.97415e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 2.12461e+06 Effective Bandwidth(GB/S): 8.10473

3.Next increase to 4 SSDs, ~10GB/s
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=4 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 1.54689e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 2.71144e+06 Effective Bandwidth(GB/S): 10.3433

4.Unfortunately, increasing the number of SSDS didn't work, and the bandwidth seemed to be limited.
run:
./bin/nvm-blockbench --threads=$((1024*1024*4)) --blk_size=64 --reqs=1 --pages=$((1024*1024*4)) --queue_depth=1024 --page_size=4096 --num_blks=2097152 --gpu=4 --n_ctrls=6 --num_queues=128 --random=true --access_type=0

result:

Elapsed Time: 1.54337e+06 Number of Ops: 4194304 Data Size (bytes): 17179869184
Ops/sec: 2.71762e+06 Effective Bandwidth(GB/S): 10.3669

I tried to change page_size, req, threads, etc., but the bandwidth was only ~10GB/s.

I tried to troubleshoot the problem from the ssd perspective, using the fio tool to read data from multiple SSDs to the CPU at the same time, and the bandwidth was up to ~30GB/s:

Do you have any ideas or solutions for this result? Thanks.

Machine Setup (please complete the following information):
OS: Ubuntu 20.04.6, Kernel 5.4.0-99-generic
NVIDIA Driver: 545.23.08, CUDA Versions: 12.3, GPU name: NVIDIA V100-PCIE-32GB
SSD used: Intel SSD D7-P5520 SERIES

The text was updated successfully, but these errors were encountered:

msharmavikram · 2024-04-11T04:16:47Z

Two things:

I think the answer is in your own question. V100 is gen3 system. The max GPU ingress bandwidth in gen3 is about 12GBps. Expecting 30GBps on gen3 spec is unreasonable and unrealistic. In the CPU case you are observing 30GBps because you are consuming more than 8 PCIe slots each capable of 12GBps of ingress bandwidth!
The interesting question here really is why the scaling to two ssd does not hit 10GBps and gets capped at 8GBps. This i strongly suspect due to limitations in root complex of ASUS ESC8000-E11 CPU socket.

Lastly, #17 issue completely different as it in gen5 system (2 generation ahead) and there is no relationship between the two!

starry12 · 2024-04-11T08:12:13Z

Thanks for your answer, it looks like I made a very basic mistake.
I only considered the characteristics of the server slot and ignored the GPU itself, thanks again!

i will close this issue.

starry12 closed this as completed Apr 11, 2024

msharmavikram added the invalid This doesn't seem right label Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The bandwidth does not scale linearly when increasing the number of SSDs #28

The bandwidth does not scale linearly when increasing the number of SSDs #28

starry12 commented Apr 11, 2024

msharmavikram commented Apr 11, 2024

starry12 commented Apr 11, 2024

The bandwidth does not scale linearly when increasing the number of SSDs #28

The bandwidth does not scale linearly when increasing the number of SSDs #28

Comments

starry12 commented Apr 11, 2024

msharmavikram commented Apr 11, 2024

starry12 commented Apr 11, 2024