AMD Instinct and prospects of peer-to-peer NVMe storage #655

tucnak · 2023-09-27T10:44:50Z

tucnak
Sep 27, 2023

Hey,

We have seen big-time improvements with our B-Tree and BRIN index arrangements when switched over to Samsung PM1735 series of PCIe 4.0 x8 NVMe and we are now considering to either move everything over in RAID configuration, or rather simply use these as index/temporary/materialisation tablespace exclusively so as to avoid RAID capacity overhead:

Capacity: 6.4 TB
Sequential Read (128 KB): 8000 MB/s
Sequential Write (128 KB): 3800 MB/s
Random Read (4 KB): 1500K IOPS
Random Write (4 KB): 250K IOPS

As you can see, at 64.0 Gbit sequential read bandwidth it's significantly thicker than DRR4 ECC memory.

Perhaps this is making a statement on whole "NVMe revolution" business, however not yet going as far as to rival purpose-built optical NICs that they use in datacenter, NUMA hyperscale arrangements, etc. At any rate, this is how I discovered PG-Strom, by casually exploring options to further leverage NVMe storage in bandwidth-first scenarios. In the server rack that I'm working with at the moment, we also have AMD Instinct MI50 which is a datacenter-grade GPU that wouldn't otherwise be interesting had it not wielded 32GB of high-bandwidth HBM2 memory at 1024 Gbps. We have initially intended to use this card for Llama 2 inference but had eventually decided against it, as our IBM POWER9 system wasn't as good fit as a purpose-built x86 system with top of the line NVIDIA cards.

We were supposed to get rid of the GPU earlier this week.

However, when discovering your work in the area of peer-to-peer gpu<-->nvme access, I had decided to postpone it until further investigation. NVIDIA GPUDirect is the technology that has enabled this capability, as I understand. According to AMD's documentation, GPUDirect RDMA API is not supported in HIP, which is unfortunate considering that HIP provides a clear set of overall drop-in bindings to CUDA, but frankly was also something to be expected. So obviously the implementation should it be possible— wouldn't work as easily as simply rewriting the headers and substituting nvcc for hipcc. That said, I'm pretty sure I had seen p2p code in amdgpu kernel driver, & I would expect it to work because it supports multi-GPU configuration without NVLink equivalent, as-in directly over PCIe.

Do you have any idea how hard this is likely to be, if at all possible, in a server mainboard with fully-enabled hardware IOMMU? If this were to work, we could make a solid case for perhaps petascale Postgres, and delegate anywhere from 32-256GB of HBM2 memory for indexing & directly scanning over NVMe without ever consuming CPU time, or incurring unnecessary copies to RAM. I haven't done the measurements but I would expect various hash-join and point-polygoin intersect type of loads to work much better in hybrid HBM / NVMe tablespaces.

The idea excites me very much considering that MI50 remains the extra-affordable (950 USD) means to 32+GB worth of HBM2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD Instinct and prospects of peer-to-peer NVMe storage #655

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

AMD Instinct and prospects of peer-to-peer NVMe storage #655

tucnak Sep 27, 2023

Replies: 0 comments

tucnak
Sep 27, 2023