Storage Benchmarks

These benchmarks are designed to measure the performance of storage on the MOC and NERC, specifically related to AI applications.

Full details of experiments, environment and motivation for project and experiments (private Google Doc)
Agenda for the weekly meeting (private Google Doc)

Experiments

FIO
MLPerf storage
Real Inference Workload

FIO

The benchmarking tool selected for this evaluation is FIO. We will follow John Strunk's FIO evaluation framework. The specific client machine used for running the experiments is not critical, as long as we can distinguish between network and storage latency. The same experiment data file is used without cleaning, through all the benchmarking explained below.

The evaluation will include five key benchmarks:

Streaming: 100% sequential writes, followed by 100% sequential reads, using 32 I/O threads. This test measures the maximum write/read bandwidth (in MB/s) we can achieve, with approximately 300GB of data written over about 30 minutes.
Maximal Latency: 100% random writes, followed by 100% random reads, using a single I/O thread. This test measures the maximum write/read latency we can expect, lasting about 30 minutes.
Minimal Latency: 100% sequential writes, followed by 100% sequential reads, using a single I/O thread. This test focuses on the minimal write/read latency, providing insights primarily into network latency. Client-side caching should be disabled. The test runs for around 30 minutes, and the sequential throughput should also be measured or calculated.
Max I/O Random Throughput: 100% random writes, followed by 100% random reads, using 32 I/O threads. This test assesses the maximum write/read throughput we can expect, with a duration of approximately 30 minutes.
Small WSS Streaming Reads: 100% sequential reads over a small working-set with 32 I/O threads. This test is an attempt to separate the effects of the network between the client and storage from the overheads of the storage back end (i.e., disk). The workload generator is configured to bypass the client cache, ensuring the reads are sent to the storage system even though the WSS is small. Given the small WSS, the expectation is that it will fit in the storage system’s cache, leading to network overheads being the dominant contributor to performance.

Results

(Network) Bandwidth

Big (1MB) sequential I/O requests, 32 concurrently, to stress the network

            100MB       300GB       600GB
Writes:     193 MiB/s   219 MiB/s   200 MiB/s
Reads:      1990 MiB/s  690 MiB/s   1003 MiB/s

Latency

Small (4KB) random I/O requests, no concurrency, to measure good latency

[in ms]         100MB       300GB       600GB
Writes Avg.:    37.23       38.7        45.8
Writes Median:  5.1         5.21        7.1
Writes 99%:     371.1       337.6       405.5
Reads Avg.:     0.8         17.62       10.6
Reads Median:   0.39        13.43       10.58
Reads 99%:      10.6        109.57      96.5

Full results can be found in the results/ folder.

MLPerf Storage

MLPerf Storage benchmarks the performance for training workloads. This is achieved by generating a dataset and simulating the process of training over the generated dataset. It does not make use of GPUs, and the time which the GPU would have spent on training over each sample of the dataset has been replaced with a sleep command. Processing time on actual hardware (A100 and H100) has been measured in order to calculate the correct sleep time amount for each sample. Training is run over 5 epochs and does not perform checkpointing.

The main metric for the MLPerf Storage experiment is Accelerator Utilization. This is measured as the fraction of time that the GPU would spend processing compared to the overall duration of the experiment, as defined by the formula AU = Accelerator Total Time / Total Duration = Accelerator Total Time / (Accelerator Total Time + Storage Load Time).

By default, MLPerf Storage defines an accelerator utilization score below 90% as a fail.

In our setup of the experiment the dataset is loaded from a Persistent Volume Claim that is hosted on the NESE ceph cluster. The training workload is unet3d, 1 simulated GPU, and 1000 (~140GB) and 3500 (~500GB) samples of dataset. Each sample ranges in size from around 80MB to 200MB. Kubernetes job and PVC definition can be found in the k8s/ folder.

The results can be found in the results/ folder.

Simulated GPU	Samples	Storage Type	AU (%)	MB/s
A100	3500	NESE Ceph PVC	10.81	165.35
H100	3500	NESE Ceph PVC	5.58	168.05
A100	1000	Local EmptyDir PVC	24.51	729.10
		Weka PVC
		Weka PVC

The below results have not been run on the NERC and are provided purely for reference.

Simulated GPU	Samples	Storage Type	AU (%)	MB/s
A100	1200	Macbook Pro 14" NVMe	99.16	1495.76

Other results that have been contributed from organizations can be found on the MLPerf Storage website.

Real Inference Workload

To be derived from Sanjay’s work about the average model to use as an example. Granit (consult Perf group). OPT13B and LLAMA.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
k8s/base		k8s/base
results		results
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Storage Benchmarks

Experiments

FIO

Results

(Network) Bandwidth

Latency

MLPerf Storage

Real Inference Workload

About

Releases

Packages

Contributors 2

Languages

License

CCI-MOC/storage-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Storage Benchmarks

Experiments

FIO

Results

(Network) Bandwidth

Latency

MLPerf Storage

Real Inference Workload

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages