Add multi-GPU option for qibo library #28

stavros11 · 2022-01-06T19:00:16Z

@mlazzarin I realized that I had an old implementation of multigpu benchmarks for qibo in this repo and I updated using the latest libraries branch to avoid doing double work. The multigpu configuration can be passed in --library-options using the existing benchmark scripts, for example:

python compare.py --library qibo --nqubits 31 --library-options accelerators=1/GPU:0+1/GPU:1

Let me know if you agree. Note that you can "reuse" a single GPU by passing accelerators=2/GPU:0 (works for >2 times too, but should be power of 2).

I checked running the above for a few configurations and it seems to execute for both qibojit and qibotf. I am not sure if the results make sense, though. I suspect that parallelization when multiple GPUs are used is broken for qibojit, because for qibotf I get 100% simultaneously on all devices on nvidia-smi, while for qibojit it seems to run sequentially. I will investigate this further. We also need to add tests that check the final state for multi-gpu configurations here.

This reverts commit 7e522cf.

stavros11 · 2022-01-10T20:17:17Z

Here are some results on the DGX in double precision:

QFT

nqubits (accelerators)	qibojit dry run (sec)	qibojit simulation (sec)	qibotf dry run (sec)	qibottf simulation (sec)
31 (2/GPU:0)	75.1929	68.7758	82.987	83.3789
31 (1/GPU:0 + 1/GPU:1)	67.636	OOM	55.9763	55.613
32 (4/GPU:0)	195.4	188.271	219.633	220.952
32 (2/GPU:0 + 2/GPU:1)	155.852	137.265	137.71	138.196
32 (1/GPU:0 + 1/GPU:1 + 1/GPU:2 + 1/GPU:3)	135.925	OOM	96.0887	94.1834

Variational

nqubits (accelerators)	qibojit dry run (sec)	qibojit simulation (sec)	qibotf dry run (sec)	qibottf simulation (sec)
31 (2/GPU:0)	61.958	55.7334	74.2878	75.3073
31 (1/GPU:0 + 1/GPU:1)	59.6055	38.3918	50.0096	50.3635
32 (4/GPU:0)	167.434	160.963	217.574	217.58
32 (2/GPU:0 + 2/GPU:1)	127.991	109.543	146.671	148.087
32 (1/GPU:0 + 1/GPU:1 + 1/GPU:2 + 1/GPU:3)	121.044	75.4436	111.084	110.45

The issues to be resolved from the qibo side are the following:

Why the second run in QFT runs out of memory when many devices are used, while the same run works for qibotf. I suspect that some objects are not deleted properly but it is strange that the same issue does not appear in the variational circuit.
It seems that some kind compilation is happening in qibojit dry run as in all cases it is much slower than simulation, in contrast to qibotf where both runs are similar. I also believe that the joblib parallelization does not work very well with qibojit.

Despite these issues, it seems that qibojit generally gets better simulation times than qibotf, most likely due to faster CPU-GPU communication.

stavros11 · 2022-01-12T08:13:55Z

In order to demonstrate the second issue with parallelization, here are some plots that show the GPU utilization as captured from nvidia-smi every short time interval (0.05 sec). For qibotf we see that the two GPUs are working simultaneously, while for qibojit most operations seem to be applied sequentially.

qibojit - QFT - 31 qubits

qibotf - QFT - 31 qubits

qibojit - Variational - 31 qubits

qibotf - Variational - 31 qubits

scarrazza · 2022-01-12T08:16:43Z

I suppose the joblib configuration is the same between both backends, right?
If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.

stavros11 · 2022-01-12T08:45:08Z

I suppose the joblib configuration is the same between both backends, right?

Yes, the multigpu circuit is defined in qibo and is the same for both backends. This is the only place where joblib is used.

If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.

If you mean the cp.cuda.stream.get_current_stream().synchronize() that we do in cupy backends, I tried removing it from everywhere (all qibo and qibojit tests still pass without the sync), but the multigpu situation remains the same. Also the OOM issue in the second QFT run remains.

stavros11 · 2022-01-16T17:38:11Z

I explored the memory issue a bit more and found the following:

It happens with smaller qubit numbers. In this case there's no OOM but additional GPU memory is occupied. For example for 28 qubits distributed to two physical GPUs, the dry run occupies 2GB in both GPUs, while the second run occupies 4GB in the first GPU and 2GB in the second. It seems that the state is not cleaned properly in the first GPU after the dry run.
If I do three or more repetitions (dry run + two more reps) for 28 qubits, memory occupation remains 4GB + 2GB during all reps. So the non-cleaning issue only happens in the dry run, additional reps do not increase memory.
The issue appears for the QFT circuit and only when the definition used in the current repo is used. For example, if we do

from qibo.models import QFT
c = models.QFT(31, accelerators={"/GPU:0": 1, "/GPU:1": 1})
final_state = c().numpy()  # dry run
final_state = c().numpy()  # second run

the OOM does not appear. More explicitly, if we define the QFT as:

circuit = Circuit(nqubits)
for i1 in range(nqubits):
    circuit.add(gates.H(i1))
    for i2 in range(i1 + 1, nqubits):
        theta = math.pi / 2 ** (i2 - i1)
       circuit.add(gates.CU1(i2, i1, theta))
for i in range(nqubits // 2):
    circuit.add(gates.SWAP(i, nqubits - i - 1))

then the OOM / double memory issue appears, while if we define it using the _DistributedQFT method from qibo the OOM does not appear. Note that the two circuits are equivalent, just some commuting gates are reordered and I checked that the final states of both are the same, even with random initial state. I am not sure if the issue appears with other circuits too.

scarrazza · 2022-01-17T13:57:21Z

Interesting, so the state vector is not being cleaned between runs.
What happens if you force the state vector delete between the dry-run and execution using the code from this repo?

stavros11 · 2022-01-17T18:32:22Z

Interesting, so the state vector is not being cleaned between runs. What happens if you force the state vector delete between the dry-run and execution using the code from this repo?

All the issues I wrote above are the same regardless of whether I delete the result (the execution output) after each run. I would assume that there is a bug and in the second circuit, which creates the problem, a reference to the state remains somewhere and that's why is not properly cleaned but this does not explain why the problem does not appear with qibotf. Also the problem remains even if I delete both the result and the whole circuit after each run.

scarrazza · 2022-01-17T19:01:41Z

Did you try with cp._default_memory_pool.free_all_blocks()?

stavros11 · 2022-01-17T19:18:40Z

Did you try with cp._default_memory_pool.free_all_blocks()?

Yes, I also tried that one after object deletion and it does not make a difference.

stavros11 · 2022-02-08T12:52:16Z

Here are some benchmarks with the latest version of qibojit, after merging the memory duplication fix.

qft

nqubits (accelerators)	dry_run_time_qibojit	simulation_times_mean_qibojit	dry_run_time_qibotf	simulation_times_mean_qibotf
31 (2/GPU:3)	75.1272	68.4773	84.8177	85.2387
31 (1/GPU:2+1/GPU:3)	56.0591	46.028	56.27	55.7507
32 (4/GPU:3)	199.658	190.287	220.224	220.174
32 (2/GPU:2+2/GPU:3)	144.643	129.706	138.685	141.355
32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3)	125.408	98.043	99.0873	96.8065

variational

nqubits (accelerators)	dry_run_time_qibojit	simulation_times_mean_qibojit	dry_run_time_qibotf	simulation_times_mean_qibotf
31 (2/GPU:3)	61.8927	56.3496	75.4588	76.1184
31 (1/GPU:2+1/GPU:3)	48.8197	44.9677	51.4959	51.3772
32 (4/GPU:3)	177.81	171.728	227.636	230.986
32 (2/GPU:2+2/GPU:3)	140.191	134.599	157.079	159.028
32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3)	125.215	90.3429	120.335	118.85

supremacy

nqubits (accelerators)	dry_run_time_qibojit	simulation_times_mean_qibojit	dry_run_time_qibotf	simulation_times_mean_qibotf
31 (2/GPU:3)	64.1213	57.8321	76.3821	76.6204
31 (1/GPU:2+1/GPU:3)	49.0247	38.3491	51.978	51.9739
32 (4/GPU:3)	179.127	173.83	232.332	233.252
32 (2/GPU:2+2/GPU:3)	131.618	120.149	157.172	158.845
32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3)	120.386	92.9109	118.797	120.125

Note that some other things are running in parallel in the machine, so there may be some noise and we cannot do a fair comparison with the tables above. I ran qibojit and qibotf sequentially though so that at least we can compare the two.

There appears to be an issue with the 4x GPU run of qibojit, I will rerun this to confirm whether it is something code related or just something temporary with the machine. Other than that qibojit times, even for dry run, are competitive with qibotf.

scarrazza · 2022-02-08T13:01:41Z

Thanks @stavros11, performance is quite good, despite the unsync initial step.

stavros11 · 2022-02-09T12:19:30Z

@scarrazza, I updated the post above with the latest results for three different circuits. I believe qibojit performance is acceptable compared to qibotf, even for dry run. Perhaps it is possible to improve with a better multigpu approach but it may be interesting to compare with cuquantum first, after they release.

scarrazza · 2022-02-09T15:41:35Z

Looks good to me, can we merge?

stavros11 · 2022-02-09T16:44:33Z

Looks good to me, can we merge?

Yes, this is okay from my side.

stavros11 added 7 commits October 28, 2021 16:45

multi-GPU benchmark

d6d59e6

Log accelerators

cabd8f3

Merge libraries

e7851c8

Remove benchmultigpu file

97561d3

Update parameter parser

7e522cf

Revert "Update parameter parser"

e26d7b6

This reverts commit 7e522cf.

Fix accelerators parser

94b7c05

stavros11 requested a review from mlazzarin January 6, 2022 19:00

Merge latest libraries

35d4722

stavros11 mentioned this pull request Jan 29, 2022

Add benchmark script and plots #26

Merged

Merge branch 'libraries' into multigpu

5d303ca

stavros11 changed the title ~~[WIP] Add multi-GPU option for qibo library~~ Add multi-GPU option for qibo library Feb 8, 2022

stavros11 marked this pull request as ready for review February 9, 2022 12:19

scarrazza merged commit ace6522 into libraries Feb 9, 2022

stavros11 deleted the multigpu branch March 15, 2022 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-GPU option for qibo library #28

Add multi-GPU option for qibo library #28

stavros11 commented Jan 6, 2022 •

edited

Loading

stavros11 commented Jan 10, 2022 •

edited

Loading

stavros11 commented Jan 12, 2022

scarrazza commented Jan 12, 2022

stavros11 commented Jan 12, 2022

stavros11 commented Jan 16, 2022 •

edited

Loading

scarrazza commented Jan 17, 2022

stavros11 commented Jan 17, 2022

scarrazza commented Jan 17, 2022

stavros11 commented Jan 17, 2022

stavros11 commented Feb 8, 2022 •

edited

Loading

scarrazza commented Feb 8, 2022

stavros11 commented Feb 9, 2022

scarrazza commented Feb 9, 2022

stavros11 commented Feb 9, 2022

Add multi-GPU option for qibo library #28

Add multi-GPU option for qibo library #28

Conversation

stavros11 commented Jan 6, 2022 • edited Loading

stavros11 commented Jan 10, 2022 • edited Loading

stavros11 commented Jan 12, 2022

scarrazza commented Jan 12, 2022

stavros11 commented Jan 12, 2022

stavros11 commented Jan 16, 2022 • edited Loading

scarrazza commented Jan 17, 2022

stavros11 commented Jan 17, 2022

scarrazza commented Jan 17, 2022

stavros11 commented Jan 17, 2022

stavros11 commented Feb 8, 2022 • edited Loading

scarrazza commented Feb 8, 2022

stavros11 commented Feb 9, 2022

scarrazza commented Feb 9, 2022

stavros11 commented Feb 9, 2022

stavros11 commented Jan 6, 2022 •

edited

Loading

stavros11 commented Jan 10, 2022 •

edited

Loading

stavros11 commented Jan 16, 2022 •

edited

Loading

stavros11 commented Feb 8, 2022 •

edited

Loading