Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-GPU option for qibo library #28

Merged
merged 9 commits into from
Feb 9, 2022
Merged

Add multi-GPU option for qibo library #28

merged 9 commits into from
Feb 9, 2022

Conversation

stavros11
Copy link
Member

@stavros11 stavros11 commented Jan 6, 2022

@mlazzarin I realized that I had an old implementation of multigpu benchmarks for qibo in this repo and I updated using the latest libraries branch to avoid doing double work. The multigpu configuration can be passed in --library-options using the existing benchmark scripts, for example:

python compare.py --library qibo --nqubits 31 --library-options accelerators=1/GPU:0+1/GPU:1

Let me know if you agree. Note that you can "reuse" a single GPU by passing accelerators=2/GPU:0 (works for >2 times too, but should be power of 2).

I checked running the above for a few configurations and it seems to execute for both qibojit and qibotf. I am not sure if the results make sense, though. I suspect that parallelization when multiple GPUs are used is broken for qibojit, because for qibotf I get 100% simultaneously on all devices on nvidia-smi, while for qibojit it seems to run sequentially. I will investigate this further. We also need to add tests that check the final state for multi-gpu configurations here.

@stavros11
Copy link
Member Author

stavros11 commented Jan 10, 2022

Here are some results on the DGX in double precision:

QFT
nqubits (accelerators) qibojit dry run (sec) qibojit simulation (sec) qibotf dry run (sec) qibottf simulation (sec)
31 (2/GPU:0) 75.1929 68.7758 82.987 83.3789
31 (1/GPU:0 + 1/GPU:1) 67.636 OOM 55.9763 55.613
32 (4/GPU:0) 195.4 188.271 219.633 220.952
32 (2/GPU:0 + 2/GPU:1) 155.852 137.265 137.71 138.196
32 (1/GPU:0 + 1/GPU:1 + 1/GPU:2 + 1/GPU:3) 135.925 OOM 96.0887 94.1834
Variational
nqubits (accelerators) qibojit dry run (sec) qibojit simulation (sec) qibotf dry run (sec) qibottf simulation (sec)
31 (2/GPU:0) 61.958 55.7334 74.2878 75.3073
31 (1/GPU:0 + 1/GPU:1) 59.6055 38.3918 50.0096 50.3635
32 (4/GPU:0) 167.434 160.963 217.574 217.58
32 (2/GPU:0 + 2/GPU:1) 127.991 109.543 146.671 148.087
32 (1/GPU:0 + 1/GPU:1 + 1/GPU:2 + 1/GPU:3) 121.044 75.4436 111.084 110.45

The issues to be resolved from the qibo side are the following:

  • Why the second run in QFT runs out of memory when many devices are used, while the same run works for qibotf. I suspect that some objects are not deleted properly but it is strange that the same issue does not appear in the variational circuit.
  • It seems that some kind compilation is happening in qibojit dry run as in all cases it is much slower than simulation, in contrast to qibotf where both runs are similar. I also believe that the joblib parallelization does not work very well with qibojit.

Despite these issues, it seems that qibojit generally gets better simulation times than qibotf, most likely due to faster CPU-GPU communication.

@stavros11
Copy link
Member Author

In order to demonstrate the second issue with parallelization, here are some plots that show the GPU utilization as captured from nvidia-smi every short time interval (0.05 sec). For qibotf we see that the two GPUs are working simultaneously, while for qibojit most operations seem to be applied sequentially.

qibojit - QFT - 31 qubits

image

qibotf - QFT - 31 qubits

image

qibojit - Variational - 31 qubits

image

qibotf - Variational - 31 qubits

image

@scarrazza
Copy link
Member

I suppose the joblib configuration is the same between both backends, right?
If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.

@stavros11
Copy link
Member Author

I suppose the joblib configuration is the same between both backends, right?

Yes, the multigpu circuit is defined in qibo and is the same for both backends. This is the only place where joblib is used.

If that is the case, then maybe there is some extra cuda sync which is blocking the operation in qibojit.

If you mean the cp.cuda.stream.get_current_stream().synchronize() that we do in cupy backends, I tried removing it from everywhere (all qibo and qibojit tests still pass without the sync), but the multigpu situation remains the same. Also the OOM issue in the second QFT run remains.

@stavros11
Copy link
Member Author

stavros11 commented Jan 16, 2022

I explored the memory issue a bit more and found the following:

  1. It happens with smaller qubit numbers. In this case there's no OOM but additional GPU memory is occupied. For example for 28 qubits distributed to two physical GPUs, the dry run occupies 2GB in both GPUs, while the second run occupies 4GB in the first GPU and 2GB in the second. It seems that the state is not cleaned properly in the first GPU after the dry run.
  2. If I do three or more repetitions (dry run + two more reps) for 28 qubits, memory occupation remains 4GB + 2GB during all reps. So the non-cleaning issue only happens in the dry run, additional reps do not increase memory.
  3. The issue appears for the QFT circuit and only when the definition used in the current repo is used. For example, if we do
from qibo.models import QFT
c = models.QFT(31, accelerators={"/GPU:0": 1, "/GPU:1": 1})
final_state = c().numpy()  # dry run
final_state = c().numpy()  # second run

the OOM does not appear. More explicitly, if we define the QFT as:

circuit = Circuit(nqubits)
for i1 in range(nqubits):
    circuit.add(gates.H(i1))
    for i2 in range(i1 + 1, nqubits):
        theta = math.pi / 2 ** (i2 - i1)
       circuit.add(gates.CU1(i2, i1, theta))
for i in range(nqubits // 2):
    circuit.add(gates.SWAP(i, nqubits - i - 1))

then the OOM / double memory issue appears, while if we define it using the _DistributedQFT method from qibo the OOM does not appear. Note that the two circuits are equivalent, just some commuting gates are reordered and I checked that the final states of both are the same, even with random initial state. I am not sure if the issue appears with other circuits too.

@scarrazza
Copy link
Member

Interesting, so the state vector is not being cleaned between runs.
What happens if you force the state vector delete between the dry-run and execution using the code from this repo?

@stavros11
Copy link
Member Author

Interesting, so the state vector is not being cleaned between runs. What happens if you force the state vector delete between the dry-run and execution using the code from this repo?

All the issues I wrote above are the same regardless of whether I delete the result (the execution output) after each run. I would assume that there is a bug and in the second circuit, which creates the problem, a reference to the state remains somewhere and that's why is not properly cleaned but this does not explain why the problem does not appear with qibotf. Also the problem remains even if I delete both the result and the whole circuit after each run.

@scarrazza
Copy link
Member

Did you try with cp._default_memory_pool.free_all_blocks()?

@stavros11
Copy link
Member Author

Did you try with cp._default_memory_pool.free_all_blocks()?

Yes, I also tried that one after object deletion and it does not make a difference.

@stavros11 stavros11 changed the title [WIP] Add multi-GPU option for qibo library Add multi-GPU option for qibo library Feb 8, 2022
@stavros11
Copy link
Member Author

stavros11 commented Feb 8, 2022

Here are some benchmarks with the latest version of qibojit, after merging the memory duplication fix.

qft
nqubits (accelerators) dry_run_time_qibojit simulation_times_mean_qibojit dry_run_time_qibotf simulation_times_mean_qibotf
31 (2/GPU:3) 75.1272 68.4773 84.8177 85.2387
31 (1/GPU:2+1/GPU:3) 56.0591 46.028 56.27 55.7507
32 (4/GPU:3) 199.658 190.287 220.224 220.174
32 (2/GPU:2+2/GPU:3) 144.643 129.706 138.685 141.355
32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3) 125.408 98.043 99.0873 96.8065
variational
nqubits (accelerators) dry_run_time_qibojit simulation_times_mean_qibojit dry_run_time_qibotf simulation_times_mean_qibotf
31 (2/GPU:3) 61.8927 56.3496 75.4588 76.1184
31 (1/GPU:2+1/GPU:3) 48.8197 44.9677 51.4959 51.3772
32 (4/GPU:3) 177.81 171.728 227.636 230.986
32 (2/GPU:2+2/GPU:3) 140.191 134.599 157.079 159.028
32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3) 125.215 90.3429 120.335 118.85
supremacy
nqubits (accelerators) dry_run_time_qibojit simulation_times_mean_qibojit dry_run_time_qibotf simulation_times_mean_qibotf
31 (2/GPU:3) 64.1213 57.8321 76.3821 76.6204
31 (1/GPU:2+1/GPU:3) 49.0247 38.3491 51.978 51.9739
32 (4/GPU:3) 179.127 173.83 232.332 233.252
32 (2/GPU:2+2/GPU:3) 131.618 120.149 157.172 158.845
32 (1/GPU:0+1/GPU:1+1/GPU:2+1/GPU:3) 120.386 92.9109 118.797 120.125

Note that some other things are running in parallel in the machine, so there may be some noise and we cannot do a fair comparison with the tables above. I ran qibojit and qibotf sequentially though so that at least we can compare the two.

There appears to be an issue with the 4x GPU run of qibojit, I will rerun this to confirm whether it is something code related or just something temporary with the machine. Other than that qibojit times, even for dry run, are competitive with qibotf.

@scarrazza
Copy link
Member

Thanks @stavros11, performance is quite good, despite the unsync initial step.

@stavros11
Copy link
Member Author

@scarrazza, I updated the post above with the latest results for three different circuits. I believe qibojit performance is acceptable compared to qibotf, even for dry run. Perhaps it is possible to improve with a better multigpu approach but it may be interesting to compare with cuquantum first, after they release.

@stavros11 stavros11 marked this pull request as ready for review February 9, 2022 12:19
@scarrazza
Copy link
Member

Looks good to me, can we merge?

@stavros11
Copy link
Member Author

Looks good to me, can we merge?

Yes, this is okay from my side.

@scarrazza scarrazza merged commit ace6522 into libraries Feb 9, 2022
@stavros11 stavros11 deleted the multigpu branch March 15, 2022 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants