-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-GPU option for qibo library #28
Conversation
Here are some results on the DGX in double precision: QFT
Variational
The issues to be resolved from the qibo side are the following:
Despite these issues, it seems that qibojit generally gets better simulation times than qibotf, most likely due to faster CPU-GPU communication. |
In order to demonstrate the second issue with parallelization, here are some plots that show the GPU utilization as captured from |
I suppose the joblib configuration is the same between both backends, right? |
Yes, the multigpu circuit is defined in qibo and is the same for both backends. This is the only place where
If you mean the |
I explored the memory issue a bit more and found the following:
from qibo.models import QFT
c = models.QFT(31, accelerators={"/GPU:0": 1, "/GPU:1": 1})
final_state = c().numpy() # dry run
final_state = c().numpy() # second run the OOM does not appear. More explicitly, if we define the QFT as: circuit = Circuit(nqubits)
for i1 in range(nqubits):
circuit.add(gates.H(i1))
for i2 in range(i1 + 1, nqubits):
theta = math.pi / 2 ** (i2 - i1)
circuit.add(gates.CU1(i2, i1, theta))
for i in range(nqubits // 2):
circuit.add(gates.SWAP(i, nqubits - i - 1)) then the OOM / double memory issue appears, while if we define it using the |
Interesting, so the state vector is not being cleaned between runs. |
All the issues I wrote above are the same regardless of whether I delete the result (the execution output) after each run. I would assume that there is a bug and in the second circuit, which creates the problem, a reference to the state remains somewhere and that's why is not properly cleaned but this does not explain why the problem does not appear with qibotf. Also the problem remains even if I delete both the result and the whole circuit after each run. |
Did you try with |
Yes, I also tried that one after object deletion and it does not make a difference. |
Here are some benchmarks with the latest version of qibojit, after merging the memory duplication fix. qft
variational
supremacy
Note that some other things are running in parallel in the machine, so there may be some noise and we cannot do a fair comparison with the tables above. I ran qibojit and qibotf sequentially though so that at least we can compare the two. There appears to be an issue with the 4x GPU run of qibojit, I will rerun this to confirm whether it is something code related or just something temporary with the machine. Other than that qibojit times, even for dry run, are competitive with qibotf. |
Thanks @stavros11, performance is quite good, despite the unsync initial step. |
@scarrazza, I updated the post above with the latest results for three different circuits. I believe qibojit performance is acceptable compared to qibotf, even for dry run. Perhaps it is possible to improve with a better multigpu approach but it may be interesting to compare with cuquantum first, after they release. |
Looks good to me, can we merge? |
Yes, this is okay from my side. |
@mlazzarin I realized that I had an old implementation of multigpu benchmarks for qibo in this repo and I updated using the latest
libraries
branch to avoid doing double work. The multigpu configuration can be passed in--library-options
using the existing benchmark scripts, for example:Let me know if you agree. Note that you can "reuse" a single GPU by passing
accelerators=2/GPU:0
(works for >2 times too, but should be power of 2).I checked running the above for a few configurations and it seems to execute for both qibojit and qibotf. I am not sure if the results make sense, though. I suspect that parallelization when multiple GPUs are used is broken for qibojit, because for qibotf I get 100% simultaneously on all devices on nvidia-smi, while for qibojit it seems to run sequentially. I will investigate this further. We also need to add tests that check the final state for multi-gpu configurations here.