-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark external libraries #11
Conversation
Here are some numbers using the @scarrazza you can confirm Qiskit's performance by running something simple, eg a QFT for 30 qubits: CPU - dry run times - qft
CPU - dry run times - variational
CPU - dry run times - bv
CPU - dry run times - supremacy
CPU - dry run times - bc
CPU - dry run times - qv
CPU - dry run times - hs
CPU - simulation times - qft
CPU - simulation times - variational
CPU - simulation times - bv
CPU - simulation times - supremacy
CPU - simulation times - bc
CPU - simulation times - qv
CPU - simulation times - hs
EDIT: Added qibotf times. |
Thanks for these numbers, do you have similar number for qibotf? For some circuits like hs and qv the difference is too large, are you sure that qiskit is using CPU instead of GPU? What is the average total program execution time, maybe qiskit is precomputing objects during the circuit definition? Does the final state vector is the same for all backends? |
Btw, how many threads qiskit is using? It might be possible that this value is different from our default, e.g. limiting the number of threads might have an impact. |
Bwt2, does qiskit is really double precision? If I set qibo to single, I get numbers which are quite close to qiskit... |
Thanks for the response and the questions. Some quick answers:
I added the possibility to use qibotf in the same script in the latest push, I will update the above tables once I have the numbers. I don't expect much difference from qibojit, certainly will not be much closer to Qiskit.
I haven't checked htop explicitly during all benchmarks but all the Qiskit runs I checked use CPU. I think Qiskit only uses GPU when the appropriate simulator is used. I also used
The benchmark script logs the circuit creation time too, which in this corresponds to transforming the OpenQASM circuit to the library circuit. Here are the numbers from the above benchmarks: CPU - circuit creation times - qft
CPU - circuit creation times - variational
CPU - circuit creation times - bv
CPU - circuit creation times - supremacy
CPU - circuit creation times - bc
CPU - circuit creation times - qv
CPU - circuit creation times - hs
Indeed Qiskit has slighlty higher creation in all cases but still wins when considering the sum creation + execution.
This is exactly what is tested in the new
Qiskit and Qulacs use all available threads while Qibo uses half of them. This may cause some of the difference but I don't think it explains the whole difference. In past Qibo benchmarks using all threads had minimal change in performance.
I am not sure exactly what happens during simulation but if I do |
@stavros11 thanks for the comments. I have tested and indeed qiskit is 2x faster when using single precision. Starting from the QFT, if I keep only first layer of H gates, qiskit is 1s faster than qibo. At this point we should revisit each gate, if the single gates have similar performance, then I agree that some extra parallelization is performed by qiskit. |
In particular, if I yield just 1 Hadamard the qibo performance is better than qiskit, however as soon as I include 5 Hadamard, one per qubit, the qiskit performance is better, so this sounds like circuit fusion/block parallelization. |
Following their docs I think this latest version of qiskit:
|
Last comment about that, if I set |
I have been doing the benchmark using the same option and can confirm that performance is the same with Qibo. Here are the results for all circuits: CPU - dry run times - qft
CPU - dry run times - variational
CPU - dry run times - bv
CPU - dry run times - supremacy
CPU - dry run times - bc
CPU - dry run times - qv
CPU - dry run times - hs
CPU - simulation times - qft
CPU - simulation times - variational
CPU - simulation times - bv
CPU - simulation times - supremacy
CPU - simulation times - bc
CPU - simulation times - qv
CPU - simulation times - hs
So results are pretty much similar with the exception of dry run times from small qubit numbers. I am not sure if this can be improved if we disable parallelization for nqubits < 14 as Qiskit does by default.
I agree we should revisit gate fusion in Qibo and if performance is improved so much for most common circuits we could consider making default with some cut-off in the number of qubits. We should open an issue about that in Qibo. |
By the way, I added the option to use Qiskit without fusion in the benchmark script (via I'm not yet sure if this is a bug with Qiskit or a problem in our code but will investigate it further (just noting it in case you try to run something in the meantime). |
@stavros11 thank you very much for these numbers and confirmation. I agree concerning fusion and the possibility to set threads automatically, as you have posted in the issue. I will try the new GPU implementations tomorrow. |
@stavros11 2 points:
|
Quick response before I take off for Abu Dhabi:
I was checking this thoroughly yesterday and interestingly the problem exists only on my local machine. I tried both DGX and qibo machine and qiskit-gpu works well there. In my machine I get errors even when using simple qiskit circuits, without all the benchmark code we have here. I’ll give a simple script later. I’m not sure if it is related to CUDA version or something is wrong in my configuration. I followed the same installation procedure everywhere (just pip install qiskit-aer-gpu). So I believe the code here is okay to try GPU benchmarks as it is. We just need to expand by adding QCGPU and HyQuas.
I haven’t checked how fusion affects GPU yet. |
@stavros11, tests are passing on my pc however, if I print the
Does this happen for you? |
Note that the tests that are uploaded on GitHub do not test the GPU backends. In order to test these you have to include "qiskit-gpu" and "qulacs-gpu" in the
Yes, I observe some strange behavior from qiskit-gpu in all machines. If I add "qiskit-gpu" in the tests, they fail on my machine but pass on Qibomachine. However when I print the state during the benchmark as in your example, I get wrong results in all machine. Also the final state changes if I run the same script more than once even though there is nothing random involved. Here is a simple script that reproduces these issues: import qiskit
from qiskit.providers.aer import StatevectorSimulator
def main(nqubits, nreps, gpu, transpile):
for _ in range(nreps):
circuit = qiskit.QuantumCircuit(nqubits)
for i in range(nqubits):
circuit.h(i)
if gpu:
simulator = StatevectorSimulator(method="statevector_gpu")
else:
simulator = StatevectorSimulator()
if transpile:
circuit = qiskit.transpile(circuit, simulator)
print("nqubits:", nqubits)
print("nreps:", nreps)
print("gpu:", gpu)
print("transpile:", transpile)
result = simulator.run(circuit).result()
print(result.get_statevector(circuit))
print() @scarrazza, if you try to run this with |
@stavros11 I confirm all your points. I was monitoring the GPU usage on different systems while running pytest and I realized that only in the qibomachine it doesn't seem to use any GPU % during tests, so maybe it is falling back to CPU (I think qiskit provides some get_device method to check if the backend is using CPU or GPU). Did you try the qft using qiskit.*.library.QFT directly? |
If I replace the circuit creation with
|
@stavros11 I just monitor the pytest performance on test_libraries for 5, 10, 15, 26 qubits. Tests are failing for 15 and 26, for these tests I can see GPU usage high and CPU usage low, however for <= 10 the CPU usage is very high and GPU is low. So I assume they have some fallback mechanism which selects the appropriate hardware. As discussed today, let me suggest to complete the other libraries listed in the first post, and perform a final decision afterwards. |
@stavros11 concerning qiskit, I have opened this issue Qiskit/qiskit-aer#1319, and they have proposed a fix in this PR Qiskit/qiskit-aer#1325. So it is a qiskit bug. |
@stavros11 I have installed the aer master locally and indeed the GPU problem is fixed. On the other hand their performance is a factor 2x slower than qibojit. |
@scarrazza here are some plots using the circuits and libraries we have so far for CPU: It seems that creation time is the main bottleneck for some libraries and circuits. This is the time required to convert the circuit from Qasm to the library's format. For Qulacs I do this conversion manually as I could not find a qasm parser on their docs but for Qiskit I am using Other than that, I will try to run some single precision benchmarks with Qibo, Cirq, Qiskit and TFQ because I could not find how to switch TFQ to double and also some GPU benchmarks with Qibo, QCGPU, Qulacs and Qiskit (if their GPU simulator is fixed). Let me know what other configurations and plots would be interesting. |
Cool, thanks for these interesting results. |
I think we should have a look at the dry-run, I have the suspicious our initialization is not 100% due to *jit, but maybe the object allocations (gate matrices, etc...). |
That is a good point and makes sense because some elements such as gate matrices are allocated during the first execution (which is the dry run) and cached for subsequent runs. However I tried executing the benchmark by recreating a new circuit object before every execution (dry run and simulation) and the difference between dry run and simulation remains. Here are some numbers: qft
variational
|
Thanks for checking, this sounds like some for loop overhead. At some point, after completing the codes / libs for this exercise, one should go step by step and profile the function calls and identify where we loose performance. |
I am not sure if this helps, but I tried profiling the benchmark script using cProfile and I noticed that the difference between the logged dry run time and simulation time is similar to the cumulative time of numba's qft
variational
supremacy
Here dry run and simulation are logged by the benchmark script, while the numba compile is read from the cProfile output as the cumulative time of the |
Thanks @stavros11, could you please rerun one of these examples by removing the |
Scripts to generate paper plots
Add benchmark script and plots
Adds a template and script for benchmarking external quantum simulation libraries different than Qibo (fixes #10). We should cover at least the libraries included in HyQuas benchmark paper. Here is a list of required libraries:
Python:
These benchmarks can be executed using the new
compare.py
script and the library is selected using the--library
flag.The supported libraries are defined under
benchmarks/libaries
and the goal is to support all circuits included the Qibo benchmark for all libraries. This works by defining every circuit using OpenQASM and then build each library's circuit from this. This is straightforward for libraries that have built-in Qasm loaders such as Qiskit and Qibo, while for the rest (eg. Qulacs) I use the Qasm parser we have in Qibo modified to add the gates from the corresponding library. All circuits we have here can be written in the Qasm format we support in Qibo except perhaps QAOA which contains some RZZ gates which we do not have built-in in Qibo.Next steps for this PR:
--library qibo
we should have--library qibojit
, etc.)Note: I noticed that Qibo's U2 and U3 gates follow a different parameter convention when compared to Qiskit and other libraries. For example check our docs vs Qiskit's docs. This should not affect performance which is what we mainly care about here but it may confuse users that use these gates for other applications as it will change results. The main issue is that for example parsing
u3(0.1,0.2,0.3) q[0];
from Qasm will create a different gate in Qibo and a different in Qiskit (and others). I guess Qiskit should be the reference for such conventions given that Qasm is developed by IBM.