Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add qsim, qsim-gpu and qsim-cuquantum #14

Merged
merged 11 commits into from
Dec 21, 2021
Merged

Add qsim, qsim-gpu and qsim-cuquantum #14

merged 11 commits into from
Dec 21, 2021

Conversation

mlazzarin
Copy link
Contributor

@mlazzarin mlazzarin commented Nov 15, 2021

In this PR I added qsim (cpu), qsim-gpu and qsim-cuquantum.
For qsim (cpu) I set the number of threads to multiprocessing.cpu_count().
For all, I set the max_fused_gate_size to zero.
EDIT: For ``qibojit```, I disabled to compilation during import.

I also performed some benchmarks (cupy 9.6.0, cuda toolkit 11.5) for gpu.

  • total_dry_time: import + creation + dry run
  • total_simulation_time: import + creation + simulation time
qft

qft_scaling_dry_run_time_single

qft_scaling_simulation_times_mean_single

qft_scaling_total_dry_time_single

qft_scaling_total_simulation_time_single

variational

variational_scaling_dry_run_time_single

variational_scaling_simulation_times_mean_single

variational_scaling_total_dry_time_single

variational_scaling_total_simulation_time_single

supremacy

supremacy_scaling_dry_run_time_single

supremacy_scaling_simulation_times_mean_single

supremacy_scaling_total_dry_time_single

supremacy_scaling_total_simulation_time_single

bv

bv_scaling_dry_run_time_single

bv_scaling_simulation_times_mean_single

bv_scaling_total_dry_time_single

bv_scaling_total_simulation_time_single

qv

qv_scaling_dry_run_time_single

qv_scaling_simulation_times_mean_single

qv_scaling_total_dry_time_single

qv_scaling_total_simulation_time_single

Some comments:

  • It doesn't seem that cuQuantum provides a speed-up w.r.t. qsim's C++/CUDA implementation.
  • Apart from the compilation overhead, we are competitive with C++/CUDA implementation, in particular with the qft circuit (maybe it's due to our approach to controlled gates?)
  • In these benchmarks (cupy 9.6.0, cuda toolkit 11.5) our dry run overhead is ~ 3.2 s. This is way higher than in other benchmarks I performed. I will open a new issue to discuss about it (EDIT Dry run overhead is inconsistent between different environments qibojit#44).
  • Qibojit crashes with 32 qubits, as you can see in the plots. EDIT (see Fix CupyBackend crash with 32 qubits qibojit#43).

EDIT: I will also prepare some benchmarks with CPU.

@scarrazza
Copy link
Member

@mlazzarin thanks for these tests. The cuQuantum is using a single GPU device, correct?

@mlazzarin
Copy link
Contributor Author

The cuQuantum is using a single GPU device, correct?

Yes, I'm using the machine with a single NVIDIA RTX A6000. By the way, I'm not sure if qsim supports multi-GPU.

@scarrazza
Copy link
Member

Ok, thanks, anyway quite good to see that we are strong XD.

@mlazzarin
Copy link
Contributor Author

Here's the results for CPU. For qsim I'm using a number of threads equal to the number of logical cores, while for qibo a kepy the default value, which is half of the logical cores. (I also tried with all logical cores and it's actually slower, for small circuits)

qft

qft_scaling_dry_run_time_single

qft_scaling_simulation_times_mean_single

qft_scaling_total_dry_time_single

qft_scaling_total_simulation_time_single

variational

variational_scaling_dry_run_time_single

variational_scaling_simulation_times_mean_single

variational_scaling_total_dry_time_single

variational_scaling_total_simulation_time_single

supremacy

supremacy_scaling_dry_run_time_single

supremacy_scaling_simulation_times_mean_single

supremacy_scaling_total_dry_time_single

supremacy_scaling_total_simulation_time_single

bv

bv_scaling_dry_run_time_single

bv_scaling_simulation_times_mean_single

bv_scaling_total_dry_time_single

bv_scaling_total_simulation_time_single

qv

qv_scaling_dry_run_time_single

qv_scaling_simulation_times_mean_single

qv_scaling_total_dry_time_single

qv_scaling_total_simulation_time_single

Two comments:

  • qsim is usually faster than qibo with large circuits, except for the qft, while qibo seems competitive with smaller circuits.
  • I'm not 100% sure that I was able to deactivate gate fusion in qsim. I simply set the max_fused_gate_size parameter to 0, because I didn't find a flag to disable fusion completely.

@scarrazza
Copy link
Member

This sounds really like there is circuit fusion, maybe we should try to activate from qibojit and see what happens.

@mlazzarin
Copy link
Contributor Author

Ok, I'm on it.

@mlazzarin
Copy link
Contributor Author

Here's the results for CPU with gate fusion up to two-qubit gates and using all threads.
Indeed the situation now is different.

qft - CPU

qft_scaling_dry_run_time_single

qft_scaling_simulation_times_mean_single

qft_scaling_total_dry_time_single

qft_scaling_total_simulation_time_single

variational - CPU

variational_scaling_dry_run_time_single

variational_scaling_simulation_times_mean_single

variational_scaling_total_dry_time_single

variational_scaling_total_simulation_time_single

supremacy - CPU

supremacy_scaling_dry_run_time_single

supremacy_scaling_simulation_times_mean_single

supremacy_scaling_total_dry_time_single

supremacy_scaling_total_simulation_time_single

bv - CPU

bv_scaling_dry_run_time_single

bv_scaling_simulation_times_mean_single

bv_scaling_total_dry_time_single

bv_scaling_total_simulation_time_single

qv - CPU

qv_scaling_dry_run_time_single

qv_scaling_simulation_times_mean_single

qv_scaling_total_dry_time_single

qv_scaling_total_simulation_time_single

I re-run the GPU benchmarks with gate fusion up to two-qubit gates, and now qibojit seems a bit faster.

qft - GPU

qft_scaling_dry_run_time_single

qft_scaling_simulation_times_mean_single

qft_scaling_total_dry_time_single

qft_scaling_total_simulation_time_single

variational - GPU

variational_scaling_dry_run_time_single

variational_scaling_simulation_times_mean_single

variational_scaling_total_dry_time_single

variational_scaling_total_simulation_time_single

supremacy - GPU

supremacy_scaling_dry_run_time_single

supremacy_scaling_simulation_times_mean_single

supremacy_scaling_total_dry_time_single

supremacy_scaling_total_simulation_time_single

bv - GPU

bv_scaling_dry_run_time_single

bv_scaling_simulation_times_mean_single

bv_scaling_total_dry_time_single

bv_scaling_total_simulation_time_single

qv - GPU

qv_scaling_dry_run_time_single

qv_scaling_simulation_times_mean_single

qv_scaling_total_dry_time_single

qv_scaling_total_simulation_time_single

@scarrazza
Copy link
Member

Cool, however would be great to understand if/how they are doing the gate fusion.

@mlazzarin
Copy link
Contributor Author

Cool, however would be great to understand if/how they are doing the gate fusion.

With qsim there is an option to set the maximum size of fused gates. In the last benchmarks that I posted I set that value to 2 (which is the default value). I've not found a specific flag to disable gate fusion, so in the other benchmarks I simply set that value to 0, but I don't know if it actually disable fusion or not.
Concerning how they do fusion, their approach is described here https://arxiv.org/abs/2111.02396 .

@scarrazza
Copy link
Member

Ok, so these last plots are comparing like with like, good.

@mlazzarin
Copy link
Contributor Author

I double-checked and I believe that this implementation is the optimal one, so we may proceed with the review and then merge it to the library branch. I have only two comment left:

  • We still need to understand how to properly disable gate fusion, but we can worry about it in PR Add fusion max_qubits option in compare.py #17
  • Among the possible options of Qsim, I found this one:
        denormals_are_zeros: if true, set flush-to-zero and denormals-are-zeros
             MXCSR control flags. This prevents rare cases of performance
             slowdown potentially at the cost of a tiny precision loss.
    
    I'm not sure if we should use it in the benchmarks or not.

@mlazzarin mlazzarin requested review from andrea-pasquale, stavros11 and scarrazza and removed request for scarrazza and stavros11 December 19, 2021 06:41
@mlazzarin
Copy link
Contributor Author

I fixed some gates in Cirq, now the CI works fine. Once we fix the tests for the gates, we should review each library to ensure that everything is properly implemented.

Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this and fixing the QAOA issue, the tests are now working for me. My only comment would be that we could consider removing tfq completely to simplify code and CI/tests. Given that its backend is equivalent to qsim it would be redundant to include it in any benchmarks we do. As long it is not causing any issues we could keep it but if there is something in tests I wouldn't spend much time about it.

If you don't plan any other changes here, we can merge this to reduce the number of active branches.

Copy link
Contributor

@andrea-pasquale andrea-pasquale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this implementation. It looks good to me.
I've left below a few comments regarding some missing factors pi and also the overall phase of the CU3 gate.
Let me know what you think.


def CU1(self, theta):
# TODO: Check if this is the right gate
return self.cirq.CZPowGate(exponent=theta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that here we are missing a factor pi.

Suggested change
return self.cirq.CZPowGate(exponent=theta)
return self.cirq.CZPowGate(exponent=theta/np.pi)

Copy link
Member

@stavros11 stavros11 Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently working on adding some random rotations before every circuit during tests so that the initial state is non-trivial and such problems are captured from tests. I will open a new PR based on this once all tests are ready. For now I can confirm that this fix works, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in fact I had to check manually to find these small errors. If we can detect all of them from tests it would be great.

benchmarks/libraries/cirq.py Show resolved Hide resolved
return self.cirq.CZPowGate(exponent=theta)

def CU3(self, theta, phi, lam):
# TODO: Check if this is the right gate
gate = self.cirq.circuits.qasm_output.QasmUGate(theta, phi, lam)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again missing pi factor

Suggested change
gate = self.cirq.circuits.qasm_output.QasmUGate(theta, phi, lam)
gate = self.cirq.circuits.qasm_output.QasmUGate(theta/np.pi, phi/np.pi, lam/np.pi)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. However, I'm not sure yet about this gate, as I couldn't find the exact matrix representation on the docs.
Anyway, we will find out after we fix the single-qubit gate tests.

Co-authored-by: Andrea Pasquale <andreapasquale97@gmail.com>
mlazzarin and others added 3 commits December 20, 2021 14:43
Co-authored-by: Andrea Pasquale <andreapasquale97@gmail.com>
@mlazzarin
Copy link
Contributor Author

Shall we merge this?

@stavros11
Copy link
Member

Yes, please go ahead and merge this and I will update randomtests to use the latest libraries so that we find any issues with gates.

@mlazzarin mlazzarin merged commit e269329 into libraries Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants