Improve performance of `CommutationAnalysis` transpiler pass #6982

jakelishman · 2021-09-03T23:50:42Z

The CommutationAnalysis transpiler pass for large circuits with many
gates is typically one of the longer passes, along with the mapping
passes. It spends most of its time deciding whether any two given
operators commute, which it does by matrix multiplication and
maintaining a cache.

This commit maintains the same general method (as opposed to, say,
maintaining a known-good lookup table), but performs the following
optimisations to improve it, in approximate order from most to least
impactful:

we store the result of "do these two matrices commute?" instead of
the previous method of two matrices individually in the cache. With
the current way the caching is written, the keys depend on both
matrices, so it is never the case that one key can be a cache hit and
the other a cache miss. This means that storing only the result does
not cause any more cache misses than before, and saves the subsequent
matrix operations (multiplication and comparison). In real-word
usage, this is the most major change.
the matrix multiplication is slightly reorganised to halve the number
of calls to Operator.compose. Instead of first constructing a
identity operator over all qubits, and composing both operators onto
it, wey reorganise the indices of the qubit arguments so that all the
"first" operator's qubits come first, and it is tensored with a small
identity to bring it up to size. Then the other operator is composed
with it. This is generally faster, since it replaces a call to
compose, which needs to do a matmul-einsum step into a simple
kron, which need not concern itself with order. It also results in
fewer operations, since the input matrices are smaller.
the cache-key algorithm is changed to avoid string-ification as much
as possible. This generally has a very small impact for most
real-world applications, but has massive impact on circuits with
large numbers of unsynthesised UnitaryGate elements (like quantum
volume circuits being transpiled without basis_gates). This is
because the gate parameters were previously being converted to string,
which for UnitaryGate meant string-ifying the whole matrix. This
was much slower than just doing the dot product, so was defeating the
purpose of the cache.

On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase
on the time_quantum_volume_transpile benchmark at transpiler
optimisation levels 2 and 3 (the only levels the CommutationAnalysis
pass is done by default), over the whole transpiler pass. The
improvement in runtime of the pass itself depends strongly on the type
of circuit itself, but in the worst (highly non-realistic) cases, can be
nearly an order of magnitude improvement (for example just calling
transpile(optimization_level=3) on a QuantumVolume(128) circuit
drops from 27s to 5s).

asv time_quantum_volume_transpile benchmarks:

This commit:

=============================== ============
 transpiler optimization level
------------------------------- ------------
               0                  3.57±0s
               1                 7.16±0.03s
               2                 10.8±0.02s
               3                 33.3±0.08s
=============================== ============

Previous commit:

=============================== ============
 transpiler optimization level
------------------------------- ------------
               0                 3.56±0.02s
               1                 7.24±0.05s
               2                 16.8±0.04s
               3                 38.9±0.1s
=============================== ============

Details and comments

There is more work that could be done here as well, which I think would give us more performance in more natural workflows that don't just involve dumping in a tonne of unitary matrices. The next obvious step would be to build a lookup table of gate types that are known to commute with each other, in order to skip the entire process of comparing the two matrix multiplications for common cases; we can always fall back to this method. I haven't done this right now, because it's bigger task, and doesn't need doing in one go.

With caches, it's important that the time taken to construct the key isn't excessive - in this case, when the parameters were arrays (even if not UnitaryGate), that wasn't necessarily honoured.

The `CommutationAnalysis` transpiler pass for large circuits with many gates is typically one of the longer passes, along with the mapping passes. It spends most of its time deciding whether any two given operators commute, which it does by matrix multiplication and maintaining a cache. This commit maintains the same general method (as opposed to, say, maintaining a known-good lookup table), but performs the following optimisations to improve it, in approximate order from most to least impactful: - we store the _result_ of "do these two matrices commute?" instead of the previous method of two matrices individually in the cache. With the current way the caching is written, the keys depend on both matrices, so it is never the case that one key can be a cache hit and the other a cache miss. This means that storing only the result does not cause any more cache misses than before, and saves the subsequent matrix operations (multiplication and comparison). In real-word usage, this is the most major change. - the matrix multiplication is slightly reorganised to halve the number of calls to `Operator.compose`. Instead of first constructing a identity operator over _all_ qubits, and composing both operators onto it, wey reorganise the indices of the qubit arguments so that all the "first" operator's qubits come first, and it is tensored with a small identity to bring it up to size. Then the other operator is composed with it. This is generally faster, since it replaces a call to `compose`, which needs to do a matmul-einsum step into a simple `kron`, which need not concern itself with order. It also results in fewer operations, since the input matrices are smaller. - the cache-key algorithm is changed to avoid string-ification as much as possible. This generally has a very small impact for most real-world applications, but has _massive_ impact on circuits with large numbers of unsynthesised `UnitaryGate` elements (like quantum volume circuits being transpiled without `basis_gates`). This is because the gate parameters were previously being converted to string, which for `UnitaryGate` meant string-ifying the whole matrix. This was much slower than just doing the dot product, so was defeating the purpose of the cache. On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase on the `time_quantum_volume_transpile` benchmark at transpiler optimisation levels 2 and 3 (the only levels the `CommutationAnalysis` pass is done by default), over the whole transpiler pass. The improvement in runtime of the pass itself depends strongly on the type of circuit itself, but in the worst (highly non-realistic) cases, can be nearly an order of magnitude improvement (for example just calling `transpile(optimization_level=3)` on a `QuantumVolume(128)` circuit drops from 27s to 5s). `asv` `time_quantum_volume_transpile` benchmarks: - This commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.57±0s 1 7.16±0.03s 2 10.8±0.02s 3 33.3±0.08s =============================== ============ - Previous commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.56±0.02s 1 7.24±0.05s 2 16.8±0.04s 3 38.9±0.1s =============================== ============

ewinston

How does the cache handle floating point round-off errors? Would the keys be unique enough by manually rounding them?

ewinston · 2021-09-07T15:01:43Z

qiskit/transpiler/passes/optimization/commutation_analysis.py

@@ -87,42 +87,87 @@ def run(self, dag):
                self.property_set["commutation_set"][(current_gate, wire)] = temp_len - 1


-def _commute(node1, node2, cache):
+_COMMUTE_ID_OP = {}


Does this need to be module level or can it be put inside _commute?

I couldn't find where it gets populated?

It can be either - realistically I can't imagine it'll ever contain more than ~3 elements (you'd need to have be dealing with many-qubit gates to have more than even a single element in it), so there shouldn't be any memory problems. Having it module-level means that the cache is saved per call; it's only ever populated with various sizes of the identity matrix, and those are complete constants.

It's populated here:
https://github.com/Qiskit/qiskit-terra/blob/07299e170532a24fbaefa6a87aa3c427b79f765e/qiskit/transpiler/passes/optimization/commutation_analysis.py#L161-L168
on line 164 (note it's a double assignment).

jakelishman · 2021-09-21T13:24:08Z

The original code actually had far worse behaviour with regards to string-ification, since before this commit, all Numpy arrays were turned into strings, which has all the floating-point problems you mentioned. In the new form, I'm not aware of any cases where the fallback will actually be used, but I kept it in place so it won't break any extension gates doing super weird things.

This commit in theory can make more cache misses (though it certainly doesn't in the benchmark suite), if several instructions are separately decomposed into the same type of gate with exactly equivalent, but referentially unequal Numpy arrays. Even if this were to happen, the slowness of string-ifying Numpy arrays (the previous method) almost invariably outweighs the cost of just doing to the commutation analysis again, especially now I tweaked the comparison to avoid unnecessary einsum calls and Operator creations.

(Sorry for the slow reply - I was on holiday!)

ewinston · 2021-09-21T16:50:39Z

I checked scaling of timing and memory usage of QV circuits and didn't see any major issues. LGTM.

peak memory usage vs square QV circuit.

total time spent in function

jakelishman requested a review from a team as a code owner September 3, 2021 23:50

ewinston reviewed Sep 7, 2021

View reviewed changes

jakelishman added the performance label Sep 21, 2021

ewinston approved these changes Sep 21, 2021

View reviewed changes

kdk approved these changes Sep 21, 2021

View reviewed changes

Merge branch 'main' into commutation-analysis-performance

29f3cc8

kdk added the automerge label Sep 21, 2021

kdk added this to the 0.19 milestone Sep 21, 2021

Merge branch 'main' into commutation-analysis-performance

47431cb

mergify bot merged commit 890d9ca into Qiskit:main Sep 22, 2021

jakelishman deleted the commutation-analysis-performance branch September 22, 2021 08:44

t-imamichi mentioned this pull request Sep 28, 2021

Use lru_cache to suppress memory usage of commutation analysis #7072

Closed

kdk mentioned this pull request Oct 5, 2021

Library of commutation relations over standard gates #7101

Closed

kdk added the Changelog: None Do not include in changelog label Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `CommutationAnalysis` transpiler pass #6982

Improve performance of `CommutationAnalysis` transpiler pass #6982

jakelishman commented Sep 3, 2021

ewinston left a comment

ewinston Sep 7, 2021

jakelishman Sep 21, 2021

jakelishman commented Sep 21, 2021

ewinston commented Sep 21, 2021

Improve performance of CommutationAnalysis transpiler pass #6982

Improve performance of CommutationAnalysis transpiler pass #6982

Conversation

jakelishman commented Sep 3, 2021

Details and comments

ewinston left a comment

Choose a reason for hiding this comment

ewinston Sep 7, 2021

Choose a reason for hiding this comment

jakelishman Sep 21, 2021

Choose a reason for hiding this comment

jakelishman commented Sep 21, 2021

ewinston commented Sep 21, 2021

Improve performance of `CommutationAnalysis` transpiler pass #6982

Improve performance of `CommutationAnalysis` transpiler pass #6982