-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add caching to the autograd batch interface #1508
Conversation
Co-authored-by: Nathan Killoran <co9olguy@users.noreply.github.com>
Hello. You may have forgotten to update the changelog!
|
…nto autograd-caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @josh146, this is great! 🚀 I've left a few questions for my understanding.
# disable caching on the forward pass | ||
execute_fn = cache_execute(device.batch_execute, cache=None) | ||
|
||
# replace the backward gradient computation | ||
gradient_fn = device.gradients | ||
gradient_fn = cache_execute( | ||
device.gradients, cache, pass_kwargs=True, return_tuple=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably my unfamiliarity with the recent changes, but do we expect to need caching for device-based gradients? I thought this was mainly for parameter shift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching is only needed for device-based gradients if mode="backwards"
. Backwards mode essentially means:
- On the forward pass, only the cost function is computed
- The gradients are only requested during backpropagation
This means that there will always be 1 additional eval required -- caching therefore reduces the number of evals by 1 😆
Worth it?
I mean, I'd expect 99% of users to use device gradients with mode="forward"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this supersede #1341?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this complements it for now 🙂
#1341 added the use_device_state
keyword argument which instructs QubitDevice.adjoint_jacobian()
to use the existing device state and avoid a redundant forward pass.
When mode="forward"
, we can pass this option:
execute(
tapes,
dev,
gradient_fn="device",
interface="torch",
gradient_kwargs={"method": "adjoint_jacobian", "use_device_state": True},
mode="forward"
)
mode="best", | ||
gradient_kwargs=None, | ||
cache=True, | ||
cachesize=10000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an idea of the memory implications of this? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you do not pass a cache
object manually to the execute
function, the cache
will be created inside execute
. What this means is that - as soon as execute
has exited, the cache is out of scope and will be garbage collected by Python.
I am 99.99% sure of this, but don't know how to sanity check 😖
This is from the last time I tried to explore this: #1131 (comment)
Do you have any ideas on how to double check that the cache is deleted after execution?
Co-authored-by: Tom Bromley <49409390+trbromley@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @josh146 for the updates and comments! Looks great 💯
# disable caching on the forward pass | ||
execute_fn = cache_execute(device.batch_execute, cache=None) | ||
|
||
# replace the backward gradient computation | ||
gradient_fn = device.gradients | ||
gradient_fn = cache_execute( | ||
device.gradients, cache, pass_kwargs=True, return_tuple=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
qml.RX(np.array(a), wires=[0]) | ||
qml.RY(np.array(b), wires=[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the np.array()
left over from the previous test? Though I guess it doesn't matter because the hash should be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this was semi-intentional - I was trying to ensure that the datatype of the parameter doesn't affect hashing 😆
"""Tests that the circuit hash of circuits with single-qubit | ||
rotations differing by multiples of 2pi have identical hash""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow that's cool, didn't realisze we'd support that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required in order to reduce the number of Hessian evals to the optimum number (I don't think the autodiff frameworks are smart enough to do this cancelling out themselves).
Currently it's hardcoded in for the R
and CR
gates, but it would be cool to add this as an operation property:
class Rot(Operation):
periodicity = [2 * np.pi, 2 * np.pi, 2 * np.pi]
# disable caching on the forward pass | ||
execute_fn = cache_execute(device.batch_execute, cache=None) | ||
|
||
# replace the backward gradient computation | ||
gradient_fn = device.gradients | ||
gradient_fn = cache_execute( | ||
device.gradients, cache, pass_kwargs=True, return_tuple=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this supersede #1341?
Co-authored-by: Tom Bromley <49409390+trbromley@users.noreply.github.com>
Context: In #1501,
batch_execute
was made differentiable in the Autograd interface using the newqml.gradients
subpackage. However, since the newqml.gradients
subpackage is differentiable, this allows for out-of-the-box higher-order derivatives as long as the new Autograd interface is recursive.Thus, not only do we have 3rd order and higher derivatives, we are able to:
expval
)However, the recursive evaluation is not smart; the autodiff frameworks will traverse the recursive structure naively, resulting in redundant evaluations.
This PR is a result of thinking about the following two questions:
batch_execute
pipeline compared to master?Benchmarking
To test the performance of #1501 vs. master, I ran the following benchmark:
With the following results:
Interestingly:
batch_execute
pipeline is ~ the same speed when the recursive evaluation is turned on.batch_execute
pipeline is slower when recursive evaluation is turned on.Description of the changes
A new argument
max_diff
is added, that allows the user to specify at what 'depth'/'order' the recursive evaluation ends. E.g., setting it tomax_diff=1
completely inactivates the recursive evaluation.Caching is added to the
qml.interfaces.execute()
function, by way of a decorator. This decorator makes use oftape.hash
to identify unique tapes.If a tape does not match a hash in the cache, then the tape has not been previously executed. It is executed, and the result added to the cache.
If a tape matches a hash in the cache, then the tape has been previously executed. The corresponding cached result is extracted, and the tape is not passed to the execution function.
Finally, there might be the case where one or more tapes in the current set of tapes to be executed share a hash. If this is the case, duplicated are removed, to avoid redundant evaluations.
Benefits
Caching has a significant effect. E.g., consider the benchmarking example above, modified to compute the Hessian and display the number of executions:
By using a cache, we can reduce the number of evaluations beyond the minimum we currently have in master.
Questions
max_diff
be?max_diff>1
is probably fine?