Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the parameter-shift hessian to the Autograd interface #1131

Merged
merged 26 commits into from
Mar 26, 2021

Conversation

josh146
Copy link
Member

@josh146 josh146 commented Mar 9, 2021

Context:

The Autograd custom gradient interface currently returns a non-differentiable gradient.

Description of the Change:

Modifies the Autograd interface so that the Jacobian also defines a custom gradient --- the vector-Hessian product --- where the Hessian is computed by an efficient implementation of the parameter-shift rule.

Autograd QNodes can now be doubly differentiated, on both simulators and hardware:

import pennylane as qml
from pennylane import numpy as np

dev = qml.device('default.qubit', wires=1)

@qml.qnode(dev, diff_method="parameter-shift")
def circuit(p):
    qml.RY(p[0], wires=0)
    qml.RX(p[1], wires=0)
    return qml.probs(0)

x = np.array([1.0, 2.0], requires_grad=True)
>>> circuit(x)
tensor([0.38757745, 0.61242255], requires_grad=True)
>>> qml.jacobian(circuit)(x)
array([[ 0.17508774, -0.24564775],
       [-0.17508774,  0.24564775]])
>>> qml.jacobian(qml.jacobian(circuit))(x)
array([[[ 0.11242255,  0.3825737 ],
        [ 0.3825737 ,  0.11242255]],
       [[-0.11242255, -0.3825737 ],
        [-0.3825737 , -0.11242255]]])

In addition, a slight tweak was made to the Jacobian/Hessian evaluation logic in order to avoid redundant Jacobian/Hessian computations.

Benefits:

  • Autograd QNodes now support double derivatives on hardware and software

  • The Jacobian/Hessian is now only computed once for given input parameters, and re-used for multiple VJPs/VHPs.

Possible Drawbacks:

  • 3rd derivatives and higher are not supported. We could potentially support higher derivatives using recursion.

  • Currently, Jacobian parameter-shifts are not being re-used for the Hessian parameter-shifts. This requires more thinking.

  • I realised while writing this PR that the parameter-shift Hessian logic is based on the formula given here: https://arxiv.org/abs/2008.06517. However, this formula assumes all gates support the 2-term parameter-shift; this is not the case of the controlled rotation. Therefore, computing the Hessian of the controlled rotations will give the incorrect result!

  • Long term, we should consider implementing parameter-shift logic as follows:

    • Low level, we provide functions for computing vjp and vhp directly. This avoids redundant parameter-shift evaluations.
    • Parameter-shifts for the same value are cached and re-used
    • Recursion for arbitrary order derivatives.

Related GitHub Issues: n/a

@codecov
Copy link

codecov bot commented Mar 9, 2021

Codecov Report

Merging #1131 (8745c6f) into master (ede35f4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1131   +/-   ##
=======================================
  Coverage   98.12%   98.12%           
=======================================
  Files         144      144           
  Lines       10813    10833   +20     
=======================================
+ Hits        10610    10630   +20     
  Misses        203      203           
Impacted Files Coverage Δ
pennylane/_grad.py 100.00% <100.00%> (ø)
pennylane/interfaces/autograd.py 100.00% <100.00%> (ø)
pennylane/tape/jacobian_tape.py 97.84% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ede35f4...8745c6f. Read the comment docs.

Copy link
Contributor

@chaserileyroberts chaserileyroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix caching scares me.

def gradient_product(g):
# In autograd, the forward pass is always performed prior to the backwards
# pass, so we do not need to re-unwrap the parameters.
saved_grad_matrices = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of a global storing solution. This is going to kill memory usage if they users runs a ton of experiments.

Copy link
Member Author

@josh146 josh146 Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The idea here is to simply cache the hessian/jacobian for the same parameter value, if vjp is called repeatedly by Autograd. So repeated calls by the user via qml.jacobian won't do any caching, but a single call by the user to qml.jacobian will cache under the hood for autograd.

This is basically a workaround Python not having first class dynamic programming (which would be very useful here).

Copy link
Member Author

@josh146 josh146 Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding memory usage: I think this should be safe, since this isn't global storage, instead it is locally scoped to within the vjp function - once the call to vjp() is complete, CPython should delete the saved_grad_matrices data.


@autograd.extend.primitive
def jacobian(p):
jacobian = _evaluate_grad_matrix(p, "jacobian")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you storing the results in a dictionary with string keys? Especially if you only have two keys.

Seems... dangerous lol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely follow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to modify a variable used in closure, it needs to be a mutable type, so either dictionary, set, or list. If that's what you are asking about.

I was confused about that and had to do some background reading about closure to figure it out. Maybe there should be a comment about that?

self.set_parameters(self._all_parameter_values, trainable_only=False)

# only flatten g if all parameters are single values
saved_grad_matrices[grad_matrix_fn] = grad_matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is saved_grad_matrices ever cleared? How does it behave between runs? Could there be unintended state poising between calls to this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me write some tests for this to make sure it works as I expect.

Copy link
Member Author

@josh146 josh146 Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is saved_grad_matrices ever cleared?

I believe it is cleared by nature of CPython deleting local variables once they go out of scope. That is once the call to vjp() is complete, saved_grad_matrices will be deleted.

I'm not 100% sure though, since it is used in locally defined functions that are returned...

Is there a way of testing/verifying this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you are correct that is what will happen. That dictionary is in the namesapce of the vjp function and not the class which I originally thought. This should be fine the.

pennylane/tape/interfaces/autograd.py Outdated Show resolved Hide resolved
@josh146
Copy link
Member Author

josh146 commented Mar 10, 2021

Benchmarking courtesy of @mariaschuld:

      before           after         ratio
     [f87e5c5f]       [fc39d412]
     <master>         <hessian_autograd>
       14.3±0.7ms       14.0±0.4ms     0.98  asv.core_suite.CircuitEvaluation.time_circuit(10, 3)
         26.7±1ms       25.9±0.5ms     0.97  asv.core_suite.CircuitEvaluation.time_circuit(10, 6)
         41.4±2ms       37.9±0.7ms     0.92  asv.core_suite.CircuitEvaluation.time_circuit(10, 9)
       4.82±0.7ms      4.34±0.02ms    ~0.90  asv.core_suite.CircuitEvaluation.time_circuit(2, 3)
         7.45±2ms       6.41±0.3ms    ~0.86  asv.core_suite.CircuitEvaluation.time_circuit(2, 6)
         9.69±1ms      8.90±0.08ms     0.92  asv.core_suite.CircuitEvaluation.time_circuit(2, 9)
       8.49±0.7ms       7.80±0.1ms     0.92  asv.core_suite.CircuitEvaluation.time_circuit(5, 3)
      13.4±0.04ms       13.3±0.1ms     0.99  asv.core_suite.CircuitEvaluation.time_circuit(5, 6)
       20.1±0.1ms       19.2±0.4ms     0.96  asv.core_suite.CircuitEvaluation.time_circuit(5, 9)
       6.92±0.3ms         6.25±1ms    ~0.90  asv.core_suite.GradientComputation.time_gradient_autograd(2, 3)
       11.0±0.9ms       9.23±0.2ms    ~0.84  asv.core_suite.GradientComputation.time_gradient_autograd(2, 6)
       13.4±0.8ms       12.2±0.5ms    ~0.91  asv.core_suite.GradientComputation.time_gradient_autograd(5, 3)
         22.4±1ms       22.6±0.8ms     1.01  asv.core_suite.GradientComputation.time_gradient_autograd(5, 6)
          177±4ms          170±2ms     0.96  asv.core_suite.Optimization.time_optimization_autograd

and memory usage:

       before           after         ratio
     [f87e5c5f]       [fc39d412]
     <master>         <hessian_autograd>
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(10, 3)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(10, 6)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(10, 9)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(2, 3)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(2, 6)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(2, 9)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(5, 3)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(5, 6)
             196M             196M     1.00  asv.core_suite.CircuitEvaluation.peakmem_circuit(5, 9)
             196M             196M     1.00  asv.core_suite.GradientComputation.peakmem_gradient_autograd(2, 3)
             196M             196M     1.00  asv.core_suite.GradientComputation.peakmem_gradient_autograd(2, 6)
             197M             197M     1.00  asv.core_suite.GradientComputation.peakmem_gradient_autograd(5, 3)
             197M             197M     1.00  asv.core_suite.GradientComputation.peakmem_gradient_autograd(5, 6)
             197M             198M     1.00  asv.core_suite.Optimization.peakmem_optimization_autograd

Copy link
Contributor

@thisac thisac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @josh146! Seems like a nice addition. Although, I'm a bit confused about exactly how autograd works here, so it would be nice to have a brief chat about it sometime.

.github/CHANGELOG.md Show resolved Hide resolved
pennylane/_grad.py Show resolved Hide resolved
pennylane/tape/interfaces/autograd.py Outdated Show resolved Hide resolved
def gradient_product(g):
# In autograd, the forward pass is always performed prior to the backwards
# pass, so we do not need to re-unwrap the parameters.
saved_grad_matrices = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, and not necessarily better than the current solution; but could there be a case here for instead saving them as saved_hessian and saved_jacobian instead of storing them in a single dict? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see the advantage over a dictionary - in Python, a lot of caching is done with dictionaries, including class properties, memoization, etc.

Comment on lines 925 to 926
def test_grad_matrix_caching(self, mocker, dev_name, diff_method):
"""Test the gradient matrix caching"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this testing the caching? It seems like it's only testing that it doesn't cache when calling qml.jacobian or qml.hessian several times (which is good, although not what I would expect from the naming and docstring for this test).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - this is still WIP, I added it after @Thenerdstation's comments. I'm still not fully sure how to test this, it might require usage of https://docs.python.org/3/library/gc.html?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a bit more of an exploration here: #1131 (comment)

Comment on lines 236 to 237
def vhp(ans, p):
def hessian_product(ddy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to get some more details on what's happening here. I'm not fully sure what's going on tbh (mainly autograd-wise).

Co-authored-by: Theodor <theodor@xanadu.ai>
@josh146
Copy link
Member Author

josh146 commented Mar 17, 2021

@Thenerdstation I've explored the issue of the cache further, just putting my exploration here for now.

The autograd.grad function, essentially, looks like this:

def grad(fun, x):
    vjp, ans = _make_vjp(fun, x)
    grad_value = vjp(vspace(ans).ones())
    return grad_value, ans

where vjp is a function that returns the gradient, and ans the output of the forward pass.

vjp is actually created via a huge chain of nested functions, with the 'leaves' of this structure being the individual vjp functions for each node in the computational graph (including AutogradInterface.vjp).

By using Python's __closure__ special method, we can actually dig deep into this structure, and extract the saved matrix cache:

def grad(fun, x):
    vjp, ans = _make_vjp(fun, x)

    ################################################################
    # The returned vjp function contains the saved matrix cache via closure.
    # Let's extract this.
    saved_matrices = []

    import autograd

    # extract the end node from the VJP closure
    end_node = vjp.__closure__[0].cell_contents
    outgrads = {end_node: (vspace(ans).ones(), False)}

    # iterate through all nodes in the computational graph
    for node in autograd.tracer.toposort(end_node):
        outgrad = outgrads.pop(node)
        ingrads = node.vjp(outgrad[0])

        try:
            # look deep inside the nested function closure, and attempt to
            # extract the saved matrices dictionary cache if the node is a QNode
            qnode_cache = (
                node.vjp
                .__closure__[0].cell_contents
                .__closure__[0].cell_contents
                .__closure__[1].cell_contents
            )
            if isinstance(qnode_cache, dict) and "jacobian" in qnode_cache:
                saved_matrices.append(qnode_cache)
        except:
            # the node was not a QNode; pass.
            pass

        for parent, ingrad in zip(node.parents, ingrads):
            outgrads[parent] = autograd.core.add_outgrads(outgrads.get(parent), ingrad)

    print(saved_matrices)
    ################################################################

    grad_value = vjp(vspace(ans).ones())
    return grad_value, ans # after returning, vjp is now out of scope

So, inside the qml.grad() function, the saved matrix cache (which contains at most two keys, Jacobian and Hessian), continues to exist in memory.

As far as I can tell, however, as soon as the qml.grad() function exits, the vjp function variable goes out of scope, and the Python garbage collector will delete all references to vjp, as well as the nested closures it contains. So the memory used by the cache matrices should be freed.

if dy.size > 1:
if all(np.ndim(p) == 0 for p in params):
# only flatten dy if all parameters are single values
vhp = dy.flatten() @ ddy @ hessian @ dy.flatten()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have dy twice here.

Copy link
Member Author

@josh146 josh146 Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can fully answer that, but it makes sense if you consider the shape of the hessian and perform dimensional analysis:

  • the hessian will be of shape (num_params, num_params, num_output),
  • dy (think of it like the gradient) has shape (num_output,), and
  • ddy (think of it like the gradient of the gradient) has shape (num_output, num_params).

So performing the matrix multiplication will return a vhp of the right size, (num_params,).

E.g.:

>>> num_params = 2
>>> num_output = 3
>>> dy = np.random.random([num_output])
>>> ddy = np.random.random([num_output, num_params])
>>> hessian = np.random.random([num_params, num_params, num_output])
>>> dy @ ddy @ hessian @ dy
array([1.82038363, 2.03458631])


dev = qml.device(dev_name, wires=1)

@qnode(dev, diff_method=diff_method, interface="autograd")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should create an "analytic circuit" set of fixtures for the testing suite and be able to easily reuse the quantum function, result, derivative, and hessian across the different interfaces.

Copy link
Member Author

@josh146 josh146 Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point 🤔 many more tests like this exist across the interface test suite, so maybe something worth considering in a new PR/issue?

Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of things to change:

  • comment clarifying closure and saved_grad_matrices
  • redefining p in a comprehension

Optional suggestion:

  • analytic circuits test fixture file

There are still things I don't completely understand about this, but it works and that is the most important thing :)

@josh146
Copy link
Member Author

josh146 commented Mar 23, 2021

Thanks @albi3ro! Your comment about the parameter flattening if statement really helped make the PR better.

@josh146 josh146 requested a review from albi3ro March 23, 2021 16:12
Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to have this! :)

Copy link
Contributor

@thisac thisac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @albi3ro and @josh146! Only found a missing punctuation, but other than that looks good. 💯

pennylane/interfaces/autograd.py Outdated Show resolved Hide resolved
Co-authored-by: Theodor <theodor@xanadu.ai>
Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found something to fix before getting merged in. Could definitely cause bugs down the line.

pennylane/interfaces/autograd.py Show resolved Hide resolved
Copy link
Contributor

@albi3ro albi3ro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating from "request changes" to "approve" since the change isn't actually needed.

Base automatically changed from hessian_torch to master March 26, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many circuit executions when computing parameter-shift Jacobians using Autograd
4 participants