Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Abstract differentiation interface #1

Merged
merged 28 commits into from
Aug 28, 2021
Merged

Conversation

mohamed82008
Copy link
Member

@mohamed82008 mohamed82008 commented Feb 8, 2021

In this PR, I implement a high level API for differentiation. The idea is to unify the APIs of all the AD packages we have in the Julia ecosystem. This should enable AD users to write backend-agnostic code using only the API from AbstractDifferentiation.

In the current implementation, AD package authors would need to define at least the following:

  1. A backend struct e.g. PackageBackend for the package that subtypes AbstractBackend
  2. jacobian(ab::PackageBackend, f, xs...): returns the Jacobian of the output(s) of f wrt its inputs at xs.
  3. primalvalue(x) (not needed for finite difference or source to source): returns the primal value of x. x can be a dual number, vector of duals, tracked array, etc.

By defining the above, the following functions are all then automatically defined:

  1. derivative(::AbstractBackend, f, xs...): returns the derivatives of the scalar-valued function f wrt its inputs at xs where xs are all scalars.
  2. gradient(ab::AbstractBackend, f, xs...): returns the gradient of the scalar-valued function f wrt its inputs at xs where xs can be anything that the backend ab supports.
  3. hessian(ab::AbstractBackend, f, xs...): returns the Hessian of the scalar-valued function f wrt its inputs at xs.
  4. value_and_derivative(::AbstractBackend, f, xs...): returns the output value of the function f as well as its derivatives wrt its inputs at xs.
  5. value_and_gradient(::AbstractBackend, f, xs...): returns the output value of the function f as well as its gradients wrt its inputs at xs.
  6. value_and_jacobian(::AbstractBackend, f, xs...): returns the output value of the function f as well as its Jacobians wrt its inputs at xs.
  7. value_and_hessian(ab::AbstractBackend, f, xs...): returns the output value of the function f as well as its Hessian wrt its inputs at xs.
  8. value_gradient_and_hessian(ab::AbstractBackend, f, xs...): returns the output value of the function f as well as its gradients and Hessians wrt its inputs at xs.
  9. pullback_function(::AbstractBackend, f, xs...): returns the pullback function of f at xs .
  10. pushforward_function(::AbstractBackend, f, xs...): returns the pushforward function of f at xs.
  11. value_and_pullback_function(::AbstractBackend, f, xs...): returns a function that takes as input the differential of f and returns the primal value of f at xs and the pullback of the differential.
  12. value_and_pushforward_function(::AbstractBackend, f, xs...): returns a function that takes as input the tangents of the inputs xs and returns the primal value of f at xs and the pushforward of the tangents.
  13. Lazy Jacobian and Jacobian transpose vector/matrix multiplication.
  14. Lazy Hessian and Hessian transpose vector/matrix multiplication.

A package author can choose to define any of the above automatically defined functions for his/her package in the following cases:

  1. The default implementation is not efficient enough. For example, the pushforward's and pullback's default implementations using jacobian incur some additional arithmetic required for the encoding of both of these functions as Jacobians. A few savings can be made by defining the method for the backend directly.
  2. To avoid control flow. The value_and versions of the functions uses control flow to avoid querying the primal value more than once when the function is called multiple times, e.g. when calculating the gradient of a multivariate function with forward-mode in chunks.

I tried to keep the restrictions minimal in my implementation. Looking forward to your feedback!

The main remaining items to do here are:

  • Test the hessian functions
  • Test the lazy operators
  • Write documentation

@mohamed82008 mohamed82008 changed the title [WIP] Abstract interface implementation [WIP] Abstract differentiation interface Feb 8, 2021
Comment on lines 14 to 23
struct HigherOrderBackend{B} <: AbstractBackend
backends::B
end
reduceorder(b::AbstractBackend) = b
function reduceorder(b::HigherOrderBackend)
return HigherOrderBackend(reverse(Base.tail(reverse(b.backends))))
end
lowest(b::AbstractBackend) = b
lowest(b::HigherOrderBackend) = b.backends[end]
secondlowest(b::HigherOrderBackend) = lowest(reduceorder(b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this part

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an over-complicated way to get b.backends[end-1]. I was trying to be generic but I don't think generic helps here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lowest-level backend is b.backends[end]. The second lowest is b.backends[end-1]. In forward-over-reverse, the lowest is reverse and the second lowest is forward.

@willtebbutt
Copy link
Member

willtebbutt commented Feb 9, 2021

This all looks great. Any chance we could go with pushforward instead of pushforward_function etc?

Please ignore this.

@willtebbutt
Copy link
Member

willtebbutt commented Feb 9, 2021

I'm still unclear on why the primitive that everything is implemented in terms of is jacobian, rather than something involving pushforward / pullback.

It seems to me that the way ADs are going to wind up implementing this interface is by defining jacobian in terms of evaluations of their native pushforwards (forwards-mode) or pullbacks (reverse-mode). Then this package defines its version of pushforwards / pullbacks in terms of jacobian. Does this not seem backwards?

Is there a reason not to

  1. require that an AD implement some variant on pushforward / pullback, depending on its mode.
  2. implement jacobian in terms of those?

@mohamed82008
Copy link
Member Author

Is there a reason not to

  1. require that an AD implement some variant on pushforward / pullback, depending on its mode.
  2. implement jacobian in terms of those?

No. I will write macros that let you define any one of the three and get the other 2 for free.

@mohamed82008
Copy link
Member Author

@willtebbutt how would you define the Jacobian of a multi-input function using the jvp? What do you pushforward?

@mohamed82008
Copy link
Member Author

So it's not clear to me how to define the jacobian function from jvp or j'vp without committing to a representation for the differential. The best I can think of is have the users define an identity_like function for the arguments or outputs to pushforward or pullback.

@willtebbutt
Copy link
Member

willtebbutt commented Feb 9, 2021

@willtebbutt how would you define the Jacobian of a multi-input function using the jvp? What do you pushforward?

To be honest I don't even know how to define the Jacobian in this context, let alone how to construct it using a pushforward. Do you have thoughts on how this should be done?

@mohamed82008
Copy link
Member Author

mohamed82008 commented Feb 9, 2021

So it seems that generically defining a jacobian using the pushforward_function or pullback_function is more difficult than the other way around. Essentially you have to commit to a certain representation for the tangents or cotangents. Does your pushforward/pullback support multiple tangents/cotangents? Are the multiple tangents a vector of vectors or a matrix? These questions can have different answers in different packages. And I don't want to make one the default. So I am left with my initial design which is the jacobian is the only primitive. The nice thing about this is that the pushforward and pullback now come for free so long as:

  1. dot is defined for the output of the function and its cotangent representation.
  2. + and * are defined for the input to the function and the tangent representation.

These assumptions are representation-agnostic. They just assume that some functions are defined.

For specific AD packages that want to commit to a specific tangent or cotangent representation, they can define pushforward_function or pullback_function as a primitive and then define jacobian in terms of that. But this belongs in the AD package not here imo.

@willtebbutt
Copy link
Member

willtebbutt commented Feb 9, 2021

@mohamed82008 it's still not clear to me that we've figured out how to define the Jacobian in the first place.

Let's forget about jvps and vjps for the time being, how are you proposing to define the Jacobian of some function f: A -> B, where neither A nor B subtype Vector{<:Real}?

@mohamed82008
Copy link
Member Author

Do you have thoughts on how this should be done?

I take the gradient case as a reference. So if we return a tuple for the gradient of a scalar-valued function with multiple arguments, then a tuple of Jacobians makes sense for vector-valued functions. Similarly for single-input, multi-output functions, a tuple can be returned but it means something different. The complicated case is the multi-input, multi-output case because you need to consider all combinations. So it's not enough to define the differential of a struct, we need a type for the derivative of one struct wrt the other.

But even for a single input, single output function, do we pass in a vector of 1-hot tangent vectors or an identity matrix to the pushforward? Ideally both should be supported but I am afraid some packages or adjoint rules may only work with the vector of vectors case or the matrix case and converting between representations is not something that I think belongs here, simply because the derivative representation problem isn't tackled here at all.

Let's forget about jvps and vjps for the time being

Hmm this is tempting but the current implementation already works for multiple array-like inputs, single array-like output functions out of the box with mild assumptions. But I imagine there is little use to these functions anyways outside the context of AD implementation. Most people just need derivatives, gradients, jacobians and hessians.

how are you proposing to define the Jacobian of some function f: A -> B, where neither A nor B subtype Vector{<:Real}?

I am not proposing any! I think this is an interesting problem to solve in ChainRulesCore perhaps where differential types are defined. I suspect something like https://github.com/jonniedie/ComponentArrays.jl may come in handy.

@mohamed82008
Copy link
Member Author

As an aside personally, I think the best representation is a good old matrix! Let's agree to vectorize all the inputs and all the outputs always and have a decoder that decodes each element to the derivative it represents. Then you can query this special matrix in different ways and get different differential structs out of it.

@mohamed82008
Copy link
Member Author

This separates the representation problem from the AD problem. Both are interesting but mixing them is a nightmare.

@mohamed82008
Copy link
Member Author

mohamed82008 commented Feb 9, 2021

I will go ahead and test the current implementation with the most common high level use cases for all the common AD packages. If tests pass, I think we can merge and release and then revisit later if we come up with a better design.

@mohamed82008
Copy link
Member Author

I think the package is useful enough even if it only supports number and array inputs and outputs (single output) which is like 90% of the AD use cases out there.

@willtebbutt
Copy link
Member

I think the package is useful enough even if it only supports number and array inputs and outputs (single output)

This makes sense to me. We know how to implement this is in terms of the ADs we have using jvps and vjps and I agree that it's probably useful.

@mohamed82008
Copy link
Member Author

I may have figured out a nice-ish solution. This got second tiered though on my priority list so I will get back to this some time next week.

@oxinabox
Copy link
Member

I will have time to review this next week, hopefully

@mohamed82008
Copy link
Member Author

I pushed what I have. It's not fully functional yet. Until later.

@oschulz
Copy link

oschulz commented Mar 15, 2021

Thanks for this initiative! I was looking for a package providing a common AD API exactly like this.

As far as

Lazy Jacobian and Jacobian transpose vector/matrix multiplication.
Lazy Hessian and Hessian transpose vector/matrix multiplication.

are concerned - a nice way to handle this might be an additional package ADLinearMaps.jl, based on both AbstractDifferentiation.jl (pushforward_function and pullback_function) and LinearMaps.jl. We're currently using LinearMaps.jl in MGVI.jl as a common API for JVP and VJP, it feels very natural.

@CarloLucibello
Copy link

Would make sense for this to be part of ChainRulesCore?

@oxinabox
Copy link
Member

We've talked about it.
Not today. Maybe one day.
We're not blocking each other over it.

They are kind of opposite ends of the abstraction stack.

ChainRulesCore will soon get an abstraction (currently penselled in as configurable rules) that will let it do things like call back into AD.
JuliaDiff/ChainRulesCore.jl#68
The AD system will need to provide ChainRules with one of those (or settle for the default, which is what we have not, and which can't call back into AD)

Having one of these (beyond the default) gives ability to do value_and_directional_derivative (frule), and value_and_pullback_function (rrule).
Those are the things ChainRulesCore needs in order to be able to write rules for map etc.

Having those is also enough to be able to implement everything in this API.
(The converse is not quite true as ChainRules' configured rules also need to have traits about mutation support and some other things)

@oschulz
Copy link

oschulz commented Mar 15, 2021

They are kind of opposite ends of the abstraction stack.

Also, from what I understand, ForwardDiff at least will not adopt ChainRulesCore any time soon (if ever), right? But maybe it could support AbstractDifferentiation.jl?

@mohamed82008
Copy link
Member Author

But maybe it could support AbstractDifferentiation.jl?

Yes. The main users of AbstractDifferentiation will be users of AD. The main users of ChainRulesCore are developers of AD packages. So they are at 2 different levels of abstraction as Lyndon said.

@oschulz
Copy link

oschulz commented Mar 15, 2021

What about packages like FiniteDiff.jl and FiniteDifferences.jl? It's not automatic differentiation (but numerical differentiation) strictly speaking, but in contexts where AD is not possibly (e.g. because one has to call external code) and the number of dims is not too high, it would be very useful be be able to use them via the AbstractDifferentiation.jl interface, right?

@mohamed82008
Copy link
Member Author

it would be very useful be be able to use them via the AbstractDifferentiation.jl interface, right?

Right. All the tests in this PR so far are using finite difference. So they are definitely in scope.

@mohamed82008
Copy link
Member Author

Looks like this PR fell into the black hole of forgotten PRs. @frankschae has been secretly working on fixing the errors here though in his fork. We should see more activity here soon. Would be nice to get some attention from potential reviewers in the coming 1-2 weeks.

@mohamed82008 mohamed82008 merged commit 5a21414 into master Aug 28, 2021
@gdalle gdalle deleted the mt/interface branch December 21, 2023 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants