Use cases:

  • map, mle estimators with gradient descent or quasi Newton methods (1)
  • posterior sampling with HMC (log density, 1 for Euclidean, 3 for Riemannian)
  • standard error, posterior cov. with Laplace approximation (log density, 2)
  • max. marginal dist, marginal a posteriori with Monte Carlo expectation method (E_mc, 1)
  • variational inference with stochastic gradient descent based on nested expectation (E_mc, 1)
  • sensitivity w.r.t data, hyperparameters, parameters

Two types of automatic differentiation(AD) exist: forward and reverse.

modeforwardreverse
derivative\dot{u} (du/dx)\bar{u} (dy/du)
mapR -> R^NR^M -> R
ruledual v, tangent propadjoint prop
forward is input-centered(starting from x), reverse is output centered(starting from y)

if b = f(a), tangent is b_dot = ~ while adjoint is a_bar. Recall the philosophy FW = output while RV = input centered, as their are R^M-> R and R -> R^N.

Reverse: Adjoints are propagated in the reverse order of function computation, hence the name of the method.

Without explicitly calculating the gradient of an expression, we could derive same result with vector operations; followings are basic operation results of <u, u’> and <v, v’>.

dual number operation rules

For example, final dual number’s tangent(<f, f’>) equals gradient-vector product (grad_f ยท v).

For reverse mode,

as there could be multiple incoming gradient component, a could have other input source c for example, cumulative derivatives are used.

Stan exposes Hessian and Hessian vector products in the modelfit class with which $ latex H_eta,eta$ can be calculated. Refer to the following Sensitivity post.