goal: structured additive regression models where the latent field is Gaussian, controlled by a few hyperparameters and with non‐Gaussian response variables

{f(j)(·)}s are unknown functions of the covariates u, {βk}s represent linear effect of covariates z and ɛis are unstructured terms. Latent Gaussian models are a subset of Bayesian additive models with a structured additive predictor with Gaussian prior assigned to α, {f(j)(·)}, {βk} and {ɛt}; hyperparameters are not necessarily gaussian.

todo: posterior p(\theta| y) and p(\phi | y) with mcmc

you might wonder about two things: why not $latex p(\theta, \phi y) and why is mcmc needed? Three problems have led to the development of adjoint-differentiated Laplace approximations.

problem1: no closed form

solution: use mcmc!

problem 2: $\theta, \phi$ joint posterior can have bad geometry

solution: divide and conquer!

problem3: high dimension + multimodal

solution: HMC + Laplace with reverse mode, where Jacobian computation is unnecessary, for speedup

  1. run HMC on φ , by encoding π( φ ) and π(y | φ ) in the model block.
  2. sample θ from π(θ | y, φ ) in generated quantities block.

Problem1 :

reparameterized regression for ease of computation

when likelihood are normal form, posterior are in closed form due to conjugacy.

For Poisson and Bernoulli likelihoods, we don’t have analytical expression for π(y | θ) and π(θ | φ , y). In other words, posterior marginals are not available in closed form owing to non‐Gaussian response variables. SOS, MCMC!

Problem2, 3

Gaussian approximation to integrate over the latent values (LA, INLA, EP distribution approximation)

Laplace approximation to marginalize out the latent Gaussian variables and then integrate out the remaining hyperparameters using dynamic Hamiltonian Monte Carlo,

We use reverse mode automatic differentiation(AD), algorithm2 instead of algorithm1 based on Thm1. which is a formula for the gradient of log marginal likelihood on hps.

original
Dynamic HMC (Laplace + HMC)

where is approximate marginal log density gradient used? is my remaining question. In my next post, I would delve into auto-diff concepts.