Computing the posterior
is our target. Expectation to get the marginal as we might not know the exact data distribution. Even if we do, computation is burdensome in high dimension. Approximations around the equation 1 have been attempted in terms of modeling and computation.
1. Modeling
By replacing the negative loglik with Taylor loss function, we could get a classifier without having to know full data distribution. Example is PAC-Bayes.
2. Computation
2.1 non exact monte carlo methods
Metropolis-Hastings algorithm that compute the likelihood ratio. It is known that , if each likelihood is computed by an unbiased Monte Carlo estimator, the algorithm remains exact. Approximate Bayesian Computation could be applied to models where likelihood is too complex to be computed, but sampling is relatively easy.
2.2 asymptotic approximation
Gaussian, or Laplace, approximation of the posterior centered on MLE and covariance as inverse of Fisher information. Bernstein-von Mises theorem justifies the approximation for parametric models under certain regularity conditions; but there are some extensions to non-parametric as well.
2.3 approximation via optimization
Best approximation of , or among the set of prespecified probability distributions Q. Different metric that measures the distribution distance include Kullback–Leibler divergence as in equation 3 and 4. Computation is tractable when using KLD. To extend the concept, we could either replace KL with other divergence measures or optimize different objection function under KLD.
For the first approach, non-KLD measures could be found in divergence measures in information geometry and include Renyi, chisquared divergence. Wasserstein distance is another choice whose theoretical guarantees has been provided by Huggins et al (2020).
The second measures KL distance from the posterior to approximate distribution instead of the other way around. The concept has further extension such as power EP, -divergences.
To solve the optimization problems, stochastic optimization methods including conjugate gradient, coordinate-wise optimization, gradient and stochastic gradient, natural gradient methods have been applied. Convexity and smoothness, theoretical justification for algorithm are analyzed (Domke, 2019). for convergence of algorithm intruwhich aid of the minimization problem.
Related ways for Stan to achieve efficiency with modeling, computation approximation and its diagnosis.
1. Modeling
– reparameterization: model implementation can achieve the same performance as playing around with the algorithm. Betancourt’s post and Incomplete Reparameterizations and Equivalent Metrics paper. Simple tricks on modeling the difference instead of the cumulated for HMM could be another example.
2. Computation
– brute force computation: Multiprocessing, parallel computing (within-chain paralleization), precomputing X^tX, faster adaptation, improved mass matrix
– enhance initial point quality with theoretical support (e.g. convex optimization)
– approximate computation in each iteration: laplace, gmo, advi, ep 1I would address validation in another post.
- to validate bayesian approximation computation, accuracy (compare with full HMC, simulation based calibration), asymptotic results (convergence) is needed. I assume PAC-Bayes (Probably Approximately Correct) is related to this.
Comment is the energy for a writer, thanks!