Stochastic optimization in variational inference

I am trying to making stan’s advi engine more robust in terms of the following:

stopping rule
values returned from each iteration

1 is decided based on 2 and the example for the latter are khat of the loglikelihood ratio between the target and approximate function (lr = log_p – log_g) and rhat of the samples. This research suggested the following algorithm.

Interesting two ideas from this research was to regard sequence toward the optimization as samples. This analogy resulted in rhat concept which is used to measure the convergence of chain (whether the sampler has landed on the typical set) applied to optimization scheme. As can be seen in line 3 of alg.1, J parallel optimization is proceeded. Second is that by viewing optimization as producing Markov chain, the use of iterative averaging due to Ornstein-Uhlenbeck’s mean-reverting process is justified. This is based on the process’ admittance of stationary distribution. In other words, SGD trajectories can be seen as a Markov chain and, under mild assumptions, it admits a stationary distribution according to the Ornstein-Uhlenbeck’s process.

Suggested ideas are being pushed forward in following two aspects which I would continue writing in the next post.

1. A convergence criterion that’s more stochastic optimization-specific, which adaptively decrease the step size

2. A new stopping rule that’s based on how much the variational approximation changes as the step size decreases

Diagnostics are evolving and the most recent updates (2020.01.06) are as follows.

Rhat

Effective sample size (ESS)

Monte carlo standard error (MCSE)

Importance resampling (SIR)

Original IS could only compute the expected value; not reporduce samples from target distribution. By resampling from the samples according to the probability ration (p/q), target samples could be recovered which is the basic idea behind SIR. See here for more explanation.

PSIS is needed before calculating the weight used for SIR because the region with extreme lr would affect the resampled samples badly. It is known that diagnostics after resampling are more tricky.

Categories

Comment is the energy for a writer, thanks!Cancel reply