- no theoretical guarantee for accurate results (Wang18)
- marginal variances of the parameters are often underestimated (Turner11)
Turner11 introduce two problems in applying vEM to time series
- compactness
– separated by an intermediate region of zero density, the approximation will be compact
-when the intermediate region does not dip to zero - dependence of tightness of variational lower bound -> strong bias
standard variational inference algorithms maximizes ELBO using coordinate ascent, but ADVI uses a gradient-based algorithm to perform the maximization
relative bias of diļ¬erent approximations depended not only on which parameter was sought, but also on its true value
MFVB avoids overfitting problem which plagues MAP
choice of variational distribution should be complemented with an analysis of the dependency of the variational bound on the model parameters.
1.1 introduction
variational approach ~ approximate inference
vEM, variational Bayes ~ variational optimization of free energy applied to time series
- poor approximation lowers uncertainty and compactness property of VI implies uncertainty cannot be propagated in time thus limiting the usefulness of distributional information
“consequence of the well-known compactness property of variational inference is a failure to propagate uncertainty in time, thus limiting the usefulness of the retained distributional information”
- analytically reveal systematic biases in the parameters found by vEM + simpler variational approximation (MFVB) can lead to less bias
1.2 variational approach
EM finds maximum likelihood parameters for latent variable models, including Hidden Markov Models and linear or non-linear State Space Models (SSMs) for time-series.
vEM formulate EM as variational optimization problem, free energy (lower bound for loglik)
MF mean field approximation when each latent variable appears in a factor of its own
structured approximations:
1.3 VI is cpt
variational approximation tend to be compact, i.e. have smaller entropy than true distribution; when this concept is applied to mfvb, it explains the model’s failure to propagate uncertainty between time-steps and thus show incorrect relation between approximation accuracy and confidence (high confidence when poor approximation)
VI: approx is zero whenever true is zero
variational approximation tends to have smaller entropy than true dist.
for gaussian mixture, approximation matches the mode with largest variance; variational approximations to independent component analysis can result in non compact variational methods.
VI = cpt e.g. clustering <-> VI= non-cpt e.g. independent component analysis
VI fails to propagate uncertainty especially for timeseries
MFVB factored over time, update for the latent variable at time t follows from 1.12
variational update combines the likelihood with variational prior-predictive (= latent and right before time). variational prior equals true prior
temporally factored variational methods will recover posterior approximation narrower (cpt) than state conditional distribution
ts mfvb optimize mean while zero-temperature EM optimized mode; this helps ts mfvg to avoid pathological likelihood spikes
problem of uncertainty propagation failure to live up to its “cpt” become more salient as ts has strong correlations; this is opposite direction of ideal apparoximation which should be more predictable and precise as correlation become stronger.
1.4 biased VI
estimate bias increase the more parameterse
vb may fail to preserve uncertainty, but is better than zero-temperature EM.
Comment is the energy for a writer, thanks!