1. Goal
Here is a short SBC summary, widely used algorithm testing framework which I aim to extend as follows:
An update algorithm that finds well-calibrated prior or range of well-calibration with certain conditions.
A. well-calibrated is the guide for the update
B. find requires “convergence” proof
C. if B does not converge, certain conditions should be added to make the update converge
Considering the success of SBC lies in its test capability for implicit posterior sampling method (a.k.a computation or inference algorithm which is the object of our anatomy), the mathematical formulation has challenges described in section 2.
For concrete example, from example in SBC paper , , let’s say failed SBC test. Then what? Our algorithm exits from the testing and use the vestige of information from SBC testing to update $\lambda_0$. We change the model to , and update with loss function.
A is almost completed with a coupling + gradient descent w.r.t hyperparameter with a loss function $C\left(\lambda_{0}\right):= ||\int_{\theta^{\prime}} \int_{y} \int_{\theta} g\left(\theta \mid y, \pi_{\lambda_{0}}, f\right) f\left(y \mid \theta^{\prime}\right) \pi_{\lambda_{0}}\left(\theta^{\prime}\right) d \theta d y d \theta^{\prime}-\lambda_{0}||$ but I need help for B and C.
Another view of my problem is solving $\lambda_0$ given likelihood f and g, a solution for self-consistent equation (eq. 1 from SBC paper) which is an identity under faithful computation. Interpretation for this self-consistent equation is that if you average the posterior distribution from computation $g(\theta|y)$ over the sampled joint of $(\theta’, y)$ it fully recovers the prior distribution you started from. Sampled joint $(\theta’, y)$ is constructed by first sampling $\theta’$ from the prior distribution then sampling $y$ conditioned on $\theta’$ with likelihood model $f$. Mechanism behind this amazing recovery is the symmetry of $\theta, \theta’$ in joint $\theta, y, \theta’$ space.
2. Challenges
1. Formulating data-averaged posterior
The current frontier for data-averaged posterior is
$\mathcal{T} \pi:=\iint g(\theta \mid y, f, \pi) f(y \mid \theta’) \pi(\theta’) d y d \theta’$ (edited 11.05).
Different procedural variations as follows exist among which I wish to set the form that meets Goal B.
– the number of samples for each $f(y|\theta’)$ and $g(\theta|y, \pi_{\lambda_0}, f)$ (i.e. $\theta’ \rightarrow y \rightarrow \theta = 1:1:1 \; \text{vs} \; 1: N: M$)
– when to apply $g$: to each outcome in each dataset or to combined outcomes from different datasets?
This answer provides how the first procedural change affect the formulation but would it breach the use of self-consistent equation? Here is the code that explicitly illustrates the different results from the change in the second procedure.
2. Black-boxing adaptive posterior sampling method g
Empirically, I observed convergence with bernouli_logit likelihood and Laplace approximation. I need advice on which computation would be most hopeful in understand this better and how to attack this for Goal B. Options are listed here which includes both deterministic (variational) and stochastic estimators (sampling like MCMC and IS). In the end, I can’t do this for every g which is why I feel blackboxing is needed. This is detailed in the second part of possible connections below. For now, I am trying to understand the technique of how the algorithm’s adaptiveness (e.g. penalty parameter in adaptive augmented Lagrangian method) is modeled so as to not affect the convergence of the whole process by understanding papers like Convergence of Adaptive Finite Element Methods or Adaptive Augmented Lagrangian Methods.
3. Compromise between validation and verification with prior choice
Two causes exist as to why we need different priors for simulation and inference: encode constraints and user preference. This means modification of two $\pi_{\lambda_0}$s from $C(\lambda_0)$ into $||\int_{\theta^{\prime}} \int_{y} \int_{\theta} g\left(\theta \mid y, \pi^i_{\lambda_0}, f\right) f\left(y \mid \theta^{\prime}\right) \pi^s_{\lambda_0}\left(\theta^{\prime}\right) d \theta d y d \theta^{\prime}-\lambda_{0}||$. $\pi^s, \pi^i$ are simulation and inference prior.
First, it is very important to be aware of encode constraints on prior associated with certain choice of algorithms. Theoretician might have difficulty understanding this, but again SBC lives in the real world and sometime is the only choice to test algorithms. For example, choosing HMC algorithm implemented in Stan, encode constraints are famous distributions like normal implemented in 6.4. Probability functions in Stan paper. This is the reason why ECDF of samples from $\pi_{\lambda_0}$ cannot be used for $\pi_{\lambda_0}$ in $g(\theta|y, \pi_{\lambda_0}, f)$ in the above loss function formulation. Using mixture normal to weaken this constraint has been attempted.
Second, is computation-friendly prior. Imagine a modeler who wants fat-tailed prior $\pi^s$ which forms extreme posterior geometry as follows.
Posterior sampling method will produce unreliable samples and therefore all inferential results are untrustworthy. Compromise between inferential (validation) and computational (verification) calibration should be made which is what my algorithm does. It transforms the prior in modeler’s head to a more computation one through iteration and therefore improves the quality of inference.
The two practical causes serve as a penalty term that can aid Goal C. With minute difference (encode is a constraint for $\pi^i$ while computation-friendly is a constraint for $\pi^s$), the two has the same mechanism when viewed as a reversed process:
However, how to formulate this clear penalizing action in place as a $C(\lambda_0) + \Omega(\lambda_0)$ form that convexifies the objective function is an open question.
3. Possible connections
Seminars offered to first year phd in Columbia IEOR is great in that I can learn where each professor’s heart lies in; what research problems they wish to work with us and why all in one hour! The two ideas I found are all from here. Further connections are detailed in each blog.
1. Regularization in neural network
I found optimization formulation of neural network below from Adaptive algorithm design with Donald Goldfarb seminar (G from L-BFGS, a default optimization algorithm in Stan!) similar to my goal as I might want to minimize the gap between the generated and real data and the output of the former procedure (summary of data-averaged posterior) becomes the input (prior) for the next procedure. As described in sec.2.3, techniques for regularization may be found by tracking the formulation of Ω in NN literature.
2. Privacy algorithms
I found the hope of analyzing repeated application of blackbox algorithm in Analyzing repeated use of blackbox algorithm with Rachel Cummings seminar.
1 Comment
1 Pingback