Model+Algorithm space

  • aka. mathematical formulation + computational tool trying to reproduce the real world and posterior space. As the figure suggests, a combination of two mathematical formulations, prior and observational model (ie. data simulator), targets to simulate the real world. These form the parameter-data joint space once marginalized provide the form of posterior which computation tool aims to reproduce. Note that all 1+2 and 3 are approximations.
  • To diagnose the second approximation (component 3), understanding $p(\theta’| \theta)$ which I call GI operator (generate + inference) as the output of marginalizing y from $p(\theta’, y, \theta)$ is useful. When approximation is perfect, this operator is symmetric for every $\theta$ and this condition can be tested by observing posterior coverage of the simulated parameters.
  • p(θ, θ’) = p(θ, θ) symmetry could be proved using Bayes’ Theorem and self-consistency condition p(θ′)=p(θ):

$p\left(\theta^{\prime}, \theta\right)$

$=p(\theta) \int \mathrm{d} y p\left(\theta^{\prime} \mid y\right) p(y \mid \theta) $
$=p\left(\theta^{\prime}\right) \int \mathrm{d} y p\left(\theta^{\prime} \mid y\right) p(y \mid \theta) $

$=p\left(\theta, \theta’\right)$


further proof

  • MCMC algorithm is the approximation where theoretical error approaches zero as the number of simulations increases.
  • section 3.3. Approximate algorithms and approximate models from this paper along with the quote “We may never end up fitting the complex model we had intended to fit at first, either because it was too difficult to fit using currently available computational algorithms, or because existing data and prior information are not informative enough to allow useful inferences from the model, or simply because the process of model exploration lead us in a different direction than we had originally planned.”
information bottleneck and contraction from iterative algorithm from haykin
  • My aim is to find good prior and topics on iterative algorithm for computing the optimal manifold representation of data and EM algorithm convergence seems relevant.

Consistency

  1. quantitative
  • order/preferenc : A>B, B>C -> A>C

2. qualitative

  • eigenvector scale
    Consistency index in AHP: The difference in decision ratio is the difference in the size of the eigenvector. Better to be less than 10%.

Markov chain

  • Markov kernel guarantees an invariant measure
  • Recurrence combined with irreducibility guarantees the existence
    of an invariant measure. Give an example showing that the converse is in general false if the invariant measure is not a probability
  • reversible and stationary distribution (the former implies the latter)
  • only finite measure could have a stationary distribution; normalized from stationary measure

SLLN

  • time average converges to space average a.s

Ergodic

  • over long periods of time, the time spent by a system in some region of the phase space of microstates with the same energy is proportional to the volume of this region
  • proportion of time that the orbit of a point (i.e. the sequence x, T x, T2x, . . .) is in a set A, tends to µ(A)
  • measure-preserving system is ergodic if every set satisfying $T^{-1}A = A$ is trivial (ie mu(A) = 0 or 1)

Paremeterization

  • centering/non-centering has to be done on a context-by-context basis (keep strongly informative contexts’ likelihood functions centered)
    centered
  • degeneracies between the population parameters (population location, scale, and correlations if working with multivariate hierarchies) can be resolved only with more information (more data or stronger priors on those population parameters). Priors
  • additional information (more data) or marginalized likelihood wrt to data

Geometry

  • studies structured points
  • example of the structures are curvature, flow, geodesic
    • hamiltonian flow: A point (q,p) in the phase space transported by the Hamiltonian flow on the space equipped with $(M, \omega, H)$, exists hamiltonian flow $ dH ( . )= \omega(X_{H}, . )$
    • gradient flow df(.) = <gradf, .>; likewise when the space is equipped with (M, <>, f)

EM algorithm

  • attempt to learn high dimensional in a subcomponent-fixed iteration way. K-means clustering is an example of an ‘expectation–maximization’ (EM) algorithm, with the two steps, which we called ‘assignment’ and ‘update’, being known as the ‘E-step’ and the ‘M-step’ respectively.

Toric variety

  • Algebraic geometry studies the curves which are formed from observing the preimage of zero of polynomial.
  • function points on a plane where that function is zero will form a curve. An algebraic curve extends this notion to polynomials over a field in a given number of variables
  • Toric variety contains a torus as an open subset and is equipped with a torus action. A torus is a particular linear algebraic group.
  • equivalence between geometric objects (the varieties) and algebraic objects (some particular sets of polynomials called prime ideals).
  • example is a projective space

Malliavin calculus

  • stochastic calculus of variations.
  • extend the mathematical field of calculus of variations from deterministic functions to stochastic processes.

Random variable

because of the definition of random variable.

  • Preimage of random variable is not from Real to event space, but from borel set to a member of sigma field in event set.

Optimization usecase

Examples of optimization in learning (from Adam’s tweet)
-Why does training error go down if I add more features? -> Increases the feasible region and decreases objective
-Why do best subset, ridge, lasso help generalization? -> Equiv. to constraints on coefficients (l0,l1,l2) and less ability to fit training data
-Why can’t we just solve an optimization problem to do clustering or decision trees? -> Combinatorial optimization is time-consuming and we need heuristics
-Why not minimize MSE for logistic regression? -> Nonconvex problem and thus we look at deviance

Problem solving: analytic vs empirical solution

problem solving is finding the answer to the problem with the tool that

  • shows the existence or construction of a solution
  • does counting or optimizing the solution
  • related to induction vs deduction approach
  • solutionAnalyticalEmpirical??
    typeexact, perfect solutionapproximate, good enough solution
    finding processnon-iterative*iterative, guess and checkis induction an counterexample for *?
    requirementperfect model diagnose tool that measures improvement or pass/fail
    possibility of better or additional solution endsnever ends
    scopegloballocal
  • From existence, construction, enumeration, optimization goal from combinatorics,

Convex optimization

Knowing the existence of the solution, finding the tool to lead us to the solution, finding the solution are three different things. I will introduce in the order of:
1) theory behind the model formulation and solution guarantee
2) actual algorithms such as subgradient, polyhedral approximation, and proximal

from Bertsekas Convex Analysis and Optimization
  • Lagrange multipliers are used as a measure of the sensitivity of certain constraints which is my main interest. They are expected to exist if the tangent cone could be express as a polar cone of a set of constraint’s gradients. Recall that negative objective’s gradient is included in a normal cone at x which is a range space of constraint’s gradient) the tangent cone of x*. However, this expectation is valid on regularity conditions such as LICQ.

Duality has many forms including the following

LP y'b \leq c'x
Lagrange 1

minimax inequality (d* <= p*)

saddle point

if saddle points exists minimax inequality becomes equality

frenchel

with this duality btw subgradient and conditional gradient is proved (Bach15) and motivates mirror descent

dual cone & conjugate function

Some quotes to ponder upon
1. polarity relation between cones is a special case of conjugacy between convex functions.
I know that duality is $latex \min_{x \in K} f(x) = \min_{x \in K*} f*(x)$ but cannot understand how cones could be actually regarded as a function when it is a set.

2. general ferkas lemma is

Normal cone and tangent cones are orthogonal and based on this, it seems this implies a condition for normal cone of x, y \in N_{x}(x). x should be $latetx conv( N(y) basis) + cone(N_{y}(y))$. This might be related to the expression of objective’s gradient as a convex combination of active constraint’s gradient.

Optimization algorithm. The underlying principle is that iterating the well-guided approximation would lead us to the optimal solution. Examples are
– newton method: repeating the first-order approximation gradient (or subgradients for non-differentiable) to find the solution of the derivative. With the theoretical support of a unique solution from a convex setting (convex function on convex support) this tool brings us to the solution

  • the tool that decreases the objective function at every step does not guarantee that the optimal solution could be reached. Certain remediation like decreasing the step size is needed for convergence guarantee.
    – polyhedron, proximal approximation are underlying tools.
  • Nocedal classified iterative methods into line search and true region whose difference is whether we set the direction before or after the step size
  • Among the four interest of problem solving, optimizaition algorithm usually is related to the latter two, enumeration and optimization, but when the enumeration returns zero discovery, it could empirically provide answer to existence problem.

Combinatorics

  • counting with finite structures
  • existence, construction, enumeration tool, optimization tool are the target (I recommend that strategic order; induction to deduction, analytical to empirical)
  • related to algorithms analysis in computer science
  • subfield include algebraic, geometric, topological, arithmetic combinatorics and partition, graph, design, order, matroid, probabilistic theory are some main topics

Future topics

  • Groups and symmetric groups in Z, Q
  • Two-dimensional vector spaces and matrices in terms of groups
  • In relation to the inverse matrix, why is symmetry group necessary in the definition of the determinant, volume, and inverse matrix?
  • Eigenvalues and eigenvectors
  • The limit concept required for linear approximation of the derivative
  • Area and volume in the process of informing the integral with the inverse operation of the derivative
  • Basic Theorems and Implications of Taylor Series and Calculus
  • Green’s and Stokes’ theorem, which extend integrals to line integrals and area integrals
  • Cauchy’s Integral Theorem and Residual Theorem
  • Topology and differential geometry based on dot product
  • Gaussian-Bonet and 3D space classification
  • statistical models as specific objects of non-Euclidean geometry when viewed with dot product
  • Maxwell’s equations based on multivariate calculus

Korean

  • 정수와 유리수에서 군과 대칭군
  • 군의 관점에서 이차원 벡터공간과 행렬
  • 역행렬과 관련해서 행렬식과 부피, 역행렬의 정의에서 대칭군이 왜 필요한지
  • 고유치와 고유벡터
  • 미분의 선형 근사를 말하는 과정에서 필요한 극한개념
  • 미분의 역연산으로 적분을 알려주는 과정에서 넓이 및 부피의 개념
  • 테일러 급수와 미적분학의 기본정리와 함의
  • 적분을 선적분과 면적분으로 확장하는 그린 정리 및 스톡스 정리
  • 코시 적분정리와 유수정리
  • 내적을 토대로 위상수학과 미분기하학의 관점
  • 가우스-보넷과 3차원 공간 분류
  • 내적의 개념으로 공간을 바라보면 통계적 모델들이 구체적인 비유클리드 기하학의 대상
  • 다변수 미적분학을 토대로 맥스웰 방정식

군대수의 곱하기와 푸리에해석의 콘볼류션이 본질적으로 같은 연산
가환군의 푸리에변환과 국소옹골군의 표현론을 같은 관점

L^p-공간의 쌍대정리
한–바나하 정리
바나하–알라우글루 정리 unit circle is wk*-compact

Krein–Milman thm — A compact convex subset of a Hausdorff locally convex topological vector space is equal to the closed convex hull of its extreme points.

Stone weierstrass thm

X is a (locally) compact Hausdorff space and A is a subalgebra of C(X, R). Then A is dense in C(X, R) if and only if it separates points (+vanishes nowhere). Dense:- || f  − g|| < ε $\exists g \in A \forall f \in C(X,R)$

eg. X: C([a,b]), A: polynomial


다항식과 해석함수는 공통점이 많음
복소수체는 모든 대수방정식이 근을 가지도록 유리수체를 확장
모든 다항식은 복소수범위에서 일차 다항식들의 곱으로 인수분해
테일러 전개, 로랑 전개, 바이어쉬트라스 정리 등 극한과 미적분의 순서를 바꾸어야 하는 경우 고른수렴 개념 필요

  1. L(x,\lambda)=f(x)+\sum_{i=1}^{m} \lambda_{i} \underbrace{h_{i}(x)}_{\leq 0} \leq f(x) when proving, we first take inf to get g(\lambda, \nu) \leq p* then take sup to get d* \leq p*