Category Archives: Optimization

Variational inference with Normalizing Flows

I keep coming back to this ICML 2015 paper by Rezende and Mohamed (arXiv version). While this is not due to the particular novelty of the papers contents, I agree that the suggested approach is very promising for any inference approach, be it VI or adaptive Monte Carlo. The paper adopts the term normalizing flow for refering to the plain old change of variables formula for integrals. With the minor change of view that one can see this as a flow and the correct but slightly alien reference to a flow defined by the Langevin SDE or Fokker-Planck, both attributed only to ML/stats literature in the paper.
The theoretical contribution feels a little like a strawman: it simply states that, as Langevin and Hamiltonian dynamics can be seen as an infinitesimal normalizing flow, and both approximate the posterior when the step size goes to zero, normalizing flows can approximate the posterior arbitrarily well. This is of course nothing that was derived in the paper, nor is it news. Nor does it say anything about the practical approach suggested.
The invertible maps suggested have practical merit however, as they allow “splitting” of a mode into two, called the planar transformation (and plotted on the right of the image), as well as “attracting/repulsing” probability mass around a point. The Jacobian correction for both invertible maps being computable in time that is linear in the number of dimensions.

A non-parametric ensemble transform method for Bayesian inference

This 2013 paper by Sebastian Reich in the Journal on Scientific Computing introduces an approach called the Ensemble Transport Particle Filter (ETPF). The main innovation of  ETPF, when compared to SMC-like filtering methods, lies in the resampling step. Which is

  1.  based on an optimal transport idea and
  2. completely deterministic.

No rejuvenation step is used, contrary to the standard in SMC. While the notation is unfamiliar to me, coming from an SMC background, I’ll adopt it here: by \{x_i^f\}_{i=1}^M denote samples from the prior with density \pi_f (the f, meaning forecast, is probably owed to Reich having done a lot of Bayesian weather prediction). The idea is to transform these into samples \{x_i^a\}_{i=1}^M that follow the posterior density \pi_a (the a meaning analyzed), preferably without introducing unequal weights. Let the likelihood term be denoted by p(y|x) where y is the data and let w_i = \frac{p(y|x_i^f)}{\sum_{j=1}^M p(y|x_i^f)} be the normalized importance weight. The normalization in the denominator stems from the fact that in Bayesian inference we can often only evaluate an unnormalized version of the posterior \pi_a(x) \propto p(y|x) \pi_f(x).

Then the optimal transport idea enters. Given the discrete realizations \{x_i^f\}_{i=1}^M, \pi_f is approximated by assigning the discrete probability vector p_f = (1/M,\dots,1/M)^\top, while \pi_a is approximated by the probability vector p_a=(w_1,\dots, w_m)^\top. Now we construct a joint probability between the discrete random variables distributed according to p_f and those distributed according to p_a, i.e. a matrix \mathbf{T} with non-negative entries summing to 1 which has the column sum p_f and row sum p_a (another view would be that \mathbf{T} is a discrete copula which has prior and posterior as marginals). Let \pi_\mathbf{T}(x^f,x^a) be the joint pmf induced by \mathbf{T}. To qualify as optimal transport, we now seek \mathbf{T}^* = \mathrm{arg min}_\mathbf{T} \mathbb{E}_{\pi_\mathbf{T}} \|x^f - x^a\|_2^2 under the additional constraint of cyclical monotonicity. This boils down to a linear programming problem. For a fixed prior sample x^f_j this induces a conditional distribution over the discretely approximated posterior given the discretely approximated prior \pi_{\mathbf{T}^*}(\cdot|x^f_j) =\pi_{\mathbf{T}^*}(x^f_j, \cdot)/\pi_f(x^f_j) = M \pi_{\mathbf{T}^*}(x^f_j, \cdot).

We could now simply sample from this conditional to obtain equally weighted posterior samples for each j =1,\dots,M. Instead, the paper proposes a deterministic transformation using the expected value x^a_j = \mathbb{E}_{\pi_{\mathbf{T}^*}(\cdot|x^f_j)}  x^f_i = \sum_{i=1}^M M t^*_{ij} x^f_i. Reich proves that the mapping \Psi_M induced by this transformation is such that for X^f \sim \pi_f, \Psi_M(X^f) \sim \pi_a for M \to \infty. In other words, if the ensemble size M goes to infinity, we indeed get samples from the posterior.

Overall I think this is a very interesting approach. The construction of an optimal transport map based on the discrete approximations of prior and posterior is indeed novel compared to standard SMC. My one objection is that as it stands, the method will only work if the prior support covers all relevant regions of the posterior, as taking the expected value over prior samples will always lead to a contraction.

ETPF_issue.pngOf course, this is not a problem when M is infinite, but my intuition would be that it has a rather strong effect in our finite world. One remedy here would of course be to introduce a rejuvenation step as in SMC, for example moving each particle  x^a_j using MCMC steps that leave \pi^a invariant.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

During the super nice International Conference on Monte Carlo techniques in the beginning of July in Paris at Université Descartes (photo), which featured many outstanding talks, one by Tong Zhang particularly caught my interest. He talked about several variants of Stochastic Gradient Descent (SGD) that basically use variance reduction techniques from Monte Carlo algorithms in order to improve the convergence rate versus vanilla SGD. Even though some of the papers mentioned in the talk do not always point out the connection to Monte Carlo variance reduction techniques.

One of the first works in this line, Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Johnson and Zhang, suggests using control variates to lower the variance of the loss estimate. Let L_j(\theta_{t-1}) be the loss for the parameter at t-1 and jth data point, then the usual batch gradient descent update is \theta_{t} =\theta_{t-1} - \frac{\eta_t}{N} \sum_{j=1}^N\nabla L_j(\theta_{t-1}) with \eta_t as step size.

In naive SGD instead one picks a data point index uniformly j \sim \mathrm{Unif}(\{1,\dots,N\}) and uses the update \theta_{t} =\theta_{t-1} - \eta_t \nabla L_j(\theta_{t-1}), usually with a decreasing step size \eta_t to guarantee convergence. The expected update resulting from this Monte Carlo estimate of the batch loss is exactly the batch procedure update. However the variance of the estimate is very high, resulting in slow convergence of SGD after the first steps (even in minibatch variants).

The authors choose a well-known solution to this, namely the introduction of a control variate. Keeping a version of the parameter that is close to the optimum, say \tilde\theta, observe that \nabla L_j(\tilde\theta) -  \frac{1}{N} \sum_{i=1}^N  \nabla L_i(\tilde\theta) has an expected value of 0 and is thus a possible control variate. With the possible downside that whenever \tilde\theta is updated, one has to go over the complete dataset.

The contribution, apart from the novel combination of knowledge, is the proof that this improves convergence. This proof assumes smoothness and strong convexity of the overall loss function and convexity of the L_j for the individual data points and then shows that the proposed procedure (termed stochastic variance reduced gradient or SVRG) enjoys geometric convergence. Even though the proof uses a slightly odd version of the algorithm, namely where \tilde\theta \sim\mathrm{Unif}(\{\theta_0,\dots,\theta_{t-1}\}). Rather simply setting  \tilde\theta = \theta_{t-1} should intuitively improve convergence, but the authors could not report a result on that. Overall a very nice idea, and one that has been discussed in more papers quite a bit, among others by Simon Lacoste-Julien and Francis Bach.