# Optimization Monte Carlo

A very short note on Optimization Monte Carlo by Meeds and Welling. Max Welling gave a talk on this at a NIPS workshop, during which unfortunately I wasn’t very attentive. Unfortunately because they propose essentially the same thing as David Duvenaud and coauthors did in Early stopping is nonparametric variational inference, which I was super-vocal in criticizing just 2h earlier at Davids poster. Now because I didn’t closely listen at Max’ talk and because it was Max Welling rather than David, I wasn’t all so vocal, but took the discussion with Max offline.

(The following critique is not actually valid for this paper, read the edit)

The critique is essentially the same: if you draw an RV and then optimize it, you can’t compute the Jacobian to correct for the transformation, because the transformation is typically not bijective (think gradient ascent: let $y = \lambda \nabla \log f(x)$. Now what is the inverse step when you only know $y$ but not $x$ and thus can’t know $\nabla \log f(x)$? It doesn’t exist in general). Which makes the very basis of the method unsound. Max answered that one can still compute the Jacobian in certain cases, even if the inverse function doesn’t exist. For some reason I wasn’t able to counter that on the spot. Of course, you can always compute the Jacobian of the forward transformation, but you don’t need that Jacobian. The Jacobian that you need is that of the inverse function. See Theorem 12.7 in Jacod & Protter for example.

I’ll stop here, as I asked Xi’an for his opinion and he will post about this as well, presumably with a more detailed discussion. If Ted Meeds and Max Welling answer to my email on this and prove me wrong, I’ll update this post of course.

EDIT 1: Lesson learned. I shouldn’t post with only half understanding the paper (if half is not said too much). They don’t actually suggest optimizing an RV and computing the Jacobian of the transformation. Rather, they propose optimizing another variable, which is not taken to be random (the parameters of a simulator, which is specific to the ABC setting).

EDIT 2Xi’ans post

# A Complete Recipe for Stochastic Gradient MCMC

Another NIPS paper, this one by Ma, Chen and Fox. This one is interesting in that it seeks to build a framework for MCMC algorithms which only use random subsamples of the data. This is achieved by considering the general form of continuous Markov processes as an SDE given by

$dz = f(z)\textrm{d}t + \sqrt{2D(z)}\textrm{d}W(t)$

where $f$ is the joint probability of the actual sampling space you care about (the marginal being the target distribution/posterior) and possible auxiliary variables. Now they suggest parameterizing f by one skew-symmetric and one PSD matrix (both of which can depend on the current state z), and proof in two Theorems that this is all you need to describe all Markov chains that admit the target distribution as the unique stationary distribution. The proofs use the Fokker-Planck and Liouville equations, which as a non-SDE person I’ve never heard of.

They use this to show ways of reconstructing known stochastic sampling algorithms and give a new sampler, which is a stochastic version of Riemannian HMC.
The paper is very honest in stating the limitations of SG sampling, namely that one has to diminish the learning rate to infinitesimal values (i.e. 0) to ensure unbiasedness. Which would defeat the purpose of sampling of course.

# The Sample Size Required in Importance Sampling

A very exiting new preprint by Chatterjee and Diaconis takes a new look at the old Importance Sampling. While its masks as a constructive paper (maybe because of publication bias), its main result to me is actually the deconstruction of an IS myth. Let me elaborate.

The paper shows that the variance estimate for an Importance Sampling estimator is not a good diagnostic for convergence. An implication of this is that aiming for low variance of importance weights, or equivalently high Effective Sample Size, might be better then nothing, but is not actually that great. In more mathy speak, high ESS is a necessary condition for convergence of the estimate, but not a sufficient condition.
Especially in your face is their result that for any $\epsilon$, one can find a number of samples n such that $p(var_n(f) < \epsilon) \geq 1-4\epsilon$ where $var$ is the empirical variance of $n$ samples, independent of your target density and your proposal distribution. Think about this: I don’t care about your actual integration problem or your proposal distribution. Just tell me what estimator variance you are willing to accept and I will tell you a number of samples that will exhibit lower variance with high probability. This is, in their words, absurd as a diagnostic.
However, they propose a new convergence diagnostic which is basically plagued by a similar problem. They even state themselves that their alternative criterion $q_n = \frac{max\{w_i\}}{\sum_i w_i}$ (where the $w_i$ are the importance weights) being close to 0 can be proofed to be necessary but not sufficient. Just stating that ESS isn’t great is not a great sell for some reviewers so they have to come up with an alternative – and conjecture that it is sufficient, while not being able to proof it. Anyway $q_n$ might still be a better diagnostic than ESS and all things considered this is a wonderful piece of research.

# Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations

This paper by Schölkopf et al in Statistics and Computing (and on arXiv, which I will refer to when giving page numbers) was a very rewarding read. Although I think the promise of “Kernel Probabilistic Programming” given in the abstract is not really kept, it served me as a concise review of ideas in modern (i.e. distribution based) methods using positive definite kernels. The obvious connection of the pd kernel mean map with kernel density estimators is just one of the things that I think are important to realize in this domain.
Estimating integrals of functions in a certain RKHS becomes a rather straight forward exercise in the proposed framework. However, it’s a bit unfortunate that they assume independent samples from the distribution, but posing this as an IS estimator one might be able to apply some elementary results leading to a generalization for dependent samples (they talk about application to dependent samples, but I felt that was geared only towards causal inference). What might get in the way here is that their estimators are assumed to be independent of the actual samples used (which is not the case in IS).
These difficulties with their estimator for functions of RVs notwithstanding, what struck me again and again is the close connection with importance sampling estimators (I’m a one trick pony). Specifically, the used empirical kernel mean map of a set of samples X  $\hat \mu[X] = \sum_{x \in X} \alpha(x) k(\cdot,x)$ for pd kernel k becomes an IS estimator with the constraint imposed in the paper (that $\sum_{x \in X} \alpha(x) =1$) and one additional assumption ($\alpha(x) >0$). Also, when using their method with sparse representations (in the paragraph “More general expansions sets” making the assumption $\alpha(x) >0$ themselves), the way they expand the sparse representation is almost exactly Importance Resampling – funny enough that they are not sure wether the method gives consistent estimates. I’m not sure about the influence that their slight variation of Importance Resampling might have, but just using Importance Resampling straight away results in consistent estimators of course under some conditions on the weights $\alpha(\cdot)$ and $\beta(\cdot)$. Those conditions unfortunately being incompatible with the assumption that weights are independent of samples.

I would be really interested to see wether this IS connection is fruitful.

On a more critical note, the paper states that no parametric assumptions on the embedded distributions are made. While this is true, it masks the fact that different types of pd kernels used for the embedding will result in better or worse approximations of integrals in this framework when using finite numbers of samples. Also, the paper claims their method “does not require explicit density estimation as an intermediate step” while still using the kernel mean map, which includes KDEs as a special case. A minor thing is that Figure 1 needs error bars badly.

# Frank-Wolfe Bayesian Quadrature

This paper (arXiv) on Bayesian Quadrature with actual theoretical guarantees starts from sigma point rules for approximating integrals. “Sigma point rule” is just a fancy name for using weighted samples, where weights are not constrained in any way (and will actually be negative sometimes in the algorithm).

The basic idea is to approximate the target distribution $\pi$ as lying in a finite dimensional Reproducing Kernel Hilbert Space and minimize the distance between the approximation $q$ and $\pi$ (measured using RKHS’ own brand of distribution distance, the Maximum Mean Discrepancy or MMD). This, so far, is an idea that is used in basically all modern methods for numeric integration – variational Bayes and EP of course, but also adaptive MC methods that sample from some $q_t$ at time t then update the proposal to $q_{t+1}$, which should be close to $\pi$ (e.g. a Gaussian approximation in Adaptive Metropolis, or a Mixture of Gaussians or Students with an EM-inspired update in both Andrieu and Moulines 2006 and Cappé et al 2007).

However, the nice thing in this paper is that they actually provide theoretical guarantees that their integral estimates are consistent and converge at rate $O(1/n)$ or $O(\exp(-Cn))$ (where n is the number of samples). While the assumption of $\pi$ lying in a finite dimensional RKHS is to be expected, the stronger assumption is that it has compact support.

The crucial point seems to me to be the following: while the estimator has a superior rate, picking design points costs $O(n^2)$ or $O(n^3)$ where n is the number of points. Drawing a sample using Monte Carlo algorithms on the other hand takes constant time, which makes MC linear in the number of points ($O(n)$). This of course is crucial for the wall clock time performance and makes theoretical comparison of MC vs FW Bayesian Quadrature delicate at best.
In the sense of conclusive story for the paper, I think it is a bit unfortunate that for the evaluation the paper uses iid samples directly from $\pi$ in its optimization algorithm. Because if you have these, Monte Carlo estimates are really good as well. But after a nice discussion with the authors, I realized that this does not in any way taint the theoretical results.

I wonder wether it is mostly the idea of extrapolating $\pi$ that gives these good rates, as I think Dan Simpson suggested on Xi’ans blog. And finally, this does not really alleviate the problem that BQ practically works only with Gaussian RBF kernels – which is a rather strong assumption on the smoothness of $\pi$.