# Talks in England

Next week I’ll be touring England, more specifically the Universities of Reading, Warwick, Oxford and UCL. Apart from visiting dear friends, I’ll be giving a poster in Warwick (still not ready!) and talks at all other places.

 19.4.2015 2pm Gradient IS and Unadjusted Langevin for IS Univ. of Reading Afternoon on Bayesian Computation, Nike Lecture Theatre, Agriculture Building 22.4.2016 1pm Kernel Sequential Monte Carlo UCL Roberts G08 Sir David Davies LT, Announcement 25.4.2016 3:30pm Kernel Sequential Monte Carlo Univ. of Oxford Department of Statistics, St Giles building, Seminar Room 1, LG.03

The poster will be on Flyweight evidence estimates at the CRiSM workshop on Estimating Constants on the 20th. Slides and the poster will go up on the Talks page.

# Kernel Sequential Monte Carlo

Heiko Strathmann, Brooks Paige, Dino Sejdinovic and I updated our draft on Kernel Sequential Monte Carlo (on the arxiv). Apart from locally adaptive covariance matrices for Gaussian proposals in various SMC algorithms, we also look at gradient emulators – both for targets that do not admit a first (gradient emulators) or even second derivative (locally adaptive covariance).
The emulators can be used in different ways, either as proposals for a MCMC rejuvenation step in SMC or as importance densities directly – for example in Population Monte Carlo.
We found especially the gradient emulator to be rather sensitive to the variance of the fit. Not Rao-Blackwellizing across importance densities used in a PMC iteration leads to gigantic estimated gradients and an exploding algorithm, while using a weighted streaming fit of the emulator with Rao-Blackwellization works just fine.
Plus we evaluate on the Stochastic volatility target from Nicolas SMC^2 paper, which is a much more nicer benchmark that what we had in the last draft (the plot being the targets marginals). Any feedback please send my way.

# Streaming weighted updates of cholesky factors

… are hell. Heiko and I have been spending the day trying to get code to work for a numerically stable streaming update of a Cholesky factor in an adaptive Importance Sampling setting. This is less then trivial, also because it has to be done in a weighted fashion (because of Importance Sampling) and weights should be kept in logspace for stability. Thinking I will have to go through this again for an Eigendecompositon instad of Cholesky for a collaboration planned with Mathieu Gerber does not make things easier to stomach.
This day again strengthens my wish to move more in the direction of Judith-Russeau-statistics, i.e. away from computers, towards pen and paper.

PS: Does anybody have code solving exactly this problem (we have tons for almost this problem)?

This arXival from last spring/summer by Martino, Elvira, Luengo and Corander combines and extends upon recent advances of Importance Sampling, using mainly ideas from Adaptive Multiple lmportance Sampling (AMIS) and Population Monte Carlo (PMC). The extension consists of the idea to not use the Importance Sampling procedure itself to come up with new proposal distributions, but rather to run a Markov Chain. The output of which is used solely as the location parameter for IS proposal distributions $q_{n,t}$. The weights of the samples drawn from these are Rao-Blackwellized using the deterministic mixture idea of Zhou and Owen, and as far as I can see only the Importance Samples are used for estimating integrands.
What’s most striking for me as somebody who has thought about these methods a lot during the PhD is the idea that in principle one is free to Rao-Blackwellize using an arbitrary partition of the samples/proposal distributions and still get a consistent estimator. Xi’an mentioned this to me earlier and of course it is not surprising given that even without Rao-Blackwell my Gradient IS did considerably better than some Adaptive MCMC algorithms. However this paper makes that idea transparent and uses it extensively. The main idea that is put forward however is to use (parallel) MCMC with the same target for coming up with IS-proposal locations. The output of MCMC is only used for that purpose but not for estimation. Which seems kind of wasteful, but in a nice conversation over email the first author Luca Martino assured me that recycling proposals as both IS and MH proposals made performance go down because of correlations. I don’t get an intuition for why that would be the case, but maybe I’ll have to fall on my own nose for that. What I like about this particular idea of getting locations from MCMC is that one is free from the tuning problem I’ve hit upon in GRIS: if you scale up the proposal covariance in GRIS (or in the PMC approach from the Cappé 2004 paper), you can get an arbitrarily high ESJD – together with a really bad target approximation. Thus unmodified ESJD cannot be used for tuning. And neither can acceptance rate which doesn’t exist. Using MCMC for getting proposal locations is an elegant way around that problem. The effect of this is shown in the plot from the paper below, where the two rightmost plots show one of their methods.

Some other aspects about the paper I find less clear. For instance, I’m not sure about the abundance of different algorithms that are introduced. It leaves the impression that the authors where trying to do mass instead of class (something I might make myself guilty of these weeks as well). Also, while the targets they use for evaluation are fine, only reporting the MSE of one dimension of one integrand seems odd. One simple thing here might be to report the MSE averaged over dimensions as well, another to report the MSE of an estimate of the target distributions variance/higher order moments.

(Title image (c) Carlos Delgado, CC-BY-SA)

# Why the MAP is a bad starting point in high dimensions

During MCMSki 2016, Heiko mentioned that in high dimensions, the MAP is not a particularly good starting point for a Monte Carlo sampler, because there is no volume around it. I and several people smarter than me where not sure why that could be the case, so Heiko gave us a proof by simulation: he sampled from multivariate standard normals with increasing dimensions and plotted the euclidean norm of the samples. The following is what I reproduced for sampling a standard normal in $D=30$ dimensions.The histogram has a peak at about $\sqrt{D}$, which means most of the samples are in a sphere around the mean/mode/MAP of the target distribution and none are at the MAP, which would correspond to norm $0$.

We where dumbstruck by this, but nobody (not even Heiko) had an explanation for what was happening. Yesterday I asked Nicolas about this and he gave the most intuitive interpretation: Given a standard normal variable in $D$ dimensions, $x \sim \mathcal{N}(0,I_D)$, computing the euclidean norm you get $n^2 = \|x\|^2 = \sum_{i=1}^D x_i^2$. But as $x$ is gaussian, this just means $n^2$ has a $\chi^2(D)$ distribution, which results in the expected value  $\mathbb{E}(n) = \sqrt{D}$. Voici l’explication.

(Title image (c) Niki Odolphie)

# Kernel Adaptive Sequential Monte Carlo

Heiko Stratmann, Dino Sejdinovic, Brooks Paige and myself had a poster on our work on Kernel Adaptive SMC (arXiv) at the Scalable Inference NIPS workshop this year. The basic idea is very related to Dinos paper on Kernel-informed MCMC, in that the particle system is (implicitly) mapped to a function space (the infamous Reproducing Kernel Hilbert Space) and used to inform a sample from a Gaussian fit to the functional associated with samples. As always, things that are nasty and nonlinear in the input space behave much more nicely in feature space. An when your name is Dino Sejdinovic, you can actually integrate out all steps in feature space, ending up with a Gaussian proposal in the space you actually want to sample from.

In the title cartoon, our approach (in red, called KASS for Kernel Adaptive SMC sampler) lines up nicely with the support of the target distribution. ASMC by Fearnhead et al., or at least the sampler they use in their evaluation, which is a special case of our method, cannot adapt to this nonlinear structure. This results in better performance for KASS compared to ASMC and Random Walk SMC, as measured by Maximum Mean Discrepancy.However, this of course comes at the price of higher computational complexity, similar to related methods such as Riemannian Manifold MALA/HMC. If your dataset is large however and evaluating the likelihood majorizes evaluating the Gram matrix on your particle system, this method will still have an edge over others when the support is nonlinear. For the MCMC predecessor KAMH/Kameleon see Heikos post.