Category Archives: Monte Carlo

Talks in England

Next week I’ll be touring England, more specifically the Universities of Reading, Warwick, Oxford and UCL. Apart from visiting dear friends, I’ll be giving a poster in Warwick (still not ready!) and talks at all other places.

19.4.2015 2pm Gradient IS and Unadjusted Langevin for IS Univ. of Reading Afternoon on Bayesian Computation, Nike Lecture Theatre, Agriculture Building
22.4.2016 1pm Kernel Sequential Monte Carlo UCL Roberts G08 Sir David Davies LT, Announcement
25.4.2016 3:30pm Kernel Sequential Monte Carlo  Univ. of Oxford Department of Statistics, St Giles building, Seminar Room 1, LG.03

The poster will be on Flyweight evidence estimates at the CRiSM workshop on Estimating Constants on the 20th. Slides and the poster will go up on the Talks page.

Marginal Likelihood From the Metropolis–Hastings Output

This is a rather old JASA paper by Siddhartha Chib and Ivan Jeliazkov from 2001. I dug it up again as I am frantically trying to get my contribution for the CRiSM workshop on Estimating Constants in shape. My poster will be titled Flyweight evidence estimates as it looks at methods for estimating model evidence (aka marginal likelihood, normalizing constant, free energy) that are computationally very cheap when a sample from the target/posterior distribution exists.
I noticed again that the approach by Chib & Jeliazkov (2001) does not satisfy this constraint, although they claim otherwise in the paper. The estimator is based on the fact that the normalized probability of a point \theta^* under the target when using Metropolis Hastings is

image

where q is the proposal density, \alpha the acceptance probability and y the data. If \bar\pi is the unnormalized target, we can get the evidence estimate Z = \bar\pi (\theta^*|y)/\pi(\theta^*|y).
Now the integral in the numerator is cheap to estimate when we already have samples from \pi. The authors claim that because it is cheap to sample from  q, the denominator is cheap as well. I can’t help it: this statement feels a bit deceptive to me. While indeed samples from q are cheep to generate, evaluating  \alpha(\theta^*,\theta^{(j)}|y) for each sample \theta^{(j)} \sim q(\theta^*,\cdot|y) is the opposite of cheap! It involves evaluating \bar\pi(\theta^{(j)}|y) for all j, which is basically the same cost as getting samples from the target in the first place.
I already included that criticism in my PhD thesis, but when revising under time pressure I no longer thought it was valid and erroneously took it out.

Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm

This is an important arXival by Alain Durmus and Eric Moulines. The title is slightly optimized for effect, as the paper actually contains non-asymptotic and asymptotic analysis.

The basic theme of the paper is in getting upper bounds on total variation (and more general distribution distances) between an uncorrected discretized Langevin diffusion wrt some target \pi and \pi itself. The discretization used is the common scheme with the scary name Euler-Maruyama:

X_{k} \sim \mathcal{N}(\cdot|X_{k-1} + \gamma_{k}\nabla \log \pi(X_{k-1}), \sqrt{2\gamma_{k}I} = R_{\gamma_{k}}(\cdot |X_{k-1})

Under a Foster-Lyapunov condition, R_{\gamma_{k}} is a Markov kernel that admits a unique stationary distribution \pi_{\gamma_{k}} that is close to the desired \pi in total variation distance and gets closer when \gamma_{k} decreases.

Now in the non-asymptotic case with fixed \gamma = \gamma_k, the authors provide bounds that explicitly depend on dimensionality of the support of the target, the number of samples drawn and the chosen step size \gamma . Unfortunately, these bounds contain some unknowns as well, such as the Lipschitz constant L of the gradient of the logpdf \log \pi(\cdot) and some suprema that I am unsure how to get explicitly.

Durmus and Moulines particularly take a look at scaling with dimension under increasingly strong conditions on \pi, getting exponential (in dimension) constants for the convergence when \pi is superexponential outside a ball. Better convergence can be achieved when assuming \pi to be log-concave or strongly log-concave. This is not surprising, nevertheless the theoretical importance of the results is clear from the fact that together with Arnak Dalalyan this is the first time that results are given for the ULA after the Roberts & Tweedie papers from 1996.

As a practitioner, I would have wished for very explicit guidance in picking  \gamma or the series \{\gamma_k\}_{k=1}^\infty. But hopefully with Alains paper as a foundation that can be the next step. As a non-mathematician, I had some problems in following the paper and at some point I completely lost it. This surely is in part due to the quite involved content. However, one might still manage to give intuition even in this case, as Sam Livingstones recent paper on HMC shows. I hope Alain goes over it again with readability and presentation in mind so that it will get the attention it deserves. Yet another task for something that already took a lot of work…

(photo: the Lipschitz Grill diner in Berlin – I don’t
know about their food, but the name is remarkable)

Talk in Berlin

Last Friday, I gave a talk at FU Berlin, organized by Sebastian Reich and Tim Sullivan. Mostly, I was presenting Gradient IS in the state of my PhD thesis. Though I gave a much better outlook than what I was thinking of as future work back then, especially for high dimensions. Sure enough, as people that have worked on DE-related topics for quite some time, Sebastian and Carsten Hartmann had very interesting remarks which will hopefully help (DE-ignorant) me in scaling GRIS up. Here are the slides.

(The title image by Armin Kübelbeck shows, oddly enough, a subway station in Berlin)

Importance Sampling Session at MCQMC 2016

On behalf of the mcqmc 2016 organizing committee I am pleased to accept your proposal.
-Art Owen
I got this nice message from Art yesterday night. My proposal for a session on Advances in Importance Sampling at MCQMC 2016 got accepted. Which is great, as I think the session is made up of strong papers (obviously). This session will almost surely be moderated by Nicolas Chopin.

MCQMC session on Advances in Importance Sampling

The sample size required in Importance Sampling

S. Chatterjee, P. Diaconis

The goal of importance sampling is to estimate the expected value of a given function with respect to a probability measure ν using a random sample of size n drawn from a different probability measure μ. If the two measures μ and ν are nearly singular with respect to each other, which is often the case in practice, the sample size required for accurate estimation is large. In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(ν||μ)) is necessary and sufficient for accurate estimation by importance sampling, where D(ν||μ) is the Kullback–Leibler divergence of μ from ν. In particular, the required sample size exhibits a kind of cut-off in the logarithmic scale. The theory is applied to obtain a fairly general formula for the sample size required in importance sampling for exponential families (Gibbs measures). We also show that the standard variance-based diagnostic for convergence of importance sampling is fundamentally problematic. An alternative diagnostic that provably works in certain situations is suggested.

Generalized Multiple Importance Sampling

V. Elvira, L. Martino, D. Luengo, M. Bugallo

Importance Sampling methods are broadly used to approximate posterior distributions or some of their moments. In its standard approach, samples are drawn from a single proposal distribution and weighted properly. However, since the performance depends on the mismatch between the targeted and the proposal distributions, several proposal densities are often employed for the generation of samples. Under this Multiple Importance Sampling (MIS) scenario, many works have addressed the selection or adaptation of the proposal distributions, interpreting the sampling and the weighting steps in different ways. In this paper, we establish a general framework for sampling and weighing procedures when more than one proposal are available. The most relevant MIS schemes in the literature are encompassed within the new framework, and, moreover novel valid schemes appear naturally. All the MIS schemes are compared and ranked in terms of the variance of the associated estimators. Finally, we provide illustrative examples which reveal that, even with a good choice of the proposal densities, a careful interpretation of the sampling and weighting procedures can make a significant difference in the performance of the method.

Continuous-Time Importance Sampling

K. Łatuszyński, G. Roberts, G. Sermaidis, P. Fearnhead

We will introduce a new framework for sequential Monte Carlo, based on evolving a set of weighted particles in continuous time. This framework can lead to novel versions of existing algorithms, such as Annealed Importance Sampling and the Exact Algorithm for diffusions, and can be used as an alternative to MALA for sampling from a target distribution of interest. These methods are amenable to the use of sub-sampling, which can greatly increase their computational efficiency for big-data applications; and can enable unbiased sampling from a much wider-range of target distributions than existing approaches.

Quasi-Monte Carlo Methods in Finance

This second reference Mathieu Gerber gave me in the quest for educating myself about QMC, is paper by Pierre L’Ecuyer from the Winter Simulation Conference in 2004. It was much clearer as a tutorial (for me) as compared to the Art Owen paper. Maybe because it didn’t contain so much ANOVA. Or maybe because I was more used to ANOVA from Arts paper.

This paper specifically and quite transparently treats different constructions for low discrepancy point sets, in particular digital nets and their special cases. On the other hand, randomization procedures are discussed, which sometimes seem to be very specialized to the sequence used. One seemingly general transform after randomization called the baker transformation results in surprisingly high variance reduction of order O(n^{-4+\epsilon}). The transformation being to replace the uniform coordinate u \in [0,1) by 2u for u\leq 0.5 and 2(1-u) else.

In the examples L’Ecuyer mentions that using an Eigenzerlegung of covariance matrices (i.e. PCA) results in much higher variance reductions as compared to using Cholesky factors. Which he attributes to dimension reduction – a naming I find odd, as the complete information is retained (as opposed to, e.g. tossing the components with lowest Eigenvalue). My intuition is that maybe the strong empirical gains with PCA might rather be attributed to the fact that Eigenvectors are orthogonal, making this decomposition as close as possible to QMCs beloved unit hypercube.

Monte Carlo Extension of Quasi-Monte Carlo

405px-Subrandom_2D
Low discrepancy (top) vs. randomly uniform points (bottom)

Mathieu Gerber gave me this 1998 paper by Art Owen as one of several references for good introductions to QMC in the course of evaluating an idea we had at MCMSki.

The paper  surveys re-randomization of low discrepancy/QMC point sets mostly as a way of getting an estimate of the variance of the integral estimate. Or that is what the paper makes states most prominently – another reason for doing randomized QMC being that in some cases it further decreases variance as compared to plain QMC.

One of the things this paper stresses is that QMC will not help in truly high-dimensional integrands: say you have a d dimensional integrand which does not live on a lower dimensional manifold, and only use n \ll O(d^2). Then equidistributing these n points meaningfully in becomes impossible. Of course if the integrand does live on a lower dimensional manifold, one can use that fact to get convergence rates that correspond to the dimensionality of that manifold, which corresponds (informally) to what Art Owen calls effective dimension. Two definitions of effective dimension variants are given, both using the ANOVA decomposition. ANOVA only crossed my way earlier through my wifes psychology courses in stats, where it seemed to be a test mostly, so I’ll have delve more deeply into the decomposition method. It seems that basically, the value of the integrand f(x_1,\dots,x_d) is treated as a dependent variable while the x_1, \dots, x_d are the independent variables and ANOVA is used to get an idea of how independent variables interact in producing the dependent variable. In which case of course we would have to get some meaningful samples of points (x_1,\dots,x_d, f(x_1,\dots,x_d)) which currently seems to me like circling back to the beginning, since meaningful samples are what we want in the first place.

The idea for getting variance of the integral estimate from RQMC is the following: given a number m of deterministic QMC points, randomize them r times using independently drawn random variables. The integral estimates \hat I_1,\dots, \hat I_r are unbiased estimates of the true integral I with some variance \sigma_{\textrm{RQMC}}^2. Averaging over the $latex  \hat I_i$ we get another unbiased estimator $latex \hat I$ which has variance \sigma_{\textrm{RQMC}}^2/r. This RQMC variance can be estimated unbiasedly as \frac{1}{r(r-1)} \sum_{i=1}^r (\hat I_i - \hat I)^2. If m = 1 then r is the number of samples and this simply is the standard estimate of Monte Carlo variance.

The paper goes on to talk about using what it calls Latin Supercube Sampling for using RQMC in even higher dimensions, again using ANOVA decomposition and RQMC points in each group of interacting variables as determined by the ANOVA.

Overall, I know a bit more about QMC but am still more in the confusion than the knowledge state, which I hope Mathieus next reference will help with.