Frank-Wolfe Bayesian Quadrature

This paper (arXiv) on Bayesian Quadrature with actual theoretical guarantees starts from sigma point rules for approximating integrals. “Sigma point rule” is just a fancy name for using weighted samples, where weights are not constrained in any way (and will actually be negative sometimes in the algorithm).

The basic idea is to approximate the target distribution \pi as lying in a finite dimensional Reproducing Kernel Hilbert Space and minimize the distance between the approximation q and \pi (measured using RKHS’ own brand of distribution distance, the Maximum Mean Discrepancy or MMD). This, so far, is an idea that is used in basically all modern methods for numeric integration – variational Bayes and EP of course, but also adaptive MC methods that sample from some q_t at time t then update the proposal to q_{t+1}, which should be close to \pi (e.g. a Gaussian approximation in Adaptive Metropolis, or a Mixture of Gaussians or Students with an EM-inspired update in both Andrieu and Moulines 2006 and Cappé et al 2007).

However, the nice thing in this paper is that they actually provide theoretical guarantees that their integral estimates are consistent and converge at rate O(1/n) or O(\exp(-Cn)) (where n is the number of samples). While the assumption of \pi lying in a finite dimensional RKHS is to be expected, the stronger assumption is that it has compact support.

The crucial point seems to me to be the following: while the estimator has a superior rate, picking design points costs O(n^2) or O(n^3) where n is the number of points. Drawing a sample using Monte Carlo algorithms on the other hand takes constant time, which makes MC linear in the number of points (O(n)). This of course is crucial for the wall clock time performance and makes theoretical comparison of MC vs FW Bayesian Quadrature delicate at best.
In the sense of conclusive story for the paper, I think it is a bit unfortunate that for the evaluation the paper uses iid samples directly from \pi in its optimization algorithm. Because if you have these, Monte Carlo estimates are really good as well. But after a nice discussion with the authors, I realized that this does not in any way taint the theoretical results.

I wonder wether it is mostly the idea of extrapolating \pi that gives these good rates, as I think Dan Simpson suggested on Xi’ans blog. And finally, this does not really alleviate the problem that BQ practically works only with Gaussian RBF kernels – which is a rather strong assumption on the smoothness of \pi.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s