A Complete Recipe for Stochastic Gradient MCMC

Another NIPS paper, this one by Ma, Chen and Fox. This one is interesting in that it seeks to build a framework for MCMC algorithms which only use random subsamples of the data. This is achieved by considering the general form of continuous Markov processes as an SDE given by

dz = f(z)\textrm{d}t + \sqrt{2D(z)}\textrm{d}W(t)

where f is the joint probability of the actual sampling space you care about (the marginal being the target distribution/posterior) and possible auxiliary variables. Now they suggest parameterizing f by one skew-symmetric and one PSD matrix (both of which can depend on the current state z), and proof in two Theorems that this is all you need to describe all Markov chains that admit the target distribution as the unique stationary distribution. The proofs use the Fokker-Planck and Liouville equations, which as a non-SDE person I’ve never heard of. SG_MCMC.png

They use this to show ways of reconstructing known stochastic sampling algorithms and give a new sampler, which is a stochastic version of Riemannian HMC.
The paper is very honest in stating the limitations of SG sampling, namely that one has to diminish the learning rate to infinitesimal values (i.e. 0) to ensure unbiasedness. Which would defeat the purpose of sampling of course.

Probabilistic Line Searches for Stochastic Optimization

This very remarkable NIPS paper by Mahsereci and Hennig (from my beloved Tübingen where I was an undergrad) deals with getting rid of learning rates for SGD. It does so by placing a GP prior on the value of the objective function along the search direction and its gradients projected onto the search direction. The pd kernel used for the GP is that of a once-integrated Wiener Process, resulting in a cubic spline posterior. They argue that this particular form is advantageous because its minima can be found exactly in linear time – tada! This of course is great news, because, paraphrasing the paper, you don’t want to run an optimization algorithm to find minima inside of the line search routine of the actual optimization algorithm. Another reason for choosing the particular pd kernel is that it does not assume an infinite number of derivatives of the GP (such as the Gaussian RBF does) – making it robust to irregularities in the Loss function.

Another point this rich paper makes is that the Wolfe conditions for termination of the line search procedure are actually positivity constraints. And that the probability of the constraints being satisfied can be modeled by a Gaussian CDF. If the probability is greater that 0.3 (this threshold being motivated in the paper), the algorithm is done.

Generally speaking, the authors do a very nice job of getting rid of any free parameters in their approach, leading to a black box line search which can be used in SGD, effectively eliminating the dreaded learning rate. What I, as a sampling person, would really be interested in is whether or not this could actually be put to good use in Monte Carlo as well. However, it is delicate to use gradient information in sampling and not introduce bias, so I think this might be very difficult.

Sampling from Probabilistic Submodular Models

This NIPS paper by Gotovos, Hassani and Krause deals with coming up with a sampler for submodular models. Submodularity is an interesting concept capturing the concept of diversity/representativeness of a set. A submodular function itself is a function from sets to a real value defined by a diminishing returns property: for S,T \subseteq \Omega some function F is submodular if F(S)+F(T)\geq F(S\cup T)+F(S\cap T). For a probabilistic model one can use such an F by defining a probability of a set as S \propto exp(\beta F(S)). Because of the diminishing returns, this results in a distribution  where points are repulsive and tend to spread out over the space rather than clump together, just like in Low Discrepancy Point Sets used for QMC. Think Determinantal Point Process, which is log-submodular and has also been called the Antisocial Coffeshop Process, or, evocatively,  the Urinal Process (thanks to Vinayak Rao for this 😀 ).

The paper deals with the case of a discrete sampling space. Sampling a set over a discrete domain can be thought of as sampling binary vectors. Speaking of sampling, the reference they give for MCMC is rather obscure.  The  Gibbs sampler they give is very straight forward: proposing to change one entry in the binary vector at a time. The main contribution of the paper is not the sampler, which is rather straight forward, but rather proving upper bounds on its mixing time, which is n^2 if the base set from which we can add/remove points is of cardinality n. If adding/removing a point only has an certain effect on the submodular function bounded by 1, then the sampler mixes at least in time n\log n. I have no intuition wether this improved result is obtained under realistic assumptions.

What I have concluded for myself independent of this paper is that sampling one component at a time often gives a very bad sampler. In this particular case, the sampler is very close to the Gibbs sampler for the Indian Buffer Process, which is mixing rather awfully and one of the reasons why I turned away from the IBP and towards working on sampling algorithms. I don’t see how their sampler might improve here. Also, I’m unsure why they did not cite Sequential Monte Carlo on large binary sampling spaces  by Christian Schäfer and Nicolas Chopin – have they been unaware of it? That paper suggests a much more elegant sampler. It adapts to correlations between entries in the binary vector, which is really what you want when doing posterior inference. And the gains of adaptivity are hard to underestimate (as are the perils) – take for example the fact that unadaptive, perfectly tuned HMC is worse than most simple adaptive algorithms, as for example reported in my Gradient IS work or in the Tutorial on adaptive MCMC by Andrieu and Thoms.

Training Very Deep Networks – aka Deep Highway Networks

This interesting paper by Srivastava, Greff and Schmidhuber (arxiv) looks at the problem of enabling deeper architectures for neural networks. The main argument being that Neural Networks have become successful mainly because training deeper architectures has only recently become possible – and making them deeper might only improve performance. However, performance tends to plateau in conventional deep architectures, as exemplified by their first plot.

The proposed solution to training deeper architectures is to simply introduce a gaiting architecture similar to LSTM ideas. If a normal layer is defined by y = H(x, w) for output y, input x and parametrization of the layer given by w, a highway layer is simply given by y = H(x, w_H) \odot T(x, w_T) + x \odot C(x, w_C), where T and C (Transfer and Carry, respectively) are functions with a (co)domain of the same dimensionality as x and y. This is called a highway layer, as gradient infomation supposedly travels faster along the $latexx \odot C(x, w_C)$ part (hence the Autobahn illustration). When it is required to change dimensionality, one has to resort to a normal layer, or subsample/pad.

Now this archictecture of the highway layer of course allows for gradient information in SGD to travel two paths, one trough H(\cdot)T(\cdot), one through C(\cdot)\cdot x and the output of a layer is a combination of the input of the layer x and the output of H(x,w). The authors use a logistic model T(\cdot) = \sigma(W_T \cdot+b_T) and C(\cdot) = 1-T(\cdot). The authors biased towards simply carrying information through without transformation, which in their setup means simply b_T < 0. After training the network with SGD an looking at the value of the transform gates of different layers, they find that those are closed on average (Figure 2, second column) while exhibiting large variance across inputs (exemplified by the transform gate outputs of a single sample, Figure 2, third column). Interestingly, some dimensions of the input to the network are carried through right to the last layer.

They find that this architecture not only helps to train larger networks, but also aids actually utilizing deeper architectures. So this paper took a similar inception as LSTMs: tries to solve an inference problem and comes up with a model that is more powerful in a way.

I feel that the result is somehow odd: carrying the input information through can be achieved by setting the weights in the original normal layer H accordingly (if it is fully connected that is). Why does adding the extra complexity gain so much? I do not feel the paper addressing this important question and I can only speculate myself: perhaps the transform/carry architecture is effective because its weights lie on a compact manifold (because of the logistic transform). Maybe the problem becomes better conditioned because simple downscaling of the input is done by T, while H takes care of actual transformations. To check this, it would have been interesting to know whether they used any constraints on the weights of H – if they haven’t, and H is fully connected, the better conditioning might be an actual explanation. In that case, simply constraining the weights W_H^{(i,i)}  of H (mapping input x_i to the same output dimension y_i) in an appropriate way might have similar effects.

EDIT: After talking to the authors, I realized where my thinking was insufficient: the layer H inside the highway always used a nonlinear map. The gating allowed to switch that off. I guess thats the difference from reading about NNs to using them…

Black Box Variational Inference

My friend Patrick Jähnichen asked me to read this AISTATS 2014 paper by Ranganath, Gerrish and Blei. For somebody using mainly variational inference, I guess the prospect of not having to derive an individual algorithm for each new model is very appealing (while for poor sampling Ingmar, it would be very appealing to scale like VI does).

However, I found the paper hard to read and unclear in its conclusions. For the Rao-Blackwellization procedure it is unclear what is conditioned upon in the main paper. I guess it’s the variational parameters of all other components. Similarly, I’m not sure where the first equality in equation (13) stems from, which is crucial for the Control Variate ansatz. These things may just be me.

More problematic I think is their evaluation, especially comparing to MH-within-Gibbs. This is a particularly bad sampling scheme, and I don’t even suggest to use HMC instead. One could easily use a simple Adaptive Metropolis algorithm and still be black box. Also, Rao-Blackwellization can be applied in this pure adaptive MCMC setting as well. Another conservative and black-box possibility would be adaptive SMC by Fearnhead. My intuition is that both variants might actually be better than the suggested Black Box VI or at least the difference in the one performance plot of the paper will barely be noticeable.

However, I think this direction of research is a useful one and I hope they can improve upon the current state.

The Sample Size Required in Importance Sampling

Stanford 2011 (c) Wikipedia user King of Hearts

A very exiting new preprint by Chatterjee and Diaconis takes a new look at the old Importance Sampling. While its masks as a constructive paper (maybe because of publication bias), its main result to me is actually the deconstruction of an IS myth. Let me elaborate.

The paper shows that the variance estimate for an Importance Sampling estimator is not a good diagnostic for convergence. An implication of this is that aiming for low variance of importance weights, or equivalently high Effective Sample Size, might be better then nothing, but is not actually that great. In more mathy speak, high ESS is a necessary condition for convergence of the estimate, but not a sufficient condition.
Especially in your face is their result that for any \epsilon, one can find a number of samples n such that p(var_n(f) < \epsilon) \geq 1-4\epsilon where var is the empirical variance of n samples, independent of your target density and your proposal distribution. Think about this: I don’t care about your actual integration problem or your proposal distribution. Just tell me what estimator variance you are willing to accept and I will tell you a number of samples that will exhibit lower variance with high probability. This is, in their words, absurd as a diagnostic.
However, they propose a new convergence diagnostic which is basically plagued by a similar problem. They even state themselves that their alternative criterion q_n = \frac{max\{w_i\}}{\sum_i w_i} (where the w_i are the importance weights) being close to 0 can be proofed to be necessary but not sufficient. Just stating that ESS isn’t great is not a great sell for some reviewers so they have to come up with an alternative – and conjecture that it is sufficient, while not being able to proof it. Anyway q_n might still be a better diagnostic than ESS and all things considered this is a wonderful piece of research.

On Markov Chain Monte Carlo methods for tall data

The preprint by Rémi Bardenet, Arnaud Doucet and Chris Homes offers a very well written overview and some original contributions regarding MCMC with large datasets (I like ‘tall’ – maybe high dimensional posteriors could then be ‘fat’ and problems which exhibit both are the worn-out ‘big’ ?-).

Their original contributions revolve around using ideas from Firefly MC (developed by the authors of the Stochastic Optimization is Variational inference paper) on bounds for the likelihood with their Confidence sampler. I must say I wasn’t able to follow in detail and will have to read up on Firefly and the original Confidence sampler paper. One downside I see is that targets need to be differentiable (thrice for their bounds using Taylor approximations) and thus in particular continuous. While in general I try to avoid discrete models simply because you cant use gradient information, sometimes they are necessary. And better make progress on something slightly limited than not make progress.

In their review section, they clearly advise against divide-and-conquer approaches for tall data, i.e. approaches that sample from posteriors involving only part of the data and than combining the information somehow. The main reason for this being that the currently existing upper bounds suggest that the MSE of divide-and-conquer approaches could grow exponential in the number of batches. Thus one would have to use few, large batches, defeating the whole purpose of special methods for tall data.

The section finally exposed me to the delayed acceptance approach of Marco Banterle et al. which I meant to read for quite some time (and will read in the original). I found this approach to be very appealing for its elementary formulation. Most other approaches, including Bardenet, Doucet and Homes, are quite involved. I have to look into the objection regarding ergodicity of delayed acceptance raised by Bardenet, which could potentially be quite a problem of course.

A word about the evaluation section: I think its unfortunate that two of four experiments use logistic regression. The logistic regression target is very close to Gaussian, where MCMC is usually slammed by a simple EP approach (Leave Pima Indians alone!). And EP is easy to apply in the big data setting, or thats what I’ve heard. About the gamma regression target I don’t know, might be that EP does not do so well in that case.

Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations

This paper by Schölkopf et al in Statistics and Computing (and on arXiv, which I will refer to when giving page numbers) was a very rewarding read. Although I think the promise of “Kernel Probabilistic Programming” given in the abstract is not really kept, it served me as a concise review of ideas in modern (i.e. distribution based) methods using positive definite kernels. The obvious connection of the pd kernel mean map with kernel density estimators is just one of the things that I think are important to realize in this domain.
Estimating integrals of functions in a certain RKHS becomes a rather straight forward exercise in the proposed framework. However, it’s a bit unfortunate that they assume independent samples from the distribution, but posing this as an IS estimator one might be able to apply some elementary results leading to a generalization for dependent samples (they talk about application to dependent samples, but I felt that was geared only towards causal inference). What might get in the way here is that their estimators are assumed to be independent of the actual samples used (which is not the case in IS).
These difficulties with their estimator for functions of RVs notwithstanding, what struck me again and again is the close connection with importance sampling estimators (I’m a one trick pony). Specifically, the used empirical kernel mean map of a set of samples X  \hat \mu[X] = \sum_{x \in X} \alpha(x) k(\cdot,x) for pd kernel k becomes an IS estimator with the constraint imposed in the paper (that \sum_{x \in X} \alpha(x) =1) and one additional assumption (\alpha(x) >0). Also, when using their method with sparse representations (in the paragraph “More general expansions sets” making the assumption \alpha(x) >0 themselves), the way they expand the sparse representation is almost exactly Importance Resampling – funny enough that they are not sure wether the method gives consistent estimates. I’m not sure about the influence that their slight variation of Importance Resampling might have, but just using Importance Resampling straight away results in consistent estimators of course under some conditions on the weights \alpha(\cdot) and \beta(\cdot). Those conditions unfortunately being incompatible with the assumption that weights are independent of samples.

I would be really interested to see wether this IS connection is fruitful.

On a more critical note, the paper states that no parametric assumptions on the embedded distributions are made. While this is true, it masks the fact that different types of pd kernels used for the embedding will result in better or worse approximations of integrals in this framework when using finite numbers of samples. Also, the paper claims their method “does not require explicit density estimation as an intermediate step” while still using the kernel mean map, which includes KDEs as a special case. A minor thing is that Figure 1 needs error bars badly.

Early Stopping as Nonparametric Variational Inference (not)

Update: This paper has been accepted to AISTATS. As far as I can see, in exactly the same version that I review here.
This preprint by Maclaurin, Duvenaud and Adams suggested to be a very interesting read. Their idea is that stochastic optimization of a Bayesian Model could be seen as sampling from a variational approximation of the target distribution. Indeed this does not seem surprising given the consistency results for Stochastic Gradient Langevin Dynamics by Teh et al. A major gist of the paper is that the variational approximation can be used to estimate a lower bound of the models marginal likelihood/model evidence. This estimate is used throughout evaluations.

However in its current version the paper exhibits several weaknesses, the least of which is the experiments, with which I will start (using the psychologically researched foot-in-the-door-technique 😉 ). In their first experiment they optimized the parameters of a neural network without regularization on the parameters. This is equivalent to using an improper prior – but this makes the marginal likelihood completely useless for Bayesian model comparison because of the Marginalization Paradox. Basically the problem is that improper priors are only known proportionally – and marginal likelihood could be anything because of that. To get an intuition for this, observe that the marginal likelihood satisfies \int p(\textrm{Data}\vert \theta) \textrm{d}p(\theta) = C\int p(\textrm{Data}\vert \theta){d}\theta, where C is some constant that can be anything because p(\theta){\propto}C is always a correct characterization of the improper prior. In the other experiments, a statement on whether the prior is proper or improper is absent.

Also they claim that sampling from some distribution q and taking a stochastic gradient step computed from the partial posterior, they are able to compute the Jacobian for this step. This however is impossible because the gradient step is not bijective. Consider for example a univariate Gaussian distribution and taking a gradient step. You could end up at the mode from two values of equal log posterior density and equally large gradients pointing towards the mode. But if you can end up at the mode from two different points, then the gradient step is not injective and thus definitely not bijective. Hence no computable Jacobian transformation. I hit upon the same problem in my early ideas for Gradient Importance Sampling, which is why in that algorithm the gradient is not used for an optimization step after sampling but rather to inform the sampling distribution.

A further problem is that they seem to assume that evaluating the log posterior of only part of the data represents an unbiased estimate of the log posterior using all of the data. This is not true, in fact using minibatches can introduce small sampling bias – this being the reason that Consensus Monte Carlo introduces a Jackknife debiasing procedure and the motivation behind Heiko Strathmanns work on removing subsampling bias in MCMC for big data (another example being the paper on tall data by Remi Bardenet, Arnaud Doucet and Chris Holmes).

Finally, and I’ll end here, they state that a limitation of their method might be that

using only a single sample to estimate both the expected likelihood as well as the entropy of an entire distribution will necessarily have high variance under some circumstances

This is understating things – I’m not sure a single sample will have finite variance at all.

I really hope that they can improve on this work, but it won’t be a piece of cake.

EDIT:  I talked to David Duvenaud at his poster at a NIPS workshop today. While I still think my objections are valid, I was harsher than I wished I had been (I wanted to be constructive, but I just have no idea how one could proceed and ended up just criticizing, which must have been very frustrating). Just to clarify: even when you do not end up at the mode after a gradient step, the step still is not invertible, resulting in it being not bijective.

Frank-Wolfe Bayesian Quadrature

This paper (arXiv) on Bayesian Quadrature with actual theoretical guarantees starts from sigma point rules for approximating integrals. “Sigma point rule” is just a fancy name for using weighted samples, where weights are not constrained in any way (and will actually be negative sometimes in the algorithm).

The basic idea is to approximate the target distribution \pi as lying in a finite dimensional Reproducing Kernel Hilbert Space and minimize the distance between the approximation q and \pi (measured using RKHS’ own brand of distribution distance, the Maximum Mean Discrepancy or MMD). This, so far, is an idea that is used in basically all modern methods for numeric integration – variational Bayes and EP of course, but also adaptive MC methods that sample from some q_t at time t then update the proposal to q_{t+1}, which should be close to \pi (e.g. a Gaussian approximation in Adaptive Metropolis, or a Mixture of Gaussians or Students with an EM-inspired update in both Andrieu and Moulines 2006 and Cappé et al 2007).

However, the nice thing in this paper is that they actually provide theoretical guarantees that their integral estimates are consistent and converge at rate O(1/n) or O(\exp(-Cn)) (where n is the number of samples). While the assumption of \pi lying in a finite dimensional RKHS is to be expected, the stronger assumption is that it has compact support.

The crucial point seems to me to be the following: while the estimator has a superior rate, picking design points costs O(n^2) or O(n^3) where n is the number of points. Drawing a sample using Monte Carlo algorithms on the other hand takes constant time, which makes MC linear in the number of points (O(n)). This of course is crucial for the wall clock time performance and makes theoretical comparison of MC vs FW Bayesian Quadrature delicate at best.
In the sense of conclusive story for the paper, I think it is a bit unfortunate that for the evaluation the paper uses iid samples directly from \pi in its optimization algorithm. Because if you have these, Monte Carlo estimates are really good as well. But after a nice discussion with the authors, I realized that this does not in any way taint the theoretical results.

I wonder wether it is mostly the idea of extrapolating \pi that gives these good rates, as I think Dan Simpson suggested on Xi’ans blog. And finally, this does not really alleviate the problem that BQ practically works only with Gaussian RBF kernels – which is a rather strong assumption on the smoothness of \pi.