# Sampling from Probabilistic Submodular Models

This NIPS paper by Gotovos, Hassani and Krause deals with coming up with a sampler for submodular models. Submodularity is an interesting concept capturing the concept of diversity/representativeness of a set. A submodular function itself is a function from sets to a real value defined by a diminishing returns property: for $S,T \subseteq \Omega$ some function F is submodular if $F(S)+F(T)\geq F(S\cup T)+F(S\cap T)$. For a probabilistic model one can use such an F by defining a probability of a set as $S \propto exp(\beta F(S))$. Because of the diminishing returns, this results in a distribution  where points are repulsive and tend to spread out over the space rather than clump together, just like in Low Discrepancy Point Sets used for QMC. Think Determinantal Point Process, which is log-submodular and has also been called the Antisocial Coffeshop Process, or, evocatively,  the Urinal Process (thanks to Vinayak Rao for this 😀 ).

The paper deals with the case of a discrete sampling space. Sampling a set over a discrete domain can be thought of as sampling binary vectors. Speaking of sampling, the reference they give for MCMC is rather obscure.  The  Gibbs sampler they give is very straight forward: proposing to change one entry in the binary vector at a time. The main contribution of the paper is not the sampler, which is rather straight forward, but rather proving upper bounds on its mixing time, which is $n^2$ if the base set from which we can add/remove points is of cardinality n. If adding/removing a point only has an certain effect on the submodular function bounded by 1, then the sampler mixes at least in time $n\log n$. I have no intuition wether this improved result is obtained under realistic assumptions.

What I have concluded for myself independent of this paper is that sampling one component at a time often gives a very bad sampler. In this particular case, the sampler is very close to the Gibbs sampler for the Indian Buffer Process, which is mixing rather awfully and one of the reasons why I turned away from the IBP and towards working on sampling algorithms. I don’t see how their sampler might improve here. Also, I’m unsure why they did not cite Sequential Monte Carlo on large binary sampling spaces  by Christian Schäfer and Nicolas Chopin – have they been unaware of it? That paper suggests a much more elegant sampler. It adapts to correlations between entries in the binary vector, which is really what you want when doing posterior inference. And the gains of adaptivity are hard to underestimate (as are the perils) – take for example the fact that unadaptive, perfectly tuned HMC is worse than most simple adaptive algorithms, as for example reported in my Gradient IS work or in the Tutorial on adaptive MCMC by Andrieu and Thoms.

# Black Box Variational Inference

My friend Patrick Jähnichen asked me to read this AISTATS 2014 paper by Ranganath, Gerrish and Blei. For somebody using mainly variational inference, I guess the prospect of not having to derive an individual algorithm for each new model is very appealing (while for poor sampling Ingmar, it would be very appealing to scale like VI does).

However, I found the paper hard to read and unclear in its conclusions. For the Rao-Blackwellization procedure it is unclear what is conditioned upon in the main paper. I guess it’s the variational parameters of all other components. Similarly, I’m not sure where the first equality in equation (13) stems from, which is crucial for the Control Variate ansatz. These things may just be me.

More problematic I think is their evaluation, especially comparing to MH-within-Gibbs. This is a particularly bad sampling scheme, and I don’t even suggest to use HMC instead. One could easily use a simple Adaptive Metropolis algorithm and still be black box. Also, Rao-Blackwellization can be applied in this pure adaptive MCMC setting as well. Another conservative and black-box possibility would be adaptive SMC by Fearnhead. My intuition is that both variants might actually be better than the suggested Black Box VI or at least the difference in the one performance plot of the paper will barely be noticeable.

However, I think this direction of research is a useful one and I hope they can improve upon the current state.

# On Markov Chain Monte Carlo methods for tall data

The preprint by Rémi Bardenet, Arnaud Doucet and Chris Homes offers a very well written overview and some original contributions regarding MCMC with large datasets (I like ‘tall’ – maybe high dimensional posteriors could then be ‘fat’ and problems which exhibit both are the worn-out ‘big’ ?-).

Their original contributions revolve around using ideas from Firefly MC (developed by the authors of the Stochastic Optimization is Variational inference paper) on bounds for the likelihood with their Confidence sampler. I must say I wasn’t able to follow in detail and will have to read up on Firefly and the original Confidence sampler paper. One downside I see is that targets need to be differentiable (thrice for their bounds using Taylor approximations) and thus in particular continuous. While in general I try to avoid discrete models simply because you cant use gradient information, sometimes they are necessary. And better make progress on something slightly limited than not make progress.

In their review section, they clearly advise against divide-and-conquer approaches for tall data, i.e. approaches that sample from posteriors involving only part of the data and than combining the information somehow. The main reason for this being that the currently existing upper bounds suggest that the MSE of divide-and-conquer approaches could grow exponential in the number of batches. Thus one would have to use few, large batches, defeating the whole purpose of special methods for tall data.

The section finally exposed me to the delayed acceptance approach of Marco Banterle et al. which I meant to read for quite some time (and will read in the original). I found this approach to be very appealing for its elementary formulation. Most other approaches, including Bardenet, Doucet and Homes, are quite involved. I have to look into the objection regarding ergodicity of delayed acceptance raised by Bardenet, which could potentially be quite a problem of course.

A word about the evaluation section: I think its unfortunate that two of four experiments use logistic regression. The logistic regression target is very close to Gaussian, where MCMC is usually slammed by a simple EP approach (Leave Pima Indians alone!). And EP is easy to apply in the big data setting, or thats what I’ve heard. About the gamma regression target I don’t know, might be that EP does not do so well in that case.