# Exponential growth as plotted by my son

With my son, my niece, sister and her boyfriend I visited the science museum in Mannheim about two months ago. There where many nice experiments, and as a (former ?) computer scientist I was particularly impressed by Leibniz’ mechanical digital calculator, completed in the 17th century and able to do all four basic arithmetic operations. This of course is another example for the superiority of Leibniz over Newton
One piece was a pure exhibit with no experiments: a system of cogwheels connected sequentially, where each added wheel took ten times longer for one complete cycle compared to the one before. While the first wheel needed five seconds, the 17th and last one did not seem to move, and indeed it completes a cycle about every 1.6 billion years ($5*10^{16}\mathrm{sec}$).
This is an example of exponential growth of course, so I told him the rice on chessboard story, e.g. the idea that starting with one grain of rice on the first square of a chessboard and doubling it on the next leads to loads and loads of rice. When he asked for rice this weekend I was rather unsuspecting, until he spilled some of it and asked for our help cleaning it up. He started the first four squares (though with linear growth as he didn’t remember completely) and we completed the first row together – after which I started simply weighing the rice, as counting took about a second per grain and already for the 256 grains on h2 I was too impatient, let alone  the 1024 for f2, which would have taken about 16 minutes to count. If this little exercise doesn’t drive home what exponential growth means, I don’t know what would.

# A Variational Analysis of Stochastic Gradient Algorithms

This arXival by Mandt, Hoffman and Blei from February takes a look at what they call SGD with constant learning rate or constant SGD. Which is not really Stochastic Gradient Descent anymore, but rather a sampling algorithm, as we should bounce  around the mode of the posterior rather than converging to it. Consequently they interpret constant SGD as a stochastic process in discrete time with some stationary distribution. They go on to analyse it under the assumption that the stationary distribution is Gaussian (Assumption 1 and 4). Something that springs to my mind here is the following: even if we accept that the stationary distribution of the process might be Gaussian (intuitively, not rigorously), what guarantees that the stationary distribution is the posterior? Even if the posterior is also gaussian, it could have a different (co)variance. I think a few more words about this question might be good. In particular, without adding artificial noise, is it guaranteed that every point in the posteriors domain can be reached with non-zero probability? And can be reached an infinite number of times if we collect infinite samples? If not, the constant SGD process is transient and does not enjoy good stability properties.

Another delicate point hides in Assumption 1: As our stochastic gradient is the sum of independent RVs, the CLT is invoked to assume that the gradient noise is Gaussian. But the CLT might not yet take effect if the number of data points in our subsample is small, which is the very thing one would like in order to have advantages in computation time. This of course is yet another instance of the dilemma common to all scalable sampling techniques, and at this stage of research this is not a show stopper.
Now assuming that constant SGD is not transient and it has a stationary distribution that is the posterior and that the posterior or its dominant mode is approximately Gaussian, we can of course try to optimize the parameters of constant SGD to make the sample it produces as close a possible to a posterior sample. Which the authors conveniently do using an Ornstein-Uhlenbeck approximation, which has a Gaussian stationary distribution. Their approach is to minimize KL-Divergence between the OU approximation and the posterior, and derive optimal parameter settings for constant SGD in closed form. The most interesting part of the theoretical results section probably is the one on optimal preconditioning in Stochastic Gradient Fisher Scoring, as that sampler adds artificial noise. Which might solve some of the transience questions.

The presentation would gain a lot by renaming “constant SGD” for what it is – a sampling algorithm. The terminology and notation in general are sometimes a bit confusing, but nothing a revision can’t fix. In general, the paper presents an interesting approach to deriving optimal parameters for a class of algorithms that is notoriously hard to tune. This is particularly relevant because the constant step size could improve mixing considerably over Welling & Tehs SGLD. What would be interesting to see is wether the results could be generalized to the case where the posterior is not Gaussian. For practical reasons, because stochastic VB/EP works quite well in the gaussian case. For theoretical reasons, because EP now even comes with some guarantees (haven’t read that paper admittedly). Maybe a path would be to take a look at the Gaussian approximation to a multimodal posterior spanning all modes, and minimizing KL-Divergence between that and the OU process. Or maybe one can proof that constant SGD (with artificial noise?) has some fixed stationary distribution to which stochastic drift term plus noise are a Markov Kernel, which might enable a pseudo-marginal correction.

# Kernel Adaptive Sequential Monte Carlo

Heiko Stratmann, Dino Sejdinovic, Brooks Paige and myself had a poster on our work on Kernel Adaptive SMC (arXiv) at the Scalable Inference NIPS workshop this year. The basic idea is very related to Dinos paper on Kernel-informed MCMC, in that the particle system is (implicitly) mapped to a function space (the infamous Reproducing Kernel Hilbert Space) and used to inform a sample from a Gaussian fit to the functional associated with samples. As always, things that are nasty and nonlinear in the input space behave much more nicely in feature space. An when your name is Dino Sejdinovic, you can actually integrate out all steps in feature space, ending up with a Gaussian proposal in the space you actually want to sample from.

In the title cartoon, our approach (in red, called KASS for Kernel Adaptive SMC sampler) lines up nicely with the support of the target distribution. ASMC by Fearnhead et al., or at least the sampler they use in their evaluation, which is a special case of our method, cannot adapt to this nonlinear structure. This results in better performance for KASS compared to ASMC and Random Walk SMC, as measured by Maximum Mean Discrepancy.However, this of course comes at the price of higher computational complexity, similar to related methods such as Riemannian Manifold MALA/HMC. If your dataset is large however and evaluating the likelihood majorizes evaluating the Gram matrix on your particle system, this method will still have an edge over others when the support is nonlinear. For the MCMC predecessor KAMH/Kameleon see Heikos post.

# Probabilistic Line Searches for Stochastic Optimization

This very remarkable NIPS paper by Mahsereci and Hennig (from my beloved Tübingen where I was an undergrad) deals with getting rid of learning rates for SGD. It does so by placing a GP prior on the value of the objective function along the search direction and its gradients projected onto the search direction. The pd kernel used for the GP is that of a once-integrated Wiener Process, resulting in a cubic spline posterior. They argue that this particular form is advantageous because its minima can be found exactly in linear time – tada! This of course is great news, because, paraphrasing the paper, you don’t want to run an optimization algorithm to find minima inside of the line search routine of the actual optimization algorithm. Another reason for choosing the particular pd kernel is that it does not assume an infinite number of derivatives of the GP (such as the Gaussian RBF does) – making it robust to irregularities in the Loss function.

Another point this rich paper makes is that the Wolfe conditions for termination of the line search procedure are actually positivity constraints. And that the probability of the constraints being satisfied can be modeled by a Gaussian CDF. If the probability is greater that $0.3$ (this threshold being motivated in the paper), the algorithm is done.

Generally speaking, the authors do a very nice job of getting rid of any free parameters in their approach, leading to a black box line search which can be used in SGD, effectively eliminating the dreaded learning rate. What I, as a sampling person, would really be interested in is whether or not this could actually be put to good use in Monte Carlo as well. However, it is delicate to use gradient information in sampling and not introduce bias, so I think this might be very difficult.

# Early Stopping as Nonparametric Variational Inference (not)

Update: This paper has been accepted to AISTATS. As far as I can see, in exactly the same version that I review here.
This preprint by Maclaurin, Duvenaud and Adams suggested to be a very interesting read. Their idea is that stochastic optimization of a Bayesian Model could be seen as sampling from a variational approximation of the target distribution. Indeed this does not seem surprising given the consistency results for Stochastic Gradient Langevin Dynamics by Teh et al. A major gist of the paper is that the variational approximation can be used to estimate a lower bound of the models marginal likelihood/model evidence. This estimate is used throughout evaluations.

However in its current version the paper exhibits several weaknesses, the least of which is the experiments, with which I will start (using the psychologically researched foot-in-the-door-technique 😉 ). In their first experiment they optimized the parameters of a neural network without regularization on the parameters. This is equivalent to using an improper prior – but this makes the marginal likelihood completely useless for Bayesian model comparison because of the Marginalization Paradox. Basically the problem is that improper priors are only known proportionally – and marginal likelihood could be anything because of that. To get an intuition for this, observe that the marginal likelihood satisfies $\int p(\textrm{Data}\vert \theta) \textrm{d}p(\theta) = C\int p(\textrm{Data}\vert \theta){d}\theta$, where C is some constant that can be anything because $p(\theta){\propto}C$ is always a correct characterization of the improper prior. In the other experiments, a statement on whether the prior is proper or improper is absent.

Also they claim that sampling from some distribution q and taking a stochastic gradient step computed from the partial posterior, they are able to compute the Jacobian for this step. This however is impossible because the gradient step is not bijective. Consider for example a univariate Gaussian distribution and taking a gradient step. You could end up at the mode from two values of equal log posterior density and equally large gradients pointing towards the mode. But if you can end up at the mode from two different points, then the gradient step is not injective and thus definitely not bijective. Hence no computable Jacobian transformation. I hit upon the same problem in my early ideas for Gradient Importance Sampling, which is why in that algorithm the gradient is not used for an optimization step after sampling but rather to inform the sampling distribution.

A further problem is that they seem to assume that evaluating the log posterior of only part of the data represents an unbiased estimate of the log posterior using all of the data. This is not true, in fact using minibatches can introduce small sampling bias – this being the reason that Consensus Monte Carlo introduces a Jackknife debiasing procedure and the motivation behind Heiko Strathmanns work on removing subsampling bias in MCMC for big data (another example being the paper on tall data by Remi Bardenet, Arnaud Doucet and Chris Holmes).

Finally, and I’ll end here, they state that a limitation of their method might be that

using only a single sample to estimate both the expected likelihood as well as the entropy of an entire distribution will necessarily have high variance under some circumstances

This is understating things – I’m not sure a single sample will have finite variance at all.

I really hope that they can improve on this work, but it won’t be a piece of cake.

EDIT:  I talked to David Duvenaud at his poster at a NIPS workshop today. While I still think my objections are valid, I was harsher than I wished I had been (I wanted to be constructive, but I just have no idea how one could proceed and ended up just criticizing, which must have been very frustrating). Just to clarify: even when you do not end up at the mode after a gradient step, the step still is not invertible, resulting in it being not bijective.