Exponential growth as plotted by my son

 

IMG_20160619_134418072With my son, my niece, sister and her boyfriend I visited the science museum in Mannheim about two months ago. There where many nice experiments, and as a (former ?) computer scientist I was particularly impressed by Leibniz’ mechanical digital calculator, completed in the 17th century and able to do all four basic arithmetic operations. This of course is another example for the superiority of Leibniz over Newton
One piece was a pure exhibit with no experiments: a system of cogwheels connected sequentially, where each added wheel took ten times longer for one complete cycle compared to the one before. While the first wheel needed five seconds, the 17th and last one did not seem to move, and indeed it completes a cycle about every 1.6 billion years (5*10^{16}\mathrm{sec}).
This is an example of exponential growth of course, so I told him the rice on chessboard story, e.g. the idea that starting with one grain of rice on the first square of a chessboard and doubling it on the next leads to loads and loads of rice. When he asked for rice this weekend I was rather unsuspecting, until he spilled some of it and asked for our help cleaning it up. He started the first four squares (though with linear growth as he didn’t remember completely) and we completed the first row together – after which I started simply weighing the rice, as counting took about a second per grain and already for the 256 grains on h2 I was too impatient, let alone  the 1024 for f2, which would have taken about 16 minutes to count. If this little exercise doesn’t drive home what exponential growth means, I don’t know what would.

History doesn’t repeat itself, but it rhymes

image

Alexander Gauland (AfD party; top right) said last week that Germans wouldn’t want Jerome Boateng, the superior defense in Germany’s national football team, as a neighbor. This of course is geared towards Boateng being black.
Had he been Jewish, Gauland would have had to resign the day after. But while Germans still react sensibly to antisemitism, it’s becoming possible more and more to be openly racist in every other way. Actually, German history is used in a perverted way by the radical right: the fact that antisemitism exists among Muslims is used as an argument that Muslims can’t be integrated. Homophobia of some Muslims is used in the same vein.
I predict, and this is a quite saw safe bet, that the current AfD party leader will step down in the next year and AfD will become way more radical.

Many people in Europe are having an affair with the extreme right at the moment. Especially Germany, Italy, Austria and Spain should know better (and the rest too, actually). But well, first of all we didn’t make that mistake yet, our grandparents did (or the grandparents of people in other countries). Second, these politicians are not anti-jewish. Just anti-black and anti-muslim. Incomparable obviously.

A Variational Analysis of Stochastic Gradient Algorithms

This arXival by Mandt, Hoffman and Blei from February takes a look at what they call SGD with constant learning rate or constant SGD. Which is not really Stochastic Gradient Descent anymore, but rather a sampling algorithm, as we should bounce  around the mode of the posterior rather than converging to it. Consequently they interpret constant SGD as a stochastic process in discrete time with some stationary distribution. They go on to analyse it under the assumption that the stationary distribution is Gaussian (Assumption 1 and 4). Something that springs to my mind here is the following: even if we accept that the stationary distribution of the process might be Gaussian (intuitively, not rigorously), what guarantees that the stationary distribution is the posterior? Even if the posterior is also gaussian, it could have a different (co)variance. I think a few more words about this question might be good. In particular, without adding artificial noise, is it guaranteed that every point in the posteriors domain can be reached with non-zero probability? And can be reached an infinite number of times if we collect infinite samples? If not, the constant SGD process is transient and does not enjoy good stability properties.

Another delicate point hides in Assumption 1: As our stochastic gradient is the sum of independent RVs, the CLT is invoked to assume that the gradient noise is Gaussian. But the CLT might not yet take effect if the number of data points in our subsample is small, which is the very thing one would like in order to have advantages in computation time. This of course is yet another instance of the dilemma common to all scalable sampling techniques, and at this stage of research this is not a show stopper.
Now assuming that constant SGD is not transient and it has a stationary distribution that is the posterior and that the posterior or its dominant mode is approximately Gaussian, we can of course try to optimize the parameters of constant SGD to make the sample it produces as close a possible to a posterior sample. Which the authors conveniently do using an Ornstein-Uhlenbeck approximation, which has a Gaussian stationary distribution. Their approach is to minimize KL-Divergence between the OU approximation and the posterior, and derive optimal parameter settings for constant SGD in closed form. The most interesting part of the theoretical results section probably is the one on optimal preconditioning in Stochastic Gradient Fisher Scoring, as that sampler adds artificial noise. Which might solve some of the transience questions.

The presentation would gain a lot by renaming “constant SGD” for what it is – a sampling algorithm. The terminology and notation in general are sometimes a bit confusing, but nothing a revision can’t fix. In general, the paper presents an interesting approach to deriving optimal parameters for a class of algorithms that is notoriously hard to tune. This is particularly relevant because the constant step size could improve mixing considerably over Welling & Tehs SGLD. What would be interesting to see is wether the results could be generalized to the case where the posterior is not Gaussian. For practical reasons, because stochastic VB/EP works quite well in the gaussian case. For theoretical reasons, because EP now even comes with some guarantees (haven’t read that paper admittedly). Maybe a path would be to take a look at the Gaussian approximation to a multimodal posterior spanning all modes, and minimizing KL-Divergence between that and the OU process. Or maybe one can proof that constant SGD (with artificial noise?) has some fixed stationary distribution to which stochastic drift term plus noise are a Markov Kernel, which might enable a pseudo-marginal correction.

Markov Chain Monte Carlo and Variational Inference: Bridging the Gap

This ICML 2015 paper by Salimans, Kingma and Welling tries to explore ‘a new synthesis of variational inference and Monte Carlo’.

The paper looks at a potentially interesting approach, namely introducing an optimization objective for optimizing parameters of the inference algorithm and approximating some integrals appearing in the objective using Monte Carlo. I had big problems following the notation and general ideas however. The authors want to treat Monte Carlo samples as auxiliary variables and introduce a distribution r(z_0,\dots,z_{T-1}|x, z_T) where x is the data, and z_i are the samples from Monte Carlo. This distribution is named auxiliary inference distribution at one place and inverse model further down in the same column. This is one example of notation/terminology changing in the paper. I have little intuition as to what r is used for or why it is interesting to look at. An elementary explanation would be nice here.

When using Monte Carlo to approximate some intractable integrals, in particular a variational lower bound, the authors claim that this gives them unbiased estimates. It is unclear to me why the estimates should be unbiased, especially if only few MC samples are collected as suggested by the paper. Completely elusive to me is the way they propose to compute gradients with respect to the lower bound.

At this point I was probably so lost that any further material in the paper had no hopes of me understanding. The proposed Hamiltonian Variational Inference is the first of these proposals I don’t really get, as is Annealed Variational Inference.

In general, I feel there is a potpourri of ideas and I am unclear how they fit together. But this of course can be a result of me not understanding what the principles of the approach are in the first place.

Variational Hamiltonian Monte Carlo via Score Matching

This is an arXival (maybe by now ICML paper?) by Zhang, Shahbaba and Zhao. It suggest fitting an emulator/surrogate q_\eta to the target density \pi via score matching and then use the gradients of q_\eta rather than those of \pi for generating proposal with HMC. Their main selling point being that this decreases wall-clock time for HMC.

However, that seems to me like reselling the earlier paper by Heiko Strathmann and collaborators on Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families. Zhang et al use basically the same idea but fail to look at this earlier work. Their reasoning to use a neural network rather than a GP emulator (computation time) is a bit arbitrary. If you go for a less rich function class (neural network) then the computation time will go down of course – but you would get the same effect by using GPs with inducing points.

Very much lacking to my taste is reasoning for doing the tuning they do. Sometimes they tune HMC/Variational HMC to 85% acceptance, sometimes to 70% acceptance. Also, it seems they not adapting the mass matrix of HMC. If they would, I conjecture the relative efficiency of Standard HMC vs Variational HMC could change drastically. Details on how they tune SGLD is completely lacking.

Overall, I think it is not yet clear what can be learned from the work reported in the paper.

 

 

Talk in Oxford

Today I gave a talk at the Kernel Monday in Oxford, to talk about Kernel Sequential Monte Carlo (slides). Apart from meeting Dino Sejdinovic  and Brooks Paige again, I also had the chance to get Monsieur Doucets input on some questions. The same ones I got input on from Dino and Brooks later on.

Always amazed by how this town is exactly like beloved Tübingen where I did my undergrad – maybe after adding some ingredients that seem very anachronistic (to me). I also keep on thinking (and Oxford obviously triggers that) that Germany has been overdoing it with its comparatively Anti-Elite research policy, a reaction to Nazi focus on elites. I think some more elitism might be necessary for good research, but its a thin red line: elitism clearly breeds randomness. It’s similar to overfitting/high variance: when you can’t actually predict the future merit of people with high accuracy but still act as if you could.

The picture is some pretty college in Oxford. I’m agnostic with respect to its name.

Talks in England

Next week I’ll be touring England, more specifically the Universities of Reading, Warwick, Oxford and UCL. Apart from visiting dear friends, I’ll be giving a poster in Warwick (still not ready!) and talks at all other places.

19.4.2015 2pm Gradient IS and Unadjusted Langevin for IS Univ. of Reading Afternoon on Bayesian Computation, Nike Lecture Theatre, Agriculture Building
22.4.2016 1pm Kernel Sequential Monte Carlo UCL Roberts G08 Sir David Davies LT, Announcement
25.4.2016 3:30pm Kernel Sequential Monte Carlo  Univ. of Oxford Department of Statistics, St Giles building, Seminar Room 1, LG.03

The poster will be on Flyweight evidence estimates at the CRiSM workshop on Estimating Constants on the 20th. Slides and the poster will go up on the Talks page.