With my son, my niece, sister and her boyfriend I visited the science museum in Mannheim about two months ago. There where many nice experiments, and as a (former ?) computer scientist I was particularly impressed by Leibniz’ mechanical digital calculator, completed in the 17th century and able to do all four basic arithmetic operations. This of course is another example for the superiority of Leibniz over Newton…
One piece was a pure exhibit with no experiments: a system of cogwheels connected sequentially, where each added wheel took ten times longer for one complete cycle compared to the one before. While the first wheel needed five seconds, the 17th and last one did not seem to move, and indeed it completes a cycle about every 1.6 billion years ().
This is an example of exponential growth of course, so I told him the rice on chessboard story, e.g. the idea that starting with one grain of rice on the first square of a chessboard and doubling it on the next leads to loads and loads of rice. When he asked for rice this weekend I was rather unsuspecting, until he spilled some of it and asked for our help cleaning it up. He started the first four squares (though with linear growth as he didn’t remember completely) and we completed the first row together – after which I started simply weighing the rice, as counting took about a second per grain and already for the 256 grains on h2 I was too impatient, let alone the 1024 for f2, which would have taken about 16 minutes to count. If this little exercise doesn’t drive home what exponential growth means, I don’t know what would.
This is an important arXival by Alain Durmus and Eric Moulines. The title is slightly optimized for effect, as the paper actually contains non-asymptotic and asymptotic analysis.
The basic theme of the paper is in getting upper bounds on total variation (and more general distribution distances) between an uncorrected discretized Langevin diffusion wrt some target and itself. The discretization used is the common scheme with the scary name Euler-Maruyama:
Under a Foster-Lyapunov condition, is a Markov kernel that admits a unique stationary distribution that is close to the desired in total variation distance and gets closer when decreases.
Now in the non-asymptotic case with fixed , the authors provide bounds that explicitly depend on dimensionality of the support of the target, the number of samples drawn and the chosen step size . Unfortunately, these bounds contain some unknowns as well, such as the Lipschitz constant of the gradient of the logpdf and some suprema that I am unsure how to get explicitly.
Durmus and Moulines particularly take a look at scaling with dimension under increasingly strong conditions on , getting exponential (in dimension) constants for the convergence when is superexponential outside a ball. Better convergence can be achieved when assuming to be log-concave or strongly log-concave. This is not surprising, nevertheless the theoretical importance of the results is clear from the fact that together with Arnak Dalalyan this is the first time that results are given for the ULA after the Roberts & Tweedie papers from 1996.
As a practitioner, I would have wished for very explicit guidance in picking or the series . But hopefully with Alains paper as a foundation that can be the next step. As a non-mathematician, I had some problems in following the paper and at some point I completely lost it. This surely is in part due to the quite involved content. However, one might still manage to give intuition even in this case, as Sam Livingstones recent paper on HMC shows. I hope Alain goes over it again with readability and presentation in mind so that it will get the attention it deserves. Yet another task for something that already took a lot of work…
(photo: the Lipschitz Grill diner in Berlin – I don’t
know about their food, but the name is remarkable)
This second reference Mathieu Gerber gave me in the quest for educating myself about QMC, is paper by Pierre L’Ecuyer from the Winter Simulation Conference in 2004. It was much clearer as a tutorial (for me) as compared to the Art Owen paper. Maybe because it didn’t contain so much ANOVA. Or maybe because I was more used to ANOVA from Arts paper.
This paper specifically and quite transparently treats different constructions for low discrepancy point sets, in particular digital nets and their special cases. On the other hand, randomization procedures are discussed, which sometimes seem to be very specialized to the sequence used. One seemingly general transform after randomization called the baker transformation results in surprisingly high variance reduction of order . The transformation being to replace the uniform coordinate by for and else.
In the examples L’Ecuyer mentions that using an Eigenzerlegung of covariance matrices (i.e. PCA) results in much higher variance reductions as compared to using Cholesky factors. Which he attributes to dimension reduction – a naming I find odd, as the complete information is retained (as opposed to, e.g. tossing the components with lowest Eigenvalue). My intuition is that maybe the strong empirical gains with PCA might rather be attributed to the fact that Eigenvectors are orthogonal, making this decomposition as close as possible to QMCs beloved unit hypercube.
The paper surveys re-randomization of low discrepancy/QMC point sets mostly as a way of getting an estimate of the variance of the integral estimate. Or that is what the paper makes states most prominently – another reason for doing randomized QMC being that in some cases it further decreases variance as compared to plain QMC.
One of the things this paper stresses is that QMC will not help in truly high-dimensional integrands: say you have a dimensional integrand which does not live on a lower dimensional manifold, and only use . Then equidistributing these points meaningfully in becomes impossible. Of course if the integrand does live on a lower dimensional manifold, one can use that fact to get convergence rates that correspond to the dimensionality of that manifold, which corresponds (informally) to what Art Owen calls effective dimension. Two definitions of effective dimension variants are given, both using the ANOVA decomposition. ANOVA only crossed my way earlier through my wifes psychology courses in stats, where it seemed to be a test mostly, so I’ll have delve more deeply into the decomposition method. It seems that basically, the value of the integrand is treated as a dependent variable while the are the independent variables and ANOVA is used to get an idea of how independent variables interact in producing the dependent variable. In which case of course we would have to get some meaningful samples of points which currently seems to me like circling back to the beginning, since meaningful samples are what we want in the first place.
The idea for getting variance of the integral estimate from RQMC is the following: given a number of deterministic QMC points, randomize them times using independently drawn random variables. The integral estimates are unbiased estimates of the true integral with some variance . Averaging over the $latex \hat I_i$ we get another unbiased estimator $latex \hat I$ which has variance . This RQMC variance can be estimated unbiasedly as . If then is the number of samples and this simply is the standard estimate of Monte Carlo variance.
The paper goes on to talk about using what it calls Latin Supercube Sampling for using RQMC in even higher dimensions, again using ANOVA decomposition and RQMC points in each group of interacting variables as determined by the ANOVA.
Overall, I know a bit more about QMC but am still more in the confusion than the knowledge state, which I hope Mathieus next reference will help with.
This remarkable arXival by Sam Livingstone, Betancourt, Byrne and Girolami takes a look at the conditions under which Hybrid/Hamiltonian Monte Carlo is ergodic wrt a certain distribution , and, in a second step, under which it will converge in total variation at a geometric rate. A nice feature being that they even prove that setting the integration time dynamically can lead to the sampler yielding geometric rate for a larger class of targets. Under idealized conditions for the dynamic integration time and, admittedly, in a certain exponential family.
Whats great about this paper, apart from the results, is how it manages to give the arguments in a rather accessible manner. Considering the depth of the results that is. The ideas are that
the gradient should at most grow linearly
for the gradient of to be of any help, it cannot lead further into the tails of the distribution or vanish (both assumption 4.4.1)
as we approach infinity, every proposed point thats closer to the origin will be accepted (assumption 4.4.2)
If the gradient grows faster than linear, numerics will punch you in the face. That happened to Heiko Strathmann and me just 5 days ago in a GradientIS-type algorithm, where an estimated gradient had length and led to Nirvana. If not the estimation is the problem but the actual gradient, of course you have the same effect. The second assumption is also very intuitive: if the gradient leads into the tails, then it hurts rather than helps convergence. And your posterior is improper. The final assumption basically requires that your chance to improve by proposing a point closer to the origin goes to the further you move away from it.
A helpful guidance they provide is the following. For a class of exponential family distributions given by , whenever the tails of the distribution are no heavier than Laplace () and no thinner than Gaussian (), HMC will be geometrically ergodic. In all other cases, it won’t.
Another thing that caught my eye was their closed form expression for the proposal after leapfrog steps with stepsize , eq. (13):
Which gives me hope that one should be able to compute the proposal probability after any number of leapfrog steps as long as is invertible and thus a Jacobian exists to account for the transformation of hidden in the . A quick check with given by a bayesian posterior shows that this might not be trivial however.
… are hell. Heiko and I have been spending the day trying to get code to work for a numerically stable streaming update of a Cholesky factor in an adaptive Importance Sampling setting. This is less then trivial, also because it has to be done in a weighted fashion (because of Importance Sampling) and weights should be kept in logspace for stability. Thinking I will have to go through this again for an Eigendecompositon instad of Cholesky for a collaboration planned with Mathieu Gerber does not make things easier to stomach.
This day again strengthens my wish to move more in the direction of Judith-Russeau-statistics, i.e. away from computers, towards pen and paper.
PS: Does anybody have code solving exactly this problem (we have tons for almost this problem)?