# Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

During the super nice International Conference on Monte Carlo techniques in the beginning of July in Paris at Université Descartes (photo), which featured many outstanding talks, one by Tong Zhang particularly caught my interest. He talked about several variants of Stochastic Gradient Descent (SGD) that basically use variance reduction techniques from Monte Carlo algorithms in order to improve the convergence rate versus vanilla SGD. Even though some of the papers mentioned in the talk do not always point out the connection to Monte Carlo variance reduction techniques.

One of the first works in this line, Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Johnson and Zhang, suggests using control variates to lower the variance of the loss estimate. Let $L_j(\theta_{t-1})$ be the loss for the parameter at $t-1$ and jth data point, then the usual batch gradient descent update is $\theta_{t} =\theta_{t-1} - \frac{\eta_t}{N} \sum_{j=1}^N\nabla L_j(\theta_{t-1})$ with $\eta_t$ as step size.

In naive SGD instead one picks a data point index uniformly $j \sim \mathrm{Unif}(\{1,\dots,N\})$ and uses the update $\theta_{t} =\theta_{t-1} - \eta_t \nabla L_j(\theta_{t-1})$, usually with a decreasing step size $\eta_t$ to guarantee convergence. The expected update resulting from this Monte Carlo estimate of the batch loss is exactly the batch procedure update. However the variance of the estimate is very high, resulting in slow convergence of SGD after the first steps (even in minibatch variants).

The authors choose a well-known solution to this, namely the introduction of a control variate. Keeping a version of the parameter that is close to the optimum, say $\tilde\theta$, observe that $\nabla L_j(\tilde\theta) - \frac{1}{N} \sum_{i=1}^N \nabla L_i(\tilde\theta)$ has an expected value of 0 and is thus a possible control variate. With the possible downside that whenever $\tilde\theta$ is updated, one has to go over the complete dataset.

The contribution, apart from the novel combination of knowledge, is the proof that this improves convergence. This proof assumes smoothness and strong convexity of the overall loss function and convexity of the $L_j$ for the individual data points and then shows that the proposed procedure (termed stochastic variance reduced gradient or SVRG) enjoys geometric convergence. Even though the proof uses a slightly odd version of the algorithm, namely where $\tilde\theta \sim\mathrm{Unif}(\{\theta_0,\dots,\theta_{t-1}\})$. Rather simply setting  $\tilde\theta = \theta_{t-1}$ should intuitively improve convergence, but the authors could not report a result on that. Overall a very nice idea, and one that has been discussed in more papers quite a bit, among others by Simon Lacoste-Julien and Francis Bach.

# Nonparametric maximum likelihood inference for mixture models via convex optimization

This arXival by Feng and Dicker deals with the problem of fitting multivariate mixture estimates with maximum likelihood. One of the advantages put forward being that nonparametric maximum likelihood estimators (NPMLEs) put virtually no constraints on the base measure $G_0$. While the abstract claims  their approach works for arbitrary distributions as mixture components, really they make the assumption that the components are well approximated by a Gaussian (of course including distributions arising from sums of RVs because of the CLT). While theoretically NPMLEs might put no constraints on the base measure, practically in the paper first $G_0$ is constrained to measures supported on at most as many points as there are data points. To make the optimization problem convex, the support is further constrained to a finite grid on some compact space that the data lies on.

The main result of the paper is in Proposition 1, which basically says that the finite grid constraint indeed makes the problem convex. After that the paper directly continues with empirical evaluation. I think the method proposed is not all that general. While the elliptical unimodal (gaussian approximation)  assumption would not be that problematic, the claimed theoretical flexibility of NPMLE is not really bearing fruit in this approach, as the finite grid constraint is very strong and gets rid of most flexibility left after the gaussian assumption. For instance, the gaussian location model fitted is merely a very constrained KDE without even allowing the ability of general gaussian mixture models of fitting the covariance matrix of individual components. While a finite grid support for location and covariance matrix is possible, to be practical the grid would have to be extremely dense in order gain flexibility in the fit. While it is correct that the optimization problem is becoming convex, this is bought for the price of a rigid model. However, Ewan Cameron assured me that the model is very useful for astrostatistics, and I realized that it might be so in other contexts, e.g. adaptive Monte Carlo techniques.

A minor comment regarding the allegation in the paper that nonparametric models lack interpretability: while this is certainly true for the model proposed in the paper and mainstream bayesian nonparametric models, this is not a given. One notable interpretable class of models are Mixture models with a prior on the number of components by Miller and Harrison (2015).

# Lifted Convex Quadratic Programming

This arXival by Mladenov, Kleinhans and Kersting considers taking advantage of symmetry structure in quadratic programming problems in order to speed up solvers/optimizers. The basic idea is to check which of the model parameters are indistinguishable because of symmetries in the model and assign all parameters in a group the same parameter in all steps of optimization – yielding a partition of the parameter space. Of course, this will reduce the effective dimensionality of the problem and might thus be considered a method of uncovering the lower-dimensional manifold that contains at least one optimum/solution.
The core contributions of the paper are
1) to provide the conditions under which a partition is a lifting partition, i.e. a partition such that setting all the variables in a group to the same value still includes a solution of the quadratic program
2) provide conditions under which the quadratic part of the program is fractionally symmetric , i.e. can be factorized
Furthermore, the paper touches on the fact that some kernelized learning algorithms can benefit from lifted solvers and an approximate lifting procedures when no strict lifting partition exists.

It is rather hard to read for somebody not an expert in lifted inference. While all formal criteria for presentation are fulfilled, more  intuition for the subject matter could be given. The contributions of the paper are hard to grasp and more on the theoretical side (lifting partitions for convex programs can be “computed via packages such as Saucy”). While deepening the theoretical understanding a bit, it is unclear wether this paper will have a strong effect on the development of the theory of lifted inference. Even the authors seem to think that it will not have strong practical implications.

In general I wonder wether lifted inference really isn’t just an automatic way of finding simpler, equivalent models, rather than an inference technique per se.

# Overdispersed Black-Box Variational Inference

This UAI paper by Ruiz, Titsias and Blei presents important insights for the idea of a black box procedure for VI (which I discussed here). The setup of BBVI is the following: given a target/posterior $\pi$ and a parametric approximation $q_\lambda$, we want to find

$\mathrm{argmin}_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$

which can be achieved for any $q_\lambda$ by estimating the gradient

$\nabla_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$

with Monte Carlo Samples and stochastic gradient descent. This works if we can easily sample from $q_\lambda$  and can compute its derivative wrt $\lambda$ in closed form. In the original paper, the authors suggested the use of the score function as a control variate and a Rao-Blackwellization. Both where described in a way that utterly confused me – until now, because Ruiz, Titsias and Blei manage to describe the concrete application of both control variates and Rao-Blackwellization in a very transparent way. Their own contribution to variance reduction (minus some tricks they applied) is based on the fact that the optimal sampling distribution for estimating $\nabla_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$ is proportional to $\left | \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) \right | q_\lambda(x)$ rather than exactly $q_\lambda(x)$. They argue that this optimal sampling distribution is considerably heavier tailed than $q_\lambda(x)$. Their reasoning is mainly that the norm of the gradient (which is essentially $(\nabla_\lambda q_\lambda) \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) = q_\lambda(x)(\nabla_\lambda \log q_\lambda(x)) \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right )$)  vanishes for the modes, making that region irrelevant for gradient estimation. The same should be true for the tails of the distribution I think. Overall very interesting work that I strongly recommend reading, if only to understand the original Blackbox VI proposal.

# A Variational Analysis of Stochastic Gradient Algorithms

This arXival by Mandt, Hoffman and Blei from February takes a look at what they call SGD with constant learning rate or constant SGD. Which is not really Stochastic Gradient Descent anymore, but rather a sampling algorithm, as we should bounce  around the mode of the posterior rather than converging to it. Consequently they interpret constant SGD as a stochastic process in discrete time with some stationary distribution. They go on to analyse it under the assumption that the stationary distribution is Gaussian (Assumption 1 and 4). Something that springs to my mind here is the following: even if we accept that the stationary distribution of the process might be Gaussian (intuitively, not rigorously), what guarantees that the stationary distribution is the posterior? Even if the posterior is also gaussian, it could have a different (co)variance. I think a few more words about this question might be good. In particular, without adding artificial noise, is it guaranteed that every point in the posteriors domain can be reached with non-zero probability? And can be reached an infinite number of times if we collect infinite samples? If not, the constant SGD process is transient and does not enjoy good stability properties.

Another delicate point hides in Assumption 1: As our stochastic gradient is the sum of independent RVs, the CLT is invoked to assume that the gradient noise is Gaussian. But the CLT might not yet take effect if the number of data points in our subsample is small, which is the very thing one would like in order to have advantages in computation time. This of course is yet another instance of the dilemma common to all scalable sampling techniques, and at this stage of research this is not a show stopper.
Now assuming that constant SGD is not transient and it has a stationary distribution that is the posterior and that the posterior or its dominant mode is approximately Gaussian, we can of course try to optimize the parameters of constant SGD to make the sample it produces as close a possible to a posterior sample. Which the authors conveniently do using an Ornstein-Uhlenbeck approximation, which has a Gaussian stationary distribution. Their approach is to minimize KL-Divergence between the OU approximation and the posterior, and derive optimal parameter settings for constant SGD in closed form. The most interesting part of the theoretical results section probably is the one on optimal preconditioning in Stochastic Gradient Fisher Scoring, as that sampler adds artificial noise. Which might solve some of the transience questions.

The presentation would gain a lot by renaming “constant SGD” for what it is – a sampling algorithm. The terminology and notation in general are sometimes a bit confusing, but nothing a revision can’t fix. In general, the paper presents an interesting approach to deriving optimal parameters for a class of algorithms that is notoriously hard to tune. This is particularly relevant because the constant step size could improve mixing considerably over Welling & Tehs SGLD. What would be interesting to see is wether the results could be generalized to the case where the posterior is not Gaussian. For practical reasons, because stochastic VB/EP works quite well in the gaussian case. For theoretical reasons, because EP now even comes with some guarantees (haven’t read that paper admittedly). Maybe a path would be to take a look at the Gaussian approximation to a multimodal posterior spanning all modes, and minimizing KL-Divergence between that and the OU process. Or maybe one can proof that constant SGD (with artificial noise?) has some fixed stationary distribution to which stochastic drift term plus noise are a Markov Kernel, which might enable a pseudo-marginal correction.

# Variational Hamiltonian Monte Carlo via Score Matching

This is an arXival (maybe by now ICML paper?) by Zhang, Shahbaba and Zhao. It suggest fitting an emulator/surrogate $q_\eta$ to the target density $\pi$ via score matching and then use the gradients of $q_\eta$ rather than those of $\pi$ for generating proposal with HMC. Their main selling point being that this decreases wall-clock time for HMC.

However, that seems to me like reselling the earlier paper by Heiko Strathmann and collaborators on Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families. Zhang et al use basically the same idea but fail to look at this earlier work. Their reasoning to use a neural network rather than a GP emulator (computation time) is a bit arbitrary. If you go for a less rich function class (neural network) then the computation time will go down of course – but you would get the same effect by using GPs with inducing points.

Very much lacking to my taste is reasoning for doing the tuning they do. Sometimes they tune HMC/Variational HMC to 85% acceptance, sometimes to 70% acceptance. Also, it seems they not adapting the mass matrix of HMC. If they would, I conjecture the relative efficiency of Standard HMC vs Variational HMC could change drastically. Details on how they tune SGLD is completely lacking.

Overall, I think it is not yet clear what can be learned from the work reported in the paper.

# Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm

This is an important arXival by Alain Durmus and Eric Moulines. The title is slightly optimized for effect, as the paper actually contains non-asymptotic and asymptotic analysis.

The basic theme of the paper is in getting upper bounds on total variation (and more general distribution distances) between an uncorrected discretized Langevin diffusion wrt some target $\pi$ and $\pi$ itself. The discretization used is the common scheme with the scary name Euler-Maruyama:

$X_{k} \sim \mathcal{N}(\cdot|X_{k-1} + \gamma_{k}\nabla \log \pi(X_{k-1}), \sqrt{2\gamma_{k}I} = R_{\gamma_{k}}(\cdot |X_{k-1})$

Under a Foster-Lyapunov condition, $R_{\gamma_{k}}$ is a Markov kernel that admits a unique stationary distribution $\pi_{\gamma_{k}}$ that is close to the desired $\pi$ in total variation distance and gets closer when $\gamma_{k}$ decreases.

Now in the non-asymptotic case with fixed $\gamma = \gamma_k$, the authors provide bounds that explicitly depend on dimensionality of the support of the target, the number of samples drawn and the chosen step size $\gamma$. Unfortunately, these bounds contain some unknowns as well, such as the Lipschitz constant $L$ of the gradient of the logpdf $\log \pi(\cdot)$ and some suprema that I am unsure how to get explicitly.

Durmus and Moulines particularly take a look at scaling with dimension under increasingly strong conditions on $\pi$, getting exponential (in dimension) constants for the convergence when $\pi$ is superexponential outside a ball. Better convergence can be achieved when assuming $\pi$ to be log-concave or strongly log-concave. This is not surprising, nevertheless the theoretical importance of the results is clear from the fact that together with Arnak Dalalyan this is the first time that results are given for the ULA after the Roberts & Tweedie papers from 1996.

As a practitioner, I would have wished for very explicit guidance in picking  $\gamma$ or the series $\{\gamma_k\}_{k=1}^\infty$. But hopefully with Alains paper as a foundation that can be the next step. As a non-mathematician, I had some problems in following the paper and at some point I completely lost it. This surely is in part due to the quite involved content. However, one might still manage to give intuition even in this case, as Sam Livingstones recent paper on HMC shows. I hope Alain goes over it again with readability and presentation in mind so that it will get the attention it deserves. Yet another task for something that already took a lot of work…

(photo: the Lipschitz Grill diner in Berlin – I don’t
know about their food, but the name is remarkable)