Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

September 6, 2016big tall data, Inference, Monte Carlo, Optimization, Variance ReductionIngmar

During the super nice International Conference on Monte Carlo techniques in the beginning of July in Paris at Université Descartes (photo), which featured many outstanding talks, one by Tong Zhang particularly caught my interest. He talked about several variants of Stochastic Gradient Descent (SGD) that basically use variance reduction techniques from Monte Carlo algorithms in order to improve the convergence rate versus vanilla SGD. Even though some of the papers mentioned in the talk do not always point out the connection to Monte Carlo variance reduction techniques.

One of the first works in this line, Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Johnson and Zhang, suggests using control variates to lower the variance of the loss estimate. Let $L_j(\theta_{t-1})$ be the loss for the parameter at $t-1$ and jth data point, then the usual batch gradient descent update is $\theta_{t} =\theta_{t-1} - \frac{\eta_t}{N} \sum_{j=1}^N\nabla L_j(\theta_{t-1})$ with $\eta_t$ as step size.

In naive SGD instead one picks a data point index uniformly $j \sim \mathrm{Unif}(\{1,\dots,N\})$ and uses the update $\theta_{t} =\theta_{t-1} - \eta_t \nabla L_j(\theta_{t-1})$ , usually with a decreasing step size $\eta_t$ to guarantee convergence. The expected update resulting from this Monte Carlo estimate of the batch loss is exactly the batch procedure update. However the variance of the estimate is very high, resulting in slow convergence of SGD after the first steps (even in minibatch variants).

The authors choose a well-known solution to this, namely the introduction of a control variate. Keeping a version of the parameter that is close to the optimum, say $\tilde\theta$ , observe that $\nabla L_j(\tilde\theta) - \frac{1}{N} \sum_{i=1}^N \nabla L_i(\tilde\theta)$ has an expected value of 0 and is thus a possible control variate. With the possible downside that whenever $\tilde\theta$ is updated, one has to go over the complete dataset.

The contribution, apart from the novel combination of knowledge, is the proof that this improves convergence. This proof assumes smoothness and strong convexity of the overall loss function and convexity of the $L_j$ for the individual data points and then shows that the proposed procedure (termed stochastic variance reduced gradient or SVRG) enjoys geometric convergence. Even though the proof uses a slightly odd version of the algorithm, namely where $\tilde\theta \sim\mathrm{Unif}(\{\theta_0,\dots,\theta_{t-1}\})$ . Rather simply setting $\tilde\theta = \theta_{t-1}$ should intuitively improve convergence, but the authors could not report a result on that. Overall a very nice idea, and one that has been discussed in more papers quite a bit, among others by Simon Lacoste-Julien and Francis Bach.

Berlin street art

Paris street art

Nonparametric maximum likelihood inference for mixture models via convex optimization

August 22, 2016Inference, Maximum Likelihood, Monte CarloIngmar

This arXival by Feng and Dicker deals with the problem of fitting multivariate mixture estimates with maximum likelihood. One of the advantages put forward being that nonparametric maximum likelihood estimators (NPMLEs) put virtually no constraints on the base measure $G_0$ . While the abstract claims their approach works for arbitrary distributions as mixture components, really they make the assumption that the components are well approximated by a Gaussian (of course including distributions arising from sums of RVs because of the CLT). While theoretically NPMLEs might put no constraints on the base measure, practically in the paper first $G_0$ is constrained to measures supported on at most as many points as there are data points. To make the optimization problem convex, the support is further constrained to a finite grid on some compact space that the data lies on.

The main result of the paper is in Proposition 1, which basically says that the finite grid constraint indeed makes the problem convex. After that the paper directly continues with empirical evaluation. I think the method proposed is not all that general. While the elliptical unimodal (gaussian approximation) assumption would not be that problematic, the claimed theoretical flexibility of NPMLE is not really bearing fruit in this approach, as the finite grid constraint is very strong and gets rid of most flexibility left after the gaussian assumption. For instance, the gaussian location model fitted is merely a very constrained KDE without even allowing the ability of general gaussian mixture models of fitting the covariance matrix of individual components. While a finite grid support for location and covariance matrix is possible, to be practical the grid would have to be extremely dense in order gain flexibility in the fit. While it is correct that the optimization problem is becoming convex, this is bought for the price of a rigid model. However, Ewan Cameron assured me that the model is very useful for astrostatistics, and I realized that it might be so in other contexts, e.g. adaptive Monte Carlo techniques.

A minor comment regarding the allegation in the paper that nonparametric models lack interpretability: while this is certainly true for the model proposed in the paper and mainstream bayesian nonparametric models, this is not a given. One notable interpretable class of models are Mixture models with a prior on the number of components by Miller and Harrison (2015).

Rodin museum

Lifted Convex Quadratic Programming

August 17, 2016Inference, Monte Carlo, statsIngmar

This arXival by Mladenov, Kleinhans and Kersting considers taking advantage of symmetry structure in quadratic programming problems in order to speed up solvers/optimizers. The basic idea is to check which of the model parameters are indistinguishable because of symmetries in the model and assign all parameters in a group the same parameter in all steps of optimization – yielding a partition of the parameter space. Of course, this will reduce the effective dimensionality of the problem and might thus be considered a method of uncovering the lower-dimensional manifold that contains at least one optimum/solution.
The core contributions of the paper are
1) to provide the conditions under which a partition is a lifting partition, i.e. a partition such that setting all the variables in a group to the same value still includes a solution of the quadratic program
2) provide conditions under which the quadratic part of the program is fractionally symmetric , i.e. can be factorized
Furthermore, the paper touches on the fact that some kernelized learning algorithms can benefit from lifted solvers and an approximate lifting procedures when no strict lifting partition exists.

It is rather hard to read for somebody not an expert in lifted inference. While all formal criteria for presentation are fulfilled, more intuition for the subject matter could be given. The contributions of the paper are hard to grasp and more on the theoretical side (lifting partitions for convex programs can be “computed via packages such as Saucy”). While deepening the theoretical understanding a bit, it is unclear wether this paper will have a strong effect on the development of the theory of lifted inference. Even the authors seem to think that it will not have strong practical implications.

In general I wonder wether lifted inference really isn’t just an automatic way of finding simpler, equivalent models, rather than an inference technique per se.

MCqMC 2016

August 16, 2016Monte CarloIngmar

Yesterday, the MCqMC 2016 conference started, which I am unfortunately not attending – the program looks very interesting! Tomorrow, Nicolas Chopin will chair the session on importance sampling that I organized.

Advances in Importance Sampling

Wednesday August 17th 10:25–11:55, Berg B

Recently, there has been renewed interest in Importance Sampling, with results that go far beyond the state of the art of the early nineties when research focus shifted to MCMC. These results include theoretical advances in the analysis of convergence conditions and convergence assessment on one side. On the other, an overarching Multiple Importance Sampling framework has been proposed as well as IS based on piecewise deterministic processes, which allows, amongst other things, data subsampling and incorporating gradient information.

Generalized Multiple Importance Sampling

V. Elvira

Importance sampling methods are broadly used to approximate posterior distributions or some of their moments. In its standard approach, samples are drawn from a single proposal distribution and weighted properly. However, since the performance depends on the mismatch between the targeted and the proposal distributions, several proposal densities are often employed for the generation of samples. Under this Multiple Importance Sampling (MIS) scenario, many works have addressed the selection or adaptation of the proposal distributions, interpreting the sampling and the weighting steps in different ways. In this paper, we establish a novel general framework for sampling and weighting procedures when more than one proposal is available. The most relevant MIS schemes in the literature are encompassed within the new framework, and, moreover novel valid schemes appear naturally. All the MIS schemes are compared and ranked in terms of the variance of the associated estimators. Finally, we provide illustrative examples which reveal that, even with a good choice of the proposal densities, a careful interpretation of the sampling and weighting procedures can make a significant difference in the performance of the method. Joint work with L. Martino (University of Valencia), D. Luengo (Universidad Politecnica de Madrid) and M. F. Bugallo (Stony Brook University of New York).

The sample size required in Importance Sampling

S. Chatterjee

I will talk about a recent result, obtained in a joint work with Persi Diaconis, about the sample size required for importance sampling. If an i.i.d. sample from a probability measure P is used to estimate expectations with respect to a probability measure Q using the importance sampling technology, the result says that the required sample size is exp(K), where K is the Kullback-Leibler divergence of P from Q. If the sample size is smaller than this threshold, the importance sampling estimates may be far from the truth, while if the sample size is larger, the estimates are guaranteed to be close to the truth.

Continuous Time Importance Sampling for Jump Diffusions with Application to Maximum Likelihood Estimation

K. Łatuszyński

In the talk I will present a novel algorithm for sampling multidimensional irreducible jump diffusions that, unlike methods based on time discretisation, is unbiased. The algorithm can be used as a building block for maximum likelihood parameter estimation. The approach is illustrated by numerical examples of financial models like the Merton or double-jump model where its efficiency is compared to methods based on the standard Euler discretisation and Multilevel Monte Carlo. Joint work with Sylvain Le Corff, Paul Fearnhead, Gareth O. Roberts, and Giorgios Sermaidis.

Reichstagsbrand in Turkey

July 21, 2016PoliticsIngmar

Politics again! However, Erdogan is at work, so I feel the urge to comment on it even when NIPS rebuttals are pressing. As a reaction to my last post that was (marginally) about Erdogan, Xian already sent me news that the turkish president called birth control unislamic and that mothers have to ensure the continued growth of Turkeys population. This already reminded me strongly of Hitlers Mutterkreuz, basically a prize the Nazis gave to german women that where particularly successful in procreation.

Now the military coup reported by Turkey strongly reminds me of the Reichstagsbrand. Basically, Nazis used the fact that the parliament building was lit on fire and blamed their political opponents for the crime, starting to remove them from all important positions and parliament itself by claiming the communists where planning a coup. The parallels to Erdogans way of proceeding are striking, and like in the Nazi case, the sheer speed with which possibly opposing forces in all parts of society are losing their positions is at least dodgy. Hitler called the fire a sign from God, Erdogan the coup a gift from Allah. What can you say? History doesn’t repeat itself, but it rhymes. Erdogan will continue to follow the script Hitler gave him, and the Turkish version of the 1933 enabling act has already been planned. Now while I completely understood Merkels way of proceeding in the refugee crisis, I think a test of maybe even bigger scope is coming her way: while she was collaborating with Erdogan to get rid of the refugee stream, will she be able to avoid the mistakes that where made in handling Nazi Germany? In particular, will she be able to leave the path of an appeasement policy?

Another statistician has more thoughts.

Championnat d’Europe de football

July 10, 2016Random, SnapshotsIngmar

Overdispersed Black-Box Variational Inference

June 22, 2016Inference, Monte Carlo, stats, Variational InferenceIngmar

This UAI paper by Ruiz, Titsias and Blei presents important insights for the idea of a black box procedure for VI (which I discussed here). The setup of BBVI is the following: given a target/posterior $\pi$ and a parametric approximation $q_\lambda$ , we want to find

$\mathrm{argmin}_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$

which can be achieved for any $q_\lambda$ by estimating the gradient

$\nabla_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$

with Monte Carlo Samples and stochastic gradient descent. This works if we can easily sample from $q_\lambda$ and can compute its derivative wrt $\lambda$ in closed form. In the original paper, the authors suggested the use of the score function as a control variate and a Rao-Blackwellization. Both where described in a way that utterly confused me – until now, because Ruiz, Titsias and Blei manage to describe the concrete application of both control variates and Rao-Blackwellization in a very transparent way. Their own contribution to variance reduction (minus some tricks they applied) is based on the fact that the optimal sampling distribution for estimating $\nabla_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$ is proportional to $\left | \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) \right | q_\lambda(x)$ rather than exactly $q_\lambda(x)$ . They argue that this optimal sampling distribution is considerably heavier tailed than $q_\lambda(x)$ . Their reasoning is mainly that the norm of the gradient (which is essentially $(\nabla_\lambda q_\lambda) \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) = q_\lambda(x)(\nabla_\lambda \log q_\lambda(x)) \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right )$ ) vanishes for the modes, making that region irrelevant for gradient estimation. The same should be true for the tails of the distribution I think. Overall very interesting work that I strongly recommend reading, if only to understand the original Blackbox VI proposal.