Category Archives: stats

(Title image (c) Carlos Delgado, CC-BY-SA)

Layered adaptive Importance Sampling

This arXival from last spring/summer by Martino, Elvira, Luengo and Corander combines and extends upon recent advances of Importance Sampling, using mainly ideas from Adaptive Multiple lmportance Sampling (AMIS) and Population Monte Carlo (PMC). The extension consists of the idea to not use the Importance Sampling procedure itself to come up with new proposal distributions, but rather to run a Markov Chain. The output of which is used solely as the location parameter for IS proposal distributions q_{n,t}. The weights of the samples drawn from these are Rao-Blackwellized using the deterministic mixture idea of Zhou and Owen, and as far as I can see only the Importance Samples are used for estimating integrands.
What’s most striking for me as somebody who has thought about these methods a lot during the PhD is the idea that in principle one is free to Rao-Blackwellize using an arbitrary partition of the samples/proposal distributions and still get a consistent estimator. Xi’an mentioned this to me earlier and of course it is not surprising given that even without Rao-Blackwell my Gradient IS did considerably better than some Adaptive MCMC algorithms. However this paper makes that idea transparent and uses it extensively. The main idea that is put forward however is to use (parallel) MCMC with the same target for coming up with IS-proposal locations. The output of MCMC is only used for that purpose but not for estimation. Which seems kind of wasteful, but in a nice conversation over email the first author Luca Martino assured me that recycling proposals as both IS and MH proposals made performance go down because of correlations. I don’t get an intuition for why that would be the case, but maybe I’ll have to fall on my own nose for that. What I like about this particular idea of getting locations from MCMC is that one is free from the tuning problem I’ve hit upon in GRIS: if you scale up the proposal covariance in GRIS (or in the PMC approach from the Cappé 2004 paper), you can get an arbitrarily high ESJD – together with a really bad target approximation. Thus unmodified ESJD cannot be used for tuning. And neither can acceptance rate which doesn’t exist. Using MCMC for getting proposal locations is an elegant way around that problem. The effect of this is shown in the plot from the paper below, where the two rightmost plots show one of their methods.

Capture d’écran 2016-01-25 à 09.29.37.png

Some other aspects about the paper I find less clear. For instance, I’m not sure about the abundance of different algorithms that are introduced. It leaves the impression that the authors where trying to do mass instead of class (something I might make myself guilty of these weeks as well). Also, while the targets they use for evaluation are fine, only reporting the MSE of one dimension of one integrand seems odd. One simple thing here might be to report the MSE averaged over dimensions as well, another to report the MSE of an estimate of the target distributions variance/higher order moments.

(Title image (c) Carlos Delgado, CC-BY-SA)

Maze

Why the MAP is a bad starting point in high dimensions

During MCMSki 2016, Heiko mentioned that in high dimensions, the MAP is not a particularly good starting point for a Monte Carlo sampler, because there is no volume around it. I and several people smarter than me where not sure why that could be the case, so Heiko gave us a proof by simulation: he sampled from multivariate standard normals with increasing dimensions and plotted the euclidean norm of the samples. The following is what I reproduced for sampling a standard normal in D=30 dimensions.Histogram_StdNorm_30D.pngThe histogram has a peak at about \sqrt{D}, which means most of the samples are in a sphere around the mean/mode/MAP of the target distribution and none are at the MAP, which would correspond to norm 0.

We where dumbstruck by this, but nobody (not even Heiko) had an explanation for what was happening. Yesterday I asked Nicolas about this and he gave the most intuitive interpretation: Given a standard normal variable in D dimensions, x \sim \mathcal{N}(0,I_D), computing the euclidean norm you get n^2 = \|x\|^2 = \sum_{i=1}^D x_i^2. But as x is gaussian, this just means n^2 has a \chi^2(D) distribution, which results in the expected value  \mathbb{E}(n) = \sqrt{D}. Voici l’explication.

(Title image (c) Niki Odolphie)

 

Kernel Adaptive Sequential Monte Carlo

Heiko Stratmann, Dino Sejdinovic, Brooks Paige and myself had a poster on our work on Kernel Adaptive SMC (arXiv) at the Scalable Inference NIPS workshop this year. The basic idea is very related to Dinos paper on Kernel-informed MCMC, in that the particle system is (implicitly) mapped to a function space (the infamous Reproducing Kernel Hilbert Space) and used to inform a sample from a Gaussian fit to the functional associated with samples. As always, things that are nasty and nonlinear in the input space behave much more nicely in feature space. An when your name is Dino Sejdinovic, you can actually integrate out all steps in feature space, ending up with a Gaussian proposal in the space you actually want to sample from.

In the title cartoon, our approach (in red, called KASS for Kernel Adaptive SMC sampler) lines up nicely with the support of the target distribution. ASMC by Fearnhead et al., or at least the sampler they use in their evaluation, which is a special case of our method, cannot adapt to this nonlinear structure. This results in better performance for KASS compared to ASMC and Random Walk SMC, as measured by Maximum Mean Discrepancy.Capture d’écran 2015-12-18 à 17.31.43.pngHowever, this of course comes at the price of higher computational complexity, similar to related methods such as Riemannian Manifold MALA/HMC. If your dataset is large however and evaluating the likelihood majorizes evaluating the Gram matrix on your particle system, this method will still have an edge over others when the support is nonlinear. For the MCMC predecessor KAMH/Kameleon see Heikos post.

Character-Aware Neural Language Models

I’m taking a friends questions on Vector Space Semantics models as a reason to write about this arXival by Kim, Jernite, Sontag and Rush using an idea I’ve had in the back of my mind while fighting my semantics model: that maybe it is easier to model characters directly instead of words, as there are only so many characters but hundreds of thousands of words. And modeling characters directly might give an algorithm robustness to neologisms and spelling mistakes. Which indeed seems to be one of the conclusions of the paper.

Capture d’écran 2015-12-14 à 20.59.06The basic idea is to use character level ConvNets (with a pooling operation over time), extract features using the recently proposed autobahn networks (see my post) and bind things together over time using LSTMs. The big thing is that one has far fewer parameters than when processing at the word level, while the downside is that I’m not completely sure how easily one might extract word embeddings. While that might be no problem in the long run (as I suppose models could be developed that work based on characters only), it might be a problem for quick combination of this papers approach with current mainstream tools used in the industry.

In the convolutional layer, they used between 100 and 1000 filters of varying length, each of which basically recognizes different n-grams. After max pooling (over time), autobahn highway layers give a way of extracting hierarchical features. LSTMs are then used to predict the next word, the loss/”likelihood” being cross entropy to the actually observed next word. Unsurprisingly, in their discussion they account a strong increase in performance to the highway layers. Making me again think that we have to  work on enabling deeper hierarchies for Bayesian Models.

Overall a super interesting case study, though as a linguist I would really like to know how these models could be changed to give us some insight into the structure of the language.  As a plus for my friend, the code is available on github, written using the ubiquitius Torch library.

Probabilistic Line Searches for Stochastic Optimization

This very remarkable NIPS paper by Mahsereci and Hennig (from my beloved Tübingen where I was an undergrad) deals with getting rid of learning rates for SGD. It does so by placing a GP prior on the value of the objective function along the search direction and its gradients projected onto the search direction. The pd kernel used for the GP is that of a once-integrated Wiener Process, resulting in a cubic spline posterior. They argue that this particular form is advantageous because its minima can be found exactly in linear time – tada! This of course is great news, because, paraphrasing the paper, you don’t want to run an optimization algorithm to find minima inside of the line search routine of the actual optimization algorithm. Another reason for choosing the particular pd kernel is that it does not assume an infinite number of derivatives of the GP (such as the Gaussian RBF does) – making it robust to irregularities in the Loss function.

Another point this rich paper makes is that the Wolfe conditions for termination of the line search procedure are actually positivity constraints. And that the probability of the constraints being satisfied can be modeled by a Gaussian CDF. If the probability is greater that 0.3 (this threshold being motivated in the paper), the algorithm is done.

Generally speaking, the authors do a very nice job of getting rid of any free parameters in their approach, leading to a black box line search which can be used in SGD, effectively eliminating the dreaded learning rate. What I, as a sampling person, would really be interested in is whether or not this could actually be put to good use in Monte Carlo as well. However, it is delicate to use gradient information in sampling and not introduce bias, so I think this might be very difficult.

Sampling from Probabilistic Submodular Models

This NIPS paper by Gotovos, Hassani and Krause deals with coming up with a sampler for submodular models. Submodularity is an interesting concept capturing the concept of diversity/representativeness of a set. A submodular function itself is a function from sets to a real value defined by a diminishing returns property: for S,T \subseteq \Omega some function F is submodular if F(S)+F(T)\geq F(S\cup T)+F(S\cap T). For a probabilistic model one can use such an F by defining a probability of a set as S \propto exp(\beta F(S)). Because of the diminishing returns, this results in a distribution  where points are repulsive and tend to spread out over the space rather than clump together, just like in Low Discrepancy Point Sets used for QMC. Think Determinantal Point Process, which is log-submodular and has also been called the Antisocial Coffeshop Process, or, evocatively,  the Urinal Process (thanks to Vinayak Rao for this 😀 ).

The paper deals with the case of a discrete sampling space. Sampling a set over a discrete domain can be thought of as sampling binary vectors. Speaking of sampling, the reference they give for MCMC is rather obscure.  The  Gibbs sampler they give is very straight forward: proposing to change one entry in the binary vector at a time. The main contribution of the paper is not the sampler, which is rather straight forward, but rather proving upper bounds on its mixing time, which is n^2 if the base set from which we can add/remove points is of cardinality n. If adding/removing a point only has an certain effect on the submodular function bounded by 1, then the sampler mixes at least in time n\log n. I have no intuition wether this improved result is obtained under realistic assumptions.

What I have concluded for myself independent of this paper is that sampling one component at a time often gives a very bad sampler. In this particular case, the sampler is very close to the Gibbs sampler for the Indian Buffer Process, which is mixing rather awfully and one of the reasons why I turned away from the IBP and towards working on sampling algorithms. I don’t see how their sampler might improve here. Also, I’m unsure why they did not cite Sequential Monte Carlo on large binary sampling spaces  by Christian Schäfer and Nicolas Chopin – have they been unaware of it? That paper suggests a much more elegant sampler. It adapts to correlations between entries in the binary vector, which is really what you want when doing posterior inference. And the gains of adaptivity are hard to underestimate (as are the perils) – take for example the fact that unadaptive, perfectly tuned HMC is worse than most simple adaptive algorithms, as for example reported in my Gradient IS work or in the Tutorial on adaptive MCMC by Andrieu and Thoms.

Training Very Deep Networks – aka Deep Highway Networks

This interesting paper by Srivastava, Greff and Schmidhuber (arxiv) looks at the problem of enabling deeper architectures for neural networks. The main argument being that Neural Networks have become successful mainly because training deeper architectures has only recently become possible – and making them deeper might only improve performance. However, performance tends to plateau in conventional deep architectures, as exemplified by their first plot.

The proposed solution to training deeper architectures is to simply introduce a gaiting architecture similar to LSTM ideas. If a normal layer is defined by y = H(x, w) for output y, input x and parametrization of the layer given by w, a highway layer is simply given by y = H(x, w_H) \odot T(x, w_T) + x \odot C(x, w_C), where T and C (Transfer and Carry, respectively) are functions with a (co)domain of the same dimensionality as x and y. This is called a highway layer, as gradient infomation supposedly travels faster along the $latexx \odot C(x, w_C)$ part (hence the Autobahn illustration). When it is required to change dimensionality, one has to resort to a normal layer, or subsample/pad.

Now this archictecture of the highway layer of course allows for gradient information in SGD to travel two paths, one trough H(\cdot)T(\cdot), one through C(\cdot)\cdot x and the output of a layer is a combination of the input of the layer x and the output of H(x,w). The authors use a logistic model T(\cdot) = \sigma(W_T \cdot+b_T) and C(\cdot) = 1-T(\cdot). The authors biased towards simply carrying information through without transformation, which in their setup means simply b_T < 0. After training the network with SGD an looking at the value of the transform gates of different layers, they find that those are closed on average (Figure 2, second column) while exhibiting large variance across inputs (exemplified by the transform gate outputs of a single sample, Figure 2, third column). Interestingly, some dimensions of the input to the network are carried through right to the last layer.

They find that this architecture not only helps to train larger networks, but also aids actually utilizing deeper architectures. So this paper took a similar inception as LSTMs: tries to solve an inference problem and comes up with a model that is more powerful in a way.

I feel that the result is somehow odd: carrying the input information through can be achieved by setting the weights in the original normal layer H accordingly (if it is fully connected that is). Why does adding the extra complexity gain so much? I do not feel the paper addressing this important question and I can only speculate myself: perhaps the transform/carry architecture is effective because its weights lie on a compact manifold (because of the logistic transform). Maybe the problem becomes better conditioned because simple downscaling of the input is done by T, while H takes care of actual transformations. To check this, it would have been interesting to know whether they used any constraints on the weights of H – if they haven’t, and H is fully connected, the better conditioning might be an actual explanation. In that case, simply constraining the weights W_H^{(i,i)}  of H (mapping input x_i to the same output dimension y_i) in an appropriate way might have similar effects.

EDIT: After talking to the authors, I realized where my thinking was insufficient: the layer H inside the highway always used a nonlinear map. The gating allowed to switch that off. I guess thats the difference from reading about NNs to using them…

On Markov Chain Monte Carlo methods for tall data

The preprint by Rémi Bardenet, Arnaud Doucet and Chris Homes offers a very well written overview and some original contributions regarding MCMC with large datasets (I like ‘tall’ – maybe high dimensional posteriors could then be ‘fat’ and problems which exhibit both are the worn-out ‘big’ ?-).

Their original contributions revolve around using ideas from Firefly MC (developed by the authors of the Stochastic Optimization is Variational inference paper) on bounds for the likelihood with their Confidence sampler. I must say I wasn’t able to follow in detail and will have to read up on Firefly and the original Confidence sampler paper. One downside I see is that targets need to be differentiable (thrice for their bounds using Taylor approximations) and thus in particular continuous. While in general I try to avoid discrete models simply because you cant use gradient information, sometimes they are necessary. And better make progress on something slightly limited than not make progress.

In their review section, they clearly advise against divide-and-conquer approaches for tall data, i.e. approaches that sample from posteriors involving only part of the data and than combining the information somehow. The main reason for this being that the currently existing upper bounds suggest that the MSE of divide-and-conquer approaches could grow exponential in the number of batches. Thus one would have to use few, large batches, defeating the whole purpose of special methods for tall data.

The section finally exposed me to the delayed acceptance approach of Marco Banterle et al. which I meant to read for quite some time (and will read in the original). I found this approach to be very appealing for its elementary formulation. Most other approaches, including Bardenet, Doucet and Homes, are quite involved. I have to look into the objection regarding ergodicity of delayed acceptance raised by Bardenet, which could potentially be quite a problem of course.

A word about the evaluation section: I think its unfortunate that two of four experiments use logistic regression. The logistic regression target is very close to Gaussian, where MCMC is usually slammed by a simple EP approach (Leave Pima Indians alone!). And EP is easy to apply in the big data setting, or thats what I’ve heard. About the gamma regression target I don’t know, might be that EP does not do so well in that case.

Early Stopping as Nonparametric Variational Inference (not)

Update: This paper has been accepted to AISTATS. As far as I can see, in exactly the same version that I review here.
This preprint by Maclaurin, Duvenaud and Adams suggested to be a very interesting read. Their idea is that stochastic optimization of a Bayesian Model could be seen as sampling from a variational approximation of the target distribution. Indeed this does not seem surprising given the consistency results for Stochastic Gradient Langevin Dynamics by Teh et al. A major gist of the paper is that the variational approximation can be used to estimate a lower bound of the models marginal likelihood/model evidence. This estimate is used throughout evaluations.

However in its current version the paper exhibits several weaknesses, the least of which is the experiments, with which I will start (using the psychologically researched foot-in-the-door-technique 😉 ). In their first experiment they optimized the parameters of a neural network without regularization on the parameters. This is equivalent to using an improper prior – but this makes the marginal likelihood completely useless for Bayesian model comparison because of the Marginalization Paradox. Basically the problem is that improper priors are only known proportionally – and marginal likelihood could be anything because of that. To get an intuition for this, observe that the marginal likelihood satisfies \int p(\textrm{Data}\vert \theta) \textrm{d}p(\theta) = C\int p(\textrm{Data}\vert \theta){d}\theta, where C is some constant that can be anything because p(\theta){\propto}C is always a correct characterization of the improper prior. In the other experiments, a statement on whether the prior is proper or improper is absent.

Also they claim that sampling from some distribution q and taking a stochastic gradient step computed from the partial posterior, they are able to compute the Jacobian for this step. This however is impossible because the gradient step is not bijective. Consider for example a univariate Gaussian distribution and taking a gradient step. You could end up at the mode from two values of equal log posterior density and equally large gradients pointing towards the mode. But if you can end up at the mode from two different points, then the gradient step is not injective and thus definitely not bijective. Hence no computable Jacobian transformation. I hit upon the same problem in my early ideas for Gradient Importance Sampling, which is why in that algorithm the gradient is not used for an optimization step after sampling but rather to inform the sampling distribution.

A further problem is that they seem to assume that evaluating the log posterior of only part of the data represents an unbiased estimate of the log posterior using all of the data. This is not true, in fact using minibatches can introduce small sampling bias – this being the reason that Consensus Monte Carlo introduces a Jackknife debiasing procedure and the motivation behind Heiko Strathmanns work on removing subsampling bias in MCMC for big data (another example being the paper on tall data by Remi Bardenet, Arnaud Doucet and Chris Holmes).

Finally, and I’ll end here, they state that a limitation of their method might be that

using only a single sample to estimate both the expected likelihood as well as the entropy of an entire distribution will necessarily have high variance under some circumstances

This is understating things – I’m not sure a single sample will have finite variance at all.

I really hope that they can improve on this work, but it won’t be a piece of cake.

EDIT:  I talked to David Duvenaud at his poster at a NIPS workshop today. While I still think my objections are valid, I was harsher than I wished I had been (I wanted to be constructive, but I just have no idea how one could proceed and ended up just criticizing, which must have been very frustrating). Just to clarify: even when you do not end up at the mode after a gradient step, the step still is not invertible, resulting in it being not bijective.

The Bouncy Particle Sampler

Bouncy Particles (c) Lionel Allorge

The Bouncy Particle Sampler (on arXiv) is an advanced MCMC algorithm making use of Gradient information. Initialized at a random position x_0 in the support of the target density \pi, and with a random direction v_0 (of euclidean length 1), the algorithm proceeds at iteration i by moving to x_{i+1} = x_i+\tau v_i where \tau>0 is a random length. The intuitive idea is

  • the step size/learning rate \tau should be high if the gradient points is similar to v_{i} (similarity measured by their dot product), low otherwise
  • after the move, the direction v_{i+1} will be updated from v_{i} using  the targets gradient at the new position x_i, by reflecting it on the hyperplane orthogonal to the gradient

At random times, the direction v_{i+1} is refreshed to a uniformly drawn direction (without this, no irreducibility). While this algorithm is as mind-boggling as HMC at first reading (for a non-phycisist), it has the nice property that it only has one parameter tuning the frequency with which direction is refreshed, unlike Hamiltonian MC where step-size, number of steps and the Mass-Matrix for momentum have to be tuned. Also, there are no rejections of proposals as HMC (or any other MH based algorithm).

A variant of the algorithm partitions the sampling space when the target is a product of independent factors and there is a direction, position and step size for each factor. Thus, in this version an actual particle system (in the SMC sense) is set up, where the interaction between the particles is achieved by them colliding. Only particles belonging to overlapping factors collide. This part is very interesting but I think it could use a more intuitive exposition to the basic idea.
The paper does not manage to adapt the bouncy sampler to discrete spaces, but shows that one can just apply another MCMC step for the discrete part.

Also, the paper establishes that you can use Hamiltonian dynamics instead of straight line updates, but as far as I can tell the straight line version is used in the numerical experiments. I’m not sure why you would want Hamiltonian dynamics back, as this reintroduces the tuning problems of HMC (maybe less so when using NUTS).

In comparison to NUTS the paper reports that their algorithm performs better (of course). It is not entirely clear however if they used NUTS with adaptation of the mass matrix or without, which should make quite a difference. In their application example (in 10 Dimensions), they seem to not adapt the mass matrix of HMC, which is a pity. A comparison to Adaptive Metropolis with mass matrix and scale adaptation would have been nice, as this is often hard to beat regarding wall clock time. Not to mention its simplicity in implementation.

Overall a very interesting paper, though it doesn’t exactly give an easy exposition. One direction that might be interesting is wether refreshing the direction v_{i+1} at random has to be done uniformly over directions. I suppose that it might be possible to do this non-uniformly, for example by drawing a vector from the empirical covariance of the target and normalizing it. But the correctness of this is likely harder to prove. Also, the advantage over HMC in terms of tuning parameters essentially vanishes in that case.