Why the MAP is a bad starting point in high dimensions

During MCMSki 2016, Heiko mentioned that in high dimensions, the MAP is not a particularly good starting point for a Monte Carlo sampler, because there is no volume around it. I and several people smarter than me where not sure why that could be the case, so Heiko gave us a proof by simulation: he sampled from multivariate standard normals with increasing dimensions and plotted the euclidean norm of the samples. The following is what I reproduced for sampling a standard normal in $D=30$ dimensions.The histogram has a peak at about $\sqrt{D}$, which means most of the samples are in a sphere around the mean/mode/MAP of the target distribution and none are at the MAP, which would correspond to norm $0$.

We where dumbstruck by this, but nobody (not even Heiko) had an explanation for what was happening. Yesterday I asked Nicolas about this and he gave the most intuitive interpretation: Given a standard normal variable in $D$ dimensions, $x \sim \mathcal{N}(0,I_D)$, computing the euclidean norm you get $n^2 = \|x\|^2 = \sum_{i=1}^D x_i^2$. But as $x$ is gaussian, this just means $n^2$ has a $\chi^2(D)$ distribution, which results in the expected value  $\mathbb{E}(n) = \sqrt{D}$. Voici l’explication.

(Title image (c) Niki Odolphie)

Kernel Adaptive Sequential Monte Carlo

Heiko Stratmann, Dino Sejdinovic, Brooks Paige and myself had a poster on our work on Kernel Adaptive SMC (arXiv) at the Scalable Inference NIPS workshop this year. The basic idea is very related to Dinos paper on Kernel-informed MCMC, in that the particle system is (implicitly) mapped to a function space (the infamous Reproducing Kernel Hilbert Space) and used to inform a sample from a Gaussian fit to the functional associated with samples. As always, things that are nasty and nonlinear in the input space behave much more nicely in feature space. An when your name is Dino Sejdinovic, you can actually integrate out all steps in feature space, ending up with a Gaussian proposal in the space you actually want to sample from.

In the title cartoon, our approach (in red, called KASS for Kernel Adaptive SMC sampler) lines up nicely with the support of the target distribution. ASMC by Fearnhead et al., or at least the sampler they use in their evaluation, which is a special case of our method, cannot adapt to this nonlinear structure. This results in better performance for KASS compared to ASMC and Random Walk SMC, as measured by Maximum Mean Discrepancy.However, this of course comes at the price of higher computational complexity, similar to related methods such as Riemannian Manifold MALA/HMC. If your dataset is large however and evaluating the likelihood majorizes evaluating the Gram matrix on your particle system, this method will still have an edge over others when the support is nonlinear. For the MCMC predecessor KAMH/Kameleon see Heikos post.

Probabilistic Line Searches for Stochastic Optimization

This very remarkable NIPS paper by Mahsereci and Hennig (from my beloved Tübingen where I was an undergrad) deals with getting rid of learning rates for SGD. It does so by placing a GP prior on the value of the objective function along the search direction and its gradients projected onto the search direction. The pd kernel used for the GP is that of a once-integrated Wiener Process, resulting in a cubic spline posterior. They argue that this particular form is advantageous because its minima can be found exactly in linear time – tada! This of course is great news, because, paraphrasing the paper, you don’t want to run an optimization algorithm to find minima inside of the line search routine of the actual optimization algorithm. Another reason for choosing the particular pd kernel is that it does not assume an infinite number of derivatives of the GP (such as the Gaussian RBF does) – making it robust to irregularities in the Loss function.

Another point this rich paper makes is that the Wolfe conditions for termination of the line search procedure are actually positivity constraints. And that the probability of the constraints being satisfied can be modeled by a Gaussian CDF. If the probability is greater that $0.3$ (this threshold being motivated in the paper), the algorithm is done.

Generally speaking, the authors do a very nice job of getting rid of any free parameters in their approach, leading to a black box line search which can be used in SGD, effectively eliminating the dreaded learning rate. What I, as a sampling person, would really be interested in is whether or not this could actually be put to good use in Monte Carlo as well. However, it is delicate to use gradient information in sampling and not introduce bias, so I think this might be very difficult.

Sampling from Probabilistic Submodular Models

This NIPS paper by Gotovos, Hassani and Krause deals with coming up with a sampler for submodular models. Submodularity is an interesting concept capturing the concept of diversity/representativeness of a set. A submodular function itself is a function from sets to a real value defined by a diminishing returns property: for $S,T \subseteq \Omega$ some function F is submodular if $F(S)+F(T)\geq F(S\cup T)+F(S\cap T)$. For a probabilistic model one can use such an F by defining a probability of a set as $S \propto exp(\beta F(S))$. Because of the diminishing returns, this results in a distribution  where points are repulsive and tend to spread out over the space rather than clump together, just like in Low Discrepancy Point Sets used for QMC. Think Determinantal Point Process, which is log-submodular and has also been called the Antisocial Coffeshop Process, or, evocatively,  the Urinal Process (thanks to Vinayak Rao for this 😀 ).

The paper deals with the case of a discrete sampling space. Sampling a set over a discrete domain can be thought of as sampling binary vectors. Speaking of sampling, the reference they give for MCMC is rather obscure.  The  Gibbs sampler they give is very straight forward: proposing to change one entry in the binary vector at a time. The main contribution of the paper is not the sampler, which is rather straight forward, but rather proving upper bounds on its mixing time, which is $n^2$ if the base set from which we can add/remove points is of cardinality n. If adding/removing a point only has an certain effect on the submodular function bounded by 1, then the sampler mixes at least in time $n\log n$. I have no intuition wether this improved result is obtained under realistic assumptions.

What I have concluded for myself independent of this paper is that sampling one component at a time often gives a very bad sampler. In this particular case, the sampler is very close to the Gibbs sampler for the Indian Buffer Process, which is mixing rather awfully and one of the reasons why I turned away from the IBP and towards working on sampling algorithms. I don’t see how their sampler might improve here. Also, I’m unsure why they did not cite Sequential Monte Carlo on large binary sampling spaces  by Christian Schäfer and Nicolas Chopin – have they been unaware of it? That paper suggests a much more elegant sampler. It adapts to correlations between entries in the binary vector, which is really what you want when doing posterior inference. And the gains of adaptivity are hard to underestimate (as are the perils) – take for example the fact that unadaptive, perfectly tuned HMC is worse than most simple adaptive algorithms, as for example reported in my Gradient IS work or in the Tutorial on adaptive MCMC by Andrieu and Thoms.

Training Very Deep Networks – aka Deep Highway Networks

This interesting paper by Srivastava, Greff and Schmidhuber (arxiv) looks at the problem of enabling deeper architectures for neural networks. The main argument being that Neural Networks have become successful mainly because training deeper architectures has only recently become possible – and making them deeper might only improve performance. However, performance tends to plateau in conventional deep architectures, as exemplified by their first plot.

The proposed solution to training deeper architectures is to simply introduce a gaiting architecture similar to LSTM ideas. If a normal layer is defined by $y = H(x, w)$ for output y, input x and parametrization of the layer given by w, a highway layer is simply given by $y = H(x, w_H) \odot T(x, w_T) + x \odot C(x, w_C)$, where T and C (Transfer and Carry, respectively) are functions with a (co)domain of the same dimensionality as x and y. This is called a highway layer, as gradient infomation supposedly travels faster along the $latexx \odot C(x, w_C)$ part (hence the Autobahn illustration). When it is required to change dimensionality, one has to resort to a normal layer, or subsample/pad.

Now this archictecture of the highway layer of course allows for gradient information in SGD to travel two paths, one trough $H(\cdot)T(\cdot)$, one through $C(\cdot)\cdot x$ and the output of a layer is a combination of the input of the layer x and the output of $H(x,w)$. The authors use a logistic model $T(\cdot) = \sigma(W_T \cdot+b_T)$ and $C(\cdot) = 1-T(\cdot)$. The authors biased towards simply carrying information through without transformation, which in their setup means simply $b_T < 0$. After training the network with SGD an looking at the value of the transform gates of different layers, they find that those are closed on average (Figure 2, second column) while exhibiting large variance across inputs (exemplified by the transform gate outputs of a single sample, Figure 2, third column). Interestingly, some dimensions of the input to the network are carried through right to the last layer.

They find that this architecture not only helps to train larger networks, but also aids actually utilizing deeper architectures. So this paper took a similar inception as LSTMs: tries to solve an inference problem and comes up with a model that is more powerful in a way.

I feel that the result is somehow odd: carrying the input information through can be achieved by setting the weights in the original normal layer $H$ accordingly (if it is fully connected that is). Why does adding the extra complexity gain so much? I do not feel the paper addressing this important question and I can only speculate myself: perhaps the transform/carry architecture is effective because its weights lie on a compact manifold (because of the logistic transform). Maybe the problem becomes better conditioned because simple downscaling of the input is done by T, while H takes care of actual transformations. To check this, it would have been interesting to know whether they used any constraints on the weights of H – if they haven’t, and H is fully connected, the better conditioning might be an actual explanation. In that case, simply constraining the weights $W_H^{(i,i)}$  of H (mapping input $x_i$ to the same output dimension $y_i$) in an appropriate way might have similar effects.

EDIT: After talking to the authors, I realized where my thinking was insufficient: the layer H inside the highway always used a nonlinear map. The gating allowed to switch that off. I guess thats the difference from reading about NNs to using them…