To understand deep learning we need to understand kernel learning

This is a most interesting ICML paper by Belkin, Ma and Mandal. Its starting point (as in seemingly many recent papers by Belkin) is the observation that there is robust generalisation performance of statistical models for interpolating solutions/solutions with zero training error, an observation previously made by Recht and Bengio for deep learning. The authors find that this phenomenon persists if we change the model class from deep networks to kernel machines. In particular, the non-smooth Laplacian kernel can fit random training labels in a classification problem.

On the other hand, the authors provide bounds for smooth kernels that diverge with increasing data. To arrive at this conclusion they show that the norm of interpolating solutions diverges with the number of data points (they call interpolating solutions “overfitted” which I think is not the best term). This implies divergence of generalisation bounds as well, as they depend on the norm of the solution.

The assumption needed for the theoretical results is that the label noise of the training data is nonzero. Experimentally, the paper checks how interpolating kernel classifiers perform in this case using synthetic datasets with known label noise and real world data sets with added noise. Here, the empirical test performance of the interpolating kernel classifier is slightly below the noise level, showing near optimal performance. This parallels the performance of deep networks.
The paper concludes that a deep architecture, as such, is not responsible for the good test performance of zero-training-error solutions. And as almost all bounds for RKHS methods depend on the norm of the solution, that a new theory is needed for these zero-training-error regimes. A complementary finding is that optimisation for fitting random labels is more demanding with Gaussian kernels as compared to non-smooth Laplacian kernels.

The authors conjecture that inductive bias, rather than regularisation, is particularly important in deep learning and kernel methods. Concretely, for kernel methods, the span of the canonical features at the data points $k(x_1, \cdot), \dots, k(x_n, \cdot)$ contains the interpolating solution that has minimum RKHS norm. As existing generalization bounds depend on this norm, it’s obvious that this inductive bias is advantageous.
I agree with the authors in that (frequentist) kernel methods are a good place to start analysing minimum norm solutions, as here they are often available in closed form, unlike in (frequentist) deep learning.

Marginal Likelihood From the Metropolis–Hastings Output

This is a rather old JASA paper by Siddhartha Chib and Ivan Jeliazkov from 2001. I dug it up again as I am frantically trying to get my contribution for the CRiSM workshop on Estimating Constants in shape. My poster will be titled Flyweight evidence estimates as it looks at methods for estimating model evidence (aka marginal likelihood, normalizing constant, free energy) that are computationally very cheap when a sample from the target/posterior distribution exists.
I noticed again that the approach by Chib & Jeliazkov (2001) does not satisfy this constraint, although they claim otherwise in the paper. The estimator is based on the fact that the normalized probability of a point $\theta^*$ under the target when using Metropolis Hastings is

where $q$ is the proposal density, $\alpha$ the acceptance probability and $y$ the data. If $\bar\pi$ is the unnormalized target, we can get the evidence estimate $Z = \bar\pi (\theta^*|y)/\pi(\theta^*|y)$.
Now the integral in the numerator is cheap to estimate when we already have samples from $\pi$. The authors claim that because it is cheap to sample from  $q$, the denominator is cheap as well. I can’t help it: this statement feels a bit deceptive to me. While indeed samples from $q$ are cheep to generate, evaluating  $\alpha(\theta^*,\theta^{(j)}|y)$ for each sample $\theta^{(j)} \sim q(\theta^*,\cdot|y)$ is the opposite of cheap! It involves evaluating $\bar\pi(\theta^{(j)}|y)$ for all j, which is basically the same cost as getting samples from the target in the first place.
I already included that criticism in my PhD thesis, but when revising under time pressure I no longer thought it was valid and erroneously took it out.

Character-Aware Neural Language Models

I’m taking a friends questions on Vector Space Semantics models as a reason to write about this arXival by Kim, Jernite, Sontag and Rush using an idea I’ve had in the back of my mind while fighting my semantics model: that maybe it is easier to model characters directly instead of words, as there are only so many characters but hundreds of thousands of words. And modeling characters directly might give an algorithm robustness to neologisms and spelling mistakes. Which indeed seems to be one of the conclusions of the paper.

The basic idea is to use character level ConvNets (with a pooling operation over time), extract features using the recently proposed autobahn networks (see my post) and bind things together over time using LSTMs. The big thing is that one has far fewer parameters than when processing at the word level, while the downside is that I’m not completely sure how easily one might extract word embeddings. While that might be no problem in the long run (as I suppose models could be developed that work based on characters only), it might be a problem for quick combination of this papers approach with current mainstream tools used in the industry.

In the convolutional layer, they used between 100 and 1000 filters of varying length, each of which basically recognizes different n-grams. After max pooling (over time), autobahn highway layers give a way of extracting hierarchical features. LSTMs are then used to predict the next word, the loss/”likelihood” being cross entropy to the actually observed next word. Unsurprisingly, in their discussion they account a strong increase in performance to the highway layers. Making me again think that we have to  work on enabling deeper hierarchies for Bayesian Models.

Overall a super interesting case study, though as a linguist I would really like to know how these models could be changed to give us some insight into the structure of the language.  As a plus for my friend, the code is available on github, written using the ubiquitius Torch library.

Sampling from Probabilistic Submodular Models

This NIPS paper by Gotovos, Hassani and Krause deals with coming up with a sampler for submodular models. Submodularity is an interesting concept capturing the concept of diversity/representativeness of a set. A submodular function itself is a function from sets to a real value defined by a diminishing returns property: for $S,T \subseteq \Omega$ some function F is submodular if $F(S)+F(T)\geq F(S\cup T)+F(S\cap T)$. For a probabilistic model one can use such an F by defining a probability of a set as $S \propto exp(\beta F(S))$. Because of the diminishing returns, this results in a distribution  where points are repulsive and tend to spread out over the space rather than clump together, just like in Low Discrepancy Point Sets used for QMC. Think Determinantal Point Process, which is log-submodular and has also been called the Antisocial Coffeshop Process, or, evocatively,  the Urinal Process (thanks to Vinayak Rao for this 😀 ).

The paper deals with the case of a discrete sampling space. Sampling a set over a discrete domain can be thought of as sampling binary vectors. Speaking of sampling, the reference they give for MCMC is rather obscure.  The  Gibbs sampler they give is very straight forward: proposing to change one entry in the binary vector at a time. The main contribution of the paper is not the sampler, which is rather straight forward, but rather proving upper bounds on its mixing time, which is $n^2$ if the base set from which we can add/remove points is of cardinality n. If adding/removing a point only has an certain effect on the submodular function bounded by 1, then the sampler mixes at least in time $n\log n$. I have no intuition wether this improved result is obtained under realistic assumptions.

What I have concluded for myself independent of this paper is that sampling one component at a time often gives a very bad sampler. In this particular case, the sampler is very close to the Gibbs sampler for the Indian Buffer Process, which is mixing rather awfully and one of the reasons why I turned away from the IBP and towards working on sampling algorithms. I don’t see how their sampler might improve here. Also, I’m unsure why they did not cite Sequential Monte Carlo on large binary sampling spaces  by Christian Schäfer and Nicolas Chopin – have they been unaware of it? That paper suggests a much more elegant sampler. It adapts to correlations between entries in the binary vector, which is really what you want when doing posterior inference. And the gains of adaptivity are hard to underestimate (as are the perils) – take for example the fact that unadaptive, perfectly tuned HMC is worse than most simple adaptive algorithms, as for example reported in my Gradient IS work or in the Tutorial on adaptive MCMC by Andrieu and Thoms.

Training Very Deep Networks – aka Deep Highway Networks

This interesting paper by Srivastava, Greff and Schmidhuber (arxiv) looks at the problem of enabling deeper architectures for neural networks. The main argument being that Neural Networks have become successful mainly because training deeper architectures has only recently become possible – and making them deeper might only improve performance. However, performance tends to plateau in conventional deep architectures, as exemplified by their first plot.

The proposed solution to training deeper architectures is to simply introduce a gaiting architecture similar to LSTM ideas. If a normal layer is defined by $y = H(x, w)$ for output y, input x and parametrization of the layer given by w, a highway layer is simply given by $y = H(x, w_H) \odot T(x, w_T) + x \odot C(x, w_C)$, where T and C (Transfer and Carry, respectively) are functions with a (co)domain of the same dimensionality as x and y. This is called a highway layer, as gradient infomation supposedly travels faster along the $latexx \odot C(x, w_C)$ part (hence the Autobahn illustration). When it is required to change dimensionality, one has to resort to a normal layer, or subsample/pad.

Now this archictecture of the highway layer of course allows for gradient information in SGD to travel two paths, one trough $H(\cdot)T(\cdot)$, one through $C(\cdot)\cdot x$ and the output of a layer is a combination of the input of the layer x and the output of $H(x,w)$. The authors use a logistic model $T(\cdot) = \sigma(W_T \cdot+b_T)$ and $C(\cdot) = 1-T(\cdot)$. The authors biased towards simply carrying information through without transformation, which in their setup means simply $b_T < 0$. After training the network with SGD an looking at the value of the transform gates of different layers, they find that those are closed on average (Figure 2, second column) while exhibiting large variance across inputs (exemplified by the transform gate outputs of a single sample, Figure 2, third column). Interestingly, some dimensions of the input to the network are carried through right to the last layer.

They find that this architecture not only helps to train larger networks, but also aids actually utilizing deeper architectures. So this paper took a similar inception as LSTMs: tries to solve an inference problem and comes up with a model that is more powerful in a way.

I feel that the result is somehow odd: carrying the input information through can be achieved by setting the weights in the original normal layer $H$ accordingly (if it is fully connected that is). Why does adding the extra complexity gain so much? I do not feel the paper addressing this important question and I can only speculate myself: perhaps the transform/carry architecture is effective because its weights lie on a compact manifold (because of the logistic transform). Maybe the problem becomes better conditioned because simple downscaling of the input is done by T, while H takes care of actual transformations. To check this, it would have been interesting to know whether they used any constraints on the weights of H – if they haven’t, and H is fully connected, the better conditioning might be an actual explanation. In that case, simply constraining the weights $W_H^{(i,i)}$  of H (mapping input $x_i$ to the same output dimension $y_i$) in an appropriate way might have similar effects.

EDIT: After talking to the authors, I realized where my thinking was insufficient: the layer H inside the highway always used a nonlinear map. The gating allowed to switch that off. I guess thats the difference from reading about NNs to using them…