# To understand deep learning we need to understand kernel learning

This is a most interesting ICML paper by Belkin, Ma and Mandal. Its starting point (as in seemingly many recent papers by Belkin) is the observation that there is robust generalisation performance of statistical models for interpolating solutions/solutions with zero training error, an observation previously made by Recht and Bengio for deep learning. The authors find that this phenomenon persists if we change the model class from deep networks to kernel machines. In particular, the non-smooth Laplacian kernel can fit random training labels in a classification problem.

On the other hand, the authors provide bounds for smooth kernels that diverge with increasing data. To arrive at this conclusion they show that the norm of interpolating solutions diverges with the number of data points (they call interpolating solutions “overfitted” which I think is not the best term). This implies divergence of generalisation bounds as well, as they depend on the norm of the solution.

The assumption needed for the theoretical results is that the label noise of the training data is nonzero. Experimentally, the paper checks how interpolating kernel classifiers perform in this case using synthetic datasets with known label noise and real world data sets with added noise. Here, the empirical test performance of the interpolating kernel classifier is slightly below the noise level, showing near optimal performance. This parallels the performance of deep networks.
The paper concludes that a deep architecture, as such, is not responsible for the good test performance of zero-training-error solutions. And as almost all bounds for RKHS methods depend on the norm of the solution, that a new theory is needed for these zero-training-error regimes. A complementary finding is that optimisation for fitting random labels is more demanding with Gaussian kernels as compared to non-smooth Laplacian kernels.

The authors conjecture that inductive bias, rather than regularisation, is particularly important in deep learning and kernel methods. Concretely, for kernel methods, the span of the canonical features at the data points $k(x_1, \cdot), \dots, k(x_n, \cdot)$ contains the interpolating solution that has minimum RKHS norm. As existing generalization bounds depend on this norm, it’s obvious that this inductive bias is advantageous.
I agree with the authors in that (frequentist) kernel methods are a good place to start analysing minimum norm solutions, as here they are often available in closed form, unlike in (frequentist) deep learning.

# Variational Hamiltonian Monte Carlo via Score Matching

This is an arXival (maybe by now ICML paper?) by Zhang, Shahbaba and Zhao. It suggest fitting an emulator/surrogate $q_\eta$ to the target density $\pi$ via score matching and then use the gradients of $q_\eta$ rather than those of $\pi$ for generating proposal with HMC. Their main selling point being that this decreases wall-clock time for HMC.

However, that seems to me like reselling the earlier paper by Heiko Strathmann and collaborators on Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families. Zhang et al use basically the same idea but fail to look at this earlier work. Their reasoning to use a neural network rather than a GP emulator (computation time) is a bit arbitrary. If you go for a less rich function class (neural network) then the computation time will go down of course – but you would get the same effect by using GPs with inducing points.

Very much lacking to my taste is reasoning for doing the tuning they do. Sometimes they tune HMC/Variational HMC to 85% acceptance, sometimes to 70% acceptance. Also, it seems they not adapting the mass matrix of HMC. If they would, I conjecture the relative efficiency of Standard HMC vs Variational HMC could change drastically. Details on how they tune SGLD is completely lacking.

Overall, I think it is not yet clear what can be learned from the work reported in the paper.

# Character-Aware Neural Language Models

I’m taking a friends questions on Vector Space Semantics models as a reason to write about this arXival by Kim, Jernite, Sontag and Rush using an idea I’ve had in the back of my mind while fighting my semantics model: that maybe it is easier to model characters directly instead of words, as there are only so many characters but hundreds of thousands of words. And modeling characters directly might give an algorithm robustness to neologisms and spelling mistakes. Which indeed seems to be one of the conclusions of the paper.

The basic idea is to use character level ConvNets (with a pooling operation over time), extract features using the recently proposed autobahn networks (see my post) and bind things together over time using LSTMs. The big thing is that one has far fewer parameters than when processing at the word level, while the downside is that I’m not completely sure how easily one might extract word embeddings. While that might be no problem in the long run (as I suppose models could be developed that work based on characters only), it might be a problem for quick combination of this papers approach with current mainstream tools used in the industry.

In the convolutional layer, they used between 100 and 1000 filters of varying length, each of which basically recognizes different n-grams. After max pooling (over time), autobahn highway layers give a way of extracting hierarchical features. LSTMs are then used to predict the next word, the loss/”likelihood” being cross entropy to the actually observed next word. Unsurprisingly, in their discussion they account a strong increase in performance to the highway layers. Making me again think that we have to  work on enabling deeper hierarchies for Bayesian Models.

Overall a super interesting case study, though as a linguist I would really like to know how these models could be changed to give us some insight into the structure of the language.  As a plus for my friend, the code is available on github, written using the ubiquitius Torch library.

# Training Very Deep Networks – aka Deep Highway Networks

This interesting paper by Srivastava, Greff and Schmidhuber (arxiv) looks at the problem of enabling deeper architectures for neural networks. The main argument being that Neural Networks have become successful mainly because training deeper architectures has only recently become possible – and making them deeper might only improve performance. However, performance tends to plateau in conventional deep architectures, as exemplified by their first plot.

The proposed solution to training deeper architectures is to simply introduce a gaiting architecture similar to LSTM ideas. If a normal layer is defined by $y = H(x, w)$ for output y, input x and parametrization of the layer given by w, a highway layer is simply given by $y = H(x, w_H) \odot T(x, w_T) + x \odot C(x, w_C)$, where T and C (Transfer and Carry, respectively) are functions with a (co)domain of the same dimensionality as x and y. This is called a highway layer, as gradient infomation supposedly travels faster along the $latexx \odot C(x, w_C)$ part (hence the Autobahn illustration). When it is required to change dimensionality, one has to resort to a normal layer, or subsample/pad.

Now this archictecture of the highway layer of course allows for gradient information in SGD to travel two paths, one trough $H(\cdot)T(\cdot)$, one through $C(\cdot)\cdot x$ and the output of a layer is a combination of the input of the layer x and the output of $H(x,w)$. The authors use a logistic model $T(\cdot) = \sigma(W_T \cdot+b_T)$ and $C(\cdot) = 1-T(\cdot)$. The authors biased towards simply carrying information through without transformation, which in their setup means simply $b_T < 0$. After training the network with SGD an looking at the value of the transform gates of different layers, they find that those are closed on average (Figure 2, second column) while exhibiting large variance across inputs (exemplified by the transform gate outputs of a single sample, Figure 2, third column). Interestingly, some dimensions of the input to the network are carried through right to the last layer.

They find that this architecture not only helps to train larger networks, but also aids actually utilizing deeper architectures. So this paper took a similar inception as LSTMs: tries to solve an inference problem and comes up with a model that is more powerful in a way.

I feel that the result is somehow odd: carrying the input information through can be achieved by setting the weights in the original normal layer $H$ accordingly (if it is fully connected that is). Why does adding the extra complexity gain so much? I do not feel the paper addressing this important question and I can only speculate myself: perhaps the transform/carry architecture is effective because its weights lie on a compact manifold (because of the logistic transform). Maybe the problem becomes better conditioned because simple downscaling of the input is done by T, while H takes care of actual transformations. To check this, it would have been interesting to know whether they used any constraints on the weights of H – if they haven’t, and H is fully connected, the better conditioning might be an actual explanation. In that case, simply constraining the weights $W_H^{(i,i)}$  of H (mapping input $x_i$ to the same output dimension $y_i$) in an appropriate way might have similar effects.

EDIT: After talking to the authors, I realized where my thinking was insufficient: the layer H inside the highway always used a nonlinear map. The gating allowed to switch that off. I guess thats the difference from reading about NNs to using them…