Supervised learning with quantum enhanced feature spaces

This letter to Nature (also on arXiv) by Havlicek and coauthors deals with gaining a quantum computing advantage for classification problems and is written by quantum physicists. The main reason a friend brought this to my attention is that the classification problem is solved using support vector machines, thus fitting my recent interest in reproducing kernel Hilbert space (RKHS) methods.
The main idea is that the way numerical data is encoded into a quantum state can result in a nonlinear feature map into the very high dimensional quantum Hilbert space. Consecutive computation in the quantum Hilbert space then induces nonlinear methods in the input space. In this case nonlinear classification.

The main contributions over papers that are easier to read coming from an RKHS background (such as Quantum machine learning in feature Hilbert spaces by Schuld et al) are twofold. For one, Havlicek and coauthors use a feature map that does not result in a trivial/useless RKHS. Specifically they propose to use two layers of a diagonal gate and a Hadamard gate and conjecture that this gives a quantum advantage (while a single layer can be simulated classically). I am quite lost here of course without any background in quantum computing. Second, their classification algorithm was implemented and run on an actual quantum computer, rather than simulated on standard computers. In particular, they use five superconducting transmons, which seems to be a type of qubit implementation and allows for quantum coupling.

The classification problem they tackle is a toy problem that they construct so as to be perfectly separable with their classification algorithm, which is of course a good sanity check for this first step of developing actual quantum machine learning. The decision of the algorithm for any of the two classes however cannot be read from the computing device deterministically, but only stochastically. The solution, seemingly common in quantum computing, is to read out the class repeatedly to obtain samples and compute their empirical average.
The training then consists of optimizing a bias variable and a parameter \theta that determines the feature map (unless I’m mistaken!) so as to minimize the empirical missclassification error for the training dataset R_{\mathrm{emp}}(\vec{\theta})=\frac{1}{|T|} \sum_{\vec{x} \in T} \mathrm{Pr}(\tilde{m}(\vec{x}) \neq m(\vec{x})), where \tilde{m}(\vec{x}) is the label assigned by the algorithm and m(\vec{x}) is the actual label of \vec{x}.

They use two main approaches, one working directly in the space spanned by the \phi(\vec{x}) at input points and finding a separating hyperplane in the quantum hilbert space directly. This is what they call the variational algorithm. The other approach uses the canonical Aronszajn features, i.e. the mapping of an input point \vec{x} to the function \langle\phi(\vec{x}), \cdot\rangle, and then construct a separating hyperplane as a linear combination of the functions induced by the training points.

Personally, I would think that the variational approach makes much more sense in the mid to long term, as the canonical approach does not yield a more powerful method but induces large memory cost for large datasets. Then on the other hand I’m completely lost when thinking about how to invert a matrix in the quantum feature space or compute solutions of systems of linear equations.

Overall, I think that this is a very interesting path to follow and am keen on finding out how quantum computing and machine learning/statistics might combine in beneficial ways.

Approximation beats concentration?

This paper by Mikhail Belkin from COLT 2018 (arXiv) takes a hard look at the main source of error for statistical problems solved using RKHS methodology. Unsurprisingly finding that approximation-theoretical results provide much sharper error bounds than the typical statistical concentration results with rate \mathcal{O}({1}/{\sqrt{n}}).

For a class of radial kernels, the paper shows nealy exponential decay of covariance operator eigenvalues and of function coordinates with respect to the eigenbasis. The radial kernel class is characterized by rather obscure (to me) conditions, but includes Gaussian, inverse multiquadric and Cauchy kernels. Notably, Belkin doesn’t mention Laplace kernels here, which seem to be his favorite. His main result, in my mind, is contained in theorem 2 and bounds the ith eigenvalue of the covariance operator (and shared by the kernel matrix) as \lambda_{i} \leq \sqrt{\kappa} C^{\prime} \exp \left(-C i^{1 / d}\right), where d is the dimension of the problem, C^{\prime} C> 0 are constants and \kappa is a constant depending on the kernel and the domain.

The implications of this are many. First, the number of samples needs to be nearly exponantial in the eigenvalue index for concentration bounds to be sharper than these approximation theory based bounds. Second, approximation theory bounds need no iid assumptions on the samples. Third, they are, according to the author, independent of the reference measure of the L2-space, which is at the same time the measure used for the covariance operator. While the latter is true in principle, I want to hint at the fact that the null sets of the reference measure is the only thing that does matter, as it influences the span of the eigenbasis.

For Kernel conditional density operators, the nearly exponetial eigenvalue decay contains a mixed message. Good news is that we now know better what to expect for the conditioning of the non-stochastic inverse problem. But slow eigenvalue decay would be good, which we definitely do not have for the class of kernels Belkin considers. The hope then is that one can get a handle on the constants in the bound, C^{\prime} C, \kappa and d to make the problem as well-posed as possible.
On the other hand, Theorem 3 shows that a function f can be approximated very well by the first few eigenfunctions, or in other words f=\sum a_{i} e_{i}, a_{i}=\left\langle\mathcal{R}_{\mu} f, e_{i}\right\rangle_{L_{\mu}^{2}} where e_i are the eigenfunctions, satisfies the decay condition \left|a_{i}\right| \leq \sqrt{\lambda_{i}}|f|_{\mathcal{H}}<C^{\prime} \exp \left(-C i^{1 / d}\right)|f|_{\mathcal{H}}. Which sounds like spectacular news, actually.

Overall the paper is a very interesting read and very enlightening. Obviously, I am still undecided of what the results actually mean for some kernel methods.

Kernel conditional density operators

We recently submitted and arXived our new work on (conditional) density reconstruction methods using the reproducing kernel Hilbert space (RKHS) formalism. This is work with relentless PhD candidate Mattes Mollenhauer, Stefan Klus and Krikamol Muandet. It seeks to provide theory for an idea that came up over the course of working on eigendecompositions of RKHS operators and was catalysed by a characterization of the kernel mean embedding (KME) object, aka the classical kernel density estimate (KDE)/Parzen window estimate, that Stefan derived in another paper.

The key insight that we started out with is the follwing. Let \mu_\mathbb{P} = \int k(x,\cdot) \mathrm{d}\mathbb{P}(x) be the KME/KDE functional for a density p (with respect to reference measure \rho) of distribution \mathbb{P}, and assume that this density is in the RKHS. Then \mu_\mathbb{P} = C_\rho p, where C_\rho is the covariance operator with respect to the reference measure. Thus we can recover p by solving the inverse problem, i.e. in terms of the pseudo inverse p = C^\dagger_\rho \mu_\mathbb{P}.
This is stated in Fukumizus Kernel Bayes rule paper for the general case of Radon-Nikodym derivatives, and later on in our work, though we didn’t realize the connections at first. So to be more precise: if the reference measure is Lebesgue, we estimate densities in the usual sense. In general Radon-Nikodym derivatives are estimated, i.e. p = \mathrm{d}\mathbb{P}/\mathrm{d}\rho.

Estimated density (green) starting from kernel mean embedding (red).
Ground truth in black.

A first impression of the resulting estimates is given above. When choosing Lebesgue measure as the reference, we can estimate the density in the usual sense as given by the green curve using 200 samples from the true black density. Using a Tikhonov approach, we prove dimensionality-independent bounds on the stochastic error of this reconstruction depending mostly on the norm of the KME/KDE function and the number of samples.

The real power of this idea however is conditional density estimation. While there already is the vernerable Gaussian process (GP) as a kernel-based method in this field, our method yields many advantages. Among them are of-the shelve modelling of arbitrary output dimensions and multiple modes, whereas standard GPs can only handle one dimension and mode. In the example below, this is demonstrated in a toy example based on a ring with gaussian noise in 3d. Both capabilities are demonstrated, multimodal and multidimensional output that easily captures correlations between the dimensions.

Gaussian donut. Samples, conditional densities and their estimates at x = 0 and x = 1.

And as no paper would be complete without deep learning, we compare the resulting conditional density operator (CDO) to a neural network based conditional density model using RealNVP. Finding the CDO outperforms the RealNVP by orders of magnitude, although the latter was tuned for quite some time by colleagues at Zalando Research for the dataset used.

Fingers crossed for the review process and thanks to my wonderful coauthors.

To understand deep learning we need to understand kernel learning

This is a most interesting ICML paper by Belkin, Ma and Mandal. Its starting point (as in seemingly many recent papers by Belkin) is the observation that there is robust generalisation performance of statistical models for interpolating solutions/solutions with zero training error, an observation previously made by Recht and Bengio for deep learning. The authors find that this phenomenon persists if we change the model class from deep networks to kernel machines. In particular, the non-smooth Laplacian kernel can fit random training labels in a classification problem.

On the other hand, the authors provide bounds for smooth kernels that diverge with increasing data. To arrive at this conclusion they show that the norm of interpolating solutions diverges with the number of data points (they call interpolating solutions “overfitted” which I think is not the best term). This implies divergence of generalisation bounds as well, as they depend on the norm of the solution.

The assumption needed for the theoretical results is that the label noise of the training data is nonzero. Experimentally, the paper checks how interpolating kernel classifiers perform in this case using synthetic datasets with known label noise and real world data sets with added noise. Here, the empirical test performance of the interpolating kernel classifier is slightly below the noise level, showing near optimal performance. This parallels the performance of deep networks.
The paper concludes that a deep architecture, as such, is not responsible for the good test performance of zero-training-error solutions. And as almost all bounds for RKHS methods depend on the norm of the solution, that a new theory is needed for these zero-training-error regimes. A complementary finding is that optimisation for fitting random labels is more demanding with Gaussian kernels as compared to non-smooth Laplacian kernels.

The authors conjecture that inductive bias, rather than regularisation, is particularly important in deep learning and kernel methods. Concretely, for kernel methods, the span of the canonical features at the data points k(x_1, \cdot), \dots, k(x_n, \cdot) contains the interpolating solution that has minimum RKHS norm. As existing generalization bounds depend on this norm, it’s obvious that this inductive bias is advantageous.
I agree with the authors in that (frequentist) kernel methods are a good place to start analysing minimum norm solutions, as here they are often available in closed form, unlike in (frequentist) deep learning.

Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control

This arXival by the dynamical duo Korda & Mesiç deals with optimal control for a dynamical system when no forward differential equations are known. Instead, we are given observation/sensor data x_t capturing the systems state (i.e. the observed state in HMM lingo or observables in dynamical systems lingo),  the control input u_t and the observed state at the following time index x_{t+1}.

The basic idea is to build a Koopman operator that captures the dynamics of the observed data. In the simplest case, this operator is just a linear map, i.e. matrices A, B satisfying [x_{t+1}, u_{t+1}] = [A,B] [x_{t}^\top, u_t^\top]^\top, where A encodes the dependence of the next state upon the current state, B upon the control. Already, this formula reveals a slight inelegance – to stay in Koopman operator world, the control input at t+1 is predicted. This does not make much sense, but so be it.

Now the trick to approximate nonlinear dynamics is well known to statisticians and ML people: rather finding a mapping between x_t and x_{t+1}, we find a mapping between \phi(x_{t}) and \phi(x_{t+1}), where \phi is some nonlinear map to \mathbb{R}^{d_\mathrm{lift}} and called a lifting function in the paper. Now to fit a model to the nonlinear system dynamics, we only need to fit the linear maps A, B to the training data. The loss in the paper is based on a Frobenius or squared euclidean norm and of the form

\min_{A,B} | [\phi(x_{2}), \dots, \phi(x_{T})] - A [\phi(x_{1}), \dots, \phi(x_{T-1})] - B[u_{1}, \dots, u_{T-1}] \|

For the Frobenius norm, the analytical solution is

[A, B] = [\phi(x_{2}), \dots, \phi(x_{T})] [\phi(x_{1}), \dots, \phi(x_{T-1}),u_{1}, \dots, u_{T-1}]^{-1}

where the inverse is the pseudoinverse of course. Up to here, the paper was all about approximating a forward model from data in the case where no a priori closed form DE model is given.
Now the idea for using this for control is to define a convex cost function J that includes both the cost of the system deviating from the target state as well as the cost of control inputs. As the cost is convex, a global optimum can be attained.

The theoretical analysis mainly shows that the approximated Koopman operator converges to the actual infinite dimensional Koopman operator.

While the extension to stochastic systems is only mentioned, I think this is an interesting paper following an approach that seems particularly promising for whats called reinforcement learning in the deep/machine learning community. It seems that many reinforcement learning papers do not use a proper optimisation procedure for control input (it’s just random search). Fitting a model of the systems dynamics and its reaction to control input instead, enabling the usage of well-known gradient based optimisation methods, as the current paper, seems like a good idea to me.

Three PhD positions in Lille

My collegue and friend Victor Elvira and his collaborator Francois Septier have three vacant PhD positions in Lille.

The first one is a cooperation with industry: the objective is to predict the evolution of the outputs in the supply chain of the industry partner, i.e. quantities in locations of storage or sales. The thesis will be about mathematical modeling of the supply chain and
inference methods for prediction.
The two other positions actually pertain to a recent and an older interest of mine, dynamical systems and importance sampling. You will study importance sampling based methods for probabilistic inference in complex non-linear high-dimensional systems. More specifically, you will work on novel adaptation schemes in order to overcome current limitations of more traditional IS-based techniques in such a challenging context. (Apply fast, or else I might!)

PostDoc at Imperial College London with Sarah Filippi

My wonderful collegue Sarah Filippi has a postdoctoral opening in statistics to work with her at Imperial College. The topic is Bayesian Nonparametric Statistics for Conditional Independence Tests and Causal Inference, and the start date October 2018. Details about the position are available at https://www.imperial.ac.uk/jobs/description/NAT00152/research-associate-statistics/

I can highly recommend both Sarah as a researcher and this most interesting and important topic.

Variational inference with Normalizing Flows

I keep coming back to this ICML 2015 paper by Rezende and Mohamed (arXiv version). While this is not due to the particular novelty of the papers contents, I agree that the suggested approach is very promising for any inference approach, be it VI or adaptive Monte Carlo. The paper adopts the term normalizing flow for refering to the plain old change of variables formula for integrals. With the minor change of view that one can see this as a flow and the correct but slightly alien reference to a flow defined by the Langevin SDE or Fokker-Planck, both attributed only to ML/stats literature in the paper.
The theoretical contribution feels a little like a strawman: it simply states that, as Langevin and Hamiltonian dynamics can be seen as an infinitesimal normalizing flow, and both approximate the posterior when the step size goes to zero, normalizing flows can approximate the posterior arbitrarily well. This is of course nothing that was derived in the paper, nor is it news. Nor does it say anything about the practical approach suggested.
The invertible maps suggested have practical merit however, as they allow “splitting” of a mode into two, called the planar transformation (and plotted on the right of the image), as well as “attracting/repulsing” probability mass around a point. The Jacobian correction for both invertible maps being computable in time that is linear in the number of dimensions.