# Supervised learning with quantum enhanced feature spaces

This letter to Nature (also on arXiv) by Havlicek and coauthors deals with gaining a quantum computing advantage for classification problems and is written by quantum physicists. The main reason a friend brought this to my attention is that the classification problem is solved using support vector machines, thus fitting my recent interest in reproducing kernel Hilbert space (RKHS) methods.
The main idea is that the way numerical data is encoded into a quantum state can result in a nonlinear feature map into the very high dimensional quantum Hilbert space. Consecutive computation in the quantum Hilbert space then induces nonlinear methods in the input space. In this case nonlinear classification.

The main contributions over papers that are easier to read coming from an RKHS background (such as Quantum machine learning in feature Hilbert spaces by Schuld et al) are twofold. For one, Havlicek and coauthors use a feature map that does not result in a trivial/useless RKHS. Specifically they propose to use two layers of a diagonal gate and a Hadamard gate and conjecture that this gives a quantum advantage (while a single layer can be simulated classically). I am quite lost here of course without any background in quantum computing. Second, their classification algorithm was implemented and run on an actual quantum computer, rather than simulated on standard computers. In particular, they use five superconducting transmons, which seems to be a type of qubit implementation and allows for quantum coupling.

The classification problem they tackle is a toy problem that they construct so as to be perfectly separable with their classification algorithm, which is of course a good sanity check for this first step of developing actual quantum machine learning. The decision of the algorithm for any of the two classes however cannot be read from the computing device deterministically, but only stochastically. The solution, seemingly common in quantum computing, is to read out the class repeatedly to obtain samples and compute their empirical average.
The training then consists of optimizing a bias variable and a parameter $\theta$ that determines the feature map (unless I’m mistaken!) so as to minimize the empirical missclassification error for the training dataset $R_{\mathrm{emp}}(\vec{\theta})=\frac{1}{|T|} \sum_{\vec{x} \in T} \mathrm{Pr}(\tilde{m}(\vec{x}) \neq m(\vec{x}))$, where $\tilde{m}(\vec{x})$ is the label assigned by the algorithm and $m(\vec{x})$ is the actual label of $\vec{x}$.

They use two main approaches, one working directly in the space spanned by the $\phi(\vec{x})$ at input points and finding a separating hyperplane in the quantum hilbert space directly. This is what they call the variational algorithm. The other approach uses the canonical Aronszajn features, i.e. the mapping of an input point $\vec{x}$ to the function $\langle\phi(\vec{x}), \cdot\rangle$, and then construct a separating hyperplane as a linear combination of the functions induced by the training points.

Personally, I would think that the variational approach makes much more sense in the mid to long term, as the canonical approach does not yield a more powerful method but induces large memory cost for large datasets. Then on the other hand I’m completely lost when thinking about how to invert a matrix in the quantum feature space or compute solutions of systems of linear equations.

Overall, I think that this is a very interesting path to follow and am keen on finding out how quantum computing and machine learning/statistics might combine in beneficial ways.

# Variational inference with Normalizing Flows

I keep coming back to this ICML 2015 paper by Rezende and Mohamed (arXiv version). While this is not due to the particular novelty of the papers contents, I agree that the suggested approach is very promising for any inference approach, be it VI or adaptive Monte Carlo. The paper adopts the term normalizing flow for refering to the plain old change of variables formula for integrals. With the minor change of view that one can see this as a flow and the correct but slightly alien reference to a flow defined by the Langevin SDE or Fokker-Planck, both attributed only to ML/stats literature in the paper.
The theoretical contribution feels a little like a strawman: it simply states that, as Langevin and Hamiltonian dynamics can be seen as an infinitesimal normalizing flow, and both approximate the posterior when the step size goes to zero, normalizing flows can approximate the posterior arbitrarily well. This is of course nothing that was derived in the paper, nor is it news. Nor does it say anything about the practical approach suggested.
The invertible maps suggested have practical merit however, as they allow “splitting” of a mode into two, called the planar transformation (and plotted on the right of the image), as well as “attracting/repulsing” probability mass around a point. The Jacobian correction for both invertible maps being computable in time that is linear in the number of dimensions.

# Operator Variational Inference

This NIPS 2016 paper by Ranganath et al. is concerned with Variational Inference using objective functions other than KL-divergence between a target density $\pi$ and a proposal density $q$. It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $\pi, q$ and test functions from some family $\mathcal{F}$. I completely agree with the motivation: KL-Divergence in the form $\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x$ indeed underestimates the variance of $\pi$ and approximates only one mode. Using KL the other way around, $\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x$ takes all modes into account, but still tends to underestimate variance.

As a particular case, the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $q$  at all but uses test functions exclusively. The only requirement is that we be able to draw samples from the proposal. The authors claim that assuming access to $q$ limits applicability of an objective/operator. This claim is not substantiated however. The example they give in equation (10) is that it is not possible to find a Jacobian correction for a certain transformation of a standard normal random variable $\epsilon \sim \mathcal{N}(0,I)$  to a bimodal distribution. However their method is not the only one to get bimodality by transforming a standard normal variable and actually the Jacobian correction can be computed even for their suggested transformation! The problem they encounter really is that they throw away one dimension of $\epsilon$, which makes the tranformation lose injectivity. However by not throwing the variable away, we keep injectivity and it is possible to compute the density of the transformed variables. The reasons for not accessing the density $q$ I thus find rather unconvincing.

To compute expectations with respect to $q$, the authors suggest Monte Carlo sums, where every summand uses an evaluation of $\pi$ or its gradient. As that is the most computationally costly part in MCMC and SMC often times, I am very curious whether the method performs any better computationally than modern adaptive Monte Carlo methods.

# Overdispersed Black-Box Variational Inference

This UAI paper by Ruiz, Titsias and Blei presents important insights for the idea of a black box procedure for VI (which I discussed here). The setup of BBVI is the following: given a target/posterior $\pi$ and a parametric approximation $q_\lambda$, we want to find

$\mathrm{argmin}_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$

which can be achieved for any $q_\lambda$ by estimating the gradient

$\nabla_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$

with Monte Carlo Samples and stochastic gradient descent. This works if we can easily sample from $q_\lambda$  and can compute its derivative wrt $\lambda$ in closed form. In the original paper, the authors suggested the use of the score function as a control variate and a Rao-Blackwellization. Both where described in a way that utterly confused me – until now, because Ruiz, Titsias and Blei manage to describe the concrete application of both control variates and Rao-Blackwellization in a very transparent way. Their own contribution to variance reduction (minus some tricks they applied) is based on the fact that the optimal sampling distribution for estimating $\nabla_\lambda \int \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) q_\lambda(x) \mathrm{d}x$ is proportional to $\left | \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) \right | q_\lambda(x)$ rather than exactly $q_\lambda(x)$. They argue that this optimal sampling distribution is considerably heavier tailed than $q_\lambda(x)$. Their reasoning is mainly that the norm of the gradient (which is essentially $(\nabla_\lambda q_\lambda) \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right ) = q_\lambda(x)(\nabla_\lambda \log q_\lambda(x)) \log \left ( \frac{\pi(x)}{q_\lambda(x)} \right )$)  vanishes for the modes, making that region irrelevant for gradient estimation. The same should be true for the tails of the distribution I think. Overall very interesting work that I strongly recommend reading, if only to understand the original Blackbox VI proposal.

# A Variational Analysis of Stochastic Gradient Algorithms

This arXival by Mandt, Hoffman and Blei from February takes a look at what they call SGD with constant learning rate or constant SGD. Which is not really Stochastic Gradient Descent anymore, but rather a sampling algorithm, as we should bounce  around the mode of the posterior rather than converging to it. Consequently they interpret constant SGD as a stochastic process in discrete time with some stationary distribution. They go on to analyse it under the assumption that the stationary distribution is Gaussian (Assumption 1 and 4). Something that springs to my mind here is the following: even if we accept that the stationary distribution of the process might be Gaussian (intuitively, not rigorously), what guarantees that the stationary distribution is the posterior? Even if the posterior is also gaussian, it could have a different (co)variance. I think a few more words about this question might be good. In particular, without adding artificial noise, is it guaranteed that every point in the posteriors domain can be reached with non-zero probability? And can be reached an infinite number of times if we collect infinite samples? If not, the constant SGD process is transient and does not enjoy good stability properties.

Another delicate point hides in Assumption 1: As our stochastic gradient is the sum of independent RVs, the CLT is invoked to assume that the gradient noise is Gaussian. But the CLT might not yet take effect if the number of data points in our subsample is small, which is the very thing one would like in order to have advantages in computation time. This of course is yet another instance of the dilemma common to all scalable sampling techniques, and at this stage of research this is not a show stopper.
Now assuming that constant SGD is not transient and it has a stationary distribution that is the posterior and that the posterior or its dominant mode is approximately Gaussian, we can of course try to optimize the parameters of constant SGD to make the sample it produces as close a possible to a posterior sample. Which the authors conveniently do using an Ornstein-Uhlenbeck approximation, which has a Gaussian stationary distribution. Their approach is to minimize KL-Divergence between the OU approximation and the posterior, and derive optimal parameter settings for constant SGD in closed form. The most interesting part of the theoretical results section probably is the one on optimal preconditioning in Stochastic Gradient Fisher Scoring, as that sampler adds artificial noise. Which might solve some of the transience questions.

The presentation would gain a lot by renaming “constant SGD” for what it is – a sampling algorithm. The terminology and notation in general are sometimes a bit confusing, but nothing a revision can’t fix. In general, the paper presents an interesting approach to deriving optimal parameters for a class of algorithms that is notoriously hard to tune. This is particularly relevant because the constant step size could improve mixing considerably over Welling & Tehs SGLD. What would be interesting to see is wether the results could be generalized to the case where the posterior is not Gaussian. For practical reasons, because stochastic VB/EP works quite well in the gaussian case. For theoretical reasons, because EP now even comes with some guarantees (haven’t read that paper admittedly). Maybe a path would be to take a look at the Gaussian approximation to a multimodal posterior spanning all modes, and minimizing KL-Divergence between that and the OU process. Or maybe one can proof that constant SGD (with artificial noise?) has some fixed stationary distribution to which stochastic drift term plus noise are a Markov Kernel, which might enable a pseudo-marginal correction.

# Variational Hamiltonian Monte Carlo via Score Matching

This is an arXival (maybe by now ICML paper?) by Zhang, Shahbaba and Zhao. It suggest fitting an emulator/surrogate $q_\eta$ to the target density $\pi$ via score matching and then use the gradients of $q_\eta$ rather than those of $\pi$ for generating proposal with HMC. Their main selling point being that this decreases wall-clock time for HMC.

However, that seems to me like reselling the earlier paper by Heiko Strathmann and collaborators on Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families. Zhang et al use basically the same idea but fail to look at this earlier work. Their reasoning to use a neural network rather than a GP emulator (computation time) is a bit arbitrary. If you go for a less rich function class (neural network) then the computation time will go down of course – but you would get the same effect by using GPs with inducing points.

Very much lacking to my taste is reasoning for doing the tuning they do. Sometimes they tune HMC/Variational HMC to 85% acceptance, sometimes to 70% acceptance. Also, it seems they not adapting the mass matrix of HMC. If they would, I conjecture the relative efficiency of Standard HMC vs Variational HMC could change drastically. Details on how they tune SGLD is completely lacking.

Overall, I think it is not yet clear what can be learned from the work reported in the paper.

# Black Box Variational Inference

My friend Patrick Jähnichen asked me to read this AISTATS 2014 paper by Ranganath, Gerrish and Blei. For somebody using mainly variational inference, I guess the prospect of not having to derive an individual algorithm for each new model is very appealing (while for poor sampling Ingmar, it would be very appealing to scale like VI does).

However, I found the paper hard to read and unclear in its conclusions. For the Rao-Blackwellization procedure it is unclear what is conditioned upon in the main paper. I guess it’s the variational parameters of all other components. Similarly, I’m not sure where the first equality in equation (13) stems from, which is crucial for the Control Variate ansatz. These things may just be me.

More problematic I think is their evaluation, especially comparing to MH-within-Gibbs. This is a particularly bad sampling scheme, and I don’t even suggest to use HMC instead. One could easily use a simple Adaptive Metropolis algorithm and still be black box. Also, Rao-Blackwellization can be applied in this pure adaptive MCMC setting as well. Another conservative and black-box possibility would be adaptive SMC by Fearnhead. My intuition is that both variants might actually be better than the suggested Black Box VI or at least the difference in the one performance plot of the paper will barely be noticeable.

However, I think this direction of research is a useful one and I hope they can improve upon the current state.