# Approximation beats concentration?

This paper by Mikhail Belkin from COLT 2018 (arXiv) takes a hard look at the main source of error for statistical problems solved using RKHS methodology. Unsurprisingly finding that approximation-theoretical results provide much sharper error bounds than the typical statistical concentration results with rate $\mathcal{O}({1}/{\sqrt{n}})$.

For a class of radial kernels, the paper shows nealy exponential decay of covariance operator eigenvalues and of function coordinates with respect to the eigenbasis. The radial kernel class is characterized by rather obscure (to me) conditions, but includes Gaussian, inverse multiquadric and Cauchy kernels. Notably, Belkin doesn’t mention Laplace kernels here, which seem to be his favorite. His main result, in my mind, is contained in theorem 2 and bounds the $i$th eigenvalue of the covariance operator (and shared by the kernel matrix) as $\lambda_{i} \leq \sqrt{\kappa} C^{\prime} \exp \left(-C i^{1 / d}\right)$, where $d$ is the dimension of the problem, $C^{\prime} C> 0$ are constants and $\kappa$ is a constant depending on the kernel and the domain.

The implications of this are many. First, the number of samples needs to be nearly exponantial in the eigenvalue index for concentration bounds to be sharper than these approximation theory based bounds. Second, approximation theory bounds need no iid assumptions on the samples. Third, they are, according to the author, independent of the reference measure of the L2-space, which is at the same time the measure used for the covariance operator. While the latter is true in principle, I want to hint at the fact that the null sets of the reference measure is the only thing that does matter, as it influences the span of the eigenbasis.

For Kernel conditional density operators, the nearly exponetial eigenvalue decay contains a mixed message. Good news is that we now know better what to expect for the conditioning of the non-stochastic inverse problem. But slow eigenvalue decay would be good, which we definitely do not have for the class of kernels Belkin considers. The hope then is that one can get a handle on the constants in the bound, $C^{\prime} C, \kappa$ and $d$ to make the problem as well-posed as possible.
On the other hand, Theorem 3 shows that a function $f$ can be approximated very well by the first few eigenfunctions, or in other words $f=\sum a_{i} e_{i}, a_{i}=\left\langle\mathcal{R}_{\mu} f, e_{i}\right\rangle_{L_{\mu}^{2}}$ where $e_i$ are the eigenfunctions, satisfies the decay condition $\left|a_{i}\right| \leq \sqrt{\lambda_{i}}|f|_{\mathcal{H}}. Which sounds like spectacular news, actually.

Overall the paper is a very interesting read and very enlightening. Obviously, I am still undecided of what the results actually mean for some kernel methods.

# Variational inference with Normalizing Flows

I keep coming back to this ICML 2015 paper by Rezende and Mohamed (arXiv version). While this is not due to the particular novelty of the papers contents, I agree that the suggested approach is very promising for any inference approach, be it VI or adaptive Monte Carlo. The paper adopts the term normalizing flow for refering to the plain old change of variables formula for integrals. With the minor change of view that one can see this as a flow and the correct but slightly alien reference to a flow defined by the Langevin SDE or Fokker-Planck, both attributed only to ML/stats literature in the paper.
The theoretical contribution feels a little like a strawman: it simply states that, as Langevin and Hamiltonian dynamics can be seen as an infinitesimal normalizing flow, and both approximate the posterior when the step size goes to zero, normalizing flows can approximate the posterior arbitrarily well. This is of course nothing that was derived in the paper, nor is it news. Nor does it say anything about the practical approach suggested.
The invertible maps suggested have practical merit however, as they allow “splitting” of a mode into two, called the planar transformation (and plotted on the right of the image), as well as “attracting/repulsing” probability mass around a point. The Jacobian correction for both invertible maps being computable in time that is linear in the number of dimensions.

# A non-parametric ensemble transform method for Bayesian inference

This 2013 paper by Sebastian Reich in the Journal on Scientific Computing introduces an approach called the Ensemble Transport Particle Filter (ETPF). The main innovation of  ETPF, when compared to SMC-like filtering methods, lies in the resampling step. Which is

1.  based on an optimal transport idea and
2. completely deterministic.

No rejuvenation step is used, contrary to the standard in SMC. While the notation is unfamiliar to me, coming from an SMC background, I’ll adopt it here: by $\{x_i^f\}_{i=1}^M$ denote samples from the prior with density $\pi_f$ (the $f$, meaning forecast, is probably owed to Reich having done a lot of Bayesian weather prediction). The idea is to transform these into samples $\{x_i^a\}_{i=1}^M$ that follow the posterior density $\pi_a$ (the $a$ meaning analyzed), preferably without introducing unequal weights. Let the likelihood term be denoted by $p(y|x)$ where $y$ is the data and let $w_i = \frac{p(y|x_i^f)}{\sum_{j=1}^M p(y|x_i^f)}$ be the normalized importance weight. The normalization in the denominator stems from the fact that in Bayesian inference we can often only evaluate an unnormalized version of the posterior $\pi_a(x) \propto p(y|x) \pi_f(x)$.

Then the optimal transport idea enters. Given the discrete realizations $\{x_i^f\}_{i=1}^M$, $\pi_f$ is approximated by assigning the discrete probability vector $p_f = (1/M,\dots,1/M)^\top$, while $\pi_a$ is approximated by the probability vector $p_a=(w_1,\dots, w_m)^\top$. Now we construct a joint probability between the discrete random variables distributed according to $p_f$ and those distributed according to $p_a$, i.e. a matrix $\mathbf{T}$ with non-negative entries summing to 1 which has the column sum $p_f$ and row sum $p_a$ (another view would be that $\mathbf{T}$ is a discrete copula which has prior and posterior as marginals). Let $\pi_\mathbf{T}(x^f,x^a)$ be the joint pmf induced by $\mathbf{T}$. To qualify as optimal transport, we now seek $\mathbf{T}^* = \mathrm{arg min}_\mathbf{T} \mathbb{E}_{\pi_\mathbf{T}} \|x^f - x^a\|_2^2$ under the additional constraint of cyclical monotonicity. This boils down to a linear programming problem. For a fixed prior sample $x^f_j$ this induces a conditional distribution over the discretely approximated posterior given the discretely approximated prior $\pi_{\mathbf{T}^*}(\cdot|x^f_j) =\pi_{\mathbf{T}^*}(x^f_j, \cdot)/\pi_f(x^f_j) = M \pi_{\mathbf{T}^*}(x^f_j, \cdot)$.

We could now simply sample from this conditional to obtain equally weighted posterior samples for each $j =1,\dots,M$. Instead, the paper proposes a deterministic transformation using the expected value $x^a_j = \mathbb{E}_{\pi_{\mathbf{T}^*}(\cdot|x^f_j)} x^f_i = \sum_{i=1}^M M t^*_{ij} x^f_i$. Reich proves that the mapping $\Psi_M$ induced by this transformation is such that for $X^f \sim \pi_f$, $\Psi_M(X^f) \sim \pi_a$ for $M \to \infty$. In other words, if the ensemble size M goes to infinity, we indeed get samples from the posterior.

Overall I think this is a very interesting approach. The construction of an optimal transport map based on the discrete approximations of prior and posterior is indeed novel compared to standard SMC. My one objection is that as it stands, the method will only work if the prior support covers all relevant regions of the posterior, as taking the expected value over prior samples will always lead to a contraction.

Of course, this is not a problem when M is infinite, but my intuition would be that it has a rather strong effect in our finite world. One remedy here would of course be to introduce a rejuvenation step as in SMC, for example moving each particle  $x^a_j$ using MCMC steps that leave $\pi^a$ invariant.

# Ergodicity of Combocontinuous Adaptive MCMC algorithms

This preprint by Jeff Rosenthal and Jinyoung Yang (currently available from Jeffs webpage) might also be called “Easily verifiable adaptive MCMC”. Jeff Rosenthal gave a tutorial on adaptive MCMC during MCMSki 2016 mentioning this work.  Adaptive MCMC is based on the idea that one can use the information gathered from sampling a distribution using MCMC to improve the efficiency of the sampling process.

If two conditions, diminishing adaptation and containment are satisfied, an adaptive MCMC algorithm is valid in the sense of asymptotically consistent. Diminishing adaptation means that two consecutive Markov Kernels in the algorithm will be asymptotically equal. In other words, we either stop adaptation at some point or we know that the adaptation algorithm converges.
Containment means the number of repeated applications of all used Markov Kernels to get close to the target measure is bounded. Concretely, let $\gamma$ be a Markov kernel index, $P_\gamma^m(x,\cdot)$ be the distribution resulting from m-fold application of kernel $P_\gamma$ starting from $x$ . In other words start MCMC at point x with kernel $P_\gamma$, let it run for m iterations and consider the induced distribution for the last point. Let $\pi$ be the target distribution. Then containment requires that
$\{M_\epsilon(X_n, \Gamma_n)\}_{n=1}^\infty$  is bounded in probability for all $\epsilon > 0$. Here $M_\epsilon(x, \gamma) = \inf \{ m \geq 1 : \| P_\gamma^m(x,\cdot) - \pi(\cdot) \|_\textrm{TV}\}$ and $\|\cdot\|_\textrm{TV}\}$ is a worst case distance between distributions (total variation distance).

The paper is concerned with trying to find conditions for containment in adaptive MCMC that are more easily verified than those from earlier papers. First however it gives a kind of blueprint for adaptive algorithms that satisfy containment.

### A blueprint for consistent adaptive MCMC

Nameley, let $\mathbb{R}^d$ be the support of the target distribution and $K \subseteq \mathbb{R}^d$ some large bounded region, $D > 0$ some large constant. The blueprint, Bounded Adaptive Metropolis, is the following:

Start the algorithm at some $X_0 \in K$ and fix a $d \times d$ covariance matrix $\Sigma_*$. At iteration n generate a proposal $Y_{n+1}$ by

(1)$Y_{n+1} \sim \mathcal{N}(X_n, \Sigma_*)~\textrm{if}~X_n \notin K$
(2)$Y_{n+1} \sim \mathcal{N}(X_n, \Sigma_{n+1})~\textrm{if}~X_n \in K$

Reject if $|Y_{n+1} - X_{n}| > D$, else accept with the usual Metropolis-Hastings acceptance probability. The $\Sigma_{n+1}$ can be chosen almost arbitrarily if the diminishing adaptation condition is met, so either the mechanism of choosing is fixed asymptotically or converges.

It would seem to me that we can actually change the distribution in (2) arbitrarily if we continue to meet diminishing adaptation. So for example we could use an independent metropolis, adaptive Langevin or other sophisticated proposal inside K, so long as condition (e) in the paper is satisfied, i.e. the adaptive proposal distribution used in (2) is continuous in $X_n$. Which leads us to the actual conditions for containment.

### General conditions for containment in adaptive MCMC

Let $\mathcal{X}$ be a general state space. For example in the Bounded Metropolis we had $\mathcal{X}=\mathbb{R}^d$. The conditions the authors give are (even more simplified by me):

(a)  The probability to move more than some finite distance D > 0 is zero: $Pr(|X_{n+1} - X_n| > D) = 0$
(b) Outside of K, the algorithm uses a fixed transition kernel P that never changes (and still respects that we can at most move D far away)
(c) The fixed kernel P is bounded above by $P(x, dy) \leq M \mu_*(dy)$ for finite constant M > 0 and all x that are outside K but no farther from it than D (call that set $K_D$) and all y that are between D and 2D distance from K (call that set $K_{2D} \backslash K_D$). Here $\mu_*$ is any distribution concentrated on $K_{2D} \backslash K_D$.
(d) The fixed kernel P is bounded below by $P^{n_0}(x, A) \geq \epsilon \nu_*(A)$ for some measure $\nu_*$ on $\mathcal{X}$, some $n_0 \in \mathbb{N}$ and some event A.
(e) Let $\gamma$ be the parameter adapted by the algorithm. The overall proposal densities $q_\gamma(x,y)$ (combining the proposal in and outside of K) are continuous in $\gamma$ for fixed (x,y) and combocontinuous in x. Practically, this would be that the fixed proposal when outside  K and the adaptive proposal when inside K are both continuous.

Here, conditions (a) and (b) are very easy to ensure even when not an expert on MCMC. Conditions (c) and (d) sound harder, but as mentioned above it seems to me that they are easy to ensure by just using a (truncated, i.e. respecting (a)) gaussian random walk proposal outside of K. Finally, (e) seems to boil down to making the adaptive proposal continuous in both $\gamma$ and x.

The proofs use a generalization of piecewise continuous functions and a generalized version of Dinis theorem to prove convergence in total variation distance.

This paper seems to me to be a long way from Roberts & Rosenthal (2007, Journal of Applied Probability) which was the first paper I read on ergodicity conditions for adaptive MCMC. It truly makes checking containment much easier. My one concern is that the exposition could be clearer for people that are not MCMC researchers. Then again, this is a contribution paper rather than a tutorial.

# Operator Variational Inference

This NIPS 2016 paper by Ranganath et al. is concerned with Variational Inference using objective functions other than KL-divergence between a target density $\pi$ and a proposal density $q$. It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $\pi, q$ and test functions from some family $\mathcal{F}$. I completely agree with the motivation: KL-Divergence in the form $\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x$ indeed underestimates the variance of $\pi$ and approximates only one mode. Using KL the other way around, $\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x$ takes all modes into account, but still tends to underestimate variance.

As a particular case, the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $q$  at all but uses test functions exclusively. The only requirement is that we be able to draw samples from the proposal. The authors claim that assuming access to $q$ limits applicability of an objective/operator. This claim is not substantiated however. The example they give in equation (10) is that it is not possible to find a Jacobian correction for a certain transformation of a standard normal random variable $\epsilon \sim \mathcal{N}(0,I)$  to a bimodal distribution. However their method is not the only one to get bimodality by transforming a standard normal variable and actually the Jacobian correction can be computed even for their suggested transformation! The problem they encounter really is that they throw away one dimension of $\epsilon$, which makes the tranformation lose injectivity. However by not throwing the variable away, we keep injectivity and it is possible to compute the density of the transformed variables. The reasons for not accessing the density $q$ I thus find rather unconvincing.

To compute expectations with respect to $q$, the authors suggest Monte Carlo sums, where every summand uses an evaluation of $\pi$ or its gradient. As that is the most computationally costly part in MCMC and SMC often times, I am very curious whether the method performs any better computationally than modern adaptive Monte Carlo methods.

# Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

During the super nice International Conference on Monte Carlo techniques in the beginning of July in Paris at Université Descartes (photo), which featured many outstanding talks, one by Tong Zhang particularly caught my interest. He talked about several variants of Stochastic Gradient Descent (SGD) that basically use variance reduction techniques from Monte Carlo algorithms in order to improve the convergence rate versus vanilla SGD. Even though some of the papers mentioned in the talk do not always point out the connection to Monte Carlo variance reduction techniques.

One of the first works in this line, Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Johnson and Zhang, suggests using control variates to lower the variance of the loss estimate. Let $L_j(\theta_{t-1})$ be the loss for the parameter at $t-1$ and jth data point, then the usual batch gradient descent update is $\theta_{t} =\theta_{t-1} - \frac{\eta_t}{N} \sum_{j=1}^N\nabla L_j(\theta_{t-1})$ with $\eta_t$ as step size.

In naive SGD instead one picks a data point index uniformly $j \sim \mathrm{Unif}(\{1,\dots,N\})$ and uses the update $\theta_{t} =\theta_{t-1} - \eta_t \nabla L_j(\theta_{t-1})$, usually with a decreasing step size $\eta_t$ to guarantee convergence. The expected update resulting from this Monte Carlo estimate of the batch loss is exactly the batch procedure update. However the variance of the estimate is very high, resulting in slow convergence of SGD after the first steps (even in minibatch variants).

The authors choose a well-known solution to this, namely the introduction of a control variate. Keeping a version of the parameter that is close to the optimum, say $\tilde\theta$, observe that $\nabla L_j(\tilde\theta) - \frac{1}{N} \sum_{i=1}^N \nabla L_i(\tilde\theta)$ has an expected value of 0 and is thus a possible control variate. With the possible downside that whenever $\tilde\theta$ is updated, one has to go over the complete dataset.

The contribution, apart from the novel combination of knowledge, is the proof that this improves convergence. This proof assumes smoothness and strong convexity of the overall loss function and convexity of the $L_j$ for the individual data points and then shows that the proposed procedure (termed stochastic variance reduced gradient or SVRG) enjoys geometric convergence. Even though the proof uses a slightly odd version of the algorithm, namely where $\tilde\theta \sim\mathrm{Unif}(\{\theta_0,\dots,\theta_{t-1}\})$. Rather simply setting  $\tilde\theta = \theta_{t-1}$ should intuitively improve convergence, but the authors could not report a result on that. Overall a very nice idea, and one that has been discussed in more papers quite a bit, among others by Simon Lacoste-Julien and Francis Bach.