Operator Variational Inference

This NIPS 2016 paper by Ranganath et al. is concerned with Variational Inference using objective functions other than KL-divergence between a target density $\pi$ and a proposal density $q$. It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $\pi, q$ and test functions from some family $\mathcal{F}$. I completely agree with the motivation: KL-Divergence in the form $\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x$ indeed underestimates the variance of $\pi$ and approximates only one mode. Using KL the other way around, $\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x$ takes all modes into account, but still tends to underestimate variance.

As a particular case, the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $q$  at all but uses test functions exclusively. The only requirement is that we be able to draw samples from the proposal. The authors claim that assuming access to $q$ limits applicability of an objective/operator. This claim is not substantiated however. The example they give in equation (10) is that it is not possible to find a Jacobian correction for a certain transformation of a standard normal random variable $\epsilon \sim \mathcal{N}(0,I)$  to a bimodal distribution. However their method is not the only one to get bimodality by transforming a standard normal variable and actually the Jacobian correction can be computed even for their suggested transformation! The problem they encounter really is that they throw away one dimension of $\epsilon$, which makes the tranformation lose injectivity. However by not throwing the variable away, we keep injectivity and it is possible to compute the density of the transformed variables. The reasons for not accessing the density $q$ I thus find rather unconvincing.

To compute expectations with respect to $q$, the authors suggest Monte Carlo sums, where every summand uses an evaluation of $\pi$ or its gradient. As that is the most computationally costly part in MCMC and SMC often times, I am very curious whether the method performs any better computationally than modern adaptive Monte Carlo methods.