This UAI paper by Ruiz, Titsias and Blei presents important insights for the idea of a black box procedure for VI (which I discussed here). The setup of BBVI is the following: given a target/posterior and a parametric approximation
, we want to find
which can be achieved for any by estimating the gradient
with Monte Carlo Samples and stochastic gradient descent. This works if we can easily sample from and can compute its derivative wrt
in closed form. In the original paper, the authors suggested the use of the score function as a control variate and a Rao-Blackwellization. Both where described in a way that utterly confused me – until now, because Ruiz, Titsias and Blei manage to describe the concrete application of both control variates and Rao-Blackwellization in a very transparent way. Their own contribution to variance reduction (minus some tricks they applied) is based on the fact that the optimal sampling distribution for estimating
is proportional to
rather than exactly
. They argue that this optimal sampling distribution is considerably heavier tailed than
. Their reasoning is mainly that the norm of the gradient (which is essentially
) vanishes for the modes, making that region irrelevant for gradient estimation. The same should be true for the tails of the distribution I think. Overall very interesting work that I strongly recommend reading, if only to understand the original Blackbox VI proposal.