# A non-parametric ensemble transform method for Bayesian inference

This 2013 paper by Sebastian Reich in the Journal on Scientific Computing introduces an approach called the Ensemble Transport Particle Filter (ETPF). The main innovation of  ETPF, when compared to SMC-like filtering methods, lies in the resampling step. Which is

1.  based on an optimal transport idea and
2. completely deterministic.

No rejuvenation step is used, contrary to the standard in SMC. While the notation is unfamiliar to me, coming from an SMC background, I’ll adopt it here: by $\{x_i^f\}_{i=1}^M$ denote samples from the prior with density $\pi_f$ (the $f$, meaning forecast, is probably owed to Reich having done a lot of Bayesian weather prediction). The idea is to transform these into samples $\{x_i^a\}_{i=1}^M$ that follow the posterior density $\pi_a$ (the $a$ meaning analyzed), preferably without introducing unequal weights. Let the likelihood term be denoted by $p(y|x)$ where $y$ is the data and let $w_i = \frac{p(y|x_i^f)}{\sum_{j=1}^M p(y|x_i^f)}$ be the normalized importance weight. The normalization in the denominator stems from the fact that in Bayesian inference we can often only evaluate an unnormalized version of the posterior $\pi_a(x) \propto p(y|x) \pi_f(x)$.

Then the optimal transport idea enters. Given the discrete realizations $\{x_i^f\}_{i=1}^M$, $\pi_f$ is approximated by assigning the discrete probability vector $p_f = (1/M,\dots,1/M)^\top$, while $\pi_a$ is approximated by the probability vector $p_a=(w_1,\dots, w_m)^\top$. Now we construct a joint probability between the discrete random variables distributed according to $p_f$ and those distributed according to $p_a$, i.e. a matrix $\mathbf{T}$ with non-negative entries summing to 1 which has the column sum $p_f$ and row sum $p_a$ (another view would be that $\mathbf{T}$ is a discrete copula which has prior and posterior as marginals). Let $\pi_\mathbf{T}(x^f,x^a)$ be the joint pmf induced by $\mathbf{T}$. To qualify as optimal transport, we now seek $\mathbf{T}^* = \mathrm{arg min}_\mathbf{T} \mathbb{E}_{\pi_\mathbf{T}} \|x^f - x^a\|_2^2$ under the additional constraint of cyclical monotonicity. This boils down to a linear programming problem. For a fixed prior sample $x^f_j$ this induces a conditional distribution over the discretely approximated posterior given the discretely approximated prior $\pi_{\mathbf{T}^*}(\cdot|x^f_j) =\pi_{\mathbf{T}^*}(x^f_j, \cdot)/\pi_f(x^f_j) = M \pi_{\mathbf{T}^*}(x^f_j, \cdot)$.

We could now simply sample from this conditional to obtain equally weighted posterior samples for each $j =1,\dots,M$. Instead, the paper proposes a deterministic transformation using the expected value $x^a_j = \mathbb{E}_{\pi_{\mathbf{T}^*}(\cdot|x^f_j)} x^f_i = \sum_{i=1}^M M t^*_{ij} x^f_i$. Reich proves that the mapping $\Psi_M$ induced by this transformation is such that for $X^f \sim \pi_f$, $\Psi_M(X^f) \sim \pi_a$ for $M \to \infty$. In other words, if the ensemble size M goes to infinity, we indeed get samples from the posterior.

Overall I think this is a very interesting approach. The construction of an optimal transport map based on the discrete approximations of prior and posterior is indeed novel compared to standard SMC. My one objection is that as it stands, the method will only work if the prior support covers all relevant regions of the posterior, as taking the expected value over prior samples will always lead to a contraction. Of course, this is not a problem when M is infinite, but my intuition would be that it has a rather strong effect in our finite world. One remedy here would of course be to introduce a rejuvenation step as in SMC, for example moving each particle $x^a_j$ using MCMC steps that leave $\pi^a$ invariant.