# Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control

This arXival by the dynamical duo Korda & Mesiç deals with optimal control for a dynamical system when no forward differential equations are known. Instead, we are given observation/sensor data $x_t$ capturing the systems state (i.e. the observed state in HMM lingo or observables in dynamical systems lingo),  the control input $u_t$ and the observed state at the following time index $x_{t+1}$.

The basic idea is to build a Koopman operator that captures the dynamics of the observed data. In the simplest case, this operator is just a linear map, i.e. matrices $A, B$ satisfying $[x_{t+1}, u_{t+1}] = [A,B] [x_{t}^\top, u_t^\top]^\top$, where $A$ encodes the dependence of the next state upon the current state, $B$ upon the control. Already, this formula reveals a slight inelegance – to stay in Koopman operator world, the control input at $t+1$ is predicted. This does not make much sense, but so be it.

Now the trick to approximate nonlinear dynamics is well known to statisticians and ML people: rather finding a mapping between $x_t$ and $x_{t+1}$, we find a mapping between $\phi(x_{t})$ and $\phi(x_{t+1})$, where $\phi$ is some nonlinear map to $\mathbb{R}^{d_\mathrm{lift}}$ and called a lifting function in the paper. Now to fit a model to the nonlinear system dynamics, we only need to fit the linear maps $A, B$ to the training data. The loss in the paper is based on a Frobenius or squared euclidean norm and of the form

$\min_{A,B} | [\phi(x_{2}), \dots, \phi(x_{T})] - A [\phi(x_{1}), \dots, \phi(x_{T-1})] - B[u_{1}, \dots, u_{T-1}] \|$

For the Frobenius norm, the analytical solution is

$[A, B] = [\phi(x_{2}), \dots, \phi(x_{T})] [\phi(x_{1}), \dots, \phi(x_{T-1}),u_{1}, \dots, u_{T-1}]^{-1}$

where the inverse is the pseudoinverse of course. Up to here, the paper was all about approximating a forward model from data in the case where no a priori closed form DE model is given.
Now the idea for using this for control is to define a convex cost function $J$ that includes both the cost of the system deviating from the target state as well as the cost of control inputs. As the cost is convex, a global optimum can be attained.

The theoretical analysis mainly shows that the approximated Koopman operator converges to the actual infinite dimensional Koopman operator.

While the extension to stochastic systems is only mentioned, I think this is an interesting paper following an approach that seems particularly promising for whats called reinforcement learning in the deep/machine learning community. It seems that many reinforcement learning papers do not use a proper optimisation procedure for control input (it’s just random search). Fitting a model of the systems dynamics and its reaction to control input instead, enabling the usage of well-known gradient based optimisation methods, as the current paper, seems like a good idea to me.

# Three PhD positions in Lille

My collegue and friend Victor Elvira and his collaborator Francois Septier have three vacant PhD positions in Lille.

The first one is a cooperation with industry: the objective is to predict the evolution of the outputs in the supply chain of the industry partner, i.e. quantities in locations of storage or sales. The thesis will be about mathematical modeling of the supply chain and
inference methods for prediction.
The two other positions actually pertain to a recent and an older interest of mine, dynamical systems and importance sampling. You will study importance sampling based methods for probabilistic inference in complex non-linear high-dimensional systems. More specifically, you will work on novel adaptation schemes in order to overcome current limitations of more traditional IS-based techniques in such a challenging context. (Apply fast, or else I might!)

# PostDoc at Imperial College London with Sarah Filippi

My wonderful collegue Sarah Filippi has a postdoctoral opening in statistics to work with her at Imperial College. The topic is Bayesian Nonparametric Statistics for Conditional Independence Tests and Causal Inference, and the start date October 2018. Details about the position are available at https://www.imperial.ac.uk/jobs/description/NAT00152/research-associate-statistics/

I can highly recommend both Sarah as a researcher and this most interesting and important topic.

# Variational inference with Normalizing Flows

I keep coming back to this ICML 2015 paper by Rezende and Mohamed (arXiv version). While this is not due to the particular novelty of the papers contents, I agree that the suggested approach is very promising for any inference approach, be it VI or adaptive Monte Carlo. The paper adopts the term normalizing flow for refering to the plain old change of variables formula for integrals. With the minor change of view that one can see this as a flow and the correct but slightly alien reference to a flow defined by the Langevin SDE or Fokker-Planck, both attributed only to ML/stats literature in the paper.
The theoretical contribution feels a little like a strawman: it simply states that, as Langevin and Hamiltonian dynamics can be seen as an infinitesimal normalizing flow, and both approximate the posterior when the step size goes to zero, normalizing flows can approximate the posterior arbitrarily well. This is of course nothing that was derived in the paper, nor is it news. Nor does it say anything about the practical approach suggested.
The invertible maps suggested have practical merit however, as they allow “splitting” of a mode into two, called the planar transformation (and plotted on the right of the image), as well as “attracting/repulsing” probability mass around a point. The Jacobian correction for both invertible maps being computable in time that is linear in the number of dimensions.

# A non-parametric ensemble transform method for Bayesian inference

This 2013 paper by Sebastian Reich in the Journal on Scientific Computing introduces an approach called the Ensemble Transport Particle Filter (ETPF). The main innovation of  ETPF, when compared to SMC-like filtering methods, lies in the resampling step. Which is

1.  based on an optimal transport idea and
2. completely deterministic.

No rejuvenation step is used, contrary to the standard in SMC. While the notation is unfamiliar to me, coming from an SMC background, I’ll adopt it here: by $\{x_i^f\}_{i=1}^M$ denote samples from the prior with density $\pi_f$ (the $f$, meaning forecast, is probably owed to Reich having done a lot of Bayesian weather prediction). The idea is to transform these into samples $\{x_i^a\}_{i=1}^M$ that follow the posterior density $\pi_a$ (the $a$ meaning analyzed), preferably without introducing unequal weights. Let the likelihood term be denoted by $p(y|x)$ where $y$ is the data and let $w_i = \frac{p(y|x_i^f)}{\sum_{j=1}^M p(y|x_i^f)}$ be the normalized importance weight. The normalization in the denominator stems from the fact that in Bayesian inference we can often only evaluate an unnormalized version of the posterior $\pi_a(x) \propto p(y|x) \pi_f(x)$.

Then the optimal transport idea enters. Given the discrete realizations $\{x_i^f\}_{i=1}^M$, $\pi_f$ is approximated by assigning the discrete probability vector $p_f = (1/M,\dots,1/M)^\top$, while $\pi_a$ is approximated by the probability vector $p_a=(w_1,\dots, w_m)^\top$. Now we construct a joint probability between the discrete random variables distributed according to $p_f$ and those distributed according to $p_a$, i.e. a matrix $\mathbf{T}$ with non-negative entries summing to 1 which has the column sum $p_f$ and row sum $p_a$ (another view would be that $\mathbf{T}$ is a discrete copula which has prior and posterior as marginals). Let $\pi_\mathbf{T}(x^f,x^a)$ be the joint pmf induced by $\mathbf{T}$. To qualify as optimal transport, we now seek $\mathbf{T}^* = \mathrm{arg min}_\mathbf{T} \mathbb{E}_{\pi_\mathbf{T}} \|x^f - x^a\|_2^2$ under the additional constraint of cyclical monotonicity. This boils down to a linear programming problem. For a fixed prior sample $x^f_j$ this induces a conditional distribution over the discretely approximated posterior given the discretely approximated prior $\pi_{\mathbf{T}^*}(\cdot|x^f_j) =\pi_{\mathbf{T}^*}(x^f_j, \cdot)/\pi_f(x^f_j) = M \pi_{\mathbf{T}^*}(x^f_j, \cdot)$.

We could now simply sample from this conditional to obtain equally weighted posterior samples for each $j =1,\dots,M$. Instead, the paper proposes a deterministic transformation using the expected value $x^a_j = \mathbb{E}_{\pi_{\mathbf{T}^*}(\cdot|x^f_j)} x^f_i = \sum_{i=1}^M M t^*_{ij} x^f_i$. Reich proves that the mapping $\Psi_M$ induced by this transformation is such that for $X^f \sim \pi_f$, $\Psi_M(X^f) \sim \pi_a$ for $M \to \infty$. In other words, if the ensemble size M goes to infinity, we indeed get samples from the posterior.

Overall I think this is a very interesting approach. The construction of an optimal transport map based on the discrete approximations of prior and posterior is indeed novel compared to standard SMC. My one objection is that as it stands, the method will only work if the prior support covers all relevant regions of the posterior, as taking the expected value over prior samples will always lead to a contraction.

Of course, this is not a problem when M is infinite, but my intuition would be that it has a rather strong effect in our finite world. One remedy here would of course be to introduce a rejuvenation step as in SMC, for example moving each particle  $x^a_j$ using MCMC steps that leave $\pi^a$ invariant.