Bayesian Inversion

I’ve previously worked on applications of Bayesian inversion to inverse problems in geophysics. I wanted to talk a little bit about Bayesian inversion and analyze it from a theoretical perspective; as well as give an illustration.

Bayesian inversion is a method for finding solutions to inverse problems using the machinery of Bayesian statistics. By nature probabilistic, it has some advantages over deterministic methods. It seems like geophysics is a place where this approach towards solving inverse problems has become adopted more readily, but it’s likely that it could be applied to other domains as well.

The main idea behind Bayesian inverse methods is that you have a function $f(x)$ , where $f: \mathbb{X} \rightarrow \mathbb{Y}$ and $\mathbb{X}$ and $\mathbb{Y}$ are two Lebesgue measurable measure spaces; maybe $\mathbb{R}$ or $\mathbb{R}^k$ for some $k \in \mathbb{N}$ with the standard Borel $\sigma-$ algebra. You observe some $y \in \mathbb{Y}$ and want to invert $f$ to obtain $x \in \mathbb{X}$ such that $f(x) = y$ .

The reason the probabilistic approach, and Bayesian methods included, is useful is that $f$ could be ill posed; being a many to one mapping. This means that there is a set of solutions to the inverse problem which all map to $y$ . Additionally, some solutions might approximately map to $y$ , but not exactly, meaning that uncertainty around which solutions solve the inverse problem exists.

The Bayesian approach sets the problem up in a Bayesian fashion. If we have a likelihood, $P(y | x)$ , and a prior, $P(x)$ , we can use Bayes’ theorem to get a distribution over possible solutions to the problem:

$P(x | y) = \frac{P(y|x)P(x)}{P(y)} = \frac{P(y|x)P(x)}{\int_\mathbb{X} P(y|x)P(x)dx}$

This solution might be difficult to analytically express due to the intractability of the integral $\int_\mathbb{X} P(y|x)P(x)dx$ , but there’s nothing distinguishing the problem at this point from a typical Bayesian problem, and things like variational inference, Markov chain Monte Carlo, or Stein Variational Gradient Descent could be used to achieve the posterior.

My main interest in writing this blog post was on the accuracy of the approximation, and ways in which it can be tweaked or modified. For example, say $f: \mathbb{R}^p \rightarrow \mathbb{R}^q$ , and we place $x \sim \textrm{N}_p(x | \mu_x, \Sigma_x)$ , and $y | x \sim \textrm{N}_q(y | f(x), \Sigma_y)$ . Obviously $\mu_x$ and $\Sigma_x$ affect the posterior distribution in the way they might in a conventional Bayesian framework, but what about $\Sigma_y$ ?

Another thing to consider is the observation $y$ . We have only one observation, which is not very informative, but we know that the problem is set up in such a way that we can have as many of this observation as we want; this is because this is the solution to the problem. So we get $y_1,...,y_n$ . Then we have that the likelihood is $y_1,...,y_n \overset{iid}{\sim} \textrm{N}(y | f(x), \Sigma_y)$ . If we do things in this way, our posterior is given by:

$P(x|y_1,..,y_n) = \frac{\prod_{i=1}^n \textrm{N}(y| f(x), \Sigma_y) \textrm{N}(x| \mu_x, \Sigma_x)}{\int_{\mathbb{X}\prod_{i=1}^n \textrm{N}(y| f(x), \Sigma_y) \textrm{N}(x| \mu_x, \Sigma_x) dx}}$

So we have four parameters to control: $\mu_x$ , $\Sigma_x$ , $\Sigma_y$ , and $n$ , and these parameters affect our posterior.

Suppose that $\hat{x}$ is the singular solution to the problem, and that we have that $x$ is sufficiently close to $\hat{x}$ (so that $\textrm{N}(x|\mu_x, \Sigma_x)$ is approximately constant). Then as $n \rightarrow \infty$ , and under some regularity assumptions, we have:

$P(x|y) \approx \frac{\exp[\log P(y|\hat{x}) - \frac{1}{2} (x - \hat{x})^T\hat{I}_n(\hat{x})(x - \hat{x})]P(\hat{x})}{\int_\mathbb{X} \exp[\log P(y|\hat{x}) - \frac{1}{2} (x - \hat{x})^T\hat{I}_n(\hat{x})(x - \hat{x})]P(\hat{x}) dx}$

Where $\hat{I}_n(\hat{x}) = -\frac{1}{n} [\frac{\partial^2}{\partial x^{(k)}\partial x^{(l)}}\log \textrm{N}(f(x). \Sigma_y) |_{x = \hat{x}}]_{kl}$ is the Fisher information matrix.

However note that $\hat{I}_n(\hat{x})$ for a given $n \geq 1$ just scales $\hat{I}_n(\hat{x})$ for $n=1$ and a different value for $\Sigma_y$ ; where $\Sigma_y$ is scaled by a factor of $\frac{1}{n}$ . This means that in actuality the number of repeated observations of the same value for $y$ doesn’t matter, it just controls how diffusive the likelihood is; which we already have control over through $\Sigma_y$ . So $n$ is an erroneous parameter and can be discarded.

This gives us that:

$\hspace{3mm} P(x|y) \approx \textrm{N}(\hat{x},\hat{I}(\hat{x})^{-1})$

In a neighborhood of $x$ sufficiently close to $\hat{x}$ ; this contracts as $\Sigma_y$ gets less diffuse. This is just a Bernstein Von-Mises theorem controlling the diffusivity of the likelihood directly through the parameter $\Sigma_y$ as opposed to increasing the sample size. The book Statistical Decision Theory and Bayesian Analysis by Jim Berger has a good heuristic sketch of the classical Bernstein Von-Mises theorems, and I used that book as a reference here.

Now consider

$||x - \mu_x - \hat{x}||_{\Sigma_x} <\epsilon$

Where $||\circ||_{\Sigma}$ denotes the Mahalanobis distance and $\epsilon$ is chosen such that $||y - f(x)||_{\Sigma_y} < \delta$ $\forall x$ satisfying the previous inequality.

This gives us some intuition about the role $\Sigma_x$ and $\Sigma_y$ play in the problem, they define what the notion of “closeness” means in terms of the two different aspects of the solution. If under the elliptical geometry of $\Sigma_y$ $f(x)$ is close to $y$ (or $f(\hat{x})$ since $y = f(\hat{x})$ ), then, under the geometry of $\Sigma_x$ , $x$ being close to
$\mu_x - \hat{x}$ gives us the notion of “sufficiently close” that was talked about previously with the normality result.

Understanding what these parameters mean has values in terms of setting up the machinery of the problem. Typically in Bayesian inversion people just assume isometric normality for the prior and likelihood, but if there’s structural information about the inverse problem that can be framed in terms of these parameters it might be beneficial to do so.

Ryan Warnick Phd.

Bayesian Inversion

Leave a ReplyCancel reply

Bayesian Inversion

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ryan Warnick Phd.