Kernel-Embedded Gaussian Processes for Bayesian Computation and Model Evaluation

(Work in Progress)

It had been a while since I had worked on a technical blog post, so I wanted to get something out there. I have been playing around with a lot of concepts in my head centering on Gaussian processes and using them for Bayesian computation. Most of the time when I write on technical topics I’m working on it as I’m writing (e.g. using the blog post as my scratch pad). I think this serves two purposes: the first is it gives me a framework for working through a problem and an evolving template, and the second is it makes the dialogue surrounding the central themes of the post more narrative in structure.

The problem I was toying around with in my head was something to the effect of having a grid of values which we could pass through the unnormalized posterior, and then fitting a gaussian process to them embedded in a kernel function. At first pass this does nothing really useful, as the unnormalized-kernel-embedded gaussian process is only really fitting values which might be much higher or much lower in absolute value than the true posterior value after normalization. I was curious if we might be able to use some trickery with KL divergences and some functional analysis to figure out a way to normalize the gaussian process to better fit the true posterior.

The uncertainty of the gaussian process would represent the so called uncertainty “around” the true value of the posterior at each particular point, and by incorporating observation uncertainty into the fixed points we can prevent overfitting the unnormalized posterior at these values.

I should note, when I was thinking about doing this I was mostly interested in it as a mathematical curiosity, and as I’m going I’m thinking about ways in which the work could be practically useful.

The first step is to find a grid of values \{\theta_k\}_{k=K}. Evaluate this grid within the unnormalized posterior to get an unnormalized approximation:

P(X|\theta_k)P(\theta_k) = \xi_k.

Take \exp[\circ], and we place a specific Gaussian process inside:

\exp[f(\theta)] with f(\theta) \sim GP(\mu, \exp[-t K(\Theta,\Theta)])

This is an Ornstein Ohlenbeck process where the process as a function of time tends towards the mean process around \mu ; where the mean process around \mu has covariance K(\Theta,\Theta). The reason I selected the Ornstein Ohlenbeck process is because it has a closed form for the distribution of the maximum.

The goal here is to think about t and it’s filtration as representing the amount of information we’re giving from the grid of observed values to the posterior process f(\theta) | \xi_k \sim GP(\mu^*, K^*(t,K)). In a sense t represents a (possibly fractionally valued) amount of information repeatedly applied by \xi to f(\theta) | \xi .

We note that this is a log gaussian process and if we take the log unnormalized values \{\xi\}_{k=1}^K -c where c is the log normalizing constant of the posterior. We can get what the embedded process should be:

\log \xi_k - c \sim GP(\mu,K).

Fit these values to obtain \exp[f(\theta) - c] with f(\theta) \sim GP(\mu^*,K^*).

We then try and normalize this log gaussian process:

\int_{\theta \in \Theta} \exp[f(\theta) - c] d\theta with f(\theta) \sim GP(\mu^*, K^*).

I was mainly curious if we could incorporate some type of LogSumExp function with the Riemann integral of this smooth function. E.g. come up with a partition of \Theta, \mathbf{V}_n, that increases in resolution as n\rightarrow \infty.

\int_{\theta \in \Theta} \exp[f(\theta)-c] d\theta = \lim_{n\rightarrow \infty} \sum_{v \in \mathbf{V}_n} |v| \sup_{\theta \in v} \frac{\exp[f(\theta)]}{\log c}

with f \sim GP(\mu^*,K^*).

This is where the LogSumExp comes in. Note that this function is a smooth approximation to the maximum, and that the following is true:

LSE(\log(x_1) + \dots + \log(x_n) = \log(x_1 + \dots + x_n)

Placing this function inside the Riemann sum in the limit gives us:

\lim_{n\rightarrow \infty} \exp[LSE[\log(|v_1|) + \sup_{\theta \in v_1} f(\theta) - c, \dots, \log(|v_n |) + \sup_{\theta \in v_1} f(\theta)]] = \int_{\theta \in \Theta} \frac{\exp[f(\theta)]}{c} d\theta

This is where some trickery comes in. We know that we’re approximating our posterior with this kernel-embedded Gaussian process, and this is the Riemann integral over these random functions, so we set it equal to 1:

\frac{1}{log c}\lim_{n\rightarrow \infty} \exp[LSE[\log(|v_1|) + \sup_{\theta \in v_1} f(\theta), \dots, \log(|v_n |) + \sup_{\theta \in v_1} f(\theta)]] = \int_{\theta \in \Theta}  \frac{\exp[f(\theta)]}{c} d\theta = 1

Note that if the partition is uniform then the \{log(|v_i|)\}_{i=1}^n is all constant, allowing us to take it out of the LSE and into the exponential to multiply out, to get:

\lim_{n\rightarrow \infty} \exp[LSE[\sup_{\theta \in v_1} f(\theta), \dots, \sup_{\theta \in v_n} f(\theta)]] *|v_n | = \int_{\theta \in \Theta} \exp[f(\theta)] d\theta = \log c

I’m going to revisit this later when I have some time to read more on extreme value theory, but what I was trying to come up with was a way to get a distribution over marginal likelihoods. There might be different interpretations of this quantity related to model validation and quality depending on the original kernel of the gaussian process (K). This is kinda circuitous, but it’s been interesting to play around with.

Using Decision Theory and KL Divergences to Find Optimal Approximation to Posterior:

Note that our original goal was an approximation to the posterior by a log-gaussian random function.

Leave a Reply

Discover more from Ryan Warnick Phd.

Subscribe now to keep reading and get access to the full archive.

Continue reading