Bayesian Decision Theory for Gaussian Process (GP) Models with an Application Towards Approximate Evaluation of Source Functions Generating the GP as a Solution to a Differential Equation.


The Old Guitarist 1903-1904 Pablo Picasso

EDIT: I started working on a draft of a preprint for this which you can find in the publication and awards section.

I recently got interested in decision theory in the Bayesian context after reading Statistical Decision Theory and Bayesian Analysis by Jim Berger; along with one of my advisors in graduate school having done some work in the area. However I noticed that the majority of the literature focuses on parametric models, and I think there’s a lot of interesting potential to play around with decision theory in nonparametric models; particularly, Gaussian processes. When I started thinking about this I quickly browsed through Google Scholar to see if anything had been done, and found a couple things scattered in the Statistics and Operations Research Literature. That said, I didn’t find anything in the vein of what I was looking to work on, so I believe (to the best of my knowledge) that it’s a novel approach that hopefully people find interesting.

One of the immediate questions that pops up when considering decision theory in any context is “what loss function should we use?” Considering Gaussian process models, you start to get into the space of functional analysis or differential equations. That said, we’ll probably end up using inner products on Hilbert spaces and ODE’s in this blog post, as well as some preliminary stuff on Gaussian processes.

Preliminary Stuff on Gaussian Processes and Reproducing Kernel Hilbert Spaces

A Gaussian process is a stochastic process which serves as a distribution over function spaces. It’s commonly used in Bayesian analysis as a prior on spaces of functions. The defining characteristic of a Gaussian process is that for any finite collection of elements in the support of the random functions, the joint distribution of the function values over the finite set is multivariate Gaussian. We’ll denote that a function is distributed according to a Gaussian process by writing f \sim \textrm{GP}(\mu,\textrm{Cov}_f); where \mu is the mean function of the random functions and \textrm{Cov}_f is the covariance function which we’ll get into below.

The covariance function of a function valued random process is a function taking two values in the domain of. the random functions, and returning the covariance between them:

\textrm{Cov}_f (u,t) = \int_f (f(u) - \mu (u) ) (f(t)- \mu (t) ) d\mathbb{P}_f = E_{\mathbb{P}_f}[(f(u) - \mu (u) ) (f(t)- \mu (t) )]

Where u,t\in\mathbb{U} are elements of the domain of the random function f

For the sake of simplicity from here on out we’re considering mean standardized gaussian processes, so \mu \equiv 0, giving us a simpler distribution, f \sim \textrm{GP}(0,\textrm{Cov}_f), and a simpler expression for the covariance function:

\textrm{Cov}_f(u,t) = E_{\mathbb{P}_f}[f(u)f(t)]

To get back to the point of the Gaussian process example we need to briefly explain what a reproducing kernel Hilbert space is. Looking at our Hilbert space \mathbb{H} suppose u \in \mathbb{U} is a value in the set the Hilbert space is over. Then \forall u \in \mathbb{U} \exists an evaluation functional L_u defined as L_u : f \rightarrow f(u) \forall f \in \mathbb{H}. If the evaluation functional is bounded then \exists a function K_u such that f(u) = L_u(f) = <f,K_u>_\mathbb{H}. This function K_u \in \mathbb{H} is the reproducing kernel of the Hilbert space \mathbb{H}. By the Reisz representation theorem such a function always exists \forall u \in \mathbb{U} so long as all elements in the Hilbert space are bounded..

Importantly because K_u is itself a function in \mathbb{H} for some other t\in\mathbb{U} we can examine K_u(t) = L_{K_u}(t) = <K_u,K_t>_\mathbb{H}. Then we can define a kernel function K(u,t) = <K_u,K_t>_\mathbb{H}. It is important to note that K(u,\circ) is a function, and <K_u,\circ>_\mathbb{H} is a functional defined on \mathbb{H}. However if the argument to <K_u, \circ>_\mathbb{H} is itself a reproducing kernel for some value t\in\mathbb{U}, or in other words it is K_t, then it maps to a value t\in\mathbb{U} which it is the associated reproducing kernel for; this is important later.

The Moore-Aronszajn theorem says that if K is a symmetric positive definite kernel, then there exists an associated reproducing kernel Hilbert space (RKHS) for which K is a reproducing kernel.

Now to ground this a bit and bring it back to the Gaussian process defined above: looking at our Gaussian process measure f \sim \textrm{GP}(\mu, \textrm{Cov}_f). We have that the covariance of the Gaussian process is given by:

\textrm{Cov}_f(u,t) = \mathbb{E}_{\mathbb{P}_f}[f(u)f(t)]

Switching It around a bit, suppose we start with a kernel K(\circ,\circ): \mathbb{U}\times\mathbb{U} \rightarrow \mathbb{R} which by the previously stated theorem defines an associated RKHS, \mathbb{H}^*. Then this kernel defines a covariance on the RKHS given by:

\textrm{Cov}_{\mathbb{P}_f} (u,t) = K(u,t) = E_{\mathbb{P}_f}[f(u)f(t)] = <K_u,K_t>_{\mathbb{H}_K}

Where \mathbb{H}_K is the Hilbert space induced by the reproducing kernel K

This is how most people typically work with Gaussian processes in applied settings. They start with a kernel, and this kernel defines a covariance function on a RKHS implied by the kernel, and this covariance function dictates the properties of the random functions realized from the Gaussian process.

Background for Bayesian Updating for GPs and Derivative Function of a GP

We can evaluate the covariance of the random process at a variety of different locations by taking the outer product of the functions at a number of locations:

\textrm{Cov}_f(\textbf{u}) = E_{\mathbb{P}_f}[f(\textbf{u})f(\textbf{u})^T] = E_{\mathbb{P}_f}[[f(u_1),...,f_{|\textbf{u}|}][[f(u_1),...,f_{|\textbf{u}|}]^T] = K(\textbf{u},\textbf{u})

Where |\textbf{u}| is the length of the vector \textbf{u} and \textbf{u}\subset \mathbb{U} if a finite subset of \mathbb{U}.

Noting that the definition of a Gaussian process is marginal normality, the Gaussian processs evaluated at the points \textbf{u} is distributed multivariate normal:

f(\textbf{u}) \sim \textrm{N}_{|\textbf{u}|}(0, \textrm{Cov}_f(\textbf{u}))

The multivariate normal has a convenient conditioning property involving the Schur-complement, so if we set up the follow Bayesian hierarchical nonparametric model:

y \sim f(u)

f \sim \textrm{GP}(0,K)

and we observe y_1,...,y_n = \textbf{y} at points u_1,...,u_n = \textbf{u}, then can get a conditional distribution for new, unobserved values y^* at unevaluated locations u^* with the following formula:

 

y^*| \textbf{u}, \textbf{y} \sim \textrm{N}(\mu_{y^*},\sigma^2_{y^*})

Where \mu_{y^*} = K(u^*,\textbf{u})K(\textbf{u},\textbf{u})\textbf{y}, and \sigma^2_{y^*} = K(u^*, u^*) - K(u^*,\textbf{u}) K(\textbf{u},\textbf{u})^{-1}K(u^*,\textbf{u})^T. If you want to incorporate observation noise into the model, the typical way to go about doing it is to add a scaled identity matrix to your kernel; i.e. K' = K+\sigma^* . This is not relevant here but I added it just for completeness

Similarly, given a Gaussian process f with a kernel function K(u,t), we can obtain a closed form for the derivative process if the kernel is twice differentiable in both of it’s arguments. That is, if K_{11}'(u,t) -= \frac{\partial^2}{\partial u \partial t} K(u,t) exists, then f is mean squared differentiable with covariance function K_{11}'(u,t). So f' \sim \textrm{GP}(0, K_{11}'). This GP form of the derivative process is convenient for the ODE application we proceed to below.

However, evaluating the derivative function as it correlates to a value of the original random function is another matter. In this context, if one wants to realize a random function and realize a derivative for that same random function one needs to take the partial derivative with respect to only one of the arguments of the covariance function. So K_{01} = \frac{\partial}{\partial t} K(u,t), and similarly, K_{10} = \frac{\partial}{\partial u} K(u,t). By the symmetry of the covariance function this is equivalent; just with u and t swapped around.

The correlation between f and f' is a tad bit confusing, but to ground it it helps to look at the multivariate Gaussian marginals. Say we want to realize f as well as it’s associated derivative f' at values u_1,...,u_n = \textbf{u}; then we construct a joint multivariate Gaussian on \mathbb{R}^{2n}, and realize them both jointly as follows:

\begin{pmatrix} f(\textbf{u}) \\ f'(\textbf{u})\end{pmatrix} \sim \textrm{N}(0,\begin{pmatrix} K_{00}(\textbf{u},\textbf{u}) & K_{10}(\textbf{u},\textbf{u}) \\ K_{01}(\textbf{u},\textbf{u})& K_{11}(\textbf{u},\textbf{u}) \end{pmatrix})

A realization from this multivariate Gaussian will contain realizations for f at \textbf{u} in the first n elements, and realizations from the derivative of f at \textbf{u} in the second n elements. This is important later.

Bayesian Decision Theory for Gaussian Processes with an Application to Ordinary Differential Equations (ODEs)

Decision theory is the science of making decisions in uncertain settings with some notion of loss or utility involved. Typically for parameter in a Bayesian model, \theta \in \Theta, where we observe some data and get the posterior P(\theta | X), our goal is to make a decision, which is a function of our observed values X, \delta(X) \in \Delta to either minimize a loss, L(\delta(X),\theta), or maximize a utility, U(\delta(X),\theta)). Without loss of generality we can restrict ourselves to loss functions because we can always take the negative utility as a loss to be minimized.

The Bayes decision \delta^* is the decision that minimizes the posterior expected loss: \delta^* = \min_{\delta \in \Delta}E_{P_{\theta | X}}[L(\delta, \theta)]

So how is this relevant to differential equations? Well, our Gaussian process is a distribution over functions, so if we encode the boundary conditions for the various derivatives in the differential equations as our observed data \textbf{u}, then we can find optimal predictive values that minimize the difference between the solution to the ODE in regards to the source term of the ODE by minimizing a loss function encoding the properties of the ODE with respect to the source.

To clarify what I mean by this, assuming the Gaussian process is the solution to an ODE, we can minimize the loss with respect to the source function as our decision, to evaluate a source function that generated the Gaussian process as a solution to the differential equation.

That is, if the Gaussian process fixed with the boundary condition as it’s observation is a random function, and that function is the solution to some ODE with some source function, then taking the posterior expected loss of a loss function which measures the difference between the GP and different source functions minimizes the long term risk of assuming that source function generated the GP as a solution to a differential equation involving that source.

For the sake of clarity we consider a simple first order linear ODE: y^{(1)}+ P(u)y(u) = Q(u) with y(u^*) = C; with P(u) known, and Q(u) unknown.

Then a loss that makes sense for this goal would be the L_2 distance between the ODE and the source term, Q(u): ||y^{(1)}(u) + P(u)y(u) - Q(u)||_{L_2}. The more similar y^{(1)} + P(u)y(u) and Q(u) are, the lower this loss.

Taking the posterior expected loss we get:

E_{y,y' | y(u^*) = C}[\int_{u}(y^{(1)}(u) + P(u)y(u) - Q(u))^2du]

Under the assumptions of Fubini’s Theorem (which are going to be situation dependent), we can exchange the integration and expectation to give us:

\int_u E_{y,y' | y(u^*) = C}[(y^{(1)}(u) + P(u)y(u) - Q(u))^2] du

Then our goal is to minimize this loss. This is where the “point wise” interpretation comes in.

Recall that joint distribution of y and y' is given by:

\begin{pmatrix} y(\textbf{u}) \\ y'(\textbf{u})\end{pmatrix} \sim \textrm{N}(0,\begin{pmatrix} K_{00}(\textbf{u},\textbf{u}) & K'_{10}(\textbf{u},\textbf{u}) \\ K'_{01}(\textbf{u},\textbf{u})& K'_{11}(\textbf{u},\textbf{u}) \end{pmatrix})

But we have observed y(u^*) = C, so, conditioning on y(u^*) = C for unobserved predictive values \textbf{y}=y_1,...,y_n and \textbf{y}' = y'_1,...,y'_n over domain values \textbf{u} = u_1,...,u_n, we get:

\textbf{y},\textbf{y}' | y(u^*)=C \sim \textrm{N}_{2n}( \begin{pmatrix} \mu^*(\textbf{u}) \\ \mu'^*(\textbf{u}) \end{pmatrix},\begin{pmatrix} K_{00}^*(\textbf{u},\textbf{u}) & K^*_{10}(\textbf{u},\textbf{u}) \\ K^*_{01}(\textbf{u},\textbf{u})& K^*_{11}(\textbf{u},\textbf{u}) \end{pmatrix}) = \textrm{N}(\mu^*, K^*)

Then, reducing the previous integral over u in our loss to the counting measure on \textbf{u}, we get the following:

\sum_{u \in \textbf{u}} E_{y,y' | y(u^*) = C}[(y^{(1)}(u) + P(u)y(u) - Q(u))^2]

Where our decision where we are trying to minimize this loss is the decision \delta^* = \min_{\textbf{Q(u)}} \sum_{u \in \textbf{u}} E_{y,y' | y(u^*) = C}[(y^{(1)}(u) + P(u)y(u) - Q(u))^2] over the domain values \textbf{u}.

Note that the expectation above is a second order polynomial expectation of Y = \textbf{y},\textbf{y}' | y(u^*)=C; E_{y,y' | y(u^*) = C}[(y^{(1)}(u) + P(u)y(u) - Q(u))^2] = E[(AY-Q)^2] where A^T = [P(u_1),...,P(u_n),\underbar{1}_n]. This is a quadratic form which can be expanded to be:

\sum_{u\in\textbf{u}} E_{y,y' | y(u^*) = C}[(AY-Q)^2] = \sum_{u\in\textbf{u}} E_{y,y' | y(u^*) = C}[AYAY - AYQ - QAY + Q\circ Q]

= \sum_{u\in\textbf{u}}[AE_{y,y' | y(u^*) = C}[YAY] - AE_{y,y' | y(u^*) = C}[Y]Q - QE_{y,y' | y(u^*) = C}[Y]A + Q\circ Q]

Which by properties of expectations of quadratic forms has expectation:

\sum_{u\in\textbf{u}} A\textrm{tr}(AK^*) + \mu^{*^T}A\mu^* - A\mu^*Q - Q\mu^*A + Q\circ Q

Which can be minimized to find the source function values Q(\textbf{u}) at \textbf{u}\subset \mathbb{U} which minimize the L_2 loss, long term risk, of generating the GP as a solution to the differential equation.

Conclusion

A lot of this is probably overkill for such a simple differential equation, but I just wanted to show this could be done. Some interesting things to consider are if this can be used to show or enforce smoothness properties on differential equations of higher order, or for partial differential equations. Additionally, the number of boundary conditions does not matter in this context, and other loss and utility functions could be used. Restrictions of GPs to other locations is also valid, and one could even use all observed values of a GP process as boundary conditions (more than one boundary condition for this ODE would not create any difficulties).

One thing I wanted to consider when working on this was financial applications; where differential equations and stochastic processes are prominent. Another thing to consider is that a Gaussian process is a convenient approximation to other stochastic processes, or deep learning models. I’m also curious about using decision theory for Dirichlet process models, and what kind of loss / utility functions would fit well in that space. In the Dirichlet process mixture setting it might open up some possibilities to properly assign subjects to clusters given some sense of costs or rewards.

If you stuck with it this far I appreciate it, and hopefully the write up made sense. Thanks for checking out my blog :).

One response to “Bayesian Decision Theory for Gaussian Process (GP) Models with an Application Towards Approximate Evaluation of Source Functions Generating the GP as a Solution to a Differential Equation.”

  1. […] processes. I wrote some preliminary stuff on GPs and Reproducing Kernel Hilbert Spaces (RKHSs) here in the first section. I didn’t want to go over any of that stuff again, but essentially the […]

Leave a Reply

Discover more from Ryan Warnick Phd.

Subscribe now to keep reading and get access to the full archive.

Continue reading