Dirichlet Process Projections

Disclaimer: This blog post is a work in progress, but I wanted to share here in case anyone had any thoughts on it.

It had been a while since I had done any personal pet technical projects on my blog, but I had an interesting realization while playing around with Principal Components Analysis (PCA) on another project, and wanted to do something related to that in the context of Bayesian nonparametric.

I had an intuition on something related to projections on dirichlet process distributed random vectors, and I was curious if there was a way to do a kind of nonparametric projection onto an infinite-dimensional space that would be useful for certain high-dimensional or infinite-dimensional dimension reduction problems, so I decided I would delve into it. I’m also trying to get familiar with the new GitHub copilot tool on VSCode, so, unlike my other blog posts, I have a computational component to this that you can find here with a notebook illustrating the approach.

My first thought about this was to see if I could come up with a simple model built around the Dirichlet process and see if some interesting conjugacy results would fall out. At first pass the most obvious model specification was something to the effect of the following:

$X_1,...,X_n \textrm{proj}(\textrm{span}(g_i)) \overset{iid}{\sim} \textrm{N}_1(0,||g_i||^2)$
$g_i \overset{iid}{\sim} G$
$G \sim \textrm{DP}(\nu, G_0)$
$G_0 = \textrm{N}_{\textrm{dim}(\mathcal{X})}(0, \tau I_{\textrm{dim}(\mathcal{X})})$
$\nu \in \mathbb{R}$
$\tau \in \mathbb{R}$

The only difference here from a mathematical perspective from the normal dirichlet process mixture of normals is that the data are not univariate but we are projecting onto a univariate distribution with support on a one dimensional subspace of $\mathcal{X}$ . To provide some intuition about what this is doing, it’s saying we have a distribution over distributions over vectors, and then we project onto the span of these vectors. This was my first approach, but it’s not quite effective because the vectors might not be orthogonal. This creates a dilemma because we can’t provide the orthogonality in the dirichlet process in any obvious way; as the atoms of a dirichlet process are independent by construction. Some people more familiar with the DP might also see some problems in the uncountability of the DP with the finite $\textrm{dim}(\mathcal{X})$ . This is accounted for later, but the previous model is still illustrative.

So it appears that if we want to preserve the Dirichlet process as a feature of the model, then we need to find a way to incorporate the necessary orthogonality into the top level of the hierarchy. This means that we need to modify the $X_1,...,X_n \textrm{proj}(\textrm{span}(g_i))\sim \textrm{N}_1(0,||g_i||^2)$ component somehow. This turns out to be quite tricky.

One way to do this is to construct a varying dimension hierarchy, where each level of the hierarchy going down removes the projection of the current vector on the span of all previous ones. Expressing this mathematically we have a random $K_G \in \mathbb{N}$ which is the number of atoms of the Dirichlet process, which we’ll get into later as there’s been a lot of work on quantifying this random natural number for various models, and then we proceed with the following:

$X_1,...,X_n \textrm{proj}(\textrm{span}(g_1)) \overset{iid}{\sim} \textrm{N}_1(0,||g_1||^2)$
$X_1,...,X_n \textrm{proj}(\textrm{span}(g_2 - \textrm{proj}(g_2\rightarrow \textrm{span}(g_1)) \overset{iid}{\sim} \textrm{N}_1(0,||g_2 - \textrm{proj}(g_2\rightarrow \textrm{span}(g_1)||^2)$
$...$
$X_1,...,X_n \textrm{proj}(\textrm{span}(g_{K_G} - \textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1})) \overset{iid}{\sim} \textrm{N}_1(0,||g_{K_G} - \textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1})||^2)$

Forgive me for the poor formatting due to the limited space on the blog page :).

I think looking at this hierarchy there is some natural confusion (for me at least initially) about what this means in the context of a hierarchical model. Think of this hierarchy as a joint distribution on the $X_1,...,X_n$ , as opposed to thinking of having the data with multiple distributions. There’s certain standards about specifying distributions in this hierarchical structure, and the need to show that the distribution is well posed. So let’s look at that now to see what kind of properties it contains.

Theorem 1: The joint distribution defined by the hierarchical projections onto orthogonal random vectors is well defined and non-atomic.

Proof: This proof will be a tad bit informal for the sake of illustration. We need to show that $X\textrm{proj}(\textrm{span}(g)) \overset{iid}{\sim} \textrm{N}_1(0,||g||^2)$ with a distribution $g\sim \textrm{N}_{\textrm{dim}(\mathcal{X})}(\underbar{0},\tau I_{\dim(\mathcal{X})})$ defines a joint distribution on $X , g \in \mathcal{X}$ and similarly that for any length hierarchy for $K_G \in \mathbb{N}$ the conditional distributions completely specify a joint distribution. E.g.:

$P(X|g_1,...,g_{K_G})P(g_1,...,g_{K_G})P(K_G) = P(X,g_1,...,g_{K_G},K_G)$

is a well defined joint model. There’s also the possibility of some identifiability coming from the hierarchical structure of the joint model which we will get into in another theorem later

The first step is that the first two distributions define a joint model on $X, g \in \mathcal{X}$ . Note that the distribution of $||g||^2$ is chi squared, so we have that this distribution is a distribution over a projected axis in a circle around the origin, where the magnitude of the direction along the radial axis is a mixture of normals with a chi square mixing measure on the variance. The interesting fact about the projection is that as you move circularly around the origin the projection of any individual $x \in \mathcal{X}$ is elliptical as you move the vector onto which you rotate the axis of projection. This is intuitive but can be seen using Lagrange multipliers on the norm of the vector $g$ . So we have an ellipse, with the radius of the ellipse being a chi square mixture of normals. Thus the marginal distribution of the $X$ has a distribution has support over the full $\mathcal{X}$ and is also non-atomic. The distribution on $g$ is just a normal so is clearly well defined.

We now proceed by induction on $P(X|g_1,...,g_k)P(g_1,...,g_k)$ and increase the dimension of $K_G$ by one to see what happens with $P(X|g_1,...,g_{k+1})P(g_1,...,g_{k+1})$ . Our goal is to prove by induction that if the first distribution is elliptical then the next distribution is a higher dimensional ellipse. Note also that $P(g_1,...,g_k)=P(g_1)...P(g_k)$ because the atoms of a dirichlet process are almost surely independent.

If $P(X|g_1,...,g_k)P(g_1,...,g_k)$ is elliptical, then $g_{k+1} - g_{k+1}\textrm{proj}(\textrm{span}(g_1,...,g_k))$ is going to be orthogonal (by construction) from $\textrm{span}(g_1,...g_k)$ where length of the vector along that axis $g_{k+1}$ is going to be the square root of another chi squared random variable. If we project onto the span of the orthogonal vector as it rotates around the origin the resulting trace is elliptical. In other words, we’re repeating the same elliptical motion around the axis of symmetry of the original ellipse, with the magnitude of the ellipse protruding orthogonal to the first ellipse being a chi square mixture of gaussians. This adds an additional axis of symmetry to the original ellipse and creates a higher dimensional ellipsoidal distribution. We showed that it is true for $k=1$ so by induction it is true for all $k$ .

That is to say. For any fixed $K_G$ , we just have a change of axis and increase of diffusiveness due to the chi squared mixture on the variance along each direction.

There is one additional thing we’re interested in along the same lines, and that is what happens concerning $P(K_G)$ . Note that if $K_G > \textrm{dim}(\mathcal{X})$ then any additional atoms of the DP will be projected onto the full $\textrm{span}(\mathcal{X})$ and the resulting subtraction of the projection reduce the projected vector to a point mass at zero – almost surely. This was the problem with using the DP on the finite dimensional problem I mentioned earlier, but I wanted to talk about this specifically in the context of the DP for some work I plan to do later in the context of function spaces where the dimension of the orthogonal set of vectors is truly uncountable. In this sense, it would be beneficial for finite dimensional problems such as this one to truncate the Dirichlet Process to a finite dimensional Dirichlet mixture on $\{1,...,\textrm{dim}(\mathcal{X})\}$ .

This gives us the following model:

I want to note something else here with another theorem now that the model has been more fleshed out:

Theorem 2: The distribution above on $X_1,...,X_n$ for the top hierarchy and fixing $K_G = \textrm{dim}(\mathcal{X})$ , so that the distribution on $K_G$ and $\underbar{p}$ are removed and holding $K_G$ fixed, has the following properties:

$X_1,...,X_n | \hspace{5mm}||g_1||=c_1,....,||g_{K_G} - \textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1})||=c_{K_G} \overset{iid}{\sim} \textrm{N}_{\textrm{dim}(\mathcal{X})}(0, G C G^T)$

where $G = [g_1^T,....,(g_{K_G} - \textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1}))^T]$ and $C = \textrm{diag}(c_1,...c_{K_G})$ where $\textrm{diag}(\circ)$ is the matrix of zeros with the argument to the diag operator along the diagonal.

This is essentially saying if we condition on $\forall k\in\{1,...,K_G\} \textrm{ where } ||g_k-\textrm{proj}(g_k\rightarrow \textrm{span}(g_1,...,g_{k-1})|| = c_k$ which is equivalent to a radial distribution on a sphere, then the conditional distribution of the $X_1,...,X_n$ conditioning on $g_k-\textrm{proj}(g_k\rightarrow \textrm{span}(g_1,...,g_{k-1})$ is a multivariate normal with covariance matrix having eigenvectors $g_1,...,g_{K_G}$ and eigenvalues $c_1,...,c_{K_G}$ . Similarly, if we fix $1\leq K_G < \textrm{dim}(\mathcal{X})$ we get the same distribution just with $c_i = 0$ $\forall i > K_G$

Proof:

……….

Markov Chain Monte Carlo Algorithm (MCMC):

People familiar with more advanced Bayesian computation can probably see that the algorithm to sample from the posterior (which is not available in closed form) is likely to be a Metropolis-Hastings-Green within Gibbs algorithm. The Metropolitan-Hastings-Green comes from the dimension changing. It will be interesting to see if there is any types of conjugacy or mathematical properties of the Green algorithm as well as the full conditionals for the Gibbs component.

We split $\{g_i\}_{i=1,...,\textrm{dim}(\mathcal{X})}$ into a directional and magnitude component, and first start with a fixed dimension of $K_G$ and then increment from there to analyze the dimension changing component.

We have that $||g_i|| \sim \chi^2(\textrm{dim}(\mathcal{X}))$ which is also the same as saying that $||g_i|| \sim \textrm{Gamma}(\frac{\textrm{dim}(\mathcal{X})}{2},\frac{1}{\tau 2})$ projecting the data $\textbf{X}$ onto $\textrm{dir}(g_i)$ we get that $\textbf{X} \textrm{proj} \textrm{dir}(g_i)$ are our observations and our likelihood is a normal with mean zero and unknown precision. This is conjugate and we get that: $\frac{1}{||g_i||} \hspace{3mm}| \hspace{3mm} \textrm{dir}(g_i), \textbf{X}, K_G \sim \textrm{InverseGamma}(\frac{\textrm{dim}(\mathcal{X})}{2} + \frac{n}{2}, \frac{1}{\tau 2} + \frac{\sum_{j=1}^n (X_j \textrm{proj}(\textrm{dir}(g_i)))^2}{2})$
For the directional component of (which are orthonormal vectors) we have that is an element of the Stiefel manifold on orthonormal p-frames. Thus the distribution on the projected vectors is another way to come up with a distribution over the p-Steifel manifold. There exists a couple of distribution on Steifel manifolds already, notably the Bingham-Von-Mises-Fisher family of distributions and subtypes built around parameter restrictions to this family. We choose to model as a Bingham family with A fixed to the identity. This gives us one parameter to control over our prior, :
- $P(G | X, B, ||G||) \propto \exp[\textrm{tr}(X^TG^T||G||GX)]\exp[\textrm{tr}(BG^TG)]$ where $||G||$ is $\textrm{diag}(\{||g_i||\}_{i=1}^m)$
- Using cyclical properties of traces and their additivity this gives us:
- $P(G| X, B, ||G||) \propto \exp[\textrm{tr}(XX^TG^T||G||G + BG^TG)$
- Note that because $G$ is orthogonal $G^T = G^{-1}$ giving us:
  $P(G|X, B, ||G||) \propto \exp[\textrm{tr}(XX^TG^T||G||G) + B)] = \exp[\textrm{tr}(XX^TG^T||G||G) + \textrm{tr}(B)]$ .
- The $\exp[\textrm{tr}(B)]$ is a constant term which factors out, giving us that our full conditional is: $P(G|X, B, ||G||) \sim \textrm{B}(G; ||G||, XX^T)$ .
- This means the posterior is invariant under the parameter $B$ , giving us that the only prior fitting into this setting is the prior $\textrm{B}(G; I, I)$ which is the same as our isometric normal projections (or uniform on the $K_G, \textrm{dim}(\mathcal{X})$ Steifel manifold.

Note that both of these full conditionals are invariant under the choice of $K_G$ .

Sampling from Bingham distributions is challenging so here I propose a new sampler built on the orthogonal projections.

Sample $G \sim \textrm{MN}(0, ||G||^{-1}, 2(XX^T)^{-1}) \propto \exp[-\textrm{tr}(XX^TG^T||G||G)]$ which is the matrix normal distribution. Then set $g_1 = \frac{G_1}{||G_1||}$ , $g_2| g_1 = G_2 - G_2\textrm{proj}(\textrm{span}(g_1))$ then $g_2 = \frac{g_2}{||g_2||}$ repeating this process inductively. This induces a distribution over Steifel Manifolds with density $\propto \exp[\textrm{tr}(XX^TG^T||G||G)]$ . We then accept or reject based on a Metropolis-Hastings ratio, where the Jacobian of changing from a distribution on euclidean space to a Steifel manifold cancels out between the numerator and denominator.

Function Spaces:

As mentioned previously the goal was to use the same hierarchical projection structure on a distribution over functions with a DP as the mixing measure; as opposed to the Dirichlet distribution on the ordinal numbers $\{1,...,\textrm{dim}(\mathcal{X})\}$ . The base measure of the DP would naturally be a stochastic process defined over functions; typically a Gaussian Process. This would have the structure of the following:

$X_1,...,X_n \textrm{proj}(\textrm{span}(g_1))\overset{iid}{\sim} \textrm{GP}(0, <g_1,g_1>)$
$....$
$X_1,...,X_n \textrm{proj}(\textrm{span}(g_{K_G}-\textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1})) )\textrm{GP}(0,<g_{K_G}-\textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1})),g_{K_G}-\textrm{proj}(g_{K_G}\rightarrow \textrm{span}(g_1,...,g_{K_G-1}))>)$
$g_i \overset{iid}{\sim} G$
$G \sim \textrm{DP}(\nu, G_0)$
$G_0 = \textrm{GP}(0, K)$

Ryan Warnick Phd.

Dirichlet Process Projections

Leave a ReplyCancel reply

Dirichlet Process Projections

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ryan Warnick Phd.