A Few Results on Spatial Dirichlet Processes

(EDIT: This work culminated in a research article published here.)

I did some work towards the tail end of graduate school and a little while after during my professional career which was never completed, and never published, on Spatial Dirichlet Process models; as well as convolutions of Spatial Dirichlet Process models with white noise base measure. You can read the notes on the publications page, but the notes were written at a time when I was struggling with some things going on in my personal life, and are generally speaking somewhat erratic and confused. Still, there’s a lot of decent work in there that never got to see the light of day, and I wanted to introduce three theorems from the notes (and fix some issues) here with a better exposition to explain what exactly was going through my mind as the notes were being written.

My technical blog posts lean pretty technical already, but this blog post will use some fairly heavy duty math (at least from my perspective), so I also wanted to give an introduction to things here that could allow people who had not had any familiarity with these concepts to follow and to use it in their own work in other technical settings should they so choose. The primary things of interest are Gaussian processes, Hilbert spaces, compact integral kernel operators between Hilbert spaces and convergent sequences of compact integral kernel operators, the Wasserstein metric on probability measures, the Dirichlet process and it’s associated Sethuraman representation, and a few other bits and pieces here and there that will be outlined in the three main proofs.

A lot of this stuff can be fairly mind bend-y (think distributions over distributions over spatial fields, and measuring distances between them), but hopefully talking about it here in a more casual format which doesn’t have to appease any reviewers can make it more digestible than the fairly opaque (and error ridden) notes.

Also, as an aside, I highly doubt this will be practically useful for anything directly :), but it was fun to think about.

Dirichlet Processes:

Dirichlet Processes (DPs) are stochastic processes which for which realizations from a DP are almost-surely discrete distributions over some measurable space. That is, assuming P \sim DP(\alpha, P_0) ( where \alpha > 0 and P_0 is a measure on some measurable space (\Sigma, \mathbb{B}). For the sake of exposition with the main proofs assume that \Sigma is a topological space and \mathbb{B} is the Borel \sigma-algebra

The primary defining characteristic of a DP is that the distribution of the amount of mass assigned to each element of a mutually exclusive and exhaustive partition \mathbf{A} = \{A_i\}_{I=1}^m is:

P(A_1), .... , P(A_m) \sim \textrm{Dirichlet}(\alpha P_0(A_1),...., \alpha P_0(A_n))

However, there is a constructive definition of a measure defined using a DP called the Sutheraman representation [1] for which the proof of the Sethuraman representation (which is omitted in this post) shows that realizations of the DP are almost surely discrete.


P(B) = \sum_{j=1}^\infty \pi_j\delta_{\theta_j}(B)

With \theta_j \overset{iid}{\sim} P_0 and \{\pi_j\}_{j=1}^\infty defined in a recursive scheme where \pi_j = \pi'_j\prod_{k=1}^{j-1}(1-\pi'_k) where \pi'_k \overset{iid}{\sim} \textrm{Beta}(1, \alpha) \forall k \in \{1,...,j\}.

So we can see that each realization from a DP is a almost-surely discrete probability distribution.

Spatial Dirichlet Process:

Recall that we said that the DP requires a measure and a concentration parameter \alpha \geq 0; and nothing else. [2] introduced a model where the base measure is a Gaussian Process (GP), creating a random distribution over stochastic processes. I wrote some preliminary stuff on GPs and Reproducing Kernel Hilbert Spaces (RKHSs) here in the first section. I didn’t want to go over any of that stuff again, but essentially the important pieces boil down to a GP, \textrm{G} \sim \textrm{GP}(\mu_0, K_0), being defined over some compact cdomain \mathbf{T} and having support in the reproducing kernel Hilbert space of the operator K_0; with a mean function \mu_0 which is the mean of the random process, and a covariance operator K_0. Recall that GPs are almost surely in the RKHS of their covariance operator, giving us that the Hilbert space on which they are defined has a naturally induced norm given by the square root of the inner product:

||f|| = \sqrt{<f,f>_{K_0}}

This allows us to metricize the space and induce a topology which can then be extended to a measurable space with the Borel sets defined on the topology.

The spatial DP is a DP with a spatial process as the base measure, and for the sake of this post we’ll be considering GPs as the base measure. This is expressed as follows, restricting ourselves to mean zero GPs for the sake of exposition:

g \sim P
P \sim DP(\alpha, P_0)
P_0 = GP(0, K_0)

Then the Sethuraman representation of this would be:

P = \sum_{j=1}^\infty \pi_j \delta_{\theta_j}

Where \theta_j \overset{iid}{\sim} GP(0, K_0), and \pi_j is constructed using the recursive scheme outlined in the last section.

This allows us to introduce the following theorem:

Theorem 1:

Suppose we have a spatial DP with base measure GP(0, K_0) defined on a compact set \mathbf{T} and K_0 is a compact covariance operator, and additionally that |\int_{\mathbf{T}}\int_{\mathbf{T}}K(t,t')dtdt'| < M for some M \geq 0. Then:

\mathbb{E}_{g |\{\pi_j\}_{j=1}^\infty}[g | \{\pi_j\}_{j=1}^\infty] \overset{P| \{\pi_j\}_{j=1}^\infty}{\sim} GP(0, (\sum_{j=1}^\infty \pi_j^2)K)

Where \overset{P| \{\pi_j\}_{j=1}^\infty}{\sim} means the expectation of realizations of P, given a particular set of \{\pi_j\}_{j=1}^\infty, and the randomness is the induced randomness that emerges as a function of the remaining stochasticity.

What this is saying is that if we take the expectation of the random measures conditioned on both the measure itself (which is drawn from a spatial DP), and the weights, then we get a gaussian process. I think this is fairly intuitive but proving it requires some hoops to jump through.

Proof:

The proof relies on Theorem 1.1 here [3], the moment result introduced here [4], the Borell Cantelli lemma, and the Kolmogorov Extension Theorem.

We have that P = \sum_{j=1}^\infty \pi_j \delta_{\theta_j} for \theta_j \overset{iid}{\sim} GP(0,K_0) and \pi_j = \pi_j'\prod_{k=1}^{j-1}(1-\pi_j'), where \pi_j' \overset{iid}{\sim} \textrm{Beta}(1,\alpha).

Let P_n(B) = \sum_{j=1}^n \pi_j \delta_{\theta_j}(B) where B \in \mathbb{B} the collection of Borel sets in the metric topology induced by the norm ||\circ||_{K_0}.

Note that \mathbb{E}_{\{\theta_j\}_{j=1}^n}[P_n(B)] =\mathbb{E}_{\{\theta_j\}_{j=1}^n}[\sum_{j=1}^n \pi_j \delta_{\theta_j}(B)] which, switching the point mass to an indicator gives us:

\mathbb{E}_{\{\theta_j\}_{j=1}^n}[\sum_{j=1}^n \pi_j \delta_{\theta_j}(B)] =  \mathbb{E}_{\{\theta_j\}_{j=1}^n}[\sum_{j=1}^n \pi_j I_{(\theta_j \in B)}]

=  \sum_{j=1}^n \pi_j P_{\theta_j}(\theta_j \in B)

But we have that \theta_j \overset{iid}{\sim} \textrm{GP}(0, K_0)

Note that this sum is less than or equal to P_n(B) \leq \sum_{j=1}^n \pi_j because the mass assigned by the GP’s \delta_{\theta_j(B)} to any Borel set B \in \mathbb{B} is less than or equal to one.

We proceed to use Borell cantelli to show almost sure convergence in the limit

Note that \pi_j = \pi_j'\prod_{k=1}^{j-1}(1-\pi_j') for \pi_j' \sim \textrm{Beta}(1,\alpha), and that 1-c for c\sim \textrm{Beta}(a,b) has distribution \textrm{Beta}(b,a), giving us that each \pi_j is the product of independent beta distributions and is thus a beta distribution.

Specifically, using the principle result introduced in the introduction in [4] we have that, for independent random variables X and Y, and the product random variable XY, we get that

\mathbb{E}[\pi_j'\prod_{k=1}^{j-1}(1-\pi_k')] = \mathbb{E}_{\pi_j'}[\pi_j']\prod_{k=1}^{j-1}\mathbb{E}_{\pi_k'}[(1-\pi_k')]

This gives us that the expectation of \pi_j is \frac{\alpha^{j-1}}{(1+\alpha)^j}.

The Borel Canteli is satisfied if P(\pi_j < \frac{1}{j^2}) \overset{j}{\rightarrow} 1. Using Markov’s inequality we get:

P(\pi_j < \frac{1}{j}^2) \geq 1-\frac{\frac{\alpha^{j-1}}{(1+\alpha)^j}}{\frac{1}{j^2}}=1-\frac{j^2\alpha^{j-1}}{(1+\alpha)^j}

The limit is 0 which gives us that in the limit P(\pi_j < \frac{1}{j^2}) \rightarrow 1, so Borel Canteli is satisfied.

We showed this is true for all Borel sets, and this includes closed sets, so select arbitrary individual singleton sets of \mathbf{T} and construct the union of these sets, giving us another element in the Borel collection. This allows us to proceed using the multivariate normal marginality and get a closed form for the limit almost surely, then extend using the Kolmogorov extension theorem to show it is true for the full GP.

Pick a set of individual elements in T = (t_1,....,t_m) \subset \mathbf{T} , and examine that P_n(t_1,...,T_m) \sim \textrm{GP}(0, \sum_{j=1}^n \pi_j^2 K_0(T,T)), and note that as \pi_j is a probability almost surely and we have the boundedness of the covariance operator \pi_j^2K_0(T,T) is still compact, and the previously linked theorem 1.1 here [3] shows that the limit of these sequences is compact as well. Borel canteli shows us that this sequence converges almost surely to the compact limit, which because each individual element is a multivariate gaussian the limit converges to a multivariate gaussian as well. The remainder to be shown is to use the Kolmogorov extension theorem to show that this can be extended to the full process.

Noting that the selection of the marginals was arbitrary, the limit exists and is a Gaussian process for all marginals, the Kolmogorov extension theorem shows that the limit process exists and is a Gaussian process.

\blacksquare

This theorem also introduces some interesting intuition about “what a spatial DP is”. Note that if we mix on \pi \sim \textrm{GEM}(\alpha) in GP(0, (\sum_j \pi_j^2)K_0) we recover the original spatial DP. But this is simply a mixing measure on the global scale parameter of a covariance operator (\sum_j \pi_j^2)K_0. This equates to doing something akin to aK_0 a\sim A, where a| Y learns from the data and allows spatial varying of the covariance operator. This means the SDP is really just a special case of a mixture distribution over the global scale parameter (which, as we shall see, admits many convenient properties).

Wasserstein Metric:

Examine the metric space (T, d) and suppose the metric space is a Polish space (this will be relevant later), where d(x,y) = ||x - y||, then the Wasserstein metric on probability distributions \mathbb{G} and \mathbb{H} with finite p-moments,defined on (T,d), is given by:

W_p(\mathbb{G},\mathbb{H}) = \inf_{\gamma \in \Gamma(\mathbb{G},\mathbb{H})}\mathbb{E}_{g,h \sim \gamma}[d(g,h)^p]^{\frac{1}{p}}

Where \Gamma(\mathbb{G},\mathbb{H}) is defined to be the set of all joint measures on T\times T such that the marginals of \Gamma are \mathbb{G} and \mathbb{H}, respectively.

Suppose \eta is a measure on measures on (T,||\circ - \circ||), the goal is to find a bound on:

W_p(\eta, P)

Where the distance metric on the metric space on which P and \eta are defined is W_r(\circ,\circ), and the distance metric this Wasserstein metric is defined is d(x,y) with d(x,y) = ||x - y||_{L_1}. We assume that the random measures \eta and P both are defined over the same one layer deep L_1 space. That is, both random measures are defined on measures defined on (T, ||\circ - \circ||_{L_1}); where ||f||_{L_1} = \sup_{t\in\mathbf{T}} |f(t)|.

Theorem 2 (REVISIT, NEEDS SOME WORK):

Suppose \eta is a measure on measures over stochastic processes defined on some compact set \mathbf{T} \subset \mathbb{R}, and that P is a spatial DP similarly defined over measures defined over Gaussian processes defined on \mathbf{T}. Suppose the requirements of Theorem 1 are satisfied on the Spatial DP P. Let c(\eta) = \mathbf{E}_{g \sim \eta}[\mathbb{E}_{h\sim g}[||h||_{L_1}]]\geq 0 be a positive constant which is a function of \eta. Let a(\alpha) = \mathbb{E}_{h\sim P}[\mathbb{E}_{h}[||h||_{L_1}]] >0 be a positive constant which is a function of \alpha. Then we can bound the two level deep nested Wasserstein distance between \eta and P by:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \min\{|c(\eta) -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\exp[\frac{-\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]^2}{\sigma^2_T}]] -a(\alpha)|,|c(\eta) - a(\alpha)| \}

Proof:

What we’re interested in examining is:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) = \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[W_{r,||\circ||_{L_1}}(g,h)^p]^{\frac{1}{p}}

Jensen’s inequality allows us to remove the exponential and it cancels with the root.


\Rightarrow \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[W_{r,||\circ||_{L_1}}(g,h)] = \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[\inf_{\psi \in \Psi(g,h)}\mathbb{E}_{q,u\sim \psi}[      ||q-u||_{L_1}^r]^{\frac{1}{r}}]

Once again Jensen’s inequality allows us to get rid of the exponential and root in r, and applying the reverse triangle inequality to the last inner expectation we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[\inf_{\psi \in \Psi(g,h)}\mathbb{E}_{q,u\sim \psi}[|||q||_{L_1}-||u||_{L_1}|]]

Applying Jensen’s inequality we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[\inf_{\psi \in \Psi(g,h)}|\mathbb{E}_{q,u\sim \psi}[||q||_{L_1}-||u||_{L_1}]|]

The expectation factors and we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[\inf_{\psi \in \Psi(g,h)}|\mathbb{E}_{q\sim \psi_1}[||q||_{L_1}]-\mathbb{E}_{u\sim \psi_2}[||u||_{L_1}]|]

Noting that the marginals of \psi \forall \psi \in \Psi(g,h) are just g and h the infimum goes away, and we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}\mathbb{E}_{g,h \sim \gamma}[|\mathbb{E}_{g}[||g||_{L_1}]-\mathbb{E}_h[||h||_{L_1}]|]

Applying Jensen’s inequality again, we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|\mathbb{E}_{g,h \sim \gamma}[\mathbb{E}_{g}[||g||_{L_1}]-\mathbb{E}_h[||h||_{L_1}]]|

The left side double expectation is some positive constant c \geq 0, and inner right expectation is the distribution of a Gaussian process by Theorem 1.

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c -\mathbb{E}_{\{\pi_j\}_{j=1^\infty}}[\mathbb{E}_{h | \{\pi_j\}_{j=1}^\infty}[||h||_{L_1}]]|

Note here that we used the conditional expectation as in Theorem 1, conditioning first on \{\pi_j\}_{j=1}^\infty and then taking the expectation with respect to the \{\pi_j\}_{j=1}^\infty.

Note that for an expectation \mathbb{E}_X[x] we can express the expectation as the integral of the CDF:

\mathbb{E}_X[x] = \int_{-\infty}^\infty (1-F_X(x))dx

We also have the Borell-TIS inequality for Gaussian processes:

Proposition 1, Borel-TIS Inequality:

Let f be a centered Gaussian process defined on some topological space \mathbf{T}, with ||f||_T = \sup_{t\in\mathbf{T}}|f_t| almost surely finite. Define \sigma^2_{\mathbf{T}} = \sup_{t\in\mathbf{T}}\mathbb{E}[|f_t|^2]. Then:

P(||f||_T > \mathbb{E}[||f||_T] + u) \leq \exp[\frac{-u^2}{\sigma^2_T}]

for u > 0.

Suppose for purpose of examination that c is very large relative to the other terms, and note that our previous statement about the expectation being and integral function of the CDF and the Borel-TIS inequality can be applied to reframe the original Gaussian expectation as:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\int_{\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]}^\infty P(||g||_{L_1} > u)du - \int_{0}^{\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]} P_{g | \{\pi_j\}_{j=1}^\infty}(||g||_{L_1} > u)du]|

Do a change of variables on the first integral to get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\int_0^\infty P(||g||_{L_1} > \mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}] + u)du - \int_{0}^{\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]} P_{g | \{\pi_j\}_{j=1}^\infty}(||g||_{L_1} > u)du]|

Use the Borel-TIS inequality on the integrand of the first integral to obtain:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\int_0^\infty \exp[\frac{-u}{\sigma^2_T}]du - \int_{0}^{\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]} P_{g | \{\pi_j\}_{j=1}^\infty}(||g||_{L_1} > u)du]|

Integrating the exponential we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c +\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\sigma^2_T \exp[\frac{-u}{\sigma^2_T}]|_{0}^\infty - \int_{0}^{\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]} P_{g | \{\pi_j\}_{j=1}^\infty}(||g||_{L_1} > u)du]|

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\sigma^2_T] - \mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\int_{0}^{\mathbb{E}_{g|\{\pi_j\}_{j=1}^\infty}[||g||_{L_1}]} P_{g | \{\pi_j\}_{j=1}^\infty}(||g||_{L_1} > u)du]|

Multiplying the inside of the absolute value by -1 we can lower bound this by removing the final term in the absolute value to get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq \inf_{\gamma \in \Gamma(\eta,P)}|c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\sigma^2_T]|

Which is constant with respect to the infimum so this gives us:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq |c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\sigma^2_T]|

Taking the (\sum_{j=1}^\infty \pi_j^2) term out of \sigma^2_T, we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq |c -\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[(\sum_{j=1}^\infty \pi_j^2)\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[\sup_{t \in \mathbf{T}}|g(t)|^2]]|

\blacksquare

Examining the theorem, note that we can reproduce everything we did for the spatial DP for another spatial DP as \eta. That is, assuming \eta \sim \textrm{SDP}(0, K^{1}_0, \alpha^{1}), and P \sim \textrm{SDP}(0,K^{2}_0,\alpha^{2}), we get:

W_{p,W_{r, ||\circ||_{L_1}}}(\eta, P) \geq |\mathbb{E}_{\{\pi^{(1)}_j\}_{j=1}^\infty}[(\sum_{j=1}^\infty \pi_j^{(1)^2})\mathbb{E}_{g^{(1)} | K_0^{(1)}}[\sup_{t \in \mathbf{T}}|g^{(1)}(t)|^2]] -\mathbb{E}_{\{\pi^{(2)}_j\}_{j=1}^\infty}[(\sum_{j=1}^\infty \pi_j^{(2)^2})\mathbb{E}_{g^{(2)} |K_0^{(2)}}[\sup_{t \in \mathbf{T}}|g^{(2)}(t)|^2]]|

Wasserstein Distance as Linear Program Between Two Spatial DPs:

We exploit the linear programming representation of optimal transport but extend it to infinite dimensional matrices (an operator on a countably infinite set). You can read more about optimal transport between finite discrete distributions in that previously linked blog post, but the general idea we’re going to do is note that for discrete distributions P = \sum_{i=1}^n \pi_i \delta_{\theta_i} and Q = \sum _{j=1}^n \eta_j \delta_{\epsilon_j}, we have the following solution to the Wasserstein distance between them:

\min_{T} <T,C>_F

where <\circ,\circ>_F denotes the Frobenius inner product between two matrices, subject to:

T1 = \pi, T^\perp 1 = \eta, T \geq 0

Where C_{ij} = ||\theta_i - \eta_jj||^r where the norm is any appropriately specified norm.

Then the Wasserstein distance between these two discrete distributions is:

W_{r,||\circ||}(P,Q) = <T^*,C>^{\frac{1}{r}}_F

Extending this to countably infinite discrete distributions we get:

\min_T \sum_{i=1}^\infty \sum_{j=1}^\infty T_{i,j}||\theta_i - \epsilon_j||^r

subject to:

T1_\infty = (\pi_1,\pi_2,.....), T^\perp 1_\infty = (\eta_1,\eta_2,...), T\geq 0

Proof:

Note for myself to finish this. The minimizer of the expectation over all joint distributions on the atoms and weights of this expression is the Wasserstein distance between two spatial dirichlet processes. Use self similarity of the DP to do an iterative conditional expectation to reduce down the inside term. With the conditional expectation iterating we get an iterative linear program for increasing dimension of T. Likely can be extended to general species sampling models. The spatial dirichlet processes’ weights are contained in the L_q Hilbert space for all q >0 because they almost surely are infinite dimensional probability vectors

Truncation Bounds:

The final and last theorem is related to the error of the truncation. This is going to heavily rely on Proposition 1.

Truncation means that supposing that we sample g \sim P with P \sim DP(\alpha, G_0) with G_0 a centered mean zero Gaussian process with some covariance operator K_0, we want to examine the difference between P_n and P; where P_n = \sum_{j=1}^n \pi_j \theta_j with the same definitions as previously for \{\pi_j\}_{j=1}^n and \{\theta_j\}_{j=1}^n.

By the Borel-TIS inequality we have: P_{P | \{\pi_j\}_{j=1}^\infty}(|| \mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] > \mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}]] + t)\leq \exp[\frac{-t^2}{(\sum_{j=n+1}^\infty \pi_j^2) \sigma^2_\mathbf{T}}]. This is due to the fact that \mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]] is still a Gaussian process, as it’s just the same infinite series done in Theorem one just with the principal first n terms removed.

Theorem 3:

Suppose the requirements of Theorem 1 and 2 are satisfied. Then we have the following:

\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[P_{P | \{\pi_j\}_{j=1}^\infty}(|| \mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] -\mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[||g - g_n||_{L_1}]] > t)] \leq \mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\exp[\frac{-t}{(\sum_{j=n+1}^\infty \pi_j^2) c(K_0)}]]

Proof:

Take the probability expression

P_{P | \{\pi_j\}_{j=1}^\infty}(|| \mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] > \mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}]] + t)\leq \exp[\frac{-t^2}{(\sum_{j=n+1}^\infty \pi_j^2) \sigma^2_\mathbf{T}}]

, with the term involving \{\pi_j\}_{j=n+1}^\infty in the exponential involved in factoring out of the expectation the multiplicative scalar in the covariance of the Gaussian process. Note that by Jensen’s inequality we can upper bound, and take the expectation of both sides with respect to \{\pi_j\}_{j=1}^\infty. This gives us:

\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[P_{P | \{\pi_j\}_{j=1}^\infty}(|| \mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] > \mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}]] + t)]\leq \mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\exp[\frac{-t^2}{(\sum_{j=n+1}^\infty \pi_j^2) \sigma^2_\mathbf{T}}]

Note that \sigma^2_T is a constant which is a function of K_0. Denote this by c(K_0) \geq 0:

\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[P_{P | \{\pi_j\}_{j=1}^\infty}(|| \mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] > \mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}] + t)]\leq \mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\exp[\frac{-t^2}{(\sum_{j=n+1}^\infty \pi_j^2) c(K_0)}]

\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[P_{P | \{\pi_j\}_{j=1}^\infty}( ||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}]]-\mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}] > t)] \leq \mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\exp[\frac{-t}{(\sum_{j=n+1}^\infty \pi_j^2) c(K_0)}]]

\blacksquare

Let’s examine this theorem for a second. As \alpha \rightarrow 0 we have that \sum_{j=1}^n \pi_j^2 \overset{p}{\rightarrow} 1, and for large \alpha we have that \sum_{j=1}^n \pi_j^2 \overset{p}{\rightarrow} a for some small constant a. This means that the term on the right hand side of the inequality is better bounded for large \alpha, which is the same as saying the mass is more concentrated on a a larger amount of atoms of the DP. As a gets smaller the division by a makes the exponential negative a larger value which lowers the threshold.

As we increase the threshold t the probability becomes smaller, which makes sense because no matter what \alpha we have the probability inside the expectation over \{\pi_j\}_{j=1}^\infty on the left hand side is over a smaller range.

This is all intuitive. As the DP has more atoms removing any individual primary atom doesn’t have as much of an influence on the approximation of the truncation to the DP itself. If there are only 3 atoms and you remove two, then you will get a bad truncation error.

Note also that examining a small number of principal atoms \{\theta_j\}_{j=1}^n with associated weights \{\pi_j\}_{j=1}^n we have that \mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}] \rightarrow 0, as the remaining atoms have so little probability mass that the expected value pushes the difference down to 0. This means that for small \alpha the bound isn’t as good but is a better approximation to the following expression:

\mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[P_{P | \{\pi_j\}_{j=1}^\infty}( ||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] > t)] \leq \mathbb{E}_{\{\pi_j\}_{j=1}^\infty}[\exp[\frac{-t}{(\sum_{j=n+1}^\infty \pi_j^2) c(K_0)}]]

That is, the bound as a function of t becomes tighter as \{\pi_j\}_{j=n+1}^\infty assigns more mass to fewer atoms.

Another interesting thing to consider is that the expectation over \{\pi_j\}_{j=1}^\infty can be removed (this was just a minor addition in the derivation to get a fixed quantity), and we can specify the following:

P_{P | \{\pi_j\}_{j=1}^\infty}( ||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g-g_n]||_{L_1}] -\mathbb{E}_{P | \{\pi_j\}_{j=1}^\infty}[||\mathbb{E}_{g | \{\pi_j\}_{j=1}^\infty}[g - g_n]||_{L_1}] > t) \leq \exp[\frac{-t}{(\sum_{j=n+1}^\infty \pi_j^2) c(K_0)}]

\forall \{\pi_j\}_{j=1}^\infty \in S_\infty where S_\infty denotes the infinite dimensional simplex. That is, we can come up with a bound for any given infinite dimensional probability vector (on the interior of the simplex to avoid dividing by 0); instead of using the expectation to get a fixed bounding.

A final interesting thing to consider is that we don’t have to remove the principal n terms, we can instead remove any subset of the infinite terms in the series and get an updated bound. This just becomes replacing the sums from 1 to n, or from n+1 to \infty, with appropriately specified sets and their compliments.

EDITED:

Contraction Rates:

We have that the model is given by:

y_t | \theta_t \overset{ind}{\sim}\textrm{N}_n(\theta_t,\tau^2I_n)
\theta_t  \overset{iid}{\sim} G
G \sim \textrm{DP}(v,G_0)
G_0 = \textrm{N}_n(0, \sigma^2 H_n(\psi)

with n observed sites s_1,...,s_n for t = 1,...,T.


This gives that the base measure is the finite dimensional GP law at the sites, and \sigma,\tau,\psi>0 treated as known scale parameters.

Let the true random effects law G' be non-atomic and supported on a Holder ball of order \alpha (typical for GP sample paths on S.

The true marginal law of Y_t is P' = \int \textrm{N}_n(\theta, \tau I_n) dG_*(\theta). When G_* = \textrm{N}_n(0, \sigma^2_*H_n(\psi_*) is a GP law, the marginal simplifies to \textrm{N}_n(0, \sigma_*^2H_n(\psi_*) + \tau I_n).

Metrics:

  • On the Mixing law G: Wasserstein W_r over \Theta, ||\circ||_\infty
  • On the induced marginal P_G for Y: Hellinger, (or total variation)

Theorem 4 (Mixing law, Wasserstein):

Assume the true G_* is supported on a Holder ball \Theta = \{f: [0,1]^d \rightarrow \mathbb{R} \hspace{2mm} \textrm{s.t.} \hspace{2mm} ||f||_{C^\alpha}< M\}, with \alpha \in (0,1], and that the prior is DP(vG_0) with nonatomic G_0 whose support contains \Theta. Then for some constant M_1>0 and any fixed r\geq 1:

\Pi(W_r(G,G_*) > M_1 \epsilon_T | Y^{(T)}) \overset{P}{T\rightarrow\infty} 0

at the (near) minimax nonparametric rate:

\epsilon_T \approx (\frac{\log T}{T})^{\frac{\alpha}{2\alpha + d}}

Proof:

The relevant background from the notes is Lemma 3, and the testing and KL results in the linked paper that’s available in my publications section.

  • The prior mass near the truth in W_r for a DP with nonatomic base measure on a holder ball ( the Lemma 3 style lower bound and packing number appearance): the display involving W_r and N(\epsilon, \Theta, ||\circ||) shows the DP prior puts strictly positive mass on W_r, controlled by the metric entropy of \Theta.
  • The dependence of the packing/covering number N(\epsilon, \theta, ||\circ||) is explicitly and is the engine behind the d, \alpha dependence. (Holder balls satisfy \log N(\epsilon, \Theta, ||\circ|_\infty) \approx \epsilon^{-\frac{d}{\alpha}}).
  • The notes set up the testing and Kullback Leibler Pieces. The KL support definition and KL bound of GPs in terms of sup-norm of the mean (A1) give the construction/contiguity needed in the general Ghosal-Ghosh-Vandervaart scheme.

Putting these together with the standard master theorem for contraction rates (entropy vs. sample size) yields the rate solving:

\log N(\epsilon_T, \Theta, ||\circ||_\infty) \lessapprox T \epsilon_T^2\Rightarrow \epsilon_T \approx (\frac{\log T}{T})^{\frac{\alpha}{2\alpha + d}}

Theorem 5 (Contraction of the induced marginal P_G for Y:

If the true G_* is a Gaussian law \textrm{N}_n(0,\sigma_*^2H_n(\psi_*)) (so P_* = \textrm{N}_n(0, \Sigma_*) with \Sigma_* = \sigma_*^2H_n(\psi_*) + \tau I_n) then the postetrior over the induced marginal P_G contracts to P_* at the parametric rate:

\delta_T \approx T^{-\frac{1}{2}}

Why parametric?:

With (\sigma^2, psi, \tau^2) >0 fixed, the target P_* is a single Gaussian; the \textrm{DP} centered at the matching GP law places positive Kullback-Leibler mass there, and the KL bounds/tests for Gaussian process provide exponentially powerful tests, yielding a T^{-\frac{1}{2}} rate in Hellinger/KL. The GP-mixture integrates to a single Gaussian when the mixing is Gaussian, and recovering that finite dimensional target density is a parametric task.

Practical Reading of the Rates:

  • If your aim is to learn the distribution of the latent spatial fields \theta_t (i.e. G) the intrinsic function-space difficulty appears: smoothness \alpha, dimension d, and sample size T trade off as \epsilon_T \approx (\frac{\log T}{T})^{\frac{\alpha}{2\alpha + d}}.
  • If you only care about the marginal law of the observed vector Y_t (e.g. to predict Y at observed sites), and the truth is Gaussian, you recover it at the optimal parametric T^{-1\frac{1}{2}} rate.

Conclusion:

That completes the blog post. I’m going to come back and do some simulations and add some figures so people can better see and understand what I was thinking as I worked on these problems. It’s nice to revisit old work and see it with fresh eyes, and I’m happy it still made somewhat sense to me while looking at it.

Leave a Reply

Discover more from Ryan Warnick Phd.

Subscribe now to keep reading and get access to the full archive.

Continue reading