(EDIT: This work culminated in a research article published here.)
I did some work towards the tail end of graduate school and a little while after during my professional career which was never completed, and never published, on Spatial Dirichlet Process models; as well as convolutions of Spatial Dirichlet Process models with white noise base measure. You can read the notes on the publications page, but the notes were written at a time when I was struggling with some things going on in my personal life, and are generally speaking somewhat erratic and confused. Still, there’s a lot of decent work in there that never got to see the light of day, and I wanted to introduce three theorems from the notes (and fix some issues) here with a better exposition to explain what exactly was going through my mind as the notes were being written.
My technical blog posts lean pretty technical already, but this blog post will use some fairly heavy duty math (at least from my perspective), so I also wanted to give an introduction to things here that could allow people who had not had any familiarity with these concepts to follow and to use it in their own work in other technical settings should they so choose. The primary things of interest are Gaussian processes, Hilbert spaces, compact integral kernel operators between Hilbert spaces and convergent sequences of compact integral kernel operators, the Wasserstein metric on probability measures, the Dirichlet process and it’s associated Sethuraman representation, and a few other bits and pieces here and there that will be outlined in the three main proofs.
A lot of this stuff can be fairly mind bend-y (think distributions over distributions over spatial fields, and measuring distances between them), but hopefully talking about it here in a more casual format which doesn’t have to appease any reviewers can make it more digestible than the fairly opaque (and error ridden) notes.
Also, as an aside, I highly doubt this will be practically useful for anything directly :), but it was fun to think about.
Dirichlet Processes:
Dirichlet Processes (DPs) are stochastic processes which for which realizations from a DP are almost-surely discrete distributions over some measurable space. That is, assuming ( where
and
is a measure on some measurable space
. For the sake of exposition with the main proofs assume that
is a topological space and
is the Borel
-algebra
The primary defining characteristic of a DP is that the distribution of the amount of mass assigned to each element of a mutually exclusive and exhaustive partition is:
However, there is a constructive definition of a measure defined using a DP called the Sutheraman representation [1] for which the proof of the Sethuraman representation (which is omitted in this post) shows that realizations of the DP are almost surely discrete.
With and
defined in a recursive scheme where
where
.
So we can see that each realization from a DP is a almost-surely discrete probability distribution.
Spatial Dirichlet Process:
Recall that we said that the DP requires a measure and a concentration parameter ; and nothing else. [2] introduced a model where the base measure is a Gaussian Process (GP), creating a random distribution over stochastic processes. I wrote some preliminary stuff on GPs and Reproducing Kernel Hilbert Spaces (RKHSs) here in the first section. I didn’t want to go over any of that stuff again, but essentially the important pieces boil down to a GP,
, being defined over some compact cdomain
and having support in the reproducing kernel Hilbert space of the operator
; with a mean function
which is the mean of the random process, and a covariance operator
. Recall that GPs are almost surely in the RKHS of their covariance operator, giving us that the Hilbert space on which they are defined has a naturally induced norm given by the square root of the inner product:
This allows us to metricize the space and induce a topology which can then be extended to a measurable space with the Borel sets defined on the topology.
The spatial DP is a DP with a spatial process as the base measure, and for the sake of this post we’ll be considering GPs as the base measure. This is expressed as follows, restricting ourselves to mean zero GPs for the sake of exposition:
Then the Sethuraman representation of this would be:
Where , and
is constructed using the recursive scheme outlined in the last section.
This allows us to introduce the following theorem:
Theorem 1:
Suppose we have a spatial DP with base measure defined on a compact set
and
is a compact covariance operator, and additionally that
for some
. Then:
Where means the expectation of realizations of
, given a particular set of
, and the randomness is the induced randomness that emerges as a function of the remaining stochasticity.
What this is saying is that if we take the expectation of the random measures conditioned on both the measure itself (which is drawn from a spatial DP), and the weights, then we get a gaussian process. I think this is fairly intuitive but proving it requires some hoops to jump through.
Proof:
The proof relies on Theorem 1.1 here [3], the moment result introduced here [4], the Borell Cantelli lemma, and the Kolmogorov Extension Theorem.
We have that for
and
, where
.
Let where
the collection of Borel sets in the metric topology induced by the norm
.
Note that which, switching the point mass to an indicator gives us:
But we have that
Note that this sum is less than or equal to because the mass assigned by the GP’s
to any Borel set
is less than or equal to one.
We proceed to use Borell cantelli to show almost sure convergence in the limit
Note that for
, and that
for
has distribution
, giving us that each
is the product of independent beta distributions and is thus a beta distribution.
Specifically, using the principle result introduced in the introduction in [4] we have that, for independent random variables and
, and the product random variable
, we get that
This gives us that the expectation of is
.
The Borel Canteli is satisfied if . Using Markov’s inequality we get:
The limit is 0 which gives us that in the limit , so Borel Canteli is satisfied.
We showed this is true for all Borel sets, and this includes closed sets, so select arbitrary individual singleton sets of and construct the union of these sets, giving us another element in the Borel collection. This allows us to proceed using the multivariate normal marginality and get a closed form for the limit almost surely, then extend using the Kolmogorov extension theorem to show it is true for the full GP.
Pick a set of individual elements in , and examine that
, and note that as
is a probability almost surely and we have the boundedness of the covariance operator
is still compact, and the previously linked theorem 1.1 here [3] shows that the limit of these sequences is compact as well. Borel canteli shows us that this sequence converges almost surely to the compact limit, which because each individual element is a multivariate gaussian the limit converges to a multivariate gaussian as well. The remainder to be shown is to use the Kolmogorov extension theorem to show that this can be extended to the full process.
Noting that the selection of the marginals was arbitrary, the limit exists and is a Gaussian process for all marginals, the Kolmogorov extension theorem shows that the limit process exists and is a Gaussian process.
This theorem also introduces some interesting intuition about “what a spatial DP is”. Note that if we mix on in
we recover the original spatial DP. But this is simply a mixing measure on the global scale parameter of a covariance operator
. This equates to doing something akin to
, where
learns from the data and allows spatial varying of the covariance operator. This means the
is really just a special case of a mixture distribution over the global scale parameter (which, as we shall see, admits many convenient properties).
Wasserstein Metric:
Examine the metric space and suppose the metric space is a Polish space (this will be relevant later), where
, then the Wasserstein metric on probability distributions
and
with finite
-moments,defined on
, is given by:
Where is defined to be the set of all joint measures on
such that the marginals of
are
and
, respectively.
Suppose is a measure on measures on
, the goal is to find a bound on:
Where the distance metric on the metric space on which and
are defined is
, and the distance metric this Wasserstein metric is defined is
with
. We assume that the random measures
and
both are defined over the same one layer deep
space. That is, both random measures are defined on measures defined on
; where
.
Theorem 2 (REVISIT, NEEDS SOME WORK):
Suppose is a measure on measures over stochastic processes defined on some compact set
, and that
is a spatial DP similarly defined over measures defined over Gaussian processes defined on
. Suppose the requirements of Theorem 1 are satisfied on the Spatial DP
. Let
be a positive constant which is a function of
. Let
be a positive constant which is a function of
. Then we can bound the two level deep nested Wasserstein distance between
and
by:
Proof:
What we’re interested in examining is:
Jensen’s inequality allows us to remove the exponential and it cancels with the root.
Once again Jensen’s inequality allows us to get rid of the exponential and root in , and applying the reverse triangle inequality to the last inner expectation we get:
Applying Jensen’s inequality we get:
The expectation factors and we get:
Noting that the marginals of are just
and
the infimum goes away, and we get:
Applying Jensen’s inequality again, we get:
The left side double expectation is some positive constant , and inner right expectation is the distribution of a Gaussian process by Theorem 1.
Note here that we used the conditional expectation as in Theorem 1, conditioning first on and then taking the expectation with respect to the
.
Note that for an expectation we can express the expectation as the integral of the CDF:
We also have the Borell-TIS inequality for Gaussian processes:
Proposition 1, Borel-TIS Inequality:
Let be a centered Gaussian process defined on some topological space
, with
almost surely finite. Define
. Then:
for .
Suppose for purpose of examination that is very large relative to the other terms, and note that our previous statement about the expectation being and integral function of the CDF and the Borel-TIS inequality can be applied to reframe the original Gaussian expectation as:
Do a change of variables on the first integral to get:
Use the Borel-TIS inequality on the integrand of the first integral to obtain:
Integrating the exponential we get:
Multiplying the inside of the absolute value by we can lower bound this by removing the final term in the absolute value to get:
Which is constant with respect to the infimum so this gives us:
Taking the term out of
, we get:
Examining the theorem, note that we can reproduce everything we did for the spatial DP for another spatial DP as . That is, assuming
, and
, we get:
Wasserstein Distance as Linear Program Between Two Spatial DPs:
We exploit the linear programming representation of optimal transport but extend it to infinite dimensional matrices (an operator on a countably infinite set). You can read more about optimal transport between finite discrete distributions in that previously linked blog post, but the general idea we’re going to do is note that for discrete distributions and
, we have the following solution to the Wasserstein distance between them:
where denotes the Frobenius inner product between two matrices, subject to:
,
,
Where where the norm is any appropriately specified norm.
Then the Wasserstein distance between these two discrete distributions is:
Extending this to countably infinite discrete distributions we get:
subject to:
,
,
Proof:
Note for myself to finish this. The minimizer of the expectation over all joint distributions on the atoms and weights of this expression is the Wasserstein distance between two spatial dirichlet processes. Use self similarity of the DP to do an iterative conditional expectation to reduce down the inside term. With the conditional expectation iterating we get an iterative linear program for increasing dimension of . Likely can be extended to general species sampling models. The spatial dirichlet processes’ weights are contained in the
Hilbert space for all
because they almost surely are infinite dimensional probability vectors
Truncation Bounds:
The final and last theorem is related to the error of the truncation. This is going to heavily rely on Proposition 1.
Truncation means that supposing that we sample with
with
a centered mean zero Gaussian process with some covariance operator
, we want to examine the difference between
and
; where
with the same definitions as previously for
and
.
By the Borel-TIS inequality we have: . This is due to the fact that
is still a Gaussian process, as it’s just the same infinite series done in Theorem one just with the principal first
terms removed.
Theorem 3:
Suppose the requirements of Theorem 1 and 2 are satisfied. Then we have the following:
Proof:
Take the probability expression
, with the term involving in the exponential involved in factoring out of the expectation the multiplicative scalar in the covariance of the Gaussian process. Note that by Jensen’s inequality we can upper bound, and take the expectation of both sides with respect to
. This gives us:
Note that is a constant which is a function of
. Denote this by
:
Let’s examine this theorem for a second. As we have that
, and for large
we have that
for some small constant
. This means that the term on the right hand side of the inequality is better bounded for large
, which is the same as saying the mass is more concentrated on a a larger amount of atoms of the DP. As
gets smaller the division by
makes the exponential negative a larger value which lowers the threshold.
As we increase the threshold the probability becomes smaller, which makes sense because no matter what
we have the probability inside the expectation over
on the left hand side is over a smaller range.
This is all intuitive. As the DP has more atoms removing any individual primary atom doesn’t have as much of an influence on the approximation of the truncation to the DP itself. If there are only 3 atoms and you remove two, then you will get a bad truncation error.
Note also that examining a small number of principal atoms with associated weights
we have that
, as the remaining atoms have so little probability mass that the expected value pushes the difference down to
. This means that for small
the bound isn’t as good but is a better approximation to the following expression:
That is, the bound as a function of becomes tighter as
assigns more mass to fewer atoms.
Another interesting thing to consider is that the expectation over can be removed (this was just a minor addition in the derivation to get a fixed quantity), and we can specify the following:
where
denotes the infinite dimensional simplex. That is, we can come up with a bound for any given infinite dimensional probability vector (on the interior of the simplex to avoid dividing by 0); instead of using the expectation to get a fixed bounding.
A final interesting thing to consider is that we don’t have to remove the principal terms, we can instead remove any subset of the infinite terms in the series and get an updated bound. This just becomes replacing the sums from
to
, or from
to
, with appropriately specified sets and their compliments.
EDITED:
Contraction Rates:
We have that the model is given by:
with observed sites
for
.
This gives that the base measure is the finite dimensional law at the sites, and
treated as known scale parameters.
Let the true random effects law be non-atomic and supported on a Holder ball of order
(typical for
sample paths on
.
The true marginal law of is
. When
is a GP law, the marginal simplifies to
.
Metrics:
- On the Mixing law
: Wasserstein
over
- On the induced marginal
for
: Hellinger, (or total variation)
Theorem 4 (Mixing law, Wasserstein):
Assume the true is supported on a Holder ball
, with
, and that the prior is
with nonatomic
whose support contains
. Then for some constant
and any fixed
:
at the (near) minimax nonparametric rate:
Proof:
The relevant background from the notes is Lemma 3, and the testing and KL results in the linked paper that’s available in my publications section.
- The prior mass near the truth in
for a
with nonatomic base measure on a holder ball ( the Lemma 3 style lower bound and packing number appearance): the display involving
and
shows the DP prior puts strictly positive mass on
, controlled by the metric entropy of
.
- The dependence of the packing/covering number
is explicitly and is the engine behind the
dependence. (Holder balls satisfy
.
- The notes set up the testing and Kullback Leibler Pieces. The
support definition and
bound of
s in terms of sup-norm of the mean (A1) give the construction/contiguity needed in the general Ghosal-Ghosh-Vandervaart scheme.
Putting these together with the standard master theorem for contraction rates (entropy vs. sample size) yields the rate solving:
Theorem 5 (Contraction of the induced marginal for
:
If the true is a Gaussian law
(so
with
) then the postetrior over the induced marginal
contracts to
at the parametric rate:
Why parametric?:
With fixed, the target
is a single Gaussian; the
centered at the matching
law places positive Kullback-Leibler mass there, and the KL bounds/tests for Gaussian process provide exponentially powerful tests, yielding a
rate in Hellinger/KL. The GP-mixture integrates to a single Gaussian when the mixing is Gaussian, and recovering that finite dimensional target density is a parametric task.
Practical Reading of the Rates:
- If your aim is to learn the distribution of the latent spatial fields
(i.e.
) the intrinsic function-space difficulty appears: smoothness
, dimension
, and sample size
trade off as
.
- If you only care about the marginal law of the observed vector
(e.g. to predict
at observed sites), and the truth is Gaussian, you recover it at the optimal parametric
rate.
Conclusion:
That completes the blog post. I’m going to come back and do some simulations and add some figures so people can better see and understand what I was thinking as I worked on these problems. It’s nice to revisit old work and see it with fresh eyes, and I’m happy it still made somewhat sense to me while looking at it.

Leave a Reply