Some Statistical Properties of Mixture Density Networks

I recently worked on an applied project involving Mixture Density Networks (MDNs), and it got me pondering them more from the statistical perspective. Most of the papers on MDNs are based around computational approaches or application. I also haven’t actually used this blog for any blog posts, so I wanted to get something out there and maybe start a pattern of contributing more regularly.

MDNs were proposed by Christopher Bishop in a 1994 technical report, and they have gained traction recently as a number of modern software packages became capable of efficiently implementing them. tensorflow/tensorflow-probability and pytorch/pyro are two examples of package pairs where the shared infrastructure allows welding together of probabilistic elements with neural networks and algorithmic differentiation in a cohesive fashion, and using these tools permits relatively easy implementation of the MDN concept. A quick Google Scholar search yields many results on applications of MDNs in problems ranging from climate science to robotics.

The basic concept of a MDN is that, instead of using of using a neural network to deterministically predict the output, you use the network to predict the parameters of a distribution over the output, and then maximize the likelihood of the weights of the network with respect to the distribution over the data. That is, instead of estimating the weights of the network in an attempt to estimate the function $\hat{f}: X \rightarrow Y$ , we try to estimate a function to satisfy $\xi: X \rightarrow \Xi$ where $Y_i \sim P(Y| \xi(X_i))$ for some parameterized distribution $P(Y|\xi)$ with $\Xi \subseteq \mathbb{R}^L$ and $L\in \mathbb{N}$ . Typically $P(Y|\xi)$ is modeled as a K-fold mixture distribution, where a component of $\xi$ corresponds to the mixing coefficients and $\Xi$ has $K-1$ dimensions corresponding to the $K-1$ simplex.

The way inference is typically done is to maximize the log likelihood as a function of the weights of the network. That is, for any network architecture with weights $\eta$ in some weight space $\mathbb{R}^q$ , for some $q \in \mathbb{N}$ which is a function of the architecture of the network, we are minimizing:

$\hat{\eta} = \textrm{argmin}_\eta -\sum_{i=1}^N \log p(y_i | \xi(X_i; \eta))$

Where $p(y | \xi)$ is the density of the distribution $P(Y | \xi)$ .

Let $f_\eta(y|x)$ denote $p(y|\xi(x;\eta))$ and suppose that:

$(X_1,Y_1),....,(X_n,Y_n) \overset{iid}{\sim} G$

where $G$ is some distribution that has density $g(x,y)$ . Factorizing $g(x,y)$ into $g(y|x)g(x)$ , and using the MDN model to approximate $g(y|x)$ with $f_\eta(y|x)$ :

$\hat{\eta} = \textrm{argmin}_\eta -\frac{1}{n}\sum_{i=1}^n \log[f_\eta(y_i| x_i)]$

The Law of Large Numbers gives us that, as $n \rightarrow \infty$ :

$\hat{\eta} \overset{P}{\rightarrow} \textrm{argmin}_\eta -\textrm{E}_G[\log[f_\eta(Y|X)]]$

We also have that, because $g(x)$ is independent of $\eta$ , and assuming $\textrm{E}_G[g(X)]$ exists:

$\hat{\eta} \overset{P}{\rightarrow} \textrm{argmin}_\eta -\textrm{E}_G[\log[f_\eta(Y|X)g(X)]]$

Now note that $\textrm{E}_G[\log[g(Y|X)g(X)]]$ doesn’t depend on $\eta$ , so we can add it to the previous limit to give us:

$\hat{\eta} \overset{P}{\rightarrow} \textrm{argmin}_\eta -\textrm{E}_G[\log[f_\eta(Y|X)g(X)]] + \textrm{E}_G[\log[g(Y|X)g(X)]]$

Rearranging a bit by putting the two logs together, and using the fact that $g(x,y) =g(y|x)g(x)$ , we get:

$\hat{\eta} \overset{P}{\rightarrow} \textrm{argmin}_\eta -\textrm{E}_G[\log[\frac{f_\eta(Y|X)g(X)}{g(X,Y)}]]$

Which is the Kullback-Leibler Divergence between $G$ and the distribution $F_\eta$ with density $f_\eta(y|x)g(x)$ . This gives us that the MDN model converges in probability to the Kullback-Leibler projection of $G$ onto $\{F_\eta : \eta \in \mathbb{R}^q\}$ .

Breaking this apart a little bit, note that if the family $\{F_\eta : \eta \in \mathbb{R}^q\}$ is sufficiently robust such that $\min_\eta \textrm{KL}(G || F_\eta) = 0$ we have that, using the fact that the total variation distance is bounded above by the square root of one half of the Kullback-Leibler divergence, in the limit $\textrm{P}[F_{\hat{\eta}}(A) = G(A)]=1$ for all measurable sets $A$ ; where the probability is taken over $\hat{\eta}$ . Another important fact is that there are no requirements placed on the space $Y$ . $\Xi$ must be a subset of $\mathbb{R}^L$ for some $L\in \mathbb{N}$ to make it a typical neural network architecture, but $Y$ could be a more general probability space and the limiting result would still hold. Two important special cases are when $Y$ is discrete or matrix valued.

Noting that we can remove the $g(x)$ terms inside the log, we can obtain:

$\hat{\eta} \overset{P}{\rightarrow} \textrm{argmin}_\eta -\textrm{E}_G[\log[\frac{f_\eta(Y|X)}{g(Y|X)}]]$

Using properties of conditional expectations we get:

$\hat{\eta} \overset{P}{\rightarrow} \textrm{argmin}_\eta \textrm{E}_{G_X}[-\textrm{E}_{G_{Y|X}}[\log[\frac{f_\eta(Y|X)}{g(Y|X)}]| X]]$

So the term inside is the KL divergence between the distribution with density $g(y|x)$ and the distribution with density $f_\eta(y|x)$ for given $x$ . The outer expectation is the expectation with respect to the marginal distribution on the $X$ . This gives another interpretation of the limiting solution $\hat{\eta}$ ; it’s the solution that minimizes the Kullback-Leibler divergence between the true conditional distribution and the MDN for a particular $x$ , averaged over the true distribution of the $x$ .

There are some problems with the calculations done previously, mainly some identifiability assumptions. The previous outline is mostly following along inline with Theorem 2.1 and 2.2 of a 1982 paper by Halbert White titled Maximum Likelihood Estimation in Misspecified Models in Econometrica. The primary assumptions we need are from A3 in that paper:

$\textrm{E}_G[\log g(x,y)]$ needs to exist and $|\log[f_\eta(y|x)]| \leq m(x,y)$ for every $\eta \in \mathbb{R}^q$ where $m(x,y)$ is integrable with respect to $G$ .
$\textrm{KL}(G || F_\eta)$ has a unique minimum in $\eta$ .
To incorporate the fact that the results in the previously referenced paper didn’t have the conditional distribution component, we also need the assumption that $\textrm{E}_G[\log g(x)]$ exists.

The assumptions on $\textrm{E}_G[\log g(x)]$ and $\textrm{E}_G[\log g(x,y)]$ are hard to verify when building the model, as they represent the unknown true distribution, but correspond to the joint and marginal entropy of $G$ . The assumption $|\log[f_\eta(y|x)]| \leq m(x,y)$ is easier to validate, as we have knowledge about the structure of $f_\eta(y|x)$ . The 2nd assumption regards identifiability, and this is generally not true for MDNs as the mixture component is only identifiable up to permutation. One can remedy this by enforcing an ordering on the mixing coefficients, but there might still be identifiability issues of the weights in the network component.

There’s also some possibilities of calculating the asymptotic distributions of the weights given some additional assumptions outlined in the previously referenced paper. Most of the results should still apply with some additional requirements; as we’re approximating the conditional distribution using the MDN and not the full joint distribution of the data $G$ . An interesting thing to consider with the asymptotic distributions is that they’re likely to depend on the Fisher information in some way, and with the automatic differentation capabilities of tensorflow and pytorch you might be able to approximate the matrix of second derivatives accurately using numerical methods.

I was really interested in MDNs when I first heard about them. I think analyzing them from this perspective gives insight into how to build models with desirable asymptotic properties, and even the entropy assumptions on $G$ illustrate that for particularly unwieldy underlying distributions, the asymptotic may not apply.

Thanks to Anar Yusifov for introducing me to MDNs, and for many insightful discussions on the topic.

Ryan Warnick Phd.

Some Statistical Properties of Mixture Density Networks

Like this:

Leave a ReplyCancel reply

Some Statistical Properties of Mixture Density Networks

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Ryan Warnick Phd.