Decision Theory for Large-Scale Outlier Detection Using Aleatoric Uncertainty

Edit: This blog post has been developed into a preprint and extended in a number of ways. You can view the preprint on my ResearchGate here.

I was introduced to the concept of aleatoric uncertainty in the context of Bayesian neural networks, but I was recently working on a project for outlier detection and got interested in how reframing approaches from decision theory to a version of aleatoric uncertainty for typical Bayesian models could provide a mechanism for outlier detection. I also want to thank Karl Hallgren for inspiring me to think more about Bayesian outlier dection.

The reason this comes up naturally in the Bayesian context is because the model has some inherent uncertainty. Thinking of a parametric model, acknowledging in the Bayesian paradigm our uncertainty in the parameters we get that a distribution over the parameters maps to a distribution over a finite dimensional subspace of all distributions on the data. Thus the two types of uncertainty: our uncertainty in the model, epistemic, and our uncertainty in the data generating mechanism, aleatoric.

Introduction:

To see why this is useful for large scale outlier detection let us analyze a simple model. Suppose the data are assumed to follow X_i \overset{iid}{\sim}\textrm{N}(\mu,\sigma^2), and we place a Normal-Inverse-Gamma prior, \textrm{NIG}(\mu_0, \nu, \alpha,\beta), on the parameters \mu, \sigma^2. Then we have that, given data X_1,...,X_n:

\mu, \sigma^2 | \{X_i\}_{i=1}^n \sim \textrm{NIG}(\frac{\nu\mu_0 + n\overline{X}}{\nu + n}, \nu + n, \alpha + \frac{n}{2}, \beta + \frac{\sum_{i=1}^n (X_i - \overline{X})^2}{2} + \frac{n \nu }{\nu + n}\frac{(\overline{X} - \mu_0)^2}{2})

These parameters are fairly complicated, so denote the parameters \mu_0^*, \nu^*, \alpha^*, \beta^*, such that \mu, \sigma^2 | \{X_i\}_{i=1}^n \sim \textrm{NIG}(\mu_0^*, \nu^*, \alpha^*, \beta^*).

Referring to the point made previously, we have a distribution over \mu and \sigma^2 which is maps to a two-dimensional subset of all distributions over X. This is the epistemic uncertainty. However, if we use the posterior predictive distribution of X | X_1,...,X_n we get the aleatoric uncertainty:

P(X | X_1,...,X_n) = \int_{\mu \in \mathbb{R}}\int_{\sigma \in \mathbb{R}^+}P(X | \mu, \sigma^2)P(\mu,\sigma^2| X_1,....,X_n)d\sigma^2d\mu

which luckily for this simple model has a closed form solution:

X | X_1,...,X_n \sim \textrm{t}_{2\alpha^*}(\mu_0^*,\frac{\beta^*(\nu^* + 1)}{\nu^*\alpha^*})

Where the distribution is a non-centered student-t distribution, with 2\alpha^* degrees of freedom, location parameter \mu_0^*, and scale parameter \frac{\beta^*(\nu^* + 1)}{\nu^*\alpha^*}.

Where Decision Theory Comes In:

Where this idea of uncertainty in the data being inherited from the model uncertainty becomes useful is in outlier detection. Say we observe an observation X^* and we want to know if it’s an outlier or not, we can look at P(X^*> X | X_1,...,X_n) and select a cutoff to classify the observed value as an outlier if this probability is too large. The problem becomes choosing this cutoff.

Now to incorporate some ideas from decision theory, suppose we observed \{X_{ij}\}_{i \in \{1,...,n\},j\in \{1,...,m\}} and want to observe if set of observations \{X^*_{(n+1)j}\}_{j \in \{1,...,m\}} has outliers. If agree with the philosophical leap that we can make decisions based on unobserved predictive values, \{X_{(n+1)j}\}_{j\in \{1,...,m\}}\}, then our goal is to make a decision \delta_j \in \{0,1\}; where \delta_j = 1 means that X^*_{(n+1)j} was an outlier, and \delta_j =0 means it was not. We can construct loss functions similar in spirit to the following:

L(X^*, X) = - \sum_{j=1}^m \delta_j \textrm{I}(X^*_{(n+1)j} > X_{(n+1)j} ) + c_1\sum_{j=1}^m (1-\delta_j)\textrm{I}(X^*_{(n+1)j} > X_{(n+1)j} ) + c_2D

Where \textrm{I}(\circ) denotes an indicator function, and D = \sum_{j=1}^m \delta_j is a penalty for total discoveries to control for multiplicity.

This loss function provides an incentive for true positives in the first addendum, - \sum_{j=1}^m \delta_j \textrm{I}(X^*_{(n+1)j} > X_{(n+1)j} ), an adjustable penalty for false negatives in the second addendum, c_1\sum_{j=1}^m (1-\delta_j)\textrm{I}(X^*_{(n+1)j} > X_{(n+1)j} ), and a penalty for total discoveries that can be made stronger or weaker, c_2D.

In typical decision theory we use the posterior expected loss and minimize it, but in this context we’re working with aleatoric uncertainty; so we take the posterior predictive expected loss. This gives us:

\mathbb{E}_{X|X_1,...,X_n}[L(X^*, X)] = - \sum_{j=1}^m \delta_j P(X^*_{(n+1)j} > X_{(n+1)j} | X_{1j},...,X_{nj}) + c_1\sum_{j=1}^m (1-\delta_j) P(X^*_{(n+1)j} > X_{(n+1)j} | X_{1j},...,X_{nj}) + c_2D

Denoting by r_j the value P(X^*_{(n+1)j} > X_{(n+1)j} | X_{1j},...,X_{nj}), we readjust the posterior predictive expected loss to the following:

\mathbb{E}_{X|X_1,...,X_n}[L(X^*, X)] =     -\sum_{j=1}^m \delta_j r_j  + c_1\sum_{j=1}^m (1-\delta_j)r_j + c_2D

For which the solution, minimizing with respect to \{\delta_j\}_{j\in\{1,...,m\}}, is:

\delta_j = \textrm{I}(r_j > \frac{c_2}{1+c_1}).

This follows along pretty closely with an outline in Section 4 of the article FDR and Bayesian Multiple Comparison Rules by P. Muller, G. Parmigiani, and K. Rice in Bayesian Statistics 8; just using the posterior predictive to calculate probabilities for outliers from the indicator functions, but with the same lower bound on the minimizer of the expected loss.

Alternative Way to Choose Threshold and Data Informed Loss Function:

This means that the optimal decision for this loss function (indicator functions of outliers) is constructed out of some lower bound on r_j. One way to select this lower bound instead of fine tuning the parameters c_1 and c_2 is to optimize the Bayesian FDR outlined here[1]. The posterior predictive CDF values, r_j, provide evidence in favor of testing the hypothesis H_0: X^*_{(n+1)j} > X_{(n+1)j}, for unobserved future predictive values X_{(n+1)}, versus the alternative hypothesis H_1: X^*_{(n+1)j} \leq X_{(n+1)j} . Thus the threshold could alternatively be set with multiplicity control by finding, for some fixed q\in [0,1], the threshold would be the largest such \eta to satisfy the following equation:

BFDR(\eta) = \frac{\sum_{j=1}^m r_j(I_{(r_j \leq \eta)})}{\sum_{j=1}^m I_{(r_j\leq\eta)}} < q

This provides a multiplicity control in the correction and offers a less controlled way to manage the false discoveries than tinkering directly with c_1 and c_2.

Let’s now examine, in the context of the previous decision theoretic model, what such a Bayesian FDR dictated threshold means in terms of the values of penalties on multiplicity and false negatives. We set up the following equation and solve for c_1 and c_2:

BFDR(c_2/(1+c_1)) = \frac{\sum_{j=1}^m r_j(I_{(r_j \leq c_2/(1+c_1))})}{\sum_{j=1}^m I_{(r_j\leq c_2/(1+c_1))}} < q

Rearranging a bit we get:

\sum_{j=1}^m (r_j-q) I_{(r_j \leq c_2/(1+c_1))}< 0

In other words, the largest such pairing of c_1 and c_2 such that the magnitude of the sum of all \{ r_j - q: r_j > q \land r_j \leq c_2/(1+c_1)\} is less than the magnitude of the sum of all \{ r_j - q: r_j \leq q \land r_j \leq c_2/(1+c_1)\} (which is going to be less than or equal to 0 because r_j \leq q).

As c_2/(1+c_1) gets smaller, the number of elements in the first set decreases more rapidly than the number of elements in the second set. This is because the second logical expression makes the first less likely to be true. Inversely, the same is true in the opposite context in the opposite direction, as c_2/(1+c_1) gets larger the number of elements increases more rapidly in the first set. An equilibrium needs to be find as the largest such c_2/(1+c_1) such that the statement is true.

Note that if c_2/(1+c_1) = \eta for some \eta : BFDR(\eta) < q, then we have that c_2 = (1+c_1)\eta and vice versa c_1 = \frac{c_2}{\eta} - 1. Remembering that c_1 is the penalty for false negatives, this gives us the linear relationship dictated by the BFDR threshold between the penalty for false negatives and the penalty for total discoveries. We have that BFDR^{-1}(q; r_1,...,r_n) is a monotonically (not strictly) decreasing function; for which the rate of increase depends on \{r_1,...,r_n\}

Thus, holding the penalty on false negatives, c_1, fixed, we get c_2 = (1+c_1)BFDR^{-1}(q) is a where the right hand side is a composition of a linear and monotonic function, this monotonicity as a function of q. Note that as q decreases c_2 increases monotonically, and as c_1 increases c_2 increases linearly. This is different from holding c_2, the penalty for total discoveries fixed, where an increase in c_2 causes a linear decrease in c_1.

L(X^*, X) = - \sum_{j=1}^m \delta_j \textrm{I}(X^*_{(n+1)j} > X_{(n+1)j} ) + c_1\sum_{j=1}^m (1-\delta_j)\textrm{I}(X^*_{(n+1)j} > X_{(n+1)j} ) +(c_1 + 1)BFDR^{-1}(q)D

It should be obvious at this point that one can reverse engineer decision criteria to give a loss function for which a particular threshold selection criteria (whatever type, whatever appropriately specified loss function) coincides.

However this is somewhat circular logic, as the Bayesian FDR is necessarily Bayesian, and is taken and evaluated after the expectation is taken. Still I think there’s interesting potential to play around with this generally speaking. Essentially it is saying we have a data dependent loss function. In other words, if you use the the Bayesian FDR a posteriori to select a cutoff, it would have equated to using this decision criteria on that data.

Noting Something Here About Bayesian FDR and the Previous Decision Criteria:

Assume in a typical Bayesian model that we have \omega \in \mathbf{R}^m and \{\gamma_j\}_{j=1}^m \in \{0,1\}^m which is a model for \{\omega_j\}. E.g. \omega_j \equiv 0 \overset{a.s.}{\Leftrightarrow} \gamma_j \equiv 0. Then we get a posterior for these two conditional on some data
P( \{\gamma_j\}_{j=1}^m, \{\omega_j\}_{j=1}^m | Y) where Y is some observed data. Let r_j = E_{\gamma_j | Y}[\gamma_j].

Examine the following loss:

L(\omega, \gamma, X) = - \sum_{j=1}^m \delta_j \gamma_j + c_1\sum_{j=1}^m (1-\delta_j)\gamma_j +FDR^{-1}(q; \gamma)D

Where q = FDR(\eta; \gamma) is just the false discovery rate (which is monotonic in q because it can only be 0 or 1 in this scenario prior to taking the expectation). Note also that FDR^{-1}(q; \gamma) is monotonic in each \gamma_j. We prove a small theorem here to show that E_{\gamma}[FDR^{-1}(q;\gamma)] = BFDR^{-1}(q; r)

Theorem 1:

Statement: For \gamma_j \overset{ind.}{\sim}\textrm{Bern}(r_j), \hspace{2mm} j\in\{1,...,m\}, and FDR(\eta) =  \sum_{j=1}^m\frac{I_{\gamma_j > \eta }(1-\gamma_j)}{\sum_{I_{\gamma_j > \eta}}} < q, we have the following:


\textrm{E}_{\gamma| Y }[FDR^{-1}(q; \gamma)] = BFDR^{-1}(q; r)

Proof:

Note that \textrm{E}_{\gamma| Y, \eta}[FDR(\eta)] = BFDR(\eta) [2], and note that FDR^{-1}(q,\gamma) is independently monotonic with respect to all \gamma_j, \hspace{2mm} j\in\{1,...,m\} as well as q. Specifically FDR^{-1}(q;\gamma) is piecewise linear with respect to \gamma, giving us that the monotonicity and piecewise linear combine such that the expectation is just the inverse of the original expectation. Thus \textrm{E}_{\gamma | Y}[FDR^{-1}(q; \gamma)] = BFDR^{-1}(q; r).

\blacksquare

This gives us that the posterior expectation of our previous loss function I given by:

\textrm{E}_{\gamma | Y}[L(\omega, \gamma, X)] = - \sum_{j=1}^m \delta_jr_j +c_1\sum_{j=1}^m (1-\delta_j)r_j+BFDR^{-1}(q; r)D

Which is a decision criteria with solution:

\delta_j = I_{(r_j\geq  \frac{BFDR^{-1}(q;r))}{1+c_1})}

Additional Note on Bayesian FDR:

We can generalize the Bayesian FDR with the following definition:

Definition:

For BFDR(\eta) we can introduce a modified:

BFDR(\eta; a) =\textrm{argmin}_q \sum_{j=1}^m\textrm{sign}(r_j - q)|r_j - q|^a I_{r_j \leq \eta}

For a \in \mathbf{R}^+\cup{0}.

Illustration:

We have that:

\eta =  \textrm{argmax}_\eta \sum_{j=1}^m \textrm{sign}(r_j-q) |(r_j - q)I_{(r_j \leq \eta)}| < 0

Using the sets outlined 2 sections ago we can see that the previous statement is true.

We note that this expression for BFDR^{-1}(q) is piecewise constant and increasing with respect to latex q, and that for a fixed \eta the BFDR is the minimum such value of q that this expression is true. (This is because the BFDR is satisfied by more than one such q, but the minimum value is the true BFDR). This gives us that

q = \textrm{argmin}_q \sum_{j=1}^m\textrm{sign}(r_j - q) |(r_j - q)I_{r_j \leq \eta}|

Note that, r_j - q \in [-1,1], giving us that |r_j - q| \in [0,1], giving us that |r_j - q|^a is going to similarly be in [0,1] for any a \in \mathbf{R}^+\cup{0}. This allows us to modify the previous expression to be:

q = \textrm{argmin}_q \sum_{j=1}^m\textrm{sign}(r_j - q) |r_j - q|^a I_{r_j \leq \eta}

Note that as a \rightarrow \infty the value of q is more emphasized by extreme values, and for values of a going towards 0 the value is q is more emphasized by moderate values. Note that for a = 0 the value for q is just \textrm{argmin}_q \sum_{j=1}^m \textrm{sign}(r_j -q) I_{r_j \leq \eta} = \min_{j} r_j if \exists r_j \leq \eta and 0 otherwise.

It should be obvious that we can reverse engineer this to get an appropriate \eta for BFDR(\eta; a) < q.

Another interesting point to consider is that \sum_{j=1}^m \textrm{sign}(r_j - q)I_{r_j < \eta} = \frac{d}{dq}\sum_{j=1}^m |r_j - q|I_{r_j \leq \eta}. Giving us that our solution is the point at which the rate of change of the conditional median is maximized. This is true by induction for all a in the appropriate range. The solution is the minimizer of the change in rate of the conditional absolute a+1 non-central moment of the \{r_j\}_{j=1}^m. E.g. with respect to the non-centrality parameter:

Theorem 2:

Let R \sim \frac{1}{m}\sum_{j=1}^m \delta_{r_j}(\circ) and we have the following:

\forall a \geq 0, \hspace{2mm} BDFR(\eta; a) = \textrm{argmin}_{q} \frac{d}{dq}\mathbb{E}_R[|r - q|^{a+1}| R \leq \eta]

Which is the point which minimizes the change in the a+1 non-central absolute moment with respect to the non-centrality parameter.

Proof:

The proof follows from applying the dominated convergence theorem to exchange an expectation and a derivative and using laws from basic calculus. There is a \frac{1}{a+1} term that has to be considered, but this term can be removed from the expectation and derivative, and the minimum is invariant under scalar multiplication/division for scalars greater than 0, so we can remove this term as well. The same applies to the \frac{1}{m} normalizing constant in the distribution of R.

\blacksquare

This theorem also provides us a way to investigate complexities such as those that arise when \{r_j\}_{j=1}^m is random and correlated. In other words, conditional on R the \gamma are independent, but there are dependencies amongst the r_j \sim R.

Returning to Outlier Detection in the Scenario of Nonparametric Models:

It’s possible to get more abstract with the posterior predictive expection for outlier detection. For example, suppose we don’t want to use a parametric model, but instead some nonparametric approach. Taking species sampling models as an example, and specifically the Dirichlet process:

X_i\overset{iid}{\sim} G

G \sim\textrm{DP}(\alpha, G_0)

We get that the posterior predictive is X | X_1,...,X_n \sim \frac{\alpha}{\alpha +n} G_0 + \frac{1}{\alpha+n}\sum_{i=1}^n \xi_{X_i}(X); where \xi_x(\circ) denotes a point mass measure at x. This can be used to make outlier decisions in a decision theoretic model, and some interesting stuff could fall out.

Another important factor to consider is that we used a pretty simple probability event, |X^*| > |X|, but in general other notions of what an outlier is could be encoded into this expression. For data that is more complex than real numbers, you could look at distance metrics on the space, and come up with a way to detect outliers from a different perspective.

I thought this was an interesting way to look at decision theory, and the loss function can be modified to prioritize different things one might be interested in. Additionally, the flexibility of the approach lends itself to a lot of different applications and generalizations.

Simulations:

I did simulations for both the DP outlier model as well as the NormalInverseGamma outlier model. There are 1000 replicated tests with 750 in the control group and 250 in the “anomalous” group. All data for the control group were simulated according to a \textrm{N}(0,1) distribution, and the previous values for the anomalous group were simulated according to the same distribution. There are 25 experiments, where the X_{(n+1)j} for the outlier group are simulated according to a \textrm{N}(\mu, \sigma) distribution with \mu \in \{1, 1.5, 2, 2.5, 3\} and \sigma \in \{1.2,1.4,1.6,1.8,2.0\}. n is specified to be equal to 10. The \textrm{NIG} prior has parameters \mu_0 = 0, \nu_0 = 1, \alpha_0 = 1, \beta_0 = 1. For the DP the concentration parameters is specified to be \alpha = 10, and the base measure is a normal estimated from all of the \{X_{ij}\}_{j\in \{1,...,m\}, i \in \{1,...,n\}}. Five replications were done of the experiments. The model gives the following results:

For the Area under the curve for the outliers for the NormalInverseGamma model we get the following performance

Where the tick marks from left to right, for each individual plot, indicate \mu \in \{1, 1.5, 2, 2.5, 3\}.

Using the Bayesian FDR Threshold and estimating accuracy, precision, and recall, we get the following:

Where the tick marks from left to right, for each individual plot, indicate \mu \in \{1, 1.5, 2, 2.5, 3\}.

Now to examine the DP model, we get the following performance on AUC:

Where the tick marks from left to right, for each individual plot, indicate \mu \in \{1, 1.5, 2, 2.5, 3\}.

For accuracy, precision, and recall, we get the following:

Where the tick marks from left to right, for each individual plot, indicate \mu \in \{1, 1.5, 2, 2.5, 3\}.

The previous plots for accuracy, precision, and recall, for both the DP and NIG, are with q = .005. Adjusting q=.1 we get higher recall but lower accuracy and precision.

Leave a Reply

Discover more from Ryan Warnick Phd.

Subscribe now to keep reading and get access to the full archive.

Continue reading