\[\begin{align} \nabla_\theta A(\theta, \Theta) &= -\frac{1}{2}\Theta^{-1}\theta = \mu\\ \nabla_\Theta A(\theta, \Theta) &= -\frac{1}{4}\theta\theta^\top\Theta^{-2} - \frac{1}{2}\Theta^{-1} = \mu\mu^\top + \Sigma\\ \end{align}\]

Introduction

We return to a favorite topic: Gaussian distributions, which we already know arise from a simple differential equation, have close connections to convolutions, and serve as the error model for least-squares regression.

In this post, we see a new side of the Gaussian: the Gaussians as a log-linear family, aka an exponential family. As it turns out, close connections between Gaussians and two of the problems we know how to solve, linear systems and quadratic forms, give (relatively) simple and computationally tractable expressions for the quantities of interest in doing inference in log-linear families: the cumulant generating function, the sufficient statistics, the canonical and mean parameters, and the entropy. In this post, we’ll focus on the canonical parameters. The main contribution of this post is to work carefully through the (non-degenerate) multivariate Gaussian case, encompassing both the the derivation of the canonical parameters and computing the gradient of the cumulant generating function.

Log-Linear Families

A family of distributions can be said to be log-linear if its members can be generated by varying the parameters \(\theta\) of an equation with the following form:

\[\begin{align} \log p(x; \theta) = \langle \theta, \phi(x) \rangle + A(\theta) + h(x) \end{align}\]

If such an equation exists, then the parameters \(\theta\) are known as canonical parameters for the family. In plain English, this equation states that the way that the parameters, \(\theta\), interact with the outcome values, \(x\), to determine the log-probability, \(\log p\) (negative one times the surprise), is linear (an inner product), once the outcomes have been suitably transformed by \(\phi\). Importantly, \(x\), \(\theta\), and \(\phi(x)\) can all be vector-valued.

This may not seem like a big win, since the linearity we have gained is tentative and limited: changing \(\theta\), for example, changes \(A(\theta)\), which is as yet an undefined, likely nonlinear function. But as it turns out, even this small degree of linearity is sufficient to give big gains in manipulating the probabilities.

Warm-Up: The Univariate Gaussian Family

Let’s begin by massaging the functional form of the log-probability of a Gaussian to see if it can be written as a member of a log-linear family.

\[\begin{align} \log p(x; \mu, \sigma^2) &= -\frac{(x - \mu)^2}{2\sigma^2} - \log \sqrt{2\pi}\sigma \end{align}\]

Our goal is to independently apply nonlinear transformations to \(x\) and to our parameters, so having them all tied up inside that polynomial is no good. So we expand, then group terms and pattern match:

\[\begin{align} \log p(x; \mu, \sigma^2) &= -\frac{(x - \mu)^2}{2\sigma^2} - \log \sqrt{2\pi}\sigma \\ &= -\frac{x^2}{2\sigma^2} + 2 \frac{\mu x}{2\sigma^2} - \frac{\mu^2}{2\sigma^2} - \log \sqrt{2\pi} +\log \sigma \\ &= \frac{\mu}{\sigma^2}x + \frac{-1}{2\sigma^2}x^2 - \left(\frac{\mu^2}{2\sigma^2} + \log \sigma\right) - \log \sqrt{2\pi}\\ &:= \theta_1 x + \theta_2 x^2 - A(\mu, \sigma) + \log \sqrt{2\pi} \end{align}\]

The final line should be taken as a definition, for \(\theta\), \(A(\mu, \sigma)\), and \(h\). Notice how it matches both the line above and the definition of a log-linear family.

Except for one problem: \(A\) is a function of \(\mu\) and \(\sigma\), which are not our parameters \(\theta_1\) and \(\theta_2\). One of the key insights of log-linear families is that it matters very much how exactly you parameterize your distributions; the different parameterizations correspond to different geometries, and the problems of inference become geometric problems of transforming from one parameterization to another. This approach is called information geometry.

We will eat our vegetables and convert \(A\) into a bona fide function of \(\theta\) in a moment. Before we do so, though, let’s take a look at what our \(\phi\) functions turned out to be:

\[\begin{align} \phi(x) &= \left[\begin{array}{c} x \\ x^2 \end{array}\right]\ \end{align}\]

For something that could be an arbitrary nonlinear function, these have turned out rather simple indeed! The two functions are also the sufficient statistics of the Gaussian family: if you collect i.i.d. observations from a Gaussian distribution, the average values of these two functions are all you need to know in order to extract all of the information those observations gave you about what the underlying parameters of the Gaussian were.

In fact, instead of starting with the formula for the Gaussian, we might have started by specifying these two sufficient statistics and then asking “what is the distribution that has this \(\phi\) as its sufficient statistics?”. If you’ve ever derived the Gaussian as a maximum entropy distribution, then that’s another way of stating what you did. That method requires some heftier math though, so we’ll stick with this approach.

Let’s finish up our derivation of the log-linear form of the univariate Gaussian. We first write our traditional parameters, \(\mu\) and \(\sigma\), in terms of our newly-derived canonical parameters, \(\theta\),

\[\begin{align} \mu &= \frac{\theta_1}{-2\theta_2}\\ \sigma^2 &= -\frac{1}{2}\theta_2^{-1} \end{align}\]

and then substitute:

\[\begin{align} A(\theta) &= -\frac{1}{4}\frac{\theta_1^2}{\theta_2} - \frac{1}{2}\log\left(-2\theta_2\right) \end{align}\]

enabling us to finally write

\[\begin{align} \log p(x; \theta) &:= \theta_1 x + \theta_2 x^2 - A(\theta) + \log \sqrt{2\pi} \end{align}\]

and so demonstrate that the family of univariate Gaussians is a log-linear family. The canonical parameter \(\theta_1\) is an arbitrary real number; the canonical parameter \(\theta_2\) is an arbitrary positive real number, as the definition in terms of \(\sigma_2\) makes clear.

We close this section by demonstrating, for this particular case, one remarkable property of log-linear families. In a fit of curiosity, let us calculate the gradient of \(A\). Since this is fundamentally a function of two vector-valued variables, we compute its gradient with respect to each variable separately, and embark with \(\theta_1\), which only touches the first, rational term:

\[\begin{align} \frac{\partial}{\partial \theta_1} A(\theta) &= \frac{\theta_1}{-2\theta_2} \\ &= \mu \\ &= \mathbb{E}(\phi_1(x)) \end{align}\]

How curious! The partial derivative of our function \(A\) in its first argument gave us the expected value of the first index of \(\phi\)! Does this pattern continue if we try the second argument?

\[\begin{align} \frac{\partial}{\partial \theta_2} A(\theta) &= \frac{1}{4}\frac{\theta_1^2}{\theta_2^2} - \frac{1}{2}\theta_2^{-1} \\ &= \mu^2 +\sigma^2 \\ &= \mathbb{E}(\phi_2(x)) \end{align}\]

It does! It would appear that the derivatives of this function \(A\) give us the expected values of the sufficient statistics. In fact, this function is also known as cumulant generating function, because taking its derivatives (including higher order derivatives) generates the expected values of certain polynomial functions, known as cumulants, of the sufficient statistics and their expectations. This is advantageous because it exchanges integration, which is typically difficult, for differentiation, which is typically tractable. The second cumulant, from the second derivative, is the beloved Fisher Information Matrix.

Before we move on, note that we had to bring in some facts from calculus, namely that \(\frac{\partial}{\partial x} \log -x = -\frac{1}{x}\) for \(x<0\).. Outside of that, we’ll be sticking to derivatives of linear and polynomial functions, all of which can be fairly straightforwardly found using the Fréchet derivative formulation, in case you’ve forgotten the rules.

The Main Event: The Multivariate Gaussian Family

Let us now repeat that same set of moves, but with multivariate Gaussians instead of univariate Gaussians. We’ll need to bust out our linear algebra, as all of our simple expressions in terms of squares and ratios will start to involve inner products and matrix inverses, but nothing fundamental will change. We begin with the log-probability, which is perhaps less familiar:

\[\begin{align} \log p (x, \mu, \Sigma) &= -\frac{1}{2} (x - \mu)^\top \Sigma^{-1} (x - \mu) + \frac{1}{2} \log \left|\Sigma^{-1}\right| - k \log \sqrt{2\pi} \end{align}\]

Once again, we’ll need to coerce this into an expression where the only interactions between our state vector \(x\) and our parameter tuple \((\mu, \Sigma)\) is via inner products. Again, we expand the troublesome polynoimal, this time showing up as a quadratic form:

\[\begin{align} (x - \mu)^\top \Sigma^{-1} (x - \mu) &= x^\top\Sigma^{-1}x - \mu^\top\Sigma^{-1}x - x^\top\Sigma^{-1}\mu + \mu^\top\Sigma^{-1}\mu \\ &= \mathrm{tr}\left(x^\top\Sigma^{-1}x\right) -\mathrm{tr}\left(\mu^\top\Sigma^{-1}x\right) -\mathrm{tr}\left(x^\top\Sigma^{-1}\mu\right) + \mathrm{tr}\left(\mu^\top\Sigma^{-1}\mu\right) \end{align}\]

The final move we made, familiar to anyone who read through the Fréchet derivative series, especially the section on linear regression with multiple inputs and outputs, was to write a bunch of scalar values as traces, \(\mathrm{tr}\). For a matrix, the trace is the sum of the diagonal elements. For a one-by-one matrix, aka a scalar or number, the trace is just equal to the value. The two key propertes we need are

  1. the trace is invariant to cyclic permutations.
  2. the trace is used to define the inner product of two matrices.

We will combine these together, rearranging the elements of our traces by “cycling” them until we get inner products between our parameters and our \(x\)s.

\[\begin{align} (x - \mu)^\top \Sigma^{-1} (x - \mu) &= \mathrm{tr}\left(x^\top\Sigma^{-1}x\right) -2\mathrm{tr}\left(x^\top\Sigma^{-1}\mu\right) + \mathrm{tr}\left(\mu^\top\Sigma^{-1}\mu\right)\\ &=\mathrm{tr}\left(xx^\top\Sigma^{-1}\right) -2\mathrm{tr}\left(x^\top\Sigma^{-1}\mu\right) + \mathrm{tr}\left(\mu\mu^\top,\Sigma^{-1}\right)\\ &=\langle\Sigma^{-1}, xx^\top \rangle -2 \langle \Sigma^{-1}\mu, x \rangle + \langle \Sigma^{-1}, \mu\mu^\top \rangle \end{align}\]

Notice that the symbol \(\langle,\rangle\) is doing double duty: it can mean the usual inner product of vectors or it can mean the (derived) inner product of matrices.

We’ve now got an expression in terms of inner products between parameters and functions of \(x\), so we can plug back in and get one step closer to the log-linear form for the multivariate Gaussian:

\[\begin{align} \log p (x, \mu, \Sigma) &= \langle \Sigma^{-1}\mu, x \rangle + \langle-\frac{1}{2} \Sigma^{-1}, xx^\top \rangle -\left(\frac{1}{2} \langle \Sigma^{-1}, \mu\mu^\top \rangle - \frac{1}{2} \log \left|\Sigma^{-1}\right|\right) - k \log \sqrt{2\pi}\\ \log p (x, \theta, \Theta) &= \langle \theta, x \rangle + \langle \Theta, xx^\top \rangle - A(\theta, \Theta) -k \log \sqrt{2\pi} \end{align}\]

On top of the shared log-linear family form (inner products, cumulant generating function) notice the similarities between the multivariate and the univariate case: \(\mu^2\) becomes \(\mu\mu^\top\), \(\frac{\mu}{\sigma^2}\) becomes \(\Sigma^{-1}\mu\), and \(\frac{-1}{2\sigma^2}\) becomes \(\frac{-1}{2}\Sigma^{-1}\) (lowercase \(\sigma\) is reserved for the standard deviation, hence the absence of a \(^2\) in the multivariate version). Tracking these patterns through derivations for univariate and multivariate cases has helped me to build my intuition for vector operations by “porting” it from the more familiar scalar numbers.

Of course, we still need to write down an explicit form for \(A\) as a function of \(\theta\) and \(\Theta\), not just of \(\mu\) and \(\Sigma\), just as we had to in the univariate case.

We use the following substitutions (compare them to their univariate versions!):

\[\begin{align} \mu &= -\frac{1}{2}\Theta^{-1}\theta\\ \Sigma^{-1} &= -2\Theta\\ \mu\mu^\top &= \frac{1}{4}\Theta^{-1}\theta\theta^\top\Theta^{-1} \end{align}\]

Note that the last one requires some algebraic ledgerdemain. To show the identity, take the transpose of the right hand side twice, then bring one transpose in, using the fact that \(\Sigma\) and \(\Sigma^{-1}\) (and therefore \(\Theta\) and \(\Theta^{-1}\)) are symmetric.

These identities let us obtain, with some algebra:

\[\begin{align} A(\mu, \Sigma) &= \frac{1}{2} \langle \Sigma^{-1}, \mu\mu^\top \rangle - \frac{1}{2} \log \left|\Sigma^{-1}\right|\\ A(\theta, \Theta) &= \frac{1}{2} \langle -2 \Theta, \frac{1}{4}\Theta^{-1}\theta\theta^\top\Theta^{-1} \rangle - \frac{1}{2} \log \left| -2 \Theta \right|\\ &= -\frac{1}{4}\mathrm{tr}\left(\Theta\Theta^{-1}\theta\theta^\top\Theta^{-1}\right) - \frac{1}{2} \log \left| -2 \Theta \right|\\ &= -\frac{1}{4} \langle \theta\theta^\top, \Theta^{-1} \rangle - \frac{1}{2} \log \left| -2 \Theta \right|\\ \end{align}\]

Take a deep breath. The trace has served us well again, helping us to obtain a much simpler expression for \(A\) that now closely resembles the expression for the univariate case.

As a last exercise, we would like to compute the gradients of \(A\) and show that the results are the expected values of the sufficient statistic functions, \(x\) and \(xx^\top\).

The derivative with respect to the vector-valued parameter \(\theta\) is almost easy:

\[\begin{align} \nabla_\theta A(\theta, \Theta) &= -\frac{1}{4} \nabla_\theta \langle \theta \theta^\top, \Theta^{-1}\rangle \end{align}\]

This is the derivative of a quadratic form with respect to its argument, which was one of the examples used in the blog post introducing the Fréchet derivative. For completeness, we rederive it here:

\[\begin{align} \langle \left(\theta + \varepsilon\right)\left(\theta + \varepsilon\right)^\top \Theta^{-1}\rangle &=\ \langle \theta \theta^\top, \Theta^{-1} \rangle + 2\langle \Theta^{-1}\theta, \varepsilon \rangle + \langle \varepsilon\varepsilon^\top, \Theta^{-1}\rangle \end{align}\]

The middle term contains the gradient as the argument to the inner product with \(\varepsilon\), establishing that the gradient of \(A\) with respect to its first argument is

\(\begin{align} \nabla_\theta A(\theta, \Theta) &= -\frac{1}{2} \Theta^{-1}\theta = \mu \end{align}\)

as desired.

We’re in the home stretch! We just need to compute the (matrix-valued) gradient of \(A(\theta, \Theta)\) with respect to \(\Theta\).

\[\begin{align} \nabla_\Theta A(\theta, \Theta) &= -\frac{1}{4} \nabla_\Theta \langle \theta \theta^\top, \Theta^{-1}\rangle -\frac{1}{2} \nabla_\Theta \log\left|-2\Theta\right| \end{align}\]

We split this into two pieces. First, we tackle the second term. The derivative of the log-determinant function is, funnily enough, just equal to the inverse of the transposed matrix, as can be shown using the Fréchet derivative again. This makes our second term \(-\frac{1}{2}\Theta^{-1}\), which is auspicious, since it corresponds to \(\Sigma\), once of the two terms in the expectation we hope to match.

Now, we tackle the first term, using the chain rule to split apart the derivative of the inner product (which, due to linearity, is just equal to the other argument) from the derivative of the matrix inverse:

\[\begin{align} \nabla_\Theta \langle \theta \theta^\top, \Theta^{-1}\rangle \nabla_\Theta\Theta^{-1} &= \theta\theta^\top \nabla_\Theta\Theta^{-1} \end{align}\]

One might hope, by analogy with the derivative of the inverse of a scalar, that the answer is the matrix equivalent of \(\frac{-1}{x^{-2}}\). And so it is!

Putting it all together, we end up with

\(\begin{align} \nabla_\Theta A(\theta, \Theta) &= \frac{1}{4}\theta\theta^\top\Theta^{-2} -\frac{1}{2}\Theta^{-1}\\ &= \mu\mu^\top + \Sigma \end{align}\)

which is, indeed, the expected value of \(xx^\top\).