What your loss function says about YOU

What your loss function says about YOU

Written on March 20, 2020

Or more accurately, what does your choice of loss function say about the assumptions you’re making regarding the distribution of residuals for your model?

Consider the following example: suppose you’ve fit a regression using mean-squared-error (MSE) as your loss criterion. You arrive at an MSE of, say, $\varepsilon \approx 4$. Now suppose for a given input $x$, your model produces a prediction $\hat{y}=5$. How would you produce a 95% confidence interval on this prediction?

To answer this question, let’s first write out our regression a little more precisely: \(y=A\mathbf{x}+b+\varepsilon\) where $\varepsilon\sim\mathcal{N}(0,\sigma)$ is the error variable, assumed to be Gaussian with mean zero and standard deviation $\sigma$.

Besides the assumption that $\varepsilon$ is Gaussian, I’m also assuming that it is independent of $\mathbf{x}$. Our regression can then be re-written as:

\[y-(A\mathbf{x}+b)\sim\mathcal{N}(0,\sigma)\]

This also gives us an expression for the conditional PDF of $y$, given a choice of parameters $A$, $b$, $\sigma$:

\[P(y|\mathbf{x};A,b,\sigma):=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2\sigma^{2}}\left(y-A\mathbf{x}-b\right)^{2}}\]

For a given collection of observations, we can also ask which choice of parameters $A$, $b$, $\sigma$ maximize the likelihood function:

\[\mathcal{L}(A,b,\sigma|y,\mathbf{x}):=\prod_{i}\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2\sigma^{2}}\left(y_{i}-A\mathbf{x}_{i}-b\right)^{2}}\]

For simplicity, I’ll denote $\hat{y}_{i} := A\mathbf{x_i} + b$. In practice, it is easier to compute the log-likelihood function (which is just the logarithm of this):

\[\log \mathcal{L} = \sum_{i}-\log\sigma\sqrt{2\pi}-\frac{1}{2\sigma^{2}}\left(y_{i}-\hat{y}_{i}\right)^{2}\] \[=-\frac{N}{2}\log2\pi-N\log\sigma-\frac{1}{2\sigma^{2}}\sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}\]

where $N$ is the total number of samples $y_i$.

We can immediately note that, for fixed $\sigma$, a choice of $A$, $b$ that minimizes the MSE will also maximize the log likelihood function (and hence the likelihood function). But what is the optimal $\sigma$? Let’s do some calculus.

\[0=\frac{\partial}{\partial\sigma}\left[-\frac{N}{2}\log2\pi-N\log\sigma-\frac{1}{2\sigma^{2}}\sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}\right]\] \[0=-\frac{N}{\sigma}+\frac{1}{\sigma^{3}}\sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}\] \[\boxed{\sigma^{2}=\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-\hat{y}_{i}\right)^{2}}\]

That is, for any fixed choice of $A$, $b$, the optimal $\sigma$ is precisely the square root of the MSE.

So what’s the point of this exercise? There are two takeaways.

First, we can give an explicit answer to our original question: if our MSE is 4, and our predicted value for $x_i$ is $\hat{y}_{i} = 5$, then a 95% confidence interval is $5 \pm 2 \sigma = 5 \pm 4$.

Second, the approach above is entirely generic: we may pick a different distribution for our noise, and we may pick an entirely different regression relation (e.g., a neural net), and re-run the same calculation: as long as you can compute the log likelihood function, numeric techniques like stochastic gradient descent can be used to find approximate maximal solutions. (This one of the basic ideas behind Pyro, which uses the autograd machinery of PyTorch to perform this kind of calculation.)