Linear regression works by finding $\underset{\theta}{argmin} \|{X\theta - Y}\|_2^2$. \ The interpretations for the metric we are minimizing are diverse. It can be interpreted as the sum of the euclidian distances between the predictions and actual values. It can also be obtained by maximum likelihood estimation (MLE).

Maximum Likelihood Estimation

In MLE we are interested in finding the value of theta that maximizes the likelihood of the data as expressed by $p(Y|X\theta)$

Let's consider every data point to be of the form $y_i = x_i \theta + \epsilon$ where $\epsilon$ follows a normal distribution $\mathcal{N}(0,\,\sigma^{2})$ and $y_i$ will follow a distribution $\mathcal{N}(x_i \theta,\,\sigma^{2})$

Our goal is now to find $\underset{\theta}{argmax}\ p(Y|X,\theta)$

Because we assume that the data points are independently sampled we can write $p(Y|X,\theta) = \displaystyle\prod_{i=1}^{n} P(y_i|x_i,\theta) $

Then because we assume the data points are identically distributed, in this case with a distribution $\mathcal{N}(x_i \theta,\,\sigma^{2})$, we can write $\displaystyle\prod_{i=1}^{n} P(y_i|x_i,\theta) = \displaystyle\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}}exp(-\frac{(y_i - x_i \theta)^2}{2 \sigma^2} $

Lastly, because the $\log$ function is a strictly increasing function, maximizing the likelihood $p(Y|X,\theta)$ is the same as maximizing the log likelihood $\log p(Y|X,\theta)$.

Using that let's develop the expression of the likelihood. $$ \begin{aligned} p(Y|X,\theta) & = \displaystyle\prod_{i=1}^{n} P(y_i|x_i,\theta) \\ & = \displaystyle\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}}exp(-\frac{(y_i - x_i \theta)^2}{2 \sigma^2} ) \\ \log p(Y|X,\theta) & = \displaystyle\sum_{i=1}^{n} \log ( \frac{1}{\sqrt{2 \pi \sigma^2}}exp(-\frac{(y_i - x_i \theta)^2}{2 \sigma^2} )) \\ & = \displaystyle\sum_{i=1}^{n} \log \frac{1}{\sqrt{2 \pi \sigma^2}} - \displaystyle\sum_{i=1}^{n} \frac{(y_i - x_i \theta)^2}{2 \sigma^2} \\ & = n \log 1 - n \log \sqrt{2 \pi \sigma^2} - \frac{1}{2 \sigma^2} \displaystyle\sum_{i=1}^{n} (y_i - x_i \theta)^2 \\ & = n \log 1 - n \log \sqrt{2 \pi \sigma^2} - \frac{1}{2 \sigma^2} \|{Y - X\theta }\|_2^2 \end{aligned} $$

From that we can see that fincing the value of $\theta$ that maximizes the likelihood is the same as finding the $\theta$ that minimizes $\|{Y - X\theta }\|_2^2$

The closed form solution is $\theta^{MLE} = (X^T X)^{-1}X^T Y$

Maximum A Posteriori

In maximum a posteriori estimation (MAP estimation) we consider $\theta$ a random variable that also follow a normal distribution $\mathcal{N}(0,\,b^{2})$

We want to maximize $P(\theta|X,Y)$

Using Baye's theorem we have $$ \begin{aligned} P(\theta|X,Y) & = \frac{P(Y|X,\theta)p(\theta)}{P(Y|X)} \\ \ log P(\theta|X,Y) & = \log P(Y|X,\theta) + \log p(\theta) - \log P(Y|X) \\ & = n \log 1 - n \log \sqrt{2 \pi \sigma^2} - \frac{1}{2 \sigma^2} \|{Y - X\theta }\|_2^2 + n \log 1 - n \log \sqrt{2 \pi b^2} - \frac{1}{2 b^2} \|{\theta }\|_2^2 - \log P(Y|X) \\ & = - (\frac{1}{2 \sigma^2} \|{Y - X\theta }\|_2^2 + \frac{1}{2 b^2} \|{\theta }\|_2^2 ) + c \end{aligned} $$

Here $c$ is a constant that groups all the terms that do not depend on $\theta$.

From this we can see the cost function for ridge regression $\|{Y - X\theta }\|_2^2 + \lambda \|{\theta }\|_2^2$

Ridge regression can be very useful because of its regularizing effect, and the fact that its closed from solution $\theta^{MAP} = (X^T X + \lambda I_d)^{-1}X^T Y$ always exists. Also, there is a very useful equality for ridge regression that is discusses here.