We did maximum likelihood estimation (MLE} and maximum a posteriori estimation (MAP) in the context of linear regression here. Let's do the same for classification.

Maximum Likelihood Estimation

Binary classification

Let's say we have two classes represented by 0 and 1 and we are trying to predict the probability of a feature vector belonging to class 1, we would have: $$ h_\theta (x) = P(y=1|x;\theta) \\ 1 - h_\theta (x) = P(y=0|x;\theta) $$

Since we have two possible outcomes (classes) we can use a Bernouilli distribution on the labels and get the likelihood and log likelihood: $$ \begin{aligned} P(y|X;\theta) & = h_\theta (x)^y (1 - h_\theta (x))^{1-y} \\ \log P(y|X;\theta) & = y \log h_\theta (x) + (1-y) \log (1 - h_\theta (x)) \end{aligned} $$

Maximizing the log likelihood is the same as minimizing the negative log likelihood. From that we get our loss function to be: $$ \mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N} y_i \log h_\theta (x_i) + (1-y_i) \log (1 - h_\theta (x_i)) $$

Multiclass classification

In the case of multiclass classification we use the multinomial distribution instead of a Bernouilli one. Now, for m classes represented as $[1,2, ... ,m]$, we have $\hat{y} = h_\theta(x)$ to be a vector such that $\hat{y}_k = P(y=k|x;\theta)$.

For one data point $(x,y)$ we have: $$ \begin{aligned} P(y|x;\theta) & = \prod_{k=1}^m \frac{\hat{y}_k^{y_k}}{y_k} \\ \log P(y|x;\theta) & = \sum_{k=1}^m \log \frac{\hat{y}_k^{y_k}}{y_k} \\ & = \sum_{k=1}^m \log \hat{y}_k^{y_k} - \sum_{k=1}^m \log y_k \\ & = \sum_{k=1}^m y_k \log \hat{y}_k - \sum_{k=1}^m \log y_k \\ & = \sum_{k=1}^m y_k \log \hat{y}_k + c \end{aligned} $$

Here $c$ is a constant not depending on $\theta$ so it is not part of the loss function.

We get our loss function to be: $$ \mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^m y_k \log h_\theta(x)_k $$

Maximum a posteriori estimation

The MAP for logistic regression and linear regression are very similar. We consider $\theta$ a random variable that follows a normal distribution $\mathcal{N}(0,\,b^{2})$. We want to maximize $P(\theta|X,Y)$

Using Baye's theorem we have

$$ \begin{aligned} P(\theta|X,Y) & = \frac{P(Y|X,\theta)p(\theta)}{P(Y|X)} \\ \ log P(\theta|X,Y) & = \log P(Y|X,\theta) + \log p(\theta) - \log P(Y|X) \\ & = y \log h_\theta (x) + (1-y) \log (1 - h_\theta (x)) - n \log \sqrt{2 \pi b^2} - \frac{1}{2 b^2} \|{\theta }\|_2^2 - \log P(Y|X) \\ & = (y \log h_\theta (x) + (1-y) \log (1 - h_\theta (x)) - \frac{1}{2 b^2} \|{\theta }\|_2^2 ) + c \end{aligned} $$

Here $c$ is a constant that groups all the terms that do not depend on $\theta$.

From that we get our loss function to be: $$ \mathcal{L}(\theta) = -(\frac{1}{N}\sum_{i=1}^{N} y_i \log h_\theta (x_i) + (1-y_i) \log (1 - h_\theta (x_i)) + \lambda \| \theta \|_2^2) $$