Typically in linear regression, the parameter $\theta$ is obtained using $$\theta = (X^TX)^{-1}X^TY$$ In the case of ridge regression (L2 regularization) however, the expression for $\theta$ is the following: $$\theta = (X^TX + \lambda I_d)^{-1}X^TY$$ where d is the number of features \ Note that with d as the number of features and n the number of examples. The dimensions are as follow:

$\theta$ is (d,1)
$X$ is (n,d)
$Y$ is (n,1)

Prediction is done using:

$Y = X \theta$

Ridge regression can be useful for several reasons. It serves as a regularization technique. Moreover, its solution is always defined. The normal equation for the ordinary least sqares requires us to compute the inverse of $X^TX$ which may not exist. The matrix $X^TX + \lambda I_d$ however is always inversible.

Showing that $X^TX + \lambda I_d$ is always inversible

There are multiple ways of showing that the inverse of $X^TX + \lambda I_d$ always exists and oone of those is to show that its determinant cannot be 0.\

One of the properties of determinants, for positive semidefinite matrices at least, is: $$det(A + B) \geq det(A) + det(B)$$ In our case that means: $$det(X^T X + \lambda I_d) \geq det(X^T X) + det(\lambda I_d)$$

We need the two matrices to be positive semidefinite for this to work.\ I will asume $X^T X$ to be positive semidefinite. It's the Hessian of the loss function we optimize so without it being positive semidefinite, the derivative wouldn't be convex which would have prevented us from finding the closed form solution the way we did it. In facr, since $X^T X$ is positive semidefinite we have: $det(X^T X) \geq 0$

$$ \begin{aligned} det(\lambda I_d) & = \lambda^d det(I_d) \\ & = \lambda^d \end{aligned} $$

so for $\lambda > 0$ we have $\lambda^d > 0$ and $\lambda I_d$ is postive semidefinite

Since we have $det(\lambda I_d) = \lambda^d$ and $det(X^T X) \geq 0$, we can now say: $$ det(X^T X) + det(\lambda I_d) \geq \lambda^d > 0$$ Therefore $$ det(X^T X + \lambda I_d) > 0 $$ and the matrix $X^T X + \lambda I_d$ is always inversible

Showing that $(X^TX + \lambda I_d)^{-1}X^TY = X^T(XX^T + \lambda I_n)^{-1}Y$

For this one too there are several solutions. My favorite one is from this blog post by Daniel Seita.

Now, when do we use them? I believe it is

$(X^TX + \lambda I_d)^{-1}X^TY$ when $n > d$
$X^T(XX^T + \lambda I_n)^{-1}Y$ when $d > n$

The reason being that we want the smallest matrix to inverse, since matrix inversion is an expensive operation.

Code

import pandas as pd
import numpy as np

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv -P data

!ls data

winequality-red.csv

def read_data(path):
    data = pd.read_csv(path,sep=";")
    X, Y = data[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']], data[["quality"]]
    return X, Y

def split_data(X, Y):
    split = int(0.8 * len(X))
    X_train,Y_train = X[:split].values, Y[:split].values
    X_test, Y_test = X[split:].values, Y[split:].values
    return (X_train,Y_train), (X_test, Y_test)

X,Y = read_data("data/winequality-red.csv")

(X_train,Y_train), (X_test, Y_test) = split_data(X, Y)

def mse(Y,Y_hat): return np.mean((Y-Y_hat)**2)

def LSO(X, Y):
    """
    Computes 𝜃 using the normal equation
    """
    theta = np.linalg.inv(X.T @ X) @ X.T @ Y
    return theta

theta1 = LSO(X_train,Y_train)
print(f"train error : {mse(X_train @ theta1, Y_train)}")
print(f"test error : {mse(X_test @ theta1, Y_test)}")

train error : 0.41601754667558516
test error : 0.4315544880376387

def ridge(X, Y, l=0.1):
    """
    Computes 𝜃 using ridge regression with the first expression
    """
    theta = np.linalg.inv(X.T @ X + (l*np.eye(X.shape[1])) ) @ X.T @ Y
    return theta
    
theta2 = ridge(X_train,Y_train)
print(f"train error : {mse(X_train @ theta2, Y_train)}")
print(f"test error : {mse(X_test @ theta2, Y_test)}")

train error : 0.416193401976357
test error : 0.4321054875487965

def ridge_2(X, Y, l=0.1):
    """
    Computes 𝜃 using ridge regression with the second expression
    """
    theta = X.T @ np.linalg.inv(X @ X.T + (l*np.eye(X.shape[0])) ) @ Y
    return theta
    
theta3 = ridge_2(X_train,Y_train)
print(f"train error : {mse(X_train @ theta3, Y_train)}")
print(f"test error : {mse(X_test @ theta3, Y_test)}")

train error : 0.41619340197628557
test error : 0.432105487324763

We can make two observations:

the loss values for the normal regression and the ridge regression are not very different. That's probably because we are using linear regression so there is not much overfitting. In that case regularization is not very likely to improve performance
The loss values for the two ridge functions are about the same