AMMI notes - ridge regression
An explanation of a useful equality for the ridge regression
- Showing that $X^TX + \lambda I_d$ is always inversible
- Showing that $(X^TX + \lambda I_d)^{-1}X^TY = X^T(XX^T + \lambda I_n)^{-1}Y$
- Code
Typically in linear regression, the parameter $\theta$ is obtained using $$\theta = (X^TX)^{-1}X^TY$$ In the case of ridge regression (L2 regularization) however, the expression for $\theta$ is the following: $$\theta = (X^TX + \lambda I_d)^{-1}X^TY$$ where d is the number of features \ Note that with d as the number of features and n the number of examples. The dimensions are as follow:
- $\theta$ is (d,1)
- $X$ is (n,d)
- $Y$ is (n,1)
Prediction is done using:
- $Y = X \theta$
Ridge regression can be useful for several reasons. It serves as a regularization technique. Moreover, its solution is always defined. The normal equation for the ordinary least sqares requires us to compute the inverse of $X^TX$ which may not exist. The matrix $X^TX + \lambda I_d$ however is always inversible.
There are multiple ways of showing that the inverse of $X^TX + \lambda I_d$ always exists and oone of those is to show that its determinant cannot be 0.\
One of the properties of determinants, for positive semidefinite matrices at least, is: $$det(A + B) \geq det(A) + det(B)$$ In our case that means: $$det(X^T X + \lambda I_d) \geq det(X^T X) + det(\lambda I_d)$$
We need the two matrices to be positive semidefinite for this to work.\ I will asume $X^T X$ to be positive semidefinite. It's the Hessian of the loss function we optimize so without it being positive semidefinite, the derivative wouldn't be convex which would have prevented us from finding the closed form solution the way we did it. In facr, since $X^T X$ is positive semidefinite we have: $det(X^T X) \geq 0$
so for $\lambda > 0$ we have $\lambda^d > 0$ and $\lambda I_d$ is postive semidefinite
Since we have $det(\lambda I_d) = \lambda^d$ and $det(X^T X) \geq 0$, we can now say: $$ det(X^T X) + det(\lambda I_d) \geq \lambda^d > 0$$ Therefore $$ det(X^T X + \lambda I_d) > 0 $$ and the matrix $X^T X + \lambda I_d$ is always inversible
For this one too there are several solutions. My favorite one is from this blog post by Daniel Seita.
Now, when do we use them? I believe it is
- $(X^TX + \lambda I_d)^{-1}X^TY$ when $n > d$
- $X^T(XX^T + \lambda I_n)^{-1}Y$ when $d > n$
The reason being that we want the smallest matrix to inverse, since matrix inversion is an expensive operation.
import pandas as pd
import numpy as np
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv -P data
!ls data
def read_data(path):
data = pd.read_csv(path,sep=";")
X, Y = data[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
'pH', 'sulphates', 'alcohol']], data[["quality"]]
return X, Y
def split_data(X, Y):
split = int(0.8 * len(X))
X_train,Y_train = X[:split].values, Y[:split].values
X_test, Y_test = X[split:].values, Y[split:].values
return (X_train,Y_train), (X_test, Y_test)
X,Y = read_data("data/winequality-red.csv")
(X_train,Y_train), (X_test, Y_test) = split_data(X, Y)
def mse(Y,Y_hat): return np.mean((Y-Y_hat)**2)
def LSO(X, Y):
"""
Computes 𝜃 using the normal equation
"""
theta = np.linalg.inv(X.T @ X) @ X.T @ Y
return theta
theta1 = LSO(X_train,Y_train)
print(f"train error : {mse(X_train @ theta1, Y_train)}")
print(f"test error : {mse(X_test @ theta1, Y_test)}")
def ridge(X, Y, l=0.1):
"""
Computes 𝜃 using ridge regression with the first expression
"""
theta = np.linalg.inv(X.T @ X + (l*np.eye(X.shape[1])) ) @ X.T @ Y
return theta
theta2 = ridge(X_train,Y_train)
print(f"train error : {mse(X_train @ theta2, Y_train)}")
print(f"test error : {mse(X_test @ theta2, Y_test)}")
def ridge_2(X, Y, l=0.1):
"""
Computes 𝜃 using ridge regression with the second expression
"""
theta = X.T @ np.linalg.inv(X @ X.T + (l*np.eye(X.shape[0])) ) @ Y
return theta
theta3 = ridge_2(X_train,Y_train)
print(f"train error : {mse(X_train @ theta3, Y_train)}")
print(f"test error : {mse(X_test @ theta3, Y_test)}")
We can make two observations:
- the loss values for the normal regression and the ridge regression are not very different. That's probably because we are using linear regression so there is not much overfitting. In that case regularization is not very likely to improve performance
- The loss values for the two ridge functions are about the same