Linear Regression Exercise (Closed Form Solution)

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables) [Wikipedia]. The closed form solution to finding the parameter $\theta$ of a linear regression model is given by $$\theta = (X^TX)^{-1}X^TY$$ where $X$ are your features and $Y$ is your target.

Let d be the number of features, n the number of examples. The dimensions are as follow:

  • $\theta$ is (d,1)
  • $X$ is (n,d)
  • $Y$ is (n,1)

Prediction is done using:

  • $Y = X \theta$

We are trying to find the value of $\theta$ that minimizes the squared error which means finding the solution to: $\underset{\theta}{argmin} \|{X \theta - Y}\|_2^2$

In order to find that value of theta, since the squared error is convex, we can find the derivative of the expression and find the value of $\theta$ that makes it 0.

First let's expand $\|{X \theta - Y}\|_2^2$ $$ \begin{aligned} \\ \\ \|{X \theta - Y}\|_2^2 &= (X \theta - Y)^T(X \theta - Y) \\ & = (\theta^T X^T - Y^T)(X \theta - Y) \\ & = \theta^T X^T X \theta - Y^T X \theta - \theta^T X^T Y - Y^T Y \\ & = \theta^T X^T X \theta - (\theta^T X^T Y)^T - \theta^T X^T Y - Y^T Y \\ & = \theta^T X^T X \theta - 2\theta^T X^T Y - Y^T Y \ because \ \theta^T X^T Y \ is \ a \ scalar \\ \\ \\ \frac{\partial \|{X \theta - Y}\|_2^2}{\partial \theta} & = 2 X^T X \theta - 2 X^T Y \\ \\ \\ \end{aligned} $$ By equating the derivative to 0 we get: $$ \begin{aligned} 2 X^T X \theta - 2 X^T Y & = 0 \\ X^T X \theta - X^T Y & = 0 \\ X^T X \theta & = X^T Y \\ \theta & = (X^T X)^{-1} X^T Y \\ \end{aligned} $$

Here is an implementation using numpy and the wine quality dataset from this dataset repo mcu dataset.

import pandas as pd
import numpy as np
!wget -P data
!ls data
data = pd.read_csv("data/winequality-red.csv",sep=';')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
cols = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"]
target = "quality"
data = data.sample(frac=1)
X = data[cols].values
Y = data[[target]].values
X.shape, Y.shape
((1599, 11), (1599, 1))

We also implement the bias parameter by adding a feature with fixed value one to every data point. By doing so we get: $$ \sum_{i=1}^{n-1}(\theta_i \cdot x_i) + \theta_n \cdot 1 $$ $\theta_n$ will be the bias parameter

def add_ones(X): return np.hstack([X,np.ones((X.shape[0],1))])

class LinearReg:
    Basic linear regression implemetation using numpy
    def __init__(self, bias=False):
        Initialization of theta and a boolean to determine whether to use a bias or not
        self.theta = None
        self.bias = bias
    def fit(self,X,Y):
        Fit function. Uses the normal equation to compute theta
        if self.bias:
            X = add_ones(X)
        A = X.T @ X
        B = X.T @ Y
        self.theta = np.linalg.solve(A,B)
        #self.theta = np.linalg.inv(A) @ B

    def predict(self,X):
        prediction function
        if self.bias:
            X = add_ones(X)
        return X @ self.theta

    def mse(y_hat,y):
        Static method implementing the mean squared error
        return np.mean((y-y_hat)**2)
model1 = LinearReg(),Y)
model2 = LinearReg(bias=True),Y)