Linear Regression Exercise (Closed Form Solution)

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables) [Wikipedia]. The closed form solution to finding the parameter $\theta$ of a linear regression model is given by $$\theta = (X^TX)^{-1}X^TY$$ where $X$ are your features and $Y$ is your target.

Let d be the number of features, n the number of examples. The dimensions are as follow:

$\theta$ is (d,1)
$X$ is (n,d)
$Y$ is (n,1)

Prediction is done using:

$Y = X \theta$

We are trying to find the value of $\theta$ that minimizes the squared error which means finding the solution to: $\underset{\theta}{argmin} \|{X \theta - Y}\|_2^2$

In order to find that value of theta, since the squared error is convex, we can find the derivative of the expression and find the value of $\theta$ that makes it 0.

First let's expand $\|{X \theta - Y}\|_2^2$ $$ \begin{aligned} \\ \\ \|{X \theta - Y}\|_2^2 &= (X \theta - Y)^T(X \theta - Y) \\ & = (\theta^T X^T - Y^T)(X \theta - Y) \\ & = \theta^T X^T X \theta - Y^T X \theta - \theta^T X^T Y - Y^T Y \\ & = \theta^T X^T X \theta - (\theta^T X^T Y)^T - \theta^T X^T Y - Y^T Y \\ & = \theta^T X^T X \theta - 2\theta^T X^T Y - Y^T Y \ because \ \theta^T X^T Y \ is \ a \ scalar \\ \\ \\ \frac{\partial \|{X \theta - Y}\|_2^2}{\partial \theta} & = 2 X^T X \theta - 2 X^T Y \\ \\ \\ \end{aligned} $$ By equating the derivative to 0 we get: $$ \begin{aligned} 2 X^T X \theta - 2 X^T Y & = 0 \\ X^T X \theta - X^T Y & = 0 \\ X^T X \theta & = X^T Y \\ \theta & = (X^T X)^{-1} X^T Y \\ \end{aligned} $$

Here is an implementation using numpy and the wine quality dataset from this dataset repo mcu dataset.

import pandas as pd
import numpy as np

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv -P data

!ls data

winequality-red.csv

data = pd.read_csv("data/winequality-red.csv",sep=';')
data.head(3)

cols = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"]
target = "quality"
data = data.sample(frac=1)
X = data[cols].values
Y = data[[target]].values
X.shape, Y.shape

((1599, 11), (1599, 1))

We also implement the bias parameter by adding a feature with fixed value one to every data point. By doing so we get: $$ \sum_{i=1}^{n-1}(\theta_i \cdot x_i) + \theta_n \cdot 1 $$ $\theta_n$ will be the bias parameter

def add_ones(X): return np.hstack([X,np.ones((X.shape[0],1))])

class LinearReg:
    """
    Basic linear regression implemetation using numpy
    """
    def __init__(self, bias=False):
        """
        Initialization of theta and a boolean to determine whether to use a bias or not
        """
        self.theta = None
        self.bias = bias
  
    def fit(self,X,Y):
        """
        Fit function. Uses the normal equation to compute theta
        """
        if self.bias:
            X = add_ones(X)
        A = X.T @ X
        B = X.T @ Y
        self.theta = np.linalg.solve(A,B)
        #self.theta = np.linalg.inv(A) @ B

    def predict(self,X):
        """
        prediction function
        """
        if self.bias:
            X = add_ones(X)
        return X @ self.theta

    @staticmethod
    def mse(y_hat,y):
        """
        Static method implementing the mean squared error
        """
        return np.mean((y-y_hat)**2)

model1 = LinearReg()
model1.fit(X,Y)
LinearReg.mse(model1.predict(X),Y)

0.4170492248204846

model2 = LinearReg(bias=True)
model2.fit(X,Y)
LinearReg.mse(model2.predict(X),Y)

0.41676716722140805

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5