Skip to main content

Generalized Linear Regression

Regression solves the following question,

Given,

xiRn,yiR,i=1,2,,nx_i \in \mathbb{R}^n, y_i \in \mathbb{R}, i = 1, 2, \ldots, n

And a train dataset,

D={(x1,y1),(x2,y2),,(xn,yn)}\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}

Find the best fit ff for,

y^=f(x)\hat{y} = f(x)

By best fit, we typically mean to minimize a loss value.

f=argminfF1ni=1nL(yi,f(xi))f = argmin_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} L(y_i, f(x_i))

Generalized Linear Regression mainly contains, the linear regression, logistic regression, and regularized regression (L1, L2, Elastic Net, etc). They are all variations of the linear regression.

Linear Regression

Linear regression finds the best fit ff in the linear function space. It usually uses MSE (Mean Squared Error) as the loss function.

Suppose,

y^=w^Tx+b^\hat{y} = \hat{w}^Tx+\hat{b}
tip

You may see the following in other books,

y^=w^Tx\hat{y} = \hat{w}^Tx

This is because they use homogeneous coordinates. This form adds an extra dimension to ww and xx, and for this extra dimension, the value of xx is always one.

This equals to,

y^=(w,b)^T(x,1)^\hat{y} = \hat{(w, b)}^T\hat{(x, 1)}

Which is exactly,

y^=w^Tx+b^\hat{y} = \hat{w}^Tx+\hat{b}

This may simply calculation in some cases.

Then we want,

(w^T,b^)=argmin(wT,b)1ni=1n(yi(wTxi+b))2(\hat{w}^T, \hat{b}) = argmin_{(w^T, b)} \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^Tx_i + b))^2

There are many ways to solve this. We simply use partial derivatives,

Note,

L=1ni=1n(yi(wTxi+b))2L = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^Tx_i + b))^2

Then,

LwT=1ni=1n2xi(yi(wTxi+b))=0\frac{\partial L}{\partial w^T} = \frac{1}{n} \sum_{i=1}^{n} -2x_i(y_i - (w^Tx_i + b)) = 0 Lb=1ni=1n2(yi(wTxi+b))=0\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} -2(y_i - (w^Tx_i + b)) = 0

We usually note,

v=1ni=1nvi\overline{v} = \frac{1}{n} \sum_{i=1}^{n} v_i

Then,

wTxx+bx=yxw^T \overline{xx} + b\overline{x} = \overline{yx} wTx+b=yw^T\overline{x} + b = \overline{y}

Please note that xx\overline{xx} is a tensor.

In the end,

wT=(xyx  y)(xxx  x)1w^T = ({\overline{xy} - \overline{x} \; \overline{y}})(\overline{xx} - \overline{x}\; \overline{x})^{-1} b=ywTxb = \overline{y} - w^T\overline{x}

We adjust the indices to make it more suitable for matrix calculations. We always use abstract notation and einstein summation convention.

wa=(xbyxb  y)(xaxbxa  xb)1=(xbyxb  y)(xaxbxa  xb)1w_{a} = ({\overline{x^b y} - \overline{x^b} \; \overline{y}})(\overline{x^a x^b} - \overline{x^a}\; \overline{x^b})^{-1} \\ = ({\overline{x_b y} - \overline{x_b} \; \overline{y}})(\overline{x^a x_b} - \overline{x^a}\; \overline{x_b})^{-1} \\

Rewrite it in matrix form,

wT=(xTyxT  y)(xxTx  xT)1w^T = ({\overline{x^Ty} - \overline{x^T} \; \overline{y}})(\overline{xx^T} - \overline{x}\; \overline{x}^T)^{-1}
tip

For homogeneous coordinates,

wT=yxT(xxT)1w^T = \overline{yx^T} (\overline{xx^T}) ^ {-1}

Using Numpy to implement this,

import numpy as np

def linear_regression(x, y):
x = np.array(x)
y = np.array(y)
xx = np.einsum('ij,ik->jk', x, x)
xy = np.einsum('ij,i->j', x, y)
x = np.einsum('ij->j', x)
y = np.einsum('i->', y)
w = np.linalg.inv(xx - np.outer(x, x)) @ (xy - x * y)
b = y - w @ x
return w, b

def predict(x, w, b):
x = np.array(x)
return x @ w + b

train_x = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
train_y = [3, 5, 7, 9, 11]
w, b = linear_regression(train_x, train_y)
print(w, b)
print(predict([[6, 7], [7, 8]], w, b))

Logistic Regression

Logistic regression is a variation of linear regression. It uses function space,

y^=σ(wTx+b)\hat{y} = \sigma(w^Tx + b)

Where σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}} is the sigmoid function.

tip

Sigmoid function has some special properties,

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z))

σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z)

σ1(z)=log(z1z)\sigma^{-1}(z) = \log(\frac{z}{1-z})

To solve logistic regression, we just convert it to a linear regression problem.

σ1(y^)=wTx+b\sigma^{-1}(\hat{y}) = w^Tx + b

Regularized Regression

Sometimes, we add extra regularization terms to loss function to reduce overfit and the effect of outliers.

If we add L1 regulation, that is,

L=1ni=1n(yi(wTxi+b))2+λw1  s.t.  λ>0L = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^Tx_i + b))^2 + \lambda ||w||_1 \; s.t. \; \lambda > 0

This would make the weight more sparse, which is good for feature selection.

If we add L2 regulation, that is,

L=1ni=1n(yi(wTxi+b))2+λw22  s.t.  λ>0L = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^T x_i + b))^2 + \lambda ||w||_2^2 \; s.t. \; \lambda > 0

This would make the weight smaller, which is good for reducing overfit.

Linear regression with L1 regularization is called Lasso regression, and with L2 regularization is called Ridge regression. If we combine them, it is called an Elastic Net.

L=1ni=1n(yi(wTxi+b))2+λw1+μw22  s.t.  λ,μ>0L = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^T x_i + b))^2 + \lambda ||w||_1 + \mu ||w||_2^2 \\ \; s.t. \; \lambda, \mu > 0
info

Another way of introducing regularization terms.

Consider the following lagrange problem,

argminwT,b1ni=1n(yi(wTxi+b))2s.t.  w1C1<=0w22C2<=0argmin_{w^T, b} \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^Tx_i + b))^2 \\ s.t. \; ||w||_1 - C_1 <= 0 \\ ||w||_2^2 - C_2 <= 0

We can use the lagrange multiplier to solve this problem.

L=1ni=1n(yi(wTxi+b))2+λ(w1C1)+μ(w22C2)s.t.  λ,μ>=0L = \frac{1}{n} \sum_{i=1}^{n} (y_i - (w^Tx_i + b))^2 + \lambda (||w||_1 - C_1) + \mu (||w||_2^2 - C_2)\\ s.t. \; \lambda, \mu >= 0

Take note that λC1\lambda C_1 and μC2\mu C_2 are irrelevant to the solution, so we can simply remove them to get the final form.

The form introduced in this section is called the constrain form, and the form introduced in the previous section is called the penalty form.

We can solve this by gradient descent,

LwT=1ni=1n2xi(yi(wTxi+b))+λsign(w)+2μw\frac{\partial L}{\partial w^T} = \frac{1}{n} \sum_{i=1}^{n} -2x_i(y_i - (w^Tx_i + b)) + \lambda sign(w) + 2\mu w Lb=1ni=1n2(yi(wTxi+b))\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} -2(y_i - (w^Tx_i + b))