Generalized Linear Regression
Regression solves the following question,
Given,
xi∈Rn,yi∈R,i=1,2,…,n
And a train dataset,
D={(x1,y1),(x2,y2),…,(xn,yn)}
Find the best fit f for,
y^=f(x)
By best fit, we typically mean to minimize a loss value.
f=argminf∈Fn1i=1∑nL(yi,f(xi))
Generalized Linear Regression mainly contains, the linear regression, logistic regression, and regularized regression (L1, L2, Elastic Net, etc). They are all variations of the linear regression.
Linear Regression
Linear regression finds the best fit f in the linear function space. It usually uses MSE (Mean Squared Error) as the loss function.
Suppose,
y^=w^Tx+b^
You may see the following in other books,
y^=w^TxThis is because they use homogeneous coordinates. This form adds an extra dimension to w and x, and for this extra dimension, the value of x is always one.
This equals to,
y^=(w,b)^T(x,1)^Which is exactly,
y^=w^Tx+b^This may simply calculation in some cases.
Then we want,
(w^T,b^)=argmin(wT,b)n1i=1∑n(yi−(wTxi+b))2
There are many ways to solve this. We simply use partial derivatives,
Note,
L=n1i=1∑n(yi−(wTxi+b))2
Then,
∂wT∂L=n1i=1∑n−2xi(yi−(wTxi+b))=0
∂b∂L=n1i=1∑n−2(yi−(wTxi+b))=0
We usually note,
v=n1i=1∑nvi
Then,
wTxx+bx=yx
wTx+b=y
Please note that xx is a tensor.
In the end,
wT=(xy−xy)(xx−xx)−1
b=y−wTx
We adjust the indices to make it more suitable for matrix calculations. We always use abstract notation and einstein summation convention.
wa=(xby−xby)(xaxb−xaxb)−1=(xby−xby)(xaxb−xaxb)−1
Rewrite it in matrix form,
wT=(xTy−xTy)(xxT−xxT)−1
For homogeneous coordinates,
wT=yxT(xxT)−1
Using Numpy to implement this,
import numpy as np
def linear_regression(x, y):
x = np.array(x)
y = np.array(y)
xx = np.einsum('ij,ik->jk', x, x)
xy = np.einsum('ij,i->j', x, y)
x = np.einsum('ij->j', x)
y = np.einsum('i->', y)
w = np.linalg.inv(xx - np.outer(x, x)) @ (xy - x * y)
b = y - w @ x
return w, b
def predict(x, w, b):
x = np.array(x)
return x @ w + b
train_x = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
train_y = [3, 5, 7, 9, 11]
w, b = linear_regression(train_x, train_y)
print(w, b)
print(predict([[6, 7], [7, 8]], w, b))
Logistic Regression
Logistic regression is a variation of linear regression. It uses function space,
y^=σ(wTx+b)
Where σ(z)=1+e−z1 is the sigmoid function.
Sigmoid function has some special properties,
σ′(z)=σ(z)(1−σ(z))
σ(−z)=1−σ(z)
σ−1(z)=log(1−zz)
To solve logistic regression, we just convert it to a linear regression problem.
σ−1(y^)=wTx+b
Regularized Regression
Sometimes, we add extra regularization terms to loss function to reduce overfit and the effect of outliers.
If we add L1 regulation, that is,
L=n1i=1∑n(yi−(wTxi+b))2+λ∣∣w∣∣1s.t.λ>0
This would make the weight more sparse, which is good for feature selection.
If we add L2 regulation, that is,
L=n1i=1∑n(yi−(wTxi+b))2+λ∣∣w∣∣22s.t.λ>0
This would make the weight smaller, which is good for reducing overfit.
Linear regression with L1 regularization is called Lasso regression, and with L2 regularization is called Ridge regression. If we combine them, it is called an Elastic Net.
L=n1i=1∑n(yi−(wTxi+b))2+λ∣∣w∣∣1+μ∣∣w∣∣22s.t.λ,μ>0
Another way of introducing regularization terms.
Consider the following lagrange problem,
argminwT,bn1i=1∑n(yi−(wTxi+b))2s.t.∣∣w∣∣1−C1<=0∣∣w∣∣22−C2<=0We can use the lagrange multiplier to solve this problem.
L=n1i=1∑n(yi−(wTxi+b))2+λ(∣∣w∣∣1−C1)+μ(∣∣w∣∣22−C2)s.t.λ,μ>=0Take note that λC1 and μC2 are irrelevant to the solution, so we can simply remove them to get the final form.
The form introduced in this section is called the constrain form, and the form introduced in the previous section is called the penalty form.
We can solve this by gradient descent,
∂wT∂L=n1i=1∑n−2xi(yi−(wTxi+b))+λsign(w)+2μw
∂b∂L=n1i=1∑n−2(yi−(wTxi+b))