Support vector machine is a natural deviation from the linear regression model. In linear regression, we try to fit a line that best fits the data points. In support vector machine, we try to fit a line that best separates the data points.
We use,
B={−1,1}
Here.
Ideas
For support vector machine, we want to have a plane,
wTx+b=0
That is optimal for splitting the space into two regions, the upper part wTx+b>0 and the lower part wTx+b<0.
We would like to have least wrongly classified points and the maximum margin between the two regions. The margin means ϵ such that, there are as few points as possible in the region
∣∣w∣∣2∣∣wTx+b∣∣<ϵ
So that there could be fewer points close to the plane, which makes the classification more robust.
To do the prediction,
y^=sgn(wTx+b)
For a high dimensional plane wTx+b=0, the wT is the normal vector of the plane because, for any two point x1 and x2 on the plane, wT(x1−x2)=0 (by subtracting wTx1+b=0 and wTx2+b=0).
This wT is called the support vector.
For a high dimensional plane wTx+b=0, the distance from any point x0 to the plane is,
∣∣w∣∣2∣∣wTx0+b∣∣You can directly extrapolate this from low dimensional space. But we need more valid proof for this.
The distant from a point to a plane is defined as the minimum distance from the given point to any point on the plane. That is to say,
D=minx∣∣x−x0∣∣2s.t.wTx+b=0This function is kind of hard to optimize, we use an equivalent function,
21D2=minx21∣∣x−x0∣∣22s.t.wTx+b=0Construct a lagrangian,
L(x,λ)=21∣∣x−x0∣∣22+λ(wTx+b)Then,
∂x∂L=(x−x0)T+λwT=0∂λ∂L=wTx+b=0Because we are in the Euclidean space,
x−x0+λw=0Then,
wTx−wTx0+λwTw=0λ=∣∣w∣∣22wTx0+bNow we have λ , we substitute it back to the first equation,
(x−x0)=−∣∣w∣∣22w(wTx+b)Thus,
D=∣∣x−x0∣∣2=∣∣w∣∣2∣∣wTx+b∣∣
We rewrite our optimization problem in a more mathematical way,
Because we want maximum margin and maximum correctly classified points that is to say we would like to minimize,
L=i=1∑nmax(0,ϵ−∣∣wT∣∣2yi(wTxi+b))
With a given ϵ range, for example,
ϵ≤ϵ0
If the classification is wrong,
yi(wTxi+b)<0Thus there will be more loss.
And if the classification is correct, but it is not far enough from the plane with a distance of ϵ, then,
∣∣wT∣∣2yi(wTxi+b)<ϵWhich results in more loss.
Only if the sample is correctly classified and is far enough from the plane with a distance ϵ, the loss will be zero.
There are two types of SVM, the soft margin SVM and the hard margin SVM. We use the soft margin SVM here.
Hard margin SVM supposes that there must exists such a line that can separate the two classes perfectly, so, they use loss function,
L=−∣∣w∣∣21s.t.yi(wTx+b)>0Or even,
L=∣∣w∣∣22s.t.yi(wTx+b)>0Because such a set of wT and b is guaranteed to exist, this loss function only optimize the margin.
But if it's not the case, we use soft margin SVM, which minimize the wrongly classified points and maximize the margin, which is shown here.
Optimization
To optimize the loss function, we first need to reduce the max function. We can use the following trick called slack variable,
L=i=1∑nξis.t.ϵ∣∣wT∣∣2yi(wTxi+b)≥1−ξiξi≥0ϵ≤ϵ0
We let,
max(0,1−ϵ∣∣wT∣∣2yi(wTxi+b))=ξiThus,
ξi≥0And
ξi≥(1−ϵ∣∣wT∣∣2yi(wTxi+b))With this, we can convert max or min function into free variables with two extra constraints.
This form in my note looks very different from other forms you may see in the textbook, which is,
L=21∣∣w∣∣22+Ci=1∑nξis.t.yi(wTxi+b)≥1−ξiξi≥0However, they are fundamentally identical.
This is our form,
L=i=1∑nξis.t.ϵ∣∣wT∣∣2yi(wTxi+b)≥1−ξiξi≥0ϵ≤ϵ0If we enforce an extra restriction ϵ∣∣wT∣∣2=1 , then,
L=i=1∑nξis.t.yi(wTxi+b)≥1−ξiξi≥0∣∣wT∣∣2≥ϵ01We can use lagrange multiplier to the last equation, then,
L=i=1∑nξi+λ(∣∣wT∣∣2−ϵ01)Again, λϵ01 doesn't effect the loss function. And eventually,
L=λ∣∣wT∣∣2+i=1∑nξis.t.yi(wTxi+b)≥1−ξiξi≥0This is identical to the traditional form, except that the coefficient is expressed in a different way.
You will later see that, if we take ϵ0=∣∣wT∣∣21 , the optimization step is exactly identical in both forms.
Rewrite it in the lagrange form,
L=i=1∑nξis.t.1−ξi−ϵ∣∣wT∣∣2yi(wTxi+b)≤0−ξi≤0ϵ−ϵ0≤0
Then use lagrange multiplier,
L=i=1∑nξi+i=1∑nαi(1−ξi−ϵ∣∣wT∣∣2yi(wTxi+b))+i=1∑nβi(−ξi)+γ(ϵ−ϵ0)s.t.αi≥0βi≥0γ≥0
Before using gradient descent, we can do some simplification, because,
∂ξi∂L=1−αi−βi=0
So,
L=i=1∑nαi(1−ϵ∣∣wT∣∣2yi(wTxi+b))+γ(ϵ−ϵ0)
We can further reduce γ,
L=i=1∑nαi(1−ϵ0∣∣wT∣∣2yi(wTxi+b))
Typically, we use,
ϵ0=∣∣wT∣∣21
So,
L=i=1∑nαi(1−yi(wTxi+b))
Eventually,
∂αi∂L=1−yi(wTxi+b)
∂wT∂L=i=1∑nαiyixi
∂b∂L=i=1∑nαiyi
αi≥0
The final inequation an be achieved using penalty method or simply, force αi to be zero if it is negative after an update.
A faster way to calculate is called the SMO algorithm, which is a more efficient way to optimize the SVM loss function. It's basic idea is to update the αi in pairs that violates the KKT condition most. We aren't going through it here.
Non-Linear SVM
SVM can only distinguish linearly separable data. To handle non-linear data, we can use the kernel trick.
For a n dimensional data, it is guaranteed to exist a n+1 dimensional hyperplane that can separate the data. So, we can use a function ϕ(x) to map the data to a higher dimensional space, then the hyperplane becomes,
ϕ(w)Tϕ(x)+b=0
The other part of the SVM algorithm remains the same.
However, because we eventually will cast the high dimensional data back to the original space, we can use the kernel trick to avoid the explicit calculation of ϕ(x). That is, we define a new inner product, K(x,x′)=ϕ(x)Tϕ(x′). This is called a kernel function.
A kernel function is a map to a higher-dimensional inner product, and thus it must satisfy all the properties of an inner product.
- Symmetric: K(x,x′)=K(x′,x)
- Positive semi-definite: K(x,x)≥0
- Linear for the second argument: K(x,cx′)=cK(x,x′)
- Conjugate K(x,cx′)=K(c†x,x′)
Then, with everything else remains the same, when we calculate the inner product, we use the non-linear kernel function instead of the original inner product.
Common kernel functions include,
- Plain kernel: K(x,x′)=xTx′
- Polynomial kernel: K(x,x′)=(xTx′+c)d
- Gaussian kernel: K(x,x′)=exp(−2σ2∣∣x−x′∣∣2)
- Sigmoid kernel: K(x,x′)=tanh(αxTx′+c)
- Laplace kernel: K(x,x′)=exp(−σ∣∣x−x′∣∣)