Skip to main content

Support Vector Machine

Support vector machine is a natural deviation from the linear regression model. In linear regression, we try to fit a line that best fits the data points. In support vector machine, we try to fit a line that best separates the data points.

We use,

B={1,1}\mathbb{B} = \{-1,1\}

Here.

Ideas

For support vector machine, we want to have a plane,

wTx+b=0w^T x + b = 0

That is optimal for splitting the space into two regions, the upper part wTx+b>0w^T x + b > 0 and the lower part wTx+b<0w^T x + b < 0.

We would like to have least wrongly classified points and the maximum margin between the two regions. The margin means ϵ\epsilon such that, there are as few points as possible in the region

wTx+bw2<ϵ\frac{||w^T x + b||}{||w||_2} < \epsilon

So that there could be fewer points close to the plane, which makes the classification more robust.

To do the prediction,

y^=sgn(wTx+b)\hat{y} = sgn(w^T x + b)
tip

For a high dimensional plane wTx+b=0w^T x + b = 0, the wTw^T is the normal vector of the plane because, for any two point x1x_1 and x2x_2 on the plane, wT(x1x2)=0w^T (x_1 - x_2) = 0 (by subtracting wTx1+b=0w^T x_1 + b = 0 and wTx2+b=0w^T x_2 + b = 0).

This wTw^T is called the support vector.

tip

For a high dimensional plane wTx+b=0w^T x + b = 0, the distance from any point x0x_0 to the plane is,

wTx0+bw2\frac{||w^T x_0 + b||}{||w||_2}

You can directly extrapolate this from low dimensional space. But we need more valid proof for this.

The distant from a point to a plane is defined as the minimum distance from the given point to any point on the plane. That is to say,

D=minxxx02  s.t.  wTx+b=0D = min_x ||x - x_0||_2 \; s.t. \; w^T x + b = 0

This function is kind of hard to optimize, we use an equivalent function,

12D2=minx12xx022  s.t.  wTx+b=0\frac{1}{2} D^2 = min_x \frac{1}{2}||x - x_0||_2^2 \; s.t. \; w^T x + b = 0

Construct a lagrangian,

L(x,λ)=12xx022+λ(wTx+b)L(x, \lambda) = \frac{1}{2}||x - x_0||_2^2 + \lambda (w^T x + b)

Then,

Lx=(xx0)T+λwT=0\frac{\partial L}{\partial x} = (x - x_0)^T + \lambda w^T = 0Lλ=wTx+b=0\frac{\partial L}{\partial \lambda} = w^T x + b = 0

Because we are in the Euclidean space,

xx0+λw=0x - x_0 + \lambda w = 0

Then,

wTxwTx0+λwTw=0w^Tx - w^Tx_0 + \lambda w^Tw = 0λ=wTx0+bw22\lambda = \frac{w^Tx_0 + b}{||w||_2^2}

Now we have λ\lambda , we substitute it back to the first equation,

(xx0)=w(wTx+b)w22(x - x_0) = - \frac{w(w^Tx + b)}{||w||_2^2}

Thus,

D=xx02=wTx+bw2D = ||x - x_0||_2 = \frac{||w^Tx + b||}{||w||_2}

Mathematical Formulation

We rewrite our optimization problem in a more mathematical way,

Because we want maximum margin and maximum correctly classified points that is to say we would like to minimize,

L=i=1nmax(0,ϵyi(wTxi+b)wT2)L = \sum_{i=1}^{n} \max(0, \epsilon - \frac{y_i(w^T x_i + b)}{||w^T||_2})

With a given ϵ\epsilon range, for example,

ϵϵ0\epsilon \leq \epsilon_0
tip

If the classification is wrong,

yi(wTxi+b)<0y_i(w^T x_i + b) < 0

Thus there will be more loss.

And if the classification is correct, but it is not far enough from the plane with a distance of ϵ\epsilon, then,

yi(wTxi+b)wT2<ϵ\frac{y_i(w^T x_i + b)}{||w^T||_2} < \epsilon

Which results in more loss.

Only if the sample is correctly classified and is far enough from the plane with a distance ϵ\epsilon, the loss will be zero.

info

There are two types of SVM, the soft margin SVM and the hard margin SVM. We use the soft margin SVM here.

Hard margin SVM supposes that there must exists such a line that can separate the two classes perfectly, so, they use loss function,

L=1w2  s.t.  yi(wTx+b)>0L = - \frac{1}{||w||_2} \; s.t. \; y_i(w^Tx + b) > 0

Or even,

L=w22  s.t.  yi(wTx+b)>0L = ||w||_2^2 \; s.t. \; y_i(w^Tx + b) > 0

Because such a set of wTw^T and bb is guaranteed to exist, this loss function only optimize the margin.

But if it's not the case, we use soft margin SVM, which minimize the wrongly classified points and maximize the margin, which is shown here.

Optimization

To optimize the loss function, we first need to reduce the maxmax function. We can use the following trick called slack variable,

L=i=1nξis.t.  yi(wTxi+b)ϵwT21ξiξi0ϵϵ0L = \sum_{i=1}^{n} \xi_i \\ s.t. \; \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2} \geq 1 - \xi_i \\ \xi_i \geq 0 \\ \epsilon \leq \epsilon_0
tip

We let,

max(0,1yi(wTxi+b)ϵwT2)=ξi\max(0, 1 - \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2}) = \xi_i

Thus,

ξi0\xi_i \geq 0

And

ξi(1yi(wTxi+b)ϵwT2)\xi_i \geq (1 - \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2})

With this, we can convert max\max or min\min function into free variables with two extra constraints.

info

This form in my note looks very different from other forms you may see in the textbook, which is,

L=12w22+Ci=1nξis.t.  yi(wTxi+b)1ξiξi0L = \frac{1}{2} ||w||_2^2 + C \sum_{i=1}^{n} \xi_i \\ s.t. \; y_i(w^T x_i + b) \geq 1 - \xi_i \\ \xi_i \geq 0 \\

However, they are fundamentally identical.

This is our form,

L=i=1nξis.t.  yi(wTxi+b)ϵwT21ξiξi0ϵϵ0L = \sum_{i=1}^{n} \xi_i \\ s.t. \; \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2} \geq 1 - \xi_i \\ \xi_i \geq 0 \\ \epsilon \leq \epsilon_0

If we enforce an extra restriction ϵwT2=1\epsilon ||w^T||_2 = 1 , then,

L=i=1nξis.t.  yi(wTxi+b)1ξiξi0wT21ϵ0L = \sum_{i=1}^{n} \xi_i \\ s.t. \; y_i(w^T x_i + b) \geq 1 - \xi_i \\ \xi_i \geq 0 \\ ||w^T||_2 \geq \frac{1}{\epsilon_0}

We can use lagrange multiplier to the last equation, then,

L=i=1nξi+λ(wT21ϵ0)L = \sum_{i=1}^{n} \xi_i + \lambda (||w^T||_2 - \frac{1}{\epsilon_0})

Again, λ1ϵ0\lambda \frac{1}{\epsilon_0} doesn't effect the loss function. And eventually,

L=λwT2+i=1nξis.t.  yi(wTxi+b)1ξiξi0L = \lambda ||w^T||_2 + \sum_{i=1}^{n} \xi_i \\ s.t. \; y_i(w^T x_i + b) \geq 1 - \xi_i \\ \xi_i \geq 0 \\

This is identical to the traditional form, except that the coefficient is expressed in a different way.

You will later see that, if we take ϵ0=1wT2\epsilon_0 = \frac{1}{||w^T||_2} , the optimization step is exactly identical in both forms.

Rewrite it in the lagrange form,

L=i=1nξis.t.  1ξiyi(wTxi+b)ϵwT20ξi0ϵϵ00L = \sum_{i=1}^{n} \xi_i \\ s.t. \; 1 - \xi_i - \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2} \leq 0 \\ - \xi_i \leq 0 \\ \epsilon - \epsilon_0 \leq 0

Then use lagrange multiplier,

L=i=1nξi+i=1nαi(1ξiyi(wTxi+b)ϵwT2)+i=1nβi(ξi)+γ(ϵϵ0)s.t.  αi0βi0γ0L = \sum_{i=1}^{n} \xi_i + \sum_{i=1}^{n} \alpha_i (1 - \xi_i - \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2}) + \sum_{i=1}^{n} \beta_i (- \xi_i) + \gamma (\epsilon - \epsilon_0) \\ s.t. \; \alpha_i \geq 0 \\ \beta_i \geq 0 \\ \gamma \geq 0

Before using gradient descent, we can do some simplification, because,

Lξi=1αiβi=0\frac{\partial L}{\partial \xi_i} = 1 - \alpha_i - \beta_i = 0

So,

L=i=1nαi(1yi(wTxi+b)ϵwT2)+γ(ϵϵ0)L = \sum_{i=1}^{n} \alpha_i (1 - \frac{y_i(w^T x_i + b)}{\epsilon ||w^T||_2}) + \gamma (\epsilon - \epsilon_0) \\

We can further reduce γ\gamma,

L=i=1nαi(1yi(wTxi+b)ϵ0wT2)L = \sum_{i=1}^{n} \alpha_i (1 - \frac{y_i(w^T x_i + b)}{\epsilon_0 ||w^T||_2})

Typically, we use,

ϵ0=1wT2\epsilon_0 = \frac{1}{||w^T||_2}

So,

L=i=1nαi(1yi(wTxi+b))L = \sum_{i=1}^{n} \alpha_i (1 - y_i(w^T x_i + b))

Eventually,

Lαi=1yi(wTxi+b)\frac{\partial L}{\partial \alpha_i} = 1 - y_i(w^T x_i + b) LwT=i=1nαiyixi\frac{\partial L}{\partial w^T} = \sum_{i=1}^{n} \alpha_i y_i x_i Lb=i=1nαiyi\frac{\partial L}{\partial b} = \sum_{i=1}^{n} \alpha_i y_i αi0\alpha_i \geq 0

The final inequation an be achieved using penalty method or simply, force αi\alpha_i to be zero if it is negative after an update.

A faster way to calculate is called the SMO algorithm, which is a more efficient way to optimize the SVM loss function. It's basic idea is to update the αi\alpha_i in pairs that violates the KKT condition most. We aren't going through it here.

Non-Linear SVM

SVM can only distinguish linearly separable data. To handle non-linear data, we can use the kernel trick.

For a nn dimensional data, it is guaranteed to exist a n+1n+1 dimensional hyperplane that can separate the data. So, we can use a function ϕ(x)\phi(x) to map the data to a higher dimensional space, then the hyperplane becomes,

ϕ(w)Tϕ(x)+b=0\phi(w)^T \phi(x) + b = 0

The other part of the SVM algorithm remains the same.

However, because we eventually will cast the high dimensional data back to the original space, we can use the kernel trick to avoid the explicit calculation of ϕ(x)\phi(x). That is, we define a new inner product, K(x,x)=ϕ(x)Tϕ(x)K(x, x') = \phi(x)^T \phi(x'). This is called a kernel function.

A kernel function is a map to a higher-dimensional inner product, and thus it must satisfy all the properties of an inner product.

  • Symmetric: K(x,x)=K(x,x)K(x, x') = K(x', x)
  • Positive semi-definite: K(x,x)0K(x, x) \geq 0
  • Linear for the second argument: K(x,cx)=cK(x,x)K(x, c x') = c K(x, x')
  • Conjugate K(x,cx)=K(cx,x)K(x, c x') = K(c^{\dagger} x, x')

Then, with everything else remains the same, when we calculate the inner product, we use the non-linear kernel function instead of the original inner product.

info

Common kernel functions include,

  • Plain kernel: K(x,x)=xTxK(x, x') = x^T x'
  • Polynomial kernel: K(x,x)=(xTx+c)dK(x, x') = (x^T x' + c)^d
  • Gaussian kernel: K(x,x)=exp(xx22σ2)K(x, x') = exp(-\frac{||x - x'||^2}{2\sigma^2})
  • Sigmoid kernel: K(x,x)=tanh(αxTx+c)K(x, x') = tanh(\alpha x^T x' + c)
  • Laplace kernel: K(x,x)=exp(xxσ)K(x, x') = exp(-\frac{||x - x'||}{\sigma})