Skip to main content

LDA

Different from PCA, LDA (Linear Discriminant Analysis) wants to make data with same label clustered together in the low dimension space. LDA assumes that the original data is classified based on the mean value, and different types of value have the same variance. Thus, LDA performs better when the original data is well separated by the mean value.

Assume we have,

D={(x1,y1),(x2,y2),,(xn,yn)}\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}

yiYy_i \in \mathbb{Y} where Y\mathbb{Y} is the set of all possible labels.

We calculate the mean and variance for each label jYj \in \mathbb{Y}

μj=1nji=1nxiI(yi=j)\mu_j = \frac{1}{n_j} \sum_{i=1}^{n} x_i \mathbb{I}(y_i = j) σj2=1nji=1n(xiμj)(xiμj)TI(yi=j)\sigma_j^2 = \frac{1}{n_j} \sum_{i=1}^{n} (x_i - \mu_j)(x_i - \mu_j)^T \mathbb{I}(y_i = j)

Please note that μj\mu_j is the mean vector and σj2\sigma_j^2 is the covariance matrix.

We also use a the following formula to measure the within-class variance,

Σw2=1nj=1Ynjσj2\Sigma_w^2 = \frac{1}{n} \sum_{j=1}^{|\mathbb{Y}|} n_j \sigma_j^2

And we use the following formula to measure the between-class variance,

Σb2=1nj=1Ynj(μjx)(μjx)T\Sigma_b^2 = \frac{1}{n} \sum_{j=1}^{ | \mathbb{Y} | } n_j (\mu_j - \overline{x}) (\mu_j - \overline{x})^T

Where x\overline{x} is the mean of all vectors.

The variance of the whole dataset is,

Σ2=1ni=1n(xixˉ)(xixˉ)T\Sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x}) (x_i - \bar{x})^T

We can prove that,

Σ2=Σw2+Σb2\Sigma^2 = \Sigma_w^2 + \Sigma_b^2

We use the following property,

σ2(x)=E(xxT)E(x)E(x)T\sigma^2(x) = \mathbb{E}(xx^T) - \mathbb{E}(x)\mathbb{E}(x)^T

Thus,

Σw2+Σb2=1ni=1n((xiμj)(xiμj)T+(μjx)(μjx)T)=1ni=1n(xixiT+x  xTμjxiTxiμjTμjxTxμjT+2μjμjT)=Ei(xixiT)+x  xTEi(xiμjT)Ei(μjxiT)Ei(μjxT)Ei(xμjT)+2Ei(μjμjT)\Sigma_w^2 + \Sigma_b^2 \\ = \frac{1}{n} \sum_{i=1}^{n} ((x_i - \mu_j) (x_i - \mu_j)^T + (\mu_j - \overline{x}) (\mu_j - \overline{x})^T)\\ = \frac{1}{n} \sum_{i=1}^{n} (x_ix_i^T + \overline{x} \; \overline{x}^T - \mu_j x_i^T - x_i \mu_j^T - \mu_j \overline{x}^T - \overline{x} \mu_j^T +2 \mu_j\mu_j^T)\\ = \mathbb{E_i}(x_ix_i^T) + \overline{x} \; \overline{x}^T - \mathbb{E_i}(x_i \mu_j^T) - \mathbb{E_i}(\mu_j x_i^T) - \mathbb{E_i}(\mu_j \overline{x}^T) - \mathbb{E_i}(\overline{x} \mu_j^T) + 2 \mathbb{E_i}(\mu_j\mu_j^T)\\

Simplify term by term:

1ni=1nxixiT=E(xxT)\frac{1}{n} \sum_{i=1}^n x_i x_i^T = \mathbb{E}(x x^T)

The terms xiμjT-x_i \mu_j^T and μjxiT-\mu_j x_i^T combine to:

1njnj(μjμjT+μjμjT)=2njnjμjμjT-\frac{1}{n} \sum_{j} n_j (\mu_j \mu_j^T + \mu_j \mu_j^T) = -\frac{2}{n} \sum_{j} n_j \mu_j \mu_j^T

This is valid because, for each cluster jj, μj\mu_j is a constant, and its indices Gj\mathbb{G}_j,

iGjxiμjT=njμjμjT\sum_{i\in\mathbb{G}_j} -x_i \mu_j^T = -n_j \mu_j \mu_j^T

The same goes for μjxiT-\mu_j x_i^T.

The terms μjxT-\mu_j \overline{x}^T and xμjT-\overline{x} \mu_j^T combine to:

1njnj(μjxT+xμjT)=2xxT-\frac{1}{n} \sum_{j} n_j (\mu_j \overline{x}^T + \overline{x} \mu_j^T) = -2 \overline{x} \overline{x}^T

The terms μjμjT\mu_j \mu_j^T and xxT\overline{x} \overline{x}^T simplify using the identity jnjμj=nx\sum_{j} n_j \mu_j = n \overline{x}:

1n(2jnjμjμjT+nxxT)=2E(μjμjT)+xxT\frac{1}{n} \left( 2 \sum_{j} n_j \mu_j \mu_j^T + n \overline{x} \overline{x}^T \right) = 2 \mathbb{E}(\mu_j \mu_j^T) + \overline{x} \overline{x}^T

In all,

Σw2+Σb2=E(xxT)2E(μjμjT)+2E(μjμjT)2xxT+xxT=E(xxT)xxT=Σ2\Sigma_w^2 + \Sigma_b^2 \\ = \mathbb{E}(x x^T) - 2 \mathbb{E}(\mu_j \mu_j^T) + 2 \mathbb{E}(\mu_j \mu_j^T) - 2 \overline{x} \overline{x}^T + \overline{x} \overline{x}^T \\ = \mathbb{E}(x x^T) - \overline{x} \overline{x}^T \\ = \Sigma^2

However, please pay attention that, in all the calculation above, the presume every element is already labeled. Thus there is no variable. However, let's back to the main track, if we project the original data onto a low-dimensional space, that is, a non-square projection matrix WTW^T that could maximize the between-class variance after projection, and minimize the within-class variance.

For convenience, we use WTW^T so that every column vector is a new basis.

That is,

x=WTxx' = W^T x μj=WTμj\mu_j' = W^T \mu_j σj2=WTσj2W\sigma_j'^2 = W^T \sigma_j^2 W Σb2=WTΣb2W\Sigma_b'^2 = W^T \Sigma_b^2 W Σw2=WTΣw2W\Sigma_w'^2 = W^T \Sigma_w^2 W

We would like to maximize the Σb2\Sigma_b'^2, and minimize the Σw2\Sigma_w'^2. However, they are matrices, not scalars, so we cannot directly compare them. LDA choose the following target function,

J(WT)=Σb21Σw2FJ(W^T) = ||{\Sigma_b'^2}^{-1}\Sigma_w'^2||_{F}
tip

The MF||M||_{F} is the Frobenius norm, defined as,

MF=i,jMi,j2||M||_{F} = \sqrt{\sum_{i,j} M_{i,j}^2}

Which, if we define λi\lambda_i as the eigenvalues of MM, also equals,

MF=iλi2||M||_{F} = \sqrt{\sum_{i} \lambda_i^2}

Another point to consider is that, we must restrain all column vector (the new basis) to have a unit norm. Otherwise, the result can be arbitrarily small or large because you can add a scale factor to the row vector.

That is to say,

WTW=1W^T W = 1

We suppose, ww is a unit eigenvector, so,

wTΣb21Σw2w=λ2=wTWT(Σb21Σw2)Ww=(WTw)T(Σb21Σw2)(WTw)w^T {\Sigma_b'^2}^{-1}\Sigma_w'^2 w = \lambda^2 \\ = w^T W^T ({\Sigma_b^2}^{-1}\Sigma_w^2) W w\\ = (W^T w)^T ({\Sigma_b^2}^{-1}\Sigma_w^2) (W^T w) \\

That is to say, WTwW^T w is the eigenvector of (Σb21Σw2)({\Sigma_b^2}^{-1}\Sigma_w^2).

So,

J(WT)=Σb21Σw2F=iLλi2J(W^T) = ||{\Sigma_b'^2}^{-1}\Sigma_w'^2||_{F} = \sqrt{\sum_{i \in \mathbb{L}} \lambda_i^2}

Where L\mathbb{L} is a set of eigenvalues chosen from Σb21Σw2{\Sigma_b^2}^{-1}\Sigma_w^2. L|\mathbb{L}| equals to the rank of WW.

So obviously, if we want to minimize J(WT)J(W^T), we simply use the smallest eigenvalues of Σb21Σw2{\Sigma_b^2}^{-1}\Sigma_w^2.

And because ww is an unit vector, we can always use, wi=eiw_i = e_i. And thus, since WTwW^T w is the eignenvector of Σb21Σw2{\Sigma_b^2}^{-1}\Sigma_w^2, WTW^T has each of its column vector as an eignenvector of Σb21Σw2{\Sigma_b^2}^{-1}\Sigma_w^2.

note

That was a long proof. But the result is simple- we find kk smallest eigenvalues of Σb21Σw2{\Sigma_b^2}^{-1}\Sigma_w^2, stack them to get WTW^T. And after performing,

X=WTXX' = W^T X

We get the LDA result.