Different from PCA, LDA (Linear Discriminant Analysis) wants to make data with same label clustered together in the low dimension space. LDA assumes that the original data is classified based on the mean value, and different types of value have the same variance. Thus, LDA performs better when the original data is well separated by the mean value.
Assume we have,
D={(x1,y1),(x2,y2),…,(xn,yn)}
yi∈Y where Y is the set of all possible labels.
We calculate the mean and variance for each label j∈Y
However, please pay attention that, in all the calculation above, the presume every element is already labeled. Thus there is no variable. However, let's back to the main track, if we project the original data onto a low-dimensional space, that is, a non-square projection matrix WT that could maximize the between-class variance after projection, and minimize the within-class variance.
For convenience, we use WT so that every column vector is a new basis.
We would like to maximize the Σb′2, and minimize the Σw′2. However, they are matrices, not scalars, so we cannot directly compare them. LDA choose the following target function,
J(WT)=∣∣Σb′2−1Σw′2∣∣F
tip
The ∣∣M∣∣F is the Frobenius norm, defined as,
∣∣M∣∣F=i,j∑Mi,j2
Which, if we define λi as the eigenvalues of M, also equals,
∣∣M∣∣F=i∑λi2
Another point to consider is that, we must restrain all column vector (the new basis) to have a unit norm. Otherwise, the result can be arbitrarily small or large because you can add a scale factor to the row vector.
That is to say, WTw is the eigenvector of (Σb2−1Σw2).
So,
J(WT)=∣∣Σb′2−1Σw′2∣∣F=i∈L∑λi2
Where L is a set of eigenvalues chosen from Σb2−1Σw2. ∣L∣ equals to the rank of W.
So obviously, if we want to minimize J(WT), we simply use the smallest eigenvalues of Σb2−1Σw2.
And because w is an unit vector, we can always use, wi=ei. And thus, since WTw is the eignenvector of Σb2−1Σw2, WT has each of its column vector as an eignenvector of Σb2−1Σw2.
note
That was a long proof. But the result is simple- we find k smallest eigenvalues of Σb2−1Σw2, stack them to get WT. And after performing,