Introduction
Both PCA and LDA are linear dimensionality techniques commonly used for data compression and data visualization with a crucial difference that the latter also serves as a supervised learning algorithm that takes into account the classes to which the data belong. In this article, I review the mathematical details of PCA and LDA and explore how the two differ in the context of machine learning.
Principal Component Analysis (PCA)
In training a machine learning model after feature engineering, we often end up with feature vectors with a very large dimension. While these vectors certainly contain a lot of useful information that our model can discover, they may also contain noise and unnecessary residual information irrelevant of the general trend displayed by the data.
PCA, being a linear dimensionality reduction technique, identifies a smaller subset of “important” features, thereby aiming at reducing the complexity of the original feature set. In a more mathematical language, this amounts to finding the coordinate or vectors that maximizes the variance. These vectors are called the principal components.
Maximizing the variance
Why do we maximize the variance? The intuition behind variance maximization is quite clear once we visualize in a two-dimensional plane. In Figure 1, we have three candidates for the first principal component. We are interested in the distribution of the data points when projected onto each of these lines. The best line (illustrated in red) is the best-fitting line and, as we can see, has a larger variance than the other two.
In signal processing, we assume that the signal or data of interest has a relatively larger variance, whereas the noise has a smaller one.
Problem setting
Let’s formalize this. Suppose we are given the dataset where are i.i.d. and are normalized to have zero mean. We want to find the basis of a -dimensional subspace where and the “reconstructed” or projected dataset still decently represents the original data.
Finding the principal components
We first focus on finding the unit vector (to prevent scaling from affecting the magnitude of variance) — the first principal component. In the Euclidean space, the projection of a vector on another vector is identical to its inner product (i.e. dot product), which we denote as . Based on our formalization, we intend to find a unit that maximizes the variance of the projection of onto itself. If we denote this quantity with , we have the following:
Notice that the mean of is 0, since the mean of the dot product between and is 0 due to the assumption that is normalized to have zero mean:
Continuing with the computation,
The quantity within the parentheses is a familiar one: the covariance matrix. We will use to denote it.
We thus end up with a constrained optimization problem below:
Solving with the method of Lagrangian multiplier,
Using the usual method, we obtain which shows that is an eigenvalue-eigenvector pair. Subsequently, . In other words, the variance by setting the first principal component as is precisely the corresponding eigenvector.
Obtaining other principal components can be found in a similar manner, by finding the unit vector that maximizes the variance for . If we iteratively continue this process until the matrix to start with is a zero matrix, we will have discovered that , where if .
One important thing to keep in mind is that the th principal component is orthogonal to the previous principal components. Considering that the objective of PCA is to find features that best capture the dataset, it is strictly better to consider a vector that cannot be expressed as a linear combination of the previously found principal components.
Low rank approximations
That we find eigenvalues in decreasing magnitude should ring a bell 🔔, and the bell shall ring the second time when we realize that the principal components are mutually orthogonal. This is precisely what SVD does: every matrix can be factorized as where and are orthogonal matrices and is the matrix with the diagonal entries .
Based on this formulation, we have . Let’s focus on :
Since for , it follows that . Further, for , we can conclude that the columns of are the eigenvectors of and .
Linear Discriminant Analysis (LDA)
PCA is based on vector projections to maximize the variance. The problem arises if we have labels associated with each data point. Using PCA on a dataset with two classes, we would not be able to distinguish between the two classes, for they would be all shuffled in the process. LDA avoids this issue by taking into account the classes for each sample.
Maximize A and minimize B
It is not difficult to see that we cannot naively adapt PCA into a multiclass setting. To be more precise, let’s consider a new dataset where representing the two samples. Our objective is to find a vector that does not confound the labels in the optimal way. To characterize what we mean by the “optimal” way, refer to Figure 2.
The plot on the right is objectively better than the one on the left. If we spot the differences between the two plots, we shall have a better idea of how to formulate LDA.
There are two major differences, one among the samples that belong to the same class, and the other between the different classes of samples. The first is that the center after projection of the samples belonging to one class has to be far from the other center. This avoids the issue of the two classes overlapping or being cluttered post-projection. The other property is minimizing the variance within each sample after projection. This guarantees the optimal line that best classifies the two classes.
To sum up, LDA aims to find a vector that:
Requirement 1: Maximizes inter-class distances
Requirement 2: Minimizes intra-class variances
Mathematical formulation of LDA
Let’s first define the center (more precisely, the centroid) of the two classes:
As in the section on PCA, we want to find out .
Requirement 1 mandates that the distance between two projected centers be far apart. Mathematically, we can express this as a constrained optimization problem:
Requirement 2 mandates that the variance in each class be as small as possible. Mathematically, we have the following:
A convenient method of maximizing one quantity and minimizing the other would be forming it into a fraction and maximizing the whole formula. We can write our objective function as and maximize this quantity:
Since
and
we can simplify by defining the following quantities:
If we differentiate with respect to and set it to zero to find the that maximizes ,
Consequently,
Since both and are scalars, we can define this ratio as , which turns the problem into another familiar eigenvalue-eigenvector problem once we multiply both sides of the equation by : . In other words, is the eigenvector of with as its eigenvalue.
As a quick note, can we guarantee that the inverse of exists?
References
- Deisenroth M. P., Faisal A. A. & Ong C. S., Mathematics for Machine Learning, pp. 317-336.