Introduction

Both PCA and LDA are linear dimensionality techniques commonly used for data compression and data visualization with a crucial difference that the latter also serves as a supervised learning algorithm that takes into account the classes to which the data belong. In this article, I review the mathematical details of PCA and LDA and explore how the two differ in the context of machine learning.

Principal Component Analysis (PCA)

In training a machine learning model after feature engineering, we often end up with feature vectors with a very large dimension. While these vectors certainly contain a lot of useful information that our model can discover, they may also contain noise and unnecessary residual information irrelevant of the general trend displayed by the data.

PCA, being a linear dimensionality reduction technique, identifies a smaller subset of “important” features, thereby aiming at reducing the complexity of the original feature set. In a more mathematical language, this amounts to finding the coordinate or vectors that maximizes the variance. These vectors are called the principal components.

Maximizing the variance

Why do we maximize the variance? The intuition behind variance maximization is quite clear once we visualize in a two-dimensional plane. In Figure 1, we have three candidates for the first principal component. We are interested in the distribution of the data points when projected onto each of these lines. The best line (illustrated in red) is the best-fitting line and, as we can see, has a larger variance than the other two.

In signal processing, we assume that the signal or data of interest has a relatively larger variance, whereas the noise has a smaller one.

Problem setting

Let’s formalize this. Suppose we are given the dataset $\mathbf X = \{\mathbf x_1, \mathbf x_2, \cdots, \mathbf x_n \}$ where $\mathbf x_i \in \mathbb{R}^D$ are i.i.d. and are normalized to have zero mean. We want to find the basis of a $d$ -dimensional subspace $P \subseteq \mathbb{R}^D$ where $d < D$ and the “reconstructed” or projected dataset $\tilde {\mathbf X} = \{\tilde{\mathbf x}_1, \cdots, \tilde{\mathbf x}_n\}$ still decently represents the original data.

Finding the principal components

We first focus on finding the unit vector $\mathbf w$ (to prevent scaling from affecting the magnitude of variance) — the first principal component. In the Euclidean space, the projection of a vector $\mathbf v_1$ on another vector $\mathbf v_2$ is identical to its inner product (i.e. dot product), which we denote as $\langle \mathbf v_1, \mathbf v_2 \rangle = \mathbf v_1^\top \mathbf v_2 = \mathbf v_2^\top \mathbf v_1$ . Based on our formalization, we intend to find a unit $\mathbf w$ that maximizes the variance of the projection of $\mathbf x_1, \cdots, \mathbf x_n$ onto itself. If we denote this quantity with $V_1(\mathbf x_1, \cdots, \mathbf x_n)$ , we have the following:

V_1(\mathbf x_1, \cdots, \mathbf x_n) = \frac{1}{n}\sum_{i=1}^n (\mathbf x_i^\top \mathbf w)^2

Notice that the mean of $\{\mathbf x_i^\top \mathbf w\}_{i=1}^n$ is 0, since the mean of the dot product between $\mathbf x_i$ and $\mathbf w$ is 0 due to the assumption that $\mathbf x_i$ is normalized to have zero mean:

\frac{1}{n}\sum_{i=1}^n \mathbf x_i^\top \mathbf w = \left(\frac{1}{n}\sum_{i=1}^n \mathbf x_i^\top\right) \mathbf w = 0

Continuing with the computation,

\begin{align*} V_1(\mathbf x_1, \cdots, \mathbf x_n) &= \frac{1}{n}\sum_{i=1}^n\mathbf w^\top \mathbf x_i \mathbf x_i^\top \mathbf w \\ &= \mathbf w^\top \left(\frac{1}{n}\sum_{i=1}^n \mathbf x_i^\top \mathbf x_i \right) \mathbf w \end{align*}

The quantity within the parentheses is a familiar one: the covariance matrix. We will use $\mathbf S$ to denote it.

We thus end up with a constrained optimization problem below:

\begin{cases} \max V_1(\mathbf x_1, \cdots, \mathbf x_n) \\ \text{subject to }\lVert \mathbf w \lVert = 1 \end{cases}

Solving with the method of Lagrangian multiplier,

\mathcal L (\mathbf w, \lambda_1) = \mathbf w^\top \mathbf S \mathbf{w} + \lambda_1 (1 - \mathbf w^\top \mathbf w)

Using the usual method, we obtain $\mathbf S \mathbf w = \lambda_1 \mathbf w$ which shows that $(\lambda_1, \mathbf w)$ is an eigenvalue-eigenvector pair. Subsequently, $V_1(\mathbf x_1, \cdots, \mathbf x_n) = \mathbf w^\top \mathbf S \mathbf w = \mathbf w^\top \lambda_1 \mathbf w = \lambda_1$ . In other words, the variance by setting the first principal component as $\mathbf w$ is precisely the corresponding eigenvector.

Obtaining other principal components can be found in a similar manner, by finding the unit vector $\mathbf w_i$ that maximizes the variance for $\mathbf X - \sum_{i=1}^n \mathbf X \mathbf w_i \mathbf w_i^\top$ . If we iteratively continue this process until the matrix to start with is a zero matrix, we will have discovered that $V_k(\mathbf x_1, \cdots, \mathbf x_n) = \lambda_k$ , where $\lambda_i \geq \lambda_j$ if $i > j$ .

One important thing to keep in mind is that the $n$ th principal component is orthogonal to the previous $n-1$ principal components. Considering that the objective of PCA is to find features that best capture the dataset, it is strictly better to consider a vector that cannot be expressed as a linear combination of the previously found principal components.

Low rank approximations

That we find eigenvalues in decreasing magnitude should ring a bell 🔔, and the bell shall ring the second time when we realize that the principal components are mutually orthogonal. This is precisely what SVD does: every matrix $\mathbf A \in \mathbb{R}^{m \times n}$ can be factorized as $\mathbf A = \mathbf U \mathbf \Sigma \mathbf V^\top$ where $\mathbf U$ and $\mathbf V$ are orthogonal matrices and $\mathbf \Sigma$ is the matrix with the diagonal entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m, n)} \geq 0$ .

Based on this formulation, we have $\mathbf X = \mathbf U \mathbf \Sigma \mathbf V^\top$ . Let’s focus on $\mathbf X \mathbf X^\top$ :

\mathbf X \mathbf X^\top = (\mathbf U \mathbf \Sigma \mathbf V^\top)(\mathbf U \mathbf \Sigma \mathbf V^\top)^\top = \mathbf U \mathbf \Sigma \mathbf V^\top \mathbf V \mathbf \Sigma^\top \mathbf U^\top = \mathbf U \mathbf \Sigma \mathbf \Sigma^\top \mathbf V^\top

Since $\sigma_i \geq \sigma_j \geq 0$ for $i > j$ , it follows that $\sigma_i^2 \geq \sigma_j^2$ . Further, for $\mathbf \Sigma \mathbf \Sigma^\top = \text{diag}(\sigma_1^2, \cdots, \sigma_{\min (m, n)}^2)$ , we can conclude that the columns of $\mathbf U$ are the eigenvectors of $\mathbf X \mathbf X^\top$ and $\mathbf S$ .

Linear Discriminant Analysis (LDA)

PCA is based on vector projections to maximize the variance. The problem arises if we have labels associated with each data point. Using PCA on a dataset with two classes, we would not be able to distinguish between the two classes, for they would be all shuffled in the process. LDA avoids this issue by taking into account the classes for each sample.

Maximize A and minimize B

It is not difficult to see that we cannot naively adapt PCA into a multiclass setting. To be more precise, let’s consider a new dataset $\{(\mathbf x_i, y_i)\}_{i=1}^n$ where $y_i \in \{C_1, C_2\}$ representing the two samples. Our objective is to find a vector that does not confound the labels in the optimal way. To characterize what we mean by the “optimal” way, refer to Figure 2.

The plot on the right is objectively better than the one on the left. If we spot the differences between the two plots, we shall have a better idea of how to formulate LDA.

There are two major differences, one among the samples that belong to the same class, and the other between the different classes of samples. The first is that the center after projection of the samples belonging to one class has to be far from the other center. This avoids the issue of the two classes overlapping or being cluttered post-projection. The other property is minimizing the variance within each sample after projection. This guarantees the optimal line that best classifies the two classes.

To sum up, LDA aims to find a vector that:

Requirement 1: Maximizes inter-class distances

Requirement 2: Minimizes intra-class variances

Mathematical formulation of LDA

Let’s first define the center (more precisely, the centroid) of the two classes:

\begin{align*} \mathbf \mu_{C_1} &= \frac{1}{\sum_{i=1}^n\mathbf{1}_{C_1}[y_i]}\sum_{i=1}^n \mathbf x_i \\ \mathbf \mu_{C_2} &= \frac{1}{\sum_{i=1}^n\mathbf{1}_{C_2}[y_i]}\sum_{i=1}^n \mathbf x_i \end{align*}

As in the section on PCA, we want to find out $\mathbf w$ .

Requirement 1 mandates that the distance between two projected centers be far apart. Mathematically, we can express this as a constrained optimization problem:

\begin{cases} \max D(\mathbf w) = \max \, \lVert \mathbf w^\top\mu_1 - \mathbf w^\top\mu_2 \lVert_2^2 \\ \text{subject to } \mathbf w^\top \mathbf w = 1 \end{cases}

Requirement 2 mandates that the variance in each class be as small as possible. Mathematically, we have the following:

\min V(\mathbf w) = \min \sum_{\{C_1, C_2\}} \sum_{j=1}^n \mathbf 1_{C_i}[y_j] (\mathbf w^\top \mathbf x - \mathbf w^\top \mathbf \mu_i)^2

A convenient method of maximizing one quantity and minimizing the other would be forming it into a fraction and maximizing the whole formula. We can write our objective function as $J(\mathbf w)$ and maximize this quantity:

J(\mathbf w) = \frac{D(\mathbf w)}{V(\mathbf w)}

Since

\begin{align*} D(\mathbf w) =\lVert \mathbf w^\top \mathbf \mu_1 - \mathbf w^\top \mathbf \mu_2 \rVert_2^2 &= (\mathbf w^\top \mathbf \mu_1 - \mathbf w^\top \mathbf \mu_2)(\mathbf w^\top \mathbf \mu_1 - \mathbf w^\top \mathbf \mu_2)^\top \\ &= \mathbf w^\top (\mathbf \mu_1 - \mathbf \mu_2)(\mathbf \mu_1 - \mathbf \mu_2)^\top \mathbf w \end{align*}

and

V(\mathbf w) = \sum_{\{C_1, C_2\}}\sum_{j=1}^n \mathbf 1_{C_i}[y_j]\mathbf w^\top (\mathbf x - \mathbf \mu_{C_i})(\mathbf x - \mathbf \mu_{C_i})^\top \mathbf w

we can simplify $J(\mathbf w)$ by defining the following quantities:

\begin{align*} \mathbf S_D &= (\mathbf \mu_1 - \mathbf \mu_2)(\mathbf \mu_1 - \mathbf \mu_2)^\top \\ \mathbf S_V &= \sum_{\{C_1, C_2\}}\sum_{j=1}^n \mathbf 1_{C_i}[y_j]\mathbf (\mathbf x - \mathbf \mu_{C_i})(\mathbf x - \mathbf \mu_{C_i})^\top \\ J(\mathbf w) &= \frac{\mathbf w^\top \mathbf S_D \mathbf w}{\mathbf w^\top \mathbf S_V \mathbf w} \end{align*}

If we differentiate $J(\mathbf w)$ with respect to $\mathbf w$ and set it to zero to find the $\mathbf w$ that maximizes $J(\mathbf w)$ ,

\begin{align*} \frac{dJ(\mathbf w)}{d\mathbf w} &= \frac{\left(\frac{d}{d\mathbf w}(\mathbf w^\top \mathbf S_D \mathbf w)\right)(\mathbf w^\top \mathbf S_V \mathbf w) - (\mathbf w^\top \mathbf S_D \mathbf w) \left(\frac{d}{d\mathbf w}(\mathbf w^\top \mathbf S_V \mathbf w)\right) }{(\mathbf w^\top \mathbf S_V \mathbf w)^2} \stackrel{\text{set}}{=} 0 \\ &\Rightarrow (\mathbf w^\top \mathbf S_V \mathbf w) (\mathbf S_D \mathbf w) - (\mathbf w^\top \mathbf S_D \mathbf w) (\mathbf S_V \mathbf w) = 0 \\ &= (\mathbf w^\top \mathbf S_V \mathbf w) (\mathbf S_D \mathbf w) = (\mathbf w^\top \mathbf S_D \mathbf w) (\mathbf S_V \mathbf w) \end{align*}

Consequently,

\mathbf S_D \mathbf w = \frac{\mathbf w^\top \mathbf S_D \mathbf w}{\mathbf w^\top \mathbf S_V \mathbf w} \mathbf S_V \mathbf w

Since both $\mathbf w^\top \mathbf S_D \mathbf w$ and $\mathbf w^\top \mathbf S_V \mathbf w$ are scalars, we can define this ratio as $\lambda$ , which turns the problem into another familiar eigenvalue-eigenvector problem once we multiply both sides of the equation by $\mathbf S_V^{-1}$ : $\mathbf S_D \mathbf S_V^{-1} \mathbf w = \lambda \mathbf w$ . In other words, $\mathbf w$ is the eigenvector of $\mathbf S_D \mathbf S_V^{-1}$ with $\lambda$ as its eigenvalue.

As a quick note, can we guarantee that the inverse of $\mathbf S_V$ exists?

References

Deisenroth M. P., Faisal A. A. & Ong C. S., Mathematics for Machine Learning, pp. 317-336.

Understanding PCA and LDA