← Back to all articles

Understanding PCA and LDA

8/5/2023

Introduction

Both PCA and LDA are linear dimensionality techniques commonly used for data compression and data visualization with a crucial difference that the latter also serves as a supervised learning algorithm that takes into account the classes to which the data belong. In this article, I review the mathematical details of PCA and LDA and explore how the two differ in the context of machine learning.

Principal Component Analysis (PCA)

In training a machine learning model after feature engineering, we often end up with feature vectors with a very large dimension. While these vectors certainly contain a lot of useful information that our model can discover, they may also contain noise and unnecessary residual information irrelevant of the general trend displayed by the data.

PCA, being a linear dimensionality reduction technique, identifies a smaller subset of “important” features, thereby aiming at reducing the complexity of the original feature set. In a more mathematical language, this amounts to finding the coordinate or vectors that maximizes the variance. These vectors are called the principal components.

Maximizing the variance

Why do we maximize the variance? The intuition behind variance maximization is quite clear once we visualize in a two-dimensional plane. In Figure 1, we have three candidates for the first principal component. We are interested in the distribution of the data points when projected onto each of these lines. The best line (illustrated in red) is the best-fitting line and, as we can see, has a larger variance than the other two.

In signal processing, we assume that the signal or data of interest has a relatively larger variance, whereas the noise has a smaller one.

Problem setting

Let’s formalize this. Suppose we are given the dataset X={x1,x2,,xn}\mathbf X = \{\mathbf x_1, \mathbf x_2, \cdots, \mathbf x_n \} where xiRD\mathbf x_i \in \mathbb{R}^D are i.i.d. and are normalized to have zero mean. We want to find the basis of a dd-dimensional subspace PRDP \subseteq \mathbb{R}^D where d<Dd < D and the “reconstructed” or projected dataset X~={x~1,,x~n}\tilde {\mathbf X} = \{\tilde{\mathbf x}_1, \cdots, \tilde{\mathbf x}_n\} still decently represents the original data.

Finding the principal components

We first focus on finding the unit vector w\mathbf w (to prevent scaling from affecting the magnitude of variance) — the first principal component. In the Euclidean space, the projection of a vector v1\mathbf v_1 on another vector v2\mathbf v_2 is identical to its inner product (i.e. dot product), which we denote as v1,v2=v1v2=v2v1\langle \mathbf v_1, \mathbf v_2 \rangle = \mathbf v_1^\top \mathbf v_2 = \mathbf v_2^\top \mathbf v_1. Based on our formalization, we intend to find a unit w\mathbf w that maximizes the variance of the projection of x1,,xn\mathbf x_1, \cdots, \mathbf x_n onto itself. If we denote this quantity with V1(x1,,xn)V_1(\mathbf x_1, \cdots, \mathbf x_n), we have the following:

V1(x1,,xn)=1ni=1n(xiw)2V_1(\mathbf x_1, \cdots, \mathbf x_n) = \frac{1}{n}\sum_{i=1}^n (\mathbf x_i^\top \mathbf w)^2

Notice that the mean of {xiw}i=1n\{\mathbf x_i^\top \mathbf w\}_{i=1}^n is 0, since the mean of the dot product between xi\mathbf x_i and w\mathbf w is 0 due to the assumption that xi\mathbf x_i is normalized to have zero mean:

1ni=1nxiw=(1ni=1nxi)w=0\frac{1}{n}\sum_{i=1}^n \mathbf x_i^\top \mathbf w = \left(\frac{1}{n}\sum_{i=1}^n \mathbf x_i^\top\right) \mathbf w = 0

Continuing with the computation,

V1(x1,,xn)=1ni=1nwxixiw=w(1ni=1nxixi)w\begin{align*} V_1(\mathbf x_1, \cdots, \mathbf x_n) &= \frac{1}{n}\sum_{i=1}^n\mathbf w^\top \mathbf x_i \mathbf x_i^\top \mathbf w \\ &= \mathbf w^\top \left(\frac{1}{n}\sum_{i=1}^n \mathbf x_i^\top \mathbf x_i \right) \mathbf w \end{align*}

The quantity within the parentheses is a familiar one: the covariance matrix. We will use S\mathbf S to denote it.

We thus end up with a constrained optimization problem below:

{maxV1(x1,,xn)subject to w=1\begin{cases} \max V_1(\mathbf x_1, \cdots, \mathbf x_n) \\ \text{subject to }\lVert \mathbf w \lVert = 1 \end{cases}

Solving with the method of Lagrangian multiplier,

L(w,λ1)=wSw+λ1(1ww)\mathcal L (\mathbf w, \lambda_1) = \mathbf w^\top \mathbf S \mathbf{w} + \lambda_1 (1 - \mathbf w^\top \mathbf w)

Using the usual method, we obtain Sw=λ1w\mathbf S \mathbf w = \lambda_1 \mathbf w which shows that (λ1,w)(\lambda_1, \mathbf w) is an eigenvalue-eigenvector pair. Subsequently, V1(x1,,xn)=wSw=wλ1w=λ1V_1(\mathbf x_1, \cdots, \mathbf x_n) = \mathbf w^\top \mathbf S \mathbf w = \mathbf w^\top \lambda_1 \mathbf w = \lambda_1. In other words, the variance by setting the first principal component as w\mathbf w is precisely the corresponding eigenvector.

Obtaining other principal components can be found in a similar manner, by finding the unit vector wi\mathbf w_i that maximizes the variance for Xi=1nXwiwi\mathbf X - \sum_{i=1}^n \mathbf X \mathbf w_i \mathbf w_i^\top. If we iteratively continue this process until the matrix to start with is a zero matrix, we will have discovered that Vk(x1,,xn)=λkV_k(\mathbf x_1, \cdots, \mathbf x_n) = \lambda_k, where λiλj\lambda_i \geq \lambda_j if i>ji > j.

One important thing to keep in mind is that the nnth principal component is orthogonal to the previous n1n-1 principal components. Considering that the objective of PCA is to find features that best capture the dataset, it is strictly better to consider a vector that cannot be expressed as a linear combination of the previously found principal components.

Low rank approximations

That we find eigenvalues in decreasing magnitude should ring a bell 🔔, and the bell shall ring the second time when we realize that the principal components are mutually orthogonal. This is precisely what SVD does: every matrix ARm×n\mathbf A \in \mathbb{R}^{m \times n} can be factorized as A=UΣV\mathbf A = \mathbf U \mathbf \Sigma \mathbf V^\top where U\mathbf U and V\mathbf V are orthogonal matrices and Σ\mathbf \Sigma is the matrix with the diagonal entries σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m, n)} \geq 0.

Based on this formulation, we have X=UΣV\mathbf X = \mathbf U \mathbf \Sigma \mathbf V^\top. Let’s focus on XX\mathbf X \mathbf X^\top:

XX=(UΣV)(UΣV)=UΣVVΣU=UΣΣV\mathbf X \mathbf X^\top = (\mathbf U \mathbf \Sigma \mathbf V^\top)(\mathbf U \mathbf \Sigma \mathbf V^\top)^\top = \mathbf U \mathbf \Sigma \mathbf V^\top \mathbf V \mathbf \Sigma^\top \mathbf U^\top = \mathbf U \mathbf \Sigma \mathbf \Sigma^\top \mathbf V^\top

Since σiσj0\sigma_i \geq \sigma_j \geq 0 for i>ji > j, it follows that σi2σj2\sigma_i^2 \geq \sigma_j^2. Further, for ΣΣ=diag(σ12,,σmin(m,n)2)\mathbf \Sigma \mathbf \Sigma^\top = \text{diag}(\sigma_1^2, \cdots, \sigma_{\min (m, n)}^2), we can conclude that the columns of U\mathbf U are the eigenvectors of XX\mathbf X \mathbf X^\topand S\mathbf S.

Linear Discriminant Analysis (LDA)

PCA is based on vector projections to maximize the variance. The problem arises if we have labels associated with each data point. Using PCA on a dataset with two classes, we would not be able to distinguish between the two classes, for they would be all shuffled in the process. LDA avoids this issue by taking into account the classes for each sample.

Maximize A and minimize B

It is not difficult to see that we cannot naively adapt PCA into a multiclass setting. To be more precise, let’s consider a new dataset {(xi,yi)}i=1n\{(\mathbf x_i, y_i)\}_{i=1}^n where yi{C1,C2}y_i \in \{C_1, C_2\} representing the two samples. Our objective is to find a vector that does not confound the labels in the optimal way. To characterize what we mean by the “optimal” way, refer to Figure 2.

The plot on the right is objectively better than the one on the left. If we spot the differences between the two plots, we shall have a better idea of how to formulate LDA.

There are two major differences, one among the samples that belong to the same class, and the other between the different classes of samples. The first is that the center after projection of the samples belonging to one class has to be far from the other center. This avoids the issue of the two classes overlapping or being cluttered post-projection. The other property is minimizing the variance within each sample after projection. This guarantees the optimal line that best classifies the two classes.

To sum up, LDA aims to find a vector that:

Requirement 1: Maximizes inter-class distances

Requirement 2: Minimizes intra-class variances

Mathematical formulation of LDA

Let’s first define the center (more precisely, the centroid) of the two classes:

μC1=1i=1n1C1[yi]i=1nxiμC2=1i=1n1C2[yi]i=1nxi\begin{align*} \mathbf \mu_{C_1} &= \frac{1}{\sum_{i=1}^n\mathbf{1}_{C_1}[y_i]}\sum_{i=1}^n \mathbf x_i \\ \mathbf \mu_{C_2} &= \frac{1}{\sum_{i=1}^n\mathbf{1}_{C_2}[y_i]}\sum_{i=1}^n \mathbf x_i \end{align*}

As in the section on PCA, we want to find out w\mathbf w.

Requirement 1 mandates that the distance between two projected centers be far apart. Mathematically, we can express this as a constrained optimization problem:

{maxD(w)=maxwμ1wμ222subject to ww=1\begin{cases} \max D(\mathbf w) = \max \, \lVert \mathbf w^\top\mu_1 - \mathbf w^\top\mu_2 \lVert_2^2 \\ \text{subject to } \mathbf w^\top \mathbf w = 1 \end{cases}

Requirement 2 mandates that the variance in each class be as small as possible. Mathematically, we have the following:

minV(w)=min{C1,C2}j=1n1Ci[yj](wxwμi)2 \min V(\mathbf w) = \min \sum_{\{C_1, C_2\}} \sum_{j=1}^n \mathbf 1_{C_i}[y_j] (\mathbf w^\top \mathbf x - \mathbf w^\top \mathbf \mu_i)^2

A convenient method of maximizing one quantity and minimizing the other would be forming it into a fraction and maximizing the whole formula. We can write our objective function as J(w)J(\mathbf w) and maximize this quantity:

J(w)=D(w)V(w)J(\mathbf w) = \frac{D(\mathbf w)}{V(\mathbf w)}

Since

D(w)=wμ1wμ222=(wμ1wμ2)(wμ1wμ2)=w(μ1μ2)(μ1μ2)w\begin{align*} D(\mathbf w) =\lVert \mathbf w^\top \mathbf \mu_1 - \mathbf w^\top \mathbf \mu_2 \rVert_2^2 &= (\mathbf w^\top \mathbf \mu_1 - \mathbf w^\top \mathbf \mu_2)(\mathbf w^\top \mathbf \mu_1 - \mathbf w^\top \mathbf \mu_2)^\top \\ &= \mathbf w^\top (\mathbf \mu_1 - \mathbf \mu_2)(\mathbf \mu_1 - \mathbf \mu_2)^\top \mathbf w \end{align*}

and

V(w)={C1,C2}j=1n1Ci[yj]w(xμCi)(xμCi)wV(\mathbf w) = \sum_{\{C_1, C_2\}}\sum_{j=1}^n \mathbf 1_{C_i}[y_j]\mathbf w^\top (\mathbf x - \mathbf \mu_{C_i})(\mathbf x - \mathbf \mu_{C_i})^\top \mathbf w

we can simplify J(w)J(\mathbf w) by defining the following quantities:

SD=(μ1μ2)(μ1μ2)SV={C1,C2}j=1n1Ci[yj](xμCi)(xμCi)J(w)=wSDwwSVw\begin{align*} \mathbf S_D &= (\mathbf \mu_1 - \mathbf \mu_2)(\mathbf \mu_1 - \mathbf \mu_2)^\top \\ \mathbf S_V &= \sum_{\{C_1, C_2\}}\sum_{j=1}^n \mathbf 1_{C_i}[y_j]\mathbf (\mathbf x - \mathbf \mu_{C_i})(\mathbf x - \mathbf \mu_{C_i})^\top \\ J(\mathbf w) &= \frac{\mathbf w^\top \mathbf S_D \mathbf w}{\mathbf w^\top \mathbf S_V \mathbf w} \end{align*}

If we differentiate J(w)J(\mathbf w) with respect to w\mathbf w and set it to zero to find the w\mathbf w that maximizes J(w)J(\mathbf w),

dJ(w)dw=(ddw(wSDw))(wSVw)(wSDw)(ddw(wSVw))(wSVw)2=set0(wSVw)(SDw)(wSDw)(SVw)=0=(wSVw)(SDw)=(wSDw)(SVw)\begin{align*} \frac{dJ(\mathbf w)}{d\mathbf w} &= \frac{\left(\frac{d}{d\mathbf w}(\mathbf w^\top \mathbf S_D \mathbf w)\right)(\mathbf w^\top \mathbf S_V \mathbf w) - (\mathbf w^\top \mathbf S_D \mathbf w) \left(\frac{d}{d\mathbf w}(\mathbf w^\top \mathbf S_V \mathbf w)\right) }{(\mathbf w^\top \mathbf S_V \mathbf w)^2} \stackrel{\text{set}}{=} 0 \\ &\Rightarrow (\mathbf w^\top \mathbf S_V \mathbf w) (\mathbf S_D \mathbf w) - (\mathbf w^\top \mathbf S_D \mathbf w) (\mathbf S_V \mathbf w) = 0 \\ &= (\mathbf w^\top \mathbf S_V \mathbf w) (\mathbf S_D \mathbf w) = (\mathbf w^\top \mathbf S_D \mathbf w) (\mathbf S_V \mathbf w) \end{align*}

Consequently,

SDw=wSDwwSVwSVw\mathbf S_D \mathbf w = \frac{\mathbf w^\top \mathbf S_D \mathbf w}{\mathbf w^\top \mathbf S_V \mathbf w} \mathbf S_V \mathbf w

Since both wSDw\mathbf w^\top \mathbf S_D \mathbf w and wSVw\mathbf w^\top \mathbf S_V \mathbf w are scalars, we can define this ratio as λ\lambda, which turns the problem into another familiar eigenvalue-eigenvector problem once we multiply both sides of the equation by SV1\mathbf S_V^{-1}: SDSV1w=λw\mathbf S_D \mathbf S_V^{-1} \mathbf w = \lambda \mathbf w. In other words, w\mathbf w is the eigenvector of SDSV1\mathbf S_D \mathbf S_V^{-1} with λ\lambda as its eigenvalue.

As a quick note, can we guarantee that the inverse of SV\mathbf S_V exists?

References

  1. Deisenroth M. P., Faisal A. A. & Ong C. S., Mathematics for Machine Learning, pp. 317-336.

© 2023 Ji Hun Wang. All Rights Reserved.