PCA - Principal Components Analysis

Created	@May 8, 2022
Tags	Basic Concepts

Principal Component Analysis (PCA) Overview

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while preserving as much of the data's variability as possible. It transforms the data into a new coordinate system, where the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

How PCA Works

Standardization: The first step often involves standardizing the data so that each feature contributes equally to the analysis.

Covariance Matrix Computation: PCA computes the covariance matrix of the data to understand how the variables of the input data are varying from the mean with respect to each other.

Eigenvalue and Eigenvector Calculation: It then calculates the eigenvalues and eigenvectors of this covariance matrix. Eigenvectors point in the direction of the largest variance, while eigenvalues correspond to the magnitude of this variance in each direction.

Sort Eigenvalues and Eigenvectors: The eigenvalues and eigenvectors are sorted in order of decreasing eigenvalues. The eigenvectors are the principal components (PCs) of the data.

Project Data Onto Principal Components: The data is projected onto the principal components to transform the data into a new subspace with reduced dimensions.

Mathematical Formulation

Given a data matrix, X, with zero empirical mean (the data vectors have been centered), the covariance matrix is given by $C = \frac{1}{n-1}XX^T$ , where n is the number of data points.

The eigenvectors of C, $v_i$ , and their corresponding eigenvalues, $\lambda_i$ , are found by solving $Cv_i = \lambda_iv_i$ .

The data can then be projected onto the k largest eigenvectors to reduce its dimensionality to k dimensions.

Pros and Cons of PCA

Pros:

Reduces Dimensionality: Helpful in reducing the dataset's dimensions while retaining most of the critical information.

Noise Reduction: Can improve the performance of algorithms by eliminating noise.

Visualization: Facilitates the visualization of high-dimensional data.

Cons:

Linearity: PCA assumes that the principal components are a linear combination of the original features.

Variance Equals Information: Assumes that components with larger variance contain more information, which might not always be the case.

Sensitive to Scaling: The results of PCA depend on the scaling of the features of the data.

Applications of PCA

Data Visualization: Reducing data to 2 or 3 dimensions for visualization.

Preprocessing: Preprocessing step before applying machine learning algorithms to reduce computation time and avoid the curse of dimensionality.

Noise Filtering: Removing noise from signals in data.

Implementing PCA in Python with Scikit-Learn

Here's a simple example of applying PCA using Scikit-Learn:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data: 100 observations with 5 features
X = np.random.rand(100, 5)

# Standardize the data
X_standardized = StandardScaler().fit_transform(X)

# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)

print("Original shape:", X.shape)
print("Reduced shape:", X_pca.shape)

In this example, PCA is used to reduce a 5-dimensional dataset to 2 dimensions for analysis or visualization.

Definition:

unsupervised learning algorithm

primarily a tool for dealing with high-dimensional data

a technique for feature extraction — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables!

As an added benefit, each of the “new” variables after PCA are all independent of one another. This is a benefit because the assumptions of a linear model require our independent variables to be independent of one another. If we decide to fit a linear regression model with these “new” variables (see “principal component regression” below), this assumption will necessarily be satisfied.

Function:

Visualization: PCA provides a way to visualize the data, by projecting the data down to two
or three dimensions that you can plot, in order to get a better sense of the data. Furthermore,
the principal component vectors sometimes provide insight as to the nature of the data as
well.

Preprocessing:
- Learning complex models of high-dimensional data is often very slow, and also prone to overfitting—the number of parameters in a model is usually exponential in the number of dimensions, meaning that very large data sets are required for higher-dimensional models. This problem is generally called the curse of dimensionality.
- PCA can be used to first map the data to a low-dimensional representation before applying a more sophisticated algorithm to it. With PCA one can also whiten the representation, which rebalances the weights of the data to give better performance in some cases.

Modeling: PCA learns a representation that is sometimes used as an entire model, e.g., a
prior distribution for new data.

Compression: PCA can be used to compress data, by replacing data with its low-dimensional
representation.

When should I use PCA?

want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration

want to ensure your variables are independent of one another?

making independent variables less interpretable?