PCA - Principal Components Analysis

Created
TagsBasic Concepts

Principal Component Analysis (PCA) Overview

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while preserving as much of the data's variability as possible. It transforms the data into a new coordinate system, where the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

How PCA Works

  1. Standardization: The first step often involves standardizing the data so that each feature contributes equally to the analysis.
  1. Covariance Matrix Computation: PCA computes the covariance matrix of the data to understand how the variables of the input data are varying from the mean with respect to each other.
  1. Eigenvalue and Eigenvector Calculation: It then calculates the eigenvalues and eigenvectors of this covariance matrix. Eigenvectors point in the direction of the largest variance, while eigenvalues correspond to the magnitude of this variance in each direction.
  1. Sort Eigenvalues and Eigenvectors: The eigenvalues and eigenvectors are sorted in order of decreasing eigenvalues. The eigenvectors are the principal components (PCs) of the data.
  1. Project Data Onto Principal Components: The data is projected onto the principal components to transform the data into a new subspace with reduced dimensions.

Mathematical Formulation

Given a data matrix, X, with zero empirical mean (the data vectors have been centered), the covariance matrix is given by C=1n1XXTC = \frac{1}{n-1}XX^T, where n is the number of data points.

The eigenvectors of C, viv_i, and their corresponding eigenvalues, λi\lambda_i, are found by solving Cvi=λiviCv_i = \lambda_iv_i.

The data can then be projected onto the k largest eigenvectors to reduce its dimensionality to k dimensions.

Pros and Cons of PCA

Pros:

Cons:

Applications of PCA

Implementing PCA in Python with Scikit-Learn

Here's a simple example of applying PCA using Scikit-Learn:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data: 100 observations with 5 features
X = np.random.rand(100, 5)

# Standardize the data
X_standardized = StandardScaler().fit_transform(X)

# Apply PCA to reduce to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)

print("Original shape:", X.shape)
print("Reduced shape:", X_pca.shape)

In this example, PCA is used to reduce a 5-dimensional dataset to 2 dimensions for analysis or visualization.

Definition:

  • unsupervised learning algorithm
  • primarily a tool for dealing with high-dimensional data
  • a technique for feature extraction  — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables! 
  • As an added benefit, each of the “new” variables after PCA are all independent of one another. This is a benefit because the assumptions of a linear model  require our independent variables to be independent of one another. If we decide to fit a linear regression model with these “new” variables (see “principal component regression” below), this assumption will necessarily be satisfied.

Function:

When should I use PCA?

  1. want to reduce the number of variables, but aren’t able to identify variables to completely remove from consideration
  1. want to ensure your variables are independent of one another?
  1. making independent variables less interpretable?