Feature Hashing

Created	@February 17, 2024
Tags	Basic Concepts

Feature Hashing (Hashing Trick)

Feature hashing, also known as the "hashing trick," is a fast and space-efficient way of vectorizing features, especially useful in high-dimensional data. Instead of maintaining a direct mapping from feature names to indices in a vector (as done in one-hot encoding), feature hashing uses a hash function to convert features into indices in a vector of fixed size. This method is particularly useful for datasets with a large number of categorical variables and when the dataset is too large to fit in memory.

Feature hashing is very useful for features with very high cardinality with hundreds, and sometimes thousands, of unique values. Hashing trick is a way to reduce the increase in dimension and memory footprint by allowing multiple values to be present/encoded as the same value.

How It Works

Hash Function: A hash function maps a feature name to an integer in a fixed range [0, N-1], where N is the predetermined size of the vector. Different features may hash to the same index, which is known as a hash collision.

Vector Representation: Each feature's value (or presence) is stored in the hashed index of the vector. For categorical features, a simple presence (1) or a count can be used. For numerical features, the feature value itself can be added to the index indicated by the hash function.

Pros and Cons

Pros:

Space Efficiency: Requires less memory as the feature vector's size is fixed and predetermined, independent of the number of unique features.

Speed: Hashing is usually fast, and converting features to their hashed representation can be quicker than maintaining a large dictionary of mappings, especially for large datasets.

Scalability: Works well with streaming data and large datasets, as it doesn't require knowing all the features in advance.

Cons:

Hash Collisions: Different features may end up with the same hash value, leading to collisions that can potentially affect model performance.

No Inverse Mapping: Once features are hashed, it's impossible to retrieve the original feature names from the indices, making model interpretability difficult.

Sensitivity to Hash Function: The choice of hash function and the size of the feature vector can significantly affect performance.

Applications

Feature hashing is widely used in text processing (e.g., bag-of-words models), online learning algorithms, and handling categorical data in large-scale machine learning tasks where memory and speed are critical considerations.

Commonly used hash functions are MurmurHash3, Jenkins, CityHash, and MD5.

Feature hashing is widely popular in a lot of tech companies such as Booking, Facebook (Semantic Hashing using Tags and Topic Modeling, 2013), Yahoo, Yandex, Avazu, and Criteo.

Python Implementation

Here's a simple Python implementation of feature hashing for a list of categorical features:

import hashlib

def feature_hashing(features, N):
    """
    Hashes a list of features into a vector of size N using a simple hash function.

    Args:
    - features: List of categorical features (strings).
    - N: Size of the resulting hashed feature vector.

    Returns:
    - A vector of size N with hashed feature values.
    """
    # Initialize the feature vector with zeros
    vector = [0] * N

    for feature in features:
        # Use a simple hash function and mod by N to ensure it fits in our vector
        index = int(hashlib.md5(feature.encode()).hexdigest(), 16) % N
        # Increment the position. For simplicity, we're just setting it to 1 (can be adjusted for counts or weights)
        vector[index] = 1  # or += 1 for counts, += feature_value for numerical features

    return vector

# Example usage
features = ['dog', 'cat', 'fish']
N = 10  # Size of the feature vector
hashed_features = feature_hashing(features, N)
print(hashed_features)

This implementation uses Python's hashlib for hashing, demonstrating a basic approach to feature hashing. Depending on the application, you might adjust this to handle numerical features, manage collisions differently, or use different hash functions.