Embedding

Created	@February 17, 2024
Tags	Basic Concepts

Both one hot encoding and feature hashing can represent features in multidimensions. However, these representations do not usually preserve the semantic meaning of each feature. For example, using OntHotEncoding can’t guarantee the word ‘cat’ and ‘animal’ are close to each other in multidimensions; or user ‘Kanye West’ is close to ‘rap music’ in YouTube data. The proximity here can be interpreted from the semantic perspective or engagement perspective. This is an important distinction and has implications for how we train embedding.

Embedding in Machine Learning

Embeddings are a fundamental concept in machine learning, particularly in the context of natural language processing (NLP), recommendation systems, and deep learning. They provide a way to represent categorical data, like words, items, or users, as dense vectors of real numbers. The key idea behind embeddings is to capture the semantics or the relationship between the entities in a lower-dimensional space.

How It Works

Representation: Instead of using sparse representations such as one-hot encoding, where the dimensionality equals the number of unique categories (leading to a lot of zeros and a few ones), embeddings map these categories to a continuous vector space. This dense representation is usually of much lower dimensionality.

Semantic Similarity: In this vector space, semantically similar items are closer together. For example, in a well-trained word embedding model, words with similar meanings, like "king" and "queen," will have vectors that are close to each other in the embedding space.

Training: Embeddings can be learned in two ways: either trained from scratch for a specific task along with the neural network or using pre-trained embeddings that can be fine-tuned.
pre-trained embedding i.e: word2vec2 style or cotrained, (i.e., YouTube video embedding).

Pros and Cons

Pros:

Efficiency: Embeddings significantly reduce the dimensionality of the input features, making the models more computationally efficient.

Semantic Representation: They capture the semantic relationships between entities, which can improve model performance, especially on tasks involving natural language.

Transferability: Pre-trained embeddings can be transferred across different tasks, enabling models to leverage knowledge from large datasets.

Cons:

Data and Domain Specific: Embeddings are specific to the data they were trained on. Pre-trained embeddings might not capture domain-specific nuances.

Opacity: The dense representation of embeddings makes them harder to interpret compared to sparse representations like one-hot encoding.

Training Complexity: Training embeddings from scratch requires a significant amount of data and computational resources.

Applications

Natural Language Processing: Word embeddings (e.g., Word2Vec, GloVe) are used to process text, enabling tasks like sentiment analysis, named entity recognition, and machine translation.

Recommendation Systems: Embeddings can represent users and items, helping to predict user preferences or item similarities.

Graph Data: Node embeddings capture the structure and features of graphs in tasks like social network analysis.

Instagram uses this type of embedding to provide personalized recommendations for their users, while Pinterest uses this as part of their Ads Ranking model. In practice, for some apps like Pinterest and Instagram where the user’s intention is strong, we can use word2vec style embedding training.

Application of Embedding in Tech Companies
- Twitter uses embedding for UsersID, and it’s widely used in different use cases at Twitter, such as recommendation, nearest neighbor search, and transfer learning.
- Pinterest Ads ranking uses word2vec style where each user session can be viewed as: pin A →\rightarrow pin B →\rightarrow pin C, then co-trained with multitask modeling.
- Instagram’s personalized recommendation model uses word2vec style where each user session can be viewed as: account 1 →\rightarrow account 2 →\rightarrow account 3 to predict accounts with which a person is likely to interact within a given session.
- YouTube recommendations uses two-tower model embedding then co-trained with multihead model architecture. (Read about multitask learning in section Common Deep Learning 1.5).
- DoorDash personalized store feed uses word2vec style where each user session can be viewed as: restaurant 1 →\rightarrow restaurant 2 →\rightarrow restaurant 3. This Store2Vec model can be trained to predict if restaurants were visited in the same session using CBOW algorithm.
- In the Tensorflow documentation, they recommend the “rule of thumb”: d=D4d = \sqrt[4]{D} where DD is the “number of categories”. Another way is to treat DD as a hyperparameter and we can tune on a downstream task. In large scale production, embedding features are usually pre-computed and stored in key/value storage to reduce inference latency.

Python Implementation Example

Below is a simple illustration of how to use embeddings in TensorFlow/Keras for a word embedding task:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

# Example parameters
vocab_size = 10000  # Number of unique words in the vocabulary
embedding_dim = 32  # Dimension of the embedding vector
max_length = 100    # Maximum length of input sequences

model = Sequential([
    # The Embedding layer takes the input sequence and maps each word to an embedding vector
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    Flatten(),  # Flatten the 3D embedding output to 2D for the Dense layer
    Dense(1, activation='sigmoid')  # Example output layer for a binary classification task
])

model.summary()

This code snippet creates a simple neural network with an embedding layer as the first layer. The Embedding layer is configured to transform integer-encoded categorical data (in this case, word indices) into dense vectors of a specified size (embedding_dim). This approach is commonly used in NLP to prepare text data for deep learning models.

Evaluation

There is no easy answer to this question. We have two approaches:

Apply embedding to downstream tasks and measure their model performance. For certain applications, like natural language processing (NLP), we can also visualize embeddings using t-SNE (t-distributed stochastic neighbor embedding), EMAP. We can look for clusters in 2-3 dimensions and validate if they match with your intuition. In practice, most embedding built with engagement optimization does not show any clear structure, UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction).

Apply clustering (kmeans, k-Nearest Neighbor) on embedding data and see if it forms meaningful clusters.