word2vec vs. doc2vec

Created
TagsBasic Concepts

Word2Vec Overview

Word2Vec is a popular technique in natural language processing (NLP) for learning word embeddings, which are dense vector representations of words. These embeddings aim to capture semantic meanings, syntactic relationships, and various linguistic patterns based on the words' context in the text. Developed by Mikolov et al. at Google, Word2Vec models are trained to reconstruct linguistic contexts of words, which allows the embeddings to capture a wide array of semantic and syntactic similarities among words.

How Word2Vec Works

Word2Vec uses one of two model architectures:

  1. CBOW (Continuous Bag of Words): Predicts a target word from a set of context words surrounding it. The context is defined as a window of words around the target word. This approach is faster and has higher accuracy for frequent words.

    For CBOW, we want to predict one word based on the surrounding words. For example, if we are given: word1 word2 word3 word4 word5, we want to use (word1, word2, word4, word5) to predict word3.

  1. Skip-Gram: Predicts context words from a target word. This model is slower but better for infrequent words and works well with small datasets.

    In the skip-gram model, we use ’word3’ to predict all surrounding words
    ’word1, word2, word4, word5’.

Both architectures use a shallow neural network, typically consisting of a single hidden layer, where the training objective optimizes the word embeddings to predict words' context effectively.

Training Process

  1. Vocabulary Construction: A vocabulary is created from the training corpus, often with a threshold to discard infrequent words.
  1. Initialization: Embeddings for words in the vocabulary are initialized randomly.
  1. Context and Target Words: Depending on the architecture (CBOW or Skip-Gram), pairs of context and target words are generated from the corpus.
  1. Training: The neural network is trained using these pairs, adjusting the embeddings to minimize the prediction error (e.g., using negative sampling or hierarchical softmax to improve efficiency and manage the computational cost).
  1. Extraction: After training, the weights of the hidden layer are extracted as the word embeddings.

Python Code Example

While training Word2Vec from scratch is complex and computationally intensive, Python's Gensim library provides a straightforward way to use Word2Vec:

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Sample corpus
corpus = ["Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space.",
          "Similar words will have similar representation in vector space."]

# Tokenization
tokenized_corpus = [word_tokenize(doc.lower()) for doc in corpus]

# Training the Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Accessing the vector for a word
word_vector = model.wv['word']

print(word_vector)

Pros and Cons

Pros:

Cons:

Applications

Word2Vec embeddings are used in numerous NLP applications, including:

The model has become a cornerstone in NLP for pre-training word embeddings, which can significantly improve the performance of various downstream tasks.

Word2Vec vs Doc2Vec

Word2Vec is trained on a single word, while Doc2vec is trained on variable-length text, so the tasks each model can accomplish are different. With Word2Vec, you can predict words based on context, and vice versa using Vera, while with Doc2vec you can measure relationships between complete documents.