Condition for Implementing Word2Vec using Skipgram
Description: The Skip-gram model is one of the two main architectures in Word2Vec (the other being Continuous Bag of Words or CBOW). The Skip-gram model is primarily used to learn word embeddings by predicting the context words from a target word. It is especially useful when dealing with rare words or small corpora.
Step-by-Step Process
Preprocessing the Corpus: The text is split into sentences and tokenized into words. All stop words or irrelevant words can be removed, though this isn't strictly necessary. The dataset is then prepared into a format that can be fed into the model for training.
Target and Context Word Pairing: From each sentence, a target word is selected. A context window is defined (e.g., window size of 2), and all the words
within this window of the target word become context words. For each target word, the model learns to predict the words around it (context).
Training the Skip-gram Model: Input Layer: The target word is represented as a one-hot encoded vector (in a large vocabulary space). Hidden Layer: The model has a hidden layer where it learns a distributed representation (embedding) of the target word. This layer is typically a weight matrix that gets updated during training. Output Layer: The model predicts the context words (i.e., the words that are most likely to appear near the target word). The output layer uses a softmax activation to calculate the probability distribution of words that should be in the context.
Objective Function: The model's objective is to maximize the probability of the context words given the target word. The loss function commonly used is the negative log-likelihood which is minimized during training.
Learning Word Embeddings: As training progresses, the Skip-gram model adjusts the word vectors (embeddings) to minimize the error in predicting the context words. The weight matrix (which holds the learned embeddings) is gradually optimized to capture semantic information about the words. Words with similar meanings tend to have similar word vectors because they appear in similar contexts.
Sample Code
from gensim.models import Word2Vec
# Example sentences for training the Word2Vec model
sentences = [
["this", "is", "a", "simple", "example"],
["word2vec", "models", "convert", "words", "to", "vectors"],
["average", "word2vec", "represents", "sentences"],
["word2vec", "is", "a", "popular", "embedding", "model"]
]
# Step 1: Train the Word2Vec model using Skip-gram (sg=1 for Skip-gram, sg=0 for CBOW)
model_skipgram = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1, workers=4, epochs=10)
# Step 2: Use the trained model to find word vectors
word = "word2vec"
vector = model_skipgram.wv[word]
# Output the vector for a specific word
print(f"Vector for '{word}':\n{vector}")
# Step 3: Finding similar words using the trained model
similar_words = model_skipgram.wv.most_similar(word, topn=5)
print(f"Words most similar to '{word}':")
for similar_word, similarity_score in similar_words:
print(f"{similar_word}: {similarity_score}")