How to Implement Of Next_Word Prediction using Tokenizer and LSTM?
Share
Condition for Next_Word Prediction using Tokenizer and LSTM
Description: Next word prediction using LSTM (Long Short-Term Memory) works by training the model on a sequence of words, where the LSTM learns to capture long-term dependencies and contextual information from previous words. The model processes the input sequence word by word, maintaining hidden states that represent the context. When predicting the next word, the LSTM generates a probability distribution over the vocabulary, suggesting the most likely next word based on the learned context. This process is repeated iteratively for continuous predictions.
Step-by-Step Process
Tokenization: The Tokenizer class is used to tokenize the corpus, converting each word into a unique integer index. This step creates a word index based on the frequency of words in the corpus.
Input Sequences Creation: The corpus is processed to create input sequences, where each sequence contains an n-gram of words (from 1 word to the current word). For each sentence, a series of subsequences is created (e.g., for "the cat sat", the sequences are "the", "the cat", "the cat sat").
Padding Sequences: The input sequences are padded using pad_sequences to ensure all sequences are of the same length, which is necessary for feeding them into a neural network.
Splitting Input and Labels: The sequences are split into inputs (X) and corresponding labels (y).The labels are one-hot encoded using to_categorical, which transforms them into a binary matrix representing word probabilities.
Building the LSTM Model: A Sequential model is created with an Embedding layer, followed by an LSTM layer,and a Dense layer with a softmax activation function. This architecture allows the model to learn sequential patterns and predict the next word.
Training the Model: The model is compiled with the Adam optimizer and categorical cross-entropy loss.It is then trained on the input data (X) and the one-hot encoded labels (y) for 100 epochs.
Prediction Function: The predict_next_word function takes a text input, converts it to the corresponding sequence, pads it, and predicts the next word using the trained LSTM model. The predicted word is returned based on the highest probability.
Sample Code
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
# Example dataset
corpus = [
"the cat sat on the mat",
"the dog lay on the rug",
"the bird flew over the house",
"the fish swam in the tank"
]
# Step 1: Tokenize the sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1 # Add 1 for padding
# Generate input sequences
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i + 1]
input_sequences.append(n_gram_sequence)
# Step 2: Pad sequences to make them of the same length
max_sequence_len = max(len(seq) for seq in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
# Split input and labels
X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)
# Step 3: Build the LSTM model
model = Sequential([
Embedding(total_words, 64, input_length=max_sequence_len - 1), # Embedding layer
LSTM(100), # LSTM layer
Dense(total_words, activation='softmax') # Dense layer with softmax for word probabilities
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Step 4: Train the model
model.fit(X, y, epochs=100, verbose=2)
# Step 5: Function to predict the next word
def predict_next_word(model, tokenizer, text, max_sequence_len):
token_list = tokenizer.texts_to_sequences([text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
predicted_probs = model.predict(token_list, verbose=0)
predicted_word_index = np.argmax(predicted_probs)
for word, index in tokenizer.word_index.items():
if index == predicted_word_index:
return word
return ""
# Example usage
input_text = "the bird flew"
predicted_word = predict_next_word(model, tokenizer, input_text, max_sequence_len)
print(f"Input text: '{input_text}'")
print(f"Predicted next word: '{predicted_word}'")