List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Implement Of Next_Word Prediction using Tokenizer and LSTM?

Next_Word Prediction using Tokenizer and LSTM

Condition for Next_Word Prediction using Tokenizer and LSTM

  • Description:
    Next word prediction using LSTM (Long Short-Term Memory) works by training the model on a sequence of words, where the LSTM learns to capture long-term dependencies and contextual information from previous words. The model processes the input sequence word by word, maintaining hidden states that represent the context. When predicting the next word, the LSTM generates a probability distribution over the vocabulary, suggesting the most likely next word based on the learned context. This process is repeated iteratively for continuous predictions.
Step-by-Step Process
  • Tokenization: The Tokenizer class is used to tokenize the corpus, converting each word into a unique integer index. This step creates a word index based on the frequency of words in the corpus.
  • Input Sequences Creation: The corpus is processed to create input sequences, where each sequence contains an n-gram of words (from 1 word to the current word). For each sentence, a series of subsequences is created (e.g., for "the cat sat", the sequences are "the", "the cat", "the cat sat").
  • Padding Sequences: The input sequences are padded using pad_sequences to ensure all sequences are of the same length, which is necessary for feeding them into a neural network.
  • Splitting Input and Labels: The sequences are split into inputs (X) and corresponding labels (y).The labels are one-hot encoded using to_categorical, which transforms them into a binary matrix representing word probabilities.
  • Building the LSTM Model: A Sequential model is created with an Embedding layer, followed by an LSTM layer,and a Dense layer with a softmax activation function. This architecture allows the model to learn sequential patterns and predict the next word.
  • Training the Model: The model is compiled with the Adam optimizer and categorical cross-entropy loss.It is then trained on the input data (X) and the one-hot encoded labels (y) for 100 epochs.
  • Prediction Function: The predict_next_word function takes a text input, converts it to the corresponding sequence, pads it, and predicts the next word using the trained LSTM model. The predicted word is returned based on the highest probability.
Sample Code
  • import numpy as np
    import tensorflow as tf
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Embedding, LSTM, Dense
    # Example dataset
    corpus = [
     "the cat sat on the mat",
     "the dog lay on the rug",
     "the bird flew over the house",
     "the fish swam in the tank"
    ]
    # Step 1: Tokenize the sentences
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1 # Add 1 for padding
    # Generate input sequences
    input_sequences = []
    for line in corpus:
     token_list = tokenizer.texts_to_sequences([line])[0]
     for i in range(1, len(token_list)):
      n_gram_sequence = token_list[:i + 1]
      input_sequences.append(n_gram_sequence)
    # Step 2: Pad sequences to make them of the same length
    max_sequence_len = max(len(seq) for seq in input_sequences)
    input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len,
    padding='pre')
    # Split input and labels
    X, y = input_sequences[:, :-1], input_sequences[:, -1]
    y = tf.keras.utils.to_categorical(y, num_classes=total_words)
    # Step 3: Build the LSTM model
    model = Sequential([
     Embedding(total_words, 64, input_length=max_sequence_len - 1), # Embedding layer
     LSTM(100), # LSTM layer
     Dense(total_words, activation='softmax') # Dense layer with softmax for word probabilities
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy',
    metrics=['accuracy'])
    # Step 4: Train the model
    model.fit(X, y, epochs=100, verbose=2)
    # Step 5: Function to predict the next word
    def predict_next_word(model, tokenizer, text, max_sequence_len):
     token_list = tokenizer.texts_to_sequences([text])[0]
     token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
     predicted_probs = model.predict(token_list, verbose=0)
     predicted_word_index = np.argmax(predicted_probs)
     for word, index in tokenizer.word_index.items():
      if index == predicted_word_index:
       return word
     return ""
    # Example usage
    input_text = "the bird flew"
    predicted_word = predict_next_word(model, tokenizer, input_text, max_sequence_len)
    print(f"Input text: '{input_text}'")
    print(f"Predicted next word: '{predicted_word}'")
Screenshots
  • Next_word prediction(Tokennizer&LSTM)