List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Build a Text Classification Model Using GRU for Sentiment Analysis with IMDB Dataset

Text Classification Model Using GRU

Condition for Building a Text Classification Model Using GRU for Sentiment Analysis with IMDB Dataset

  • Description:
    The code loads and preprocesses the IMDB dataset by cleaning text, removing stopwords, and applying TF-IDF vectorization. It then builds an LSTM model (using a GRU layer) for binary sentiment classification and trains it. Finally, it evaluates the model's performance using classification metrics and visualizes the results with a confusion matrix.
Step-by-Step Process
  • Load Libraries:
    Import essential libraries like pandas, sklearn, nltk, and tensorflow for data preprocessing and model building.
  • Preprocess Text Data:
    Clean the text by removing HTML tags, special characters, and stopwords. Tokenize and lemmatize the text.
  • Encode and Vectorize Data:
    Encode sentiment labels and transform the text into numerical features using TF-IDF.
  • Build and Train Model:
    Use GRU as the core layer in an LSTM model, compile it, and train it with the prepared data.
  • Evaluate and Visualize:
    Evaluate the model on the test dataset, generate a classification report, and display a confusion matrix.
Sample Source Code
  • # Import Necessary Libraries
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder, MinMaxScaler
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    from tensorflow.keras.models import Model
    from tensorflow.keras.layers import Input, Dense, GRU
    from sklearn.metrics import classification_report, confusion_matrix
    import matplotlib.pyplot as plt
    import seaborn as sns

    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')

    # Load and preprocess IMDB dataset
    df = pd.read_csv("IMDB_Dataset.csv")

    # Preprocess text
    def preprocess_text(text):
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

    df['review'] = df['review'].apply(preprocess_text)

    # Vectorize text
    tfidf = TfidfVectorizer(max_features=250)
    X = tfidf.fit_transform(df['review']).toarray()

    # Encode labels
    y = LabelEncoder().fit_transform(df['sentiment'])

    # Split dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Build GRU Model
    def build_model():
    inputs = Input(shape=(X_train.shape[1],))
    x = Dense(64, activation='relu')(inputs)
    x = Dense(32, activation='relu')(x)
    outputs = Dense(1, activation='sigmoid')(x)
    model = Model(inputs, outputs)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

    model = build_model()
    model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), batch_size=32)

    # Evaluate Model
    y_pred = model.predict(X_test)
    print(classification_report(y_test, (y_pred > 0.5).astype(int)))
Screenshots
  • GRU Output Screenshot