List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Build a Text Classification Model Using LSTM for Binary Classification

Text Classification Model with LSTM

Condition for Building a Text Classification Model Using LSTM for Binary Classification

  • Description:
    This code demonstrates how to preprocess text data for binary classification by cleaning and vectorizing it using TF-IDF. It then builds an LSTM model to classify the text, incorporating dropout layers for regularization. Finally, the model's performance is evaluated using various metrics like accuracy, F1 score, and confusion matrix.
Step-by-Step Process
  • Import the dataset and select the relevant columns for text and labels.
  • Identify and drop rows with missing values.
  • Plot the class distribution using a bar plot to check for class imbalance.
  • Convert text to lowercase, clean HTML tags, URLs, and special characters.
  • Tokenize the text and remove stopwords, then apply lemmatization to normalize words.
  • Convert text data into numerical features using TF-IDF vectorization.
  • Reshape the features to match the input requirements of the LSTM model (3D tensor).
  • Split the dataset into training and test sets.
  • Define an LSTM model with dropout layers for regularization and a sigmoid output for binary classification.
  • Train the model and evaluate its performance using accuracy, F1 score, recall, precision, and confusion matrix.
Sample Source Code
  • # Import Necessary Libraries
    import pandas as pd
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    from tensorflow.keras.layers import Dense, Dropout, Input, LSTM
    from tensorflow.keras.models import Model
    from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score)

    import warnings
    warnings.filterwarnings("ignore")

    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')

    import matplotlib.pyplot as plt
    import seaborn as sns

    data = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/fake_train.csv")

    df = data.iloc[:, 3:]

    # Display initial rows of the dataset
    print("Initial data preview:")
    print(df.head())

    # Check for Nan values
    print("Check for Nan values\n")
    print(df.isna().sum())

    # If Nan values present
    df = df.dropna()

    # Check for Null Values
    print("Check for Null Values\n")
    print(df.isnull().sum())

    # Plotting the class distribution
    plt.figure(figsize=(6, 4))
    sns.countplot(x='label', data=df, palette='viridis')
    plt.title('Class Distribution')
    plt.xlabel('Class')
    plt.ylabel('Count')
    plt.show()

    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Define the preprocessing functions
    def preprocess_text(text):
    text = text.lower()
    text = clean_text(text)
    tokens = word_tokenize(text)
    stopwords_set = set(stopwords.words('english'))
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

    def clean_text(text):
    # Remove HTML tags using regex
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-ASCII characters except periods
    text = re.sub(r'[^\x00-\x7F.]', ' ', text)
    # Remove special characters except periods
    text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
    # Remove isolated numbers
    text = re.sub(r'\b\d+\b', '', text)
    # Replace multiple periods with a single space
    text = re.sub(r'\.{2,}', ' ', text)
    # Remove extra spaces after periods
    text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
    return text

    # Apply preprocessing to the text column
    text_data = df['text'].apply(preprocess_text)

    tfidf_vectorizer = TfidfVectorizer(max_features=250)

    tfidf_features = tfidf_vectorizer.fit_transform(text_data)
    text = tfidf_features.toarray()

    text = text.reshape(text.shape[0], 10, 25)

    # Split the train-test data
    X_train, X_test, y_train, y_test = train_test_split(text, df['label'], test_size=.2, random_state=42)

    def LSTM_model(input_shape):
    # Input layer
    inputs = Input(shape=input_shape)

    # LSTM layers with dropout
    lstm1 = LSTM(64, return_sequences=True, activation='relu')(inputs)
    Dropout1 = Dropout(0.2)(lstm1)
    lstm2 = LSTM(32, return_sequences=False, activation='relu')(Dropout1)
    Dropout2 = Dropout(0.2)(lstm2)

    # Output layer
    output_layer = Dense(1, activation='sigmoid')(Dropout2)

    # Build the model
    lstm_model = Model(inputs=inputs, outputs=output_layer)

    # Compile the model with Adam optimizer and binary crossentropy loss function
    lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    return lstm_model

    model = LSTM_model((X_train.shape[1], X_train.shape[2]))

    # Summary of Model
    model.summary()

    model.fit(X_train, y_train, batch_size=2, epochs=10, validation_data=(X_test, y_test))

    y_pred = model.predict(X_test)
    y_pred = [1 if i > 0.5 else 0 for i in y_pred]

    print("___Performance_Metrics___\n")
    print('Classification_Report:\n', classification_report(y_test, y_pred))
    print('Confusion_Matrix:\n', confusion_matrix(y_test, y_pred))
    print('Accuracy_Score: ', accuracy_score(y_test, y_pred))
    print('F1_Score: ', f1_score(y_test, y_pred))
    print('Recall_Score: ', recall_score(y_test, y_pred))
    print('Precision_Score: ', precision_score(y_test, y_pred))
Screenshots
  • LSTM Model Output Screenshot