List of Topics:
Location Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Perform Text Classification Using Word2Vec and LSTM

Text Classification Using Word2Vec and LSTM

Condition for Text Classification Using Word2Vec and LSTM

  • Description:
    The process involves loading and preprocessing a dataset for text classification. Text data is cleaned, tokenized, and lemmatized before being transformed into Word2Vec embeddings. These embeddings are used to train an LSTM-based model, which is evaluated on its performance using various metrics.
Step-by-Step Process
  • Import necessary libraries for text preprocessing, machine learning, and deep learning, including pandas, numpy, nltk, gensim, and tensorflow.
  • Read the dataset using pandas and drop any unnecessary columns, like 'Unnamed: 0', to focus on relevant data.
  • Inspect the dataset for NaN and null values to ensure clean data before processing. Remove rows with NaN values.
  • Plot a count of the class labels (e.g., positive or negative) to understand the dataset's class balance.
  • Clean the text by converting to lowercase, removing HTML tags, URLs, punctuation, and non-ASCII characters. Tokenize the text and lemmatize words.
  • Use nltk's word_tokenize to split the text into individual tokens for further processing.
  • Train a Word2Vec model on the tokenized text, which generates word embeddings for each token.
  • Convert each tokenized text into an embedding vector by averaging the embeddings of its words using the trained Word2Vec model.
  • Reshape the Word2Vec features into a 3D array suitable for input into the LSTM model (i.e., (samples, time_steps, features)).
  • Define the LSTM model with two LSTM layers and a dense output layer. Train the model on the Word2Vec features and evaluate its performance using accuracy, F1-score, and confusion matrix.
Sample Source Code
  • #Import Necessary Libraries
    import pandas as pd
    import numpy as np
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from sklearn.model_selection import train_test_split
    from tensorflow.keras.layers import Dense, Dropout, Input, LSTM
    from tensorflow.keras.models import Model
    from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score,
    f1_score, recall_score, precision_score)
    from gensim.models import Word2Vec

    import warnings
    warnings.filterwarnings("ignore")

    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')

    import matplotlib.pyplot as plt
    import seaborn as sns

    # Load your dataset
    data = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/text.csv")

    df = data.drop(['Unnamed: 0'], axis=1)

    # Display initial rows of the dataset
    print("Initial data preview:")
    print(df.head())

    # Check for Nan values
    print("Check for Nan values\n")
    print(df.isna().sum())

    # Drop rows with NaN values
    df = df.dropna()

    # Check for Null Values
    print("Check for Null Values\n")
    print(df.isnull().sum())

    # Plotting the class distribution
    plt.figure(figsize=(6, 4))
    sns.countplot(x='label', data=df, palette='viridis')
    plt.title('Class Distribution')
    plt.xlabel('Class')
    plt.ylabel('Count')
    plt.show()

    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Define the preprocessing functions
    def preprocess_text(text):
    text = text.lower()
    text = clean_text(text)
    tokens = word_tokenize(text)
    stopwords_set = set(stopwords.words('english'))
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text

    def clean_text(text):
    # Remove HTML tags using regex
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-ASCII characters except periods
    text = re.sub(r'[^\x00-\x7F.]', ' ', text)
    # Remove special characters except periods
    text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
    # Remove isolated numbers
    text = re.sub(r'\b\d+\b', '', text)
    # Replace multiple periods with a single space
    text = re.sub(r'\.{2,}', ' ', text)
    # Remove extra spaces after periods
    text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
    return text

    # Apply preprocessing to the text column
    text_data = df['text'].apply(preprocess_text)

    # Tokenize the preprocessed text
    tokenized_texts = [word_tokenize(text) for text in text_data]

    # Train Word2Vec model
    word2vec_model = Word2Vec(sentences=tokenized_texts, vector_size=250, window=5, min_count=1, workers=4)

    # Generate word embeddings for each sentence
    def get_word2vec_vector(tokens):
    vector = np.zeros(250) # Vector size of Word2Vec
    valid_word_count = 0
    for word in tokens:
    if word in word2vec_model.wv:
    vector += word2vec_model.wv[word]
    valid_word_count += 1
    if valid_word_count > 0:
    vector /= valid_word_count # Average the word vectors
    return vector

    # Convert the tokenized text to Word2Vec vectors
    word2vec_features = np.array([get_word2vec_vector(tokens) for tokens in tokenized_texts])

    # Reshape the data for LSTM input (if necessary)
    word2vec_features = word2vec_features.reshape(word2vec_features.shape[0], 10, 25)

    # Split the train_test_data
    X_train, X_test, y_train, y_test = train_test_split(word2vec_features, df['label'], test_size=0.2, random_state=42)

    # Define LSTM Model
    def LSTM_model(input_shape):
    # Input layer
    inputs = Input(shape=input_shape)

    # LSTM layers with dropout
    lstm1 = LSTM(64, return_sequences=True, activation='relu')(inputs)
    Dropout1 = Dropout(0.5)(lstm1)

    lstm2 = LSTM(64, return_sequences=False, activation='relu')(Dropout1)
    Dropout2 = Dropout(0.5)(lstm2)

    output = Dense(1, activation='sigmoid')(Dropout2)

    model = Model(inputs=inputs, outputs=output)

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    return model

    # Train LSTM model
    model = LSTM_model((10, 25))
    history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

    # Predict and evaluate
    y_pred = model.predict(X_test)
    y_pred_class = (y_pred > 0.5).astype(int)

    print("Accuracy:", accuracy_score(y_test, y_pred_class))
    print("F1-Score:", f1_score(y_test, y_pred_class))

Screenshots
Sentiment analysis using keras with deep learning