Research breakthrough possible @S-Logix pro@slogix.in

Office Address

Social List

How to Perform Sentiment Analysis on Movie Reviews Using the IMDB Dataset with Keras in Python?

Movie Review Sentimental Analysis using Keras in Python

Condition for Performing Sentiment Analysis on Movie Reviews Using the IMDB Dataset with Keras in Python

  • Description:
    This code is particularly useful for analyzing large-scale text data,such as customer reviews or social media posts, to determine user sentiment (positive or negative) and derive actionable insights.
Step-by-Step Process
  • Step1: Load required libraries for data manipulation, text preprocessing,model building, and evaluation.
  • Step2: Read the IMDB movie review dataset into a DataFrame and inspect the data structure.
  • Step3: The text is cleaned by converting to lowercase, removing unwanted characters (HTML tags, URLs, non-ASCII symbols, and special characters),and tokenizing it. Stopwords are removed, and lemmatization is applied to simplify words to their root form.
  • Step4: Tokenized reviews are used to train a Word2Vec model that learns dense vector representations (embeddings) for words based on their context in the text.
  • Step5: Each review is converted into a numerical representation by averaging the embeddings of the words present in it, producing a fixed-length feature vector.
  • Step6: The sentiment column is converted to numeric labels (1 for positive, 0 for negative) using LabelEncoder, ensuring compatibility with the deep learning model.
  • Step7: The dataset is divided into training and test sets (80/20 split) using train_test_split, ensuring randomization and consistency with a set seed (random_state=42).
  • Step8: A neural network is designed with an input layer, two hidden layers with ReLU activation and dropout regularization to avoid overfitting, and a sigmoid output layer for binary classification tasks.
  • Step9: The ANN model is trained using the training set with a batch size of 2 for 10 epochs and evaluated on the test set to monitor validation performance during training.
  • Step10: Predictions are made on the test set, and metrics like accuracy,F1-score, precision, recall, and a confusion matrix are calculated to assess the model's effectiveness in sentiment classification.
Sample Code
  • #Import Necessary Libraries
    import pandas as pd
    import numpy as np
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from gensim.models import Word2Vec
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import train_test_split
    from tensorflow.keras.layers import Dense,Dropout,Input
    from tensorflow.keras.models import Model
    from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
    f1_score,recall_score,precision_score)
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    import warnings
    warnings.filterwarnings("ignore")
    df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/IMDB
    Dataset.csv")
    # Display initial rows of the dataset
    print("Initial data preview:")
    print(df.head())
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Define the preprocessing functions
    def preprocess_text(text):
    text = text.lower()
    text = clean_text(text)
    tokens = word_tokenize(text)
    stopwords_set = set(stopwords.words('english'))
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in
    stopwords_set]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text
    def clean_text(text):
    # Remove HTML tags using regex
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-ASCII characters except periods
    text = re.sub(r'[^\x00-\x7F.]', ' ', text)
    # Remove special characters except periods
    text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
    # Remove isolated numbers
    text = re.sub(r'\b\d+\b', '', text)
    # Replace multiple periods with a single space
    text = re.sub(r'\.{2,}', ' ', text)
    # Remove extra spaces after periods
    text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
    return text
    # Apply preprocessing to the text column
    text_data = df['review'].apply(preprocess_text)
    def train_word2vec_model(text_data):
    # Tokenize the preprocessed text and train a Word2Vec model
    tokenized_data = [text.split() for text in text_data]
    model = Word2Vec(tokenized_data, vector_size=100, window=5, min_count=1, workers=4)
    return model
    word2vec_model = train_word2vec_model(text_data)
    def vectorize_text_with_word2vec(text, model):
    tokens = text.split()
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    if len(vectors) == 0:
    return np.zeros(model.vector_size) # Return a zero vector if no tokens are
    in the model
    return np.mean(vectors, axis=0)
    # Apply Word2Vec vectorization to the text column
    word2vec_features = np.array([vectorize_text_with_word2vec(text, word2vec_model)
    for text in text_data])
    label = LabelEncoder()
    y = label.fit_transform(df['sentiment'])
    #Split the train_test_data
    X_train,X_test,y_train,y_test =
    train_test_split(word2vec_features,y,test_size=.2,random_state=42)
    def ANN_model(input_shape):
    # Input layer
    inputs = Input(shape=(input_shape,))
    # Hidden layers
    layer1 = Dense(64, activation='relu')(inputs)
    Dropout1 = Dropout(0.2)(layer1)
    layer2 = Dense(32, activation='relu')(Dropout1)
    Dropout2 = Dropout(0.2)(layer2)
    # Output layer
    output_layer = Dense(1, activation='sigmoid')(Dropout2)
    # Build the model
    ann_model = Model(inputs=inputs, outputs=output_layer)
    # Compile the model with Adam optimizer and binary crossentropy loss function
    ann_model.compile(optimizer='adam', loss='binary_crossentropy',
    metrics=['accuracy'])
    return ann_model
    model = ANN_model(X_train.shape[1])
    model.fit(X_train,y_train,batch_size=2,epochs=10,validation_data=(X_test,y_test))
    y_pred = model.predict(X_test)
    y_pred = [1 if i>0.5 else 0 for i in y_pred]
    print("___Performance_Metrics___\n")
    print('Classification_Report:\n',classification_report(y_test, y_pred))
    print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
    print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
    print('F1_Score: ',f1_score(y_test, y_pred))
    print('Recall_Score: ',recall_score(y_test, y_pred))
    print('Precision_Score: ',precision_score(y_test, y_pred))
Screenshots
  • Movie Review Sentimental Analysis