How to Perform Sentiment Analysis on Movie Reviews Using the IMDB Dataset with Keras in Python?
Share
Condition for Performing Sentiment Analysis on Movie Reviews Using the IMDB Dataset with Keras in Python
Description: This code is particularly useful for analyzing large-scale text data,such as customer reviews or social media posts, to determine user sentiment (positive or negative) and derive actionable insights.
Step-by-Step Process
Step1: Load required libraries for data manipulation, text preprocessing,model building, and evaluation.
Step2: Read the IMDB movie review dataset into a DataFrame and inspect the data structure.
Step3: The text is cleaned by converting to lowercase, removing unwanted characters (HTML tags, URLs, non-ASCII symbols, and special characters),and tokenizing it. Stopwords are removed, and lemmatization is applied to simplify words to their root form.
Step4: Tokenized reviews are used to train a Word2Vec model that learns dense vector representations (embeddings) for words based on their context in the text.
Step5: Each review is converted into a numerical representation by averaging the embeddings of the words present in it, producing a fixed-length feature vector.
Step6: The sentiment column is converted to numeric labels (1 for positive, 0 for negative) using LabelEncoder, ensuring compatibility with the deep learning model.
Step7: The dataset is divided into training and test sets (80/20 split) using train_test_split, ensuring randomization and consistency with a set seed (random_state=42).
Step8: A neural network is designed with an input layer, two hidden layers with ReLU activation and dropout regularization to avoid overfitting, and a sigmoid output layer for binary classification tasks.
Step9: The ANN model is trained using the training set with a batch size of 2 for 10 epochs and evaluated on the test set to monitor validation performance during training.
Step10: Predictions are made on the test set, and metrics like accuracy,F1-score, precision, recall, and a confusion matrix are calculated to assess the model's effectiveness in sentiment classification.
Sample Code
#Import Necessary Libraries
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dense,Dropout,Input
from tensorflow.keras.models import Model
from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
f1_score,recall_score,precision_score)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/IMDB Dataset.csv")
# Display initial rows of the dataset
print("Initial data preview:")
print(df.head())
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Define the preprocessing functions
def preprocess_text(text):
text = text.lower()
text = clean_text(text)
tokens = word_tokenize(text)
stopwords_set = set(stopwords.words('english'))
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
preprocessed_text = ' '.join(tokens)
return preprocessed_text
def clean_text(text):
# Remove HTML tags using regex
text = re.sub(r'<.*?>', '', text)
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove non-ASCII characters except periods
text = re.sub(r'[^\x00-\x7F.]', ' ', text)
# Remove special characters except periods
text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
# Remove isolated numbers
text = re.sub(r'\b\d+\b', '', text)
# Replace multiple periods with a single space
text = re.sub(r'\.{2,}', ' ', text)
# Remove extra spaces after periods
text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
return text
# Apply preprocessing to the text column
text_data = df['review'].apply(preprocess_text)
def train_word2vec_model(text_data):
# Tokenize the preprocessed text and train a Word2Vec model
tokenized_data = [text.split() for text in text_data]
model = Word2Vec(tokenized_data, vector_size=100, window=5, min_count=1, workers=4)
return model
word2vec_model = train_word2vec_model(text_data)
def vectorize_text_with_word2vec(text, model):
tokens = text.split()
vectors = [model.wv[token] for token in tokens if token in model.wv]
if len(vectors) == 0:
return np.zeros(model.vector_size) # Return a zero vector if no tokens are in the model
return np.mean(vectors, axis=0)
# Apply Word2Vec vectorization to the text column
word2vec_features = np.array([vectorize_text_with_word2vec(text, word2vec_model) for text in text_data])
label = LabelEncoder()
y = label.fit_transform(df['sentiment'])
#Split the train_test_data
X_train,X_test,y_train,y_test = train_test_split(word2vec_features,y,test_size=.2,random_state=42)
def ANN_model(input_shape):
# Input layer
inputs = Input(shape=(input_shape,))
# Hidden layers
layer1 = Dense(64, activation='relu')(inputs)
Dropout1 = Dropout(0.2)(layer1)
layer2 = Dense(32, activation='relu')(Dropout1)
Dropout2 = Dropout(0.2)(layer2)
# Output layer
output_layer = Dense(1, activation='sigmoid')(Dropout2)
# Build the model
ann_model = Model(inputs=inputs, outputs=output_layer)
# Compile the model with Adam optimizer and binary crossentropy loss function
ann_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return ann_model
model = ANN_model(X_train.shape[1])
model.fit(X_train,y_train,batch_size=2,epochs=10,validation_data=(X_test,y_test))
y_pred = model.predict(X_test)
y_pred = [1 if i>0.5 else 0 for i in y_pred]
print("___Performance_Metrics___\n")
print('Classification_Report:\n',classification_report(y_test, y_pred))
print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
print('F1_Score: ',f1_score(y_test, y_pred))
print('Recall_Score: ',recall_score(y_test, y_pred))
print('Precision_Score: ',precision_score(y_test, y_pred))