How to Perform Sentiment Analysis on Amazon Product Reviews Using Keras and Deep Learning?
Share
Condition for Performing Sentiment Analysis on Amazon Product Reviews Using Keras and Deep Learning
Description: Amazon product review data by cleaning and tokenizing text,followed by vectorization using TF-IDF. It then splits the dataset into training and testing sets, training a simple artificial neural network (ANN) for binary classification of reviews into two classes.
Step-by-Step Process
Step1: Import necessary libraries for data manipulation (pandas), text processing (nltk, re, string),machine learning (sklearn), and deep learning (tensorflow).
Step2: Download NLTK resources like stopwords, punkt tokenizer, and wordnet lemmatizer.
Step3: Load the Amazon product review dataset from a CSV file and display the first few rows for inspection.
Step4: Define and apply a preprocessing function to clean and tokenize the review text.This involves converting to lowercase, removing HTML tags, URLs,non-ASCII characters, special characters, stopwords, and lemmatizing the tokens.
Step5: Use TfidfVectorizer to convert the preprocessed text data into a sparse matrix of TF-IDF features,representing the text numerically.
Step6: Split the dataset into training and testing sets using train_test_split.
Step7: Define a simple artificial neural network (ANN) model with two hidden layers,dropout layers for regularization, and a sigmoid output layer for binary classification.
Step8: Compile the model using the Adam optimizer, binary cross-entropy loss function, and accuracy metric.
Step9: Train the model on the training set for 10 epochs while validating it on the test set.
Step10: Make predictions on the test data, classify them as 0 or 1 based on the threshold of 0.5, and evaluate the model using metrics such as accuracy, F1 score, recall, precision, and a confusion matrix.
Sample Code
#Import Necessary Libraries
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dense,Dropout,Input
from tensorflow.keras.models import Model
from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
f1_score,recall_score,precision_score)
import warnings
warnings.filterwarnings("ignore")
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/amazon.csv")
# Display initial rows of the dataset
print("Initial data preview:")
print(df.head())
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Define the preprocessing functions
def preprocess_text(text):
text = text.lower()
text = clean_text(text)
tokens = word_tokenize(text)
stopwords_set = set(stopwords.words('english'))
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
preprocessed_text = ' '.join(tokens)
return preprocessed_text
def clean_text(text):
# Remove HTML tags using regex
text = re.sub(r'<.*?>', '', text)
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove non-ASCII characters except periods
text = re.sub(r'[^\x00-\x7F.]', ' ', text)
# Remove special characters except periods
text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
# Remove isolated numbers
text = re.sub(r'\b\d+\b', '', text)
# Replace multiple periods with a single space
text = re.sub(r'\.{2,}', ' ', text)
# Remove extra spaces after periods
text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
return text
# Apply preprocessing to the text column
text_data = df['Text'].apply(preprocess_text)
tfidf_vectorizer = TfidfVectorizer(max_features=250)
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
text = tfidf_features.toarray()
#Split the train_test_data
X_train,X_test,y_train,y_test = train_test_split(text,df['label'],test_size=.2,random_state=42)
def ANN_model(input_shape):
# Input layer
inputs = Input(shape=(input_shape,))
# Hidden layers
layer1 = Dense(64, activation='relu')(inputs)
Dropout1 = Dropout(0.2)(layer1)
layer2 = Dense(32, activation='relu')(Dropout1)
Dropout2 = Dropout(0.2)(layer2)
# Output layer
output_layer = Dense(1, activation='sigmoid')(Dropout2)
# Build the model
ann_model = Model(inputs=inputs, outputs=output_layer)
# Compile the model with Adam optimizer and binary crossentropy loss function
ann_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return ann_model
model = ANN_model(X_train.shape[1])
model.fit(X_train,y_train,batch_size=2,epochs=10,validation_data=(X_test,y_test))
y_pred = model.predict(X_test)
y_pred = [1 if i>0.5 else 0 for i in y_pred]
print("___Performance_Metrics___\n")
print('Classification_Report:\n',classification_report(y_test, y_pred))
print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
print('F1_Score: ',f1_score(y_test, y_pred))
print('Recall_Score: ',recall_score(y_test, y_pred))
print('Precision_Score: ',precision_score(y_test, y_pred))