How to Perform Sentiment Analysis on Food Reviews Using Word2Vec and Keras with Deep Learning?
Share
Condition for Performing Sentiment Analysis on Food Reviews Using Word2Vec and Keras with Deep Learning
Description: This code preprocesses a food review dataset by cleaning, tokenizing, and lemmatizing the text, followed by training a Word2Vec model to generate dense vector representations. It then trains a simple artificial neural network (ANN) on the Word2Vec features for binary classification of food reviews into two categories.
Step-by-Step Process
Step1: Import necessary libraries such as pandas for data handling, nltk for text processing, gensim for Word2Vec, and tensorflow for building the neural network model.
Step2: Download required NLTK resources like stopwords, tokenizer (punkt), and wordnet lemmatizer for text preprocessing.
Step3: Load the food review dataset from a CSV file into a DataFrame and display the first few rows to understand its structure.
Step4: Create functions to clean the text (removing HTML tags, URLs, special characters, etc.) and preprocess the text by lowercasing, tokenizing, and lemmatizing while removing stopwords.
Step5: Apply the preprocessing functions to the Text column of the dataset,resulting in cleaned and tokenized text.
Step6: Train a Word2Vec model on the preprocessed text data to generate word embeddings, capturing semantic relationships between words.
Step7: Convert each review into a dense vector by averaging the Word2Vec embeddings of the words present in the review.
Step8: Split the dataset into training and testing sets using train_test_split,ensuring that the training data is used to train the model and the test data is used for evaluation.
Step9: Define a simple artificial neural network (ANN) model with two hidden layers and a dropout layer to prevent overfitting. Compile the model using the Adam optimizer and binary cross-entropy loss function.
Step10: Train the ANN model on the Word2Vec feature vectors and evaluate its performance on the test set using classification metrics like accuracy, F1 score, recall,and precision.
Sample Code
#Import Necessary Libraries
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dense,Dropout,Input
from tensorflow.keras.models import Model
from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
f1_score,recall_score,precision_score)
import warnings
warnings.filterwarnings("ignore")
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/amazon.csv")
# Display initial rows of the dataset
print("Initial data preview:")
print(df.head())
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Define the preprocessing functions
def preprocess_text(text):
text = text.lower()
text = clean_text(text)
tokens = word_tokenize(text)
stopwords_set = set(stopwords.words('english'))
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
preprocessed_text = ' '.join(tokens)
return preprocessed_text
def clean_text(text):
# Remove HTML tags using regex
text = re.sub(r'<.*?>', '', text)
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove non-ASCII characters except periods
text = re.sub(r'[^\x00-\x7F.]', ' ', text)
# Remove special characters except periods
text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
# Remove isolated numbers
text = re.sub(r'\b\d+\b', '', text)
# Replace multiple periods with a single space
text = re.sub(r'\.{2,}', ' ', text)
# Remove extra spaces after periods
text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
return text
# Apply preprocessing to the text column
text_data = df['Text'].apply(preprocess_text)
def train_word2vec_model(text_data):
# Tokenize the preprocessed text and train a Word2Vec model
tokenized_data = [text.split() for text in text_data]
model = Word2Vec(tokenized_data, vector_size=100, window=5, min_count=1, workers=4)
return model
word2vec_model = train_word2vec_model(text_data)
def vectorize_text_with_word2vec(text, model):
tokens = text.split()
vectors = [model.wv[token] for token in tokens if token in model.wv]
if len(vectors) == 0:
return np.zeros(model.vector_size) # Return a zero vector if no tokens are in the model
return np.mean(vectors, axis=0)
# Apply Word2Vec vectorization to the text column
word2vec_features = np.array([vectorize_text_with_word2vec(text, word2vec_model) for text in text_data])
#Split the train_test_data
X_train,X_test,y_train,y_test = train_test_split(word2vec_features,df['label'],test_size=.2,random_state=42)
def ANN_model(input_shape):
# Input layer
inputs = Input(shape=(input_shape,))
# Hidden layers
layer1 = Dense(64, activation='relu')(inputs)
Dropout1 = Dropout(0.2)(layer1)
layer2 = Dense(32, activation='relu')(Dropout1)
Dropout2 = Dropout(0.2)(layer2)
# Output layer
output_layer = Dense(1, activation='sigmoid')(Dropout2)
# Build the model
ann_model = Model(inputs=inputs, outputs=output_layer)
# Compile the model with Adam optimizer and binary crossentropy loss function
ann_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return ann_model
model = ANN_model(X_train.shape[1])
model.fit(X_train,y_train,batch_size=2,epochs=10,validation_data=(X_test,y_test))
y_pred = model.predict(X_test)
y_pred = [1 if i>0.5 else 0 for i in y_pred]
print("___Performance_Metrics___\n")
print('Classification_Report:\n',classification_report(y_test, y_pred))
print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
print('F1_Score: ',f1_score(y_test, y_pred))
print('Recall_Score: ',recall_score(y_test, y_pred))
print('Precision_Score: ',precision_score(y_test, y_pred))