Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Perform Sentiment Analysis on Food Reviews Using Word2Vec and Keras with Deep Learning?

Sentiment Analysis using Keras with Deep Learning

Condition for Performing Sentiment Analysis on Food Reviews Using Word2Vec and Keras with Deep Learning

  • Description:
    This code preprocesses a food review dataset by cleaning, tokenizing, and lemmatizing the text, followed by training a Word2Vec model to generate dense vector representations. It then trains a simple artificial neural network (ANN) on the Word2Vec features for binary classification of food reviews into two categories.
Step-by-Step Process
  • Step1: Import necessary libraries such as pandas for data handling, nltk for text processing, gensim for Word2Vec, and tensorflow for building the neural network model.
  • Step2: Download required NLTK resources like stopwords, tokenizer (punkt), and wordnet lemmatizer for text preprocessing.
  • Step3: Load the food review dataset from a CSV file into a DataFrame and display the first few rows to understand its structure.
  • Step4: Create functions to clean the text (removing HTML tags, URLs, special characters, etc.) and preprocess the text by lowercasing, tokenizing, and lemmatizing while removing stopwords.
  • Step5: Apply the preprocessing functions to the Text column of the dataset,resulting in cleaned and tokenized text.
  • Step6: Train a Word2Vec model on the preprocessed text data to generate word embeddings, capturing semantic relationships between words.
  • Step7: Convert each review into a dense vector by averaging the Word2Vec embeddings of the words present in the review.
  • Step8: Split the dataset into training and testing sets using train_test_split,ensuring that the training data is used to train the model and the test data is used for evaluation.
  • Step9: Define a simple artificial neural network (ANN) model with two hidden layers and a dropout layer to prevent overfitting. Compile the model using the Adam optimizer and binary cross-entropy loss function.
  • Step10: Train the ANN model on the Word2Vec feature vectors and evaluate its performance on the test set using classification metrics like accuracy, F1 score, recall,and precision.
Sample Code
  • #Import Necessary Libraries
    import pandas as pd
    import numpy as np
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from gensim.models import Word2Vec
    from sklearn.model_selection import train_test_split
    from tensorflow.keras.layers import Dense,Dropout,Input
    from tensorflow.keras.models import Model
    from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
    f1_score,recall_score,precision_score)
    import warnings
    warnings.filterwarnings("ignore")
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    df =
    pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/amazon.csv")
    # Display initial rows of the dataset
    print("Initial data preview:")
    print(df.head())
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Define the preprocessing functions
    def preprocess_text(text):
    text = text.lower()
    text = clean_text(text)
    tokens = word_tokenize(text)
    stopwords_set = set(stopwords.words('english'))
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in
    stopwords_set]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text
    def clean_text(text):
    # Remove HTML tags using regex
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-ASCII characters except periods
    text = re.sub(r'[^\x00-\x7F.]', ' ', text)
    # Remove special characters except periods
    text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
    # Remove isolated numbers
    text = re.sub(r'\b\d+\b', '', text)
    # Replace multiple periods with a single space
    text = re.sub(r'\.{2,}', ' ', text)
    # Remove extra spaces after periods
    text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
    return text
    # Apply preprocessing to the text column
    text_data = df['Text'].apply(preprocess_text)
    def train_word2vec_model(text_data):
    # Tokenize the preprocessed text and train a Word2Vec model
    tokenized_data = [text.split() for text in text_data]
    model = Word2Vec(tokenized_data, vector_size=100, window=5, min_count=1,
    workers=4)
    return model
    word2vec_model = train_word2vec_model(text_data)
    def vectorize_text_with_word2vec(text, model):
    tokens = text.split()
    vectors = [model.wv[token] for token in tokens if token in model.wv]
    if len(vectors) == 0:
    return np.zeros(model.vector_size) # Return a zero vector if no tokens are
    in the model
    return np.mean(vectors, axis=0)
    # Apply Word2Vec vectorization to the text column
    word2vec_features = np.array([vectorize_text_with_word2vec(text, word2vec_model)
    for text in text_data])
    #Split the train_test_data
    X_train,X_test,y_train,y_test =
    train_test_split(word2vec_features,df['label'],test_size=.2,random_state=42)
    def ANN_model(input_shape):
    # Input layer
    inputs = Input(shape=(input_shape,))
    # Hidden layers
    layer1 = Dense(64, activation='relu')(inputs)
    Dropout1 = Dropout(0.2)(layer1)
    layer2 = Dense(32, activation='relu')(Dropout1)
    Dropout2 = Dropout(0.2)(layer2)
    # Output layer
    output_layer = Dense(1, activation='sigmoid')(Dropout2)
    # Build the model
    ann_model = Model(inputs=inputs, outputs=output_layer)
    # Compile the model with Adam optimizer and binary crossentropy loss function
    ann_model.compile(optimizer='adam', loss='binary_crossentropy',
    metrics=['accuracy'])
    return ann_model
    model = ANN_model(X_train.shape[1])
    model.fit(X_train,y_train,batch_size=2,epochs=10,validation_data=(X_test,y_test))
    y_pred = model.predict(X_test)
    y_pred = [1 if i>0.5 else 0 for i in y_pred]
    print("___Performance_Metrics___\n")
    print('Classification_Report:\n',classification_report(y_test, y_pred))
    print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
    print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
    print('F1_Score: ',f1_score(y_test, y_pred))
    print('Recall_Score: ',recall_score(y_test, y_pred))
    print('Precision_Score: ',precision_score(y_test, y_pred))
Screenshots
  • Sentiment Analysis for Food Review