Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Perform Sentiment Analysis on Amazon Product Reviews Using Naive Bayes Algorithm in Python?

Sentiment Analysis using Naive Bayes

Condition for Performing Sentiment Analysis on Amazon Product Reviews Using Naive Bayes Algorithm in Python

  • Description: Sentiment analysis refers to the use of Natural Language Processing (NLP) to identify and extract subjective information from text. In this project, we will perform sentiment analysis on Amazon product reviews using the Naive Bayes algorithm, a popular machine learning model for classification tasks. This method will classify each review as positive, negative, or neutral.
Step-by-Step Process
  • Data Collection: Collect Amazon product reviews data (available from Kaggle or similar sources).
  • Data Preprocessing: Clean text data, remove unwanted characters, tokenize, and convert text to numerical features using TF-IDF or Bag of Words.
  • Splitting the Data: Split the dataset into training and test sets (e.g., 80-20 split).
  • Model Training: Train a Naive Bayes classifier using the training data.
  • Model Evaluation: Evaluate the model using metrics like accuracy, precision, recall, and F1-score. Display the results using confusion matrices and classification reports.
  • Prediction: Use the trained model to predict the sentiment of new Amazon product reviews.
  • Visualization: Visualize the most frequent words in positive and negative reviews and display classification results using bar plots or pie charts.
Why Should We Choose Naive Bayes Algorithm?
  • Efficiency: Fast to train and predict, suitable for large datasets like Amazon product reviews.
  • Simplicity: Requires less computational power and is easy to implement.
  • Probabilistic Interpretation: Offers a probabilistic framework that works well for text classification tasks.
  • Good for Text Classification: Performs well with text data, especially when features are independent (Naive Bayes assumption).
Sample Source Code
  • import pandas as pd
    import numpy as np
    import re
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics import classification_report, confusion_matrix
    import matplotlib.pyplot as plt
    import seaborn as sns

    # Simulate a small dataset
    data = {
    'review': [
    "I love this product, it's amazing!",
    "Worst purchase I ever made, very disappointing.",
    "It's okay, not great but not bad either.",
    "Totally worth the price, I'm so happy with it!",
    "Do not buy this, it broke after one use.",
    "Great quality, I'm definitely buying again.",
    "The worst! Would not recommend to anyone.",
    "Very useful, I use it every day and it's perfect."
    ],
    'sentiment': [1, 0, 1, 1, 0, 1, 0, 1]
    }

    data = pd.DataFrame(data)
    print(data.head())

    # Preprocess the text data
    def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    return text
    data['cleaned_review'] = data['review'].apply(clean_text)

    # Convert to numerical form using TF-IDF
    vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
    X = vectorizer.fit_transform(data['cleaned_review'])
    y = data['sentiment']

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Initialize Naive Bayes classifier
    model = MultinomialNB()
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate model
    print("Classification Report:\n", classification_report(y_test, y_pred))

    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()

    # Plot feature importance
    log_probs = model.feature_log_prob_
    features = vectorizer.get_feature_names_out()
    sorted_idx = np.argsort(log_probs[1] - log_probs[0])[::-1]
    top_n = 10
    plt.barh(range(top_n), log_probs[1][sorted_idx[:top_n]] - log_probs[0][sorted_idx[:top_n]])
    plt.yticks(range(top_n), features[sorted_idx[:top_n]])
    plt.xlabel('Feature Importance (Log Prob Difference)')
    plt.title('Top 10 Important Features')
    plt.show()
Screenshots
  • Sentiment Analysis Screenshot