List of Topics:
Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Build a Spam Detector Using a Multi-Layer Perceptron and TF-IDF Vectorizer in Python?

Build Spam Detector using Multi Layer Perceptron in Python

Condition for Building a Spam Detector Using a Multi-Layer Perceptron and TF-IDF Vectorizer in Python

  • Description:
    This code implements a spam detection model using a Multi-layer Perceptron (MLP) classifier. It preprocesses text data by removing noise, tokenizing,lemmatizing, and applying TF-IDF vectorization, followed by training the MLP model on the processed data. The model's performance is evaluated using various metrics, including accuracy, F1-score, precision, and recall.
Step-by-Step Process
  • Step1: Import necessary libraries like pandas, nltk, sklearn, and MLPClassifier for text processing and machine learning tasks.
  • Step2: Download required NLTK resources like stopwords, punkt, and wordnet for text processing.
  • Step3: Load the spam dataset into a pandas DataFrame using read_csv.
  • Step4: Define functions to clean and preprocess text, including removing HTML tags,URLs, special characters, and stopwords.
  • Step5: Tokenize the text and apply lemmatization to reduce words to their base form.
  • Step6: Use TfidfVectorizer to convert text data into numerical vectors with a maximum of 250 features.
  • Step7: Convert the target labels (ham and spam) into numeric labels using LabelEncoder.
  • Step8: Split the preprocessed data into training and testing sets using train_test_split.
  • Step9: Create an MLPClassifier with hidden layers of varying sizes, ReLU activation,and the Adam optimizer.
  • Step10: Train the model on the training data, make predictions, and evaluate the model's performance using metrics like accuracy, precision, and recall.
Sample Code
  • #Import Necessary Libraries
    import pandas as pd
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.neural_network import MLPClassifier
    from sklearn.preprocessing import LabelEncoder
    import warnings
    warnings.filterwarnings("ignore")
    from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
    f1_score,recall_score,precision_score)
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/spam.csv",
    encoding='latin1')
    x = df['v2']
    y = df['v1']
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Define the preprocessing functions
    def preprocess_text(text):
    text = text.lower()
    text = clean_text(text)
    tokens = word_tokenize(text)
    stopwords_set = set(stopwords.words('english'))
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in
    stopwords_set]
    preprocessed_text = ' '.join(tokens)
    return preprocessed_text
    def clean_text(text):
    # Remove HTML tags using regex
    text = re.sub(r'<.*?>', '', text)
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-ASCII characters except periods
    text = re.sub(r'[^\x00-\x7F.]', ' ', text)
    # Remove special characters except periods
    text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
    # Remove isolated numbers
    text = re.sub(r'\b\d+\b', '', text)
    # Replace multiple periods with a single space
    text = re.sub(r'\.{2,}', ' ', text)
    # Remove extra spaces after periods
    text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
    return text
    # Apply preprocessing to the text column
    text_data = x.apply(preprocess_text)
    #Apply tfidf to convert vectorization
    tfidf_vectorizer = TfidfVectorizer(max_features=250)
    tfidf_features = tfidf_vectorizer.fit_transform(text_data)
    text = tfidf_features.toarray()
    #convert object into numeric
    label = LabelEncoder()
    y = label.fit_transform(y)
    #Split the train_test_data
    X_train,X_test,y_train,y_test =
    train_test_split(text,y,test_size=.2,random_state=42)
    # Define the MLP Classifier
    mlp = MLPClassifier(hidden_layer_sizes=(128, 64, 32, 16),
    activation='relu',
    solver='adam',
    max_iter=100,
    batch_size=64,
    random_state=42,
    verbose=True)
    # Train the model
    mlp.fit(X_train, y_train)
    # Make predictions
    y_pred = mlp.predict(X_test)
    print("___Performance_Metrics___\n")
    print('Classification_Report:\n',classification_report(y_test, y_pred))
    print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
    print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
    print('F1_Score: ',f1_score(y_test, y_pred))
    print('Recall_Score: ',recall_score(y_test, y_pred))
    print('Precision_Score: ',precision_score(y_test, y_pred))
Screenshots
  • Spam Detector using Multi Layer Perceptron