How to Build a Spam Detector Using a Multi-Layer Perceptron and TF-IDF Vectorizer in Python?
Share
Condition for Building a Spam Detector Using a Multi-Layer Perceptron and TF-IDF Vectorizer in Python
Description: This code implements a spam detection model using a Multi-layer Perceptron (MLP) classifier. It preprocesses text data by removing noise, tokenizing,lemmatizing, and applying TF-IDF vectorization, followed by training the MLP model on the processed data. The model's performance is evaluated using various metrics, including accuracy, F1-score, precision, and recall.
Step-by-Step Process
Step1: Import necessary libraries like pandas, nltk, sklearn, and MLPClassifier for text processing and machine learning tasks.
Step2: Download required NLTK resources like stopwords, punkt, and wordnet for text processing.
Step3: Load the spam dataset into a pandas DataFrame using read_csv.
Step4: Define functions to clean and preprocess text, including removing HTML tags,URLs, special characters, and stopwords.
Step5: Tokenize the text and apply lemmatization to reduce words to their base form.
Step6: Use TfidfVectorizer to convert text data into numerical vectors with a maximum of 250 features.
Step7: Convert the target labels (ham and spam) into numeric labels using LabelEncoder.
Step8: Split the preprocessed data into training and testing sets using train_test_split.
Step9: Create an MLPClassifier with hidden layers of varying sizes, ReLU activation,and the Adam optimizer.
Step10: Train the model on the training data, make predictions, and evaluate the model's performance using metrics like accuracy, precision, and recall.
Sample Code
#Import Necessary Libraries
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
f1_score,recall_score,precision_score)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/spam.csv", encoding='latin1')
x = df['v2']
y = df['v1']
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Define the preprocessing functions
def preprocess_text(text):
text = text.lower()
text = clean_text(text)
tokens = word_tokenize(text)
stopwords_set = set(stopwords.words('english'))
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
preprocessed_text = ' '.join(tokens)
return preprocessed_text
def clean_text(text):
# Remove HTML tags using regex
text = re.sub(r'<.*?>', '', text)
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove non-ASCII characters except periods
text = re.sub(r'[^\x00-\x7F.]', ' ', text)
# Remove special characters except periods
text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
# Remove isolated numbers
text = re.sub(r'\b\d+\b', '', text)
# Replace multiple periods with a single space
text = re.sub(r'\.{2,}', ' ', text)
# Remove extra spaces after periods
text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
return text
# Apply preprocessing to the text column
text_data = x.apply(preprocess_text)
#Apply tfidf to convert vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=250)
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
text = tfidf_features.toarray()
#convert object into numeric
label = LabelEncoder()
y = label.fit_transform(y)
#Split the train_test_data
X_train,X_test,y_train,y_test = train_test_split(text,y,test_size=.2,random_state=42)
# Define the MLP Classifier
mlp = MLPClassifier(hidden_layer_sizes=(128, 64, 32, 16),
activation='relu',
solver='adam',
max_iter=100,
batch_size=64,
random_state=42,
verbose=True)
# Train the model
mlp.fit(X_train, y_train)
# Make predictions
y_pred = mlp.predict(X_test)
print("___Performance_Metrics___\n")
print('Classification_Report:\n',classification_report(y_test, y_pred))
print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
print('F1_Score: ',f1_score(y_test, y_pred))
print('Recall_Score: ',recall_score(y_test, y_pred))
print('Precision_Score: ',precision_score(y_test, y_pred))