List of Topics:
Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Build a Spam Email Detection System Using Deep Learning with Keras in Python?

How to Build a Spam Email Detection System Using Deep Learning with Keras

Condition for Building a Spam Email Detection System Using Deep Learning with Keras in Python

  • Description:
    This code implements an SMS spam detection model using an artificial neural network (ANN). It preprocesses the text data by removing noise, tokenizing, and applying lemmatization, then vectorizes the text using TF-IDF. The model is trained on the processed data, and performance is evaluated using classification metrics and a confusion matrix.
Step-by-Step Process
  • Step1: Essential libraries like Pandas, NLTK, Scikit-learn, and Keras are imported for data manipulation, preprocessing, and model building.
  • Step2: The SMS dataset is loaded from a CSV file using Pandas, and its class distribution is visualized to understand class imbalance.
  • Step3: Text data is preprocessed by converting to lowercase, removing HTML tags, URLs, special characters, and stopwords, followed by tokenization and lemmatization.
  • Step4: The preprocessed text data is converted into numerical vectors using TfidfVectorizer, which transforms the text into features suitable for the model.
  • Step5: The processed data is split into training and testing sets, with 80% for training and 20% for testing using train_test_split.
  • Step6: A simple artificial neural network (ANN) model is defined with an input layer, two hidden layers with dropout for regularization, and an output layer using the sigmoid activation function.
  • Step7: The model is compiled using the Adam optimizer, binary cross-entropy loss, and accuracy as a performance metric.
  • Step8: The model is trained on the training data for 10 epochs, with validation performed on the testing data.
  • Step9: The trained model predicts the class labels for the test data, and the results are thresholded (values > 0.5 are classified as spam).
  • Step10: The model's performance is evaluated using metrics like classification report, accuracy, F1 score, recall, precision, and confusion matrix, with results visualized in a heatmap.
Sample Code
  • #Import Necessary Libraries
    import pandas as pd
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    warnings.filterwarnings("ignore")
    from tensorflow.keras.layers import Dense,Dropout,Input
    from tensorflow.keras.models import Model
    from sklearn.metrics import (classification_report,confusion_matrix,accuracy_score,
      f1_score,recall_score,precision_score)
    df = pd.read_csv("/home/soft12/Downloads/sample_dataset/Website/Dataset/train.csv")
    # Count the occurrences of each class
    class_counts = df['label'].value_counts()
    # Plotting the class imbalance
    plt.figure(figsize=(8,6))
    sns.barplot(x=class_counts.index, y=class_counts.values)
    plt.title('Class Imbalance')
    plt.xlabel('Class')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()
    #Split Dependent and Independent Variable
    x = df['sms']
    y = df['label']
    nltk.download('stopwords')
    nltk.download('punkt')
    nltk.download('wordnet')
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    # Define the preprocessing functions
    def preprocess_text(text):
     text = text.lower()
     text = clean_text(text)
     tokens = word_tokenize(text)
     stopwords_set = set(stopwords.words('english'))
     tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords_set]
     preprocessed_text = ' '.join(tokens)
     return preprocessed_text
    def clean_text(text):
     # Remove HTML tags using regex
     text = re.sub(r'<.*?>', '', text)
     # Remove URLs
     text = re.sub(r'http\S+', '', text)
     # Remove non-ASCII characters except periods
     text = re.sub(r'[^\x00-\x7F.]', ' ', text)
     # Remove special characters except periods
     text = re.sub(f'[{re.escape(string.punctuation.replace(".", ""))}]', '', text)
     # Remove isolated numbers
     text = re.sub(r'\b\d+\b', '', text)
     # Replace multiple periods with a single space
     text = re.sub(r'\.{2,}', ' ', text)
     # Remove extra spaces after periods
     text = re.sub(r'(?<=\.)\s+', ' ', text).strip()
     return text
    # Apply preprocessing to the text column
    text_data = x.apply(preprocess_text)
    #Apply tfidf to convert vectorization
    tfidf_vectorizer = TfidfVectorizer(max_features=250)
    tfidf_features = tfidf_vectorizer.fit_transform(text_data)
    text = tfidf_features.toarray()
    #Split the train_test_data
    X_train,X_test,y_train,y_test = train_test_split(text,y,test_size=.2,random_state=42)
    def ANN_model(input_shape):
     # Input layer
     inputs = Input(shape=(input_shape,))
     # Hidden layers
     layer1 = Dense(32, activation='relu')(inputs)
     Dropout1 = Dropout(0.2)(layer1)
     layer2 = Dense(16, activation='relu')(Dropout1)
     Dropout2 = Dropout(0.2)(layer2)
     # Output layer
     output_layer = Dense(1, activation='sigmoid')(Dropout2)
     # Build the model
     ann_model = Model(inputs=inputs, outputs=output_layer)
     ann_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
     return ann_model
    model = ANN_model(X_train.shape[1])
    model.fit(X_train,y_train,batch_size=2,epochs=10,validation_data=(X_test,y_test))
    y_pred = model.predict(X_test)
    y_pred = [1 if i>0.5 else 0 for i in y_pred]
    print("___Performance_Metrics___\n")
    print('Classification_Report:\n',classification_report(y_test, y_pred))
    print('Confusion_Matrix:\n',confusion_matrix(y_test, y_pred))
    print('Accuracy_Score: ',accuracy_score(y_test, y_pred))
    print('F1_Score: ',f1_score(y_test, y_pred))
    print('Recall_Score: ',recall_score(y_test, y_pred))
    print('Precision_Score: ',precision_score(y_test, y_pred))
    #Plot Confusion Matrix
    # Compute confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    # Plot confusion matrix using seaborn heatmap
    plt.figure(figsize=(6,6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Spam', 'Spam'], yticklabels=['Not Spam', 'Spam'])
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()
Screenshots
  • Spam Email Detection System