How to Implement a Decision Tree Classifier Algorithm Using Scikit-Learn in Python?


Condition for Decision Tree Classifier Algorithm Using scikit-learn in Python

  • Description:
    A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. It works by splitting the dataset into subsets based on the feature that results in the best split (usually measured by Gini Impurity or Entropy for classification). The tree structure allows for easy interpretation and understanding of the decisions being made.
  • In this documentation, we will implement a Decision Tree Classifier using the scikit-learn library to classify data points. The classifier can be visualized as a tree structure with decision nodes and leaf nodes.
Step by Step Process
  • Step 1: Dataset Preparation
    Load and preprocess the dataset.
  • Step 2: Split the Data:
    Divide the dataset into training and testing sets.
  • Step 3: Model Training
    Train the Decision Tree Classifier using the training data.
  • Step 4: Model Evaluation
    Evaluate the model's performance on the test data using accuracy, confusion matrix, and other evaluation metrics.
  • Step 5: Visualization
    Visualize the Decision Tree for better interpretability.
  • Step 6: Fine-tuning
    Optionally, tune hyperparameters such as maximum depth, min_samples_split, and min_samples_leaf for better performance.
Sample Code
  • # Importing necessary libraries import pandas as pd # Used to load and read the dataset
    import seaborn as sns # For drawing count plots
    import matplotlib.pyplot as plt # For visualizing plots like confusion matrix, decision tree
    from sklearn.preprocessing import LabelEncoder # Converts categorical data to numerical format
    from sklearn.preprocessing import StandardScaler # Scales features to standard normal distribution (mean = 0, std = 1)
    from sklearn.model_selection import train_test_split # Splits data into training and testing datasets
    from sklearn.tree import DecisionTreeClassifier # Model for decision tree classification
    from sklearn.tree import plot_tree # Function to plot the decision tree structure
    from sklearn.metrics import accuracy_score # Function to calculate accuracy score of predictions
    from sklearn.metrics import confusion_matrix # Function to calculate confusion matrix
    from sklearn.metrics import classification_report # Function to generate classification report (precision, recall, f1-score)
    # Load car evaluation dataset
    data = pd.read_csv('CarEval.csv') # Read the dataset
    df = pd.DataFrame(data) # Convert the dataset into a DataFrame
    # Plot count plots for each feature against the target variable 'class values'
    for col in df.columns:
    sns.countplot(data=df, x='class values', hue=col) # Count plot with target 'class values' and each feature
    plt.title(col) # Set title as column name
    plt.xlabel('class values') # Label the x-axis
    plt.ylabel('Counts') # Label the y-axis # Display the plot
    # Convert categorical data to numerical labels using Label Encoding
    l_encoder = LabelEncoder() # Instantiate the label encoder
    for column in df.columns:
    df[column] = l_encoder.fit_transform(df[column]) # Apply label encoding to each column
    # Split dataset into features (X) and target (y)
    x = df.drop(['class values'], axis=1) # Drop 'class values' column to get feature data
    y = df['class values'] # The 'class values' column is the target variable
    # Scale feature data to standard normal distribution (mean=0, std=1) using StandardScaler
    s_scalar = StandardScaler() # Instantiate the standard scaler
    x = s_scalar.fit_transform(x) # Scale features (X) using fit_transform
    # Split the data into training and test sets (90% training, 10% testing)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42) # Split the data
    # Train a Decision Tree Classifier model using the training dataset
    model = DecisionTreeClassifier() # Instantiate the DecisionTreeClassifier model, y_train) # Fit the model on the training data
    # Make predictions on the training set
    y_train_pred = model.predict(x_train) # Predict on the training data
    # Make predictions on the test set
    y_test_pred = model.predict(x_test) # Predict on the test data
    # Plot the trained decision tree for visualization
    plt.figure(figsize=(20, 16)) # Set figure size for the tree plot
    plot_tree(model, filled=True, max_depth=4) # Plot the tree with max depth 4 to avoid complexity # Display the tree plot
    # Evaluate the model on the training data
    # Calculate and print the confusion matrix for the training predictions
    confu_matrix = confusion_matrix(y_train, y_train_pred) # Compute confusion matrix for training data
    print(f"Train Confusion Matrix : \n{confu_matrix}") # Print the confusion matrix
    # Calculate and print the classification report for the training predictions
    classi_report = classification_report(y_train, y_train_pred) # Compute classification report for training data
    print(f"Train Classification Report : \n{classi_report}") # Print classification report
    # Calculate and print the accuracy score for the training predictions
    print(f"Train accuracy_score : {accuracy_score(y_train, y_train_pred)}") # Print accuracy on the training data
    # Evaluate the model on the test data
    # Calculate and print the confusion matrix for the test predictions
    conf_matrix = confusion_matrix(y_test, y_test_pred) # Compute confusion matrix for test data
    print(f"Test Confusion Matrix : \n{conf_matrix}") # Print the confusion matrix
    # Calculate and print the classification report for the test predictions
    class_report = classification_report(y_test, y_test_pred) # Compute classification report for test data
    print(f"Test Classification Report : \n{class_report}") # Print classification report
    # Calculate and print the accuracy score for the test predictions
    accuracy = accuracy_score(y_test, y_test_pred) # Compute accuracy on the test data
    print(f"Test Accuracy Score : {accuracy}") # Print accuracy on the test data
