How to Implement a Decision Tree Classifier Algorithm Using Scikit-Learn in Python?
Share
Condition for Decision Tree Classifier Algorithm Using scikit-learn in Python
Description: A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It works by splitting the dataset into subsets based on the feature that results
in the best split (usually measured by Gini Impurity or Entropy for classification). The tree
structure allows for easy interpretation and understanding of the decisions being made.
In this documentation, we will implement a Decision Tree Classifier using the scikit-learn library
to classify data points. The classifier can be visualized as a tree structure with decision nodes
and leaf nodes.
Step by Step Process
Step 1: Dataset Preparation
Load and preprocess the dataset.
Step 2: Split the Data:
Divide the dataset into training and testing sets.
Step 3: Model Training
Train the Decision Tree Classifier using the training data.
Step 4: Model Evaluation
Evaluate the model's performance on the test data using accuracy, confusion matrix, and other
evaluation metrics.
Step 5: Visualization
Visualize the Decision Tree for better interpretability.
Step 6: Fine-tuning
Optionally, tune hyperparameters such as maximum depth, min_samples_split, and min_samples_leaf
for better performance.
Sample Code
# Importing necessary libraries
import pandas as pd # Used to load and read the dataset
import seaborn as sns # For drawing count plots
import matplotlib.pyplot as plt # For visualizing plots like confusion matrix, decision tree
from sklearn.preprocessing import LabelEncoder # Converts categorical data to numerical format
from sklearn.preprocessing import StandardScaler # Scales features to standard normal distribution (mean = 0, std = 1)
from sklearn.model_selection import train_test_split # Splits data into training and testing datasets
from sklearn.tree import DecisionTreeClassifier # Model for decision tree classification
from sklearn.tree import plot_tree # Function to plot the decision tree structure
from sklearn.metrics import accuracy_score # Function to calculate accuracy score of predictions
from sklearn.metrics import confusion_matrix # Function to calculate confusion matrix
from sklearn.metrics import classification_report # Function to generate classification report (precision, recall, f1-score)
# Load car evaluation dataset
data = pd.read_csv('CarEval.csv') # Read the dataset
df = pd.DataFrame(data) # Convert the dataset into a DataFrame
# Plot count plots for each feature against the target variable 'class values'
for col in df.columns:
sns.countplot(data=df, x='class values', hue=col) # Count plot with target 'class values' and each feature
plt.title(col) # Set title as column name
plt.xlabel('class values') # Label the x-axis
plt.ylabel('Counts') # Label the y-axis
plt.show() # Display the plot
# Convert categorical data to numerical labels using Label Encoding
l_encoder = LabelEncoder() # Instantiate the label encoder
for column in df.columns:
df[column] = l_encoder.fit_transform(df[column]) # Apply label encoding to each column
# Split dataset into features (X) and target (y)
x = df.drop(['class values'], axis=1) # Drop 'class values' column to get feature data
y = df['class values'] # The 'class values' column is the target variable
# Scale feature data to standard normal distribution (mean=0, std=1) using StandardScaler
s_scalar = StandardScaler() # Instantiate the standard scaler
x = s_scalar.fit_transform(x) # Scale features (X) using fit_transform
# Split the data into training and test sets (90% training, 10% testing)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42) # Split the data
# Train a Decision Tree Classifier model using the training dataset
model = DecisionTreeClassifier() # Instantiate the DecisionTreeClassifier model
model.fit(x_train, y_train) # Fit the model on the training data
# Make predictions on the training set
y_train_pred = model.predict(x_train) # Predict on the training data
# Make predictions on the test set
y_test_pred = model.predict(x_test) # Predict on the test data
# Plot the trained decision tree for visualization
plt.figure(figsize=(20, 16)) # Set figure size for the tree plot
plot_tree(model, filled=True, max_depth=4) # Plot the tree with max depth 4 to avoid complexity
plt.show() # Display the tree plot
# Evaluate the model on the training data
# Calculate and print the confusion matrix for the training predictions
confu_matrix = confusion_matrix(y_train, y_train_pred) # Compute confusion matrix for training data
print(f"Train Confusion Matrix : \n{confu_matrix}") # Print the confusion matrix
# Calculate and print the classification report for the training predictions
classi_report = classification_report(y_train, y_train_pred) # Compute classification report for training data
print(f"Train Classification Report : \n{classi_report}") # Print classification report
# Calculate and print the accuracy score for the training predictions
print(f"Train accuracy_score : {accuracy_score(y_train, y_train_pred)}") # Print accuracy on the training data
# Evaluate the model on the test data
# Calculate and print the confusion matrix for the test predictions
conf_matrix = confusion_matrix(y_test, y_test_pred) # Compute confusion matrix for test data
print(f"Test Confusion Matrix : \n{conf_matrix}") # Print the confusion matrix
# Calculate and print the classification report for the test predictions
class_report = classification_report(y_test, y_test_pred) # Compute classification report for test data
print(f"Test Classification Report : \n{class_report}") # Print classification report
# Calculate and print the accuracy score for the test predictions
accuracy = accuracy_score(y_test, y_test_pred) # Compute accuracy on the test data
print(f"Test Accuracy Score : {accuracy}") # Print accuracy on the test data