How to Detect Breast Cancer with a Decision Tree Algorithm in Python?
Share
Condition for Detecting Breast Cancer Using Decision Tree Algorithm
Description: Breast cancer is one of the most common forms of cancer affecting women worldwide. Early detection of breast cancer can greatly improve the chances of successful treatment. In this project, we will use a Decision Tree algorithm to build a machine learning model that can classify breast cancer as malignant or benign based on a set of features extracted from breast tissue biopsies. The dataset used in this case study is the well-known "Breast Cancer Wisconsin (Diagnostic) Dataset", which includes features such as radius, texture, smoothness, compactness, and concavity of the cell nuclei.
Why Should We Choose Decision Tree Algorithm?
Interpretability: The decision-making process of the model is easy to follow and understand.
Non-Linear Relationships: Decision trees can model non-linear relationships between features.
Handles Numerical and Categorical Data: Decision trees can handle both types of data effectively.
No Need for Feature Scaling: Unlike algorithms like SVM or KNN, Decision Trees do not require normalization or scaling of features.
Step-by-Step Process
Data Collection: Load the Breast Cancer dataset from sources like sklearn.datasets.
Data Preprocessing: Handle missing or irrelevant data, if necessary, and split the dataset into training and testing sets.
Model Training: Train the Decision Tree model on the training data.
Model Evaluation: Evaluate the model using accuracy, precision, recall, and F1-score. Visualize the Decision Tree to understand the decision-making process.
Model Tuning: Fine-tune the model using hyperparameters like maximum depth, min_samples_split, etc.
Results Interpretation: Present results and draw conclusions based on the model's performance.
Sample Source Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load the dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
# Make predictions
y_pred = dt_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}")