How to Build a Logistic Regression Model for the Wine Quality Dataset?
Share
Condition for Logistic Regression Model for Predicting Wine Quality
Description: The goal of this project is to build a Logistic Regression model that predicts the quality of wine based on various physicochemical features such as acidity, alcohol content, pH, and residual sugar. The dataset used is the Wine Quality Dataset, which contains measurements of these features along with the associated quality rating of the wine.
Why Should We Choose Logistic Regression?
Binary Classification: Logistic regression is suitable for binary classification tasks, but it can also be extended for multi-class classification by using techniques like one-vs-rest or multinomial logistic regression.
Simplicity and Interpretability: It’s easy to implement and understand, making it ideal for situations where we need to interpret model results.
Probabilistic Output: Logistic regression outputs probabilities, which can provide insight into the likelihood of a wine belonging to a specific quality category.
Efficient for Smaller Datasets: It works efficiently on smaller datasets and offers a good baseline model before trying more complex algorithms.
Step by Step Process
Data Collection: Load the Wine Quality dataset.
Data Preprocessing: Handle missing values, normalize/standardize data if necessary.
Data Exploration: Perform exploratory data analysis (EDA) to understand the distribution of features and the target variable.
Feature Selection/Engineering: Identify which features are important for predicting wine quality.
Model Building: Split the data into training and testing sets, and then apply Logistic Regression.
Model Evaluation: Use various performance metrics like accuracy, confusion matrix, and ROC curve to assess the model.
Optimization: Fine-tune the model using hyperparameter optimization (optional).
Visualization: Plot relevant graphs to visualize the performance and understand the data.
Sample Source Code
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as warn
from warnings import filterwarnings
filterwarnings("ignore")
# Read Data
data = pd.read_csv('/home/soft15/soft15/Python/py_Exercises/python_Machine_Learning/22-11-2024/26.wine quality/WineQT.csv')
# Missing values
data.isnull().sum()
# Check for duplicates
duplicates = data[data.duplicated()]
print("Number of duplicate rows:", len(duplicates))
# quality vs. alcohol content
plt.figure(figsize=(8, 6))
sns.boxplot(x='quality', y='alcohol', data=data)
plt.title('Alcohol Content by Wine Quality')
plt.xlabel('Quality')
plt.ylabel('Alcohol Content')
plt.show()
from sklearn.model_selection import train_test_split
# Convert the target variable into a binary classification task
data['good_quality'] = (data['quality'] >= 7).astype(int)
# Prepare the data
X = data.drop(['quality', 'good_quality'], axis=1)
y = data['good_quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
confusion_mat = confusion_matrix(y_test, y_pred)
print("confusion_matrix:", confusion_mat)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy*100)
classi_report = classification_report(y_test, y_pred)
print("classification_report:", classi_report)