How to Make Predictions in Multiple Linear Regression Using the Statsmodels Library in Python?
Share
Conditions for Making Predictions with Multiple Linear Regression in Statsmodels
Description: Multiple Linear Regression (MLR) is a statistical method used to model the relationship between two or more features and a response by fitting a linear equation to the observed data. In this method, each feature contributes to the prediction of the outcome. The statsmodels library in Python provides tools for estimating and interpreting linear models, making it easy to perform statistical analyses, such as multiple linear regression
Why Should We Use Multiple Linear Regression?
Predictive Modeling: MLR is helpful for predicting a continuous dependent variable based on multiple independent variables
Interpretability: The coefficients in the regression model show how each feature affects the target variable.
Flexibility: It can handle complex relationships with more than one predictor variable.
Correlation Insights: Helps in understanding how predictor variables are related to each other and the target variable.
Step-by-Step Process
Load Data: Import the dataset and examine the structure.
Preprocessing: Clean the data by handling missing values, outliers, and normalizing features if necessary.
Exploratory Data Analysis (EDA): Visualize the relationships between variables using plots, heatmaps, and correlation matrices.
Feature Selection: Identify which features are most important for the model.
Train-Test Split: Divide the dataset into a training set and a testing set to evaluate model performance.
Model Training: Use statsmodels to build the regression model.
Model Evaluation: Evaluate the model using metrics like R², RMSE, etc.
Prediction: Use the trained model to make predictions on unseen data.
Sample Code
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
# Load the dataset
df = pd.read_csv('Student_Marks.csv')
# Check the first few rows of the dataset to understand its structure
print(df.head())
label_encoder = LabelEncoder()
for column in df.select_dtypes(include='object').columns:
df[column] = label_encoder.fit_transform(df[column])
# Data Exploration
# Check for missing values
print("\nMissing values:\n", df.isnull().sum())
# Basic statistical summary
print("\nStatistical Summary:\n", df.describe())
# Heatmap of correlations to visualize the relationship between features
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
# Assuming the last column is the target variable, and the rest are features
X = df.drop('Marks', axis=1) # Replace 'TargetColumn' with the actual target column name in the CSV
y = df['Marks'] # Replace 'TargetColumn' with the actual target column name
# Split the dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Add a constant to the features for the intercept in the model
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
# Build the regression model using statsmodels
model = sm.OLS(y_train, X_train).fit()
# Display the summary of the regression
print("\nRegression Model Summary:\n", model.summary())
# Predictions using the test set
y_pred = model.predict(X_test)
# Plotting: Predicted vs Actual
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--r', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Predicted vs Actual')
plt.show()
# Residuals plot
residuals = y_test - y_pred
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
# Model Evaluation: Metrics
# R-squared
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}")
# Mean Absolute Error (MAE)
mae = np.mean(np.abs(residuals))
print(f"Mean Absolute Error: {mae}")
# Visualizing the distribution of predicted values
plt.figure(figsize=(8, 6))
sns.histplot(y_pred, kde=True)
plt.title('Predicted Values Distribution')
plt.xlabel('Predicted')
plt.ylabel('Frequency')
plt.show()