How Can I Split a DataSet into Training and Testing Sets for Simple Linear Regression in Python?
Share
Condition for Spliting DataSet into TrainSet and TestSet
Description: This document outlines the steps involved in splitting a dataset into training and test sets for simple linear regression analysis. The process ensures that the model trained on the training setcan be evaluated on unseen data from the test set, helping to assess the generalization performance of the model.
Why Should We Separate the Data into Test and Train Sets?
Evaluating Model Performance: Splitting data allows us to train the model on one subset and test it on another, providing a more accurate assessment of the model's performance.
Preventing Overfitting: By using separate training and test data, we reduce the risk of overfitting,where the model memorizes the training data rather than learning patterns that generalize to unseen data.
Step-by-Step Process
Choosing Dataset: Select a dataset with a continuous target variable (Y) and one or more predictor variables (X). Ensure the dataset is clean and preprocessed before splitting.
Data Splitting: Randomly split the dataset into training and test sets, typically using an 80/20 or 70/30 split ratio. This split ensures a sufficient amount of data for training while reserving some for testing.
Simple Linear Regression: Fit a simple linear regression model on the training set to establish the relationship between the independent variable (X) and dependent variable (Y).
Model Evaluation: Evaluate the model's performance by predicting on the test set and calculating metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared. Visualize the regression line and scatter plots for better understanding.
Conclusion: Summarize the results, discuss the strengths and limitations of the model, and identify areas for further improvement or exploration.
Sample Code
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
# Load and preprocess the dataset
data = pd.read_csv('Salary Data.csv')
# Check for missing values
if data.isnull().any().any():
data = data.dropna() # Drop rows with missing values or apply imputation
label_encoder = LabelEncoder()
for column in data.select_dtypes(include='object').columns:
data[column] = label_encoder.fit_transform(data[column])
# Define X (features) and y (target)
X = data.drop(['Salary'], axis=1) # Drop 'Salary' from features
y = data['Salary'] # 'Salary' is the target variable
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')