Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How Can I Split a DataSet into Training and Testing Sets for Simple Linear Regression in Python?

Split Data Set into Train and Test in Simple Linear Regression

Condition for Spliting DataSet into TrainSet and TestSet

  • Description:
    This document outlines the steps involved in splitting a dataset into training and test sets for simple linear regression analysis. The process ensures that the model trained on the training setcan be evaluated on unseen data from the test set, helping to assess the generalization performance of the model.
Why Should We Separate the Data into Test and Train Sets?
  • Evaluating Model Performance:
    Splitting data allows us to train the model on one subset and test it on another, providing a more accurate assessment of the model's performance.
  • Preventing Overfitting:
    By using separate training and test data, we reduce the risk of overfitting,where the model memorizes the training data rather than learning patterns that generalize to unseen data.
Step-by-Step Process
  • Choosing Dataset:
    Select a dataset with a continuous target variable (Y) and one or more predictor variables (X).
    Ensure the dataset is clean and preprocessed before splitting.
  • Data Splitting:
    Randomly split the dataset into training and test sets, typically using an 80/20 or 70/30 split ratio.
    This split ensures a sufficient amount of data for training while reserving some for testing.
  • Simple Linear Regression:
    Fit a simple linear regression model on the training set to establish the relationship between the independent variable (X) and dependent variable (Y).
  • Model Evaluation:
    Evaluate the model's performance by predicting on the test set and calculating metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared.
    Visualize the regression line and scatter plots for better understanding.
  • Conclusion:
    Summarize the results, discuss the strengths and limitations of the model, and identify areas for further improvement or exploration.
Sample Code
  • # Import necessary libraries
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error
    from sklearn.preprocessing import LabelEncoder
    # Load and preprocess the dataset
    data =
    pd.read_csv('Salary Data.csv')
    # Check for missing values
    if data.isnull().any().any():
      data = data.dropna() # Drop rows with missing values or apply imputation
    label_encoder = LabelEncoder()
    for column in data.select_dtypes(include='object').columns:
      data[column] = label_encoder.fit_transform(data[column])
    # Define X (features) and y (target)
    X = data.drop(['Salary'], axis=1) # Drop 'Salary' from features
    y = data['Salary'] # 'Salary' is the target variable
    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
    random_state=42)
    # Fit a simple linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    # Evaluate the model using Mean Squared Error
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean Squared Error: {mse}')
Screenshots
  • split data set into train and test in simple linear regression using python