How to Build a Simple Linear Regression Model using Python?
Share
Condition for Simple Linear Regression Model using Python
Description: A Simple Linear Regression model is one of the most basic and commonly used algorithms in machine learning, which establishes a relationship between two variables. It assumes that there is a linear relationship between the independent variable (input) and the dependent variable (output). In simple terms, it attempts to predict a continuous target variable based on a single feature.
Why Should We Use Simple Linear Regression?
Simple Linear Regression is a go-to method for predictive analysis when: You want to understand the relationship between two continuous variables. You have a clear linear relationship between the target and feature. You are working with small to medium datasets and want interpretable results. It's widely used in various industries like finance (predicting stock prices), healthcare (predicting patient outcomes), and marketing (predicting sales based on advertising).
Step-by-Step Process
Define the Problem: Identify the dependent (target) variable and independent (feature) variable.
Collect the Data: Gather relevant data that can be used to make predictions.
Preprocess the Data: Handle missing values, remove outliers, and scale the data if necessary.
Visualize the Data: Plot the data to check for any obvious trends or relationships.
Train-Test Split: Divide the data into training and testing sets (commonly 80% for training and 20% for testing).
Build the Model: Use linear regression from libraries like scikit-learn to fit the model on the training data.
Evaluate the Model: Use metrics like Mean Squared Error (MSE) and R-squared to evaluate the model's performance.
Interpret Results: Analyze the coefficients (intercept and slope) to understand the relationship between the variables.
Make Predictions: Use the trained model to make predictions on new data.
Sample Code
# import important liberies
import pandas as pd # Import pandas to load and manipulate the dataset
from sklearn.preprocessing import LabelEncoder # For encoding categorical data into numeric format
from sklearn.preprocessing import StandardScaler # To scale data to a standard normal distribution
import matplotlib.pyplot as plt # Used to create visualizations, such as graphs
import seaborn as sns # For advanced visualizations like heat maps
from sklearn.model_selection import train_test_split # To split data into training and testing sets
from sklearn.linear_model import LinearRegression # For building the linear regression model
from sklearn.metrics import r2_score # To evaluate model performance (R-squared score)
# To calculate the error between predicted and actual values
from sklearn.metrics import mean_squared_error,mean_absolute_error
# Load the Salary dataset
data = pd.read_csv("Salary Data.csv")
df = pd.DataFrame(data) # Convert the data into a DataFrame
# Rename columns to make them more user-friendly
df = df.rename(columns={'Education Level':'Education_Level','Job Title':'Job_Title',
'Years of Experience':'Experience'})
# Remove columns with low correlation to salary prediction (Job Title and Gender)
df = df.drop(['Job_Title', 'Gender'], axis=1)
# Optionally: Fill missing values with the mean of respective columns (commented out)
'''
age_mean = df['Age'].mean()
experience_mean = df['Experience'].mean()
salary_mean = df['Salary'].mean()
df['Age'].fillna(age_mean, inplace=True)
df['Experience'].fillna(experience_mean, inplace=True)
df['Salary'].fillna(salary_mean, inplace=True)
'''
# Remove rows with missing values
df = df.dropna(axis=0)
# Encode categorical data (Education Level) to numerical format
l_encoder = LabelEncoder()
df['Education_Level'] = l_encoder.fit_transform(df['Education_Level'])
# Calculate the correlation matrix to analyze relationships between variables
correlation_matrix = df.corr()
# Visualize the correlation matrix using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Correlation Heatmap for Salary Prediction')
plt.show()
# Prepare the features (X) and target variable (y)
x = df.drop(['Salary'], axis=1) # Features (everything except 'Salary')
y = df['Salary'] # Target variable (Salary)
# Scale the features to standardize the data (z-scores)
s_scalar = StandardScaler()
x = s_scalar.fit_transform(x)
# Split the data into training (90%) and testing (10%) sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
# Train a linear regression model using the training data
model = LinearRegression()
model.fit(x_train, y_train)
# Predict the salary using the test data
y_predict = model.predict(x_test)
# Calculate the R-squared score to measure the accuracy of the model
r2_score_value = r2_score(y_test, y_predict)
print(f"Prediction accuracy (R-squared) between predicted and test data: {r2_score_value *100}")
# Calculate the Mean Squared Error (MSE) to assess prediction error
mean_square = mean_squared_error(y_test, y_predict)
print(f"Prediction error (Mean Squared Error) between predicted and test data: {mean_square}")
# Calculate the Mean Squared Error (MSE) to assess prediction error
mean_absolute = mean_absolute_error(y_test, y_predict)
print(f"Prediction error (mean_absolute_error) between predicted and test data:
{mean_absolute}")