Income prediction for the given data set using python?

Description

To predict individuals income using Logistic Regression in python.

  • Get the data set(population).
  • Clean the data.(population).
  • Find and fill the missing values
    (population).
  • Check outliers(for dependent variables).
  • After fill missing values take random sample
    from the population.
  • Write the sample data into a CSV file
    (Easy to handle).
  • Read the sample data from CSV file.
  • Make it as a data frame.
  • Check if there is any missing values.
  • Calculate basic descriptive statistic for
    sample data.
  • Check correlation of whole data frame.
  • Take the variables which has the highly
    correlated with y(target) variable.
  • Correlation range must lies in between
    -1 to 1.
  • Take X variables and y variable.
  • Split X and y into train and test data sat.
  • Import logistic regression from sklearn
    library.
  • Build the regression model.
  • Fit the X_train and y_train data in to
    the model.
  • Make predictions.
  • Calculate the coefficients, intercept,
    confusion matrix by using sklearn.
    metrics library.
  • Based on the confusion matrix we can
    calculate the accuracy,specificity
    and sensitivity also.

#import libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn import metrics

from sklearn.metrics import classification_report

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

#Take sample from population

#read the data sample

data=pd.read_csv(‘/home/soft23/soft23/

Sathish/Spyder workings/sample.csv’)

df=pd.DataFrame(data)

print(“Actual Data frame is:\n”,df.head(10))

#checking missing values

print(“Checking missing values in the sample”)

print(df.isnull().sum())

print(“\n”)

print(“Descriptive statistics”)

print(df.describe())

print(“\n”)

print(“Correlation is”)

print(df.corr(method=’pearson’))

#Depends upon the correlation choose X variable

X=df[[‘hoursperweek’,’relationship’,’EdType’]]

#Fix the target variable

y=(df[‘SalStat’])

#Split the data into train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

print(“Shape of train data of X\n”,X_train.shape)

print(“Shape of train data of y\n”,y_train.shape)

#build the model

logmodel = LogisticRegression()

result=logmodel.fit(X_train,y_train)

#predictions

y_pred = logmodel.predict(X_test)

df1=pd.DataFrame({‘Actual':y_test, ‘Predicted': y_pred})

print(df1)

#regression c-efficients and intercept

print(“Regression intercept is”,logmodel.intercept_)

print(“Regression coefficient is”,logmodel.coef_)

#classification report

print(classification_report(y_test,y_pred))

cm = metrics.confusion_matrix(y_test, y_pred)

print(“The confusion matrix is:\n”,cm)

#finding score of the model

print(“Model score”)

score=result.score(X_train,y_train)

print(score)

print(“Accuracy:”,metrics.accuracy_score

(y_test, y_pred))

print(“Precision:”,metrics.precision_score

(y_test, y_pred))

print(“Recall:”,metrics.recall_score(y_test, y_pred))
class_names=[0,1]

fig, ax = plt.subplots()

tick_marks = np.arange(len(class_names))

plt.xticks(tick_marks,class_names)

plt.yticks(tick_marks,class_names)

# create heatmap

sns.heatmap(pd.DataFrame(cm), annot=True, cmap=”YlGnBu” ,fmt=’g’)

ax.xaxis.set_label_position(“top”)

plt.tight_layout()

plt.title(‘Confusion matrix’)

plt.ylabel(‘Actual label’)

plt.xlabel(‘Predicted label’)

Leave Comment

Your email address will not be published. Required fields are marked *

clear formSubmit