List of Topics:
Research Breakthrough Possible @S-Logix pro@slogix.in

Office Address

Social List

How to Build an Income Prediction Model for a Given Dataset Using Python?

Income Prediction using Machine Learning

Condition for Building an Income Prediction Model for a Given Dataset Using Python?

  • Description:
    Income prediction is a common machine learning problem where we predict a person's income level based on certain features. This project focuses on predicting whether a person earns more or less than $50K/year based on their characteristics such as age, education, occupation, marital status, and more. A classification approach is used to determine the income class (greater than $50K or less than $50K).

    In this project, we use a dataset (typically the Adult Income Dataset) for training and evaluation. The dataset contains multiple attributes such as age, workclass, education, occupation, etc., and the task is to predict whether a person earns more than $50K or not based on these attributes.
Step-by-Step Process
  • Data Loading and Preprocessing:
    Load the dataset.
    Handle missing data, if any.
    Convert categorical data into numerical features using encoding techniques.
  • Exploratory Data Analysis (EDA):
    Visualize data distributions.
    Check correlations between features.
    Plot the heatmap for correlation of features.
  • Data Splitting:
    Split the dataset into training and testing datasets (e.g., 80% for training and 20% for testing).
  • Model Selection and Training:
    Choose a classification model (e.g., Random Forest, Decision Tree, Logistic Regression).
    Train the model using the training dataset.
  • Model Evaluation:
    Predict the income on the test set.
    Evaluate the model using accuracy, precision, recall, and F1-score.
  • Visualization:
    Plot ROC curves.
    Generate confusion matrix and classification metrics.
  • Output:
    Predicted classes (income greater than or less than $50K).
    Evaluate the classification performance metrics.
Why Should We Choose This Approach?
  • Random Forest and Decision Trees:
    These models are robust for classification problems and perform well on tabular data with mixed numerical and categorical features.
  • Heatmaps:
    Heatmaps help in identifying relationships between variables and can show which features are important in predicting income.
  • Classification Metrics:
    Accuracy, precision, recall, and F1-score are standard in evaluating the performance of machine learning models for classification tasks.
Sample Source Code
  • # Importing necessary libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
    from sklearn.preprocessing import StandardScaler

    # Load the dataset (using a typical adult income dataset, CSV format)
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
    column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

    # Load the dataset into a Pandas dataframe
    data = pd.read_csv(url, header=None, names=column_names, na_values=" ?")

    # Data preprocessing
    data.dropna(inplace=True) # Drop rows with missing values

    # Encode categorical features
    encoder = LabelEncoder()
    categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']

    for col in categorical_columns:
    data[col] = encoder.fit_transform(data[col])

    # Splitting dataset into features and target variable
    X = data.drop('income', axis=1)
    y = data['income']

    # Split the data into training and testing datasets (80% train, 20% test)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Feature scaling
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Model training using Random Forest
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Predictions
    y_pred = model.predict(X_test)

    # Classification metrics
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["<=50K", ">50K"], yticklabels=["<=50K", ">50K"])
    plt.title("Confusion Matrix")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()

    # Feature Importance Plot
    feature_importances = model.feature_importances_
    features = X.columns
    plt.figure(figsize=(10,6))
    sns.barplot(x=feature_importances, y=features)
    plt.title("Feature Importance")
    plt.xlabel("Importance")
    plt.ylabel("Feature")
    plt.show()

    # Accuracy Score
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy*100:.2f}%")
Screenshots
  • Income-Prediction-Accuracy Score