How to implement Linear Discriminant Analysis (LDA) using sklearn in python?


To implement LDA using python.

Linear Discriminant Analysis:

  • LDA is used mainly for dimension reduction of a data set.
  • LDA tries to reduce dimensions of the feature set while retaining the information that discriminates output classes.
  • LDA is a supervised dimensionality reduction technique.
  • Its used to avoid overfitting.

Data Re scaling:

  • Standardization is one of the data re scaling method.
  • Data re scaling is an important part of data preparation before applying machine learning algorithms.
  • Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

Eigen Values:

  • Eigenvalue is a number, it gives how much variance there is in the data in that direction related to output classes.
  • Each feature has own eigen vectors and eigen values.
  • The eigen vector with the highest eigenvalue is therefore the principal component.

Explained Variance:

  • It contains variance ratio for each linear discriminant.
  • First discriminant having more variance data points.
  • Second discriminant having less variance data points.

#import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import warnings

#load data set URL
url = “”
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
data = pd.read_csv(url, names=names)

X = data.drop(‘class’,1)

print(“Actual Features before standardizing\n\n”,X.head())

y = data[‘class’]

# Standardizing the features
X_trans = StandardScaler().fit_transform(X)

print(“After standardizing the features\n\n”,X_trans)

#covariance matrix
covar_matrix = LDA(n_components = 4),y)

variance = covar_matrix.explained_variance_ratio_

#Cumulative sum of variance
var=np.cumsum(np.round(variance, decimals=3)*100)
print(“Eigen values\n\n”,var)

#plot for variance explained
plt.ylabel(‘% Variance Explained’)
plt.xlabel(‘# of Features’)
plt.title(‘LDA Analysis’)

#Fit LDA for two components
lda = LDA(n_components = 2)
LinearComponents = lda.fit_transform(X_trans, y)

#make it as data frame
finalDf = pd.DataFrame(data = LinearComponents
, columns = [‘linear discriminant 1′, ‘linear discriminant 2′])

print(“After transform X, the linear discriminants are\n\n”,finalDf.head())

#data visualizations
print(“2D LDA Visualization\n”)

def visual(df):
sample_size = 5
df = df.sample(sample_size)
sns.distplot(finalDf[‘linear discriminant 1′], hist = True, kde = False,kde_kws = {‘linewidth': 3})

def visual1(df):
sample_size = 5
sns.distplot(finalDf[‘linear discriminant 2′], hist = True, kde=False,
bins=int(180/5), color = ‘blue’,


#scatter plot
ax = sns.scatterplot(x=”linear discriminant 1″, y=”linear discriminant 2″, data=finalDf)

print(“The explained variance percentage is:”,lda.explained_variance_ratio_*100)

Leave Comment

Your email address will not be published. Required fields are marked *

clear formSubmit