#5, First Floor, 4th Street , Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

How to resolve missing values of a data frame in spark using python
Objective

To resolve missing values of a data frame in spark using python

Functions used :

df.fillna(fillvalue) – To fill NA’s with a given value
df.nadrop() – To drop NA’s
df.na.replace(Value to be replaced,replace with value) – To replace a value with a particular value
df.dropDuplicates() – To drop the duplicate values

Process

  Import necessary libraries

  Initialize the Spark session

  Create the required data frame

  Use the predefined functions to resolve missing values of the data frame

Sapmle Code

from pyspark.sql import SparkSession
#Set up SparkContext and SparkSession
spark=SparkSession \
.builder \
.appName(“Python spark example”)\
.config(“spark.some.config.option”,”some-value”)\
.getOrCreate()

#Load the file
df1=spark.read.format(‘com.databricks.spark.csv’).options(header=’True’,inferschema=’True’).load(“/home/……/Empm.csv”)
#To get statistical summary of a column
df1.select(“SALARY”).describe().show()
#To fill na of a Column
df1.select(“SALARY”).show()
df1.select(“SALARY”).fillna(16083.33).show()
#To drop the null values
df1.select(“SALARY”).show()
df1.select(“SALARY”).na.drop().show()
#To fill na with 0 and then replace with other value
sal=df1.select(“SALARY”).fillna(0)
sal.show()
sal.na.replace(0,16083.33).show()
#To drop duplicate values
sal.select(“SALARY”).dropDuplicates().show()

Screenshots