How to resolve missing values of a data frame in spark using R ?

Description

To resolve missing values of a data frame in spark using R

Functions used :

describe(data) – To get the statistical summary
fillna(data,value) – To fill NA’s of whole data frame
fillna(data,list(“col1″=value,”col2″=value)) – To Fill NA’s of particular column
dropna(data) – To Drop NA’s
dropDuplicates(data,”Colname”) – To Drop duplicates

  • Set up spark home
  • Load the spark library
  • Initialize the spark context
  • Load the data set
  • Use the predefined functions to resolve missing values of the data frame

library(sparklyr)
#Set up spark home
Sys.setenv(SPARK_HOME=”/…/spark-2.4.0-bin-hadoop2.7″)
.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”), .libPaths()))
#Load the library
library(SparkR)
#Initialize the Spark Context
#To run spark in a local node give master=”local”
sc #Start the SparkSQL Context
sqlContext #Load the data set
data = read.df(“file:///…./Empm.csv”,”csv”,header = “true”, inferSchema = “true”, na.strings = “NA”)
showDF(data)
#To get the statistical summary
showDF(describe(data))
showDF(describe(data,”SALARY”))
#Fill NAS’s
showDF(fillna(data,list(“SALARY”=16000,”TA”=750)))
#Drop NA’s
showDF(dropna(data))
#Drop duplicates
showDF(dropDuplicates(data,”SALARY”))

Leave Comment

Your email address will not be published. Required fields are marked *

clear formSubmit