Resolve missing values of a data frame in spark using R

How to resolve missing values of a data frame in spark using R

Description

To resolve missing values of a data frame in spark using R

Functions used :

describe(data) – To get the statistical summary
fillna(data,value) – To fill NA’s of whole data frame
fillna(data,list(“col1″=value,”col2″=value)) – To Fill NA’s of particular column
dropna(data) – To Drop NA’s
dropDuplicates(data,”Colname”) – To Drop duplicates

Process

Set up spark home
Load the spark library
Initialize the spark context
Load the data set
Use the predefined functions to resolve missing values of the data frame

Sapmle Code

library(sparklyr)
#Set up spark home
Sys.setenv(SPARK_HOME=”/…/spark-2.4.0-bin-hadoop2.7″)
.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”), “R”, “lib”), .libPaths()))
#Load the library
library(SparkR)
#Initialize the Spark Context
#To run spark in a local node give master=”local”
sc #Start the SparkSQL Context
sqlContext #Load the data set
data = read.df(“file:///…./Empm.csv”,”csv”,header = “true”, inferSchema = “true”, na.strings = “NA”)
showDF(data)
#To get the statistical summary
showDF(describe(data))
showDF(describe(data,”SALARY”))
#Fill NAS’s
showDF(fillna(data,list(“SALARY”=16000,”TA”=750)))
#Drop NA’s
showDF(dropna(data))
#Drop duplicates
showDF(dropDuplicates(data,”SALARY”))

Screenshots

List

Office Address