Implement pipeline architecture for regression in Spark with R|S-Logix

How to implement pipeline architecture for regression in spark with R?

Description

To implement pipeline architecture for regression in spark with R using sparklyR package

Functions used :

spark_connect(master = “local”) – To create a spark connection
sdf_copy_to(spark_connection,R object,name) – To copy data to spark environment
sdf_partition(spark_dataframe,partitions with weights,seed) – To partition spark dataframe into multiple groups
ml_pipeline(spark connection) – To create a Spark ML pipeline
ft_r_formula(formula) – To implement the transforms required for fititng a dataset against an R model formula
ml_linear_regression() – To build a linear regression model
ml_fit(pipeline_model,train_data) – To fit the model
ml_transform(fitted_pipeline,test_data) – To predict the test data

Process

Load the sparklyr library
Create a spark connection
Copy data to spark environment
Split the data for train and test
Create an empty pipeline model
Fit the pipeline model using the train data
Predict using the test data
Evaluate the metrics

Sapmle Code

#Load the sparklyr library
library(sparklyr)
#Create a spark connection
sc %
ft_r_formula(Class~.) %>%
ml_linear_regression()
#Split the data for train and test
partitions=sdf_partition(data_s,training=0.8,test=0.2,seed=111)
train_data=partitions$training
test_data=partitions$test
#Fit the pipeline model
fitted_pipeline fitted_pipeline
#Predict using the test data
predictions predictions
#Evaluate the metrics
#Default RMSE(Root Mean Square Error)
cat(“Root Mean Squared Error : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”))
cat(“\nMean Squared Error : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”,metric_name = “mse”))
cat(“\nR-Squared : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”,metric_name = “r2”))
cat(“\nMean Absolute Error : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”,metric_name = “mae”))

Screenshots

List

Office Address