How to implement pipeline architecture for regression in spark with R ?


To implement pipeline architecture for regression in spark with R using sparklyR package

Functions used :

 spark_connect(master = “local”) – To create a spark connection
 sdf_copy_to(spark_connection,R object,name) – To copy data to spark environment
 sdf_partition(spark_dataframe,partitions with weights,seed) – To partition spark dataframe into multiple groups
 ml_pipeline(spark connection) – To create a Spark ML pipeline
 ft_r_formula(formula) – To implement the  transforms required for fititng a dataset against an R model formula
 ml_linear_regression() – To build a linear regression model
 ml_fit(pipeline_model,train_data) – To fit the model
 ml_transform(fitted_pipeline,test_data) – To predict the test data

  • Load the sparklyr library
  • Create a spark connection
  • Copy data to spark environment
  • Split the data for train and test
  • Create an empty pipeline model
  • Fit the pipeline model using the train data
  • Predict using the test data
  • Evaluate the metrics

#Load the sparklyr library
#Create a spark connection
sc %
ft_r_formula(Class~.) %>%
#Split the data for train and test
#Fit the pipeline model
fitted_pipeline fitted_pipeline
#Predict using the test data
predictions predictions
#Evaluate the metrics
#Default RMSE(Root Mean Square Error)
cat(“Root Mean Squared Error : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”))
cat(“\nMean Squared Error : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”,metric_name = “mse”))
cat(“\nR-Squared : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”,metric_name = “r2”))
cat(“\nMean Absolute Error : “,ml_regression_evaluator(predictions, label_col = “Class”,prediction_col = “prediction”,metric_name = “mae”))

Leave Comment

Your email address will not be published. Required fields are marked *

clear formSubmit