How to add,remove and update column of a data frame in spark using python ?

Objective

To add,remove and update column of a data frame in spark using python

Functions used :

  df.withColumn(“New_col_name”, (Col_value)) – To add columns to a dataframe

  df.withColumnRenamed(‘Existing column name’, ‘Rename’) – To rename column name of a data frame

  df.drop(columns_to_be_dropped).show() – To drop columns of a data frame

Process:

  Import necessary libraries

  Initialize the Spark session

  Create the required data frame

  Use the predefined functions to add,remove and update column of the data frame

from pyspark.sql import SparkSession
#Set up SparkContext and SparkSession
spark=SparkSession \
.builder \
.appName(“Python spark example”)\
.config(“spark.some.config.option”,”some-value”)\
.getOrCreate()

#Load the file
df1=spark.read.format(‘com.databricks.spark.csv’).options(header=’True’,inferschema=’True’).load(“/home/……/Emp.csv”)
#To add a new column
df2 = df1.withColumn(“Final_salary”, (df1.SALARY+df1.TA))
df2.show(5)
#To rename a column name
df2 = df2.withColumnRenamed(‘SALARY’, ‘BASIC_PAY’)
df2.show(5)
#To drop columns of a dataframe
df2.drop(“Bal_Amnt”,”DEPT”).show()

Leave Comment

Your email address will not be published. Required fields are marked *

clear formSubmit