To add,remove and update column of a data frame in spark using python
df.withColumn(“New_col_name”, (Col_value)) – To add columns to a dataframe
df.withColumnRenamed(‘Existing column name’, ‘Rename’) – To rename column name of a data frame
df.drop(columns_to_be_dropped).show() – To drop columns of a data frame
Import necessary libraries
Initialize the Spark session
Create the required data frame
Use the predefined functions to add,remove and update column of the data frame
from pyspark.sql import SparkSession
#Set up SparkContext and SparkSession
spark=SparkSession \
.builder \
.appName(“Python spark example”)\
.config(“spark.some.config.option”,”some-value”)\
.getOrCreate()
#Load the file
df1=spark.read.format(‘com.databricks.spark.csv’).options(header=’True’,inferschema=’True’).load(“/home/……/Emp.csv”)
#To add a new column
df2 = df1.withColumn(“Final_salary”, (df1.SALARY+df1.TA))
df2.show(5)
#To rename a column name
df2 = df2.withColumnRenamed(‘SALARY’, ‘BASIC_PAY’)
df2.show(5)
#To drop columns of a dataframe
df2.drop(“Bal_Amnt”,”DEPT”).show()