Research breakthrough possible @S-Logix pro@slogix.in

Office Address

  • 2nd Floor, #7a, High School Road, Secretariat Colony Ambattur, Chennai-600053 (Landmark: SRM School) Tamil Nadu, India
  • pro@slogix.in
  • +91- 81240 01111

Social List

How to add,remove and update column of a data frame in spark using python

Objective

To add,remove and update column of a data frame in spark using python

Functions used :

df.withColumn(“New_col_name”, (Col_value)) – To add columns to a dataframe

df.withColumnRenamed(‘Existing column name’, ‘Rename’) – To rename column name of a data frame

df.drop(columns_to_be_dropped).show() – To drop columns of a data frame

Process

  Import necessary libraries

  Initialize the Spark session

  Create the required data frame

  Use the predefined functions to add,remove and update column of the data frame

Sample Code

from pyspark.sql import SparkSession
#Set up SparkContext and SparkSession
spark=SparkSession \
.builder \
.appName(“Python spark example”)\
.config(“spark.some.config.option”,”some-value”)\
.getOrCreate()

#Load the file
df1=spark.read.format(‘com.databricks.spark.csv’).options(header=’True’,inferschema=’True’).load(“/home/……/Emp.csv”)
#To add a new column
df2 = df1.withColumn(“Final_salary”, (df1.SALARY+df1.TA))
df2.show(5)
#To rename a column name
df2 = df2.withColumnRenamed(‘SALARY’, ‘BASIC_PAY’)
df2.show(5)
#To drop columns of a dataframe
df2.drop(“Bal_Amnt”,”DEPT”).show()

Screenshots
add,remove and update column of a data frame in spark using python
Import necessary libraries
Initialize the Spark session
Create the required data frame