#5, First Floor, 4th Street , Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

How to select,filter and sort a data frame in spark using python
Objective

To select,filter and sort a data frame in spark using python

Functions used :

df.groupBy(“Col_name”) – To group the dataframe by a given column
df.filter(Condition) – To filter the data frame based on the given condition
df.sort(Col_name,ascending=True or False) – To sort the dataframe either in ascending or in descending order.(ascending = True for ascending order and False for descending order)
df.orderBy([Col_names],ascending=[values]) – To sort the data frame by more than one column value whose priority depends on the values in the ascending parameter
df.select(Col_name).show() – To select a column

Process

  Import necessary libraries

  Initialize the Spark session

  Create the required data frame

  Use the predefined functions to select,filter and sort the data frame

Sapmle Code

from pyspark.sql import SparkSession
#Set up SparkContext and SparkSession
spark=SparkSession \
.builder \
.appName(“Python spark example”)\
.config(“spark.some.config.option”,”some-value”)\
.getOrCreate()

#Load the file
df1=spark.read.format(‘com.databricks.spark.csv’).options(header=’True’,inferschema=’True’).load(“/home/…../weight-height.csv”)
#To group the df based on some criteria
df1.groupBy(“Gender”).count().show()
#To filter the df based on some criteria
df1.filter(df1[“Height”]>60.00).show()
#To count the number of samples in the filtered dataset
df1.filter(df1[“Height”]>60.00).count()
#To sort the data set in descending order based on a particular column values
df1.sort(df1.Height.desc()).collect()
#To sort the data set in ascending order based on a particular column values
df1.sort(“Height”,ascending=True).collect()
#To sort the data set in ascending order based on more than one column values
df1.orderBy([“Gender”,”Height”],ascending=[0,1]).collect()

Screenshots