To select,filter and sort a data frame in spark using python
df.groupBy(“Col_name”) – To group the dataframe by a given column
df.filter(Condition) – To filter the data frame based on the given condition
df.sort(Col_name,ascending=True or False) – To sort the dataframe either in ascending or in descending order.(ascending = True for ascending order and False for descending order)
df.orderBy([Col_names],ascending=[values]) – To sort the data frame by more than one column value whose priority depends on the values in the ascending parameter
df.select(Col_name).show() – To select a column
Import necessary libraries
Initialize the Spark session
Create the required data frame
Use the predefined functions to select,filter and sort the data frame
from pyspark.sql import SparkSession
#Set up SparkContext and SparkSession
spark=SparkSession \
.builder \
.appName(“Python spark example”)\
.config(“spark.some.config.option”,”some-value”)\
.getOrCreate()
#Load the file
df1=spark.read.format(‘com.databricks.spark.csv’).options(header=’True’,inferschema=’True’).load(“/home/…../weight-height.csv”)
#To group the df based on some criteria
df1.groupBy(“Gender”).count().show()
#To filter the df based on some criteria
df1.filter(df1[“Height”]>60.00).show()
#To count the number of samples in the filtered dataset
df1.filter(df1[“Height”]>60.00).count()
#To sort the data set in descending order based on a particular column values
df1.sort(df1.Height.desc()).collect()
#To sort the data set in ascending order based on a particular column values
df1.sort(“Height”,ascending=True).collect()
#To sort the data set in ascending order based on more than one column values
df1.orderBy([“Gender”,”Height”],ascending=[0,1]).collect()