Groupby functions in pyspark (Aggregate functions)

Groupby functions in pyspark which is also known as aggregate function ( count, sum,mean, min, max) in pyspark is calculated using groupby(). Groupby single column and multiple column is shown with an example of each. We will be using aggregate function to get groupby count, groupby mean, groupby sum, groupby min and groupby max of dataframe in pyspark. Let’s get clarity with an example.

  • Groupby count of dataframe in pyspark – Groupby single and multiple column
  • Groupby sum of dataframe in pyspark – Groupby single and multiple column
  • Groupby mean of dataframe in pyspark – Groupby single and multiple column
  • Groupby min of dataframe in pyspark – Groupby single and multiple column
  • Groupby max of dataframe in pyspark – Groupby single and multiple column

We will use the dataframe named df_basket1

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 1

 

 

Groupby count of dataframe in pyspark – Groupby single and multiple column:

group by function in pyspark aggregate function

Groupby count of single column in pyspark :Method 1
Groupby count of dataframe in pyspark – this method uses count() function along with grouby() function.

## Groupby count of single column
df_basket1.groupBy("Item_group").count().show()

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 2

Groupby count of single column in pyspark :Method 2

Groupby count of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and count as argument

## Groupby count of single column
df_basket1.groupby('Item_group').agg({'Price': 'count'}).show()

groupby count of “Item_group” column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 3

Groupby count of multiple column in pyspark

Groupby count of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and count as argument

## Groupby count of multiple column
df_basket1.groupby('Item_group','Item_name').agg({'Price': 'count'}).show()

groupby count of “Item_group”  and “Item_name” column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 6

 

 


Groupby sum of dataframe in pyspark – Groupby single and multiple column:

Groupby sum of dataframe in pyspark – Groupby single  column

Groupby sum of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and sum as argument

## Groupby sum of single column
df_basket1.groupby('Item_group').agg({'Price': 'sum'}).show()

groupby sum of “Item_group”  column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 5

Groupby  sum of dataframe in pyspark – Groupby multiple  column

Groupby sum of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and sum as argument

## Groupby sum of multiple column
df_basket1.groupby('Item_group','Item_name').agg({'Price': 'sum'}).show()

groupby sum of “Item_group”  and “Item_name” column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 6

 

 

 


Groupby mean of dataframe in pyspark – Groupby single and multiple column:

Groupby mean of dataframe in pyspark – Groupby single  column

Groupby mean of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and mean as argument

## Groupby mean of single column
df_basket1.groupby('Item_group').agg({'Price': 'mean'}).show()

groupby mean of “Item_group” column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 6

Groupby mean of dataframe in pyspark – Groupby multiple column

Groupby mean of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and mean as argument

## Groupby mean of multiple column

df_basket1.groupby('Item_group','Item_name').agg({'Price': 'mean'}).show()

groupby mean of “Item_group”  and “Item_name” column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 7

 

 

 


Groupby min of dataframe in pyspark – Groupby single and multiple column:

Groupby min of dataframe in pyspark – Groupby single  column:

Groupby min of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and min as argument

## Groupby min of single column

df_basket1.groupby('Item_group').agg({'Price': 'min'}).show()

groupby min of “Item_group”  column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 9

Groupby min of dataframe in pyspark – Groupby multiple column:

Groupby min of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and min as argument

## Groupby min of multiple column

df_basket1.groupby('Item_group','Item_name').agg({'Price': 'min'}).show()

groupby min of “Item_group”  and “Item_name” column will be

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 10

 

 

 


Groupby max of dataframe in pyspark – Groupby single and multiple column:

Groupby max of dataframe in pyspark – Groupby single  column

Groupby max of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and max as argument

## Groupby max of single column

df_basket1.groupby('Item_group').agg({'Price': 'max'}).show()

groupby max of “Item_group”  column will be,

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 11

Groupby max of dataframe in pyspark – Groupby multiple  column

Groupby max of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and max as argument

## Groupby max of multiple column

df_basket1.groupby('Item_group','Item_name').agg({'Price': 'max'}).show()

groupby max of “Item_group”  and “Item_name” column will be,

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max 12

 


Other Related Topics:

Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max                                                                                               Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.