Groupby functions in pyspark (Aggregate functions)

Groupby functions in pyspark which is also known as aggregate function ( count, sum,mean, min, max) in pyspark is calculated using groupby(). Groupby single column and multiple column is shown with an example of each. We will be using aggregate function to get groupby count, groupby mean, groupby sum, groupby min and groupby max of dataframe in pyspark. Let’s get clarity with an example.

• Groupby count of dataframe in pyspark – Groupby single and multiple column
• Groupby sum of dataframe in pyspark – Groupby single and multiple column
• Groupby mean of dataframe in pyspark – Groupby single and multiple column
• Groupby min of dataframe in pyspark – Groupby single and multiple column
• Groupby max of dataframe in pyspark – Groupby single and multiple column

We will use the dataframe named df_basket1

Groupby count of dataframe in pyspark – Groupby single and multiple column:

Groupby count of single column in pyspark :Method 1
Groupby count of dataframe in pyspark – this method uses count() function along with grouby() function.

```## Groupby count of single column
```

Groupby count of single column in pyspark :Method 2

Groupby count of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and count as argument

```## Groupby count of single column
```

groupby count of “Item_group” column will be

Groupby count of multiple column in pyspark

Groupby count of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and count as argument

```## Groupby count of multiple column
```

groupby count of “Item_group”  and “Item_name” column will be

Groupby sum of dataframe in pyspark – Groupby single  column

Groupby sum of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and sum as argument

```## Groupby sum of single column
```

groupby sum of “Item_group”  column will be

Groupby  sum of dataframe in pyspark – Groupby multiple  column

Groupby sum of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and sum as argument

```## Groupby sum of multiple column
```

groupby sum of “Item_group”  and “Item_name” column will be

Groupby mean of dataframe in pyspark – Groupby single  column

Groupby mean of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and mean as argument

```## Groupby mean of single column
```

groupby mean of “Item_group” column will be

Groupby mean of dataframe in pyspark – Groupby multiple column

Groupby mean of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and mean as argument

```## Groupby mean of multiple column

```

groupby mean of “Item_group”  and “Item_name” column will be

Groupby min of dataframe in pyspark – Groupby single  column:

Groupby min of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and min as argument

```## Groupby min of single column

```

groupby min of “Item_group”  column will be

Groupby min of dataframe in pyspark – Groupby multiple column:

Groupby min of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and min as argument

```## Groupby min of multiple column

```

groupby min of “Item_group”  and “Item_name” column will be

Groupby max of dataframe in pyspark – Groupby single  column

Groupby max of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes column name and max as argument

```## Groupby max of single column

```

groupby max of “Item_group”  column will be,

Groupby max of dataframe in pyspark – Groupby multiple  column

Groupby max of multiple column of dataframe in pyspark – this method uses grouby() function. along with aggregate function agg() which takes list of column names and max as argument

```## Groupby max of multiple column

```

groupby max of “Item_group”  and “Item_name” column will be,

Other Related Topics:

Author

• With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.