Populate row number in pyspark – Row number by Group

In order to populate row number in pyspark we use row_number() Function.  row_number() function along with partitionBy() of other column populates the row number by group. Let’s see an example on how to populate row number in pyspark.

  • Populate row number in pyspark
  • Populate row number in pyspark by group

We will be using the dataframe df_basket1

row number in pyspark 1

 

 

Populate Row number in pyspark:

Row number is populated by row_number() function. We will be using partitionBy(), orderBy() on a column so that row number will be populated.

### Row number in pyspark

from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number

df_basket1 = df_basket1.select("Item_group","Item_name","Price", F.row_number().over(Window.partitionBy().orderBy(df_basket1['price'])).alias("row_num"))
df_basket1.show()

So the resultant row number populated dataframe in pyspark will be
row number in pyspark 2

 

 

Populate row number in pyspark by group

Row number by group is populated by row_number() function. We will be using partitionBy() on a group, orderBy() on a column so that row number will be populated by group in pyspark.

### Row number in pyspark by group

from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number


df_basket1 = df_basket1.select("Item_group","Item_name","Price", F.row_number().over(Window.partitionBy(df_basket1['Item_group']).orderBy(df_basket1['price'])).alias("row_num"))
df_basket1.show()

So the resultant row number populated dataframe in pyspark will be
row number in pyspark 3

 

Populate row number in pyspark – Row number by Group                                                                                          Populate row number in pyspark – Row number by Group