Get duplicate rows in pyspark

In order to get duplicate rows in pyspark we use round about method. First we do groupby count of all the columns and then we filter the rows with count greater than 1. Thereby get duplicate rows in pyspark.

  • Get Duplicate rows in pyspark

We will be using dataframe df_basket1

Get duplicate rows in pyspark 1

 

 

Get Duplicate rows in pyspark

### Get Duplicate rows in pyspark

df1=df_basket1.groupBy("Item_group","Item_name","price").count().filter("count >1")
df1.drop('count').show()
  • First we do groupby count of all the columns i.e. “Item_group”,”Item_name”,”price”
  • Secondly we filter the rows with count greater than 1.

So the resultant duplicate rows are

Get duplicate rows in pyspark 2

Get duplicate rows in pyspark                                                                                                Get duplicate rows in pyspark