Set Difference in Pyspark – Difference of two dataframe

Set difference in Pyspark returns the rows that are in the one dataframe but not other dataframe.  Set difference performs set difference i.e. difference of two dataframe in Pyspark.

We will see an example of

  • Set difference which returns the difference of two dataframe in pyspark
  • Set difference of a column in two dataframe – difference of a column in two dataframe in pyspark

Set Difference in Pyspark – Difference of two dataframe 0

We will be using two dataframes namely df_summerfruits and df_fruits.

df_summerfruits:

Set Difference in Pyspark – Difference of two dataframe 1

df_fruits:

Set Difference in Pyspark – Difference of two dataframe 1

 

 

Difference of two dataframe in pyspark – set difference

Syntax:

 df1.subtract(df2)

df1 – dataframe1
df2 – dataframe2

dataframe1.subtract(dataframe2) gets the difference of dataframe2 from dataframe1. So the rows that are present in first dataframe but not present in the second dataframe will be returned

########## Difference of two dataframe in pyspark

df_summerfruits.subtract(df_fruits).show()

Set difference of two dataframes will be calculated

Set Difference in Pyspark – Difference of two dataframe 3

 

 

Difference of a column in two dataframe in pyspark – set difference of a column

We will be using subtract() function along with select() to get the difference between a column of  dataframe2 from dataframe1. So the column value that are present in first dataframe but not present in the second dataframe will be returned

########## Difference of a column in two dataframe in pyspark

df_summerfruits.select('color').subtract(df_fruits.select('color')).show()

Set difference of “color” column of two dataframes will be calculated. “Color” value that are present in first dataframe but not in the second dataframe will be returned.

Set Difference in Pyspark – Difference of two dataframe 4

 

Set Difference in Pyspark – Difference of two dataframe                                                                                         Set Difference in Pyspark – Difference of two dataframe