Intersect, Intersect all of dataframe in pyspark (two or more)

Intersect of two dataframe in pyspark can be accomplished using intersect() function. Intersection in Pyspark returns the common rows of two or more dataframe. Intersect removes the duplicate after combining. Intersect all returns the common rows from the dataframe with duplicate.

Intersect of two dataframe in pyspark performs a DISTINCT on the result set, returns the common rows of two different tables

  • Intersect of two dataframe in pyspark
  • Intersect of two or more dataframe in pyspark – (more than two dataframe)
  • Intersect all of the two or more dataframe – without removing the duplicate rows.

Intersect of two datframe in pyspark (two or more) 0

We will be using two dataframes namely df_summerfruits and df_fruits.

df_summerfruits:
Intersect of two datframe in pyspark (two or more) 1

df_fruits:
intersect and intersect all in pyspark d1

 

Intersect of two dataframe in pyspark

dataframe1.intersect(dataframe2) gets the common rows of dataframe1 and dataframe2. So the rows that are present in both dataframes will be returned

########## Intersect of two dataframe in pyspark

df_inter=df_summerfruits.intersect(df_fruits)
df_inter.show()

common rows present in both “df_summerfruits” and “df_fruits” dataframe is shown below

Intersect of two datframe in pyspark (two or more) 3

 

 

Intersect of two or more dataframe in pyspark

Intersect() function takes up more than two dataframes as argument and gets the common rows of all the dataframe.

############ intersect of more than two tables

from functools import reduce
from pyspark.sql import DataFrame

def intersect(*dfs):
    return reduce(DataFrame.intersect, dfs)

intersect(df_summerfruits, df_fruits, df_summerfruits).show()

Common rows present in “df_summerfruits” ,“df_fruits”, “df_summerfruits” dataframe is shown below

Intersect of two datframe in pyspark (two or more) 4

 


IntersectAll of the dataframe in pyspark:

Intersect all of the dataframe in pyspark is similar to intersect function but the only difference is it will not remove the duplicate rows of the resultant dataframe.

Intersectall() function takes up more than two dataframes as argument and gets the common rows of all the dataframe with duplicates not being eliminated.

############ intersect all of more than two tables

df_summerfruits.intersectAll(df_fruits).show()

Common rows present in both the dataframes  “df_summerfruits” & “df_fruits”, without removing duplicate will be

intersect and intersect all in pyspark d2

 


Other Related Topics:

 

                                                                                         

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.