Intersect of two dataframe in pyspark (two or more)

Intersect of two dataframe in pyspark can be accomplished using intersect() function. Intersection in Pyspark returns the common rows of two or more table. Intersect removes the duplicate after combining.

Intersect of two dataframe in pyspark performs a DISTINCT on the result set, returns the common rows of two different tables

  • Intersect of two dataframe in pyspark
  • Intersect of two or more dataframe in pyspark – (more than two dataframe)

Intersect of two datframe in pyspark (two or more) 0

We will be using two dataframes namely df_summerfruits and df_fruits.

df_summerfruits:
Intersect of two datframe in pyspark (two or more) 1

df_fruits:
Intersect of two datframe in pyspark (two or more) 2

 

 

Intersect of two dataframe in pyspark

dataframe1.intersect(dataframe2) gets the common rows of dataframe1 and dataframe2. So the rows that are present in both dataframes will be returned

########## Intersect of two dataframe in pyspark

df_inter=df_summerfruits.intersect(df_fruits)
df_inter.show()

common rows present in both “df_summerfruits” and “df_fruits” dataframe is shown below

Intersect of two datframe in pyspark (two or more) 3

 

 

Intersect of two or more dataframe in pyspark

Intersect() function takes up more than two dataframes as argument and gets the common rows of all the dataframe.

############ intersect of more than two tables

from functools import reduce
from pyspark.sql import DataFrame

def intersect(*dfs):
    return reduce(DataFrame.intersect, dfs)

intersect(df_summerfruits, df_fruits, df_summerfruits).show()

Common rows present in “df_summerfruits” ,“df_fruits”, “df_summerfruits” dataframe is shown below

Intersect of two datframe in pyspark (two or more) 4