Union and union all of two dataframe in pyspark (row bind)

Union all of two dataframe in pyspark can be accomplished using unionAll() function. unionAll() function row binds two dataframe in pyspark and does not removes the duplicates this is called union all in pyspark. Union of two dataframe can be accomplished in roundabout way by using unionall() function first and then remove the duplicate by using distinct() function and there by performing in union in roundabout way.

Note: Both UNION and UNION ALL in pyspark is different from other languages. Union will not remove duplicate in pyspark.

We will be demonstrating following with examples for each

union of two dataframe in pyspark – union with distinct rows
union of two or more dataframe – (more than two dataframes)
union all of two dataframe in pyspark
union all of more than two dataframe

Union pictographic representation:

Union and union all of two datframe in pyspark (row bind) 0a

pyspark union all: Union all concatenates but does not remove duplicates.

Union all pictographic representation:

Union and union all of two datframe in pyspark (row bind) 0b

Let’s discuss with an example. Let’s take three dataframe for example

We will be using three dataframes namely df_summerfruits, df_fruits, df_dryfruits

df_summerfruits:

Union and union all of two datframe in pyspark (row bind) 1

df_fruits:

Union and union all of two datframe in pyspark (row bind) 2

df_dryfruits:

Union and union all of two datframe in pyspark (row bind) 3

Union all of two dataframe without removing duplicates – Union ALL:

UnionAll() function unions or row binds two or more dataframe and does not remove duplicates

########## Union ALL of two dataframe in pyspark

df_union_all=df_summerfruits.unionAll(df_fruits)
df_union_all.show()

unionAll of “df_summerfruits” and “df_fruits” dataframe will be

Union and union all of two datframe in pyspark (row bind) 4

Union all of more than two dataframe in pyspark without removing duplicates – Union ALL:
UnionAll() function also takes up more than two dataframe as input and computes union or rowbinds those dataframe and does not remove duplicates

########## Union ALL of more than two dataframes in pyspark

from functools import reduce
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

unionAll(df_summerfruits, df_fruits, df_dryfruits).show()

unionAll of “df_summerfruits” ,“df_fruits” and “df_dryfruits” dataframe will be
Union and union all of two datframe in pyspark (row bind) 5

Union of two dataframe in pyspark after removing duplicates – Union:
UnionAll() function along with distinct() function takes two or more dataframes as input and computes union or rowbinding of those dataframe and removes duplicate rows.

########## Union of two dataframe in pyspark

df_union=df_summerfruits.unionAll(df_fruits).distinct()
df_union.show()

union of two dataframe is shown below
Union and union all of two datframe in pyspark (row bind) 6

Union of more than two dataframe after removing duplicates – Union:

UnionAll() function along with distinct() function takes more than two dataframes as input and computes union or rowbinds those dataframes and distinct() function removes duplicate rows.

########## Union of more than two dataframe in pyspark

from functools import reduce
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

unionAll(df_summerfruits, df_fruits, df_dryfruits).distinct().show()

union of three dataframe with duplicates removed is shown below
Union and union all of two datframe in pyspark (row bind) 7

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts