Distinct value of dataframe in pyspark – drop duplicates

In order to get the distinct value of dataframe in pyspark we will be using distinct() function. There is another way to drop the duplicates of dataframe in pyspark there by getting distinct value of dataframe in pyspark. Let’s see with an example on how to get distinct rows in pyspark

  • Distinct value of dataframe in pyspark using distinct() function.
  • Drop duplicates in pyspark and thereby getting distinct rows – dropDuplicates()
  • drop duplicates by a specific column in pyspark
  • Distinct value of a column in pyspark

We will be using dataframe df_student_detail

Distinct value of dataframe in pyspark – drop duplicates 1

 

 

Get distinct value of dataframe in pyspark – distinct rows – Method 1

Syntax:

 df.distinct()

df – dataframe

dataframe.distinct() gets the distinct value of the dataframe in pyspark


### Get distinct value of dataframe – distinct row in pyspark
df_student_detail.distinct().show()

Distinct value of “df_student_detail” dataframe will be

Distinct value of dataframe in pyspark – drop duplicates 2

 

 

 

Drop duplicates in pyspark –  get distinct rows – Method 2

Syntax:

 df.dropDuplicates()

df – dataframe

dataframe.dropDuplicates() removes the duplicate value of the dataframe and thereby keeps only distinct value of the  dataframe in pyspark

### Get distinct value of dataframe – distinct row in pyspark

df_student_detail.dropDuplicates().show()

Distinct value of “df_student_detail” dataframe by using dropDuplicate() function will be

Distinct value of dataframe in pyspark – drop duplicates 3

 

 

 

Drop duplicates in pyspark by a specific column:

dataframe.dropDuplicates() takes the column name as argument and removes duplicate value of that particular column thereby distinct value of column is obtained.

### drop duplicates by specific column

df_student_detail.dropDuplicates((['name'])).show()

dataframe with duplicate value of column “name” removed will be

Distinct value of dataframe in pyspark – drop duplicates 4

 

Distinct value of dataframe in pyspark – drop duplicates                                                                                                 Distinct value of dataframe in pyspark – drop duplicates