Distinct value of a column in pyspark

In order to get the distinct value of a column in pyspark we will be using select() and distinct() function. There is another way to get distinct value of the column in pyspark using dropDuplicates() function. Let’s see with an example for both

  • Distinct value of a column in pyspark using distinct() function
  • Distinct value of a column in pyspark using dropDuplicates() function

We will be using dataframe df

Distinct value of a column in pyspark 1

 

 

Get distinct value of a column in pyspark – distinct() – Method 1

Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up the column name as argument, Followed by distinct() function will give distinct value of the column

### Get distinct value of column

df.select("name").distinct().show()

distinct value of “name” column will be

Distinct value of a column in pyspark 2

 

 

Get distinct value of a column in pyspark – dropDuplicates() – Method 2

dropDuplicates() function takes up the column name as argument, will give distinct value of that column.

### Drop Duplicate of the column

df.dropDuplicates((['name'])).select("name").show()

distinct value of “name” column will be

Distinct value of a column in pyspark 3

 

Distinct value of a column in pyspark                                                                                        Distinct value of a column in pyspark