Distinct value of a column in pyspark – distinct()

In order to get the distinct value of a column in pyspark we will be using select() and distinct() function. There is another way to get distinct value of the column in pyspark using dropDuplicates() function. Distinct value of multiple columns in pyspark using dropDuplicates() function. Distinct value or unique value all the columns.  Let’s see with an example for both

  • Distinct value of a column in pyspark using distinct() function
  • Distinct value of the column in pyspark using dropDuplicates() function
  • Unique/Distinct value of multiple columns in pyspark distinct() function & dropDuplicates() function
  • unique/Distinct value of all the columns using distinct() function

Distinct value of a column in pyspark c1

We will be using dataframe Basket_df

Distinct value of a column in pyspark d1

 

Get distinct value of a column in pyspark – distinct() – Method 1

Distinct value of the column is obtained by using select() function along with distinct() function. select() function takes up the column name as argument, Followed by distinct() function will give distinct value of the column

### Get distinct value of column

df_basket.select("Price").distinct().show()

distinct value of “Price” column will be

Distinct value of a column in pyspark d2

 

Get distinct value of a column – dropDuplicates() – Method 2

dropDuplicates() function takes up the column name as argument, will give distinct value of that column.

### Drop Duplicate of the column
from pyspark.sql import Row
df_basket.dropDuplicates((['Price'])).select("Price").show()

distinct value of “Price” column will be

Distinct value of a column in pyspark d2

 

 


Distinct Value of multiple columns in pyspark: Method 1

Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.

### Get distinct value of multiple columns

df_basket.select("Item_group","Price").distinct().show()

distinct value of “Item_group” & “Price” columns will be

Distinct value of a column in pyspark d3

 

Get distinct value of multiple columns in pyspark – Method 2

dropDuplicates() function takes up multiple column names as argument, will give distinct value of those columns.

### Drop Duplicate of the column
from pyspark.sql import Row

df_basket.dropDuplicates((['Price','Item_group'])).select("Item_group","Price").show()

distinct value of “Item_group” & “Price” columns will be

Distinct value of a column in pyspark d3

 

 


distinct value of all the columns in pyspark using – distinct() function : Method 1 

distinct() function without any arguments or select function, will give distinct value of the dataframe i.e. distinct value of the columns

### distinct of all the columns

df_basket.distinct().show()

distinct value of all the columns will be

Distinct value of a column in pyspark d4

 

distinct value of all the columns using dropDuplicates() function : Method 2

dropDuplicates() function without any arguments gets the distinct value of  all the columns as shown below.

### Drop Duplicates of all the columns
from pyspark.sql import Row
df_basket.dropDuplicates().show()

distinct value of all the columns will be

Distinct value of a column in pyspark d4

 


Other Related Topics: 

 

Distinct value of a column in pyspark                                                                                        Distinct value of a column in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.