Remove Duplicate rows in R using Dplyr – distinct () function

Distinct function in R is used to remove duplicate rows in R using Dplyr package.  Dplyr package in R is provided with distinct() function  which eliminate duplicates rows with single variable or with multiple variable. There are other methods to drop duplicate rows in R one method is duplicated() which identifies and removes duplicate in R. The other method is unique() which identifies the unique values.

we will looking at example on How to

  • Get distinct Rows of the dataframe in R using distinct() function.
  • Remove duplicate rows based on two or more variables/columns in R
  • Drop duplicates of the dataframe using duplicated() function in R
  • Get unique rows (remove duplicate rows) of the dataframe in R using unique() function.

remove duplicates in R dplyr 0

Create Dataframe

We will be using the following dataframe  to depict the above functions. Lets first create the dataframe.

# simple Data frame creation

mydata = data.frame (NAME =c ('Alisa','Bobby','jodha','jack','raghu','Cathrine',
                      'Alisa','Bobby','kumar','Alisa','jack','Cathrine'),
                      Age = c (26,24,26,22,23,24,26,24,22,26,22,25),
                      Score =c(85,63,55,74,31,77,85,63,42,85,74,78))

mydata

so the resultant data frame will be

remove duplicates in R dplyr 1

 

 

distinct() Function in Dplyr  –  Remove duplicate rows of a dataframe in R:

library(dplyr)

# Remove duplicate rows of the dataframe
distinct(mydata)

OR

 
library(dplyr) 
mydata %>% distinct()

In this dataset, all the duplicate rows are eliminated so it returns the unique rows in mydata.

remove duplicates in R dplyr 2

 

Remove Duplicate Rows based on a variable

We will be removing duplicate rows using a particular variable.

library(dplyr)

# Remove duplicate rows of the dataframe using NAME variable
distinct(mydata,NAME, .keep_all= TRUE)

OR

 
library(dplyr) 
mydata %>% distinct(NAME, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame. So the output dataframe will be

remove duplicates in R dplyr 3

 

Remove Duplicate Rows based on multiple variables

We will be removing duplicate rows using Multiple variables in the below example.

library(dplyr)

# Remove duplicate rows of the dataframe using NAME and Age variables
distinct(mydata, NAME,Age, .keep_all= TRUE)

OR

 
library(dplyr) 
mydata %>% distinct(NAME,Age, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame. So the resultant dataframe will be

remove duplicates in R dplyr 4

 

 

DROP Duplicates in R using unique() function in R

When we apply unique function to the above data frame

## Apply unique function for data frame in R
unique(mydata)

Duplicate entries in the data frame are eliminated and the final output will be
unique function in R 5

 

unique rows of the dataframe by keeping last occurrences

unique() function along with the argument fromLast =T indicates keeping the last occurrence in the process of identifying unique values

 
## unique value of dataframe in R by keeping last occurrences

unique(mydata, fromLast=T)

unique values of a dataframe by keeping last occurrences will be

unique function in R 6

unique value of the columns in the dataframe

unique() function takes up the column name as argument and results in identifying unique value of the particular column as shown below

## unique value of the column in R dataframe

unique(mydata$NAME)

so the unique values of the name column will be

unique function in R 7

 

 

Remove Duplicates based on a column using duplicated() function

duplicated()  function along with [!] takes up the column name as argument and results in identifying unique value of the particular column as shown below

 
## unique value of the column in R dataframe 
mydata[!duplicated(mydata$NAME), ] 

so the dataframe with unique values of the NAME column will be

remove duplicates in R dplyr 3

For Further understanding on how to drop duplicate rows in R using Dplyr one can refer dplyr documentation


Other Related Topics :

 

Remove Duplicate rows in R using Dplyr – distinct () previous                                                                                                         Remove Duplicate rows in R using Dplyr – distinct () next

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.