R Dplyr Learning

Select Random Samples in R with Dplyr – (sample_n() and sample_frac())

Sample_n() and Sample_frac () are the functions used to select random samples from the data set in R using Dplyr Package. Dplyr package in R is provided with sample_n() function which selects random n rows from a data frame.

Remove Duplicate rows in R using Dplyr – distinct ()

Distinct function in R is used to remove duplicate rows in R using Dplyr package. Dplyr package in R is provided with distinct() function which eliminate duplicates rows with single variable or with multiple variable.

distinct() Function in Dplyr – Remove duplicate rows of a dataframe:

library(dplyr)
mydata <- mtcars

# Remove duplicate rows of the dataframe
distinct(mydata)

Select variables (columns) in R using Dplyr – select () Function

Select function in R is used to select the columns in R using Dplyr package. Dplyr package in R is provided with select() function which select the columns based on conditions.

Drop variables (columns) in R using Dplyr

Drop variable in R can be done by using minus before the select function. Dplyr package in R is provided with select() function which is used to select or drop the columns based on conditions.

Drop by column names in Dplyr:

select() function along with minus which is used to drop the columns by name

library(dplyr)
mydata <- mtcars

# Drop the columns of the dataframe
select (mydata,-c(mpg,cyl,wt))

the above code drops mpg, cyl and wt columns

Drop variables (columns) in R using Dplyr 1

Drop 3^rd, 4^th and 5^th columns of the dataframe:

library(dplyr)
mydata <- mtcars

# Drop 3rd,4th and 5th columns of the dataframe
select(mydata,-c(3,4,5))

the above code drops 3rd, 4th and 5th column

Drop variables (columns) in R using Dplyr 2

Re arrange or Re order the column of dataframe in R using Dplyr

Re Arranging or Re order the column of dataframe in R using Dplyr. Dplyr package in R is provided with select() function which re orders the columns.

Re order the column using select function with gear,hp,qsec,vs columns arranged in order.

library(dplyr)
mydata <- mtcars

# Reorder the columns of the dataframe

Mydata1 = select(mydata, gear,hp,qsec,vs, everything())
Mydata1

Re arrange or Re order the column of dataframe in R using Dplyr 1

Rename the column name in R using Dplyr

Rename the column name or variable name in R using Dplyr. Dplyr package in R is provided with rename() function which re names the column name or column variable.

Rename the column name in R using Dplyr:

Rename the column name using rename function in dplyr.

library(dplyr)
mydata <- mtcars

# Rename the column name of the dataframe
Mydata1 = rename(mydata, displacement=disp, cylinder=cyl)
Mydata1

Rename the column name disp with displacement and cyl with cylinder

Rename the column name in R using Dplyr

Filter or subsetting rows in R using Dplyr

Filter or subsetting rows in R can be done using Dplyr. Dplyr package in R is provided with filter() function which subsets the rows with multiple conditions.

Filter or subsetting the rows in R using Dplyr:

Subset using filter() function.

library(dplyr)
mydata <- mtcars

# subset the rows of dataframe with condition
Mydata1 = filter(mydata,cyl==6)
Mydata1

Only the rows with cyl =6 is filtered

Filter or subsetting rows in R using Dplyr 1

Filter or subsetting the rows in R with multiple conditions using Dplyr:

library(dplyr)
mydata <- mtcars

# subset the rows of dataframe with multiple conditions
Mydata1 = filter(mydata, gear %in% c(4,5))
Mydata1

The rows with gear=4 or 5 are filtered

Filter or subsetting rows in R using Dplyr 2

Get the summary of dataset in R using Dplyr – summarise()

Summary of the dataset (Mean, Median and Mode) in R can be done using Dplyr. Dplyr package in R is provided with summarise() function which gets the summary of dataset in R. Dplyr package has summarise(), summarise_at(), summarise_if(), summarise_all()

Summary of column in dataset in R using Dplyr – summarise()

library(dplyr)
mydata <- mtcars

# summarise the columns of dataframe
summarise(mydata, mpg_mean=mean(mpg),mpg_median=median(mpg))

summarise() function that gets the mean and median of mpg.

Summary of multiple column of dataset in R using Dplyr – summarise_at()

library(dplyr)
mydata <- mtcars

# summarise the list of columns of dataframe
summarise_at(mydata, vars(mpg, hp), funs(n(), mean, median))

summarise_at() function that gets the number of rows, mean and median of mpg and hp.

Sorting DataFrame in R using Dplyr

Sorting the dataframe in R can be done using Dplyr. Dplyr package in R is provided with arrange() function which sorts the dataframe by multiple conditions.

Group by function in R using Dplyr

Group by Function in R is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum or any other functions.

We will be using iris data to depict the example of group_by() function

library(dplyr)
mydata2 <- iris 

# Groupby function for dataframe in R
summarise_at(group_by(mydata2,Species),vars(Sepal.Length),funs(mean(.,na.rm=TRUE)))

Mean of Sepal.Length is grouped by Species variable.

Group by function in R using dplyr 1

Groupby function in R with dplyr pipe operator %>%:

library(dplyr)
mydata2 <- iris # Group by function for dataframe in R using pipe operator mydata2 %&gt;% group_by(Species) %&gt;% summarise_at(vars(Sepal.Length),funs(sum(.,na.rm=TRUE)))

Sum of Sepal.Length is grouped by Species variable with the help of pipe operator (%>%) in dplyr package.

So the output will be

Group by function in R using dplyr 3

Windows Function in R using Dplyr

Like SQL, dplyr uses window functions that are used to subset data within a group. It returns a vector of values. We could use min_rank() function that calculates rank.

Create new variable in R using Mutate Function in dplyr

Mutate Function in R is used to create new variable or column to the dataframe in R. Dplyr package in R is provided with mutate(), mutate_all() and mutate_at() function which creates the new variable to the dataframe.

We will be using iris data to depict the example of mutate() function

library(dplyr)
mydata2 <- iris 

# Mutate function for creating new variable to the dataframe in R

mydata3 = mutate(mydata2, sepal_length_width_ratio=Sepal.Length/Sepal.Width)
head(mydata3)

New column named sepal_length_width_ratio is created using mutate function and values are populated by dividing sepal length by sepal width

mutate_all() Function in R

mutate_all() function in R creates new columns for all the available columns here in our example. mutate_all() function creates 4 new column and get the percentage distribution of sepal length and width, petal length and width.

library(dplyr)
mydata2 <- iris 

# Mutate_all function for creating new variable to the dataframe in R

mydata3 = mutate_all(mydata2[,-5], funs("percent"= ./100))
head(mydata3)

mutate_at() Function in R

mutate_at() function in R creates new columns for the specified columns here in our example. mutate_at() Function get the min_rank() of sepal length and sepal width .

library(dplyr)
mydata2 <- iris 

# mutate_at() function for creating new variable to the dataframe in R

mydata4 = mutate_at(mydata2, vars(Sepal.Length,Sepal.Width), funs(Rank=min_rank(desc(.))))
head(mydata4)

Union Function & union_all in R using Dplyr (union of data frames)

Union of two data frames in R can be easily achieved by using union Function and union all function in Dplyr package . Dplyr package in R is provided with union(), union_all() function

Intersect Function in R using Dplyr (intersection of data frames)

Intersection of two data frames in R can be easily achieved by using intersect Function in Dplyr package . Dplyr package in R is provided with intersect() function

Intersect function takes the rows that appear in both the tables and create the dataframe.

library(dplyr)

#  intersect two dataframes  
intersect(df1,df2)

Setdiff() Function in R using Dplyr (get difference of dataframes)

To get the difference of two data frames i.e. To get the row present in one table which is not in other table we will be using setdiff() function in R ‘s Dplyr package . Dplyr package in R is provided with setdiff() function which gets the difference of two dataframe.

Case when statement in R using case_when() Dplyr

Case when statement in R can be executed with case_when() function in dplyr package. Dplyr package is provided with case_when() function which is similar to case when statement in SQL.

We will be using iris data to depict the example of case_when() function.

library(dplyr)

mydata2 <- iris 
head(mydata2)

iris data will be looking like

case when statement in R dplyr 1

We will be creating additional variable species_new using mutate function and case when statement.

mydata2 %>% mutate(species_new = case_when(is.na(Species) ~ "missing",
                                           Species=="setosa" ~ "setosa_new",
                                           Species=="versicolor" ~ "versicolor_new",
                                           Species=="virginica" ~ "virginica_new",
                                           TRUE ~ "others"))

you can use variables directly within case_when() wrapper.
TRUE equivalent to ELSE statement .

So the snapshot of resultant data frame will be

case when statement in R dplyr 2

NOTE: Make sure you set is.na() condition at the beginning of R case_when to handle the missing values.

Row wise operation in R using Dplyr

Row wise operation in R can be performed using rowwise() function in dplyr package. Dplyr package is provided with rowwise() function with which we will be doing row wise maximum or row wise minimum operations.

We will be creating additional variable row_max using mutate function and rowwise() function to store the row wise maximum variable.

df1 = mydata2 %&gt;%
  rowwise() %&gt;% mutate(row_max= max(Sepal.Length:Petal.Width))

head(df1)

So the snapshot of resultant data frame will be

rowwise function in R dplyr

Calculate percentile, quantile, N tile of dataframe in R using dplyr (create column with percentile rank)

Quantile, Decile and Percentile can be calculated using ntile() Function in R. Dplyr package is provided with mutate() function and ntile() function. The ntile() function is used to divide the data into N bins. ntile() function will be useful in creating the column with percentile, decile and quantile rank

Decile rank in R:

library(dplyr)

mydata <- mtcars
df1 = mutate(mydata, decile_rank = ntile(mydata$mpg,10))
df1

So in the resultant data frame decile rank is calculated and populated across

Calculate percentile, quantile, N tile of dataframe in R using dplyr (create column with percentile rank 2

Percentile rank in R:

library(dplyr)

mydata <- mtcars
df1 = mutate(mydata, percentile_rank = ntile(mydata$mpg,100))
df1

So in the resultant data frame percentile rank is calculated and populated across

Calculate percentile, quantile, N tile of dataframe in R using dplyr (create column with percentile rank 3

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts