We can merge two data frames in R by using the merge() function or by using family of join() function in dplyr package. The data frames must have same column names on which the merging happens. Merge() Function in R is similar to database join operation in SQL. The different arguments to merge() allow you to perform natural joins i.e. inner join, left join, right join,cross join, semi join, anti join and full outer join. We can perform Join in R using merge() Function or by using family of join() functions in dplyr package.
We will have look at an example of
- Inner join using merge() function in R or inner_join() function of dplyr with example
- Outer join using merge() function or full_join() function of dplyr with example
- Left join using left_join() function of dplyr or merge() function
- Right join using right_join() function of dplyr or merge() function.
- Cross join with merge() function
- semi join and anti join in R using semi_join() function and anti_join() function.
Syntax of merge() function in R
- x:data frame1.
- y:data frame2.
- by,x, by.y: The names of the columns that are common to both x and y. The default is to use the columns with common names between the two data frames.
- all, all.x, all.y:Logical values that specify the type of merge. The default value is all=FALSE (meaning that only the matching rows are returned).
UNDERSTANDING THE DIFFERENT TYPES OF MERGE IN R:
- Natural join or Inner Join: To keep only rows that match from the data frames, specify the argument all=FALSE.
- Full outer join or Outer Join:To keep all rows from both data frames, specify all=TRUE.
- Left outer join or Left Join:To include all the rows of your data frame x and only those from y that match, specify x=TRUE.
- Right outer join or Right Join:To include all the rows of your data frame y and only those from x that match, specify y=TRUE.
Lets look at with some examples
# data frame 1 df1 = data.frame(CustomerId = c(1:6), Product = c("Oven","Television","Mobile","WashingMachine","Lightings","Ipad")) df1 # data frame 2 df2 = data.frame(CustomerId = c(2, 4, 6, 7, 8), State = c("California","Newyork","Santiago","Texas","Indiana")) df2
so we will get following two data frames
df1 will be
df2 will be
INNER JOIN Explained
Inner Join in R is the simplest and most common type of join. It is also known as simple join or Natural Join. Inner join returns the rows when matching condition is met.
Inner join in R using merge() function: merge() function takes df1 and df2 as argument. merge() function by default performs inner join there by return only the rows in which the left table have matching keys in the right table.
#### Left Join using merge function df = merge(x=df1,y=df2,by="CustomerId") df
the resultant inner joined dataframe df will be
Inner join in R using inner_join() function of dplyr:
dplyr() package has inner_join() function which performs inner join of two dataframes by “CustomerId” as shown below.
#### Left Join using inner_join function library(dplyr) df= df1 %>% inner_join(df2,by="CustomerId") df
the resultant inner joined dataframe df will be
OUTER JOIN Explained
Outer Join in R combines the results of both left and right outer joins. The joined table will contain all records from both the tables
Outer join in R using merge() function: merge() function takes df1 and df2 as argument along with all=TRUE there by returns all rows from both tables, join records from the left which have matching keys in the right table.
###### outer join in R using merge() function df = merge(x=df1,y=df2,by="CustomerId",all=TRUE) df
the resultant data frame df will be
outer join in R using full_join() function of dplyr:
dplyr() package has full_join() function which performs outer join of two dataframes by “CustomerId” as shown below.
###### outer join in R using outer_join() function library(dplyr) df= df1 %>% full_join(df2,by="CustomerId") df
the resultant outer joined dataframe df will be
LEFT JOIN Explained:
The LEFT JOIN in R returns all records from the left dataframe (A), and the matched records from the right dataframe (B)
Left join in R: merge() function takes df1 and df2 as argument along with all.x=TRUE there by returns all rows from the left table, and any rows with matching keys from the right table.
###### left join in R using merge() function df = merge(x=df1,y=df2,by="CustomerId",all.x=TRUE) df
the resultant data frame df will be
Left join in R using left_join() function of dplyr:
dplyr() package has left_join() function which performs left join of two dataframes by “CustomerId” as shown below.
###### left join in R using left_join() function library(dplyr) df= df1 %>% left_join(df2,by="CustomerId") df
the resultant Left joined dataframe df will be
RIGHT JOIN Explained:
The RIGHT JOIN in R returns all records from the right dataframe (B), and the matched records from the left dataframe (A)
Right join in R: merge() function takes df1 and df2 as argument along with all.y=TRUE and thereby returns all rows from the right table, and any rows with matching keys from the left table.
###### right join in R using merge() function df = merge(x=df1,y=df2,by="CustomerId",all.y=TRUE) df
the resultant data frame df will be
Right join in R using right_join() function of dplyr:
dplyr() package has right_join() function which performs outer join of two dataframes by “CustomerId” as shown below.
###### right join in R using merge() function library(dplyr) df= df1 %>% right_join(df2,by="CustomerId") df
the resultant right joined dataframe df will be
Cross join in R: A Cross Join (also sometimes known as a Cartesian Join) results in every row of one table being joined to every row of another table
##### cross join in R df = merge(x = df1, y = df2, by = NULL) df
the resultant data frame df will be
SEMI JOIN in R using dplyr:
This is like inner join, with only the left dataframe columns and values are selected
#### Semi join in R library(dplyr) df= df1 %>% semi_join(df2,by="CustomerId") df
the resultant data frame df will be
ANTI JOIN in R using dplyr:
This join is like df1-df2, as it selects all rows from df1 that are not present in df2.
#### anti join in R library(dplyr) df= df1 %>% anti_join(df2,by="CustomerId") df
the resultant data frame df will be
For further understanding of join() function in R using dplyr one can refer the dplyr documentation
Related Topics: