Group by Multiple Variables
Introduction
We have already covered basics of group by operation here. We can perform group by using multiple columns.
Procedure
We will be working with a custom dataframe.
# package for creating dataframe
library(tibble)
# tibble or dataframe
df <- tibble(col1 = as.integer(c(1,2,3,4,5)),
col2 = c(11,12,13,14,15),
col3 = c("A", "B", "A", "B", "C"),
col4 = c("X", "X", "Y", "Y", "Y")
)
View(df)
Few rows of the data are:
We will use the group by and summarize operation to:
-
group by col3 and col4 and then:
- find mean for col1
- fnd median for col2
Thus the group by col3 and col4 operation will create intermediate results of:
Group “AX”:
Group “BX”:
Group “AY”:
Group “BY”:
Group “CY”:
And then we perform summarization on each of these groups.
Code
# refer procedure for definition of df
library(dplyr)
# group by col3 and col4
grouped_data <- dplyr::group_by(df, col3, col4)
# then summarize
result <- summarize(grouped_data, mean_value=mean(col1), median_value=median(col2))
View(result)
The output of above code is:
We first grouped by col3 and col4, thus we got groups of AX, BX, AY, BY and CY. Then we performed summarization on each group to get final result.
Conclusion
Thus we have successfully implemented group by with multiple variables in tidyverse.
References
- https://r4ds.had.co.nz/