Basic Group by operation
Introduction
Group by operation is used with summarize(..) function. It helps to summarize data based on desired groups instead on the entire dataset.
The syntax for group by is:
grouped_data <- group_by(<dataframe>, <columns_to_group_by>)
summarize(grouped_data, <variable_name)> = <function_applied_to_column_to_summarize>)
Procedure
We will be working with a custom dataframe.
# package for creating dataframe
library(tibble)
# tibble or dataframe
df <- tibble(col1 = as.integer(c(1,2,3,4,5)),
col2 = c(11,12,13,14,15),
col3 = c("A", "B", "A", "B", "C")
)
View(df)
Few rows of the data are:
We will use the group by and summarize operation to:
-
group by col3 and then:
- find mean for col1
- fnd median for col2
Thus the group by col3 operation will create intermediate results of:
Group “A”:
Group “B”:
Group “C”:
And then we perform summarization on each of these groups.
Code
# refer procedure for definition of df
library(dplyr)
grouped_data <- dplyr::group_by(df, col3)
result <- summarize(grouped_data, mean_value=mean(col1), median_value=median(col2))
View(result)
The output of above code is:
We first grouped by col3, thus we got groups based on column value A, B and C. Then we performed summarization on each group to get final result.
Conclusion
Thus we have successfully implemented basic group by function in tidyverse.
References
- https://r4ds.had.co.nz/