15.16 Calculating New Columns by Groups

15.16.1 Problem

You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.

15.16.3 Discussion

Let’s take a closer look at the cabbages data set. It has two grouping variables (factors): Cult, which has levels c39 and c52, and Date, which has levels d16, d20, and d21. It also has two measured numeric variables, HeadWt and VitC:

Suppose we want to find, for each case, the deviation of HeadWt from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:

You’ll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean within the group, where the groups are specified by Cult. In these cases, we can use group_by() and mutate() together:

First it groups cabbages based on the value of Cult. There are two levels of Cult, c39 and c52. It then applies the mutate() function to each data frame.

The before and after results are shown in Figure 15.2:

Before normalizing (left); After normalizing (right)Before normalizing (left); After normalizing (right)

Figure 15.2: Before normalizing (left); After normalizing (right)

You can also group the data frame on multiple variables and perform operations on multiple variables. The following code groups the data by Cult and Date, forming a group for each distinct combination of the two variables. After forming these groups, the code will calculate the deviation of HeadWt and VitC from the mean of each group:

15.16.4 See Also

To summarize data by groups, see Recipe 15.17.