15.17 Summarizing Data by Groups

15.17.1 Problem

You want to summarize your data, based on one or more grouping variables.

15.17.3 Discussion

There are few things going on here that may be unfamiliar if you’re new to dplyr and the tidyverse in general.

First, let’s take a closer look at the cabbages data set. It has two factors that can be used as grouping variables: Cult, which has levels c39 and c52, and Date, which has levels d16, d20, and d21. It also has two numeric variables, HeadWt and VitC:

Finding the overall mean of HeadWt is simple. We could just use the mean() function on that column, but for reasons that will soon become clear, we’ll use the summarise() function instead:

The result is a data frame with one row and one column, named Weight.

Often we want to find information about each subset of the data, as specified by a grouping variable. For example, suppose we want to find the mean of each Cult group. To do this, we can use summarise() with group_by().

The command first groups the data frame cabbages based on the value of Cult. There are two levels of Cult, c39 and c52, so there are two groups. It then applies the summarise() function to each of these data frames; it calculates Weight by taking the mean() of the HeadWt column in each of the sub-data frames. The resulting summaries for each group are assembled into a data frame, which is returned.

You can imagine that the cabbages data is split up into two separate data frames, then summarise() is called on each data frame (returning a one-row data frame for each), and then those results are combined together into a final data frame. This is actually how things worked in dplyr’s predecessor, plyr, with the ddply() function.

The syntax of the previous code used a temporary variable to store results. That’s a little verbose, so instead, we can use %>%, also known as the pipe operator, to chain the function calls together. The pipe operator simply takes what’s on its left and substitutes it as the first argument of the function call on the right. The following two lines of code are equivalent:

The reason it’s called a pipe operator is that it lets you connect function calls together in sequence to form a pipeline of operations. Another common term for this is a different metaphor: chaining.

So the first argument of the function call is in a different place. So what? The advantages become apparent when chaining is involved. Here’s what it would look like if you wanted to call group_by() and then summarise() without making use of a temporary variable. Instead of proceeding left to right, the computation occurs from the inside out:

Using a temporary variable, as we did earlier, makes it more readable, but a more elegant solution is to use the pipe operator:

Back to summarizing data. Summarizing the data frame by grouping using more variables (or columns) is simple: just give it the names of the additional variables. It’s also possible to get more than one summary value by specifying more calculated columns. Here we’ll summarize each Cult and Date group, getting the average of HeadWt and VitC:

Note

You might have noticed that it says that the result is grouped by Cult, but not Date. This is because the summarise() function removes one level of grouping. This is typically what you want when the input has one grouping variable. When there are multiple grouping variables, this may or may not be the what you want. To remove all grouping, use ungroup(), and to add back the original grouping, use group_by() again.

It’s possible to do more than take the mean. You may, for example, want to compute the standard deviation and count of each group. To get the standard deviation, use sd(), and to get a count of rows in each group, use n():

Other useful functions for generating summary statistics include min(), max(), and median(). The n() function is a special function that works only inside of the dplyr functions summarise(), mutate() and filter(). See ?summarise for more useful functions.

The n() function gets a count of rows, but if you want to have it not count NA values from a column, you need to use a different technique. For example, if you want it to ignore any NAs in the HeadWt column, use sum(!is.na(Headwt)).

If you want to get a count of rows

15.17.3.1 Dealing with NAs {#_dealing_with_literal_na_literal_s}

One potential pitfall is that NAs in the data will lead to NAs in the output. Let’s see what happens if we sprinkle a few NAs into HeadWt:

The problem is that mean() and sd() simply return NA if any of the input values are NA. Fortunately, these functions have an option to deal with this very issue: setting na.rm=TRUE will tell them to ignore the NAs.

15.17.3.2 Missing combinations {#_missing_combinations}

If there are any empty combinations of the grouping variables, they will not appear in the summarized data frame. These missing combinations can cause problems when making graphs. To illustrate, we’ll remove all entries that have levels c52 and d21. The graph on the left in Figure 15.3 shows what happens when there’s a missing combination in a bar graph:

To fill in the missing combination (Figure 15.3, right), use the complete() function from the tidyr package – which is also part of the tidyverse. Also, the grouping for c2a must be removed, with ungroup(); otherwise it will return too many rows.

Bar graph with a missing combination (left); With missing combination filled (right)Bar graph with a missing combination (left); With missing combination filled (right)

Figure 15.3: Bar graph with a missing combination (left); With missing combination filled (right)

When we used complete(), it filled in the missing combinations with NA. It’s possible to fill with a different value, with the fill parameter. See ?complete for more information.

15.17.4 See Also

If you want to calculate standard errors and confidence intervals, see Recipe 15.18.

See Recipe 6.8 for an example of using stat_summary() to calculate means and overlay them on a graph.

To perform transformations on data by groups, see Recipe 15.16.