Chapter 15 Getting Your Data into Shape

When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn’t the case: you’ll have to clean up and restructure the data before you can visualize it.

The recipes in this chapter will often use packages from the tidyverse. For a little background about the tidyverse, see the introduction section of Chapter 1. I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.

Note

The >%> symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe 1.7.

Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I’ll assume that dplyr is already loaded. You can load it with either library(tidyverse) as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:

library(dplyr)

Data sets in R are most often stored in data frames. They’re typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.

Here’s the heightweight data set:

library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#>     sex ageYear ageMonth heightIn weightLb
#> 1     f   11.92      143     56.3     85.0
#> 2     f   12.92      155     62.3    105.0
#>  ...<232 more rows>...
#> 236   m   13.92      167     62.0    107.5
#> 237   m   12.58      151     59.3     87.0

It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it’s structured by using the str() function:

str(heightweight)
#> 'data.frame':    236 obs. of  5 variables:
#>  $ sex     : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ ageYear : num  11.9 12.9 12.8 13.4 15.9 ...
#>  $ ageMonth: int  143 155 153 161 191 171 185 142 160 140 ...
#>  $ heightIn: num  56.3 62.3 63.3 59 62.5 62.5 59 56.5 62 53.8 ...
#>  $ weightLb: num  85 105 108 92 112 ...

The first column, sex, is a factor with two levels, "f" and "m", and the other four columns are vectors of numbers (one of them, ageMonth, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).

Factors and character vectors behave similarly in ggplot – the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.

15.1 Creating a Data Frame

15.1.1 Problem

You want to create a data frame from vectors.

15.1.2 Solution

You can put vectors together in a data frame with data.frame():

# Two starting vectors
g <- c("A", "B", "C")
x <- 1:3
dat <- data.frame(g, x)
dat
#>   g x
#> 1 A 1
#> 2 B 2
#> 3 C 3

15.1.3 Discussion

A data frame is essentially a list of vectors and factors. Each vector or factor can be thought of as a column in the data frame.

If your vectors are in a list, you can convert the list to a data frame with the as.data.frame() function:

lst <- list(group = g, value = x)    # A list of vectors

dat <- as.data.frame(lst)

The tidyverse way of creating a data frame is to use data_frame() or as_data_frame() (note the underscores instead of periods). This returns a special kind of data frame – a tibble – which behaves like a regular data frame in most contexts, but prints out more nicely and is specifically designed to play well with the tidyverse functions.

data_frame(g, x)
#> Warning: `data_frame()` is deprecated, use `tibble()`.
#> This warning is displayed once per session.
#> # A tibble: 3 x 2
#>   g         x
#>   <chr> <int>
#> 1 A         1
#> 2 B         2
#> 3 C         3
# Convert the list of vectors to a tibble
as_data_frame(lst)

A regular data frame can be converted to a tibble using as_tibble():

as_tibble(dat)
#> # A tibble: 3 x 2
#>   group value
#>   <fct> <int>
#> 1 A         1
#> 2 B         2
#> 3 C         3

15.2 Getting Information About a Data Structure

15.2.1 Problem

You want to find out information about an object or data structure.

15.2.2 Solution

Use the str() function:

str(ToothGrowth)
#> 'data.frame':    60 obs. of  3 variables:
#>  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#>  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

This tells us that ToothGrowth is a data frame with three columns, len, supp, and dose. len and dose contain numeric values, while supp is a factor with two levels.

Another useful function is the summary() function:

summary(ToothGrowth)

Instead of showing you the first few values of each column as str() does, summary() provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.

15.2.3 Discussion

The str() function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.

When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run str() on the data frame, or print out the column by itself:

tg <- ToothGrowth
tg$supp <- as.character(tg$supp)
str(tg)
#> 'data.frame':    60 obs. of  3 variables:
#>  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#>  $ supp: chr  "VC" "VC" "VC" "VC" ...
#>  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Print out the columns by themselves
# From old data frame (factor)
ToothGrowth$supp
#>  [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
#> [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> Levels: OJ VC
# From new data frame (character)
tg$supp
#>  [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [16] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [31] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [46] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"

15.3 Adding a Column to a Data Frame

15.3.1 Problem

You want to add a column to a data frame.

15.3.2 Solution

Use mutate() from dplyr to add a new column and assign values to it. This returns a new data frame, which you’ll typically want save over the original.

If you assign a single value to the new column, the entire column will be filled with that value. This adds a column named newcol, filled with NA:

library(dplyr)

ToothGrowth %>%
  mutate(newcol = NA)
#>     len supp dose newcol
#> 1   4.2   VC  0.5     NA
#> 2  11.5   VC  0.5     NA
#>  ...<56 more rows>...
#> 59 29.4   OJ  2.0     NA
#> 60 23.0   OJ  2.0     NA

You can also assign a vector to the new column:

# Since ToothGrowth has 60 rows, we must create a new vector that has 60 rows
vec <- rep(c(1, 2), 30)

ToothGrowth %>%
  mutate(newcol = vec)

Note that the vector being added to the data frame must either have one element, or the same number of elements as the data frame has rows. In the example above we created a new vector that had 60 rows by repeating the values c(1, 2) thirty times.

15.3.3 Discussion

Each column of a data frame is a vector. R handles columns in data frames slightly differently from standalone vectors because all the columns in a data frame must have the same length.

To add a column using base R, you can simply assign values into the new column like so:

# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

# Assign NA's for the whole column
ToothGrowth2$newcol <- NA

# Assign 1 and 2, automatically repeating to fill
ToothGrowth2$newcol <- c(1, 2)

With base R, the vector being assigned into the data frame will automatically be repeated to fill the number of rows in the data frame.

15.4 Deleting a Column from a Data Frame

15.4.1 Problem

You want to delete a column from a data frame. This returns a new data frame, which you’ll typically want save over the original.

15.4.2 Solution

Use select() from dplyr and specify the columns you want to drop by using - (a minus sign).

# Remove the len column
ToothGrowth %>%
  select(-len)

15.4.3 Discussion

You can list multiple columns that you want to drop at the same time, or conversely specify only the columns that you want to keep. The following two pieces of code are thus equivalent:

# Remove both len and supp from ToothGrowth
ToothGrowth %>%
  select(-len, -supp)
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0

# This keeps just dose, which has the same effect for this data set
ToothGrowth %>%
  select(dose)
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0

To remove a column using base R, you can simply assign NULL to that column.

# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

ToothGrowth2$len <- NULL

15.4.4 See Also

Recipe 15.7 for more on getting a subset of a data frame.

See ?select for more ways to drop and keep columns.

15.5 Renaming Columns in a Data Frame

15.5.1 Problem

You want to rename the columns in a data frame.

15.5.2 Solution

Use rename() from dplyr. This returns a new data frame, which you’ll typically want save over the original.

tg_mod <- ToothGrowth %>%
  rename(length = len)

15.5.3 Discussion

You can rename multiple columns within the same call to rename():

ToothGrowth %>%
  rename(
    length = len,
    supplement_type = supp
  )
#>    length supplement_type dose
#> 1     4.2              VC  0.5
#> 2    11.5              VC  0.5
#>  ...<56 more rows>...
#> 59   29.4              OJ  2.0
#> 60   23.0              OJ  2.0

Renaming a column using base R is a bit more verbose. It uses the names() function on the left side of the <- operator.

# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

names(ToothGrowth2)  # Print the names of the columns
#> [1] "len"  "supp" "dose"

# Rename "len" to "length"
names(ToothGrowth2)[names(ToothGrowth2) == "len"] <- "length"

names(ToothGrowth)
#> [1] "len"  "supp" "dose"

15.5.4 See Also

See ?select for more ways to rename columns within a data frame.

15.6 Reordering Columns in a Data Frame

15.6.1 Problem

You want to change the order of columns in a data frame.

15.6.2 Solution

Use the select() from dplyr.

ToothGrowth %>%
  select(dose, len, supp)
#>    dose  len supp
#> 1   0.5  4.2   VC
#> 2   0.5 11.5   VC
#>  ...<56 more rows>...
#> 59  2.0 29.4   OJ
#> 60  2.0 23.0   OJ

The new data frame will contain the columns you specified in select(), in the order you specified. Note that select() returns a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

15.6.3 Discussion

If you are only reordering a few variables and want to keep the rest of the variables in order, you can use everything() as a placeholder:

ToothGrowth %>%
  select(dose, everything())
#>    dose  len supp
#> 1   0.5  4.2   VC
#> 2   0.5 11.5   VC
#>  ...<56 more rows>...
#> 59  2.0 29.4   OJ
#> 60  2.0 23.0   OJ

See ?select_helpers for other ways to select columns. You can, for example, select columns by matching parts of the name.

Using base R, you can also reorder columns by their name or numeric position. This returns a new data frame, which can be saved over the original.

ToothGrowth[c("dose", "len", "supp")]

ToothGrowth[c(3, 1, 2)]

In these examples, I used list-style indexing. A data frame is essentially a list of vectors, and indexing into it as a list will return another data frame. You can get the same effect with matrix-style indexing:

ToothGrowth[c("dose", "len", "supp")]   # List-style indexing

ToothGrowth[, c("dose", "len", "supp")] # Matrix-style indexing

In this case, both methods return the same result, a data frame. However, when retrieving a single column, list-style indexing will return a data frame, while matrix-style indexing will return a vector:

ToothGrowth["dose"]
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0
ToothGrowth[, "dose"]
#>  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [20] 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#> [39] 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
#> [58] 2.0 2.0 2.0

You can use drop=FALSE to ensure that it returns a data frame:

ToothGrowth[, "dose", drop=FALSE]
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0

15.7 Getting a Subset of a Data Frame

15.7.1 Problem

You want to get a subset of a data frame.

15.7.2 Solution

Use filter() to get the rows, and select() to get the columns you want. These operations can be chained together using the %>% operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

We’ll use the climate data set for the examples here:

library(gcookbook) # Load gcookbook for the climate data set
climate
#>       Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 1   Berkeley 1800        NA        NA     -0.435  0.505
#> 2   Berkeley 1801        NA        NA     -0.453  0.493
#>  ...<495 more rows>...
#> 498  CRUTEM3 2010    0.8023        NA         NA     NA
#> 499  CRUTEM3 2011    0.6193        NA         NA     NA

Let’s that say that only want to keep rows where Source is "Berkeley" and where the year is inclusive of and between 1900 and 2000. You can do so with the filter() function:

climate %>%
  filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000)

If you want only the Year and Anomaly10y columns, use select(), as we did in 15.4:

climate %>%
  select(Year, Anomaly10y)
#>     Year Anomaly10y
#> 1   1800     -0.435
#> 2   1801     -0.453
#>  ...<495 more rows>...
#> 498 2010         NA
#> 499 2011         NA

These operations can be chained together using the %>% operator:

climate %>%
  filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000) %>%
  select(Year, Anomaly10y)
#>     Year Anomaly10y
#> 1   1900     -0.171
#> 2   1901     -0.162
#>  ...<97 more rows>...
#> 100 1999      0.734
#> 101 2000      0.748

15.7.3 Discussion

The filter() function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the slice() function:

slice(climate, 1:100)

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, ]
#>       Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 101 Berkeley 1900        NA        NA     -0.171  0.108
#> 102 Berkeley 1901        NA        NA     -0.162  0.109
#>  ...<97 more rows>...
#> 200 Berkeley 1999        NA        NA      0.734  0.025
#> 201 Berkeley 2000        NA        NA      0.748  0.026

Notice that we needed to prefix each column name with climate$, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.

This row filtering can also be combined with the column selection from 15.4:

climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000,
        c("Year", "Anomaly10y")]
#>     Year Anomaly10y
#> 101 1900     -0.171
#> 102 1901     -0.162
#>  ...<97 more rows>...
#> 200 1999      0.734
#> 201 2000      0.748

15.8 Changing the Order of Factor Levels

15.8.1 Problem

You want to change the order of levels in a factor.

15.8.2 Solution

Pass the factor to factor(), and give it the levels in the order you want. This returns a new factor, so if you want to change the original variable, you’ll need to save the new result over it.

# By default, levels are ordered alphabetically
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes
#> [1] small  large  large  small  medium
#> Levels: large medium small

factor(sizes, levels = c("small", "medium", "large"))
#> [1] small  large  large  small  medium
#> Levels: small medium large

The order can also be specified with levels when the factor is first created:

factor(c("small", "large", "large", "small", "medium"),
       levels = c("small", "medium", "large"))

15.8.3 Discussion

There are two kinds of factors in R: ordered factors and regular factors. (In practice, ordered levels are not commonly used.) In both types, the levels are arranged in some order; the difference is that the order is meaningful for an ordered factor, but it is arbitrary for a regular factor – it simply reflects how the data is stored. For plotting data, the distinction between ordered and regular factors is generally unimportant, and they can be treated the same.

The order of factor levels affects graphical output. When a factor variable is mapped to an aesthetic property in ggplot, the aesthetic adopts the ordering of the factor levels. If a factor is mapped to the x-axis, the ticks on the axis will be in the order of the factor levels, and if a factor is mapped to color, the items in the legend will be in the order of the factor levels.

To reverse the level order, you can use rev(levels()):

factor(sizes, levels = rev(levels(sizes)))

The tidyverse function for reordering factors is fct_relevel() from the forcats package. It has a syntax similar to the factor() function from base R.

# Change the order of levels
library(forcats)
fct_relevel(sizes, "small", "medium", "large")
#> [1] small  large  large  small  medium
#> Levels: small medium large

15.8.4 See Also

To reorder a factor based on the value of another variable, see Recipe 15.9.

Reordering factor levels is useful for controlling the order of axes and legends. See Recipes Recipe 8.4 and Recipe 10.3 for more information.

15.9 Changing the Order of Factor Levels Based on Data Values

15.9.1 Problem

You want to change the order of levels in a factor based on values in the data.

15.9.2 Solution

Use reorder() with the factor that has levels to reorder, the values to base the reordering on, and a function that aggregates the values:

# Make a copy of the InsectSprays data set since we're modifying it
iss <- InsectSprays
iss$spray
#>  [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
#> [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
#> Levels: A B C D E F

iss$spray <- reorder(iss$spray, iss$count, FUN = mean)
iss$spray
#>  [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
#> [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
#> attr(,"scores")
#>         A         B         C         D         E         F 
#> 14.500000 15.333333  2.083333  4.916667  3.500000 16.666667 
#> Levels: C E D A B F

Notice that the original levels were ABCDEF, while the reordered levels are CEDABF. What we’ve done is reorder the levels of spray based on the mean value of count for each level of spray.

15.9.3 Discussion

The usefulness of reorder() might not be obvious from just looking at the raw output. Figure 15.1 shows three plots made with reorder(). In these plots, the order in which the items appear is determined by their values.

Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)

Figure 15.1: Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)

In the middle plot in Figure 15.1, the boxes are sorted by the mean. The horizontal line that runs across each box represents the median of the data. Notice that these values do not increase strictly from left to right. That’s because with this particular data set, sorting by the mean gives a different order than sorting by the median. To make the median lines increase from left to right, as in the plot on the right in Figure 15.1, we used the median() function in reorder().

The tidyverse function for reordering factors is fct_reorder(), and it is used the same way as reorder(). These do the same thing:

reorder(iss$spray, iss$count, FUN = mean)
fct_reorder(iss$spray, iss$count, .fun = mean)

15.9.4 See Also

Reordering factor levels is also useful for controlling the order of axes and legends. See Recipes 8.4 and 10.3 for more information.

15.10 Changing the Names of Factor Levels

15.10.1 Problem

You want to change the names of levels in a factor.

15.10.2 Solution

Use fct_recode() from the forcats package

sizes <- factor(c( "small", "large", "large", "small", "medium"))
sizes
#> [1] small  large  large  small  medium
#> Levels: large medium small

# Pass it a named vector with the mappings
fct_recode(sizes, S = "small", M = "medium", L = "large")
#> [1] S L L S M
#> Levels: L M S

15.10.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use do.call() with fct_recode().

old <- c("small", "medium", "large")
new <- c("S", "M", "L")

# Create a named vector that has the mappings between old and new
mappings <- setNames(old, new)
mappings
#>        S        M        L 
#>  "small" "medium"  "large"

# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)

# Look at the structure of the list
str(args)
#> List of 4
#>  $  : Factor w/ 3 levels "large","medium",..: 3 1 1 3 2
#>  $ S: chr "small"
#>  $ M: chr "medium"
#>  $ L: chr "large"

# Use do.call to call fct_recode with the arguments
do.call(fct_recode, args)
#> [1] S L L S M
#> Levels: L M S

Or, more concisely, we can do all of that in one go:

do.call(
  fct_recode,
  c(list(sizes), setNames(c("small", "medium", "large"), c("S", "M", "L")))
)
#> [1] S L L S M
#> Levels: L M S

For a more traditional (and clunky) base R method for renaming factor levels, use the levels()<- function:

sizes <- factor(c( "small", "large", "large", "small", "medium"))

# Index into the levels and rename each one
levels(sizes)[levels(sizes) == "large"]  <- "L"
levels(sizes)[levels(sizes) == "medium"] <- "M"
levels(sizes)[levels(sizes) == "small"]  <- "S"
sizes
#> [1] S L L S M
#> Levels: L M S

If you are renaming all your factor levels, there is a simpler method. You can pass a list to levels()<-:

sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes) <- list(S = "small", M = "medium", L = "large")
sizes
#> [1] S L L S M
#> Levels: S M L

With this method, all factor levels must be specified in the list; if any are missing, they will be replaced with NA.

It’s also possible to rename factor levels by position, but this is somewhat inelegant:

sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes)[1] <- "L"
sizes
#> [1] small  L      L      small  medium
#> Levels: L medium small

# Rename all levels at once
levels(sizes) <- c("L", "M", "S")
sizes
#> [1] S L L S M
#> Levels: L M S

It’s safer to rename factor levels by name rather than by position, since you will be less likely to make a mistake (and mistakes here may be hard to detect). Also, if your input data set changes to have more or fewer levels, the numeric positions of the existing levels could change, which could cause serious but nonobvious problems for your analysis.

15.10.4 See Also

If, instead of a factor, you have a character vector with items to rename, see Recipe 15.12.

15.11 Removing Unused Levels from a Factor

15.11.1 Problem

You want to remove unused levels from a factor.

15.11.2 Solution

Sometimes, after processing your data you will have a factor that contains levels that are no longer used. Here’s an example:

sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes <- sizes[1:3]
sizes
#> [1] small large large
#> Levels: large medium small

To remove them, use droplevels():

droplevels(sizes)
#> [1] small large large
#> Levels: large small

15.11.3 Discussion

The droplevels() function preserves the order of factor levels. You can use the except parameter to keep particular levels.

The tidyverse way: Use fct_drop() from the forcats package:

fct_drop(sizes)
#> [1] small large large
#> Levels: large small

15.12 Changing the Names of Items in a Character Vector

15.12.1 Problem

You want to change the names of items in a character vector.

15.12.2 Solution

Use recode() from the dplyr package:

library(dplyr)

sizes <- c("small", "large", "large", "small", "medium")
sizes
#> [1] "small"  "large"  "large"  "small"  "medium"

# With recode(), pass it a named vector with the mappings
recode(sizes, small = "S", medium = "M", large = "L")
#> [1] "S" "L" "L" "S" "M"

# Can also use quotes -- useful if there are spaces or other strange characters
recode(sizes, "small" = "S", "medium" = "M", "large" = "L")
#> [1] "S" "L" "L" "S" "M"

15.12.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use do.call() with fct_recode().

old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(new, old)
mappings
#>  small medium  large 
#>    "S"    "M"    "L"

# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
#> List of 4
#>  $       : chr [1:5] "small" "large" "large" "small" ...
#>  $ small : chr "S"
#>  $ medium: chr "M"
#>  $ large : chr "L"
# Use do.call to call fct_recode with the arguments
do.call(recode, args)
#> [1] "S" "L" "L" "S" "M"

Or, more concisely, we can do all of that in one go:

do.call(
  recode,
  c(list(sizes), setNames(c("S", "M", "L"), c("small", "medium", "large")))
)
#> [1] "S" "L" "L" "S" "M"

Note that for recode(), the name and value of the arguments is reversed, compared to the fct_recode() function from the forcats package. With recode(), you would use small="S", whereas for fct_recode(), you would use S="small".

A more traditional R method is to use square-bracket indexing to select the items and rename them:

sizes <- c("small", "large", "large", "small", "medium")
sizes[sizes == "small"]  <- "S"
sizes[sizes == "medium"] <- "M"
sizes[sizes == "large"]  <- "L"
sizes
#> [1] "S" "L" "L" "S" "M"

15.12.4 See Also

If, instead of a character vector, you have a factor with levels to rename, see Recipe 15.10.

15.13 Recoding a Categorical Variable to Another Categorical Variable

15.13.1 Problem

You want to recode a categorical variable to another variable.

15.13.2 Solution

For the examples here, we’ll use a subset of the PlantGrowth data set:

# Work on a subset of the PlantGrowth data set
pg <- PlantGrowth[c(1,2,11,21,22), ]
pg
#>    weight group
#> 1    4.17  ctrl
#> 2    5.58  ctrl
#> 11   4.81  trt1
#> 21   6.31  trt2
#> 22   5.12  trt2

In this example, we’ll recode the categorical variable group into another categorical variable, treatment. If the old value was "ctrl", the new value will be "No", and if the old value was "trt1" or "trt2", the new value will be "Yes".

This can be done with the recode() function from the dplyr package:

library(dplyr)

recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes

You can assign it as a new column in the data frame:

pg$treatment <- recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")

Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use as.character():

recode(as.character(pg$group), ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] "No"  "No"  "Yes" "Yes" "Yes"

15.13.3 Discussion

You can also use the fct_recode() function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:

library(forcats)
fct_recode(pg$group, No = "ctrl", Yes = "trt1", Yes = "trt2")
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes

Another difference is that fct_recode() will always return a factor, whereas recode() will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a recode_factor() function which also always returns a factor.)

Using base R, recoding can be done with the match() function:

oldvals <- c("ctrl", "trt1", "trt2")
newvals <- factor(c("No", "Yes", "Yes"))

newvals[ match(pg$group, oldvals) ]
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes

It can also be done by indexing in the vectors:

pg$treatment[pg$group == "ctrl"] <- "No"
pg$treatment[pg$group == "trt1"] <- "Yes"
pg$treatment[pg$group == "trt2"] <- "Yes"

# Convert to a factor
pg$treatment <- factor(pg$treatment)
pg
#>    weight group treatment
#> 1    4.17  ctrl        No
#> 2    5.58  ctrl        No
#> 11   4.81  trt1       Yes
#> 21   6.31  trt2       Yes
#> 22   5.12  trt2       Yes

Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe 15.10.

The coding criteria can also be based on values in multiple columns, by using the & and | operators:

pg$newcol[pg$group == "ctrl" & pg$weight < 5]  <- "no_small"
pg$newcol[pg$group == "ctrl" & pg$weight >= 5] <- "no_large"
pg$newcol[pg$group == "trt1"] <- "yes"
pg$newcol[pg$group == "trt2"] <- "yes"
pg$newcol <- factor(pg$newcol)
pg
#>    weight group   newcol
#> 1    4.17  ctrl no_small
#> 2    5.58  ctrl no_large
#> 11   4.81  trt1      yes
#> 21   6.31  trt2      yes
#> 22   5.12  trt2      yes

It’s also possible to combine two columns into one using the interaction() function, which appends the values with a . in between. This combines the weight and group columns into a new column, weightgroup:

pg$weightgroup <- interaction(pg$weight, pg$group)
pg
#>    weight group weightgroup
#> 1    4.17  ctrl   4.17.ctrl
#> 2    5.58  ctrl   5.58.ctrl
#> 11   4.81  trt1   4.81.trt1
#> 21   6.31  trt2   6.31.trt2
#> 22   5.12  trt2   5.12.trt2

15.13.4 See Also

For more on renaming factor levels, see Recipe 15.10.

See Recipe 15.14 for recoding continuous values to categorical values.

15.14 Recoding a Continuous Variable to a Categorical Variable

15.14.1 Problem

You want to recode a continuous variable to another variable.

15.14.2 Solution

Use the cut() function. In this example, we’ll use the PlantGrowth data set and recode the continuous variable weight into a categorical variable, wtclass, using the cut() function:

pg <- PlantGrowth
pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf))
pg
#>    weight group wtclass
#> 1    4.17  ctrl   (0,5]
#> 2    5.58  ctrl   (5,6]
#>  ...<26 more rows>...
#> 29   5.80  trt2   (5,6]
#> 30   5.26  trt2   (5,6]

15.14.3 Discussion

For three categories we specify four bounds, which can include Inf and -Inf. If a data value falls outside of the specified bounds, it’s categorized as NA. The result of cut() is a factor, and you can see from the example that the factor levels are named after the bounds.

To change the names of the levels, set the labels:

pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf),
                  labels = c("small", "medium", "large"))
pg
#>    weight group wtclass
#> 1    4.17  ctrl   small
#> 2    5.58  ctrl  medium
#>  ...<26 more rows>...
#> 29   5.80  trt2  medium
#> 30   5.26  trt2  medium

As indicated by the factor levels, the bounds are by default open on the left and closed on the right. In other words, they don’t include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting include.lowest=TRUE. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as NA.

If you want the categories to be closed on the left and open on the right, set right = FALSE:

cut(pg$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
#>  [1] [0,5)   [5,6)   [5,6)   [6,Inf) [0,5)   [0,5)   [5,6)   [0,5)   [5,6)  
#> [10] [5,6)   [0,5)   [0,5)   [0,5)   [0,5)   [5,6)   [0,5)   [6,Inf) [0,5)  
#> [19] [0,5)   [0,5)   [6,Inf) [5,6)   [5,6)   [5,6)   [5,6)   [5,6)   [0,5)  
#> [28] [6,Inf) [5,6)   [5,6)  
#> Levels: [0,5) [5,6) [6,Inf)

15.14.4 See Also

To recode a categorical variable to another categorical variable, see Recipe 15.13.

15.15 Calculating New Columns From Existing Columns

15.15.1 Problem

You want to calculate a new column of values in a data frame.

15.15.2 Solution

Use mutate() from the dplyr package.

library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#>     sex ageYear ageMonth heightIn weightLb
#> 1     f   11.92      143     56.3     85.0
#> 2     f   12.92      155     62.3    105.0
#>  ...<232 more rows>...
#> 236   m   13.92      167     62.0    107.5
#> 237   m   12.58      151     59.3     87.0

This will convert heightIn to centimeters and store it in a new column, heightCm:

library(dplyr)
heightweight %>%
  mutate(heightCm = heightIn * 2.54)
#>     sex ageYear ageMonth heightIn weightLb heightCm
#> 1     f   11.92      143     56.3     85.0  143.002
#> 2     f   12.92      155     62.3    105.0  158.242
#>  ...<232 more rows>...
#> 235   m   13.92      167     62.0    107.5  157.480
#> 236   m   12.58      151     59.3     87.0  150.622

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

15.15.3 Discussion

You can use mutate() to transform multiple columns at once:

heightweight %>%
  mutate(
    heightCm = heightIn * 2.54,
    weightKg = weightLb / 2.204
  )
#>     sex ageYear ageMonth heightIn weightLb heightCm weightKg
#> 1     f   11.92      143     56.3     85.0  143.002 38.56624
#> 2     f   12.92      155     62.3    105.0  158.242 47.64065
#>  ...<232 more rows>...
#> 235   m   13.92      167     62.0    107.5  157.480 48.77495
#> 236   m   12.58      151     59.3     87.0  150.622 39.47368

It is also possible to calculate a new column based on multiple columns:

heightweight %>%
  mutate(bmi = weightKg / (heightCm / 100)^2)

With mutate(), the columns are added sequentially. That means that we can reference a newly-created column when calculating a new column:

heightweight %>%
  mutate(
    heightCm = heightIn * 2.54,
    weightKg = weightLb / 2.204,
    bmi = weightKg / (heightCm / 100)^2
  )
#>     sex ageYear ageMonth heightIn weightLb heightCm weightKg      bmi
#> 1     f   11.92      143     56.3     85.0  143.002 38.56624 18.85919
#> 2     f   12.92      155     62.3    105.0  158.242 47.64065 19.02542
#>  ...<232 more rows>...
#> 235   m   13.92      167     62.0    107.5  157.480 48.77495 19.66736
#> 236   m   12.58      151     59.3     87.0  150.622 39.47368 17.39926

With base R, calculating a new colum can be done by referencing the new column with the $ operator and assigning some values to it:

heightweight$heightCm <- heightweight$heightIn * 2.54

15.15.4 See Also

See Recipe 15.16 for how to perform group-wise transformations on data.

15.16 Calculating New Columns by Groups

15.16.1 Problem

You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.

15.16.2 Solution

Use group_by() from the dplyr package to specify the grouping variable, and then specify the operations in mutate():

library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

cabbages %>%
  group_by(Cult) %>%
  mutate(DevWt = HeadWt - mean(HeadWt))
#> # A tibble: 60 x 5
#> # Groups:   Cult [2]
#>   Cult  Date  HeadWt  VitC  DevWt
#>   <fct> <fct>  <dbl> <int>  <dbl>
#> 1 c39   d16      2.5    51 -0.407
#> 2 c39   d16      2.2    55 -0.707
#> 3 c39   d16      3.1    45  0.193
#> 4 c39   d16      4.3    42  1.39 
#> 5 c39   d16      2.5    53 -0.407
#> 6 c39   d16      4.3    50  1.39 
#> # … with 54 more rows

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

15.16.3 Discussion

Let’s take a closer look at the cabbages data set. It has two grouping variables (factors): Cult, which has levels c39 and c52, and Date, which has levels d16, d20, and d21. It also has two measured numeric variables, HeadWt and VitC:

cabbages
#>    Cult Date HeadWt VitC
#> 1   c39  d16    2.5   51
#> 2   c39  d16    2.2   55
#>  ...<56 more rows>...
#> 59  c52  d21    1.5   66
#> 60  c52  d21    1.6   72

Suppose we want to find, for each case, the deviation of HeadWt from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:

mutate(cabbages, DevWt = HeadWt - mean(HeadWt))
#>    Cult Date HeadWt VitC       DevWt
#> 1   c39  d16    2.5   51 -0.09333333
#> 2   c39  d16    2.2   55 -0.39333333
#>  ...<56 more rows>...
#> 59  c52  d21    1.5   66 -1.09333333
#> 60  c52  d21    1.6   72 -0.99333333

You’ll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean within the group, where the groups are specified by Cult. In these cases, we can use group_by() and mutate() together:

cb <- cabbages %>%
  group_by(Cult) %>%
  mutate(DevWt = HeadWt - mean(HeadWt))

First it groups cabbages based on the value of Cult. There are two levels of Cult, c39 and c52. It then applies the mutate() function to each data frame.

The before and after results are shown in Figure 15.2:

# The data before normalizing
ggplot(cb, aes(x = Cult, y = HeadWt)) +
  geom_boxplot()

# After normalizing
ggplot(cb, aes(x = Cult, y = DevWt)) +
  geom_boxplot()
Before normalizing (left); After normalizing (right)Before normalizing (left); After normalizing (right)

Figure 15.2: Before normalizing (left); After normalizing (right)

You can also group the data frame on multiple variables and perform operations on multiple variables. The following code groups the data by Cult and Date, forming a group for each distinct combination of the two variables. After forming these groups, the code will calculate the deviation of HeadWt and VitC from the mean of each group:

cabbages %>%
  group_by(Cult, Date) %>%
  mutate(
    DevWt = HeadWt - mean(HeadWt),
    DevVitC = VitC - mean(VitC)
  )
#> # A tibble: 60 x 6
#> # Groups:   Cult, Date [6]
#>   Cult  Date  HeadWt  VitC DevWt DevVitC
#>   <fct> <fct>  <dbl> <int> <dbl>   <dbl>
#> 1 c39   d16      2.5    51 -0.68   0.7  
#> 2 c39   d16      2.2    55 -0.98   4.7  
#> 3 c39   d16      3.1    45 -0.08  -5.30 
#> 4 c39   d16      4.3    42  1.12  -8.30 
#> 5 c39   d16      2.5    53 -0.68   2.7  
#> 6 c39   d16      4.3    50  1.12  -0.300
#> # … with 54 more rows

15.16.4 See Also

To summarize data by groups, see Recipe 15.17.

15.17 Summarizing Data by Groups

15.17.1 Problem

You want to summarize your data, based on one or more grouping variables.

15.17.2 Solution

Use group_by() and summarise() from the dplyr package, and specify the operations to do:

library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    VitC = mean(VitC)
  )
#> # A tibble: 6 x 4
#> # Groups:   Cult [2]
#>   Cult  Date  Weight  VitC
#>   <fct> <fct>  <dbl> <dbl>
#> 1 c39   d16     3.18  50.3
#> 2 c39   d20     2.8   49.4
#> 3 c39   d21     2.74  54.8
#> 4 c52   d16     2.26  62.5
#> 5 c52   d20     3.11  58.9
#> 6 c52   d21     1.47  71.8

15.17.3 Discussion

There are few things going on here that may be unfamiliar if you’re new to dplyr and the tidyverse in general.

First, let’s take a closer look at the cabbages data set. It has two factors that can be used as grouping variables: Cult, which has levels c39 and c52, and Date, which has levels d16, d20, and d21. It also has two numeric variables, HeadWt and VitC:

cabbages
#>    Cult Date HeadWt VitC
#> 1   c39  d16    2.5   51
#> 2   c39  d16    2.2   55
#>  ...<56 more rows>...
#> 59  c52  d21    1.5   66
#> 60  c52  d21    1.6   72

Finding the overall mean of HeadWt is simple. We could just use the mean() function on that column, but for reasons that will soon become clear, we’ll use the summarise() function instead:

library(dplyr)
summarise(cabbages, Weight = mean(HeadWt))
#>     Weight
#> 1 2.593333

The result is a data frame with one row and one column, named Weight.

Often we want to find information about each subset of the data, as specified by a grouping variable. For example, suppose we want to find the mean of each Cult group. To do this, we can use summarise() with group_by().

tmp <- group_by(cabbages, Cult)
summarise(tmp, Weight = mean(HeadWt))
#> # A tibble: 2 x 2
#>   Cult  Weight
#>   <fct>  <dbl>
#> 1 c39     2.91
#> 2 c52     2.28

The command first groups the data frame cabbages based on the value of Cult. There are two levels of Cult, c39 and c52, so there are two groups. It then applies the summarise() function to each of these data frames; it calculates Weight by taking the mean() of the HeadWt column in each of the sub-data frames. The resulting summaries for each group are assembled into a data frame, which is returned.

You can imagine that the cabbages data is split up into two separate data frames, then summarise() is called on each data frame (returning a one-row data frame for each), and then those results are combined together into a final data frame. This is actually how things worked in dplyr’s predecessor, plyr, with the ddply() function.

The syntax of the previous code used a temporary variable to store results. That’s a little verbose, so instead, we can use %>%, also known as the pipe operator, to chain the function calls together. The pipe operator simply takes what’s on its left and substitutes it as the first argument of the function call on the right. The following two lines of code are equivalent:

group_by(cabbages, Cult)
# The pipe operator moves `cabbages` to the first argument position of group_by()
cabbages %>% group_by(Cult)

The reason it’s called a pipe operator is that it lets you connect function calls together in sequence to form a pipeline of operations. Another common term for this is a different metaphor: chaining.

So the first argument of the function call is in a different place. So what? The advantages become apparent when chaining is involved. Here’s what it would look like if you wanted to call group_by() and then summarise() without making use of a temporary variable. Instead of proceeding left to right, the computation occurs from the inside out:

summarise(group_by(cabbages, Cult), Weight = mean(HeadWt))

Using a temporary variable, as we did earlier, makes it more readable, but a more elegant solution is to use the pipe operator:

cabbages %>%
  group_by(Cult) %>%
  summarise(Weight = mean(HeadWt))

Back to summarizing data. Summarizing the data frame by grouping using more variables (or columns) is simple: just give it the names of the additional variables. It’s also possible to get more than one summary value by specifying more calculated columns. Here we’ll summarize each Cult and Date group, getting the average of HeadWt and VitC:

cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    Vitc = mean(VitC)
  )
#> # A tibble: 6 x 4
#> # Groups:   Cult [2]
#>   Cult  Date  Weight  Vitc
#>   <fct> <fct>  <dbl> <dbl>
#> 1 c39   d16     3.18  50.3
#> 2 c39   d20     2.8   49.4
#> 3 c39   d21     2.74  54.8
#> 4 c52   d16     2.26  62.5
#> 5 c52   d20     3.11  58.9
#> 6 c52   d21     1.47  71.8

Note

You might have noticed that it says that the result is grouped by Cult, but not Date. This is because the summarise() function removes one level of grouping. This is typically what you want when the input has one grouping variable. When there are multiple grouping variables, this may or may not be the what you want. To remove all grouping, use ungroup(), and to add back the original grouping, use group_by() again.

It’s possible to do more than take the mean. You may, for example, want to compute the standard deviation and count of each group. To get the standard deviation, use sd(), and to get a count of rows in each group, use n():

cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n()
  )
#> # A tibble: 6 x 5
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n
#>   <fct> <fct>  <dbl> <dbl> <int>
#> 1 c39   d16     3.18 0.957    10
#> 2 c39   d20     2.8  0.279    10
#> 3 c39   d21     2.74 0.983    10
#> 4 c52   d16     2.26 0.445    10
#> 5 c52   d20     3.11 0.791    10
#> 6 c52   d21     1.47 0.211    10

Other useful functions for generating summary statistics include min(), max(), and median(). The n() function is a special function that works only inside of the dplyr functions summarise(), mutate() and filter(). See ?summarise for more useful functions.

The n() function gets a count of rows, but if you want to have it not count NA values from a column, you need to use a different technique. For example, if you want it to ignore any NAs in the HeadWt column, use sum(!is.na(Headwt)).

If you want to get a count of rows

15.17.3.1 Dealing with NAs {#_dealing_with_literal_na_literal_s}

One potential pitfall is that NAs in the data will lead to NAs in the output. Let’s see what happens if we sprinkle a few NAs into HeadWt:

c1 <- cabbages # Make a copy
c1$HeadWt[c(1, 20, 45)] <- NA # Set some values to NA

c1 %>%
  group_by(Cult) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n()
  )
#> # A tibble: 2 x 4
#>   Cult  Weight    sd     n
#>   <fct>  <dbl> <dbl> <int>
#> 1 c39       NA    NA    30
#> 2 c52       NA    NA    30

The problem is that mean() and sd() simply return NA if any of the input values are NA. Fortunately, these functions have an option to deal with this very issue: setting na.rm=TRUE will tell them to ignore the NAs.

c1 %>%
  group_by(Cult) %>%
  summarise(
    Weight = mean(HeadWt, na.rm = TRUE),
    sd = sd(HeadWt, na.rm = TRUE),
    n = n()
  )
#> # A tibble: 2 x 4
#>   Cult  Weight    sd     n
#>   <fct>  <dbl> <dbl> <int>
#> 1 c39     2.9  0.822    30
#> 2 c52     2.23 0.828    30

15.17.3.2 Missing combinations {#_missing_combinations}

If there are any empty combinations of the grouping variables, they will not appear in the summarized data frame. These missing combinations can cause problems when making graphs. To illustrate, we’ll remove all entries that have levels c52 and d21. The graph on the left in Figure 15.3 shows what happens when there’s a missing combination in a bar graph:

# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
  group_by(Cult, Date) %>%
  summarise(Weight = mean(HeadWt))

ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
  geom_col(position = "dodge")

To fill in the missing combination (Figure 15.3, right), use the complete() function from the tidyr package – which is also part of the tidyverse. Also, the grouping for c2a must be removed, with ungroup(); otherwise it will return too many rows.

library(tidyr)
c2b <- c2a %>%
  ungroup() %>%
  complete(Cult, Date)

ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
  geom_col(position = "dodge")
# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
  group_by(Cult, Date) %>%
  summarise(Weight = mean(HeadWt))

ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
  geom_col(position = "dodge")
library(tidyr)
c2b <- c2a %>%
  ungroup() %>%
  complete(Cult, Date)

ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
  geom_col(position = "dodge")
Bar graph with a missing combination (left); With missing combination filled (right)Bar graph with a missing combination (left); With missing combination filled (right)

Figure 15.3: Bar graph with a missing combination (left); With missing combination filled (right)

When we used complete(), it filled in the missing combinations with NA. It’s possible to fill with a different value, with the fill parameter. See ?complete for more information.

15.17.4 See Also

If you want to calculate standard errors and confidence intervals, see Recipe 15.18.

See Recipe 6.8 for an example of using stat_summary() to calculate means and overlay them on a graph.

To perform transformations on data by groups, see Recipe 15.16.

15.18 Summarizing Data with Standard Errors and Confidence Intervals

15.18.1 Problem

You want to summarize your data with the standard error of the mean and/or confidence intervals.

15.18.2 Solution

Getting the standard error of the mean involves two steps: first get the standard deviation and count for each group, then use those values to calculate the standard error. The standard error for each group is just the standard deviation divided by the square root of the sample size:

library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

ca <- cabbages %>%
  group_by(Cult, Date) %>%
  summarise(
    Weight = mean(HeadWt),
    sd = sd(HeadWt),
    n = n(),
    se = sd / sqrt(n)
  )

ca
#> # A tibble: 6 x 6
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n     se
#>   <fct> <fct>  <dbl> <dbl> <int>  <dbl>
#> 1 c39   d16     3.18 0.957    10 0.303 
#> 2 c39   d20     2.8  0.279    10 0.0882
#> 3 c39   d21     2.74 0.983    10 0.311 
#> 4 c52   d16     2.26 0.445    10 0.141 
#> 5 c52   d20     3.11 0.791    10 0.250 
#> 6 c52   d21     1.47 0.211    10 0.0667

15.18.3 Discussion

The summarise() function computes the columns in order, so you can refer to previous newly-created columns. That’s why se can use the sd and n columns.

The n() function gets a count of rows, but if you want to have it not count NA values from a column, you need to use a different technique. For example, if you want it to ignore any NAs in the HeadWt column, use sum(!is.na(Headwt)).

15.18.3.1 Confidence Intervals {#_confidence_intervals}

Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the qt() function to get the quantile, then multiply that by the standard error. The qt() function will give quantiles of the t-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped t-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.

This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:

ciMult <- qt(.975, ca$n - 1)
ciMult
#> [1] 2.262157 2.262157 2.262157 2.262157 2.262157 2.262157

Now we can multiply that vector by the standard error to get the 95% confidence interval:

ca$ci95 <- ca$se * ciMult
ca
#> # A tibble: 6 x 7
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n     se  ci95
#>   <fct> <fct>  <dbl> <dbl> <int>  <dbl> <dbl>
#> 1 c39   d16     3.18 0.957    10 0.303  0.684
#> 2 c39   d20     2.8  0.279    10 0.0882 0.200
#> 3 c39   d21     2.74 0.983    10 0.311  0.703
#> 4 c52   d16     2.26 0.445    10 0.141  0.318
#> 5 c52   d20     3.11 0.791    10 0.250  0.566
#> 6 c52   d21     1.47 0.211    10 0.0667 0.151

This could be done in one line, like this:

ca$ci95 <- ca$se * qt(.975, ca$n - 1)

For a 99% confidence interval, use .995.

Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there’s a 95% chance that the true population mean is within the interval (actually, it doesn’t mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).

This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle NAs and missing combinations, with the na.rm and .drop options. By default, it provides a 95% confidence interval, but this can be set with the conf.interval argument:

summarySE <- function(data = NULL, measurevar, groupvars = NULL, na.rm = FALSE,
                      conf.interval = .95, .drop = TRUE) {

  # New version of length which can handle NA's: if na.rm==T, don't count them
  length2 <- function(x, na.rm = FALSE) {
    if (na.rm) sum(!is.na(x))
    else       length(x)
  }

  groupvars  <- rlang::syms(groupvars)
  measurevar <- rlang::sym(measurevar)

  datac <- data %>%
    dplyr::group_by(!!!groupvars) %>%
    dplyr::summarise(
      N             = length2(!!measurevar, na.rm = na.rm),
      sd            = sd     (!!measurevar, na.rm = na.rm),
      !!measurevar := mean   (!!measurevar, na.rm = na.rm),
      se            = sd / sqrt(N),
      # Confidence interval multiplier for standard error
      # Calculate t-statistic for confidence interval:
      # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
      ci            = se * qt(conf.interval/2 + .5, N - 1)
    ) %>%
    dplyr::ungroup() %>%
    # Rearrange the columns so that sd, se, ci are last
    dplyr::select(seq_len(ncol(.) - 4), ncol(.) - 2, sd, se, ci)

  datac
}

The following usage example has a 99% confidence interval and handles NAs and missing combinations:

# Remove all rows with both c52 and d21
c2 <- filter(cabbages, !(Cult == "c52" & Date == "d21" ))
# Set some values to NA
c2$HeadWt[c(1, 20, 45)] <- NA
summarySE(c2, "HeadWt", c("Cult", "Date"),
          conf.interval = .99, na.rm = TRUE, .drop = FALSE)
#> # A tibble: 5 x 7
#>   Cult  Date      N HeadWt    sd     se    ci
#>   <fct> <fct> <int>  <dbl> <dbl>  <dbl> <dbl>
#> 1 c39   d16       9   3.26 0.982 0.327  1.10 
#> 2 c39   d20       9   2.72 0.139 0.0465 0.156
#> 3 c39   d21      10   2.74 0.983 0.311  1.01 
#> 4 c52   d16      10   2.26 0.445 0.141  0.458
#> 5 c52   d20       9   3.04 0.809 0.270  0.905

15.18.4 See Also

See Recipe 7.7 to use the values calculated here to add error bars to a graph.

15.19 Converting Data from Wide to Long

15.19.1 Problem

You want to convert a data frame from “wide” format to “long” format.

15.19.2 Solution

Use gather() from the tidyr package. In the anthoming data set, for each angle, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

library(gcookbook) # For the data set
anthoming
#>   angle expt ctrl
#> 1   -20    1    0
#> 2   -10    7    3
#> 3     0    2    3
#> 4    10    0    3
#> 5    20    0    1

We can reshape the data so that all the measurements are in one column. This will put the values from expt and ctrl into one column, and put the names into a different column:

library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
#>    angle condition count
#> 1    -20      expt     1
#> 2    -10      expt     7
#>  ...<6 more rows>...
#> 9     10      ctrl     3
#> 10    20      ctrl     1

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

15.19.3 Discussion

In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when angle is –20. In the output data frame, the two measurements, for expt and ctrl, are no longer in the same row, but we can still tell that they belong together because they have the same value of angle.

The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called condition, and the values are put into a new value column which we called count.

You can designate the value columns from the source data by naming them individually, as we did above with expt and ctrl. gather() automatically inferred that the ID variable was the remaining column, angle. Another way to tell it which columns are values is to do the reverse: if you exclude the angle column, then gather() will infer that the value columns are the remaining ones, expt and ctrl.

gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)

There are other convenient shortcuts to specify which columns are values. For example expt:ctrl means to select all columns between expt and ctrl (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, gather() will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the select() function.

For example, in the drunk data set, suppose we want to convert it to long format, keeping sex in one column and putting the numeric values in another column. This time, we want the values for only the 0-29 and 30-39 columns, and we want to discard the values for the other age ranges:

# Our source data
drunk
#>      sex 0-29 30-39 40-49 50-59 60+
#> 1   male  185   207   260   180  71
#> 2 female    4    13    10     7  10

# Try gather() with just 0-29 and 30-39
drunk %>%
  gather(age, count, "0-29", "30-39")
#>      sex 40-49 50-59 60+   age count
#> 1   male   260   180  71  0-29   185
#> 2 female    10     7  10  0-29     4
#> 3   male   260   180  71 30-39   207
#> 4 female    10     7  10 30-39    13

That doesn’t look right! We told gather() that 0-29 and 30-39 were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep sex and discard the others. The solution is to use select() to remove the unwanted columns first, and then gather().

library(dplyr)  # For the select() function

drunk %>%
  select(sex, "0-29", "30-39") %>%
  gather(age, count, "0-29", "30-39")
#>      sex   age count
#> 1   male  0-29   185
#> 2 female  0-29     4
#> 3   male 30-39   207
#> 4 female 30-39    13

There are times where you may want to use use more than one column as the ID variables:

plum_wide
#>   length      time dead alive
#> 1   long   at_once   84   156
#> 2   long in_spring  156    84
#> 3  short   at_once  133   107
#> 4  short in_spring  209    31
# Use length and time as the ID variables (by not naming them as value variables)
gather(plum_wide, "survival", "count", dead, alive)
#>   length      time survival count
#> 1   long   at_once     dead    84
#> 2   long in_spring     dead   156
#>  ...<4 more rows>...
#> 7  short   at_once    alive   107
#> 8  short in_spring    alive    31

Some data sets don’t come with a column with an ID variable. For example, in the corneas data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

# Make a copy of the data
co <- corneas
# Add an ID column
co$id <- 1:nrow(co)

gather(co, "eye", "thickness", affected, notaffected)
#>    id         eye thickness
#> 1   1    affected       488
#> 2   2    affected       478
#>  ...<12 more rows>...
#> 15  7 notaffected       464
#> 16  8 notaffected       476

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with as.character(), or a factor with factor().

15.19.4 See Also

See Recipe 15.20 to do conversions in the other direction, from long to wide.

See the stack() function for another way of converting from wide to long.

15.20 Converting Data from Long to Wide

15.20.1 Problem

You want to convert a data frame from “long” format to “wide” format.

15.20.2 Solution

Use the spread() function from the tidyr package. In this example, we’ll use the plum data set, which is in a long format:

library(gcookbook) # For the data set
plum
#>   length      time survival count
#> 1   long   at_once     dead    84
#> 2   long in_spring     dead   156
#>  ...<4 more rows>...
#> 7  short   at_once    alive   107
#> 8  short in_spring    alive    31

The conversion to wide format takes each unique value in one column and uses those values as headers for new columns, then uses another column for source values. For example, we can “move” values in the survival column to the top and fill them with values from count:

library(tidyr)
spread(plum, survival, count)
#>   length      time dead alive
#> 1   long   at_once   84   156
#> 2   long in_spring  156    84
#> 3  short   at_once  133   107
#> 4  short in_spring  209    31

15.20.3 Discussion

The spread() function requires you to specify a key column which is used for header names, and a value column which is used to fill the values in the output data frame. It’s assumed that you want to use all the other columns as ID variables.

In the preceding example, there are two ID columns, length and time, one key column, survival, and one value column, count. What if we want to use two of the columns as keys? Suppose, for example, that we want to use length and survival as keys. This would leave us with time as the ID column.

The way to do this is to combine the length and survival columns together and put it in a new column, then use that new column as a key.

# Create a new column, length_survival, from length and survival.
plum %>%
  unite(length_survival, length, survival)
#>   length_survival      time count
#> 1       long_dead   at_once    84
#> 2       long_dead in_spring   156
#>  ...<4 more rows>...
#> 7     short_alive   at_once   107
#> 8     short_alive in_spring    31

# Now pass it to spread() and use length_survival as a key
plum %>%
  unite(length_survival, length, survival) %>%
  spread(length_survival, count)
#>        time long_alive long_dead short_alive short_dead
#> 1   at_once        156        84         107        133
#> 2 in_spring         84       156          31        209

15.20.4 See Also

See Recipe 15.19 to do conversions in the other direction, from wide to long.

See the unstack() function for another way of converting from long to wide.

15.21 Converting a Time Series Object to Times and Values

15.21.1 Problem

You have a time series object that you wish to convert to numeric vectors representing the time and values at each time.

15.21.2 Solution

Use the time() function to get the time for each observation, then convert the times and values to numeric vectors with as.numeric():

# Look at nhtemp Time Series object
nhtemp
#> Time Series:
#> Start = 1912 
#> End = 1971 
#>  ...
#> [31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
#> [46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0

# Get times for each observation
as.numeric(time(nhtemp))
#>  [1] 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926
#> [16] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
#> [31] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
#> [46] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971

# Get value of each observation
as.numeric(nhtemp)
#>  [1] 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 50.8 49.6 49.3 50.6 48.4
#> [16] 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4 51.6 51.8 50.9 48.8 51.7
#> [31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
#> [46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0
# Put them in a data frame
nht <- data.frame(year = as.numeric(time(nhtemp)), temp = as.numeric(nhtemp))
nht
#>    year temp
#> 1  1912 49.9
#> 2  1913 52.3
#>  ...<56 more rows>...
#> 59 1970 51.9
#> 60 1971 53.0

15.21.3 Discussion

Time series objects efficiently store information when there are observations at regular time intervals, but for use with ggplot, they need to be converted to a format that separately represents times and values for each observation.

Some time series objects are cyclical. The presidents data set, for example, contains four observations per year, one for each quarter:

presidents
#>      Qtr1 Qtr2 Qtr3 Qtr4
#> 1945   NA   87   82   75
#> 1946   63   50   43   32
#>  ...
#> 1973   68   44   40   27
#> 1974   28   25   24   24

To convert it to a two-column data frame with one column representing the year with fractional values, we can do the same as before:

pres_rating <- data.frame(
  year = as.numeric(time(presidents)),
  rating = as.numeric(presidents)
)
pres_rating
#>        year rating
#> 1   1945.00     NA
#> 2   1945.25     87
#>  ...<116 more rows>...
#> 119 1974.50     24
#> 120 1974.75     24

It is also possible to store the year and quarter in separate columns, which may be useful in some visualizations:

pres_rating2 <- data.frame(
  year = as.numeric(floor(time(presidents))),
  quarter = as.numeric(cycle(presidents)),
  rating = as.numeric(presidents)
)
pres_rating2
#>     year quarter rating
#> 1   1945       1     NA
#> 2   1945       2     87
#>  ...<116 more rows>...
#> 119 1974       3     24
#> 120 1974       4     24

15.21.4 See Also

The zoo package is also useful for working with time series objects.