# Chapter 15 Getting Your Data into Shape

When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn’t the case: you’ll have to clean up and restructure the data before you can visualize it.

The recipes in this chapter will often use packages from the tidyverse. For a little background about the tidyverse, see the introduction section of Chapter 1. I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.

Note

The `>%>` symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe 1.7.

Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I’ll assume that dplyr is already loaded. You can load it with either `library(tidyverse)` as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:

``library(dplyr)``

Data sets in R are most often stored in data frames. They’re typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.

Here’s the `heightweight` data set:

``````library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#>     sex ageYear ageMonth heightIn weightLb
#> 1     f   11.92      143     56.3     85.0
#> 2     f   12.92      155     62.3    105.0
#>  ...<232 more rows>...
#> 236   m   13.92      167     62.0    107.5
#> 237   m   12.58      151     59.3     87.0``````

It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it’s structured by using the `str()` function:

``````str(heightweight)
#> 'data.frame':    236 obs. of  5 variables:
#>  \$ sex     : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
#>  \$ ageYear : num  11.9 12.9 12.8 13.4 15.9 ...
#>  \$ ageMonth: int  143 155 153 161 191 171 185 142 160 140 ...
#>  \$ heightIn: num  56.3 62.3 63.3 59 62.5 62.5 59 56.5 62 53.8 ...
#>  \$ weightLb: num  85 105 108 92 112 ...``````

The first column, `sex`, is a factor with two levels, `"f"` and `"m"`, and the other four columns are vectors of numbers (one of them, `ageMonth`, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).

Factors and character vectors behave similarly in ggplot – the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.

## 15.1 Creating a Data Frame

### 15.1.1 Problem

You want to create a data frame from vectors.

### 15.1.2 Solution

You can put vectors together in a data frame with `data.frame()`:

``````# Two starting vectors
g <- c("A", "B", "C")
x <- 1:3
dat <- data.frame(g, x)
dat
#>   g x
#> 1 A 1
#> 2 B 2
#> 3 C 3``````

### 15.1.3 Discussion

A data frame is essentially a list of vectors and factors. Each vector or factor can be thought of as a column in the data frame.

If your vectors are in a list, you can convert the list to a data frame with the `as.data.frame()` function:

``````lst <- list(group = g, value = x)    # A list of vectors

dat <- as.data.frame(lst)``````

The tidyverse way of creating a data frame is to use `data_frame()` or `as_data_frame()` (note the underscores instead of periods). This returns a special kind of data frame – a tibble – which behaves like a regular data frame in most contexts, but prints out more nicely and is specifically designed to play well with the tidyverse functions.

``````data_frame(g, x)
#> Warning: `data_frame()` is deprecated, use `tibble()`.
#> This warning is displayed once per session.
#> # A tibble: 3 x 2
#>   g         x
#>   <chr> <int>
#> 1 A         1
#> 2 B         2
#> 3 C         3``````
``````# Convert the list of vectors to a tibble
as_data_frame(lst)``````

A regular data frame can be converted to a tibble using `as_tibble()`:

``````as_tibble(dat)
#> # A tibble: 3 x 2
#>   group value
#>   <fct> <int>
#> 1 A         1
#> 2 B         2
#> 3 C         3``````

## 15.2 Getting Information About a Data Structure

### 15.2.1 Problem

You want to find out information about an object or data structure.

### 15.2.2 Solution

Use the `str()` function:

``````str(ToothGrowth)
#> 'data.frame':    60 obs. of  3 variables:
#>  \$ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#>  \$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
#>  \$ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...``````

This tells us that `ToothGrowth` is a data frame with three columns, `len`, `supp`, and `dose`. `len` and `dose` contain numeric values, while `supp` is a factor with two levels.

Another useful function is the `summary()` function:

``summary(ToothGrowth)``

Instead of showing you the first few values of each column as `str()` does, `summary()` provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.

### 15.2.3 Discussion

The `str()` function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.

When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run `str()` on the data frame, or print out the column by itself:

``````tg <- ToothGrowth
tg\$supp <- as.character(tg\$supp)
str(tg)
#> 'data.frame':    60 obs. of  3 variables:
#>  \$ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#>  \$ supp: chr  "VC" "VC" "VC" "VC" ...
#>  \$ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...``````
``````# Print out the columns by themselves
# From old data frame (factor)
ToothGrowth\$supp
#>  [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
#> [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> Levels: OJ VC
# From new data frame (character)
tg\$supp
#>  [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [16] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [31] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [46] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"``````

## 15.3 Adding a Column to a Data Frame

### 15.3.1 Problem

You want to add a column to a data frame.

### 15.3.2 Solution

Use `mutate()` from dplyr to add a new column and assign values to it. This returns a new data frame, which you’ll typically want save over the original.

If you assign a single value to the new column, the entire column will be filled with that value. This adds a column named `newcol`, filled with `NA`:

``````library(dplyr)

ToothGrowth %>%
mutate(newcol = NA)
#>     len supp dose newcol
#> 1   4.2   VC  0.5     NA
#> 2  11.5   VC  0.5     NA
#>  ...<56 more rows>...
#> 59 29.4   OJ  2.0     NA
#> 60 23.0   OJ  2.0     NA``````

You can also assign a vector to the new column:

``````# Since ToothGrowth has 60 rows, we must create a new vector that has 60 rows
vec <- rep(c(1, 2), 30)

ToothGrowth %>%
mutate(newcol = vec)``````

Note that the vector being added to the data frame must either have one element, or the same number of elements as the data frame has rows. In the example above we created a new vector that had 60 rows by repeating the values `c(1, 2)` thirty times.

### 15.3.3 Discussion

Each column of a data frame is a vector. R handles columns in data frames slightly differently from standalone vectors because all the columns in a data frame must have the same length.

To add a column using base R, you can simply assign values into the new column like so:

``````# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

# Assign NA's for the whole column
ToothGrowth2\$newcol <- NA

# Assign 1 and 2, automatically repeating to fill
ToothGrowth2\$newcol <- c(1, 2)``````

With base R, the vector being assigned into the data frame will automatically be repeated to fill the number of rows in the data frame.

## 15.4 Deleting a Column from a Data Frame

### 15.4.1 Problem

You want to delete a column from a data frame. This returns a new data frame, which you’ll typically want save over the original.

### 15.4.2 Solution

Use `select()` from dplyr and specify the columns you want to drop by using `-` (a minus sign).

``````# Remove the len column
ToothGrowth %>%
select(-len)``````

### 15.4.3 Discussion

You can list multiple columns that you want to drop at the same time, or conversely specify only the columns that you want to keep. The following two pieces of code are thus equivalent:

``````# Remove both len and supp from ToothGrowth
ToothGrowth %>%
select(-len, -supp)
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0

# This keeps just dose, which has the same effect for this data set
ToothGrowth %>%
select(dose)
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0``````

To remove a column using base R, you can simply assign `NULL` to that column.

``````# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

ToothGrowth2\$len <- NULL``````

Recipe 15.7 for more on getting a subset of a data frame.

See `?select` for more ways to drop and keep columns.

## 15.5 Renaming Columns in a Data Frame

### 15.5.1 Problem

You want to rename the columns in a data frame.

### 15.5.2 Solution

Use `rename()` from dplyr. This returns a new data frame, which you’ll typically want save over the original.

``````tg_mod <- ToothGrowth %>%
rename(length = len)``````

### 15.5.3 Discussion

You can rename multiple columns within the same call to `rename()`:

``````ToothGrowth %>%
rename(
length = len,
supplement_type = supp
)
#>    length supplement_type dose
#> 1     4.2              VC  0.5
#> 2    11.5              VC  0.5
#>  ...<56 more rows>...
#> 59   29.4              OJ  2.0
#> 60   23.0              OJ  2.0``````

Renaming a column using base R is a bit more verbose. It uses the `names()` function on the left side of the `<-` operator.

``````# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth

names(ToothGrowth2)  # Print the names of the columns
#> [1] "len"  "supp" "dose"

# Rename "len" to "length"
names(ToothGrowth2)[names(ToothGrowth2) == "len"] <- "length"

names(ToothGrowth)
#> [1] "len"  "supp" "dose"``````

See `?select` for more ways to rename columns within a data frame.

## 15.6 Reordering Columns in a Data Frame

### 15.6.1 Problem

You want to change the order of columns in a data frame.

### 15.6.2 Solution

Use the `select()` from dplyr.

``````ToothGrowth %>%
select(dose, len, supp)
#>    dose  len supp
#> 1   0.5  4.2   VC
#> 2   0.5 11.5   VC
#>  ...<56 more rows>...
#> 59  2.0 29.4   OJ
#> 60  2.0 23.0   OJ``````

The new data frame will contain the columns you specified in `select()`, in the order you specified. Note that `select()` returns a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

### 15.6.3 Discussion

If you are only reordering a few variables and want to keep the rest of the variables in order, you can use `everything()` as a placeholder:

``````ToothGrowth %>%
select(dose, everything())
#>    dose  len supp
#> 1   0.5  4.2   VC
#> 2   0.5 11.5   VC
#>  ...<56 more rows>...
#> 59  2.0 29.4   OJ
#> 60  2.0 23.0   OJ``````

See `?select_helpers` for other ways to select columns. You can, for example, select columns by matching parts of the name.

Using base R, you can also reorder columns by their name or numeric position. This returns a new data frame, which can be saved over the original.

``````ToothGrowth[c("dose", "len", "supp")]

ToothGrowth[c(3, 1, 2)]``````

In these examples, I used list-style indexing. A data frame is essentially a list of vectors, and indexing into it as a list will return another data frame. You can get the same effect with matrix-style indexing:

``````ToothGrowth[c("dose", "len", "supp")]   # List-style indexing

ToothGrowth[, c("dose", "len", "supp")] # Matrix-style indexing``````

In this case, both methods return the same result, a data frame. However, when retrieving a single column, list-style indexing will return a data frame, while matrix-style indexing will return a vector:

``````ToothGrowth["dose"]
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0
ToothGrowth[, "dose"]
#>  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [20] 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#> [39] 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
#> [58] 2.0 2.0 2.0``````

You can use `drop=FALSE` to ensure that it returns a data frame:

``````ToothGrowth[, "dose", drop=FALSE]
#>    dose
#> 1   0.5
#> 2   0.5
#>  ...<56 more rows>...
#> 59  2.0
#> 60  2.0``````

## 15.7 Getting a Subset of a Data Frame

### 15.7.1 Problem

You want to get a subset of a data frame.

### 15.7.2 Solution

Use `filter()` to get the rows, and `select()` to get the columns you want. These operations can be chained together using the `%>%` operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

We’ll use the `climate` data set for the examples here:

``````library(gcookbook) # Load gcookbook for the climate data set
climate
#>       Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 1   Berkeley 1800        NA        NA     -0.435  0.505
#> 2   Berkeley 1801        NA        NA     -0.453  0.493
#>  ...<495 more rows>...
#> 498  CRUTEM3 2010    0.8023        NA         NA     NA
#> 499  CRUTEM3 2011    0.6193        NA         NA     NA``````

Let’s that say that only want to keep rows where `Source` is `"Berkeley"` and where the year is inclusive of and between 1900 and 2000. You can do so with the `filter()` function:

``````climate %>%
filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000)``````

If you want only the `Year` and `Anomaly10y` columns, use `select()`, as we did in 15.4:

``````climate %>%
select(Year, Anomaly10y)
#>     Year Anomaly10y
#> 1   1800     -0.435
#> 2   1801     -0.453
#>  ...<495 more rows>...
#> 498 2010         NA
#> 499 2011         NA``````

These operations can be chained together using the `%>%` operator:

``````climate %>%
filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000) %>%
select(Year, Anomaly10y)
#>     Year Anomaly10y
#> 1   1900     -0.171
#> 2   1901     -0.162
#>  ...<97 more rows>...
#> 100 1999      0.734
#> 101 2000      0.748``````

### 15.7.3 Discussion

The `filter()` function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the `slice()` function:

``slice(climate, 1:100)``

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

``````climate[climate\$Source == "Berkeley" & climate\$Year >= 1900 & climate\$Year <= 2000, ]
#>       Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 101 Berkeley 1900        NA        NA     -0.171  0.108
#> 102 Berkeley 1901        NA        NA     -0.162  0.109
#>  ...<97 more rows>...
#> 200 Berkeley 1999        NA        NA      0.734  0.025
#> 201 Berkeley 2000        NA        NA      0.748  0.026``````

Notice that we needed to prefix each column name with `climate\$`, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.

This row filtering can also be combined with the column selection from 15.4:

``````climate[climate\$Source == "Berkeley" & climate\$Year >= 1900 & climate\$Year <= 2000,
c("Year", "Anomaly10y")]
#>     Year Anomaly10y
#> 101 1900     -0.171
#> 102 1901     -0.162
#>  ...<97 more rows>...
#> 200 1999      0.734
#> 201 2000      0.748``````

## 15.8 Changing the Order of Factor Levels

### 15.8.1 Problem

You want to change the order of levels in a factor.

### 15.8.2 Solution

Pass the factor to `factor()`, and give it the levels in the order you want. This returns a new factor, so if you want to change the original variable, you’ll need to save the new result over it.

``````# By default, levels are ordered alphabetically
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes
#> [1] small  large  large  small  medium
#> Levels: large medium small

factor(sizes, levels = c("small", "medium", "large"))
#> [1] small  large  large  small  medium
#> Levels: small medium large``````

The order can also be specified with `levels` when the factor is first created:

``````factor(c("small", "large", "large", "small", "medium"),
levels = c("small", "medium", "large"))``````

### 15.8.3 Discussion

There are two kinds of factors in R: ordered factors and regular factors. (In practice, ordered levels are not commonly used.) In both types, the levels are arranged in some order; the difference is that the order is meaningful for an ordered factor, but it is arbitrary for a regular factor – it simply reflects how the data is stored. For plotting data, the distinction between ordered and regular factors is generally unimportant, and they can be treated the same.

The order of factor levels affects graphical output. When a factor variable is mapped to an aesthetic property in ggplot, the aesthetic adopts the ordering of the factor levels. If a factor is mapped to the x-axis, the ticks on the axis will be in the order of the factor levels, and if a factor is mapped to color, the items in the legend will be in the order of the factor levels.

To reverse the level order, you can use `rev(levels())`:

``factor(sizes, levels = rev(levels(sizes)))``

The tidyverse function for reordering factors is `fct_relevel()` from the forcats package. It has a syntax similar to the `factor()` function from base R.

``````# Change the order of levels
library(forcats)
fct_relevel(sizes, "small", "medium", "large")
#> [1] small  large  large  small  medium
#> Levels: small medium large``````

To reorder a factor based on the value of another variable, see Recipe 15.9.

Reordering factor levels is useful for controlling the order of axes and legends. See Recipes Recipe 8.4 and Recipe 10.3 for more information.

## 15.9 Changing the Order of Factor Levels Based on Data Values

### 15.9.1 Problem

You want to change the order of levels in a factor based on values in the data.

### 15.9.2 Solution

Use `reorder()` with the factor that has levels to reorder, the values to base the reordering on, and a function that aggregates the values:

``````# Make a copy of the InsectSprays data set since we're modifying it
iss <- InsectSprays
iss\$spray
#>  [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
#> [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
#> Levels: A B C D E F

iss\$spray <- reorder(iss\$spray, iss\$count, FUN = mean)
iss\$spray
#>  [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
#> [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
#> attr(,"scores")
#>         A         B         C         D         E         F
#> 14.500000 15.333333  2.083333  4.916667  3.500000 16.666667
#> Levels: C E D A B F``````

Notice that the original levels were `ABCDEF`, while the reordered levels are `CEDABF`. What we’ve done is reorder the levels of `spray` based on the mean value of `count` for each level of `spray`.

### 15.9.3 Discussion

The usefulness of `reorder()` might not be obvious from just looking at the raw output. Figure 15.1 shows three plots made with `reorder()`. In these plots, the order in which the items appear is determined by their values.

In the middle plot in Figure 15.1, the boxes are sorted by the mean. The horizontal line that runs across each box represents the median of the data. Notice that these values do not increase strictly from left to right. That’s because with this particular data set, sorting by the mean gives a different order than sorting by the median. To make the median lines increase from left to right, as in the plot on the right in Figure 15.1, we used the `median()` function in `reorder()`.

The tidyverse function for reordering factors is `fct_reorder()`, and it is used the same way as `reorder()`. These do the same thing:

``````reorder(iss\$spray, iss\$count, FUN = mean)
fct_reorder(iss\$spray, iss\$count, .fun = mean)``````

Reordering factor levels is also useful for controlling the order of axes and legends. See Recipes 8.4 and 10.3 for more information.

## 15.10 Changing the Names of Factor Levels

### 15.10.1 Problem

You want to change the names of levels in a factor.

### 15.10.2 Solution

Use `fct_recode()` from the forcats package

``````sizes <- factor(c( "small", "large", "large", "small", "medium"))
sizes
#> [1] small  large  large  small  medium
#> Levels: large medium small

# Pass it a named vector with the mappings
fct_recode(sizes, S = "small", M = "medium", L = "large")
#> [1] S L L S M
#> Levels: L M S``````

### 15.10.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()` with `fct_recode()`.

``````old <- c("small", "medium", "large")
new <- c("S", "M", "L")

# Create a named vector that has the mappings between old and new
mappings <- setNames(old, new)
mappings
#>        S        M        L
#>  "small" "medium"  "large"

# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)

# Look at the structure of the list
str(args)
#> List of 4
#>  \$  : Factor w/ 3 levels "large","medium",..: 3 1 1 3 2
#>  \$ S: chr "small"
#>  \$ M: chr "medium"
#>  \$ L: chr "large"

# Use do.call to call fct_recode with the arguments
do.call(fct_recode, args)
#> [1] S L L S M
#> Levels: L M S``````

Or, more concisely, we can do all of that in one go:

``````do.call(
fct_recode,
c(list(sizes), setNames(c("small", "medium", "large"), c("S", "M", "L")))
)
#> [1] S L L S M
#> Levels: L M S``````

For a more traditional (and clunky) base R method for renaming factor levels, use the `levels()<-` function:

``````sizes <- factor(c( "small", "large", "large", "small", "medium"))

# Index into the levels and rename each one
levels(sizes)[levels(sizes) == "large"]  <- "L"
levels(sizes)[levels(sizes) == "medium"] <- "M"
levels(sizes)[levels(sizes) == "small"]  <- "S"
sizes
#> [1] S L L S M
#> Levels: L M S``````

If you are renaming all your factor levels, there is a simpler method. You can pass a list to `levels()<-`:

``````sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes) <- list(S = "small", M = "medium", L = "large")
sizes
#> [1] S L L S M
#> Levels: S M L``````

With this method, all factor levels must be specified in the list; if any are missing, they will be replaced with `NA`.

It’s also possible to rename factor levels by position, but this is somewhat inelegant:

``````sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes)[1] <- "L"
sizes
#> [1] small  L      L      small  medium
#> Levels: L medium small

# Rename all levels at once
levels(sizes) <- c("L", "M", "S")
sizes
#> [1] S L L S M
#> Levels: L M S``````

It’s safer to rename factor levels by name rather than by position, since you will be less likely to make a mistake (and mistakes here may be hard to detect). Also, if your input data set changes to have more or fewer levels, the numeric positions of the existing levels could change, which could cause serious but nonobvious problems for your analysis.

If, instead of a factor, you have a character vector with items to rename, see Recipe 15.12.

## 15.11 Removing Unused Levels from a Factor

### 15.11.1 Problem

You want to remove unused levels from a factor.

### 15.11.2 Solution

Sometimes, after processing your data you will have a factor that contains levels that are no longer used. Here’s an example:

``````sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes <- sizes[1:3]
sizes
#> [1] small large large
#> Levels: large medium small``````

To remove them, use `droplevels()`:

``````droplevels(sizes)
#> [1] small large large
#> Levels: large small``````

### 15.11.3 Discussion

The `droplevels()` function preserves the order of factor levels. You can use the `except` parameter to keep particular levels.

The tidyverse way: Use `fct_drop()` from the forcats package:

``````fct_drop(sizes)
#> [1] small large large
#> Levels: large small``````

## 15.12 Changing the Names of Items in a Character Vector

### 15.12.1 Problem

You want to change the names of items in a character vector.

### 15.12.2 Solution

Use `recode()` from the dplyr package:

``````library(dplyr)

sizes <- c("small", "large", "large", "small", "medium")
sizes
#> [1] "small"  "large"  "large"  "small"  "medium"

# With recode(), pass it a named vector with the mappings
recode(sizes, small = "S", medium = "M", large = "L")
#> [1] "S" "L" "L" "S" "M"

# Can also use quotes -- useful if there are spaces or other strange characters
recode(sizes, "small" = "S", "medium" = "M", "large" = "L")
#> [1] "S" "L" "L" "S" "M"``````

### 15.12.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()` with `fct_recode()`.

``````old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(new, old)
mappings
#>  small medium  large
#>    "S"    "M"    "L"

# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
#> List of 4
#>  \$       : chr [1:5] "small" "large" "large" "small" ...
#>  \$ small : chr "S"
#>  \$ medium: chr "M"
#>  \$ large : chr "L"
# Use do.call to call fct_recode with the arguments
do.call(recode, args)
#> [1] "S" "L" "L" "S" "M"``````

Or, more concisely, we can do all of that in one go:

``````do.call(
recode,
c(list(sizes), setNames(c("S", "M", "L"), c("small", "medium", "large")))
)
#> [1] "S" "L" "L" "S" "M"``````

Note that for `recode()`, the name and value of the arguments is reversed, compared to the `fct_recode()` function from the forcats package. With `recode()`, you would use `small="S"`, whereas for `fct_recode()`, you would use `S="small"`.

A more traditional R method is to use square-bracket indexing to select the items and rename them:

``````sizes <- c("small", "large", "large", "small", "medium")
sizes[sizes == "small"]  <- "S"
sizes[sizes == "medium"] <- "M"
sizes[sizes == "large"]  <- "L"
sizes
#> [1] "S" "L" "L" "S" "M"``````

If, instead of a character vector, you have a factor with levels to rename, see Recipe 15.10.

## 15.13 Recoding a Categorical Variable to Another Categorical Variable

### 15.13.1 Problem

You want to recode a categorical variable to another variable.

### 15.13.2 Solution

For the examples here, we’ll use a subset of the `PlantGrowth` data set:

``````# Work on a subset of the PlantGrowth data set
pg <- PlantGrowth[c(1,2,11,21,22), ]
pg
#>    weight group
#> 1    4.17  ctrl
#> 2    5.58  ctrl
#> 11   4.81  trt1
#> 21   6.31  trt2
#> 22   5.12  trt2``````

In this example, we’ll recode the categorical variable group into another categorical variable, treatment. If the old value was `"ctrl"`, the new value will be `"No"`, and if the old value was `"trt1"` or `"trt2"`, the new value will be `"Yes"`.

This can be done with the `recode()` function from the dplyr package:

``````library(dplyr)

recode(pg\$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes``````

You can assign it as a new column in the data frame:

``pg\$treatment <- recode(pg\$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")``

Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use `as.character()`:

``````recode(as.character(pg\$group), ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] "No"  "No"  "Yes" "Yes" "Yes"``````

### 15.13.3 Discussion

You can also use the `fct_recode()` function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:

``````library(forcats)
fct_recode(pg\$group, No = "ctrl", Yes = "trt1", Yes = "trt2")
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes``````

Another difference is that `fct_recode()` will always return a factor, whereas `recode()` will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a `recode_factor()` function which also always returns a factor.)

Using base R, recoding can be done with the `match()` function:

``````oldvals <- c("ctrl", "trt1", "trt2")
newvals <- factor(c("No", "Yes", "Yes"))

newvals[ match(pg\$group, oldvals) ]
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes``````

It can also be done by indexing in the vectors:

``````pg\$treatment[pg\$group == "ctrl"] <- "No"
pg\$treatment[pg\$group == "trt1"] <- "Yes"
pg\$treatment[pg\$group == "trt2"] <- "Yes"

# Convert to a factor
pg\$treatment <- factor(pg\$treatment)
pg
#>    weight group treatment
#> 1    4.17  ctrl        No
#> 2    5.58  ctrl        No
#> 11   4.81  trt1       Yes
#> 21   6.31  trt2       Yes
#> 22   5.12  trt2       Yes``````

Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe 15.10.

The coding criteria can also be based on values in multiple columns, by using the `&` and `|` operators:

``````pg\$newcol[pg\$group == "ctrl" & pg\$weight < 5]  <- "no_small"
pg\$newcol[pg\$group == "ctrl" & pg\$weight >= 5] <- "no_large"
pg\$newcol[pg\$group == "trt1"] <- "yes"
pg\$newcol[pg\$group == "trt2"] <- "yes"
pg\$newcol <- factor(pg\$newcol)
pg
#>    weight group   newcol
#> 1    4.17  ctrl no_small
#> 2    5.58  ctrl no_large
#> 11   4.81  trt1      yes
#> 21   6.31  trt2      yes
#> 22   5.12  trt2      yes``````

It’s also possible to combine two columns into one using the interaction() function, which appends the values with a `.` in between. This combines the `weight` and `group` columns into a new column, `weightgroup`:

``````pg\$weightgroup <- interaction(pg\$weight, pg\$group)
pg
#>    weight group weightgroup
#> 1    4.17  ctrl   4.17.ctrl
#> 2    5.58  ctrl   5.58.ctrl
#> 11   4.81  trt1   4.81.trt1
#> 21   6.31  trt2   6.31.trt2
#> 22   5.12  trt2   5.12.trt2``````

For more on renaming factor levels, see Recipe 15.10.

See Recipe 15.14 for recoding continuous values to categorical values.

## 15.14 Recoding a Continuous Variable to a Categorical Variable

### 15.14.1 Problem

You want to recode a continuous variable to another variable.

### 15.14.2 Solution

Use the `cut()` function. In this example, we’ll use the `PlantGrowth` data set and recode the continuous variable `weight` into a categorical variable, `wtclass`, using the `cut()` function:

``````pg <- PlantGrowth
pg\$wtclass <- cut(pg\$weight, breaks = c(0, 5, 6, Inf))
pg
#>    weight group wtclass
#> 1    4.17  ctrl   (0,5]
#> 2    5.58  ctrl   (5,6]
#>  ...<26 more rows>...
#> 29   5.80  trt2   (5,6]
#> 30   5.26  trt2   (5,6]``````

### 15.14.3 Discussion

For three categories we specify four bounds, which can include `Inf` and `-Inf`. If a data value falls outside of the specified bounds, it’s categorized as `NA`. The result of `cut()` is a factor, and you can see from the example that the factor levels are named after the bounds.

To change the names of the levels, set the labels:

``````pg\$wtclass <- cut(pg\$weight, breaks = c(0, 5, 6, Inf),
labels = c("small", "medium", "large"))
pg
#>    weight group wtclass
#> 1    4.17  ctrl   small
#> 2    5.58  ctrl  medium
#>  ...<26 more rows>...
#> 29   5.80  trt2  medium
#> 30   5.26  trt2  medium``````

As indicated by the factor levels, the bounds are by default open on the left and closed on the right. In other words, they don’t include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting `include.lowest=TRUE`. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as `NA`.

If you want the categories to be closed on the left and open on the right, set right = FALSE:

``````cut(pg\$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
#>  [1] [0,5)   [5,6)   [5,6)   [6,Inf) [0,5)   [0,5)   [5,6)   [0,5)   [5,6)
#> [10] [5,6)   [0,5)   [0,5)   [0,5)   [0,5)   [5,6)   [0,5)   [6,Inf) [0,5)
#> [19] [0,5)   [0,5)   [6,Inf) [5,6)   [5,6)   [5,6)   [5,6)   [5,6)   [0,5)
#> [28] [6,Inf) [5,6)   [5,6)
#> Levels: [0,5) [5,6) [6,Inf)``````

To recode a categorical variable to another categorical variable, see Recipe 15.13.

## 15.15 Calculating New Columns From Existing Columns

### 15.15.1 Problem

You want to calculate a new column of values in a data frame.

### 15.15.2 Solution

Use `mutate()` from the dplyr package.

``````library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#>     sex ageYear ageMonth heightIn weightLb
#> 1     f   11.92      143     56.3     85.0
#> 2     f   12.92      155     62.3    105.0
#>  ...<232 more rows>...
#> 236   m   13.92      167     62.0    107.5
#> 237   m   12.58      151     59.3     87.0``````

This will convert `heightIn` to centimeters and store it in a new column, `heightCm`:

``````library(dplyr)
heightweight %>%
mutate(heightCm = heightIn * 2.54)
#>     sex ageYear ageMonth heightIn weightLb heightCm
#> 1     f   11.92      143     56.3     85.0  143.002
#> 2     f   12.92      155     62.3    105.0  158.242
#>  ...<232 more rows>...
#> 235   m   13.92      167     62.0    107.5  157.480
#> 236   m   12.58      151     59.3     87.0  150.622``````

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

### 15.15.3 Discussion

You can use `mutate()` to transform multiple columns at once:

``````heightweight %>%
mutate(
heightCm = heightIn * 2.54,
weightKg = weightLb / 2.204
)
#>     sex ageYear ageMonth heightIn weightLb heightCm weightKg
#> 1     f   11.92      143     56.3     85.0  143.002 38.56624
#> 2     f   12.92      155     62.3    105.0  158.242 47.64065
#>  ...<232 more rows>...
#> 235   m   13.92      167     62.0    107.5  157.480 48.77495
#> 236   m   12.58      151     59.3     87.0  150.622 39.47368``````

It is also possible to calculate a new column based on multiple columns:

``````heightweight %>%
mutate(bmi = weightKg / (heightCm / 100)^2)``````

With `mutate()`, the columns are added sequentially. That means that we can reference a newly-created column when calculating a new column:

``````heightweight %>%
mutate(
heightCm = heightIn * 2.54,
weightKg = weightLb / 2.204,
bmi = weightKg / (heightCm / 100)^2
)
#>     sex ageYear ageMonth heightIn weightLb heightCm weightKg      bmi
#> 1     f   11.92      143     56.3     85.0  143.002 38.56624 18.85919
#> 2     f   12.92      155     62.3    105.0  158.242 47.64065 19.02542
#>  ...<232 more rows>...
#> 235   m   13.92      167     62.0    107.5  157.480 48.77495 19.66736
#> 236   m   12.58      151     59.3     87.0  150.622 39.47368 17.39926``````

With base R, calculating a new colum can be done by referencing the new column with the `\$` operator and assigning some values to it:

``heightweight\$heightCm <- heightweight\$heightIn * 2.54``

See Recipe 15.16 for how to perform group-wise transformations on data.

## 15.16 Calculating New Columns by Groups

### 15.16.1 Problem

You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.

### 15.16.2 Solution

Use `group_by()` from the dplyr package to specify the grouping variable, and then specify the operations in `mutate()`:

``````library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

cabbages %>%
group_by(Cult) %>%
#> # A tibble: 60 x 5
#> # Groups:   Cult [2]
#>   Cult  Date  HeadWt  VitC  DevWt
#>   <fct> <fct>  <dbl> <int>  <dbl>
#> 1 c39   d16      2.5    51 -0.407
#> 2 c39   d16      2.2    55 -0.707
#> 3 c39   d16      3.1    45  0.193
#> 4 c39   d16      4.3    42  1.39
#> 5 c39   d16      2.5    53 -0.407
#> 6 c39   d16      4.3    50  1.39
#> # … with 54 more rows``````

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

### 15.16.3 Discussion

Let’s take a closer look at the `cabbages` data set. It has two grouping variables (factors): `Cult`, which has levels `c39` and `c52`, and `Date`, which has levels `d16`, `d20`, and `d21.` It also has two measured numeric variables, `HeadWt` and `VitC`:

``````cabbages
#> 1   c39  d16    2.5   51
#> 2   c39  d16    2.2   55
#>  ...<56 more rows>...
#> 59  c52  d21    1.5   66
#> 60  c52  d21    1.6   72``````

Suppose we want to find, for each case, the deviation of `HeadWt` from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:

``````mutate(cabbages, DevWt = HeadWt - mean(HeadWt))
#>    Cult Date HeadWt VitC       DevWt
#> 1   c39  d16    2.5   51 -0.09333333
#> 2   c39  d16    2.2   55 -0.39333333
#>  ...<56 more rows>...
#> 59  c52  d21    1.5   66 -1.09333333
#> 60  c52  d21    1.6   72 -0.99333333``````

You’ll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean within the group, where the groups are specified by `Cult`. In these cases, we can use `group_by()` and `mutate()` together:

``````cb <- cabbages %>%
group_by(Cult) %>%

First it groups cabbages based on the value of `Cult`. There are two levels of `Cult`, `c39` and `c52`. It then applies the `mutate()` function to each data frame.

The before and after results are shown in Figure 15.2:

``````# The data before normalizing
ggplot(cb, aes(x = Cult, y = HeadWt)) +
geom_boxplot()

# After normalizing
ggplot(cb, aes(x = Cult, y = DevWt)) +
geom_boxplot()``````

You can also group the data frame on multiple variables and perform operations on multiple variables. The following code groups the data by `Cult` and `Date`, forming a group for each distinct combination of the two variables. After forming these groups, the code will calculate the deviation of `HeadWt` and `VitC` from the mean of each group:

``````cabbages %>%
group_by(Cult, Date) %>%
mutate(
DevVitC = VitC - mean(VitC)
)
#> # A tibble: 60 x 6
#> # Groups:   Cult, Date [6]
#>   Cult  Date  HeadWt  VitC DevWt DevVitC
#>   <fct> <fct>  <dbl> <int> <dbl>   <dbl>
#> 1 c39   d16      2.5    51 -0.68   0.7
#> 2 c39   d16      2.2    55 -0.98   4.7
#> 3 c39   d16      3.1    45 -0.08  -5.30
#> 4 c39   d16      4.3    42  1.12  -8.30
#> 5 c39   d16      2.5    53 -0.68   2.7
#> 6 c39   d16      4.3    50  1.12  -0.300
#> # … with 54 more rows``````

To summarize data by groups, see Recipe 15.17.

## 15.17 Summarizing Data by Groups

### 15.17.1 Problem

You want to summarize your data, based on one or more grouping variables.

### 15.17.2 Solution

Use `group_by()` and `summarise()` from the dplyr package, and specify the operations to do:

``````library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

cabbages %>%
group_by(Cult, Date) %>%
summarise(
VitC = mean(VitC)
)
#> # A tibble: 6 x 4
#> # Groups:   Cult [2]
#>   Cult  Date  Weight  VitC
#>   <fct> <fct>  <dbl> <dbl>
#> 1 c39   d16     3.18  50.3
#> 2 c39   d20     2.8   49.4
#> 3 c39   d21     2.74  54.8
#> 4 c52   d16     2.26  62.5
#> 5 c52   d20     3.11  58.9
#> 6 c52   d21     1.47  71.8``````

### 15.17.3 Discussion

There are few things going on here that may be unfamiliar if you’re new to dplyr and the tidyverse in general.

First, let’s take a closer look at the `cabbages` data set. It has two factors that can be used as grouping variables: `Cult`, which has levels `c39` and `c52`, and `Date`, which has levels `d16`, `d20`, and `d21`. It also has two numeric variables, `HeadWt` and `VitC`:

``````cabbages
#> 1   c39  d16    2.5   51
#> 2   c39  d16    2.2   55
#>  ...<56 more rows>...
#> 59  c52  d21    1.5   66
#> 60  c52  d21    1.6   72``````

Finding the overall mean of `HeadWt` is simple. We could just use the `mean()` function on that column, but for reasons that will soon become clear, we’ll use the `summarise()` function instead:

``````library(dplyr)
#>     Weight
#> 1 2.593333``````

The result is a data frame with one row and one column, named `Weight`.

Often we want to find information about each subset of the data, as specified by a grouping variable. For example, suppose we want to find the mean of each `Cult` group. To do this, we can use `summarise()` with `group_by()`.

``````tmp <- group_by(cabbages, Cult)
#> # A tibble: 2 x 2
#>   Cult  Weight
#>   <fct>  <dbl>
#> 1 c39     2.91
#> 2 c52     2.28``````

The command first groups the data frame `cabbages` based on the value of `Cult`. There are two levels of `Cult`, `c39` and `c52`, so there are two groups. It then applies the `summarise()` function to each of these data frames; it calculates `Weight` by taking the `mean()` of the `HeadWt` column in each of the sub-data frames. The resulting summaries for each group are assembled into a data frame, which is returned.

You can imagine that the `cabbages` data is split up into two separate data frames, then `summarise()` is called on each data frame (returning a one-row data frame for each), and then those results are combined together into a final data frame. This is actually how things worked in dplyr’s predecessor, plyr, with the `ddply()` function.

The syntax of the previous code used a temporary variable to store results. That’s a little verbose, so instead, we can use `%>%`, also known as the pipe operator, to chain the function calls together. The pipe operator simply takes what’s on its left and substitutes it as the first argument of the function call on the right. The following two lines of code are equivalent:

``````group_by(cabbages, Cult)
# The pipe operator moves `cabbages` to the first argument position of group_by()
cabbages %>% group_by(Cult)``````

The reason it’s called a pipe operator is that it lets you connect function calls together in sequence to form a pipeline of operations. Another common term for this is a different metaphor: chaining.

So the first argument of the function call is in a different place. So what? The advantages become apparent when chaining is involved. Here’s what it would look like if you wanted to call `group_by()` and then `summarise()` without making use of a temporary variable. Instead of proceeding left to right, the computation occurs from the inside out:

``summarise(group_by(cabbages, Cult), Weight = mean(HeadWt))``

Using a temporary variable, as we did earlier, makes it more readable, but a more elegant solution is to use the pipe operator:

``````cabbages %>%
group_by(Cult) %>%

Back to summarizing data. Summarizing the data frame by grouping using more variables (or columns) is simple: just give it the names of the additional variables. It’s also possible to get more than one summary value by specifying more calculated columns. Here we’ll summarize each `Cult` and `Date` group, getting the average of `HeadWt` and `VitC`:

``````cabbages %>%
group_by(Cult, Date) %>%
summarise(
Vitc = mean(VitC)
)
#> # A tibble: 6 x 4
#> # Groups:   Cult [2]
#>   Cult  Date  Weight  Vitc
#>   <fct> <fct>  <dbl> <dbl>
#> 1 c39   d16     3.18  50.3
#> 2 c39   d20     2.8   49.4
#> 3 c39   d21     2.74  54.8
#> 4 c52   d16     2.26  62.5
#> 5 c52   d20     3.11  58.9
#> 6 c52   d21     1.47  71.8``````

Note

You might have noticed that it says that the result is grouped by `Cult`, but not `Date`. This is because the `summarise()` function removes one level of grouping. This is typically what you want when the input has one grouping variable. When there are multiple grouping variables, this may or may not be the what you want. To remove all grouping, use `ungroup()`, and to add back the original grouping, use `group_by()` again.

It’s possible to do more than take the mean. You may, for example, want to compute the standard deviation and count of each group. To get the standard deviation, use `sd()`, and to get a count of rows in each group, use `n()`:

``````cabbages %>%
group_by(Cult, Date) %>%
summarise(
n = n()
)
#> # A tibble: 6 x 5
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n
#>   <fct> <fct>  <dbl> <dbl> <int>
#> 1 c39   d16     3.18 0.957    10
#> 2 c39   d20     2.8  0.279    10
#> 3 c39   d21     2.74 0.983    10
#> 4 c52   d16     2.26 0.445    10
#> 5 c52   d20     3.11 0.791    10
#> 6 c52   d21     1.47 0.211    10``````

Other useful functions for generating summary statistics include `min()`, `max()`, and `median()`. The `n()` function is a special function that works only inside of the dplyr functions `summarise()`, `mutate()` and `filter()`. See `?summarise` for more useful functions.

The `n()` function gets a count of rows, but if you want to have it not count `NA` values from a column, you need to use a different technique. For example, if you want it to ignore any `NA`s in the `HeadWt` column, use `sum(!is.na(Headwt))`.

If you want to get a count of rows

#### 15.17.3.1 Dealing with NAs {#_dealing_with_literal_na_literal_s}

One potential pitfall is that `NA`s in the data will lead to `NA`s in the output. Let’s see what happens if we sprinkle a few `NA`s into `HeadWt`:

``````c1 <- cabbages # Make a copy
c1\$HeadWt[c(1, 20, 45)] <- NA # Set some values to NA

c1 %>%
group_by(Cult) %>%
summarise(
n = n()
)
#> # A tibble: 2 x 4
#>   Cult  Weight    sd     n
#>   <fct>  <dbl> <dbl> <int>
#> 1 c39       NA    NA    30
#> 2 c52       NA    NA    30``````

The problem is that `mean()` and `sd()` simply return `NA` if any of the input values are `NA.` Fortunately, these functions have an option to deal with this very issue: setting `na.rm=TRUE` will tell them to ignore the `NA`s.

``````c1 %>%
group_by(Cult) %>%
summarise(
Weight = mean(HeadWt, na.rm = TRUE),
sd = sd(HeadWt, na.rm = TRUE),
n = n()
)
#> # A tibble: 2 x 4
#>   Cult  Weight    sd     n
#>   <fct>  <dbl> <dbl> <int>
#> 1 c39     2.9  0.822    30
#> 2 c52     2.23 0.828    30``````

#### 15.17.3.2 Missing combinations {#_missing_combinations}

If there are any empty combinations of the grouping variables, they will not appear in the summarized data frame. These missing combinations can cause problems when making graphs. To illustrate, we’ll remove all entries that have levels `c52` and `d21`. The graph on the left in Figure 15.3 shows what happens when there’s a missing combination in a bar graph:

``````# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
group_by(Cult, Date) %>%

ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")``````

To fill in the missing combination (Figure 15.3, right), use the `complete()` function from the tidyr package – which is also part of the tidyverse. Also, the grouping for `c2a` must be removed, with `ungroup()`; otherwise it will return too many rows.

``````library(tidyr)
c2b <- c2a %>%
ungroup() %>%
complete(Cult, Date)

ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")``````
``````# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
group_by(Cult, Date) %>%

ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")
library(tidyr)
c2b <- c2a %>%
ungroup() %>%
complete(Cult, Date)

ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")``````

When we used `complete()`, it filled in the missing combinations with `NA`. It’s possible to fill with a different value, with the `fill` parameter. See `?complete` for more information.

If you want to calculate standard errors and confidence intervals, see Recipe 15.18.

See Recipe 6.8 for an example of using stat_summary() to calculate means and overlay them on a graph.

To perform transformations on data by groups, see Recipe 15.16.

## 15.18 Summarizing Data with Standard Errors and Confidence Intervals

### 15.18.1 Problem

You want to summarize your data with the standard error of the mean and/or confidence intervals.

### 15.18.2 Solution

Getting the standard error of the mean involves two steps: first get the standard deviation and count for each group, then use those values to calculate the standard error. The standard error for each group is just the standard deviation divided by the square root of the sample size:

``````library(MASS)  # Load MASS for the cabbages data set
library(dplyr)

ca <- cabbages %>%
group_by(Cult, Date) %>%
summarise(
n = n(),
se = sd / sqrt(n)
)

ca
#> # A tibble: 6 x 6
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n     se
#>   <fct> <fct>  <dbl> <dbl> <int>  <dbl>
#> 1 c39   d16     3.18 0.957    10 0.303
#> 2 c39   d20     2.8  0.279    10 0.0882
#> 3 c39   d21     2.74 0.983    10 0.311
#> 4 c52   d16     2.26 0.445    10 0.141
#> 5 c52   d20     3.11 0.791    10 0.250
#> 6 c52   d21     1.47 0.211    10 0.0667``````

### 15.18.3 Discussion

The `summarise()` function computes the columns in order, so you can refer to previous newly-created columns. That’s why `se` can use the `sd` and `n` columns.

The `n()` function gets a count of rows, but if you want to have it not count `NA` values from a column, you need to use a different technique. For example, if you want it to ignore any `NA`s in the `HeadWt` column, use `sum(!is.na(Headwt))`.

#### 15.18.3.1 Confidence Intervals {#_confidence_intervals}

Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the `qt()` function to get the quantile, then multiply that by the standard error. The `qt()` function will give quantiles of the t-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped t-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.

This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:

``````ciMult <- qt(.975, ca\$n - 1)
ciMult
#> [1] 2.262157 2.262157 2.262157 2.262157 2.262157 2.262157``````

Now we can multiply that vector by the standard error to get the 95% confidence interval:

``````ca\$ci95 <- ca\$se * ciMult
ca
#> # A tibble: 6 x 7
#> # Groups:   Cult [2]
#>   Cult  Date  Weight    sd     n     se  ci95
#>   <fct> <fct>  <dbl> <dbl> <int>  <dbl> <dbl>
#> 1 c39   d16     3.18 0.957    10 0.303  0.684
#> 2 c39   d20     2.8  0.279    10 0.0882 0.200
#> 3 c39   d21     2.74 0.983    10 0.311  0.703
#> 4 c52   d16     2.26 0.445    10 0.141  0.318
#> 5 c52   d20     3.11 0.791    10 0.250  0.566
#> 6 c52   d21     1.47 0.211    10 0.0667 0.151``````

This could be done in one line, like this:

``ca\$ci95 <- ca\$se * qt(.975, ca\$n - 1)``

For a 99% confidence interval, use .995.

Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there’s a 95% chance that the true population mean is within the interval (actually, it doesn’t mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).

This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle `NA`s and missing combinations, with the `na.rm` and `.drop` options. By default, it provides a 95% confidence interval, but this can be set with the `conf.interval` argument:

``````summarySE <- function(data = NULL, measurevar, groupvars = NULL, na.rm = FALSE,
conf.interval = .95, .drop = TRUE) {

# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function(x, na.rm = FALSE) {
if (na.rm) sum(!is.na(x))
else       length(x)
}

groupvars  <- rlang::syms(groupvars)
measurevar <- rlang::sym(measurevar)

datac <- data %>%
dplyr::group_by(!!!groupvars) %>%
dplyr::summarise(
N             = length2(!!measurevar, na.rm = na.rm),
sd            = sd     (!!measurevar, na.rm = na.rm),
!!measurevar := mean   (!!measurevar, na.rm = na.rm),
se            = sd / sqrt(N),
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ci            = se * qt(conf.interval/2 + .5, N - 1)
) %>%
dplyr::ungroup() %>%
# Rearrange the columns so that sd, se, ci are last
dplyr::select(seq_len(ncol(.) - 4), ncol(.) - 2, sd, se, ci)

datac
}``````

The following usage example has a 99% confidence interval and handles `NA`s and missing combinations:

``````# Remove all rows with both c52 and d21
c2 <- filter(cabbages, !(Cult == "c52" & Date == "d21" ))
# Set some values to NA
conf.interval = .99, na.rm = TRUE, .drop = FALSE)
#> # A tibble: 5 x 7
#>   Cult  Date      N HeadWt    sd     se    ci
#>   <fct> <fct> <int>  <dbl> <dbl>  <dbl> <dbl>
#> 1 c39   d16       9   3.26 0.982 0.327  1.10
#> 2 c39   d20       9   2.72 0.139 0.0465 0.156
#> 3 c39   d21      10   2.74 0.983 0.311  1.01
#> 4 c52   d16      10   2.26 0.445 0.141  0.458
#> 5 c52   d20       9   3.04 0.809 0.270  0.905``````

See Recipe 7.7 to use the values calculated here to add error bars to a graph.

## 15.19 Converting Data from Wide to Long

### 15.19.1 Problem

You want to convert a data frame from “wide” format to “long” format.

### 15.19.2 Solution

Use `gather()` from the tidyr package. In the `anthoming` data set, for each `angle`, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

``````library(gcookbook) # For the data set
anthoming
#>   angle expt ctrl
#> 1   -20    1    0
#> 2   -10    7    3
#> 3     0    2    3
#> 4    10    0    3
#> 5    20    0    1``````

We can reshape the data so that all the measurements are in one column. This will put the values from `expt` and `ctrl` into one column, and put the names into a different column:

``````library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
#>    angle condition count
#> 1    -20      expt     1
#> 2    -10      expt     7
#>  ...<6 more rows>...
#> 9     10      ctrl     3
#> 10    20      ctrl     1``````

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

### 15.19.3 Discussion

In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when `angle` is –20. In the output data frame, the two measurements, for `expt` and `ctrl`, are no longer in the same row, but we can still tell that they belong together because they have the same value of `angle`.

The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called `condition`, and the values are put into a new value column which we called `count`.

You can designate the value columns from the source data by naming them individually, as we did above with `expt` and `ctrl`. `gather()` automatically inferred that the ID variable was the remaining column, `angle`. Another way to tell it which columns are values is to do the reverse: if you exclude the `angle` column, then `gather()` will infer that the value columns are the remaining ones, `expt` and `ctrl`.

``````gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)``````

There are other convenient shortcuts to specify which columns are values. For example `expt:ctrl` means to select all columns between `expt` and `ctrl` (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, `gather()` will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the `select()` function.

For example, in the `drunk` data set, suppose we want to convert it to long format, keeping `sex` in one column and putting the numeric values in another column. This time, we want the values for only the `0-29` and `30-39` columns, and we want to discard the values for the other age ranges:

``````# Our source data
drunk
#>      sex 0-29 30-39 40-49 50-59 60+
#> 1   male  185   207   260   180  71
#> 2 female    4    13    10     7  10

# Try gather() with just 0-29 and 30-39
drunk %>%
gather(age, count, "0-29", "30-39")
#>      sex 40-49 50-59 60+   age count
#> 1   male   260   180  71  0-29   185
#> 2 female    10     7  10  0-29     4
#> 3   male   260   180  71 30-39   207
#> 4 female    10     7  10 30-39    13``````

That doesn’t look right! We told `gather()` that `0-29` and `30-39` were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep `sex` and discard the others. The solution is to use `select()` to remove the unwanted columns first, and then `gather()`.

``````library(dplyr)  # For the select() function

drunk %>%
select(sex, "0-29", "30-39") %>%
gather(age, count, "0-29", "30-39")
#>      sex   age count
#> 1   male  0-29   185
#> 2 female  0-29     4
#> 3   male 30-39   207
#> 4 female 30-39    13``````

There are times where you may want to use use more than one column as the ID variables:

``````plum_wide
#> 1   long   at_once   84   156
#> 2   long in_spring  156    84
#> 3  short   at_once  133   107
#> 4  short in_spring  209    31
# Use length and time as the ID variables (by not naming them as value variables)
#>   length      time survival count
#> 1   long   at_once     dead    84
#> 2   long in_spring     dead   156
#>  ...<4 more rows>...
#> 7  short   at_once    alive   107
#> 8  short in_spring    alive    31``````

Some data sets don’t come with a column with an ID variable. For example, in the `corneas` data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

``````# Make a copy of the data
co <- corneas
co\$id <- 1:nrow(co)

gather(co, "eye", "thickness", affected, notaffected)
#>    id         eye thickness
#> 1   1    affected       488
#> 2   2    affected       478
#>  ...<12 more rows>...
#> 15  7 notaffected       464
#> 16  8 notaffected       476``````

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with `as.character()`, or a factor with `factor()`.

See Recipe 15.20 to do conversions in the other direction, from long to wide.

See the `stack()` function for another way of converting from wide to long.

## 15.20 Converting Data from Long to Wide

### 15.20.1 Problem

You want to convert a data frame from “long” format to “wide” format.

### 15.20.2 Solution

Use the `spread()` function from the tidyr package. In this example, we’ll use the `plum` data set, which is in a long format:

``````library(gcookbook) # For the data set
plum
#>   length      time survival count
#> 1   long   at_once     dead    84
#> 2   long in_spring     dead   156
#>  ...<4 more rows>...
#> 7  short   at_once    alive   107
#> 8  short in_spring    alive    31``````

The conversion to wide format takes each unique value in one column and uses those values as headers for new columns, then uses another column for source values. For example, we can “move” values in the `survival` column to the top and fill them with values from `count`:

``````library(tidyr)
#> 1   long   at_once   84   156
#> 2   long in_spring  156    84
#> 3  short   at_once  133   107
#> 4  short in_spring  209    31``````

### 15.20.3 Discussion

The `spread()` function requires you to specify a key column which is used for header names, and a value column which is used to fill the values in the output data frame. It’s assumed that you want to use all the other columns as ID variables.

In the preceding example, there are two ID columns, `length` and `time`, one key column, `survival`, and one value column, `count`. What if we want to use two of the columns as keys? Suppose, for example, that we want to use `length` and `survival` as keys. This would leave us with `time` as the ID column.

The way to do this is to combine the `length` and `survival` columns together and put it in a new column, then use that new column as a key.

``````# Create a new column, length_survival, from length and survival.
plum %>%
unite(length_survival, length, survival)
#>   length_survival      time count
#>  ...<4 more rows>...
#> 7     short_alive   at_once   107
#> 8     short_alive in_spring    31

# Now pass it to spread() and use length_survival as a key
plum %>%
unite(length_survival, length, survival) %>%
#> 1   at_once        156        84         107        133
#> 2 in_spring         84       156          31        209``````

See Recipe 15.19 to do conversions in the other direction, from wide to long.

See the `unstack()` function for another way of converting from long to wide.

## 15.21 Converting a Time Series Object to Times and Values

### 15.21.1 Problem

You have a time series object that you wish to convert to numeric vectors representing the time and values at each time.

### 15.21.2 Solution

Use the `time()` function to get the time for each observation, then convert the times and values to numeric vectors with `as.numeric()`:

``````# Look at nhtemp Time Series object
nhtemp
#> Time Series:
#> Start = 1912
#> End = 1971
#>  ...
#> [31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
#> [46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0

# Get times for each observation
as.numeric(time(nhtemp))
#>  [1] 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926
#> [16] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
#> [31] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
#> [46] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971

# Get value of each observation
as.numeric(nhtemp)
#>  [1] 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 50.8 49.6 49.3 50.6 48.4
#> [16] 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4 51.6 51.8 50.9 48.8 51.7
#> [31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
#> [46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0
# Put them in a data frame
nht <- data.frame(year = as.numeric(time(nhtemp)), temp = as.numeric(nhtemp))
nht
#>    year temp
#> 1  1912 49.9
#> 2  1913 52.3
#>  ...<56 more rows>...
#> 59 1970 51.9
#> 60 1971 53.0``````

### 15.21.3 Discussion

Time series objects efficiently store information when there are observations at regular time intervals, but for use with ggplot, they need to be converted to a format that separately represents times and values for each observation.

Some time series objects are cyclical. The `presidents` data set, for example, contains four observations per year, one for each quarter:

``````presidents
#>      Qtr1 Qtr2 Qtr3 Qtr4
#> 1945   NA   87   82   75
#> 1946   63   50   43   32
#>  ...
#> 1973   68   44   40   27
#> 1974   28   25   24   24``````

To convert it to a two-column data frame with one column representing the year with fractional values, we can do the same as before:

``````pres_rating <- data.frame(
year = as.numeric(time(presidents)),
rating = as.numeric(presidents)
)
pres_rating
#>        year rating
#> 1   1945.00     NA
#> 2   1945.25     87
#>  ...<116 more rows>...
#> 119 1974.50     24
#> 120 1974.75     24``````

It is also possible to store the year and quarter in separate columns, which may be useful in some visualizations:

``````pres_rating2 <- data.frame(
year = as.numeric(floor(time(presidents))),
quarter = as.numeric(cycle(presidents)),
rating = as.numeric(presidents)
)
pres_rating2
#>     year quarter rating
#> 1   1945       1     NA
#> 2   1945       2     87
#>  ...<116 more rows>...
#> 119 1974       3     24
#> 120 1974       4     24``````