# Chapter 15 Getting Your Data into Shape

When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn’t the case: you’ll have to clean up and restructure the data before you can visualize it.

The recipes in this chapter will often use packages from the *tidyverse*. For a little background about the tidyverse, see the introduction section of Chapter 1. I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.

NoteThe

`>%>`

symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe 1.7.

Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I’ll assume that dplyr is already loaded. You can load it with either `library(tidyverse)`

as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:

Data sets in R are most often stored in data frames. They’re typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.

Here’s the `heightweight`

data set:

```
library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#> sex ageYear ageMonth heightIn weightLb
#> 1 f 11.92 143 56.3 85.0
#> 2 f 12.92 155 62.3 105.0
#> ...<232 more rows>...
#> 236 m 13.92 167 62.0 107.5
#> 237 m 12.58 151 59.3 87.0
```

It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it’s structured by using the `str()`

function:

```
str(heightweight)
#> 'data.frame': 236 obs. of 5 variables:
#> $ sex : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
#> $ ageYear : num 11.9 12.9 12.8 13.4 15.9 ...
#> $ ageMonth: int 143 155 153 161 191 171 185 142 160 140 ...
#> $ heightIn: num 56.3 62.3 63.3 59 62.5 62.5 59 56.5 62 53.8 ...
#> $ weightLb: num 85 105 108 92 112 ...
```

The first column, `sex`

, is a factor with two levels, `"f"`

and `"m"`

, and the other four columns are vectors of numbers (one of them, `ageMonth`

, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).

Factors and character vectors behave similarly in ggplot – the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.

## 15.1 Creating a Data Frame

### 15.1.1 Problem

You want to create a data frame from vectors.

### 15.1.2 Solution

You can put vectors together in a data frame with `data.frame()`

:

### 15.1.3 Discussion

A data frame is essentially a list of vectors and factors. Each vector or factor can be thought of as a column in the data frame.

If your vectors are in a list, you can convert the list to a data frame with the `as.data.frame()`

function:

The tidyverse way of creating a data frame is to use `data_frame()`

or `as_data_frame()`

(note the underscores instead of periods). This returns a special kind of data frame – a *tibble* – which behaves like a regular data frame in most contexts, but prints out more nicely and is specifically designed to play well with the tidyverse functions.

```
data_frame(g, x)
#> Warning: `data_frame()` is deprecated, use `tibble()`.
#> This warning is displayed once per session.
#> # A tibble: 3 x 2
#> g x
#> <chr> <int>
#> 1 A 1
#> 2 B 2
#> 3 C 3
```

A regular data frame can be converted to a tibble using `as_tibble()`

:

## 15.2 Getting Information About a Data Structure

### 15.2.1 Problem

You want to find out information about an object or data structure.

### 15.2.2 Solution

Use the `str()`

function:

```
str(ToothGrowth)
#> 'data.frame': 60 obs. of 3 variables:
#> $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#> $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
#> $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
```

This tells us that `ToothGrowth`

is a data frame with three columns, `len`

, `supp`

, and `dose`

. `len`

and `dose`

contain numeric values, while `supp`

is a factor with two levels.

Another useful function is the `summary()`

function:

Instead of showing you the first few values of each column as `str()`

does, `summary()`

provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.

### 15.2.3 Discussion

The `str()`

function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.

When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run `str()`

on the data frame, or print out the column by itself:

```
tg <- ToothGrowth
tg$supp <- as.character(tg$supp)
str(tg)
#> 'data.frame': 60 obs. of 3 variables:
#> $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#> $ supp: chr "VC" "VC" "VC" "VC" ...
#> $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
```

```
# Print out the columns by themselves
# From old data frame (factor)
ToothGrowth$supp
#> [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
#> [26] VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> [51] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> Levels: OJ VC
# From new data frame (character)
tg$supp
#> [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [16] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [31] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [46] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
```

## 15.3 Adding a Column to a Data Frame

### 15.3.1 Problem

You want to add a column to a data frame.

### 15.3.2 Solution

Use `mutate()`

from dplyr to add a new column and assign values to it. This returns a new data frame, which you’ll typically want save over the original.

If you assign a single value to the new column, the entire column will be filled with that value. This adds a column named `newcol`

, filled with `NA`

:

```
library(dplyr)
ToothGrowth %>%
mutate(newcol = NA)
#> len supp dose newcol
#> 1 4.2 VC 0.5 NA
#> 2 11.5 VC 0.5 NA
#> ...<56 more rows>...
#> 59 29.4 OJ 2.0 NA
#> 60 23.0 OJ 2.0 NA
```

You can also assign a vector to the new column:

```
# Since ToothGrowth has 60 rows, we must create a new vector that has 60 rows
vec <- rep(c(1, 2), 30)
ToothGrowth %>%
mutate(newcol = vec)
```

Note that the vector being added to the data frame must either have one element, or the same number of elements as the data frame has rows. In the example above we created a new vector that had 60 rows by repeating the values `c(1, 2)`

thirty times.

### 15.3.3 Discussion

Each column of a data frame is a vector. R handles columns in data frames slightly differently from standalone vectors because all the columns in a data frame must have the same length.

To add a column using base R, you can simply assign values into the new column like so:

```
# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth
# Assign NA's for the whole column
ToothGrowth2$newcol <- NA
# Assign 1 and 2, automatically repeating to fill
ToothGrowth2$newcol <- c(1, 2)
```

With base R, the vector being assigned into the data frame will automatically be repeated to fill the number of rows in the data frame.

## 15.4 Deleting a Column from a Data Frame

### 15.4.1 Problem

You want to delete a column from a data frame. This returns a new data frame, which you’ll typically want save over the original.

### 15.4.2 Solution

Use `select()`

from dplyr and specify the columns you want to drop by using `-`

(a minus sign).

### 15.4.3 Discussion

You can list multiple columns that you want to drop at the same time, or conversely specify only the columns that you want to keep. The following two pieces of code are thus equivalent:

```
# Remove both len and supp from ToothGrowth
ToothGrowth %>%
select(-len, -supp)
#> dose
#> 1 0.5
#> 2 0.5
#> ...<56 more rows>...
#> 59 2.0
#> 60 2.0
# This keeps just dose, which has the same effect for this data set
ToothGrowth %>%
select(dose)
#> dose
#> 1 0.5
#> 2 0.5
#> ...<56 more rows>...
#> 59 2.0
#> 60 2.0
```

To remove a column using base R, you can simply assign `NULL`

to that column.

### 15.4.4 See Also

Recipe 15.7 for more on getting a subset of a data frame.

See `?select`

for more ways to drop and keep columns.

## 15.5 Renaming Columns in a Data Frame

### 15.5.1 Problem

You want to rename the columns in a data frame.

### 15.5.2 Solution

Use `rename()`

from dplyr. This returns a new data frame, which you’ll typically want save over the original.

### 15.5.3 Discussion

You can rename multiple columns within the same call to `rename()`

:

```
ToothGrowth %>%
rename(
length = len,
supplement_type = supp
)
#> length supplement_type dose
#> 1 4.2 VC 0.5
#> 2 11.5 VC 0.5
#> ...<56 more rows>...
#> 59 29.4 OJ 2.0
#> 60 23.0 OJ 2.0
```

Renaming a column using base R is a bit more verbose. It uses the `names()`

function on the left side of the `<-`

operator.

### 15.5.4 See Also

See `?select`

for more ways to rename columns within a data frame.

## 15.6 Reordering Columns in a Data Frame

### 15.6.1 Problem

You want to change the order of columns in a data frame.

### 15.6.2 Solution

Use the `select()`

from dplyr.

```
ToothGrowth %>%
select(dose, len, supp)
#> dose len supp
#> 1 0.5 4.2 VC
#> 2 0.5 11.5 VC
#> ...<56 more rows>...
#> 59 2.0 29.4 OJ
#> 60 2.0 23.0 OJ
```

The new data frame will contain the columns you specified in `select()`

, in the order you specified. Note that `select()`

returns a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

### 15.6.3 Discussion

If you are only reordering a few variables and want to keep the rest of the variables in order, you can use `everything()`

as a placeholder:

```
ToothGrowth %>%
select(dose, everything())
#> dose len supp
#> 1 0.5 4.2 VC
#> 2 0.5 11.5 VC
#> ...<56 more rows>...
#> 59 2.0 29.4 OJ
#> 60 2.0 23.0 OJ
```

See `?select_helpers`

for other ways to select columns. You can, for example, select columns by matching parts of the name.

Using base R, you can also reorder columns by their name or numeric position. This returns a new data frame, which can be saved over the original.

In these examples, I used list-style indexing. A data frame is essentially a list of vectors, and indexing into it as a list will return another data frame. You can get the same effect with matrix-style indexing:

```
ToothGrowth[c("dose", "len", "supp")] # List-style indexing
ToothGrowth[, c("dose", "len", "supp")] # Matrix-style indexing
```

In this case, both methods return the same result, a data frame. However, when retrieving a single column, list-style indexing will return a data frame, while matrix-style indexing will return a vector:

```
ToothGrowth["dose"]
#> dose
#> 1 0.5
#> 2 0.5
#> ...<56 more rows>...
#> 59 2.0
#> 60 2.0
ToothGrowth[, "dose"]
#> [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
#> [20] 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#> [39] 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
#> [58] 2.0 2.0 2.0
```

You can use `drop=FALSE`

to ensure that it returns a data frame:

## 15.7 Getting a Subset of a Data Frame

### 15.7.1 Problem

You want to get a subset of a data frame.

### 15.7.2 Solution

Use `filter()`

to get the rows, and `select()`

to get the columns you want. These operations can be chained together using the `%>%`

operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

We’ll use the `climate`

data set for the examples here:

```
library(gcookbook) # Load gcookbook for the climate data set
climate
#> Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 1 Berkeley 1800 NA NA -0.435 0.505
#> 2 Berkeley 1801 NA NA -0.453 0.493
#> ...<495 more rows>...
#> 498 CRUTEM3 2010 0.8023 NA NA NA
#> 499 CRUTEM3 2011 0.6193 NA NA NA
```

Let’s that say that only want to keep rows where `Source`

is `"Berkeley"`

and where the year is inclusive of and between 1900 and 2000. You can do so with the `filter()`

function:

If you want only the `Year`

and `Anomaly10y`

columns, use `select()`

, as we did in 15.4:

```
climate %>%
select(Year, Anomaly10y)
#> Year Anomaly10y
#> 1 1800 -0.435
#> 2 1801 -0.453
#> ...<495 more rows>...
#> 498 2010 NA
#> 499 2011 NA
```

These operations can be chained together using the `%>%`

operator:

### 15.7.3 Discussion

The `filter()`

function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the `slice()`

function:

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

```
climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, ]
#> Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 101 Berkeley 1900 NA NA -0.171 0.108
#> 102 Berkeley 1901 NA NA -0.162 0.109
#> ...<97 more rows>...
#> 200 Berkeley 1999 NA NA 0.734 0.025
#> 201 Berkeley 2000 NA NA 0.748 0.026
```

Notice that we needed to prefix each column name with `climate$`

, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.

This row filtering can also be combined with the column selection from 15.4:

## 15.8 Changing the Order of Factor Levels

### 15.8.1 Problem

You want to change the order of levels in a factor.

### 15.8.2 Solution

Pass the factor to `factor()`

, and give it the levels in the order you want. This returns a new factor, so if you want to change the original variable, you’ll need to save the new result over it.

```
# By default, levels are ordered alphabetically
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes
#> [1] small large large small medium
#> Levels: large medium small
factor(sizes, levels = c("small", "medium", "large"))
#> [1] small large large small medium
#> Levels: small medium large
```

The order can also be specified with `levels`

when the factor is first created:

### 15.8.3 Discussion

There are two kinds of factors in R: ordered factors and regular factors. (In practice, ordered levels are not commonly used.) In both types, the levels are arranged in *some* order; the difference is that the order is meaningful for an ordered factor, but it is arbitrary for a regular factor – it simply reflects how the data is stored. For plotting data, the distinction between ordered and regular factors is generally unimportant, and they can be treated the same.

The order of factor levels affects graphical output. When a factor variable is mapped to an aesthetic property in ggplot, the aesthetic adopts the ordering of the factor levels. If a factor is mapped to the x-axis, the ticks on the axis will be in the order of the factor levels, and if a factor is mapped to color, the items in the legend will be in the order of the factor levels.

To reverse the level order, you can use `rev(levels())`

:

The tidyverse function for reordering factors is `fct_relevel()`

from the forcats package. It has a syntax similar to the `factor()`

function from base R.

## 15.9 Changing the Order of Factor Levels Based on Data Values

### 15.9.1 Problem

You want to change the order of levels in a factor based on values in the data.

### 15.9.2 Solution

Use `reorder()`

with the factor that has levels to reorder, the values to base the reordering on, and a function that aggregates the values:

```
# Make a copy of the InsectSprays data set since we're modifying it
iss <- InsectSprays
iss$spray
#> [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
#> [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
#> Levels: A B C D E F
iss$spray <- reorder(iss$spray, iss$count, FUN = mean)
iss$spray
#> [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
#> [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
#> attr(,"scores")
#> A B C D E F
#> 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
#> Levels: C E D A B F
```

Notice that the original levels were `ABCDEF`

, while the reordered levels are `CEDABF`

. What we’ve done is reorder the levels of `spray`

based on the mean value of `count`

for each level of `spray`

.

### 15.9.3 Discussion

The usefulness of `reorder()`

might not be obvious from just looking at the raw output. Figure 15.1 shows three plots made with `reorder()`

. In these plots, the order in which the items appear is determined by their values.

In the middle plot in Figure 15.1, the boxes are sorted by the mean. The horizontal line that runs across each box represents the *median* of the data. Notice that these values do not increase strictly from left to right. That’s because with this particular data set, sorting by the mean gives a different order than sorting by the median. To make the median lines increase from left to right, as in the plot on the right in Figure 15.1, we used the `median()`

function in `reorder()`

.

The tidyverse function for reordering factors is `fct_reorder()`

, and it is used the same way as `reorder()`

. These do the same thing:

## 15.10 Changing the Names of Factor Levels

### 15.10.1 Problem

You want to change the names of levels in a factor.

### 15.10.2 Solution

Use `fct_recode()`

from the forcats package

### 15.10.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()`

with `fct_recode()`

.

```
old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(old, new)
mappings
#> S M L
#> "small" "medium" "large"
# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
#> List of 4
#> $ : Factor w/ 3 levels "large","medium",..: 3 1 1 3 2
#> $ S: chr "small"
#> $ M: chr "medium"
#> $ L: chr "large"
# Use do.call to call fct_recode with the arguments
do.call(fct_recode, args)
#> [1] S L L S M
#> Levels: L M S
```

Or, more concisely, we can do all of that in one go:

```
do.call(
fct_recode,
c(list(sizes), setNames(c("small", "medium", "large"), c("S", "M", "L")))
)
#> [1] S L L S M
#> Levels: L M S
```

For a more traditional (and clunky) base R method for renaming factor levels, use the `levels()<-`

function:

```
sizes <- factor(c( "small", "large", "large", "small", "medium"))
# Index into the levels and rename each one
levels(sizes)[levels(sizes) == "large"] <- "L"
levels(sizes)[levels(sizes) == "medium"] <- "M"
levels(sizes)[levels(sizes) == "small"] <- "S"
sizes
#> [1] S L L S M
#> Levels: L M S
```

If you are renaming *all* your factor levels, there is a simpler method. You can pass a list to `levels()<-`

:

```
sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes) <- list(S = "small", M = "medium", L = "large")
sizes
#> [1] S L L S M
#> Levels: S M L
```

With this method, all factor levels must be specified in the list; if any are missing, they will be replaced with `NA`

.

It’s also possible to rename factor levels by position, but this is somewhat inelegant:

```
sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes)[1] <- "L"
sizes
#> [1] small L L small medium
#> Levels: L medium small
# Rename all levels at once
levels(sizes) <- c("L", "M", "S")
sizes
#> [1] S L L S M
#> Levels: L M S
```

It’s safer to rename factor levels by name rather than by position, since you will be less likely to make a mistake (and mistakes here may be hard to detect). Also, if your input data set changes to have more or fewer levels, the numeric positions of the existing levels could change, which could cause serious but nonobvious problems for your analysis.

### 15.10.4 See Also

If, instead of a factor, you have a character vector with items to rename, see Recipe 15.12.

## 15.11 Removing Unused Levels from a Factor

### 15.11.1 Problem

You want to remove unused levels from a factor.

### 15.11.2 Solution

Sometimes, after processing your data you will have a factor that contains levels that are no longer used. Here’s an example:

```
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes <- sizes[1:3]
sizes
#> [1] small large large
#> Levels: large medium small
```

To remove them, use `droplevels()`

:

### 15.11.3 Discussion

The `droplevels()`

function preserves the order of factor levels. You can use the `except`

parameter to keep particular levels.

The tidyverse way: Use `fct_drop()`

from the forcats package:

## 15.12 Changing the Names of Items in a Character Vector

### 15.12.1 Problem

You want to change the names of items in a character vector.

### 15.12.2 Solution

Use `recode()`

from the dplyr package:

```
library(dplyr)
sizes <- c("small", "large", "large", "small", "medium")
sizes
#> [1] "small" "large" "large" "small" "medium"
# With recode(), pass it a named vector with the mappings
recode(sizes, small = "S", medium = "M", large = "L")
#> [1] "S" "L" "L" "S" "M"
# Can also use quotes -- useful if there are spaces or other strange characters
recode(sizes, "small" = "S", "medium" = "M", "large" = "L")
#> [1] "S" "L" "L" "S" "M"
```

### 15.12.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()`

with `fct_recode()`

.

```
old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(new, old)
mappings
#> small medium large
#> "S" "M" "L"
# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
#> List of 4
#> $ : chr [1:5] "small" "large" "large" "small" ...
#> $ small : chr "S"
#> $ medium: chr "M"
#> $ large : chr "L"
# Use do.call to call fct_recode with the arguments
do.call(recode, args)
#> [1] "S" "L" "L" "S" "M"
```

Or, more concisely, we can do all of that in one go:

```
do.call(
recode,
c(list(sizes), setNames(c("S", "M", "L"), c("small", "medium", "large")))
)
#> [1] "S" "L" "L" "S" "M"
```

Note that for `recode()`

, the name and value of the arguments is reversed, compared to the `fct_recode()`

function from the forcats package. With `recode()`

, you would use `small="S"`

, whereas for `fct_recode()`

, you would use `S="small"`

.

A more traditional R method is to use square-bracket indexing to select the items and rename them:

### 15.12.4 See Also

If, instead of a character vector, you have a factor with levels to rename, see Recipe 15.10.

## 15.13 Recoding a Categorical Variable to Another Categorical Variable

### 15.13.1 Problem

You want to recode a categorical variable to another variable.

### 15.13.2 Solution

For the examples here, we’ll use a subset of the `PlantGrowth`

data set:

```
# Work on a subset of the PlantGrowth data set
pg <- PlantGrowth[c(1,2,11,21,22), ]
pg
#> weight group
#> 1 4.17 ctrl
#> 2 5.58 ctrl
#> 11 4.81 trt1
#> 21 6.31 trt2
#> 22 5.12 trt2
```

In this example, we’ll recode the categorical variable group into another categorical variable, treatment. If the old value was `"ctrl"`

, the new value will be `"No"`

, and if the old value was `"trt1"`

or `"trt2"`

, the new value will be `"Yes"`

.

This can be done with the `recode()`

function from the dplyr package:

```
library(dplyr)
recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] No No Yes Yes Yes
#> Levels: No Yes
```

You can assign it as a new column in the data frame:

Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use `as.character()`

:

### 15.13.3 Discussion

You can also use the `fct_recode()`

function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:

```
library(forcats)
fct_recode(pg$group, No = "ctrl", Yes = "trt1", Yes = "trt2")
#> [1] No No Yes Yes Yes
#> Levels: No Yes
```

Another difference is that `fct_recode()`

will always return a factor, whereas `recode()`

will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a `recode_factor()`

function which also always returns a factor.)

Using base R, recoding can be done with the `match()`

function:

```
oldvals <- c("ctrl", "trt1", "trt2")
newvals <- factor(c("No", "Yes", "Yes"))
newvals[ match(pg$group, oldvals) ]
#> [1] No No Yes Yes Yes
#> Levels: No Yes
```

It can also be done by indexing in the vectors:

```
pg$treatment[pg$group == "ctrl"] <- "No"
pg$treatment[pg$group == "trt1"] <- "Yes"
pg$treatment[pg$group == "trt2"] <- "Yes"
# Convert to a factor
pg$treatment <- factor(pg$treatment)
pg
#> weight group treatment
#> 1 4.17 ctrl No
#> 2 5.58 ctrl No
#> 11 4.81 trt1 Yes
#> 21 6.31 trt2 Yes
#> 22 5.12 trt2 Yes
```

Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe 15.10.

The coding criteria can also be based on values in multiple columns, by using the `&`

and `|`

operators:

```
pg$newcol[pg$group == "ctrl" & pg$weight < 5] <- "no_small"
pg$newcol[pg$group == "ctrl" & pg$weight >= 5] <- "no_large"
pg$newcol[pg$group == "trt1"] <- "yes"
pg$newcol[pg$group == "trt2"] <- "yes"
pg$newcol <- factor(pg$newcol)
pg
#> weight group newcol
#> 1 4.17 ctrl no_small
#> 2 5.58 ctrl no_large
#> 11 4.81 trt1 yes
#> 21 6.31 trt2 yes
#> 22 5.12 trt2 yes
```

It’s also possible to combine two columns into one using the interaction() function, which appends the values with a `.`

in between. This combines the `weight`

and `group`

columns into a new column, `weightgroup`

:

## 15.14 Recoding a Continuous Variable to a Categorical Variable

### 15.14.1 Problem

You want to recode a continuous variable to another variable.

### 15.14.2 Solution

Use the `cut()`

function. In this example, we’ll use the `PlantGrowth`

data set and recode the continuous variable `weight`

into a categorical variable, `wtclass`

, using the `cut()`

function:

### 15.14.3 Discussion

For three categories we specify four bounds, which can include `Inf`

and `-Inf`

. If a data value falls outside of the specified bounds, it’s categorized as `NA`

. The result of `cut()`

is a factor, and you can see from the example that the factor levels are named after the bounds.

To change the names of the levels, set the labels:

```
pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf),
labels = c("small", "medium", "large"))
pg
#> weight group wtclass
#> 1 4.17 ctrl small
#> 2 5.58 ctrl medium
#> ...<26 more rows>...
#> 29 5.80 trt2 medium
#> 30 5.26 trt2 medium
```

As indicated by the factor levels, the bounds are by default *open* on the left and *closed* on the right. In other words, they don’t include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting `include.lowest=TRUE`

. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as `NA`

.

If you want the categories to be closed on the left and open on the right, set right = FALSE:

```
cut(pg$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
#> [1] [0,5) [5,6) [5,6) [6,Inf) [0,5) [0,5) [5,6) [0,5) [5,6)
#> [10] [5,6) [0,5) [0,5) [0,5) [0,5) [5,6) [0,5) [6,Inf) [0,5)
#> [19] [0,5) [0,5) [6,Inf) [5,6) [5,6) [5,6) [5,6) [5,6) [0,5)
#> [28] [6,Inf) [5,6) [5,6)
#> Levels: [0,5) [5,6) [6,Inf)
```

### 15.14.4 See Also

To recode a categorical variable to another categorical variable, see Recipe 15.13.

## 15.15 Calculating New Columns From Existing Columns

### 15.15.1 Problem

You want to calculate a new column of values in a data frame.

### 15.15.2 Solution

Use `mutate()`

from the dplyr package.

```
library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#> sex ageYear ageMonth heightIn weightLb
#> 1 f 11.92 143 56.3 85.0
#> 2 f 12.92 155 62.3 105.0
#> ...<232 more rows>...
#> 236 m 13.92 167 62.0 107.5
#> 237 m 12.58 151 59.3 87.0
```

This will convert `heightIn`

to centimeters and store it in a new column, `heightCm`

:

```
library(dplyr)
heightweight %>%
mutate(heightCm = heightIn * 2.54)
#> sex ageYear ageMonth heightIn weightLb heightCm
#> 1 f 11.92 143 56.3 85.0 143.002
#> 2 f 12.92 155 62.3 105.0 158.242
#> ...<232 more rows>...
#> 235 m 13.92 167 62.0 107.5 157.480
#> 236 m 12.58 151 59.3 87.0 150.622
```

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

### 15.15.3 Discussion

You can use `mutate()`

to transform multiple columns at once:

```
heightweight %>%
mutate(
heightCm = heightIn * 2.54,
weightKg = weightLb / 2.204
)
#> sex ageYear ageMonth heightIn weightLb heightCm weightKg
#> 1 f 11.92 143 56.3 85.0 143.002 38.56624
#> 2 f 12.92 155 62.3 105.0 158.242 47.64065
#> ...<232 more rows>...
#> 235 m 13.92 167 62.0 107.5 157.480 48.77495
#> 236 m 12.58 151 59.3 87.0 150.622 39.47368
```

It is also possible to calculate a new column based on multiple columns:

With `mutate()`

, the columns are added sequentially. That means that we can reference a newly-created column when calculating a new column:

```
heightweight %>%
mutate(
heightCm = heightIn * 2.54,
weightKg = weightLb / 2.204,
bmi = weightKg / (heightCm / 100)^2
)
#> sex ageYear ageMonth heightIn weightLb heightCm weightKg bmi
#> 1 f 11.92 143 56.3 85.0 143.002 38.56624 18.85919
#> 2 f 12.92 155 62.3 105.0 158.242 47.64065 19.02542
#> ...<232 more rows>...
#> 235 m 13.92 167 62.0 107.5 157.480 48.77495 19.66736
#> 236 m 12.58 151 59.3 87.0 150.622 39.47368 17.39926
```

With base R, calculating a new colum can be done by referencing the new column with the `$`

operator and assigning some values to it:

### 15.15.4 See Also

See Recipe 15.16 for how to perform group-wise transformations on data.

## 15.16 Calculating New Columns by Groups

### 15.16.1 Problem

You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.

### 15.16.2 Solution

Use `group_by()`

from the dplyr package to specify the grouping variable, and then specify the operations in `mutate()`

:

```
library(MASS) # Load MASS for the cabbages data set
library(dplyr)
cabbages %>%
group_by(Cult) %>%
mutate(DevWt = HeadWt - mean(HeadWt))
#> # A tibble: 60 x 5
#> # Groups: Cult [2]
#> Cult Date HeadWt VitC DevWt
#> <fct> <fct> <dbl> <int> <dbl>
#> 1 c39 d16 2.5 51 -0.407
#> 2 c39 d16 2.2 55 -0.707
#> 3 c39 d16 3.1 45 0.193
#> 4 c39 d16 4.3 42 1.39
#> 5 c39 d16 2.5 53 -0.407
#> 6 c39 d16 4.3 50 1.39
#> # … with 54 more rows
```

This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.

### 15.16.3 Discussion

Let’s take a closer look at the `cabbages`

data set. It has two grouping variables (factors): `Cult`

, which has levels `c39`

and `c52`

, and `Date`

, which has levels `d16`

, `d20`

, and `d21.`

It also has two measured numeric variables, `HeadWt`

and `VitC`

:

```
cabbages
#> Cult Date HeadWt VitC
#> 1 c39 d16 2.5 51
#> 2 c39 d16 2.2 55
#> ...<56 more rows>...
#> 59 c52 d21 1.5 66
#> 60 c52 d21 1.6 72
```

Suppose we want to find, for each case, the deviation of `HeadWt`

from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:

```
mutate(cabbages, DevWt = HeadWt - mean(HeadWt))
#> Cult Date HeadWt VitC DevWt
#> 1 c39 d16 2.5 51 -0.09333333
#> 2 c39 d16 2.2 55 -0.39333333
#> ...<56 more rows>...
#> 59 c52 d21 1.5 66 -1.09333333
#> 60 c52 d21 1.6 72 -0.99333333
```

You’ll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean *within the group*, where the groups are specified by `Cult`

. In these cases, we can use `group_by()`

and `mutate()`

together:

First it groups cabbages based on the value of `Cult`

. There are two levels of `Cult`

, `c39`

and `c52`

. It then applies the `mutate()`

function to each data frame.

The before and after results are shown in Figure 15.2:

```
# The data before normalizing
ggplot(cb, aes(x = Cult, y = HeadWt)) +
geom_boxplot()
# After normalizing
ggplot(cb, aes(x = Cult, y = DevWt)) +
geom_boxplot()
```

You can also group the data frame on multiple variables and perform operations on multiple variables. The following code groups the data by `Cult`

and `Date`

, forming a group for each distinct combination of the two variables. After forming these groups, the code will calculate the deviation of `HeadWt`

and `VitC`

from the mean of each group:

```
cabbages %>%
group_by(Cult, Date) %>%
mutate(
DevWt = HeadWt - mean(HeadWt),
DevVitC = VitC - mean(VitC)
)
#> # A tibble: 60 x 6
#> # Groups: Cult, Date [6]
#> Cult Date HeadWt VitC DevWt DevVitC
#> <fct> <fct> <dbl> <int> <dbl> <dbl>
#> 1 c39 d16 2.5 51 -0.68 0.7
#> 2 c39 d16 2.2 55 -0.98 4.7
#> 3 c39 d16 3.1 45 -0.08 -5.30
#> 4 c39 d16 4.3 42 1.12 -8.30
#> 5 c39 d16 2.5 53 -0.68 2.7
#> 6 c39 d16 4.3 50 1.12 -0.300
#> # … with 54 more rows
```

### 15.16.4 See Also

To summarize data by groups, see Recipe 15.17.

## 15.17 Summarizing Data by Groups

### 15.17.1 Problem

You want to summarize your data, based on one or more grouping variables.

### 15.17.2 Solution

Use `group_by()`

and `summarise()`

from the dplyr package, and specify the operations to do:

```
library(MASS) # Load MASS for the cabbages data set
library(dplyr)
cabbages %>%
group_by(Cult, Date) %>%
summarise(
Weight = mean(HeadWt),
VitC = mean(VitC)
)
#> # A tibble: 6 x 4
#> # Groups: Cult [2]
#> Cult Date Weight VitC
#> <fct> <fct> <dbl> <dbl>
#> 1 c39 d16 3.18 50.3
#> 2 c39 d20 2.8 49.4
#> 3 c39 d21 2.74 54.8
#> 4 c52 d16 2.26 62.5
#> 5 c52 d20 3.11 58.9
#> 6 c52 d21 1.47 71.8
```

### 15.17.3 Discussion

There are few things going on here that may be unfamiliar if you’re new to dplyr and the tidyverse in general.

First, let’s take a closer look at the `cabbages`

data set. It has two factors that can be used as grouping variables: `Cult`

, which has levels `c39`

and `c52`

, and `Date`

, which has levels `d16`

, `d20`

, and `d21`

. It also has two numeric variables, `HeadWt`

and `VitC`

:

```
cabbages
#> Cult Date HeadWt VitC
#> 1 c39 d16 2.5 51
#> 2 c39 d16 2.2 55
#> ...<56 more rows>...
#> 59 c52 d21 1.5 66
#> 60 c52 d21 1.6 72
```

Finding the overall mean of `HeadWt`

is simple. We could just use the `mean()`

function on that column, but for reasons that will soon become clear, we’ll use the `summarise()`

function instead:

The result is a data frame with one row and one column, named `Weight`

.

Often we want to find information about each subset of the data, as specified by a grouping variable. For example, suppose we want to find the mean of each `Cult`

group. To do this, we can use `summarise()`

with `group_by()`

.

```
tmp <- group_by(cabbages, Cult)
summarise(tmp, Weight = mean(HeadWt))
#> # A tibble: 2 x 2
#> Cult Weight
#> <fct> <dbl>
#> 1 c39 2.91
#> 2 c52 2.28
```

The command first groups the data frame `cabbages`

based on the value of `Cult`

. There are two levels of `Cult`

, `c39`

and `c52`

, so there are two groups. It then applies the `summarise()`

function to each of these data frames; it calculates `Weight`

by taking the `mean()`

of the `HeadWt`

column in each of the sub-data frames. The resulting summaries for each group are assembled into a data frame, which is returned.

You can imagine that the `cabbages`

data is split up into two separate data frames, then `summarise()`

is called on each data frame (returning a one-row data frame for each), and then those results are combined together into a final data frame. This is actually how things worked in dplyr’s predecessor, plyr, with the `ddply()`

function.

The syntax of the previous code used a temporary variable to store results. That’s a little verbose, so instead, we can use `%>%`

, also known as the pipe operator, to chain the function calls together. The pipe operator simply takes what’s on its left and substitutes it as the first argument of the function call on the right. The following two lines of code are equivalent:

```
group_by(cabbages, Cult)
# The pipe operator moves `cabbages` to the first argument position of group_by()
cabbages %>% group_by(Cult)
```

The reason it’s called a pipe operator is that it lets you connect function calls together in sequence to form a pipeline of operations. Another common term for this is a different metaphor: *chaining*.

So the first argument of the function call is in a different place. So what? The advantages become apparent when chaining is involved. Here’s what it would look like if you wanted to call `group_by()`

and then `summarise()`

without making use of a temporary variable. Instead of proceeding left to right, the computation occurs from the inside out:

Using a temporary variable, as we did earlier, makes it more readable, but a more elegant solution is to use the pipe operator:

Back to summarizing data. Summarizing the data frame by grouping using more variables (or columns) is simple: just give it the names of the additional variables. It’s also possible to get more than one summary value by specifying more calculated columns. Here we’ll summarize each `Cult`

and `Date`

group, getting the average of `HeadWt`

and `VitC`

:

```
cabbages %>%
group_by(Cult, Date) %>%
summarise(
Weight = mean(HeadWt),
Vitc = mean(VitC)
)
#> # A tibble: 6 x 4
#> # Groups: Cult [2]
#> Cult Date Weight Vitc
#> <fct> <fct> <dbl> <dbl>
#> 1 c39 d16 3.18 50.3
#> 2 c39 d20 2.8 49.4
#> 3 c39 d21 2.74 54.8
#> 4 c52 d16 2.26 62.5
#> 5 c52 d20 3.11 58.9
#> 6 c52 d21 1.47 71.8
```

NoteYou might have noticed that it says that the result is grouped by

`Cult`

, but not`Date`

. This is because the`summarise()`

function removes one level of grouping. This is typically what you want when the input has one grouping variable. When there are multiple grouping variables, this may or may not be the what you want. To remove all grouping, use`ungroup()`

, and to add back the original grouping, use`group_by()`

again.

It’s possible to do more than take the mean. You may, for example, want to compute the standard deviation and count of each group. To get the standard deviation, use `sd()`

, and to get a count of rows in each group, use `n()`

:

```
cabbages %>%
group_by(Cult, Date) %>%
summarise(
Weight = mean(HeadWt),
sd = sd(HeadWt),
n = n()
)
#> # A tibble: 6 x 5
#> # Groups: Cult [2]
#> Cult Date Weight sd n
#> <fct> <fct> <dbl> <dbl> <int>
#> 1 c39 d16 3.18 0.957 10
#> 2 c39 d20 2.8 0.279 10
#> 3 c39 d21 2.74 0.983 10
#> 4 c52 d16 2.26 0.445 10
#> 5 c52 d20 3.11 0.791 10
#> 6 c52 d21 1.47 0.211 10
```

Other useful functions for generating summary statistics include `min()`

, `max()`

, and `median()`

. The `n()`

function is a special function that works only inside of the dplyr functions `summarise()`

, `mutate()`

and `filter()`

. See `?summarise`

for more useful functions.

The `n()`

function gets a count of rows, but if you want to have it *not* count `NA`

values from a column, you need to use a different technique. For example, if you want it to ignore any `NA`

s in the `HeadWt`

column, use `sum(!is.na(Headwt))`

.

If you want to get a count of rows

#### 15.17.3.1 Dealing with NAs {#_dealing_with_literal_na_literal_s}

One potential pitfall is that `NA`

s in the data will lead to `NA`

s in the output. Let’s see what happens if we sprinkle a few `NA`

s into `HeadWt`

:

```
c1 <- cabbages # Make a copy
c1$HeadWt[c(1, 20, 45)] <- NA # Set some values to NA
c1 %>%
group_by(Cult) %>%
summarise(
Weight = mean(HeadWt),
sd = sd(HeadWt),
n = n()
)
#> # A tibble: 2 x 4
#> Cult Weight sd n
#> <fct> <dbl> <dbl> <int>
#> 1 c39 NA NA 30
#> 2 c52 NA NA 30
```

The problem is that `mean()`

and `sd()`

simply return `NA`

if any of the input values are `NA.`

Fortunately, these functions have an option to deal with this very issue: setting `na.rm=TRUE`

will tell them to ignore the `NA`

s.

#### 15.17.3.2 Missing combinations {#_missing_combinations}

If there are any empty combinations of the grouping variables, they will not appear in the summarized data frame. These missing combinations can cause problems when making graphs. To illustrate, we’ll remove all entries that have levels `c52`

and `d21`

. The graph on the left in Figure 15.3 shows what happens when there’s a missing combination in a bar graph:

```
# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
group_by(Cult, Date) %>%
summarise(Weight = mean(HeadWt))
ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")
```

To fill in the missing combination (Figure 15.3, right), use the `complete()`

function from the tidyr package – which is also part of the tidyverse. Also, the grouping for `c2a`

must be removed, with `ungroup()`

; otherwise it will return too many rows.

```
library(tidyr)
c2b <- c2a %>%
ungroup() %>%
complete(Cult, Date)
ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")
```

```
# Copy cabbages and remove all rows with both c52 and d21
c2 <- filter(cabbages, !( Cult == "c52" & Date == "d21" ))
c2a <- c2 %>%
group_by(Cult, Date) %>%
summarise(Weight = mean(HeadWt))
ggplot(c2a, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")
library(tidyr)
c2b <- c2a %>%
ungroup() %>%
complete(Cult, Date)
ggplot(c2b, aes(x = Date, fill = Cult, y = Weight)) +
geom_col(position = "dodge")
```

When we used `complete()`

, it filled in the missing combinations with `NA`

. It’s possible to fill with a different value, with the `fill`

parameter. See `?complete`

for more information.

## 15.18 Summarizing Data with Standard Errors and Confidence Intervals

### 15.18.1 Problem

You want to summarize your data with the standard error of the mean and/or confidence intervals.

### 15.18.2 Solution

Getting the standard error of the mean involves two steps: first get the standard deviation and count for each group, then use those values to calculate the standard error. The standard error for each group is just the standard deviation divided by the square root of the sample size:

```
library(MASS) # Load MASS for the cabbages data set
library(dplyr)
ca <- cabbages %>%
group_by(Cult, Date) %>%
summarise(
Weight = mean(HeadWt),
sd = sd(HeadWt),
n = n(),
se = sd / sqrt(n)
)
ca
#> # A tibble: 6 x 6
#> # Groups: Cult [2]
#> Cult Date Weight sd n se
#> <fct> <fct> <dbl> <dbl> <int> <dbl>
#> 1 c39 d16 3.18 0.957 10 0.303
#> 2 c39 d20 2.8 0.279 10 0.0882
#> 3 c39 d21 2.74 0.983 10 0.311
#> 4 c52 d16 2.26 0.445 10 0.141
#> 5 c52 d20 3.11 0.791 10 0.250
#> 6 c52 d21 1.47 0.211 10 0.0667
```

### 15.18.3 Discussion

The `summarise()`

function computes the columns in order, so you can refer to previous newly-created columns. That’s why `se`

can use the `sd`

and `n`

columns.

The `n()`

function gets a count of rows, but if you want to have it *not* count `NA`

values from a column, you need to use a different technique. For example, if you want it to ignore any `NA`

s in the `HeadWt`

column, use `sum(!is.na(Headwt))`

.

#### 15.18.3.1 Confidence Intervals {#_confidence_intervals}

Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the `qt()`

function to get the quantile, then multiply that by the standard error. The `qt()`

function will give quantiles of the *t*-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped *t*-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.

This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:

Now we can multiply that vector by the standard error to get the 95% confidence interval:

```
ca$ci95 <- ca$se * ciMult
ca
#> # A tibble: 6 x 7
#> # Groups: Cult [2]
#> Cult Date Weight sd n se ci95
#> <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 c39 d16 3.18 0.957 10 0.303 0.684
#> 2 c39 d20 2.8 0.279 10 0.0882 0.200
#> 3 c39 d21 2.74 0.983 10 0.311 0.703
#> 4 c52 d16 2.26 0.445 10 0.141 0.318
#> 5 c52 d20 3.11 0.791 10 0.250 0.566
#> 6 c52 d21 1.47 0.211 10 0.0667 0.151
```

This could be done in one line, like this:

For a 99% confidence interval, use .995.

Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there’s a 95% chance that the true population mean is within the interval (actually, it doesn’t mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).

This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle `NA`

s and missing combinations, with the `na.rm`

and `.drop`

options. By default, it provides a 95% confidence interval, but this can be set with the `conf.interval`

argument:

```
summarySE <- function(data = NULL, measurevar, groupvars = NULL, na.rm = FALSE,
conf.interval = .95, .drop = TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function(x, na.rm = FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
groupvars <- rlang::syms(groupvars)
measurevar <- rlang::sym(measurevar)
datac <- data %>%
dplyr::group_by(!!!groupvars) %>%
dplyr::summarise(
N = length2(!!measurevar, na.rm = na.rm),
sd = sd (!!measurevar, na.rm = na.rm),
!!measurevar := mean (!!measurevar, na.rm = na.rm),
se = sd / sqrt(N),
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ci = se * qt(conf.interval/2 + .5, N - 1)
) %>%
dplyr::ungroup() %>%
# Rearrange the columns so that sd, se, ci are last
dplyr::select(seq_len(ncol(.) - 4), ncol(.) - 2, sd, se, ci)
datac
}
```

The following usage example has a 99% confidence interval and handles `NA`

s and missing combinations:

```
# Remove all rows with both c52 and d21
c2 <- filter(cabbages, !(Cult == "c52" & Date == "d21" ))
# Set some values to NA
c2$HeadWt[c(1, 20, 45)] <- NA
summarySE(c2, "HeadWt", c("Cult", "Date"),
conf.interval = .99, na.rm = TRUE, .drop = FALSE)
#> # A tibble: 5 x 7
#> Cult Date N HeadWt sd se ci
#> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 c39 d16 9 3.26 0.982 0.327 1.10
#> 2 c39 d20 9 2.72 0.139 0.0465 0.156
#> 3 c39 d21 10 2.74 0.983 0.311 1.01
#> 4 c52 d16 10 2.26 0.445 0.141 0.458
#> 5 c52 d20 9 3.04 0.809 0.270 0.905
```

### 15.18.4 See Also

See Recipe 7.7 to use the values calculated here to add error bars to a graph.

## 15.19 Converting Data from Wide to Long

### 15.19.1 Problem

You want to convert a data frame from “wide” format to “long” format.

### 15.19.2 Solution

Use `gather()`

from the tidyr package. In the `anthoming`

data set, for each `angle`

, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

```
library(gcookbook) # For the data set
anthoming
#> angle expt ctrl
#> 1 -20 1 0
#> 2 -10 7 3
#> 3 0 2 3
#> 4 10 0 3
#> 5 20 0 1
```

We can reshape the data so that all the measurements are in one column. This will put the values from `expt`

and `ctrl`

into one column, and put the names into a different column:

```
library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
#> angle condition count
#> 1 -20 expt 1
#> 2 -10 expt 7
#> ...<6 more rows>...
#> 9 10 ctrl 3
#> 10 20 ctrl 1
```

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

### 15.19.3 Discussion

In the source data, there are *ID* variables and *value* variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when `angle`

is –20. In the output data frame, the two measurements, for `expt`

and `ctrl`

, are no longer in the same row, but we can still tell that they belong together because they have the same value of `angle`

.

The value variables are by default all the non-ID variables. The names of these variables are put into a new *key* column, which we called `condition`

, and the values are put into a new *value* column which we called `count`

.

You can designate the *value* columns from the source data by naming them individually, as we did above with `expt`

and `ctrl`

. `gather()`

automatically inferred that the ID variable was the remaining column, `angle`

. Another way to tell it which columns are values is to do the reverse: if you exclude the `angle`

column, then `gather()`

will infer that the value columns are the remaining ones, `expt`

and `ctrl`

.

```
gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)
```

There are other convenient shortcuts to specify which columns are values. For example `expt:ctrl`

means to select all columns between `expt`

and `ctrl`

(in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, `gather()`

will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the `select()`

function.

For example, in the `drunk`

data set, suppose we want to convert it to long format, keeping `sex`

in one column and putting the numeric values in another column. This time, we want the values for only the `0-29`

and `30-39`

columns, and we want to discard the values for the other age ranges:

```
# Our source data
drunk
#> sex 0-29 30-39 40-49 50-59 60+
#> 1 male 185 207 260 180 71
#> 2 female 4 13 10 7 10
# Try gather() with just 0-29 and 30-39
drunk %>%
gather(age, count, "0-29", "30-39")
#> sex 40-49 50-59 60+ age count
#> 1 male 260 180 71 0-29 185
#> 2 female 10 7 10 0-29 4
#> 3 male 260 180 71 30-39 207
#> 4 female 10 7 10 30-39 13
```

That doesn’t look right! We told `gather()`

that `0-29`

and `30-39`

were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep `sex`

and discard the others. The solution is to use `select()`

to remove the unwanted columns first, and then `gather()`

.

```
library(dplyr) # For the select() function
drunk %>%
select(sex, "0-29", "30-39") %>%
gather(age, count, "0-29", "30-39")
#> sex age count
#> 1 male 0-29 185
#> 2 female 0-29 4
#> 3 male 30-39 207
#> 4 female 30-39 13
```

There are times where you may want to use use more than one column as the ID variables:

```
plum_wide
#> length time dead alive
#> 1 long at_once 84 156
#> 2 long in_spring 156 84
#> 3 short at_once 133 107
#> 4 short in_spring 209 31
# Use length and time as the ID variables (by not naming them as value variables)
gather(plum_wide, "survival", "count", dead, alive)
#> length time survival count
#> 1 long at_once dead 84
#> 2 long in_spring dead 156
#> ...<4 more rows>...
#> 7 short at_once alive 107
#> 8 short in_spring alive 31
```

Some data sets don’t come with a column with an ID variable. For example, in the `corneas`

data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

```
# Make a copy of the data
co <- corneas
# Add an ID column
co$id <- 1:nrow(co)
gather(co, "eye", "thickness", affected, notaffected)
#> id eye thickness
#> 1 1 affected 488
#> 2 2 affected 478
#> ...<12 more rows>...
#> 15 7 notaffected 464
#> 16 8 notaffected 476
```

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with `as.character()`

, or a factor with `factor()`

.

### 15.19.4 See Also

See Recipe 15.20 to do conversions in the other direction, from long to wide.

See the `stack()`

function for another way of converting from wide to long.

## 15.20 Converting Data from Long to Wide

### 15.20.1 Problem

You want to convert a data frame from “long” format to “wide” format.

### 15.20.2 Solution

Use the `spread()`

function from the tidyr package. In this example, we’ll use the `plum`

data set, which is in a long format:

```
library(gcookbook) # For the data set
plum
#> length time survival count
#> 1 long at_once dead 84
#> 2 long in_spring dead 156
#> ...<4 more rows>...
#> 7 short at_once alive 107
#> 8 short in_spring alive 31
```

The conversion to wide format takes each unique value in one column and uses those values as headers for new columns, then uses another column for source values. For example, we can “move” values in the `survival`

column to the top and fill them with values from `count`

:

### 15.20.3 Discussion

The `spread()`

function requires you to specify a *key* column which is used for header names, and a *value* column which is used to fill the values in the output data frame. It’s assumed that you want to use all the other columns as ID variables.

In the preceding example, there are two ID columns, `length`

and `time`

, one key column, `survival`

, and one value column, `count`

. What if we want to use two of the columns as keys? Suppose, for example, that we want to use `length`

and `survival`

as keys. This would leave us with `time`

as the ID column.

The way to do this is to combine the `length`

and `survival`

columns together and put it in a new column, then use that new column as a key.

```
# Create a new column, length_survival, from length and survival.
plum %>%
unite(length_survival, length, survival)
#> length_survival time count
#> 1 long_dead at_once 84
#> 2 long_dead in_spring 156
#> ...<4 more rows>...
#> 7 short_alive at_once 107
#> 8 short_alive in_spring 31
# Now pass it to spread() and use length_survival as a key
plum %>%
unite(length_survival, length, survival) %>%
spread(length_survival, count)
#> time long_alive long_dead short_alive short_dead
#> 1 at_once 156 84 107 133
#> 2 in_spring 84 156 31 209
```

### 15.20.4 See Also

See Recipe 15.19 to do conversions in the other direction, from wide to long.

See the `unstack()`

function for another way of converting from long to wide.

## 15.21 Converting a Time Series Object to Times and Values

### 15.21.1 Problem

You have a time series object that you wish to convert to numeric vectors representing the time and values at each time.

### 15.21.2 Solution

Use the `time()`

function to get the time for each observation, then convert the times and values to numeric vectors with `as.numeric()`

:

```
# Look at nhtemp Time Series object
nhtemp
#> Time Series:
#> Start = 1912
#> End = 1971
#> ...
#> [31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
#> [46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0
# Get times for each observation
as.numeric(time(nhtemp))
#> [1] 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926
#> [16] 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
#> [31] 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956
#> [46] 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
# Get value of each observation
as.numeric(nhtemp)
#> [1] 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9 49.3 51.9 50.8 49.6 49.3 50.6 48.4
#> [16] 50.7 50.9 50.6 51.5 52.8 51.8 51.1 49.8 50.2 50.4 51.6 51.8 50.9 48.8 51.7
#> [31] 51.0 50.6 51.7 51.5 52.1 51.3 51.0 54.0 51.4 52.7 53.1 54.6 52.0 52.0 50.9
#> [46] 52.6 50.2 52.6 51.6 51.9 50.5 50.9 51.7 51.4 51.7 50.8 51.9 51.8 51.9 53.0
# Put them in a data frame
nht <- data.frame(year = as.numeric(time(nhtemp)), temp = as.numeric(nhtemp))
nht
#> year temp
#> 1 1912 49.9
#> 2 1913 52.3
#> ...<56 more rows>...
#> 59 1970 51.9
#> 60 1971 53.0
```

### 15.21.3 Discussion

Time series objects efficiently store information when there are observations at regular time intervals, but for use with ggplot, they need to be converted to a format that separately represents times and values for each observation.

Some time series objects are cyclical. The `presidents`

data set, for example, contains four observations per year, one for each quarter:

```
presidents
#> Qtr1 Qtr2 Qtr3 Qtr4
#> 1945 NA 87 82 75
#> 1946 63 50 43 32
#> ...
#> 1973 68 44 40 27
#> 1974 28 25 24 24
```

To convert it to a two-column data frame with one column representing the year with fractional values, we can do the same as before:

```
pres_rating <- data.frame(
year = as.numeric(time(presidents)),
rating = as.numeric(presidents)
)
pres_rating
#> year rating
#> 1 1945.00 NA
#> 2 1945.25 87
#> ...<116 more rows>...
#> 119 1974.50 24
#> 120 1974.75 24
```

It is also possible to store the year and quarter in separate columns, which may be useful in some visualizations:

### 15.21.4 See Also

The zoo package is also useful for working with time series objects.