## 15.19 Converting Data from Wide to Long

### 15.19.1 Problem

You want to convert a data frame from “wide” format to “long” format.

### 15.19.2 Solution

Use `gather()` from the tidyr package. In the `anthoming` data set, for each `angle`, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

``````library(gcookbook) # For the data set
anthoming
#>   angle expt ctrl
#> 1   -20    1    0
#> 2   -10    7    3
#> 3     0    2    3
#> 4    10    0    3
#> 5    20    0    1``````

We can reshape the data so that all the measurements are in one column. This will put the values from `expt` and `ctrl` into one column, and put the names into a different column:

``````library(tidyr)
gather(anthoming, condition, count, expt, ctrl)
#>    angle condition count
#> 1    -20      expt     1
#> 2    -10      expt     7
#>  ...<6 more rows>...
#> 9     10      ctrl     3
#> 10    20      ctrl     1``````

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

### 15.19.3 Discussion

In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when `angle` is –20. In the output data frame, the two measurements, for `expt` and `ctrl`, are no longer in the same row, but we can still tell that they belong together because they have the same value of `angle`.

The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called `condition`, and the values are put into a new value column which we called `count`.

You can designate the value columns from the source data by naming them individually, as we did above with `expt` and `ctrl`. `gather()` automatically inferred that the ID variable was the remaining column, `angle`. Another way to tell it which columns are values is to do the reverse: if you exclude the `angle` column, then `gather()` will infer that the value columns are the remaining ones, `expt` and `ctrl`.

``````gather(anthoming, condition, count, expt, ctrl)
# Prepending the column name with a '-' means it is not a value column
gather(anthoming, condition, count, -angle)``````

There are other convenient shortcuts to specify which columns are values. For example `expt:ctrl` means to select all columns between `expt` and `ctrl` (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, `gather()` will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the `select()` function.

For example, in the `drunk` data set, suppose we want to convert it to long format, keeping `sex` in one column and putting the numeric values in another column. This time, we want the values for only the `0-29` and `30-39` columns, and we want to discard the values for the other age ranges:

``````# Our source data
drunk
#>      sex 0-29 30-39 40-49 50-59 60+
#> 1   male  185   207   260   180  71
#> 2 female    4    13    10     7  10

# Try gather() with just 0-29 and 30-39
drunk %>%
gather(age, count, "0-29", "30-39")
#>      sex 40-49 50-59 60+   age count
#> 1   male   260   180  71  0-29   185
#> 2 female    10     7  10  0-29     4
#> 3   male   260   180  71 30-39   207
#> 4 female    10     7  10 30-39    13``````

That doesn’t look right! We told `gather()` that `0-29` and `30-39` were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep `sex` and discard the others. The solution is to use `select()` to remove the unwanted columns first, and then `gather()`.

``````library(dplyr)  # For the select() function

drunk %>%
select(sex, "0-29", "30-39") %>%
gather(age, count, "0-29", "30-39")
#>      sex   age count
#> 1   male  0-29   185
#> 2 female  0-29     4
#> 3   male 30-39   207
#> 4 female 30-39    13``````

There are times where you may want to use use more than one column as the ID variables:

``````plum_wide
#> 1   long   at_once   84   156
#> 2   long in_spring  156    84
#> 3  short   at_once  133   107
#> 4  short in_spring  209    31
# Use length and time as the ID variables (by not naming them as value variables)
#>   length      time survival count
#> 1   long   at_once     dead    84
#> 2   long in_spring     dead   156
#>  ...<4 more rows>...
#> 7  short   at_once    alive   107
#> 8  short in_spring    alive    31``````

Some data sets don’t come with a column with an ID variable. For example, in the `corneas` data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

``````# Make a copy of the data
co <- corneas
co\$id <- 1:nrow(co)

gather(co, "eye", "thickness", affected, notaffected)
#>    id         eye thickness
#> 1   1    affected       488
#> 2   2    affected       478
#>  ...<12 more rows>...
#> 15  7 notaffected       464
#> 16  8 notaffected       476``````

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with `as.character()`, or a factor with `factor()`.

See the `stack()` function for another way of converting from wide to long.