15.19 Converting Data from Wide to Long

15.19.1 Problem

You want to convert a data frame from “wide” format to “long” format.

15.19.2 Solution

Use gather() from the tidyr package. In the anthoming data set, for each angle, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

We can reshape the data so that all the measurements are in one column. This will put the values from expt and ctrl into one column, and put the names into a different column:

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

15.19.3 Discussion

In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when angle is –20. In the output data frame, the two measurements, for expt and ctrl, are no longer in the same row, but we can still tell that they belong together because they have the same value of angle.

The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called condition, and the values are put into a new value column which we called count.

You can designate the value columns from the source data by naming them individually, as we did above with expt and ctrl. gather() automatically inferred that the ID variable was the remaining column, angle. Another way to tell it which columns are values is to do the reverse: if you exclude the angle column, then gather() will infer that the value columns are the remaining ones, expt and ctrl.

There are other convenient shortcuts to specify which columns are values. For example expt:ctrl means to select all columns between expt and ctrl (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, gather() will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the select() function.

For example, in the drunk data set, suppose we want to convert it to long format, keeping sex in one column and putting the numeric values in another column. This time, we want the values for only the 0-29 and 30-39 columns, and we want to discard the values for the other age ranges:

That doesn’t look right! We told gather() that 0-29 and 30-39 were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep sex and discard the others. The solution is to use select() to remove the unwanted columns first, and then gather().

There are times where you may want to use use more than one column as the ID variables:

Some data sets don’t come with a column with an ID variable. For example, in the corneas data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with as.character(), or a factor with factor().

15.19.4 See Also

See Recipe 15.20 to do conversions in the other direction, from long to wide.

See the stack() function for another way of converting from wide to long.