Chapter 15 Getting Your Data into Shape

When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn’t the case: you’ll have to clean up and restructure the data before you can visualize it.

The recipes in this chapter will often use packages from the tidyverse. For a little background about the tidyverse, see the introduction section of Chapter 1. I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.

Note

The >%> symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe 1.7.

Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I’ll assume that dplyr is already loaded. You can load it with either library(tidyverse) as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:

Data sets in R are most often stored in data frames. They’re typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.

Here’s the heightweight data set:

It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it’s structured by using the str() function:

The first column, sex, is a factor with two levels, "f" and "m", and the other four columns are vectors of numbers (one of them, ageMonth, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).

Factors and character vectors behave similarly in ggplot – the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.

15.1 Creating a Data Frame

15.1.1 Problem

You want to create a data frame from vectors.

15.1.2 Solution

You can put vectors together in a data frame with data.frame():

15.1.3 Discussion

A data frame is essentially a list of vectors and factors. Each vector or factor can be thought of as a column in the data frame.

If your vectors are in a list, you can convert the list to a data frame with the as.data.frame() function:

The tidyverse way of creating a data frame is to use data_frame() or as_data_frame() (note the underscores instead of periods). This returns a special kind of data frame – a tibble – which behaves like a regular data frame in most contexts, but prints out more nicely and is specifically designed to play well with the tidyverse functions.

A regular data frame can be converted to a tibble using as_tibble():

15.2 Getting Information About a Data Structure

15.2.1 Problem

You want to find out information about an object or data structure.

15.2.2 Solution

Use the str() function:

This tells us that ToothGrowth is a data frame with three columns, len, supp, and dose. len and dose contain numeric values, while supp is a factor with two levels.

Another useful function is the summary() function:

Instead of showing you the first few values of each column as str() does, summary() provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.

15.3 Adding a Column to a Data Frame

15.3.1 Problem

You want to add a column to a data frame.

15.3.2 Solution

Use mutate() from dplyr to add a new column and assign values to it. This returns a new data frame, which you’ll typically want save over the original.

If you assign a single value to the new column, the entire column will be filled with that value. This adds a column named newcol, filled with NA:

You can also assign a vector to the new column:

Note that the vector being added to the data frame must either have one element, or the same number of elements as the data frame has rows. In the example above we created a new vector that had 60 rows by repeating the values c(1, 2) thirty times.

15.3.3 Discussion

Each column of a data frame is a vector. R handles columns in data frames slightly differently from standalone vectors because all the columns in a data frame must have the same length.

To add a column using base R, you can simply assign values into the new column like so:

With base R, the vector being assigned into the data frame will automatically be repeated to fill the number of rows in the data frame.

15.4 Deleting a Column from a Data Frame

15.4.1 Problem

You want to delete a column from a data frame. This returns a new data frame, which you’ll typically want save over the original.

15.4.2 Solution

Use select() from dplyr and specify the columns you want to drop by using - (a minus sign).

15.4.3 Discussion

You can list multiple columns that you want to drop at the same time, or conversely specify only the columns that you want to keep. The following two pieces of code are thus equivalent:

To remove a column using base R, you can simply assign NULL to that column.

15.4.4 See Also

Recipe 15.7 for more on getting a subset of a data frame.

See ?select for more ways to drop and keep columns.

15.5 Renaming Columns in a Data Frame

15.5.1 Problem

You want to rename the columns in a data frame.

15.5.2 Solution

Use rename() from dplyr. This returns a new data frame, which you’ll typically want save over the original.

15.5.4 See Also

See ?select for more ways to rename columns within a data frame.

15.6 Reordering Columns in a Data Frame

15.6.1 Problem

You want to change the order of columns in a data frame.

15.6.2 Solution

Use the select() from dplyr.

The new data frame will contain the columns you specified in select(), in the order you specified. Note that select() returns a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

15.6.3 Discussion

If you are only reordering a few variables and want to keep the rest of the variables in order, you can use everything() as a placeholder:

See ?select_helpers for other ways to select columns. You can, for example, select columns by matching parts of the name.

Using base R, you can also reorder columns by their name or numeric position. This returns a new data frame, which can be saved over the original.

In these examples, I used list-style indexing. A data frame is essentially a list of vectors, and indexing into it as a list will return another data frame. You can get the same effect with matrix-style indexing:

In this case, both methods return the same result, a data frame. However, when retrieving a single column, list-style indexing will return a data frame, while matrix-style indexing will return a vector:

You can use drop=FALSE to ensure that it returns a data frame:

15.7 Getting a Subset of a Data Frame

15.7.1 Problem

You want to get a subset of a data frame.

15.7.2 Solution

Use filter() to get the rows, and select() to get the columns you want. These operations can be chained together using the %>% operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

We’ll use the climate data set for the examples here:

Let’s that say that only want to keep rows where Source is "Berkeley" and where the year is inclusive of and between 1900 and 2000. You can do so with the filter() function:

If you want only the Year and Anomaly10y columns, use select(), as we did in 15.4:

These operations can be chained together using the %>% operator:

15.7.3 Discussion

The filter() function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the slice() function:

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

Notice that we needed to prefix each column name with climate$, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.

This row filtering can also be combined with the column selection from 15.4:

15.8 Changing the Order of Factor Levels

15.8.1 Problem

You want to change the order of levels in a factor.

15.8.2 Solution

Pass the factor to factor(), and give it the levels in the order you want. This returns a new factor, so if you want to change the original variable, you’ll need to save the new result over it.

The order can also be specified with levels when the factor is first created:

15.8.3 Discussion

There are two kinds of factors in R: ordered factors and regular factors. (In practice, ordered levels are not commonly used.) In both types, the levels are arranged in some order; the difference is that the order is meaningful for an ordered factor, but it is arbitrary for a regular factor – it simply reflects how the data is stored. For plotting data, the distinction between ordered and regular factors is generally unimportant, and they can be treated the same.

The order of factor levels affects graphical output. When a factor variable is mapped to an aesthetic property in ggplot, the aesthetic adopts the ordering of the factor levels. If a factor is mapped to the x-axis, the ticks on the axis will be in the order of the factor levels, and if a factor is mapped to color, the items in the legend will be in the order of the factor levels.

To reverse the level order, you can use rev(levels()):

The tidyverse function for reordering factors is fct_relevel() from the forcats package. It has a syntax similar to the factor() function from base R.

15.8.4 See Also

To reorder a factor based on the value of another variable, see Recipe 15.9.

Reordering factor levels is useful for controlling the order of axes and legends. See Recipes Recipe 8.4 and Recipe 10.3 for more information.

15.9 Changing the Order of Factor Levels Based on Data Values

15.9.1 Problem

You want to change the order of levels in a factor based on values in the data.

15.9.3 Discussion

The usefulness of reorder() might not be obvious from just looking at the raw output. Figure 15.1 shows three plots made with reorder(). In these plots, the order in which the items appear is determined by their values.

Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)

Figure 15.1: Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)

In the middle plot in Figure 15.1, the boxes are sorted by the mean. The horizontal line that runs across each box represents the median of the data. Notice that these values do not increase strictly from left to right. That’s because with this particular data set, sorting by the mean gives a different order than sorting by the median. To make the median lines increase from left to right, as in the plot on the right in Figure 15.1, we used the median() function in reorder().

The tidyverse function for reordering factors is fct_reorder(), and it is used the same way as reorder(). These do the same thing:

15.9.4 See Also

Reordering factor levels is also useful for controlling the order of axes and legends. See Recipes 8.4 and 10.3 for more information.

15.10 Changing the Names of Factor Levels

15.10.1 Problem

You want to change the names of levels in a factor.

15.10.3 Discussion

If you want to use two vectors, one with the original levels and one with the new ones, use do.call() with fct_recode().

Or, more concisely, we can do all of that in one go:

For a more traditional (and clunky) base R method for renaming factor levels, use the levels()<- function:

If you are renaming all your factor levels, there is a simpler method. You can pass a list to levels()<-:

With this method, all factor levels must be specified in the list; if any are missing, they will be replaced with NA.

It’s also possible to rename factor levels by position, but this is somewhat inelegant:

It’s safer to rename factor levels by name rather than by position, since you will be less likely to make a mistake (and mistakes here may be hard to detect). Also, if your input data set changes to have more or fewer levels, the numeric positions of the existing levels could change, which could cause serious but nonobvious problems for your analysis.

15.10.4 See Also

If, instead of a factor, you have a character vector with items to rename, see Recipe 15.12.

15.11 Removing Unused Levels from a Factor

15.11.1 Problem

You want to remove unused levels from a factor.

15.11.2 Solution

Sometimes, after processing your data you will have a factor that contains levels that are no longer used. Here’s an example:

To remove them, use droplevels():

15.11.3 Discussion

The droplevels() function preserves the order of factor levels. You can use the except parameter to keep particular levels.

The tidyverse way: Use fct_drop() from the forcats package:

15.12 Changing the Names of Items in a Character Vector

15.12.1 Problem

You want to change the names of items in a character vector.

15.12.4 See Also

If, instead of a character vector, you have a factor with levels to rename, see Recipe 15.10.

15.13 Recoding a Categorical Variable to Another Categorical Variable

15.13.1 Problem

You want to recode a categorical variable to another variable.

15.13.2 Solution

For the examples here, we’ll use a subset of the PlantGrowth data set:

In this example, we’ll recode the categorical variable group into another categorical variable, treatment. If the old value was "ctrl", the new value will be "No", and if the old value was "trt1" or "trt2", the new value will be "Yes".

This can be done with the recode() function from the dplyr package:

You can assign it as a new column in the data frame:

Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use as.character():

15.13.3 Discussion

You can also use the fct_recode() function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:

Another difference is that fct_recode() will always return a factor, whereas recode() will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a recode_factor() function which also always returns a factor.)

Using base R, recoding can be done with the match() function:

It can also be done by indexing in the vectors:

Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe 15.10.

The coding criteria can also be based on values in multiple columns, by using the & and | operators:

It’s also possible to combine two columns into one using the interaction() function, which appends the values with a . in between. This combines the weight and group columns into a new column, weightgroup:

15.13.4 See Also

For more on renaming factor levels, see Recipe 15.10.

See Recipe 15.14 for recoding continuous values to categorical values.

15.14 Recoding a Continuous Variable to a Categorical Variable

15.14.1 Problem

You want to recode a continuous variable to another variable.

15.14.2 Solution

Use the cut() function. In this example, we’ll use the PlantGrowth data set and recode the continuous variable weight into a categorical variable, wtclass, using the cut() function:

15.14.3 Discussion

For three categories we specify four bounds, which can include Inf and -Inf. If a data value falls outside of the specified bounds, it’s categorized as NA. The result of cut() is a factor, and you can see from the example that the factor levels are named after the bounds.

To change the names of the levels, set the labels:

As indicated by the factor levels, the bounds are by default open on the left and closed on the right. In other words, they don’t include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting include.lowest=TRUE. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as NA.

If you want the categories to be closed on the left and open on the right, set right = FALSE:

15.14.4 See Also

To recode a categorical variable to another categorical variable, see Recipe 15.13.

15.15 Calculating New Columns From Existing Columns

15.15.1 Problem

You want to calculate a new column of values in a data frame.

15.15.4 See Also

See Recipe 15.16 for how to perform group-wise transformations on data.

15.16 Calculating New Columns by Groups

15.16.1 Problem

You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.

15.16.3 Discussion

Let’s take a closer look at the cabbages data set. It has two grouping variables (factors): Cult, which has levels c39 and c52, and Date, which has levels d16, d20, and d21. It also has two measured numeric variables, HeadWt and VitC:

Suppose we want to find, for each case, the deviation of HeadWt from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:

You’ll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean within the group, where the groups are specified by Cult. In these cases, we can use group_by() and mutate() together:

First it groups cabbages based on the value of Cult. There are two levels of Cult, c39 and c52. It then applies the mutate() function to each data frame.

The before and after results are shown in Figure 15.2:

Before normalizing (left); After normalizing (right)Before normalizing (left); After normalizing (right)

Figure 15.2: Before normalizing (left); After normalizing (right)

You can also group the data frame on multiple variables and perform operations on multiple variables. The following code groups the data by Cult and Date, forming a group for each distinct combination of the two variables. After forming these groups, the code will calculate the deviation of HeadWt and VitC from the mean of each group:

15.16.4 See Also

To summarize data by groups, see Recipe 15.17.

15.17 Summarizing Data by Groups

15.17.1 Problem

You want to summarize your data, based on one or more grouping variables.

15.17.3 Discussion

There are few things going on here that may be unfamiliar if you’re new to dplyr and the tidyverse in general.

First, let’s take a closer look at the cabbages data set. It has two factors that can be used as grouping variables: Cult, which has levels c39 and c52, and Date, which has levels d16, d20, and d21. It also has two numeric variables, HeadWt and VitC:

Finding the overall mean of HeadWt is simple. We could just use the mean() function on that column, but for reasons that will soon become clear, we’ll use the summarise() function instead:

The result is a data frame with one row and one column, named Weight.

Often we want to find information about each subset of the data, as specified by a grouping variable. For example, suppose we want to find the mean of each Cult group. To do this, we can use summarise() with group_by().

The command first groups the data frame cabbages based on the value of Cult. There are two levels of Cult, c39 and c52, so there are two groups. It then applies the summarise() function to each of these data frames; it calculates Weight by taking the mean() of the HeadWt column in each of the sub-data frames. The resulting summaries for each group are assembled into a data frame, which is returned.

You can imagine that the cabbages data is split up into two separate data frames, then summarise() is called on each data frame (returning a one-row data frame for each), and then those results are combined together into a final data frame. This is actually how things worked in dplyr’s predecessor, plyr, with the ddply() function.

The syntax of the previous code used a temporary variable to store results. That’s a little verbose, so instead, we can use %>%, also known as the pipe operator, to chain the function calls together. The pipe operator simply takes what’s on its left and substitutes it as the first argument of the function call on the right. The following two lines of code are equivalent:

The reason it’s called a pipe operator is that it lets you connect function calls together in sequence to form a pipeline of operations. Another common term for this is a different metaphor: chaining.

So the first argument of the function call is in a different place. So what? The advantages become apparent when chaining is involved. Here’s what it would look like if you wanted to call group_by() and then summarise() without making use of a temporary variable. Instead of proceeding left to right, the computation occurs from the inside out:

Using a temporary variable, as we did earlier, makes it more readable, but a more elegant solution is to use the pipe operator:

Back to summarizing data. Summarizing the data frame by grouping using more variables (or columns) is simple: just give it the names of the additional variables. It’s also possible to get more than one summary value by specifying more calculated columns. Here we’ll summarize each Cult and Date group, getting the average of HeadWt and VitC:

Note

You might have noticed that it says that the result is grouped by Cult, but not Date. This is because the summarise() function removes one level of grouping. This is typically what you want when the input has one grouping variable. When there are multiple grouping variables, this may or may not be the what you want. To remove all grouping, use ungroup(), and to add back the original grouping, use group_by() again.

It’s possible to do more than take the mean. You may, for example, want to compute the standard deviation and count of each group. To get the standard deviation, use sd(), and to get a count of rows in each group, use n():

Other useful functions for generating summary statistics include min(), max(), and median(). The n() function is a special function that works only inside of the dplyr functions summarise(), mutate() and filter(). See ?summarise for more useful functions.

The n() function gets a count of rows, but if you want to have it not count NA values from a column, you need to use a different technique. For example, if you want it to ignore any NAs in the HeadWt column, use sum(!is.na(Headwt)).

If you want to get a count of rows

15.17.3.1 Dealing with NAs {#_dealing_with_literal_na_literal_s}

One potential pitfall is that NAs in the data will lead to NAs in the output. Let’s see what happens if we sprinkle a few NAs into HeadWt:

The problem is that mean() and sd() simply return NA if any of the input values are NA. Fortunately, these functions have an option to deal with this very issue: setting na.rm=TRUE will tell them to ignore the NAs.

15.17.3.2 Missing combinations {#_missing_combinations}

If there are any empty combinations of the grouping variables, they will not appear in the summarized data frame. These missing combinations can cause problems when making graphs. To illustrate, we’ll remove all entries that have levels c52 and d21. The graph on the left in Figure 15.3 shows what happens when there’s a missing combination in a bar graph:

To fill in the missing combination (Figure 15.3, right), use the complete() function from the tidyr package – which is also part of the tidyverse. Also, the grouping for c2a must be removed, with ungroup(); otherwise it will return too many rows.

Bar graph with a missing combination (left); With missing combination filled (right)Bar graph with a missing combination (left); With missing combination filled (right)

Figure 15.3: Bar graph with a missing combination (left); With missing combination filled (right)

When we used complete(), it filled in the missing combinations with NA. It’s possible to fill with a different value, with the fill parameter. See ?complete for more information.

15.17.4 See Also

If you want to calculate standard errors and confidence intervals, see Recipe 15.18.

See Recipe 6.8 for an example of using stat_summary() to calculate means and overlay them on a graph.

To perform transformations on data by groups, see Recipe 15.16.

15.18 Summarizing Data with Standard Errors and Confidence Intervals

15.18.1 Problem

You want to summarize your data with the standard error of the mean and/or confidence intervals.

15.18.3 Discussion

The summarise() function computes the columns in order, so you can refer to previous newly-created columns. That’s why se can use the sd and n columns.

The n() function gets a count of rows, but if you want to have it not count NA values from a column, you need to use a different technique. For example, if you want it to ignore any NAs in the HeadWt column, use sum(!is.na(Headwt)).

15.18.3.1 Confidence Intervals {#_confidence_intervals}

Confidence intervals are calculated using the standard error of the mean and the degrees of freedom. To calculate a confidence interval, use the qt() function to get the quantile, then multiply that by the standard error. The qt() function will give quantiles of the t-distribution when given a probability level and degrees of freedom. For a 95% confidence interval, use a probability level of .975; for the bell-shaped t-distribution, this will in essence cut off 2.5% of the area under the curve at either end. The degrees of freedom equal the sample size minus one.

This will calculate the multiplier for each group. There are six groups and each has the same number of observations (10), so they will all have the same multiplier:

Now we can multiply that vector by the standard error to get the 95% confidence interval:

This could be done in one line, like this:

For a 99% confidence interval, use .995.

Error bars that represent the standard error of the mean and confidence intervals serve the same general purpose: to give the viewer an idea of how good the estimate of the population mean is. The standard error is the standard deviation of the sampling distribution. Confidence intervals are a little easier to interpret. Very roughly, a 95% confidence interval means that there’s a 95% chance that the true population mean is within the interval (actually, it doesn’t mean this at all, but this seemingly simple topic is way too complicated to cover here; if you want to know more, read up on Bayesian statistics).

This function will perform all the steps of calculating the standard deviation, count, standard error, and confidence intervals. It can also handle NAs and missing combinations, with the na.rm and .drop options. By default, it provides a 95% confidence interval, but this can be set with the conf.interval argument:

The following usage example has a 99% confidence interval and handles NAs and missing combinations:

15.18.4 See Also

See Recipe 7.7 to use the values calculated here to add error bars to a graph.

15.19 Converting Data from Wide to Long

15.19.1 Problem

You want to convert a data frame from “wide” format to “long” format.

15.19.2 Solution

Use gather() from the tidyr package. In the anthoming data set, for each angle, there are two measurements: one column contains measurements in the experimental condition and the other contains measurements in the control condition:

We can reshape the data so that all the measurements are in one column. This will put the values from expt and ctrl into one column, and put the names into a different column:

This data frame represents the same information as the original one, but it is structured in a way that is more conducive to some analyses.

15.19.3 Discussion

In the source data, there are ID variables and value variables. The ID variables are those that specify which values go together. In the source data, the first row holds measurements for when angle is –20. In the output data frame, the two measurements, for expt and ctrl, are no longer in the same row, but we can still tell that they belong together because they have the same value of angle.

The value variables are by default all the non-ID variables. The names of these variables are put into a new key column, which we called condition, and the values are put into a new value column which we called count.

You can designate the value columns from the source data by naming them individually, as we did above with expt and ctrl. gather() automatically inferred that the ID variable was the remaining column, angle. Another way to tell it which columns are values is to do the reverse: if you exclude the angle column, then gather() will infer that the value columns are the remaining ones, expt and ctrl.

There are other convenient shortcuts to specify which columns are values. For example expt:ctrl means to select all columns between expt and ctrl (in this particular case, there are no other columns in between, but for a larger data set you can imagine how this would save typing).

By default, gather() will use all of the columns from the source data as either ID columns or value columnbs. That means that if you want to ignore some columns, you’ll need to filter them out first using the select() function.

For example, in the drunk data set, suppose we want to convert it to long format, keeping sex in one column and putting the numeric values in another column. This time, we want the values for only the 0-29 and 30-39 columns, and we want to discard the values for the other age ranges:

That doesn’t look right! We told gather() that 0-29 and 30-39 were the value columns we wanted, and it automatically inferred that we wanted to use all of the other columns as ID columns, when we wanted to just keep sex and discard the others. The solution is to use select() to remove the unwanted columns first, and then gather().

There are times where you may want to use use more than one column as the ID variables:

Some data sets don’t come with a column with an ID variable. For example, in the corneas data set, each row represents one pair of measurements, but there is no ID variable. Without an ID variable, you won’t be able to tell how the values are meant to be paired together. In these cases, you can add an ID variable before using melt():

Having numeric values for the ID variable may be problematic for subsequent analyses, so you may want to convert id to a character vector with as.character(), or a factor with factor().

15.19.4 See Also

See Recipe 15.20 to do conversions in the other direction, from long to wide.

See the stack() function for another way of converting from wide to long.

15.20 Converting Data from Long to Wide

15.20.1 Problem

You want to convert a data frame from “long” format to “wide” format.

15.20.2 Solution

Use the spread() function from the tidyr package. In this example, we’ll use the plum data set, which is in a long format:

The conversion to wide format takes each unique value in one column and uses those values as headers for new columns, then uses another column for source values. For example, we can “move” values in the survival column to the top and fill them with values from count:

15.20.3 Discussion

The spread() function requires you to specify a key column which is used for header names, and a value column which is used to fill the values in the output data frame. It’s assumed that you want to use all the other columns as ID variables.

In the preceding example, there are two ID columns, length and time, one key column, survival, and one value column, count. What if we want to use two of the columns as keys? Suppose, for example, that we want to use length and survival as keys. This would leave us with time as the ID column.

The way to do this is to combine the length and survival columns together and put it in a new column, then use that new column as a key.

15.20.4 See Also

See Recipe 15.19 to do conversions in the other direction, from wide to long.

See the unstack() function for another way of converting from long to wide.

15.21 Converting a Time Series Object to Times and Values

15.21.1 Problem

You have a time series object that you wish to convert to numeric vectors representing the time and values at each time.

15.21.3 Discussion

Time series objects efficiently store information when there are observations at regular time intervals, but for use with ggplot, they need to be converted to a format that separately represents times and values for each observation.

Some time series objects are cyclical. The presidents data set, for example, contains four observations per year, one for each quarter:

To convert it to a two-column data frame with one column representing the year with fractional values, we can do the same as before:

It is also possible to store the year and quarter in separate columns, which may be useful in some visualizations:

15.21.4 See Also

The zoo package is also useful for working with time series objects.