Chapter 15 Getting Your Data into Shape

When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn’t the case: you’ll have to clean up and restructure the data before you can visualize it.

The recipes in this chapter will often use packages from the tidyverse. For a little background about the tidyverse, see the introduction section of Chapter 1. I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.

Note

The `%>%` symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe 1.7.

Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I’ll assume that dplyr is already loaded. You can load it with either `library(tidyverse)` as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:

``library(dplyr)``

Data sets in R are most often stored in data frames. They’re typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.

Here’s the `heightweight` data set:

``````library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
#>     sex ageYear ageMonth heightIn weightLb
#> 1     f   11.92      143     56.3     85.0
#> 2     f   12.92      155     62.3    105.0
#>  ...<232 more rows>...
#> 236   m   13.92      167     62.0    107.5
#> 237   m   12.58      151     59.3     87.0``````

It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it’s structured by using the `str()` function:

``````str(heightweight)
#> 'data.frame':    236 obs. of  5 variables:
#>  \$ sex     : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
#>  \$ ageYear : num  11.9 12.9 12.8 13.4 15.9 ...
#>  \$ ageMonth: int  143 155 153 161 191 171 185 142 160 140 ...
#>  \$ heightIn: num  56.3 62.3 63.3 59 62.5 62.5 59 56.5 62 53.8 ...
#>  \$ weightLb: num  85 105 108 92 112 ...``````

The first column, `sex`, is a factor with two levels, `"f"` and `"m"`, and the other four columns are vectors of numbers (one of them, `ageMonth`, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).

Factors and character vectors behave similarly in ggplot – the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.