# Chapter 5 Scatter Plots

Scatter plots are used to display the relationship between two continuous variables. In a scatter plot, each observation in a data set is represented by a point. Often, a scatter plot will also have a line showing the predicted values based on some statistical model. Adding this line is easy to do with R and the ggplot2 package, and can help to make sense of data when the trends aren’t immediately obvious just by looking at the points.

With large data sets, plotting every single observation in the data set can result in overplotting, when points overlap and obscure one another. To deal with the problem of overplotting, you’ll probably want to summarize the data before displaying it. We’ll also see how to do that in this chapter.

## 5.1 Making a Basic Scatter Plot

### 5.1.1 Problem

You want to make a scatter plot using two continuous variables.

### 5.1.2 Solution

Use `geom_point()`

, and map one variable to `x`

and one variable to `y`

.

We will use the `heightweight`

data set. There are a number of columns in this data set, but we’ll only use two in this example (Figure 5.1):

```
library(gcookbook) # Load gcookbook for the heightweight data set
library(dplyr)
# Show the head of the two columns we'll use in the plot
heightweight %>%
select(ageYear, heightIn)
#> ageYear heightIn
#> 1 11.92 56.3
#> 2 12.92 62.3
#> 3 12.75 63.3
#> ...<230 more rows>...
#> 235 13.67 61.5
#> 236 13.92 62.0
#> 237 12.58 59.3
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point()
```

### 5.1.3 Discussion

Instead of points, you can use different shapes for your scatter plot by using the `shape`

aesthetic. A common alternative to the default solid circles (shape #19) is hollow ones (#21), as seen in Figure 5.2 (left):

```
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(shape = 21)
```

The size of the points can be controlled with the `size`

aesthetic. The default value of size is 2 (`size = 2`

). The following code will set `size = 1.5`

to create smaller points (Figure 5.2, right):

```
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(size = 1.5)
```

## 5.2 Grouping Points Together using Shapes or Colors

### 5.2.1 Problem

You want to visually group points by some variable (the grouping variable), using different shapes or colors.

### 5.2.2 Solution

Map the grouping variable to the aesthetic of `shape`

or `colour`

. We’ll use three columns from the `heightweight`

data set for this example:

```
library(gcookbook) # Load gcookbook for the heightweight data set
# Show the head of the three columns we'll use
heightweight %>%
select(sex, ageYear, heightIn)
#> sex ageYear heightIn
#> 1 f 11.92 56.3
#> 2 f 12.92 62.3
#> 3 f 12.75 63.3
#> ...<230 more rows>...
#> 235 m 13.67 61.5
#> 236 m 13.92 62.0
#> 237 m 12.58 59.3
```

We can use the aesthetics of `colour`

or `shape`

to visually differentiate the data points belonging to different categories of `sex`

. We do this by mapping `sex`

to one of the aesthetics `colour`

or `shape`

(Figure5.3):

### 5.2.3 Discussion

The grouping variable you choose must be categorical – in other words, a factor or character vector. If the grouping variable is a numeric vector, you should convert it to a factor first.

It is possible to map a variable to both `shape`

and `colour`

, or, if you have multiple grouping variables, to map each grouping variable to a different aesthetic. Here, we’ll map the variable `sex`

to both `shape`

and `colour`

aesthetics (Figure 5.4, left):

```
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
geom_point()
```

You may want to use different shapes and colors than are given by the default settings. You can select other shapes for the grouping variables using `scale_shape_manual()`

, and select other colors using `scale_colour_brewer()`

or `scale_colour_manual()`

. (Figure 5.4, right):

```
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex, colour = sex)) +
geom_point() +
scale_shape_manual(values = c(1,2)) +
scale_colour_brewer(palette = "Set1")
```

## 5.3 Using Different Point Shapes

### 5.3.1 Problem

You want to change the default scatterplot shapes for the data points.

### 5.3.2 Solution

You can set the shape of all the data points at once (Figure 5.5, left) by setting a shape in `geom_point()`

:

```
library(gcookbook) # Load gcookbook for the heightweight data set
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(shape = 3)
```

If you have mapped a variable to `shape`

, you can use `scale_shape_manual()`

to manually change the shapes mapped to the levels of that variable:

```
# Use slightly larger points and use custom values for the shape scale
ggplot(heightweight, aes(x = ageYear, y = heightIn, shape = sex)) +
geom_point(size = 3) +
scale_shape_manual(values = c(1, 4))
```

### 5.3.3 Discussion

Figure 5.6 shows the shapes that are already built into R. Some of the point shapes (1–14) only have an outline; some (15–20) have solid fill; and some (21–25) have an outline and fill that can be controlled separately. You can also use characters for points.

For shapes 1–20, the color of the entire point – even the points that have solid fill – is controlled by the `colour`

aesthetic. For shapes 21–25, the outline is controlled by `colour`

and the fill is controlled by `fill`

.

It’s possible to have one variable represented by the shape of a point, and and another variable represented by the fill (empty or filled) of the point. To do this, you need to first choose point shapes that have both colour and fill, and set these in `scale_shape_manual`

. You then need to choose a fill palette that includes `NA`

and another color (the `NA`

will result in a hollow shape) and use these in `scale_fill_manual()`

.

For example, we’ll take the `heightweight`

data set and add another column that indicates whether the child weighed 100 pounds or more (Figure 5.7):

```
# Using the heightweight data set, create a new column that indicates if the
# child weighs < 100 or >= 100 pounds. We'll save this modified dataset as 'hw'.
hw <- heightweight %>%
mutate(weightgroup = ifelse(weightLb < 100, "< 100", ">= 100"))
# Specify shapes with fill and color, and specify fill colors that includes an empty (NA) color
ggplot(hw, aes(x = ageYear, y = heightIn, shape = sex, fill = weightgroup)) +
geom_point(size = 2.5) +
scale_shape_manual(values = c(21, 24)) +
scale_fill_manual(
values = c(NA, "black"),
guide = guide_legend(override.aes = list(shape = 21))
)
```

## 5.4 Mapping a Continuous Variable to Color or Size

### 5.4.1 Problem

You want to represent a third continuous variable using color or size.

### 5.4.2 Solution

Map the continuous variable to `size`

or `colour`

. We will use the `heightweight`

data set for this example. There are many columns in this data set, but we’ll only use four of them in this example:

```
library(gcookbook) # Load gcookbook for the heightweight data set
# Show the head of the four columns we'll use
heightweight %>%
select(sex, ageYear, heightIn, weightLb)
#> sex ageYear heightIn weightLb
#> 1 f 11.92 56.3 85.0
#> 2 f 12.92 62.3 105.0
#> 3 f 12.75 63.3 108.0
#> ...<230 more rows>...
#> 235 m 13.67 61.5 140.0
#> 236 m 13.92 62.0 107.5
#> 237 m 12.58 59.3 87.0
```

The basic scatter plot in Recipe 5.1 shows the relationship between the continuous variables `ageYear`

and `heightIn`

. We can represent a third continuous variable, `weightLb`

, by mapping this variable to another aesthetic property, such as `colour`

or `size`

(Figure 5.8:

```
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = weightLb)) +
geom_point()
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
geom_point()
```

### 5.4.3 Discussion

A basic scatter plot shows the relationship between two continuous variables: one mapped to the x-axis, and one to the y-axis. When there are more than two continuous variables, these additional variables must be mapped to other aesthetics, like `size`

and `color`

.

Humans can easily perceive small differences in spatial position, so we can interpret the variables mapped to *x* and *y* coordinates with high precision. Humans aren’t as good at perceiving small differences in `size`

and `color`

though, so we will interpret variables mapped to these aesthetic attributes with much lower precision. Therefore, when you map a variable to `size`

or `color`

, make sure it is a variable where high precision is not very important for correctly intepreting the data.

There is another consideration when mapping a variable to `size`

, which is that the results can be perceptually misleading. While the largest dots in Figure 5.8 are about 36 times the size of the smallest ones, they are only supposed to represent about 3.5 times the weight of the smallest dots.

This relative misrepresentation of size happens because the default values in ggplot2 for the diameter of points ranges from 1 to 6mm, regardless of the actual data values. For example, if the data values range from 0 to 10, the smallest value of 0 will be represented on the plot with a point that is 1mm wide, while the largest value of 10 will be represented on the plot with a point that is 6mm wide. Similarly, if the data values range from 100 to 110, the smallest value of 100 will still be represented by a point that is 1mm wide, and the largest value of 110 will be represented by a point that is 6mm wide. Thus regardless of the actual data values, the largest point will have a diameter that is 6 times the diameter of the smallest point, and will be 36 times the area.

If it is important for the size of the points to accurately reflect the proportional differences of your data values, you should first decide if you want the diameter of the points to represent the data values, or if you want to area of the points to represent the data values. Figure 5.9 shows the difference between these representations.

```
range(heightweight$weightLb)
#> [1] 50.5 171.5
size_range <- range(heightweight$weightLb) / max(heightweight$weightLb) * 6
size_range
#> [1] 1.766764 6.000000
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
geom_point() +
scale_size_continuous(range = size_range)
ggplot(heightweight, aes(x = ageYear, y = heightIn, size = weightLb)) +
geom_point() +
scale_size_area()
```

See Recipe 5.12 for details on making the area of points proportional to the data values.

When it comes to color, there are actually two aesthetic attributes that can be used: `color`

and `fill`

. You will use `color`

for most point shapes. However, shapes 21–25 have an outline with a solid region in the middle where the color is controlled by fill. These outlined shapes can be useful when using a color scale with light colors as in Figure 5.10, because the outline sets the shapes off from the background. In this example, we also set the fill gradient to go from black to white and make the points larger so that the fill is easier to see:

Mapping a continuous variable to an aesthetic doesn’t prevent us from mapping a categorical variable to other aesthetics. In Figure 5.11, we’ll map `weightLb`

to `size`

, and also map `sex`

to `color`

. Because there is a fair amount of overplotting (where the points overlap), we’ll make the points 50% transparent by setting `alpha = .5`

. We’ll also use `scale_size_area()`

to make the area of the points proportional to the data values (see Recipe 5.12), and manually change the color palette:

When a variable is mapped to `size`

, it’s a good idea to *not* map a variable to `shape`

. This is because it is difficult to compare the sizes of different shapes; for example, a size 4 triangle could appear larger than a size 3.5 circle. Also, some of the shapes really are different sizes: shapes 16 and 19 are both circles, but at any given numeric size, shape 19 circles are visually larger than shape 16 circles.

## 5.5 Dealing with Overplotting

### 5.5.1 Problem

You have many points that overlap and obscure each other when plotted.

### 5.5.2 Solution

With large data sets, the points in a scatter plot may overlap and obscure each other and prevent the viewer from accurately assessing the distribution of the data. This is called *overplotting*. If the amount of overplotting is low, you may be able to alleviate the problem by using smaller points, or by using a different shape (like shape 1, a hollow circle) through which other points can be seen. Figure 5.2 in Recipe 5.1 demonstrates both of these solutions.

If there’s a high degree of overplotting, there are a number of possible solutions:

- Make the points semi-transparent
- Bin the data into rectangles (better for quantitative analysis)
- Bin the data into hexagons
- Use box plots

### 5.5.3 Discussion

The scatter plot in Figure 5.12 contains about 54,000 points. They are heavily overplotted, making it impossible to get a sense of the relative density of points in different areas of the graph:

```
# We'll use the diamonds data set and create a base plot called `diamonds_sp`
diamonds_sp <- ggplot(diamonds, aes(x = carat, y = price))
diamonds_sp +
geom_point()
```

We can make the points semitransparent using the `alpha`

aesthetic, as in Figure 5.13. Here, we’ll make them 90% transparent and then 99% transparent, by setting `alpha = .1`

and `alpha = .01`

:

```
diamonds_sp +
geom_point(alpha = .1)
diamonds_sp +
geom_point(alpha = .01)
```

Now we can see that there appear to be vertical bands at nice round values of carats, indicating that diamonds tend to be cut to those sizes. Still, the data is so dense that even when the points are 99% transparent, much of the graph appears black and the data distribution is still somewhat obscured.

NoteFor most plots, vector formats (such as PDF, EPS, and SVG) result in smaller output files than bitmap formats (such as TIFF and PNG). But in cases where there are tens of thousands of points, vector output files can be very large and slow to render – the scatter plot here with 99% transparent points is a 1.5 MB PDF! In these cases, high-resolution bitmaps will be smaller and faster to display on computer screens. See Chapter 14 for more information.

Another solution is to *bin* the points into rectangles and map the density of the points to the fill color of the rectangles, as shown in Figure 5.14. With the binned visualization, the vertical bands are barely visible. The density of points in the lower-left corner is much greater, which tells us that the vast majority of diamonds are small and inexpensive.

By default, `stat_bin_2d()`

divides the space into 30 groups in the *x* and *y* directions, for a total of 900 bins. In the second version, we increase the number of bins with `bins = 50`

.

The default colors are somewhat difficult to distinguish because they don’t vary much in luminosity. In the second version we set the colors by using `scale_fill_gradient()`

and by specifying the low and high colors. By default, the legend doesn’t show an entry for the lowest values. This is because the range of the color scale starts not from zero, but from the smallest nonzero quantity in a bin – probably 1, in this case. To make the legend show a zero (as in Figure 5.14, right), we can manually set the range from 0 to the maximum, 6000, using limits (Figure 5.14, left):

```
diamonds_sp +
stat_bin2d()
diamonds_sp +
stat_bin2d(bins = 50) +
scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 6000))
```

Another alternative is to bin the data into hexagons instead of rectangles, with `stat_binhex()`

(Figure 5.15). It works just like `stat_bin2d()`

. To use `stat_binhex()`

, you must first install the hexbin package, with the command `install.packages("hexbin")`

:

```
library(hexbin) # Load the hexbin library to access stat_binhex()
diamonds_sp +
stat_binhex() +
scale_fill_gradient(low = "lightblue", high = "red", limits = c(0, 8000))
diamonds_sp +
stat_binhex() +
scale_fill_gradient(
low = "lightblue", high = "red", limits = c(0, 6000),
breaks = c(0, 250, 500, 1000, 2000, 4000, 6000)
)
```

For both of these methods, if you manually specify the range and there is a bin that falls outside that range because it has too many or too few points, that bin will show up as grey rather than the color at the high or low end of the range, as seen in the graph on the right in Figure 5.15.

Overplotting can also occur when the data is *discrete* on one or both axes, as shown in Figure 5.16. In these cases, you can randomly *jitter* the points with `position_jitter()`

. By default the amount of jitter is 40% of the resolution of the data in each direction, but these amounts can be controlled with `width`

and `height`

:

```
# We'll use the ChickWeight data set and create a base plot called `cw_sp` (for ChickWeight scatter plot)
cw_sp <- ggplot(ChickWeight, aes(x = Time, y = weight))
cw_sp +
geom_point()
cw_sp +
geom_point(position = "jitter") # Could also use geom_jitter(), which is equivalent
cw_sp +
geom_point(position = position_jitter(width = .5, height = 0))
```

When the data has one discrete axis and one continuous axis, it might make sense to use box plots, as shown in Figure 5.17. This will convey a different story than a standard scatter plot because a box plot will obscure the *number* of data points at each location on the discrete axis. This may be problematic in some cases, but desirable in others.

When we look at the `ChickWeight`

data we know that we conceptually want to treat `Time`

as a discrete variable. However since `Time`

is taken as a numerical variable by default, ggplot doesn’t know to group the data to form each boxplot box. If you don’t tell ggplot how to group the data, you get a result like the graph on the right in Figure 5.17. To tell it how to group the data, use `aes(group = ...)`

. In this case, we’ll group by each distinct value of `Time`

:

```
cw_sp +
geom_boxplot(aes(group = Time))
cw_sp +
geom_boxplot() # Without groups
```

### 5.5.4 See Also

Instead of binning the data, it may be useful to display a 2D density estimate. To do this, see Recipe 6.12.

## 5.6 Adding Fitted Regression Model Lines

### 5.6.1 Problem

You want to add lines from a fitted regression model to a scatter plot.

### 5.6.2 Solution

To add a linear regression line to a scatter plot, add `stat_smooth()`

and tell it to use `method = lm`

. This instructs ggplot to fit the data with the `lm()`

(linear model) function. First we’ll save the base plot object in `sp`

, then we’ll add different components to it:

```
library(gcookbook) # Load gcookbook for the heightweight data set
# We'll use the heightweight data set and create a base plot called `hw_sp` (for heighweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn))
hw_sp +
geom_point() +
stat_smooth(method = lm)
```

By default, `stat_smooth()`

also adds a 95% confidence region for the regression fit. The confidence interval can be changed by modifying the value for `level`

, or it can be disabled with `se = FALSE`

(Figure 5.18):

```
# 99% confidence region
hw_sp +
geom_point() +
stat_smooth(method = lm, level = 0.99)
# No confidence region
hw_sp +
geom_point() +
stat_smooth(method = lm, se = FALSE)
```

The default color of the fit line is blue. This can be change by setting `colour`

. As with any other line, the attributes `linetype`

and `size`

can also be set. To emphasize the line, you can make the dots less prominent by changing the `colour`

of the points (Figure 5.18, bottom right):

```
hw_sp +
geom_point(colour = "grey60") +
stat_smooth(method = lm, se = FALSE, colour = "black")
```

### 5.6.3 Discussion

The linear regression line is not the only way of fitting a model to the data – in fact, it’s not even the default. If you add `stat_smooth()`

without specifying the method, it will use a LOESS (locally weighted polynomial) curve by default, as shown in Figure 5.19:

```
hw_sp +
geom_point(colour = "grey60") +
stat_smooth()
# Equivalent to:
hw_sp +
geom_point(colour = "grey60") +
stat_smooth(method = loess)
```

It may be useful to specify additional parameters for the modeling function, which in this case is `loess()`

. If, for example, you wanted to use `loess(degree = 1)`

, you would call `stat_smooth(method = loess, method.args = list(degree = 1))`

. The same could be done for other modeling functions like `lm()`

or `glm()`

.

Another common type of model fit is a logistic regression. Logistic regression isn’t appropriate for `heightweight`

, but it’s perfect for the `biopsy`

data set in the `MASS`

package. In the `biopsy`

data, there are nine different measured attributes of breast cancer biopsies, as well as the class of the tumor, which is either benign or malignant. To prepare the data for logistic regression, we must convert the factor `class`

, with the levels `benign`

and `malignant`

, to a vector with numeric values of 0 and 1. We’ll make a copy of the `biopsy`

data frame called `biopsy_mod`

, then store the numeric coded `class`

in a column called `classn`

:

```
library(MASS) # Load MASS for the biopsy data set
biopsy_mod <- biopsy %>%
mutate(classn = recode(class, benign = 0, malignant = 1))
biopsy_mod
#> ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class classn
#> 1 1000025 5 1 1 1 2 1 3 1 1 benign 0
#> 2 1002945 5 4 4 5 7 10 3 2 1 benign 0
#> 3 1015425 3 1 1 1 2 2 3 1 1 benign 0
#> ...<693 more rows>...
#> 697 888820 5 10 10 3 7 3 8 10 2 malignant 1
#> 698 897471 4 8 6 4 3 4 10 6 1 malignant 1
#> 699 897471 4 8 8 5 4 5 10 4 1 malignant 1
```

Although there are many attributes we could examine, for this example we’ll just look at the relationship of `V1`

(clump thickness) and the `class`

of the tumor. Because there is a large degree of overplotting, we’ll jitter the points and make them semitransparent (`alpha = 0.4`

), hollow (`shape = 21`

), and slightly smaller (`size = 1.5`

). Then we’ll add a fitted logistic regression line (Figure 5.20) by telling `stat_smooth()`

to use the `glm()`

function with `family = binomial`

:

```
ggplot(biopsy_mod, aes(x = V1, y = classn)) +
geom_point(
position = position_jitter(width = 0.3, height = 0.06),
alpha = 0.4,
shape = 21,
size = 1.5
) +
stat_smooth(method = glm, method.args = list(family = binomial))
```

If your scatter plot has points grouped by a factor, and that factor is mapped to an aesthetic such as `colour`

or `shape`

, one fit line will be drawn for each factor level. First we’ll make the base plot object `hw_sp`

, then we’ll add the LOESS lines to it. We’ll also make the points less prominent by making them semitransparent, using `alpha = .4`

(Figure 5.21):

```
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
scale_colour_brewer(palette = "Set1")
hw_sp +
geom_smooth()
```

Notice that the blue line, for males, doesn’t run all the way to the right side of the plot. There are two reasons for this. The first is that by default, `stat_smooth()`

limits the prediction to within the range of the predictor data on the x-axis. The second is that even if it extrapolates, the `loess()`

function only offers prediction within the *x* range of the data.

If you want the lines to extrapolate from the data, as shown in the right-hand image of Figure 5.21, you must use a model method that allows extrapolation, like `lm()`

, and pass `stat_smooth()`

the option `fullrange = TRUE`

:

```
hw_sp +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE)
```

In this example with the `heightweight`

data set, the default settings for `stat_smooth()`

(with `loess`

and no extrapolation) may make more sense than the extrapolated linear predictions, because humans don’t grow linearly and we don’t grow forever.

## 5.7 Adding Fitted Lines from an Existing Model

### 5.7.1 Problem

You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.

### 5.7.2 Solution

Usually the easiest way to overlay a fitted model is to simply ask `stat_smooth()`

to do it for you, as described in Recipe 5.6. Sometimes, however, you may want to create the model yourself and then add it to your graph. This allows you to be sure that the model you’re using for other calculations is the same one that you see.

In this example, we’ll build a quadratic model using `lm()`

with `ageYear`

as a predictor of `heightIn`

. Then we’ll use the `predict()`

function and find the predicted values of `heightIn`

across the range of values for the predictor, `ageYear`

:

```
library(gcookbook) # Load gcookbook for the heightweight data set
model <- lm(heightIn ~ ageYear + I(ageYear^2), heightweight)
model
#>
#> Call:
#> lm(formula = heightIn ~ ageYear + I(ageYear^2), data = heightweight)
#>
#> Coefficients:
#> (Intercept) ageYear I(ageYear^2)
#> -10.3136 8.6673 -0.2478
# Create a data frame with ageYear column, interpolating across range
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear = seq(xmin, xmax, length.out = 100))
# Calculate predicted values of heightIn
predicted$heightIn <- predict(model, predicted)
predicted
#> ageYear heightIn
#> 1 11.5800 56.82624
#> 2 11.6398 57.00047
#> 3 11.6996 57.17294
#> ...<94 more rows>...
#> 98 17.3804 65.47641
#> 99 17.4402 65.47875
#> 100 17.5000 65.47933
```

We can now plot the data points along with the values predicted from the model (as you’ll see in Figure 5.22):

```
# Create a base plot called `hw_sp` (for heightweight scatter plot)
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point(colour = "grey40")
hw_sp +
geom_line(data = predicted, size = 1)
```

### 5.7.3 Discussion

Any model object (e.g. `lm`

) can be used, so long as it has a corresponding `predict()`

method. For example, `lm`

has `predict.lm()`

, loess has `predict.loess()`

, and so on. Adding lines from a model can be simplified by using the function `predictvals()`

, defined below. If you simply pass a model to `predictvals()`

, the function will do the work of finding the variable names and the range of the predictor, and will return a data frame with predictor and predicted values. That data frame can then be passed to `geom_line()`

to draw the fitted line, as we did earlier:

```
# Given a model, predict values of yvar from xvar
# This function supports one predictor and one predicted variable
# xrange: If NULL, determine the x range from the model object. If a vector with
# two numbers, use those as the min and max of the prediction range.
# samples: Number of samples across the x range.
# ...: Further arguments to be passed to predict()
predictvals <- function(model, xvar, yvar, xrange = NULL, samples = 100, ...) {
# If xrange isn't passed in, determine xrange from the models.
# Different ways of extracting the x range, depending on model type
if (is.null(xrange)) {
if (any(class(model) %in% c("lm", "glm")))
xrange <- range(model$model[[xvar]])
else if (any(class(model) %in% "loess"))
xrange <- range(model$x)
}
newdata <- data.frame(x = seq(xrange[1], xrange[2], length.out = samples))
names(newdata) <- xvar
newdata[[yvar]] <- predict(model, newdata = newdata, ...)
newdata
}
```

With the heightweight data set, we’ll make a linear model with `lm()`

and a LOESS model with `loess()`

(Figure 5.22):

```
modlinear <- lm(heightIn ~ ageYear, heightweight)
modloess <- loess(heightIn ~ ageYear, heightweight)
```

Then we can call `predictvals()`

on each model, and pass the resulting data frames to `geom_line()`

:

```
lm_predicted <- predictvals(modlinear, "ageYear", "heightIn")
loess_predicted <- predictvals(modloess, "ageYear", "heightIn")
hw_sp +
geom_line(data = lm_predicted, colour = "red", size = .8) +
geom_line(data = loess_predicted, colour = "blue", size = .8)
```

For `glm`

models use a nonlinear link function, you need to specify `type = "response"`

to the `predictvals()`

function. This is because the default behavior of `glm`

is to return predicted values in the scale of the linear predictors, instead of in the scale of the response (*y*) variable.

To illustrate this, we’ll use the `biopsy`

data set from the `MASS`

package. As we did in Recipe 5.6, we’ll use `V1`

to predict `class`

. Since logistic regressions require numeric values from 0 to 1, we need to convert the factor `class`

to 0s and 1s:

```
library(MASS) # Load MASS for the biopsy data set
# Using the biopsy data set, create a new column that stores the factor `class` as a numeric variable named `classn`. If `class` == "benign", set `classn` to 0. If `class` == "malignant", set `classn` to 1. Save this new dataset as `biopsy_mod`.
biopsy_mod <- biopsy %>%
mutate(classn = recode(class, benign = 0, malignant = 1))
biopsy_mod
#> ID V1 V2 V3 V4 V5 V6 V7 V8 V9 class classn
#> 1 1000025 5 1 1 1 2 1 3 1 1 benign 0
#> 2 1002945 5 4 4 5 7 10 3 2 1 benign 0
#> 3 1015425 3 1 1 1 2 2 3 1 1 benign 0
#> ...<693 more rows>...
#> 697 888820 5 10 10 3 7 3 8 10 2 malignant 1
#> 698 897471 4 8 6 4 3 4 10 6 1 malignant 1
#> 699 897471 4 8 8 5 4 5 10 4 1 malignant 1
```

Next, we’ll perform the logistic regression:

`fitlogistic <- glm(classn ~ V1, biopsy_mod, family = binomial)`

Finally, we’ll make the graph with jittered points and the `fitlogistic`

line. We’ll make the line in a shade of blue by specifying a color in RGB values, and slightly thicker, with `size = 1`

(Figure 5.23):

```
# Get predicted values
glm_predicted <- predictvals(fitlogistic, "V1", "classn", type = "response")
ggplot(biopsy_mod, aes(x = V1, y = classn)) +
geom_point(
position = position_jitter(width = .3, height = .08),
alpha = 0.4,
shape = 21,
size = 1.5
) +
geom_line(data = glm_predicted, colour = "#1177FF", size = 1)
```

## 5.8 Adding Fitted Lines from Multiple Existing Models

### 5.8.1 Problem

You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.

### 5.8.2 Solution

Use the `predictvals()`

function from the previous recipe (5.7) along with functions from the `dplyr`

package, including `group_by()`

and `do()`

.

With the `heightweight`

data set, we’ll make a linear model for each of the levels of `sex`

. The model building is done for each subset of the data frame by specifying the model computation we want within the `do()`

function.

The following code splits the `heightweight`

data frame into two data frames, by the grouping variable `sex`

. This creates a data frame subset for males and a data frame subset for females. We then apply `lm(heightIn ~ ageYear, .)`

to each subset. The `.`

in `lm(heightIn ~ ageYear)`

represents the data frame we are piping in from the previous line – in this case, the two data frame subsets we have just created. Finally, this code returns a data frame which contains a column, `model`

, which is a list of the model outputs corresponding to each level of the grouping variable `sex`

, female and male.

```
library(gcookbook) # Load gcookbook for the heightweight data set
library(dplyr)
# Create an lm model object for each value of sex; this returns a data frame
models <- heightweight %>%
group_by(sex) %>%
do(model = lm(heightIn ~ ageYear, .)) %>%
ungroup()
# Print the data frame
models
#> # A tibble: 2 x 2
#> sex model
#> * <fct> <list>
#> 1 f <S3: lm>
#> 2 m <S3: lm>
# Print out the model column of the data frame
models$model
#> [[1]]
#>
#> Call:
#> lm(formula = heightIn ~ ageYear, data = .)
#>
#> Coefficients:
#> (Intercept) ageYear
#> 43.963 1.209
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = heightIn ~ ageYear, data = .)
#>
#> Coefficients:
#> (Intercept) ageYear
#> 30.658 2.301
```

Now that we have the list of model objects, we can run the `predictvals()`

as defined in 5.7 to get the predicted values from each model:

```
predvals <- models %>%
group_by(sex) %>%
do(predictvals(.$model[[1]], xvar = "ageYear", yvar = "heightIn"))
```

Finally, we can plot the data with the predicted values (Figure 5.24):

```
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
geom_line(data = predvals)
# Using facets instead of colors for the groups
ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point() +
geom_line(data = predvals) +
facet_grid(. ~ sex)
```

### 5.8.3 Discussion

The `group_by()`

and `do()`

calls are used for splitting the data into parts, running functions on those parts, and then reassembling the output.

With the preceding code, the *x* range of the predicted values for each group spans the *x* range of each group, and no further; for the males, the prediction line stops at the oldest male, while for females, the prediction line continues further right, to the oldest female. To form prediction lines that have the same *x* range across all groups, we can simply pass in `xrange`

, like this:

```
predvals <- models %>%
group_by(sex) %>%
do(predictvals(
.$model[[1]],
xvar = "ageYear",
yvar = "heightIn",
xrange = range(heightweight$ageYear))
)
```

Then we can plot it, the same as we did before:

```
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) +
geom_point() +
geom_line(data = predvals)
```

As you can see in Figure 5.25, the line for males now extends as far to the right as the line for females. Keep in mind that extrapolating past the data isn’t always appropriate, though; whether or not it’s justified will depend on the nature of your data and the assumptions you bring to the table.

## 5.9 Adding Annotations with Model Coefficients

### 5.9.1 Problem

You want to add numerical information about a model to a plot.

### 5.9.2 Solution

To add simple text to a plot, simply add an annotation. In this example, we’ll create a linear model and use the `predictvals()`

function defined in Recipe 5.7 to create a prediction line from the model. Then we’ll add an annotation:

```
library(gcookbook) # Load gcookbook for the heightweight data set
model <- lm(heightIn ~ ageYear, heightweight)
summary(model)
#>
#> Call:
#> lm(formula = heightIn ~ ageYear, data = heightweight)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -8.3517 -1.9006 0.1378 1.9071 8.3371
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.4356 1.8281 20.48 <2e-16 ***
#> ageYear 1.7483 0.1329 13.15 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.989 on 234 degrees of freedom
#> Multiple R-squared: 0.4249, Adjusted R-squared: 0.4225
#> F-statistic: 172.9 on 1 and 234 DF, p-value: < 2.2e-16
```

This shows that the *r^2* value is 0.4249. We’ll create a graph and manually add the text using `annotate()`

(Figure 5.26):

```
# First generate prediction data
pred <- predictvals(model, "ageYear", "heightIn")
# Save a base plot
hw_sp <- ggplot(heightweight, aes(x = ageYear, y = heightIn)) +
geom_point() +
geom_line(data = pred)
hw_sp +
annotate("text", x = 16.5, y = 52, label = "r^2=0.42")
```

Instead of using a plain text string, it’s also possible to enter formulas using R’s math expression syntax, by using `parse = TRUE`

:

```
hw_sp +
annotate("text", x = 16.5, y = 52, label = "r^2 == 0.42", parse = TRUE)
```

### 5.9.3 Discussion

Text geoms in ggplot do not take expression objects directly; instead, they take character strings that can be turned into expressions with R’s `parse()`

function.

If you use a mathematical expression, the syntax must be correct for the expression to be a valid R expression object. You can test the validity by wrapping the object in the `expression()`

function and seeing if it throws an error (make sure *not* to use quotes around the expression). In the example here, `==`

is a valid construct in an expression to express equality, but `=`

is not:

```
expression(r^2 == 0.42) # Valid
expression(r^2 = 0.42) # Not valid
#> Error: unexpected '=' in "expression(r\^2 ="
```

It’s possible to automatically extract values from the model object and build an expression using those values. In this example, we’ll create a string which when parsed, yields a valid expression:

```
# Use sprintf() to construct our string.
# The %.3g and %.2g are replaced with numbers with 3 significant digits and 2
# significant digits, respectively. The numbers are supplied after the string.
eqn <- sprintf(
"italic(y) == %.3g + %.3g * italic(x) * ',' ~~ italic(r)^2 ~ '=' ~ %.2g",
coef(model)[1],
coef(model)[2],
summary(model)$r.squared
)
eqn
#> [1] "italic(y) == 37.4 + 1.75 * italic(x) * ',' ~~ italic(r)^2 ~ '=' ~ 0.42"
# Test validity by using parse()
parse(text = eqn)
#> expression(italic(y) == 37.4 + 1.75 * italic(x) * "," ~ ~italic(r)^2 ~
#> "=" ~ 0.42)
```

Now that we have the expression string, we can add it to the plot. In this example we’ll put the text in the bottom-right corner, by setting `x = Inf`

and `y = -Inf`

and using horizontal and vertical adjustments so that the text all fits inside the plotting area (Figure 5.27):

```
hw_sp +
annotate(
"text",
x = Inf, y = -Inf,
label = eqn, parse = TRUE,
hjust = 1.1, vjust = -.5
)
```

### 5.9.4 See Also

The math expression syntax in R can be a bit tricky. See Recipe 7.2 for more information.

## 5.10 Adding Marginal Rugs to a Scatter Plot

### 5.10.1 Problem

You want to add marginal rugs to a scatter plot.

### 5.10.2 Solution

Use `geom_rug()`

. For this example (Figure 5.28), we’ll use the `faithful`

data set. This data set has two columns with data about the Old Faithful geyser: `eruptions`

, which is the length of each eruption, and `waiting`

, which is the length of time until the next eruption:

```
ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point() +
geom_rug()
```

### 5.10.3 Discussion

A marginal rug plot is essentially a one-dimensional scatter plot that can be used to visualize the distribution of data on each axis.

In this particular data set, the marginal rug is not as informative as it could be. The resolution of the `waiting`

variable is in whole minutes, and because of this, the rug lines have a lot of overplotting. To reduce overplotting, we can jitter the line positions and make them slightly thinner by specifying size (Figure 5.29). This helps the viewer see the distribution more clearly:

```
ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point() +
geom_rug(position = "jitter", size = 0.2)
```

### 5.10.4 See Also

For more about overplotting, see Recipe 5.5.

## 5.11 Labeling Points in a Scatter Plot

### 5.11.1 Problem

You want to add labels to points in a scatter plot.

### 5.11.2 Solution

For annotating just one or a few points, you can use `annotate()`

or `geom_text()`

. For this example, we’ll use the countries data set and visualize the relationship between health expenditures and infant mortality rate per 1,000 live births. To keep things manageable, we’ll filter the data to only look at data from 2009 for a subset of countries that spent more than $2,000 USD per capita:

```
library(gcookbook) # Load gcookbook for the countries data set
library(dplyr)
# Filter the data to only look at 2009 data for countries that spent > 2000 USD per capita
countries_sub <- countries %>%
filter(Year == 2009 & healthexp > 2000)
```

We’ll save the basic scatter plot object in `countries_sp`

(for countries scatter plot) and add then add our annotations to it. To manually add annotations, use `annotate()`

, and specify the coordinates and label (Figure 5.30, left). It may require some trial-and-error tweaking to get the labels positioned just right:

```
countries_sp <- ggplot(countries_sub, aes(x = healthexp, y = infmortality)) +
geom_point()
countries_sp +
annotate("text", x = 4350, y = 5.4, label = "Canada") +
annotate("text", x = 7400, y = 6.8, label = "USA")
```

To automatically add the labels from your data (Figure 5.30, right), use `geom_text()`

and map a column that is a factor or character vector to the label aesthetic. In this case, we’ll use `Name`

, and we’ll make the font slightly smaller to reduce crowding. The default value for `size`

is 5, which doesn’t correspond directly to a point size:

```
countries_sp +
geom_text(aes(label = Name), size = 4)
```

As you can see in the center of (Figure 5.30, right), you may find yourself with a plot where labels are overlapping. To automatically adjust point labels so that they don’t overlap, we can use `geom_text_repel`

(Figure 5.31, left) or `geom_label_repel`

(which adds a box around the label, Figure 5.31, right) from the ggrepel package, which functions similarly to `geom_text`

.

```
# Make sure to have installed ggrepel with install.packages("ggrepel")
library(ggrepel)
countries_sp +
geom_text_repel(aes(label = Name), size = 3)
countries_sp +
geom_label_repel(aes(label = Name), size = 3)
```

### 5.11.3 Discussion

Using `geom_text_repel`

or `geom_label_repel`

is the easiest way to have nicely-placed labels on a plot. It makes automatic (and random) decisions about label placement, so if exact control over where each label is placed, you should use `annotate()`

or `geom_text()`

.

The automatic method for placing annotations using `geom_text()`

centers each annotation on the *x* and *y* coordinates. You’ll probably want to shift the text vertically, horizontally, or both.

Setting `vjust = 0`

will make the baseline of the text on the same level as the point (Figure 5.32, left), and setting `vjust = 1`

will make the top of the text level with the point. This usually isn’t enough, though – you can increase or decrease `vjust`

to shift the labels higher or lower, or you can add or subtract a bit to or from the y mapping to get the same effect (Figure 5.32, right):

```
countries_sp +
geom_text(aes(label = Name), size = 3, vjust = 0)
# Add a little extra to y
countries_sp +
geom_text(aes(y = infmortality + .1, label = Name), size = 3)
```

It often makes sense to right- or left-justify the labels relative to the points. To left-justify, set `hjust = 0`

(Figure 5.33, left), and to right-justify, set `hjust = 1`

. As was the case with `vjust`

, the labels will still slightly overlap with the points. This time, though, it’s not a good idea to try to fix it by increasing or decreasing `hjust`

. Doing so will shift the labels a distance proportional to the length of the label, making longer labels move further than shorter ones. It’s better to just set hjust to 0 or 1, and then add or subtract a bit to or from `x`

(Figure 5.33, right):

```
countries_sp +
geom_text(
aes(label = Name),
size = 3,
hjust = 0
)
countries_sp +
geom_text(
aes(x = healthexp + 100, label = Name),
size = 3,
hjust = 0
)
```

NoteIf you are using a logarithmic axis, instead of adding to x or y, you’ll need to

multiplythe x or y value by a number to shift the labels a consistent amount.

Besides right- or left-justifying all of your labels, you can also adjust the position of all of the labels at once is to use `position = position_nudge()`

. This allows you to specify the amount of vertical or horizontal distance you want to move the labels. As you can see from the figures below (Figure 5.34, this strategy works best when there are fewer labels, or fewer points that can cause overlap with labels. Note that the units you specify with `x = ...`

and `y = ...`

correspond to the units of the x and y axis.

```
countries_sp +
geom_text(
aes(x = healthexp + 100, label = Name),
size = 3,
hjust = 0
)
countries_sp +
geom_text(
aes(x = healthexp + 100, label = Name),
size = 3,
hjust = 0,
position = position_nudge(x = 100, y = -0.2)
)
```

If you want to label just some of the points but want the placement to be handled automatically, you can add a new column to your data frame containing just the labels you want. Here’s one way to do that: first we’ll make a copy of the data we’re using, then we’ll copy the `Name`

column into `plotname`

, converting from a factor to a character vector, for reasons we’ll see below.

```
cdat <- countries %>%
filter(Year == 2009, healthexp > 2000) %>%
mutate(plotname = as.character(Name))
```

Now that `plotname`

is a character vector, we can use an `ifelse()`

function and the `%in%`

operator to identify if each row of `plotname`

matches the list of names we want to show on our plot, which we have specified manually below. The `%in%`

operator returns a logical vector that allows us to specify within the `ifelse()`

function that we want to replace all values of `plotname`

that do not match one of our specified names with a blank string.

```
countrylist <- c("Canada", "Ireland", "United Kingdom", "United States",
"New Zealand", "Iceland", "Japan", "Luxembourg", "Netherlands", "Switzerland")
cdat <- cdat %>%
mutate(plotname = ifelse(plotname %in% countrylist, plotname, ""))
# Take a look at the resulting `plotname` variable, as compared to the original `Name` variable
cdat %>%
select(Name, plotname)
#> Name plotname
#> 1 Andorra
#> 2 Australia
#> 3 Austria
#> ...<21 more rows>...
#> 25 Switzerland Switzerland
#> 26 United Kingdom United Kingdom
#> 27 United States United States
```

Now we can make the plot (Figure 5.35). This time, we’ll also expand the *x* range so that the text will fit:

```
ggplot(cdat, aes(x = healthexp, y = infmortality)) +
geom_point() +
geom_text(aes(x = healthexp + 100, label = plotname), size = 4, hjust = 0) +
xlim(2000, 10000)
```

If any individual position adjustments are needed, you have a couple of options. One option is to copy the columns used for the *x* and *y* coordinates and modify the numbers for the individual items to move the text around. Make sure to use the original numbers for the coordinates of the points, of course!

Finally, another option is to save the output to a vector format such as PDF or SVG (see Recipes Recipe 14.1 and Recipe 14.2), then edit it in a program like Illustrator or Inkscape.

## 5.12 Creating a Balloon Plot

### 5.12.1 Problem

You want to make a balloon plot, where the area of the dots is proportional to their numerical value.

### 5.12.2 Solution

Use `geom_point()`

with `scale_size_area()`

. For this example, we’ll filter the data set `countries`

to only include data from the year 2009, for certain countries we have specified in `countrylist`

:

```
library(gcookbook) # Load gcookbook for the countries data set
countrylist <- c("Canada", "Ireland", "United Kingdom", "United States",
"New Zealand", "Iceland", "Japan", "Luxembourg", "Netherlands", "Switzerland")
cdat <- countries %>%
filter(Year == 2009, Name %in% countrylist)
cdat
#> Name Code Year GDP laborrate healthexp infmortality
#> 1 Canada CAN 2009 39599.04 67.8 4379.761 5.2
#> 2 Iceland ISL 2009 37972.24 77.5 3130.391 1.7
#> 3 Ireland IRL 2009 49737.93 63.6 4951.845 3.4
#> ...<4 more rows>...
#> 8 Switzerland CHE 2009 63524.65 66.9 7140.729 4.1
#> 9 United Kingdom GBR 2009 35163.41 62.2 3285.050 4.7
#> 10 United States USA 2009 45744.56 65.0 7410.163 6.6
```

If we just map `GDP`

to `size`

, the value of `GDP`

gets mapped to the *radius* of the dots (Figure 5.36, left), which is not what we want; a doubling of value results in a quadrupling of area, and this will distort the interpretation of the data. We instead want to map the value of `GDP`

to the *area* of the dots, which we can do this using `scale_size_area()`

(Figure 5.36, right):

```
# Create a base plot using the cdat data frame. We will call this base plot `cdat_sp` (for cdat scatter plot)
cdat_sp <- ggplot(cdat, aes(x = healthexp, y = infmortality, size = GDP)) +
geom_point(shape = 21, colour = "black", fill = "cornsilk")
# GDP mapped to radius (default with scale_size_continuous)
cdat_sp
# GDP mapped to area instead, and larger circles
cdat_sp +
scale_size_area(max_size = 15)
```

### 5.12.3 Discussion

The example here is a scatter plot, but that is not the only way to use balloon plots. It may also be useful to use balloon plots to represent values on a grid, where the x- and y-axes are categorical, as in Figure 5.37:

```
# Create a data frame that adds up counts for males and females
hec <- HairEyeColor %>%
# Convert to long format
as_tibble() %>%
group_by(Hair, Eye) %>%
summarize(count = sum(n))
# Create the base balloon plot
hec_sp <- ggplot(hec, aes(x = Eye, y = Hair)) +
geom_point(aes(size = count), shape = 21, colour = "black", fill = "cornsilk") +
scale_size_area(max_size = 20, guide = FALSE) +
geom_text(aes(
y = as.numeric(as.factor(Hair)) - sqrt(count)/34, label = count),
vjust = 1.3,
colour = "grey60",
size = 4
)
hec_sp
# Add red guide points
hec_sp +
geom_point(aes(y = as.numeric(as.factor(Hair)) - sqrt(count)/34), colour = "red", size = 1)
```

In this example we’ve used a few tricks to add the text labels under the circles. First, we used `vjust = 1.3`

to justify the top of text slightly below the *y* coordinate. Next, we wanted to set the *y* coordinate so that it is at the bottom of each circle. This requires a little wrangling and arithmetic: we need to first convert the levels of `Hair`

and `Eye`

into numeric values, which involves converting these variables from being a character vector to being a factor variable, and then converting them again into a numeric variable. We then take the *numeric* value of `Hair`

and subtract a small value from it, where the value depends in some way on count. This actually requires taking the square root of count, since the radius has a linear relationship with the square root of `count`

. The number that this value is divided by (34 in this case) is found by trial and error; it depends on the particular data values, radius, text size, and output image size.

To help find the correct *y* offset, we can add guide points in red and adjusted the value until they lined up with the bottom of each circle. Once we have the correct value, we can place the text and remove the points.

The text under the circles is in a shade of grey. This is so that it doesn’t jump out at the viewer and overwhelm the perceptual impact of the circles, but is still available if the viewer wants to know the exact values.

## 5.13 Making a Scatter Plot Matrix

### 5.13.1 Problem

You want to make a scatter plot matrix.

### 5.13.2 Solution

A scatter plot matrix is an excellent way of visualizing the pairwise relationships among several variables. To make one, use the `pairs()`

function from R’s base graphics.

For this example, we’ll use a subset of the `countries`

data. We’ll pull out the data for the year 2009, and keep only the columns that are relevant:

```
library(gcookbook) # Load gcookbook for the countries data set
c2009 <- countries %>%
filter(Year == 2009) %>%
select(Name, GDP, laborrate, healthexp, infmortality)
c2009
#> Name GDP laborrate healthexp infmortality
#> 1 Afghanistan NA 59.8 50.88597 103.2
#> 2 Albania 3772.6047 59.5 264.60406 17.2
#> 3 Algeria 4022.1989 58.5 267.94653 32.0
#> ...<210 more rows>...
#> 214 Yemen, Rep. 1130.1833 46.8 64.00204 58.7
#> 215 Zambia 1006.3882 69.2 47.05637 71.5
#> 216 Zimbabwe 467.8534 66.8 NA 52.2
```

To make the scatter plot matrix (Figure 5.38), we’ll use all of the variables except for `Name`

, since making a scatter plot matrix using the names of the countries wouldn’t make sense and would produce strange-looking results:

```
c2009 %>%
select(-Name) %>%
pairs()
```

### 5.13.3 Discussion

You can also use customized functions for the panels. To show the correlation coefficient of each pair of variables instead of a scatter plot, we’ll define the function `panel.cor`

. This will also show higher correlations in a larger font. Don’t worry about the details for now – just paste this code into your R session or script:

```
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use = "complete.obs"))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste(prefix, txt, sep = "")
if (missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}
```

To show histograms of each variable along the diagonal, we’ll define `panel.hist`

:

```
panel.hist <- function(x, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks
nB <- length(breaks)
y <- h$counts
y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "white", ...)
}
```

Both of these panel functions are taken from the `pairs()`

help page, so if it’s more convenient, you can simply open that help page, then copy and paste. The last line of this version of the `panel.cor`

function is slightly modified, however, so that the changes in font size aren’t as extreme as with the original.

Now that we’ve defined these functions we can use them for our scatter plot matrix, by telling `pairs()`

to use `panel.cor`

for the upper panels and `panel.hist`

for the diagonal panels.

We’ll also throw in one more thing: `panel.smooth`

for the lower panels, which makes a scatter plot and adds a LOWESS smoothed line, as shown in Figure 5.39. (LOWESS is slightly different from LOESS, which we saw in Recipe 5.6, but the differences aren’t important for this sort of rough exploratory visualization):

```
c2009 %>%
select(-Name) %>%
pairs(
upper.panel = panel.cor,
diag.panel = panel.hist,
lower.panel = panel.smooth
)
```

It may be more desirable to use linear regression lines instead of LOWESS lines. The `panel.lm()`

function will do the trick (unlike the previous panel functions, this one isn’t in the pairs help page):

```
panel.lm <- function (x, y, col = par("col"), bg = NA, pch = par("pch"),
cex = 1, col.smooth = "black", ...) {
points(x, y, pch = pch, col = col, bg = bg, cex = cex)
abline(stats::lm(y ~ x), col = col.smooth, ...)
}
```

This time the default line color is black instead of red, though you can change it here (and with `panel.smooth`

) by setting `col.smooth`

when you call `pairs()`

.

We’ll also use small points in the visualization, so that we can distinguish them a bit better (Figure 5.40). This is done by setting `pch = "."`

:

```
c2009 %>%
select(-Name) %>%
pairs(
upper.panel = panel.cor,
diag.panel = panel.hist,
lower.panel = panel.smooth,
pch = "."
)
```

The size of the points can also be controlled using the `cex`

parameter. The default value for `cex`

is 1; make it smaller for smaller points and larger for larger points. Values below .5 might not render properly with PDF output.

### 5.13.4 See Also

To create a correlation matrix, see Recipe 13.1.

It is worth noting that we didn’t use ggplot here because it doesn’t make scatter plot matrices (at least, not well).

Other packages like `GGally`

have been developed as extensions to ggplot to fill in this gap. The `ggpairs()`

function from the `GGally`

package makes scatter plot matrices, for example.