Chapter 5 Scatter Plots

Scatter plots are used to display the relationship between two continuous variables. In a scatter plot, each observation in a data set is represented by a point. Often, a scatter plot will also have a line showing the predicted values based on some statistical model. Adding this line is easy to do with R and the ggplot2 package, and can help to make sense of data when the trends aren’t immediately obvious just by looking at the points.

With large data sets, plotting every single observation in the data set can result in overplotting, when points overlap and obscure one another. To deal with the problem of overplotting, you’ll probably want to summarize the data before displaying it. We’ll also see how to do that in this chapter.

5.1 Making a Basic Scatter Plot

5.1.1 Problem

You want to make a scatter plot using two continuous variables.

5.1.2 Solution

Use geom_point(), and map one variable to x and one variable to y.

We will use the heightweight data set. There are a number of columns in this data set, but we’ll only use two in this example (Figure 5.1):

A basic scatter plot

Figure 5.1: A basic scatter plot

5.1.3 Discussion

Instead of points, you can use different shapes for your scatter plot by using the shape aesthetic. A common alternative to the default solid circles (shape #19) is hollow ones (#21), as seen in Figure 5.2 (left):

The size of the points can be controlled with the size aesthetic. The default value of size is 2 (size = 2). The following code will set size = 1.5 to create smaller points (Figure 5.2, right):

Scatter plot with hollow circles (shape 21, left); With smaller points (right)Scatter plot with hollow circles (shape 21, left); With smaller points (right)

Figure 5.2: Scatter plot with hollow circles (shape 21, left); With smaller points (right)

5.2 Grouping Points Together using Shapes or Colors

5.2.1 Problem

You want to visually group points by some variable (the grouping variable), using different shapes or colors.

5.2.2 Solution

Map the grouping variable to the aesthetic of shape or colour. We’ll use three columns from the heightweight data set for this example:

We can use the aesthetics of colour or shape to visually differentiate the data points belonging to different categories of sex. We do this by mapping sex to one of the aesthetics colour or shape (Figure5.3):

Grouping points by a variable mapped to colour (left), or to shape (right)Grouping points by a variable mapped to colour (left), or to shape (right)

Figure 5.3: Grouping points by a variable mapped to colour (left), or to shape (right)

5.2.3 Discussion

The grouping variable you choose must be categorical – in other words, a factor or character vector. If the grouping variable is a numeric vector, you should convert it to a factor first.

It is possible to map a variable to both shape and colour, or, if you have multiple grouping variables, to map each grouping variable to a different aesthetic. Here, we’ll map the variable sex to both shape and colour aesthetics (Figure 5.4, left):

You may want to use different shapes and colors than are given by the default settings. You can select other shapes for the grouping variables using scale_shape_manual(), and select other colors using scale_colour_brewer() or scale_colour_manual(). (Figure 5.4, right):

Mapping to both shape and colour (left); With manually set shapes and colors (right)Mapping to both shape and colour (left); With manually set shapes and colors (right)

Figure 5.4: Mapping to both shape and colour (left); With manually set shapes and colors (right)

5.2.4 See Also

To use different shapes, see Recipe 5.3.

For more on using different colors, see Chapter 12.

5.3 Using Different Point Shapes

5.3.1 Problem

You want to change the default scatterplot shapes for the data points.

5.3.2 Solution

You can set the shape of all the data points at once (Figure 5.5, left) by setting a shape in geom_point():

If you have mapped a variable to shape, you can use scale_shape_manual() to manually change the shapes mapped to the levels of that variable:

Scatter plot with the shape aesthetic set to a custom value (left); With a variable mapped to shape, using a custom shape palette (right)

Figure 5.5: Scatter plot with the shape aesthetic set to a custom value (left); With a variable mapped to shape, using a custom shape palette (right)

5.3.3 Discussion

Figure 5.6 shows the shapes that are already built into R. Some of the point shapes (1–14) only have an outline; some (15–20) have solid fill; and some (21–25) have an outline and fill that can be controlled separately. You can also use characters for points.

For shapes 1–20, the color of the entire point – even the points that have solid fill – is controlled by the colour aesthetic. For shapes 21–25, the outline is controlled by colour and the fill is controlled by fill.

Shapes in R graphics

Figure 5.6: Shapes in R graphics

It’s possible to have one variable represented by the shape of a point, and and another variable represented by the fill (empty or filled) of the point. To do this, you need to first choose point shapes that have both colour and fill, and set these in scale_shape_manual. You then need to choose a fill palette that includes NA and another color (the NA will result in a hollow shape) and use these in scale_fill_manual().

For example, we’ll take the heightweight data set and add another column that indicates whether the child weighed 100 pounds or more (Figure 5.7):

A variable mapped to shape and another mapped to fill

Figure 5.7: A variable mapped to shape and another mapped to fill

5.3.4 See Also

For more on using different colors, see Chapter 12.

For more information about recoding a continuous variable to a categorical one, see Recipe 15.14.

5.4 Mapping a Continuous Variable to Color or Size

5.4.1 Problem

You want to represent a third continuous variable using color or size.

5.4.2 Solution

Map the continuous variable to size or colour. We will use the heightweight data set for this example. There are many columns in this data set, but we’ll only use four of them in this example:

The basic scatter plot in Recipe 5.1 shows the relationship between the continuous variables ageYear and heightIn. We can represent a third continuous variable, weightLb, by mapping this variable to another aesthetic property, such as colour or size (Figure 5.8:

A continuous variable mapped to colour (left); Mapped to size (right)A continuous variable mapped to colour (left); Mapped to size (right)

Figure 5.8: A continuous variable mapped to colour (left); Mapped to size (right)

5.4.3 Discussion

A basic scatter plot shows the relationship between two continuous variables: one mapped to the x-axis, and one to the y-axis. When there are more than two continuous variables, these additional variables must be mapped to other aesthetics, like size and color.

Humans can easily perceive small differences in spatial position, so we can interpret the variables mapped to x and y coordinates with high precision. Humans aren’t as good at perceiving small differences in size and color though, so we will interpret variables mapped to these aesthetic attributes with much lower precision. Therefore, when you map a variable to size or color, make sure it is a variable where high precision is not very important for correctly intepreting the data.

There is another consideration when mapping a variable to size, which is that the results can be perceptually misleading. While the largest dots in Figure 5.8 are about 36 times the size of the smallest ones, they are only supposed to represent about 3.5 times the weight of the smallest dots.

This relative misrepresentation of size happens because the default values in ggplot2 for the diameter of points ranges from 1 to 6mm, regardless of the actual data values. For example, if the data values range from 0 to 10, the smallest value of 0 will be represented on the plot with a point that is 1mm wide, while the largest value of 10 will be represented on the plot with a point that is 6mm wide. Similarly, if the data values range from 100 to 110, the smallest value of 100 will still be represented by a point that is 1mm wide, and the largest value of 110 will be represented by a point that is 6mm wide. Thus regardless of the actual data values, the largest point will have a diameter that is 6 times the diameter of the smallest point, and will be 36 times the area.

If it is important for the size of the points to accurately reflect the proportional differences of your data values, you should first decide if you want the diameter of the points to represent the data values, or if you want to area of the points to represent the data values. Figure 5.9 shows the difference between these representations.

Value mapped to diameter of points (left); Value mapped to area of points (right)Value mapped to diameter of points (left); Value mapped to area of points (right)

Figure 5.9: Value mapped to diameter of points (left); Value mapped to area of points (right)

See Recipe 5.12 for details on making the area of points proportional to the data values.

When it comes to color, there are actually two aesthetic attributes that can be used: color and fill. You will use color for most point shapes. However, shapes 21–25 have an outline with a solid region in the middle where the color is controlled by fill. These outlined shapes can be useful when using a color scale with light colors as in Figure 5.10, because the outline sets the shapes off from the background. In this example, we also set the fill gradient to go from black to white and make the points larger so that the fill is easier to see:

Outlined points with a continuous variable mapped to fill (left); With a discrete legend instead of continuous colorbar (right)Outlined points with a continuous variable mapped to fill (left); With a discrete legend instead of continuous colorbar (right)

Figure 5.10: Outlined points with a continuous variable mapped to fill (left); With a discrete legend instead of continuous colorbar (right)

Mapping a continuous variable to an aesthetic doesn’t prevent us from mapping a categorical variable to other aesthetics. In Figure 5.11, we’ll map weightLb to size, and also map sex to color. Because there is a fair amount of overplotting (where the points overlap), we’ll make the points 50% transparent by setting alpha = .5. We’ll also use scale_size_area() to make the area of the points proportional to the data values (see Recipe 5.12), and manually change the color palette:

Continuous variable mapped to size and categorical variable mapped to colour

Figure 5.11: Continuous variable mapped to size and categorical variable mapped to colour

When a variable is mapped to size, it’s a good idea to not map a variable to shape. This is because it is difficult to compare the sizes of different shapes; for example, a size 4 triangle could appear larger than a size 3.5 circle. Also, some of the shapes really are different sizes: shapes 16 and 19 are both circles, but at any given numeric size, shape 19 circles are visually larger than shape 16 circles.

5.4.4 See Also

To use different colors from the default, see Recipe 12.6.

See Recipe 5.12 for creating a balloon plot.

5.5 Dealing with Overplotting

5.5.1 Problem

You have many points that overlap and obscure each other when plotted.

5.5.2 Solution

With large data sets, the points in a scatter plot may overlap and obscure each other and prevent the viewer from accurately assessing the distribution of the data. This is called overplotting. If the amount of overplotting is low, you may be able to alleviate the problem by using smaller points, or by using a different shape (like shape 1, a hollow circle) through which other points can be seen. Figure 5.2 in Recipe 5.1 demonstrates both of these solutions.

If there’s a high degree of overplotting, there are a number of possible solutions:

  • Make the points semi-transparent
  • Bin the data into rectangles (better for quantitative analysis)
  • Bin the data into hexagons
  • Use box plots

5.5.3 Discussion

The scatter plot in Figure 5.12 contains about 54,000 points. They are heavily overplotted, making it impossible to get a sense of the relative density of points in different areas of the graph:

Overplotting, with about 54,000 points

Figure 5.12: Overplotting, with about 54,000 points

We can make the points semitransparent using the alpha aesthetic, as in Figure 5.13. Here, we’ll make them 90% transparent and then 99% transparent, by setting alpha = .1 and alpha = .01:

Semitransparent points with alpha=.1 (left); With alpha=.01 (right)Semitransparent points with alpha=.1 (left); With alpha=.01 (right)

Figure 5.13: Semitransparent points with alpha=.1 (left); With alpha=.01 (right)

Now we can see that there appear to be vertical bands at nice round values of carats, indicating that diamonds tend to be cut to those sizes. Still, the data is so dense that even when the points are 99% transparent, much of the graph appears black and the data distribution is still somewhat obscured.

Note

For most plots, vector formats (such as PDF, EPS, and SVG) result in smaller output files than bitmap formats (such as TIFF and PNG). But in cases where there are tens of thousands of points, vector output files can be very large and slow to render – the scatter plot here with 99% transparent points is a 1.5 MB PDF! In these cases, high-resolution bitmaps will be smaller and faster to display on computer screens. See Chapter 14 for more information.

Another solution is to bin the points into rectangles and map the density of the points to the fill color of the rectangles, as shown in Figure 5.14. With the binned visualization, the vertical bands are barely visible. The density of points in the lower-left corner is much greater, which tells us that the vast majority of diamonds are small and inexpensive.

By default, stat_bin_2d() divides the space into 30 groups in the x and y directions, for a total of 900 bins. In the second version, we increase the number of bins with bins = 50.

The default colors are somewhat difficult to distinguish because they don’t vary much in luminosity. In the second version we set the colors by using scale_fill_gradient() and by specifying the low and high colors. By default, the legend doesn’t show an entry for the lowest values. This is because the range of the color scale starts not from zero, but from the smallest nonzero quantity in a bin – probably 1, in this case. To make the legend show a zero (as in Figure 5.14, right), we can manually set the range from 0 to the maximum, 6000, using limits (Figure 5.14, left):

Binning data with stat_bin2d() (left); With more bins, manually specified colors, and legend breaks (right)Binning data with stat_bin2d() (left); With more bins, manually specified colors, and legend breaks (right)

Figure 5.14: Binning data with stat_bin2d() (left); With more bins, manually specified colors, and legend breaks (right)

Another alternative is to bin the data into hexagons instead of rectangles, with stat_binhex() (Figure 5.15). It works just like stat_bin2d(). To use stat_binhex(), you must first install the hexbin package, with the command install.packages("hexbin"):

Binning data with stat_binhex() (left); Cells outside of the range shown in grey (right)Binning data with stat_binhex() (left); Cells outside of the range shown in grey (right)

Figure 5.15: Binning data with stat_binhex() (left); Cells outside of the range shown in grey (right)

For both of these methods, if you manually specify the range and there is a bin that falls outside that range because it has too many or too few points, that bin will show up as grey rather than the color at the high or low end of the range, as seen in the graph on the right in Figure 5.15.

Overplotting can also occur when the data is discrete on one or both axes, as shown in Figure 5.16. In these cases, you can randomly jitter the points with position_jitter(). By default the amount of jitter is 40% of the resolution of the data in each direction, but these amounts can be controlled with width and height:

Data with a discrete x variable (left); Jittered (middle); Jittered horizontally only (right)Data with a discrete x variable (left); Jittered (middle); Jittered horizontally only (right)Data with a discrete x variable (left); Jittered (middle); Jittered horizontally only (right)

Figure 5.16: Data with a discrete x variable (left); Jittered (middle); Jittered horizontally only (right)

When the data has one discrete axis and one continuous axis, it might make sense to use box plots, as shown in Figure 5.17. This will convey a different story than a standard scatter plot because a box plot will obscure the number of data points at each location on the discrete axis. This may be problematic in some cases, but desirable in others.

When we look at the ChickWeight data we know that we conceptually want to treat Time as a discrete variable. However since Time is taken as a numerical variable by default, ggplot doesn’t know to group the data to form each boxplot box. If you don’t tell ggplot how to group the data, you get a result like the graph on the right in Figure 5.17. To tell it how to group the data, use aes(group = ...). In this case, we’ll group by each distinct value of Time:

Grouping into box plots (left); What happens if you don't specify groups (right)Grouping into box plots (left); What happens if you don't specify groups (right)

Figure 5.17: Grouping into box plots (left); What happens if you don’t specify groups (right)

5.5.4 See Also

Instead of binning the data, it may be useful to display a 2D density estimate. To do this, see Recipe 6.12.

5.6 Adding Fitted Regression Model Lines

5.6.1 Problem

You want to add lines from a fitted regression model to a scatter plot.

5.6.2 Solution

To add a linear regression line to a scatter plot, add stat_smooth() and tell it to use method = lm. This instructs ggplot to fit the data with the lm() (linear model) function. First we’ll save the base plot object in sp, then we’ll add different components to it:

By default, stat_smooth() also adds a 95% confidence region for the regression fit. The confidence interval can be changed by modifying the value for level, or it can be disabled with se = FALSE (Figure 5.18):

The default color of the fit line is blue. This can be change by setting colour. As with any other line, the attributes linetype and size can also be set. To emphasize the line, you can make the dots less prominent by changing the colour of the points (Figure 5.18, bottom right):

An lm fit with the default 95\% confidence region (top left); A 99\% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)An lm fit with the default 95\% confidence region (top left); A 99\% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)An lm fit with the default 95\% confidence region (top left); A 99\% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)An lm fit with the default 95\% confidence region (top left); A 99\% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)

Figure 5.18: An lm fit with the default 95% confidence region (top left); A 99% confidence region (top right); No confidence region (bottom left); In black with grey points (bottom right)

5.6.3 Discussion

The linear regression line is not the only way of fitting a model to the data – in fact, it’s not even the default. If you add stat_smooth() without specifying the method, it will use a LOESS (locally weighted polynomial) curve by default, as shown in Figure 5.19:

A LOESS fit

Figure 5.19: A LOESS fit

It may be useful to specify additional parameters for the modeling function, which in this case is loess(). If, for example, you wanted to use loess(degree = 1), you would call stat_smooth(method = loess, method.args = list(degree = 1)). The same could be done for other modeling functions like lm() or glm().

Another common type of model fit is a logistic regression. Logistic regression isn’t appropriate for heightweight, but it’s perfect for the biopsy data set in the MASS package. In the biopsy data, there are nine different measured attributes of breast cancer biopsies, as well as the class of the tumor, which is either benign or malignant. To prepare the data for logistic regression, we must convert the factor class, with the levels benign and malignant, to a vector with numeric values of 0 and 1. We’ll make a copy of the biopsy data frame called biopsy_mod, then store the numeric coded class in a column called classn:

Although there are many attributes we could examine, for this example we’ll just look at the relationship of V1 (clump thickness) and the class of the tumor. Because there is a large degree of overplotting, we’ll jitter the points and make them semitransparent (alpha = 0.4), hollow (shape = 21), and slightly smaller (size = 1.5). Then we’ll add a fitted logistic regression line (Figure 5.20) by telling stat_smooth() to use the glm() function with family = binomial:

A logistic model

Figure 5.20: A logistic model

If your scatter plot has points grouped by a factor, and that factor is mapped to an aesthetic such as colour or shape, one fit line will be drawn for each factor level. First we’ll make the base plot object hw_sp, then we’ll add the LOESS lines to it. We’ll also make the points less prominent by making them semitransparent, using alpha = .4 (Figure 5.21):

Notice that the blue line, for males, doesn’t run all the way to the right side of the plot. There are two reasons for this. The first is that by default, stat_smooth() limits the prediction to within the range of the predictor data on the x-axis. The second is that even if it extrapolates, the loess() function only offers prediction within the x range of the data.

If you want the lines to extrapolate from the data, as shown in the right-hand image of Figure 5.21, you must use a model method that allows extrapolation, like lm(), and pass stat_smooth() the option fullrange = TRUE:

LOESS fit lines for each group (left); Extrapolated linear fit lines (right)LOESS fit lines for each group (left); Extrapolated linear fit lines (right)

Figure 5.21: LOESS fit lines for each group (left); Extrapolated linear fit lines (right)

In this example with the heightweight data set, the default settings for stat_smooth() (with loess and no extrapolation) may make more sense than the extrapolated linear predictions, because humans don’t grow linearly and we don’t grow forever.

5.7 Adding Fitted Lines from an Existing Model

5.7.1 Problem

You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.

5.7.2 Solution

Usually the easiest way to overlay a fitted model is to simply ask stat_smooth() to do it for you, as described in Recipe 5.6. Sometimes, however, you may want to create the model yourself and then add it to your graph. This allows you to be sure that the model you’re using for other calculations is the same one that you see.

In this example, we’ll build a quadratic model using lm() with ageYear as a predictor of heightIn. Then we’ll use the predict() function and find the predicted values of heightIn across the range of values for the predictor, ageYear:

We can now plot the data points along with the values predicted from the model (as you’ll see in Figure 5.22):

5.7.3 Discussion

Any model object (e.g. lm) can be used, so long as it has a corresponding predict() method. For example, lm has predict.lm(), loess has predict.loess(), and so on. Adding lines from a model can be simplified by using the function predictvals(), defined below. If you simply pass a model to predictvals(), the function will do the work of finding the variable names and the range of the predictor, and will return a data frame with predictor and predicted values. That data frame can then be passed to geom_line() to draw the fitted line, as we did earlier:

With the heightweight data set, we’ll make a linear model with lm() and a LOESS model with loess() (Figure 5.22):

Then we can call predictvals() on each model, and pass the resulting data frames to geom_line():

A quadratic prediction line from an lm object (left); Prediction lines from linear (in red) and LOESS (in blue) models (right)A quadratic prediction line from an lm object (left); Prediction lines from linear (in red) and LOESS (in blue) models (right)

Figure 5.22: A quadratic prediction line from an lm object (left); Prediction lines from linear (in red) and LOESS (in blue) models (right)

For glm models use a nonlinear link function, you need to specify type = "response" to the predictvals() function. This is because the default behavior of glm is to return predicted values in the scale of the linear predictors, instead of in the scale of the response (y) variable.

To illustrate this, we’ll use the biopsy data set from the MASS package. As we did in Recipe 5.6, we’ll use V1 to predict class. Since logistic regressions require numeric values from 0 to 1, we need to convert the factor class to 0s and 1s:

Next, we’ll perform the logistic regression:

Finally, we’ll make the graph with jittered points and the fitlogistic line. We’ll make the line in a shade of blue by specifying a color in RGB values, and slightly thicker, with size = 1 (Figure 5.23):

A fitted logistic model

Figure 5.23: A fitted logistic model

5.8 Adding Fitted Lines from Multiple Existing Models

5.8.1 Problem

You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.

5.8.2 Solution

Use the predictvals() function from the previous recipe (5.7) along with functions from the dplyr package, including group_by() and do().

With the heightweight data set, we’ll make a linear model for each of the levels of sex. The model building is done for each subset of the data frame by specifying the model computation we want within the do() function.

The following code splits the heightweight data frame into two data frames, by the grouping variable sex. This creates a data frame subset for males and a data frame subset for females. We then apply lm(heightIn ~ ageYear, .) to each subset. The . in lm(heightIn ~ ageYear) represents the data frame we are piping in from the previous line – in this case, the two data frame subsets we have just created. Finally, this code returns a data frame which contains a column, model, which is a list of the model outputs corresponding to each level of the grouping variable sex, female and male.

Now that we have the list of model objects, we can run the predictvals() as defined in 5.7 to get the predicted values from each model:

Finally, we can plot the data with the predicted values (Figure 5.24):

Predictions from two separate lm objects, one for each subset of data (left); With facets (right)

Figure 5.24: Predictions from two separate lm objects, one for each subset of data (left); With facets (right)

5.8.3 Discussion

The group_by() and do() calls are used for splitting the data into parts, running functions on those parts, and then reassembling the output.

With the preceding code, the x range of the predicted values for each group spans the x range of each group, and no further; for the males, the prediction line stops at the oldest male, while for females, the prediction line continues further right, to the oldest female. To form prediction lines that have the same x range across all groups, we can simply pass in xrange, like this:

Then we can plot it, the same as we did before:

Predictions for each group extend to the full x range of all groups together

Figure 5.25: Predictions for each group extend to the full x range of all groups together

As you can see in Figure 5.25, the line for males now extends as far to the right as the line for females. Keep in mind that extrapolating past the data isn’t always appropriate, though; whether or not it’s justified will depend on the nature of your data and the assumptions you bring to the table.

5.9 Adding Annotations with Model Coefficients

5.9.1 Problem

You want to add numerical information about a model to a plot.

5.9.2 Solution

To add simple text to a plot, simply add an annotation. In this example, we’ll create a linear model and use the predictvals() function defined in Recipe 5.7 to create a prediction line from the model. Then we’ll add an annotation:

This shows that the r^2 value is 0.4249. We’ll create a graph and manually add the text using annotate() (Figure 5.26):

Instead of using a plain text string, it’s also possible to enter formulas using R’s math expression syntax, by using parse = TRUE:

Plain text (left); Math expression (right)Plain text (left); Math expression (right)

Figure 5.26: Plain text (left); Math expression (right)

5.9.3 Discussion

Text geoms in ggplot do not take expression objects directly; instead, they take character strings that can be turned into expressions with R’s parse() function.

If you use a mathematical expression, the syntax must be correct for the expression to be a valid R expression object. You can test the validity by wrapping the object in the expression() function and seeing if it throws an error (make sure not to use quotes around the expression). In the example here, == is a valid construct in an expression to express equality, but = is not:

It’s possible to automatically extract values from the model object and build an expression using those values. In this example, we’ll create a string which when parsed, yields a valid expression:

Now that we have the expression string, we can add it to the plot. In this example we’ll put the text in the bottom-right corner, by setting x = Inf and y = -Inf and using horizontal and vertical adjustments so that the text all fits inside the plotting area (Figure 5.27):

Scatter plot with automatically generated expression

Figure 5.27: Scatter plot with automatically generated expression

5.9.4 See Also

The math expression syntax in R can be a bit tricky. See Recipe 7.2 for more information.

5.10 Adding Marginal Rugs to a Scatter Plot

5.10.1 Problem

You want to add marginal rugs to a scatter plot.

5.10.2 Solution

Use geom_rug(). For this example (Figure 5.28), we’ll use the faithful data set. This data set has two columns with data about the Old Faithful geyser: eruptions, which is the length of each eruption, and waiting, which is the length of time until the next eruption:

Marginal rug added to a scatter plot

Figure 5.28: Marginal rug added to a scatter plot

5.10.3 Discussion

A marginal rug plot is essentially a one-dimensional scatter plot that can be used to visualize the distribution of data on each axis.

In this particular data set, the marginal rug is not as informative as it could be. The resolution of the waiting variable is in whole minutes, and because of this, the rug lines have a lot of overplotting. To reduce overplotting, we can jitter the line positions and make them slightly thinner by specifying size (Figure 5.29). This helps the viewer see the distribution more clearly:

Marginal rug with thinner, jittered lines

Figure 5.29: Marginal rug with thinner, jittered lines

5.10.4 See Also

For more about overplotting, see Recipe 5.5.

5.11 Labeling Points in a Scatter Plot

5.11.1 Problem

You want to add labels to points in a scatter plot.

5.11.2 Solution

For annotating just one or a few points, you can use annotate() or geom_text(). For this example, we’ll use the countries data set and visualize the relationship between health expenditures and infant mortality rate per 1,000 live births. To keep things manageable, we’ll filter the data to only look at data from 2009 for a subset of countries that spent more than $2,000 USD per capita:

We’ll save the basic scatter plot object in countries_sp (for countries scatter plot) and add then add our annotations to it. To manually add annotations, use annotate(), and specify the coordinates and label (Figure 5.30, left). It may require some trial-and-error tweaking to get the labels positioned just right:

To automatically add the labels from your data (Figure 5.30, right), use geom_text() and map a column that is a factor or character vector to the label aesthetic. In this case, we’ll use Name, and we’ll make the font slightly smaller to reduce crowding. The default value for size is 5, which doesn’t correspond directly to a point size:

A scatter plot with manually labeled points (left); With automatically labeled points and a smaller font (right)A scatter plot with manually labeled points (left); With automatically labeled points and a smaller font (right)

Figure 5.30: A scatter plot with manually labeled points (left); With automatically labeled points and a smaller font (right)

As you can see in the center of (Figure 5.30, right), you may find yourself with a plot where labels are overlapping. To automatically adjust point labels so that they don’t overlap, we can use geom_text_repel (Figure 5.31, left) or geom_label_repel (which adds a box around the label, Figure 5.31, right) from the ggrepel package, which functions similarly to geom_text.

A scatter plot labeled with geom_text_repel (left); Labeled with geom_label_repel (right)A scatter plot labeled with geom_text_repel (left); Labeled with geom_label_repel (right)

Figure 5.31: A scatter plot labeled with geom_text_repel (left); Labeled with geom_label_repel (right)

5.11.3 Discussion

Using geom_text_repel or geom_label_repel is the easiest way to have nicely-placed labels on a plot. It makes automatic (and random) decisions about label placement, so if exact control over where each label is placed, you should use annotate() or geom_text().

The automatic method for placing annotations using geom_text() centers each annotation on the x and y coordinates. You’ll probably want to shift the text vertically, horizontally, or both.

Setting vjust = 0 will make the baseline of the text on the same level as the point (Figure 5.32, left), and setting vjust = 1 will make the top of the text level with the point. This usually isn’t enough, though – you can increase or decrease vjust to shift the labels higher or lower, or you can add or subtract a bit to or from the y mapping to get the same effect (Figure 5.32, right):

A scatter plot with vjust=0 (left); With a little extra added to y (right)A scatter plot with vjust=0 (left); With a little extra added to y (right)

Figure 5.32: A scatter plot with vjust=0 (left); With a little extra added to y (right)

It often makes sense to right- or left-justify the labels relative to the points. To left-justify, set hjust = 0 (Figure 5.33, left), and to right-justify, set hjust = 1. As was the case with vjust, the labels will still slightly overlap with the points. This time, though, it’s not a good idea to try to fix it by increasing or decreasing hjust. Doing so will shift the labels a distance proportional to the length of the label, making longer labels move further than shorter ones. It’s better to just set hjust to 0 or 1, and then add or subtract a bit to or from x (Figure 5.33, right):

A scatter plot with hjust=0 (left); With a little extra added to x (right)A scatter plot with hjust=0 (left); With a little extra added to x (right)

Figure 5.33: A scatter plot with hjust=0 (left); With a little extra added to x (right)

Note

If you are using a logarithmic axis, instead of adding to x or y, you’ll need to multiply the x or y value by a number to shift the labels a consistent amount.

Besides right- or left-justifying all of your labels, you can also adjust the position of all of the labels at once is to use position = position_nudge(). This allows you to specify the amount of vertical or horizontal distance you want to move the labels. As you can see from the figures below (Figure 5.34, this strategy works best when there are fewer labels, or fewer points that can cause overlap with labels. Note that the units you specify with x = ... and y = ... correspond to the units of the x and y axis.

Original scatter plot (left); Scatter plot with labels nudged down and to the right (right)Original scatter plot (left); Scatter plot with labels nudged down and to the right (right)

Figure 5.34: Original scatter plot (left); Scatter plot with labels nudged down and to the right (right)

If you want to label just some of the points but want the placement to be handled automatically, you can add a new column to your data frame containing just the labels you want. Here’s one way to do that: first we’ll make a copy of the data we’re using, then we’ll copy the Name column into plotname, converting from a factor to a character vector, for reasons we’ll see below.

Now that plotname is a character vector, we can use an ifelse() function and the %in% operator to identify if each row of plotname matches the list of names we want to show on our plot, which we have specified manually below. The %in% operator returns a logical vector that allows us to specify within the ifelse() function that we want to replace all values of plotname that do not match one of our specified names with a blank string.

Now we can make the plot (Figure 5.35). This time, we’ll also expand the x range so that the text will fit:

Scatter plot with selected labels and expanded x range

Figure 5.35: Scatter plot with selected labels and expanded x range

If any individual position adjustments are needed, you have a couple of options. One option is to copy the columns used for the x and y coordinates and modify the numbers for the individual items to move the text around. Make sure to use the original numbers for the coordinates of the points, of course!

Finally, another option is to save the output to a vector format such as PDF or SVG (see Recipes Recipe 14.1 and Recipe 14.2), then edit it in a program like Illustrator or Inkscape.

5.11.4 See Also

For more on controlling the appearance of the text, see Recipe 9.2.

If you want to manually edit a PDF or SVG file, see Recipe 14.4.

5.12 Creating a Balloon Plot

5.12.1 Problem

You want to make a balloon plot, where the area of the dots is proportional to their numerical value.

5.12.2 Solution

Use geom_point() with scale_size_area(). For this example, we’ll filter the data set countries to only include data from the year 2009, for certain countries we have specified in countrylist:

If we just map GDP to size, the value of GDP gets mapped to the radius of the dots (Figure 5.36, left), which is not what we want; a doubling of value results in a quadrupling of area, and this will distort the interpretation of the data. We instead want to map the value of GDP to the area of the dots, which we can do this using scale_size_area() (Figure 5.36, right):

Balloon plot with value mapped to radius (left); With value mapped to area (right)Balloon plot with value mapped to radius (left); With value mapped to area (right)

Figure 5.36: Balloon plot with value mapped to radius (left); With value mapped to area (right)

5.12.3 Discussion

The example here is a scatter plot, but that is not the only way to use balloon plots. It may also be useful to use balloon plots to represent values on a grid, where the x- and y-axes are categorical, as in Figure 5.37:

Balloon plot with categorical axes and text labels. Qith guide points to help position text (left); Without guide points (right)Balloon plot with categorical axes and text labels. Qith guide points to help position text (left); Without guide points (right)

Figure 5.37: Balloon plot with categorical axes and text labels. Qith guide points to help position text (left); Without guide points (right)

In this example we’ve used a few tricks to add the text labels under the circles. First, we used vjust = 1.3 to justify the top of text slightly below the y coordinate. Next, we wanted to set the y coordinate so that it is at the bottom of each circle. This requires a little wrangling and arithmetic: we need to first convert the levels of Hair and Eye into numeric values, which involves converting these variables from being a character vector to being a factor variable, and then converting them again into a numeric variable. We then take the numeric value of Hair and subtract a small value from it, where the value depends in some way on count. This actually requires taking the square root of count, since the radius has a linear relationship with the square root of count. The number that this value is divided by (34 in this case) is found by trial and error; it depends on the particular data values, radius, text size, and output image size.

To help find the correct y offset, we can add guide points in red and adjusted the value until they lined up with the bottom of each circle. Once we have the correct value, we can place the text and remove the points.

The text under the circles is in a shade of grey. This is so that it doesn’t jump out at the viewer and overwhelm the perceptual impact of the circles, but is still available if the viewer wants to know the exact values.

5.12.4 See Also

To add labels to the circles, see Recipes Recipe 5.11 and Recipe 7.1.

See Recipe 5.4 for ways of mapping variables to other aesthetics in a scatter plot.

5.13 Making a Scatter Plot Matrix

5.13.1 Problem

You want to make a scatter plot matrix.

5.13.2 Solution

A scatter plot matrix is an excellent way of visualizing the pairwise relationships among several variables. To make one, use the pairs() function from R’s base graphics.

For this example, we’ll use a subset of the countries data. We’ll pull out the data for the year 2009, and keep only the columns that are relevant:

To make the scatter plot matrix (Figure 5.38), we’ll use all of the variables except for Name, since making a scatter plot matrix using the names of the countries wouldn’t make sense and would produce strange-looking results:

A scatter plot matrix

Figure 5.38: A scatter plot matrix

5.13.3 Discussion

You can also use customized functions for the panels. To show the correlation coefficient of each pair of variables instead of a scatter plot, we’ll define the function panel.cor. This will also show higher correlations in a larger font. Don’t worry about the details for now – just paste this code into your R session or script:

To show histograms of each variable along the diagonal, we’ll define panel.hist:

Both of these panel functions are taken from the pairs() help page, so if it’s more convenient, you can simply open that help page, then copy and paste. The last line of this version of the panel.cor function is slightly modified, however, so that the changes in font size aren’t as extreme as with the original.

Now that we’ve defined these functions we can use them for our scatter plot matrix, by telling pairs() to use panel.cor for the upper panels and panel.hist for the diagonal panels.

We’ll also throw in one more thing: panel.smooth for the lower panels, which makes a scatter plot and adds a LOWESS smoothed line, as shown in Figure 5.39. (LOWESS is slightly different from LOESS, which we saw in Recipe 5.6, but the differences aren’t important for this sort of rough exploratory visualization):

Scatter plot with correlations in the upper triangle, smoothing lines in the lower triangle, and histograms on the diagonal

Figure 5.39: Scatter plot with correlations in the upper triangle, smoothing lines in the lower triangle, and histograms on the diagonal

It may be more desirable to use linear regression lines instead of LOWESS lines. The panel.lm() function will do the trick (unlike the previous panel functions, this one isn’t in the pairs help page):

This time the default line color is black instead of red, though you can change it here (and with panel.smooth) by setting col.smooth when you call pairs().

We’ll also use small points in the visualization, so that we can distinguish them a bit better (Figure 5.40). This is done by setting pch = ".":

Scatter plot matrix with smaller points and linear fit lines

Figure 5.40: Scatter plot matrix with smaller points and linear fit lines

The size of the points can also be controlled using the cex parameter. The default value for cex is 1; make it smaller for smaller points and larger for larger points. Values below .5 might not render properly with PDF output.

5.13.4 See Also

To create a correlation matrix, see Recipe 13.1.

It is worth noting that we didn’t use ggplot here because it doesn’t make scatter plot matrices (at least, not well).

Other packages like GGally have been developed as extensions to ggplot to fill in this gap. The ggpairs() function from the GGally package makes scatter plot matrices, for example.