Chapter 13 Miscellaneous Graphs

There are many, many ways of visualizing data, and sometimes things don’t fit into nice, tidy categories. This chapter shows how to make some of these other visualizations.

13.1 Making a Correlation Matrix

13.1.1 Problem

You want to make a graphical correlation matrix.

13.1.2 Solution

We’ll look at the mtcars data set:

mtcars
#>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#>  ...<26 more rows>...
#> Ferrari Dino  19.7   6  145 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E    21.4   4  121 109 4.11 2.780 18.60  1  1    4    2

First, generate the numerical correlation matrix using cor. This will generate correlation coefficients for each pair of columns:

mcor <- cor(mtcars)
# Print mcor and round to 2 digits
round(mcor, digits = 2)
#>        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#> mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
#> cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
#> disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
#>  ...<5 more rows>...
#> am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
#> gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
#> carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

If there are any columns that you don’t want used for correlations (for example, a column of names), you should exclude them. If there are any NA cells in the original data, the resulting correlation matrix will have NA values. To deal with this, you will probably want to use the argument use="complete.obs" or use="pairwise.complete.obs".

To plot the correlation matrix (Figure 13.1), we’ll use the corrplot package:

# If needed, first install with install.packages("corrplot")
library(corrplot)

corrplot(mcor)
A correlation matrix

Figure 13.1: A correlation matrix

13.1.3 Discussion

The corrplot() function has many, many options. Here is an example of how to make a correlation matrix with colored squares and black labels, rotated 45 degrees along the top (Figure 13.2):

corrplot(mcor, method = "shade", shade.col = NA, tl.col = "black", tl.srt = 45)
Correlation matrix with colored squares and black, rotated labels

Figure 13.2: Correlation matrix with colored squares and black, rotated labels

It may also be helpful to display labels representing the correlation coefficient on each square in the matrix. In this example we’ll make a lighter palette so that the text is readable, and we’ll remove the color legend, since it’s redundant. We’ll also order the items so that correlated items are closer together, using the order="AOE" (angular order of eigenvectors) option. The result is shown in Figure 13.3:

# Generate a lighter palette
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))

corrplot(mcor, method = "shade", shade.col = NA, tl.col = "black", tl.srt = 45,
         col = col(200), addCoef.col = "black", cl.pos = "n", order = "AOE")
Correlation matrix with correlation coefficients and no legend

Figure 13.3: Correlation matrix with correlation coefficients and no legend

Like many other standalone plotting functions, corrplot() has its own menagerie of options, which can’t all be illustrated here. Table 13.1 lists some useful options.

Table 13.1: Options for corrplot()
Option Description
type={"lower" | "upper"} Only use the lower or upper triangle
diag=FALSE Don’t show values on the diagonal
addshade="all" Add lines indicating the direction of the correlation
shade.col=NA Hide correlation direction lines
method="shade" Use colored squares
method="ellipse" Use ellipses
addCoef.col="*color*" Add correlation coefficients, in color
tl.srt="*number*" Specify the rotation angle for top labels
tl.col="*color*" Specify the label color
order={"AOE" | "FPC" | "hclust"} Sort labels using angular order of eigenvectors, first principal component, or hierarchical clustering

13.1.4 See Also

To create a scatter plot matrix, see Recipe 5.13.

For more on subsetting data, see Recipe 15.7.

13.2 Plotting a Function

13.2.1 Problem

You want to plot a function.

13.2.2 Solution

Use stat_function(). It’s also necessary to give ggplot a dummy data frame so that it will get the proper x range. In this example we’ll use dnorm(), which gives the density of the normal distribution (Figure 13.4, left):

# The data frame is only used for setting the range
p <- ggplot(data.frame(x = c(-3, 3)), aes(x = x))

p + stat_function(fun = dnorm)

13.2.3 Discussion

Some functions take additional arguments. For example, dt(), the function for the density of the t-distribution, takes a parameter for degrees of freedom (Figure 13.4, right). These additional arguments can be passed to the function by putting them in a list and giving the list to args:

p + stat_function(fun = dt, args = list(df = 2))
The normal distribution (left); The t-distribution with df=2 (right)The normal distribution (left); The t-distribution with df=2 (right)

Figure 13.4: The normal distribution (left); The t-distribution with df=2 (right)

It’s also possible to define your own functions. It should take an x value for its first argument, and it should return a y value. In this example, we’ll define a sigmoid function (Figure 13.5):

myfun <- function(xvar) {
    1 / (1 + exp(-xvar + 10))
}

ggplot(data.frame(x = c(0, 20)), aes(x = x)) +
  stat_function(fun = myfun)
A user-defined function

Figure 13.5: A user-defined function

By default, the function is calculated at 101 points along the x range. If you have a rapidly fluctuating function, you may be able to see the individual segments. To smooth out the curve, pass a larger value of n to stat_function(), as in stat_function(fun=myfun, n=200).

13.2.4 See Also

For plotting predicted values from model objects (such as lm and glm), see Recipe 5.7.

13.3 Shading a Subregion Under a Function Curve

13.3.1 Problem

You want to shade part of the area under a function curve.

13.3.2 Solution

Define a new wrapper function around your curve function, and replace out-of-range values with NA, as shown in Figure 13.6:

# Return dnorm(x) for 0 < x < 2, and NA for all other x
dnorm_limit <- function(x) {
    y <- dnorm(x)
    y[x < 0  |  x > 2] <- NA
    return(y)
}

# ggplot() with dummy data
p <- ggplot(data.frame(x = c(-3, 3)), aes(x = x))

p +
  stat_function(fun = dnorm_limit, geom = "area", fill = "blue", alpha = 0.2) +
  stat_function(fun = dnorm)
Function curve with a shaded region

Figure 13.6: Function curve with a shaded region

Remember that what gets passed to this function is a vector, not individual values. If this function operated on single elements at a time, it might make sense to use an ifelse statement to decide what to return, conditional on the value of x. But that won’t work here, since x is a vector with many values.

13.3.3 Discussion

R has first-class functions, and we can write a function that returns a closure-that is, we can program a function to program another function.

This function will allow you to pass in a function, a minimum value, and a maximum value. Values outside the range will again be returned with NA:

limitRange <- function(fun, min, max) {
  function(x) {
    y <- fun(x)
    y[x < min  |  x > max] <- NA
    return(y)
  }
}

Now we can call this function to create another function – one that is effectively the same as the dnorm_limit() function used earlier:

# This returns a function
dlimit <- limitRange(dnorm, 0, 2)
# Now we'll try out the new function -- it only returns values for inputs
# between 0 and 2
dlimit(-2:4)
#> [1]         NA         NA 0.39894228 0.24197072 0.05399097         NA         NA

We can use limitRange() to create a function that is passed to stat_function():

p +
  stat_function(fun = dnorm) +
  stat_function(fun = limitRange(dnorm, 0, 2), geom = "area", fill = "blue",
                alpha = 0.2)

The limitRange() function can be used with any function, not just dnorm(), to create a range-limited version of that function. The result of all this is that instead of having to write functions with different hard-coded values for each situation that arises, we can write one function and simply pass it different arguments depending on the situation.

If you look very, very closely at the graph in Figure 13.6, you may see that the shaded region does not align exactly with the range we specified. This is because ggplot2 does a numeric approximation by calculating values at fixed intervals, and these intervals may not fall exactly within the specified range. As in Recipe 13.2, we can improve the approximation by increasing the number of interpolated values with stat_function(n = 200).

13.4 Creating a Network Graph

13.4.1 Problem

You want to create a network graph.

13.4.2 Solution

Use the igraph package. To create a graph, pass a vector containing pairs of items to graph(), then plot the resulting object (Figure 13.7):

# May need to install first, with install.packages("igraph")
library(igraph)

# Specify edges for a directed graph
gd <- graph(c(1,2, 2,3, 2,4, 1,4, 5,5, 3,6))
plot(gd)

# For an undirected graph
gu <- graph(c(1,2, 2,3, 2,4, 1,4, 5,5, 3,6), directed = FALSE)
# No labels
plot(gu, vertex.label = NA)
A directed graph (left); An undirected graph, with no vertex labels (right)A directed graph (left); An undirected graph, with no vertex labels (right)

Figure 13.7: A directed graph (left); An undirected graph, with no vertex labels (right)

This is the structure of each of the graph objects:

gd
#> IGRAPH 7d4a625 D--- 6 6 -- 
#> + edges from 7d4a625:
#> [1] 1->2 2->3 2->4 1->4 5->5 3->6
gu
#> IGRAPH 2abf0a4 U--- 6 6 -- 
#> + edges from 2abf0a4:
#> [1] 1--2 2--3 2--4 1--4 5--5 3--6

13.4.3 Discussion

In a network graph, the position of the nodes is unspecified by the data, and they’re placed randomly. To make the output repeatable, you can set the random seed before making the plot. You can try different random numbers until you get a result that you like:

set.seed(229)
plot(gu)

It’s also possible to create a graph from a data frame. The first two columns of the data frame are used, and each row specifies a connection between two nodes. In the next example (Figure 13.8), we’ll use the madmen2 data set, which has this structure. We’ll also use the Fruchterman-Reingold layout algorithm. The idea is that all the nodes have a magnetic repulsion from one another, but the edges between nodes act as springs, pulling the nodes together:

library(gcookbook) # For the data set
madmen2
#>                       Name1          Name2
#> 1               Abe Drexler    Peggy Olson
#> 2                   Allison     Don Draper
#> 3               Arthur Case   Betty Draper
#>  ...<81 more rows>...
#> 85                    Vicky Roger Sterling
#> 86                 Waitress     Don Draper
#> 87 Woman at the Clios party     Don Draper
# Create a graph object from the data set
g <- graph.data.frame(madmen2, directed=TRUE)
# Remove unnecessary margins
par(mar = c(0, 0, 0, 0))
plot(g, layout = layout.fruchterman.reingold, vertex.size = 8,
     edge.arrow.size = 0.5, vertex.label = NA)
A directed graph from a data frame, with the Fruchterman-Reingold algorithm

Figure 13.8: A directed graph from a data frame, with the Fruchterman-Reingold algorithm

It’s also possible to make a directed graph from a data frame. The madmen data set has only one row for each pairing, since direction doesn’t matter for an undirected graph. This time we’ll use a circle layout (Figure 13.9):

g <- graph.data.frame(madmen, directed = FALSE)
par(mar = c(0, 0, 0, 0))  # Remove unnecessary margins
plot(g, layout = layout.circle, vertex.size = 8, vertex.label = NA)
A circular undirected graph from a data frame

Figure 13.9: A circular undirected graph from a data frame

13.4.4 See Also

For more information about the available output options, see ?plot.igraph. Also see ?igraph::layout for layout options.

An alternative to igraph is Rgraphviz, which a frontend for Graphviz, an open-source library for visualizing graphs. It works better with labels and makes it easier to create graphs with a controlled layout, but it can be a bit challenging to install. Rgraphviz is available through the Bioconductor repository system.

13.5 Using Text Labels in a Network Graph

13.5.1 Problem

You want to use text labels in a network graph.

13.5.2 Solution

The vertices/nodes may have names, but these names are not used as labels by default. To set the labels, pass in a vector of names to vertex.label (Figure 13.10):

library(igraph)
library(gcookbook) # For the data set

# Copy madmen and drop every other row
m <- madmen[1:nrow(madmen) %% 2 == 1, ]

g <- graph.data.frame(m, directed=FALSE) # Print out the names of each vertex

V(g)$name
#>  [1] "Betty Draper"      "Don Draper"        "Harry Crane"      
#>  [4] "Joan Holloway"     "Lane Pryce"        "Peggy Olson"      
#>  [7] "Pete Campbell"     "Roger Sterling"    "Sal Romano"       
#>  ...
#> [19] "Rebecca Pryce"     "Abe Drexler"       "Duck Phillips"    
#> [22] "Playtex bra model" "Ida Blankenship"   "Mirabelle Ames"   
#> [25] "Vicky"             "Kitty Romano"

plot(g, layout=layout.fruchterman.reingold,
    vertex.size = 4,          # Smaller nodes
    vertex.label = V(g)$name, # Set the labels
    vertex.label.cex = 0.8,   # Slightly smaller font
    vertex.label.dist = 0.4,  # Offset the labels
    vertex.label.color = "black")
A network graph with labels

Figure 13.10: A network graph with labels

13.5.3 Discussion

Another way to achieve the same effect is to modify the plot object, instead of passing in the values as arguments to plot(). To do this, use V()$xxx<- instead of passing a value to a vertex.xxx argument. For example, this will result in the same output as the previous code:

# This is equivalent to the preceding code
V(g)$size        <- 4
V(g)$label       <- V(g)$name
V(g)$label.cex   <- 0.8
V(g)$label.dist  <- 0.4
V(g)$label.color <- "black"

# Set a property of the entire graph
g$layout <- layout.fruchterman.reingold

plot(g)

The properties of the edges can also be set, either with the E() function or by passing values to edge.xxx arguments (Figure 13.11):

# View the edges
E(g)
#> + 20/20 edges from 08b4dc9 (vertex names):
#>  [1] Betty Draper  --Henry Francis     Don Draper    --Allison          
#>  [3] Betty Draper  --Don Draper        Don Draper    --Candace          
#>  [5] Don Draper    --Faye Miller       Don Draper    --Megan Calvet     
#>  [7] Don Draper    --Rachel Menken     Don Draper    --Suzanne Farrell  
#>  [9] Harry Crane   --Hildy             Joan Holloway --Franklin         
#> [11] Joan Holloway --Roger Sterling    Lane Pryce    --Rebecca Pryce    
#> [13] Peggy Olson   --Abe Drexler       Peggy Olson   --Duck Phillips    
#> [15] Peggy Olson   --Pete Campbell     Pete Campbell --Playtex bra model
#> [17] Roger Sterling--Ida Blankenship   Roger Sterling--Mirabelle Ames   
#> [19] Roger Sterling--Vicky             Sal Romano    --Kitty Romano

# Set some of the labels to "M"
E(g)[c(2,11,19)]$label <- "M"

# Set color of all to grey, and then color a few red
E(g)$color             <- "grey70"
E(g)[c(2,11,19)]$color <- "red"

plot(g)
A network graph with labeled and colored edges

Figure 13.11: A network graph with labeled and colored edges

13.5.4 See Also

See ?igraph.plotting for more information about graphical parameters in igraph.

13.6 Creating a Heat Map

13.6.1 Problem

You want to make a heat map.

13.6.2 Solution

Use geom_tile() or geom_raster() and map a continuous variable to fill. We’ll use the presidents data set, which is a time series object rather than a data frame:

presidents
#>      Qtr1 Qtr2 Qtr3 Qtr4
#> 1945   NA   87   82   75
#> 1946   63   50   43   32
#> 1947   35   60   54   55
#>  ...
#> 1972   49   61   NA   NA
#> 1973   68   44   40   27
#> 1974   28   25   24   24
str(presidents)
#>  Time-Series [1:120] from 1945 to 1975: NA 87 82 75 63 50 43 32 35 60 ...

We’ll first convert it to a format that is usable by ggplot: a data frame with columns that are numeric:

pres_rating <- data.frame(
  rating = as.numeric(presidents),
  year = as.numeric(floor(time(presidents))),
  quarter = as.numeric(cycle(presidents))
)
pres_rating
#>     rating year quarter
#> 1       NA 1945       1
#> 2       87 1945       2
#> 3       82 1945       3
#>  ...<114 more rows>...
#> 118     25 1974       2
#> 119     24 1974       3
#> 120     24 1974       4

Now we can make the plot using geom_tile() or geom_raster() (Figure 13.12). Simply map one variable to x, one to y, and one to fill:

# Base plot
p <- ggplot(pres_rating, aes(x = year, y = quarter, fill = rating))

# Using geom_tile()
p + geom_tile()

# Using geom_raster() - looks the same, but a little more efficient
p + geom_raster()
A heat map-the grey squares represent NAs in the data

Figure 13.12: A heat map-the grey squares represent NAs in the data

Note

The results with geom_tile() and geom_raster() should look the same, but in practice they might appear different. See Recipe 6.12 for more information about this issue.

13.6.3 Discussion

To better convey useful information, you may want to customize the appearance of the heat map. With this example, we’ll reverse the y-axis so that it progresses from top to bottom, and we’ll add tick marks every four years along the x-axis, to correspond with each presidential term. For the x and y scales, we remove the padding by using expand=c(0, 0). We’ll also change the color scale using scale_fill_gradient2(), which lets you specify a midpoint color and the two colors at the low and high ends (Figure 13.13):

p +
  geom_tile() +
  scale_x_continuous(breaks = seq(1940, 1976, by = 4), expand = c(0, 0)) +
  scale_y_reverse(expand = c(0, 0)) +
  scale_fill_gradient2(midpoint = 50, mid = "grey70", limits = c(0, 100))
A heat map with customized appearance

Figure 13.13: A heat map with customized appearance

13.6.4 See Also

If you want to use a different color palette, see Recipe 12.6.

13.7 Creating a Three-Dimensional Scatter Plot

13.7.1 Problem

You want to create a three-dimensional scatter plot.

13.7.2 Solution

We’ll use the rgl package, which provides an interface to the OpenGL graphics library for 3D graphics. To create a 3D scatter plot, as in Figure 13.14, use plot3d() and pass in a data frame where the first three columns represent x, y, and z coordinates, or pass in three vectors representing the x, y, and z coordinates.

# You may need to install first, with install.packages("rgl")
library(rgl)
plot3d(mtcars$wt, mtcars$disp, mtcars$mpg, type = "s", size = 0.75, lit = FALSE)
A 3D scatter plot

Figure 13.14: A 3D scatter plot

Note

If you are using a Mac, you will need to install XQuartz before you can use rgl. You can download it from https://www.xquartz.org/.

Viewers can rotate the image by clicking and dragging with the mouse, and zoom in and out with the scroll wheel.

Note

By default, plot3d() uses square points, which do not appear properly when saving to a PDF. For improved appearance, the example above uses type="s" for spherical points, made them smaller with size=0.75, and turned off the 3D lighting with lit=FALSE (otherwise they look like shiny spheres).

13.7.3 Discussion

Three-dimensional scatter plots can be difficult to interpret, so it’s often better to use a two-dimensional representation of the data. That said, there are things that can help make a 3D scatter plot easier to understand.

In Figure 13.15, we’ll add vertical segments to help give a sense of the spatial positions of the points:

# Function to interleave the elements of two vectors
interleave <- function(v1, v2)  as.vector(rbind(v1,v2))

# Plot the points
plot3d(mtcars$wt, mtcars$disp, mtcars$mpg,
       xlab = "Weight", ylab = "Displacement", zlab = "MPG",
       size = .75, type = "s", lit = FALSE)

# Add the segments
segments3d(interleave(mtcars$wt,   mtcars$wt),
           interleave(mtcars$disp, mtcars$disp),
           interleave(mtcars$mpg,  min(mtcars$mpg)),
           alpha = 0.4, col = "blue")
A 3D scatter plot with vertical lines for each point

Figure 13.15: A 3D scatter plot with vertical lines for each point

It’s possible to tweak the appearance of the background and the axes. In Figure 13.16, we change the number of tick marks and add tick marks and axis labels to the specified sides:

# Make plot without axis ticks or labels
plot3d(mtcars$wt, mtcars$disp, mtcars$mpg,
       xlab = "", ylab = "", zlab = "",
       axes = FALSE,
       size = .75, type = "s", lit = FALSE)

segments3d(interleave(mtcars$wt,   mtcars$wt),
           interleave(mtcars$disp, mtcars$disp),
           interleave(mtcars$mpg,  min(mtcars$mpg)),
           alpha = 0.4, col = "blue")

# Draw the box.
rgl.bbox(color = "grey50",          # grey60 surface and black text
         emission = "grey50",       # emission color is grey50
         xlen = 0, ylen = 0, zlen = 0)  # Don't add tick marks

# Set default color of future objects to black
rgl.material(color = "black")

# Add axes to specific sides. Possible values are "x--", "x-+", "x+-", and "x++".
axes3d(edges = c("x--", "y+-", "z--"),
       ntick = 6,                       # Attempt 6 tick marks on each side
       cex = .75)                       # Smaller font

# Add axis labels. 'line' specifies how far to set the label from the axis.
mtext3d("Weight",       edge = "x--", line = 2)
mtext3d("Displacement", edge = "y+-", line = 3)
mtext3d("MPG",          edge = "z--", line = 3)
3D scatter plot with axis ticks and labels repositioned

Figure 13.16: 3D scatter plot with axis ticks and labels repositioned

13.7.4 See Also

See ?plot3d for more options for controlling the output.

13.8 Adding a Prediction Surface to a Three-Dimensional Plot

13.8.1 Problem

You want to add a surface of predicted value to a three-dimensional scatter plot.

13.8.2 Solution

First we need to define some utility functions for generating the predicted values from a model object:

# Given a model, predict zvar from xvar and yvar
# Defaults to range of x and y variables, and a 16x16 grid
predictgrid <- function(model, xvar, yvar, zvar, res = 16, type = NULL) {
  # Find the range of the predictor variable. This works for lm and glm
  # and some others, but may require customization for others.
  xrange <- range(model$model[[xvar]])
  yrange <- range(model$model[[yvar]])

  newdata <- expand.grid(x = seq(xrange[1], xrange[2], length.out = res),
                         y = seq(yrange[1], yrange[2], length.out = res))
  names(newdata) <- c(xvar, yvar)
  newdata[[zvar]] <- predict(model, newdata = newdata, type = type)
  newdata
}


# Convert long-style data frame with x, y, and z vars into a list
# with x and y as row/column values, and z as a matrix.
df2mat <- function(p, xvar = NULL, yvar = NULL, zvar = NULL) {
  if (is.null(xvar)) xvar <- names(p)[1]
  if (is.null(yvar)) yvar <- names(p)[2]
  if (is.null(zvar)) zvar <- names(p)[3]

  x <- unique(p[[xvar]])
  y <- unique(p[[yvar]])
  z <- matrix(p[[zvar]], nrow = length(y), ncol = length(x))

  m <- list(x, y, z)
  names(m) <- c(xvar, yvar, zvar)
  m
}

# Function to interleave the elements of two vectors
interleave <- function(v1, v2)  as.vector(rbind(v1,v2))

With these utility functions defined, we can make a linear model from the data and plot it as a mesh along with the data, using the surface3d() function, as shown in Figure 13.17:

library(rgl)

# Make a copy of the data set
m <- mtcars

# Generate a linear model
mod <- lm(mpg ~ wt + disp + wt:disp, data = m)

# Get predicted values of mpg from wt and disp
m$pred_mpg <- predict(mod)

# Get predicted mpg from a grid of wt and disp
mpgrid_df <- predictgrid(mod, "wt", "disp", "mpg")
mpgrid_list <- df2mat(mpgrid_df)

# Make the plot with the data points
plot3d(m$wt, m$disp, m$mpg, type = "s", size = 0.5, lit = FALSE)

# Add the corresponding predicted points (smaller)
spheres3d(m$wt, m$disp, m$pred_mpg, alpha = 0.4, type = "s", size = 0.5, lit = FALSE)

# Add line segments showing the error
segments3d(interleave(m$wt,   m$wt),
           interleave(m$disp, m$disp),
           interleave(m$mpg,  m$pred_mpg),
           alpha = 0.4, col = "red")

# Add the mesh of predicted values
surface3d(mpgrid_list$wt, mpgrid_list$disp, mpgrid_list$mpg,
          alpha = 0.4, front = "lines", back = "lines")
A 3D scatter plot with a prediction surface

Figure 13.17: A 3D scatter plot with a prediction surface

13.8.3 Discussion

We can tweak the appearance of the graph, as shown in Figure 13.18. We’ll add each of the components of the graph separately:

plot3d(mtcars$wt, mtcars$disp, mtcars$mpg,
       xlab = "", ylab = "", zlab = "",
       axes = FALSE,
       size = .5, type = "s", lit = FALSE)

# Add the corresponding predicted points (smaller)
spheres3d(m$wt, m$disp, m$pred_mpg, alpha = 0.4, type = "s", size = 0.5, lit = FALSE)

# Add line segments showing the error
segments3d(interleave(m$wt,   m$wt),
           interleave(m$disp, m$disp),
           interleave(m$mpg,  m$pred_mpg),
           alpha = 0.4, col = "red")

# Add the mesh of predicted values
surface3d(mpgrid_list$wt, mpgrid_list$disp, mpgrid_list$mpg,
          alpha = 0.4, front = "lines", back = "lines")

# Draw the box
rgl.bbox(color = "grey50",          # grey60 surface and black text
         emission = "grey50",       # emission color is grey50
         xlen = 0, ylen = 0, zlen = 0)  # Don't add tick marks

# Set default color of future objects to black
rgl.material(color = "black")

# Add axes to specific sides. Possible values are "x--", "x-+", "x+-", and "x++".
axes3d(edges = c("x--", "y+-", "z--"),
       ntick = 6,                       # Attempt 6 tick marks on each side
       cex = .75)                       # Smaller font

# Add axis labels. 'line' specifies how far to set the label from the axis.
mtext3d("Weight",       edge = "x--", line = 2)
mtext3d("Displacement", edge = "y+-", line = 3)
mtext3d("MPG",          edge = "z--", line = 3)
Three-dimensional scatter plot with customized appearance

Figure 13.18: Three-dimensional scatter plot with customized appearance

13.8.4 See Also

For more on changing the appearance of the surface, see ?rgl.material.

13.9 Saving a Three-Dimensional Plot

13.9.1 Problem

You want to save a three-dimensional plot created with the rgl package.

13.9.2 Solution

To save a bitmap image of a plot created with rgl, use rgl.snapshot(). This will capture the exact image that is on the screen:

library(rgl)
plot3d(mtcars$wt, mtcars$disp, mtcars$mpg, type = "s", size = 0.75, lit = FALSE)

rgl.snapshot('3dplot.png', fmt = 'png')

You can also use rgl.postscript() to save a PostScript or PDF file:

rgl.postscript('3dplot.pdf', fmt = 'pdf')

rgl.postscript('3dplot.ps', fmt = 'ps')

PostScript and PDF output does not support many features of the OpenGL library on which rgl is based. For example, it does not support transparency, and the sizes of objects such as points and lines may not be the same as what appears on the screen.

13.9.3 Discussion

To make the output more repeatable, you can save your current viewpoint and restore it later:

# Save the current viewpoint
view <- par3d("userMatrix")

# Restore the saved viewpoint
par3d(userMatrix = view)

To save view in a script, you can use dput(), then copy and paste the output into your script:

dput(view)

Once you have the text representation of the userMatrix, add the following to your script:

view <- structure(c(0.907931625843048, 0.267511069774628, -0.322642296552658,
0, -0.410978674888611, 0.417272746562958, -0.810543060302734,
0, -0.0821993798017502, 0.868516683578491, 0.488796472549438,
0, 0, 0, 0, 1), .Dim = c(4L, 4L))

par3d(userMatrix = view)

13.10 Animating a Three-Dimensional Plot

13.10.1 Problem

You want to animate a three-dimensional plot by moving the viewpoint around the plot.

13.10.2 Solution

Rotating a 3D plot can provide a more complete view of the data. To animate a 3D plot, use play3d() with spin3d():

library(rgl)
plot3d(mtcars$wt, mtcars$disp, mtcars$mpg, type = "s", size = 0.75, lit = FALSE)

play3d(spin3d())

13.10.3 Discussion

By default, the graph will be rotated on the z (vertical) axis, until you send a break command to R.

You can change the rotation axis, rotation speed, and duration:

# Spin on x-axis, at 4 rpm, for 20 seconds
play3d(spin3d(axis = c(1,0,0), rpm = 4), duration = 20)

To save the movie, use the movie3d() function in the same way as play3d(). It will generate a series of .png files, one for each frame, and then attempt to combine them into a single animated .gif file using the convert program from the ImageMagick image utility.

This will spin the plot once in 15 seconds, at 50 frames per second:

# Spin on z axis, at 4 rpm, for 15 seconds
movie3d(spin3d(axis = c(0,0,1), rpm = 4), duration = 15, fps = 50)

The output file will be saved in a temporary directory, and the name will be printed on the R console.

If you don’t want to use ImageMagick to convert the output to a .gif, you can specify convert=FALSE and then convert the series of .png files to a movie using some other utility.

13.11 Creating a Dendrogram

13.11.1 Problem

You want to make a dendrogram to show how items are clustered.

13.11.2 Solution

Use hclust() and plot the output from it. This can require a fair bit of data preprocessing. For this example, we’ll first take a subset of the countries data set from the year 2009. For simplicity, we’ll also drop all rows that contain an NA, and then select a random 25 of the remaining rows:

library(dplyr)
library(tidyr)     # For drop_na function
library(gcookbook) # For the data set

# Set random seed to make random operation below repeatable
set.seed(392)

c2 <- countries %>%
  filter(Year == 2009) %>%  # Get data from year 2009
  drop_na() %>%             # Drop rows that have any NA values
  sample_n(25)              # Select 25 random rows

c2
#>                Name Code Year         GDP laborrate  healthexp infmortality
#> 1  Egypt, Arab Rep.  EGY 2009   2370.7111      48.8  113.29717         20.0
#> 2             Haiti  HTI 2009    656.7792      69.9   39.60249         59.3
#> 3            Belize  BLZ 2009   4056.1224      64.1  217.15214         14.9
#>  ...<19 more rows>...
#> 23       Luxembourg  LUX 2009 106252.2442      55.5 8182.85511          2.2
#> 24        Lithuania  LTU 2009  11033.5885      55.7  729.78492          5.7
#> 25          Denmark  DNK 2009  55933.3545      65.4 6272.72868          3.4

Notice that the row names (the first column) are essentially random numbers, since the rows were selected randomly. We need to do a few more things to the data before making a dendrogram from it. First, we need to set the row names-right now there’s a column called Name, but the row names are those random numbers (we don’t often use row names, but for the hclust() function they’re essential). Next, we’ll need to drop all the columns that aren’t values used for clustering. These columns are Name, Code, and Year:

rownames(c2) <- c2$Name
c2 <- c2[, 4:7]
c2
#>                          GDP laborrate  healthexp infmortality
#> Egypt, Arab Rep.   2370.7111      48.8  113.29717         20.0
#> Haiti               656.7792      69.9   39.60249         59.3
#> Belize             4056.1224      64.1  217.15214         14.9
#>  ...<19 more rows>...
#> Luxembourg       106252.2442      55.5 8182.85511          2.2
#> Lithuania         11033.5885      55.7  729.78492          5.7
#> Denmark           55933.3545      65.4 6272.72868          3.4

The values for GDP are several orders of magnitude larger than the values for, say, infmortality. Because of this, the effect of infmortality on the clustering will be negligible compared to the effect of GDP. This probably isn’t what we want. To address this issue, we’ll scale the data:

c3 <- scale(c2)
c3
#>                         GDP    laborrate  healthexp infmortality
#> Egypt, Arab Rep. -0.6183699 -1.564089832 -0.6463085   -0.2387436
#> Haiti            -0.6882421  0.602552464 -0.6797433    1.0985742
#> Belize           -0.5496604  0.006982544 -0.5991902   -0.4122886
#>  ...<19 more rows>...
#> Luxembourg        3.6165896 -0.876103889  3.0147969   -0.8444498
#> Lithuania        -0.2652086 -0.855566996 -0.3666121   -0.7253503
#> Denmark           1.5652292  0.140472354  2.1481851   -0.8036157

By default the scale() function scales each column relative to its standard deviation, but other methods may be used.

Finally, we’re ready to make the dendrogram, as shown in Figure 13.19:

hc <- hclust(dist(c3))

# Make the dendrogram
plot(hc)

# With text aligned
plot(hc, hang = -1)
A dendrogram (left); With text aligned (right)A dendrogram (left); With text aligned (right)

Figure 13.19: A dendrogram (left); With text aligned (right)

13.11.3 Discussion

A cluster analysis is simply a way of assigning points to groups in an n-dimensional space (four dimensions, in this example). A hierarchical cluster analysis divides each group into two smaller groups, and can be represented with the dendrograms in this recipe. There are many different parameters you can control in the hierarchical cluster analysis process, and there may not be a single “right” way to do it for your data.

First, we normalized the data using scale() with its default settings. You can scale your data differently, or not at all. (With this data set, not scaling the data will lead to GDP overwhelming the other variables, as shown in Figure 13.20.)

hc_unscaled <- hclust(dist(c2))
plot(hc_unscaled)
Dendrogram with unscaled data-notice the much larger Height values, which are largely due to the unscaled GDP values

Figure 13.20: Dendrogram with unscaled data-notice the much larger Height values, which are largely due to the unscaled GDP values

For the distance calculation, we used the default method, “euclidean”, which calculates the Euclidean distance between the points. The other possible methods are “maximum”, “manhattan”, “canberra”, “binary”, and “minkowski”.

The hclust() function provides several methods for performing the cluster analysis. The default is “complete”; the other possible methods are “ward”, “single”, “average”, “mcquitty”, “median”, and “centroid”.

13.11.4 See Also

See ?hclust for more information about the different clustering methods.

13.12 Creating a Vector Field

13.12.1 Problem

You want to make a vector field.

13.12.2 Solution

Use geom_segment(). For this example, we’ll use the isabel data set:

library(gcookbook) # For the isabel data set
isabel
#>                x        y      z        vx        vy          vz         t
#> 1      -83.00000 41.70000  0.035        NA        NA          NA        NA
#> 2      -83.00000 41.55571  0.035        NA        NA          NA        NA
#> 3      -83.00000 41.41142  0.035        NA        NA          NA        NA
#> 156248 -62.12625 24.09679 18.035 -11.39709 -5.315139 0.009657148 -66.99567
#> 156249 -62.12625 23.95251 18.035 -11.37965 -5.275015 0.040921956 -67.00032
#> 156250 -62.12625 23.80822 18.035 -12.16637 -5.435891 0.030216325 -66.98057
#>           speed
#> 1            NA
#> 2            NA
#> 3            NA
#>  ...<156,244 more rows>...
#> 156248 12.57555
#> 156249 12.54281
#> 156250 13.32552

x and y are the longitude and latitude, respectively, and z is the height in kilometers. The vx, vy, and vz values are the wind speed components in each of these directions, in meters per second, and speed is the wind speed.

The height (z) ranges from 0.035 km to 18.035 km. For this example, we’ll just use the lowest slice of data.

To draw the vectors (Figure 13.21), we’ll use geom_segment(). Each segment has a starting point and an ending point. We’ll use the x and y values as the starting points for each segment, then add a fraction of the vx and vy values to get the end points for each segment. If we didn’t scale down these values, the lines would be much too long:

islice <- filter(isabel, z == min(z))

ggplot(islice, aes(x = x, y = y)) +
       geom_segment(aes(xend = x + vx/50, yend = y + vy/50),
                    size = 0.25)   # Make the line segments 0.25 mm thick
First attempt at a vector field. The resolution of the data is too high, but it does hint at some interesting patterns not visible in graphs with a lower data resolution

Figure 13.21: First attempt at a vector field. The resolution of the data is too high, but it does hint at some interesting patterns not visible in graphs with a lower data resolution

This vector field has two problems: the data is at too high a resolution to read, and the segments do not have arrows indicating the direction of the flow. To reduce the resolution of the data, we’ll define a function every_n() that keeps one out of every n values in the data and drops the rest:

# Take a slice where z is equal to the minimum value of z
islice <- filter(isabel, z == min(z))

# Keep 1 out of every 'by' values in vector x
every_n <- function(x, by = 2) {
  x <- sort(x)
  x[seq(1, length(x), by = by)]
}

# Keep 1 of every 4 values in x and y
keepx <- every_n(unique(isabel$x), by = 4)
keepy <- every_n(unique(isabel$y), by = 4)

# Keep only those rows where x value is in keepx and y value is in keepy
islicesub <- filter(islice, x %in% keepx  &  y %in% keepy)

Now that we’ve taken a subset of the data, we can plot it, with arrowheads, as shown in Figure 13.22:

# Need to load grid for arrow() function
library(grid)

# Make the plot with the subset, and use an arrowhead 0.1 cm long
ggplot(islicesub, aes(x = x, y = y)) +
    geom_segment(aes(xend = x+vx/50, yend = y+vy/50),
                 arrow = arrow(length = unit(0.1, "cm")), size = 0.25)
Vector field with arrowheads

Figure 13.22: Vector field with arrowheads

13.12.3 Discussion

One effect of arrowheads is that short vectors appear with more ink than is proportional to their length. This could somewhat distort the interpretation of the data. To mitigate this effect, it may also be useful to map the speed to other properties, like size (line thickness), alpha, or colour. Here, we’ll map speed to alpha (Figure 13.23, left):

# The existing 'speed' column includes the z component. We'll calculate
# speedxy, the horizontal speed.
islicesub$speedxy <- sqrt(islicesub$vx^2 + islicesub$vy^2)

# Map speed to alpha
ggplot(islicesub, aes(x = x, y = y)) +
    geom_segment(aes(xend = x+vx/50, yend = y+vy/50, alpha = speed),
                 arrow = arrow(length = unit(0.1,"cm")), size = 0.6)

Next, we’ll map speed to colour. We’ll also add a map of the United States and zoom in on the area of interest, as shown in the graph on the right in Figure 13.23,using coord_cartesian() (without this, the entire USA will be displayed):

# Get USA map data
usa <- map_data("usa")

# Map speed to colour, and set go from "grey80" to "darkred"
ggplot(islicesub, aes(x = x, y = y)) +
    geom_segment(aes(xend = x+vx/50, yend = y+vy/50, colour = speed),
                 arrow = arrow(length = unit(0.1,"cm")), size = 0.6) +
    scale_colour_continuous(low = "grey80", high = "darkred") +
    geom_path(aes(x = long, y = lat, group = group), data = usa) +
    coord_cartesian(xlim = range(islicesub$x), ylim = range(islicesub$y))
Vector field with speed mapped to alpha (left); With speed mapped to colour (right)Vector field with speed mapped to alpha (left); With speed mapped to colour (right)

Figure 13.23: Vector field with speed mapped to alpha (left); With speed mapped to colour (right)

The isabel data set has three-dimensional data, so we can also make a faceted graph of the data, as shown in Figure 13.24. Because each facet is small, we will use a sparser subset than before:

# Keep 1 out of every 5 values in x and y, and 1 in 2 values in z
keepx <- every_n(unique(isabel$x), by = 5)
keepy <- every_n(unique(isabel$y), by = 5)
keepz <- every_n(unique(isabel$z), by = 2)

isub <- filter(isabel, x %in% keepx  &  y %in% keepy  &  z %in% keepz)

ggplot(isub, aes(x = x, y = y)) +
    geom_segment(aes(xend = x+vx/50, yend = y+vy/50, colour = speed),
                 arrow = arrow(length = unit(0.1,"cm")), size = 0.5) +
    scale_colour_continuous(low = "grey80", high = "darkred") +
    facet_wrap( ~ z)
Vector field of wind speeds, faceted on z

Figure 13.24: Vector field of wind speeds, faceted on z

13.12.4 See Also

If you want to use a different color palette, see Recipe 12.6.

See Recipe 8.2 for more information about zooming in on part of a graph.

13.13 Creating a QQ Plot

13.13.1 Problem

You want to make a quantile-quantile (QQ) plot to compare an empirical distribution to a theoretical distribution.

13.13.2 Solution

Use geom_qq() and geom_qq_line() to compare to a normal distribution (Figure 13.25):

library(gcookbook) # For the data set

ggplot(heightweight, aes(sample = heightIn)) +
  geom_qq() +
  geom_qq_line()

ggplot(heightweight, aes(sample = ageYear)) +
  geom_qq() +
  geom_qq_line()
QQ plot of height, which is close to normally distributed (left); QQ plot of age, which is not normally distributed (right)QQ plot of height, which is close to normally distributed (left); QQ plot of age, which is not normally distributed (right)

Figure 13.25: QQ plot of height, which is close to normally distributed (left); QQ plot of age, which is not normally distributed (right)

13.13.3 Discussion

The points for heightIn are close to the line, which means that the distribution is close to normal. In contrast, the points for ageYear veer far away from the line, especially on the left, indicating that the distribution is skewed. A histogram may also be useful for exploring how the data is distributed.

13.13.4 See Also

See ?stat_qq for information on comparing data to theoretical distributions other than the normal distribution.

13.14 Creating a Graph of an Empirical Cumulative Distribution Function

13.14.1 Problem

You want to graph the empirical cumulative distribution function (ECDF) of a data set.

13.14.2 Solution

Use stat_ecdf() (Figure 13.26):

library(gcookbook) # For the data set

# ecdf of heightIn
ggplot(heightweight, aes(x = heightIn)) +
  stat_ecdf()

# ecdf of ageYear
ggplot(heightweight, aes(x = ageYear)) +
  stat_ecdf()
ECDF of height (left); ECDF of age (right)ECDF of height (left); ECDF of age (right)

Figure 13.26: ECDF of height (left); ECDF of age (right)

13.14.3 Discussion

The ECDF shows what proportion of observations are at or below the given x value. Because it is empirical, the line takes a step up at each x value where there are one or more observations.

13.15 Creating a Mosaic Plot

13.15.1 Problem

You want to make a mosaic plot to visualize a contingency table.

13.15.2 Solution

Use the mosaic() function from the vcd package. For this example we’ll use the USBAdmissions data set, which is a contingency table with three dimensions. We’ll first take a look at the data in a few different ways:

UCBAdmissions
#> , , Dept = A
#> 
#>           Gender
#> Admit      Male Female
#>   Admitted  512     89
#>   Rejected  313     19
#> 
#> , , Dept = B
#> 
#>           Gender
#> Admit      Male Female
#>   Admitted  353     17
#>   Rejected  207      8
#> 
#>  ... with 41 more lines of text

# Print a "flat" contingency table
ftable(UCBAdmissions)
#>                 Dept   A   B   C   D   E   F
#> Admit    Gender                             
#> Admitted Male        512 353 120 138  53  22
#>          Female       89  17 202 131  94  24
#> Rejected Male        313 207 205 279 138 351
#>          Female       19   8 391 244 299 317

dimnames(UCBAdmissions)
#> $Admit
#> [1] "Admitted" "Rejected"
#> 
#> $Gender
#> [1] "Male"   "Female"
#> 
#> $Dept
#> [1] "A" "B" "C" "D" "E" "F"

The three dimensions are Admit, Gender, and Dept. To visualize the relationships between the variables (Figure 13.27), use mosaic() and pass it a formula with the variables that will be used to split up the data:

# You may need to install first, with install.packages("vcd")
library(vcd)
# Split by Admit, then Gender, then Dept
mosaic( ~ Admit + Gender + Dept, data = UCBAdmissions)
Mosaic plot of UC-Berkeley admissions data-the area of each rectangle is proportional to the number of cases in that cell

Figure 13.27: Mosaic plot of UC-Berkeley admissions data-the area of each rectangle is proportional to the number of cases in that cell

Notice that mosaic() splits the data in the order in which the variables are provided: first on admission status, then gender, then department. The resulting plot order makes it very clear that more applicants were rejected than admitted. It is also clear that within the admitted group there were many more men than women, while in the rejected group there were approximately the same number of men and women. It is difficult to make comparisons within each department, though. A different variable splitting order may reveal some other interesting information.

Another way of looking at the data is to split first by department, then gender, then admission status, as in Figure 13.28. This makes the admission status the last variable that is partitioned, so that after partitioning by department and gender, the admitted and rejected cells for each group are right next to each other:

mosaic( ~ Dept + Gender + Admit, data = UCBAdmissions,
       highlighting = "Admit", highlighting_fill = c("lightblue", "pink"),
       direction = c("v","h","v"))
Mosaic plot with a different variable splitting order: first department, then gender, then admission status

Figure 13.28: Mosaic plot with a different variable splitting order: first department, then gender, then admission status

We also specified a variable to highlight (Admit), and which colors to use in the highlighting.

13.15.3 Discussion

In the preceding example we also specified the direction in which each variable will be split. The first variable, Dept, is split vertically; the second variable, Gender, is split horizontally; and the third variable, Admit, is split vertically. The reason that we chose these directions is because, in this particular example, it makes it easy to compare the male and female groups within each department.

We can also use different splitting directions, as shown in Figure 13.29 and Figure 13.30:

# Another possible set of splitting directions
mosaic( ~ Dept + Gender + Admit, data = UCBAdmissions,
       highlighting = "Admit", highlighting_fill = c("lightblue", "pink"),
       direction = c("v", "v", "h"))
Splitting Dept vertically, Gender vertically, and Admit horizontally

Figure 13.29: Splitting Dept vertically, Gender vertically, and Admit horizontally

# This order makes it difficult to compare male and female
mosaic( ~ Dept + Gender + Admit, data = UCBAdmissions,
       highlighting = "Admit", highlighting_fill = c("lightblue", "pink"),
       direction = c("v", "h", "h"))
Splitting Dept vertically, Gender horizontally, and Admit horizontally

Figure 13.30: Splitting Dept vertically, Gender horizontally, and Admit horizontally

The example here illustrates a classic case of Simpson’s paradox, in which a relationship between variables within subgroups can change (or reverse!) when the groups are combined. The UCBerkeley table contains admissions data from the University of California-Berkeley in 1973. Overall, men were admitted at a higher rate than women, and because of this, the university was sued for gender bias. But when each department was examined separately, it was found that they each had approximately equal admission rates for men and women. The difference in overall admission rates was because women were more likely to apply to competitive departments with lower admission rates.

In Figure 13.28 and Figure 13.29, you can see that within each department, admission rates were approximately equal between men and women. You can also see that departments with higher admission rates (A and B) were very imbalanced in the gender ratio of applicants: far more men applied to these departments than did women. As you can see, partitioning the data in different orders and directions can bring out different aspects of the data. In Figure 13.29, as in Figure 13.28, it’s easy to compare male and female admission rates within each department and across departments. Splitting Dept vertically, Gender horizontally, and Admit horizontally, as in Figure 13.30, makes it difficult to compare male and female admission rates within each department, but it is easy to compare male and female application rates across departments.

13.15.4 See Also

See ?mosaicplot for another function that can create mosaic plots.

P.J. Bickel, E.A. Hammel, and J.W. O’Connell, “Sex Bias in Graduate Admissions: Data from Berkeley,” Science 187 (1975): 398–404.

make a pie chart.

13.15.5 Solution

Use the pie() function. In this example (Figure 13.31), we’ll use the survey data set from the MASS library:

library(MASS) # For the data set
# Get a table of how many cases are in each level of fold
fold <- table(survey$Fold)
fold

# Reduce margins so there's less blank space around the plot
par(mar = c(1, 1, 1, 1))
# Make the pie chart
pie(fold)
A pie chart

Figure 13.31: A pie chart

We passed pie() an object of class table. We could have instead given it a named vector, or a vector of values and a vector of labels like this, with the same result:

pie(c(99, 18, 120), labels = c("L on R", "Neither", "R on L"))

13.15.6 Discussion

The lowly pie chart is the subject of frequent abuse from data visualization experts. If you’re thinking of using a pie chart, consider whether a bar graph (or stacked bar graph) would convey the information more effectively. Despite their faults, pie charts do have one important virtue: everyone knows how to read them.

13.16 Creating a Map

13.16.1 Problem

You want to create a geographical map.

13.16.2 Solution

Retrieve map data from the maps package and draw it with geom_polygon() (which can have a color fill) or geom_path() (which can’t have a fill). By default, the latitude and longitude will be drawn on a Cartesian coordinate plane, but you can use coord_map() and specify a projection. The default projection is “mercator”, which, unlike the Cartesian plane, has a progressively changing spacing for latitude lines (Figure 13.32):

library(maps) # For map data
# Get map data for USA
states_map <- map_data("state") # ggplot2 must be loaded to use map_data()

ggplot(states_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

# geom_path (no fill) and Mercator projection
ggplot(states_map, aes(x = long, y = lat, group = group)) +
  geom_path() + coord_map("mercator")
Top: a basic map with fill; bottom: with no fill, and Mercator projectionTop: a basic map with fill; bottom: with no fill, and Mercator projection

Figure 13.32: Top: a basic map with fill; bottom: with no fill, and Mercator projection

13.16.3 Discussion

The map_data() function returns a data frame with the following columns:

  • long: Longitude.
  • lat: Latitude.
  • group: This is a grouping variable for each polygon. A region or subregion might have multiple polygons, for example, if it includes islands.
  • order: The order to connect each point within a group.
  • region: Roughly, the names of countries, although some other objects are present (such as some lakes).
  • subregion: The names of subregions within a region, which can contain multiple groups. For example, the Alaska subregion includes many islands, each with its own group.

There are a number of different maps available, including world, nz, france, italy, usa (outline of the United States), state (each state in the USA), and county (each county in the USA). For example, to get map data for the world:

# Get map data for world
world_map <- map_data("world")
world_map
#>             long      lat group  order  region subregion
#> 1      -69.89912 12.45200     1      1   Aruba      <NA>
#> 2      -69.89571 12.42300     1      2   Aruba      <NA>
#> 3      -69.94219 12.43853     1      3   Aruba      <NA>
#>  ...<99,332 more rows>...
#> 100962  12.42754 41.90073  1627 100962 Vatican   enclave
#> 100963  12.43057 41.89756  1627 100963 Vatican   enclave
#> 100964  12.43916 41.89839  1627 100964 Vatican   enclave

If you want to draw a map of a region in the world map for which there isn’t a separate map, you can first look for the region name, like so:

sort(unique(world_map$region))
#>   [1] "Afghanistan"                        
#>   [2] "Albania"                            
#>   [3] "Algeria"                            
#>  ...
#> [250] "Yemen"                              
#> [251] "Zambia"                             
#> [252] "Zimbabwe"

It’s possible to get data for specific regions from a particular map (Figure 13.33):

east_asia <- map_data("world", region = c("Japan", "China", "North Korea", "South Korea"))

# Map region to fill color
ggplot(east_asia, aes(x = long, y = lat, group = group, fill = region)) +
  geom_polygon(colour = "black") +
  scale_fill_brewer(palette = "Set2")
Specific regions from the world map

Figure 13.33: Specific regions from the world map

If there is a separate map available for a region, such as nz (New Zealand), that map data will be at a higher resolution than if you were to extract it from the world map, as shown in Figure 13.34:

# Get New Zealand data from world map
nz1 <- map_data("world", region = "New Zealand") %>%
  filter(long > 0 & lat > -48)        # Trim off islands

ggplot(nz1, aes(x = long, y = lat, group = group)) +
  geom_path()

# Get New Zealand data from the nz map
nz2 <- map_data("nz")
ggplot(nz2, aes(x = long, y = lat, group = group)) +
  geom_path()
New Zealand data taken from world map (left); Data from nz map (right)New Zealand data taken from world map (left); Data from nz map (right)

Figure 13.34: New Zealand data taken from world map (left); Data from nz map (right)

13.16.4 See Also

See the mapdata package for more map data sets. It includes maps of China and Japan, as well as a high-resolution world map, worldHires.

See the map() function, for quickly generating maps.

See ?mapproject for a list of available map projections.

13.17 Creating a Choropleth Map

13.17.1 Problem

You want to create a map with regions that are colored according to variable values.

13.17.2 Solution

Merge the value data with the map data, then map a variable to fill:

# Transform the USArrests data set to the correct format
crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests)
crimes
#>                       state Murder Assault UrbanPop Rape
#> Alabama             alabama   13.2     236       58 21.2
#> Alaska               alaska   10.0     263       48 44.5
#> Arizona             arizona    8.1     294       80 31.0
#>  ...<44 more rows>...
#> West Virginia west virginia    5.7      81       39  9.3
#> Wisconsin         wisconsin    2.6      53       66 10.8
#> Wyoming             wyoming    6.8     161       60 15.6

library(maps) # For map data
states_map <- map_data("state")
# Merge the data sets together
crime_map <- merge(states_map, crimes, by.x = "region", by.y = "state")
# After merging, the order has changed, which would lead to polygons drawn in
# the incorrect order. So, we'll sort the data.
crime_map
#>        region   long  lat group order subregion Murder Assault UrbanPop Rape
#> 1     alabama  -87.5 30.4     1     1      <NA>   13.2     236       58 21.2
#> 2     alabama  -87.5 30.4     1     2      <NA>   13.2     236       58 21.2
#> 3     alabama  -88.0 30.2     1    13      <NA>   13.2     236       58 21.2
#>  ...<15,521 more rows>...
#> 15525 wyoming -107.9 41.0    63 15597      <NA>    6.8     161       60 15.6
#> 15526 wyoming -109.1 41.0    63 15598      <NA>    6.8     161       60 15.6
#> 15527 wyoming -109.1 41.0    63 15599      <NA>    6.8     161       60 15.6

library(dplyr) # For arrange() function
# Sort by group, then order
crime_map <- arrange(crime_map, group, order)
crime_map
#>        region   long  lat group order subregion Murder Assault UrbanPop Rape
#> 1     alabama  -87.5 30.4     1     1      <NA>   13.2     236       58 21.2
#> 2     alabama  -87.5 30.4     1     2      <NA>   13.2     236       58 21.2
#> 3     alabama  -87.5 30.4     1     3      <NA>   13.2     236       58 21.2
#>  ...<15,521 more rows>...
#> 15525 wyoming -107.9 41.0    63 15597      <NA>    6.8     161       60 15.6
#> 15526 wyoming -109.1 41.0    63 15598      <NA>    6.8     161       60 15.6
#> 15527 wyoming -109.1 41.0    63 15599      <NA>    6.8     161       60 15.6

Once the data is in the correct format, it can be plotted (Figure 13.35), mapping one of the columns with data values to fill:

ggplot(crime_map, aes(x = long, y = lat, group = group, fill = Assault)) +
  geom_polygon(colour = "black") +
  coord_map("polyconic")
A map with a variable mapped to fill

Figure 13.35: A map with a variable mapped to fill

13.17.3 Discussion

The preceding example used the default color scale, which goes from dark to light blue. If you want to show how the values diverge from some middle value, you can use scale_fill_gradient2(), or scale_fill_viridis_c() as shown in Figure 13.36:

# Create a base plot
crime_p <- ggplot(crimes, aes(map_id = state, fill = Assault)) +
  geom_map(map = states_map, colour = "black") +
  expand_limits(x = states_map$long, y = states_map$lat) +
  coord_map("polyconic")

crime_p +
  scale_fill_gradient2(low = "#559999", mid = "grey90", high = "#BB650B",
                       midpoint = median(crimes$Assault))

crime_p +
    scale_fill_viridis_c()
With a diverging color scaleWith a diverging color scale

Figure 13.36: With a diverging color scale

The previous example mapped continuous values to fill, but we could just as well use discrete values. It’s sometimes easier to interpret the data if the values are discretized. For example, we can categorize the values into quantiles and show those quantiles, as in Figure 13.37:

# Find the quantile bounds
qa <- quantile(crimes$Assault, c(0, 0.2, 0.4, 0.6, 0.8, 1.0))
qa
#>    0%   20%   40%   60%   80%  100% 
#>  45.0  98.8 135.0 188.8 254.2 337.0

# Add a column of the quantile category
crimes$Assault_q <- cut(crimes$Assault, qa,
                        labels = c("0-20%", "20-40%", "40-60%", "60-80%", "80-100%"),
                        include.lowest = TRUE)
crimes
#>                       state Murder Assault UrbanPop Rape Assault_q
#> Alabama             alabama   13.2     236       58 21.2    60-80%
#> Alaska               alaska   10.0     263       48 44.5   80-100%
#> Arizona             arizona    8.1     294       80 31.0   80-100%
#>  ...<44 more rows>...
#> West Virginia west virginia    5.7      81       39  9.3     0-20%
#> Wisconsin         wisconsin    2.6      53       66 10.8     0-20%
#> Wyoming             wyoming    6.8     161       60 15.6    40-60%
# Generate a discrete color palette with 5 values
pal <- colorRampPalette(c("#559999", "grey80", "#BB650B"))(5)
pal
#> [1] "#559999" "#90B2B2" "#CCCCCC" "#C3986B" "#BB650B"

ggplot(crimes, aes(map_id = state, fill = Assault_q)) +
  geom_map(map = states_map, colour = "black") +
  scale_fill_manual(values = pal) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  coord_map("polyconic") +
  labs(fill = "Assault Rate\nPercentile")
Choropleth map with discretized data

Figure 13.37: Choropleth map with discretized data

Another way to make a choropleth, but without needing to merge the map data with the value data, is to use geom_map(). As of this writing, this will render maps faster than the method just described.

For this method, the map data frame must have columns named lat, long, and region. In the value data frame, there must be a column that is matched to the region column in the map data frame, and this column is specified by mapping it to the map_id aesthetic. For example, this code will have the same output as the first example (Figure 13.35):

# The 'state' column in the crimes data is to be matched to the 'region' column
# in the states_map data
ggplot(crimes, aes(map_id = state, fill = Assault)) +
  geom_map(map = states_map) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  coord_map("polyconic")

Notice that we also needed to use expand_limits(). This is because unlike most geoms, geom_map() doesn’t automatically set the x and y limits; the use of expand_limits() makes it include those x and y values. (Another way to accomplish the same result is to use ylim() and xlim().)

13.17.4 See Also

For an example of data overlaid on a map, see Recipe 13.12.

For more on using continuous colors, see Recipe 12.6.

13.18 Making a Map with a Clean Background

13.18.1 Problem

You want to remove background elements from a map.

13.18.2 Solution

Use theme_void() (Figure 13.38). In this example, we’ll add it to one of the choropleths we created in Recipe 13.17:

ggplot(crimes, aes(map_id = state, fill = Assault_q)) +
  geom_map(map = states_map, colour = "black") +
  scale_fill_manual(values = pal) +
  expand_limits(x = states_map$long, y = states_map$lat) +
  coord_map("polyconic") +
  labs(fill = "Assault Rate\nPercentile") +
  theme_void()
A map with a clean background

Figure 13.38: A map with a clean background

13.18.3 Discussion

In some maps, it’s important to include contextual information such as the latitude and longitude. In others, this information is unimportant and distracts from the information that’s being conveyed. In Figure 13.38, it’s unlikely that viewers will care about the latitude and longitude of the states. They can probably identify the states by shape and relative position, and even if they can’t, having the latitude and longitude isn’t really helpful.

13.19 Creating a Map from a Shapefile

13.19.1 Problem

You want to create a geographical map from an Esri shapefile.

13.19.2 Solution

Load the shapefile using st_read() from the sf package, then plot it with geom_sf() (Figure 13.39):

library(sf)

# Load the shapefile
taiwan_shp <- st_read("fig/TWN_adm/TWN_adm2.shp")
#> Reading layer `TWN_adm2' from data source `/home/travis/build/wch/rgcookbook/fig/TWN_adm/TWN_adm2.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 22 features and 11 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 116.71 ymin: 20.6975 xmax: 122.1085 ymax: 25.63431
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs

ggplot(taiwan_shp) +
  geom_sf()
A map created from a shapefile

Figure 13.39: A map created from a shapefile

13.19.3 Discussion

Esri shapefiles are a common format for map data. The st_read() function reads a shape file and returns a sf object, which will also have other useful columns. Here’s a look at the contents of the object.

taiwan_shp
#> Simple feature collection with 22 features and 11 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 116.71 ymin: 20.6975 xmax: 122.1085 ymax: 25.63431
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#> First 6 features:
#>   ID_0 ISO NAME_0 ID_1         NAME_1 ID_2         NAME_2 NL_NAME_2
#> 1  223 TWN Taiwan    1      Kaohsiung    1 Kaohsiung City      <NA>
#> 2  223 TWN Taiwan    2 Pratas Islands    2           <NA>      <NA>
#> 3  223 TWN Taiwan    3         Taipei    3    Taipei City      <NA>
#> 4  223 TWN Taiwan    4         Taiwan    4       Changhwa      <NA>
#> 5  223 TWN Taiwan    4         Taiwan    5         Chiayi      <NA>
#> 6  223 TWN Taiwan    4         Taiwan    6        Hsinchu      <NA>
#>           VARNAME_2         TYPE_2            ENGTYPE_2
#> 1      Gaoxiong Shi     Chuan-shih Special Municipality
#> 2              <NA>           <NA>                 <NA>
#> 3        Taibei Shi     Chuan-shih Special Municipality
#> 4 Zhanghua|Changhua District|Hsien               County
#> 5       Jiayi|Chiai District|Hsien               County
#> 6            Xinzhu District|Hsien               County
#>                         geometry
#> 1 MULTIPOLYGON (((120.239 22....
#> 2 MULTIPOLYGON (((116.7172 20...
#> 3 MULTIPOLYGON (((121.525 25....
#> 4 MULTIPOLYGON (((120.4176 24...
#> 5 MULTIPOLYGON (((120.1526 23...
#> 6 MULTIPOLYGON (((120.9146 24...

The sf object is a special kind of data frame, with 22 rows and 12 columns. Each row corresponds to one feature, and each column has some data about each feature. One of the columns is the geometry for each feature. This is a list-column – a special type of column where each of the 22 elements contains one or more matrices of numbers representing the shape of the feature.

Columns in the data can be mapped to aesthetics like fill. For example, we can mapp the ENGTYPE_2 column to fill, as shown in Figure 13.40:

# Remove rows for which ENGTYPE_2 is NA; otherwise NA will show in the legend
taiwan_shp_mod <- taiwan_shp
taiwan_shp_mod <- taiwan_shp[!is.na(taiwan_shp$ENGTYPE_2), ]

ggplot(taiwan_shp_mod) +
  geom_sf(aes(fill = ENGTYPE_2))
With a column mapped to fill

Figure 13.40: With a column mapped to fill

13.19.4 See Also

The shapefile used in this example is not included in the gcookbook package. It and many other shapefiles are available for download at http://www.gadm.org.