Chapter 13 Miscellaneous Graphs

There are many, many ways of visualizing data, and sometimes things don’t fit into nice, tidy categories. This chapter shows how to make some of these other visualizations.

13.1 Making a Correlation Matrix

13.1.1 Problem

You want to make a graphical correlation matrix.

13.1.2 Solution

We’ll look at the mtcars data set:

First, generate the numerical correlation matrix using cor. This will generate correlation coefficients for each pair of columns:

If there are any columns that you don’t want used for correlations (for example, a column of names), you should exclude them. If there are any NA cells in the original data, the resulting correlation matrix will have NA values. To deal with this, you will probably want to use the argument use="complete.obs" or use="pairwise.complete.obs".

To plot the correlation matrix (Figure 13.1), we’ll use the corrplot package:

A correlation matrix

Figure 13.1: A correlation matrix

13.1.3 Discussion

The corrplot() function has many, many options. Here is an example of how to make a correlation matrix with colored squares and black labels, rotated 45 degrees along the top (Figure 13.2):

Correlation matrix with colored squares and black, rotated labels

Figure 13.2: Correlation matrix with colored squares and black, rotated labels

It may also be helpful to display labels representing the correlation coefficient on each square in the matrix. In this example we’ll make a lighter palette so that the text is readable, and we’ll remove the color legend, since it’s redundant. We’ll also order the items so that correlated items are closer together, using the order="AOE" (angular order of eigenvectors) option. The result is shown in Figure 13.3:

Correlation matrix with correlation coefficients and no legend

Figure 13.3: Correlation matrix with correlation coefficients and no legend

Like many other standalone plotting functions, corrplot() has its own menagerie of options, which can’t all be illustrated here. Table 13.1 lists some useful options.

Table 13.1: Options for corrplot()
Option Description
type={"lower" | "upper"} Only use the lower or upper triangle
diag=FALSE Don’t show values on the diagonal
addshade="all" Add lines indicating the direction of the correlation
shade.col=NA Hide correlation direction lines
method="shade" Use colored squares
method="ellipse" Use ellipses
addCoef.col="*color*" Add correlation coefficients, in color
tl.srt="*number*" Specify the rotation angle for top labels
tl.col="*color*" Specify the label color
order={"AOE" | "FPC" | "hclust"} Sort labels using angular order of eigenvectors, first principal component, or hierarchical clustering

13.1.4 See Also

To create a scatter plot matrix, see Recipe 5.13.

For more on subsetting data, see Recipe 15.7.

13.2 Plotting a Function

13.2.1 Problem

You want to plot a function.

13.2.2 Solution

Use stat_function(). It’s also necessary to give ggplot a dummy data frame so that it will get the proper x range. In this example we’ll use dnorm(), which gives the density of the normal distribution (Figure 13.4, left):

13.2.3 Discussion

Some functions take additional arguments. For example, dt(), the function for the density of the t-distribution, takes a parameter for degrees of freedom (Figure 13.4, right). These additional arguments can be passed to the function by putting them in a list and giving the list to args:

The normal distribution (left); The t-distribution with df=2 (right)The normal distribution (left); The t-distribution with df=2 (right)

Figure 13.4: The normal distribution (left); The t-distribution with df=2 (right)

It’s also possible to define your own functions. It should take an x value for its first argument, and it should return a y value. In this example, we’ll define a sigmoid function (Figure 13.5):

A user-defined function

Figure 13.5: A user-defined function

By default, the function is calculated at 101 points along the x range. If you have a rapidly fluctuating function, you may be able to see the individual segments. To smooth out the curve, pass a larger value of n to stat_function(), as in stat_function(fun=myfun, n=200).

13.2.4 See Also

For plotting predicted values from model objects (such as lm and glm), see Recipe 5.7.

13.3 Shading a Subregion Under a Function Curve

13.3.1 Problem

You want to shade part of the area under a function curve.

13.3.2 Solution

Define a new wrapper function around your curve function, and replace out-of-range values with NA, as shown in Figure 13.6:

Function curve with a shaded region

Figure 13.6: Function curve with a shaded region

Remember that what gets passed to this function is a vector, not individual values. If this function operated on single elements at a time, it might make sense to use an ifelse statement to decide what to return, conditional on the value of x. But that won’t work here, since x is a vector with many values.

13.3.3 Discussion

R has first-class functions, and we can write a function that returns a closure-that is, we can program a function to program another function.

This function will allow you to pass in a function, a minimum value, and a maximum value. Values outside the range will again be returned with NA:

Now we can call this function to create another function – one that is effectively the same as the dnorm_limit() function used earlier:

We can use limitRange() to create a function that is passed to stat_function():

The limitRange() function can be used with any function, not just dnorm(), to create a range-limited version of that function. The result of all this is that instead of having to write functions with different hard-coded values for each situation that arises, we can write one function and simply pass it different arguments depending on the situation.

If you look very, very closely at the graph in Figure 13.6, you may see that the shaded region does not align exactly with the range we specified. This is because ggplot2 does a numeric approximation by calculating values at fixed intervals, and these intervals may not fall exactly within the specified range. As in Recipe 13.2, we can improve the approximation by increasing the number of interpolated values with stat_function(n = 200).

13.4 Creating a Network Graph

13.4.1 Problem

You want to create a network graph.

13.4.3 Discussion

In a network graph, the position of the nodes is unspecified by the data, and they’re placed randomly. To make the output repeatable, you can set the random seed before making the plot. You can try different random numbers until you get a result that you like:

It’s also possible to create a graph from a data frame. The first two columns of the data frame are used, and each row specifies a connection between two nodes. In the next example (Figure 13.8), we’ll use the madmen2 data set, which has this structure. We’ll also use the Fruchterman-Reingold layout algorithm. The idea is that all the nodes have a magnetic repulsion from one another, but the edges between nodes act as springs, pulling the nodes together:

A directed graph from a data frame, with the Fruchterman-Reingold algorithm

Figure 13.8: A directed graph from a data frame, with the Fruchterman-Reingold algorithm

It’s also possible to make a directed graph from a data frame. The madmen data set has only one row for each pairing, since direction doesn’t matter for an undirected graph. This time we’ll use a circle layout (Figure 13.9):

A circular undirected graph from a data frame

Figure 13.9: A circular undirected graph from a data frame

13.4.4 See Also

For more information about the available output options, see ?plot.igraph. Also see ?igraph::layout for layout options.

An alternative to igraph is Rgraphviz, which a frontend for Graphviz, an open-source library for visualizing graphs. It works better with labels and makes it easier to create graphs with a controlled layout, but it can be a bit challenging to install. Rgraphviz is available through the Bioconductor repository system.

13.5 Using Text Labels in a Network Graph

13.5.1 Problem

You want to use text labels in a network graph.

13.5.3 Discussion

Another way to achieve the same effect is to modify the plot object, instead of passing in the values as arguments to plot(). To do this, use V()$xxx<- instead of passing a value to a vertex.xxx argument. For example, this will result in the same output as the previous code:

The properties of the edges can also be set, either with the E() function or by passing values to edge.xxx arguments (Figure 13.11):

A network graph with labeled and colored edges

Figure 13.11: A network graph with labeled and colored edges

13.5.4 See Also

See ?igraph.plotting for more information about graphical parameters in igraph.

13.6 Creating a Heat Map

13.6.1 Problem

You want to make a heat map.

13.6.2 Solution

Use geom_tile() or geom_raster() and map a continuous variable to fill. We’ll use the presidents data set, which is a time series object rather than a data frame:

We’ll first convert it to a format that is usable by ggplot: a data frame with columns that are numeric:

Now we can make the plot using geom_tile() or geom_raster() (Figure 13.12). Simply map one variable to x, one to y, and one to fill:

A heat map-the grey squares represent NAs in the data

Figure 13.12: A heat map-the grey squares represent NAs in the data

Note

The results with geom_tile() and geom_raster() should look the same, but in practice they might appear different. See Recipe 6.12 for more information about this issue.

13.6.3 Discussion

To better convey useful information, you may want to customize the appearance of the heat map. With this example, we’ll reverse the y-axis so that it progresses from top to bottom, and we’ll add tick marks every four years along the x-axis, to correspond with each presidential term. For the x and y scales, we remove the padding by using expand=c(0, 0). We’ll also change the color scale using scale_fill_gradient2(), which lets you specify a midpoint color and the two colors at the low and high ends (Figure 13.13):

A heat map with customized appearance

Figure 13.13: A heat map with customized appearance

13.6.4 See Also

If you want to use a different color palette, see Recipe 12.6.

13.7 Creating a Three-Dimensional Scatter Plot

13.7.1 Problem

You want to create a three-dimensional scatter plot.

13.7.2 Solution

We’ll use the rgl package, which provides an interface to the OpenGL graphics library for 3D graphics. To create a 3D scatter plot, as in Figure 13.14, use plot3d() and pass in a data frame where the first three columns represent x, y, and z coordinates, or pass in three vectors representing the x, y, and z coordinates.

A 3D scatter plot

Figure 13.14: A 3D scatter plot

Note

If you are using a Mac, you will need to install XQuartz before you can use rgl. You can download it from https://www.xquartz.org/.

Viewers can rotate the image by clicking and dragging with the mouse, and zoom in and out with the scroll wheel.

Note

By default, plot3d() uses square points, which do not appear properly when saving to a PDF. For improved appearance, the example above uses type="s" for spherical points, made them smaller with size=0.75, and turned off the 3D lighting with lit=FALSE (otherwise they look like shiny spheres).

13.7.3 Discussion

Three-dimensional scatter plots can be difficult to interpret, so it’s often better to use a two-dimensional representation of the data. That said, there are things that can help make a 3D scatter plot easier to understand.

In Figure 13.15, we’ll add vertical segments to help give a sense of the spatial positions of the points:

A 3D scatter plot with vertical lines for each point

Figure 13.15: A 3D scatter plot with vertical lines for each point

It’s possible to tweak the appearance of the background and the axes. In Figure 13.16, we change the number of tick marks and add tick marks and axis labels to the specified sides:

3D scatter plot with axis ticks and labels repositioned

Figure 13.16: 3D scatter plot with axis ticks and labels repositioned

13.7.4 See Also

See ?plot3d for more options for controlling the output.

13.8 Adding a Prediction Surface to a Three-Dimensional Plot

13.8.1 Problem

You want to add a surface of predicted value to a three-dimensional scatter plot.

13.8.2 Solution

First we need to define some utility functions for generating the predicted values from a model object:

With these utility functions defined, we can make a linear model from the data and plot it as a mesh along with the data, using the surface3d() function, as shown in Figure 13.17:

A 3D scatter plot with a prediction surface

Figure 13.17: A 3D scatter plot with a prediction surface

13.8.4 See Also

For more on changing the appearance of the surface, see ?rgl.material.

13.9 Saving a Three-Dimensional Plot

13.9.1 Problem

You want to save a three-dimensional plot created with the rgl package.

13.9.2 Solution

To save a bitmap image of a plot created with rgl, use rgl.snapshot(). This will capture the exact image that is on the screen:

You can also use rgl.postscript() to save a PostScript or PDF file:

PostScript and PDF output does not support many features of the OpenGL library on which rgl is based. For example, it does not support transparency, and the sizes of objects such as points and lines may not be the same as what appears on the screen.

13.9.3 Discussion

To make the output more repeatable, you can save your current viewpoint and restore it later:

To save view in a script, you can use dput(), then copy and paste the output into your script:

Once you have the text representation of the userMatrix, add the following to your script:

13.10 Animating a Three-Dimensional Plot

13.10.1 Problem

You want to animate a three-dimensional plot by moving the viewpoint around the plot.

13.10.2 Solution

Rotating a 3D plot can provide a more complete view of the data. To animate a 3D plot, use play3d() with spin3d():

13.10.3 Discussion

By default, the graph will be rotated on the z (vertical) axis, until you send a break command to R.

You can change the rotation axis, rotation speed, and duration:

To save the movie, use the movie3d() function in the same way as play3d(). It will generate a series of .png files, one for each frame, and then attempt to combine them into a single animated .gif file using the convert program from the ImageMagick image utility.

This will spin the plot once in 15 seconds, at 50 frames per second:

The output file will be saved in a temporary directory, and the name will be printed on the R console.

If you don’t want to use ImageMagick to convert the output to a .gif, you can specify convert=FALSE and then convert the series of .png files to a movie using some other utility.

13.11 Creating a Dendrogram

13.11.1 Problem

You want to make a dendrogram to show how items are clustered.

13.11.2 Solution

Use hclust() and plot the output from it. This can require a fair bit of data preprocessing. For this example, we’ll first take a subset of the countries data set from the year 2009. For simplicity, we’ll also drop all rows that contain an NA, and then select a random 25 of the remaining rows:

Notice that the row names (the first column) are essentially random numbers, since the rows were selected randomly. We need to do a few more things to the data before making a dendrogram from it. First, we need to set the row names-right now there’s a column called Name, but the row names are those random numbers (we don’t often use row names, but for the hclust() function they’re essential). Next, we’ll need to drop all the columns that aren’t values used for clustering. These columns are Name, Code, and Year:

The values for GDP are several orders of magnitude larger than the values for, say, infmortality. Because of this, the effect of infmortality on the clustering will be negligible compared to the effect of GDP. This probably isn’t what we want. To address this issue, we’ll scale the data:

By default the scale() function scales each column relative to its standard deviation, but other methods may be used.

Finally, we’re ready to make the dendrogram, as shown in Figure 13.19:

A dendrogram (left); With text aligned (right)A dendrogram (left); With text aligned (right)

Figure 13.19: A dendrogram (left); With text aligned (right)

13.11.3 Discussion

A cluster analysis is simply a way of assigning points to groups in an n-dimensional space (four dimensions, in this example). A hierarchical cluster analysis divides each group into two smaller groups, and can be represented with the dendrograms in this recipe. There are many different parameters you can control in the hierarchical cluster analysis process, and there may not be a single “right” way to do it for your data.

First, we normalized the data using scale() with its default settings. You can scale your data differently, or not at all. (With this data set, not scaling the data will lead to GDP overwhelming the other variables, as shown in Figure 13.20.)

Dendrogram with unscaled data-notice the much larger Height values, which are largely due to the unscaled GDP values

Figure 13.20: Dendrogram with unscaled data-notice the much larger Height values, which are largely due to the unscaled GDP values

For the distance calculation, we used the default method, “euclidean”, which calculates the Euclidean distance between the points. The other possible methods are “maximum”, “manhattan”, “canberra”, “binary”, and “minkowski”.

The hclust() function provides several methods for performing the cluster analysis. The default is “complete”; the other possible methods are “ward”, “single”, “average”, “mcquitty”, “median”, and “centroid”.

13.11.4 See Also

See ?hclust for more information about the different clustering methods.

13.12 Creating a Vector Field

13.12.1 Problem

You want to make a vector field.

13.12.2 Solution

Use geom_segment(). For this example, we’ll use the isabel data set:

x and y are the longitude and latitude, respectively, and z is the height in kilometers. The vx, vy, and vz values are the wind speed components in each of these directions, in meters per second, and speed is the wind speed.

The height (z) ranges from 0.035 km to 18.035 km. For this example, we’ll just use the lowest slice of data.

To draw the vectors (Figure 13.21), we’ll use geom_segment(). Each segment has a starting point and an ending point. We’ll use the x and y values as the starting points for each segment, then add a fraction of the vx and vy values to get the end points for each segment. If we didn’t scale down these values, the lines would be much too long:

First attempt at a vector field. The resolution of the data is too high, but it does hint at some interesting patterns not visible in graphs with a lower data resolution

Figure 13.21: First attempt at a vector field. The resolution of the data is too high, but it does hint at some interesting patterns not visible in graphs with a lower data resolution

This vector field has two problems: the data is at too high a resolution to read, and the segments do not have arrows indicating the direction of the flow. To reduce the resolution of the data, we’ll define a function every_n() that keeps one out of every n values in the data and drops the rest:

Now that we’ve taken a subset of the data, we can plot it, with arrowheads, as shown in Figure 13.22:

Vector field with arrowheads

Figure 13.22: Vector field with arrowheads

13.12.3 Discussion

One effect of arrowheads is that short vectors appear with more ink than is proportional to their length. This could somewhat distort the interpretation of the data. To mitigate this effect, it may also be useful to map the speed to other properties, like size (line thickness), alpha, or colour. Here, we’ll map speed to alpha (Figure 13.23, left):

Next, we’ll map speed to colour. We’ll also add a map of the United States and zoom in on the area of interest, as shown in the graph on the right in Figure 13.23,using coord_cartesian() (without this, the entire USA will be displayed):

Vector field with speed mapped to alpha (left); With speed mapped to colour (right)Vector field with speed mapped to alpha (left); With speed mapped to colour (right)

Figure 13.23: Vector field with speed mapped to alpha (left); With speed mapped to colour (right)

The isabel data set has three-dimensional data, so we can also make a faceted graph of the data, as shown in Figure 13.24. Because each facet is small, we will use a sparser subset than before:

Vector field of wind speeds, faceted on z

Figure 13.24: Vector field of wind speeds, faceted on z

13.12.4 See Also

If you want to use a different color palette, see Recipe 12.6.

See Recipe 8.2 for more information about zooming in on part of a graph.

13.13 Creating a QQ Plot

13.13.1 Problem

You want to make a quantile-quantile (QQ) plot to compare an empirical distribution to a theoretical distribution.

13.13.2 Solution

Use geom_qq() and geom_qq_line() to compare to a normal distribution (Figure 13.25):

QQ plot of height, which is close to normally distributed (left); QQ plot of age, which is not normally distributed (right)QQ plot of height, which is close to normally distributed (left); QQ plot of age, which is not normally distributed (right)

Figure 13.25: QQ plot of height, which is close to normally distributed (left); QQ plot of age, which is not normally distributed (right)

13.13.3 Discussion

The points for heightIn are close to the line, which means that the distribution is close to normal. In contrast, the points for ageYear veer far away from the line, especially on the left, indicating that the distribution is skewed. A histogram may also be useful for exploring how the data is distributed.

13.13.4 See Also

See ?stat_qq for information on comparing data to theoretical distributions other than the normal distribution.

13.14 Creating a Graph of an Empirical Cumulative Distribution Function

13.14.1 Problem

You want to graph the empirical cumulative distribution function (ECDF) of a data set.

13.14.3 Discussion

The ECDF shows what proportion of observations are at or below the given x value. Because it is empirical, the line takes a step up at each x value where there are one or more observations.

13.15 Creating a Mosaic Plot

13.15.1 Problem

You want to make a mosaic plot to visualize a contingency table.

13.15.2 Solution

Use the mosaic() function from the vcd package. For this example we’ll use the USBAdmissions data set, which is a contingency table with three dimensions. We’ll first take a look at the data in a few different ways:

The three dimensions are Admit, Gender, and Dept. To visualize the relationships between the variables (Figure 13.27), use mosaic() and pass it a formula with the variables that will be used to split up the data:

Mosaic plot of UC-Berkeley admissions data-the area of each rectangle is proportional to the number of cases in that cell

Figure 13.27: Mosaic plot of UC-Berkeley admissions data-the area of each rectangle is proportional to the number of cases in that cell

Notice that mosaic() splits the data in the order in which the variables are provided: first on admission status, then gender, then department. The resulting plot order makes it very clear that more applicants were rejected than admitted. It is also clear that within the admitted group there were many more men than women, while in the rejected group there were approximately the same number of men and women. It is difficult to make comparisons within each department, though. A different variable splitting order may reveal some other interesting information.

Another way of looking at the data is to split first by department, then gender, then admission status, as in Figure 13.28. This makes the admission status the last variable that is partitioned, so that after partitioning by department and gender, the admitted and rejected cells for each group are right next to each other:

Mosaic plot with a different variable splitting order: first department, then gender, then admission status

Figure 13.28: Mosaic plot with a different variable splitting order: first department, then gender, then admission status

We also specified a variable to highlight (Admit), and which colors to use in the highlighting.

13.15.3 Discussion

In the preceding example we also specified the direction in which each variable will be split. The first variable, Dept, is split vertically; the second variable, Gender, is split horizontally; and the third variable, Admit, is split vertically. The reason that we chose these directions is because, in this particular example, it makes it easy to compare the male and female groups within each department.

We can also use different splitting directions, as shown in Figure 13.29 and Figure 13.30:

Splitting Dept vertically, Gender vertically, and Admit horizontally

Figure 13.29: Splitting Dept vertically, Gender vertically, and Admit horizontally

Splitting Dept vertically, Gender horizontally, and Admit horizontally

Figure 13.30: Splitting Dept vertically, Gender horizontally, and Admit horizontally

The example here illustrates a classic case of Simpson’s paradox, in which a relationship between variables within subgroups can change (or reverse!) when the groups are combined. The UCBerkeley table contains admissions data from the University of California-Berkeley in 1973. Overall, men were admitted at a higher rate than women, and because of this, the university was sued for gender bias. But when each department was examined separately, it was found that they each had approximately equal admission rates for men and women. The difference in overall admission rates was because women were more likely to apply to competitive departments with lower admission rates.

In Figure 13.28 and Figure 13.29, you can see that within each department, admission rates were approximately equal between men and women. You can also see that departments with higher admission rates (A and B) were very imbalanced in the gender ratio of applicants: far more men applied to these departments than did women. As you can see, partitioning the data in different orders and directions can bring out different aspects of the data. In Figure 13.29, as in Figure 13.28, it’s easy to compare male and female admission rates within each department and across departments. Splitting Dept vertically, Gender horizontally, and Admit horizontally, as in Figure 13.30, makes it difficult to compare male and female admission rates within each department, but it is easy to compare male and female application rates across departments.

13.15.4 See Also

See ?mosaicplot for another function that can create mosaic plots.

P.J. Bickel, E.A. Hammel, and J.W. O’Connell, “Sex Bias in Graduate Admissions: Data from Berkeley,” Science 187 (1975): 398–404.

make a pie chart.

13.15.5 Solution

Use the pie() function. In this example (Figure 13.31), we’ll use the survey data set from the MASS library:

A pie chart

Figure 13.31: A pie chart

We passed pie() an object of class table. We could have instead given it a named vector, or a vector of values and a vector of labels like this, with the same result:

13.15.6 Discussion

The lowly pie chart is the subject of frequent abuse from data visualization experts. If you’re thinking of using a pie chart, consider whether a bar graph (or stacked bar graph) would convey the information more effectively. Despite their faults, pie charts do have one important virtue: everyone knows how to read them.

13.16 Creating a Map

13.16.1 Problem

You want to create a geographical map.

13.16.2 Solution

Retrieve map data from the maps package and draw it with geom_polygon() (which can have a color fill) or geom_path() (which can’t have a fill). By default, the latitude and longitude will be drawn on a Cartesian coordinate plane, but you can use coord_map() and specify a projection. The default projection is “mercator”, which, unlike the Cartesian plane, has a progressively changing spacing for latitude lines (Figure 13.32):

Top: a basic map with fill; bottom: with no fill, and Mercator projectionTop: a basic map with fill; bottom: with no fill, and Mercator projection

Figure 13.32: Top: a basic map with fill; bottom: with no fill, and Mercator projection

13.16.3 Discussion

The map_data() function returns a data frame with the following columns:

  • long: Longitude.
  • lat: Latitude.
  • group: This is a grouping variable for each polygon. A region or subregion might have multiple polygons, for example, if it includes islands.
  • order: The order to connect each point within a group.
  • region: Roughly, the names of countries, although some other objects are present (such as some lakes).
  • subregion: The names of subregions within a region, which can contain multiple groups. For example, the Alaska subregion includes many islands, each with its own group.

There are a number of different maps available, including world, nz, france, italy, usa (outline of the United States), state (each state in the USA), and county (each county in the USA). For example, to get map data for the world:

If you want to draw a map of a region in the world map for which there isn’t a separate map, you can first look for the region name, like so:

It’s possible to get data for specific regions from a particular map (Figure 13.33):

Specific regions from the world map

Figure 13.33: Specific regions from the world map

If there is a separate map available for a region, such as nz (New Zealand), that map data will be at a higher resolution than if you were to extract it from the world map, as shown in Figure 13.34:

New Zealand data taken from world map (left); Data from nz map (right)New Zealand data taken from world map (left); Data from nz map (right)

Figure 13.34: New Zealand data taken from world map (left); Data from nz map (right)

13.16.4 See Also

See the mapdata package for more map data sets. It includes maps of China and Japan, as well as a high-resolution world map, worldHires.

See the map() function, for quickly generating maps.

See ?mapproject for a list of available map projections.

13.17 Creating a Choropleth Map

13.17.1 Problem

You want to create a map with regions that are colored according to variable values.

13.17.2 Solution

Merge the value data with the map data, then map a variable to fill:

# Transform the USArrests data set to the correct format
crimes <- data.frame(state = tolower(rownames(USArrests)), USArrests)
crimes
#>                       state Murder Assault UrbanPop Rape
#> Alabama             alabama   13.2     236       58 21.2
#> Alaska               alaska   10.0     263       48 44.5
#> Arizona             arizona    8.1     294       80 31.0
#>  ...<44 more rows>...
#> West Virginia west virginia    5.7      81       39  9.3
#> Wisconsin         wisconsin    2.6      53       66 10.8
#> Wyoming             wyoming    6.8     161       60 15.6

library(maps) # For map data
states_map <- map_data("state")
# Merge the data sets together
crime_map <- merge(states_map, crimes, by.x = "region", by.y = "state")
# After merging, the order has changed, which would lead to polygons drawn in
# the incorrect order. So, we'll sort the data.
crime_map
#>        region   long  lat group order subregion Murder Assault UrbanPop Rape
#> 1     alabama  -87.5 30.4     1     1      <NA>   13.2     236       58 21.2
#> 2     alabama  -87.5 30.4     1     2      <NA>   13.2     236       58 21.2
#> 3     alabama  -88.0 30.2     1    13      <NA>   13.2     236       58 21.2
#>  ...<15,521 more rows>...
#> 15525 wyoming -107.9 41.0    63 15597      <NA>    6.8     161       60 15.6
#> 15526 wyoming -109.1 41.0    63 15598      <NA>    6.8     161       60 15.6
#> 15527 wyoming -109.1 41.0    63 15599      <NA>    6.8     161       60 15.6

library(dplyr) # For arrange() function
# Sort by group, then order
crime_map <- arrange(crime_map, group, order)
crime_map
#>        region   long  lat group order subregion Murder Assault UrbanPop Rape
#> 1     alabama  -87.5 30.4     1     1      <NA>   13.2     236       58 21.2
#> 2     alabama  -87.5 30.4     1     2      <NA>   13.2     236       58 21.2
#> 3     alabama  -87.5 30.4     1     3      <NA>   13.2     236       58 21.2
#>  ...<15,521 more rows>...
#> 15525 wyoming -107.9 41.0    63 15597      <NA>    6.8     161       60 15.6
#> 15526 wyoming -109.1 41.0    63 15598      <NA>    6.8     161       60 15.6
#> 15527 wyoming -109.1 41.0    63 15599      <NA>    6.8     161       60 15.6

Once the data is in the correct format, it can be plotted (Figure 13.35), mapping one of the columns with data values to fill:

A map with a variable mapped to fill

Figure 13.35: A map with a variable mapped to fill

13.17.3 Discussion

The preceding example used the default color scale, which goes from dark to light blue. If you want to show how the values diverge from some middle value, you can use scale_fill_gradient2(), or scale_fill_viridis_c() as shown in Figure 13.36:

With a diverging color scaleWith a diverging color scale

Figure 13.36: With a diverging color scale

The previous example mapped continuous values to fill, but we could just as well use discrete values. It’s sometimes easier to interpret the data if the values are discretized. For example, we can categorize the values into quantiles and show those quantiles, as in Figure 13.37:

Choropleth map with discretized data

Figure 13.37: Choropleth map with discretized data

Another way to make a choropleth, but without needing to merge the map data with the value data, is to use geom_map(). As of this writing, this will render maps faster than the method just described.

For this method, the map data frame must have columns named lat, long, and region. In the value data frame, there must be a column that is matched to the region column in the map data frame, and this column is specified by mapping it to the map_id aesthetic. For example, this code will have the same output as the first example (Figure 13.35):

Notice that we also needed to use expand_limits(). This is because unlike most geoms, geom_map() doesn’t automatically set the x and y limits; the use of expand_limits() makes it include those x and y values. (Another way to accomplish the same result is to use ylim() and xlim().)

13.17.4 See Also

For an example of data overlaid on a map, see Recipe 13.12.

For more on using continuous colors, see Recipe 12.6.

13.18 Making a Map with a Clean Background

13.18.1 Problem

You want to remove background elements from a map.

13.18.3 Discussion

In some maps, it’s important to include contextual information such as the latitude and longitude. In others, this information is unimportant and distracts from the information that’s being conveyed. In Figure 13.38, it’s unlikely that viewers will care about the latitude and longitude of the states. They can probably identify the states by shape and relative position, and even if they can’t, having the latitude and longitude isn’t really helpful.

13.19 Creating a Map from a Shapefile

13.19.1 Problem

You want to create a geographical map from an Esri shapefile.

13.19.3 Discussion

Esri shapefiles are a common format for map data. The st_read() function reads a shape file and returns a sf object, which will also have other useful columns. Here’s a look at the contents of the object.

The sf object is a special kind of data frame, with 22 rows and 12 columns. Each row corresponds to one feature, and each column has some data about each feature. One of the columns is the geometry for each feature. This is a list-column – a special type of column where each of the 22 elements contains one or more matrices of numbers representing the shape of the feature.

Columns in the data can be mapped to aesthetics like fill. For example, we can mapp the ENGTYPE_2 column to fill, as shown in Figure 13.40:

With a column mapped to fill

Figure 13.40: With a column mapped to fill

13.19.4 See Also

The shapefile used in this example is not included in the gcookbook package. It and many other shapefiles are available for download at http://www.gadm.org.