A.2 Some Terminology and Theory

Before we go any further, it’ll be helpful to define some of the terminology used in ggplot2:

  • The data is what we want to visualize. It consists of variables, which are stored as columns in a data frame.

  • Geoms are the geometric objects that are drawn to represent the data, such as bars, lines, and points.

  • Aesthetic attributes, or aesthetics, are visual properties of geoms, such as x and y position, line color, point shapes, etc.

  • There are mappings from data values to aesthetics.

  • Scales control the mapping from the values in the data space to values in the aesthetic space. A continuous y scale maps larger numerical values to vertically higher positions in space.

  • Guides show the viewer how to map the visual properties back to the data space. The most commonly used guides are the tick marks and labels on an axis.

Here’s an example of how a typical mapping works. You have data, which is a set of numerical or categorical values. You have geoms to represent each observation. You have an aesthetic, such as y (vertical) position. And you have a scale, which defines the mapping from the data space (numeric values) to the aesthetic space (vertical position). A typical linear y-scale might map the value 0 to the baseline of the graph, 5 to the middle, and 10 to the top. A logarithmic y scale would place them differently.

These aren’t the only kinds of data and aesthetic spaces possible. In the abstract grammar of graphics, the data and aesthetics could be anything; in the ggplot2 implementation, there are some predetermined types of data and aesthetics. Commonly used data types include numeric values, categorical values, and text strings. Some commonly used aesthetics include horizontal and vertical position, color, size, and shape.

To interpret the plot, viewers refer to the guides. An example of a guide is the y-axis, including the tick marks and labels. The viewer refers to this guide to interpret what it means when a point is in the middle of the scale. A legend is another type of scale. A legend might show people what it means for a point to be a circle or a triangle, or what it means for a line to be blue or red.

Some aesthetics can only work with categorical variables, such as the shape of a point: triangles, circles, squares, etc. Some aesthetics work with categorical or continuous variables, such as x (horizontal) position. For a bar graph, the variable must be categorical-it would make no sense for there to be a continuous variable on the x-axis. For a scatter plot, the variable must be numeric. Both of these types of data (categorical and numeric) can be mapped to the aesthetic space of x position, but they require different types of scales.

Note

In ggplot2 terminology, categorical variables are called discrete, and numeric variables are called continuous. These terms may not always correspond to how they’re used elsewhere. Sometimes a variable that is continuous in the ggplot2 sense is discrete in the ordinary sense. For example, the number of visible sunspots must be an integer, so it’s numeric (continuous to ggplot2) and discrete (in ordinary language).