When the first edition of this book was published five years ago, the phrase “data science” had only recently entered the popular lexicon. Today, the phrase is unavoidable if you’re involved with the sciences, journalism, or high-tech industries. Many interrelated developments have made this possible: there’s a general awareness that understanding quantitative data has tangible benefits; there are better and more widely-available educational resources about how to do data science; and finally, the tools have evolved, becoming easier to use and get started with.
The goal of this book is to help you understand your data by visualizing it, and to help you convey that understanding to others. You can think of data analysis as the process of transforming raw data into ideas in somebody’s mind. One of the key techniques for doing this is to create visualizations of the data. Our brains have very highly-developed visual pattern detection systems, and data visualizations are a way to efficiently use those visual systems to get quantitative information into a person’s mind.
Each recipe in this book lists a problem and a solution. In most cases, the solutions I offer aren’t the only way to do things in R, but they are, in my opinion, the best way. One of the reasons for R’s popularity is that there are many available add-on packages, each of which provides some functionality for R. There are many packages for visualizing data in R, but this book primarily uses ggplot2.
This book isn’t meant to be a comprehensive manual of all the different ways of creating data visualizations in R, but hopefully it will help you figure out how to make the graphics you have in mind. Or, if you’re not sure what you want to make, browsing its pages may give you some ideas about what’s possible.
This book is intended for readers who have at least a basic understanding of R. The recipes in this book will show you how to do specific tasks. I’ve tried to use examples that are simple, so that you can understand how they work and transfer the solutions over to your own problems.
0.2 Software and Platform Notes
Most of the recipes here use the ggplot2 plotting package, and the dplyr package for data wrangling. These packages are both part of the tidyverse, which is a collection of R packages that make it easier to work with data. Some of the recipes require the most recent version of ggplot2, 3.0.0, and this in turn requires a relatively recent version of R. You can always get the latest version of R from the main R project site, http://www.r-project.org.
You can use the recipes with just a surface understanding of ggplot2, but if you want a deeper understanding of how it works, see Chapter A.
Once you’ve installed R, you can install the necessary packages. In addition to the tidverse packages, you’ll also want to install the gcookbook package, which contains data sets for many of the examples in this book. You can install the tidyverse packages and the gcookbook package with:
You may be asked to choose a mirror site for CRAN, the Comprehensive R Archive Network. Any of the sites should work, but it’s a good idea to choose one close to you because it will likely be faster than one far away. Once you’ve installed the packages, load the ggplot2 and dplyr packages in each R session in which you want to use the recipes in this book:
The recipes in this book will assume that you’ve already loaded ggplot2 and dplyr, so they won’t show these lines of code.
If you see an error like this, it means that you forgot to load ggplot2:
Error: could not find function "ggplot"
The major platforms for R are macOS, Linux, and Windows, and all the recipes in this book should work on all of these platforms. There are some platform-specific differences when it comes to creating bitmap output files, and these differences are covered in Chapter 14.
0.3 Conventions Used in This Book
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
- Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
- Constant width text that is prefaced with
#>represents output from R.
Constant width bold
Shows commands or other text that should be typed literally by the user.
- Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
0.4 Using Code Examples
0.5 How to Contact Us
No book is the product of a single person. There are many people who have helped make this book possible. I’d like to thank the R community for creating R and for fostering a dynamic ecosystem around it. Thanks to Hadley Wickham and other members of the tidyverse team for creating the software that this book revolves around, and for opening up many opportunities for me to deepen my understanding of R, data analysis, and visualization. I’m grateful that my employer, RStudio, not only makes it possible for me to work with some of the leading lights in the R community, but also pays us to work software that the entire R community benefits from.
Thanks to the technical reviewers for this book, and the first edition of it: Garrett Grolemund, Thomas Lin Pedersen, Paul Teetor, Hadley Wickham, Dennis Murphy, and Erik Iverson. Their depth of knowledge and attention to detail has resulted in a much better book.
Finally, I would like to thank my wife, Sylia, for her support and understanding – and not just with regard to the book.