15.7 Getting a Subset of a Data Frame

15.7.1 Problem

You want to get a subset of a data frame.

15.7.2 Solution

Use filter() to get the rows, and select() to get the columns you want. These operations can be chained together using the %>% operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

We’ll use the climate data set for the examples here:

Let’s that say that only want to keep rows where Source is "Berkeley" and where the year is inclusive of and between 1900 and 2000. You can do so with the filter() function:

If you want only the Year and Anomaly10y columns, use select(), as we did in 15.4:

These operations can be chained together using the %>% operator:

15.7.3 Discussion

The filter() function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the slice() function:

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

Notice that we needed to prefix each column name with climate$, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.

This row filtering can also be combined with the column selection from 15.4: