15.7 Getting a Subset of a Data Frame

15.7.1 Problem

You want to get a subset of a data frame.

15.7.2 Solution

Use filter() to get the rows, and select() to get the columns you want. These operations can be chained together using the %>% operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.

We’ll use the climate data set for the examples here:

library(gcookbook) # Load gcookbook for the climate data set
climate
#>       Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 1   Berkeley 1800        NA        NA     -0.435  0.505
#> 2   Berkeley 1801        NA        NA     -0.453  0.493
#>  ...<495 more rows>...
#> 498  CRUTEM3 2010    0.8023        NA         NA     NA
#> 499  CRUTEM3 2011    0.6193        NA         NA     NA

Let’s that say that only want to keep rows where Source is "Berkeley" and where the year is inclusive of and between 1900 and 2000. You can do so with the filter() function:

climate %>%
  filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000)

If you want only the Year and Anomaly10y columns, use select(), as we did in 15.4:

climate %>%
  select(Year, Anomaly10y)
#>     Year Anomaly10y
#> 1   1800     -0.435
#> 2   1801     -0.453
#>  ...<495 more rows>...
#> 498 2010         NA
#> 499 2011         NA

These operations can be chained together using the %>% operator:

climate %>%
  filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000) %>%
  select(Year, Anomaly10y)
#>     Year Anomaly10y
#> 1   1900     -0.171
#> 2   1901     -0.162
#>  ...<97 more rows>...
#> 100 1999      0.734
#> 101 2000      0.748

15.7.3 Discussion

The filter() function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the slice() function:

slice(climate, 1:100)

I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.

With base R, you can get a subset of rows like this:

climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, ]
#>       Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 101 Berkeley 1900        NA        NA     -0.171  0.108
#> 102 Berkeley 1901        NA        NA     -0.162  0.109
#>  ...<97 more rows>...
#> 200 Berkeley 1999        NA        NA      0.734  0.025
#> 201 Berkeley 2000        NA        NA      0.748  0.026

Notice that we needed to prefix each column name with climate$, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.

This row filtering can also be combined with the column selection from 15.4:

climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000,
        c("Year", "Anomaly10y")]
#>     Year Anomaly10y
#> 101 1900     -0.171
#> 102 1901     -0.162
#>  ...<97 more rows>...
#> 200 1999      0.734
#> 201 2000      0.748