15.7 Getting a Subset of a Data Frame
15.7.2 Solution
Use filter()
to get the rows, and select()
to get the columns you want. These operations can be chained together using the %>%
operator. These functions return a new data frame, so if you want to change the original variable, you’ll need to save the new result over it.
We’ll use the climate
data set for the examples here:
library(gcookbook) # Load gcookbook for the climate data set
climate#> Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 1 Berkeley 1800 NA NA -0.435 0.505
#> 2 Berkeley 1801 NA NA -0.453 0.493
#> ...<495 more rows>...
#> 498 CRUTEM3 2010 0.8023 NA NA NA
#> 499 CRUTEM3 2011 0.6193 NA NA NA
Let’s that say that only want to keep rows where Source
is "Berkeley"
and where the year is inclusive of and between 1900 and 2000. You can do so with the filter()
function:
%>%
climate filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000)
If you want only the Year
and Anomaly10y
columns, use select()
, as we did in 15.4:
%>%
climate select(Year, Anomaly10y)
#> Year Anomaly10y
#> 1 1800 -0.435
#> 2 1801 -0.453
#> ...<495 more rows>...
#> 498 2010 NA
#> 499 2011 NA
These operations can be chained together using the %>%
operator:
%>%
climate filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000) %>%
select(Year, Anomaly10y)
#> Year Anomaly10y
#> 1 1900 -0.171
#> 2 1901 -0.162
#> ...<97 more rows>...
#> 100 1999 0.734
#> 101 2000 0.748
15.7.3 Discussion
The filter()
function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the slice()
function:
slice(climate, 1:100)
I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you’re collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.
With base R, you can get a subset of rows like this:
$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, ]
climate[climate#> Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
#> 101 Berkeley 1900 NA NA -0.171 0.108
#> 102 Berkeley 1901 NA NA -0.162 0.109
#> ...<97 more rows>...
#> 200 Berkeley 1999 NA NA 0.734 0.025
#> 201 Berkeley 2000 NA NA 0.748 0.026
Notice that we needed to prefix each column name with climate$
, and that there’s a comma after the selection criteria. This indicates that we’re getting rows, not columns.
This row filtering can also be combined with the column selection from 15.4:
$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000,
climate[climatec("Year", "Anomaly10y")]
#> Year Anomaly10y
#> 101 1900 -0.171
#> 102 1901 -0.162
#> ...<97 more rows>...
#> 200 1999 0.734
#> 201 2000 0.748