15.13 Recoding a Categorical Variable to Another Categorical Variable

15.13.1 Problem

You want to recode a categorical variable to another variable.

15.13.2 Solution

For the examples here, we’ll use a subset of the PlantGrowth data set:

# Work on a subset of the PlantGrowth data set
pg <- PlantGrowth[c(1,2,11,21,22), ]
pg
#>    weight group
#> 1    4.17  ctrl
#> 2    5.58  ctrl
#> 11   4.81  trt1
#> 21   6.31  trt2
#> 22   5.12  trt2

In this example, we’ll recode the categorical variable group into another categorical variable, treatment. If the old value was "ctrl", the new value will be "No", and if the old value was "trt1" or "trt2", the new value will be "Yes".

This can be done with the recode() function from the dplyr package:

library(dplyr)

recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes

You can assign it as a new column in the data frame:

pg$treatment <- recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")

Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use as.character():

recode(as.character(pg$group), ctrl = "No", trt1 = "Yes", trt2 = "Yes")
#> [1] "No"  "No"  "Yes" "Yes" "Yes"

15.13.3 Discussion

You can also use the fct_recode() function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:

library(forcats)
fct_recode(pg$group, No = "ctrl", Yes = "trt1", Yes = "trt2")
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes

Another difference is that fct_recode() will always return a factor, whereas recode() will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a recode_factor() function which also always returns a factor.)

Using base R, recoding can be done with the match() function:

oldvals <- c("ctrl", "trt1", "trt2")
newvals <- factor(c("No", "Yes", "Yes"))

newvals[ match(pg$group, oldvals) ]
#> [1] No  No  Yes Yes Yes
#> Levels: No Yes

It can also be done by indexing in the vectors:

pg$treatment[pg$group == "ctrl"] <- "No"
pg$treatment[pg$group == "trt1"] <- "Yes"
pg$treatment[pg$group == "trt2"] <- "Yes"

# Convert to a factor
pg$treatment <- factor(pg$treatment)
pg
#>    weight group treatment
#> 1    4.17  ctrl        No
#> 2    5.58  ctrl        No
#> 11   4.81  trt1       Yes
#> 21   6.31  trt2       Yes
#> 22   5.12  trt2       Yes

Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe 15.10.

The coding criteria can also be based on values in multiple columns, by using the & and | operators:

pg$newcol[pg$group == "ctrl" & pg$weight < 5]  <- "no_small"
pg$newcol[pg$group == "ctrl" & pg$weight >= 5] <- "no_large"
pg$newcol[pg$group == "trt1"] <- "yes"
pg$newcol[pg$group == "trt2"] <- "yes"
pg$newcol <- factor(pg$newcol)
pg
#>    weight group   newcol
#> 1    4.17  ctrl no_small
#> 2    5.58  ctrl no_large
#> 11   4.81  trt1      yes
#> 21   6.31  trt2      yes
#> 22   5.12  trt2      yes

It’s also possible to combine two columns into one using the interaction() function, which appends the values with a . in between. This combines the weight and group columns into a new column, weightgroup:

pg$weightgroup <- interaction(pg$weight, pg$group)
pg
#>    weight group weightgroup
#> 1    4.17  ctrl   4.17.ctrl
#> 2    5.58  ctrl   5.58.ctrl
#> 11   4.81  trt1   4.81.trt1
#> 21   6.31  trt2   6.31.trt2
#> 22   5.12  trt2   5.12.trt2

15.13.4 See Also

For more on renaming factor levels, see Recipe 15.10.

See Recipe 15.14 for recoding continuous values to categorical values.