15.2 Getting Information About a Data Structure

15.2.1 Problem

You want to find out information about an object or data structure.

15.2.2 Solution

Use the str() function:

str(ToothGrowth)
#> 'data.frame':    60 obs. of  3 variables:
#>  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#>  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
#>  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

This tells us that ToothGrowth is a data frame with three columns, len, supp, and dose. len and dose contain numeric values, while supp is a factor with two levels.

Another useful function is the summary() function:

summary(ToothGrowth)
#>       len        supp         dose      
#>  Min.   : 4.20   OJ:30   Min.   :0.500  
#>  1st Qu.:13.07   VC:30   1st Qu.:0.500  
#>  Median :19.25           Median :1.000  
#>  Mean   :18.81           Mean   :1.167  
#>  3rd Qu.:25.27           3rd Qu.:2.000  
#>  Max.   :33.90           Max.   :2.000

Instead of showing you the first few values of each column as str() does, summary() provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.

15.2.3 Discussion

The str() function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.

When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run str() on the data frame, or print out the column by itself:

tg <- ToothGrowth
tg$supp <- as.character(tg$supp)
str(tg)
#> 'data.frame':    60 obs. of  3 variables:
#>  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
#>  $ supp: chr  "VC" "VC" "VC" "VC" ...
#>  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Print out the columns by themselves
# From old data frame (factor)
ToothGrowth$supp
#>  [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
#> [25] VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> [49] OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
#> Levels: OJ VC
# From new data frame (character)
tg$supp
#>  [1] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [15] "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC" "VC"
#> [29] "VC" "VC" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [43] "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ" "OJ"
#> [57] "OJ" "OJ" "OJ" "OJ"