13.11 Creating a Dendrogram

13.11.1 Problem

You want to make a dendrogram to show how items are clustered.

13.11.2 Solution

Use hclust() and plot the output from it. This can require a fair bit of data preprocessing. For this example, we’ll first take a subset of the countries data set from the year 2009. For simplicity, we’ll also drop all rows that contain an NA, and then select a random 25 of the remaining rows:

Notice that the row names (the first column) are essentially random numbers, since the rows were selected randomly. We need to do a few more things to the data before making a dendrogram from it. First, we need to set the row names-right now there’s a column called Name, but the row names are those random numbers (we don’t often use row names, but for the hclust() function they’re essential). Next, we’ll need to drop all the columns that aren’t values used for clustering. These columns are Name, Code, and Year:

The values for GDP are several orders of magnitude larger than the values for, say, infmortality. Because of this, the effect of infmortality on the clustering will be negligible compared to the effect of GDP. This probably isn’t what we want. To address this issue, we’ll scale the data:

By default the scale() function scales each column relative to its standard deviation, but other methods may be used.

Finally, we’re ready to make the dendrogram, as shown in Figure 13.19:

A dendrogram (left); With text aligned (right)A dendrogram (left); With text aligned (right)

Figure 13.19: A dendrogram (left); With text aligned (right)

13.11.3 Discussion

A cluster analysis is simply a way of assigning points to groups in an n-dimensional space (four dimensions, in this example). A hierarchical cluster analysis divides each group into two smaller groups, and can be represented with the dendrograms in this recipe. There are many different parameters you can control in the hierarchical cluster analysis process, and there may not be a single “right” way to do it for your data.

First, we normalized the data using scale() with its default settings. You can scale your data differently, or not at all. (With this data set, not scaling the data will lead to GDP overwhelming the other variables, as shown in Figure 13.20.)

Dendrogram with unscaled data-notice the much larger Height values, which are largely due to the unscaled GDP values

Figure 13.20: Dendrogram with unscaled data-notice the much larger Height values, which are largely due to the unscaled GDP values

For the distance calculation, we used the default method, “euclidean”, which calculates the Euclidean distance between the points. The other possible methods are “maximum”, “manhattan”, “canberra”, “binary”, and “minkowski”.

The hclust() function provides several methods for performing the cluster analysis. The default is “complete”; the other possible methods are “ward”, “single”, “average”, “mcquitty”, “median”, and “centroid”.

13.11.4 See Also

See ?hclust for more information about the different clustering methods.