3.5 Summarising data frames

Now that we’re able to manipulate and extract data from our data frames our next task is to start exploring and getting to know our data. In this section we’ll start producing tables of useful summary statistics of the variables in our data frame and in the next two Chapters we’ll cover visualising our data with base R graphics and using the ggplot2 package.

A really useful starting point is to produce some simple summary statistics of all of the variables in our flowers data frame using the summary() function.

summary(flowers)
##     treat             nitrogen      block         height           weight      
##  Length:96          low   :32   Min.   :1.0   Min.   : 1.200   Min.   : 5.790  
##  Class :character   medium:32   1st Qu.:1.0   1st Qu.: 4.475   1st Qu.: 9.027  
##  Mode  :character   high  :32   Median :1.5   Median : 6.450   Median :11.395  
##                                 Mean   :1.5   Mean   : 6.840   Mean   :12.155  
##                                 3rd Qu.:2.0   3rd Qu.: 9.025   3rd Qu.:14.537  
##                                 Max.   :2.0   Max.   :17.200   Max.   :23.890  
##     leafarea       shootarea         flowers      
##  Min.   : 5.80   Min.   :  5.80   Min.   : 1.000  
##  1st Qu.:11.07   1st Qu.: 39.05   1st Qu.: 4.000  
##  Median :13.45   Median : 70.05   Median : 6.000  
##  Mean   :14.05   Mean   : 79.78   Mean   : 7.062  
##  3rd Qu.:16.45   3rd Qu.:113.28   3rd Qu.: 9.000  
##  Max.   :49.20   Max.   :189.60   Max.   :17.000

For numeric variables (i.e. height, weight etc) the mean, minimum, maximum, median, first (lower) quartile and third (upper) quartile are presented. For factor variables (i.e. treat and nitrogen) the number of observations in each of the factor levels is given. If a variable contains missing data then the number of NA values is also reported.

If we wanted to summarise a smaller subset of variables in our data frame we can use our indexing skills in combination with the summary() function. For example, to summarise only the height, weight, leafarea and shootarea variables we can include the appropriate column indexes when using the [ ]. Notice we include all rows by not specifying a row index.

summary(flowers[, 4:7])
##      height           weight          leafarea       shootarea     
##  Min.   : 1.200   Min.   : 5.790   Min.   : 5.80   Min.   :  5.80  
##  1st Qu.: 4.475   1st Qu.: 9.027   1st Qu.:11.07   1st Qu.: 39.05  
##  Median : 6.450   Median :11.395   Median :13.45   Median : 70.05  
##  Mean   : 6.840   Mean   :12.155   Mean   :14.05   Mean   : 79.78  
##  3rd Qu.: 9.025   3rd Qu.:14.537   3rd Qu.:16.45   3rd Qu.:113.28  
##  Max.   :17.200   Max.   :23.890   Max.   :49.20   Max.   :189.60

# or equivalently 
# summary(flowers[, c("height", "weight", "leafarea", "shootarea")])

And to summarise a single variable.

summary(flowers$leafarea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.80   11.07   13.45   14.05   16.45   49.20

# or equivalently
# summary(flowers[, 6])

As you’ve seen above, the summary() function reports the number of observations in each level of our factor variables. Another useful function for generating tables of counts is the table() function. The table() function can be used to build contingency tables of different combinations of factor levels. For example, to count the number of observations for each level of nitrogen

table(flowers$nitrogen)
## 
##    low medium   high 
##     32     32     32

We can extend this further by producing a table of counts for each combination of nitrogen and treat factor levels.

table(flowers$nitrogen, flowers$treat)
##         
##          notip tip
##   low       16  16
##   medium    16  16
##   high      16  16

A more flexible version of the table() function is the xtabs() function. The xtabs() function uses a formula notation (~) to build contingency tables with the cross-classifying variables separated by a + symbol on the right hand side of the formula. xtabs() also has a useful data = argument so you don’t have to include the data frame name when specifying each variable.

xtabs(~ nitrogen + treat, data = flowers)
##         treat
## nitrogen notip tip
##   low       16  16
##   medium    16  16
##   high      16  16

We can even build more complicated contingency tables using more variables. Note, in the example below the xtabs() function has quietly coerced our block variable to a factor.

xtabs(~ nitrogen + treat + block, data = flowers)
## , , block = 1
## 
##         treat
## nitrogen notip tip
##   low        8   8
##   medium     8   8
##   high       8   8
## 
## , , block = 2
## 
##         treat
## nitrogen notip tip
##   low        8   8
##   medium     8   8
##   high       8   8

And for a nicer formatted table we can nest the xtabs() function inside the ftable() function to ‘flatten’ the table.

ftable(xtabs(~ nitrogen + treat + block, data = flowers))
##                block 1 2
## nitrogen treat          
## low      notip       8 8
##          tip         8 8
## medium   notip       8 8
##          tip         8 8
## high     notip       8 8
##          tip         8 8

We can also summarise our data for each level of a factor variable. Let’s say we want to calculate the mean value of height for each of our low, meadium and high levels of nitrogen. To do this we will use the mean() function and apply this to the height variable for each level of nitrogen using the tapply() function.

tapply(flowers$height, flowers$nitrogen, mean)
##      low   medium     high 
## 5.853125 7.012500 7.653125

The tapply() function is not just restricted to calculating mean values, you can use it to apply many of the functions that come with R or even functions you’ve written yourself (see Chapter 7 for more details). For example, we can apply the sd() function to calculate the standard deviation for each level of nitrogen or even the summary() function.

tapply(flowers$height, flowers$nitrogen, sd)
##      low   medium     high 
## 2.828425 3.005345 3.483323
tapply(flowers$height, flowers$nitrogen, summary)
## $low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.800   3.600   5.550   5.853   8.000  12.300 
## 
## $medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.800   4.500   7.000   7.013   9.950  12.300 
## 
## $high
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   5.800   7.450   7.653   9.475  17.200

Note, if the variable you want to summarise contains missing values (NA) you will also need to include an argument specifying how you want the function to deal with the NA values. We saw an example if this in Chapter 2 where the mean() function returned an NA when we had missing data. To include the na.rm = TRUE argument we simply add this as another argument when using tapply().

tapply(flowers$height, flowers$nitrogen, mean, na.rm = TRUE)
##      low   medium     high 
## 5.853125 7.012500 7.653125

We can also use tapply() to apply functions to more than one factor. The only thing to remember is that the factors need to be supplied to the tapply() function in the form of a list using the list() function. To calculate the mean height for each combination of nitrogen and treat factor levels we can use the list(flowers$nitrogen, flowers$treat) notation.

tapply(flowers$height, list(flowers$nitrogen, flowers$treat), mean)
##          notip    tip
## low    3.66875 8.0375
## medium 4.83750 9.1875
## high   5.70625 9.6000

And if you get a little fed up with having to write flowers$ for every variable you can nest the tapply() function inside the with() function. The with() function allows R to evaluate an R expression with respect to a named data object (in this case flowers).

with(flowers, tapply(height, list(nitrogen, treat), mean))
##          notip    tip
## low    3.66875 8.0375
## medium 4.83750 9.1875
## high   5.70625 9.6000

The with() function also works with many other functions and can save you alot of typing!

Another really useful function for summarising data is the aggregate() function. The aggregate() function works in a very similar way to tapply() but is a bit more flexible.

For example, to calculate the mean of the variables height, weight, leafarea and shootarea for each level of nitrogen.

aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen), FUN = mean)
##   nitrogen   height    weight leafarea shootarea
## 1      low 5.853125  8.652812 11.14375   45.1000
## 2   medium 7.012500 11.164062 13.83125   67.5625
## 3     high 7.653125 16.646875 17.18125  126.6875

In the code above we have indexed the columns we want to summarise in the flowers data frame using flowers[, 4:7]. The by = argument specifies a list of factors (list(nitrogen = flowers$nitrogen)) and the FUN = argument names the function to apply (mean in this example).

Similar to the tapply() function we can include more than one factor to apply a function to. Here we calculate the mean values for each combination of nitrogen and treat

aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen,
                                 treat = flowers$treat), FUN = mean)
##   nitrogen treat  height    weight leafarea shootarea
## 1      low notip 3.66875  8.289375 12.32500  59.89375
## 2   medium notip 4.83750 11.316875 14.17500  94.53125
## 3     high notip 5.70625 16.604375 18.81875 155.31875
## 4      low   tip 8.03750  9.016250  9.96250  30.30625
## 5   medium   tip 9.18750 11.011250 13.48750  40.59375
## 6     high   tip 9.60000 16.689375 15.54375  98.05625

We can also use the aggregate() function in a different way by using the formula method (as we did with xtabs()). On the left hand side of the formula (~) we specify the variable we want to apply the mean function on and to the right hand side our factors separated by a + symbol. The formula method also allows you to use the data = argument for convenience.

aggregate(height ~ nitrogen + treat, FUN = mean, data = flowers)
##   nitrogen treat  height
## 1      low notip 3.66875
## 2   medium notip 4.83750
## 3     high notip 5.70625
## 4      low   tip 8.03750
## 5   medium   tip 9.18750
## 6     high   tip 9.60000

One advantage of using the formula method is that we can also use the subset = argument to apply the function to subsets of the original data. For example, to calculate the mean height for each combination of the nitrogen and treat levels but only for those plants that have less than 7 flowers.

aggregate(height ~ nitrogen + treat, FUN = mean, subset = flowers < 7, data = flowers)
##   nitrogen treat   height
## 1      low notip 3.533333
## 2   medium notip 5.316667
## 3     high notip 3.850000
## 4      low   tip 8.176923
## 5   medium   tip 8.570000
## 6     high   tip 7.900000

Or for only those plants in block 1.

aggregate(height ~ nitrogen + treat, FUN = mean, subset = block == "1", data = flowers)
##   nitrogen treat  height
## 1      low notip  3.3250
## 2   medium notip  5.2375
## 3     high notip  5.9250
## 4      low   tip  8.7500
## 5   medium   tip  9.5375
## 6     high   tip 10.0375