3.5 Summarising data frames
Now that we’re able to manipulate and extract data from our data frames our next task is to start exploring and getting to know our data. In this section we’ll start producing tables of useful summary statistics of the variables in our data frame and in the next two Chapters we’ll cover visualising our data with base R graphics and using the ggplot2 package.
A really useful starting point is to produce some simple summary statistics of all of the variables in our flowers data frame using the summary() function.
summary(flowers)
## treat nitrogen block height weight
## Length:96 low :32 Min. :1.0 Min. : 1.200 Min. : 5.790
## Class :character medium:32 1st Qu.:1.0 1st Qu.: 4.475 1st Qu.: 9.027
## Mode :character high :32 Median :1.5 Median : 6.450 Median :11.395
## Mean :1.5 Mean : 6.840 Mean :12.155
## 3rd Qu.:2.0 3rd Qu.: 9.025 3rd Qu.:14.537
## Max. :2.0 Max. :17.200 Max. :23.890
## leafarea shootarea flowers
## Min. : 5.80 Min. : 5.80 Min. : 1.000
## 1st Qu.:11.07 1st Qu.: 39.05 1st Qu.: 4.000
## Median :13.45 Median : 70.05 Median : 6.000
## Mean :14.05 Mean : 79.78 Mean : 7.062
## 3rd Qu.:16.45 3rd Qu.:113.28 3rd Qu.: 9.000
## Max. :49.20 Max. :189.60 Max. :17.000For numeric variables (i.e. height, weight etc) the mean, minimum, maximum, median, first (lower) quartile and third (upper) quartile are presented. For factor variables (i.e. treat and nitrogen) the number of observations in each of the factor levels is given. If a variable contains missing data then the number of NA values is also reported.
If we wanted to summarise a smaller subset of variables in our data frame we can use our indexing skills in combination with the summary() function. For example, to summarise only the height, weight, leafarea and shootarea variables we can include the appropriate column indexes when using the [ ]. Notice we include all rows by not specifying a row index.
summary(flowers[, 4:7])
## height weight leafarea shootarea
## Min. : 1.200 Min. : 5.790 Min. : 5.80 Min. : 5.80
## 1st Qu.: 4.475 1st Qu.: 9.027 1st Qu.:11.07 1st Qu.: 39.05
## Median : 6.450 Median :11.395 Median :13.45 Median : 70.05
## Mean : 6.840 Mean :12.155 Mean :14.05 Mean : 79.78
## 3rd Qu.: 9.025 3rd Qu.:14.537 3rd Qu.:16.45 3rd Qu.:113.28
## Max. :17.200 Max. :23.890 Max. :49.20 Max. :189.60
# or equivalently
# summary(flowers[, c("height", "weight", "leafarea", "shootarea")])And to summarise a single variable.
summary(flowers$leafarea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.80 11.07 13.45 14.05 16.45 49.20
# or equivalently
# summary(flowers[, 6])As you’ve seen above, the summary() function reports the number of observations in each level of our factor variables. Another useful function for generating tables of counts is the table() function. The table() function can be used to build contingency tables of different combinations of factor levels. For example, to count the number of observations for each level of nitrogen
We can extend this further by producing a table of counts for each combination of nitrogen and treat factor levels.
A more flexible version of the table() function is the xtabs() function. The xtabs() function uses a formula notation (~) to build contingency tables with the cross-classifying variables separated by a + symbol on the right hand side of the formula. xtabs() also has a useful data = argument so you don’t have to include the data frame name when specifying each variable.
xtabs(~ nitrogen + treat, data = flowers)
## treat
## nitrogen notip tip
## low 16 16
## medium 16 16
## high 16 16We can even build more complicated contingency tables using more variables. Note, in the example below the xtabs() function has quietly coerced our block variable to a factor.
xtabs(~ nitrogen + treat + block, data = flowers)
## , , block = 1
##
## treat
## nitrogen notip tip
## low 8 8
## medium 8 8
## high 8 8
##
## , , block = 2
##
## treat
## nitrogen notip tip
## low 8 8
## medium 8 8
## high 8 8And for a nicer formatted table we can nest the xtabs() function inside the ftable() function to ‘flatten’ the table.
ftable(xtabs(~ nitrogen + treat + block, data = flowers))
## block 1 2
## nitrogen treat
## low notip 8 8
## tip 8 8
## medium notip 8 8
## tip 8 8
## high notip 8 8
## tip 8 8We can also summarise our data for each level of a factor variable. Let’s say we want to calculate the mean value of height for each of our low, meadium and high levels of nitrogen. To do this we will use the mean() function and apply this to the height variable for each level of nitrogen using the tapply() function.
The tapply() function is not just restricted to calculating mean values, you can use it to apply many of the functions that come with R or even functions you’ve written yourself (see Chapter 7 for more details). For example, we can apply the sd() function to calculate the standard deviation for each level of nitrogen or even the summary() function.
tapply(flowers$height, flowers$nitrogen, sd)
## low medium high
## 2.828425 3.005345 3.483323
tapply(flowers$height, flowers$nitrogen, summary)
## $low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.800 3.600 5.550 5.853 8.000 12.300
##
## $medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.800 4.500 7.000 7.013 9.950 12.300
##
## $high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 5.800 7.450 7.653 9.475 17.200Note, if the variable you want to summarise contains missing values (NA) you will also need to include an argument specifying how you want the function to deal with the NA values. We saw an example if this in Chapter 2 where the mean() function returned an NA when we had missing data. To include the na.rm = TRUE argument we simply add this as another argument when using tapply().
tapply(flowers$height, flowers$nitrogen, mean, na.rm = TRUE)
## low medium high
## 5.853125 7.012500 7.653125We can also use tapply() to apply functions to more than one factor. The only thing to remember is that the factors need to be supplied to the tapply() function in the form of a list using the list() function. To calculate the mean height for each combination of nitrogen and treat factor levels we can use the list(flowers$nitrogen, flowers$treat) notation.
tapply(flowers$height, list(flowers$nitrogen, flowers$treat), mean)
## notip tip
## low 3.66875 8.0375
## medium 4.83750 9.1875
## high 5.70625 9.6000And if you get a little fed up with having to write flowers$ for every variable you can nest the tapply() function inside the with() function. The with() function allows R to evaluate an R expression with respect to a named data object (in this case flowers).
with(flowers, tapply(height, list(nitrogen, treat), mean))
## notip tip
## low 3.66875 8.0375
## medium 4.83750 9.1875
## high 5.70625 9.6000The with() function also works with many other functions and can save you alot of typing!
Another really useful function for summarising data is the aggregate() function. The aggregate() function works in a very similar way to tapply() but is a bit more flexible.
For example, to calculate the mean of the variables height, weight, leafarea and shootarea for each level of nitrogen.
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen), FUN = mean)
## nitrogen height weight leafarea shootarea
## 1 low 5.853125 8.652812 11.14375 45.1000
## 2 medium 7.012500 11.164062 13.83125 67.5625
## 3 high 7.653125 16.646875 17.18125 126.6875In the code above we have indexed the columns we want to summarise in the flowers data frame using flowers[, 4:7]. The by = argument specifies a list of factors (list(nitrogen = flowers$nitrogen)) and the FUN = argument names the function to apply (mean in this example).
Similar to the tapply() function we can include more than one factor to apply a function to. Here we calculate the mean values for each combination of nitrogen and treat
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen,
treat = flowers$treat), FUN = mean)
## nitrogen treat height weight leafarea shootarea
## 1 low notip 3.66875 8.289375 12.32500 59.89375
## 2 medium notip 4.83750 11.316875 14.17500 94.53125
## 3 high notip 5.70625 16.604375 18.81875 155.31875
## 4 low tip 8.03750 9.016250 9.96250 30.30625
## 5 medium tip 9.18750 11.011250 13.48750 40.59375
## 6 high tip 9.60000 16.689375 15.54375 98.05625We can also use the aggregate() function in a different way by using the formula method (as we did with xtabs()). On the left hand side of the formula (~) we specify the variable we want to apply the mean function on and to the right hand side our factors separated by a + symbol. The formula method also allows you to use the data = argument for convenience.
aggregate(height ~ nitrogen + treat, FUN = mean, data = flowers)
## nitrogen treat height
## 1 low notip 3.66875
## 2 medium notip 4.83750
## 3 high notip 5.70625
## 4 low tip 8.03750
## 5 medium tip 9.18750
## 6 high tip 9.60000One advantage of using the formula method is that we can also use the subset = argument to apply the function to subsets of the original data. For example, to calculate the mean height for each combination of the nitrogen and treat levels but only for those plants that have less than 7 flowers.
aggregate(height ~ nitrogen + treat, FUN = mean, subset = flowers < 7, data = flowers)
## nitrogen treat height
## 1 low notip 3.533333
## 2 medium notip 5.316667
## 3 high notip 3.850000
## 4 low tip 8.176923
## 5 medium tip 8.570000
## 6 high tip 7.900000Or for only those plants in block 1.