3.5 Summarising data frames
Now that we’re able to manipulate and extract data from our data frames our next task is to start exploring and getting to know our data. In this section we’ll start producing tables of useful summary statistics of the variables in our data frame and in the next two Chapters we’ll cover visualising our data with base R graphics and using the ggplot2
package.
A really useful starting point is to produce some simple summary statistics of all of the variables in our flowers
data frame using the summary()
function.
summary(flowers)
## treat nitrogen block height weight
## Length:96 low :32 Min. :1.0 Min. : 1.200 Min. : 5.790
## Class :character medium:32 1st Qu.:1.0 1st Qu.: 4.475 1st Qu.: 9.027
## Mode :character high :32 Median :1.5 Median : 6.450 Median :11.395
## Mean :1.5 Mean : 6.840 Mean :12.155
## 3rd Qu.:2.0 3rd Qu.: 9.025 3rd Qu.:14.537
## Max. :2.0 Max. :17.200 Max. :23.890
## leafarea shootarea flowers
## Min. : 5.80 Min. : 5.80 Min. : 1.000
## 1st Qu.:11.07 1st Qu.: 39.05 1st Qu.: 4.000
## Median :13.45 Median : 70.05 Median : 6.000
## Mean :14.05 Mean : 79.78 Mean : 7.062
## 3rd Qu.:16.45 3rd Qu.:113.28 3rd Qu.: 9.000
## Max. :49.20 Max. :189.60 Max. :17.000
For numeric variables (i.e. height
, weight
etc) the mean, minimum, maximum, median, first (lower) quartile and third (upper) quartile are presented. For factor variables (i.e. treat
and nitrogen
) the number of observations in each of the factor levels is given. If a variable contains missing data then the number of NA
values is also reported.
If we wanted to summarise a smaller subset of variables in our data frame we can use our indexing skills in combination with the summary()
function. For example, to summarise only the height
, weight
, leafarea
and shootarea
variables we can include the appropriate column indexes when using the [ ]
. Notice we include all rows by not specifying a row index.
summary(flowers[, 4:7])
## height weight leafarea shootarea
## Min. : 1.200 Min. : 5.790 Min. : 5.80 Min. : 5.80
## 1st Qu.: 4.475 1st Qu.: 9.027 1st Qu.:11.07 1st Qu.: 39.05
## Median : 6.450 Median :11.395 Median :13.45 Median : 70.05
## Mean : 6.840 Mean :12.155 Mean :14.05 Mean : 79.78
## 3rd Qu.: 9.025 3rd Qu.:14.537 3rd Qu.:16.45 3rd Qu.:113.28
## Max. :17.200 Max. :23.890 Max. :49.20 Max. :189.60
# or equivalently
# summary(flowers[, c("height", "weight", "leafarea", "shootarea")])
And to summarise a single variable.
summary(flowers$leafarea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.80 11.07 13.45 14.05 16.45 49.20
# or equivalently
# summary(flowers[, 6])
As you’ve seen above, the summary()
function reports the number of observations in each level of our factor variables. Another useful function for generating tables of counts is the table()
function. The table()
function can be used to build contingency tables of different combinations of factor levels. For example, to count the number of observations for each level of nitrogen
We can extend this further by producing a table of counts for each combination of nitrogen
and treat
factor levels.
A more flexible version of the table()
function is the xtabs()
function. The xtabs()
function uses a formula notation (~
) to build contingency tables with the cross-classifying variables separated by a +
symbol on the right hand side of the formula. xtabs()
also has a useful data =
argument so you don’t have to include the data frame name when specifying each variable.
xtabs(~ nitrogen + treat, data = flowers)
## treat
## nitrogen notip tip
## low 16 16
## medium 16 16
## high 16 16
We can even build more complicated contingency tables using more variables. Note, in the example below the xtabs()
function has quietly coerced our block
variable to a factor.
xtabs(~ nitrogen + treat + block, data = flowers)
## , , block = 1
##
## treat
## nitrogen notip tip
## low 8 8
## medium 8 8
## high 8 8
##
## , , block = 2
##
## treat
## nitrogen notip tip
## low 8 8
## medium 8 8
## high 8 8
And for a nicer formatted table we can nest the xtabs()
function inside the ftable()
function to ‘flatten’ the table.
ftable(xtabs(~ nitrogen + treat + block, data = flowers))
## block 1 2
## nitrogen treat
## low notip 8 8
## tip 8 8
## medium notip 8 8
## tip 8 8
## high notip 8 8
## tip 8 8
We can also summarise our data for each level of a factor variable. Let’s say we want to calculate the mean value of height
for each of our low
, meadium
and high
levels of nitrogen
. To do this we will use the mean()
function and apply this to the height
variable for each level of nitrogen
using the tapply()
function.
The tapply()
function is not just restricted to calculating mean values, you can use it to apply many of the functions that come with R or even functions you’ve written yourself (see Chapter 7 for more details). For example, we can apply the sd()
function to calculate the standard deviation for each level of nitrogen
or even the summary()
function.
tapply(flowers$height, flowers$nitrogen, sd)
## low medium high
## 2.828425 3.005345 3.483323
tapply(flowers$height, flowers$nitrogen, summary)
## $low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.800 3.600 5.550 5.853 8.000 12.300
##
## $medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.800 4.500 7.000 7.013 9.950 12.300
##
## $high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 5.800 7.450 7.653 9.475 17.200
Note, if the variable you want to summarise contains missing values (NA
) you will also need to include an argument specifying how you want the function to deal with the NA
values. We saw an example if this in Chapter 2 where the mean()
function returned an NA
when we had missing data. To include the na.rm = TRUE
argument we simply add this as another argument when using tapply()
.
tapply(flowers$height, flowers$nitrogen, mean, na.rm = TRUE)
## low medium high
## 5.853125 7.012500 7.653125
We can also use tapply()
to apply functions to more than one factor. The only thing to remember is that the factors need to be supplied to the tapply()
function in the form of a list using the list()
function. To calculate the mean height
for each combination of nitrogen
and treat
factor levels we can use the list(flowers$nitrogen, flowers$treat)
notation.
tapply(flowers$height, list(flowers$nitrogen, flowers$treat), mean)
## notip tip
## low 3.66875 8.0375
## medium 4.83750 9.1875
## high 5.70625 9.6000
And if you get a little fed up with having to write flowers$
for every variable you can nest the tapply()
function inside the with()
function. The with()
function allows R to evaluate an R expression with respect to a named data object (in this case flowers
).
with(flowers, tapply(height, list(nitrogen, treat), mean))
## notip tip
## low 3.66875 8.0375
## medium 4.83750 9.1875
## high 5.70625 9.6000
The with()
function also works with many other functions and can save you alot of typing!
Another really useful function for summarising data is the aggregate()
function. The aggregate()
function works in a very similar way to tapply()
but is a bit more flexible.
For example, to calculate the mean of the variables height
, weight
, leafarea
and shootarea
for each level of nitrogen
.
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen), FUN = mean)
## nitrogen height weight leafarea shootarea
## 1 low 5.853125 8.652812 11.14375 45.1000
## 2 medium 7.012500 11.164062 13.83125 67.5625
## 3 high 7.653125 16.646875 17.18125 126.6875
In the code above we have indexed the columns we want to summarise in the flowers
data frame using flowers[, 4:7]
. The by =
argument specifies a list of factors (list(nitrogen = flowers$nitrogen)
) and the FUN =
argument names the function to apply (mean
in this example).
Similar to the tapply()
function we can include more than one factor to apply a function to. Here we calculate the mean values for each combination of nitrogen
and treat
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen,
treat = flowers$treat), FUN = mean)
## nitrogen treat height weight leafarea shootarea
## 1 low notip 3.66875 8.289375 12.32500 59.89375
## 2 medium notip 4.83750 11.316875 14.17500 94.53125
## 3 high notip 5.70625 16.604375 18.81875 155.31875
## 4 low tip 8.03750 9.016250 9.96250 30.30625
## 5 medium tip 9.18750 11.011250 13.48750 40.59375
## 6 high tip 9.60000 16.689375 15.54375 98.05625
We can also use the aggregate()
function in a different way by using the formula method (as we did with xtabs()
). On the left hand side of the formula (~
) we specify the variable we want to apply the mean function on and to the right hand side our factors separated by a +
symbol. The formula method also allows you to use the data =
argument for convenience.
aggregate(height ~ nitrogen + treat, FUN = mean, data = flowers)
## nitrogen treat height
## 1 low notip 3.66875
## 2 medium notip 4.83750
## 3 high notip 5.70625
## 4 low tip 8.03750
## 5 medium tip 9.18750
## 6 high tip 9.60000
One advantage of using the formula method is that we can also use the subset =
argument to apply the function to subsets of the original data. For example, to calculate the mean height
for each combination of the nitrogen
and treat
levels but only for those plants that have less than 7 flowers
.
aggregate(height ~ nitrogen + treat, FUN = mean, subset = flowers < 7, data = flowers)
## nitrogen treat height
## 1 low notip 3.533333
## 2 medium notip 5.316667
## 3 high notip 3.850000
## 4 low tip 8.176923
## 5 medium tip 8.570000
## 6 high tip 7.900000
Or for only those plants in block
1.