3.5 Summarising data frames
Now that we’re able to manipulate and extract data from our data frames our next task is to start exploring and getting to know our data. In this section we’ll start producing tables of useful summary statistics of the variables in our data frame and in the next two Chapters we’ll cover visualising our data with base R graphics and using the
A really useful starting point is to produce some simple summary statistics of all of the variables in our
flowers data frame using the
summary(flowers) ## treat nitrogen block height weight ## Length:96 low :32 Min. :1.0 Min. : 1.200 Min. : 5.790 ## Class :character medium:32 1st Qu.:1.0 1st Qu.: 4.475 1st Qu.: 9.027 ## Mode :character high :32 Median :1.5 Median : 6.450 Median :11.395 ## Mean :1.5 Mean : 6.840 Mean :12.155 ## 3rd Qu.:2.0 3rd Qu.: 9.025 3rd Qu.:14.537 ## Max. :2.0 Max. :17.200 Max. :23.890 ## leafarea shootarea flowers ## Min. : 5.80 Min. : 5.80 Min. : 1.000 ## 1st Qu.:11.07 1st Qu.: 39.05 1st Qu.: 4.000 ## Median :13.45 Median : 70.05 Median : 6.000 ## Mean :14.05 Mean : 79.78 Mean : 7.062 ## 3rd Qu.:16.45 3rd Qu.:113.28 3rd Qu.: 9.000 ## Max. :49.20 Max. :189.60 Max. :17.000
For numeric variables (i.e.
weight etc) the mean, minimum, maximum, median, first (lower) quartile and third (upper) quartile are presented. For factor variables (i.e.
nitrogen) the number of observations in each of the factor levels is given. If a variable contains missing data then the number of
NA values is also reported.
If we wanted to summarise a smaller subset of variables in our data frame we can use our indexing skills in combination with the
summary() function. For example, to summarise only the
shootarea variables we can include the appropriate column indexes when using the
[ ]. Notice we include all rows by not specifying a row index.
summary(flowers[, 4:7]) ## height weight leafarea shootarea ## Min. : 1.200 Min. : 5.790 Min. : 5.80 Min. : 5.80 ## 1st Qu.: 4.475 1st Qu.: 9.027 1st Qu.:11.07 1st Qu.: 39.05 ## Median : 6.450 Median :11.395 Median :13.45 Median : 70.05 ## Mean : 6.840 Mean :12.155 Mean :14.05 Mean : 79.78 ## 3rd Qu.: 9.025 3rd Qu.:14.537 3rd Qu.:16.45 3rd Qu.:113.28 ## Max. :17.200 Max. :23.890 Max. :49.20 Max. :189.60 # or equivalently # summary(flowers[, c("height", "weight", "leafarea", "shootarea")])
And to summarise a single variable.
As you’ve seen above, the
summary() function reports the number of observations in each level of our factor variables. Another useful function for generating tables of counts is the
table() function. The
table() function can be used to build contingency tables of different combinations of factor levels. For example, to count the number of observations for each level of
We can extend this further by producing a table of counts for each combination of
treat factor levels.
A more flexible version of the
table() function is the
xtabs() function. The
xtabs() function uses a formula notation (
~) to build contingency tables with the cross-classifying variables separated by a
+ symbol on the right hand side of the formula.
xtabs() also has a useful
data = argument so you don’t have to include the data frame name when specifying each variable.
We can even build more complicated contingency tables using more variables. Note, in the example below the
xtabs() function has quietly coerced our
block variable to a factor.
And for a nicer formatted table we can nest the
xtabs() function inside the
ftable() function to ‘flatten’ the table.
We can also summarise our data for each level of a factor variable. Let’s say we want to calculate the mean value of
height for each of our
high levels of
nitrogen. To do this we will use the
mean() function and apply this to the
height variable for each level of
nitrogen using the
tapply() function is not just restricted to calculating mean values, you can use it to apply many of the functions that come with R or even functions you’ve written yourself (see Chapter 7 for more details). For example, we can apply the
sd() function to calculate the standard deviation for each level of
nitrogen or even the
tapply(flowers$height, flowers$nitrogen, sd) ## low medium high ## 2.828425 3.005345 3.483323 tapply(flowers$height, flowers$nitrogen, summary) ## $low ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.800 3.600 5.550 5.853 8.000 12.300 ## ## $medium ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.800 4.500 7.000 7.013 9.950 12.300 ## ## $high ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.200 5.800 7.450 7.653 9.475 17.200
Note, if the variable you want to summarise contains missing values (
NA) you will also need to include an argument specifying how you want the function to deal with the
NA values. We saw an example if this in Chapter 2 where the
mean() function returned an
NA when we had missing data. To include the
na.rm = TRUE argument we simply add this as another argument when using
We can also use
tapply() to apply functions to more than one factor. The only thing to remember is that the factors need to be supplied to the
tapply() function in the form of a list using the
list() function. To calculate the mean
height for each combination of
treat factor levels we can use the
list(flowers$nitrogen, flowers$treat) notation.
And if you get a little fed up with having to write
flowers$ for every variable you can nest the
tapply() function inside the
with() function. The
with() function allows R to evaluate an R expression with respect to a named data object (in this case
with() function also works with many other functions and can save you alot of typing!
Another really useful function for summarising data is the
aggregate() function. The
aggregate() function works in a very similar way to
tapply() but is a bit more flexible.
For example, to calculate the mean of the variables
shootarea for each level of
In the code above we have indexed the columns we want to summarise in the
flowers data frame using
flowers[, 4:7]. The
by = argument specifies a list of factors (
list(nitrogen = flowers$nitrogen)) and the
FUN = argument names the function to apply (
mean in this example).
Similar to the
tapply() function we can include more than one factor to apply a function to. Here we calculate the mean values for each combination of
aggregate(flowers[, 4:7], by = list(nitrogen = flowers$nitrogen, treat = flowers$treat), FUN = mean) ## nitrogen treat height weight leafarea shootarea ## 1 low notip 3.66875 8.289375 12.32500 59.89375 ## 2 medium notip 4.83750 11.316875 14.17500 94.53125 ## 3 high notip 5.70625 16.604375 18.81875 155.31875 ## 4 low tip 8.03750 9.016250 9.96250 30.30625 ## 5 medium tip 9.18750 11.011250 13.48750 40.59375 ## 6 high tip 9.60000 16.689375 15.54375 98.05625
We can also use the
aggregate() function in a different way by using the formula method (as we did with
xtabs()). On the left hand side of the formula (
~) we specify the variable we want to apply the mean function on and to the right hand side our factors separated by a
+ symbol. The formula method also allows you to use the
data = argument for convenience.
One advantage of using the formula method is that we can also use the
subset = argument to apply the function to subsets of the original data. For example, to calculate the mean
height for each combination of the
treat levels but only for those plants that have less than 7
Or for only those plants in