Manipulating, summarising and sorting data using R is an important skill to master but one which many people find a little confusing at first. We’ll go through a few simple examples here using vectors to illustrate some important concepts but will build on this in much more detail in Chapter 3 where we will look at more complicated (and useful) data structures.
Take a look at this video for a quick introduction to working with vectors in R using positional and logical indexes
To extract (also known as indexing or subscripting) one or more values (more generally known as elements) from a vector we use the square bracket
[ ] notation. The general approach is to name the object you wish to extract from, then a set of square brackets with an index of the element you wish to extract contained within the square brackets. This index can be a position or the result of a logical test.
To extract elements based on their position we simply write the position inside the
[ ]. For example, to extract the 3rd value of
Note that the positional index starts at 1 rather than 0 like some other other programming languages (i.e. Python).
We can also extract more than one value by using the
c() function inside the square brackets. Here we extract the 1st, 5th, 6th and 8th element from the
Or we can extract a range of values using the
: notation. To extract the values from the 3rd to the 8th elements
Another really useful way to extract data from a vector is to use a logical expression as an index. For example, to extract all elements with a value greater than 4 in the vector
Here, the logical expression is
my_vec > 4 and R will only extract those elements that satisfy this logical condition. So how does this actually work? If we look at the output of just the logical expression without the square brackets you can see that R returns a vector containing either
FALSE which correspond to whether the logical condition is satisfied for each element. In this case only the 4th and 8th elements return a
TRUE as their value is greater than 4.
So what R is actually doing under the hood is equivalent to
and only those element that are
TRUE will be extracted.
In addition to the
> operators you can also use composite operators to increase the complexity of your expressions. For example the expression for ‘greater or equal to’ is
>=. To test whether a value is equal to a value we need to use a double equals symbol
== and for ‘not equal to’ we use
! symbol means ‘not’).
my_vec[my_vec >= 4] # values greater or equal to 4 ##  6 4 7 my_vec[my_vec < 4] # values less than 4 ##  2 3 1 3 3 my_vec[my_vec <= 4] # values less than or equal to 4 ##  2 3 1 4 3 3 my_vec[my_vec == 4] # values equal to 4 ##  4 my_vec[my_vec != 4] # values not equal to 4 ##  2 3 1 6 3 3 7
We can also combine multiple logical expressions using Boolean expressions. In R the
& symbol means AND and the
| symbol means OR. For example, to extract values in
my_vec which are less than 6 AND greater than 2
or extract values in
my_vec that are greater than 6 OR less than 3
We can change the values of some elements in a vector using our
[ ] notation in combination with the assignment operator
<-. For example, to replace the 4th value of our
my_vec object from
We can also replace more than one value or even replace values based on a logical expression
In addition to extracting particular elements from a vector we can also order the values contained in a vector. To sort the values from lowest to highest value we can use the
To reverse the sort, from highest to lowest, we can either include the
decreasing = TRUE argument when using the
or first sort the vector using the
sort() function and then reverse the sorted vector using the
rev() function. This is another example of nesting one function inside another function.
Whilst sorting a single vector is fun, perhaps a more useful task would be to sort one vector according to the values of another vector. To do this we should use the
order() function in combination with
[ ]. To demonstrate this let’s create a vector called
height containing the height of 5 different people and another vector called
p.names containing the names of these people (so Joanna is 180 cm, Charlotte is 155 cm etc)
Our goal is to order the people in
p.names in ascending order of their
height. The first thing we’ll do is use the
order() function with the
height variable to create a vector called
OK, what’s going on here? The first value,
2, (remember ignore
) should be read as ‘the smallest value of
height is the second element of the
height vector’. If we check this by looking at the
height vector above, you can see that element 2 has a value of 155, which is the smallest value. The second smallest value in
height is the 3rd element of
height, which when we check is 160 and so on. The largest value of
height is element
5 which is 181. Now that we have a vector of the positional indices of heights in ascending order (
height_ord), we can extract these values from our
p.names vector in this order
You’re probably thinking ‘what’s the use of this?’ Well, imagine you have a dataset which contains two columns of data and you want to sort each column. If you just use
sort() to sort each column separately, the values of each column will become uncoupled from each other. By using the ‘order()’ on one column, a vector of positional indices is created of the values of the column in ascending order This vector can be used on the second column, as the index of elements which will return a vector of values based on the first column.
One of the great things about R functions is that most of them are vectorised. This means that the function will operate on all elements of a vector without needing to apply the function on each element separately. For example, to multiple each element of a vector by 5 we can simply use
Or we can add the elements of two or more vectors
However, you must be careful when using vectorisation with vectors of different lengths as R will quietly recycle the elements in the shorter vector rather than throw a wobbly (error).
In R, missing data is usually represented by an
NA symbol meaning ‘Not Available’. Data may be missing for a whole bunch of reasons, maybe your machine broke down, maybe you broke down, maybe the weather was too bad to collect data on a particular day etc etc. Missing data can be a pain in the proverbial both from an R perspective and also a statistical perspective. From an R perspective missing data can be problematic as different functions deal with missing data in different ways. For example, let’s say we collected air temperature readings over 10 days, but our thermometer broke on day 2 and again on day 9 so we have no data for those days
We now want to calculate the mean temperature over these days using the
Flippin heck, what’s happened here? Why does the
mean() function return an
NA? Actually, R is doing something very sensible (at least in our opinion!). If a vector has a missing value then the only possible value to return when calculating a mean is
NA. R doesn’t know that you perhaps want to ignore the
NA values (R can’t read your mind - yet!). Happily, if we look at the help file (use
help("mean") - see the next section for more details) associated with the
mean() function we can see there is an argument
na.rm = which is set to
FALSE by default.
na.rm - a logical value indicating whether NA values should be stripped before the computation proceeds.
If we change this argument to
na.rm = TRUE when we use the
mean() function this will allow us to ignore the
NA values when calculating the mean
It’s important to note that the
NA values have not been removed from our
temp object (that would be bad practice), rather the
mean() function has just ignored them. The point of the above is to highlight how we can change the default behaviour of a function using an appropriate argument. The problem is that not all functions will have an
na.rm = argument, they might deal with
NA values differently. However, the good news is that every help file associated with any function will always tell you how missing data are handled by default.