2.4 Working with vectors
Manipulating, summarising and sorting data using R is an important skill to master but one which many people find a little confusing at first. We’ll go through a few simple examples here using vectors to illustrate some important concepts but will build on this in much more detail in Chapter 3 where we will look at more complicated (and useful) data structures.
Take a look at this video for a quick introduction to working with vectors in R using positional and logical indexes
2.4.1 Extracting elements
To extract (also known as indexing or subscripting) one or more values (more generally known as elements) from a vector we use the square bracket [ ]
notation. The general approach is to name the object you wish to extract from, then a set of square brackets with an index of the element you wish to extract contained within the square brackets. This index can be a position or the result of a logical test.
Positional index
To extract elements based on their position we simply write the position inside the [ ]
. For example, to extract the 3rd value of my_vec
my_vec # remind ourselves what my_vec looks like
## [1] 2 3 1 6 4 3 3 7
my_vec[3] # extract the 3rd value
## [1] 1
# if you want to store this value in another object
val_3 <- my_vec[3]
val_3
## [1] 1
Note that the positional index starts at 1 rather than 0 like some other programming languages (i.e. Python).
We can also extract more than one value by using the c()
function inside the square brackets. Here we extract the 1st, 5th, 6th and 8th element from the my_vec
object
Or we can extract a range of values using the :
notation. To extract the values from the 3rd to the 8th elements.
Logical index
Another really useful way to extract data from a vector is to use a logical expression as an index. For example, to extract all elements with a value greater than 4 in the vector my_vec
Here, the logical expression is my_vec > 4
and R will only extract those elements that satisfy this logical condition. So how does this actually work? If we look at the output of just the logical expression without the square brackets you can see that R returns a vector containing either TRUE
or FALSE
which correspond to whether the logical condition is satisfied for each element. In this case only the 4th and 8th elements return a TRUE
as their value is greater than 4.
So what R is actually doing under the hood is equivalent to
and only those element that are TRUE
will be extracted.
In addition to the <
and >
operators you can also use composite operators to increase the complexity of your expressions. For example the expression for ‘greater or equal to’ is >=
. To test whether a value is equal to a value we need to use a double equals symbol ==
and for ‘not equal to’ we use !=
(the !
symbol means ‘not’).
my_vec[my_vec >= 4] # values greater or equal to 4
## [1] 6 4 7
my_vec[my_vec < 4] # values less than 4
## [1] 2 3 1 3 3
my_vec[my_vec <= 4] # values less than or equal to 4
## [1] 2 3 1 4 3 3
my_vec[my_vec == 4] # values equal to 4
## [1] 4
my_vec[my_vec != 4] # values not equal to 4
## [1] 2 3 1 6 3 3 7
We can also combine multiple logical expressions using Boolean expressions. In R the &
symbol means AND and the |
symbol means OR. For example, to extract values in my_vec
which are less than 6 AND greater than 2
or extract values in my_vec
that are greater than 6 OR less than 3.
2.4.2 Replacing elements
We can change the values of some elements in a vector using our [ ]
notation in combination with the assignment operator <-
. For example, to replace the 4th value of our my_vec
object from 6
to 500
We can also replace more than one value or even replace values based on a logical expression.
2.4.3 Ordering elements
In addition to extracting particular elements from a vector we can also order the values contained in a vector. To sort the values from lowest to highest value we can use the sort()
function.
To reverse the sort, from highest to lowest, we can either include the decreasing = TRUE
argument when using the sort()
function
or first sort the vector using the sort()
function and then reverse the sorted vector using the rev()
function. This is another example of nesting one function inside another function.
Whilst sorting a single vector is fun, perhaps a more useful task would be to sort one vector according to the values of another vector. To do this we should use the order()
function in combination with [ ]
. To demonstrate this let’s create a vector called height
containing the height of 5 different people and another vector called p.names
containing the names of these people (so Joanna is 180 cm, Charlotte is 155 cm etc).
height <- c(180, 155, 160, 167, 181)
height
## [1] 180 155 160 167 181
p.names <- c("Joanna", "Charlotte", "Helen", "Karen", "Amy")
p.names
## [1] "Joanna" "Charlotte" "Helen" "Karen" "Amy"
Our goal is to order the people in p.names
in ascending order of their height
. The first thing we’ll do is use the order()
function with the height
variable to create a vector called height_ord
.
OK, what’s going on here? The first value, 2
, (remember ignore [1]
) should be read as ‘the smallest value of height
is the second element of the height
vector’. If we check this by looking at the height
vector above, you can see that element 2 has a value of 155, which is the smallest value. The second smallest value in height
is the 3rd element of height
, which when we check is 160 and so on. The largest value of height
is element 5
which is 181. Now that we have a vector of the positional indices of heights in ascending order (height_ord
), we can extract these values from our p.names
vector in this order.
You’re probably thinking ‘what’s the use of this?’ Well, imagine you have a dataset which contains two columns of data and you want to sort each column. If you just use sort()
to sort each column separately, the values of each column will become uncoupled from each other. By using the order()
on one column, a vector of positional indices is created of the values of the column in ascending order. This vector can be used on the second column, as the index of elements which will return a vector of values based on the first column.
2.4.4 Vectorisation
One of the great things about R functions is that most of them are vectorised. This means that the function will operate on all elements of a vector without needing to apply the function on each element separately. For example, to multiple each element of a vector by 5 we can simply use
# create a vector
my_vec2 <- c(3, 5, 7, 1, 9, 20)
# multiply each element by 5
my_vec2 * 5
## [1] 15 25 35 5 45 100
Or we can add the elements of two or more vectors
# create a second vector
my_vec3 <- c(17, 15, 13, 19, 11, 0)
# add both vectors
my_vec2 + my_vec3
## [1] 20 20 20 20 20 20
# multiply both vectors
my_vec2 * my_vec3
## [1] 51 75 91 19 99 0
However, you must be careful when using vectorisation with vectors of different lengths as R will quietly recycle the elements in the shorter vector rather than throw a wobbly (error).
2.4.5 Missing data
In R, missing data is usually represented by an NA
symbol meaning ‘Not Available’. Data may be missing for a whole bunch of reasons, maybe your machine broke down, maybe you broke down, maybe the weather was too bad to collect data on a particular day etc etc. Missing data can be a pain in the proverbial both from an R perspective and also a statistical perspective. From an R perspective missing data can be problematic as different functions deal with missing data in different ways. For example, let’s say we collected air temperature readings over 10 days, but our thermometer broke on day 2 and again on day 9 so we have no data for those days.
temp <- c(7.2, NA, 7.1, 6.9, 6.5, 5.8, 5.8, 5.5, NA, 5.5)
temp
## [1] 7.2 NA 7.1 6.9 6.5 5.8 5.8 5.5 NA 5.5
We now want to calculate the mean temperature over these days using the mean()
function.
Flippin heck, what’s happened here? Why does the mean()
function return an NA
? Actually, R is doing something very sensible (at least in our opinion!). If a vector has a missing value then the only possible value to return when calculating a mean is NA
. R doesn’t know that you perhaps want to ignore the NA
values (R can’t read your mind - yet!). Happily, if we look at the help file (use help("mean")
- see the next section for more details) associated with the mean()
function we can see there is an argument na.rm =
which is set to FALSE
by default.
na.rm - a logical value indicating whether NA values should be stripped before the computation proceeds.
If we change this argument to na.rm = TRUE
when we use the mean()
function this will allow us to ignore the NA
values when calculating the mean.
It’s important to note that the NA
values have not been removed from our temp
object (that would be bad practice), rather the mean()
function has just ignored them. The point of the above is to highlight how we can change the default behaviour of a function using an appropriate argument. The problem is that not all functions will have an na.rm =
argument, they might deal with NA
values differently. However, the good news is that every help file associated with any function will always tell you how missing data are handled by default.