2.4 Working with vectors

Manipulating, summarising and sorting data using R is an important skill to master but one which many people find a little confusing at first. We’ll go through a few simple examples here using vectors to illustrate some important concepts but will build on this in much more detail in Chapter 3 where we will look at more complicated (and useful) data structures.

 

Take a look at this video for a quick introduction to working with vectors in R using positional and logical indexes

2.4.1 Extracting elements

To extract (also known as indexing or subscripting) one or more values (more generally known as elements) from a vector we use the square bracket [ ] notation. The general approach is to name the object you wish to extract from, then a set of square brackets with an index of the element you wish to extract contained within the square brackets. This index can be a position or the result of a logical test.

Positional index

To extract elements based on their position we simply write the position inside the [ ]. For example, to extract the 3rd value of my_vec

my_vec        # remind ourselves what my_vec looks like
## [1] 2 3 1 6 4 3 3 7
my_vec[3]     # extract the 3rd value
## [1] 1

# if you want to store this value in another object
val_3 <- my_vec[3]
val_3
## [1] 1

Note that the positional index starts at 1 rather than 0 like some other programming languages (i.e. Python).

We can also extract more than one value by using the c() function inside the square brackets. Here we extract the 1st, 5th, 6th and 8th element from the my_vec object

my_vec[c(1, 5, 6, 8)]
## [1] 2 4 3 7

Or we can extract a range of values using the : notation. To extract the values from the 3rd to the 8th elements.

my_vec[3:8]
## [1] 1 6 4 3 3 7

Logical index

Another really useful way to extract data from a vector is to use a logical expression as an index. For example, to extract all elements with a value greater than 4 in the vector my_vec

my_vec[my_vec > 4]
## [1] 6 7

Here, the logical expression is my_vec > 4 and R will only extract those elements that satisfy this logical condition. So how does this actually work? If we look at the output of just the logical expression without the square brackets you can see that R returns a vector containing either TRUE or FALSE which correspond to whether the logical condition is satisfied for each element. In this case only the 4th and 8th elements return a TRUE as their value is greater than 4.

my_vec > 4
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

So what R is actually doing under the hood is equivalent to

my_vec[c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE)]
## [1] 6 7

and only those element that are TRUE will be extracted.

In addition to the < and > operators you can also use composite operators to increase the complexity of your expressions. For example the expression for ‘greater or equal to’ is >=. To test whether a value is equal to a value we need to use a double equals symbol == and for ‘not equal to’ we use != (the ! symbol means ‘not’).

my_vec[my_vec >= 4]        # values greater or equal to 4
## [1] 6 4 7
my_vec[my_vec < 4]         # values less than 4
## [1] 2 3 1 3 3
my_vec[my_vec <= 4]        # values less than or equal to 4
## [1] 2 3 1 4 3 3
my_vec[my_vec == 4]        # values equal to 4
## [1] 4
my_vec[my_vec != 4]        # values not equal to 4
## [1] 2 3 1 6 3 3 7

We can also combine multiple logical expressions using Boolean expressions. In R the & symbol means AND and the | symbol means OR. For example, to extract values in my_vec which are less than 6 AND greater than 2

val26 <- my_vec[my_vec < 6 & my_vec > 2]
val26
## [1] 3 4 3 3

or extract values in my_vec that are greater than 6 OR less than 3.

val63 <- my_vec[my_vec > 6 | my_vec < 3]
val63
## [1] 2 1 7

2.4.2 Replacing elements

We can change the values of some elements in a vector using our [ ] notation in combination with the assignment operator <-. For example, to replace the 4th value of our my_vec object from 6 to 500

my_vec[4] <- 500
my_vec
## [1]   2   3   1 500   4   3   3   7

We can also replace more than one value or even replace values based on a logical expression.

# replace the 6th and 7th element with 100
my_vec[c(6, 7)] <- 100
my_vec
## [1]   2   3   1 500   4 100 100   7

# replace element that are less than or equal to 4 with 1000
my_vec[my_vec <= 4] <- 1000
my_vec
## [1] 1000 1000 1000  500 1000  100  100    7

2.4.3 Ordering elements

In addition to extracting particular elements from a vector we can also order the values contained in a vector. To sort the values from lowest to highest value we can use the sort() function.

vec_sort <- sort(my_vec)
vec_sort
## [1]    7  100  100  500 1000 1000 1000 1000

To reverse the sort, from highest to lowest, we can either include the decreasing = TRUE argument when using the sort() function

vec_sort2 <- sort(my_vec, decreasing = TRUE)
vec_sort2
## [1] 1000 1000 1000 1000  500  100  100    7

or first sort the vector using the sort() function and then reverse the sorted vector using the rev() function. This is another example of nesting one function inside another function.

vec_sort3 <- rev(sort(my_vec))
vec_sort3
## [1] 1000 1000 1000 1000  500  100  100    7

Whilst sorting a single vector is fun, perhaps a more useful task would be to sort one vector according to the values of another vector. To do this we should use the order() function in combination with [ ]. To demonstrate this let’s create a vector called height containing the height of 5 different people and another vector called p.names containing the names of these people (so Joanna is 180 cm, Charlotte is 155 cm etc).

height <- c(180, 155, 160, 167, 181)
height
## [1] 180 155 160 167 181

p.names <- c("Joanna", "Charlotte", "Helen", "Karen", "Amy")
p.names
## [1] "Joanna"    "Charlotte" "Helen"     "Karen"     "Amy"

Our goal is to order the people in p.names in ascending order of their height. The first thing we’ll do is use the order() function with the height variable to create a vector called height_ord.

height_ord <- order(height)
height_ord
## [1] 2 3 4 1 5

OK, what’s going on here? The first value, 2, (remember ignore [1]) should be read as ‘the smallest value of height is the second element of the height vector’. If we check this by looking at the height vector above, you can see that element 2 has a value of 155, which is the smallest value. The second smallest value in height is the 3rd element of height, which when we check is 160 and so on. The largest value of height is element 5 which is 181. Now that we have a vector of the positional indices of heights in ascending order (height_ord), we can extract these values from our p.names vector in this order.

names_ord <- p.names[height_ord]
names_ord
## [1] "Charlotte" "Helen"     "Karen"     "Joanna"    "Amy"

You’re probably thinking ‘what’s the use of this?’ Well, imagine you have a dataset which contains two columns of data and you want to sort each column. If you just use sort() to sort each column separately, the values of each column will become uncoupled from each other. By using the order() on one column, a vector of positional indices is created of the values of the column in ascending order. This vector can be used on the second column, as the index of elements which will return a vector of values based on the first column.

2.4.4 Vectorisation

One of the great things about R functions is that most of them are vectorised. This means that the function will operate on all elements of a vector without needing to apply the function on each element separately. For example, to multiple each element of a vector by 5 we can simply use

# create a vector
my_vec2 <- c(3, 5, 7, 1, 9, 20)

# multiply each element by 5
my_vec2 * 5
## [1]  15  25  35   5  45 100

Or we can add the elements of two or more vectors

# create a second vector
my_vec3 <- c(17, 15, 13, 19, 11, 0)

# add both vectors
my_vec2 + my_vec3
## [1] 20 20 20 20 20 20

# multiply both vectors
my_vec2 * my_vec3
## [1] 51 75 91 19 99  0

However, you must be careful when using vectorisation with vectors of different lengths as R will quietly recycle the elements in the shorter vector rather than throw a wobbly (error).

# create a third vector
my_vec4 <- c(1, 2)

# add both vectors - quiet recycling!
my_vec2 + my_vec4
## [1]  4  7  8  3 10 22

2.4.5 Missing data

In R, missing data is usually represented by an NA symbol meaning ‘Not Available’. Data may be missing for a whole bunch of reasons, maybe your machine broke down, maybe you broke down, maybe the weather was too bad to collect data on a particular day etc etc. Missing data can be a pain in the proverbial both from an R perspective and also a statistical perspective. From an R perspective missing data can be problematic as different functions deal with missing data in different ways. For example, let’s say we collected air temperature readings over 10 days, but our thermometer broke on day 2 and again on day 9 so we have no data for those days.

temp  <- c(7.2, NA, 7.1, 6.9, 6.5, 5.8, 5.8, 5.5, NA, 5.5)
temp
##  [1] 7.2  NA 7.1 6.9 6.5 5.8 5.8 5.5  NA 5.5

We now want to calculate the mean temperature over these days using the mean() function.

mean_temp <- mean(temp)
mean_temp
## [1] NA

Flippin heck, what’s happened here? Why does the mean() function return an NA? Actually, R is doing something very sensible (at least in our opinion!). If a vector has a missing value then the only possible value to return when calculating a mean is NA. R doesn’t know that you perhaps want to ignore the NA values (R can’t read your mind - yet!). Happily, if we look at the help file (use help("mean") - see the next section for more details) associated with the mean() function we can see there is an argument na.rm = which is set to FALSE by default.

na.rm - a logical value indicating whether NA values should be stripped before the computation proceeds.

If we change this argument to na.rm = TRUE when we use the mean() function this will allow us to ignore the NA values when calculating the mean.

mean_temp <- mean(temp, na.rm = TRUE)
mean_temp
## [1] 6.2875

It’s important to note that the NA values have not been removed from our temp object (that would be bad practice), rather the mean() function has just ignored them. The point of the above is to highlight how we can change the default behaviour of a function using an appropriate argument. The problem is that not all functions will have an na.rm = argument, they might deal with NA values differently. However, the good news is that every help file associated with any function will always tell you how missing data are handled by default.