## 2.4 Working with vectors

Manipulating, summarising and sorting data using R is an important skill to master but one which many people find a little confusing at first. We’ll go through a few simple examples here using vectors to illustrate some important concepts but will build on this in much more detail in Chapter 3 where we will look at more complicated (and useful) data structures.

Take a look at this video for a quick introduction to working with vectors in R using positional and logical indexes

### 2.4.1 Extracting elements

To extract (also known as indexing or subscripting) one or more values (more generally known as elements) from a vector we use the square bracket `[ ]`

notation. The general approach is to name the object you wish to extract from, then a set of square brackets with an index of the element you wish to extract contained within the square brackets. This index can be a position or the result of a logical test.

#### Positional index

To extract elements based on their position we simply write the position inside the `[ ]`

. For example, to extract the 3rd value of `my_vec`

```
my_vec # remind ourselves what my_vec looks like
## [1] 2 3 1 6 4 3 3 7
my_vec[3] # extract the 3rd value
## [1] 1
# if you want to store this value in another object
val_3 <- my_vec[3]
val_3
## [1] 1
```

Note that the positional index starts at 1 rather than 0 like some other programming languages (i.e. Python).

We can also extract more than one value by using the `c()`

function inside the square brackets. Here we extract the 1^{st}, 5^{th}, 6^{th} and 8^{th} element from the `my_vec`

object

Or we can extract a range of values using the `:`

notation. To extract the values from the 3^{rd} to the 8^{th} elements.

#### Logical index

Another really useful way to extract data from a vector is to use a logical expression as an index. For example, to extract all elements with a value greater than 4 in the vector `my_vec`

Here, the logical expression is `my_vec > 4`

and R will only extract those elements that satisfy this logical condition. So how does this actually work? If we look at the output of just the logical expression without the square brackets you can see that R returns a vector containing either `TRUE`

or `FALSE`

which correspond to whether the logical condition is satisfied for each element. In this case only the 4^{th} and 8^{th} elements return a `TRUE`

as their value is greater than 4.

So what R is actually doing under the hood is equivalent to

and only those element that are `TRUE`

will be extracted.

In addition to the `<`

and `>`

operators you can also use composite operators to increase the complexity of your expressions. For example the expression for ‘greater or equal to’ is `>=`

. To test whether a value is equal to a value we need to use a double equals symbol `==`

and for ‘not equal to’ we use `!=`

(the `!`

symbol means ‘not’).

```
my_vec[my_vec >= 4] # values greater or equal to 4
## [1] 6 4 7
my_vec[my_vec < 4] # values less than 4
## [1] 2 3 1 3 3
my_vec[my_vec <= 4] # values less than or equal to 4
## [1] 2 3 1 4 3 3
my_vec[my_vec == 4] # values equal to 4
## [1] 4
my_vec[my_vec != 4] # values not equal to 4
## [1] 2 3 1 6 3 3 7
```

We can also combine multiple logical expressions using Boolean expressions. In R the `&`

symbol means AND and the `|`

symbol means OR. For example, to extract values in `my_vec`

which are less than 6 AND greater than 2

or extract values in `my_vec`

that are greater than 6 OR less than 3.

### 2.4.2 Replacing elements

We can change the values of some elements in a vector using our `[ ]`

notation in combination with the assignment operator `<-`

. For example, to replace the 4^{th} value of our `my_vec`

object from `6`

to `500`

We can also replace more than one value or even replace values based on a logical expression.

### 2.4.3 Ordering elements

In addition to extracting particular elements from a vector we can also order the values contained in a vector. To sort the values from lowest to highest value we can use the `sort()`

function.

To reverse the sort, from highest to lowest, we can either include the `decreasing = TRUE`

argument when using the `sort()`

function

or first sort the vector using the `sort()`

function and then reverse the sorted vector using the `rev()`

function. This is another example of nesting one function inside another function.

Whilst sorting a single vector is fun, perhaps a more useful task would be to sort one vector according to the values of another vector. To do this we should use the `order()`

function in combination with `[ ]`

. To demonstrate this let’s create a vector called `height`

containing the height of 5 different people and another vector called `p.names`

containing the names of these people (so Joanna is 180 cm, Charlotte is 155 cm etc).

```
height <- c(180, 155, 160, 167, 181)
height
## [1] 180 155 160 167 181
p.names <- c("Joanna", "Charlotte", "Helen", "Karen", "Amy")
p.names
## [1] "Joanna" "Charlotte" "Helen" "Karen" "Amy"
```

Our goal is to order the people in `p.names`

in ascending order of their `height`

. The first thing we’ll do is use the `order()`

function with the `height`

variable to create a vector called `height_ord`

.

OK, what’s going on here? The first value, `2`

, (remember ignore `[1]`

) should be read as ‘the smallest value of `height`

is the second element of the `height`

vector’. If we check this by looking at the `height`

vector above, you can see that element 2 has a value of 155, which is the smallest value. The second smallest value in `height`

is the 3^{rd} element of `height`

, which when we check is 160 and so on. The largest value of `height`

is element `5`

which is 181. Now that we have a vector of the positional indices of heights in ascending order (`height_ord`

), we can extract these values from our `p.names`

vector in this order.

You’re probably thinking ‘what’s the use of this?’ Well, imagine you have a dataset which contains two columns of data and you want to sort each column. If you just use `sort()`

to sort each column separately, the values of each column will become uncoupled from each other. By using the `order()`

on one column, a vector of positional indices is created of the values of the column in ascending order. This vector can be used on the second column, as the index of elements which will return a vector of values based on the first column.

### 2.4.4 Vectorisation

One of the great things about R functions is that most of them are vectorised. This means that the function will operate on all elements of a vector without needing to apply the function on each element separately. For example, to multiple each element of a vector by 5 we can simply use

```
# create a vector
my_vec2 <- c(3, 5, 7, 1, 9, 20)
# multiply each element by 5
my_vec2 * 5
## [1] 15 25 35 5 45 100
```

Or we can add the elements of two or more vectors

```
# create a second vector
my_vec3 <- c(17, 15, 13, 19, 11, 0)
# add both vectors
my_vec2 + my_vec3
## [1] 20 20 20 20 20 20
# multiply both vectors
my_vec2 * my_vec3
## [1] 51 75 91 19 99 0
```

However, you must be careful when using vectorisation with vectors of different lengths as R will quietly recycle the elements in the shorter vector rather than throw a wobbly (error).

### 2.4.5 Missing data

In R, missing data is usually represented by an `NA`

symbol meaning ‘Not Available’. Data may be missing for a whole bunch of reasons, maybe your machine broke down, maybe you broke down, maybe the weather was too bad to collect data on a particular day etc etc. Missing data can be a pain in the proverbial both from an R perspective and also a statistical perspective. From an R perspective missing data can be problematic as different functions deal with missing data in different ways. For example, let’s say we collected air temperature readings over 10 days, but our thermometer broke on day 2 and again on day 9 so we have no data for those days.

```
temp <- c(7.2, NA, 7.1, 6.9, 6.5, 5.8, 5.8, 5.5, NA, 5.5)
temp
## [1] 7.2 NA 7.1 6.9 6.5 5.8 5.8 5.5 NA 5.5
```

We now want to calculate the mean temperature over these days using the `mean()`

function.

Flippin heck, what’s happened here? Why does the `mean()`

function return an `NA`

? Actually, R is doing something very sensible (at least in our opinion!). If a vector has a missing value then the only possible value to return when calculating a mean is `NA`

. R doesn’t know that you perhaps want to ignore the `NA`

values (R can’t read your mind - yet!). Happily, if we look at the help file (use `help("mean")`

- see the next section for more details) associated with the `mean()`

function we can see there is an argument `na.rm =`

which is set to `FALSE`

by default.

na.rm - a logical value indicating whether NA values should be stripped before the computation proceeds.

If we change this argument to `na.rm = TRUE`

when we use the `mean()`

function this will allow us to ignore the `NA`

values when calculating the mean.

It’s important to note that the `NA`

values have not been removed from our `temp`

object (that would be bad practice), rather the `mean()`

function has just ignored them. The point of the above is to highlight how we can change the default behaviour of a function using an appropriate argument. The problem is that not all functions will have an `na.rm =`

argument, they might deal with `NA`

values differently. However, the good news is that every help file associated with any function will **always** tell you how missing data are handled by default.