7.2 Functions in R

Functions are your loyal servants, waiting patiently to do your bidding to the best of their ability. They’re made with the utmost care and attention … though sometimes may end up being something of a Frankenstein’s monster - with an extra limb or two and a head put on backwards. But no matter how ugly they may be they’re completely faithful to you.

They’re also very stupid.

If we asked you to go to the supermarket to get us some ingredients to make Francesinha, even if you don’t know what the heck that is, you’d be able to guess and bring at least something back. Or you could decide to make something else. Or you could ask a celebrity chef for help. Or you could pull out your phone and search online for what Francesinha is. The point is, even if we didn’t give you enough information to do the task, you’re intelligent enough to, at the very least, try to find a work around.

If instead, we asked our loyal function to do the same, it would listen intently to our request, stand still for a few milliseconds, compose itself, and then start shouting Error: 'data' must be a data frame, or other object .... It would then repeat this every single time we asked it to do the job. The point here, is that code and functions are not intelligent. They cannot find workarounds. It’s totally reliant on you, to tell it very explicitly what it needs to do step by step.

Remember two things: the intelligence of code comes from the coder, not the computer and functions need exact instructions to work.

To prevent functions from being too stupid you must provide the information the function needs in order for it to function. As with the Francesinha example, if we’d supplied a recipe list to the function, it would have managed just fine. We call this “fulfilling an argument”. The vast majority of functions require the user to fulfill at least one argument.

This can be illustrated in the pseudocode below. When we make a function we can specify what arguments the user must fulfill (e.g. argument1 and argument2), as well as what to do once it has this information (expression):

nameOfFunction <- function(argument1, argument2, ...) {expression}

The first thing to note is that we’ve used the function function() to create a new function called nameOfFunction. To walk through the above code; we’re creating a function called nameOfFunction. Within the round brackets we specify what information (i.e. arguments) the function requires to run (as many or as few as needed). These arguments are then passed to the expression part of the function. The expression can be any valid R command or set of R commands and is usually contained between a pair of braces { } (if a function is only one line long you can omit the braces). Once you run the above code, you can then use your new function by typing:

nameOfFunction(argument1, argument2)

Confused? Let’s work through an example to help clear things up.

First we are going to create a data frame called city, where columns porto, aberdeen, nairobi, and genoa are filled with 100 random values drawn from a bag (using the rnorm() function to draw random values from a Normal distribution with mean 0 and standard deviation of 1). We also include a “problem”, for us to solve later, by including 10 NA values within the nairobi column (using rep(NA, 10)).

city <- data.frame(
  porto = rnorm(100),
  aberdeen = rnorm(100),
  nairobi = c(rep(NA, 10), rnorm(90)),
  genoa = rnorm(100)
)

Let’s say that you want to multiply the values in the variables Porto and Aberdeen and create a new object called porto_aberdeen. We can do this “by hand” using:

porto_aberdeen <- city$porto * city$aberdeen

We’ve now created an object called porto_aberdeen by multiplying the vectors city$porto and city$aberdeen. Simple. If this was all we needed to do, we can stop here. R works with vectors, so doing these kinds of operations in R is actually much simpler than other programming languages, where this type of code might require loops (we say that R is a vectorised language). Something to keep in mind for later is that doing these kinds of operations with loops can be much slower compared to vectorisation.

But what if we want to repeat this multiplication many times? Let’s say we wanted to multiply columns porto and aberdeen, aberdeen and genoa, and nairobi and genoa. In this case we could copy and paste the code, replacing the relevant information.

porto_aberdeen <- city$porto * city$aberdeen
aberdeen_genoa <- city$aberdeen * city$aberdeen
nairobi_genoa <- city$nairobi * city$genoa

While this approach works, it’s easy to make mistakes. In fact, here we’ve “forgotten” to change aberdeen to genoa in the second line of code when copying and pasting. This is where writing a function comes in handy. If we were to write this as a function, there is only one source of potential error (within the function itself) instead of many copy-pasted lines of code (which we also cut down on by using a function).

In this case, we’re using some fairly trivial code where it’s maybe hard to make a genuine mistake. But what if we increased the complexity?

city$porto * city$aberdeen / city$porto + (city$porto * 10^(city$aberdeen)) 
                  - city$aberdeen - (city$porto * sqrt(city$aberdeen + 10))

Now imagine having to copy and paste this three times, and in each case having to change the porto and aberdeen variables (especially if we had to do it more than three times).

What we could do instead is generalise our code for x and y columns instead of naming specific cities. If we did this, we could recycle the x * y code. Whenever we wanted to multiple columns together, we assign a city to either x or y. We’ll assign the multiplication to the objects porto_aberdeen and aberdeen_nairobi so we can come back to them later.

# Assign x and y values
x <- city$porto
y <- city$aberdeen

# Use multiplication code
porto_aberdeen <- x * y

# Assign new x and y values
x <- city$aberdeen
y <- city$nairobi

# Reuse multiplication code
aberdeen_nairobi <- x * y

This is essentially what a function does. OK down to business, let’s call our new function multiply_columns() and define it with two arguments, x and y. In the function code we simply return the value of x * y using the return() function. Using the return() function is not strictly necessary in this example as R will automatically return the value of the last line of code in our function. We include it here to make this explicit.

multiply_columns <- function(x, y) {
  return(x * y)
}

Now that we’ve defined our function we can use it. Let’s use the function to multiple the columns city$porto and city$aberdeen and assign the result to a new object called porto_aberdeen_func.

porto_aberdeen_func <- multiply_columns(x = city$porto, y = city$aberdeen)
porto_aberdeen_func
##   [1]  0.810772253  0.776314924 -1.050772718  1.876789255 -0.573413105
##   [6] -1.450528253  0.074028141  0.206610946  0.003716581  0.006331698
##  [11] -0.130728313 -0.125540444 -1.530147542 -1.279255376 -0.139219922
##  [16] -0.685148391 -0.152873229 -0.456245391  0.078221628  1.077202046
##  [21] -0.290998052 -0.008419029 -0.269453982 -1.023347307  0.742947653
##  [26]  0.232643941 -0.417828889  0.647897696 -1.593777480 -0.838914479
##  [31]  0.154984035 -0.044832339  0.593448130  0.800119805 -0.391219901
##  [36] -1.784810824  2.749534518  0.060978538  1.060162337 -0.050776211
##  [41]  0.865348265 -0.108787772  0.005085782 -0.015114437 -0.385855929
##  [46] -1.153896561  0.263904553  0.311670114 -0.064650407 -0.179069082
##  [51]  0.139916930  0.037112934 -0.190068331 -0.720278960 -0.164959184
##  [56]  0.147605871 -0.137591185  0.010120274  0.397249167 -0.008672550
##  [61] -0.768838625 -0.459626035 -0.182809006  2.865878306 -1.618576682
##  [66] -0.081581710 -1.276716658  1.197026320  0.314031879 -0.610246103
##  [71]  0.033465834  0.006078090 -0.147278256 -0.767849745  0.052828909
##  [76]  2.609447528 -0.214711996 -1.017728375  0.002334648 -0.587062073
##  [81] -0.715181336 -2.165668199 -0.339636043  0.121710383  0.158292572
##  [86] -0.038396354 -0.759252979 -0.005293516  0.147660832 -0.071157378
##  [91]  0.351060966  0.305738573  0.608519207 -0.095549685  1.323367119
##  [96]  0.684816245  0.012798506 -0.019045682  0.004099720  0.051902054

If we’re only interested in multiplying city$porto and city$aberdeen, it would be overkill to create a function to do something once. However, the benefit of creating a function is that we now have that function added to our environment which we can use as often as we like. We also have the code to create the function, meaning we can use it in completely new projects, reducing the amount of code that has to be written (and retested) from scratch each time. As a rule of thumb, you should consider writing a function whenever you’ve copied and pasted a block of code more than twice.

To satisfy ourselves that the function has worked properly, we can compare the porto_aberdeen variable with our new variable porto_aberdeen_func using the identical() function. The identical() function tests whether two objects are exactly identical and returns either a TRUE or FALSE value. Use ?identical if you want to know more about this function.

identical(porto_aberdeen, porto_aberdeen_func)
## [1] TRUE

And we confirm that the function has produced the same result as when we do the calculation manually. We recommend getting into a habit of checking that the function you’ve created works the way you think it has.

Now let’s use our multiply_columns() function to multiply columns aberdeen and nairobi. Notice now that argument x is given the value city$aberdeen and y the value city$nairobi.

aberdeen_nairobi_func <- multiply_columns(x = city$aberdeen, y = city$nairobi)
aberdeen_nairobi_func
##   [1]           NA           NA           NA           NA           NA
##   [6]           NA           NA           NA           NA           NA
##  [11]  0.076719817 -0.407269922  0.997915540 -0.141323946  0.479459415
##  [16]  0.978115020  0.252538829  0.195906974 -0.222690256  1.112446387
##  [21] -1.010252726  0.474014792 -0.279776995  0.088401770 -1.718169811
##  [26] -0.368192477 -0.568976777 -0.149661642  0.059338299  0.549367776
##  [31] -0.309321527 -0.030589236 -0.830838972  0.360122382 -0.003456745
##  [36]  0.681367328 -1.103605585 -0.077554380 -0.637249275 -0.046379677
##  [41] -0.541475081 -0.770363298 -0.001722887  0.011593618  0.305518595
##  [46] -0.238725447 -1.328340077  0.004472346  0.055968441  0.352372084
##  [51]  0.905912089 -0.456386327  0.257811653  0.511326584 -0.372282055
##  [56] -0.136801001 -0.190454286 -0.871633953 -0.173728644 -0.013140105
##  [61] -0.276677728  1.045153785 -0.282407801  0.951665307 -4.836267755
##  [66] -0.800478413 -0.171434654 -2.473511192 -0.204467134 -0.753771014
##  [71] -0.086945898 -0.024146025 -0.087305424  1.436582002 -0.140783495
##  [76] -1.751029539  0.021366544  0.258306479 -0.002563041 -1.327754701
##  [81] -0.257049577 -0.011980050 -0.318872631 -0.110141911  0.009663467
##  [86]  0.093308045 -0.256268494  0.002680386 -0.242139210 -3.649490533
##  [91] -0.807378905 -1.760587428  0.465358642  0.136950099  0.526348414
##  [96] -0.439401135  0.122950199 -0.078183239  0.476755132 -0.014341594

So far so good. All we’ve really done is wrapped the code x * y into a function, where we ask the user to specify what their x and y variables are.

Now let’s add a little complexity. If you look at the output of nairobi_genoa some of the calculations have produced NA values. This is because of those NA values we included in nairobi when we created the city data frame. Despite these NA values, the function appeared to have worked but it gave us no indication that there might be a problem. In such cases we may prefer if it had warned us that something was wrong. How can we get the function to let us know when NA values are produced? Here’s one way.

multiply_columns <- function(x, y) {
  temp_var <- x * y
  if (any(is.na(temp_var))) {
    warning("The function has produced NAs")
    return(temp_var)
  } else {
    return(temp_var)
  }
}

aberdeen_nairobi_func <- multiply_columns(city$aberdeen, city$nairobi)
## Warning in multiply_columns(city$aberdeen, city$nairobi): The function has
## produced NAs
porto_aberdeen_func <- multiply_columns(city$porto, city$aberdeen)

The core of our function is still the same. We still have x * y, but we’ve now got an extra six lines of code. Namely, we’ve included some conditional statements, if and else, to test whether any NAs have been produced and if they have we display a warning message to the user. The next section of this Chapter will explain how these work and how to use them.