Week 5

An intro to programming with R

Up to this point, we have focused on using the functions defined in the tidyverse family of packages, as well as those defined in base R. However, there will likely come a time when you want or need to do something in R for which you cannot find a package that has pre-built functions to accomplish your goal. You will therefore need to do a little programming, which may include writing your own functions. This lab is meant to introduce you to some useful concepts and functions to help you accomplish custom tasks.

A motivating example

Let’s pretend we are population biologists interested in simulating population growth based on the following model of population density over time:

\[ N_t = N_{t-1}\lambda e^{-\alpha N_{t-1} + \varepsilon_t} \tag{1}\]

where

  • \(N_t\) is the population density at time \(t\)
  • \(\lambda > 0\) is the intrinsic growth factor
  • \(\alpha > 0\) is the intra-specific competition effect
  • \(\varepsilon_t \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\) for all \(t=1,2,...\), adds demographic stochasticity.

How do we simulate such dynamics?

The initial conditions

First, we need a starting point, and we need to specify values for paramters \(\lambda\) and \(\alpha\). Let’s specify some arbitrary parameter values based on known limits for a stable population.

library(tidyverse)

set.seed(55688)
alpha <- 0.002
lambda <- 1.5
sigma <- 0.1

# specify initial population density
N0 <- 10

Okay, given the initial state of the population, \(N_0 = 10\), as well as some arbitrary parameter values, let’s simulate the population density at time \(t=1\) using Equation 1.

(N1 <- N0 * lambda * exp(-alpha * N0 + rnorm(1, sd = sigma)))
[1] 14.51858

Cool, now given \(N_1\), we can again simulate the population density for \(N_2\).

(N2 <- N1 * lambda * exp(-alpha * N1 + rnorm(1, sd = sigma)))
[1] 21.44161

Okay, this is pretty easy, but what if we want to do this for 100 or 1000 time steps? We obviously don’t want to write 1000 lines of repeating code 😳.

Repetetive tasks

Tackling repetitive tasks is something computers are really good at, so, naturally, there are many options for accomplishing repetitive tasks in R. We apply two base R options and one tidyverse option to the motivating example.

Loops 🔁

Loops are the bread and butter of many programs, but you will hear advice about avoiding loops in R. This is because loops are slow in R and may get bogged down with complicated loops that run over many iterations. However, most find them more intuitive than our other options in R, so we will start there.

for() loops

The structure of a for() loop in R is

for(<counter> in <list>){
  # code to run each time goes here
}

where we replace <counter> with the name of an object that changes with each iteration, and <list> with a list or vector that determines how the counter changes over the iterations. For example:

(list_of_values <- 1:4)
[1] 1 2 3 4
for(i in list_of_values){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4

The list need not be numeric or consecutive or formatted in any way other than that it must be a vector of some class. For example,

office_chars <- c("Jim", "Pam", "Dwight", "Michael")

for(name in office_chars){
  writeLines(name)
}
Jim
Pam
Dwight
Michael
Your Turn

Work in pairs for 5 minutes to write in words (or pseudo code) how you think the loop should be structured to accomplish simulating a population over 200 time steps using Equation 1. What should the list that we iterate over be and what should go inside the curly braces?

# initialize an empty vector with 200 time steps
N <- vector(mode = "double", length = 200)

# add initial population density
N[1] <- N0

# loop through the remainder
# with the vector to loop over as the time steps
for(t in 2:200){
  N[t] <- N[t-1] * lambda * exp(-alpha * N[t-1] + rnorm(1, sd = sigma))
}

Okay, let’s add this to our R script, but let’s be sure not to hard-code any values into the script. This will allow us to change the simulations more easily later down the line.

# define variable for the number of time steps
steps <- 200

# initialize an empty vector with 200 time steps
N <- vector(mode = "double", length = steps)

# add initial population density
N[1] <- N0

# loop through the remainder
for(t in 2:steps){
  N[t] <- N[t-1] * lambda * exp(-alpha * N[t-1] + rnorm(1, sd = sigma))
}

ggplot(data = data.frame(t = 1:steps, N = N)) +
  geom_point(aes(x = t, y = N)) +
  geom_line(aes(x = t, y = N)) +
  theme_bw()
Figure 1: Simulated population density with \(\alpha = 0.002,\ \lambda = 1.5\), and \(\sigma^2 = 0.01\).

Cool!🤘

while() loops

Another type of loop is called a while() loop which continues to run while the condition specified inside the while() call evaluates to true.

Warning

While loops, if not constructed carefully, can result in infinite loops that will never stop running! It’s important to check your while loops on small test cases before trying to deploy them to bigger jobs so you don’t end up waiting for something to finish running that will never actually finish.

The structure of a while loop in R is

while(<logical expression>){
  # do some task that changes the conditions
  # evaluated in the logical expression
}

This is a bit harder to understand off the bat in my opinion, so here are a couple examples:

# starting conditions
i <- 1

# set the logical expression to be evaluated
while(i <= 5){
  # do some operations
  print(i)
  # do an operation that changes the conditions
  # to be evaluated on the next iteration
  i <- i + 1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

A short aside into logical expressions

We have seen logicals once before when using filter() to subset a dataset. However, there are a few logical operators we have not seen. Here is a list of logical operators:

  • ==: Will evaluate to TRUE if the objects on either side are equal to one another.

  • <: Less than will evaluate to TRUE if the numeric value on the left of the operator is less than that on the right.

  • >: Greater than will evaluate to TRUE if the numeric value on the left of the operator is greater than that on the right.

  • <=: Less than or equal to will evalute to TRUE if the numeric value on the left of the operator is less than or equal to that on the right.

  • >=: Greater than or equal to will evaluate to TRUE if the numeric value on the left of the operator is greater than or equal to that on the right.

  • !=: Not equal to will evaluate to TRUE if the objects on either side of the operator are not equal.

Combining logical expressions using Booleans

We can combine multiple logical expressions using Boolean operators, | and &, which are said as or and and, respectively. For example,

a <- 4

3 < a & a < 6
[1] TRUE

evaluates whether a is between 3 and 6.

Your Turn

First, simulate a random number between 0 and 10 using num <- runif(1, 0, 10). Working with a partner, take 5 minutes to construct a while loop that finds which two integers from 0, 1, 2, …, 10 your random number falls between.

# set initial conditions
num <- runif(1, 0, 10)

lower <- 0
upper <- 1

# start the loop and stop once the answer is found
while(!(lower < num & num < upper)){
  lower <- lower + 1
  upper <- upper + 1
}

# once this stops, we get our answer with
paste("The random number is between", lower, "and", upper) |>
  writeLines()

while() loop for population growth

We could similarly write a while loop to simulate our population growth. The loop in that case would look something like

t <- 2
while(t <= steps){
  N[t] <- N[t-1] * lambda * exp(-alpha * N[t-1] + rnorm(1, sd = sigma))
  t <- t + 1
}

but we will not change our script in this case to use a while loop. While loops are better for situations in which you don’t know exactly how many iterations it will take to finish a task (e.g., optimization routines), while for loops are better for cases in which you know the number of steps you will be looping over.

Using the apply() family of functions

Using apply() functions is generally preferred over for() loops in R because they are faster. However, they will not be as useful for simulating population growth because it is difficult to reference previous elements of whatever is being generated. In the population growth example, we use the previously simulated population size, N[t-1], in the calculation of N[t]. This sort of thing is not really what apply() was designed for. However, I will introduce apply() and the tidyverse version, purrr::map(), then demonstrate their use for a different purpose related to the population growth example.

Apply functions work best with existing vectors, matrices, or array. The general syntax is

apply(
  X =  <array>,
  MARGIN = <the margin to which we apply the function>,
  FUN = <some function to apply>
)

where the X argument is a vector, matrix, or array, FUN is the function you want to apply to some MARGIN of the array. For example,

(X <- matrix(data = 1:6, ncol = 2))
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
apply(X, MARGIN = 2, FUN = mean)
[1] 2 5
Your Turn

🤔 What did the apply function do? Why did we specify MARGIN = 2? Take a moment to play with the code a bit and discuss with your neighbor.

Answer

The code applied the mean() function to the columns of the matrix since rows are margin 1 and columns are margin 2.

When we use apply functions with vectors, we can use some shortcut functions since they are one-dimensional and therefore only have one margin. lapply() applies a function to the elements of a vector of any class and always returns a list. sapply() is similar, but simplifies the results to be a simple vector (a vector of any other class besides a list). For example:

(eg_list <- list(1:3, 4:6, 7:9))
[[1]]
[1] 1 2 3

[[2]]
[1] 4 5 6

[[3]]
[1] 7 8 9
lapply(eg_list, sum)
[[1]]
[1] 6

[[2]]
[1] 15

[[3]]
[1] 24
sapply(eg_list, sum)
[1]  6 15 24

Using purrr::map() from tidyverse

The tidyverse option for completing repetitive tasks comes from the purrr package with the map() family of functions. The syntax for map() is a bit cleaner than apply() in my opinion (particularly when we get into more complex operations that we want to perform on each element), and is generally structured as

purrr::map(
  <vector>, 
  ~ {# do something with each element of vector}
)

where the <vector> is, like with a for() loop, a vector (often a list) to loop over. The second argument can either be a named function (e.g., sqrt), or can start with a tilde. If it starts with a tilde, everything that comes after gets evaluated during each iteration, just like a for loop. Let’s see some examples.

purrr::map(eg_list, sum)
[[1]]
[1] 6

[[2]]
[1] 15

[[3]]
[1] 24
Your Turn

🤔 What class of object is returned by map()? Hint: assign the result of the above map() call to a named object.

map() always returns a list.

Here is another example where we use the tilde syntax.

Note

Note that .x always takes the place of the current element of the supplied vector in the code that is to be evaluated with each iteration.

purrr::map(
  1:5,
  ~ .x^3
) %>% unlist()
[1]   1   8  27  64 125

Notice how I had to unlist the result of map() above. However, as with sapply(), there are shortcuts that belong to the family of map() functions. For example, if I know the result can be a double(), that is, a numeric vector, then I can use a shortcut:

purrr::map_dbl(
  1:5,
  ~ .x^3
)
[1]   1   8  27  64 125

See ?purrr::map() for the details on the other shortcut functions.

Introducing functions

Before returning to the motivating example, let’s first take a step back to introduce how one writes their own functions in R. Recall that we said that functions in R are treated as just another R object with class function. Because of that, it turns out that it’s quite easy to define a function. The general form of a user-defined function is

my_function <- function(arg1, arg2){
  # do something with arguments
  return(result)
}

You can include as many arguments as you need in a function and even specify default values for each argument. For example:

my_mean <- function(x, na.rm = TRUE){
  n <- sum(!is.na(x))
  return(
    sum(x, na.rm = na.rm) / n
  )
}

This user-defined mean function defaults to removing NAs from the calculation. We can test it by loading it into the environment and feeding it test cases.

x <- c(2, 4, NA, 5, 7, NA)
my_mean(x)
[1] 4.5
mean(x, na.rm = T)
[1] 4.5
my_mean(x, na.rm = F)
[1] NA
mean(x)
[1] NA

It can be very handy to define your own functions when you want to use apply() or map() for nuanced tasks. Here is an example using sapply().

# generate 5 random samples of size 20 from a normal distribution
samps <- lapply(rep(20, 5), rnorm)

# compute the sum of squared errors for each sample
sapply(
  samps,
  FUN = function(x, mu){sum((x - mu)^2)},
  mu = 0
)
[1] 13.365747  8.117842 13.354433 23.708973 13.958778

The motivating example continued

Let’s return to the population growth example, but now imagine that you have data on population densities through time, as in Figure 1, and wish to estimate the parameters, \(\alpha\) and \(\lambda\), given the data. Indeed, there are sophisticated ways of doing this, but this lab is focused on repetitive computations, so we will use a brute force approach 💪.

First, let’s define our criteria for what constitutes a “good fit” to the data. For this, we will use the classic approach of minimizing the sum of squared error. That is, we want to minimize the sum of the squared distances between the observed data and the fitted model. Formally,

\[ \hat{\boldsymbol \theta}: \hat{\boldsymbol \theta} = \underset{\boldsymbol \theta}{\min} \sum_{i=1}^n (y_i - f({\boldsymbol \theta}, {\bf y}))^2 \] where \(\boldsymbol \theta = (\lambda, \alpha)^\top\) and \(f()\) is the model function.