7 Iteration

Prerequisite: Chapter 21 ‘Iteration’ from R for Data Science, available at http://r4ds.had.co.nz/iteration.html

7.1 Introduction to iteration

In the Understanding Society data we have seven waves and seven separate files for adult questionnaires. We will need to read them all for the data to be joined. Of course, we can read them one by one, but this is inconvenient.

We will use this example to learn about iteration, one of the most important concepts in programming. You should read Chapter 21 from R for Data Science and do the exercises to learn the basics; here we will consider how we can apply iteration to our case.

Iteration simply means repeating a process, and the ability to do this comes with three major benefits for coding. When you need to reproduce the same, or similar, lines of code, iteration allows you to do these all at once, reducing the number of lines of code you need to write. Additionally, if you need to change sections of your code, you only need to change the original function, rather than every line you have re-written (or copy and pasted over). Finally, if you made a mistake in your original code, using iteration means you will only have to correct the error(s) in one place, rather than many.

7.1.1 The for loop

Before we focus on using iterations and loops to load multiple waves of data in at once, let’s first write a simple for loop.

The for loop carries out a command within criteria you have set. Let’s say we wanted to work out what the square and cube of the numbers from 1 to 20 were. We could calculate this for each number individually, but doing this would result in 40 separate lines of code. With a simple for loop, we can instead write this as follows:

n <- c(1:20)

for(i in 1:length(n)){
  print(c(n[i], n[i]^2, n[i]^3))
}
## [1] 1 1 1
## [1] 2 4 8
## [1]  3  9 27
## [1]  4 16 64
## [1]   5  25 125
## [1]   6  36 216
## [1]   7  49 343
## [1]   8  64 512
## [1]   9  81 729
## [1]   10  100 1000
## [1]   11  121 1331
## [1]   12  144 1728
## [1]   13  169 2197
## [1]   14  196 2744
## [1]   15  225 3375
## [1]   16  256 4096
## [1]   17  289 4913
## [1]   18  324 5832
## [1]   19  361 6859
## [1]   20  400 8000

7.1.2 The while loop

Another type of loop you will come across, though not one we will be using to load the Understanding Society data, is the while loop. Simply put, the while loop will carry out a command while certain criteria are filled, and stop once they are not (in our example, once x < 36). Let’s say, for example, we now wanted to see the square and cube of numbers from 21 to 35. We can write this in a while loop in the following way:

x <- 21

while (x < 36) {
  print(c(x, x^2, x^3))
  x = x + 1
}
## [1]   21  441 9261
## [1]    22   484 10648
## [1]    23   529 12167
## [1]    24   576 13824
## [1]    25   625 15625
## [1]    26   676 17576
## [1]    27   729 19683
## [1]    28   784 21952
## [1]    29   841 24389
## [1]    30   900 27000
## [1]    31   961 29791
## [1]    32  1024 32768
## [1]    33  1089 35937
## [1]    34  1156 39304
## [1]    35  1225 42875

7.2 Loading Understanding Society using iteration

To load the Understanding Society data, let’s first consider a very simple for loop.

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

This loop goes through the values from 1 to 5 and, in each iteration, prints the number on the screen. With the Understanding Society data, we want to go from 1 to 7 (as we have seven waves) and in each iteration we want to read in the data and join it to the data from other waves. Let’s see how we can write a loop that does this.

First, we need to identify the files we want to open. The dir function will return the paths and names of all the data files in our data folder that contain the pattern indresp.

files <- dir(
             # Select the folder in which the files are stored.
             "data/UKDA-6614-tab/tab",
             # Tell R which pattern you want present in the files it will display.
             pattern = "indresp",
             # We want this process to repeat through the entire folder.
             recursive = TRUE,
             # And finally want R to show us the entire file path, rather than just the names of the individual files.
             full.names = TRUE)

files
##  [1] "data/UKDA-6614-tab/tab/bhps_w1/ba_indresp.tab" 
##  [2] "data/UKDA-6614-tab/tab/bhps_w10/bj_indresp.tab"
##  [3] "data/UKDA-6614-tab/tab/bhps_w11/bk_indresp.tab"
##  [4] "data/UKDA-6614-tab/tab/bhps_w12/bl_indresp.tab"
##  [5] "data/UKDA-6614-tab/tab/bhps_w13/bm_indresp.tab"
##  [6] "data/UKDA-6614-tab/tab/bhps_w14/bn_indresp.tab"
##  [7] "data/UKDA-6614-tab/tab/bhps_w15/bo_indresp.tab"
##  [8] "data/UKDA-6614-tab/tab/bhps_w16/bp_indresp.tab"
##  [9] "data/UKDA-6614-tab/tab/bhps_w17/bq_indresp.tab"
## [10] "data/UKDA-6614-tab/tab/bhps_w18/br_indresp.tab"
## [11] "data/UKDA-6614-tab/tab/bhps_w2/bb_indresp.tab" 
## [12] "data/UKDA-6614-tab/tab/bhps_w3/bc_indresp.tab" 
## [13] "data/UKDA-6614-tab/tab/bhps_w4/bd_indresp.tab" 
## [14] "data/UKDA-6614-tab/tab/bhps_w5/be_indresp.tab" 
## [15] "data/UKDA-6614-tab/tab/bhps_w6/bf_indresp.tab" 
## [16] "data/UKDA-6614-tab/tab/bhps_w7/bg_indresp.tab" 
## [17] "data/UKDA-6614-tab/tab/bhps_w8/bh_indresp.tab" 
## [18] "data/UKDA-6614-tab/tab/bhps_w9/bi_indresp.tab" 
## [19] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"    
## [20] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"    
## [21] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"    
## [22] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"    
## [23] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"    
## [24] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"    
## [25] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

There are 25 files as we also have data from the BHPS, not just Understanding Society. We do not need the BHPS, so we want to select only the files from Understanding Society. We can use the function str_detect from the package stringr to select only the files whose paths contain us.

# stringr will return a logical vector. Note that I specify which package the function comes from without explicitly attaching it.

stringr::str_detect(files, "us")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE
# Now I only select the files from UndSoc

files <- files[stringr::str_detect(files, "us")]

files
## [1] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"
## [2] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"
## [3] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"
## [4] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"
## [5] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"
## [6] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"
## [7] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

Now we have a vector of file names we want to loop over. We can write a short loop that prints the path and files’ names.

for (i in 1:7) {
  print(files[i])
}
## [1] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

Note that the same task can be achieved simply with:

for (i in files) {
  print(i)
}
## [1] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

You will see a bit later why I wanted to loop over numbers rather than elements of the character vector.

Now we need to read in the data. We can read the files in their entirety, but this is inefficient as we will only need a few variables. The function fread from the package data.table allows us to specify the variables we want to read (look back at Read Data if you need a recap here). Let’s choose the id variable (pidp), sex, age, interest in politics and net monthly income. The problem is that in each wave these variables have different names indicated by a prefix. pidp does not change and has the same name in each wave. All the other variables have a prefix a_ in wave 1, b_ in wave 2, etc. We will need to find a way to loop over not just file names in files, but also prefixes at the same time.

Let’s start with creating a vector of the variable names without the prefixes.

vars <- c("sex", "dvage", "vote6", "fimnnet_dv")

If we want to add a prefix to the elements of this vector we can use the function paste.

paste("a", vars, sep = "_")
## [1] "a_sex"        "a_dvage"      "a_vote6"      "a_fimnnet_dv"

The constant letters contains all the letters of the English alphabet, so the same expression can be written as the following:

paste(letters[1], vars, sep = "_")
## [1] "a_sex"        "a_dvage"      "a_vote6"      "a_fimnnet_dv"

Now we can write a loop that goes through the values 1 to 7 and in each iteration reads the correct data file choosing the variables with the correct prefix.

# Attach data.table for the fread function.

library(data.table)

for (i in 1:7) {
        # Create a vector of the variables with the correct prefix.
        varsToSelect <- paste(letters[i], vars, sep = "_")
        # Add pidp to this vector (no prefix for pidp)
        varsToSelect <- c("pidp", varsToSelect)
        # Now read the data. 
        data <- fread(files[i], select = varsToSelect)
        # print the first line
        print(head(data, 1))
}        
## 
Read 0.0% of 50994 rows
Read 50994 rows and 5 (of 1364) columns from 0.187 GB file in 00:00:03
##        pidp a_sex a_dvage a_vote6 a_fimnnet_dv
## 1: 68001367     1      39       3         1400
## 
Read 0.0% of 54597 rows
Read 54597 rows and 5 (of 1615) columns from 0.233 GB file in 00:00:04
##        pidp b_sex b_dvage b_vote6 b_fimnnet_dv
## 1: 68004087     1      60       2     1276.667
## 
Read 0.0% of 49739 rows
Read 80.4% of 49739 rows
Read 49739 rows and 5 (of 3024) columns from 0.402 GB file in 00:00:06
##        pidp c_sex c_dvage c_vote6 c_fimnnet_dv
## 1: 68004087     1      61       2     914.3333
## 
Read 0.0% of 47157 rows
Read 47157 rows and 5 (of 2086) columns from 0.262 GB file in 00:00:04
##        pidp d_sex d_dvage d_vote6 d_fimnnet_dv
## 1: 68004087     1      62       1     914.3333
## 
Read 0.0% of 44903 rows
Read 44903 rows and 5 (of 2583) columns from 0.310 GB file in 00:00:05
##        pidp e_sex e_dvage e_vote6 e_fimnnet_dv
## 1: 68004087     1      63       2     1015.667
## 
Read 0.0% of 45290 rows
Read 45290 rows and 5 (of 2060) columns from 0.263 GB file in 00:00:04
##        pidp f_sex f_dvage f_vote6 f_fimnnet_dv
## 1: 68004087     1      64       2       1007.5
## 
Read 0.0% of 42217 rows
Read 42217 rows and 5 (of 2799) columns from 0.321 GB file in 00:00:07
##        pidp g_sex g_dvage g_vote6 g_fimnnet_dv
## 1: 68004087     1      65       2     1258.333

Now we need to join all these data frames together, and we want to do this in the loop. It is clear what we need to do in the second and later iterations of the loop: join the data from wave 2 with the data from wave 1, and so on. But what shall we do in the first iteration? There is no data frame yet to be joined with the data from wave 1. Clearly our algorithm for the first iteration needs to be different from the algorithm for all other iterations. We will use the if … else control structure for this.

In the first iteration of the loop we simply want to save the data from wave 1. In the second and other iterations we want the data to be joined with the data frame we have from the previous iteration.

# Attach dplyr for the full_join function.

library(dplyr)

for (i in 1:7) {
        # Create a vector of the variables with the correct prefix.
        varsToSelect <- paste(letters[i], vars, sep = "_")
        # Add pidp to this vector (no prefix for pidp)
        varsToSelect <- c("pidp", varsToSelect)
        # Now read the data. 
        data <- fread(files[i], select = varsToSelect)
        if (i == 1) {
                all7 <- data  
        }
        else {
                all7 <- full_join(all7, data, by = "pidp")
        }
        # Now we can remove data to free up memory
        rm(data)
} 

all7 now contains the data from all seven waves.

head(all7, 3)
##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
## 3 68006127     2      39       4    1179.5267     2      40       4
##   b_fimnnet_dv c_sex c_dvage c_vote6 c_fimnnet_dv d_sex d_dvage d_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2     1276.667     1      61       2     914.3333     1      62       1
## 3     1115.993     2      41       4    1175.6666     2      43       4
##   d_fimnnet_dv e_sex e_dvage e_vote6 e_fimnnet_dv f_sex f_dvage f_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2     914.3333     1      63       2     1015.667     1      64       2
## 3     851.6666     2      43       4     1025.276     2      44       4
##   f_fimnnet_dv g_sex g_dvage g_vote6 g_fimnnet_dv
## 1           NA    NA      NA      NA           NA
## 2     1007.500     1      65       2     1258.333
## 3     1108.833     2      45       4      385.000

We will now save this file for future use using the saveRDS function in the myData folder (make sure first you have this folder on your computer).

saveRDS(all7, "myData/all7.rds")