7 Iteration

Please read ch.21 (Iteration) from R for Data Science - http://r4ds.had.co.nz/iteration.html

In the Understanding Society data we have seven waves and seven separate individual files for adult questionnaires. We will need to read them all for the data to be joined. Of course, we can read them one by one, but this is inconvenient.

We will use this example to learn about iteration, one of the most important concepts in programming. You should read ch.21 from R for Data Science and do the exercises to learn the basics; here we will consider how we can apply iteration to our case.

Let us first consider a very simple for loop.

for (i in 1:5) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

This loop goes through the values from 1 to 5 and in each iteration it prints the number on the screen. With the Understanding Society data, we want to go from 1 to 7 (as we have seven waves) and in each iteration we want to read in the data and join it to the data from other waves. Let us see how we can write a loop that does it.

First, we need to identify the files we want to open. The dir function will return the paths and names of all the data files in our data folder that contain the pattern indresp.

files <- dir("data/UKDA-6614-tab/tab",
             pattern="indresp", recursive = TRUE, full.names=TRUE)
files
##  [1] "data/UKDA-6614-tab/tab/bhps_w1/ba_indresp.tab" 
##  [2] "data/UKDA-6614-tab/tab/bhps_w10/bj_indresp.tab"
##  [3] "data/UKDA-6614-tab/tab/bhps_w11/bk_indresp.tab"
##  [4] "data/UKDA-6614-tab/tab/bhps_w12/bl_indresp.tab"
##  [5] "data/UKDA-6614-tab/tab/bhps_w13/bm_indresp.tab"
##  [6] "data/UKDA-6614-tab/tab/bhps_w14/bn_indresp.tab"
##  [7] "data/UKDA-6614-tab/tab/bhps_w15/bo_indresp.tab"
##  [8] "data/UKDA-6614-tab/tab/bhps_w16/bp_indresp.tab"
##  [9] "data/UKDA-6614-tab/tab/bhps_w17/bq_indresp.tab"
## [10] "data/UKDA-6614-tab/tab/bhps_w18/br_indresp.tab"
## [11] "data/UKDA-6614-tab/tab/bhps_w2/bb_indresp.tab" 
## [12] "data/UKDA-6614-tab/tab/bhps_w3/bc_indresp.tab" 
## [13] "data/UKDA-6614-tab/tab/bhps_w4/bd_indresp.tab" 
## [14] "data/UKDA-6614-tab/tab/bhps_w5/be_indresp.tab" 
## [15] "data/UKDA-6614-tab/tab/bhps_w6/bf_indresp.tab" 
## [16] "data/UKDA-6614-tab/tab/bhps_w7/bg_indresp.tab" 
## [17] "data/UKDA-6614-tab/tab/bhps_w8/bh_indresp.tab" 
## [18] "data/UKDA-6614-tab/tab/bhps_w9/bi_indresp.tab" 
## [19] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"    
## [20] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"    
## [21] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"    
## [22] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"    
## [23] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"    
## [24] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"    
## [25] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

There are 25 files as we also have data from the BHPS, not just the Understanding Society. We do not need the BHPS, so we want to select only the files from the Understanding Society. We can use the function str_detect from the package stringr to select only the files whose paths contain us.

# stringr will return a logical vector. Note that I specify which package the function comes from
# without explicitly attaching it.
stringr::str_detect(files, "us")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE
# Now I only select the files from UndSoc
files <- files[stringr::str_detect(files, "us")]
files
## [1] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"
## [2] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"
## [3] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"
## [4] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"
## [5] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"
## [6] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"
## [7] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

Now we have a vector of file names we want to loop over. We can write a short loop that prints the path and files name.

for (i in 1:7) {
  print(files[i])
}
## [1] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

Note that the same task can be achieved simply with:

for (i in files) {
  print(i)
}
## [1] "data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w2/b_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w3/c_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w4/d_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w5/e_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w6/f_indresp.tab"
## [1] "data/UKDA-6614-tab/tab/us_w7/g_indresp.tab"

You will see a bit later why I wanted to loop over numbers rather than elements of the character vector.

Now we need to read in the data. We can read the whole files, but this is inefficient as we will only need a few variables. The function fread from the package data.table allows us to specify the variables we want to read. Let us choose the id variable (pidp), sex, age, interest in politics and net monthly income. The problem is that in each wave these variables have different names indicated by a prefix. pidp does not change and has the same name in each wave. All the other variables have a prefix a_ in wave 1, b_ in wave 2, etc. We will need to find a way to loop over not just file names in files, but also prefixes at the same time.

Let us start with creating a vector of the variable names without the prefixes.

vars <- c("sex", "dvage", "vote6", "fimnnet_dv")

If we want to add a prefix to the elements of this vector we can use the function paste.

paste("a", vars, sep = "_")
## [1] "a_sex"        "a_dvage"      "a_vote6"      "a_fimnnet_dv"

The constant letters contains all the letters of the English alphabet, so the same expression can be written as the following.

paste(letters[1], vars, sep = "_")
## [1] "a_sex"        "a_dvage"      "a_vote6"      "a_fimnnet_dv"

Now we can write a loop that goes through the values 1 to 7 and in each iteration reads the correct data file choosing the variables with the correct prefix.

# Attach data.table
library(data.table)
for (i in 1:7) {
        # Create a vector of the variables with the correct prefix.
        varsToSelect <- paste(letters[i], vars, sep = "_")
        # Add pidp to this vector (no prefix for pidp)
        varsToSelect <- c("pidp", varsToSelect)
        # Now read the data. 
        data <- fread(files[i], select = varsToSelect)
        # print the first line
        print(head(data, 1))
}        
## 
Read 0.0% of 50994 rows
Read 78.4% of 50994 rows
Read 50994 rows and 5 (of 1364) columns from 0.187 GB file in 00:00:04
##        pidp a_sex a_dvage a_vote6 a_fimnnet_dv
## 1: 68001367     1      39       3         1400
## 
Read 0.0% of 54597 rows
Read 54597 rows and 5 (of 1615) columns from 0.233 GB file in 00:00:04
##        pidp b_sex b_dvage b_vote6 b_fimnnet_dv
## 1: 68004087     1      60       2     1276.667
## 
Read 0.0% of 49739 rows
Read 80.4% of 49739 rows
Read 49739 rows and 5 (of 3024) columns from 0.402 GB file in 00:00:10
##        pidp c_sex c_dvage c_vote6 c_fimnnet_dv
## 1: 68004087     1      61       2     914.3333
## 
Read 0.0% of 47157 rows
Read 47157 rows and 5 (of 2086) columns from 0.262 GB file in 00:00:06
##        pidp d_sex d_dvage d_vote6 d_fimnnet_dv
## 1: 68004087     1      62       1     914.3333
## 
Read 0.0% of 44903 rows
Read 44903 rows and 5 (of 2583) columns from 0.310 GB file in 00:00:08
##        pidp e_sex e_dvage e_vote6 e_fimnnet_dv
## 1: 68004087     1      63       2     1015.667
## 
Read 0.0% of 45290 rows
Read 45290 rows and 5 (of 2060) columns from 0.263 GB file in 00:00:12
##        pidp f_sex f_dvage f_vote6 f_fimnnet_dv
## 1: 68004087     1      64       2       1007.5
## 
Read 0.0% of 42217 rows
Read 42217 rows and 5 (of 2799) columns from 0.321 GB file in 00:00:09
##        pidp g_sex g_dvage g_vote6 g_fimnnet_dv
## 1: 68004087     1      65       2     1258.333

Now we need to join all these data frames together, and we want to do this in the loop. It is clear what we need to do in the second and later iterations of the loop: join the data from wave 2 with the data from wave 1, etc. But what shall we do in the first iteration? There is no data frame yet to be joined with the data from wave 1. Clearly our algorithm for the first iteration needs to be different from the algorithm for all other iterations. We will use the if … else control structure for this.

In the first iteration of the loop we simply want to save the data from wave 1. In the second and other iterations we want the data to be joined with the data frame we have from the previous iteration.

for (i in 1:7) {
        # Create a vector of the variables with the correct prefix.
        varsToSelect <- paste(letters[i], vars, sep = "_")
        # Add pidp to this vector (no prefix for pidp)
        varsToSelect <- c("pidp", varsToSelect)
        # Now read the data. 
        data <- fread(files[i], select = varsToSelect)
        if (i == 1) {
                all7 <- data  
        }
        else {
                all7 <- full_join(all7, data, by = "pidp")
        }
        # Now we can remove data to free up memory
        rm(data)
} 

all7 now contains the data from all seven waves.

head(all7, 3)
##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
## 3 68006127     2      39       4    1179.5267     2      40       4
##   b_fimnnet_dv c_sex c_dvage c_vote6 c_fimnnet_dv d_sex d_dvage d_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2     1276.667     1      61       2     914.3333     1      62       1
## 3     1115.993     2      41       4    1175.6666     2      43       4
##   d_fimnnet_dv e_sex e_dvage e_vote6 e_fimnnet_dv f_sex f_dvage f_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2     914.3333     1      63       2     1015.667     1      64       2
## 3     851.6666     2      43       4     1025.276     2      44       4
##   f_fimnnet_dv g_sex g_dvage g_vote6 g_fimnnet_dv
## 1           NA    NA      NA      NA           NA
## 2     1007.500     1      65       2     1258.333
## 3     1108.833     2      45       4      385.000

I will now save this file for future use using the saveRDS function in the myData folder (make sure first you have this folder on your computer).

saveRDS(all7, "myData/all7.rds")