8 Tidy data

For this class please read ch.12 on Tidy Data from R for Data Science – http://r4ds.had.co.nz/tidy-data.html.

In the previous part of the course (joiningData.Rmd) we learned how to join together data from seven waves of the Understanding Society. Let us open this data set.

UndSoc <- readRDS("myData/all7.rds")
head(UndSoc)
##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
## 3 68006127     2      39       4    1179.5267     2      40       4
## 4 68006135     2      17       4     130.0000    NA      NA      NA
## 5 68006807     2      72       4     933.8135     2      73       4
## 6 68007487     2      57       1     400.0000     2      58       2
##   b_fimnnet_dv c_sex c_dvage c_vote6 c_fimnnet_dv d_sex d_dvage d_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2   1276.66663     1      61       2     914.3333     1      62       1
## 3   1115.99341     2      41       4    1175.6666     2      43       4
## 4           NA     2      19       4    1100.0000     2      21       2
## 5   1145.27893     2      74       4    1146.7035     2      75       4
## 6     16.66666     2      59       4     475.0000     2      60       1
##   d_fimnnet_dv e_sex e_dvage e_vote6 e_fimnnet_dv f_sex f_dvage f_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2     914.3333     1      63       2    1015.6667     1      64       2
## 3     851.6666     2      43       4    1025.2756     2      44       4
## 4     977.7525     2      21       4     823.0209    NA      NA      NA
## 5    1405.1293     2      76       4   15000.0000     2      77       4
## 6     666.6168    NA      NA      NA           NA    NA      NA      NA
##   f_fimnnet_dv g_sex g_dvage g_vote6 g_fimnnet_dv
## 1           NA    NA      NA      NA           NA
## 2     1007.500     1      65       2    1258.3334
## 3     1108.833     2      45       4     385.0000
## 4           NA     2      23      -7     909.2457
## 5      904.209     2      78       3     996.6353
## 6           NA    NA      NA      NA           NA

Now we will work on how these data can be represented and prepared for the analysis. Please read ch.12 on Tidy Data from the R for Data Science Book – http://r4ds.had.co.nz/tidy-data.html.

8.1 Long and wide formats

Let us keep only a few observations and columns in the data and more closely look at its structure.

UndSocExample <- UndSoc %>%
  filter(pidp == 68001367 | pidp == 68004087) %>%
  select(pidp, a_sex: b_fimnnet_dv)
UndSocExample
##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
##   b_fimnnet_dv
## 1           NA
## 2     1276.667

These are the data for two individuals only in waves 1 and 2. The data are represented in the wide format. This means that we have one row for each individual, and data from different waves are recorded in several columns. For example, the data on sex from wave 1 is in column a_sex and the data on sex from wave is in b_sex.

You will find this representation of the data common in longitudinal data sets. It may be convenient for certain purposes, but it is generally recommended to keep the data in the long format (that corresponds to the tidy data principles as described in the R for Data Science book).

To move from the wide to the long format we can use the function melt and cast functions from the reshape2 package.

require(reshape2)

# First we "melt" the data frame.
UndSocExampleMolten <- UndSocExample %>%
  melt(id = "pidp")
UndSocExampleMolten
##        pidp     variable     value
## 1  68001367        a_sex    1.0000
## 2  68004087        a_sex    1.0000
## 3  68001367      a_dvage   39.0000
## 4  68004087      a_dvage   59.0000
## 5  68001367      a_vote6    3.0000
## 6  68004087      a_vote6    2.0000
## 7  68001367 a_fimnnet_dv 1400.0000
## 8  68004087 a_fimnnet_dv  802.0833
## 9  68001367        b_sex        NA
## 10 68004087        b_sex    1.0000
## 11 68001367      b_dvage        NA
## 12 68004087      b_dvage   60.0000
## 13 68001367      b_vote6        NA
## 14 68004087      b_vote6    2.0000
## 15 68001367 b_fimnnet_dv        NA
## 16 68004087 b_fimnnet_dv 1276.6666
# Next I want to split the column variable into a column indicating wave and a column indicating variable name. 
# I will use the function separate() from tidyr.

UndSocExampleSep <- UndSocExampleMolten %>%
        separate(variable, into = c("wave", "variable"), sep = "_")
UndSocExampleSep
##        pidp wave variable     value
## 1  68001367    a      sex    1.0000
## 2  68004087    a      sex    1.0000
## 3  68001367    a    dvage   39.0000
## 4  68004087    a    dvage   59.0000
## 5  68001367    a    vote6    3.0000
## 6  68004087    a    vote6    2.0000
## 7  68001367    a  fimnnet 1400.0000
## 8  68004087    a  fimnnet  802.0833
## 9  68001367    b      sex        NA
## 10 68004087    b      sex    1.0000
## 11 68001367    b    dvage        NA
## 12 68004087    b    dvage   60.0000
## 13 68001367    b    vote6        NA
## 14 68004087    b    vote6    2.0000
## 15 68001367    b  fimnnet        NA
## 16 68004087    b  fimnnet 1276.6666
# We have a problem here because one of our variables (fimnnet_dv) has _ in the name and we do not want to separate by it. To avoid this problem we need to add the argument extra = "merge"" in separate().

UndSocExampleSep <- UndSocExampleMolten %>%
        separate(variable, into = c("wave", "variable"), sep = "_", extra = "merge")
UndSocExampleSep
##        pidp wave   variable     value
## 1  68001367    a        sex    1.0000
## 2  68004087    a        sex    1.0000
## 3  68001367    a      dvage   39.0000
## 4  68004087    a      dvage   59.0000
## 5  68001367    a      vote6    3.0000
## 6  68004087    a      vote6    2.0000
## 7  68001367    a fimnnet_dv 1400.0000
## 8  68004087    a fimnnet_dv  802.0833
## 9  68001367    b        sex        NA
## 10 68004087    b        sex    1.0000
## 11 68001367    b      dvage        NA
## 12 68004087    b      dvage   60.0000
## 13 68001367    b      vote6        NA
## 14 68004087    b      vote6    2.0000
## 15 68001367    b fimnnet_dv        NA
## 16 68004087    b fimnnet_dv 1276.6666
# Next we "cast" the molten data frame into the format we want.

UndSocExampleLong <- UndSocExampleSep %>%
  dcast(pidp + wave ~ variable)
UndSocExampleLong
##       pidp wave dvage fimnnet_dv sex vote6
## 1 68001367    a    39  1400.0000   1     3
## 2 68001367    b    NA         NA  NA    NA
## 3 68004087    a    59   802.0833   1     2
## 4 68004087    b    60  1276.6666   1     2

Now the data are in the “long format”. This means that we have as many rows for each individual as the number of waves, a variable indicating wave, and all other variables are in columns. In most cases with longitudinal data, the long format is easier to work with.

What if we want to convert the data back to the wide format?

# First melt
UndSocExampleMolten2 <- UndSocExampleLong %>%
  melt(id = c("pidp", "wave"))
UndSocExampleMolten2
##        pidp wave   variable     value
## 1  68001367    a      dvage   39.0000
## 2  68001367    b      dvage        NA
## 3  68004087    a      dvage   59.0000
## 4  68004087    b      dvage   60.0000
## 5  68001367    a fimnnet_dv 1400.0000
## 6  68001367    b fimnnet_dv        NA
## 7  68004087    a fimnnet_dv  802.0833
## 8  68004087    b fimnnet_dv 1276.6666
## 9  68001367    a        sex    1.0000
## 10 68001367    b        sex        NA
## 11 68004087    a        sex    1.0000
## 12 68004087    b        sex    1.0000
## 13 68001367    a      vote6    3.0000
## 14 68001367    b      vote6        NA
## 15 68004087    a      vote6    2.0000
## 16 68004087    b      vote6    2.0000
# Unite the columns
UndSocExampleUnited <- UndSocExampleMolten2 %>%
  unite("variable", c("wave", "variable"), sep = "_")
UndSocExampleUnited
##        pidp     variable     value
## 1  68001367      a_dvage   39.0000
## 2  68001367      b_dvage        NA
## 3  68004087      a_dvage   59.0000
## 4  68004087      b_dvage   60.0000
## 5  68001367 a_fimnnet_dv 1400.0000
## 6  68001367 b_fimnnet_dv        NA
## 7  68004087 a_fimnnet_dv  802.0833
## 8  68004087 b_fimnnet_dv 1276.6666
## 9  68001367        a_sex    1.0000
## 10 68001367        b_sex        NA
## 11 68004087        a_sex    1.0000
## 12 68004087        b_sex    1.0000
## 13 68001367      a_vote6    3.0000
## 14 68001367      b_vote6        NA
## 15 68004087      a_vote6    2.0000
## 16 68004087      b_vote6    2.0000
# And now cast
UndSocExampleWide <- UndSocExampleUnited %>%
  dcast(pidp ~ variable)
UndSocExampleWide
##       pidp a_dvage a_fimnnet_dv a_sex a_vote6 b_dvage b_fimnnet_dv b_sex
## 1 68001367      39    1400.0000     1       3      NA           NA    NA
## 2 68004087      59     802.0833     1       2      60     1276.667     1
##   b_vote6
## 1      NA
## 2       2

We can also restructure the data using the gather and spread functions from the tidyr package (part of tidyverse). gather is roughy equivalent to melt and spread is roughy equivalent to dcast.

Moving from wide to long:

UndSocExample
##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
##   b_fimnnet_dv
## 1           NA
## 2     1276.667
# This "melts" the data frame. 
UndSocExample %>%
  gather(a_sex:b_fimnnet_dv, key = "variable", value = "value")
##        pidp     variable     value
## 1  68001367        a_sex    1.0000
## 2  68004087        a_sex    1.0000
## 3  68001367      a_dvage   39.0000
## 4  68004087      a_dvage   59.0000
## 5  68001367      a_vote6    3.0000
## 6  68004087      a_vote6    2.0000
## 7  68001367 a_fimnnet_dv 1400.0000
## 8  68004087 a_fimnnet_dv  802.0833
## 9  68001367        b_sex        NA
## 10 68004087        b_sex    1.0000
## 11 68001367      b_dvage        NA
## 12 68004087      b_dvage   60.0000
## 13 68001367      b_vote6        NA
## 14 68004087      b_vote6    2.0000
## 15 68001367 b_fimnnet_dv        NA
## 16 68004087 b_fimnnet_dv 1276.6666
# Next we want to split the "variable" column and "cast" in the long format
UndSocExample %>%
  gather(a_sex:b_vote6, key = "variable", value = "value") %>%
  separate(variable, into = c("wave", "variable"), sep = "_", extra = "merge") %>%
  spread(key = variable, value = value)
##       pidp b_fimnnet_dv wave dvage fimnnet_dv sex vote6
## 1 68001367           NA    a    39  1400.0000   1     3
## 2 68001367           NA    b    NA         NA  NA    NA
## 3 68004087     1276.667    a    59   802.0833   1     2
## 4 68004087     1276.667    b    60         NA   1     2

If we want to move from long to wide:

UndSocExampleLong
##       pidp wave dvage fimnnet_dv sex vote6
## 1 68001367    a    39  1400.0000   1     3
## 2 68001367    b    NA         NA  NA    NA
## 3 68004087    a    59   802.0833   1     2
## 4 68004087    b    60  1276.6666   1     2
UndSocExampleLong %>%
  gather(dvage:vote6, key = "variable", value = "value") %>%
  unite("variable", c("wave", "variable"), sep = "_") %>%
  spread(key = variable, value = value)
##       pidp a_dvage a_fimnnet_dv a_sex a_vote6 b_dvage b_fimnnet_dv b_sex
## 1 68001367      39    1400.0000     1       3      NA           NA    NA
## 2 68004087      59     802.0833     1       2      60     1276.667     1
##   b_vote6
## 1      NA
## 2       2

Exercise. Reshape the full UndSoc data frame from wide to long format. Call the object where you will store the result UndSocLong.

Solution:

UndSocLong <- UndSoc %>%
  gather(a_sex:g_fimnnet_dv, key = "variable", value = "value") %>%
  separate(variable, into = c("wave", "variable"), sep = "_", extra = "merge") %>%
  spread(key = variable, value = value)
head(UndSocLong, 5)
##    pidp wave dvage fimnnet_dv sex vote6
## 1 22445    a    NA         NA  NA    NA
## 2 22445    b    NA         NA  NA    NA
## 3 22445    c    NA         NA  NA    NA
## 4 22445    d    27   1140.000   2     2
## 5 22445    e    28   1602.667   2     2

8.2 Cleaning the data

Before we begin the analysis we want to make sure that the data have been cleaned and all the missing values have been correctly identified. It usually makes sense to separate the cleaning and analysis stages into separate scripts.

Let us explore the data set we have. Note that if we had not converted the data into the long format we would have to tabulate and clean each variable seven times.

summary(UndSocLong)
##       pidp               wave               dvage          fimnnet_dv    
##  Min.   :2.244e+04   Length:584703      Min.   : -9.00   Min.   :-42904  
##  1st Qu.:4.086e+08   Class :character   1st Qu.: 32.00   1st Qu.:   630  
##  Median :7.493e+08   Mode  :character   Median : 46.00   Median :  1159  
##  Mean   :7.973e+08                      Mean   : 47.09   Mean   :  1383  
##  3rd Qu.:1.224e+09                      3rd Qu.: 61.00   3rd Qu.:  1800  
##  Max.   :1.653e+09                      Max.   :104.00   Max.   : 15000  
##                                         NA's   :249806   NA's   :249806  
##       sex             vote6       
##  Min.   :-9.00    Min.   :-10.00  
##  1st Qu.: 1.00    1st Qu.:  2.00  
##  Median : 2.00    Median :  3.00  
##  Mean   : 1.54    Mean   :  1.82  
##  3rd Qu.: 2.00    3rd Qu.:  4.00  
##  Max.   : 2.00    Max.   :  4.00  
##  NA's   :249806   NA's   :249806
table(UndSocLong$wave)
## 
##     a     b     c     d     e     f     g 
## 83529 83529 83529 83529 83529 83529 83529
table(UndSocLong$dvage)
## 
##   -9   -2   -1   16   17   18   19   20   21   22   23   24   25   26   27 
##   19    6   34 5735 5769 5536 5350 5188 5000 4779 4668 4537 4555 4556 4610 
##   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42 
## 4742 4836 5129 5181 5286 5387 5380 5512 5591 5519 5765 5896 6155 6274 6422 
##   43   44   45   46   47   48   49   50   51   52   53   54   55   56   57 
## 6291 6450 6489 6316 6263 6187 6119 6013 5901 5727 5701 5483 5406 5193 5042 
##   58   59   60   61   62   63   64   65   66   67   68   69   70   71   72 
## 5052 4856 4782 4962 4995 5013 4999 4846 4814 4666 4435 4210 3914 3707 3504 
##   73   74   75   76   77   78   79   80   81   82   83   84   85   86   87 
## 3316 3130 2977 2749 2618 2452 2241 2106 1886 1749 1599 1390 1211  988  860 
##   88   89   90   91   92   93   94   95   96   97   98   99  100  101  102 
##  683  547  453  344  243  172  131   92   80   41   36   25   15    6    2 
##  103  104 
##    1    1
table(UndSocLong$sex)
## 
##     -9     -1      1      2 
##      2      1 154045 180849
table(UndSocLong$vote6)
## 
##    -10     -9     -7     -2     -1      1      2      3      4 
##   4656    366  24752    431    358  30981 102212  86790  84351
summary(UndSocLong$fimnnet_dv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -42904     630    1159    1383    1800   15000  249806

Note the negative values for dvage, sex and vote6. These are missing values that need to be coded as missing.

UndSocLong  <- UndSocLong %>%
  mutate(dvage = ifelse(dvage > 0, dvage, NA)) %>%
  mutate(sex = ifelse(sex > 0, sex, NA)) %>%
  mutate(vote6 = ifelse(vote6 > 0, vote6, NA))
table(UndSocLong$dvage)
## 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 5735 5769 5536 5350 5188 5000 4779 4668 4537 4555 4556 4610 4742 4836 5129 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
## 5181 5286 5387 5380 5512 5591 5519 5765 5896 6155 6274 6422 6291 6450 6489 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
## 6316 6263 6187 6119 6013 5901 5727 5701 5483 5406 5193 5042 5052 4856 4782 
##   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
## 4962 4995 5013 4999 4846 4814 4666 4435 4210 3914 3707 3504 3316 3130 2977 
##   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
## 2749 2618 2452 2241 2106 1886 1749 1599 1390 1211  988  860  683  547  453 
##   91   92   93   94   95   96   97   98   99  100  101  102  103  104 
##  344  243  172  131   92   80   41   36   25   15    6    2    1    1
table(UndSocLong$sex)
## 
##      1      2 
## 154045 180849
table(UndSocLong$vote6)
## 
##      1      2      3      4 
##  30981 102212  86790  84351

We also have negative values for income (fimnnet_dv), but we will leave as it is for now.

We may also want to code sex as “male” and “female” and assign meaningful labels to vote6.

UndSocLongClean <- UndSocLong %>%
  mutate(sex = recode(sex, "1" = "male", "2" = "female")) %>%
  mutate(vote6 = recode(vote6, "1" = "very", "2" = "fairly", "3" = "not very", "4" = "not al all"))
head(UndSocLongClean, 10)
##     pidp wave dvage fimnnet_dv    sex  vote6
## 1  22445    a    NA         NA   <NA>   <NA>
## 2  22445    b    NA         NA   <NA>   <NA>
## 3  22445    c    NA         NA   <NA>   <NA>
## 4  22445    d    27   1140.000 female fairly
## 5  22445    e    28   1602.667 female fairly
## 6  22445    f    29   2012.000 female fairly
## 7  22445    g    30   1840.000 female   very
## 8  29925    a    NA         NA   <NA>   <NA>
## 9  29925    b    NA         NA   <NA>   <NA>
## 10 29925    c    NA         NA   <NA>   <NA>
saveRDS(UndSocLongClean, "myData/all7clean.rds")