6 Tidy data

Prerequisite: Chapter 12 ‘Tidy data’ from R for Data Science, available at http://r4ds.had.co.nz/tidy-data.html.

Earlier in this course (entitiled ‘Join Data’) we learned how to join together data from seven waves of the Understanding Society. Let us open this data set.

UndSoc <- readRDS("myData/all7.rds")

head(UndSoc)

##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
## 3 68006127     2      39       4    1179.5267     2      40       4
## 4 68006135     2      17       4     130.0000    NA      NA      NA
## 5 68006807     2      72       4     933.8135     2      73       4
## 6 68007487     2      57       1     400.0000     2      58       2
##   b_fimnnet_dv c_sex c_dvage c_vote6 c_fimnnet_dv d_sex d_dvage d_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2   1276.66663     1      61       2     914.3333     1      62       1
## 3   1115.99341     2      41       4    1175.6666     2      43       4
## 4           NA     2      19       4    1100.0000     2      21       2
## 5   1145.27893     2      74       4    1146.7035     2      75       4
## 6     16.66666     2      59       4     475.0000     2      60       1
##   d_fimnnet_dv e_sex e_dvage e_vote6 e_fimnnet_dv f_sex f_dvage f_vote6
## 1           NA    NA      NA      NA           NA    NA      NA      NA
## 2     914.3333     1      63       2    1015.6667     1      64       2
## 3     851.6666     2      43       4    1025.2756     2      44       4
## 4     977.7525     2      21       4     823.0209    NA      NA      NA
## 5    1405.1293     2      76       4   15000.0000     2      77       4
## 6     666.6168    NA      NA      NA           NA    NA      NA      NA
##   f_fimnnet_dv g_sex g_dvage g_vote6 g_fimnnet_dv
## 1           NA    NA      NA      NA           NA
## 2     1007.500     1      65       2    1258.3334
## 3     1108.833     2      45       4     385.0000
## 4           NA     2      23      -7     909.2457
## 5      904.209     2      78       3     996.6353
## 6           NA    NA      NA      NA           NA

Now we will work on how these data can be represented and prepared for analysis.

6.1 Long and wide formats

Let us keep only a few observations and columns in the data and more closely look at its structure.

# First attach the tidyverse package.

library(tidyverse)

UndSocExample <- UndSoc %>%
  filter(pidp == 68001367 | pidp == 68004087) %>%
  select(pidp, a_sex: b_fimnnet_dv)

UndSocExample

##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
##   b_fimnnet_dv
## 1           NA
## 2     1276.667

These are the data for two individuals only in waves 1 and 2, presented in the ‘wide’ format. This means that we have one row for each individual, and data from different waves are recorded in several columns. For example, the data on sex from wave 1 is in column a_sex and the data on sex from wave 2 is in column b_sex. This quite literally makes the table of the data wider than if each wave was represented by a new row, which would instead make the table longer and narrower.

In longitudinal data sets, it is common to find data presented in this ‘wide’ format. While this may be convenient for certain purposes, it is generally recommended to work with data in the ‘long’ format (that corresponds to the ‘tidy data’ principles as described in the R for Data Science book).

6.1.1 Reshaping the data

To restructure the data from the ‘wide’ format to the ‘long’ format, we can use the gather and spread functions from the tidyr package (part of tidyverse) with separate, shown below:

# Here is our data in its current format:

UndSocExample

##       pidp a_sex a_dvage a_vote6 a_fimnnet_dv b_sex b_dvage b_vote6
## 1 68001367     1      39       3    1400.0000    NA      NA      NA
## 2 68004087     1      59       2     802.0833     1      60       2
##   b_fimnnet_dv
## 1           NA
## 2     1276.667

# First, we want to 'melt' the data frame, putting each new variable on a new row.

UndSocExample %>%
  gather(a_sex:b_fimnnet_dv, key = "variable", value = "value")

##        pidp     variable     value
## 1  68001367        a_sex    1.0000
## 2  68004087        a_sex    1.0000
## 3  68001367      a_dvage   39.0000
## 4  68004087      a_dvage   59.0000
## 5  68001367      a_vote6    3.0000
## 6  68004087      a_vote6    2.0000
## 7  68001367 a_fimnnet_dv 1400.0000
## 8  68004087 a_fimnnet_dv  802.0833
## 9  68001367        b_sex        NA
## 10 68004087        b_sex    1.0000
## 11 68001367      b_dvage        NA
## 12 68004087      b_dvage   60.0000
## 13 68001367      b_vote6        NA
## 14 68004087      b_vote6    2.0000
## 15 68001367 b_fimnnet_dv        NA
## 16 68004087 b_fimnnet_dv 1276.6666

# Then, we split the 'variable' column into separate columns for each variable and 'cast' these new columns in the 'long' format.

UndSocLongExample <- UndSocExample %>%
  gather(a_sex:b_fimnnet_dv, key = "variable", value = "value") %>%
  separate(variable, into = c("wave", "variable"), sep = "_", extra = "merge") %>%
  spread(key = variable, value = value)

# The key command here is the 'separate' command, which takes the letter representing the wave from which the data was taken, and puts this instead as a new variable, called 'wave'.

Now the data are in the ‘long’ format. This means that we have as many rows for each individual as the number of waves, a variable indicating wave, and all other variables are in columns. In most cases with longitudinal data, the ‘long’ format is easier to work with.

If we want to restructure our data back to the ‘wide’ format from the ‘long’ format, we use the same gather and spread commands, but this time with the unite command, shown below:

# The data in its 'long' format:

UndSocLongExample

##       pidp wave dvage fimnnet_dv sex vote6
## 1 68001367    a    39  1400.0000   1     3
## 2 68001367    b    NA         NA  NA    NA
## 3 68004087    a    59   802.0833   1     2
## 4 68004087    b    60  1276.6666   1     2

# Here, we again use the 'gather' command to 'melt' the data frame, putting every variable on a new row, then 'cast' these new rows in the 'wide' data format.

UndSocLongExample %>%
  gather(dvage:vote6, key = "variable", value = "value") %>%
  unite("variable", c("wave", "variable"), sep = "_") %>%
  spread(key = variable, value = value)

##       pidp a_dvage a_fimnnet_dv a_sex a_vote6 b_dvage b_fimnnet_dv b_sex
## 1 68001367      39    1400.0000     1       3      NA           NA    NA
## 2 68004087      59     802.0833     1       2      60     1276.667     1
##   b_vote6
## 1      NA
## 2       2

# The key command here is the 'unite' command, which takes the letter representing the wave from which the data was taken, and recombines this with each variable, meaning each variable has a letter in front of it that represents the wave the data is from.

6.1.2 Reshaping the data using the reshape2 package

To move from the ‘wide’ to the ‘long’ format we can also use the functions melt and dcast, from the reshape2 package. The melt function is roughly equivalent to gather and the dcast function is roughly equivalent to spread.

# You can use require() to install packages if and only if they are necessary in your code.

require(reshape2)

# First we "melt" the data frame.

UndSocExampleMolten <- UndSocExample %>%
  melt(id = "pidp")
UndSocExampleMolten

##        pidp     variable     value
## 1  68001367        a_sex    1.0000
## 2  68004087        a_sex    1.0000
## 3  68001367      a_dvage   39.0000
## 4  68004087      a_dvage   59.0000
## 5  68001367      a_vote6    3.0000
## 6  68004087      a_vote6    2.0000
## 7  68001367 a_fimnnet_dv 1400.0000
## 8  68004087 a_fimnnet_dv  802.0833
## 9  68001367        b_sex        NA
## 10 68004087        b_sex    1.0000
## 11 68001367      b_dvage        NA
## 12 68004087      b_dvage   60.0000
## 13 68001367      b_vote6        NA
## 14 68004087      b_vote6    2.0000
## 15 68001367 b_fimnnet_dv        NA
## 16 68004087 b_fimnnet_dv 1276.6666

# Next we need to split the column variable into a column indicating wave and a column indicating variable name.

UndSocExampleSep <- UndSocExampleMolten %>%
        separate(variable, into = c("wave", "variable"), sep = "_")

UndSocExampleSep

##        pidp wave variable     value
## 1  68001367    a      sex    1.0000
## 2  68004087    a      sex    1.0000
## 3  68001367    a    dvage   39.0000
## 4  68004087    a    dvage   59.0000
## 5  68001367    a    vote6    3.0000
## 6  68004087    a    vote6    2.0000
## 7  68001367    a  fimnnet 1400.0000
## 8  68004087    a  fimnnet  802.0833
## 9  68001367    b      sex        NA
## 10 68004087    b      sex    1.0000
## 11 68001367    b    dvage        NA
## 12 68004087    b    dvage   60.0000
## 13 68001367    b    vote6        NA
## 14 68004087    b    vote6    2.0000
## 15 68001367    b  fimnnet        NA
## 16 68004087    b  fimnnet 1276.6666

# We have a problem here because one of our variables (fimnnet_dv) has '_' in the name and we do not want to separate by it. To avoid this problem we need to add the argument 'extra = "merge"' in separate().

UndSocExampleSep <- UndSocExampleMolten %>%
        separate(variable, into = c("wave", "variable"), sep = "_", extra = "merge")

UndSocExampleSep

##        pidp wave   variable     value
## 1  68001367    a        sex    1.0000
## 2  68004087    a        sex    1.0000
## 3  68001367    a      dvage   39.0000
## 4  68004087    a      dvage   59.0000
## 5  68001367    a      vote6    3.0000
## 6  68004087    a      vote6    2.0000
## 7  68001367    a fimnnet_dv 1400.0000
## 8  68004087    a fimnnet_dv  802.0833
## 9  68001367    b        sex        NA
## 10 68004087    b        sex    1.0000
## 11 68001367    b      dvage        NA
## 12 68004087    b      dvage   60.0000
## 13 68001367    b      vote6        NA
## 14 68004087    b      vote6    2.0000
## 15 68001367    b fimnnet_dv        NA
## 16 68004087    b fimnnet_dv 1276.6666

# Finally, we "cast" the molten data frame into the format we want.

UndSocExampleLong <- UndSocExampleSep %>%
  dcast(pidp + wave ~ variable)

UndSocExampleLong

##       pidp wave dvage fimnnet_dv sex vote6
## 1 68001367    a    39  1400.0000   1     3
## 2 68001367    b    NA         NA  NA    NA
## 3 68004087    a    59   802.0833   1     2
## 4 68004087    b    60  1276.6666   1     2

To convert the data back to the ‘wide’ format, we do the following:

# First melt
UndSocExampleMolten2 <- UndSocExampleLong %>%
  melt(id = c("pidp", "wave"))
UndSocExampleMolten2

##        pidp wave   variable     value
## 1  68001367    a      dvage   39.0000
## 2  68001367    b      dvage        NA
## 3  68004087    a      dvage   59.0000
## 4  68004087    b      dvage   60.0000
## 5  68001367    a fimnnet_dv 1400.0000
## 6  68001367    b fimnnet_dv        NA
## 7  68004087    a fimnnet_dv  802.0833
## 8  68004087    b fimnnet_dv 1276.6666
## 9  68001367    a        sex    1.0000
## 10 68001367    b        sex        NA
## 11 68004087    a        sex    1.0000
## 12 68004087    b        sex    1.0000
## 13 68001367    a      vote6    3.0000
## 14 68001367    b      vote6        NA
## 15 68004087    a      vote6    2.0000
## 16 68004087    b      vote6    2.0000

# Unite the columns
UndSocExampleUnited <- UndSocExampleMolten2 %>%
  unite("variable", c("wave", "variable"), sep = "_")
UndSocExampleUnited

##        pidp     variable     value
## 1  68001367      a_dvage   39.0000
## 2  68001367      b_dvage        NA
## 3  68004087      a_dvage   59.0000
## 4  68004087      b_dvage   60.0000
## 5  68001367 a_fimnnet_dv 1400.0000
## 6  68001367 b_fimnnet_dv        NA
## 7  68004087 a_fimnnet_dv  802.0833
## 8  68004087 b_fimnnet_dv 1276.6666
## 9  68001367        a_sex    1.0000
## 10 68001367        b_sex        NA
## 11 68004087        a_sex    1.0000
## 12 68004087        b_sex    1.0000
## 13 68001367      a_vote6    3.0000
## 14 68001367      b_vote6        NA
## 15 68004087      a_vote6    2.0000
## 16 68004087      b_vote6    2.0000

# And now cast
UndSocExampleWide <- UndSocExampleUnited %>%
  dcast(pidp ~ variable)
UndSocExampleWide

##       pidp a_dvage a_fimnnet_dv a_sex a_vote6 b_dvage b_fimnnet_dv b_sex
## 1 68001367      39    1400.0000     1       3      NA           NA    NA
## 2 68004087      59     802.0833     1       2      60     1276.667     1
##   b_vote6
## 1      NA
## 2       2

6.2 Exercise:

Reshape the full UndSoc data frame from ‘wide’ to ‘long’ format. Call the object where you will store the result UndSocLong.

6.2.1 Solution:

UndSocLong <- UndSoc %>%
  gather(a_sex:g_fimnnet_dv, key = "variable", value = "value") %>%
  separate(variable, into = c("wave", "variable"), sep = "_", extra = "merge") %>%
  spread(key = variable, value = value)

head(UndSocLong, 5)

##    pidp wave dvage fimnnet_dv sex vote6
## 1 22445    a    NA         NA  NA    NA
## 2 22445    b    NA         NA  NA    NA
## 3 22445    c    NA         NA  NA    NA
## 4 22445    d    27   1140.000   2     2
## 5 22445    e    28   1602.667   2     2

6.3 Cleaning the data

Before we begin the analysis, we want to make sure that the data have been cleaned and all the missing values have been correctly identified. It usually makes sense to separate the cleaning and analysis stages into separate scripts.

Let us explore the data set we have. Note that if we had not converted the data into the ‘long’ format we would have to tabulate and clean each variable seven times.

summary(UndSocLong)

##       pidp               wave               dvage          fimnnet_dv    
##  Min.   :2.244e+04   Length:584703      Min.   : -9.00   Min.   :-42904  
##  1st Qu.:4.086e+08   Class :character   1st Qu.: 32.00   1st Qu.:   630  
##  Median :7.493e+08   Mode  :character   Median : 46.00   Median :  1159  
##  Mean   :7.973e+08                      Mean   : 47.09   Mean   :  1383  
##  3rd Qu.:1.224e+09                      3rd Qu.: 61.00   3rd Qu.:  1800  
##  Max.   :1.653e+09                      Max.   :104.00   Max.   : 15000  
##                                         NA's   :249806   NA's   :249806  
##       sex             vote6       
##  Min.   :-9.00    Min.   :-10.00  
##  1st Qu.: 1.00    1st Qu.:  2.00  
##  Median : 2.00    Median :  3.00  
##  Mean   : 1.54    Mean   :  1.82  
##  3rd Qu.: 2.00    3rd Qu.:  4.00  
##  Max.   : 2.00    Max.   :  4.00  
##  NA's   :249806   NA's   :249806

table(UndSocLong$wave)

## 
##     a     b     c     d     e     f     g 
## 83529 83529 83529 83529 83529 83529 83529

table(UndSocLong$dvage)

## 
##   -9   -2   -1   16   17   18   19   20   21   22   23   24   25   26   27 
##   19    6   34 5735 5769 5536 5350 5188 5000 4779 4668 4537 4555 4556 4610 
##   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42 
## 4742 4836 5129 5181 5286 5387 5380 5512 5591 5519 5765 5896 6155 6274 6422 
##   43   44   45   46   47   48   49   50   51   52   53   54   55   56   57 
## 6291 6450 6489 6316 6263 6187 6119 6013 5901 5727 5701 5483 5406 5193 5042 
##   58   59   60   61   62   63   64   65   66   67   68   69   70   71   72 
## 5052 4856 4782 4962 4995 5013 4999 4846 4814 4666 4435 4210 3914 3707 3504 
##   73   74   75   76   77   78   79   80   81   82   83   84   85   86   87 
## 3316 3130 2977 2749 2618 2452 2241 2106 1886 1749 1599 1390 1211  988  860 
##   88   89   90   91   92   93   94   95   96   97   98   99  100  101  102 
##  683  547  453  344  243  172  131   92   80   41   36   25   15    6    2 
##  103  104 
##    1    1

table(UndSocLong$sex)

## 
##     -9     -1      1      2 
##      2      1 154045 180849

table(UndSocLong$vote6)

## 
##    -10     -9     -7     -2     -1      1      2      3      4 
##   4656    366  24752    431    358  30981 102212  86790  84351

summary(UndSocLong$fimnnet_dv)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -42904     630    1159    1383    1800   15000  249806

Note the negative values for dvage, sex and vote6. These are missing values that need to be coded as NA.

UndSocLong  <- UndSocLong %>%
  mutate(dvage = ifelse(dvage > 0, dvage, NA)) %>%
  mutate(sex = ifelse(sex > 0, sex, NA)) %>%
  mutate(vote6 = ifelse(vote6 > 0, vote6, NA))

# If you need a recap on how the ifelse function works, we previously covered it in 'Transform Data', section 6 'Recode variables Both base R and dplyr using ifelse'.

table(UndSocLong$dvage)

## 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 5735 5769 5536 5350 5188 5000 4779 4668 4537 4555 4556 4610 4742 4836 5129 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
## 5181 5286 5387 5380 5512 5591 5519 5765 5896 6155 6274 6422 6291 6450 6489 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
## 6316 6263 6187 6119 6013 5901 5727 5701 5483 5406 5193 5042 5052 4856 4782 
##   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
## 4962 4995 5013 4999 4846 4814 4666 4435 4210 3914 3707 3504 3316 3130 2977 
##   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
## 2749 2618 2452 2241 2106 1886 1749 1599 1390 1211  988  860  683  547  453 
##   91   92   93   94   95   96   97   98   99  100  101  102  103  104 
##  344  243  172  131   92   80   41   36   25   15    6    2    1    1

table(UndSocLong$sex)

## 
##      1      2 
## 154045 180849

table(UndSocLong$vote6)

## 
##      1      2      3      4 
##  30981 102212  86790  84351

We also have negative values for income (fimnnet_dv), but we will leave as it is for now.

We may want to code sex as ‘male’ and ‘female’ and assign meaningful labels to vote6.

UndSocLongClean <- UndSocLong %>%
  mutate(sex = recode(sex, "1" = "male", "2" = "female")) %>%
  mutate(vote6 = recode(vote6, "1" = "very", "2" = "fairly", "3" = "not very", "4" = "not at all"))

head(UndSocLongClean, 10)

##     pidp wave dvage fimnnet_dv    sex  vote6
## 1  22445    a    NA         NA   <NA>   <NA>
## 2  22445    b    NA         NA   <NA>   <NA>
## 3  22445    c    NA         NA   <NA>   <NA>
## 4  22445    d    27   1140.000 female fairly
## 5  22445    e    28   1602.667 female fairly
## 6  22445    f    29   2012.000 female fairly
## 7  22445    g    30   1840.000 female   very
## 8  29925    a    NA         NA   <NA>   <NA>
## 9  29925    b    NA         NA   <NA>   <NA>
## 10 29925    c    NA         NA   <NA>   <NA>

You will have noticed that we used the ifelse command to code the negative values from the dvage, sex and vote6 variables as missing values, whereas we used the recode command to assign labels to the values in the data for the sex and vote6 variables. This is because the ifelse function works with conditional statements (used in creating NA values for missing numbers) and the recode function requires fewer lines of code to recode multiple values in a variable, as it can simply reassign values, rather than having to set values to a number of conditional statements (explained in Transform Data, section 6 Recode variables Both base R and dplyr using ifelse).

Finally, we save this cleaned data in our myData folder, as all7clean.

saveRDS(UndSocLongClean, "myData/all7clean.rds")