# 5 Transform data

Prerequisite: Chapter 11 ‘Data transformation’ from R for Data Science, available at http://r4ds.had.co.nz/transform.html

## 5.1 Introduction to transforming data

Once you have imported your data in R you will usually want to clean it, transform it and produce some data summaries. All these tasks can be accomplished with base R. However, it is usually more convenient to use specialised packages for this, such as dplyr and data.table. In this module we will use dplyr.

We are going to work with wave 1 from the Understanding Society dataset.

# First, install the 'data.table' package.

library(data.table)

# Revisit the class "Read Data" if you need a reminder on how to best read data into R.

W1 <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab")
##
Read 50994 rows and 1364 (of 1364) columns from 0.187 GB file in 00:00:06

## 5.2 Operators recap

However, before we learn how to transform data in R, we are going to recap on operators you will need that you should have covered in other courses using R. These operators are relational and logical.

### 5.2.1 Relational operators

Relational operators are used when comparing values, and will help with selecting cases, recoding and creating variables in your assignment. Below is a table with some useful relational operators available in R.

Operator Description
< Less than
> Greater than
<! Not less than
>! Not greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

For example, if we code x to be 5 and y to be 6, we can use these operators to compare the two. R gives us a TRUE or FALSE reading for each comparison we make.

x <- 5

y <- 6

x < y
## [1] TRUE
x == y
## [1] FALSE
x != y
## [1] TRUE

If we were selecting variables from a data frame that were greater than x, for example, we could use these operators to do so. This will be covered later in this class.

### 5.2.2 Logical operators

Logical operators are used to help compare values, and are especially useful when comparing multiple values. Below is a table of some useful logical operators available in R.

Operator Description
& And
| Or
z <- 7

x & y < y
## [1] FALSE
x | z < y
## [1] TRUE
x | z <! y
## [1] TRUE

Again, these operators will be used more later in this class, and are necessary when recoding values.

# Cleaning up the working environment.

rm(x, y, z)

## 5.3 The pipe operator (%>%)

The first step in transforming data is understanding what the pipe operator %>% is, and how it works. To do this, we are going to create a simple table showing the proportions of responses given to the variable political interest (a_vote6). To begin with, we create a table showing the numbers of responses given for each category.

table(W1$a_vote6) ## ## -9 -7 -2 -1 1 2 3 4 ## 118 3262 42 42 4882 15862 14017 12769 To convert this table into a table of proportions we do the following: prop.table(table(W1$a_vote6))
##
##           -9           -7           -2           -1            1
## 0.0023139977 0.0639683100 0.0008236263 0.0008236263 0.0957367533
##            2            3            4
## 0.3110562027 0.2748754755 0.2504020081

To print this table in an easy-to-read manner, we would usually use the kable() function from the knitr package, and would need to convert the table into a data frame first and then apply the function kable().

# Loading the 'knitr' package into R.

library(knitr)

# Printing the table of proportions.

kable(data.frame(prop.table(table(W1$a_vote6))), digits = 2) Var1 Freq -9 0.00 -7 0.06 -2 0.00 -1 0.00 1 0.10 2 0.31 3 0.27 4 0.25 As you can see, at this point we have four ‘nested’ functions (functions within other functions) and the code becomes difficult to read. With the pipe operator %>% (Shift+Ctrl+M on Windows or Shift+Cmd+M on Mac) you can achieve the same result with the following code: # Loading the 'tidyverse' package, in which the package containing the pipe operator ('dplyr') is found. library(tidyverse) W1$a_vote6 %>%
table() %>%
prop.table() %>%
data.frame() %>%
kable(digits = 2)
. Freq
-9 0.00
-7 0.06
-2 0.00
-1 0.00
1 0.10
2 0.31
3 0.27
4 0.25

The pipe operator passes the results of the execution of a function to the next function, essentially treating the entire chunk of code as a ladder, which is executed from top to bottom, with each extra line of code being run on the results from the previous line. This makes code easier to write, read and understand.

## 5.4 Select variables

For your assignment, you will only want to work with select variables from Understanding Society. You learnt how to read in only select variables from a dataset in the Read data class, though if you have already loaded in a dataset (as we have here) and want to select only certain variables from this, you need to learn how to transform the data as you do not want to have to keep loading data in to R every time you decide to add or remove a variable from your analysis.

Here, we will select the variables for sex, age, place of birth and measures of weight and height, as well as the personal identification variable for each individual. In base R you could use the following code:

newW1 <- subset(W1, select = c("pidp", "a_sex", "a_dvage", "a_ukborn", "a_hlht", "a_hlhtf", "a_hlhti", "a_hlhtc", "a_hlwt", "a_hlwts", "a_hlwtp", "a_hlwtk"))
head(newW1, 3)
##        pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt
## 1: 68001367     1      39        1      1       6       0      -8      1
## 2: 68004087     1      59        5      1       5      11      -8      2
## 3: 68006127     2      39        1      1       5       1      -8      1
##    a_hlwts a_hlwtp a_hlwtk
## 1:      15       8      -8
## 2:      -8      -8      70
## 3:      14       7      -8

However, with dplyr, we are able to do this with much less code, as follows.

newW1 <- W1 %>%
select(pidp:a_dvage, a_ukborn, a_hlht:a_hlwtk)
head(newW1, 3)
##        pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt
## 1: 68001367     1      39        1      1       6       0      -8      1
## 2: 68004087     1      59        5      1       5      11      -8      2
## 3: 68006127     2      39        1      1       5       1      -8      1
##    a_hlwts a_hlwtp a_hlwtk
## 1:      15       8      -8
## 2:      -8      -8      70
## 3:      14       7      -8

As some of the variables we want to select from the W1 data appear next to one another, we can use a colon to select these adjacent variables, meaning we do not need to write out every variable we want to select.

# Clearing W1 from the working environment.

rm(W1)

## 5.5 Select cases

Relatedly, sometimes you will want to only work with certain cases in variables that meet certain conditions. Here, we only include women aged 18 to 25 in a new data frame. To do this in base R, we use what we recapped in Operators recap and can save the new data frame as follows.

women <- newW1[newW1$a_sex == 2 & (newW1$a_dvage >= 18 & newW1$a_dvage <=25),] Once again, by using dplyr, our code will be shorter and easier to read. newW1 %>% filter(a_sex == 2 & (a_dvage >= 18 & a_dvage <=25)) %>% head(3) ## pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt ## 1 68010207 2 24 1 1 5 5 -8 1 ## 2 68010891 2 23 1 -7 -7 -7 -7 -7 ## 3 68023131 2 23 1 1 5 9 -8 1 ## a_hlwts a_hlwtp a_hlwtk ## 1 10 10 -8 ## 2 -7 -7 -7 ## 3 11 13 -8 # Note: In this case we are not saving the new data frame as an object, but instead printing the first three rows from the data to demonstrate the result. We can make this as complicated as we want and use more than one variable. For example, here is a new data frame that includes only people born in Wales or Northern Ireland and over the age of 40. newW1 %>% filter((a_ukborn == 3 | a_ukborn == 4) & a_dvage > 40) %>% head(3) ## pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt ## 1 68051011 2 41 3 1 5 5 -8 1 ## 2 68062567 2 50 4 1 5 1 -8 1 ## 3 68189727 2 95 3 1 5 4 -8 1 ## a_hlwts a_hlwtp a_hlwtk ## 1 10 2 -8 ## 2 9 2 -8 ## 3 10 3 -8 ## 5.6 Recode variables Both base R and dplyr using ifelse Wanting to change the output of variables is common when working with data, as variables are often coded in ways that make them easy to code for those recording the variables, but difficult to understand without a code book for people analysing the data. This is the case with some of the variables in Understanding Society. The ifelse command is used to recode variables, and works by a) using a relational operator to specify the range of values you want to include from the original variable, b) setting what you want to recode this value/set of values as, c) finally setting what values outside the range you specified are recoded as. As a simple example, here we create a new dummy variable using base R named a_Scotland that takes the value of 1 if a person was born in Scotland (with a value of 2 from the a_ukborn variable) and 0 otherwise. newW1$a_Scotland <- ifelse(newW1$a_ukborn == 2, 1, 0) However, there are often times when you will want to recode a number of values from a variable. If you look at the code book for Understanding Society, you can see that the variable ukborn is coded to give numerical values that correspond to a country of birth. To make interpreting this variable easier, both for us and for anyone reading our analysis of the Understanding Society data, we can recode this variable to instead show the country name, rather than a numerical value, as its output. In this case, as there are multiple values we want to recode, the final clause in the ifelse() command can be set as another ifelse() command to recode other values from this variable, before setting those outside all of the values we are interested in (such as responses where this question was refused an answer) as NA to finish the command. # Instead of recoding the variable and overwriting the original, we create a new variable to put the recoded variable in, as if we make a mistake with our code and have overwritten the original data, we will have to reload the dataset and start this process again. It is good coding practice to never overwrite original data. newW1$a_uk <- ifelse(newW1$a_ukborn == 1, 'England', ifelse(newW1$a_ukborn == 2, 'Scotland',
ifelse(newW1$a_ukborn == 3, 'Wales', ifelse(newW1$a_ukborn == 4, 'NIR',

## 5.8 Sort data

We can sort data with the arrange() command from dplyr.

Initially, we sort the data by BMI.

W1mod %>%
arrange(bmi) %>%
select(pidp, bmi) %>%
head(5)
##         pidp       bmi
## 1  614152207  3.581188
## 2 1158204567  4.357708
## 3 1292128539 10.098136
## 4 1088805127 11.732290
## 5  478722727 12.050592

However, we can also sort data by a number of factors. Finally, we sort by BMI in decreasing order, separately for each sex.

library(tidyverse)
W1mod %>%
arrange(a_sex, desc(bmi)) %>%
select(pidp, a_sex, bmi) %>%
head(5)
##         pidp a_sex      bmi
## 1  952435887     1 73.90269
## 2 1632274047     1 66.37745
## 3  340892847     1 65.19274
## 4 1292157767     1 63.10072
## 5  340443367     1 62.98222

Here, we have just printed the first five rows of this data as an example. If you wanted, you could save these ordered datasets in the working environment, the same way we saved our new datasets earlier.