# 5 Transform data

Once you have imported your data in R you will usually want to clean it, transform it and produce some data summaries. All these tasks can be accomplished with base R. However, it is usually more convenient to use specialised packages for this, such as dplyr and data.table. In this module we will use dplyr.

Let us read the data from wave 1 of the Understanding Society.

library(tidyverse)
W1 <- read_tsv("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab")

## 5.1 The pipe operator (%>%)

table(W1$a_vote6) ## ## -9 -7 -2 -1 1 2 3 4 ## 118 3262 42 42 4882 15862 14017 12769 Now if we want to convert this table into a table of proportions we need to do the following. prop.table(table(W1$a_vote6))
##
##           -9           -7           -2           -1            1
## 0.0023139977 0.0639683100 0.0008236263 0.0008236263 0.0957367533
##            2            3            4
## 0.3110562027 0.2748754755 0.2504020081

Imagine now that you want to use the kable() function from the knitr package to print this table. We will need to convert the table into a data frame first and then apply the function kable.

library(knitr)
kable(data.frame(prop.table(table(W1$a_vote6))), digits = 2) Var1 Freq -9 0.00 -7 0.06 -2 0.00 -1 0.00 1 0.10 2 0.31 3 0.27 4 0.25 At this point we have four nested functions and the code becomes difficult to read. With the pipe operator (%>%) you can achieve the same result with the following code. W1$a_vote6 %>%
table() %>%
prop.table() %>%
data.frame() %>%
kable(digits = 2)
. Freq
-9 0.00
-7 0.06
-2 0.00
-1 0.00
1 0.10
2 0.31
3 0.27
4 0.25

The pipe operator passes the results of the execution of a function to the next function. This makes code easier to write, read and understand.

## 5.2 Select variables

Imagine we want to select some variables from a data frame. Indeed, W1 is too large for our purposes and we do not need all the variables today. Let us select the variables for sex, age, place of birth and measures of weight and height.

In base R you could do the following:

newW1 <- subset(W1, select = c("pidp", "a_sex", "a_dvage", "a_ukborn", "a_hlht", "a_hlhtf",
"a_hlhti", "a_hlhtc", "a_hlwt", "a_hlwts", "a_hlwtp", "a_hlwtk"))
head(newW1, 3)
## # A tibble: 3 x 12
##       pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt
##      <int> <int>   <int>    <int>  <int>   <int>   <int>   <int>  <int>
## 1 68001367     1      39        1      1       6       0      -8      1
## 2 68004087     1      59        5      1       5      11      -8      2
## 3 68006127     2      39        1      1       5       1      -8      1
## # ... with 3 more variables: a_hlwts <int>, a_hlwtp <int>, a_hlwtk <int>

With dplyr the following code will produce the same result.

newW1 <- W1 %>%
select(pidp:a_dvage, a_ukborn, a_hlht:a_hlwtk)
head(newW1, 3)
## # A tibble: 3 x 12
##       pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt
##      <int> <int>   <int>    <int>  <int>   <int>   <int>   <int>  <int>
## 1 68001367     1      39        1      1       6       0      -8      1
## 2 68004087     1      59        5      1       5      11      -8      2
## 3 68006127     2      39        1      1       5       1      -8      1
## # ... with 3 more variables: a_hlwts <int>, a_hlwtp <int>, a_hlwtk <int>

Note that I combined together the variables that follow each other in the original data frame with :.

## 5.3 Select cases

Another common task is to select cases based on some conditions. For example, we may want to have a data frame that only indludes women aged 18 to 25.

In base R you can do the following.

## 5.5 Sort data

We can sort data with arrange.

We may want to sort the data by BMI.

W1mod %>%
arrange(bmi) %>%
select(pidp, bmi) %>%
head(5)
## # A tibble: 5 x 2
##         pidp       bmi
##        <int>     <dbl>
## 1  614152207  3.581188
## 2 1158204567  4.357708
## 3 1292128539 10.098136
## 4 1088805127 11.732290
## 5  478722727 12.050592

We can also sort cases by BMI in the decreasing order, separatey for each sex.

W1mod %>%
arrange(a_sex, desc(bmi)) %>%
select(pidp, a_sex, bmi) %>%
head(5)
## # A tibble: 5 x 3
##         pidp a_sex      bmi
##        <int> <int>    <dbl>
## 1  952435887     1 73.90269
## 2 1632274047     1 66.37745
## 3  340892847     1 65.19274
## 4 1292157767     1 63.10072
## 5  340443367     1 62.98222

## 5.6 Summarise data

dplyr is also helpful when you want to create a data frame with summary statistics. For example, we may want to calculate mean and median BMI in our sample and the proportion of people with BMI over 30 (considered obese).

W1mod %>%
# create a binary variable for being obese
mutate(bmiover30 = ifelse(bmi > 30, 1, 0)) %>%
summarise(
meanBMI = mean(bmi, na.rm=TRUE),
medianBMI = median(bmi, na.rm=TRUE),
proportionObese = mean(bmiover30, na.rm=TRUE)
)
## # A tibble: 1 x 3
##    meanBMI medianBMI proportionObese
##      <dbl>     <dbl>           <dbl>
## 1 26.04229  25.37527       0.1807575

We can also produce summary statistics by group. Let us look at the BMI summaries by age group and by sex.

W1mod %>%
# create a variabe for being obese
mutate(bmiover30 = ifelse(bmi > 30, 1, 0)) %>%
# create a variable for age groups
mutate(agegr = ifelse(a_dvage >= 18 & a_dvage <= 35, "18-35",
ifelse((a_dvage >= 36 & a_dvage <= 55), "36-55",
ifelse(a_dvage >= 56, ">55", NA)))) %>%
# filter out people with missing age
filter(!is.na(agegr)) %>%
# group by sex and age
group_by(a_sex, agegr) %>%
# calculate summary statistics
summarise(
meanBMI = mean(bmi, na.rm=TRUE),
medianBMI = median(bmi, na.rm=TRUE),
proportion = mean(bmiover30, na.rm=TRUE)
)
## # A tibble: 6 x 5
## # Groups:   a_sex [?]
##   a_sex agegr  meanBMI medianBMI proportion
##   <int> <chr>    <dbl>     <dbl>      <dbl>
## 1     1   >55 26.95474  26.43255  0.1967436
## 2     1 18-35 24.99670  24.46032  0.1179154
## 3     1 36-55 27.25879  26.59878  0.2157322
## 4     2   >55 26.65251  25.96255  0.2168707
## 5     2 18-35 24.48421  23.37306  0.1262995
## 6     2 36-55 26.45840  25.38130  0.2179358

We can see that people aged 18 to 35 have the lowest proportion of obese people both among men and women.

Let us now save W1mod for future use.

saveRDS(W1mod, "myData/W1mod.rds")