6 Join data

Prerequisite: Chapter 13 ‘Relational Data’ from R for Data Science, available at http://r4ds.had.co.nz/relational-data.html

6.1 Introduction to joining data

When you work with data, it is rare that all the data you need will be confined to just one table. As a result, you will often find that you have to manipulate multiple data tables, or data frames. In order to do this, it helps to be able to join these multiple sources of data into just one table.

This is very relevant for your assignment, as you are required to work with multiple waves of the Understanding Society data set. Without being able to join these waves together, you will struggle to analyse more than one wave at a time.

This class will teach you a number of ways to do this. We will start by joining just two waves together in the four ways introduced in Chapter 13 of R for Data Science into one, easy to work with, table. The next class will focus on joining multiple waves together at once.

6.2 Loading the data

Before we join the waves together, we first need to read the individual waves into R. To do this, we need to use the function read_tsv() from the readr package, as the data are tab separated. We can load this by attaching the tidyverse package (you will need to install the tidyverse package first if you have not already done so, with the install.packages() function).

# This code attaches the tidyverse package and then reads the first two waves of Understanding Society separately into R.
library(tidyverse)

UndSoc1 <- read_tsv("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab")
UndSoc2 <- read_tsv("data/UKDA-6614-tab/tab/us_w2/b_indresp.tab")

You may have noticed that this took a lot longer to load into R than other data files you have used in the past. This is because the files themselves are large, and take up a lot of space in the memory. We can see just how much space they take up by using the object.size() command.

object.size(UndSoc1)
## 286289824 bytes
object.size(UndSoc2)
## 367629552 bytes

However, this does not give us an answer that we can easily interpret. Therefore, by using the format() command, we can specify how we want R to show us this data. In this case, by adding units = “” we can tell R to show us how large the files are in specific units.

format(object.size(UndSoc1), units = "auto")
## [1] "273 Mb"
format(object.size(UndSoc2), units = "auto")
## [1] "350.6 Mb"
# By adding units = 'auto' R automatically chooses the clearest way to show us the size of the files.

To make the files we are working with smaller, and thus make R carry out our analysis quicker, we can select only the variables we need for our analysis. We are only going to load a few variables here to keep this simple; you can always add more to your analysis. Here, as we are looking at level of interest in politics, they are:

  • pidp: this is the id number given to each respondent to identify them in each wave

  • a_sex: sex from wave 1

  • a_dvage: age from wave 1

  • a_vote6: level of interest in politics from wave 1

  • b_sex: sex from wave 2

  • b_dvage: age from wave 2

  • b_vote6: level of interest in politics from wave 2

By using what we learnt in the previous class, Transforming Data, we can edit both data sets to keep only the variables we need.

UndSoc1ed <- UndSoc1 %>%
        select(pidp, a_sex, a_dvage, a_vote6)
UndSoc2ed <- UndSoc2 %>%
        select(pidp, b_sex, b_dvage, b_vote6)

We can check that these are much smaller objects, and that our code was successful.

format(object.size(UndSoc1ed), units = "auto")
## [1] "1.3 Mb"
format(object.size(UndSoc2ed), units = "auto")
## [1] "5.5 Mb"
head(UndSoc1ed)
## # A tibble: 6 x 4
##       pidp a_sex a_dvage a_vote6
##      <int> <int>   <int>   <int>
## 1 68001367     1      39       3
## 2 68004087     1      59       2
## 3 68006127     2      39       4
## 4 68006135     2      17       4
## 5 68006807     2      72       4
## 6 68007487     2      57       1
head(UndSoc2ed)
## # A tibble: 6 x 4
##       pidp b_sex b_dvage b_vote6
##      <int> <int>   <int>   <int>
## 1 68004087     1      60       2
## 2 68006127     2      40       4
## 3 68006807     2      73       4
## 4 68007487     2      58       2
## 5 68008167     2      39       4
## 6 68008171     1      52      -7

As we no longer have any use for the original data we loaded into R, we can remove them from R to free up the memory they are unnecessarily taking up.

rm(UndSoc1, UndSoc2)

6.3 Joining waves 1 and 2

Now our data is ready for us to join the data sets. As you already know, there can be several types of joins. Here, we will use each of them to join the two waves of data together, and look at the differences between them. However, first it is important to understand how joining data works.

6.3.1 The ‘key’

When we join two data frames together, we have to choose which variable we want to join them by, known as the ‘key’. As we are working with a data set that has an id number (the variable pidp in the data set) that is unique to each respondent, it makes sense to join the two waves we are using by this id number, so we can compare responses by each person in each wave of data. To do this, we use the by = “” command.

6.3.2 Inner join

The first join we will use is the inner join. By using inner_join() we can join both waves together and keep the observations that are present in both data frames. This means that respondents who were in both waves 1 and 2 will be in our new data frame, but anyone who appeared in only wave 1 or wave 2 will be excluded.

inner <- UndSoc1ed %>%
        inner_join(UndSoc2ed, by = "pidp")

We can see that this has excluded a number of respondents from both waves 1 and 2 who only appeared in one wave and not the other, as we only have 38388 observations in the our new data frame compared to 50994 in UndSoc1, and 54597 in UndSoc2.

6.3.3 Left join

However, we may not want to exclude all of the respondents who do not appear in both waves. If, for example, we were working with three waves of data and wanted to compare how the answers given by respondents in wave 1 changed over time, we would want to keep answers from people who were in waves 1 and 3, even if they were not in wave 2. To do this, we can use left_join(), which will keep answers from everyone who appeared in wave 1, and exclude answers from everyone who did not.

left <- UndSoc1ed %>%
        left_join(UndSoc2ed, by = "pidp")

To check we have done this correctly, we can compare the number of observations in UndSoc1ed and left, which should be the same. As we can see they each have 50994 observations, we can be confident our new data frame has been created successfully. If any respondents appeared in wave 1 but not wave 2, their answers for wave 2 will show as ‘NA’ values. We can have a quick look at how this new data frame looks with head().

head(left)
## # A tibble: 6 x 7
##       pidp a_sex a_dvage a_vote6 b_sex b_dvage b_vote6
##      <int> <int>   <int>   <int> <int>   <int>   <int>
## 1 68001367     1      39       3    NA      NA      NA
## 2 68004087     1      59       2     1      60       2
## 3 68006127     2      39       4     2      40       4
## 4 68006135     2      17       4    NA      NA      NA
## 5 68006807     2      72       4     2      73       4
## 6 68007487     2      57       1     2      58       2

6.3.4 Right join

The command right_join() is very similar to left_join(), except that instead of keeping all the respondents who were in wave 1, this will keep all the respondents who appeared in wave 2 and exclude those who did not. Again, any respondents who appeared in wave 2 but not wave 1 will have their answers for wave 1 show as ‘NA’ values.

right <- UndSoc1ed %>%
        right_join(UndSoc2ed, by = "pidp")
head(right)
## # A tibble: 6 x 7
##       pidp a_sex a_dvage a_vote6 b_sex b_dvage b_vote6
##      <int> <int>   <int>   <int> <int>   <int>   <int>
## 1 68004087     1      59       2     1      60       2
## 2 68006127     2      39       4     2      40       4
## 3 68006807     2      72       4     2      73       4
## 4 68007487     2      57       1     2      58       2
## 5 68008167     2      38       3     2      39       4
## 6 68008171     1      51      -7     1      52      -7

6.3.5 Full join

Usually, we would want all the respondents from both waves to remain in the data set, no matter if they appeared in the other waves or not. If we need to exclude any respondents from our analysis we can do this manually. We can do a full join with the full_join() command.

full <- UndSoc1ed %>%
        full_join(UndSoc2ed, by = "pidp")

By doing a full join, our new data frame has 67203 observations compared to 50994 in wave 1 and 54597 in wave 2. This is because it includes not only all the individuals that took part in both waves 1 and 2, but also those who only took part in either wave 1 or wave 2.

Now that we have joined two waves together, we could begin our analysis of these waves, if we had decided to just analyse the first couple of waves from the Understanding Society dataset.

head(full)
## # A tibble: 6 x 7
##       pidp a_sex a_dvage a_vote6 b_sex b_dvage b_vote6
##      <int> <int>   <int>   <int> <int>   <int>   <int>
## 1 68001367     1      39       3    NA      NA      NA
## 2 68004087     1      59       2     1      60       2
## 3 68006127     2      39       4     2      40       4
## 4 68006135     2      17       4    NA      NA      NA
## 5 68006807     2      72       4     2      73       4
## 6 68007487     2      57       1     2      58       2

If you wanted to work with more than just these two waves, you could manually join each new wave to our ‘full’ data frame. However, this is a slow and cumbersome process, so the next class will teach you how to join several waves of data together at once, using iteration and loops.