4 Join data

Prerequisite: Chapter 13 ‘Relational Data’ from R for Data Science, available at http://r4ds.had.co.nz/relational-data.html

4.1 Introduction to joining data

When you work with data, it is rare that all the data you need will be confined to just one table. As a result, you will often find that you have to manipulate multiple data tables, or data frames. In order to do this, it helps to be able to join these multiple sources of data into just one table.

This is very relevant for your assignment, as you are required to work with multiple waves of the Understanding Society data set. Without being able to join these waves together, you will struggle to analyse more than one wave at a time.

This class will teach you a number of ways to do this. We will start by joining just two waves together in the four ways introduced in Chapter 13 of R for Data Science into one, easy to work with, table. The next class will focus on joining multiple waves together at once.

4.2 Joining waves 1 and 2

Following on from the class Read data, you should have the two waves of data we are going to join already saved in your myData folder, named UndSoc1ed and UndSoc2ed, with these loaded into the workspace saved as ReadData. Once you have loaded this workspace back into R, you are ready to join these two waves together.

load("myData/ReadData.RData")

As you already know, there can be several types of joins. Here, we will use each of them to join the two waves of data together, and look at the differences between them. However, first it is important to understand how joining data works.

4.2.1 The ‘key’

When we join two data frames together, we have to choose which variable we want to join them by, known as the ‘key’. As we are working with a data set that has an id number (the variable pidp in the data set) that is unique to each respondent, it makes sense to join the two waves we are using by this id number, so we can compare responses by each person in each wave of data. To do this, we use the by = “” command.

4.2.2 Inner join

The first join we will use is the inner join. By using inner_join() we can join both waves together and keep the observations that are present in both data frames. This means that respondents who were in both waves 1 and 2 will be in our new data frame, but anyone who appeared in only wave 1 or wave 2 will be excluded.

# Attaching the tidyverse package

library(tidyverse)

inner <- UndSoc1ed %>%
        inner_join(UndSoc2ed, by = "pidp")

We can see that this has excluded a number of respondents from both waves 1 and 2 who only appeared in one wave and not the other, as we only have 38388 observations in the our new data frame compared to 50994 in UndSoc1, and 54597 in UndSoc2.

4.2.3 Left join

However, we may not want to exclude all of the respondents who do not appear in both waves. If, for example, we were working with three waves of data and wanted to compare how the answers given by respondents in wave 1 changed over time, we would want to keep answers from people who were in waves 1 and 3, even if they were not in wave 2. To do this, we can use left_join(), which will keep answers from everyone who appeared in wave 1, and exclude answers from everyone who did not.

left <- UndSoc1ed %>%
        left_join(UndSoc2ed, by = "pidp")

To check we have done this correctly, we can compare the number of observations in UndSoc1ed and left, which should be the same. As we can see they each have 50994 observations, we can be confident our new data frame has been created successfully. If any respondents appeared in wave 1 but not wave 2, their answers for wave 2 will show as ‘NA’ values. We can have a quick look at how this new data frame looks with head().

head(left)
##       pidp a_sex a_dvage a_vote6 b_sex b_dvage b_vote6
## 1 68001367     1      39       3    NA      NA      NA
## 2 68004087     1      59       2     1      60       2
## 3 68006127     2      39       4     2      40       4
## 4 68006135     2      17       4    NA      NA      NA
## 5 68006807     2      72       4     2      73       4
## 6 68007487     2      57       1     2      58       2

4.2.4 Right join

The command right_join() is very similar to left_join(), except that instead of keeping all the respondents who were in wave 1, this will keep all the respondents who appeared in wave 2 and exclude those who did not. Again, any respondents who appeared in wave 2 but not wave 1 will have their answers for wave 1 show as ‘NA’ values.

right <- UndSoc1ed %>%
        right_join(UndSoc2ed, by = "pidp")
head(right)
##       pidp a_sex a_dvage a_vote6 b_sex b_dvage b_vote6
## 1 68004087     1      59       2     1      60       2
## 2 68006127     2      39       4     2      40       4
## 3 68006807     2      72       4     2      73       4
## 4 68007487     2      57       1     2      58       2
## 5 68008167     2      38       3     2      39       4
## 6 68008171     1      51      -7     1      52      -7

4.2.5 Full join

Usually, we would want all the respondents from both waves to remain in the data set, no matter if they appeared in the other waves or not. If we need to exclude any respondents from our analysis we can do this manually. We can do a full join with the full_join() command.

full <- UndSoc1ed %>%
        full_join(UndSoc2ed, by = "pidp")

By doing a full join, our new data frame has 67203 observations compared to 50994 in wave 1 and 54597 in wave 2. This is because it includes not only all the individuals that took part in both waves 1 and 2, but also those who only took part in either wave 1 or wave 2.

Now that we have joined two waves together, we could begin our analysis of these waves, if we had decided to just analyse the first couple of waves from the Understanding Society dataset.

head(full)
##       pidp a_sex a_dvage a_vote6 b_sex b_dvage b_vote6
## 1 68001367     1      39       3    NA      NA      NA
## 2 68004087     1      59       2     1      60       2
## 3 68006127     2      39       4     2      40       4
## 4 68006135     2      17       4    NA      NA      NA
## 5 68006807     2      72       4     2      73       4
## 6 68007487     2      57       1     2      58       2

If you wanted to work with more than just these two waves, you could manually join each new wave to our ‘full’ data frame. However, this is a slow and cumbersome process, so the next class will teach you how to join several waves of data together at once, using iteration and loops.