2 Read data

Prerequisite: Chapter 11 ‘Data Import’ from R for Data Science, available at http://r4ds.had.co.nz/data-import.html

2.1 Introduction to Reading Data

Once you have downloaded the data from the Understanding Society survey, the first thing you need to do is read the data into R. There are a number of ways to do this, and this class will cover loading data into R by using base R and both the readr and data.table packages.

The data we will be loading into R is the individual adult questionnaire answers from wave 1 (UKDA-6614-tab/tab/us_w1/a_indresp.tab). This should be saved in the data folder you created in your project folder (this was covered in the Introduction section to the course).

2.2 Reading common data files into R

2.2.1 Ways to read data into R

2.2.1.1 Using base R

To read this data into R using functions from base R, we can first use the read.table() function. To read the data in correctly, we need to use header = TRUE as the first row of the data contains the names of the variables, and also stringsAsFactors = FALSE, which stops R from treating text variables as factors. We can convert these into factors later, when necessary.

UndSoc.1 <- read.table("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab",
                       header = TRUE,
                       stringsAsFactors = FALSE)

Different types of files need to read into R in different ways. The read.table() function reads data that is tab separated (the file will be called ‘name.tab’), the read.csv() function reads data that is comma separated (the file will be called ‘name.csv’) and read.delim() reads data that is separated in any way. These work in the same way, so once you’ve mastered one, you’ve mastered them all.

2.2.1.2 Using the readr package

We can also read data into R using the package readr, part of the tidyverse package, with the read_tsv() command. This is the equivalent command to read.table() from base R.

# We can either load the entire 'tidyverse' package into R, which includes the readr
# package as well as others such as ggplot2, or just the 'readr' package on its own.

library(readr)

UndSoc.2 <- read_tsv("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab")

The readr package also allows us to read different types of files into R. The commands read_csv() and read_delim() read comma and any type of separated data into R respectively.

2.2.1.3 Using the data.table package

The final way we are going to read this data into R is by using the fread() function from the data.table package. The advantage of the fread() command is that it can read any type of separated data into R, without you having to change anything if the file type you are trying to read changes.

library(data.table)

UndSoc.3 <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab")
## 
Read 58.8% of 50994 rows
Read 98.1% of 50994 rows
Read 50994 rows and 1364 (of 1364) columns from 0.187 GB file in 00:00:04

2.2.2 Comparing these three methods

One of the important differences between these three methods for reading data into R is the length of time they take to read the data. We can compare this by reading the data into R again, this time wrapping our code with the command system.time().

2.2.2.1 Using base R

# Base R
system.time(UndSoc.1 <- read.table("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab",
                                   header = TRUE,
                                   stringsAsFactors = FALSE))
##    user  system elapsed 
##   20.50    0.58   21.73

2.2.2.2 Using the readr package

system.time(UndSoc.2 <- read_tsv("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"))
##    user  system elapsed 
##    4.85    0.28    6.82

2.2.2.3 Using the data.table package

system.time(UndSoc.3 <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"))
## 
Read 0.0% of 50994 rows
Read 39.2% of 50994 rows
Read 78.4% of 50994 rows
Read 50994 rows and 1364 (of 1364) columns from 0.187 GB file in 00:00:06
##    user  system elapsed 
##    2.79    0.11    5.29

From this we can see clear differences in loading time. With small data sets, the difference between these three methods will not be very noticeable or important, but with larger data the increase in loading times the readr and data.table packages provide can be quite substantial.

2.2.3 Editing a dataset while reading it into R

2.2.3.1 Skipping rows of data

Sometimes we do not want to read in the entire dataset we have into R, but instead want to ignore the first few rows. For example, if we were to do some sentiment analysis on Hamlet (found here: http://www.gutenberg.org/files/1524/1524-0.txt), we would want to only read in the play itself, not the introductions before the start of the play. We can use the skip = “” command to only read the file from the start of the play, with each line of the script counting as a row.

Hamlet <- read.delim("http://www.gutenberg.org/files/1524/1524-0.txt",
                     skip = 30,
                     header = FALSE)

2.2.3.2 Reading only select variables

When working with large datasets, like the one you will be using for your assignment, there are only certain variables that are of interest. As a result, it makes sense to load in only these variables, to make analysing your data easier to do.

Let’s say that we’re interested in how political interest is distributed across the UK, and want to analyse how this is differentiated by sex and age. The variables we need to read into R are these two (a_sex and a_dvage), as well as interest in politics (a_vote6) and the personal identification variable for each individual (a_pidp). We can use the select = “” command to do this.

UndSoc.4 <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab",
                  select = c("pidp", "a_sex", "a_dvage", "a_vote6"))

2.2.3.3 Setting NA values

Sometimes, if we look at the raw data we are about to analyse, we see that there are a number of missing values. Instead of recoding these values as NA after we have loaded our data (found, in this example, at https://raw.githubusercontent.com/abessudnov/dataanalysis3/master/exData/Table3.txt), we can do this while reading our data into R using the na.strings = “” command.

MissingData <- read.table("exData/Table3.txt",
                          header = TRUE,
                          skip = 2,
                          na.strings = c("*", "**", "--"))

2.3 Reading other data files

2.3.1 Excel

In R, you can also read data that was created in Excel. Excel files are saved primarily as .xls or .xlsx files. Fortunately, after installing the readxl package, the command read_excel() can read both formats into R. Using the example Excel document found at (https://github.com/abessudnov/dataanalysis3/blob/master/exData/tableExcel.xlsx), we can easily read this into R.

library(readxl)

Excel <- read_excel("exData/tableExcel.xlsx")

When reading Excel files into R, read_excel() defaults to loading the first sheet. If there are multiple sheets in our document (by opening the Excel file we can see that there are 2 in this case), we can load different sheets in with the sheet = “” command.

Excel2 <- read_excel("exData/tableExcel.xlsx",
                     sheet = 2)

However, if we look at the sheet we have just loaded, we can see that there are two NA values, one of which is coded as “NA”. Furthermore, if we compare how these NA values appear in our data with the ones from our MissingData dataset, we can see that they aren’t actually coded as missing values, but instead as values called NA and “NA”.

head(MissingData)
##     Name Age Height Weight Sex
## 1   Alex  25    177     57   F
## 2  Lilly  31     NA     69   F
## 3   Mark  NA    190     83   M
## 4 Oliver  52    179     75   M
## 5 Martha  76   <NA>     70   F
## 6  Lucas  49    183     NA   M
head(Excel2)
## # A tibble: 3 x 3
##   actor movie sentence
##   <chr> <dbl>    <chr>
## 1     A     1       NA
## 2     B     2 "\"NA\""
## 3     C     3       15

To fix this, we need to set both these values as NA, which we can do with the na = “” command.

Excel3 <- read_excel("exData/tableExcel.xlsx",
sheet = 2,
na = c('NA', '"NA"'))

# Note that as one of the missing values is called "NA", you have to use inverted commas
# ('') around the entire value for R to change it to an NA value, rather than the usual
# speech marks (""). Inverted commas and speech marks are interchangable in R, though
# if you use one to open a command, you cannot use the other to close it.

We can now see that this dataset has true NA values in it for the missing data.

head(Excel3)
## # A tibble: 3 x 3
##   actor movie sentence
##   <chr> <dbl>    <dbl>
## 1     A     1       NA
## 2     B     2       NA
## 3     C     3       15

2.3.2 SPSS

As many people use SPSS instead of R to analyse data, there are times when you will have to use data that is intended for use in SPSS. The haven package allows us to do this, and as an example we will read a dataset containing some data from a survey.

library(haven)

SPSS <- read_spss("exData/survey.sav")

You can also use haven to read data in the Stata and SAS formats.

2.4 Saving datasets

Once you have read your data into R you can save it as a file so you can load it back into R at a later date to continue working on it. Again, there are a number of different ways to do this.

2.4.1 Saving data as a text file

Having read the first wave of data into R with only the variables we are interested in (see Reading only select variables), we can save this data as a text file using base R, without having to install any new packages. All we need to do is create a folder called myData in the folder where you have saved your project, and then you can save your dataset to this folder.

write.csv(UndSoc.4, "myData/UndSoc.4.csv")

As you had to get specific permission to access the Understanding Society dataset, it is important you do not save this data to Github. This applies to every method of saving data from R we will cover here. Add the following line to your gitignore file to keep the myData folder untracked:

myData/

2.4.2 Saving data as an Excel file

To save our data as an Excel file, we need to first install the writexl package, and use the command write_xlsx().

library(writexl)

write_xlsx(UndSoc.4, path = 'myData/UndSoc.4.xlsx', col_names = TRUE)

2.4.3 Saving data as an SPSS file

We can use the haven package we used to read files saved for use in SPSS to write data as SPSS files too, using the write_sav() command.

write_sav(UndSoc.4, "myData/UndSoc.4.sav")

2.4.4 Saving an object as an RDS file

Finally, we can just save a file for future use in R as an RDS file. It will work faster with large objects, but it will only allow saving one object (such as a data frame) at a time and it will also drop the name of the object while saving.

saveRDS(UndSoc.4, "myData/UndSoc.4.rds")

2.4.5 Loading these datasets back into R

To read any of these datasets back into R, use what you learnt in Reading common data files into R.

When loading an RDS file back into R, simply use the readRDS() command. Note that we assign a new name to the object.

RDS <- readRDS("myData/UndSoc.4.rds")

2.5 Saving the working environment

If you want to save an entire workspace, rather than just individual files, the command save.image() does this. This saves not only all loaded data frames but any saved objects: models, plots, functions, etc.

Before saving the entirer workspace we’ll remove all the objects from it, except two: Hamlet and MissingData. We’ll do this just to save time, otherwise saving the data image will take too long.

# First we'll print all the objects we currently have in the memory.

ls()
##  [1] "Excel"       "Excel2"      "Excel3"      "Hamlet"      "MissingData"
##  [6] "RDS"         "SPSS"        "UndSoc.1"    "UndSoc.2"    "UndSoc.3"   
## [11] "UndSoc.4"
# Then we can remove all the objects, except the 4th and the 5th.

rm(list = ls()[-c(4,5)])
# Saving the workspace.

save.image("myData/Workspace.RData")

# Clearing the current workspace.

rm(list = ls())

# Loading the workspace back into R.

load("myData/Workspace.RData")

2.6 Reading in multiple waves from Understanding Society

For your assignment, you are required to analyse at least two waves of data from the Understanding Society dataset. We will cover how to join multiple waves together in a future lesson, but before doing this, let’s read in the first two waves of Understanding Society. To do this, we need to use the function fread() from the data.table package, as the data are tab separated.

# First, we clear the current workspace.

rm(list = ls())

UndSoc1 <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab")
## 
Read 78.4% of 50994 rows
Read 50994 rows and 1364 (of 1364) columns from 0.187 GB file in 00:00:03
UndSoc2 <- fread("data/UKDA-6614-tab/tab/us_w2/b_indresp.tab")
## 
Read 0.0% of 54597 rows
Read 36.6% of 54597 rows
Read 73.3% of 54597 rows
Read 54597 rows and 1615 (of 1615) columns from 0.233 GB file in 00:00:07

As we discussed earlier in this class, these files take a lot longer to load into R than other data files you have used in the past. This is because the files themselves are large, and take up a lot of space in the memory. We can see just how much space they take up by using the object.size() command.

object.size(UndSoc1)
## 286020872 bytes
object.size(UndSoc2)
## 364341464 bytes

However, this does not give us an answer that we can easily interpret. Therefore, by using the format() command, we can specify how we want R to show us this data. In this case, by adding units = “” we can tell R to show us how large the files are in specific units.

# By adding units = "auto" R automatically chooses the clearest way to show us the
# size of the files.

format(object.size(UndSoc1), units = "auto")
## [1] "272.8 Mb"
format(object.size(UndSoc2), units = "auto")
## [1] "347.5 Mb"

To make the files we are working with smaller, and thus make R carry out our analysis quicker, we can select only the variables we need for our analysis. We are only going to load a few variables here to keep this simple; you can always add more to your analysis. Here, as we are looking at level of interest in politics, they are:

  • pidp: this is the id number given to each respondent to identify them in each wave

  • a_sex: sex from wave 1

  • a_dvage: age from wave 1

  • a_vote6: level of interest in politics from wave 1

  • b_sex: sex from wave 2

  • b_dvage: age from wave 2

  • b_vote6: level of interest in politics from wave 2

By using what we covered earlier, we can select only the variables we want to read into R, making the loading times decrease significantly as they are much smaller datasets.

UndSoc1ed <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab",
select = c("pidp", "a_sex", "a_dvage", "a_vote6"))

UndSoc2ed <- fread("data/UKDA-6614-tab/tab/us_w2/b_indresp.tab",
select = c("pidp", "b_sex", "b_dvage", "b_vote6"))

We can check that these are much smaller objects, and that our code was successful.

format(object.size(UndSoc1ed), units = "auto")
## [1] "798.3 Kb"
format(object.size(UndSoc2ed), units = "auto")
## [1] "854.6 Kb"
head(UndSoc1ed)
##        pidp a_sex a_dvage a_vote6
## 1: 68001367     1      39       3
## 2: 68004087     1      59       2
## 3: 68006127     2      39       4
## 4: 68006135     2      17       4
## 5: 68006807     2      72       4
## 6: 68007487     2      57       1
head(UndSoc2ed)
##        pidp b_sex b_dvage b_vote6
## 1: 68004087     1      60       2
## 2: 68006127     2      40       4
## 3: 68006807     2      73       4
## 4: 68007487     2      58       2
## 5: 68008167     2      39       4
## 6: 68008171     1      52      -7

As we no longer have any use for the original data we loaded into R, we can remove them from R to free up the memory they are unnecessarily taking up.

rm(UndSoc1, UndSoc2)

We can then save these datasets so we can easily load them into R at a later date to continue working on them. We will use these in future classes. Here, we save the workspace, which includes just waves 1 and 2 of the Understanding Society data, for future use.

save.image("myData/ReadData.RData")