4 Read data

Please read ch.11 (Data import) from R for Data Science - http://r4ds.had.co.nz/data-import.html

For this module we will use data from the Understanding Society survey (https://discover.ukdataservice.ac.uk/catalogue/?sn=6614). I assume that you have famliarised yourself with the data set, registered an account with the UK Data Service website and downloaded the data in the tab format.

The first thing we need to do is to read the data in R. There are multiple ways of doing this.

For all the examples I will use the individual adult data from wave 1 (a_indresp.tab).

4.1 Base R

In base R we have the read.table function. I will wrap it into the system.time function to measure how long the execution will take.

system.time(UndSoc1 <- read.table("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab",
                      header = TRUE,
                      stringsAsFactors = FALSE))
##    user  system elapsed 
##   22.71    0.57   23.40

I set header = TRUE to make sure that the first row in the data is interpreted as variable names. stringsAsFacrors = FALSE means that the text variables will be read in as character vectors rather than factors. We can convert them into factors when necessary.

4.2 Package readr

We can also read in these data with the package readr (part of tidyverse). The main advantage is that it works faster.

library(readr)
system.time(UndSoc2 <- read_tsv("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"))
##    user  system elapsed 
##    4.41    0.12    5.15

readr was able to read the data set much faster than base R.

4.3 Package data.table

The fread function from the data.table package is probably the fastest way to read in the data.

library(data.table)
system.time(UndSoc3 <- fread("data/UKDA-6614-tab/tab/us_w1/a_indresp.tab"))
## 
Read 58.8% of 50994 rows
Read 98.1% of 50994 rows
Read 50994 rows and 1364 (of 1364) columns from 0.187 GB file in 00:00:04
##    user  system elapsed 
##    2.69    0.05    3.39

It took less than 3 seconds!

With small data sets the choice between these three methods is not very important, but with larger data the gain in efficiency that readr and data.table provide can be quite substantial.

4.4 Other data formats

In R you can easily read in data in other formats, such as csv files, Stata, SPSS, SAS, Excel and others. There are many tutorials on how to do this on the web. See, for example, https://www.datacamp.com/courses/importing-data-in-r-part-1/ and https://www.datacamp.com/courses/importing-data-in-r-part-2 .

4.5 Saving the R workspace

Once you have read your data into R you can save it as R workspace.

# I will remove some objects from memory to speed things up
rm(UndSoc2, UndSoc3)
# saving R workspavce now in myData (you need to create myData first)
save.image("myData/readTest.RData")

Next time I need this file I can simply load the workspace.

# first let's remove everything from the workspace
rm(list = ls())
# load the workspace
system.time(load("myData/readTest.RData"))
##    user  system elapsed 
##    0.73    0.00    0.73

Of course, in R workspace you can save not only data frames but any objects: models, plots, functions, etc.