7 Data visualisation

Pre-requisite for this class: ch.3 (“Data visualisation”) from R for Data Science - http://r4ds.had.co.nz/data-visualisation.html

At home you learned about the basic principles of data visualisation in R with the ggplot2 package. Let us see how we can apply this to the Understanding Society data set.

Personally I can never remember all the details of the ggplot2 syntax. I often use the ready-made “recipes” from the R Graphics Cookbook by W.Chang – https://www.amazon.co.uk/R-Graphics-Cookbook-Winston-Chang/dp/1449316956/. The 2nd edition is coming out later this year – https://www.amazon.co.uk/Graphics-Cookbook-2e-Winston-Chang/dp/1491978600 .

You may also find Winston Chang’s website useful (and not only for graphics) - http://www.cookbook-r.com .

7.1 Reading in the data

First let us read in the data we used in week 2 when we learned about dplyr (a short version of the wave 1 data) and recreate the measures for weight, height and BMI.

library(tidyverse)
library(data.table)
W1 <- readRDS("myData/W1mod.rds")
head(W1, 3)

## # A tibble: 3 x 15
##       pidp a_sex a_dvage a_ukborn a_hlht a_hlhtf a_hlhti a_hlhtc a_hlwt
##      <int> <int>   <int>    <int>  <int>   <int>   <int>   <int>  <int>
## 1 68001367     1      39        1      1       6       0      -8      1
## 2 68004087     1      59        5      1       5      11      -8      2
## 3 68006127     2      39        1      1       5       1      -8      1
## # ... with 6 more variables: a_hlwts <int>, a_hlwtp <int>, a_hlwtk <int>,
## #   heightcm <dbl>, weightkg <dbl>, bmi <dbl>

7.2 Visualising one quantitative variable

Exercise. Visualise the distribution of the BMI with ggplot2. Which statistical graphs would be appropriate for this?

7.2.1 Histogram.

ggplot(W1, aes(x=bmi)) +
  geom_histogram(bins = 100) +
  xlab("Body mass index")

7.2.2 Density chart.

ggplot(W1, aes(x=bmi)) +
  geom_density() +
  xlab("Body mass index")

7.3 Visualising one categorical variable

Exercise. Visualise the distribution of a_ukborn with ggplot2. Which statistical graphs would be appropriate for this?

7.3.1 Bar plot.

table(W1$a_ukborn)

## 
##    -9    -2    -1     1     2     3     4     5 
##     6     2     8 33480  3567  2154  2033  9744

W1 <- W1 %>%
  mutate(a_ukborn = ifelse(a_ukborn > 0, a_ukborn, NA)) %>%
  mutate(cbirth = recode(a_ukborn, "1" = "England", 
                         "2" = "Scotland",
                         "3" = "Wales",
                         "4" = "Northern Ireland",
                         "5" = "Not UK")) 
  
table(W1$cbirth)

## 
##          England Northern Ireland           Not UK         Scotland 
##            33480             2033             9744             3567 
##            Wales 
##             2154

W1 %>% 
  filter(!is.na(cbirth)) %>%
  ggplot(aes(x=cbirth)) +
  geom_bar() +
  xlab("Country of birth")

table(W1$cbirth, useNA = "always")

## 
##          England Northern Ireland           Not UK         Scotland 
##            33480             2033             9744             3567 
##            Wales             <NA> 
##             2154               16

7.4 Visualising two quantitative variables

Exercise. Visualise the joint distribution of weight (in kg) and height (in cm). In your chart show the regression line and the nonparametric smoothing line.

ggplot(W1, aes(x = weightkg, y= heightcm)) +
  geom_point() +
  geom_smooth() +
  stat_smooth(method=lm)

7.5 Visualising one categorical and one quantitative variable

Exercise. Visualise the distribution of BMI for a) men and women, b) different age groups.

# Coding a categorical variable for age groups

table(W1$a_dvage, useNA = "always")

## 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##  937  864  787  798  786  738  756  786  806  791  827  849  878  936  914 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  884  879  917  864  928  983  923  976 1051 1054 1032 1043  935  968  987 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##  940  941  917  889  873  824  817  765  803  722  761  703  756  666  662 
##   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
##  820  806  775  685  621  646  591  521  563  571  500  498  443  411  380 
##   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
##  368  356  338  294  287  267  225  207  166  147  132  108   80   73   54 
##   91   92   93   94   95   96   97   98   99  100  101 <NA> 
##   38   27   20   21   14   14    4    3    2    1    1    0

W1 <- W1 %>%
        mutate(agegr = ifelse(a_dvage < 31, "16-30",
                              ifelse(a_dvage > 30 & a_dvage < 46, "31-45",
                                ifelse(a_dvage > 45 & a_dvage < 61, "46-60",
                                       ">60")))) %>%
        mutate(agegr = factor(agegr, c("16-30", "31-45", "46-60", ">60")))

ggplot(W1, aes(x = agegr, y= bmi)) +
  geom_boxplot() +
  xlab("Age group") +
  ylab("Body mass index")

7.6 Visualising two categorical variables

Exercise. Use facets to visualise the distribution of a_ukborn by age group.

W1 %>% 
  filter(!is.na(cbirth)) %>%
  ggplot(aes(x=cbirth)) +
  geom_bar() +
  xlab("Country of birth") +
  facet_wrap(~ agegr)

Alternatively you can do a jitter plot, but in our case it wouldn’t look nice.

W1 %>% 
  filter(!is.na(cbirth)) %>%
  ggplot(aes(x=cbirth, y = agegr)) +
  geom_jitter() +
  xlab("Country of birth") +
  ylab("Age group")

7.7 Showing the relationships by group

Exercise. Use facets to visualise the association between age and BMI by country of birth.

W1 %>%
        filter(!is.na(cbirth)) %>%
        ggplot(aes(x = a_dvage, y= bmi)) +
                geom_point() +
                geom_smooth() +
                facet_wrap(~ cbirth)