Happy new year!

The aim of this workshop is to review the code and analysis I showed you last year.

Last term

Last term there were three workshops:

Today

We can group the code from last term into three categories

We loaded/defined data. Then the data was explored. A simple statistical model was fit and an inference made. In other words, we found and answered a question with some data.

In this workshop, we are going to consider the British Social Attitudes survey. We will start off recovering ground and then move into more small group work. The goal here is to help you think about how you can use R to fit, consider and evaluate trends in data. Those who feel a little lost are encouraged to look at my previous workshop materials.

British Social Attitudes

The BSA is a rich data source and available here. A copy of the BSA is here. There are technical details available which outline how the survey was carried out. The user guide is useful, too.

The technical notes also - and very usefully, what nice people! - offer guidance on statistical analysis of the data. The relevent section is at the bottom of the document and is written in readable text. What a relief!

Please download the tab data file containing the BSA responses. You may need to fill in some deails and login using Athens before you can access the data.

Reading in and writing out

Our data for this workshop is the British Social Attitudes survey (see above). Those of you who attended my workshops last term will be familiar with this data set.

The data is a table. Each row is a person and each column is a response. The file has each row a new line and the columns are seperated by tabs.

Loading the data is done using a function called read.table. The arguments we give to the function are the filename, the column seperator - a tab or ‘’ - and that the first row of files contains the names of our variables. We use the assignment operator <- to save the output of this function (the data R reads in) in the enviroment to a variable called d.

# this is a comment, just to you know. R will not run this line.

# Make sure R is looking at the directory containing the data file
# You can set the working directory by going to Session > Set Working Directory > Choose Directory 

# read in our data.
d <- read.table(file = 'bsa16_to_ukda.tab', sep = "\t", header = TRUE)

If we make changes to d then we can write a file - saving the data on the computer.

# We can write the content of d into a file laid out the same way as our origonal data
write.table(file = 'our_data.tab', sep = '\t', x = d)

# Or we can save the data in an R format
# The R format will often be smaller but must be loaded in R
save(d, file = 'our_data.RData')

# We can clear our enviroment and lose d
rm(list=ls())
# And load the R file back in
load(file = 'our_data.RData')

Exploring your variables

The BSA data has lots of variables. We are going to pick a few variables to explore from the documentation. RStudio has a cool autocomplete feature which might help. If you type in d$ into a code area (within the shaded area of this notebook or the console) RStudio will list all the columns in d.

Below we explore the data by looking at the sex, location and marriage status. All of this data is in number format and also need to be recoded (have labels attached to it).

# sex
d$Rsex.recode <- factor(x = d$Rsex, labels = c('Male', 'Female'))
table(d$Rsex.recode)
## 
##   Male Female 
##   1291   1651
plot(d$Rsex.recode)

# location
d$Country.recode <- factor(x = d$Country, labels = c('England', 'Scotland', 'Wales'))
table(d$Country.recode)
## 
##  England Scotland    Wales 
##     2525      252      165
plot(d$Country.recode)

# marriage status
d$Married.recode <- factor(x = d$Married, labels = c('Married/living as married', 'Seperated/divorced', 'Widowed', 'Never married'))
table(d$Married.recode)
## 
## Married/living as married        Seperated/divorced 
##                      1618                       426 
##                   Widowed             Never married 
##                       289                       609
plot(d$Married.recode, las = 2)

# also
summary(d$Married.recode)
## Married/living as married        Seperated/divorced 
##                      1618                       426 
##                   Widowed             Never married 
##                       289                       609
head(d$Married.recode)
## [1] Married/living as married Married/living as married
## [3] Married/living as married Married/living as married
## [5] Widowed                   Never married            
## 4 Levels: Married/living as married Seperated/divorced ... Never married
str(d$Married.recode)
##  Factor w/ 4 levels "Married/living as married",..: 1 1 1 1 3 4 2 2 1 3 ...
# A side note on finding out data types

# d is a data frame
class(d)
## [1] "data.frame"
# number variables
class(d$Rsex)
## [1] "integer"
# we turned some variables into categories (factors)
class(d$Rsex.recode)
## [1] "factor"

We can count data accross categories.

table(d$Married.recode, d$Country.recode)
##                            
##                             England Scotland Wales
##   Married/living as married    1383      147    88
##   Seperated/divorced            381       27    18
##   Widowed                       245       21    23
##   Never married                 516       57    36
# proportions
plot(d$Country.recode, d$Rsex.recode)

Group task

In your groups, visualise some of the variables. Look at their frequencies. Can you think of any interesting questions you would like to address?

Something new: Lattice.

# a short introduction is here 
# https://www.statmethods.net/advgraphs/trellis.html
library(lattice)

histogram(~d$Rsex.recode | d$Country.recode, type = 'count')