6 Data wrangling 2
6.1 Walkthrough video
There is a walkthrough video of this chapter available via Zoom.
- Video notes: this video was recorded in 2020, it uses the server, and the book has been updated visually. There are no other differences between the video and this book chapter.
6.2 Activity 1: Set-up
- Open R Studio and ensure the environment is clear.
- Open the
stub-wrangling-2.Rmd
file and ensure that the working directory is set to your Data Skills folder and that the two .csv data files (participant-info.csv
andahi-cesd.csv
) are in your working directory (you should see them in the file pane).
- If you're on the server, avoid a number of issues by restarting the session - click
Session
-Restart R
- Type and run the below code to load the
tidyverse
package and to load in the data files.
library(tidyverse)
dat <- read_csv('ahi-cesd.csv')
pinfo <- read_csv('participant-info.csv')
all_dat <- inner_join(dat, pinfo, by= c("id", "intervention"))
Now let's start working with our tidyverse
verb functions...
6.3 Activity 2: Select
Select the columns all_dat, ahiTotal, cesdTotal, sex, age, educ, income, occasion, elapsed.days from the data and create a variable called summarydata
.
summarydata <- select(all_dat, ahiTotal, cesdTotal, sex, age, educ, income, occasion, elapsed.days)
If you get an error message when using select that says
unused argument
it means that it is trying to use the wrong
version of the select function. There are two solutions to this, first,
save you work and then restart the R session (click session -restart R)
and then run all your code above again from the start, or replace
select
with dplyr::select
which tells R
exactly which version of the select function to use. We'd recommend
restarting the session because this will get you in the habit and it's a
useful thing to try for a range of problems
Pause here and interpret the above code and output
- How you would translate this code into English
- What columns have been removed from the data?
6.4 Activity 3: Arrange
Arrange the data in the variable created above (summarydata
) by ahiTotal with lowest score first.
ahi_asc <- arrange(summarydata, by = ahiTotal)
- How could you arrange this data in descending order (highest score first)?
What is the smallest ahiTotal score?
What is the largest ahiTotal score?
6.5 Activity 4: Filter
Filter the data ahi_asc
by taking out those who are over 65 years of age.
age_65max <- filter(ahi_asc, age <= 65)
- What does
filter()
do?
- How many observations are left in
age_65max
after runningfilter()
?
6.6 Activity 5: Summarise
Then, use summarise to create a new variable data_median
, which calculates the median ahiTotal score in this grouped data and assign it a table head called median_score
.
What is the median score?
Change the above code to give you the mean score. What is the mean score to 2 decimal places?
6.7 Activity 6: Group_by
Use mutate to create a new column called Happiness_Category
in age_65max
which categorises participants based on whether they score above the median ahiTotal
score for all participants.
Then, group the data stored in age_65max
by Happiness_Category
, and store it in an object named happy_dat
.
Finally, use summarise to calculate the median cesdTotal
score for participants who scored above and below the median ahiTotal
score and save it in a new object named data_median_group
.
age_65max <- mutate(age_65max, Happiness_Category = (ahiTotal > 74))
happy_dat <- group_by(age_65max, Happiness_Category)
data_median_group <- summarise(happy_dat, median_score = median(cesdTotal))
If you get what looks like an error that says
summarise() ungrouping output (override with .groups argument)
don't
worry, this isn't an error it's just R telling you what it's done. This
message was included in a very recent update to the
tidyverse
which is why it doesn't appear on some of the
walkthrough videos.
Pause here and interpret the above code and output
- What does
group_by()
do?
- How would you change the code to group by education rather than
Happiness_Category
?
group_by(age_65max, educ)
6.8 Activity 7: Data visualisation
Copy, paste and run the below code into a new code chunk to create a plot of depression scores grouped by income level using the age_65max
data.
ggplot(age_65max, aes(x = as.factor(income), y = cesdTotal, fill = as.factor(income))) +
geom_violin(trim = FALSE, show.legend = FALSE, alpha = .6) +
geom_boxplot(width = .2, show.legend = FALSE, alpha = .5) +
scale_fill_viridis_d(option = "D") +
scale_x_discrete(name = "Income Level", labels = c("Below Average", "Average", "Above Average")) +
scale_y_continuous(name = "Depression Score")
Which income group has the highest median depression scores?
Which group has the highest density of scores at any one point?
Density is represented by the curvy line around the boxplot that looks a little bit like a (drunk) violin. The fatter the violin, the more data points there are at any one point. This means that in the above plot, the Above Average group has the highest density because this has the widest violin, i.e., there are lots of people in the Above Average income group with a score of about 5.
- Is income group a between-subject or within-subject variable?
Between-subjects designs are where different participants are in different groups. Within-subject designs are when the same participants are in all groups. Income is an example of a between-subject variable because participants can only be in one grouping level of the independent variable
6.9 Activity 8: Make R your own
Finally, you can customise how R Studio looks to make it feel more like your own personal version. Click Tools
- Global Options
- Apperance
. You can change the default font, font size, and general appearance of R Studio, including using dark mode. Play around with the settings and see which one you prefer - you're going to spend a lot of time with R, it might as well look nice!