12 Resilience 2
12.1 Intended Learning Outcomes
By the end of this chapter you should be able to:
- Create new columns with
mutate
andcase_when()
- Identify and deal with missing data using
na.rm
,is.na()
, anddrop_na()
12.2 Walkthrough video
There is a walkthrough video of this chapter available via Echo360. We recommend first trying to work through each section of the book on your own and then watching the video if you get stuck, or if you would like more information. This will feel slower than just starting with the video, but you will learn more in the long-run. Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
12.3 Activity 1: Set-up
Login to the server and then:
- Open your Resilience project.
- Create a new Markdown document named "Resilience 2". Delete the default text and then create a new code chunk.
- Your environment should be clear but if there are objects in it, remove them by pressing the brush icon.
- Create a new code chunk and then copy, paste, and run the below code. You don't need to edit anything, this will get you to the point you were at in Activity 7 in Resilience 1. Once you've run the code, click on
dat_wide
to view it and take a minute to familiarise yourself with the variables and structure of the dataset.
library(tidyverse)
demo_data <- read_csv("demographic_data.csv")
q_data <- read_csv("questionnaire_data.csv")
scoring <- read_csv("scoring.csv")
demo_cleaned <- demo_data %>%
mutate(treatment = factor(treatment,
levels = c(1,2),
labels = c("control", "intervention")),
gender = as.factor(gender),
participant_ID = as.character(participant_ID)) # convert to character
q_cleaned <- q_data %>%
mutate(participant_ID = as.character(participant_ID))
dat_wide <- inner_join(x = demo_cleaned, y = q_cleaned, by = "participant_ID")
- How many rows of data are there in
dat_wide
? - How many columns (variables) are there in
dat_wide
? - What is the name of the column that stores the participant's ID number?
12.4 Activity 2: Identify missing data
As noted in Resilience 1, there is a little bit of missing data in this dataset. There's a few ways to identify missing data. The first method you have already encountered, you can run summary()
on the object and any variables with missing data will have a count of NA's
.
summary(dat_wide)
But you can also do this using tidyverse
functions. The reason this is potentially more useful is that it is easier to store it as a table and refer back to it later. It also just contains the information you need about missing data, rather than all the summary stats.
-
summarise_all()
is a function that applies the summary calculation to all variables in an object -
sum(is.na(.))
adds up (sums) how many values are NAs (is.na
) - The . in
is.na()
acts as a placeholder for each column in turn.
missing_data <- dat_wide %>%
summarise_all(~sum(is.na(.)))
Finally, you could also eyeball the data (literally just look at it). Although with a large number of data points, this isn't particularly accurate.
12.5 Activity 3: Removing missing values
Now we've identified which columns have missing values, we can deal with it using the function drop_na()
which drops any NA values from our object.
The first option is to do the nuclear version and drop any line of data if there is a single missing data point so that we only have complete cases. Because our data is in wide-form, this would mean that we drop all the data from an entire participant if they are missing data in any of the variables.
- How many data points are left after running
drop_na()
?
You can figure this out by looking in the environment pane and seeing how many observations dat_complete
has. You could also do it with code:
n |
---|
685 |
We lost a lot of data by doing this, it's removed 205 participants (about 23% of the original dataset) for having one or more NA in any row. You'll learn more about statistical power in Level 2 but the tl:dr version is that we generally want our sample sizes to be as large as possible and we don't want to throw away data unnecessarily.
Our study is going to look at whether resilience scores change depending on the treatment condition participants underwent. Look at the columns that have missing data and think about which ones are absolutely necessary for that analysis.
- Which column might we not mind having missing data for?
Generally speaking, you always need the
participant_id
so you know which data comes from which person. Additionally,participant_id
doesn't actually have any missing data.bounce_back_quickly
is one of the items on the resilience scale, so if we want to accurately measure resilience scores, we don't want any missing data in this variable.However,
age
isn't necessary for our analysis. When writing up a research report, you would normally report the age and gender stats of your sample but we can make a note of any missing data rather than having to delete the entire participant. For example:
There were 890 participants in total (mean age = 28.92, SD = 6.62, missing = 8)
Instead of running drop_na()
on the entire dataset, we can specify which columns to include or ignore. This code works the same way as select()
in that you can either say which columns you want it to run the code on, or you say which columns to ignore. In this case, it's easier to say which ones to ignore.
This then means only the resilience and treatment columns are included in the call to drop_na()
, which gives us an extra 8 participants. Which isn't much, but every little counts.
12.6 Activity 4: Reshape and score
Now we've got rid of the missing data, we can transform the object to long-form and score it. Copy and paste the below code (this was explained in the last chapter, we've provided it for you so we can jump ahead to more interesting things).
dat_long <- dat_final %>%
pivot_longer(cols = bounce_back_quickly:long_time_over_setbacks,
names_to = "item",
values_to = "response")
dat <- inner_join(x = dat_long, y = scoring, by = c("item", "response"))
dat_scores <- dat %>%
group_by(participant_ID, age, gender, treatment) %>%
summarise(resilience_score = mean(score, na.rm = TRUE)) %>%
ungroup()
12.7 Activity 5: Creating new variables
In addition to using the data that is contained in our original datasets, we can also create new variables using mutate()
combined with other functions.
Let's imagine that for some reason, there is an error in the software used to record the data and age has been recorded incorrectly. Each participant is actually 1 year older than the values we have.
We can create a new column using mutate()
.
- The bit on the left of the equals sign is the name of the new variable we will create, in this case, it will be a column named
age_corrected
. - The bit on the right of the equals sign determines the values that column will have. In this case, it will add 1 to the values that are currently stored in the column
age
- We just want to add columns to the existing object which is why we are still using
dat_scores
rather than creating a new object.
We can also create new variables using a combination of mutate()
and case_when()
. case_when()
is a very useful and powerful function that allows you to create new variables based on multiple conditions and values.
For example, we can create a new categorical variable that assigns participants to the "Younger" category if their age_corrected
is less than or equal (<=
) to 30, and "Older" if age_corrected
is more than (>
) 30. This categorisation is not based on any scientifically motivated split, it's just around the age I stopped being able to go out more than once a week.
- You can read
~
asthen
, i.e., if age is less than or equal to 30, then write "Younger". - You can have as many conditions as you like, just separate them with commas.
-
TRUE
isn't always necessary but it controls what value is entered if a value doesn't meet any of the conditions. In this case, it will return the text "Missing" for all the NA values in age. -
Fun fact: If you're working on your own computer and have recently installed the tidyverse, you have a slightly different version of
case_when()
than on the server. If you'd like to see Emily discover this in real-time, you can watch the walkthrough video. The below code will work on both the server and your local installation.
dat_scores <- dat_scores %>%
mutate(age_category = case_when(age_corrected <= 30 ~ "Younger",
age_corrected > 30 ~ "Older"))
The new variable age_category
will be created as a character
variable but we want it to be stored as a factor.
- Add on a line of code to the above to overwrite the new variable as a factor.
Now it's a factor, using your method of choice, count how many participants are in each group.
- There are participants in the younger group
- There are participants in the older group
- There are participants in the missing group
## participant_ID age gender treatment
## Length:692 Min. :18.0 man :322 control :324
## Class :character 1st Qu.:23.0 non-binary: 22 intervention:368
## Mode :character Median :29.0 woman :348
## Mean :28.9
## 3rd Qu.:35.0
## Max. :40.0
## NA's :7
## resilience_score age_corrected age_category
## Min. :1.833 Min. :19.0 Missing: 7
## 1st Qu.:2.667 1st Qu.:24.0 Older :334
## Median :3.000 Median :30.0 Younger:351
## Mean :3.026 Mean :29.9
## 3rd Qu.:3.333 3rd Qu.:36.0
## Max. :4.333 Max. :41.0
## NA's :7
age_category | n |
---|---|
Missing | 7 |
Older | 334 |
Younger | 351 |
12.8 Activity 6: More variable creation
Let's try some other examples to give you a sense of the flexibility this affords (this will help you with the group project analysis).
The software we used to store the data really has gone wrong and we've also discovered that gender
has been coded incorrectly - man
and woman
have been switched, although non-binary
is correct.
- Create a new variable named
gender_corrected
usingmutate()
andcase_when()
- If
gender
equalsman
,gender_corrected
should saywoman
- If
gender
equalswoman
,gender_corrected
should sayman
- If
gender
equalsnon-binary
,gender_corrected
should also saynon-binary
- Remember to make it a factor once you're done
- How many men are there in the corrected gender variable?
- How many women are there in the corrected gender variable?
- How many non-binary people are there in the corrected gender variable?
12.9 Activity 7: Analysis and visualisation
For the final step, let's calculate all the numbers you would need if you were writing up this study for an analysis. You've done all of this before multiple times throughout Psych 1A and Psych 1B - try and do as much of it as you can from memory and trial-and-error before you look at the hints or solutions.
12.9.1 Demographic stats
Write code to calculate the following (remember to use the corrected variables):
- Total number of participants
- Number of each gender
- Mean age and standard deviation
- Number of participants in each treatment condition
These are all the functions you need to achieve the above
If you have looked at the solutions without trying it for yourself, just remember that it's your own learning that will suffer and Level 2 is going to be an unpleasant shock to the system if you've been taking shortcuts throughout Level 1....
# total ppts
dat_scores %>%
count()
# gender split
dat_scores %>%
count(gender_corrected)
# mean age, sd and missing
dat_scores %>%
summarise(mean_age = mean(age_corrected, na.rm = TRUE),
sd_age = sd(age_corrected, na.rm = TRUE),
missing = sum(is.na(age_corrected)))
# number in each condition
dat_scores %>%
count(treatment)
12.9.2 Analysis stats
- Mean, standard deviation, min and max resilience score across all participants
- Mean, standard deviation, min and max resilience score for each treatment condition
- A visualisation of the difference between the two groups
These are all the functions you need to achieve the above
summarise()
group_by()
mean
sd
min
max
ggplot()
aes()
# which plot you choose is up to you, it might involve the following
geom_histogram()
geom_boxplot()
geom_violin()
facet_wrap()
If you have looked at the solutions without trying it for yourself, just remember that it's your own learning that will suffer and Level 2 is going to be an unpleasant shock to the system if you've been taking shortcuts throughout Level 1....
# overall resilience score
dat_scores %>%
summarise(mean_score = mean(resilience_score),
sd_score = sd(resilience_score),
min_score = min(resilience_score),
max = max(resilience_score))
# resilience scores by group
dat_scores %>%
group_by(treatment) %>%
summarise(mean_score = mean(resilience_score),
sd_score = sd(resilience_score),
min_score = min(resilience_score),
max = max(resilience_score))
# visualisation
## histogram by group
ggplot(dat_scores, aes(x = resilience_score, fill = treatment)) +
geom_histogram(colour = "black") +
facet_wrap(~ treatment, nrow = 2) +
guides(fill = "none") +
scale_fill_viridis_d(option = "E")
# violin-boxplot
ggplot(dat_scores, aes(x = treatment, y = resilience_score, fill = treatment)) +
geom_violin(alpha = .3) +
geom_boxplot(width = .2, alpha = .6)+
guides(fill = "none") +
scale_fill_viridis_d(option = "E")
12.9.3 Write-up
To get the same numbers, report all numbers to 2 decimal places, rounding up at .5.
After participants were excluded for having missing data in the resilience scores, in total there were participants ( men, women, non-binary, mean age = , SD = , missing = ).
participants were randomly assigned to the control condition and were randomly assigned to the treatment condition.
Collapsing across both treatment conditions, the overall mean resilience score was (SD = , min = , max = ).
When analysed by treatment group, participants in the control condition (mean = , SD = , min = , max = ) self-reported resilience scores than participants in the treatment condition (mean = , SD = , min = , max = ).
The hypothesis that the treatment would improve resilience scores was therefore .