3  Data wrangling II

Intended Learning Outcomes

By the end of this chapter, you should be able to:

  • apply familiar data wrangling functions to novel datasets
  • read and interpret error messages
  • realise there are several ways of getting to the results

In this chapter, we will pick up where we left off in Chapter 2. We will calculate average scores for two of the questionnaires, address an error mode problem, and finally, join all data objects together. This will finalise our data for the upcoming data visualization sections (Chapter 4 and Chapter 5).

Individual Walkthrough

3.1 Activity 1: Setup

  • Go to the project folder we have been using in the last two weeks and double-click on the project icon to open the project in RStudio
  • Either Create a new .Rmd file for chapter 3 and save it to your project folder or continue the one from last week. See Section 1.3 if you need some guidance.

3.2 Activity 2: Load in the libraries and read in the data

Today, we will be using tidyverse along with the two csv files created at the end of the last chapter: data_prp_for_ch3.csv and qrp_t1.csv. If you need to download them again for any reason, click on the following links: data_prp_for_ch3.csv and qrp_t1.csv.

library(???)
data_prp <- read_csv("???")
qrp_t1 <- read_csv("???")
library(tidyverse)
data_prp <- read_csv("prp_data_reduced.csv")
qrp_t1 <- read_csv("qrp_t1.csv")

If you need a quick reminder what the dataset was about, have a look at the abstract in Section 1.4. We also addressed the changes we made to the dataset there.

And remember to have a quick glimpse() at your data.

3.3 Activity 3: Confidence in understanding Open Science practices

The main goal is to compute the mean Understanding score per participant.

The mean Understanding score for time point 2 has already been calculated (in the Time2_Understanding_OS column), but we still need to compute it for time point 1.

Looking at the Understanding data at time point 1, you determine that

  • individual item columns are , and
  • according to the codebook, there are reverse-coded items in this questionnaire.

The steps are quite similar to those for QRP, but we need to add an extra step: converting the character labels into numbers.

Again, let’s do this step by step:

  • Step 1: Select the relevant columns Code, and every Understanding column from time point 1 (e.g., from Understanding_OS_1_Time1 to Understanding_OS_12_Time1) and store them in an object called understanding_t1
  • Step 2: Pivot the data from wide format to long format using pivot_longer() so we can recode the labels into values (step 3) and calculate the average score (in step 4) more easily
  • Step 3: Recode the values “Not at all confident” as 1 and “Entirely confident” as 7. All other values are already numbers. We can use functions mutate() in combination with case_match() for that
  • Step 4: Calculate the average Understanding Open Science score (Time1_Understanding_OS) per participant using group_by() and summarise()

Steps 1 and 2: Select and pivot

How about you try the first 2 steps yourself using the code from Chapter 2 Activity 4 (Section 2.4) as a template?

understanding_t1 <- data_prp %>% 
  select(???) %>% # Step 1
  pivot_longer(cols = ???, names_to = "???", values_to = "???") # Step 2
understanding_t1 <- data_prp %>% 
  # Step 1
  select(Code, Understanding_OS_1_Time1:Understanding_OS_12_Time1) %>% 
  # Step 2 - I picked different column labels this time for some variety
  pivot_longer(cols = Understanding_OS_1_Time1:Understanding_OS_12_Time1, names_to = "Understanding_Qs", values_to = "Responses") 

Step 3: recoding the values

OK, we now want to recode the values in the Responses column (or whatever name you picked for your column that has some of the numbers in it) so that “Not at all confident” = 1 and “Entirely confident” = 7. We want to keep all other values as they are (2-6 look already quite “numeric”).

Let’s create a new column Responses_corrected that stores the new values with mutate(). Then we can combine that with the case_match() function.

  • The first argument in case_match() is the column name of the variable you want to recode.
  • Then you can start recoding the values in the way of CurrentValue ~ NewValue (~ is a tilde). Make sure you use the ~ and not =!
  • Lastly, the .default argument tells R what to do with values that are neither “Not at all confident” nor “Entirely confident”. Here, we want to replace them with the original value of the Responses column. In other datasets, you may want to set the default to NA for missing values, a character string or a number, and case_match() is happy to oblige.
understanding_t1 <- understanding_t1 %>% 
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = Responses # all other values taken from column Responses
  ))
Error in `mutate()`:
ℹ In argument: `Responses_corrected = case_match(...)`.
Caused by error in `case_match()`:
! Can't combine `..1 (right)` <double> and `.default` <character>.

Have a look at the error message. It’s pretty helpful this time. It says Can't combine ..1 (right) <double> and .default <character>. It means that the replacement values are expected to be data type character since the original column type was type character.

So how do we fix this? Actually, there are several ways this could be done. Click on the tabs below to check out 3 possible solutions.

One option is to modify the .default argument Responses so that the values are copied over from the original column but as a number rather than the original character value. The function as.numeric() does the conversion.

understanding_t1_step3_v1 <- understanding_t1 %>% 
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type 
  ))

Change the numeric values on the right side of the ~ to character. Then in a second step, we would need to turn the character column into a numeric type. Again, we have several options to do so. We could either use the parse_number() function we encountered earlier during the demographics wrangling or the as.numeric() function.

  • V1: Responses_corrected = parse_number(Responses_corrected)
  • V2: Responses_corrected = as.numeric(Responses_corrected)

Just pay attention that you are still working within the mutate() function.

understanding_t1_step3_v2 <- understanding_t1 %>% 
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ "1",
                                          "Entirely confident" ~ "7",
                                          .default = Responses # all other values taken from column Responses (character)
  ),
  Responses_corrected = parse_number(Responses_corrected)) # turning Responses_corrected into a numeric column

If you recode all the labels into numbers (e.g., “2” into 2, “3” into 3, etc.) from the start, you won’t need to perform any additional conversions later.

understanding_t1_step3_v2 <- understanding_t1 %>% 
  mutate(Responses_recoded = case_match(Responses, # column of the values to recode
                                        "Not at all confident" ~ 1, # recode all of them
                                        "2" ~ 2,
                                        "3" ~ 3,
                                        "4" ~ 4,
                                        "5" ~ 5,
                                        "6" ~ 6,
                                        "Entirely confident" ~ 7))
Your Turn

Choose the option that works best for you to modify the code of understanding_t1 above that didn’t work/ gave you an error message. Once you do that, you should be able to calculate the mean Understanding Score per participant. Store the average scores in a variable called Time1_Understanding_OS. If you need help, refer to the hint below or use Chapter 2 Activity 4 (Section 2.4) as guidance.

understanding_t1 <- understanding_t1 %>% 
  # Step 3
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type 
  )) %>% 
  # Step 4: calculating averages per participant
  group_by(Code) %>%
  summarise(Time1_Understanding_OS = mean(Responses_corrected)) %>%
  ungroup()

Of course, this could have been written up as a single pipe.

understanding_t1 <- data_prp %>% 
  # Step 1
  select(Code, Understanding_OS_1_Time1:Understanding_OS_12_Time1) %>% 
  # Step 2
  pivot_longer(cols = -Code, names_to = "Understanding_Qs", values_to = "Responses") %>% 
  # Step 3
  mutate(Responses_corrected = case_match(Responses, # column of the values to recode
                                          "Not at all confident" ~ 1, # values to recode
                                          "Entirely confident" ~ 7,
                                          .default = as.numeric(Responses) # all other values taken from column Responses but as numeric data type 
  )) %>% 
  # Step 4
  group_by(Code) %>%
  summarise(Time1_Understanding_OS = mean(Responses_corrected)) %>%
  ungroup()

3.4 Activity 4: Survey of Attitudes Toward Statistics (SATS-28)

The main goal is to compute the mean SATS-28 score for each of the 4 subscales per participant for time point 1.

Looking at the SATS data at time point 1, you determine that

  • individual item columns are , and
  • according to the codebook, there are reverse-coded items in this questionnaire.
  • Additionally, we are looking to compute the means for the 4 different subscales of the SAT-28 which are , , , and .

This scenario is slightly more tricky than the previous ones due to the reverse-coding and the 4 subscales. So, let’s tackle this step by step again:

  • Step 1: Select the relevant columns Code, and every SATS28 column from time point 1 (e.g., from SATS28_1_Affect_Time1 to SATS28_28_Difficulty_Time1) and store them in an object called sats_t1
  • Step 2: Pivot the data from wide format to long format using pivot_longer() so we can recode the labels into values (step 3) and calculate the average score (in step 4) more easily
  • Step 3: We need to know which items belong to which subscale - fortunately, we have that information in the variable name and can use the separate() function to access it.
  • Step 4: We need to know which items are reverse-coded and then reverse-score them - unfortunately, the info is only in the codebook and we need to find a work-around. case_when() can help identify and re-score the reverse-coded items.
  • Step 5: Calculate the average SATS score per participant and subscale using group_by() and summarise()
  • Step 6: use pivot_wider() to spread out the dataframe into wide format and rename() to tidy up the column names

Steps 1 and 2: select and pivot

The selecting and pivoting are exactly the same way as we already practiced in the other 2 questionnaires. Apply them here to this questionnaire.

sats_t1 <- data_prp %>% 
  select(???) %>% # Step 1
  pivot_longer(cols = ???, names_to = "???", values_to = "???") # Step 2
sats_t1 <- data_prp %>% 
  select(Code, SATS28_1_Affect_Time1:SATS28_28_Difficulty_Time1) %>% # Step 1
  pivot_longer(cols = -Code, names_to = "Items", values_to = "Response") # Step 2

Step 3: separate Subscale information

If you look at the Items column more closely, you can see that there is information on the Questionnaire, the Item_number, the Subscale, and the Timepoint the data was collected at.

We can separate the information into separate columns using the separate() function. The function’s first argument is the column to separate, then define into which columns you want the original column to split up, and lastly, define the separator sep (here an underscore). For our example, we would write:

  • V1: separate(Items, into = c("SATS", "Item_number", "Subscale", "Time"), sep = "_")

However, we don’t need all of those columns, so we could just drop the ones we are not interested in by replacing them with NA.

  • V2: separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_")

We might also add an extra argument of convert = TRUE to have numeric columns (i.e., Item_number) converted to numeric as opposed to keeping them as characters. Saves us typing a few quotation marks later in Step 4.

sats_t1 <- sats_t1 %>% 
  # Step 3
  separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_", convert = TRUE)

Step 4: identifying reverse-coded items and then correct them

We can use case_when() within the mutate() function here to create a new column FW_RV that stores information on whether the item is a reverse-coded item or not.

case_when() works similarly to case_match(), however case_match() only allows us to “recode” values (i.e., replace one value with another), whereas case_when() is more flexible. It allows us to use conditional statements on the left side of the tilde which is useful when you want to change only some of the data based on specific conditions.

Looking at the codebook, it seems that items 2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, and 28 are reverse-coded. The rest are forward-coded.

We want to tell R now, that

  • if the Item_number is any of those numbers listed above, R should write “Reverse” into the new column FW_RV we are creating. Since we have a few possible matches for Item_number, we need the Boolean expression %in% rather than ==.
  • if Item_number is none of those numbers, then we would like the word “Forward” in the FW_RV column to appear. We can achieve that by specifying a .default argument again, but this time we want a “word” rather than a value from another column.
sats_t1 <- sats_t1 %>% 
  mutate(FW_RV = case_when(
    Item_number %in% c(2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, 28) ~ "Reverse",
    .default = "Forward"
  ))

Moving on to correcting the scores: Once again, we can use case_when () within the mutate() function to create another conditional statement. This time, the condition is:

  • if FW_RV column has a value of “Reverse” then we would like to turn all 1 into 7, 2 into 6, etc.
  • if FW_RV column has a value of “Forward” then we would like to keep the score from the Response column

There is a quick way and a not-so-quick way to achieve the actual reverse-coding.

  • Option 1 (quick): The easiest way to reverse-code scores is to take the maximum value of the scale, add 1 unit, and subtract the original value. For example, on a 5-point Likert scale, it would be 6 minus the original rating; for a 7-point Likert scale, 8 minus the original rating, etc. (see Option 1 tab).
  • Option 2 (not so quick): This involves using two conditional statements (see Option 2 tab).

Use the one you find more intuitive.

Here we are using a Boolean expression to check if the string “Reverse” is present in the FW_RV column. If this condition is TRUE, the value in the new column we’re creating, Scores_corrected, will be calculated as 8 minus the value from the Response column. If the condition is FALSE (handled by the .default argument), the original values from the Response column will be retained.

sats_t1 <- sats_t1 %>% 
  mutate(Scores_corrected = case_when(
    FW_RV == "Reverse" ~ 8-Response,
    .default = Response
  ))

As stated above, the longer approach involves using two conditional statements. The first condition checks if the value in the FW_RV column is “Reverse”, while the second condition checks if the value in the Response column equals a specific number. When both conditions are met, the corresponding value on the right side of the tilde is placed in the newly created Scores_corrected_v2 column.

For example, line 3 would read: if the value in the FW_RV column is “Reverse” AND the value in the Response column is 1, then assign a value of 7 to the Scores_corrected_v2 column.

sats_t1 <- sats_t1 %>% 
  mutate(Scores_corrected_v2 = case_when(
    FW_RV == "Reverse" & Response == 1 ~ 7,
    FW_RV == "Reverse" & Response == 2 ~ 6,
    FW_RV == "Reverse" & Response == 3 ~ 5,
    # no need to recode 4 as 4
    FW_RV == "Reverse" & Response == 5 ~ 3,
    FW_RV == "Reverse" & Response == 6 ~ 2,
    FW_RV == "Reverse" & Response == 7 ~ 1,
    .default = Response
  ))

As you can see now in sats_t1, both columns Scores_corrected and Scores_corrected_v2 are identical.

One way to check whether our reverse-coding worked is by examining the distinct values in the original Response column and comparing them with the Scores_corrected. We should also retain the FW_RV column to observe how the reverse-coding applied.

To see the patterns more clearly, we can use arrange() to sort the values in a meaningful order. Remember, the default sorting order is ascending, so if you want to sort values in descending order, you’ll need to wrap your variable in the desc() function.

check_coding <- sats_t1 %>% 
  distinct(FW_RV, Response, Scores_corrected) %>% 
  arrange(desc(FW_RV), Response)
check_coding
FW_RV Response Scores_corrected
Reverse 1 7
Reverse 2 6
Reverse 3 5
Reverse 4 4
Reverse 5 3
Reverse 6 2
Reverse 7 1
Forward 1 1
Forward 2 2
Forward 3 3
Forward 4 4
Forward 5 5
Forward 6 6
Forward 7 7

Step 5

Now that we know everything worked out as intended, we can calculate the mean scores of each subscale for each participant in sats_t1.

sats_t1 <- sats_t1 %>% 
  group_by(???, ???) %>% 
  summarise(mean_score = ???(???)) %>% 
  ungroup()
sats_t1 <- sats_t1 %>% 
  group_by(Code, Subscale) %>% 
  summarise(mean_score = mean(Scores_corrected)) %>% 
  ungroup()
`summarise()` has grouped output by 'Code'. You can override using the
`.groups` argument.

Step 6

The final step is to transform the data back into wide format, ensuring that each subscale has its own column. This will make it easier to join the data objects later on. In pivot_wider(), the first argument, names_from, specifies the column you want to use for your new column headings. The second argument, values_from, tells R which column should provide the cell values.

We should also rename the column names to match those in the codebook. Conveniently, we can use a function called rename() that works exactly like select() (following the pattern new_name = old_name), but it keeps all other column names the same rather than reducing the number of columns.

sats_t1 <- sats_t1 %>% 
  pivot_wider(names_from = Subscale, values_from = mean_score) %>% 
  rename(SATS28_Affect_Time1_mean = Affect,
         SATS28_CognitiveCompetence_Time1_mean = CognitiveCompetence,
         SATS28_Value_Time1_mean = Value,
         SATS28_Difficulty_Time1_mean = Difficulty)
head(sats_t1, n = 5)
Code SATS28_Affect_Time1_mean SATS28_CognitiveCompetence_Time1_mean SATS28_Difficulty_Time1_mean SATS28_Value_Time1_mean
AD03 2.333333 3.833333 3.428571 5.555556
AD05 3.500000 5.000000 2.142857 4.777778
Ab01 5.166667 5.666667 4.142857 5.444444
Al05 2.166667 2.666667 2.857143 3.777778
Am05 4.166667 5.666667 5.571429 4.888889

Again, this could have been written up as a single pipe.

sats_t1 <- data_prp %>% 
  # Step 1
  select(Code, SATS28_1_Affect_Time1:SATS28_28_Difficulty_Time1) %>% 
  # Step 2
  pivot_longer(cols = -Code, names_to = "Items", values_to = "Response") %>% 
  # Step 3
  separate(Items, into = c(NA, "Item_number", "Subscale", NA), sep = "_", convert = TRUE) %>% 
  # step 4
  mutate(FW_RV = case_when(
    Item_number %in% c(2, 3, 4, 6, 7, 8, 9, 12, 13, 16, 17, 19, 20, 21, 23, 25, 26, 27, 28) ~ "Reverse",
    .default = "Forward"
  ),
    Scores_corrected = case_when(
      FW_RV == "Reverse" ~ 8-Response,
      .default = Response
  )) %>% 
  # step 5
  group_by(Code, Subscale) %>% 
  summarise(mean_score = mean(Scores_corrected)) %>% 
  ungroup() %>% 
  # step 6
  pivot_wider(names_from = Subscale, values_from = mean_score) %>% 
  rename(SATS28_Affect_Time1_mean = Affect,
         SATS28_CognitiveCompetence_Time1_mean = CognitiveCompetence,
         SATS28_Value_Time1_mean = Value,
         SATS28_Difficulty_Time1_mean = Difficulty)

3.5 Activity 5 (Error Mode): Perceptions of supervisory support

The main goal is to compute the mean score for perceived supervisory support per participant.

Looking at the supervisory support data, you determine that

  • individual item columns are , and
  • according to the codebook, there are reverse-coded items in this questionnaire.

I have outlined my steps as follows:

  • Step 1: Reverse-code the single column first because that’s less hassle than having to do that with conditional statements (Supervisor_15_R). mutate() is my friend.
  • Step 2: I want to filter out everyone who failed the attention check in Supervisor_7. I can do this with a Boolean expression within the filter() function. The correct response was “completely disagree” which is 1.
  • Step 3: Select their id from time point 2 and all the columns that start with the word “super”, apart from Supervisor_7 and the original Supervisor_15_R column
  • Step 4: pivot into long format so I can calculate the averages better
  • Step 5: calculate the average scores per participant

I’ve started coding but there are some errors in my code. Help me find and fix all of them. Try to go through the code line by line and read the error messages.

super <- data_ppr %>% 
  mutate(Supervisor_15 = 9-supervisor_15_R) %>% 
  filter(Supervisor_7 = 1) %>% 
  select(Code, starts_with("Super"), -Supervisor_7, -Supervisor_15_R) 
pivot_wider(cols = -Code, names_to = "Item", values_to = "Response") %>% 
  group_by(Time2_Code) %>% 
  summarise(Mean_Supervisor_Support = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()

There are 8 mistakes in the code.

Did you spot all 8 mistakes? Let’s go through them line by line.

super <- data_prp %>% # spelling mistake in data object
  mutate(Supervisor_15 = 8-Supervisor_15_R) %>% # semantic error: 8 minus response for a 7-point scale and supervisor_15_R needs a capital S
  filter(Supervisor_7 == 1) %>% # needs a Boolean expression == instead of =
  select(Code, starts_with("Super"), -Supervisor_7, -Supervisor_15_R) %>% # no pipe at the end, the rest is actually legit
  pivot_longer(cols = -Code, names_to = "Item", values_to = "Response") %>% # pivot_longer instead of pivot_wider
  group_by(Code) %>% # Code rather than Time2_Code - the reduced dataset does not contain Time2_Code
  summarise(Mean_Supervisor_Support = mean(Response, na.rm = TRUE)) %>% # Score_corrected doesn't exist; needs to be Response
  ungroup()
  • Note that the semantic error in line 2 will not give you an error message.
  • Were you thrown off by the starts_with("Super") expression in line 4? starts_with() and ends_with() are great alternatives to selecting columns via : But, using select(Code, Supervisor_1:Supervisor_6, Supervisor_8:Supervisor_14) would have given us the same result. [I admit, that one was perhaps a bit mean]

3.6 Activity 6: Join everything together with ???_join()

Time to join all the relevant data files into a single dataframe, which will be used in the next chapters on data visualization. There are four ways to join data: inner_join(), left_join(), right_join(), and full_join(). Each function behaves differently in terms of what information is retained from the two data objects. Here is a quick overview:

Info on mutating joins

You have 4 types of join functions you could make use of. Click on the panels to know more

inner_join() returns only the rows where the values in the column specified in the by = statement match in both tables.

inner_join(): gif by Garrick Aden-Buie

left_join() retains the complete first (left) table and adds values from the second (right) table that have matching values in the column specified in the by = statement. Rows in the left table with no match in the right table will have missing values (NA) in the new columns.

left_join(): gif by Garrick Aden-Buie

right_join() retains the complete second (right) table and adds values from the first (left) table that have matching values in the column specified in the by = statement. Rows in the right table with no match in the left table will have missing values (NA) in the new columns.

right_join(): gif by Garrick Aden-Buie

full_join() returns all rows and all columns from both tables. NA values fill unmatched rows.

full_join(): gif by Garrick Aden-Buie

From our original data_prp, we need to select demographics data and all summarised questionnaire data from time point 2. Next, we will join this with all other aggregated datasets from time point 1 which are currently stored in separate data objects in the Global Environment.

While you may be familiar with inner_join() from last year, for this task, we want to retain all data from all the data objects. Therefore, we will use full_join(). Keep in mind, you can only join two data objects at a time, so the upcoming code chunk will involve a fair bit of piping and joining.

Note: Since I (Gaby) like my columns arranged in a meaningful way, I will use select() at the end to order them better.

data_prp_final <- data_prp %>% 
  select(Code:Plan_prereg, Pre_reg_group:Time2_Understanding_OS) %>% 
  full_join(qrp_t1) %>% 
  full_join(understanding_t1) %>% 
  full_join(sats_t1) %>% 
  full_join(super) %>% 
  select(Code:Plan_prereg, Pre_reg_group, SATS28_Affect_Time1_mean, SATS28_CognitiveCompetence_Time1_mean, SATS28_Value_Time1_mean, SATS28_Difficulty_Time1_mean, QRPs_Acceptance_Time1_mean, Time1_Understanding_OS, Other_OS_behav_2:Time2_Understanding_OS, Mean_Supervisor_Support)
No by argument in the code above?

Note how I didn’t include a by argument in the code above. If you leave by = out, R will join the 2 data objects by ALL columns that have the same name.

Special case 1: matching column names but different values

If you want more control, you should include the by argument; for example, if both data objects include a column age but data was recorded at 2 different time points. In that case, the information from both age columns should be retained and the by argument would not include age.

Special case 2: different column names but matching values

Another special case presents when both data objects contain identical information but the variable names don’t match. Let’s say, both data objects contain gender information, but in one data object the variable is named gender and in the other one gender_label. In that case, your by argument needs to be modified as: by = join_by(gender == gender_label).

More info on joins can be found https://www.tidyverse.org/blog/2023/01/dplyr-1-1-0-joins/

And this is basically the dataset we need for Chapter 4 and Chapter 5.

3.7 Activity 7: Knit and export

Knit the .Rmd file to ensure everything runs as expected. Once it does, export the data object data_prp_final as a csv for use in the Chapter 4. Name it something meaningful, something like data_prp_for_ch4.csv.

write_csv(data_prp_final, "data_prp_for_ch4.csv")

Pair-coding

We will once again be working with data from Binfet et al. (2021), which focuses on the randomised controlled trials data involving therapy dog interventions. Today, our goal is to calculate the average Loneliness score for each participant measured at time point 1 (pre-intervention) using the raw data file dog_data_raw. Currently, the data looks like this:

RID L1_1 L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L1_9 L1_10 L1_11 L1_12 L1_13 L1_14 L1_15 L1_16 L1_17 L1_18 L1_19 L1_20
1 3 3 4 3 2 3 1 2 3 4 3 1 3 1 2 3 2 3 2 4
2 3 2 3 3 4 3 2 2 4 3 2 2 1 2 4 3 3 2 4 3
3 3 3 2 3 3 4 2 3 3 3 2 2 2 2 3 3 4 3 3 3
4 4 2 2 3 4 4 1 3 3 4 2 1 2 2 4 4 3 3 4 3
5 2 3 3 3 4 3 2 2 3 2 4 4 4 3 2 2 3 4 3 2

But we want the data to look like this:

RID Loneliness_pre
1 2.25
2 1.90
3 2.25
4 1.75
5 2.85

This task is a bit more challenging compared to last week’s lab activity, as the Loneliness scale includes some reverse-coded items.

Task 1: Open the R project for the lab

Task 2: Open your .Rmd file from last week or create a new .Rmd file

You could continue the .Rmd file you used last week, or create a new .Rmd. If you need some guidance, have a look at Section 1.3.

Task 3: Load in the library and read in the data

The data should already be in your project folder. If you want a fresh copy, you can download the data again here: data_pair_coding.

We are using the package tidyverse today, and the datafile we should read in is dog_data_raw.csv.

# loading tidyverse into the library
library(???)

# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("???")

Task 4: Calculating the mean for Loneliness_pre

  • Step 1: Select all relevant columns, such as the participant ID and all 20 items of the Loneliness questionnaire completed by participants before the intervention. Store this data in an object called data_loneliness.

Look at the codebook. Try to figure out

  • the variable name of the column in which the participant id is stored, and
  • which items relate to the Loneliness scale at Stage “pre”
  • the participant id column is called RID
  • The Loneliness items at pre-intervention stage start with L1_
  • Step 2: Pivot the data from wide format to long format so we can reverse-score and calculate the average score more easily (in step 3)

pivot_

We also need 3 arguments in that function:

  • the columns we want to select (e.g., all the loneliness items),
  • the name of the column in which the current column headings will be stored (e.g., “Qs”), and
  • the name of the column that should store all the values (e.g., “Responses”).
  pivot_longer(cols = ???, names_to = "???", values_to = "???")
  • Step 3: Reverse-scoring

Identify the items on the Loneliness scale that are reverse-coded, and then reverse-score them accordingly.

We need to figure out:

  • which are the items of the loneliness scale we need to reverse-score
  • what is the measuring scale of loneliness so we can determine the new values
  • which function to use to create a new column that has the corrected scores in it
  • which one of the case_ functions will get us there
  • The items to be reverse-coded items can be found in the codebook: L1_1, L1_5, L1_6, L1_9, L1_10, L1_15, L1_16, L1_19, L1_20
  • the loneliness scale ranges from 1 to 4, so we need to replace 1 with 4, 2 with 3, 3 with 2, and 4 with 1
  • the function to create a new column mutate()
  • it’s a conditional statement rather than “just” replacing values, hence we need case_when()
  mutate(Score_corrected = case_when(
    ??? ~ ???,
    .default = ???
    ))
  • Step 4: Calculate the average Loneliness score per participant. To match with the table above, we want to call this column Loneliness_pre

grouping and summarising

  group_by(???) %>% 
  summarise(Loneliness_pre = ???(???)) %>% 
  ungroup()
# loading tidyverse into the library
library(tidyverse)

# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("dog_data_raw.csv")

# Task 4: Tidying 
loneliness_tidy <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("L1")) %>% # select(RID, L1_1:L1_20) also works
  # Step 2
  pivot_longer(cols = -RID, names_to = "Qs", values_to = "Response") %>% 
  # Step 3
  mutate(Score_corrected = case_when(
    Qs %in% c("L1_1", "L1_5", "L1_6", "L1_9", "L1_10", "L1_15", "L1_16", "L1_19", "L1_20") ~ 5-Response,
    .default = Response
    )) %>% 
  # Step 4
  group_by(RID) %>% 
  summarise(Loneliness_pre = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()

Test your knowledge and challenge yourself

Knowledge check

Question 1

When using mutate(), which additional function could you use to recode an existing variable?

Question 2

When using mutate(), which additional function could you use to create a new variable based on one or multiple conditional statements?

Question 3

Which of the following functions would you use if you wanted to join two data sets by their shared identifier?

Question 4

Your data object contains a column Score with numbers, but they have been read in incorrectly as a character datatype. Which of the following functions would not work for fixing this issue?

  • parse_number() from the readr package extracts numeric values from strings, so this would work.
  • factor(Score): This would not work as expected because it converts the column into a factor, not a numeric datatype, leading to incorrect results if numeric operations are needed.
  • mutate(Score = as.numeric(Score)): This would work too because mutate() can be used in combination with as.numeric() to create a new numeric column or override the existing character column.
  • as.numeric(): This would also work to convert a character column to numeric. Without mutate, you could use it in a BaseR way, e.g., data$Score <- as.numeric(data$Score) (shudder, BaseR!!! But effective)

Challenge yourself

If you want to challenge yourself and further apply the skills from Chapter 3, you could wrangle the data from dog_data_raw for one of the other questionnaires. There are plenty of options to choose from:

  • recode column Live_Pets so the values read yes and no rather than 1 and 2
  • recode Year_of_Study so they have the labels from the codebook rather than the numbers
  • reverse-code the Homesickness scale for _pre and _post
  • renaming the columns of the other one-item scales as Stress_pre, Stress_post, Engagement_pre and Engagement_post

Any of these tasks should be doable in one step. No need to select or pivot anything. You could just modify dog_data_raw.

  • For the recoding tasks, you need to work out which function to use to recode one value as another - just plain replacing, no conditional statements
  • The reverse-coding might sound daunting to do in one step, but it is only a single value that needs to be recoded. Take some inspiration from Activity 5 (error mode).
  • For the renaming tasks, check how you would change column names without reducing the number of columns overall
## Live_Pets
dog_data_raw <- dog_data_raw %>%
  mutate(Live_Pets = case_match(Live_Pets,
                                1 ~ "yes",
                                2 ~ "no"))
## Year of Study
dog_data_raw <- dog_data_raw %>%
  mutate(Year_of_Study = case_match(Year_of_Study,
                                    1 ~ "First",
                                    2 ~ "Second",
                                    3 ~ "Third",
                                    4 ~ "Fourth",
                                    5 ~ "Fifth or above"))
## Reverse-coding of homesickness pre and post. It's a 5-point scale, hence you'd calculate 6-the original response column
dog_data_raw <- dog_data_raw %>% 
  mutate(Homesick_pre = 6-HO1_1,
         Homesick_post = 6-HO2_1)
## Renaming of Stress and Engagement
dog_data_raw <- dog_data_raw %>% 
  rename(Stress_pre = S1_1, Stress_post = S2_1, Engagement_pre = HO1_2, Engagement_post = HO2_2)
  • reverse-code the Social connectedness scale (pre-intervention) and compute a mean score per participant

This task would take 4 steps to complete. These are the exact same steps we applied to Loneliness_pre in the lab activity. You would just need to figure out which items are related to the Social connectedness scale (pre-intervention) and which ones of those are reverse-coded. The codebook has all the answers.

## SCS pre
scs_pre <- dog_data_raw %>% 
  select(RID, starts_with("SC1")) %>% 
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  mutate(Score_corrected = case_when(
    Names %in% c("SC1_3", "SC1_6", "SC1_7", "SC1_9", "SC1_11", "SC1_13", "SC1_15", "SC1_17", "SC1_18", "SC1_20") ~ 7-Response,
    .default = Response
    )) %>% 
  group_by(RID) %>% 
  summarise(SCS_pre = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()
  • reverse-code the Loneliness scale (post-intervention) and compute a mean score per participant
  • reverse-code the Social connectedness scale (post-intervention) and compute a mean score per participant

Both activities are similar to Activity 3 from the individual walkthrough and would take about 5 steps to complete. Start by mapping out the steps.

  • Step 1: Select all relevant columns, such as participant ID and all the items that belong to the questionnaire that participants completed after the intervention
  • Step 2: Pivot the data from wide format to long format so we can reverse-score and calculate the average score more easily
  • Step 3: Recode the initial responses so that the new column has numbers instead of labels
  • Step 4: Reverse-score the items that are labelled as “Reverse” in the codebook and then reverse-score them
  • Step 5: Group by and summarise to calculate the mean Score
## loneliness post
lonely_post <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("L2")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  # Step 3
  mutate(Score = case_match(Response,
                            "never" ~ 1,
                            "rarely" ~ 2,
                            "sometimes" ~ 3,
                            "often" ~ 4,
                            .default = NA
  ),
  # Step 4 - we are still in the same mutate function (count the brackets)
        Score_corrected = case_when(
          Names %in% c("L2_1", "L2_5", "L2_6", "L2_9", "L2_10", "L2_15", "L2_16", "L2_19", "L2_20") ~ 5-Score,
          .default = Score
  )) %>% 
  # Step 5
  group_by(RID) %>% 
  summarise(Loneliness_post = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()
## SCS post
scs_post <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("SC2")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  # Step 3
  mutate(Response = case_match(Response,
                               "strongly disagree" ~ "1",
                               "strongly agree" ~ "6",
                               .default = Response),
         Response = parse_number(Response),
  # Step 4 - we are still in the same mutate function (count the brackets)
         Score_corrected = case_when(
           Names %in% c("SC2_3", "SC2_6", "SC2_7", "SC2_9", "SC2_11", "SC2_13", "SC2_15", "SC2_17", "SC2_18", "SC2_20") ~ 7-Response,
           .default = Response
         )) %>% 
  # Step 5
  group_by(RID) %>% 
  summarise(SCS_post = mean(Score_corrected, na.rm = TRUE)) %>% 
  ungroup()
  • PANAS: positive and negative affect of pre- and post-intervention in a single pipe rather than in 4 different data objects (see last week’s)

This task would take about 7 steps to get it from

RID PN1_1 PN1_2 PN1_3 PN1_4 PN1_5 PN1_6 PN1_7 PN1_8 PN1_9 PN1_10 PN2_1 PN2_2 PN2_3 PN2_4 PN2_5 PN2_6 PN2_7 PN2_8 PN2_9 PN2_10
1 1 1 1 1 4 1 4 3 1 4 2 1 3 1 4 1 4 4 1 4
2 1 2 3 2 1 3 3 4 1 4 1 1 2 1 3 1 3 4 1 4
3 1 1 3 1 2 4 4 3 1 2 2 2 3 1 3 2 4 3 1 2
4 1 1 5 1 4 3 5 5 3 2 1 1 5 1 4 3 4 4 2 2
5 2 3 5 2 3 2 3 4 2 2 1 2 5 2 3 2 4 5 1 3

to

RID Stage PANAS_NA PANAS_PA
1 post 1.2 3.8
1 pre 1.0 3.2
2 post 1.0 3.2
2 pre 1.8 3.0
3 post 1.6 3.0

Start by mapping out the steps

  • Step 1: select all relevant columns, such as participant ID and all the items that belong to PANAs scale (pos, neg, pre, and post)
  • Step 2: pivot the data from wide format to long format. You want to do that for ALL columns that are not the participant id. The data object should have 3 columns and 5680 observations, i.e. each participant has 20 rows.
  • Step 3: All of the items will have the structure PN1_1. Use separate to split the information across 2 columns. First column has information about the Stage, second column should turn into an Item_number and it should convert into a numeric column in the process to save you typing quotation marks in Step 5.
  • Step 4: recode the Stage column you just created so that everything that starts with PN1 relates to “pre” and PN2 as post.
  • Step 5: identify the subscales positive affect (PA) and negative affect (NA) by item number and recode them. This requires a conditional statement.
  • Step 6: group by and summarise to calculate the mean Score
  • Step 7: pivot, so that you have the 2 PANAS subscales presented in separate columns (see table above). You might need an extra step if the columns aren’t labelled exactly as shown in the table above.
PANAS <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("PN")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  separate(Items, into = c("Stage", "Item_number"), sep = "_", convert = TRUE) %>% 
  # Step 4 recode Stage
  mutate(Stage = case_match(Stage,
                            "PN1" ~ "pre",
                            "PN2" ~ "post")) %>% 
  # Step 5 identify subscales by item number
  mutate(Subscales = case_when(
    Item_number %in% c(3, 5, 7, 8, 10) ~ "PANAS_PA",
    .default = "PANAS_NA"
  )) %>% 
  # Step 6 
  group_by(RID, Stage, Subscales) %>% 
  summarise(Score = mean(Scores)) %>% 
  ungroup() %>% 
  # Step 7 - to make the data look like the data in `dog_data_clean_long.csv`
  pivot_wider(names_from = Subscales, values_from = Score)