2 Data wrangling I

Intended Learning Outcomes

In the next two chapters, we will build on the data wrangling skills from level 1. We will revisit all the functions you have already encountered (and might have forgotten over the summer break) and introduce 2 or 3 new functions. These two chapters will provide an opportunity to revise and apply the functions to a novel dataset.

By the end of this chapter, you should be able to:

apply familiar data wrangling functions to novel datasets
read and interpret error messages
realise there are several ways of getting to the results
export data objects as csv files

The main purpose of this chapter and Chapter 3 is to wrangle your data into shape for data visualisation (Chapter 4 and Chapter 5). For the two chapters, we will:

calculate demographics
tidy 3 different questionnaires with varying degrees of complexity
solve an error mode problem
join all data objects together

Individual Walkthrough

Before we start, we need to set up some things.

2.1 Activity 1: Setup

We will be working on the dataset by Pownall et al. (2023) again, which means we can still use the project we created last week. The data files will already be there, so no need to download them again.
To open the project in RStudio, go to the folder in which you stored the project and the data last time, and double click on the project icon.
Create a new .Rmd file for chapter 2 and save it to your project folder. Name it something meaningful (e.g., “chapter_02.Rmd”, “02_data_wrangling.Rmd”). See Section 1.3 if you need some guidance.
In your newly created .Rmd file, delete everything below line 12 (after the set-up code chunk).

2.2 Activity 2: Load in the libraries and read in the data

We will use tidyverse today, and we want to create a data object data_prp that stores the data from the file prp_data_reduced.csv.

Hint

library(???)
data_prp <- read_csv("???")

Solution

library(tidyverse)
data_prp <- read_csv("prp_data_reduced.csv")

If you need a quick reminder what the dataset was about, have a look at the abstract in Section 1.4. We also addressed the changes we made to the dataset there.

And remember to have a quick glimpse() at your data.

2.3 Activity 3: Calculating demographics

Let’s start with some simple data-wrangling steps to compute demographics for our original dataset, data_prp. First, we want to determine how many participants took part in the study by Pownall et al. (2023) and compute the mean age and the standard deviation of age for the sample.

2.3.1 … for the full sample using `summarise()`

The summarise() function is part of the “Wickham Six” alongside group_by(), select(), filter(), mutate(), and arrange(). You used them plenty of times last year.

Within summarise(), we can use the n() function, which calculates the number of rows in the dataset. Since each row corresponds to a unique participant, this gives us the total number of participants.

To calculate the mean age and the standard deviation of age, we need to use the functions mean() and sd() on the column Age respectively.

demo_total <- data_prp %>% 
  summarise(n = n(), # participant number
            mean_age = mean(Age), # mean age
            sd_age = sd(Age)) # standard deviation of age

Warning: There were 2 warnings in `summarise()`.
The first warning was:
ℹ In argument: `mean_age = mean(Age)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

demo_total

n	mean_age	sd_age
89	NA	NA

R did not give us an error message per se, but the output is not quite as expected either. There are NA values in the mean_age and sd_age columns. Looking at the warning message and at Age, can you explain what happened?

Answer

The warning message says: argument is not numeric or logical: returning NA If we look at the Age column more closely, we can see that it’s a character data type.

Fixing `Age`

Might be wise to look at the unique answers in column Age to determine what is wrong. We can do that with the function distinct().

age_distinct <- data_prp %>% 
  distinct(Age)

age_distinct

Show the unique values of Age.

Age
22
20
26
21
29
23
39
NA
24
43
31
25 years

One cell has the string “years” added to their number 25, which has converted the entire column into a character column.

We can easily fix this by extracting only the numbers from the column and converting it into a numeric data type. The parse_number() function, which is part of the tidyverse package, handles both steps in one go (so there’s no need to load additional packages).

We will combine this with the mutate() function to create a new column called Age (containing those numeric values), effectively replacing the old Age column (which had the character values).

parse_number() illustration by Allison Horst (see https://allisonhorst.com/r-packages-functions)

data_prp <- data_prp %>% 
  mutate(Age = parse_number(Age))

typeof(data_prp$Age) # fixed

[1] "double"

Computing summary stats

Excellent. Now that the numbers are in a numeric format, let’s try calculating the demographics for the total sample again.

demo_total <- data_prp %>% 
  summarise(n = n(), # participant number
            mean_age = mean(Age), # mean age
            sd_age = sd(Age)) # standard deviation of age

demo_total

n	mean_age	sd_age
89	NA	NA

Even though there’s no error or warning, the table still shows NA values for mean_age and sd_age. So, what could possibly be wrong now?

Answer

Did you notice that the Age column in age_distinct contains some missing values (NA)? To be honest, it’s easier to spot this issue in the actual R output than in the printed HTML page.

Computing summary stats - third attempt

To ensure R ignores missing values during calculations, we need to add the extra argument na.rm = TRUE to the mean() and sd() functions.

demo_total <- data_prp %>% 
  summarise(n = n(), # participant number
            mean_age = mean(Age, na.rm = TRUE), # mean age
            sd_age = sd(Age, na.rm = TRUE)) # standard deviation of age

demo_total

n	mean_age	sd_age
89	21.88506	3.485603

Finally, we’ve got it! 🥳 Third time’s the charm!

2.3.2 … per gender using `summarise()` and `group_by()`

Now we want to compute the summary statistics for each gender. The code inside the summarise() function remains unchanged; we just need to use the group_by() function beforehand to tell R that we want to compute the summary statistics for each group separately. It’s also a good practice to use ungroup() afterwards, so you are not taking groupings forward unintentionally.

demo_by_gender <- data_prp %>% 
  group_by(Gender) %>% # split data up into groups (here Gender)
  summarise(n = n(), # participant number 
            mean_age = mean(Age, na.rm = TRUE), # mean age 
            sd_age = sd(Age, na.rm = TRUE)) %>%  # standard deviation of age
  ungroup()

demo_by_gender

Gender	n	mean_age	sd_age
1	17	23.31250	5.770254
2	69	21.57353	2.738973
3	3	21.33333	1.154700

2.3.3 Adding percentages

Sometimes, it may be useful to calculate percentages, such as for the gender split. You can do this by adding a line within the summarise() function to perform the calculation. All we need to do is take the number of female, male, and non-binary participants (stored in the n column of demo_by_gender), divide it by the total number of participants (stored in the n column of demo_total), and multiply by 100. Let’s add percentage to the summarise() function of demo_by_gender. Make sure that the code for percentages is placed after the value for n has been computed.

Accessing the value of n for the different gender categories is straightforward because we can refer back to it directly. However, since the total number of participants is stored in a different data object, we need to use a base R function to access it – specifically the $ operator. To do this, you simply type the name of the data object (in this case, demo_total), followed by the $ symbol (with no spaces), and then the name of the column you want to retrieve (in this case, n). The general pattern is data$column.

demo_by_gender <- data_prp %>% 
  group_by(Gender) %>% 
  summarise(n = n(), 
            # n from the line above divided by n from demo_total *100
            percentage = n/demo_total$n *100, 
            mean_age = mean(Age, na.rm = TRUE), 
            sd_age = sd(Age, na.rm = TRUE)) %>% 
  ungroup()

demo_by_gender

Gender	n	percentage	mean_age	sd_age
1	17	19.101124	23.31250	5.770254
2	69	77.528090	21.57353	2.738973
3	3	3.370786	21.33333	1.154700

Tip for decimal places - use round()

Not super important, because you could round the values by yourself when writing up your reports, but if you wanted to tidy up the decimal places in the output, you can do that using the round() function. You would need to “wrap” it around your computations and specify how many decimal places you want to display (for example mean(Age) would turn into round(mean(Age), 1)). It may look odd for percentage, just make sure the number that specifies the decimal places is placed within the round function. The default value is 0 (meaning no decimal spaces).

demo_by_gender <- data_prp %>% 
  group_by(Gender) %>% 
  summarise(n = n(), 
            percentage = round(n/demo_total$n *100, 2), # percentage with 2 decimal places
            mean_age = round(mean(Age, na.rm = TRUE), 1), # mean Age with 1 decimal place
            sd_age = round(sd(Age, na.rm = TRUE), 3)) %>% # sd Age with 3 decimal places
  ungroup()

demo_by_gender

Gender	n	percentage	mean_age	sd_age
1	17	19.10	23.3	5.770
2	69	77.53	21.6	2.739
3	3	3.37	21.3	1.155

2.4 Activity 4: Questionable Research Practices (QRPs)

The main goal is to compute the mean QRP score per participant for time point 1.

At the moment, the data is in wide format. The table below shows data from the first 3 participants:

head(data_prp, n = 3)

Code	Gender	Age	Ethnicity	Secondyeargrade	Opptional_mod	Opptional_mod_1_TEXT	Research_exp	Research_exp_1_TEXT	Plan_prereg	SATS28_1_Affect_Time1	SATS28_2_Affect_Time1	SATS28_3_Affect_Time1	SATS28_4_Affect_Time1	SATS28_5_Affect_Time1	SATS28_6_Affect_Time1	SATS28_7_CognitiveCompetence_Time1	SATS28_8_CognitiveCompetence_Time1	SATS28_9_CognitiveCompetence_Time1	SATS28_10_CognitiveCompetence_Time1	SATS28_11_CognitiveCompetence_Time1	SATS28_12_CognitiveCompetence_Time1	SATS28_13_Value_Time1	SATS28_14_Value_Time1	SATS28_15_Value_Time1	SATS28_16_Value_Time1	SATS28_17_Value_Time1	SATS28_18_Value_Time1	SATS28_19_Value_Time1	SATS28_20_Value_Time1	SATS28_21_Value_Time1	SATS28_22_Difficulty_Time1	SATS28_23_Difficulty_Time1	SATS28_24_Difficulty_Time1	SATS28_25_Difficulty_Time1	SATS28_26_Difficulty_Time1	SATS28_27_Difficulty_Time1	SATS28_28_Difficulty_Time1	QRPs_1_Time1	QRPs_2_Time1	QRPs_3_Time1	QRPs_4_Time1	QRPs_5_Time1	QRPs_6_Time1	QRPs_7_Time1	QRPs_8_Time1	QRPs_9_Time1	QRPs_10_Time1	QRPs_11_Time1	QRPs_12NotQRP_Time1	QRPs_13NotQRP_Time1	QRPs_14NotQRP_Time1	QRPs_15NotQRP_Time1	Understanding_OS_1_Time1	Understanding_OS_2_Time1	Understanding_OS_3_Time1	Understanding_OS_4_Time1	Understanding_OS_5_Time1	Understanding_OS_6_Time1	Understanding_OS_7_Time1	Understanding_OS_8_Time1	Understanding_OS_9_Time1	Understanding_OS_10_Time1	Understanding_OS_11_Time1	Understanding_OS_12_Time1	Pre_reg_group	Other_OS_behav_2	Other_OS_behav_4	Other_OS_behav_5	Closely_follow	SATS28_Affect_Time2_mean	SATS28_CognitiveCompetence_Time2_mean	SATS28_Value_Time2_mean	SATS28_Difficulty_Time2_mean	QRPs_Acceptance_Time2_mean	Time2_Understanding_OS	Supervisor_1	Supervisor_2	Supervisor_3	Supervisor_4	Supervisor_5	Supervisor_6	Supervisor_7	Supervisor_8	Supervisor_9	Supervisor_10	Supervisor_11	Supervisor_12	Supervisor_13	Supervisor_14	Supervisor_15_R
Tr10	2	22	White European	2	1	Research methods in first year	2	NA	1	4	5	3	4	5	5	4	2	2	6	4	3	1	7	7	2	1	3	3	2	2	3	5	2	6	4	4	1	7	7	5	7	3	4	5	7	6	7	7	2	1	1	2	2	2	2	6	Entirely confident	Entirely confident	6	6	Entirely confident	Entirely confident	Entirely confident	Entirely confident	1	1	1	NA	2	3.500000	4.166667	3.000000	2.857143	5.636364	5.583333	5	5	6	6	5	5	1	5	6	5	NA	4	4	5	1
Bi07	2	20	White British	3	2	NA	2	NA	3	5	6	2	5	5	6	2	2	2	7	3	5	1	7	7	1	1	6	3	1	1	2	6	2	7	2	5	7	7	7	2	7	3	7	7	7	7	6	7	2	1	4	4	2	Not at all confident	Not at all confident	Not at all confident	6	Entirely confident	Not at all confident	3	6	6	2	2	1	NA	NA	NA	2	3.166667	4.666667	6.222222	2.857143	5.454546	3.333333	7	6	7	7	7	7	1	5	7	7	7	5	2	7	1
SK03	2	22	White British	1	2	NA	2	NA	1	5	3	5	2	5	2	2	2	2	6	5	3	2	6	6	3	3	5	3	4	3	5	5	2	5	2	5	5	7	7	6	6	7	6	7	7	7	5	7	1	1	3	2	6	2	3	6	6	5	2	5	5	5	4	5	1	NA	NA	NA	2	4.833333	6.166667	6.000000	4.000000	6.272727	5.416667	7	7	7	7	7	7	1	7	7	7	7	7	5	7	1

Looking at the QRP data at time point 1, you determine that

individual item columns are , and
according to the codebook, there are reverse-coded items in this questionnaire.

According to the codebook and the data table above, we just have to compute the average score for QRP items to , since items to are distractor items. Seems quite straightforward.

However, as you can see in the table above, each item is in a separate column, meaning the data is in wide format. It would be much easier to calculate the mean scores if the items were arranged in long format.

Let’s tackle this problem step by step. It’s best to create a separate data object for this. If we tried to compute it within data_prp, it could quickly become messy.

Step 1: Select the relevant columns Code, and QRPs_1_Time1 to QRPs_11_Time1 and store them in an object called qrp_t1
Step 2: Pivot the data from wide format to long format using pivot_longer() so we can calculate the average score more easily (in step 3)
Step 3: Calculate the average QRP score (QRPs_Acceptance_Time1_mean) per participant using group_by() and summarise()

qrp_t1 <- data_prp %>% 
  #Step 1
  select(Code, QRPs_1_Time1:QRPs_11_Time1) %>%
  # Step 2
  pivot_longer(cols = -Code, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  group_by(Code) %>% # grouping by participant id
  summarise(QRPs_Acceptance_Time1_mean = mean(Scores)) %>% # calculating the average Score
  ungroup() # just make it a habit

Explain the individual functions

The select function allows to include or exclude certain variables (columns). Here we want to focus on the participant ID column (i.e., Code) and the QRP items at time point 1. We can either list them all individually, i.e., Code, QRPs_1_Time1, QRPs_2_Time1, QRPs_3_Time1, and so forth (you get the gist), but that would take forever to type.

A shortcut is to use the colon operator :. It allows us to select all columns that fall within the range of first_column_name to last_column_name. We can apply this here since the QRP items (1 to 11) are sequentially listed in data_prp.

qrp_step1 <- data_prp %>% 
  select(Code, QRPs_1_Time1:QRPs_11_Time1)

# show first 5 rows of qrp_step1
head(qrp_step1, n = 5)

Code	QRPs_1_Time1	QRPs_2_Time1	QRPs_3_Time1	QRPs_4_Time1	QRPs_5_Time1	QRPs_6_Time1	QRPs_7_Time1	QRPs_8_Time1	QRPs_9_Time1	QRPs_10_Time1	QRPs_11_Time1
Tr10	7	7	5	7	3	4	5	7	6	7	7
Bi07	7	7	2	7	3	7	7	7	7	6	7
SK03	7	7	6	6	7	6	7	7	7	5	7
SM95	7	7	2	6	7	5	7	7	4	2	4
St01	7	7	6	7	2	7	7	7	7	5	7

How many rows/observations and columns/variables do we have in qrp_step1?

rows/observations:
columns/variables:

As you can see, the table we got from Step 1 is in wide format. To get it into wide format, we need to define:

the columns that need to be reshuffled from wide into long format (col argument). Here we selected “everything except the Code column”, as indicated by -Code [minus Code]. However, QRPs_1_Time1:QRPs_11_Time1 would also work and give you the exact same result.
the names_to argument. R is creating a new column in which all the column names from the columns you selected in col will be stored in. Here we are naming this column “Items” but you could pick something equally sensible if you like.
the values_to argument. R creates this second column to store all responses the participants gave to the individual questions, i.e., all the numbers in this case. We named it “Scores” here, but you could have called it something different, like “Responses”

qrp_step2 <- qrp_step1 %>% 
  pivot_longer(cols = -Code, names_to = "Items", values_to = "Scores")

# show first 15 rows of qrp_step2
head(qrp_step2, n = 15)

Code	Items	Scores
Tr10	QRPs_1_Time1	7
Tr10	QRPs_2_Time1	7
Tr10	QRPs_3_Time1	5
Tr10	QRPs_4_Time1	7
Tr10	QRPs_5_Time1	3
Tr10	QRPs_6_Time1	4
Tr10	QRPs_7_Time1	5
Tr10	QRPs_8_Time1	7
Tr10	QRPs_9_Time1	6
Tr10	QRPs_10_Time1	7
Tr10	QRPs_11_Time1	7
Bi07	QRPs_1_Time1	7
Bi07	QRPs_2_Time1	7
Bi07	QRPs_3_Time1	2
Bi07	QRPs_4_Time1	7

Now, have a look at qrp_step2. In total, we now have rows/observations, per participant, and columns/variables.

This follows exactly the same sequence we used when calculating descriptive statistics by gender. The only difference is that we are now grouping the data by the participant’s Code instead of Gender.

summarise() works exactly the same way: summarise(new_column_name = function_to_calculate_something(column_name_of_numeric_values))

The function_to_calculate_something can be mean(), sd() or sum() for mean scores, standard deviations, or summed-up scores respectively. You could also use min() or max() if you wanted to determine the lowest or the highest score for each participant.

Tip

You could rename the columns whilst selecting them. The pattern would be select(new_name = old_name). For example, if we wanted to select variable Code and rename it as Participant_ID, we could do that.

renaming_col <- data_prp %>% 
  select(Participant_ID = Code)

head(renaming_col, n = 5)

Participant_ID
Tr10
Bi07
SK03
SM95
St01

2.5 Activity 5: Knitting

Once you’ve completed your R Markdown file, the final step is to “knit” it, which converts the .Rmd file into a HTML file. Knitting combines your code, text, and output (like tables and plots) into a single cohesive document. This is a really good way to check your code is working.

To knit the file, click the Knit button at the top of your RStudio window. The document will be generated and, depending on your setting, automatically opened in the viewer in the Output pane or an external browser window.

If any errors occur during knitting, RStudio will show you an error message with details to help you troubleshoot.

If you want to intentionally keep any errors we tackled today to keep a reference on how you solved them, you could add error=TRUE or eval=FALSE to the code chunk that isn’t running.

2.6 Activity 6: Export a data object as a csv

To avoid having to repeat the same steps in the next chapter, it’s a good idea to save the data objects you’ve created today as csv files. You can do this by using the write_csv() function from the readr package. The csv files will appear in your project folder.

The basic syntax is:

write_csv(data_object, "filename.csv")

Now, let’s export the objects data_prp and qrp_t1.

write_csv(data_prp, "data_prp_for_ch3.csv")

Here we named the file data_prp_for_ch3.csv, so we wouldn’t override the original data csv file prp_data_reduced.csv. However, feel free to choose a name that makes sense to you.

Your Turn

Export qrp_t1.

Solution

write_csv(qrp_t1, "qrp_t1.csv")

Check that your csv files have appeared in your project folder, and you’re all set!

That’s it for Chapter 2: Individual Walkthrough.

Pair-coding

We will continue working with the data from Binfet et al. (2021), focusing on the randomised controlled trial of therapy dog interventions. Today, our goal is to calculate an average Flourishing score for each participant at time point 1 (pre-intervention) using the raw data file dog_data_raw. Currently, the data looks like this:

RID	F1_1	F1_2	F1_3	F1_4	F1_5	F1_6	F1_7	F1_8
1	6	7	5	5	7	7	6	6
2	5	7	6	5	5	5	5	4
3	5	5	5	6	6	6	5	5
4	7	6	7	7	7	6	7	4
5	5	5	4	6	7	7	7	6

However, we want the data to look like this:

RID	Flourishing_pre
1	6.125
2	5.250
3	5.375
4	6.375
5	5.875

Task 1: Open the R project you created last week

If you haven’t created an R project for the lab yet, please do so now. If you already have one set up, go ahead and open it.

Task 2: Open your `.Rmd` file from last week

Since we haven’t used it much yet, feel free to continue using the .Rmd file you created last week in Task 2.

Task 3: Load in the library and read in the data

The data should be in your project folder. If you didn’t download it last week, or if you’d like a fresh copy, you can download the data again here: data_pair_coding.

We will be using the tidyverse package today, and the data file we need to read in is dog_data_raw.csv.

Hint

# loading tidyverse into the library
library(???)

# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("???")

Task 4: Calculating the mean for `Flourishing_pre`

Step 1: Select all relevant columns from dog_data_raw, including participant ID and all items from the Flourishing questionnaire completed before the intervention. Store this data in an object called data_flourishing.

Hint

Look at the codebook. Try to determine:

The variable name of the column where the participant ID is stored.
The items related to the Flourishing scale at the pre-intervention stage.

More concrete hint

From the codebook, we know that:

The participant ID column is called RID.
The Flourishing items at the pre-intervention stage start with F1_.

data_flourishing <- ??? %>% 
  select(???, F1_???:F1_???)

Step 2: Pivot the data from wide format to long format so we can calculate the average score more easily (in step 3).

Hint

Which pivot function should you use? We have pivot_wider() and pivot_longer() to choose from.

We also need 3 arguments in that function:

The columns you want to select (e.g., all the Flourishing items),
The name of the column where the current column headings will be stored (e.g., “Questionnaire”),
The name of the column that should store all the values (e.g., “Responses”).

More concrete hint

We need pivot_longer(). You already encountered pivot_longer() in first year (or in the individual walkthrough if you have already completed this Chapter). The 3 arguments was also a give-away; pivot_wider() only requires 2 arguments.

  pivot_longer(cols = ???, names_to = "???", values_to = "???")

Step 3: Calculate the average Flourishing score per participant and name this column Flourishing_pre to match the table above.

Hint

Before summarising the mean, you may need to group the data.

More concrete hint

To compute an average score per participant, we would need to group by participant ID first.

  group_by(???) %>% 
  summarise(Flourishing_pre = mean(???)) %>% 
  ungroup()

Solution

# loading tidyverse into the library
library(tidyverse)

# reading in `dog_data_raw.csv`
dog_data_raw <- read_csv("dog_data_raw.csv")

# Task 4: Tidying 
data_flourishing <- dog_data_raw %>% 
  # Step 1
  select(RID, F1_1:F1_8) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Questionnaire", values_to = "Responses") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(Flourishing_pre = mean(Responses)) %>% 
  ungroup()

Test your knowledge and challenge yourself

Knowledge check

Question 1

Which function of the Wickham Six would you use to include or exclude certain variables (columns)?

Question 2

Which function of the Wickham Six would you use to create new columns or modify existing columns in a dataframe?

Question 3

Which function of the Wickham Six would you use to organise data into groups based on one or more columns?

Question 4

Which function of the Wickham Six would you use to sort the rows of a dataframe based on the values in one or more columns?

Question 5

Which function of the Wickham Six would NOT modify the original dataframe?

Explain these answers

Function	Description
`select()`	Include or exclude certain variables/columns
`filter()`	Include or exclude certain observations/rows
`mutate()`	Creates new columns or modifies existing ones
`arrange()`	Changes the order of the rows
`group_by()`	Split data into groups based on one or more variables
`summarise()`	Creates a new dataframe returning one row for each combination of grouping variables

Technically, the first five functions operate on the existing data object, making adjustments like sorting the data (e.g., with arrange()), reducing the number of rows (e.g., with filter()), reducing the number of columns (e.g., with select()), or adding new columns (e.g., with mutate()). In contrast, summarise() fundamentally alters the structure of the original dataframe by generating a completely new dataframe that contains only summary statistics, rather than retaining the original rows and columns.

Error mode

Some of the code chunks contain mistakes and result in errors, while others do not produce the expected results. Your task is to identify any issues, explain why they occurred, and, if possible, fix them.

We will use a few built-in datasets, such as billboard and starwars, to help you replicate the errors in your own R environment. You can view the data either by typing the dataset name directly into your console or by storing the data as a separate object in your Global Environment.

billboard

starwars_data = starwars

Question 6

Currently, the weekly song rankings for Billboard Top 100 in 2000 are in wide format, with each week in a separate column. The following code is supposed to transpose the wide-format billboard data into long format:

long_data <- billboard %>% 
  pivot_longer(names_to = "weeks", values_to = "rank")

Error in `pivot_longer()`:
! `cols` must select at least one column.

What does this error message mean and how do you fix it?

Explain the solution

The error message indicates that the cols argument is missing in the function. This means the function doesn’t know which columns to transpose from wide format to long format.

FIX: Add cols = wk1:wk76 to the function to select columns from wk1 to wk76. Alternatively, cols = starts_with("wk") would also work since all columns start with the letter combination “wk”.

long_data <- billboard %>% 
  pivot_longer(cols = wk1:wk76, names_to = "weeks", values_to = "rank")
# OR
long_data <- billboard %>% 
  pivot_longer(cols = starts_with("wk"), names_to = "weeks", values_to = "rank")

Question 7

The following code is intended to calculate the mean height of all the characters in the built-in starwars dataset, grouped by their gender.

summary_data <- starwars %>%
  group_by(gender) %>%
  summarise(mean_height = height)

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

The code runs, but it’s giving us some weird warning and the output is also not as expected. What steps should we take to fix this?

Explain the solution

The aggregation function mean() is missing from within summarise(). Without it, the function does not perform any aggregation and returns all rows with only the columns for gender and height.

FIX: Wrap the mean() function around the variable you want to aggregate, here height.

summary_data <- starwars %>%
  group_by(gender) %>%
  summarise(mean_height = mean(height))

Question 8

Following up on Question 7, we now have summary_data that looks approximately correct - it has the expected rows and column numbers, however, the cell values are “weird”.

summary_data

gender	mean_height
feminine	NA
masculine	NA
NA	175

Can you explain what is happening here? And how can we modify the code to fix this?

Explain the solution

Look at the original starwars data. You will notice that some of the characters with feminine and masculine gender entries have missing height values. However, all four characters without a specified gender have provided their height.

FIX: We need to add na.rm = TRUE to the mean() function to ensure that R ignores missing values before aggregating the data.

summary_data <- starwars %>%
  group_by(gender) %>%
  summarise(mean_height = mean(height, na.rm = TRUE))

summary_data

gender	mean_height
feminine	166.5333
masculine	176.5323
NA	175.0000

Challenge yourself

If you want to challenge yourself and further apply the skills from Chapter 2, you can wrangle the data from dog_data_raw for additional questionnaires from either the pre- and/or post-intervention stages:

Calculate the mean score for flourishing_post for each participant.
Calculate the mean score for the PANAS (Positive and/or Negative Affect) per participant
Calculate the mean score for happiness (SHS) per participant

The 3 steps are equivalent for those questionnaires - select, pivot, group_by and summarise; you just have to “replace” the questionnaire items involved.

Solution for Challenge yourself

Flourishing post-intervention

## flourishing_post
flourishing_post <- dog_data_raw %>% 
  # Step 1
  select(RID, starts_with("F2")) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Names", values_to = "Response") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(Flourishing_post = mean(Response)) %>% 
  ungroup()

The PANAS could be solved more concisely with the skills we learn in Chapter 3, but for now, you would have solved it this way:

# PANAS - positive affect pre
PANAS_PA_pre <- dog_data_raw %>% 
  # Step 1
  select(RID, PN1_3, PN1_5, PN1_7, PN1_8, PN1_10) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(PANAS_PA_pre = mean(Scores)) %>% 
  ungroup()

# PANAS - positive affect post
PANAS_PA_post <- dog_data_raw %>% 
  # Step 1
  select(RID, PN2_3, PN2_5, PN2_7, PN2_8, PN2_10) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(PANAS_PA_post = mean(Scores)) %>% 
  ungroup()

# PANAS - negative affect pre
PANAS_NA_pre <- dog_data_raw %>% 
  # Step 1
  select(RID, PN1_1, PN1_2, PN1_4, PN1_6, PN1_9) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(PANAS_NA_pre = mean(Scores)) %>% 
  ungroup()

# PANAS - negative affect post
PANAS_NA_post <- dog_data_raw %>% 
  # Step 1
  select(RID, PN2_1, PN2_2, PN2_4, PN2_6, PN2_9) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Items", values_to = "Scores") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(PANAS_NA_post = mean(Scores)) %>% 
  ungroup()

Happiness scale

# happiness_pre
happiness_pre <- dog_data_raw %>% 
  # Step 1
  select(RID, HA1_1, HA1_2, HA1_3) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Item", values_to = "Score") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(SHS_pre = mean(Score)) %>% 
  ungroup()

#happiness_post
happiness_post <- dog_data_raw %>% 
  # Step 1
  select(RID, HA2_1, HA2_2, HA2_3) %>% 
  # Step 2
  pivot_longer(cols = -RID, names_to = "Item", values_to = "Score") %>% 
  # Step 3
  group_by(RID) %>% 
  summarise(SHS_post = mean(Score)) %>% 
  ungroup()

Intended Learning Outcomes

Individual Walkthrough

2.1 Activity 1: Setup

2.2 Activity 2: Load in the libraries and read in the data

2.3 Activity 3: Calculating demographics

2.3.1 … for the full sample using summarise()

Fixing Age

Computing summary stats

Computing summary stats - third attempt

2.3.2 … per gender using summarise() and group_by()

2.3.3 Adding percentages

2.4 Activity 4: Questionable Research Practices (QRPs)

The main goal is to compute the mean QRP score per participant for time point 1.

2.5 Activity 5: Knitting

2.6 Activity 6: Export a data object as a csv

Pair-coding

Task 1: Open the R project you created last week

Task 2: Open your .Rmd file from last week

Task 3: Load in the library and read in the data

Task 4: Calculating the mean for Flourishing_pre

Test your knowledge and challenge yourself

Knowledge check

Question 1

Question 2

Question 3

Question 4

Question 5

Error mode

Question 6

Question 7

Question 8

Challenge yourself

2.3.1 … for the full sample using `summarise()`

Fixing `Age`

2.3.2 … per gender using `summarise()` and `group_by()`

Task 2: Open your `.Rmd` file from last week

Task 4: Calculating the mean for `Flourishing_pre`