2 Data-Wrangling: A Key Skill

2.1 Overview

One of the key skills in any researcher's toolbox is the ability to work with data. When you run an experiment you get lots of data in various files. For instance, it is not uncommon for an experimental software to create a new file for every participant you run and for each participant's file to contain numerous columns and rows of data, only some of which are important. Being able to wrangle that data, manipulate it into different layouts, extract the parts you need, and summarise it, is one of the most important skills.

The next few chapters are aimed at refreshing and consolidating your skills in working with data. This chapter focuses on organizing data using the tidyverse package. Over the course of the activities, you will recap the main functions and how to use them, and we will use a number of real datasets to give you a wide range of exposure to what Psychology is about, and to reiterate that the same skills apply across different datasets. The skills don't change, just the data!

In the style of programming we teach, the most efficient format/layout of data is what is known as Tidy Data, and any data in this format is easily processed through the tidyverse package. You can read more about this type of data layout in this paper: Tidy Data (Wickham, 2014). It is a surprisingly good read.

However, the data you work with will not always be formatted in the most efficient way possible. If that happens then our first step is to put it into Tidy Data format. There are two fundamental principles defining Tidy Data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.

Tidy Data (Wickham, 2014) adds the following principle:

  • Each type of observation unit forms a table.

And Grolemund and Wickham (2017) restate this third principle as:

  • Each value must have its own cell (i.e. no grouping two variables together, e.g. time/date in one cell).

Where a cell is where any specific row and column meet; a single data point in a tibble is a cell for example. The Grolemund and Wickham (2017) book is a very useful read and it is free, but browsing the chapter on Tidy Data will help you visualise how you want to arrange data. Try to keep the principles in mind whilst doing so.


There are some questions to answer as you go along to help build your skills: use the example code as a guide and check your answer against the solutions at the end of the chapter. Finally, remember to be pro-active in your learning, work together as a community, and if you get stuck: google what you are trying to do, use the cheatsheets or the Data Skills R Book. The key cheatsheet for this activity is the Data Transformation Cheatsheet with dplyr.

In this chapter you will recap on:

  • Data-Wrangling with the Wickham Six one-table verbs
  • Additional useful functions such as count, pivot_longer, and joins
  • Piping and making efficient codes

Remember to open up your portfolio that you created in Chapter 1 so you can add useful information to it as you work through the tasks. Also summarising the information we give in this chapter, in your own words, is a great way to learn! You don’t have to read all of these but they might help from time to time to explain parts further.

For instance, do you remember how to get help on an R function in RStudio? In your Console window, you can call the help function (e.g. ?mutate) to view the reference page for each function. This example shows how to get help on the mutate() function within dplyr, which we will use in later labs.

2.2 Data Wrangling Basics

2.2.1 Revisiting the Wickham Six

The main way we teach data-wrangling skills is by using the Wickham Six one-table verbs. These are part of the tidyverse package which we introduced to you in the first PsyTeachR book, and more specifically from the dplyr package that is contained within the tidyverse. These six verbs are often referred to as the Wickham Six "one-table" dplyr verbs as they perform actions on a single table of data.

We will look at some of the basics again here, but try to look back at the exercises in the Data Skills book to see how we used these verbs (functions) previously.

The Wickham Six are:

Function Description
select() Include or exclude certain variables (columns)
filter() Include or exclude certain observations/data (rows)
mutate() Creates new variables (columns)
arrange() Changes the order of observations (rows)
group_by() Organises the observations (rows) into groups
summarise() Create summary variables for groups of observations (rows)

You will use the Wickham Six very frequently for wrangling your data so this would definitely be something you should be making notes about - not just the names, but how they work and any particular nuances that you spot. Perhaps recreate the above table and add your own examples.

2.2.2 Learning to Wrangle

Today we are going to be using data from this paper: Witt et al. (2017). Is there a chastity belt on perception. The main research question asks: Does your ability to perform an action influence your perception? For instance, does your ability to hit a tennis ball influence how fast you perceive the ball to be moving? Or to phrase another way, do expert tennis players perceive the tennis ball moving slower than novice tennis players?

This experiment does not use tennis players however, they used the Pong task: "a computerised game in which participants aim to block moving balls with various sizes of paddles". A bit like a very classic retro arcade game. Participants tend to estimate the balls as moving faster when they have to block it with a smaller paddle as opposed to when they have a bigger paddle. You can read the paper to get more details if you wish but hopefully that gives enough of an idea to help you understand the wrangling we will do on the data. We have cleaned up the data a little to start with. Let's begin!

  1. Download the data as a zip file from this link and save it somewhere you have access. In the lab, use your M: drive.

  2. Set your working directory to the same folder as the data. Session >> Set Working Directory >> Choose Directory

  3. Open a new script and copy and paste the two lines below that will load the tidyverse library into your session and load in the data through the read_csv() function and storing it in the tibble called pong_data.

library("tidyverse")
pong_data <- read_csv("PongBlueRedBack 1-16 Codebook.csv")

DO NOT install packages in the Boyd Orr labs; they are already there and just need called in through library().

However, If you are using your own computer and you haven’t previously installed the tidyverse package before, you will have to install it first (install.packages("tidyverse")).

If you have already installed tidyverse, but it was a long time ago, it might be worth running some updates on the packages as you may have an old version that works differently. The easiest way to do this is through RStudio using the menu at the top - Tools >> Check for Package Updates. You can update packages individually or just run all updates. Tends to be better to just update all packages as many of the packages are linked.

The three most common mistakes we see are:

  1. Make sure you have spelt the data file name exactly as it is shown. Spaces and everything. Do not change the name of the .csv file, fix your code instead. The reason being is that if you have a different name for your file than someone else then your code is not reproducible. We would say avoid using spaces in filenames you create, but as this is one created by another researcher and already has them, we will leave it as is and work with them.
  2. Remember when uploading data we use read_csv which has an underscore, whereas the data file itself will have a dot in its name, filename.csv.
  3. Check that the datafile is actually in the folder you have set as your working directory.
  1. Let's have a look at the pong_data and see how it is organized. Type View(pong_data) or glimpse(pong_data) in your Console window (Capital V and little g).

In the dataset, you will see that each row (observation) represents one trial per participant and that there were 288 trials for each of the 16 participants. The columns (variables) we have in the dataset are as follows:

Variable Type Description
Participant integer participant number
JudgedSpeed integer speed judgement (1 = fast, 0 = slow)
PaddleLength integer paddle length (pixels)
BallSpeed integer ball speed (2 pixels/4ms)
TrialNumber integer trial number
BackgroundColor character background display colour
HitOrMiss integer hit ball = 1, missed ball = 0
BlockNumber integer block number (out of 12 blocks)

We will use this data to master our skills of the Wickham Six verbs, taking each verb in turn and looking at it briefly. You should develop your skills by setting yourself new challenges based on the ones we set. There are 6 verbs to work through and then after that we will briefly recap on two other functions before finishing with a quick look at pipes. Try everything out and let us know anything you can't quite get.

Data in research methods is stored in two-dimensional tables; called data-frames, tables, or tibbles. There are other ways of storing data that you will discover in time, but mainly we will be using tibbles (if you like more info, type vignette("tibble") in the Console window). A tibble is a table of data with columns and rows of information, and within the tibble you can get different r glossary("data type", display = "types of data"), i.e. r glossary("double"), r glossary("integer"), and r glossary("character").

Type of Data Description
Double Numbers including decimals (e.g. 3.14)
Integer Numbers without decimals (e.g. 3)
Character Tends to contain letters or be words, but can be numbers (e.g. “AB12”, “Data Rocks!”)
  • Note: Double and Integer can both be referred to as Numeric data, and you will see this word from time to time. For clarity, we will use Double as a term for any number with a decimal (e.g., 3.14) and Integer as a term for any whole number (e.g., 3).

2.2.3 select() Function - keep only specific columns

The select() function lets us pick out the variables within a dataset that we want to work with. For example, say in pong_data we wanted to only keep the columns Participant, JudgedSpeed, PaddleLength, BallSpeed, TrialNumber, and HitOrMiss, but we don't need BackgroundColor or BlockNumber.

We can do this in two ways:

  1. Tell the function what variables to include
select(pong_data, Participant, JudgedSpeed, PaddleLength, BallSpeed, TrialNumber, HitOrMiss)
  1. Tell the function what variables to exclude through -ColumnName approach (i.e., minus the ColumnName)
select(pong_data, -BackgroundColor, -BlockNumber)

In this second example, -BackgroundColor means 'not BackgroundColor', so here you are saying all columns except BackgroundColor and BlockNumber. The minus sign is the crucial part!

Task 1: Using the select() function

  1. Either by inclusion or exclusion, select only the columns Participant, PaddleLength, TrialNumber, BackgroundColor and HitOrMiss from pong_data.

Did you know select() can also be used to reorder columns?

  1. Use select() to keep only the columns Participant, JudgedSpeed, BallSpeed, TrialNumber, and HitOrMiss, but have them display in alphabetical order, left to right.
  1. Have you remembered to include the dataset pong_data? Pay attention to upper/lower case letters and spelling!

  2. Think about how you first entered the column names as they appeared. But what happens if you change the order that you enter the column names?


2.2.4 arrange() Function - sort and arrange columns

The arrange() function sorts the rows in the tibble according to what column you tell it to sort by.

Arrange the data by one column, e.g., by BallSpeed:

arrange(pong_data, BallSpeed)

Arrange the data by multiple columns, e.g., by BallSpeed (fastest first) and BackgroundColor:

arrange(pong_data, desc(BallSpeed), BackgroundColor)
  • desc() is how to sort by largest to smallest, i.e., descending order.
  • Compare the output of the two lines above on the BallSpeed column.
  • Does desc() also work for BackgroundColor?

Task 2: Arranging Data with the arrange() function

  1. Arrange the data in pong_data by two variables: HitOrMiss (putting hits - 1 - first), and JudgedSpeed (fast judgement - 1 - first).

2.2.5 filter() Function - keep only parts of the data

The filter() function lets us parse out a subset of the data, meaning we keep only parts of the data.

For example, we might want to only keep the red BackgroundColor

filter(pong_data, BackgroundColor == "red")

or only keep BallSpeed above 4 pixels

filter(pong_data, BallSpeed > 4)

or only keep trials that match both the red BackgroundColor and BallSpeed above 4 pixels. Any trial that is not red background color or slower than 5 pixels will be removed.

filter(pong_data, BackgroundColor == "red", BallSpeed > 4)

This last example can also be written as follows. Two arguments or requirements separated by a comma is equivalent to an & (ampersand - meaning "and"):

filter(pong_data, BackgroundColor == "red" & BallSpeed > 4)

Or say you want to keep specific Participant IDs. Say we want just the data from Participants 1, 3, 10, 14 and 16. We would write it as follows.

  • The %in% is called group membership and means keep each of these Participants
  • The c() creates a little container of items called a vector.
filter(pong_data, Participant %in% c("1", "3", "10", "14", "16")) 

The == operator compares the first value of one vector with the first value of another vector, whereas the %in% operator compares the first value of the first vector with all values of the second vector.

For example, let’s create the following vectors:

x <- c(4, 5, 6)

y <- c(6, 5, 4)

Here is what happens if we use the %in% operator to compare the values in vector x and y:

x %in% y

[1] TRUE TRUE TRUE

Hooray! All the values in vector x were found to exist in vector y - as shown by the word ‘TRUE’ for each correspondng number in vector x. But what happens if we use the == operator instead?

x == y

[1] FALSE TRUE FALSE

Oh no! The first and last values in vector x were not found to exist in vector y because the == operator directly compared the first number in vector x (4) to the first number in vector y (6) and so on. Only the middle number (5) was found to exist in both vectors because this number is positioned in the same location in vector x and y.

So, if you want to keep rows where the value equals any of the ones you specify, use the %in% operator!

And finally, say you wanted to keep all Participants except Participant 7:

  • you can read != (exclamation mark followed by equals) as 'does not equal'. So Participant != "7" means keep all Participants where the values in Participant column are not 7.
  • The exclamation mark can sometimes be used to negate the function that follows it.
filter(pong_data, Participant != "7")

Task 3: Using the filter() Function

  1. Use filter() to extract all Participants that had a fast speed judgement, for speeds 2, 4, 5, and 7, but missed the ball. Store this remaining data in a variable called pong_fast_miss

There are three parts to this filter so it is best to think about them individually and then combine them.

  1. Filter all fast speed judgements (JudgedSpeed)

  2. Filter for the speeds 2, 4, 5 and 7 (BallSpeed)

  3. Filter for all Misses (HitOrMiss)

You could do this in three filters where each one uses the output of the preceeding one, or remember that filter functions can take more than one argument - see the example above. Also, because the JudgedSpeed and HitOrMiss are Integer you will need == instead of just =.


Common mistakes with filter()

The filter function is very useful, but if used wrongly can give you very misleading findings. This is why it is very important to always check your data after you perform an action. Let's say you are working in comparative psychology and have run a study looking at how cats, dogs, and horses perceive emotion. Let's say the data is all stored in the tibble animal_data and there is a column called animals that tells you what type of animal your participant was. Something like this:

Participant ID animals Perceived Emotion Accuracy (%)
1 dog 80
2 dog 90
3 cat 10
4 dog 85
5 horse 100
6 cat 6

Ok, so imagine you wanted all the data from just cats:

filter(animal_data, animals == "cat")

Exactly! But what if you wanted cats and dogs?

filter(animal_data, animals == "cat", animals == "dog")

Right? Wrong! This actually says "give me everything that is a cat and a dog". But nothing is a cat and a dog, that would be weird - like a dat or a cog! What you actually want is everything that is either a cat or a dog, which is stated as:

filter(animal_data, animals == "cat" | animals == "dog")

The vertical line | is the symbol for or, just as & is the symbol for and.

So, always pay attention to what you want and most importantly to what your code produces.

2.2.6 mutate() Function - add new columns

The mutate() function lets us create a new variable in our dataset. For example, let's add a new column to pong_data in which the background color is represented by numbers, where red will be represented as 1, and blue will be represented as 2.

pong_data <- mutate(pong_data, 
                    BackgroundColorNumeric = recode(BackgroundColor, 
                                                    "red" = 1, 
                                                    "blue" = 2))

Let's look at the code in detail:

  • BackgroundColorNumeric is the name of the new column you are adding to the tibble.
  • BackgroundColor is the name of the original column in the tibble and the one to take information from.
  • 1 and 2 are the new codings of red and blue, respectively.

The mutate() function is also handy for making some calculations on or across columns in your data. For example, say you realise you made a mistake in your experiment where your participant numbers should be 1 higher for every participant, i.e. Participant 1 should actually be numbered as Participant 2, etc. You would do something like:

mutate(pong_data, Participant = Participant + 1)

Note: The "new column" has the same name as the old column, i.e. Participant. In the resulting table, the Participant column will have the new values which will differ from the values in the original pong_data table. While it may seem like you have overwritten these values, in reality you have created a copy of the table with altered values, but you have not lost anything: the original values are still there in pong_data because you didn't store (assign) this action to pong_data. You didn't save the change basically.

In general, it is good practice not to overwrite pong_data with a new version of pong_data, but to store the altered table in a new tibble, e.g., pong_data_mutated, like this:

pong_data_mutated <- mutate(pong_data, Participant = Participant + 1)

For some research questions, you may need to create a new variable that is the mean of other variables in the dataset. For example, if you have several items in a questionnaire that all measure a specific construct, you may want to create the mean of all these items as a way to represent that construct. Using example data, we will show you one way to accomplish this in R with the mutate() function.

Imagine that you have a questionnaire with 4 items measuring test anxiety. The first 3 rows in the dataset might look like this:

Participant_ID test_anxiety1 test_anxiety2 test_anxiety3 test_anxiety4
1 7 7 6 7
2 4 6 4 1
3 5 1 1 2

The following code could be used to calculate a mean test anxiety score for each participant in the dataset:

data_new <- mutate(data, mean = (test_anxiety1 + test_anxiety2 + test_anxiety3 + test_anxiety4)/4)

This line of code would add a new column - mean to the end of the original dataset. The mean column would contain the mean test anxiety score per participant as that was the calculation you provided.

Participant_ID test_anxiety1 test_anxiety2 test_anxiety3 test_anxiety4 mean
1 7 7 6 7 6.75
2 4 6 4 1 3.75
3 5 1 1 2 2.25

Likewise, if you wanted to calculate the total test anxiety score for each participant in the dataset, you would just need to amend the calculation accordingly (i.e., remove the division).

data_new <- mutate(data, total = test_anxiety1 + test_anxiety2 + test_anxiety3 + test_anxiety4)

Participant_ID test_anxiety1 test_anxiety2 test_anxiety3 test_anxiety4 total
1 7 7 6 7 27
2 4 6 4 1 15
3 5 1 1 2 9

Task 4: Mutating variables with mutate()

  • You realise another mistake in that all your trial numbers are wrong. The first trial (trial number 1) was a practice so should be excluded and your experiment actually started on trial 2. Tidy this up by:
  1. Creating a new tibble called pong_data_filt and in it store data from pong_data after filtering out all trials with the number 1 (TrialNumber column).
  2. Now use the mutate() function to renumber all the remaining trial numbers, in pong_data_filt, starting them at 1 again instead of 2. Store this output in a new tibble called pong_data2.

Step 1:

  • filter(TrialNumber does not equal 1).
  • remember to store this output in a tibble called pong_data_filt

Step 2:

  • mutate(TrialNumber = TrialNumber minus 1)
  • exclamation mark, equals

2.2.7 group_by() Function - group parts of data together

The group_by() function groups the rows in a dataset according to a category you specify, e.g., in the animals example above, grouping all cat data together, all dog data together, and all horse data together.

Looking at the data within pong_data2, say you wanted to eventually create means, etc., for the different background color conditions, you would start by grouping trials by BackgroundColor, grouping the data into red background data and blue background data:

group_by(pong_data2, BackgroundColor)

Or you can add numerous grouping variables depending on how you want to split up the data. Here we group by Hit Or Miss (in HitOrMiss column) and background color (Red or Blue). This gives four groups (i.e, Hit Red, Miss Red, Hit Blue, Miss Blue):

group_by(pong_data2, HitOrMiss, BackgroundColor)

Note: Nothing actually appears to change in the data, unlike with the other functions, but a big operation has taken place. Look at the output in your console when you run group_by(pong_data2, BackgroundColor). At the top of the output notice that the 2nd line of the output tells us the grouping criteria and how many groups now exist: see the line Groups: BackgroundColor [2]: we grouped by BackgroundColor and there are [2] groups - one for red and one for blue.

Task 5: Grouping Data with group_by()

  1. Group the data by BlockNumber and by BackgroundColor, in that order, and then enter the number of groups (i.e., a number) you get as a result here:

It is the same procedure as this but with different column names:

group_by(pong_data2, HitOrMiss, BackgroundColor)

The number of groups should be the product (i.e., multiplication) between the sum of the number of background colors (red and blue) and the number of blocks (12).


group_by() is incredibly useful as, once the data is organised into groups, you can then apply other functions (filter, arrange, mutate,...) to the groups within your data that you are interested in, instead of to the entire dataset. For instance, a common second step after group_by might be to summarise the data.

Good to know: The ungroup() function

The ungroup() function undoes the action of the group_by() function.

After grouping data together using the group_by() function and then performing a task on it, e.g., filter(), summarise(), it can be very good practice to ungroup the data before performing another function. Forgetting to ungroup the dataset won't always affect further processing, but sometimes it can really mess up other things.

Again just a good reminder to always check the data you are getting out of a function a) makes sense and b) is what you expect.

2.2.8 summarise() Function - do some calculations on the data

The summarise() function lets you calculate some descriptive statistics on your data. For example, say you want to count the number of hits there were for different paddle lengths or number of hits there were when the background color was red or blue.

First we group the data accordingly, storing it in pong_data2_group

pong_data2_group <- group_by(pong_data2, BackgroundColor, PaddleLength)

Then we summarise it, storing the answer in total_hits

pong_data2_hits <- summarise(pong_data2_group, total_hits = sum(HitOrMiss))

Finally, for fun we can filter just the red, small paddle hits from the summarised data.

pong_data2_hits_red_small <- filter(pong_data2_hits, BackgroundColor == "red", PaddleLength == 50)
## `summarise()` has grouped output by 'BackgroundColor'. You can override using
## the `.groups` argument.
## `summarise()` has grouped output by 'BackgroundColor'. You can override using
## the `.groups` argument.

Which would leave us with:

Table 2.1: Summarising with group_by() and summarise()
BackgroundColor PaddleLength total_hits
red 50 516


Tip: The name of the column within pong_data2_hits_red_small that has the summarised data is total_hits; this is what you called it when creating pong_data2_hits. You could have called it anything you wanted, but always try to use something sensible. Make sure to call your variables something you (and anyone looking at your code) will understand and recognize later (i.e., not variable1, variable2, variable3. etc.), and avoid spaces (use_underscores_never_spaces).

summarise() has a range of internal functions that make life really easy, e.g., mean(), median(), n(), sum(), max, min, etc. Some common ones are shown below, but see the dplyr cheatsheets for more examples.

function example purpose
n() N = n() number of values
mean() m = mean(X) mean of values in stated column - e.g. column X
median() mdn = median(X) median of values in stated column - e.g. column X
sum() sm = sum(X) sum of values in stated column - e.g. column X
max() mx = max(X) max of values in stated column - e.g. column X
min() mn = min(X) min of values in stated column - e.g. column X
sd() stdev = sd(X) standard deviation of values in stated column - e.g. column X

Task 6: Summarising Data with summarise()

  1. Use the lines of code above to calculate the mean number of hits made with the small paddle (50) and the red color background. Enter that value in this box to two decimal places (e.g., 0.12):

Quickfire Questions

  • Which of the Wickham Six would you use to sort columns from smallest to largest:

  • Which of the Wickham Six would you use to calculate the mean of a column:

  • Which of the Wickham Six would you use to remove certain observations - e.g. remove all males:

2.2.9 Other Useful Functions: bind_rows() and count()

The Wickham Six verbs let you do a lot of things with data however there are thousands of other functions at your disposal. If you want to do something with your data that you are not sure how to do using these functions, do a Google search for an alternative function - chances are someone else has had the same problem and has a help guide.

Two useful functions are the bind_rows() function and the count() function. We will briefly show you these here.

Binding columns with bind_rows()

The bind_rows() function is useful if you want to combine two tibbles together into one larger tibble that have the same column structure, i.e. they have exactly the same columns and you want to combine them by attaching one to the bottom of the other:

Say you had on tibble with data for ball speeds 1 and 2:

slow_ball<- filter(pong_data2, BallSpeed < 3) 

And another tibble with data for ball speeds 6 and 7:

fast_ball <- filter(pong_data2, BallSpeed >= 6) 

Now combine those tibbles together into one big tibble containing these extreme ball speeds:

extreme_balls <- bind_rows(slow_ball, fast_ball) 

Count with the count() function

The count() function is a shortcut that can sometimes be used to count up the number of rows you have for groups in your data, without having to use the group_by() and summarise() functions. It is a tally basically. It doesn't sum up values. It just counts how many observations you have.

For example, in Task 6 we combined group_by() and summarise() to calculate how many hits there were based on background color and paddle length. Alternatively, we could have done:

count(pong_data2, BackgroundColor, PaddleLength, HitOrMiss)

The results are the same, just that in the count() version we get all the information, including misses, because we are just counting rows. In the summarise() method we only got hits because that was the effect of what we summed. So two different methods give similar answers.

2.2.10 Pipes (%>%) - make your code efficient

By now you'll have noticed thattidyverse functions generally take the following grammatical structure (called syntax): function_name(dataset, arg1, arg2,..., argN) where the dataset is the entire tibble of data you are using and each argument (arg) is some operation on a particular column or variable, or the column name you want to work with. For example:

filter(pong_data2, PaddleLength == "50", BallSpeed > 4)
group_by(pong_data2, BallSpeed, Participant)

Both of these examples follow the structure of function_name(dataset, arg1, arg2, ....)

In the first example, we are filtering (function) the whole pong_data2 dataset by a particular paddle length, then by particular speeds (arguments). In the second, we are grouping by BallSpeed and then by Participant. Note that the order of arguments is specific as it performs argument1 then argument2, etc. Changing the order of arguments may give a different output. So the order you work in is important, and this is called your pipeline. For example, here is a pipeline we used above to find how many hits there were with the small paddle length and the red background.

  • First we group the data accordingly, storing it in pong_data2_group
  • Then we summarise it, storing the answer in total_hits
  • And finally filter just the red, small paddle hits
pong_data2_group <- group_by(pong_data, BackgroundColor, PaddleLength)
pong_data2_hits <- summarise(pong_data2_group, total_hits = sum(HitOrMiss))
pong_data2_hits_red_small <- filter(pong_data2_hits, BackgroundColor == "red", PaddleLength == 50)

Pipelines allow us to quickly and reproducibly perform an action that would take much longer manually. However, we can make our code even more efficient, using less code, by stringing our sequence of functions together using 'pipes', written as %>%. Changing the above code into one using pipes would give us:

pong_data_hits_red_small <- pong_data2 %>% 
  group_by(BackgroundColor, PaddleLength) %>% 
  summarise(total_hits = sum(HitOrMiss)) %>%
  filter(BackgroundColor == "red", PaddleLength == 50)

Both these chunks show exactly the same procedure, but adding pipes can make code easier to read once you understand piping.

Compare code without a pipe:

  • function_name(dataset, arg1, arg2,...,argN)

o code with a pipe:

  • dataset %>% function_name(arg1, arg2,...,argN)

The premise is that you can pipe (%>%) between functions when the input of a function is the output of the previous function. Alternatively, you can use a pipe to put the data into the first function, as shown directly above.

You can think of the pipe (%>%) as saying 'and then' or 'goes into', e.g., the data goes into this function and then into this function and then into this function.

One last point on pipes is that they can be written in a single line of code but it's much easier to see what the pipe is doing if each function takes its own line. Every time you add a function to the pipeline, remember to add a %>% first and note that when using separate lines for each function, the %>% must appear at the end of the line and not the start of the next line. Compare the two examples below. The first won't work, but the second will because the second puts the pipes at the end of the line where they need to be!

  • Example 1: won't work because the pipes (%>%) are in the wrong place.
data_arrange <- pong_data2 
                %>% filter(PaddleLength == "50")
                %>% arrange(BallSpeed) 
  • Example 2: will work because the pipes (%>%) are in the correct place.
data_arrange <- pong_data2 %>%
                filter(PaddleLength == "50") %>%
                arrange(BallSpeed) 

Where piping becomes most useful is when we string a series of functions together, rather than using them as separate steps and having to save the data each time under a new tibble name and getting ourselves all confused. In the non-piped version we have to create a new tibble each time, for example, data, data_filtered, data_arranged, data_grouped, data_summarised just to get to the final one we actually want, which was data_summarised. This creates a lot of tibbles in our environment and can make everything unclear and eventually slow down our computer. The piped version however uses one tibble name, saving space in the environment, and is clear and easy to read. With pipes, we skip unnecessary steps and avoid cluttering our environment.


Quickfire Questions

  • What does this line of code say? data %>% filter() %>% group_by() %>% summarise():

2.3 Data Wrangling Application

We looked at a series of functions known as the Wickham six one-table to filter, arrange, group_by, select, mutate and summarise. Now we will focus on working with data across two or more tables using functions you will have come across in the Data Skills book. The two main functions we will add to the Wickham six are pivot_longer() and inner_join().

  • pivot_longer() allows us to transform a table from wide format to long format.

We use Tidy Data because it is really efficient and works well with the tidyverse. However, people used to use data structured in long format or wide format.

Long format is where each row is a single observation, typically a single trial in an experiment or a response to a single item on a questionnaire. When you have multiple trials per participant, you will have multiple rows for the same participant. To identify participants, you would need a variable with some kind of participant id, which can be as simple as a distinct integer value for each participant. In addition to the participant identifier, you would have any measurements taken during each observation (e.g., response time) and what experimental condition the observation was taken under.

In wide format data, each row corresponds to a single participant, with multiple observations for that participant spread across columns. So for instance, with survey data, you would have a separate column for each survey question.

Tidy is a mix of both of these approaches and most functions in the tidyverse assume the tidy format, so typically the first thing you need to do when you get data, particularly wide-format data, is to reshape it through wrangling. Which is why we teach these really important skills.

  • inner_join() allows us to combine two tables together based on common columns.

Analysing the Autism Spectrum Quotient (AQ)

To continue building your data wrangling skills we will recap on skills from the Data Skills book by tidying up data from the Autism Spectrum Quotient (AQ) questionnaire. If you have completed the Data Skills book then you may be familiar with the AQ10; a non-diagnostic short form of the AQ with only 10 questions per participant. It is a discrete scale and the higher a participant scores on the AQ10 the more autistic-like traits they are said to display. Anyone scoring 7 or above is recommended for further diagnosis. You can see an example of the AQ10 through this link: AQ10 Example.

We have four data files to work with:

  • responses.csv containing the AQ survey responses to each of the 10 questions for our 66 participants
  • qformats.csv containing information on how a question should be coded, i.e. forward or reverse coded
  • scoring.csv containing information on how many points a specific response should get; depending on whether it is forward or reverse coded
  • pinfo.csv containing participant information such as Age, Sex and importantly ID number.

Click here to download the files as a zip file. Now unzip the files into a folder you have access to.

csv stands for ‘comma separated values’ and is a very basic format for storing data in a plain text file. It really just stores numbers and text separated by commas and nothing else. The great thing about being this basic is that it can be read by many different systems and is non-proprietary, i.e., you don’t need to purchase commercial software to open it.


Now set your working directory to the folder where you saved the .csv files. Do this through the dropdown menus at the top toolbar: Session >> Set Working Directory >> Choose Directory and then find your folder with your .csv files.

Today we will work in an RScript instead of .Rmd, but if you want to turn it into an R Markdown report or add elements to your Portfolio then please feel free.

Thinking Cap Point

Now would be a good time to make sure that you are all using RStudio effectively and know what each window does.

  • TRUE or FALSE, the Console is best for practice and the Script window is for saving:
  • TRUE or FALSE, the Environment holds all the data and objects you have loaded in and created:
  • TRUE or FALSE, clicking the name of a table in the Environment window will open it in the Script window:

The answer to all of these are True.

  1. The Script window is where you should write code and comments that you are going to save and send to people. The Console is where you should practice stuff - nothing is saved here; it is like a sandbox that just gets wiped away.

  2. Any data you load in or create is held in the Environment (or Global Environment) window with the variable name that you gave it.

  3. By clicking the name of the table in the Environment window it will open up in the Script window and you can look at it to make sure it is what you expect. This only works for tables but not for other types of data. You will learn the difference as we go along!

2.3.1 Task 1: Open a Script

  1. Start a new RScript and save it in the same folder as your .csv files, calling the RScript something informative like AQ_DataWrangling.R.
  2. Make sure your environment is completely empty so we don't mix up one analysis with the other. You can run the following code line in the console to clear the environment or by clicking the little brush on your environment window.
rm(list = ls()) 

Remember that when using a script you can write notes to yourself to remind you what a line of code does. Just put a hashtag at the start of the line and R will ignore this line. This is where you have to be clear on using a Script versus an R Markdown file. In a Script, # means the line is ignored, in Markdown # sets the line as a header!.

To run any line on a script, the simplest way is to click anywhere on that line and either press Run on the top of the script window or press CTRL+Enter on the keyboard (or mac equivalent).

2.3.2 Task 2: Bring in Your Library

  1. Add a line to your code that brings the tidyverse package into your working environment and run it.

Combine the function library() and the package tidyverse and remember that the solutions are at the end of the chapter.

On our lab machines in Psychology all the necessary packages will already be on the machines, they just need called into the library. If however you are using your own machine you will have to install the packages first (and you only have to do this once).

Do not install packages on the Psychology machines! Why?

  1. They are already installed and can cause the package to stop working if a student tries to install the same package on our machines.
  2. They are already installed and it is a bit like using apps on your phone. Install is putting the app onto your phone, library is just opening the app. If you’ve already downloaded the app (package) then you just need to open it (library()) to use it!

2.3.3 Task 3: Load in the Data

Now we have to load in the .csv datafiles using the read_csv() function and save them as tibbles in our environment. For example, to load in the data in responses.csv and save it as the tibble responses we would type:

responses <- read_csv("responses.csv") 
  1. Add the following lines of code to your script and complete them to load in all four .csv datafiles. Use the above code as an example and name each tibble the same as its original filename (minus the .csv part), again as above, e.g. responses.csv gets saved as responses. Remember to run the lines so that the data is loaded in and stored in your environment.
responses <-  read_csv()    # survey responses
qformats <-                 # question formats
scoring <-                  # scoring info
pinfo <-                    # participant information

As you work with data and functions you will find there are functions with similar names, but that give different results. One of these is the read function for csv. Make sure to always use read_csv() as your function to load in csv files. Nothing else. It is part of the readr package automatically brought in with tidyverse.

There is a very similarly named function called read.csv(). DO NOT use this function. We will always expect you to use read_csv(). Although very similar in name they do not work the same way and create differences in your data.

2.3.4 Task 4: Review Your Data.

Now that we have the data loaded in it is always best to have a look at the data to get an idea of its layout. We showed you one way before, by clicking on the name in the environment, but you can also use the glimpse() or View() functions in your Console window. Put the name of the data between the brackets to see how it is arranged. Don't add these to your script though - they are just one-offs for testing.

Have a look at the data in responses to see if you think it is Tidy or not and answer the following question: The data in responses is in format

The reponses tibble is far from being tidy; each row represents multiple observations from the same participant, i.e., each row shows responses to multiple questions - wide format.

2.3.5 Task 5: Gathering Data with pivot_longer().

In order to make it easier for us to get the AQ score for each participant, we need to change the layout of the responses tibble from its wide format to a long format using the pivot_longer() function.

  1. Copy the below code line to your script and run it.
rlong <- pivot_longer(responses,
                      cols = Q1:Q10,
                      names_to = "Question",
                      values_to = "Response")

Let's look at the code in detail:

  • The first argument given to the pivot_longer() function is the tibble which holds the data we want to wrangle, responses.
    • Remember we could have written this with a pipe as well, e.g. rlong <- responses %>% pivot_longer(...)
  • The second argument is the names of specific columns in the original tibble that we want to gather together, Q1:Q10 meaning all columns between Q1 and Q10.
    • You do not actually need to write cols = but it makes things clearer.
    • "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q10, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q10, and therefore the code gathers all the columns between Q1 and Q10.
    • Column names are not put in quotes because they exist already in the tibble responses.
  • The third and fourth arguments are the names of the new columns we are creating;
    • The first will store the question numbers, Question. I.e. put the question names (names_to = ...) in a column called "Question".
    • The second will store the values/responses, Response. I.e. put the values/responses to the questions (values_to = ...) in a column called "Response".
    • These new column names are put in quotes because they do not already exist in the tibble. This is not always the case, but is the case for this function.
    • Note that these names could have been anything but by using these names the code makes more sense.
    • Lastly, you do need to write names_to = ... and values_to = ... otherwise the columns won't be created correctly.

In case you are wondering, if we wanted to go back the way and ungather the data we just gathered, we would use the pivot_wider() function: e.g. rwide <- rlong %>% pivot_wider(names_from = Question, values_from = Response). But we do not want to do that here so let's not add this to the code.

Quickfire Questions

Let's see if you understand pivot_longer(). Say you wanted to gather the first three columns of responses (Q1, Q2, Q3), put the question numbers in a column called Jam, the responses in a column called Strawberry, and store everything in a tibble called sandwich. Fill in the box with what you would write:

sandwich <- pivot_longer(responses, cols = Q1:Q3, names_to = "Jam", values_to = "Strawberry")

pivot_longer() wants the data first, then the columns to gather, then the name of the new column to store the gathered column names in, and finally the name of the new column to store the values in.

2.3.6 Task 6: Combining Data.

So now our responses data is in tidy format, we are closer to getting an AQ score for each person. However, we still need to add some information to:

  • show if the question is reverse or forward scored - found in qformats
  • show the number of points to give a specific response - found in scoring.

This is a typical analysis situation where different information is in different tables and you need to join them together. Both these pieces of information are contained in qformats and scoring, respectively, but we want to join it to the data in rlong to create one informative tidy table with all the info.

We can do this sort of join through the function inner_join(); a function to combine information in two tibbles using a column (or columns) common to both tibbles.

  1. Copy the below line into your code and run it. This piece of code combines rows in the tibble rlong with rows in the tibble qformats, based on the common column "Question".
rlong2 <- inner_join(rlong, qformats, "Question")
  1. Now have a look in rlong2. We have matched each question with its scoring format, forward or reverse.

A lot of questionnaires have some questions that are Forward scored and some questions that are Reverse scored. What does this mean? Imagine a situation where your options in replying to a question are: 1 - extremely agree, 2 - agree, 3 - neutral, 4 - disagree, 5 - extremely disagree. In a forward-scoring question you would get 1 point for extremely agree, 2 for agree, 3 for neutral, etc. In a reverse scoring question, you would get 5 for extremely agree, 4 for agree, 3 for neutral, etc.

The reasoning behind this shift is that sometimes agreeing or disagreeing might be more favourable depending on how the question is worded. Secondly, sometimes these questions are used just to catch people out - imagine if you had two similar questions where one has the reverse meaning of the other. In this scenario, people should respond opposites. If they respond the same then they might not be paying attention.


Now we need to combine the information in our table, rlong2, with the scoring table so we know how many points to attribute each question based on the answer the participant gave, and whether the question was forward or reverse coded. Again, we use the inner_join() function, but this time the common columns found in rlong2 and scoring are QFormat and Response. To combine by two columns you just write them in sequence as shown below. Note: when there is more than one common column between two tibbles you are joining, you should combine by ALL the columns to avoid repeat columns in the new tibble. If you forget to do this, your new tibble will have names such as column_name.x and column_name.y. This will cause confusion so avoid it by combining by all common columns.

  1. Copy the below line into your code and run it. This code combine rows in rlong2 and scoring based on the columns, QFormat and Response.
rscores <- inner_join(rlong2, scoring, c("QFormat", "Response"))

2.3.7 Task 7: Calculating the AQ Scores.

We have now created rscores which has information on how each participant responded to each question and how each question should be coded and scored, all within the one tibble. All we need now is to sum the scores for each participant to get their AQ score.

  1. Based on what you learning in the previous section, copy the below line into your code and complete it to obtain individual aq_scores for each participant.
  2. Save your script and run it all again from the start to make sure it works!
aq_scores <- rscores %>% 
             group_by() %>% # how will you group individual participants?
             summarise(AQ = sum()) # which column will you sum to obtain AQ scores?
  • Each participant could be grouped by their Id.
  • If we summed up the value for each Score we might get a full AQ Score for each particpipant.

Yep, well spotted. Pipes are your friend. Think of them as saying ‘and then’ or ‘goes into’. So in the example above we take rscores and then group it by something and then summarise it into AQ scores based on…

In most cases, the pipe serves the purpose of putting the input into the function or taking the output of one function and treating it as the input of another function.

In the example above the first pipe takes rscores as the input for group_by, and the second pipe takes the output of group_by and puts it as the input to summarise. See how you can almost read it as a chain of actions or steps.


Quickfire Questions

The whole purpose of this chapter was to calculate AQ scores for individual participants. Try to answer the following questions. Try to do it using code where possible to help you based on your knowledge from the this chapter. Remember the cheatsheets as well. Look for the dplyr one!

  1. From the options, choose the correct citation for the AQ 10 question questionnaire:

  2. Complete the sentence, the higher the AQ score...

  3. Type in the AQ score (just the number) of Participant ID No. 87:

  4. Type how many participants had an AQ score of 3 (again just the number):

  5. The cut-off for the AQ10 is usually said to be around 6 meaning that anyone with a score of more than 6 should be referred for diagnostic assessment. Type in how many participants we should refer from our sample:

  1. From the link above you can see that an appropriate citation for the AQ10 would be (Allison, Auyeung, and Baron-Cohen, (2012))

  2. As mentioned, the higher the score on the AQ10 the more autistic-like traits a participant is said to show.

  3. You could do this by code with filter(aq_scores, Id == 87), which would give you a tibble of 1x2 showing the ID number and score. If you just wanted the score you could use pull() which we haven’t shown you yet but works as follows: filter(aq_scores, Id == 87) %>% pull(AQ). The answer is an AQ score of 2.

  4. Same as above but changing the argument of the filter. filter(aq_scores, AQ == 3) %>% count(). The answer is 13. Remember you can do this by counting but the code makes it reproducible every time.

  5. filter(aq_scores, AQ > 6) %>% count() or filter(aq_scores, AQ >= 7) %>% count(). The answer is 6.

2.3.8 Task 8: Practice using pipes

You now have a complete code to load in your data, convert it to Tidy, combine the tables and calculate an AQ score for each participant. But, if you look at it, some of your code could be more efficient by using pipes.

Go back through your code and rewrite it using pipes %>% so that it is as efficient as possible.

At any point where the first argument of your function is the name of a variable created before that line, there is a good chance you could have used a pipe! Here are all the bits of this code that could be piped together into one chain:

rlong <- pivot_longer(responses, cols = Q1:Q10, names_to = "Question", values_to = "Response")

rlong2 <- inner_join(rlong, qformats, "Question")

rscores <- inner_join(rlong2, scoring, c("QFormat", "Response"))

aq_scores <- rscores %>% group_by(Id) %>% summarise(AQ = sum(Score))


2.4 Practice Your Skills

In order to complete these tasks you will need to download the data .csv files and the .Rmd file, which you need to edit, titled Ch2_PracticeSkills_Template.Rmd. These can be downloaded within a zip file from the below link. Once downloaded and unzipped, you should create a new folder that you will use as your working directory; put the data files and the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top. Download the Exercises .zip file from here.

Now open the .Rmd file within RStudio. You will see there is a code chunk for each task. Follow the instructions on what to edit in each code chunk. This will often be entering code based on what we have covered up until this point.

In this chapter we recapped on data-wrangling using the Wickham 6 verbs, looked at additional functions such as pivot_longer() and inner_join(), and at piping chains of code for efficiency using %>%. You will need these skills to complete the following exercises, so make sure you have worked through this chapter before attempting this exercise. Two other useful online resources are:

2.4.1 The Ageing Brain

A key topic in current psychological research, and one which forms a main focus for some of the research in our School, is that of human ageing. In this research we use brain imaging techniques to understand how changes in brain function and structure relate to changes in perception and behaviour. A typical 'ageing' experiment will compare a measure (or a number of measures) such as performance on a cognitive or perceptual task between younger and older adults (i.e., a between-subjects design experiment).

However, in order to make sure we are studying 'healthy' ageing, we first have to 'screen' our older participants for symptoms of age-related dementia (Alzheimer's Disease), where cognitive function can be significantly impaired. We do this using a range of cognitive tests. Some studies will also test participants' sensory acuity (ability to perceive something), as a function of age (particularly eyesight and hearing).

The data you have downloaded for this exercise is example screening data taken from research investigating how the ageing brain processes different types of sounds. The tests used in this study are detailed below. Please note that the links are there to provide you with further information and examples of the tests once you have completed the assignment if you so wish; you do not have to read them to complete the exercises.

The Data Files

You have just downloaded the three .csv files containing all the data you need. Below is a list of the .csv file names and a description of the variables each contains:

  • p_screen.csv contains particpants demographic information including:
    • ID Participant Id number - for confidentiality (no names or other identifying info)
    • AGE in years
    • SEX M for male, F for female
    • HANDEDNESS L for left-handed, R for right-handed
    • EDUCATION in years
    • MUSICAL whether they have any musical abilties/experience (YES or NO)
    • FLANG speak any foreign languages (YES or NO)
    • MOCA Montreal Cognitive Assessment score
    • D-SPAN Working Memory Digit Span test score
    • D2 D2 Test of Attention score
  • QHC_responses.csv contains participants' responses to each question on the "Better Hearing Institute Quick Hearing Check (QHC)" questionnaire.
    • Column 1 represents participants' ID (matching up to that in p_screen.csv).
    • Each column thereafter represents the 15 questions from the questionnaire.
    • Each row represents a participant and their response to each question.
  • QHC_scoring.csv contains the scoring key for each question of the QHC, with the columns:
    • RESPONSE the types of responses participants could give (STRONGLY DISAGREE, SLIGHTLY DISAGREE, NEUTRAL, SLIGHTLY AGREE, STRONGLY AGREE)
    • SCORE the points awarded for each response type (from 0 to 4). A score for each participant can be calculated by converting their categorical responses to values and summing the values.

Before starting lets check:

  1. The .csv files are saved into a folder on your computer and you have manually set this folder as your working directory.

  2. The .Rmd file is saved in the same folder as the .csv files.

2.4.2 Load in the data

  1. You will see a code chunk called libraries, similar to the one below, at the top of your .Rmd assignment file. It is all set-up to load in the data for you and to call tidyverse to the library(). Run this code chunk now to bring in the data and tidyverse. You can do this in the console, in a script, or even through the code chunk by clicking the small green play symbol in the top right of the code chunk.
library("tidyverse")

screening <- read_csv("p_screen.csv")
responses <- read_csv("QHC_responses.csv")
scoring <- read_csv("QHC_scoring.csv")

View the data

  1. It is always a good idea to familiarise yourself with the layout of the data that you have just loaded in. You can do this through using glimpse() or View() in the Console window, but you must never put these functions in your assignment file.

The Tasks:

Now that we have the data loaded, tidyverse attached, and have viewed our data, you should now try to complete the following 9 tasks. You may want to practice them first to get the correct code and format, and to make sure they work. You can do this in the console or a script, but remember, once you have the correct code, edit the necessary parts of the assignment .Rmd file to produce a reproducible .Rmd file. This is what you will do from now on for all other assessment files so practicing this now will really help. In short, go through the tasks and change only the NULL with what the question asks for and then make sure that the file knits at the end so that you have a fully reproducible code.

2.4.3 Task 1 - Oldest Participant

Replace the NULL in the T1 code chunk with the Participant ID of the oldest participant. Store this single value in oldest_participant (e.g. oldest_participant <- 999).

  • hint: look at your data, who is oldest?
oldest_participant <- NULL

2.4.4 Task 2 - Arranging D-SPAN

Replace the NULL in the T2 code chunk with code that arranges participants' D-SPAN performance from highest to lowest using the appropriate one-table dplyr (i.e., Wickham) verb. Store the output in cogtest_sort. (e.g. cogtest_sort <- verb(data, argument))

  • hint: arrange your screening data
cogtest_sort <- NULL

2.4.5 Task 3 - Foreign Language Speakers

Replace the NULL in each of the two lines of code chunk T3, so that descriptives has a column called n that shows the number of participants that speak a foreign language and number of participants that do not speak a foreign language, and another column called median_age that shows the median age for those two groups. If you have done this correctly, descriptives should have 3 columns and 2 rows of data, not including the header row.

  • hint: First need to group_by() foreign language
  • hint: Second need to summarise(). You will need the n() function. Pay attention to specific column names given.
screen_groups <- NULL
descriptives <- NULL

2.4.6 Task 4 - Creating Percentage MOCA scores

Replace the NULL in the T4 code chunk with code using one of the dplyr verbs to add a new column called MOCA_Perc to the dataframe screening In this new column should be the MOCA scores converted to percentages. The maximum achievable score on MOCA is 30 and percentages are calculated as (participant score / max score) * 100. Store this output in screening.

  • hint: mutate() something using MOCA and the percentage formula
screening <- NULL

2.4.7 Task 5 - Remove the MOCA column

Now that we have our MoCA score expressed as a percentage MOCA_Perc we no longer need the raw scores held in MOCA. Replace the NULL in the T5 code chunk using a one-table dplyr verb to keep all the columns of screening, with the same order, but without the MOCA column. Store this output in screening.

  • hint: select your columns
screening <- NULL

The remaining tasks focus on merging two tables.

You suspect that the older adults with musical experience might report more finely-tuned hearing abilities than those without musical experience. You therefore decide to check whether this trend exists in your data. You measured participants' self reported hearing abilities using the Better Hearing Institute Quick Hearing Check Questionnaire. In this questionnaire, participants rated the extent to which they agree or disagree with a list of statements (e.g., 'I have a problem hearing over the telephone') using a 5 point Likert scale (Strongly Disagree, Slightly Disagree, Neutral, Slightly Agree, Strongly Agree).

Each participant's response to each question is contained in the responses dataframe in your environment. Each response type is worth a certain number of points (e.g., Strongly Disagree = 0, Strongly Agree = 5) and the scoring key is contained in the scoring dataframe. A score for each participant is calculated by totaling up the number of points across all the questions to derive an overall score. The lower the overall score, the better the participants' self-reported hearing ability.

In order to score the questionnaire we first need to perform a couple of steps.

2.4.8 Task 6 - Gather the Responses together

Replace the NULL in the T6 code chunk using code to gather the responses to all the questions of the QHC from wide format to tidy/long format. Put the names in Question and the values in RESPONSE. Store this output in responses_long.

  • hint: pivot_longer()
  • hint: names to "Question"
  • hint: values to "RESPONSE"
responses_long <- NULL 

2.4.9 Task 7 - Joining the data

Now we need to join the number of points for each response in scoring to the participants' responses in responses_long.

Replace the NULL in the T7 code chunk using inner_join() to combine responses_long and scoring into a new variable called responses_points.

  • hint: join them by the column common to both scoring and responses_long
responses_points <- NULL

2.4.10 Task 8 - Working the Pipes

Below we have given you five lines of code that takes the data in its current long format and then creates a QHC score for each participant (group_by()...summarise()). It then joins the screening information (inner_join()) before calculating a mean QHC score for the two groups of participants - those that play musical instruments and those that don't. This final step is stored as a tibble called musical_means.

participant_groups <- group_by(responses_points, ID)
participant_scores <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_screening <- inner_join(participant_scores, screening, "ID")
screening_groups_new <- group_by(participant_screening, MUSICAL)
musical_means <- summarise(screening_groups_new, mean_score = mean(Total_QHC))

Use the above five lines of code to replace the NULL in the T8 code chunk with a functioning code pipeline using pipes. Put each function on a new line one under the other. This pipeline should result in the mean QHC values of musical and non-musical people being stored in the tibble musical_means. This final tibble will consist of two rows by two columns (i.e. four cells in total).

  • hint: in pipes, the output of the previous function is the input of the subsequent function.
  • hint: function1(...) %>% function2(...)
musical_means <- NULL

2.4.11 Task 9 - Difference in Musical Means

Finally, replace the NULL in the T9 code chunk with the value of how much higher the QHC score of people who play music is compared to people who don't play music. This should be a single numeric value, to two decimal places, e.g. 2.93

  • hint: look in musical means and enter the difference between the two means.
QHC_diff <- NULL

Well done, you are finished! Now you should go check your answers against the solutions at the end of this chapter. You are looking to check that the answers you have submitted are exactly the same as the ones in the solution - for example, remember that Mycolumn is different to mycolumn and only one is correct.

2.5 Solutions to Questions

Below you will find the solutions to the questions for the Activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.

2.5.1 Data Wrangling Basics

2.5.1.1 Task 1

  • Using select() to include only stated columns:
select(pong_data, Participant, PaddleLength, TrialNumber, BackgroundColor, HitOrMiss)
  • Using select() to exclude certain columns:
select(pong_data, -JudgedSpeed, -BallSpeed, -BlockNumber)
  • Using select() to change the order of columns:
select(pong_data, BallSpeed, HitOrMiss, JudgedSpeed, Participant, TrialNumber)

Return to Task

2.5.1.2 Task 2

arrange(pong_data, desc(HitOrMiss), desc(JudgedSpeed))

Return to Task

2.5.1.3 Task 3

filter(pong_data, 
       JudgedSpeed == 1, 
       BallSpeed %in% c("2", "4", "5", "7"), 
       HitOrMiss == 0)

Return to Task

2.5.1.4 Task 4

pong_data_filt <- filter(pong_data, TrialNumber >= 2) 
pong_data2 <- mutate(pong_data_filt, TrialNumber = TrialNumber - 1)

Return to Task

2.5.1.5 Task 5

group_by(pong_data2, BlockNumber, BackgroundColor)

Return to Task

2.5.1.6 Task 6

pong_data2_group <- group_by(pong_data2, BackgroundColor, PaddleLength)
pong_data2_hits <- summarise(pong_data2_group, mean_hits = mean(HitOrMiss))
## `summarise()` has grouped output by 'BackgroundColor'. You can override using
## the `.groups` argument.
  • You should find that the number of hits made with the small paddle (50) and the red color background is 0.450655

Return to Task

2.5.2 Data Wrangling Application

2.5.2.1 Task 2

or

Return to Task

2.5.2.2 Task 3

responses <- read_csv("responses.csv")                  
qformats <- read_csv("qformats.csv")                 
scoring <- read_csv("scoring.csv")                  
pinfo <- read_csv("pinfo.csv")
  • Note that there is a difference between responses <- read_csv("responses.csv") and responses <- read_csv(responses.csv). You will need quotes around the .csv filename as shown in the code chunk above (e.g. responses <- read_csv("responses.csv")), or the code won't work.

Return to Task

2.5.2.3 Task 7

aq_scores <- rscores %>% 
             group_by(Id) %>% # group by the ID number in column Id
             summarise(AQ = sum(Score)) # sum column Score to obtain AQ scores.

Return to Task

2.5.2.4 Task 8

aq_scores2 <- responses %>% 
  pivot_longer(cols = Q1:Q10,
               names_to = "Question",
               values_to = "Response") %>%
  inner_join(qformats, "Question") %>%
  inner_join(scoring, c("QFormat", "Response")) %>%
             group_by(Id) %>% 
             summarise(AQ = sum(Score))

Return to Task

2.5.3 Practice Your Skills

2.5.3.1 Task 1 - Oldest Participant

  • Whether you coded this answer or just read from the data, the Participant with ID Number 3 is the oldest.
oldest_participant <- 3

This could also be answered with code. We haven't quite shown you how yet but it would look like this:

oldest_participant_code <- arrange(screening, desc(AGE)) %>% 
  slice(1) %>% 
  pull(ID)

Return to Task

2.5.3.2 Task 2 - Arranging D-SPAN

  • arrange() is the main function here
  • You also needed to use desc() to sort from high to low
cogtest_sort <- arrange(screening, desc(DSPAN))

Return to Task

2.5.3.3 Task 3 - Foreign Language Speakers

  1. First group the screening data by FLANG using group_by()
  2. Next, summarise, paying attention to use the variable names as instructed
  3. n() is a function that we use within summarise() to count how many observations we have. It works like count() but you can't use count() within summarise()
  4. median() is the function that we use within summarise() to calculate the median. Much like we do with sum() or mean() or sd(), etc.
screen_groups <- group_by(screening, FLANG) 
descriptives <- summarise(screen_groups, 
                          n = n(), 
                          median_age = median(AGE))

Return to Task

2.5.3.4 Task 4 - Creating Percentage MOCA scores

  • mutate() is the function to add a new column to data
  • Here we are mutating/adding on a column called MOCA_Perc which shows a participant's MOCA score divided by 30 and multiplied by 100.
screening <- mutate(screening, MOCA_Perc = (MOCA / 30) * 100)

Return to Task

2.5.3.5 Task 5 - Remove the MOCA column

  • select() is the key function to keep and remove certain columns.
  • Two options here; both would give the same dataframe.
  • The first option shows how to deselect a column and keep everything else.
  • The second option shows how to select all the columns you want.
  • Remember that order is very important and you should select the columns in the order you want.

Option 1:

screening <- select(screening, -MOCA)

Option 2:

screening <- select(screening, ID, AGE, SEX, HANDEDNESS, EDUCATION, MUSICAL, FLANG, DSPAN, D2, MOCA_Perc)

Return to Task

2.5.3.6 Task 6 - Gather the Responses together

  • pivot_longer() is the function to use here.
  • People take a while to understand this function but spend some time looking at the example below and it will start to make some sense.
  • The first argument is the data. In this case responses.
  • The second argument is the name of the columns you want to gather. Here we are gathering all columns between the Q1 column and the Q15 column. Remember that the colon (:) says "from ... to ..."
    • You do not actually need to write cols = but it makes things clearer.
    • "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q15, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q15, and therefore the code gathers all the columns between Q1 and Q15.
    • Colum names are not put in quotes because they exist already in the tibble responses.
  • The third and fourth arguments are the names of the new columns we are creating;
    • the first will store the question numbers, Question. I.e. put the question names (names_to = ...) in a column called "Question".
    • the second will store the values/responses, Response. I.e. put the values/responses to the questions (values_to = ...) in a column called "Response".
    • These new column names are put in quotes because they do not already exist in the tibble. This is not always the case but is the case for this function.
    • Note that these names could have been anything but by using these names the code makes more sense.
    • Lastly, you do need to write names_to = ... and values_to = ... otherwise the columns won't be created correctly.
responses_long <- pivot_longer(responses, 
                         cols = Q1:Q15, 
                         names_to = "Question", 
                         values_to = "RESPONSE")

Return to Task

2.5.3.7 Task 7 - Joining the data

  • inner_join() will combine all common information in two sets of data by a common column or columns.
  • Here we are joining the data in responses_long with the data in scoring by the common column RESPONSE.
  • Keep in mind that inner_join() keeps only the rows that have data in both datasets. It will remove rows that only have data in one dataset.
  • When joining two datasets, join by ALL common columns when there is more than one column in common.
responses_points <- inner_join(responses_long, scoring, "RESPONSE")

Return to Task

2.5.3.8 Task 8 - Working the Pipes

  • This is the code we started with.
participant_groups <- group_by(responses_points, ID)
participant_scores <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_screening <- inner_join(participant_scores, screening, "ID")
screening_groups_new <- group_by(participant_screening, MUSICAL)
musical_means <- summarise(screening_groups_new, mean_score = mean(Total_QHC))
  • Below is how to transcribe the above series of functions into a pipeline.
  • Remember, when using pipes, the output of the previous function is the input of the subsequent function
musical_means <- group_by(responses_points, ID) %>%
                  summarise(Total_QHC = sum(SCORE)) %>%
                  inner_join(screening, "ID") %>%
                  group_by(MUSICAL) %>%
                  summarise(mean_score = mean(Total_QHC))

Return to Task

2.5.3.9 Task 9 - Difference in Musical Means

  • People who play music have a QHC score that is 1.53 units higher than people who don't play music.
  • You can do this by looking in musical_means, reading the values, and doing some quick maths.
  • A second option is through code. Code is always better as it can reduce error and is reproducible!
# Option 1
QHC_diff <- 1.53

# Option 2
# You will soon learn the functions to do this by code but here is how you could do it.
QHC_diff_code <- pivot_wider(musical_means, 
                             names_from = "MUSICAL", 
                             values_from = "mean_score") %>% 
  mutate(diff = YES - NO) %>% 
  pull(diff) %>% 
  round(2)

Return to Task