Lab 2 Data-Wrangling: A Key Skill

2.1 Overview

One of the key skills in any researcher's toolbox is the ability to work with data. When you run an experiment you get lots of data in various files. For instance, it is not uncommon for an experimental software to create a new file for every participant you run and for each participant's file to contain numerous columns and rows of data, only some of which are important. Being able to wrangle that data, manipulate it into different layouts, extract the parts you need, and summarise it, is one of the most important skills we will help you learn in the coming weeks.

The next few labs are aimed at refreshing and consolidating your skills in working with data. This lab focuses on organizing data using the tidyverse package. Over the course of the activities, you will recap the main functions and how to use them, and we will use a number of real datasets to give you a wide range of exposure to what Psychology is about, and to reiterate that the same skills apply across different datasets. The skills don't change, just the data!

There are some questions to answer as you go along to help build your skills: use the example code as a guide and check your answer against the solutions at the end of the chapter. Finally, remember to be pro-active in your learning, work together as a community, and if you get stuck, ask on the forums, google what you are trying to do (e.g. "mutate table tidyverse"), and use the cheatsheets or the Grassroots PsyTeacher book. The key cheatsheet for this activity is the Data Transformation Cheatsheet with dplyr.

In this lab you will recap on:

  • Data-Wrangling with the Wickham Six one-table verbs.
  • Additional useful functions such as count, pivot_longer (to gether data) and joins
  • Piping and making efficient codes.

Note: This preclass is a bit of a read but it is important that you have all this information in the one place so you can quickly refer back to it. Also, you did a very similar task in the Grassroots book so it is about recapping more than learning afresh. But take your time to try to understand the information and be sure to ask any questions you have. "Try first, then ask" is our only rule!

Remember to open up your Portfolio that you created in Lab 1 so you can add useful information to it as you work through the tasks! Also summarising the information we give in this preclass, in your own words, is a great way to learn! You don't have to read all of these but they might help from time to time to explain parts further.

For instance, do you remember how to get help on an R function in RStudio? In your Console window, you can call the help function (e.g. ?mutate) to view the reference page for each function. This example shows how to get help on the mutate() function within dplyr, which we will use in later labs.

2.2 PreClass Activity

Revisiting Tabular Data

From working with data previously you know that nearly all data in research methods is stored in two-dimensional tables, either called data-frames, tables or tibbles. There are other ways of storing data that you will discover in time but mainly we will be using tibbles (if you like more info, type vignette("tibble") in the Console Window). A tibble is really just a table of data with columns and rows of information, and within the tibble you can get different types of data, i.e. double, integer, and character.

Type of Data Description
Double Numbers including decimals (e.g. 3.14)
Integer Numbers without decimals (e.g. 3)
Character Tends to contain letters or be words, but can be numbers (e.g. "AB12", "Data Rocks!")
  • Note: Double and Integer can both be referred to as Numeric data, and you will see this word from time to time. For clarity, we will use Double as a term for any number with a decimal (e.g. 3.14) and Integer as a term for any whole number (no decimal, e.g. 3).

Quickfire Questions

What type of data would these most likely be:

  • Male =

  • 7.15 =

  • 137 =

There are lots of different types of data as well as different levels of measurements and it can get very confusing. It's important to try to remember which is which because you can only do certain types of analyses on certain types of data and certain types of measurements. For instance, you can't take the average of characters just like you can't take the average of categorical data. Likewise, you can do any maths on double and integer data, just like you can on interval and ratio data. Integer data is funny in that sometimes it is ordinal and sometimes it is interval, sometimes you should take the median, sometimes you should take the mean.

The main point is to always know what type of data you are using and to think about what you can and cannot do with them.

2.2.1 Revisiting the Wickham Six

The main way we teach data-wrangling skills is by using the Wickham Six one-table verbs. These are part of the tidyverse package which we introduced to you in the first PsyTeachR book, and more specifically from the dplyr package that is contained within the tidyverse. These six verbs are often referred to as the Wickham Six "one-table" dplyr verbs as they perform actions on a single table of data.

We will look at some of the basics again here but try to look back at the exercises in the Grassroots book to see how we used these verbs (functions) previously. The Wickham Six are:

Function Description
select() Include or exclude certain variables (columns)
filter() Include or exclude certain observations/data (rows)
mutate() Creates new variables (columns)
arrange() Changes the order of observations (rows)
group_by() Organises the observations (rows) into groups
summarise() Create summary variables for groups of observations (rows)


You will use the Wickham Six very frequently for wrangling your data so this would definitely be something you should be making notes about - not just the names, but how they work and any particular nuances that you spot. Perhaps recreate the above table and add your own examples.

2.2.2 Learning to Wrangle: Is there a Chastity Belt on Perception

Today we are going to be using data from this recent paper: Is there a Chastity Belt on Perception. You can read the full paper if you like, it is a nice representation of action/perception/cogntion working together, but we will summarise the paper for you. The main research question asks, does your ability to perform an action influence your perception? For instance, does your ability to hit a tennis ball influence how fast you perceive the ball to be moving? Or to phrase another way, do expert tennis players perceive the tennis ball moving slower than novice tennis players?

This experiment does not use tennis players however, they used the Pong task: "a computerised game in which participants aim to block moving balls with various sizes of paddles". A bit like a very classic retro arcade game. Participants tend to estimate the balls as moving faster when they have to block it with a smaller paddle as opposed to when they have a bigger paddle. You can read the paper to get more details if you wish but hopefully that gives enough of an idea to help you understand the wrangling we will do on the data. We have cleaned up the data a little to start with. Let's begin!

  1. Download the data as a zip file from this link and save it somewhere you have access. In the lab, use your M: drive.

  2. Set your working directory to the same folder as the data. Session >> Set Working Directory >> Choose Directory

  3. Open a new script and copy and paste the two lines below. Here we are:

    1. loading the tidyverse library into our session and then
    2. loading in the data through the read_csv() function and storing it in the tibble called pong_data.
library("tidyverse")
pong_data <- read_csv("PongBlueRedBack 1-16 Codebook.csv")

DO NOT install packages in the Boyd Orr labs; they are already there and just need called in through library().

However, If you are using your own computer and you haven't previously installed the tidyverse package before, you will have to install it first, e.g. install.packages("tidyverse").

If you have already installed tidyverse but it was a long time ago, it might be worth running some updates on the packages as you may have an old version that works differently. The easiest way to do this is through RStudio using the menu at the top - Tools >> Check for Package Updates. You can update packages individually or just run all updates. Tends to be better to just update all packages as many of the packages are linked, unless you specifically don't want to update a certain package.


The three most common mistakes we see are:

  1. Make sure you have spelt the data file name exactly as it is shown. Spaces and everything. Do not change the name of the csv file, fix your code instead. The reason being is that if you have a different name for your file than someone else then your code is not reproducible. We would say avoid using spaces in filenames you create, but as this is one created by another researcher and already has them, we will leave it as is and work with them.
  2. Remember when uploading data we use read_csv which has an underscore, whereas the data file itself will have a dot in its name, filename.csv.
  3. Check that the datafile is actually in the folder you have set as your working directory.


  1. Let's have a look at the pong_data and see how it is organized. Type View(pong_data) or glimpse(pong_data) in your Console window. Capital V and little g.

In the dataset you will see that each row (observation) represents one trial per participant and that there were 288 trials for each of the 16 participants. The columns (variables) we have in the dataset are as follows:

Variable Type Description
Participant integer participant number
JudgedSpeed integer speed judgement (1 = fast, 0 = slow)
PaddleLength integer paddle length (pixels)
BallSpeed integer ball speed (2 pixels/4ms)
TrialNumber integer trial number
BackgroundColor character background display colour
HitOrMiss integer hit ball = 1, missed ball = 0
BlockNumber integer block number (out of 12 blocks)

We will use this data to master our skills of the Wickham Six verbs, taking each verb in turn and looking at it briefly. You should develop your skills by setting yourself new challenges based on the ones we set. There are 6 verbs to work through and then after that we will briefly recap on two other functions before finishing with a quick look at pipes. Try everything out and let us know anything you can't quite get.

2.2.3 The select() Function - to keep only specific columns

The select() function lets us pick out the variables within a dataset that we want to work with. For example, say in pong_data we wanted to only keep the columns Participant, JudgedSpeed, PaddleLength, BallSpeed, TrialNumber, and HitOrMiss but we don't need BackgroundColor or BlockNumber. We can do this in two ways:

  1. We can tell the function what variables we want to include
select(pong_data, Participant, JudgedSpeed, PaddleLength, BallSpeed, TrialNumber, HitOrMiss)
  1. Or we can do it the opposite way by excluding columns through -ColumnName approach (i.e. minus the ColumnName)
select(pong_data, -BackgroundColor, -BlockNumber)

In this latter example, -BackgroundColor means 'not BackgroundColor', so here you are saying all columns except BackgroundColor and BlockNumber. The minus sign is the crucial part!

Task 1: Using the select() function

  1. Either by inclusion or exclusion, select only the columns Participant, PaddleLength, TrialNumber, BackgroundColor and HitOrMiss from pong_data.

Did you know select() can also be used to reorder columns?

  1. Use select() to keep only the columns Participant, JudgedSpeed, BallSpeed, TrialNumber, and HitOrMiss but have them display in alphabetical order, left to right.
  1. Have you remembered to include the dataset pong_data? Pay attention to upper/lower case letters and spelling!

  2. Think about how you first entered the column names as they appeared. But what happens if you change the order that you enter the column names?

2.2.4 The arrange() Function - to sort and arrange columns

The arrange() function sorts the rows in the tibble according to what column you tell it to sort by.

  • In this example we show how to the data arrange by one column e.g. by BallSpeed
arrange(pong_data, BallSpeed)
  • Or by multiple columns e.g. by BallSpeed (fastest first) and BackgroundColor
arrange(pong_data, desc(BallSpeed), BackgroundColor)
  • What does desc() do?
    • desc() is how to sort by largest to smallest - i.e. descending order.
    • Compare the output of the two lines above on the BallSpeed column.
    • Does desc() also work for BackgroundColor?


Task 2: Arranging Data with the arrange() function

  1. Arrange the data in pong_data by two variables: HitOrMiss (putting hits - 1 - first), and JudgedSpeed (fast judgement - 1 - first).

2.2.5 The filter() Function - to keep only parts of the data

The filter() function lets us parse out a subset of the data, meaning we keep only parts of the data.

  • For example, we might want to only keep the red BackgroundColor
filter(pong_data, BackgroundColor == "red")
  • or only keep BallSpeed above 4 pixels
filter(pong_data, BallSpeed > 4)
  • or only keep trials that match both the red BackgroundColor and BallSpeed above 4 pixels. Any trial that is not red background color or slower than 5 pixels will be removed.
filter(pong_data, BackgroundColor == "red", BallSpeed > 4)
  • This last example can also be written as follows. Two arguements or requirements separated by a comma is equivalent to an & (ampersand - meaning "and").
filter(pong_data, BackgroundColor == "red" & BallSpeed > 4)
  • Or say you want to keep specific Participant IDs. Say we want just the data from Participants 1, 3, 10, 14 and 16. We would write it as follows.
    • The %in% is called group membership and means keep each of these Participants
    • The c() creates a little container of items called a vector.
filter(pong_data, Participant %in% c("1", "3", "10", "14", "16")) 
  • And finally, say you wanted to keep all Participants except Participant 7. Say the experiment didn't work for them and you want to remove them. You would write:
    • you can read != (exclamation mark followed by equals) as 'does not equal'. So Participant != "7" means keep all Participants where the values in Participant column are not 7.
    • The exclamation mark can sometimes be used to negate the function that follows it.
filter(pong_data, Participant != "7")

Task 3: Using the filter() Function

  1. Use filter() to extract all Participants that had a fast speed judgement, for speeds 2, 4, 5, and 7, but missed the ball. Store this remaining data in a variable called pong_fast_miss

There are three parts to this filter so it is best to think about them individually and then combine them.

  1. Filter all fast speed judgements (JudgedSpeed)

  2. Filter for the speeds 2, 4, 5 and 7 (BallSpeed)

  3. Filter for all Misses (HitOrMiss)

You could do this in three filters where each one uses the output of the preceeding one, or remember that filter functions can take more than one arguement - see the example above. Also, because the JudgedSpeed and HitOrMiss are Integer you will need == instead of just =.


  • And not Or. Mistakes with filter()

The filter function is very useful but if used wrongly can give you very misleading findings. This is why it is very important to always check your data after you perform an action. Let's say you are working in comparative psychology and have run a study looking at how cats, dogs and horses perceive emotion. Let's say the data is all stored in the tibble animal_data and there is a column called animals that tells you what type of animal your participant was. Something like this:

Participant ID animals Perceived Emotion Accuracy (%)
1 dog 80
2 dog 90
3 cat 10
4 dog 85
5 horse 100
6 cat 6

Ok, so imagine you wanted all the data from just cats

filter(animal_data, animals == "cat")

Exactly! But what if you wanted cats and dogs?

filter(animal_data, animals == "cat", animals == "dog")

Right? Wrong! This actually says "give me everything that is a cat and a dog". But nothing is a cat and a dog, that would be weird - like a dat or a cog! What you actually want is everything that is either a cat or a dog, which is stated as:

filter(animal_data, animals == "cat" | animals == "dog")

The vertical line | is the symbol for or, just as & is the symbol for and.

TOP TIP: Always pay attention to what you want and most importantly to what your code produces.

2.2.6 The mutate() Function - for adding new columns

The mutate() function lets us create a new variable in our dataset. For example, let's add a new column to pong_data in which the background color is represented by numbers, where red will be represented as 1, and blue will be represented as 2.

pong_data <- mutate(pong_data, 
                    BackgroundColorNumeric = recode(BackgroundColor, 
                                                    "red" = 1, 
                                                    "blue" = 2))

The code here is is a bit complicated but:

  • BackgroundColorNumeric is the name of the new column you are adding to the tibble,
  • BackgroundColor is the name of the original column in the tibble and the one to take information from,
  • and 1 and 2 are the new codings of red and blue respectively.

The mutate() function is also handy for making some calculations on or across columns in your data. For example, say you realise you made a mistake in your experiment where your participant numbers should be 1 higher for every participant, i.e. Participant 1 should actually be numbered as Participant 2, etc. You would do something like:

mutate(pong_data, Participant = Participant + 1)

Note here that the "new column" has the same name as the old column, i.e. Participant. In the resulting table, the Participant column will have the new values which will differ from the values in the original pong_data table. While it may seem like you have overwritten these values, in reality you have created a copy of the table with altered values, but you have not lost anything: the original values are still there in pong_data because you didn't store (assign) this action to pong_data. You didn't save the change basically.

In general however it is good practice not to overwrite pong_data with a new version of pong_data, but to store the altered table in a new tibble, e.g., pong_data2, like this:

pong_data_mutated <- mutate(pong_data, Participant = Participant + 1)

Task 4: Mutating Variables with mutate()

  • You realise another mistake in that all your trial numbers are wrong. The first trial (trial number 1) was a practice so should be excluded. And your experiment actually started on trial 2. Tidy this up by:
  1. Creating a new tibble called pong_data_filt and in it store data from pong_data after filtering out all trials with the number 1 (TrialNumber column).
  2. Now use the mutate() function to renumber all the remaining trial numbers, in pong_data_filt, starting them at one again instead of two. Store this output in a new tibble called pong_data2.

Step 1:

  • filter(TrialNumber does not equal 1).
  • remember to store this output in a tibble called pong_data_filt

Step 2:

  • mutate(TrialNumber = TrialNumber minus 1)
  • exclamation mark, equals

2.2.7 The group_by() Function - to group parts of data altogether

The group_by() function groups the rows in a dataset according to a category you specify, e.g. in the animals example above, grouping all cat data together and all dog data together, and all horse data together.

Looking at the data within pong_data2, say you wanted to eventually creating means, etc, for the different background color conditions. You would start by grouping trials by BackgroundColor, grouping the data into red background data and blue background data, as such:

group_by(pong_data2, BackgroundColor)
  • Or you can add numerous grouping variables depending on how you want to split up the data. Here we group by Hit Or Miss (in HitOrMiss column), and background color (Red or Blue). This gives four grousp - Hit Red, Miss Red, Hit Blue, Miss Blue.
group_by(pong_data2, HitOrMiss, BackgroundColor)

Note: Nothing actually appears to change in the data, unlike with the other functions, but a big operation has taken place. Look at the output in your console when you run group_by(pong_data2, BackgroundColor). At the top of the output notice that the 2nd line of the output tells us the grouping criteria and how many groups now exist: see the line Groups: BackgroundColor [2]: we grouped by BackgroundColor and there are [2] groups - one for red and one for blue.

Task 5: Grouping Data with group_by()

  • Group the data by BlockNumber and by BackgroundColor, in that order, and then enter the number of groups (i.e. a number) you get as a result:

It is the same procedure as this but with different column names:

group_by(pong_data2, HitOrMiss, BackgroundColor)

The number of groups should be between the sum (i.e. multiplication) of the number of background colors (red and blue) and the number of blocks (12).


group_by() is incredibly useful as, once the data is organised into groups, you can then apply other functions (filter, arrange, mutate...etc.) to the groups within your data that you are interested in, instead of to the entire dataset. For instance, a common second step after group_by might be to summarise the data...

2.2.8 The summarise() Function - to do some calculations on the data

The summarise() function lets you calculate some descriptive statistics on your data. For example, say you want to count the number of hits there were for different paddle lengths, or number of hits there were when the background color was red or blue.

  • First we group the data accordingly, storing it in pong_data2_group
pong_data2_group <- group_by(pong_data2, BackgroundColor, PaddleLength)
  • And then we summarise it, storing the answer in total_hits
pong_data2_hits <- summarise(pong_data2_group, total_hits = sum(HitOrMiss))
  • And then for fun we can filter just the red, small paddle hits from the summarised data.
pong_data2_hits_red_small <- filter(pong_data2_hits, BackgroundColor == "red", PaddleLength == 50)
## `summarise()` has grouped output by 'BackgroundColor'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'BackgroundColor'. You can override using the `.groups` argument.

Which would leave us with:

Table 2.1: Summarising with group_by() and summarise()
BackgroundColor PaddleLength total_hits
red 50 516


  • The name of the column within pong_data2_hits_red_small that has the summarised data is total_hits; this is what you called it when creating pong_data2_hits. You could have called it anything you wanted but always try to use something sensible. Make sure to call your variables something you (and anyone looking at your code) will understand and recognize later (i.e. not variable1, variable2, variable3. etc.), and avoid spaces (use_underscores_never_spaces).

summarise() has a range of internal functions that make life really easy, e.g. mean(), median(), n(), sum(), max, min, etc. Some common ones are shown below but see the dplyr cheatsheets for more examples.

function example purpose
n() N = n() number of values
mean() m = mean(X) mean of values in stated column - e.g. column X
median() mdn = median(X) median of values in stated column - e.g. column X
sum() sm = sum(X) sum of values in stated column - e.g. column X
max() mx = max(X) max of values in stated column - e.g. column X
min() mn = min(X) min of values in stated column - e.g. column X
sd() stdev = sd(X) standard deviation of values in stated column - e.g. column X

Task 6: Summarising Data with summarise()

  • Use the lines of code above to calculate the mean number of hits made with the small paddle (50) and the red color background. Enter that value in this box to two decimal places - (e.g. 0.12):

A quick point on the ungroup() function

The ungroup() undoes the action of the group_by() function.

After grouping data together using the group_by() function and then peforming a task on it, e.g. filter(), summarise(), it can be very good practice to ungroup the data before performing another function. Forgetting to ungroup the dataset won't always affect further processing but sometimes it can really mess up other things.

Again just a good reminder to always check the data you are getting out of a function a) makes sense and b) is what you expect.

Quickfire Questions

  • Which of the Wickham Six would I use to sort columns from smallest to largest:

  • Which of the Wickham Six would I use to calculate the mean of a column:

  • Which of the Wickham Six would I use to remove certain observations - e.g. remove all males:

2.2.9 Two Other Useful Functions

The Wickham Six verbs let you to do a lot of things with data however there are thousands of other functions at your disposal. If you want to do something with your data that you are not sure how to do using these functions, do a Google search for an alternative function - chances are someone else has had the same problem and has a help guide. For example, two other functions to note are the bind_rows() function and the count() function. We will breidly show you these here.

Binding columns with bind_rows()

The bind_rows() function is useful if you want to combine two tibbles together into one larger tibble that have the same column structure - i.e. the have exactly the same columns and you want to combine them by attaching one to the bottom of the other. For example:

  • Say you had on tibble with data for ball speeds 1 and 2
slow_ball<- filter(pong_data2, BallSpeed < 3) 
  • And another tibble with data for ball speeds 6 and 7
fast_ball <- filter(pong_data2, BallSpeed >= 6) 
  • And you wanted to combine those tibbles together into one big tibble containing these extreme ball speeds.
extreme_balls <- bind_rows(slow_ball, fast_ball) 

We are going to use bind_rows() throughout the labs so do keep it in mind!

Quick counts with the count() function

Finally, the count() function is a shortcut that can sometimes be used to count up the number of rows you have for groups in your data, without having to use the group_by() and summarise() functions. It is a tally basically. It doesn't sum up values. It just counts how many observations you have.

For example, in Task 6 we combined group_by() and summarise() to calculate how many hits there were based on background color and paddle length. Alternatively we could have done:

count(pong_data2, BackgroundColor, PaddleLength, HitOrMiss)

The results are the same, just that in the count() version we get all the information, including misses, because we are just counting rows. In the summarise() method we only got hits because that was the effect of what we summed. So two different methods give similar answers - coding can be individualised and get the same result!

Again, we will use count() at various times in the labs so again it is handy to make a note of.

2.2.10 Last but not least - Pipes (%>%) to make your code efficient

By now you'll have noticed thattidyverse functions generally take the following grammatical structure (called syntax): function_name(dataset, arg1, arg2,..., argN) where the dataset is the entire tibble of data you are using, and each argument (arg) is some operation on a particular column or variable, or the column name you want to work with. For example:

filter(pong_data2, PaddleLength == "50", BallSpeed > 4)
group_by(pong_data2, BallSpeed, Participant)

Both of these examples follow the structure of function_name(dataset, arg1, arg2, ....)

In the first example, we are filtering (function) the whole pong_data2 dataset by a particular paddle length, then by particular speeds (arguments). In the second, we are grouping by BallSpeed and then by Participant. Note that the order of arguments is specific as it performs argument1 then argument2, etc. Changing the order of arguments may give a different output. So the order you work in is important, and this is called your pipeline. For example, here is a pipeline we used above to find how many hits there were with the small paddle length and the red background.

  • First we group the data accordingly, storing it in pong_data2_group
  • And then we summarise it, storing the answer in total_hits
  • And filter just the red, small paddle hits
pong_data2_group <- group_by(pong_data, BackgroundColor, PaddleLength)
pong_data2_hits <- summarise(pong_data2_group, total_hits = sum(HitOrMiss))
pong_data2_hits_red_small <- filter(pong_data2_hits, BackgroundColor == "red", PaddleLength == 50)

Pipelines allow us to quickly and reproducibly, perform an action that would take much longer manually. However, we can make our code even more efficient, using less code, by stringing our sequence of functions together using 'pipes', written as %>%. Changing the above code into one using pipes would give us:

pong_data_hits_red_small <- pong_data2 %>% 
  group_by(BackgroundColor, PaddleLength) %>% 
  summarise(total_hits = sum(HitOrMiss)) %>%
  filter(BackgroundColor == "red", PaddleLength == 50)

Both these chunks show exactly the same procedure, but adding pipes can make code easier to read and follow once you understand piping.

Code without a pipe would look like

  • function_name(dataset, arg1, arg2,...,argN)

but a pipe version would look like

  • dataset %>% function_name(arg1, arg2,...,argN)

The premise is that you can pipe (%>%) between functions when the input of a function is the output of the previous function. Alternatively, you can use a pipe to put the data into the first function, as shown directly above.

You can think of the pipe (%>%) as saying 'and then' or 'goes into'. E.g. the data goes into this function and then into this function and then into this function. We will expand on this in the lab where you can ask more questions, but try comparing the two chunks of code above and see if you can match them up.

One last point on pipes is that they can be written in a single line of code but it's much easier to see what the pipe is doing if each function takes its own line. Every time you add a function to the pipeline, remember to add a %>% first and note that when using separate lines for each function, the %>% must appear at the end of the line and not the start of the next line. Compare the two examples below. The first won't work but the second will because the second puts the pipes at the end of the line where they need to be!

  • Example 1: won't work because the pipes (%>%) are in the wrong place.
data_arrange <- pong_data2 
                %>% filter(PaddleLength == "50")
                %>% arrange(BallSpeed) 
  • Example 2: will work because the pipes (%>%) are in the correct place.
data_arrange <- pong_data2 %>%
                filter(PaddleLength == "50") %>%
                arrange(BallSpeed) 

Where piping becomes most useful is when we string a series of functions together, rather than using them as separate steps and having to save the data each time under a new tibble name and getting ourselves all confused. In the non-piped version we have to create a new tibble each time, for example, data, data_filtered, data_arranged, data_grouped, data_summarised just to get to the final one we actually want, which was data_summarised. This creates a lot of tibbles in our environment and can make everything unclear and eventually slow down our computer. The piped version however uses one tibble name, saving space in the environment, and is clear and easy to read. With pipes, we skip unnecessary steps and avoid cluttering our environment.


Quickfire Questions

  • What does this line of code say? data %>% filter() %>% group_by() %>% summarise():

Job Done - Activity Complete!

We have now recapped a number of functions and verbs that you will need as the semester goes on. You will use them in upcoming labs and chapters so be sure to go over these and try them out to make yourself more comfortable with them. Remember to also start looking back at the Grassroots book and remembering some of the work you did there. If you have any questions please post them on the forums. Finally, don't forget to add any useful information to your Portfolio before you leave it too long and forget. Happy Wrangling!

2.3 InClass Activity

Data Wrangling

In the PreClass activity we looked at a series of functions known as the Wickham six one-table to filter, arrange, group_by, select, mutate and summarise. Now we will focus on working with data across two or more tables using functions you will have come across in the Grassroots book. The two main functions we will add to the Wickham six today are pivot_longer() and inner_join().

  • pivot_longer() allows us to transform a table from wide format to long format (more on this below).
  • inner_join() allows us to combine two tables together based on common columns.

A function is a tool that takes an input, performs some action, and gives an output. They are nothing more than that. If you think about it, your toaster is a function: it takes bread as an input; it perfoms the action of heating it up (nicely sometimes; on both sides would be a luxury); and it gives an output, the toast. A good thing about the Wickham six functions is that they are nicely named as verbs to describe what they do - mutate() mutates (adds on a column); arrange() arranges columns, summarise() summarises, etc.

In terms of remembering all the functions, the truth is you don't have to know them all or remember them all. However, through practice and repetition, you will quickly learn to remember which ones are which and what package they come from. Sort of like where to find your spoons in your kitchen - you don't look in the fridge, and then the washing machine, and then the drawer. Nope, you learnt, by repetition, to look in the drawer first time. It's the same with functions. Keep in mind that research methods is like a language in that the more you use it and work with it the more it makes sense.


A Note on Tidy Data

In the style of programming we teach, the most efficient format/layout of data is what is known as Tidy Data, and any data in this format is easily processed through the tidyverse package. You can read more about this type of data layout in this paper: Tidy Data (Wickham, 2014). It is a surprisingly good read.

However, the data you work with will not always be formatted in the most efficient way possible. If that happens then our first step is to put it into Tidy Data format. There are two fundamental principles defining Tidy Data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.

Tidy Data (Wickham, 2014) adds the following principle:

  • Each type of observation unit forms a table.

And Grolemund and Wickham (2017) restate this third principle as:

  • Each value must have its own cell (i.e. no grouping two variables together, e.g. time/date in one cell).

Where a cell is where any specific row and column meet; a single data point in a tibble is a cell for example. The Grolemund and Wickham (2017) book is a very useful read and it is free, but browsing the chapter on Tidy Data will help you visualise how you want to arrange data. Try to keep the principles in mind whilst doing so.

We use Tidy Data because it is really efficient and works well with the tidyverse. However, people used to use data structured in long format or wide format.

Long format is where each row is a single observation, typically a single trial in an experiment or a response to a single item on a questionnaire. When you have multiple trials per participant, you will have multiple rows for the same participant. To identify participants, you would need a variable with some kind of participant id, which can be as simple as a distinct integer value for each participant. In addition to the participant identifier, you would have any measurements taken during each observation (e.g., response time) and what experimental condition the observation was taken under.

In wide format data, each row corresponds to a single participant, with multiple observations for that participant spread across columns. So for instance, with survey data, you would have a separate column for each survey question.

Tidy is a mix of both of these approachs and most functions in the tidyverse assume the tidy format, so typically the first thing you need to do when you get data, particularly wide-format data, is to reshape it through wrangling. Which is why we teach these really important skills.


Today's Lab - Analysing the Autism Specturm Quotient (AQ)

To continue building your data wrangling skills we will recap on skills from Level 1 by tidying up data from the Autism Spectrum Quotient (AQ) questionnaire. If you have completed the Grassroots book then you may be familiar with the AQ10; a non-diagnostic short form of the AQ with only 10 questions per participant. It is a discrete scale and the higher a participant scores on the AQ10 the more autistic-like traits they are said to display. Anyone scoring 7 or above is recommended for further diagnosis. You can see an example of the AQ10 through this link: AQ10 Example. Remember you can revisit the Grassroots book at any time but we will recap here for you.

Today we have 66 participants and your goal in this lab is to find an AQ score for each of them through your data-wrangling skills.

We have four data files to work with:

  • responses.csv containing the AQ survey responses to each of the 10 questions for our 66 participants
  • qformats.csv containing information on how a question should be coded - i.e. forward or reverse coded
  • scoring.csv containing information on how many points a specific response should get; depending on whether it is forward or reverse coded
  • pinfo.csv containing participant information such as Age, Sex and importantly ID number.

Click here to download the files as a zip file. Now unzip the files into a folder you have access to (e.g. your M: drive on campus). We will use zip folders a lot so if this is something you struggle with please ask.

csv stands for 'comma separated values' and is a very basic format for storing data in a plain text file. It really just stores numbers and text separated by commas and nothing else. The great thing about being this basic is that it can be read by many different systems and is non-proprietary, i.e., you don't need to purchase commercial software to open it.


Now set your working directory to the folder where you saved the .csv files. Do this through the dropdown menus at the top toolbar: Session >> Set Working Directory >> Choose Directory and then find your folder with your .csv files. Remember what we previously said about folder structure - use a drive you have access to (e.g. M: drive in the labs, but never call your folders R).

Today we will work in an RScript instead of .Rmd but if you want to turn the lab into an R Markdown report or to add elements to your Portfolio then please feel free.

Group Discussion Point

Now would be a good time to make sure that you are all using RStudio effectively and know what each window does.

  • TRUE or FALSE, the Console is best for practice and the Script Window is for saving:
  • TRUE or FALSE, the Environment holds all the data and objects you have loaded in and created:
  • TRUE or FALSE, clicking the name of a table in the Environment window will open it in the Script window:

The answer to all of these are True.

  1. The Script window is where you should write code and comments that you are going to save and send to people. The Console is where you should practice stuff - nothing is saved here; it is like a sandbox that just gets wiped away.

  2. Any data you load in or create is held in the Environment (or Global Environment) window with the variable name that you gave it.

  3. By clicking the name of the table in the Environment window it will open up in the Script window and you can look at it to make sure it is what you expect. This only works for tables but not for other types of data. You will learn the difference as we go along!

2.3.1 Task 1: Open a Script

  1. Start a new RScript and save it in the same folder as your .csv files, calling the RScript something informative like Lab2_AQ_DataWrangling.R.
  2. Make sure your environment is completely empty so we don't mix up one analysis with the other. You can run the following code line in the console to clear the environment or by clicking the little brush on your environment window.
rm(list = ls()) 

Remember that when using a script you can write notes to yourself to remind you what a line of code does. Just put a hashtag at the start of the line and R will ignore this line. This is where you have to be clear on using a Script versus an R Markdown file. In a Script, # means the line is ignored, in Markdown # sets the line as a header!.

To run any line on a script, the simplest way is to click anywhere on that line and either press Run on the top of the script window or press CTRL+Enter on the keyboard (or mac equivalent).

2.3.2 Task 2: Bring in Your Library

  1. Add a line to your code that brings the tidyverse package into your working environment and run it.

Combine the function library() and the package tidyverse and remember that the solutions are at the end of the chapter.

On our lab machines in Psychology all the necessary packages will already be on the machines, they just need called into the library. If however you are using your own machine you will have to install the packages first.

Do not install packages on the Psychology machines! Why?

  1. They are already installed and can cause the package to stop working if a student tries to install the same package on our machines.
  2. They are already installed and it is a bit like using apps on your phone. Install is putting the app onto your phone, library is just opening the app. If you've already downloaded the app (package) then you just need to open it (library()) to use it!

2.3.3 Task 3: Load in the Data

Now we have to load in the .csv datafiles using the read_csv() function and save them as tibbles in our environment. For example, to load in the data in responses.csv and save it as the tibble responses we would type:

responses <- read_csv("responses.csv") 
  1. Add the following lines of code to your script and complete them to load in all four .csv datafiles. Use the above code as an example and name each tibble the same as its original filename (minus the .csv part), again as above, e.g. responses.csv gets saved as responses. Remember to run the lines so that the data is loaded in and stored in your environment.
responses <-  read_csv()    # survey responses
qformats <-                 # question formats
scoring <-                  # scoring info
pinfo <-                    # participant information

As you work with data and functions you will find there are functions with similar names but that give different results. One of these is the read function for csv. Make sure to always use read_csv() as your function to load in csv files. Nothing else. It is part of the readr package automatically brought in with tidyverse.

There is a very similarly named function called read.csv(). DO NOT use this function in these labs. We will always expect you to use read_csv(). Although very similar in name they do not work the same way and create differences in your data.

2.3.4 Task 4: Review Your Data.

Group Discussion Point

Now that we have the data loaded in it is always best to have a look at the data to get an idea of its layout. We showed you one way before, by clicking on the name in the environment, but you can also use the glimpse() or View() functions in your Console window. Put the name of the data between the brackets to see how it is arranged. Don't add these to your script though - they are just one-offs for testing.

As a small group, have a look at the data in responses to see if you think it is Tidy or not and answer the following question: The data in responses is in format

The reponses tibble is far from being tidy; each row represents multiple observations from the same participant, i.e. each row shows responses to multiple questions - wide format. Remember we want the data in tidy format as described above.

Eh, remind what's a tibble?

A tibble is simply a dataframe - or a table of data with columns and rows - that is really handy for working with when using the tidyverse package. When we say tibble, you can think of a dataframe with rows and columns of information and numbers stored in them - like responses, it is a tibble. For more info, see here: Tibbles

2.3.5 Task 5: Gathering Data with pivot_longer().

We now have all the data we need loaded in, but in order to make it easier for us to get the AQ score for each participant, we need to change the layout of the responses tibble using the pivot_longer() function.

  1. Copy the below code line to your script and run it.
rlong <- pivot_longer(responses,
                      cols = Q1:Q10,
                      names_to = "Question",
                      values_to = "Response")
  • The first argument given to the pivot_longer() function is the tibble which holds the data we want to wrangle, responses.
    • Remember we could have written this with a pipe as well, e.g. rlong <- responses %>% pivot_longer(...)
  • The second argument is the names of specific columns in the original tibble that we want to gather together, Q1:Q10 meaning all columns between Q1 and Q10.
    • You do not actually need to write cols = but it makes things clearer.
    • "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q10, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q10, and therefore the code gathers all the columns between Q1 and Q10.
    • Colum names are not put in quotes because they exist already in the tibble responses.
  • The third and fourth arguments are the names of the new columns we are creating;
    • the first will store the question numbers, Question. I.e. put the question names (names_to = ...) in a column called "Question".
    • the second will store the values/responses, Response. I.e. put the values/responses to the questions (values_to = ...) in a column called "Response".
    • These new column names are put in quotes because they do not already exist in the tibble. This is not always the case but is the case for this function.
    • Note that these names could have been anything but by using these names the code makes more sense.
    • Lastly, you do need to write names_to = ... and values_to = ... otherwise the columns won't be created correctly.

In case you are wondering, if we wanted to go back the way and ungather the data we just gathered, we would use the pivot_wider() function: e.g. rwide <- rlong %>% pivot_wider(names_from = Question, values_from = Response). But we do not want to do that here so let's not add this to the code.

Quickfire Questions

Let's see if you understand pivot_longer(). Say I wanted to gather the first three columns of responses (Q1, Q2, Q3), put the question numbers in a column called Jam, the responses in a column called Strawberry, and store everything in a tibble called sandwich. Fill in the box with what you would write:

sandwich <- pivot_longer(responses, cols = Q1:Q3, names_to = "Jam", values_to = "Strawberry")

pivot_longer() wants the data first, then the columns to gather, then the name of the new column to store the gathered column names in and finally the name of the new column to store the values in.

2.3.6 Task 6: Combining Data.

So now our responses data is in tidy format, we are closer to getting an AQ score for each person. However, we still need to add some information to:

  • show if the question is reverse or forward scored - found in qformats
  • show the number of points to give a specific response - found in scoring.

This is a typical analysis situation where different information is in different tables and you need to join them altogether. Both these pieces of information are contained in qformats and scoring respectively, but we want to join it to the data in rlong to create one informative tidy table with all the info.

We can do this sort of join through the function inner_join(); a function to combine information in two tibbles using a column (or columns) common to both tibbles.

  1. Copy the below line into your code and run it. This piece of code combines rows in the tibble rlong with rows in the tibble qformats, based on the common column "Question".
rlong2 <- inner_join(rlong, qformats, "Question")
  1. Now have a look in rlong2. We have matched each question with its scoring format, forward or reverse.

A lot of questionnaires have some questions that are Forward scored and some questions that are Reverse scored. What does this mean? Imagine a situation where your options in replying to a question are: 1 - extremely agree, 2 - agree, 3 - neutral, 4 - disagree, 5 - extremely disagree. In a forward-scoring question you would get 1 point for extremely agree, 2 for agree, 3 for neutral, etc. In a reverse scoring question you would get 5 for extremely agree, 4 for agree, 3 for neutral, etc.

The reasoning behind this shift is that sometimes agreeing or disagreeing might be more favourable depending on how the question is worded. Secondly, sometimes these questions are used just to catch people out - imagine if you had two similar questions where one has the reverse meaning of the other. In this scenario, people should respond opposites. If they respond the same then they might not be paying attention.


Now we need to combine the information in our table, rlong2, with the scoring table so we know how many points to attribute each question based on the answer the participant gave, and whether the question was forward or reverse coded. Again, we use the inner_join() function, but this time the common columns found in rlong2 and scoring are QFormat and Response. To combine by two columns you just write them in sequence as shown below. Note: when there is more than one common column between two tibbles you are joining, you should combine by ALL the columns to avoid repeat columns in the new tibble. If you forget to do this, your new tibble will have names such as column_name.x and column_name.y. This will cause confusion so avoid it by combining by all common columns.

  1. Copy the below line into your code and run it. This code combine rows in rlong2 and scoring based on the commn columns, QFormat and Response.
rscores <- inner_join(rlong2, scoring, c("QFormat", "Response"))

2.3.7 Task 7: Calculating the AQ Scores.

We have now created rscores which has information on how each participant responded to each question and how each question should be coded and scored, all within the one tibble. All we need now is to sum the scores for each participant to get their AQ score.

  1. Based on your PreClass learning, copy the below line into your code and complete it to obtain individual aq_scores for each participant.
  2. Save your script and run it all again from the start to make sure it works!
aq_scores <- rscores %>% 
             group_by() %>% # how will you group individual participants?
             summarise(AQ = sum()) # which column will you sum to obtain AQ scores?
  • Each participant could be grouped by their Id.
  • If we summed up the value for each Score we might get a full AQ Score for each particpipant.

Yep, well spotted. Pipes are your friend. Think of them as saying 'and then' or 'goes into'. So in the example above we take rscores and then group it by something and then summarise it into AQ scores based on...

In most cases, the pipe serves the purpose of putting the input into the function or taking the output of one function and treating it as the input of another function.

In the example above the first pipe takes rscores as the input for group_by, and the second pipe takes the output of group_by and puts it as the input to summarise. See how you can almost read it as a chain of actions or steps.


Quickfire Questions

The whole purpose of this lab was to calculate AQ scores for individual participants. As a small group, try to answer the following questions. Try to do it using code where possible to help you based on your knowledge from the preclass and inclass activity. Remember the cheatsheets as well. Look for the dplyr one!

  1. From the options, choose the correct citation for the AQ 10 question questionnaire:

  2. Complete the sentence, the higher the AQ score...

  3. Type in the AQ score (just the number) of Participant ID No. 87:

  4. Type how many participants had an AQ score of 3 (again just the number):

  5. The cut-off for the AQ10 is usually said to be around 6 meaning that anyone with a score of more than 6 should be referred for diagnostic assessment. Type in how many participants we should refer from our sample:

  1. From the link above you can see that an appropriate citation for the AQ10 would be (Allison, Auyeung, and Baron-Cohen, (2012))

  2. As mentioned, the higher the score on the AQ10 the more autistic-like traits a participant is said to show.

  3. You could do this by code with filter(aq_scores, Id == 87), which would give you a tibble of 1x2 showing the ID number and score. If you just wanted the score you could use pull() which we haven't shown you yet but works as follows: filter(aq_scores, Id == 87) %>% pull(AQ). The answer is an AQ score of 2.

  4. Same as above but changing the argument of the filter. filter(aq_scores, AQ == 3) %>% count(). The answer is 13. Remember you can do this by counting but the code makes it reproducible every time.

  5. filter(aq_scores, AQ > 6) %>% count() or filter(aq_scores, AQ >= 7) %>% count(). The answer is 6.

2.3.8 Task 8: One Last Thing on Pipes

You now have a complete code to load in your data, convert it to Tidy, combine the tables and calculate an AQ score for each participant. But, if you look at it, some of your code could be more efficient by using pipes.

Go back through your code and rewrite it using pipes %>% so that it is as efficient as possible.

At any point where the first argument of your function is the name of a variable created before that line, there is a good chance you could have used a pipe! Here are all the bits of this code that could be piped together into one chain:

rlong <- pivot_longer(responses, cols = Q1:Q10, names_to = "Question", values_to = "Response")

rlong2 <- inner_join(rlong, qformats, "Question")

rscores <- inner_join(rlong2, scoring, c("QFormat", "Response"))

aq_scores <- rscores %>% group_by(Id) %>% summarise(AQ = sum(Score))


Job Done - Activity Complete!

You have now recapped on a number of handy one-table and two-table verbs. These are great to know as for example, in the above Activity, it actually only took a handful of reproducible steps to get from messy data to tidy data; could you imagine doing this by hand in Excel through cutting and pasting? Not to mention the mistakes you could make!

You should now be ready to complete the Homework Assignment for this lab. The assignment for this Lab is FORMATIVE and is NOT to be submitted and will NOT count towards the overall grade for this module. However you are strongly encouraged to do the assignment as it will continue to boost your data-wrangling skills which you will need in future assignments. If you have any questions, please post them on the forums. Finally, don't forget to add any useful information to your Portfolio before you leave it too long and forget.

Happy wrangling! Excellent work! You are a DataWrangling expert! Now go try the assignment!

2.4 Test Yourself

This is a formative assignment meaning that it is purely for you to test your own knowledge, skill development, and learning, and does not count towards an overall grade. However, you are strongly encouraged to do the assignment as it will continue to boost your skills which you will need in future assignments. You will be instructed by the Course Lead on Moodle as to when you should attempt this assignment. Please check the information and schedule on the Level 2 Moodle page.

One thing to keep in mind in all assignments is that spelling it really important when learning about being reproducible. Remember that Mycolumn is a different column to mycolumn and when being reproducible only one is correct. Likewise, Mydata and mydata are two different tibbles, and read_csv() and read.csv() are two different functions. Check the instructions and check your work!

Lab 2: Formative Data Wrangling Assignment

In order to complete this assignment you will need to download the data .csv files, as well as the assignment .Rmd file, which you need to edit, titled GUID_Level2_Semester1_Lab2.Rmd. These can be downloaded within a zip file from the below link. Once downloaded and unzipped, you should create a new folder that you will use as your working directory; put the data files and the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top. Download the Assignment .zip file from here.

Now open the assignment .Rmd file within RStudio. You will see there is a code chunk for each task. As you did in previous assignments, follow the instructions on what to edit in each code chunk. This will often be entering code based on what we have covered up until this point, either entering code or a value.

In this chapter we recapped on data-wrangling using the Wickham 6 verbs, looked at additional functions such as pivot_longer() and inner_join(), and at piping chains of code for efficiency using %>%. You will need these skills to complete the following assignment so please make sure you have carried out the PreClass and InClass activities before attempting this formative assignment. Remember to follow the instructions and if you get stuck at any point to post questions on the available forums. Also, two useful online resources are:

As well as the PsyTeachR Grassroots book and remembering some of the work you did there.

2.4.1 Today's Topic - The Ageing Brain

A key topic in current psychologial research, and one which forms a main focus for some of the research in our School, is that of human ageing. In this research we use brain imaging techniques to understand how changes in brain function and structure relate to changes in perception and behaviour. A typical 'ageing' experiment will compare a measure (or a number of measures) such as performance on a cognitive or perceptual task between younger and older adults (i.e. a between-subjects design experiment).

However, in order to make sure we are studying 'healhty' ageing, we first have to 'screen' our older participants for symptoms of age-related dementia (Alzheimer's Disease), where cognitive function can be significantly impaired. We do this using a range of cognitive tests. Some studies will also test participants' sensory acuity (ability to perceive something), as a function of age (particularly eyesight and hearing).

The data you have downloaded for this lab is example screening data taken from research investigating how the ageing brain processes different types of sounds. The tests used in this study are detailed below. Please note that the links are there to provide you with further information and examples of the tests once you have completed the assignment if you so wish; you do not have to read them to complete the assignment.

The Data Files

You have just downloaded the three .csv files containing all the data you need. Below is a list of the .csv file names and a description of the variables each contains:

  • p_screen.csv contains particpants demographic information including:
    • ID Participant Id number - for confidentiality (no names or other identifying info)
    • AGE in years
    • SEX M for male, F for female
    • HANDEDNESS L for left-handed, R for right-handed
    • EDUCATION in years
    • MUSICAL whether they have any musical abilties/experience (YES or NO)
    • FLANG speak any foreign languages (YES or NO)
    • MOCA Montreal Cognitive Assessment score
    • D-SPAN Working Memory Digit Span test score
    • D2 D2 Test of Attention score
  • QHC_responses.csv contains participants' responses to each question on the Better Hearing Institute Quick Hearing Check (QHC) questionnaire.
    • Column 1 represents participants' ID (matching up to that in p_screen.csv).
    • Each column thereafter represents the 15 questions from the questionnaire.
    • Each row represents a participant and their response to each question.
  • QHC_scoring.csv contains the scoring key for each question of the QHC, with the columns:
    • RESPONSE the types of responses participants could give (STRONGLY DISAGREE, SLIGHTLY DISAGREE, NEUTRAL, SLIGHTLY AGREE, STRONGLY AGREE)
    • SCORE the points awarded for each response type (from 0 to 4). A score for each participant can be calculated by converting their categorical responses to values and summing the values.

Before starting lets check:

  1. The .csv files are saved into a folder on your computer and you have manually set this folder as your working directory.

  2. The .Rmd file is saved in the same folder as the .csv files. For assessments we ask that you save it with the format GUID_Level2_Semester1_Lab2.Rmd where GUID is replaced with your GUID. Though this is a formative assessment, it may be good practice to do the same here.

2.4.2 Load in the data

  1. You will see a code chunk called libraries, similar to the one below, at the top of your .Rmd assignment file. It is all set-up to load in the data for you and to call tidyverse to the library(). Run this code chunk now to bring in the data and tidyverse. You can do this in the console, in a script, or even through the code chunk by clicking the small green play symbol in the top right of the code chunk.
library("tidyverse")

screening <- read_csv("p_screen.csv")
responses <- read_csv("QHC_responses.csv")
scoring <- read_csv("QHC_scoring.csv")

View the data

  1. It is always a good idea to familiarise yourself with the layout of the data that you have just loaded in. You can do this through using glimpse() or View() in the Console window, but you must never put these functions in your assignment file.

The Tasks:

Now that we have the data loaded, tidyverse attached, and have viewed our data, you should now try to complete the following 9 tasks. You may want to practice them first to get the correct code and format, and to make sure they work. You can do this in the console or a script, but remember, once you have the correct code, edit the necessary parts of the assignment .Rmd file to produce a reproducible Rmd file. This is what you will do from now on for all other assessment files so practicing this now will really help. In short, go through the tasks and change only the NULL with what the question asks for and then make sure that the file knits at the end so that you have a fully reproducible code.

2.4.3 Task 1 - Oldest Participant

Replace the NULL in the T1 code chunk with the Participant ID of the oldest participant. Store this single value in oldest_participant (e.g. oldest_participant <- 999.

  • hint: look at your data, who is oldest?
oldest_participant <- NULL

2.4.4 Task 2 - Arranging D-SPAN

Replace the NULL in the T2 code chunk with code that arranges participants' D-SPAN performance from highest to lowest using the appropriate one-table dplyr (i.e., Wickham) verb. Store the output in cogtest_sort. (e.g. cogtest_sort <- verb(data, argument))

  • hint: arrange your screening data
cogtest_sort <- NULL

2.4.5 Task 3 - Foreign Language Speakers

Replace the NULL in each of the two lines of code chunk T3, so that descriptives has a column called n that shows the number of participants that speak a foreign language and number of participants that do not speak a foreign language, and another column called median_age that shows the median age for those two groups. If you have done this correctly, descriptives should have 3 columns and 2 rows of data, not including the header row.

  • hint: First need to group_by() foreign language
  • hint: Second need to summarise(). You will need the n() function. Pay attention to specific column names given.
screen_groups <- NULL
descriptives <- NULL

2.4.6 Task 4 - Creating Percentage MOCA scores

Replace the NULL in the T4 code chunk with code using one of the dplyr verbs to add a new column called MOCA_Perc to the dataframe screening In this new column should be the MOCA scores converted to percentages. The maximum achievable score on MOCA is 30 and percentages are calculated as (participant score / max score) * 100. Store this output in screening.

  • hint: mutate() something using MOCA and the percentage formula
screening <- NULL

2.4.7 Task 5 - Remove the MOCA column

Now that we have our MoCA score expressed as a percentage MOCA_Perc we no longer need the raw scores held in MOCA. Replace the NULL in the T5 code chunk using a one-table dplyr verb to keep all the columns of screening, with the same order, but without the MOCA column. Store this output in screening.

  • hint: select your columns
screening <- NULL

Halfway There!

The remaining tasks focus on merging two tables.

You suspect that the older adults with musical experience might report more finely-tuned hearing abilities than those without musical experience. You therefore decide to check whether this trend exists in your data. You measured participants' self reported hearing abilties using the Better Hearing Institute Quick Hearing Check Questionnaire. In this questionnaire, participants rated the extent to which they agree or disagree with a list of statements (e.g. 'I have a problem hearing over the telephone') using a 5 point Likert scale (Strongly Disagree, Slightly Disagree, Neutral, Slightly Agree, Strongly Agree).

Each participant's response to each question is contained in the responses dataframe in your environment. Each response type is worth a certain number of points (e.g. Strongly Disagree = 0, Strongly Agree = 5) and the scoring key is contained in the scoring dataframe. A score for each participant is calculated by totalling up the number of points across all the questions to derive an overall score. The lower the overall score, the better the participants' self-reported hearing ability.

In order to score the questionnaire we first need to perform a couple of steps.

2.4.8 Task 6 - Gather the Responses together

Replace the NULL in the T6 code chunk using code to gather the responses to all the questions of the QHC from wide format to tidy/long format. Put the names in Question and the values in RESPONSE. Store this output in responses_long.

  • hint: pivot_lnger()
  • hint: names to "Question"
  • hint: values to "RESPONSE"
responses_long <- NULL 

2.4.9 Task 7 - Joining the data

Now we need to join the number of points for each response in scoring to the participants' responses in responses_long.

Replace the NULL in the T7 code chunk using inner_join() to combine responses_long and scoring into a new variable called responses_points.

  • hint: join them by the column common to both scoring and responses_long
responses_points <- NULL

2.4.10 Task 8 - Working the Pipes

Below we have given you five lines of code that takes the data in its current long format and then creates a QHC score for each participant (group_by()...summarise()). It then joins the screening information (inner_join()) before calculating a mean QHC score for the two groups of participants - those that play musical intruments and those that don't. This final step is stored as a tibble called musical_means.

participant_groups <- group_by(responses_points, ID)
participant_scores <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_screening <- inner_join(participant_scores, screening, "ID")
screening_groups_new <- group_by(participant_screening, MUSICAL)
musical_means <- summarise(screening_groups_new, mean_score = mean(Total_QHC))

Use the above five lines of code to replace the NULL in the T8 code chunk with a functioning code pipeline using pipes. Put each function on a new line one under the other. This pipeline should result in the mean QHC values of musical and non-musical people being stored in the tibble musical_means. This final tibble will consist of two rows by two columns (i.e. four cells in total).

  • hint: in pipes, the output of the previous function is the input of the subsequent function.
  • hint: function1(...) %>% function2(...)
musical_means <- NULL

2.4.11 Task 9 - Difference in Musical Means

Finally, replace the NULL in the T9 code chunk with the value of how much higher the QHC score of people who play music is compared to people who don't play music. This should be a single numeric value, to two decimal places, e.g. 2.93

  • hint: look in musical means and enter the difference between the two means.
QHC_diff <- NULL

Job Done - Activity Complete!

Well done, you are finished! Now you should go check your answers against the solutions at the end of this chapter. You are looking to check that the answers you have submitted are exactly the same as the ones in the solution - for example, remember that Mycolumn is different to mycolumn and only one is correct. If you have any questions, please post them on the forums.

2.5 Solutions to Questions

Below you will find the solutions to the questions for the Activities for this chapter. Only look at them after giving the questions a good try and speaking to the tutor about any issues.

2.5.1 PreClass Activities

2.5.1.1 PreClass Task 1

  • Using select() to include only stated columns:
select(pong_data, Participant, PaddleLength, TrialNumber, BackgroundColor, HitOrMiss)
  • Using select() to exclude certain columns:
select(pong_data, -JudgedSpeed, -BallSpeed, -BlockNumber)
  • Using select() to change the order of columns:
select(pong_data, BallSpeed, HitOrMiss, JudgedSpeed, Participant, TrialNumber)

Return to Task

2.5.1.2 PreClass Task 2

arrange(pong_data, desc(HitOrMiss), desc(JudgedSpeed))

Return to Task

2.5.1.3 PreClass Task 3

filter(pong_data, 
       JudgedSpeed == 1, 
       BallSpeed %in% c("2", "4", "5", "7"), 
       HitOrMiss == 0)

Return to Task

2.5.1.4 PreClass Task 4

  • first step is created with filter()
  • the second step is created with mutate()
pong_data_filt <- filter(pong_data, TrialNumber >= 2) 
pong_data2 <- mutate(pong_data_filt, TrialNumber = TrialNumber - 1)

Return to Task

2.5.1.5 PreClass Task 5

group_by(pong_data2, BlockNumber, BackgroundColor)

Return to Task

2.5.1.6 PreClass Task 6

pong_data2_group <- group_by(pong_data2, BackgroundColor, PaddleLength)
pong_data2_hits <- summarise(pong_data2_group, mean_hits = mean(HitOrMiss))
## `summarise()` has grouped output by 'BackgroundColor'. You can override using the `.groups` argument.
  • You should find that the number of hits made with the small paddle (50) and the red color background is 0.450655

Return to Task

2.5.2 InClass Actitivies

2.5.2.1 InClass Task 2

library(tidyverse)

or

library("tidyverse")
  • Note, there is no difference between library(tidyverse) and library("tidyverse") both will work.

Return to Task

2.5.2.2 InClass Task 3

responses <- read_csv("responses.csv")                  
qformats <- read_csv("qformats.csv")                 
scoring <- read_csv("scoring.csv")                  
pinfo <- read_csv("pinfo.csv")
  • Note that there is a difference between responses <- read_csv("responses.csv") and responses <- read_csv(responses.csv). You will need quotes around the .csv filename as shown in the code chunk above (e.g. responses <- read_csv("responses.csv")), or the code won't work.

Return to Task

2.5.2.3 InClass Task 7

aq_scores <- rscores %>% 
             group_by(Id) %>% # group by the ID number in column Id
             summarise(AQ = sum(Score)) # sum column Score to obtain AQ scores.

Return to Task

2.5.2.4 InClass Task 8

aq_scores2 <- responses %>% 
  pivot_longer(cols = Q1:Q10,
               names_to = "Question",
               values_to = "Response"),
  inner_join(qformats, "Question") %>%
  inner_join(scoring, c("QFormat", "Response")) %>%
             group_by(Id) %>% 
             summarise(AQ = sum(Score))

Return to Task

2.5.3 Test Yourself Activities

2.5.3.1 Assignment Task 1 - Oldest Participant

  • Whether you coded this answer or just read from the data, the Participant with ID Number 3 is the oldest.
oldest_participant <- 3

This could also be answered with code. We haven't quite shown you how yet but it would look like this:

oldest_participant_code <- arrange(screening, desc(AGE)) %>% 
  slice(1) %>% 
  pull(ID)

Return to Task

2.5.3.2 Assignment Task 2 - Arranging D-SPAN

  • arrange() is the main function here
  • You also needed to use desc() to sort from high to low
cogtest_sort <- arrange(screening, desc(DSPAN))

Return to Task

2.5.3.3 Assignment Task 3 - Foreign Language Speakers

  1. First group the screening data by FLANG using group_by()
  2. Next, summarise, paying attention to use the variable names as instructed
  3. n() is a function that we use within summarise() to count how many observations we have. It works like count() but you can't use count() within summarise()
  4. median() is the function that we use within summarise() to calculate the median. Much like we do with sum() or mean() or sd(), etc.
screen_groups <- group_by(screening, FLANG) 
descriptives <- summarise(screen_groups, 
                          n = n(), 
                          median_age = median(AGE))

Return to Task

2.5.3.4 Assignment Task 4 - Creating Percentage MOCA scores

  • mutate() is the function to add a new column to data
  • Here we are mutating/adding on a column called MOCA_Perc which shows a participant's MOCA score divided by 30 and multiplied by 100.
screening <- mutate(screening, MOCA_Perc = (MOCA / 30) * 100)

Return to Task

2.5.3.5 Assignment Task 5 - Remove the MOCA column

  • select() is the key function to keep and remove certain columns.
  • Two options here; both would give the same dataframe.
  • The first option shows how to deselect a column and keep everything else.
  • The second option shows how to select all the columns you want.
  • Remember that order is very important and you should select the columns in the order you want.

Option 1:

screening <- select(screening, -MOCA)

Option 2:

screening <- select(screening, ID, AGE, SEX, HANDEDNESS, EDUCATION, MUSICAL, FLANG, DSPAN, D2, MOCA_Perc)

Return to Task

2.5.3.6 Assignment Task 6 - Gather the Responses together

  • pivot_longer() is the function to use here.
  • People take a while to understand this function but spend some time looking at the example below and it will start to make some sense.
  • The first argument is the data. In this case responses.
  • The second argument is the name of the columns you want to gather. Here we are gathering all columns between the Q1 column and the Q15 column. Remember that the colon (:) says "from ... to ..."
    • You do not actually need to write cols = but it makes things clearer.
    • "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q15, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q15, and therefore the code gathers all the columns between Q1 and Q15.
    • Colum names are not put in quotes because they exist already in the tibble responses.
  • The third and fourth arguments are the names of the new columns we are creating;
    • the first will store the question numbers, Question. I.e. put the question names (names_to = ...) in a column called "Question".
    • the second will store the values/responses, Response. I.e. put the values/responses to the questions (values_to = ...) in a column called "Response".
    • These new column names are put in quotes because they do not already exist in the tibble. This is not always the case but is the case for this function.
    • Note that these names could have been anything but by using these names the code makes more sense.
    • Lastly, you do need to write names_to = ... and values_to = ... otherwise the columns won't be created correctly.
responses_long <- pivot_longer(responses, 
                         cols = Q1:Q15, 
                         names_to = "Question", 
                         values_to = "RESPONSE")

Return to Task

2.5.3.7 Assignment Task 7 - Joining the data

  • inner_join() will combine all common information in two sets of data by a common column or columns.
  • Here we are joining the data in responses_long with the data in scoring by the common column RESPONSE.
  • Keep in mind that inner_join() keeps only the rows that have data in both datasets. It will remove rows that only have data in one dataset.
  • When joining two datasets, join by ALL common columns when there is more than one column in common.
responses_points <- inner_join(responses_long, scoring, "RESPONSE")

Return to Task

2.5.3.8 Assignment Task 8 - Working the Pipes

  • This is the code we started with.
participant_groups <- group_by(responses_points, ID)
participant_scores <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_screening <- inner_join(participant_scores, screening, "ID")
screening_groups_new <- group_by(participant_screening, MUSICAL)
musical_means <- summarise(screening_groups_new, mean_score = mean(Total_QHC))
  • Below is how to transcribe the above series of functions into a pipeline.
  • Remember, when using pipes, the output of the previous function is the input of the subsequent function
musical_means <- group_by(responses_points, ID) %>%
                  summarise(Total_QHC = sum(SCORE)) %>%
                  inner_join(screening, "ID") %>%
                  group_by(MUSICAL) %>%
                  summarise(mean_score = mean(Total_QHC))

Return to Task

2.5.3.9 Assignment Task 9 - Difference in Musical Means

  • People who play music have a QHC score that is 1.53 units higher than people who don't play music.
  • You can do this by looking in musical_means, reading the values, and doing some quick maths.
  • A second option is through code. Code is always better as it can reduce error and is reproducible!
# Option 1
QHC_diff <- 1.53

# Option 2
# You will soon learn the functions to do this by code but here is how you could do it.
QHC_diff_code <- pivot_wider(musical_means, 
                             names_from = "MUSICAL", 
                             values_from = "mean_score") %>% 
  mutate(diff = YES - NO) %>% 
  pull(diff) %>% 
  round(2)

Return to Task

Chapter Complete!

2.6 Additional Material

Below is some additional material that might help your wrangling.

More on read_csv()

In the preclass activity we used the following code to load in our data:

pong_data <- read_csv("PongBlueRedBack 1-16 Codebook.csv")

This is a totally acceptable approach and it is the one we will use 99% of the time. Now one thing to note that the read_csv() function by default always loads any number as double, meaning that it can take a decimal. As shown here by the at the start of each variable. The one column that is not a double is the BackgroundColor column, which is of course "characters" .

  • Note: we can use glimpse() from dplyr to check our data types.
glimpse(pong_data)
## Rows: 4,608
## Columns: 8
## $ Participant     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ JudgedSpeed     <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, …
## $ PaddleLength    <dbl> 50, 250, 50, 250, 250, 50, 250, 50, 250, 50, 50, 250, …
## $ BallSpeed       <dbl> 5, 3, 4, 3, 7, 5, 6, 2, 4, 4, 7, 7, 3, 6, 5, 7, 2, 5, …
## $ TrialNumber     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ BackgroundColor <chr> "red", "blue", "red", "red", "blue", "blue", "red", "r…
## $ HitOrMiss       <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, …
## $ BlockNumber     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

However you might not always want this default and might want to stipulate the data-type to load the data in as. Particularly if you already know the data-type you should have. Let's look at this first and then talk about it.

pong_data3 <- read_csv("PongBlueRedBack 1-16 Codebook.csv",
                      col_types = "iiiiicii")

Which if we look at again, we now see that the numerical data are integers instead of double .

glimpse(pong_data3)
## Rows: 4,608
## Columns: 8
## $ Participant     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ JudgedSpeed     <int> 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, …
## $ PaddleLength    <int> 50, 250, 50, 250, 250, 50, 250, 50, 250, 50, 50, 250, …
## $ BallSpeed       <int> 5, 3, 4, 3, 7, 5, 6, 2, 4, 4, 7, 7, 3, 6, 5, 7, 2, 5, …
## $ TrialNumber     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ BackgroundColor <chr> "red", "blue", "red", "red", "blue", "blue", "red", "r…
## $ HitOrMiss       <int> 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, …
## $ BlockNumber     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

So what is the difference? Note that in the new code we have specified the col_types argument as "iiiiicii". The col_types argument allows you to control the data type for each variable. If you don't specify this argument, the default is for read_csv() to guess, and when it sees numbers in a column, it will default to treating it as type double.

What does the "iiiiicii" string do? Well, we know that there are 8 different columns in the csv file, and we have 8 characters in the string "iiiiicii"; each letter in this string tells read_csv() the data type for each of these columns. The string has five "i"s followed by one "c" followed by two more "i"s, which tells read_csv() to treat the first five columns as type integer (i), the sixth column as type character (c), and the last two columns as type integer. (If we wanted a column to be read in as a double, we would use "d".)

Whilst this is a very useful approach if you are already familiar with the type and structure of the data you are working with, it can cause issues if you don't know that. For instance, you need to know exactly how many columns there are, what order, and what type they are in. So it can get tricky. For this series of lab activities, we will just stick to using the basic read_csv() defaults and not state column types.


More on Code Layout

One issue we see a lot is people not being able to debug their code quickly (i.e. find issues) because of the way the code is laid out. Pipes (%>%) helps with that, but so does taking new lines for different parts of your code. After a comma (,) or a pipe (%>%), you can take a new line to continue your code to make it easier to read and follow. For example, both of the following will work, but the second is easiest to read.

musical_means <- group_by(responses_points, ID) %>% summarise(Total_QHC = sum(SCORE)) %>% inner_join(screening, "ID") %>% group_by(MUSICAL) %>% summarise(mean_score = mean(Total_QHC))
musical_means <- group_by(responses_points, ID) %>%
                  summarise(Total_QHC = sum(SCORE)) %>%
                  inner_join(screening, "ID") %>%
                  group_by(MUSICAL) %>%
                  summarise(mean_score = mean(Total_QHC))

And you could even expand this second option further to make it clearer on the group_by() and inner_join() what are the different inputs:

musical_means <- group_by(responses_points, 
                          ID) %>%
                  summarise(Total_QHC = sum(SCORE)) %>%
                  inner_join(screening, 
                             "ID") %>%
                  group_by(MUSICAL) %>%
                  summarise(mean_score = mean(Total_QHC))

Remember, Tidy Code and Tidy Data Make Wrangling Fun! OK, that is not catchy, but true!


More on gathering data - pivot_longer() and gather()

With time, some functions change a bit, others change entirely, and others get forgotten about. gather() is one of these functions. gather() was one of these functions that many people just couldn't ever get right in their head and as such pivot_longer() was created. The first version of this book was written using the gather() function and we have tried to update the book to replace it with the pivot_longer() but we will make mistakes and no doubt gather() will appear somewhere, or you may get some old code and see the gather() function and wonder what it does. As such, we have put below a version of the code from the inclass using the gather() function instead of the pivot_longer() function, just in case it helps.

  • gather() allows us to transform a table from wide format to long format. It has now been replaced by pivot_longer()
  • spread() allows us to transform a table from long format to wide format. It has now been replaced by pivot_wider()

An example with gather()

In the example in class, Task 5, we have all the data we need loaded in, but in order to make it easier for us to get the AQ score for each participant, we need to change the layout of the responses tibble. In class we use the pivot_longer() function but you could have done it with the gather() function as follows

rlong <- gather(responses, Question, Response, Q1:Q10)
  • The first argument given to the gather() function is the dataset which holds the data we want to wrangle, responses.
  • The second and third arguments are the names we want to give the columns we are creating;
    • the first will store the question numbers, Question
    • the second will store the responses, Response.
    • Note that these names could have been anything but by using these names the code makes more sense.
  • Finally, the fourth argument is the names of specific columns in the original tibble that we want to gather together - columns Q1 to Q10. The colon (:) says "all columns from this to that" - all columns from Q1 to Q10.

Notes:

  • "Gathering" of columns is based on position in the tibble. If the order of columns in the tibble was Q1 then Q10, the above code would only gather those two columns. As it is, in our tibble, the order, is Q1, Q2, Q3, ... Q10, and therefore the code gathers all the columns between Q1 and Q10.

  • If you had just wanted columns Q1 and Q10, then you could do c(Q1, Q10).

  • pivot_longer() and gather() do not function exactly the same. They tend to sort data differently. Make sure you are using the function you need to do. You should always always always check the output of your functions to make sure they are doing what you expect.

  • In case you are wondering, if we wanted to go back the way and ungather the data we just gathered, we would use the spread() function: e.g. rwide <- spread(rlong, Questions, Response). But we do not want to do that here so let's not add this to the code.

Quickfire Questions

Let's see if you understand gather(). Say I wanted to gather the first three columns from a tibble called responses where the columns are called Q1, Q2, Q3 (in that order), put the question numbers in a column called Jam, the responses in a column called Strawberry, and store everything in a tibble called sandwich. Fill in the box with what you would write:

sandwich <- gather(responses, Jam, Strawberry, Q1:Q3)

gather wants the data first, then the name of the new column to store the gathered column names, then the name of the new column to store the data, and then finally which columns to gather.


As mentioned above we hope to replace every instance of gather() and spread() in this book with pivot_longer() and pivot_wider(), but if we make mistakes the above might help.


More on binding and joining

In this lab we looked at two methods of combining datasets. We used bind_rows() and we used inner_join(). They do quite different tasks but we thought a quick summary might help.

Let's say we have the following two tables that show Perceived Emotion Accuracy in five different cats:

Table 2.2: Cat data - not very good
ID Animal Perceived_Emotion_Accuracy
1 Cats 11
2 Cats 8
3 Cats 8
4 Cats 10
5 Cats 9

And in five different dogs:

Table 2.3: Dog data - smart!
ID Animal Perceived_Emotion_Accuracy
1 Dogs 82
2 Dogs 78
3 Dogs 83
4 Dogs 82
5 Dogs 84

Because these two tables have no overlapping information about participants but have a similar tibble structure (i.e. number and names of columns) then the way we can combine the tibbles is by putting one tibble below the other using bind_rows()

all_animals <- bind_rows(cats, dogs)

Which would give us:

Table 2.4: By using bind_rows()
ID Animal Perceived_Emotion_Accuracy
1 Cats 11
2 Cats 8
3 Cats 8
4 Cats 10
5 Cats 9
1 Dogs 82
2 Dogs 78
3 Dogs 83
4 Dogs 82
5 Dogs 84

And say now we want to tack on the age information from each participant. Because all_animals and ages have overlapping information that we can use to link the information, we can put the new information at the side of the tibble as a new column using inner_join().

all_animals_ages <- inner_join(all_animals, ages, c("ID", "Animal"))

Which would give us:

Table 2.5: By using inner_join()
ID Animal Perceived_Emotion_Accuracy Ages
1 Cats 11 5
2 Cats 8 4
3 Cats 8 3
4 Cats 10 5
5 Cats 9 5
1 Dogs 82 5
2 Dogs 78 5
3 Dogs 83 3
4 Dogs 82 6
5 Dogs 84 4

There is actualy bind_cols() as well as a host of other join functions, e.g. full_join(), anti_join(). You will pick up more of these functions as we continue but we hope this helps clarify these two functions a little.

A quick table on some useful joins()`

function Example Final tibble has...
full_join() full_join(x,y, by = "ID") information for all observations found in both original tibbles
inner_join() inner_join(x,y, by = "ID") information for only observations found in both original tibbles
anti_join() anti_join(x,y, by = "ID") information for observations found in tibble x but not tibble y
anti_join(y,x, by = "ID") information for observations found in tibble y but not tibble x

End of Additional Material!